Commit Graph

3147 Commits

Author SHA1 Message Date
Steven Dake
e3ff3e2a23 Add groff as a BuildRequires to spec file
According to Fedora packaging guidelines, groff is not on the list
of package exceptions for BuildRequires.  A recent change in the Fedora
build system has triggered breakage in building rpm packages and it
is likely this package won't build for Fedora 18.

Reference:
http://fedoraproject.org/wiki/Packaging:Guidelines#Exceptions_2

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
2012-08-03 22:47:09 -07:00
Jan Friesse
fed7fc23e1 Don't call sync_* funcs for unloaded services
When service is unloaded, sync shouldn't call sync_init|process|activate
and abort functions. It happens very rare, but in process of unloading
all services, totem can recreate membership and bad things can happen
(service is unloaded, so there may be access to already freed memory,
 ...)

Solution is to fetch services sync handlers in every time when we are
building service list instead of using precreated one.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-08-02 09:34:58 +02:00
Jan Friesse
9fb7979370 Introduce SERVICES_COUNT_MAX macro
Sync/service was using maximal number of services in ehter numberic form
(magic constant) or inconsistently, this means using
SERVICE_HANDLER_MAXIMUM_COUNT which means maximal number of handlers.

New macro solves this.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-08-02 09:32:05 +02:00
Jan Friesse
7ff258557f cmap_keys: Document few more runtime statistics
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-07-30 17:38:10 +02:00
Jan Friesse
b9eb19e623 cts: Delete shm blacbox after corosync kill
This makes SHM Audit pass in test CpgCfgChgOnExecCrash.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-07-30 10:26:40 +02:00
Jan Friesse
537bf56fcc cpg: Be more verbose for procjoin message
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-07-30 10:22:16 +02:00
Jan Friesse
908ed7dcb3 cts: Change DC_IDLE pattern
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-07-12 15:53:11 +02:00
Jan Friesse
2e3796a706 cts: Make shm_leak_audit run
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-07-12 15:53:09 +02:00
Jan Friesse
04dac3ff5d Correctly free state string in wd
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-07-12 15:53:04 +02:00
Jan Friesse
cb650e93a9 cts: Change local_start[ing|ed] pattern in CTS
Previous pattern is no longer send to syslog. Use first pattern which
is.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-07-12 15:53:01 +02:00
Jan Friesse
db4ac0fb38 Support for crypto_ and nodelist in lenses
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-07-12 15:52:56 +02:00
Jan Friesse
e4d75d1ab3 Revert "Free state variable allocated in wd_resource_state_is_ok"
This reverts commit 01c63ca17c.
2012-07-11 17:04:41 +02:00
Jan Friesse
a966506c1e cpg: Enhance downlist selection algorithm
Let's say we have 2 nodes:
- node 2 is paused
- node 1 create membership (one node)
- node 2 is unpaused

Result is that node 1 downlist is selected, so it means that from node 2
point of view, node 1 was never down.

Patch solves situation by adding additional check for largest previous
membership.

So current tests are:
1) largest (previous #nodes - #nodes know to have left)
2) (then) largest previous membership
3) (and last as a tie-breaker) node with smallest nodeid

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-06-14 15:15:42 +02:00
Jan Friesse
f3457c5d49 cpg: Print cpg name to debug informations
In downlist and joinlist debug output group was printed in nonsense
format of integer to pointer to array.

Now it's printed by full name.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-06-14 15:15:39 +02:00
Jan Friesse
35446d6bcc cpg: Process join list after downlists
let's say following situation will happen:
- we have 3 nodes
- on wire messages looks like D1,J1,D2,J2,D3,J3 (D is downlist, J is
  joinlist)
- let's say, D1 and D3 contains node 2
- it means that J2 is applied, but right after that, D1 (or D3) is
  applied what means, node 2 is again considered down

It's solved by collecting joinlists and apply them after downlist, so
order is:
- apply best matching downlist
- apply all joinlists

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-06-14 15:15:35 +02:00
Jan Friesse
816d7687b0 cpg: Never choose downlist with localnode
Test scenario is follows:
- node 1, node 2
- node 1 is paused
- node 2 sees node 1 dead
- node 1 unpaused
- node 1 and 2 both choose same dowlist message which includes node 2 ->
node 2 is efectivelly disconnected

Patch includes additional test if left_node is localnode. If so, such
downlist is ignored.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-06-14 15:15:32 +02:00
Jerome FLESCH
99faa3b864 When flushing, discard only memb_join messages
Patch solves problem when 1 ring out of 2 went up/down quite often.

The simplest setup to reproduce bug is following:
- 2 VMs, connected by 2 network interfaces
- OS: Linux
- On one of the VMs, a test program sending some CPG messages (see the
  script "test_corosync.sh" joined to this mail for example)

Here are the Corosync logs we get when we do this setup:

Jun 06 16:23:40 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 06 16:23:40 corosync [CPG   ] chosen downlist: sender r(0)
ip(192.168.56.104) r(1) ip(192.168.57.104) ; members(old:1 left:0)
Jun 06 16:23:40 corosync [MAIN  ] Completed service synchronization,
ready to provide service.
Jun 06 16:24:37 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:24:38 corosync [TOTEM ] Automatically recovered ring 1
Jun 06 16:25:33 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:25:34 corosync [TOTEM ] Automatically recovered ring 1
Jun 06 16:26:35 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:26:36 corosync [TOTEM ] Automatically recovered ring 1
(...)

The second ring goes down about every 2 minutes and automatically back
up right after.

We spent some times looking for the commit that introduced this bug, and
it appears it's due the following one:
Corosync 1.3.3 -> 1.3.4: e27a58d93d
Corosync 1.4.1 -> 1.4.2: be608c0502
Commit message: Ignore memb_join messages during flush operations

I had a look at this commit, and it seems to me it's dropping too many
packets:
Because of this commit, while totemrrp_recv_flush() is called, Corosync
drops memb_join packets, but also ORF tokens. In the end, it seems that
sometimes, we drop so many of them that Corosync marks the ring as
faulty.

To fix that, only memb_join messages are dropped now.

Signed-off-by: Jerome FLESCH <jerome.flesch@netasq.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-06-11 10:59:30 +02:00
Jan Friesse
2766e57ce5 Store fdata with timestamp and pid in name
This should allow easier handling of various blackbox dumps. Original
fdata name is now symlink to latest created dump.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-06-05 12:19:42 +02:00
Jan Friesse
ab328942cc Remove corosync-fplay
Libqb now ships with qb-blackbox command doing same job as
corosync-fplay. It doesn't make sense to maintain two versions of same
utility so corosync-fplay can go. corosync-blackbox command now calls
directly qb-blackbox.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-06-05 12:05:15 +02:00
Kazunori INOUE
cf147197d0 notifyd: handle addition of a members key to CMAP
When new key (totem.pg.mrp.srp.members) was added to CMAP,
we would like to receive the trap of this time.

Signed-off-by: Kazunori INOUE <inouekazu@intellilink.co.jp>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-06-01 09:49:47 +02:00
Fabio M. Di Nitto
514f2a13bd testcpg: fix build warning
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-06-01 08:47:44 +02:00
Jan Friesse
7ce332a713 totemudpu: Bind sending sockets to bindto address
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-05-31 09:28:52 +02:00
Jan Friesse
56f512d05e snmp MIB: Remove unnecessary comma
Thank Hideo Yamauch for pointing this bug.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
2012-05-29 16:25:09 +02:00
Kazunori INOUE
a5dfcb98d5 notifyd snmp: fix a function name
This fixes the bug to which snmp trap of rrp_faulty_event is not sent.

Signed-off-by: Kazunori INOUE <inouekazu@intellilink.co.jp>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-05-29 16:05:04 +02:00
Fabio M. Di Nitto
f008cf442c rename mainconfig to logconfig
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-05-29 09:36:00 +02:00
Fabio M. Di Nitto
b283ef8f12 mainconfig: allow mainconfig logic to be used both internally and externally
corosync logging configuration logic is rather complex and in order
to make it simpler to reuse (at least within corosync/ tree)
we need to be able to use both icmap and cmap.

the patch might seem controversial, but it reduces heaps of code around
from qdevices (coming next).

It might be useful to consider moving this to a common shared library
but there aren't enough users yet and a shared lib would force
corosync to link with cmap (that we do not want at all costs)

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-05-29 09:04:03 +02:00
Angus Salkeld
5831136c87 LOG: make sure the log target is enabled.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-05-29 14:02:42 +10:00
Angus Salkeld
e6b35bdb7a LOG: handle closing unused logfiles better
This fixes a bug where having a second log file will close
the previous one.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-05-29 14:02:42 +10:00
Angus Salkeld
e6afc761fe LOG: be more explict about the qb file names
else we can get messages been put in the wrong subsys.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-05-29 14:02:42 +10:00
Angus Salkeld
775f71591b LOG: drop the number of logging subsystems from 64 to 32
Currently 14 are used, 64 seems like a waste of memory.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-05-29 14:02:42 +10:00
Barney Desmond
323f0f4277 Correct the description of bindnetaddr config parameter in manpage
bindnetaddr has been wrongly described in the past, and did not
document that fact that it will also accept exact address matches.

Signed-off-by: Barney Desmond <barney.desmond@anchor.net.au>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-05-24 17:31:20 +02:00
Jan Friesse
2894f33c4f totemip: Support bind to exact address
Logic for binding now works in following way:
- Try to find exact match
- If not exact match is found, use first found network address

This allows set concrete IP even if network settings contains two IPs on
same network.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-05-24 14:01:12 +02:00
Jan Friesse
aaa575e091 totemip: insert items in correct order
list_add_tail is used instead of list_add so ip addresses are inserted
in same order as returned by getifaddrs.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-05-24 14:01:08 +02:00
Fabio M. Di Nitto
d7e205d197 init: major cleanup
- rename generic.in and notifyd.in to corosync.in and corosync-notifyd.in
  (makes build simpler)
- fix sysvinit corosync.in sleep time to include a check for when IPC
  are ready and drop cman bits (there is no cman with corosync 2.0)
- corosync-notifyd.service should always start after corosync.service
- corosync.service should always start after network
- corosync.service uses init script wrapper
- install/ship sysvinit as wrappers for systemd in /usr/share/corosync
  when necessary
- change the build system to deal with all of the above

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-05-21 14:55:49 +02:00
Jan Friesse
0791f44c41 Include ringid in processor joined log message
This should help correlate syslog entires with their blackbox
counterparts.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Andrew Beekhof <andrew@beekhof.net>
2012-05-17 14:58:04 +02:00
Jan Friesse
dc5b8981de Update TODO file
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-05-15 16:42:42 +02:00
Dan Clark
88dd3e1eea Improve testcpg to handle change of node identity
Signed-off-by: Dan Clark <2clarkd@gmail.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-04-26 09:01:21 +02:00
Fabio M. Di Nitto
f2444effd0 icmap: don't leak memory when changing ro/rw status on a key
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-04-24 09:28:23 +02:00
Fabio M. Di Nitto
1dcb2d43d9 icmap: fix a valgrind errors (pass 1)
clean up a lot of allocated blocks at exit.
those changes has no runtime effects, but it makes valgrind
output a bit more useful by dropping over 700 errors/warnings to skip
over every single run.

there are still a few icmap related valgrind errors but those need
some more complex and timeconsuming investigation.

pre patch:

==21844== HEAP SUMMARY:
==21844==     in use at exit: 1,229,321 bytes in 1,516 blocks
==21844==   total heap usage: 7,191 allocs, 5,675 frees, 3,819,853 bytes allocated

==21844== LEAK SUMMARY:
==21844==    definitely lost: 3,617 bytes in 11 blocks
==21844==    indirectly lost: 21,960 bytes in 11 blocks
==21844==      possibly lost: 1,080,101 bytes in 131 blocks
==21844==    still reachable: 123,643 bytes in 1,363 blocks
==21844==         suppressed: 0 bytes in 0 blocks

==21844== ERROR SUMMARY: 136 errors from 136 contexts (suppressed: 0 from 0)

post patch:

==25793== HEAP SUMMARY:
==25793==     in use at exit: 1,185,870 bytes in 808 blocks
==25793==   total heap usage: 9,427 allocs, 8,619 frees, 4,156,841 bytes allocated

==25793== LEAK SUMMARY:
==25793==    definitely lost: 3,697 bytes in 12 blocks
==25793==    indirectly lost: 22,248 bytes in 13 blocks
==25793==      possibly lost: 1,079,655 bytes in 113 blocks
==25793==    still reachable: 80,270 bytes in 670 blocks
==25793==         suppressed: 0 bytes in 0 blocks

==25793== ERROR SUMMARY: 119 errors from 119 contexts (suppressed: 0 from 0)

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-04-24 09:28:23 +02:00
Fabio M. Di Nitto
d2872aec70 crypto init: release *_slot resource after init
Those are only used at init phase and we can free some memory for the system.

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2012-04-20 10:57:16 +02:00
Fabio M. Di Nitto
b34c1e2870 ipcs: allow connections only after all services are ready
this fixes a rather annoying race condition at startup where a client
connects to corosync "too fast" before the service is ready to operate
and client gets some random data during initialization phase.

With this fix, we allow connections to ipc only after the main engine
is operational and configured (and after the first totem transition).

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2012-04-16 13:39:03 +02:00
Jan Friesse
f89d7b715f Always allocate totemrrp stats array
This prevents segfault when rrp mode is set with only one ring.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-04-10 09:08:42 +02:00
Jan Friesse
92ead6106f Properly parse uidgid files
Full path to key is now tested rather then key name only.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-04-10 09:08:36 +02:00
Fabio M. Di Nitto
769a69e41f build: improve systemd service file handling
this solves the issue of having to special case before and after usrmove

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-04-10 08:59:40 +02:00
Jan Friesse
f78671fe84 Remove info pages see also from cmapctl man
Corosync doesn't have documentation in info format, so information is
corosync-cmapctl was not true.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-04-06 09:21:04 +02:00
Jan Friesse
de47f12437 Add man page with CMAP keys created by corosync
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-04-06 09:20:57 +02:00
Angus Salkeld
353e223377 Check before making a reference to __start___verbose
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2012-04-05 23:49:47 +10:00
Fabio M. Di Nitto
835025f319 conf: add quorum section to example config
document only the provider option since all the others
(votes/expected_votes/etc) are provider specific.

Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-04-03 09:47:11 +02:00
Angus Salkeld
acad48bf38 Only call qb_ipcc_disconnect when the instance is fully dereferenced.
Sometimes calling xyz_finilize() within a dispatch would
cause a crash because the qb_ipcc_disconnect actually
disconnects immediatly and frees it't memory. whereas
the corosync structure is reference counted. So this
makes use of the reference counting to only call
qb_ipcc_disconnect when it is fully dereferenced.

Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2012-04-03 16:03:07 +10:00
Jan Friesse
4b2cfc3f6b Convert udpu example to use nodelist
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-03-27 13:28:37 +02:00