Patch tries to fix incorrect behaviour during following test-case:
- 3 nodes
- Node 1 is paused
- Node 2 and 3 detects node 1 as failed and informs CPG clients
- Node 1 is unpaused
- Node 1 clients are informed about new membership, but not about Node 1
being paused, so from Node 1 point-of-view, Node 2 and 3 failure
Solution is to:
- Remove downlist master choose and always choose local node downlist.
For Node 1 in example above, downlist contains Node 2 and 3.
- Keep code which informs clients about left nodes
- Use joinlist as a authoritative source of nodes/clients which exists
in membership
This patch doesn't break backwards compatibility.
I've walked thru all the patches which changed behavior of cpg to ensure
patch does not break CPG behavior. Most important were:
- 058f50314c - Base. Code was significantly
changed to handle double free by split group_info into two structures
cpg_pd (local node clients) and process_info (all clients). Joinlist
was
- 97c28ea756 - This patch removed
confchg_fn and made CPG sync correct
- feff0e8542 - I've tested described
behavior without any issues
- 6bbbfcb6b4 - Added idea of using
heuristics to choose same downlist on all nodes. Sadly this idea
was beginning of the problems described in
040fda8872,
ac1d79ea7c,
559d4083ed,
02c5dffa5b,
64d0e5ace0 and
b55f32fe2e
- 02c5dffa5b - Made joinlist as
authoritative source of nodes/clients but left downlist_master_choose
as a source of information about left nodes
Long story made short. This patch basically reverts
idea of using heuristics to choose same downlist on all nodes.
(ported from needle 9c2a97f4f9)
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
This shrinks the srp_addr (and consequently every packet sent by
corosync) so that instead of containing loads of IP addresses to
identify a node, it just sends the nodeid.
This then allows us to make ring0 optional and replaceable when running
knet.
It also means that we need some other way of identifying the local
node in corosync.conf, so the nodelist.node.name entry is now mandatory
and is mapped to the local host using the same algorithm as used in
cman.
This code needs LOTS of testing as it touches a huge amount of totemsrp
and totemconfig.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If a cpg client sends a message larger than 1Mb (actually slightly
less to allow for internal buffers) cpg will now fragment that into
several corosync messages before sending it around the ring.
cpg_mcast_joined() can now return CS_ERR_INTERRUPT which means that the
cpg membership was disrupted during the send operation and the message
needs to be resent.
The new API call cpg_max_atomic_msgsize_get() returns the maximum size
of a message that will not be fragmented internally.
New test program cpghum was written to stress test this functionality,
it checks message integrity and order of receipt.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When node is paused and other nodes has in meantime exited cpg process,
paused node after resume doesn't update it's membership correctly so on
previously paused node exited cpg process is still visible.
Solution is to compare join list with cpd and remove all pids which are
not included in join list.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Also number is prefixed by 0x so it's easier to spot that number is
hexadecimal.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Most of functionality is moved to do_proc_leave function to make it
reusable.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Code for zero-copy in cpg does following mmaps:
- Mmap anonymous, private memory to some address (-> malloc)
- Mmap shared memory of fd to address returned by first mmap
(effectively shadows first mapping)
This is not necessary and only one mapping is needed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Messages which are flow messages, rather then lifecycle are now logged
in trace level.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
drop all SOLARIS specific ifdefs and replace them with feature checks
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
IPC is using buffer of CS_MAX_NAME_LENGTH for name. If user calls
function with longer string, such string can be passed to service
incomplete.
Solution is to not allow string larger then CS_MAX_NAME_LENGTH
and return error.
Same applies to cpg service.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Let's say we have 2 nodes:
- node 2 is paused
- node 1 create membership (one node)
- node 2 is unpaused
Result is that node 1 downlist is selected, so it means that from node 2
point of view, node 1 was never down.
Patch solves situation by adding additional check for largest previous
membership.
So current tests are:
1) largest (previous #nodes - #nodes know to have left)
2) (then) largest previous membership
3) (and last as a tie-breaker) node with smallest nodeid
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
In downlist and joinlist debug output group was printed in nonsense
format of integer to pointer to array.
Now it's printed by full name.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
let's say following situation will happen:
- we have 3 nodes
- on wire messages looks like D1,J1,D2,J2,D3,J3 (D is downlist, J is
joinlist)
- let's say, D1 and D3 contains node 2
- it means that J2 is applied, but right after that, D1 (or D3) is
applied what means, node 2 is again considered down
It's solved by collecting joinlists and apply them after downlist, so
order is:
- apply best matching downlist
- apply all joinlists
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Test scenario is follows:
- node 1, node 2
- node 1 is paused
- node 2 sees node 1 dead
- node 1 unpaused
- node 1 and 2 both choose same dowlist message which includes node 2 ->
node 2 is efectivelly disconnected
Patch includes additional test if left_node is localnode. If so, such
downlist is ignored.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
this change breaks onwire compatibility.
cpg is the only user of sync_* interface and it's the only
service that will require extra testing.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
We would use libqb for hashing now if we needed hashing.
cpg no longer uses jhash.h.
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
This patch fixes the issue that in some cases where cpg_finalize()
was called just after cpg_leave() was called, CPG_REASON_PROCDOWN
might also be sent while CPG_REASON_LEAVE had already been sent.
This behavior is not aligned with what the man page has described:
"CPG_REASON_PROCDOWN - the process left a group without calling
cpg_leave()."
And it will confuse CPG's clients in that one process left results
in two different reasons being sent.
The root cause of this issue is cpg_leave() will return after
adding the LEAVE message to the sending queue, but the cpg's group
name has not been cleared yet. Just at that time, cpg_finalize()
is being called, then it determines if there is the calling of
cpg_leave() happened only by the checking of cpg's group name, so
this method is not sufficient.
Signed-off-by: Jiaju Zhang <jjzhang@suse.de>
Reviewed-by: Steven Dake <sdake@redhat.com>
exec_init_fn now either returns NULL (success) or a string which indicates
the error that occured during service engine initialization. If an error
occurs, corosync will exit. This patch adds ykd and makes other suggestions
from Fabio Di Nitto.
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
These look ugly, are inconsistently done and just have
to be removed later in libqb before calling syslog.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Quorum is broken in this patch.
service.h needs to be cleaned up significantly
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
This call causes a complete list of active groups and their
membership lists to be sent to a callback function.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@1571 fd59a12c-fef9-0310-b244-a6a79926bd2f