Patch tries to fix incorrect behaviour during following test-case:
- 3 nodes
- Node 1 is paused
- Node 2 and 3 detects node 1 as failed and informs CPG clients
- Node 1 is unpaused
- Node 1 clients are informed about new membership, but not about Node 1
being paused, so from Node 1 point-of-view, Node 2 and 3 failure
Solution is to:
- Remove downlist master choose and always choose local node downlist.
For Node 1 in example above, downlist contains Node 2 and 3.
- Keep code which informs clients about left nodes
- Use joinlist as a authoritative source of nodes/clients which exists
in membership
This patch doesn't break backwards compatibility.
I've walked thru all the patches which changed behavior of cpg to ensure
patch does not break CPG behavior. Most important were:
- 058f50314c - Base. Code was significantly
changed to handle double free by split group_info into two structures
cpg_pd (local node clients) and process_info (all clients). Joinlist
was
- 97c28ea756 - This patch removed
confchg_fn and made CPG sync correct
- feff0e8542 - I've tested described
behavior without any issues
- 6bbbfcb6b4 - Added idea of using
heuristics to choose same downlist on all nodes. Sadly this idea
was beginning of the problems described in
040fda8872,
ac1d79ea7c,
559d4083ed,
02c5dffa5b,
64d0e5ace0 and
b55f32fe2e
- 02c5dffa5b - Made joinlist as
authoritative source of nodes/clients but left downlist_master_choose
as a source of information about left nodes
Long story made short. This patch basically reverts
idea of using heuristics to choose same downlist on all nodes.
(ported from needle 9c2a97f4f9)
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>