Leave message in totem is just join message where leaving member is
excluded from member list and included in fail list. It also contains
special nodeid in header.nodeid and system_from.nodeid fields.
Before "totem: Use nodeid ONLY in srp_addr" fix, most of the functions
were using system_from addresses and not nodeid, which was used only in
one specific case for memb_consensus_set function.
After the patch, addresses are gone and only nodeid is used. Result is,
that leaving node nodeid is not added into local fail list
(my_faillist) so node is unable to reach consensus till token timeout,
which starts new gather process.
Solution is to send valid leaving node nodeid in system_from.nodeid and
handle specific case for memb_consensus_set in memb_join_process.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
These information are useful and with trace log level they should not be
too much irritating.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
There is regression caused by "totem: Use nodeid ONLY in srp_addr" patch
in srp_addr_compare function. This function should be usable with qsort,
so it should return values less than, equal to or greater than zero. It
was however returning only zero or negation of a zero. Final results
were unable to reach consensus in following test case:
- 3 node cluster
- start nodes 1, 2, 3
- shutdown node 3
- start node 3
- shutdown node 2
- start node 2
- shutdown node 1
After this steps, node 2 and 3 were unable to reach consensus.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
If number of proc_list, failed_list or active members is too high it
may be impossible to put them into message, which is allocated on the
stack what results in stack corruption.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Sanity checkers are used to prevent crashing because of
accessing unallocated memory.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
To make finding victim of incompatible messages easier, IP of sender is
logged. Propagating IP in layers makes patch slightly larger.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Magic number (0xC070) together with version in every packet
is used for detecting that other node is really
Corosync 3.x.
Endian_detector field is removed and magic number is now
used instead.
If received packet magic number differs, guessing is used to show more
about the source (Corosync 2.3+, 2.2 are quite reliable, Knet and
unencrypted Corosync 2.1/2.0/1.x/OpenAIS are semi-reliable and encrypted
Corosync 2.1/2.0/1.x/OpenAIS are quite unreliable).
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Even if it's not used for anything else.
Also, make cfgtool show the correct link ID when links are not
contiguous
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Fix crash introduced a couple of commits ago in iface_get
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This shrinks the srp_addr (and consequently every packet sent by
corosync) so that instead of containing loads of IP addresses to
identify a node, it just sends the nodeid.
This then allows us to make ring0 optional and replaceable when running
knet.
It also means that we need some other way of identifying the local
node in corosync.conf, so the nodelist.node.name entry is now mandatory
and is mapped to the local host using the same algorithm as used in
cman.
This code needs LOTS of testing as it touches a huge amount of totemsrp
and totemconfig.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
In my enthusiasm for removing code while integrating knet I
also deleted the correct code for returning IP address for a node,
so that only the IP addres of the local node was ever returned.
This commit restores the the previous code.
Also, because we always return INTERFACE_MAX interfaces now (they don't
have to be contiguous) set ss_family to zero if that interface is not
in use so that downstream apps know and don't display a lot of 0.0.0.0
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Now we are using knet, it's possible to dynamically add, remove and
reconfigure links on the fly.
Also print 'n' for non-existant knet links. This will show up
only on loopback links >0. But it looks better than 'status ='
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
RRP doesn't exist any more so all the ring re-enable code is redundant.
I've removed it from the library and all the code that does anything,
but I've left the hole in the IPC just in case old libraries are
hanging around.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This is a big update that removes RRP & MRP from the codebase
and makes knet the default transport for corosync. UDP & UDPU
are still (currently) supported but are deprecated. Also crypto
and mutiple interfaces are only supported over knet.
To compile this codebase you will need to install libknet from
https://github.com/fabbione/kronosnet
The corosync.conf(5) man page has been updated with info on the new
options. Older config files should still work but many options
have changed because of the knet implementation so configs should
be checked carefully. In particular any cluster using using RRP
over UDP or UDPU will not start as RRP is no longer present. If you
need multiple interface support then you should be using the knet transport.
Knet brings many benefits to the corosync codebase, it provides support
for more interfaces than RRP (up to 8), will be more reliable in the event
of network outages and allows dynamic reconfiguration of interfaces.
It also fixes the ifup/ifdown and 127.0.0.1 binding problems that have
plagued corosync/openais from day 1
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
This patch from Hideo Yamauchi improves the logging of
whether nodes leave the cluster cleanly or uncleanly,
making it easier to determine if a node ws shut down
by the operator. There is also the possibility that a
LEAVE message could get missed (due to the node being
in flush state) so this can also make that clearer.
The modifications are as follows.
Change 1) I added the list which maintained LEAVE node to totemsrp.
Change 2) I added registration, a search, the handling of to clear LEAVE
node.
Change 3) I added the output to log.
Change 4) I changed an output level of the log.
Signed-off-by: Hideo Yamauchi <renayama19661014@ybb.ne.jp>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
In active rrp mode, commit tokens are treated as mcast data messages,
thus, rrp directly delivers them to srp layer by active_mcast_recv().
This will result in duplicated commit tokens being received by srp from
different heartbeat links. If node is in recovery state and has already
sent out the initial orf token, those duplicated commit tokens will
cause message_handler_memb_commit_token() to send initial orf token
again! This is wrong because it resets the orf token content in
instance->orf_token_retransmit, which breaks the token retransmission
state.
Furthermore, by sending those initial orf tokens again and again,
it may lead active_token_recv() to drop some subsequent orf tokens.
It is OK for rrp because srp will do token retransmission,
but as said above, srp retransmission state has already been broken,
so finally we meet a "token lost in recovery state" condition caused
by software. If token timeout value is large, then it will takes long
time to create a new ring.
This can be reproduced by having two noded set to active rrp mode, with
two heartbeat links. Then with one node always on, let the other one do
stop/start again and again. It has a low probability to reproduce.
In theory, I think, the more heartbeat links used, the more easily it
can be reproduced.
This problem can be resolved by letting
message_handler_memb_commit_token() to ignore duplicated commit tokens
in recovery state if node (the ring representation) has already sent
out the initial orf token.
Different from prev take, this version do not depends on stored token
data but uses originated_orf_token in totemsrp_instance to remember
if initial orf token has been already originated for current membership.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Services are informed about membership changes, but if same information
is needed inside totemrrp or totemnet, it's impossible to gather this
information.
Patch makes this possible for now only for RRP with empty callbacks.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When there is no other activty on ring but only retransmition, and
token is in hold mode, the retransmition will become slow. More over,
if the retransmition is always fail but token rotation works well, then
it takes quite a lone time
(fail_to_recv_const * token_hold = 2500 * 180ms = 450sec) for the
retransmit requester to meet the "FAILED TO RECEIVE" condition to
re-construct a new ring.
This problem can be solved by checking if retransmits are present
before going into hold. If a node is the retransmit requester or
the resender, it set my_token_held to 0 to speed up retransmition
and omit further unnecessary sending of token_hold_cancel signal.
Signed-off-by: Jason HU <huzhijiang@gmail.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Functions for storing and loading ring id was in the totem library. This
causes problem, what to do when it's impossible to load or store ring
id. Easy solution seemed to be assert, but sadly this makes hard for
user to find out what happened (because corosync was just aborted and
logsys didn't flush)
Solution is to move these functions to main.c, where is much easier to
handle error. This also makes libtotem free of any file system
operations.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Run dir (LOCALSTATEDIR/lib/corosync) was hardcoded thru whole codebase.
Totemsrp was trying to create and chdir into it, but also
takes into account environment variable COROSYNC_RUN_DIR creating
inconsistency.
get_run_dir correctly returns COROSYNC_RUN_DIR (when set) or
LOCALSTATEDIR/lib/corosync. This is now used by all functions instead of
hardcoded string.
All occurrences of mkdir/chdir are removed from totemsrp and chdir is
now called in main function. Mkdir call is completely removed, because
it was not used anyway (check in main.c was called before totemsrp init,
so mkdir was never called) and also make install and/or package system
should take care of creating this directory with correct
permissions/context.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Patch f3ffd3da5c introduced named states
of state-machine, but sadly contains logical problem causing
stats.continuous_gather increasing even when it shouldn't. Problem is
not critical, because continuous_gather is set to 0 on successful
membership creation.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
According to the totem paper, if a processor
receives a join message in the operational state and if the
receivers identifier is in the join messages fail list,
then join message should be ignored.
By applying this validation of join messages, we can avoid unnecessary
switching from operational state to gather state(or even lead to rings
can not be merged) like the following to happen.
1. Initially, there is only one ring contains three nodes, say
ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL.
Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and
create ring(A).
Node B sends join message with proclist:A,B,C. faillist:A,C. and
create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent
by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message
with proclist:A,B,C. faillist:B. Such join message will prevent
both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.
Same thing happens on node B, so A and B will dead loop forever
in step 7, 8 and 9.
As the paper also said: "If a processor receives a join message in the
operational state and if the sender's identifier is in the receiver's
my_proclist and the join message's ring_seq is less than the receiver's
ring sequence number, then it ignores the join message too." So these
patch applying these validations of join messages altogether.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
The reason why memb_state_gather_enter is invoked was printed
in integer code. This patch introduces human readable English
messages for the code.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
vague and unhelpful. People have to look for the following quorum
message and try to deduce which nodes have joined or left from that
and past membership messages, even though the routine printing the
message already has this information to hand.
This patch fixes that message so that it prints the nodeids of the nodes
that have joined/left the cluster.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-By: Jan Friesse <jfriesse@redhat.com>
Patch for support waiting_trans_ack may fail if there is synchronization
happening between delivery of fragmented message. In such situation,
fragmentation layer is waiting for message with correct number, but it
will never arrive.
Solution is to handle (callback) change of waiting_trans_ack and use
different queue.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
If instance->memb_state is not OPERATION or RECOVERY, we was passing NULL
to cs_queue_used call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This patch creates a special message queue for synchronization messages.
This prevents a situation in which messages are queued in the
new_message_queue but have not yet been originated from corrupting the
synchronization process.
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
If failed_to_recv is set (node detect itself not able to receive
message), we can end up with assert, because my_failed_list and
my_member_list are same list. This is happening because we are not
following specification and we allow to mark node itself as failed.
Because if failed_to_recv is set and we reached consensus across nodes,
single node membership is created (ignoring both fail list and
member_list), we can skip assert.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Messages which are flow messages, rather then lifecycle are now logged
in trace level.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Nowhere in the corosync codebase references this structure.
Signed-off-by: Tim Beale <tlbeale@gmail.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This should help correlate syslog entires with their blackbox
counterparts.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Andrew Beekhof <andrew@beekhof.net>
Commit which added number of addresses to srp_address structure didn't
count with totemsrp_ifaces_get where whole structure was copied instead
of addresses only. This is now fixed.
Also to make API totempg forward compatible, size of interfaces array
must be passed to ifaces_get like functions to prevent memory overwrite.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This should allow us future change to dynamic number of rings without
breaking wire compatibility.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Also few leftovers from cfg is removed and version of totempg is
increased to 5 to reflect all changes we made
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Tomcrypt in corosync is for long time not updated. Because we have
support for libnss, libtomcrypt can be removed.
Also few leftovers (AES is 256 bits, not 128, ...) are removed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
These look ugly, are inconsistently done and just have
to be removed later in libqb before calling syslog.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>