Fix an 'error: success' stype message by propogating error_string
back down the stack.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.
Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).
Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.
Such behaviour is consistent but seems to be a bit too strict.
Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.
This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.
Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Previously value of new expected_votes was checked so newly computed
quorum value was in the interval <total_votes / 2, total_votes>. The
upper range prevented the cluster to become unquorate, but bottom check
was almost useless because it allowed to change expected_votes so it is
smaller than total_votes.
Solution is to check if expected_votes is bigger or equal to total_votes
and for quorate cluster only check if cluster doesn't become unquorate
(for unquorate cluster one can set upper range freely - as it is
perfectly possible when using config file)
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When 2Node mode is set, WFA is also set unless WFA is configured
explicitly. This behavior was not reflected on runtime change, so
restarted corosync behavior was different (WFA not set). Also when
cluster is reduced from 3 nodes to 2 nodes during runtime, WFA was not
set, what may result in two quorate partitions.
Solution is to set WFA depending on 2Node when WFA
is not explicitly configured.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Express intention to ignore icmap_get_* return
value and rely on default behavior of not changing the output
parameter on error.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Some functions allocated memory on stack without clearing memory and
then send them on wire. This is not an issue, but valgrind reports this
as a problem so it is easy to miss real problem then.
Solution is to clear stack memory.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Previously node id was logged ether as a %d (most often), %u, %x or
PRI.32 and ring id ether as %lld, %llx with various separators (., :, /)
between rep nodeid and seq. This seems to cause confusion.
This patch adds macros CS_PRI_NODE_ID, CS_PRI_RING_ID and
CS_PRI_RING_ID_SEQ (CS prefix = corosync, PRI modeled in spirit of
inttypes.h PRIx32) and makes code use them.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
system.run_dir was a little bit unfortunate and confusing name. Rename
to state_dir makes more evident what is content of this directory. To
keep setting consistent with code, get_run_dir is changed to
get_state_dir.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
The conversion to the new srp_addr format broke the feature where
UDP/UDPU nodes could get their nodeids generated from the IP address.
A big part of this was the removal of mandatory ring0_addr - it was used
as a placeholder when reading down the nodelist. I replaced this with
nodeid thinking that nodeid was now mandatory, forgetting this use case.
So the compare on "ring0_addr" or "nodeid" is now replaced with a more
robust check that we're only reading keys from the same node_pos once,
this was needed in votequorum.c as well as totemconfig.c
Another tidying side-effect of this patch is that the nodeid generation
is now all in a single routine in totemconfig.c and not shared between
it and totemip.c.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This shrinks the srp_addr (and consequently every packet sent by
corosync) so that instead of containing loads of IP addresses to
identify a node, it just sends the nodeid.
This then allows us to make ring0 optional and replaceable when running
knet.
It also means that we need some other way of identifying the local
node in corosync.conf, so the nodelist.node.name entry is now mandatory
and is mapped to the local host using the same algorithm as used in
cman.
This code needs LOTS of testing as it touches a huge amount of totemsrp
and totemconfig.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When the cluster changes from even sized to odd sized corosync
disables auto-tie-breaker if wait_for_all is not enabled.
However when changing from odd sized to even sized it doesn't reenable
it, causing auto_tie_breaker to be inconsistent across the cluster:
the newly added node and any nodes that restart corosync
will have it, but all the previously running nodes won't.
Signed-off-by: Edwin Torok <edvin.torok@citrix.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If votequorum_exec_send_reconfigure() returns an error (ie the
packet could not be sent) then we should either return it to the
sender (for a library call) or, for an internal call, log it.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
As we now have update_node_expected_votes(), we can use that
when receiving a new EXPECTED_VOTES value from another node
rather than having our own loop.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
If expected_votes was set via the library but the calculation
decides it's too high, then an error is correctly returned but
the value is still set in the nodes' expected_votes field and
turns up in the corosync-quorumtool display.
This patch separates out the quorum calculation from the updating
of expected_votes per node to prevent this from happening.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Revert patch 9f54f0a1fad7dad42c55562a50dfb9d773e6a660 as it causes
more troubles than it solves. Code that uses the quorum nodelist
to get a list of actual nodes in the cluster for communication
break using this as well as the display from corosync-quorumtool
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
We were looking for us in other node lists, rather than
others in our nodelist.
Also, remove debug print in votequorum.c
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
This patch tidies the two state change callbacks and explains them
in the man page:
The difference between votequorum_nodelist_notification_t and
votequorum_quorum_notification_t is subtle but important.
The 'nodelist' callback is sent at the start of a cluster state
transition and contains the new ring_id and only the list of
nodes that are included in the sync state - ie only active nodes. No
quorum information is included this callback because it is not
available at that time.
The 'quorum' callback is sent after the cluster state transition has
completed and does contain quorum information.
In addition, the nodelist contains a list of all nodes known to
votequorum (whether up or down) and their state as well
as information about the quorum device attached (if any). quorum
callbacks will not be sent for qdevice up and down
events unless they affect quorum.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
This split is needed for qdevice, so that it gets the ring_id and
nodelist as part of the sync process and not afterwards - when quorum
has been calculated.
As this is and unsupported API I'm not too worried about breaking
existing code - all the clients I know of are using the quorum API
anyway as they should be.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
There were several places where defaults were not restored
if the keys were removed from corosync.conf and the file reloaded.
This patch adds those back so that reloading corosync.conf
has the expected effect when keys are deleted.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If we don't have it, fall back to fsync
Fixes the build on FreeBSD
Signed-off-by: Ruben Kerkhof <ruben@rubenkerkhof.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch aligns the votequorum callbacks so that they are
the same as the quorum ones. Previously it was quite common
for votequorum to send one callback for every node in the cluster
when a single new node joined (because it sent one for every
nodeinfo message it received).
This new system makes much more sense in itself and being
consistent with the internal quorum is also an advantage!
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
auto_tie_breaker can behave incorrectly in the case of a cluster
with an odd number of nodes. It's possible for a partition to
have quorum while the other side has the ATB node, and both will
continue working. (Of course in a properly configured cluster one side
will be fenced but that becomes an indeterminate race .. just what ATB
is supposed to avoid).
This patch prevents ATB from running in a partition if the 'other'
partition might have quorum, and also mandates the use of wait_for_all
in clusters with an odd number of nodes so that a quorate partition
cannot start services or fence an existing partition with the tie
breaker node.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If quorum_trackstart() or votequorum_trackstart() are called twice with
CS_TRACK_CHANGES then the client gets added twice to the notifications
list effectively corrupting it. Users have reported segfaults in
corosync when they did this (by mistake!).
As there's already a tracking_enabled flag in the private-data, we check
that before adding to the list again and return an error if
the process is already registered.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
The two_node and auto_tie_breaker options are incompatible as they
specify conflicting methods of determining the quorate half of a cluster
partition.
This patch detects this error in corosync.conf, issues a message and
disables two_node if auto_tie_breaker is present.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
The default for auto_tie_breaker should be 'lowest' - which is what it
was before the extended ATB functionality of auto_tie_breaker_node was
added, and what the documentation states.
However this was broken so that if auto_tie_breaker_node was not
specified then auto_tie_breaker itself was ignored. This patch fixes
that.
It also fixes a typo in a comment.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
It's possible in a two_node cluster (and others but it's more likely
with just two) that a node could be booted up after downtime or failure
and the other node is not available for some reason. In this case it
would not be allowed to proceed because wait_for_all is enforced.
This patch provides a cmap key to clear this flag in the desperate
situation where that becomes necessary. It should only be used with
extreme caution and will be wrapped up in pcs which should also check
that fencing has been run.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Configuration option quorum.device.sync_timeout is available for setting
qdevice poll timeout for synchronization phase. Default value is 30
sec.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
If qdevice is registered a alive, corosync waits in sync phase until
timeout expires or qdevice votes with correct nodeid parameter.
This gives qdevice time to decide to vote or not undisturbed and without
time hazard.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
If votequorum service receives incorrect (not current) ringid, call is
ignored and CS_ERR_MESSAGE_ERROR is returned.
This and previous commits makes incompatible changes in votequorum
API/ABI, so library version is increased.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Returning ring id will be used in poll function.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
During reload, local_node_pos is deleted and reinstation is handled in
totemconfig after reload is finished. votequorum handles this events and
tries to reload it's configuration. This led to logging a little scary
messages (even nothing bad is happening, because after local_node_pos
reinstation everything back to normal).
Solution is to stop processing events during reload. Sadly, simple
tracking of config.reload_in_progress doesn't work because LibQB events
triggering order is undefined so votequorum reload handler can be called
before totemconfig (and before local_node_pos is reinstatied).
So new config.totemconfig_reload_in_progress key is defined with very
similar semanthic as config.reload_in_progress but set inside
totem_reload_notify function. Votequorum then use this new key.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Run dir (LOCALSTATEDIR/lib/corosync) was hardcoded thru whole codebase.
Totemsrp was trying to create and chdir into it, but also
takes into account environment variable COROSYNC_RUN_DIR creating
inconsistency.
get_run_dir correctly returns COROSYNC_RUN_DIR (when set) or
LOCALSTATEDIR/lib/corosync. This is now used by all functions instead of
hardcoded string.
All occurrences of mkdir/chdir are removed from totemsrp and chdir is
now called in main function. Mkdir call is completely removed, because
it was not used anyway (check in main.c was called before totemsrp init,
so mkdir was never called) and also make install and/or package system
should take care of creating this directory with correct
permissions/context.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
icmap_get_* behavior is to NOT modify passed variable when it doesn't
success. So we must initialize variable before icmap_get_* call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This patch adds more flexibility to the auto_tie_breaker feature of
votequorum. With this, not only can the lowest nodeid be used as
a tie breaker, but also the highest, or a node from a nominated list.
If there is a list of nodes, the first node in the list that was not
part of the previous partition is used. This allows the user to
specify a preferred set of nodes but prevents a split-brain if the
cluster divides evenly with a node in each half.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch adds the option to store expected_votes to
persistent storage. This is needed to allow_downscale
to operate properly.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
drop all SOLARIS specific ifdefs and replace them with feature checks
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
there is really no point to have a per node view of (vote)quorum
since all the info are always there.
drop the -n option for status/display nodes and improve
the output to provide a full cluster view at any given time.
Old format:
[root@fedora-master-node2 ~]# corosync-quorumtool -s
Quorum information
------------------
Date: Mon Aug 6 10:22:27 2012
Quorum provider: corosync_votequorum
Nodes: 2
Ring ID: 8
Quorate: Yes
Votequorum information
----------------------
Node ID: 3254954176
Node state: Member
Node votes: 1
Qdevice votes: 1
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Name
3238176960 1 fedora-master-node1.int.fabbione.net
3254954176 1 fedora-master-node2.int.fabbione.net
0 1 QDEVICE (Alive/Voting/NoMasterWins)
New format:
[root@fedora-master-node1 tools]# ./corosync-quorumtool -s
Quorum information
------------------
Date: Mon Aug 6 15:50:03 2012
Quorum provider: corosync_votequorum
Nodes: 2
Ring ID: 48
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
3238176960 1 A,V,MW fedora-master-node1.int.fabbione.net
3254954176 1 NR fedora-master-node2.int.fabbione.net
0 1 QDEVICE
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>