this change breaks onwire compatibility.
cpg is the only user of sync_* interface and it's the only
service that will require extra testing.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
this should guarantee that votequorum won't fail under high memory
pressure. Price is 3500 bytes extra preallocated at startup.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
stop using malloc for each new node, because we cannot free the memory
easily. Move to a static allocated buffer that can contain
PROCESSOR_MAX + qdevice cluster_node instead.
We can never have more than PROCESSOR_MAX nodes anyway and the memory
footprint is small enough compared to memory leaks (those can
effectively happen only in very dynamic clusters with tons of different
nodes joining/leaveing with different nodeids).
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
pointed out that leave_remove can be easily confused with the old
cman leave_remove behavior. The two are substantially different
and we need to avoid confusion both for users and our support team.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
cmap changes are local to the node only and should not be broadcasted
as configuration changes.
if any change has happened to us, we will inform other nodes via
send_nodeinfo.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
this is mostly to avoid valgrind errors on exit and make the output
more readable.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
simply taking the safest path here since integration of qdevice is not
fully complete
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
no functional changes or extra features yet
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
The problem here is that user expectations, when using both modes
at the same time, have not been set yet. There are 2/3 options
that need investigation.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
old_flags was set to uint16_t but it needs to be uint32_t.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
this is a regression introduced by cb5fd775
when reading static config us->flags does not exists yet and therefor
setting it will cause a segfault.
Move the settings after cluster_node *us is created, with the long
term plan to simply kill the whole _static readconfig bits
in favour of dynamic (runtime changeable) bits.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
qdevice is a very special node in the cluster and it adds a certain
amount of complexity and special cases across the code.
most of the qdevice data are shared across the cluster (name/votes)
but effectively each node has a different view of the qdevice
(registered/unregistered/voting/etc.)
with this change, we align the qdevice view across the node,
exchanging more data between nodes and we fix how qdevice behaves
and it is configured.
The only side effect is that the amount of data transmitted on wire
is slightly higher.
The qdevice API is still disabled by default. This means that
the amount of real changes in current code are a lot smaller
than it appears by this patch.
TODO: documentation/man pages needs to be updated once
this change is in (and behavior finalized).
User visible changes:
- configuration (coroparse, exec/votequorum):
the quorum device section is now standalone within the quorum.
quorum {
provider: corosync_votequorum
device {
model: (name)
timeout: (millisec)
votes:
}
}
the keyword "model:" is mandatory to enable qdevice in configuration
and should express the name of the script/daemon that will provide
the qdevice. Looking into the future, an init script or systemd
service will look for that name in /path/to/be/decided/name
and start/stop qdevice.
timeout: defines the maximum interval the qdevice implementation
has available between poll (see votequorum_qdevice_poll.3) before
the device is considered dead and votes discarded
votes: is now a configuration parameter and not an API call.
quorum devices don't care what they need to vote.
votes is autocalculated when a nodelist is available and all
nodes in the list vote 1. Otherwise this parameter is mandatory.
- configuration (exec/votequorum):
startup and runtime configuration changes have been improved.
errors at startup are considered fatal. errors at runtime
have different exit paths.
startup:
* quorum.two_node and qdevice are incompatible.
* quorum.expected_votes requires quorum.device.votes.
* quorum.expected_votes - quorum.device.votes cannot be lower
than 2.
* qdevice and last_man_standing are mutually exclusive.
* qdevice and auto_tie_breaker are mutually exclusive.
runtime config changes:
* quorum.two_node and qdevice are incompatible:
if quorum device is alive, two_node is disabled.
if quorum device is not alive and node count is 2, two_node is
enabled, and quorum device cannot be registered
* if either last_man_standing or auto_tie_breaker were enabled
at startup, and at runtime quorum device is configured,
quorum device registration will be blocked.
* if quorum.expected_votes is configured but not quorum.device.votes,
quorum device registration will be blocked.
* if quorum.device.votes is not configured and we cannot
automatically calculate it, quorum device registration will be blocked.
* An error in configuring quorum.expected_votes and quorum.device.votes
will block quorum device registration.
blocking quorum device registation, also means dropping the votes.
quorum.device.votes (either set or automatically calculated) is now
used to determine current expected_votes in the cluster.
- logging (exec/votequorum):
all errors from configuration are treated as WARNING/CRITICAL.
lots of extra DEBUG output is added (see internal changes too).
- corosync-quorumtool (tools/corosync-quorumtool):
* added option to forcefully kick out a quorum device from the local
node. This is for emergency recovery only and it is only
available when qdevice API is built-in.
* Improved status output, specifically add node state and qdevice
information
[root@fedora-master-node2 coro]# corosync-quorumtool -s
Version: 1.99.4.12-9c7d-dirty
Quorum type: corosync_votequorum
Nodes: 2
Ring ID: 132
Quorate: Yes
Node votes: 1
Node state: Member
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Nodeid Votes Name
1 1 fedora-master-node1.int.fabbione.net
2 1 fedora-master-node2.int.fabbione.net
0 1 QDEVICE (Voting)
* allow to print status for any node in the cluster known to
local node.
[root@fedora-master-node1 coro]# corosync-quorumtool -s
Version: 1.99.4.12-9c7d-dirty
Quorum type: corosync_votequorum
Nodes: 2
Ring ID: 144
Quorate: Yes
Node votes: 1
Node state: Member
Expected votes: 3
Highest expected: 3
Total votes: 2
Quorum: 2
Flags: Quorate
Nodeid Votes Name
1 1 fedora-master-node1.int.fabbione.net
2 1 fedora-master-node2.int.fabbione.net
[root@fedora-master-node1 coro]# corosync-quorumtool -s -n 2
Version: 1.99.4.12-9c7d-dirty
Quorum type: corosync_votequorum
Nodes: 2
Ring ID: 144
Quorate: Yes
Node votes: 1
Node state: Member
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Nodeid Votes Name
1 1 fedora-master-node1.int.fabbione.net
2 1 fedora-master-node2.int.fabbione.net
0 1 QDEVICE (Voting)
Internal changes:
- change qdevice timer to not run all time, but only when necessary.
- change votequorum_nodeinfo on wire data to use flags instead of uint8_t
and add QDEVICE status.
- allocate nodeid 0 to qdevice since it's the only real
nodeid that be reserved.
- change send_nodeinfo to allow to send nodeinfo for any node
so that we can share qdevice info across the cluster
(and this might be useful in future if we need to sync
internal cluster view).
- add votequorum api call to update qdevice name
- add runtime data if quorum device has been forcefully disabled
by config error
- add qdevice votes to expected_votes calculation (this
is probably the biggest difference vs cman)
- change votequorum_read_nodelist_configuration so that
we can autocalculate votes for qdevice (we need the nodecount
vs votes).
- add all checks for startup/runtime config (see above).
- do not make qdevice part of the membership_list received from
totem. None of our users care about it and it is not a real node.
- change onwire message handlers to deal with "data for this node from any node"
case and undersand nodeid 0 for qdevice info
- always allocate qdevice at startup. this simplifies code a lot.
- dispatch qdevice nodeinfo on membership changes.
- inform libvotequorum users when a qdevice is registered
- improve substantially qdevice api and add a simple
barrier based on qdevice name.
- add qdevice API barrier at cluster level. This feature allow
only one qdevice name to be active in the cluster at any time.
- qdevice getinfo can now report status for qdevice on any node.
- change slightly the way the qdevice API is built-in/out:
only the libvotequorum calls are #ifdef'out now. Doing so in
the core is too complex and would make the code unreadable
with the risk of missing a bit or two effectively introducing
an on-wire incompatibility if we will ever turn the API on.
- probably added some bugs on the way...
TODO: update qdevice_* API once the above is settled and test
qdevice integration with other features.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com> (only second part)
nodeid = 0 is a valide nodeid and node associated with it should
not be freed
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
votequorum internal quorum/expected_vote check was slightly too
conservative and was not done correctly when leave_remove feature
is enabled.
this fix allows admins to effectively override expected_votes
and drive ev_barrier as expected.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
specifically ev_barrier, two_node, lowest_node_id and wait_for_all_status
are values that change internally at runtime and keeping track
of those can make debugging rather easy, specially when LOG_DEBUG is not
set.
Also track our node id.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-By: Christine Caulfield <ccaulfie@redhat.com>
this also cleanup NODESTATE for good. JOINING was never used
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
internal flags were not propagated correctly in the node status
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
all main services are loaded at priority 1.
vfs_quorum and votequorum did not specify a priority and
automatically defaulting to 0, that has a special meaning
of being loaded last and unloaded last.
this is not correct behavior and limits what votequorum
can do at shutdown, for example notify other nodes that
it is leaving (something that cannot be gathered by
totem membership change callback).
fix vsf_quorum to load at priority 1 as the other
default services and bump votequorum to 2 (needs to
unload before everything else currently known).
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
code inspection shows that those internal flags are never used
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
exec_init_fn now either returns NULL (success) or a string which indicates
the error that occured during service engine initialization. If an error
occurs, corosync will exit. This patch adds ykd and makes other suggestions
from Fabio Di Nitto.
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
a quorum device is not necessarely a disk and this also aligns
various names to be generic
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-By: Christine Caulfield <ccaulfie@redhat.com>
it is not correct to randomly accept expected_votes from any node in
the cluster. We can only allow expected_votes from quorate nodes.
A quorate cluster is "always" right and have the correct expected_votes.
One of the different bug triggers:
quorum {
expected_votes: 8
auto_tie_breaker: 1
last_man_standing: 1
}
start all 8 nodes.
clean shut down 2 nodes.
wait for lms to kick in.
kill 3 nodes with highest nodeid
(we want to retain a quorate partition of 3 nodes)
start one node again -> cluster will be unquorate
This happens because the node rebooting/rejoining with
non current cluster status will propagate an expected_votes of 8,
while in reality the cluster is down to expected_votes: 3.
4 nodes are still < 5 (quorum for 8 nodes/votes).
In order to avoid this condition, we need to exchange expected_votes
information among nodes but we cannot randomly trust everybody.
1) Allow expected_votes to be changed cluster-wide only if the
information is coming from a quorate node.
2) Fix node->expected_votes based on quorate status
3) allow a joining node to decrease quorum and expected_votes
if the node is not yet quorate, but it's joining a quorate
cluster
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
auto_tie_breaker requires to know the lowest node id in the currently
quorate partition and not of the whole cluster.
this allow us to determine the lowest node id as soon as we are quorate
and remove the complexity to read it from WFA or nodelist. Add
the same time it adds the flexibility for dynamic nodeids in a cluster.
drop requirement on WFA if nodelist is not specified
update man page
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
this is another leftover from cman compatibility layer
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
as agreed on the mailing list, quorum.expected_votes should override
automatically calculated expected_votes from nodelist.
Also simplify the code to handle expected_votes. "silly defaults" is now
unnecessary because votequorum does config sanity checks upfront.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
This avoids fencing races at startup of a cluster.
It is still possible to override WFA by explicitly setting
wait_for_all: 0
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
expected votes is now calculated automatically and quorum.expected_votes
can be used to override nodelist calculation. The highest of the two
value is used for runtime.
quorum_votes can be specified either in the node list or in quorum.votes.
The node list has priority over global.
propagate votequorum initalization errors (due to config inconsistencies)
back to vsf_quorum.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
These look ugly, are inconsistently done and just have
to be removed later in libqb before calling syslog.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
protocol needs to stay compatible across a corosync MAJOR release.
Implementing internal protocol version compat is at best suicidal.
Add extra space to the net struct and we can use flags to determine
feature sets.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
as agreed, the API has not been tested yet. Adding later is better than
removing it.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
reduce req_exec_quorum_nodeinfo from 40 to 16 bytes (onwire)
add 4 bytes to req_exec_quorum_reconfigure to be consisent
with feature/version checking (onwire data)
make all nodeid definition "unsigned int" instead of some random mix.
reduce size of different vars
remove lots of unnecessary swab due to reducing size of data
drop join_time from cluster_node, it's never used
fix printing of nodeids from random mix to uint for consistency
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
the problem is that votequorum was listed as part of default
services.
At service_link_and_init, votequorum library and exec were being
made available, even when votequorum was not in used at all, creating
all kind of problems.
By changing the service_link_and_init to be clever, we restore the
original and wanted behavior to link the service only when required.
This also fixed N*votequorum API calls segfaults, init segfaults
and a few dozen other small issues...
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
the problem is mostly in votequorum here, where votequorum_exec is
initialized with or without votequorum being configured as quorum
provider.
re-establish init order (regression from dropping lcrso) and make
sure we init correctly only the module configured.
ykd changes are for consistency only at this point in time.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Angus Salkeld <asalked@redhat.com>
Signed-off-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio Di Nitto <fdinitto@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Quorum is broken in this patch.
service.h needs to be cleaned up significantly
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio Di Nitto <fdinitto@redhat.com>