During reload, local_node_pos is deleted and reinstation is handled in
totemconfig after reload is finished. votequorum handles this events and
tries to reload it's configuration. This led to logging a little scary
messages (even nothing bad is happening, because after local_node_pos
reinstation everything back to normal).
Solution is to stop processing events during reload. Sadly, simple
tracking of config.reload_in_progress doesn't work because LibQB events
triggering order is undefined so votequorum reload handler can be called
before totemconfig (and before local_node_pos is reinstatied).
So new config.totemconfig_reload_in_progress key is defined with very
similar semanthic as config.reload_in_progress but set inside
totem_reload_notify function. Votequorum then use this new key.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
It's not very good idea to allow user apps changing internal key
reload_in_progress.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Previous safe_atoi didn't check range of input values so if for example
user used -1 s token timeout, it was converted to UINT32_MAX without
letting user know.
Another safe_atoi problem was using strtol. This works pretty well on
64-bit systems, where long integer is usually 64-bits long, sadly on
32-bit systems, it is usually 32-bit long. And because strtol returns
signed integer, it was not possible to enter 32-bit value with highest
bit set.
Solution is to use strtoll which is guaranteed to be at least 64-bits
long and check value range.
Also error message now contains also information about expected value
range.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Functions for storing and loading ring id was in the totem library. This
causes problem, what to do when it's impossible to load or store ring
id. Easy solution seemed to be assert, but sadly this makes hard for
user to find out what happened (because corosync was just aborted and
logsys didn't flush)
Solution is to move these functions to main.c, where is much easier to
handle error. This also makes libtotem free of any file system
operations.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Run dir (LOCALSTATEDIR/lib/corosync) was hardcoded thru whole codebase.
Totemsrp was trying to create and chdir into it, but also
takes into account environment variable COROSYNC_RUN_DIR creating
inconsistency.
get_run_dir correctly returns COROSYNC_RUN_DIR (when set) or
LOCALSTATEDIR/lib/corosync. This is now used by all functions instead of
hardcoded string.
All occurrences of mkdir/chdir are removed from totemsrp and chdir is
now called in main function. Mkdir call is completely removed, because
it was not used anyway (check in main.c was called before totemsrp init,
so mkdir was never called) and also make install and/or package system
should take care of creating this directory with correct
permissions/context.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
rdma_join_multicast failed ... message parameters was swapped.
Also information about multicast join is now logged as notice.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Totemiba wasn't able to survive SubnetManager handover or
restart. If SM was migrated to another node, corosync logged
"multicast error" and losses connectivity.
Commit should solve this situation.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
token_coefficient change in cmap didn't triggered change. So only way
how to change token_coefficient was editing config file and reload.
Patch let's key totem.token_coefficient to be processed so
token_coefficient can be dynamically changed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Token coefficient is used only when nodelist is specified and contains
at least 3 nodes. If so, real token timeout is then computed as
token + (number_of_nodes - 2) * token_coefficient. This allows cluster
to scale without manually changing token timeout every time new
node is added. This value can be set to 0 resulting in effective
removal of this feature.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When volatile key was changed (cmap set or reload) and checks fails,
nothing was logged.
Values are now checked and error string is logged on problems.
Also totem_config is dumped to log (DEBUG level) after every
volatile key change and every reload.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When key with dependency was changed, dependant keys were not recomputed.
Nice example is consensus timeout. If token timout was changed,
consensus timeout was not recomputed correctly (nether via cmap change
of key nor via cfg reload).
Solution is almost complete refactor of handling volatile defaults.
totem_volatile_config_read now handles not only storing cmap key to
totem_config structure, but also checking of existence, comparing with
zero value and properly storing defaults.
totem_set_volatile_defaults is gone. It's function was splitted into
totem_volatile_config_read and totem_volatile_config_validate functions.
Reload callback and change of key callback are now mostly same functions
and both calls totem_volatile_config_read.
Patch also fixes small memory leak. totem.vsftype key is not used for
long time and original totem_volatile_config_read wasn't freeing
allocated memory returned by icmap_get_string. Whole reading of
totem.vsftype is removed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When reload was called nodes were constantly added to totemconfig
nodelist.
So simple corosync-cfgtool -R resulted very quickly in filling whole
array and segfault.
Solution is to clear member_count.
Clearing is also moved directly to put_nodelist_members_to_config to
make sure it's always processed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When reload was called multiple times (~20), logging to file stopped
working.
Main problem was hidden in the fact, that log file was opened multiple
times, because even target_id was shared via subsystem loggers, file
name was not.
Solution is to ALWAYS set proper log file name into subsystem logger
(copy is stored). This will not only fix problem but also removes small
leak.
Also if filename didn't changed, function can return sooner.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When totem_set_volatile_defaults is called from totem_config_validate
return code is unchecked.
It's then perfectly possible to set (for example) join timeout to very
small value (1) and consensus value is then set to 0 making corosync
unable to create membership.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
icmap_get_* behavior is to NOT modify passed variable when it doesn't
success. So we must initialize variable before icmap_get_* call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When node is paused and other nodes has in meantime exited cpg process,
paused node after resume doesn't update it's membership correctly so on
previously paused node exited cpg process is still visible.
Solution is to compare join list with cpd and remove all pids which are
not included in join list.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Also number is prefixed by 0x so it's easier to spot that number is
hexadecimal.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Most of functionality is moved to do_proc_leave function to make it
reusable.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Patch f3ffd3da5c introduced named states
of state-machine, but sadly contains logical problem causing
stats.continuous_gather increasing even when it shouldn't. Problem is
not critical, because continuous_gather is set to 0 on successful
membership creation.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
This patch adds more flexibility to the auto_tie_breaker feature of
votequorum. With this, not only can the lowest nodeid be used as
a tie breaker, but also the highest, or a node from a nominated list.
If there is a list of nodes, the first node in the list that was not
part of the previous partition is used. This allows the user to
specify a preferred set of nodes but prevents a split-brain if the
cluster divides evenly with a node in each half.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Memory object allocated with malloc at quorum_register_callback
is not freed. The object is linked to internal_trackers_list.
The object is unlinked at quorum_unregister_callback. However,
it is not freed at the function.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Error message is displayed when it's impossible to create symlink to
fdata file.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
According to the totem paper, if a processor
receives a join message in the operational state and if the
receivers identifier is in the join messages fail list,
then join message should be ignored.
By applying this validation of join messages, we can avoid unnecessary
switching from operational state to gather state(or even lead to rings
can not be merged) like the following to happen.
1. Initially, there is only one ring contains three nodes, say
ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL.
Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and
create ring(A).
Node B sends join message with proclist:A,B,C. faillist:A,C. and
create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent
by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message
with proclist:A,B,C. faillist:B. Such join message will prevent
both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.
Same thing happens on node B, so A and B will dead loop forever
in step 7, 8 and 9.
As the paper also said: "If a processor receives a join message in the
operational state and if the sender's identifier is in the receiver's
my_proclist and the join message's ring_seq is less than the receiver's
ring sequence number, then it ignores the join message too." So these
patch applying these validations of join messages altogether.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch adds the option to store expected_votes to
persistent storage. This is needed to allow_downscale
to operate properly.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Because of change in libqb (9abb686) logging of TOTEM subsystem stopped
working.
Instead of rely on previous behavior (implicit substring match), all
totem files are now explicitly given.
Also QB subsystem now uses comma separated filelist instead of previous
function calling.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
The reason why memb_state_gather_enter is invoked was printed
in integer code. This patch introduces human readable English
messages for the code.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Solution use aproximation of totem structures. This needs to be
rewritten in proper way. Also MTU checking should be implemented for IP
transports.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Parameters in functions like mcast_cq_send_event_fn, ... were defined in
incorrect order. Also their names were weird.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Corosync freezes after several peer node connects/disconnects. The
freeze happens in recv_token_cq_recv_event_fn in ibv_get_cq_event call.
The problems is in fact, that after each peer node connect,
recv_token_accept_destroy is called, which tries to call
poll_dispatch_delete _after_ freeing of completion_channel. As
completion_channel contains fd, handlers are not disconnected from
poller properly. This leads to complete inconsistency in subsequent
calls to handlers.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
1. In UD mode receivnig side of RDMA application should have enough
space in buffer to hold data and GRH. Also, sge.length on the receiving
size should be set to max_msg_size + sizeof (struct ibv_grh). Current
corosync doesn't take grh in the account and does not work if mtu is set
to the real mtu of IB port (it works if netmtu is set to < 2048-40).
2. ibv_wc.byte_len is the actual lentgh of the received packet, i.e.
msg_len + GRH. GRH length should be substracted in further proceeding.
If not, it might cause problems when messages get retransmitted, as
their apparent size will constantly grow.
3. Current corosync will not work with rdma and mtus > 2048. Most modern
IB HW supports 4096 mtu.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When a reload is in progress, wait until it has all finished
before re-reading all of the logging parameters
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When a reload is in progress, wait until the whole thing has
finished before setting parameters
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Add the code to do the actual corosync.conf reload to cfg, along with
a corosync-cfgtool -R command to trigger it
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Pass an icmap hashtable into coroparse so we can load it into
a temporary one during reload
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch replaces the existing freopen method of
forcing stdin/out/err to /dev/null with the more
usual system of open/dup2.
While I don't like posting patches I don't fully understand,
this patch seems to fix a problem where stdout/err get
assigned to a socket causing double logging output
on systemd.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
I've seen a few instances where corosync has shut down for
apparently 'no reason'. In fact most of the time the shutdown
has been caused by an external source (often an init script)
but it's not been obvious what has happened and people
implicate the deamon
This patch simply adds a log message to the signal handler
when it is called so that the cause of the shutdown is obvious.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
icmap_get_r is now implemented using this function. Function is not very
safe tho defined as static.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Implementation should allow pass only parts of string (shorten string)
and must prohibit reading of uninitialized memory.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>