Token coefficient is used only when nodelist is specified and contains
at least 3 nodes. If so, real token timeout is then computed as
token + (number_of_nodes - 2) * token_coefficient. This allows cluster
to scale without manually changing token timeout every time new
node is added. This value can be set to 0 resulting in effective
removal of this feature.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When volatile key was changed (cmap set or reload) and checks fails,
nothing was logged.
Values are now checked and error string is logged on problems.
Also totem_config is dumped to log (DEBUG level) after every
volatile key change and every reload.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When key with dependency was changed, dependant keys were not recomputed.
Nice example is consensus timeout. If token timout was changed,
consensus timeout was not recomputed correctly (nether via cmap change
of key nor via cfg reload).
Solution is almost complete refactor of handling volatile defaults.
totem_volatile_config_read now handles not only storing cmap key to
totem_config structure, but also checking of existence, comparing with
zero value and properly storing defaults.
totem_set_volatile_defaults is gone. It's function was splitted into
totem_volatile_config_read and totem_volatile_config_validate functions.
Reload callback and change of key callback are now mostly same functions
and both calls totem_volatile_config_read.
Patch also fixes small memory leak. totem.vsftype key is not used for
long time and original totem_volatile_config_read wasn't freeing
allocated memory returned by icmap_get_string. Whole reading of
totem.vsftype is removed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When reload was called nodes were constantly added to totemconfig
nodelist.
So simple corosync-cfgtool -R resulted very quickly in filling whole
array and segfault.
Solution is to clear member_count.
Clearing is also moved directly to put_nodelist_members_to_config to
make sure it's always processed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When reload was called multiple times (~20), logging to file stopped
working.
Main problem was hidden in the fact, that log file was opened multiple
times, because even target_id was shared via subsystem loggers, file
name was not.
Solution is to ALWAYS set proper log file name into subsystem logger
(copy is stored). This will not only fix problem but also removes small
leak.
Also if filename didn't changed, function can return sooner.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When totem_set_volatile_defaults is called from totem_config_validate
return code is unchecked.
It's then perfectly possible to set (for example) join timeout to very
small value (1) and consensus value is then set to 0 making corosync
unable to create membership.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
icmap_get_* behavior is to NOT modify passed variable when it doesn't
success. So we must initialize variable before icmap_get_* call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When node is paused and other nodes has in meantime exited cpg process,
paused node after resume doesn't update it's membership correctly so on
previously paused node exited cpg process is still visible.
Solution is to compare join list with cpd and remove all pids which are
not included in join list.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Also number is prefixed by 0x so it's easier to spot that number is
hexadecimal.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Most of functionality is moved to do_proc_leave function to make it
reusable.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Patch f3ffd3da5c introduced named states
of state-machine, but sadly contains logical problem causing
stats.continuous_gather increasing even when it shouldn't. Problem is
not critical, because continuous_gather is set to 0 on successful
membership creation.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
This patch adds more flexibility to the auto_tie_breaker feature of
votequorum. With this, not only can the lowest nodeid be used as
a tie breaker, but also the highest, or a node from a nominated list.
If there is a list of nodes, the first node in the list that was not
part of the previous partition is used. This allows the user to
specify a preferred set of nodes but prevents a split-brain if the
cluster divides evenly with a node in each half.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Memory object allocated with malloc at quorum_register_callback
is not freed. The object is linked to internal_trackers_list.
The object is unlinked at quorum_unregister_callback. However,
it is not freed at the function.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Error message is displayed when it's impossible to create symlink to
fdata file.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
According to the totem paper, if a processor
receives a join message in the operational state and if the
receivers identifier is in the join messages fail list,
then join message should be ignored.
By applying this validation of join messages, we can avoid unnecessary
switching from operational state to gather state(or even lead to rings
can not be merged) like the following to happen.
1. Initially, there is only one ring contains three nodes, say
ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL.
Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and
create ring(A).
Node B sends join message with proclist:A,B,C. faillist:A,C. and
create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent
by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message
with proclist:A,B,C. faillist:B. Such join message will prevent
both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.
Same thing happens on node B, so A and B will dead loop forever
in step 7, 8 and 9.
As the paper also said: "If a processor receives a join message in the
operational state and if the sender's identifier is in the receiver's
my_proclist and the join message's ring_seq is less than the receiver's
ring sequence number, then it ignores the join message too." So these
patch applying these validations of join messages altogether.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch adds the option to store expected_votes to
persistent storage. This is needed to allow_downscale
to operate properly.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Because of change in libqb (9abb686) logging of TOTEM subsystem stopped
working.
Instead of rely on previous behavior (implicit substring match), all
totem files are now explicitly given.
Also QB subsystem now uses comma separated filelist instead of previous
function calling.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
The reason why memb_state_gather_enter is invoked was printed
in integer code. This patch introduces human readable English
messages for the code.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Solution use aproximation of totem structures. This needs to be
rewritten in proper way. Also MTU checking should be implemented for IP
transports.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Parameters in functions like mcast_cq_send_event_fn, ... were defined in
incorrect order. Also their names were weird.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Corosync freezes after several peer node connects/disconnects. The
freeze happens in recv_token_cq_recv_event_fn in ibv_get_cq_event call.
The problems is in fact, that after each peer node connect,
recv_token_accept_destroy is called, which tries to call
poll_dispatch_delete _after_ freeing of completion_channel. As
completion_channel contains fd, handlers are not disconnected from
poller properly. This leads to complete inconsistency in subsequent
calls to handlers.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
1. In UD mode receivnig side of RDMA application should have enough
space in buffer to hold data and GRH. Also, sge.length on the receiving
size should be set to max_msg_size + sizeof (struct ibv_grh). Current
corosync doesn't take grh in the account and does not work if mtu is set
to the real mtu of IB port (it works if netmtu is set to < 2048-40).
2. ibv_wc.byte_len is the actual lentgh of the received packet, i.e.
msg_len + GRH. GRH length should be substracted in further proceeding.
If not, it might cause problems when messages get retransmitted, as
their apparent size will constantly grow.
3. Current corosync will not work with rdma and mtus > 2048. Most modern
IB HW supports 4096 mtu.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When a reload is in progress, wait until it has all finished
before re-reading all of the logging parameters
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When a reload is in progress, wait until the whole thing has
finished before setting parameters
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Add the code to do the actual corosync.conf reload to cfg, along with
a corosync-cfgtool -R command to trigger it
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Pass an icmap hashtable into coroparse so we can load it into
a temporary one during reload
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch replaces the existing freopen method of
forcing stdin/out/err to /dev/null with the more
usual system of open/dup2.
While I don't like posting patches I don't fully understand,
this patch seems to fix a problem where stdout/err get
assigned to a socket causing double logging output
on systemd.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
I've seen a few instances where corosync has shut down for
apparently 'no reason'. In fact most of the time the shutdown
has been caused by an external source (often an init script)
but it's not been obvious what has happened and people
implicate the deamon
This patch simply adds a log message to the signal handler
when it is called so that the cause of the shutdown is obvious.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
icmap_get_r is now implemented using this function. Function is not very
safe tho defined as static.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Implementation should allow pass only parts of string (shorten string)
and must prohibit reading of uninitialized memory.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Patch adds reentrant version of most of functions (with exception of RO
flags support and tracking) to allow multiple icmap instances existence
inside corosync.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
qb_loop_timer_add expects the timeout to be in nanoseconds, but we were
passing the value in milliseconds. Scale the timeout appropriately.
Signed-off-by: Michael Chapman <mike@very.puzzling.org>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
We don't reference the connection object on creation, so there
is on reason to dereference it on disconnect.
Signed-off-by: David Vossel <dvossel@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
vague and unhelpful. People have to look for the following quorum
message and try to deduce which nodes have joined or left from that
and past membership messages, even though the routine printing the
message already has this information to hand.
This patch fixes that message so that it prints the nodeids of the nodes
that have joined/left the cluster.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-By: Jan Friesse <jfriesse@redhat.com>
When corosync was started in daemon mode and there was parse error, no
way existed how to find out what happened (this is usual situation with
systemd enabled systems). Solution seems to be output to syslog by
default.
Also redundant line with setting logsys is removed because it's no
longer needed, because FORK and THREADED mode options has no longer
effect. FORK is handled by libqb by default and THREADED mode is forced
by calling logsys_thread_start.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Status string should be same lenght as needed for cfg
ringstatusget function.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Even this check is really not needed, it's nice to have it and on fault
ensure that cluster_name is really NULL.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Code for zero-copy in cpg does following mmaps:
- Mmap anonymous, private memory to some address (-> malloc)
- Mmap shared memory of fd to address returned by first mmap
(effectively shadows first mapping)
This is not necessary and only one mapping is needed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Add poll timer scheduler to be called 3 times per token timeout.
If poll timer was not called for more then 0.8 * token timeout, it means
corosync process was not scheduled and ether token_timeout should be
increased or load should be reduced (useful for VM, where host is
overcommitted so VM is not scheduled as expected).
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When using corosync with clear_node_high_bit setting to yes,
the highest bit is cleared. When all the cluster nodes are in
one subnet, we probably configure the IP addresses as follows:
node1: 147.2.207.64
node2: 147.2.207.192
If the byte order of the nodeid is little endian, wiping off the
highest bit will make the two nodes have the same nodeid!
This patch fixes this by converting the nodeid to network order.
Signed-off-by: Xia Li <xli@suse.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If configuration file contains closing brace before opening brace
at top level, configuration parsing is stopped and file is not
completely parsed. Solution is to detect extra closing brace and display
error.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
If colon was entered as part of value on end of value, it is deleted.
This makes impossible to enter (legal) IPv6 address ending with :: (like
fed0::).
Also when line contains both brace and colon, it is parsed twice (first
as key = value and second as start of section). This is handled by
continue in if section.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
"Corosync Cluster Engine ... started" message is shown after
logsys is full configured.
Signed-off-by: Kazunori INOUE <inouekazu@intellilink.co.jp>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Creating qb_loop before daemonization is not problem for poll or epoll
type loops, but it's problem for kqueue, because kqueue is not shared
in child with parent after fork.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Patch for support waiting_trans_ack may fail if there is synchronization
happening between delivery of fragmented message. In such situation,
fragmentation layer is waiting for message with correct number, but it
will never arrive.
Solution is to handle (callback) change of waiting_trans_ack and use
different queue.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
If instance->memb_state is not OPERATION or RECOVERY, we was passing NULL
to cs_queue_used call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This patch creates a special message queue for synchronization messages.
This prevents a situation in which messages are queued in the
new_message_queue but have not yet been originated from corrupting the
synchronization process.
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Corosync now works with infiniband transport in any redundant ring mode
Signed-off-by: Evgeny Barskiy <barskiy@rts.ru>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If failed_to_recv is set (node detect itself not able to receive
message), we can end up with assert, because my_failed_list and
my_member_list are same list. This is happening because we are not
following specification and we allow to mark node itself as failed.
Because if failed_to_recv is set and we reached consensus across nodes,
single node membership is created (ignoring both fail list and
member_list), we can skip assert.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
The libtotem_pg library uses symbols from libqb, so it should be
explicitely linked with it. This doesn't cause problems for corosync
binary itself, as it is linked to both libraries, but can cause
problems if anything else links to libtotem_pg.so and automated
checkers can show this as a library problem.
Signed-off-by: Jacek Konieczny <jajcus@jajcus.net>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
my_processing_idx is pointer to received service list, instead of global
service number. If we check state of service we should use service_id
instead of my_processing_idx.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This patch returns back SUBJ functionality. It rely on fact, that
sendmsg will return error, and if such error is returned for long time,
it's probably because of firewall.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Instead of rely on multicast loop functionality of kernel, we now use
unix socket created by socketpair to deliver multicast messages to
local node. This handles problems with improperly configured local
firewall. So if output/input to/from ethernet interface is blocked, node
is still able to create single node membership.
Dark side of the patch is fact, that membership is always created, so
"Totem is unable to form a cluster..." will never appear (same applies
to continuous_gather key).
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Config version of other nodes is stored in
runtime.totem.pg.mrp.srp.members.NODEID.config_version key. Also when
local config_version is changed, all nodes are informed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Config version is requested from other nodes. If our config version is
not 0 and differes from highest config version of other nodes, corosync
quits.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Aligning function (kernel style magic) MAR_ALIGN_UP is used for
aligning of items in req_exec_cmap_mcast message.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Function is little more complex, but it is designed to be used in future
without big changes.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When ringnumber in config file was set to value bigger or equal to
INTERFACE_MAX, we are using this big value as index to totemconfig
interfaces array, resulting to access to invalid memory and segfault.
Instead of that, ringnumber is now checked and proper error message is
printed if value is too big.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Messages which are flow messages, rather then lifecycle are now logged
in trace level.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
also set commont include dirs.
fPIC and DPIC are automatically detected and added
as required by libtool. We don't need to carry it around.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Previous two log releated patches tried to solve few problems with
threaded libqb, but introduced regressions when running in daemon mode.
This patch takes bigger hammer and hopefully solves all problems.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
So we relax netmask check and set to same family as ipaddr
if needed
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
drop all SOLARIS specific ifdefs and replace them with feature checks
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
getifaddrs is always available if there is freeifaddr.
all BSD and openindiana have it defined in ifaddr.h.
drop a bunch of obsoleted headers.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
drop duplicate code and remove the last COROSYNC_LINUX ifdefs
around
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Syslog and file log can block, so it's good idea to use libqb threaded
mode to prevent it.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Instead of hardcoded SHM, we should use NATIVE, so libqb is able to find
out what is best/availiable mechanism.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
openindiana toolchain is rather messy. This is the first cut only
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
We were checking 'hold_timeout == 0' in 3 different places when setting up
the default totem config.
Signed-off-by: Tim Beale <tlbeale@gmail.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Nowhere in the corosync codebase references this structure.
Signed-off-by: Tim Beale <tlbeale@gmail.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
MON and WD services are using fsm.h, which calls log function. Such
messages were incorrectly logged as SERV (or random service) which made
debugging hard.
Solution is to add callback parameter to fsm functions and do actual
logging there.
Handling of failure states is also done in calback now.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
send_ok was incorrectly tested as boolean, even it's errno type
variable.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This will remove (non critical) debug message from QB about polling on
closed FD.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
IPC is using buffer of CS_MAX_NAME_LENGTH for name. If user calls
function with longer string, such string can be passed to service
incomplete.
Solution is to not allow string larger then CS_MAX_NAME_LENGTH
and return error.
Same applies to cpg service.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When sync started and service is unloaded in meantime, it can happen that
sync will call sync_* functions on unloaded service.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Multithreaded corosync used to use many ugly workarounds. One of them is
shutdown process, where we had to solve problem with two locks. This was
solved by scheduling jobs between service exit_fn call and actual
service unload. Sadly this can cause to receive message from other node
in that meantime causing corosync to segfault on exit.
Because corosync is now single threaded, we don't need such hacks any
longer.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
there is really no point to have a per node view of (vote)quorum
since all the info are always there.
drop the -n option for status/display nodes and improve
the output to provide a full cluster view at any given time.
Old format:
[root@fedora-master-node2 ~]# corosync-quorumtool -s
Quorum information
------------------
Date: Mon Aug 6 10:22:27 2012
Quorum provider: corosync_votequorum
Nodes: 2
Ring ID: 8
Quorate: Yes
Votequorum information
----------------------
Node ID: 3254954176
Node state: Member
Node votes: 1
Qdevice votes: 1
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Name
3238176960 1 fedora-master-node1.int.fabbione.net
3254954176 1 fedora-master-node2.int.fabbione.net
0 1 QDEVICE (Alive/Voting/NoMasterWins)
New format:
[root@fedora-master-node1 tools]# ./corosync-quorumtool -s
Quorum information
------------------
Date: Mon Aug 6 15:50:03 2012
Quorum provider: corosync_votequorum
Nodes: 2
Ring ID: 48
Quorate: Yes
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate Qdevice
Membership information
----------------------
Nodeid Votes Qdevice Name
3238176960 1 A,V,MW fedora-master-node1.int.fabbione.net
3254954176 1 NR fedora-master-node2.int.fabbione.net
0 1 QDEVICE
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
votequorum has no business to device if master_wins setting is correct or not.
only the qdevice can decide and should set the value for votequorum.
Logic is:
- user requests master_wins from config
- corosync starts
- qdevice starts
- qdevice reads cmap values / register with votequorum
- qdevice decides if the node can support master_wins or not and tells votequorum
- at this point votequorum can check if an unquorate node is part of the master_wins
partition
it is the qdevice responsibility to keep that value up to date in votequorum and the
value can be changed at runtime.
this commit also exchange per node master_wins information to lay down the infrastructure
to verify discrepancies in node config for master_wins (coming next on this channel).
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
it's really pointless to have basically a duplicated API call
to transfer one value and one name.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
those are all info flags.. it's redudant and inconsistent
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
in previous incarnation of qdisk + cman, master_wins was restricted
to 2 node only.
In this new version it is possible to use master_wins for any cluster
size.
Let's assume a 4 node cluster. Each node votes 1, qdevice votes 3.
node 1 becomes qdevice master
node 2/3/4 no
In case of a split (let's assume 2/2):
partition 1: {4, 1}
partition 2: {1, 1}
node 2 in partition 1 would normally be unquorate, leaving effectively
only node 1 active.
master_wins allows node 2 to recognize to be part of a quorate partition
(since node1 is broadcasting that qdevice is voting) and retain
quorum.
node1 has never lost quorate status since qdevice is voting there.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
and cleanup similar code to make it more readable
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
also align naming of vote to cast_vote for info calls
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
- add a protection check to avoid spurious messages on membership
change
- greately simplify processing of nodeinfo, since the only
data that we send for qdevice over nodeinfo is the number of votes
- fix a flag check to trigger quorum calculation that would
leave a cluster unquorate under certain conditions
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
simplify different code path as checks are simpler, separate
ALIVE and CAST_VOTE
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
as data are moving around we can drop lots of special cases
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
this is a preparation commit for the next changes. right now it is
no more than an alias to ALIVE.
CAST_VOTE is required to support master/slave feature from qdevice.
Effectively a quorum device can be:
Not registered / registered (connected to API but nothing else is happening)
if registered:
Not alive / alive (quorum device is petting the API via poll and timer is running)
if alive:
Not voting (slave) / voting (master)
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
STATE is confusing and overloaded term in votequorum as it's used for nodes
and other bits.
make the name unique and ALIVE means that the qdevice is heartbeating
to votequorum.
improve display of the status in tools and tests.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
make the flag name explicit
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When service is unloaded, sync shouldn't call sync_init|process|activate
and abort functions. It happens very rare, but in process of unloading
all services, totem can recreate membership and bad things can happen
(service is unloaded, so there may be access to already freed memory,
...)
Solution is to fetch services sync handlers in every time when we are
building service list instead of using precreated one.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Sync/service was using maximal number of services in ehter numberic form
(magic constant) or inconsistently, this means using
SERVICE_HANDLER_MAXIMUM_COUNT which means maximal number of handlers.
New macro solves this.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Let's say we have 2 nodes:
- node 2 is paused
- node 1 create membership (one node)
- node 2 is unpaused
Result is that node 1 downlist is selected, so it means that from node 2
point of view, node 1 was never down.
Patch solves situation by adding additional check for largest previous
membership.
So current tests are:
1) largest (previous #nodes - #nodes know to have left)
2) (then) largest previous membership
3) (and last as a tie-breaker) node with smallest nodeid
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
In downlist and joinlist debug output group was printed in nonsense
format of integer to pointer to array.
Now it's printed by full name.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
let's say following situation will happen:
- we have 3 nodes
- on wire messages looks like D1,J1,D2,J2,D3,J3 (D is downlist, J is
joinlist)
- let's say, D1 and D3 contains node 2
- it means that J2 is applied, but right after that, D1 (or D3) is
applied what means, node 2 is again considered down
It's solved by collecting joinlists and apply them after downlist, so
order is:
- apply best matching downlist
- apply all joinlists
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Test scenario is follows:
- node 1, node 2
- node 1 is paused
- node 2 sees node 1 dead
- node 1 unpaused
- node 1 and 2 both choose same dowlist message which includes node 2 ->
node 2 is efectivelly disconnected
Patch includes additional test if left_node is localnode. If so, such
downlist is ignored.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Patch solves problem when 1 ring out of 2 went up/down quite often.
The simplest setup to reproduce bug is following:
- 2 VMs, connected by 2 network interfaces
- OS: Linux
- On one of the VMs, a test program sending some CPG messages (see the
script "test_corosync.sh" joined to this mail for example)
Here are the Corosync logs we get when we do this setup:
Jun 06 16:23:40 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 06 16:23:40 corosync [CPG ] chosen downlist: sender r(0)
ip(192.168.56.104) r(1) ip(192.168.57.104) ; members(old:1 left:0)
Jun 06 16:23:40 corosync [MAIN ] Completed service synchronization,
ready to provide service.
Jun 06 16:24:37 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:24:38 corosync [TOTEM ] Automatically recovered ring 1
Jun 06 16:25:33 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:25:34 corosync [TOTEM ] Automatically recovered ring 1
Jun 06 16:26:35 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:26:36 corosync [TOTEM ] Automatically recovered ring 1
(...)
The second ring goes down about every 2 minutes and automatically back
up right after.
We spent some times looking for the commit that introduced this bug, and
it appears it's due the following one:
Corosync 1.3.3 -> 1.3.4: e27a58d93d
Corosync 1.4.1 -> 1.4.2: be608c0502
Commit message: Ignore memb_join messages during flush operations
I had a look at this commit, and it seems to me it's dropping too many
packets:
Because of this commit, while totemrrp_recv_flush() is called, Corosync
drops memb_join packets, but also ORF tokens. In the end, it seems that
sometimes, we drop so many of them that Corosync marks the ring as
faulty.
To fix that, only memb_join messages are dropped now.
Signed-off-by: Jerome FLESCH <jerome.flesch@netasq.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This should allow easier handling of various blackbox dumps. Original
fdata name is now symlink to latest created dump.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
corosync logging configuration logic is rather complex and in order
to make it simpler to reuse (at least within corosync/ tree)
we need to be able to use both icmap and cmap.
the patch might seem controversial, but it reduces heaps of code around
from qdevices (coming next).
It might be useful to consider moving this to a common shared library
but there aren't enough users yet and a shared lib would force
corosync to link with cmap (that we do not want at all costs)
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
This fixes a bug where having a second log file will close
the previous one.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
else we can get messages been put in the wrong subsys.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Logic for binding now works in following way:
- Try to find exact match
- If not exact match is found, use first found network address
This allows set concrete IP even if network settings contains two IPs on
same network.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
list_add_tail is used instead of list_add so ip addresses are inserted
in same order as returned by getifaddrs.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
This should help correlate syslog entires with their blackbox
counterparts.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Andrew Beekhof <andrew@beekhof.net>
clean up a lot of allocated blocks at exit.
those changes has no runtime effects, but it makes valgrind
output a bit more useful by dropping over 700 errors/warnings to skip
over every single run.
there are still a few icmap related valgrind errors but those need
some more complex and timeconsuming investigation.
pre patch:
==21844== HEAP SUMMARY:
==21844== in use at exit: 1,229,321 bytes in 1,516 blocks
==21844== total heap usage: 7,191 allocs, 5,675 frees, 3,819,853 bytes allocated
==21844== LEAK SUMMARY:
==21844== definitely lost: 3,617 bytes in 11 blocks
==21844== indirectly lost: 21,960 bytes in 11 blocks
==21844== possibly lost: 1,080,101 bytes in 131 blocks
==21844== still reachable: 123,643 bytes in 1,363 blocks
==21844== suppressed: 0 bytes in 0 blocks
==21844== ERROR SUMMARY: 136 errors from 136 contexts (suppressed: 0 from 0)
post patch:
==25793== HEAP SUMMARY:
==25793== in use at exit: 1,185,870 bytes in 808 blocks
==25793== total heap usage: 9,427 allocs, 8,619 frees, 4,156,841 bytes allocated
==25793== LEAK SUMMARY:
==25793== definitely lost: 3,697 bytes in 12 blocks
==25793== indirectly lost: 22,248 bytes in 13 blocks
==25793== possibly lost: 1,079,655 bytes in 113 blocks
==25793== still reachable: 80,270 bytes in 670 blocks
==25793== suppressed: 0 bytes in 0 blocks
==25793== ERROR SUMMARY: 119 errors from 119 contexts (suppressed: 0 from 0)
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Those are only used at init phase and we can free some memory for the system.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
this fixes a rather annoying race condition at startup where a client
connects to corosync "too fast" before the service is ready to operate
and client gets some random data during initialization phase.
With this fix, we allow connections to ipc only after the main engine
is operational and configured (and after the first totem transition).
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
This prevents segfault when rrp mode is set with only one ring.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Full path to key is now tested rather then key name only.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
- cleanup include list
- reorder code and functions (crypto then hash)
- split crypt/decrypt/hash functions
- some micro optimizations by dropping a few memcpy
- make the code more readable (better var names and buffers mapping)
- improve exit paths on error (return codes and free)
- store crypto header size instead of recalculating it per packet
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Commit which added number of addresses to srp_address structure didn't
count with totemsrp_ifaces_get where whole structure was copied instead
of addresses only. This is now fixed.
Also to make API totempg forward compatible, size of interfaces array
must be passed to ifaces_get like functions to prevent memory overwrite.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This should allow us future change to dynamic number of rings without
breaking wire compatibility.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Also most of the key settings are now centralized in one function, so
it's easier to audit.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>