The two_node and auto_tie_breaker options are incompatible as they
specify conflicting methods of determining the quorate half of a cluster
partition.
This patch detects this error in corosync.conf, issues a message and
disables two_node if auto_tie_breaker is present.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
The default for auto_tie_breaker should be 'lowest' - which is what it
was before the extended ATB functionality of auto_tie_breaker_node was
added, and what the documentation states.
However this was broken so that if auto_tie_breaker_node was not
specified then auto_tie_breaker itself was ignored. This patch fixes
that.
It also fixes a typo in a comment.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When config file is reloaded with removed UDPU member, internal icmap
index of nodelist.node can change. This can result in removal and then
adding back node. This, with UDPU alive filtering (where member is by
default considered as not a member) makes corosync not sending messages
to such members resulting in new membership creation.
Solution is to properly test which members were really deleted and added
(instead of relying on internal and dynamic naming of icmap hash table
key name).
Also trully dynamic add and remove node (via cmap) is now handled by
same function so totem_config->interfaces is now updated properly.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
corosync_ring_id_store should use same (safer) permissions as
corosync_ring_id_create_or_load for (eventually) newly created ringid
file.
Credit to Sjerek for finding this problem.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
In active rrp mode, commit tokens are treated as mcast data messages,
thus, rrp directly delivers them to srp layer by active_mcast_recv().
This will result in duplicated commit tokens being received by srp from
different heartbeat links. If node is in recovery state and has already
sent out the initial orf token, those duplicated commit tokens will
cause message_handler_memb_commit_token() to send initial orf token
again! This is wrong because it resets the orf token content in
instance->orf_token_retransmit, which breaks the token retransmission
state.
Furthermore, by sending those initial orf tokens again and again,
it may lead active_token_recv() to drop some subsequent orf tokens.
It is OK for rrp because srp will do token retransmission,
but as said above, srp retransmission state has already been broken,
so finally we meet a "token lost in recovery state" condition caused
by software. If token timeout value is large, then it will takes long
time to create a new ring.
This can be reproduced by having two noded set to active rrp mode, with
two heartbeat links. Then with one node always on, let the other one do
stop/start again and again. It has a low probability to reproduce.
In theory, I think, the more heartbeat links used, the more easily it
can be reproduced.
This problem can be resolved by letting
message_handler_memb_commit_token() to ignore duplicated commit tokens
in recovery state if node (the ring representation) has already sent
out the initial orf token.
Different from prev take, this version do not depends on stored token
data but uses originated_orf_token in totemsrp_instance to remember
if initial orf token has been already originated for current membership.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Make sure to log auto-recovery of ring only once. Every
MESSAGE_TYPE_RING_TEST_ACTIVATE receive is logged, but with lower
priority and more detailed information.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Experience with larger production clusters showed that setting RR
priority for corosync is viable for prevent random fencing, ...
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
After a heartbeat link's FAULTY and its auto re-enable,
active_instance->timer_problem_decrementer did not reset to zero. So in
the next timer_function_active_token_expired() round,
active_timer_problem_decrementer_start() will not be called. This will
result in that the active_instance->counter_problems of this link can
not be decreased any more. Cause rrp lose the ability to tolerate
network fluctuation.
This problem can be reproduced by the following sequence:
1) Set RRP in active mode, configure at least 2 heartbeat links.
2) Unplug one link till corosync-cfgtool -s shows it is FAULTY.
3) Re-plug this link then corosync-cfgtool -s shows it is active with
no faults.
4) Unplug this link again but quicky re-plug it before it becomes
FAULTY.
5) Finally, you can see corosync-cfgtool -s shows it is in
"Incrementing problem counter" state despite it currently is physically
healthy.
It can be solved by not forget to reset timer_problem_decrementer to
zero in active_timer_problem_decrementer_cancel().
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When using multiple interfaces, it's necessary to use different
multicast address/port pair for each interface to make
rrp work correctly. This is now checked in parser.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Broadcast option is global but in config set in interface section. When
more interfaces are defined, only broadcast from last section was used.
Solution is to use broadcast whenever at least one interface use
broadcast.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Checking code was there, sadly not correct, so it was possible to enter
one bindnet addr as IPv4 and second as IPv6. Fix is trivial.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Some totem configuration values (like token, consensus, ...) are ether
computed or default value is used. It's hard to find out, what
value is really used.
Solution is to store values in cmap.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
MTU for IPv6 is 20 bytes larger then IPv4. This fact was not taken into
account so IPv6 packets were larger then MTU resulting in fragmentation.
Solution is to substract correct IP header size.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
libnss is "weird" in this respect as some block sizes are hardcoded,
others need to be determined dynamically.
For AES we need to use the values we know since GetBlockSize would
return errors, for 3des (that hopefully nobody is using) the value
returned by GetBlockSize is 8, but let's use the call into libnss
to avoid possible conflicts with distro patching or older versions.
Now, given the correct block size, the old calculation simply added
block size to the hdr_size. This is not sufficient.
We use _PAD encryption methods and we need to take that into account.
_PAD is calculated given the current input buf len and rounded up
to block size boundary, then block_size is added.
Ideally we would do that on a per packet base but current transport
infrastructure doesn't allow it yet.
So round up the hdr_size to double the block_size reported by the
cipher.
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
To follow spec it's needed to send messages to all nodes (not only
active members) from time to time to detect merge.
This is needed in situations when totemsrp merge timer isn't running
(because there is enough messages sent by processors) to detect merge.
Example scenario:
- 3 nodes, all of them running cpgverify
- One node is isolated (iptables for example)
- Node is un-isolated
Without this commit, node will not merge as long as the cpgverify is
running.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Member active is used for sending "multicast" messages only to members
of ring. This reduces network load if some nodes are intentionally down.
Only regular multicast message load is reduced (messages sent by
totemudpu_mcast_noflush_send), because special messages (like hold
cancel, join message, ...) still have to be send to all members to
ensure correct behavior.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
All *_membership_changed calls totemnet_member_set_active passing 1 as
active parameter for joined nodes and 0 for left nodes.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
totemnet_member_set_active together with transport specific
member_set_active makes possible for totemnet (and more interestingly
transport) to be informed about membership changes.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Services are informed about membership changes, but if same information
is needed inside totemrrp or totemnet, it's impossible to gather this
information.
Patch makes this possible for now only for RRP with empty callbacks.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Although YKD is currently unsupported, untested and decprecated it's
handy for testing things in the quorum module.
This patch allows YKD to actually load without an error. It does not fix
anything else in the service!
Also remove vsftype and its reference to YKD being the preferred and
default provider from the corosync.conf man page,
as that hasn't been true for a considerable time.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
It's possible in a two_node cluster (and others but it's more likely
with just two) that a node could be booted up after downtime or failure
and the other node is not available for some reason. In this case it
would not be allowed to proceed because wait_for_all is enforced.
This patch provides a cmap key to clear this flag in the desperate
situation where that becomes necessary. It should only be used with
extreme caution and will be wrapped up in pcs which should also check
that fencing has been run.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When there is no other activty on ring but only retransmition, and
token is in hold mode, the retransmition will become slow. More over,
if the retransmition is always fail but token rotation works well, then
it takes quite a lone time
(fail_to_recv_const * token_hold = 2500 * 180ms = 450sec) for the
retransmit requester to meet the "FAILED TO RECEIVE" condition to
re-construct a new ring.
This problem can be solved by checking if retransmits are present
before going into hold. If a node is the retransmit requester or
the resender, it set my_token_held to 0 to speed up retransmition
and omit further unnecessary sending of token_hold_cancel signal.
Signed-off-by: Jason HU <huzhijiang@gmail.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Configuration option quorum.device.sync_timeout is available for setting
qdevice poll timeout for synchronization phase. Default value is 30
sec.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
If qdevice is registered a alive, corosync waits in sync phase until
timeout expires or qdevice votes with correct nodeid parameter.
This gives qdevice time to decide to vote or not undisturbed and without
time hazard.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
This is needed for qdevice to be able to process messages during
synchronization phase.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
If votequorum service receives incorrect (not current) ringid, call is
ignored and CS_ERR_MESSAGE_ERROR is returned.
This and previous commits makes incompatible changes in votequorum
API/ABI, so library version is increased.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Returning ring id will be used in poll function.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
The thesis contains this paragraph:
" The Join timeout is shorter than the Consensus timeout and is used to
increase the probability that Join messages from all currently
working processors are received during a single round of consensus."
Empirically I can confirm that making join less than consensus can cause
havoc with a cluster so I think we should enforce this.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Fix several places where 'then' is used instead of 'than' in error
messages and a comment.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Move finding of bindaddr in nodelist to generally usable function
totem_config_find_local_addr_in_nodelist and refactor
config_convert_nodelist_to_interface function to use it.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Add totem_config_get_ip_version to get user configured ip version.
Make totem_config_read use this newly introduced function.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
QB loop signal handler prototype differs from signal(2) prototype.
Solution is to create wrapper functions.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
SIGSEGV and SIGABRT signals are now correctly handled (blackbox is
dumped and logsys is finalized).
Signed-off-by: zouyu <hopkings2005@gmail.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
During reload, local_node_pos is deleted and reinstation is handled in
totemconfig after reload is finished. votequorum handles this events and
tries to reload it's configuration. This led to logging a little scary
messages (even nothing bad is happening, because after local_node_pos
reinstation everything back to normal).
Solution is to stop processing events during reload. Sadly, simple
tracking of config.reload_in_progress doesn't work because LibQB events
triggering order is undefined so votequorum reload handler can be called
before totemconfig (and before local_node_pos is reinstatied).
So new config.totemconfig_reload_in_progress key is defined with very
similar semanthic as config.reload_in_progress but set inside
totem_reload_notify function. Votequorum then use this new key.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
It's not very good idea to allow user apps changing internal key
reload_in_progress.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Previous safe_atoi didn't check range of input values so if for example
user used -1 s token timeout, it was converted to UINT32_MAX without
letting user know.
Another safe_atoi problem was using strtol. This works pretty well on
64-bit systems, where long integer is usually 64-bits long, sadly on
32-bit systems, it is usually 32-bit long. And because strtol returns
signed integer, it was not possible to enter 32-bit value with highest
bit set.
Solution is to use strtoll which is guaranteed to be at least 64-bits
long and check value range.
Also error message now contains also information about expected value
range.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Functions for storing and loading ring id was in the totem library. This
causes problem, what to do when it's impossible to load or store ring
id. Easy solution seemed to be assert, but sadly this makes hard for
user to find out what happened (because corosync was just aborted and
logsys didn't flush)
Solution is to move these functions to main.c, where is much easier to
handle error. This also makes libtotem free of any file system
operations.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Run dir (LOCALSTATEDIR/lib/corosync) was hardcoded thru whole codebase.
Totemsrp was trying to create and chdir into it, but also
takes into account environment variable COROSYNC_RUN_DIR creating
inconsistency.
get_run_dir correctly returns COROSYNC_RUN_DIR (when set) or
LOCALSTATEDIR/lib/corosync. This is now used by all functions instead of
hardcoded string.
All occurrences of mkdir/chdir are removed from totemsrp and chdir is
now called in main function. Mkdir call is completely removed, because
it was not used anyway (check in main.c was called before totemsrp init,
so mkdir was never called) and also make install and/or package system
should take care of creating this directory with correct
permissions/context.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
rdma_join_multicast failed ... message parameters was swapped.
Also information about multicast join is now logged as notice.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Totemiba wasn't able to survive SubnetManager handover or
restart. If SM was migrated to another node, corosync logged
"multicast error" and losses connectivity.
Commit should solve this situation.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
token_coefficient change in cmap didn't triggered change. So only way
how to change token_coefficient was editing config file and reload.
Patch let's key totem.token_coefficient to be processed so
token_coefficient can be dynamically changed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Token coefficient is used only when nodelist is specified and contains
at least 3 nodes. If so, real token timeout is then computed as
token + (number_of_nodes - 2) * token_coefficient. This allows cluster
to scale without manually changing token timeout every time new
node is added. This value can be set to 0 resulting in effective
removal of this feature.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When volatile key was changed (cmap set or reload) and checks fails,
nothing was logged.
Values are now checked and error string is logged on problems.
Also totem_config is dumped to log (DEBUG level) after every
volatile key change and every reload.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When key with dependency was changed, dependant keys were not recomputed.
Nice example is consensus timeout. If token timout was changed,
consensus timeout was not recomputed correctly (nether via cmap change
of key nor via cfg reload).
Solution is almost complete refactor of handling volatile defaults.
totem_volatile_config_read now handles not only storing cmap key to
totem_config structure, but also checking of existence, comparing with
zero value and properly storing defaults.
totem_set_volatile_defaults is gone. It's function was splitted into
totem_volatile_config_read and totem_volatile_config_validate functions.
Reload callback and change of key callback are now mostly same functions
and both calls totem_volatile_config_read.
Patch also fixes small memory leak. totem.vsftype key is not used for
long time and original totem_volatile_config_read wasn't freeing
allocated memory returned by icmap_get_string. Whole reading of
totem.vsftype is removed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When reload was called nodes were constantly added to totemconfig
nodelist.
So simple corosync-cfgtool -R resulted very quickly in filling whole
array and segfault.
Solution is to clear member_count.
Clearing is also moved directly to put_nodelist_members_to_config to
make sure it's always processed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When reload was called multiple times (~20), logging to file stopped
working.
Main problem was hidden in the fact, that log file was opened multiple
times, because even target_id was shared via subsystem loggers, file
name was not.
Solution is to ALWAYS set proper log file name into subsystem logger
(copy is stored). This will not only fix problem but also removes small
leak.
Also if filename didn't changed, function can return sooner.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
When totem_set_volatile_defaults is called from totem_config_validate
return code is unchecked.
It's then perfectly possible to set (for example) join timeout to very
small value (1) and consensus value is then set to 0 making corosync
unable to create membership.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
icmap_get_* behavior is to NOT modify passed variable when it doesn't
success. So we must initialize variable before icmap_get_* call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When node is paused and other nodes has in meantime exited cpg process,
paused node after resume doesn't update it's membership correctly so on
previously paused node exited cpg process is still visible.
Solution is to compare join list with cpd and remove all pids which are
not included in join list.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Also number is prefixed by 0x so it's easier to spot that number is
hexadecimal.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Most of functionality is moved to do_proc_leave function to make it
reusable.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Patch f3ffd3da5c introduced named states
of state-machine, but sadly contains logical problem causing
stats.continuous_gather increasing even when it shouldn't. Problem is
not critical, because continuous_gather is set to 0 on successful
membership creation.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
This patch adds more flexibility to the auto_tie_breaker feature of
votequorum. With this, not only can the lowest nodeid be used as
a tie breaker, but also the highest, or a node from a nominated list.
If there is a list of nodes, the first node in the list that was not
part of the previous partition is used. This allows the user to
specify a preferred set of nodes but prevents a split-brain if the
cluster divides evenly with a node in each half.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Memory object allocated with malloc at quorum_register_callback
is not freed. The object is linked to internal_trackers_list.
The object is unlinked at quorum_unregister_callback. However,
it is not freed at the function.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Error message is displayed when it's impossible to create symlink to
fdata file.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
According to the totem paper, if a processor
receives a join message in the operational state and if the
receivers identifier is in the join messages fail list,
then join message should be ignored.
By applying this validation of join messages, we can avoid unnecessary
switching from operational state to gather state(or even lead to rings
can not be merged) like the following to happen.
1. Initially, there is only one ring contains three nodes, say
ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL.
Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and
create ring(A).
Node B sends join message with proclist:A,B,C. faillist:A,C. and
create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent
by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message
with proclist:A,B,C. faillist:B. Such join message will prevent
both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.
Same thing happens on node B, so A and B will dead loop forever
in step 7, 8 and 9.
As the paper also said: "If a processor receives a join message in the
operational state and if the sender's identifier is in the receiver's
my_proclist and the join message's ring_seq is less than the receiver's
ring sequence number, then it ignores the join message too." So these
patch applying these validations of join messages altogether.
Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch adds the option to store expected_votes to
persistent storage. This is needed to allow_downscale
to operate properly.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Because of change in libqb (9abb686) logging of TOTEM subsystem stopped
working.
Instead of rely on previous behavior (implicit substring match), all
totem files are now explicitly given.
Also QB subsystem now uses comma separated filelist instead of previous
function calling.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
The reason why memb_state_gather_enter is invoked was printed
in integer code. This patch introduces human readable English
messages for the code.
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Solution use aproximation of totem structures. This needs to be
rewritten in proper way. Also MTU checking should be implemented for IP
transports.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Parameters in functions like mcast_cq_send_event_fn, ... were defined in
incorrect order. Also their names were weird.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Corosync freezes after several peer node connects/disconnects. The
freeze happens in recv_token_cq_recv_event_fn in ibv_get_cq_event call.
The problems is in fact, that after each peer node connect,
recv_token_accept_destroy is called, which tries to call
poll_dispatch_delete _after_ freeing of completion_channel. As
completion_channel contains fd, handlers are not disconnected from
poller properly. This leads to complete inconsistency in subsequent
calls to handlers.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
1. In UD mode receivnig side of RDMA application should have enough
space in buffer to hold data and GRH. Also, sge.length on the receiving
size should be set to max_msg_size + sizeof (struct ibv_grh). Current
corosync doesn't take grh in the account and does not work if mtu is set
to the real mtu of IB port (it works if netmtu is set to < 2048-40).
2. ibv_wc.byte_len is the actual lentgh of the received packet, i.e.
msg_len + GRH. GRH length should be substracted in further proceeding.
If not, it might cause problems when messages get retransmitted, as
their apparent size will constantly grow.
3. Current corosync will not work with rdma and mtus > 2048. Most modern
IB HW supports 4096 mtu.
Signed-off-by: Yevheniy Demchenko <zheka@uvt.cz>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When a reload is in progress, wait until it has all finished
before re-reading all of the logging parameters
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
When a reload is in progress, wait until the whole thing has
finished before setting parameters
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Add the code to do the actual corosync.conf reload to cfg, along with
a corosync-cfgtool -R command to trigger it
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Pass an icmap hashtable into coroparse so we can load it into
a temporary one during reload
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
This patch replaces the existing freopen method of
forcing stdin/out/err to /dev/null with the more
usual system of open/dup2.
While I don't like posting patches I don't fully understand,
this patch seems to fix a problem where stdout/err get
assigned to a socket causing double logging output
on systemd.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
I've seen a few instances where corosync has shut down for
apparently 'no reason'. In fact most of the time the shutdown
has been caused by an external source (often an init script)
but it's not been obvious what has happened and people
implicate the deamon
This patch simply adds a log message to the signal handler
when it is called so that the cause of the shutdown is obvious.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
icmap_get_r is now implemented using this function. Function is not very
safe tho defined as static.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Implementation should allow pass only parts of string (shorten string)
and must prohibit reading of uninitialized memory.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Patch adds reentrant version of most of functions (with exception of RO
flags support and tracking) to allow multiple icmap instances existence
inside corosync.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
qb_loop_timer_add expects the timeout to be in nanoseconds, but we were
passing the value in milliseconds. Scale the timeout appropriately.
Signed-off-by: Michael Chapman <mike@very.puzzling.org>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
We don't reference the connection object on creation, so there
is on reason to dereference it on disconnect.
Signed-off-by: David Vossel <dvossel@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
vague and unhelpful. People have to look for the following quorum
message and try to deduce which nodes have joined or left from that
and past membership messages, even though the routine printing the
message already has this information to hand.
This patch fixes that message so that it prints the nodeids of the nodes
that have joined/left the cluster.
Signed-Off-By: Christine Caulfield <ccaulfie@redhat.com>
Reviewed-By: Jan Friesse <jfriesse@redhat.com>
When corosync was started in daemon mode and there was parse error, no
way existed how to find out what happened (this is usual situation with
systemd enabled systems). Solution seems to be output to syslog by
default.
Also redundant line with setting logsys is removed because it's no
longer needed, because FORK and THREADED mode options has no longer
effect. FORK is handled by libqb by default and THREADED mode is forced
by calling logsys_thread_start.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Status string should be same lenght as needed for cfg
ringstatusget function.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
Even this check is really not needed, it's nice to have it and on fault
ensure that cluster_name is really NULL.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Code for zero-copy in cpg does following mmaps:
- Mmap anonymous, private memory to some address (-> malloc)
- Mmap shared memory of fd to address returned by first mmap
(effectively shadows first mapping)
This is not necessary and only one mapping is needed.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Add poll timer scheduler to be called 3 times per token timeout.
If poll timer was not called for more then 0.8 * token timeout, it means
corosync process was not scheduled and ether token_timeout should be
increased or load should be reduced (useful for VM, where host is
overcommitted so VM is not scheduled as expected).
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
When using corosync with clear_node_high_bit setting to yes,
the highest bit is cleared. When all the cluster nodes are in
one subnet, we probably configure the IP addresses as follows:
node1: 147.2.207.64
node2: 147.2.207.192
If the byte order of the nodeid is little endian, wiping off the
highest bit will make the two nodes have the same nodeid!
This patch fixes this by converting the nodeid to network order.
Signed-off-by: Xia Li <xli@suse.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If configuration file contains closing brace before opening brace
at top level, configuration parsing is stopped and file is not
completely parsed. Solution is to detect extra closing brace and display
error.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
If colon was entered as part of value on end of value, it is deleted.
This makes impossible to enter (legal) IPv6 address ending with :: (like
fed0::).
Also when line contains both brace and colon, it is parsed twice (first
as key = value and second as start of section). This is handled by
continue in if section.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
"Corosync Cluster Engine ... started" message is shown after
logsys is full configured.
Signed-off-by: Kazunori INOUE <inouekazu@intellilink.co.jp>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Creating qb_loop before daemonization is not problem for poll or epoll
type loops, but it's problem for kqueue, because kqueue is not shared
in child with parent after fork.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Patch for support waiting_trans_ack may fail if there is synchronization
happening between delivery of fragmented message. In such situation,
fragmentation layer is waiting for message with correct number, but it
will never arrive.
Solution is to handle (callback) change of waiting_trans_ack and use
different queue.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
If instance->memb_state is not OPERATION or RECOVERY, we was passing NULL
to cs_queue_used call.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
This patch creates a special message queue for synchronization messages.
This prevents a situation in which messages are queued in the
new_message_queue but have not yet been originated from corrupting the
synchronization process.
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
Corosync now works with infiniband transport in any redundant ring mode
Signed-off-by: Evgeny Barskiy <barskiy@rts.ru>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
If failed_to_recv is set (node detect itself not able to receive
message), we can end up with assert, because my_failed_list and
my_member_list are same list. This is happening because we are not
following specification and we allow to mark node itself as failed.
Because if failed_to_recv is set and we reached consensus across nodes,
single node membership is created (ignoring both fail list and
member_list), we can skip assert.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
The libtotem_pg library uses symbols from libqb, so it should be
explicitely linked with it. This doesn't cause problems for corosync
binary itself, as it is linked to both libraries, but can cause
problems if anything else links to libtotem_pg.so and automated
checkers can show this as a library problem.
Signed-off-by: Jacek Konieczny <jajcus@jajcus.net>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
my_processing_idx is pointer to received service list, instead of global
service number. If we check state of service we should use service_id
instead of my_processing_idx.
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>