mirror_corosync/exec
Jason 4ee84c51fa totem: Ignore duplicated commit tokens in recovery
In active rrp mode, commit tokens are treated as mcast data messages,
thus, rrp directly delivers them to srp layer by active_mcast_recv().
This will result in duplicated commit tokens being received by srp from
different heartbeat links. If node is in recovery state and has already
sent out the initial orf token, those duplicated commit tokens will
cause message_handler_memb_commit_token() to send initial orf token
again! This is wrong because it resets the orf token content in
instance->orf_token_retransmit, which breaks the token retransmission
state.

Furthermore, by sending those initial orf tokens again and again,
it may lead active_token_recv() to drop some subsequent orf tokens.
It is OK for rrp because srp will do token retransmission,
but as said above, srp retransmission state has already been broken,
so finally we meet a "token lost in recovery state" condition caused
by software. If token timeout value is large, then it will takes long
time to create a new ring.

This can be reproduced by having two noded set to active rrp mode, with
two heartbeat links. Then with one node always on, let the other one do
stop/start again and again. It has a low probability to reproduce.
In theory, I think, the more heartbeat links used, the more easily it
can be reproduced.

This problem can be resolved by letting
message_handler_memb_commit_token() to ignore duplicated commit tokens
in recovery state if node (the ring representation) has already sent
out the initial orf token.

Different from prev take, this version do not depends on stored token
data but uses originated_orf_token in totemsrp_instance to remember
if initial orf token has been already originated for current membership.

Signed-off-by: Jason <huzhijiang@gmail.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Christine Caulfield <ccaulfie@redhat.com>
2015-01-15 17:33:04 +01:00
..
.gitignore Add .gitignore files. 2010-10-21 07:43:46 -07:00
apidef.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
apidef.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
cfg.c Indent: Remove newline before else branch start 2014-05-09 11:38:02 +02:00
cmap.c config: Fix typos 2014-07-24 10:27:45 +01:00
coroparse.c votequorum: Make qdev timeout in sync configurable 2014-08-05 17:22:52 +02:00
cpg.c cpg: Make sure left nodes are really removed 2014-02-19 10:59:14 +01:00
cs_queue.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
fsm.h Make logging of WD and MON service correct 2012-08-16 14:45:15 +02:00
icmap.c icmap: Add func to test equality of two key values 2013-09-10 17:02:12 +02:00
ipc_glue.c ipc: Process votequorum messages during sync 2014-08-05 17:22:44 +02:00
logconfig.c Reload: Add atomic reload to log config 2013-09-12 16:10:07 +01:00
logconfig.h rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
logsys.c logsys: Log warning if flightrecorder init fails 2014-06-02 14:36:10 +02:00
main.c Set RR priority by default 2015-01-05 15:01:49 +01:00
main.h Reload: Make coroparse use a designated icmap hash table 2013-09-12 16:09:06 +01:00
Makefile.am be consistent in using CPPFLAGS vs CFLAGS 2014-07-21 08:47:21 +02:00
mon.c mon: Make monitoring work 2014-02-25 14:57:20 +01:00
pload.c build: bring SOLARIS up to the same standard as other OSes 2012-08-30 15:00:27 +02:00
quorum.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
quorum.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
schedwrk.c Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
schedwrk.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
service.c service: Fix memleak in service_unlink_and_exit 2013-06-21 11:21:29 +02:00
service.h service: remove leftovers from mt corosync 2012-08-09 15:10:16 +02:00
sync.c Correctly check if service was unloaded 2012-10-17 15:06:36 +02:00
sync.h sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
timer.c Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
timer.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
totemconfig.c config: Ensure mcast address/port differs for rrp 2014-11-24 11:55:37 +01:00
totemconfig.h totemconfig: refactor nodelist_to_interface func 2014-07-22 14:59:31 +02:00
totemcrypto.c [crypto] fix crypto block rounding/padding calculation 2014-09-06 07:11:56 +02:00
totemcrypto.h crypto: drop < 2.3 protocols and onwire compat 2013-01-14 11:49:32 +01:00
totemiba.c totemiba: Fix incorrect failed log message 2014-05-15 15:28:51 +02:00
totemiba.h Return back "Totem is unable to form..." message 2012-10-08 16:53:35 +02:00
totemip.c Adjust MTU for IPv6 correctly 2014-10-01 14:20:21 +02:00
totemmrp.c Add waiting_trans_ack also to fragmentation layer 2012-11-22 11:48:12 +01:00
totemmrp.h Add waiting_trans_ack also to fragmentation layer 2012-11-22 11:48:12 +01:00
totemnet.c totemudpu: Implement member_set_active 2014-08-26 15:36:05 +02:00
totemnet.h totemnet: Add totemnet_member_set_active 2014-08-26 15:35:59 +02:00
totempg.c totempg: Make iov_delv local variable 2013-03-21 14:24:23 +01:00
totemrrp.c Log auto-recovery of ring only once 2015-01-14 18:13:29 +01:00
totemrrp.h totem: Inform RRP about membership changes 2014-08-26 15:35:56 +02:00
totemsrp.c totem: Ignore duplicated commit tokens in recovery 2015-01-15 17:33:04 +01:00
totemsrp.h Add waiting_trans_ack also to fragmentation layer 2012-11-22 11:48:12 +01:00
totemudp.c Adjust MTU for IPv6 correctly 2014-10-01 14:20:21 +02:00
totemudp.h Return back "Totem is unable to form..." message 2012-10-08 16:53:35 +02:00
totemudpu.c Adjust MTU for IPv6 correctly 2014-10-01 14:20:21 +02:00
totemudpu.h totemudpu: Implement member_set_active 2014-08-26 15:36:05 +02:00
util.c Introduce get_run_dir function 2014-06-02 14:53:18 +02:00
util.h Move ringid store and load from totem library 2014-06-02 14:54:57 +02:00
votequorum.c votequorum: Add cmap key to reset wait_for_all 2014-08-12 16:02:46 +01:00
votequorum.h Remove include/engine/quorum and integrate it into exec/engine.h 2012-02-08 08:31:10 -07:00
vsf_quorum.c Free object allocated at quorum_register_callback 2014-01-23 17:18:44 +01:00
vsf_ykd.c YKD: Fix loading of YKD quorum module 2014-08-18 09:33:59 +01:00
vsf_ykd.h Remove include/engine/quorum and integrate it into exec/engine.h 2012-02-08 08:31:10 -07:00
vsf.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
wd.c build: bring SOLARIS up to the same standard as other OSes 2012-08-30 15:00:27 +02:00