mirror_corosync/exec
Jerome FLESCH 99faa3b864 When flushing, discard only memb_join messages
Patch solves problem when 1 ring out of 2 went up/down quite often.

The simplest setup to reproduce bug is following:
- 2 VMs, connected by 2 network interfaces
- OS: Linux
- On one of the VMs, a test program sending some CPG messages (see the
  script "test_corosync.sh" joined to this mail for example)

Here are the Corosync logs we get when we do this setup:

Jun 06 16:23:40 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 06 16:23:40 corosync [CPG   ] chosen downlist: sender r(0)
ip(192.168.56.104) r(1) ip(192.168.57.104) ; members(old:1 left:0)
Jun 06 16:23:40 corosync [MAIN  ] Completed service synchronization,
ready to provide service.
Jun 06 16:24:37 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:24:38 corosync [TOTEM ] Automatically recovered ring 1
Jun 06 16:25:33 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:25:34 corosync [TOTEM ] Automatically recovered ring 1
Jun 06 16:26:35 corosync [TOTEM ] Marking ringid 1 interface
192.168.57.105 FAULTY
Jun 06 16:26:36 corosync [TOTEM ] Automatically recovered ring 1
(...)

The second ring goes down about every 2 minutes and automatically back
up right after.

We spent some times looking for the commit that introduced this bug, and
it appears it's due the following one:
Corosync 1.3.3 -> 1.3.4: e27a58d93d
Corosync 1.4.1 -> 1.4.2: be608c0502
Commit message: Ignore memb_join messages during flush operations

I had a look at this commit, and it seems to me it's dropping too many
packets:
Because of this commit, while totemrrp_recv_flush() is called, Corosync
drops memb_join packets, but also ORF tokens. In the end, it seems that
sometimes, we drop so many of them that Corosync marks the ring as
faulty.

To fix that, only memb_join messages are dropped now.

Signed-off-by: Jerome FLESCH <jerome.flesch@netasq.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-06-11 10:59:30 +02:00
..
.gitignore Add .gitignore files. 2010-10-21 07:43:46 -07:00
apidef.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
apidef.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
cfg.c Make ifaces_get work with dynamic no_rings 2012-03-26 11:54:26 +02:00
cmap.c Move hdb_error_to_cs to corotypes.h 2012-02-14 11:10:14 +11:00
coroparse.c Properly parse uidgid files 2012-04-10 09:08:36 +02:00
cpg.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
cs_queue.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
fsm.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
icmap.c icmap: don't leak memory when changing ro/rw status on a key 2012-04-24 09:28:23 +02:00
ipc_glue.c rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
logconfig.c rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
logconfig.h rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
logsys.c LOG: make sure the log target is enabled. 2012-05-29 14:02:42 +10:00
main.c Store fdata with timestamp and pid in name 2012-06-05 12:19:42 +02:00
main.h ipcs: allow connections only after all services are ready 2012-04-16 13:39:03 +02:00
Makefile.am rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
mon.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
pload.c pload: make it a test service and not a public one 2012-03-12 07:11:51 +01:00
quorum.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
quorum.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
schedwrk.c Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
schedwrk.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
service.c rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
service.h drop evs service 2012-03-12 15:51:50 +01:00
sync.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
sync.h sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
timer.c Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
timer.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
totemconfig.c crypto: Remove sha224 and add md5 hash 2012-03-15 17:36:56 +01:00
totemconfig.h Tweak nodeid warning 2012-02-21 16:33:56 +01:00
totemcrypto.c crypto init: release *_slot resource after init 2012-04-20 10:57:16 +02:00
totemcrypto.h crypto: change network packets and add dynamic crypto header/data 2012-03-14 15:57:01 +01:00
totemiba.c Update crypto_set API 2012-03-15 17:33:53 +01:00
totemiba.h Update crypto_set API 2012-03-15 17:33:53 +01:00
totemip.c totemip: Support bind to exact address 2012-05-24 14:01:12 +02:00
totemmrp.c Make ifaces_get work with dynamic no_rings 2012-03-26 11:54:26 +02:00
totemmrp.h Make ifaces_get work with dynamic no_rings 2012-03-26 11:54:26 +02:00
totemnet.c Update crypto_set API 2012-03-15 17:33:53 +01:00
totemnet.h Update crypto_set API 2012-03-15 17:33:53 +01:00
totempg.c Make ifaces_get work with dynamic no_rings 2012-03-26 11:54:26 +02:00
totemrrp.c Always allocate totemrrp stats array 2012-04-10 09:08:42 +02:00
totemrrp.h Update crypto_set API 2012-03-15 17:33:53 +01:00
totemsrp.c Include ringid in processor joined log message 2012-05-17 14:58:04 +02:00
totemsrp.h Make ifaces_get work with dynamic no_rings 2012-03-26 11:54:26 +02:00
totemudp.c When flushing, discard only memb_join messages 2012-06-11 10:59:30 +02:00
totemudp.h Update crypto_set API 2012-03-15 17:33:53 +01:00
totemudpu.c totemudpu: Bind sending sockets to bindto address 2012-05-31 09:28:52 +02:00
totemudpu.h Update crypto_set API 2012-03-15 17:33:53 +01:00
util.c drop evs service 2012-03-12 15:51:50 +01:00
util.h rename mainconfig to logconfig 2012-05-29 09:36:00 +02:00
votequorum.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
votequorum.h Remove include/engine/quorum and integrate it into exec/engine.h 2012-02-08 08:31:10 -07:00
vsf_quorum.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
vsf_ykd.c Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
vsf_ykd.h Remove include/engine/quorum and integrate it into exec/engine.h 2012-02-08 08:31:10 -07:00
vsf.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
wd.c sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00