mirror_corosync/exec
Jan Friesse e7a82370a7 totemsrp: Switch totempg buffers at the right time
Commit 92e0f9c7bb added switching of
totempg buffers in sync phase. But because buffers got switch too early
there was a problem when delivering recovered messages (messages got
corrupted and/or lost). Solution is to switch buffers after recovered
messages got delivered.

I think it is worth to describe complete history with reproducers so it
doesn't get lost.

It all started with 402638929e (more info
about original problem is described in
https://bugzilla.redhat.com/show_bug.cgi?id=820821). This patch
solves problem which is way to be reproduced with following reproducer:
- 2 nodes
- Both nodes running corosync and testcpg
- Pause node 1 (SIGSTOP of corosync)
- On node 1, send some messages by testcpg
  (it's not answering but this doesn't matter). Simply hit ENTER key
  few times is enough)
- Wait till node 2 detects that node 1 left
- Unpause node 1 (SIGCONT of corosync)

and on node 1 newly mcasted cpg messages got sent before sync barrier,
so node 2 logs "Unknown node -> we will not deliver message".

Solution was to add switch of totemsrp new messages buffer.

This patch was not enough so new one
(92e0f9c7bb) was created. Reproducer of
problem was similar, just cpgverify was used instead of testcpg.
Occasionally when node 1 was unpaused it hang in sync phase because
there was a partial message in totempg buffers. New sync message had
different frag cont so it was thrown away and never delivered.

After many years problem was found which is solved by this patch
(original issue describe in
https://github.com/corosync/corosync/issues/660).
Reproducer is more complex:
- 2 nodes
- Node 1 is rate-limited (used script on the hypervisor side):
  ```
  iface=tapXXXX
  # ~0.1MB/s in bit/s
  rate=838856
  # 1mb/s
  burst=1048576
  tc qdisc add dev $iface root handle 1: htb default 1
  tc class add dev $iface parent 1: classid 1:1 htb rate ${rate}bps \
    burst ${burst}b
  tc qdisc add dev $iface handle ffff: ingress
  tc filter add dev $iface parent ffff: prio 50 basic police rate \
    ${rate}bps burst ${burst}b mtu 64kb "drop"
  ```
- Node 2 is running corosync and cpgverify
- Node 1 keeps restarting of corosync and running cpgverify in cycle
  - Console 1: while true; do corosync; sleep 20; \
      kill $(pidof corosync); sleep 20; done
  - Console 2: while true; do ./cpgverify;done

And from time to time (reproduced usually in less than 5 minutes)
cpgverify reports corrupted message.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2021-11-03 10:19:44 +01:00
..
.gitignore Add .gitignore files. 2010-10-21 07:43:46 -07:00
apidef.c CFG: Remove ring-reenable code 2017-08-03 14:32:02 +02:00
apidef.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
cfg.c cfg: corosync_cfg_trackstop blocks forever 2021-05-19 18:28:45 +02:00
cmap.c cmap: Assert copied string length 2019-11-28 09:44:44 +01:00
coroparse.c main: Add support for cgroup v2 and auto mode 2021-07-23 15:31:52 +02:00
cpg.c cpg: Change downlist log level 2020-01-09 12:40:32 +01:00
cs_queue.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
fsm.h Make logging of WD and MON service correct 2012-08-16 14:45:15 +02:00
icmap.c icmap: icmap_init_r() leaks if trie_create() fails 2020-03-26 14:42:41 +01:00
ipc_glue.c main: Move sched paramaters to config file 2018-11-15 17:30:03 +01:00
ipcs_stats.h stats: Add cmap key to clear the various stats. 2017-10-31 17:39:14 +01:00
logconfig.c logconfig: Remove double free of value 2019-11-28 09:44:44 +01:00
logconfig.h list: Replace uses of list.h with qblist.h 2016-10-27 14:56:52 +02:00
logsys.c logsys: Unlock config mutex on error 2021-09-13 09:13:54 +02:00
main.c totemconfig: Do not process totem.nodeid 2021-08-02 15:13:04 +02:00
main.h main: Replace COROSYNC_MAIN_CONFIG_FILE 2018-11-15 17:30:14 +01:00
Makefile.am nozzle: Add support for libnozzle devices 2019-02-26 13:11:35 +01:00
mon.c list: Replace uses of list.h with qblist.h 2016-10-27 14:56:52 +02:00
pload.c build: bring SOLARIS up to the same standard as other OSes 2012-08-30 15:00:27 +02:00
quorum.c Remove redundant header file inclusion 2016-12-05 09:59:08 +01:00
quorum.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
schedwrk.c schedwrk: Cleanup and make it work on PPC BE 2016-05-17 16:29:25 +02:00
schedwrk.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
service.c service: Fix memleak in service_unlink_and_exit 2013-06-21 11:21:29 +02:00
service.h service: remove leftovers from mt corosync 2012-08-09 15:10:16 +02:00
stats.c stats: fix crash when iterating over deleted keys 2021-06-03 10:14:47 +02:00
stats.h stats: Add stats for scheduler misses 2020-01-22 17:06:10 +01:00
sync.c sync: Assert sync_callbacks.name length 2019-11-28 09:44:44 +01:00
sync.h sync: kill evil and syncv1 in one shot 2012-03-09 11:15:08 +01:00
timer.c Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
timer.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
totemconfig.c totem: Add cancel_hold_on_retransmit config option 2021-08-20 16:55:48 +02:00
totemconfig.h totemconfig: Do not process totem.nodeid 2021-08-02 15:13:04 +02:00
totemip.c Revert "totemip: compare sin6_scope_id and interface_num" 2020-04-22 13:30:36 +02:00
totemknet.c knet: Fix node status display 2021-07-29 14:38:53 +02:00
totemknet.h cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemnet.c cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemnet.h cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totempg.c cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemsrp.c totemsrp: Switch totempg buffers at the right time 2021-11-03 10:19:44 +01:00
totemsrp.h cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemudp.c cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemudp.h cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemudpu.c cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
totemudpu.h cfg: New API to get extended node/link infomation 2020-11-26 16:15:50 +01:00
util.c config: Properly check crypto and compress models 2021-04-14 18:07:20 +02:00
util.h config: Properly check crypto and compress models 2021-04-14 18:07:20 +02:00
votequorum.c config: don't reload vquorum if reload fails 2020-04-24 16:27:01 +02:00
votequorum.h list: Replace uses of list.h with qblist.h 2016-10-27 14:56:52 +02:00
vsf_quorum.c quorum: Add support for nodelist callback 2020-10-12 13:22:11 +02:00
vsf_ykd.c YKD: Fix loading of YKD quorum module 2014-08-18 09:33:59 +01:00
vsf_ykd.h list: Replace uses of list.h with qblist.h 2016-10-27 14:56:52 +02:00
vsf.h Update copyright header dates in exec directory 2012-02-13 17:05:04 -07:00
wd.c wd: fix snprintf warnings 2017-12-01 17:23:54 +01:00