Commit Graph

1420 Commits

Author SHA1 Message Date
Fabio M. Di Nitto
10098dba27 votequorum: add last_man_standing support (default: off)
this flag (0|1) can be configured via quorum.last_man_standing and when
enabled, it allows expected_votes to be dynamically recalculated.

Assuming an 8 nodes cluster, every node votes 1 (mandatory requirement for
this feature).

In the first event, 3 nodes are lost.

The remaining partition of 5 is barely quorate.

After a configurable timeout (quorum.last_man_standing_window, default 10sec)
the quorate partition is allow to recalculate expected_votes based on
the remaining nodes.

This operation will bring expected_votes to 5 and quorum to 3.

Repeating the above loop, in the next event, 2 more nodes are allowed to
die. etc. etc.

Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-01-10 15:48:17 +01:00
Fabio M. Di Nitto
9589611dc4 votequorum: drop concept of DISALLOWED
this is a very old leftover from the RHEL5 timeframe, not used in RHEL6.

Also change votequorum soname since this change implies an ABI change.

Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-01-10 15:48:10 +01:00
Fabio M. Di Nitto
b41372c6b2 votequorum: add auto_tie_breaker support (default: off)
this flag (0|1) can be configured via quorum.auto_tie_breaker and when
enabled, support for perfect even split is on.

In case of a 50% of votes loss in one single transition, the partition
with the node that has the lowest node id will remain quorate.

Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-01-10 15:47:57 +01:00
Fabio M. Di Nitto
e8d0af0bc8 votequorum: add wait_for_all support (default: off)
this flag (0|1) can be configured via quorum.wait_for_all and changes
behavior when granting quorum for the first time.

Normal behavior (default / 0) grants quorum as soon as enough nodes
are available in a cluster.

Setting this value to 1 will grant quorum only after all cluster
memembers are part of the cluster at the same time.

Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-01-10 15:47:52 +01:00
Fabio M. Di Nitto
e34d509df7 quorum: change API to return quorum type at initialization time
corosync internal theory of operation is that without a quorum provider
the cluster is always quorate. This is fine for membership free clusters
but it does pose a problem for applications that need membership and
"real" quorum.

this change add quorum_type to quorum_initialize call to return QUORUM_FREE
or QUORUM_SET. Applications can then make their own decisions to error out
or continue operating.

The only other way to know if a quorum provider is enabled/configured is
to poke at confdb/objdb, but adds an unnecessary burden to applications
that really don't need to use an entire library for a boolean value.

Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-01-10 15:47:24 +01:00
Steven Dake
e5aba30a49 Move coroapi out of external headers
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkled <asalkeld@redhat.com>
2012-01-07 17:47:45 -07:00
Steven Dake
8ad583a54c Move logsys.c into corosync binary instead of a shared object
Our preferred shared logging system is exported via the libqb library.  As
a result, the corosync project no longer needs to export logsys.so and the
code can be directly included in the binary.  The header file can also be
removed.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2012-01-06 18:19:59 -07:00
Angus Salkeld
7ef81f1235 Fix some iterator based mem leaks
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2012-01-06 12:25:08 +11:00
Fabio M. Di Nitto
de194160e1 ipc: make less noise
switch from LOG_INFO to LOG_DEBUG for some basic operations

Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2012-01-03 11:12:52 +01:00
Jan Friesse
7c250a5147 Remove objdb and confdb
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-12-15 09:19:18 +01:00
Jan Friesse
8a45e2b152 Move corosync core to use icmap
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-12-15 09:19:17 +01:00
Jan Friesse
525e6a6ebe Add icmap
Icmap is replacement for objdb, based on libqb map (trie).

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-12-15 09:19:17 +01:00
Angus Salkeld
7b02f176df Check for the correct message size in totempg_groups_joined_reserve()
Currently:
- send_reserve() adds to the reserve
- msg_count_send_ok() tests ((avail - totempg_reserved) > msg_count)

So essentially we are checking to see if 2 * msg_count can fit in
the q.

So instead I am using byte_count_send_ok (size) to see if the
message will fit then calling send_reserve()

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-12-15 10:43:11 +11:00
Angus Salkeld
2ba4ebe09e Fix cpgbench (large message sizes)
To allow async cpg messages of 1M we need to:
1) increase the totem queue size 4 times
2) align the critical level to one large message free

There are a number of reasons for doing this:

We can't let cpg_mcast_joined() fail because the user will not see it
and will assume is has succeded.

The reason we are getting good performance is by providing a negative
feedback loop from the totem q to the IPC/poll system. This relies
on 4 q states low/med/high/crit. With messages of size 1M you
now have a q of size one and now go from level low to crit instantly
then back to low as messages are put on and taken off. I don't think
this is the best behaviour. By having a q size of 4 allows the system
to utilize the q better and give us time to respond to changes in
the q level.

To effectively achieve flow control with a q of size 1 would require
all the clients to request the space on the q like is done in
totempg_groups_joined_reserve() but probably in shared memory
This would take quite a bit of re-work.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2011-12-15 10:43:11 +11:00
Angus Salkeld
94b11502cb LOG: get the logging to work from loaded quorum modules
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-12-15 10:10:54 +11:00
Angus Salkeld
a748700cde Be more flexible (correct) with flowcontrol.
Many functions do not require flowcontrol and are two-way
so they can get failures from corosync.
Only cpg_mcast_joined() _really_ needs the current level
of flowcontrol.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-12-14 12:03:42 +11:00
Angus Salkeld
c317ee433f LOG: Fix a crash in the shutdown.
Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2011-12-13 15:00:42 +11:00
Yunkai Zhang
232ac5a7fe Correct nodeid in memb_state_commit_token_send function
Signed-off-by: Yunkai Zhang <qiushu.zyk@taobao.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-30 11:21:22 -07:00
Steven Dake
e48ddf99a6 From: Yunkai Zhang:
Today, I have observed one of the reason that corosync running into
FAILED TO RECEIVE state.

There was five nodes(A,B,C,D,E) in my testing, and I limited the UDP
transmission rate of C nodes by iptables command:
iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s
--limit-burst 1 -j ACCEPT
iptables -A INPUT -i eth0 -p udp -j DROP

After one hour later, C node had been missing some MCAST messages,
it's state described as following:
==state of C node==
my_aru:0x805
my_high_seq_received:0xC2C
my_aru_count:7

=>receved MCAST message with seq:806 from B nodes
=>enter *message_handler_mcast*
  =>add this message to regular_sort_queue
  ...
  =>enter *update_aru* function
    => range = (my_high_seq_received - my_aru)
             = (0xC2C - 0x805)
             = 1063
    => if range>1024, do nothing and and return directly.
==END==

According this logic, after (my_high_req_received-my_aru)>1024, my_aru
will not be updated though corosync can receive MCAST messages
retransmitted by other nodes.

But at that timte, my_aru_count was only 7. So the corosync at C node
would keep in this status until my_aru_count increased to
fail_to_recv_const(the default value is 2500). This was a long time
for corosync, but we wasted it.

To solve this issue, maybe we can enlarge the range condition in
update_aru function? Or we just ingnore the checking of range value,
it seems no harmfull, because we have been using fail_to_recv_const to
control the things.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2011-11-29 10:59:11 -07:00
Yunkai Zhang
19652c3d7c Correct nodeid of token when we retransmit it
Although incorrect nodeid will not affect program's logic, but it will
make us confused when we add some logs to record the transmission path of
token in debug mode.

Signed-off-by: Yunkai Zhang <qiushu.zyk@taobao.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-28 05:56:28 -07:00
Yunkai Zhang
d991400372 Fixed bug when corosync receive JoinMSG in OPERATIONAL state
Accordig the totem protocal, nodes should enter GATHER state when it
receive JoinMSG in OPERATIONAL state. If we discard it in OPERATIONAL
state, the nodes sending this JoinMSG could not receive the response
untill other nodes reach token lost timeout.

This bug will cause nodes having entered GATHER state spend more time to
rejoin the ring, and then it will make nodes reach token expired timeout
more easily.

Signed-off-by: Yunkai Zhang <qiushu.zyk@taobao.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-26 08:52:26 -07:00
Angus Salkeld
92ca91fa66 TOTEM: better clean up on exit
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-11 09:08:04 +11:00
Angus Salkeld
a6729003a6 OBJDB: free up resources on exit
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-11 09:06:50 +11:00
Angus Salkeld
0fc51c40fd LOG: cleanup logging resources at exit
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-11 09:05:08 +11:00
Angus Salkeld
21f1008be8 Clean up the poll loop resourses on exit
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-11 08:13:08 +11:00
Angus Salkeld
f5a31e55a2 Add calls to missing object_find_destroy() to fix mem leaks
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-11 08:12:13 +11:00
Angus Salkeld
390391acba Free mem allocated by getaddrinfo
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-11-11 08:11:17 +11:00
Anton Jouline
a358791d5b Adding support for dynamic membership with UDPU transport
Add a new object called totem.interface.dynamic to allow creation/deletion
of new child objects using the corosync-objctl utility:

to add new member:
linux#  corosync-objctl -c totem.interface.dynamic.10-211-55-12

to delete an existing member:
linux#  corosync-objctl -d totem.interface.dynamic.10-211-55-12

Corosync will dynamically add these members to the configuration and start
communicating with those nodes.

Signed-off-by: Anton Jouline <anton.jouline@cbsinteractive.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-27 23:52:16 -07:00
Jan Friesse
783dd4e553 Remove unused buf and len variables in log_printf
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-25 16:29:10 +02:00
Jan Friesse
26db8b21b2 api: Change some of totempg definitons
Recent changes in patch "Get rid of hdb usage in totempg.h interface"
caused incompatibility between corosync API and totempg.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-24 17:43:36 +02:00
Jan Friesse
87821f52a6 totemmrp: Allow compilation without warnings
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-24 17:43:32 +02:00
Jan Friesse
1711aea72f Allow compilation of totempg without warnings
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-24 17:43:28 +02:00
Angus Salkeld
2cf37d4063 Set the size of the blackbox to the size on flatiron
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-22 17:42:53 +11:00
Angus Salkeld
9fbd5c08c4 don't log an error if exiting with 0
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-22 10:51:47 +11:00
Steven Dake
8671c967e1 res could return an undefined value if there was no error in
totempg_groups_initialize

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-10-21 03:01:14 -07:00
Angus Salkeld
78a5260c06 LOG: use libqb facility conversion functions
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-21 19:34:43 +11:00
Angus Salkeld
0e58141a2f LOG: get logging to file working correctly
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-21 19:34:43 +11:00
Angus Salkeld
26a6e26f57 LOG: Fix debugging
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-10-21 19:34:43 +11:00
Masatake YAMATO
721e2d2a2a Remove cloned lines in main of main.c
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
2011-10-09 20:32:39 -07:00
Steven Dake
2ec4ddb039 Deliver all messages from my_high_seq_recieved to the last gap
This patch passes two test cases:

-------
Test #1
-------
Two node cluster - run cpgbench on each node

modify totemsrp with following defines:
Two test cases:

-------
Test #2
-------
5 node cluster

start 5 nodes randomly at about same time, start 5 nodes randomly at about
same time, wait 10 seconds and attempt to send a message.  If message blocks
on "TRY_AGAIN" likely a message loss has occured.  Wait a few minutes without
cyclng the nodes and see if the TRY_AGAIN state becomes unblocked.

If it doesn't the test case has failed

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2011-09-22 10:21:37 +02:00
Jan Friesse
f6c2a8dab7 totemconfig: change minimum RRP threshold
RRP threshold can be lower value then 5.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2011-09-08 09:52:16 +02:00
Steven Dake
48ffa8892d Ignore memb_join messages during flush operations
a memb_join operation that occurs during flushing can result in an
entry into the GATHER state from the RECOVERY state.  This results in the
regular sort queue being used instead of the recovery sort queue, resulting
in segfault.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2011-09-02 09:58:44 -07:00
Jan Friesse
752239eaa1 rrp: Higher threshold in passive mode for mcast
There were too much false positives with passive mode rrp when high
number of messages were received.

Patch adds new configurable variable rrp_problem_count_mcast_threshold
which is by default 10 times rrp_problem_count_threshold and this is
used as threshold for multicast packets in passive mode. Variable is
unused in active mode.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed by: Steven Dake <sdake@redhat.com>
2011-09-01 11:21:09 +02:00
Jan Friesse
0eade8de79 rrp: Handle endless loop if all ifaces are faulty
If all interfaces were faulty, passive_mcast_flush_send and related
functions ended in endless loop. This is now handled and if there is no
live interface, message is dropped.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed by: Steven Dake <sdake@redhat.com>
2011-09-01 11:20:18 +02:00
Steven Dake
e920fef7e9 Get rid of hdb usage in totempg.h interface
hdb has some expense and is not necessary in the totempg.so runtime.  This
patch removes the dependence on hdb and instead uses a direct pointer.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-23 22:29:01 -07:00
Steven Dake
32f11337b1 Remove hdb.h header includes from unnecessary files
The files in this patch do not use the hdb.h header.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-23 22:28:40 -07:00
Steven Dake
71f044bfe7 Add totempg_threaded_mode_enable() api
This API allows totem to operate as a multithreaded library.  Performance is
better without threads but some library users may only have multithreaded
systems.  In the corosync case where we have removed threads, this reduces
cpu utilization by ~10% by removing about 50% of the mutex lock and unlock calls
that occur during typical operation.  Since the latest corosync is nearly
thread free, there is no need for mutex operations.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-22 19:31:52 -07:00
Steven Dake
9f36a892a8 Move cs_queue.h from include directory to exec directory
This file is only used by totemsrp.c.  Move out of general include
directory.

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-22 19:31:33 -07:00
Steven Dake
67972efa7d use va version of external log function
This removes a sprintf operation in the totem and ipc logging operations

Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-22 19:31:15 -07:00
Tim Beale
370d9bcecf Display ring-ID consistently in debug
Ring ID was being displayed both as hex and decimal in places. Update so
it's displayed consistently (I chose hex) to make debugging easier.

Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-17 12:15:16 +10:00