Commit Graph

176 Commits

Author SHA1 Message Date
Tim Beale
08f07be323 A CPG client can sometimes lockup if the local node is in the downlist
In a 10-node cluster where all nodes are booting up and starting corosync
at the same time, sometimes during this process corosync detects a node as
leaving and rejoining the cluster.

Occasionally the downlist that gets picked contains the local node. When the
local node sends leave events for the downlist (including itself), it sets
its cpd state to CPD_STATE_UNJOINED and clears the cpd->group_name. This
means it no longer sends CPG events to the CPG client.

Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2011-08-18 14:57:15 +02:00
Tim Beale
5a724a9c39 Add code comment mapping for message handler defines
As a corosync-newbie it can be hard to bridge the gap between where a
particular message is sent and where the receive handler processes it,
and vice versa.

Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-17 11:52:25 +10:00
Angus Salkeld
37e17e7a94 libqb: logging & trace
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-08-09 10:37:16 +10:00
Angus Salkeld
a716f13bf9 Fix some compiler warnings
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-08-09 10:37:16 +10:00
Angus Salkeld
bd150728bf libqb: Improve IPC dispatch and async handling
Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-09 10:37:16 +10:00
Angus Salkeld
4dffef53fd CPG: downgrade some log messages
Reviewed-by: Steven Dake <sdake@redhat.com>
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2011-08-09 10:37:16 +10:00
Angus Salkeld
4614c91fef libqb: fix valgring warnings in mon/wd
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-08-09 10:37:15 +10:00
Angus Salkeld
f717bc60e1 libqb: make timer api a wrapper around qb_loop timers.
- change timeout value to nano seconds
- fix timer handles (don't alloc on stack)

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-08-09 10:37:14 +10:00
Angus Salkeld
c6895faa05 libqb: change ipc -> qb_ipc
IPC: return 0/-ENOBUFS from message handler
IPC: use the new rate_limit API to improve perf.
CPG: add send_async API & hook up flow control
IPC: Fix flow control getting stuck.
IPC: Port the remaining libs to use libqb IPC
IPC: remove libqb flowcontrol API
TEST: put cpg_dispatch() in it's own thread
IPC: cleanup ipc_glue.c name everything cs_ipcs_*()
IPC: add back statistics
IPC: remove coroipcc_ symbols from lib*.versions
IPC: init each se's IPC as it is loaded.
IPC: use the new connection_closed() event to free the context.
IPC: re-add zero copy functionality back
IPC: remove cpg_mcast_joined_async() and make it the default
 -> now cpg_mcast_joined() == cpg_mcast_joined_async()
libqb: expose a libqb error converter
libqb: add missing error conversions
libqb: remove repeat try loop in lib/cpg.c
CPG: fix zero copy mcast
CPG: use newer return codes
Add ENOTCONN to qb_to_cs_error()
libqb: fix error conversion from errno to cs_error_t in confdb
libqb: change errno_to_cs to qb_to_cs_error
libqb: add a cs_strerror() to get a more meaningful message
libqb: fix some confusing error conversions.
libqb: set the timeout on recv's to -1 (wait forever)

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-08-09 10:37:14 +10:00
Angus Salkeld
fce8a3c3b6 libqb: convert coropoll calls to qb_loop calls.
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-08-09 10:37:14 +10:00
Steven Dake
c544e87bb0 Correct missing poll funtions from service handler struct needed for confdb APIs
Signed-off-by: Steven Dake <sdake@redhat.com>
Reviewed-by: Jan Friesse <jfriesse@redhat.com>
2011-07-15 13:30:41 -07:00
Jan Friesse
5458d4f27a votequorum: free newly allocated node if nodeid==0
Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-06-29 11:59:57 +02:00
Jan Friesse
b5d2f4578a confdb: Resolve dispatch deadlock
Following situation could happen:
- one thread is waiting for finish write operation (line 853), objdb is
  locked
- flush (done in objdb_notify_dispatch) is called in main thread, but
  this call will never appear because main thread is waiting for objdb
  lock.

In this situation deadlock appears.

Commit solves this by:
- setting pipe to non-blocking mode
- pipe is used only as trigger for coropoll
- dispatch messages are stored in list
- main thread is processing messages from list

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-06-22 11:20:55 +02:00
Jan Friesse
9afb4bdaa8 confdb: Properly check result of object_find_create
in confdb_object_iter result of object_find_create is now properly
checked. object_find_create can return -1 if object doesn't exists.
Without this check, incorrect handle (memory garbage) was directly
passed to object_find_next.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Angus Salkeld <asalkeld@redhat.com>
2011-06-10 12:33:07 +02:00
Jan Friesse
f95d3b3bf2 cpg: do_proc_join change list_slice to list_add
In this concrete case result is equivalent but makes coverity happy.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-06-03 11:10:08 +02:00
Angus Salkeld
956a1dcb42 cpg: fix sync master selection when one node paused.
If one node is paused it can miss a config change and
thus report a larger old_members than expected.

The solution is to use the left_nodes field.

Master selection used to be "choose node with":
1) largest previous membership
2) (then as a tie-breaker) node with smallest nodeid

New selection:
1) largest (previous #nodes - #nodes know to have left)
2) (then as a tie-breaker) node with smallest nodeid

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
2011-05-05 21:39:30 +10:00
Tim Serong
5b92829d6c Add ipc_refcnt to message_handler_req_{exec, lib}_cfg_ringreenable()
Without refcounting the conn pointer here, corosync will segfault
if one kills a running instance of "corosync-cfgtool -r" (rhbz#695191)

Signed-off-by: Tim Serong <tserong@novell.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-04-14 20:14:12 -07:00
Angus Salkeld
265661745d Fix shutdown when a confdb client is still connected
If you are connected to corosync and registered for
object notifications then corosync is asked to shutdown
the IPC server will get stuck. This is because the pipe
is closed and the refcount is increased. This leaves ipcs
with a connection that it can't destroy.

Solution:
1) if a write to the pipe fails (pipe closed) decrement the refcounter.
2) fix the object_track_stop() - it was not working as the functions
   did not match up. (this caused the late callbacks).
3) in ipcs call exit_fn() then stats_destroy_connection() so that
   the service engine can have time to call object_track_stop()
   before the object gets destroyed.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-03-29 13:48:20 +11:00
Jan Friesse
033f7ced10 cfg_get_node_addrs: Return correct addresses
Zero element array behavior is very different from normal array or
pointer. This behavior is root of problem in not returning correctly
filled array of addresses. This appeared only in rrp mode, where more
then one address is returned.

All memcpy's are now correctly converted to copy pointer to char.

Signed-off-by: Jan Friesse <jfriesse@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-03-24 17:42:08 +01:00
Angus Salkeld
75087f7c1b confdb: send notifications from the main thread not IPC thread
corosync-notifyd has exposed an issue with confdb notifications.

The normal state of affairs is:
IPC thread > lock > objdb > lock

objdb notification whilst really useful turn things around:
<middle of big call chain>
objdb > lock > confdb > ipc > lock

This reverse ordering of locks causes a horrible dead lock.

I see this patch as a work around until corosync-2.0
when most of the threads and locking disappear.

This patch adds a pipe to confdb service. When we get a
objdb notification a struct gets written to the pipe.
The poll loop then runs the dispatch in the main thread.
In the dispatch we call the real ipc_dispatch_send().

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-03-24 07:54:42 +11:00
Angus Salkeld
0ad2494ae7 Fix some "set but not used" warnings [-Wunused-but-set-variable]
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-03-16 07:13:42 +11:00
Angus Salkeld
e1a6b2ccfb CONFDB: fix parent_get response id
Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Seven Dake <sdake@redhat.com>
2011-02-08 08:10:20 +11:00
Angus Salkeld
89e4c1c048 CONFDB: add confdb_object_name_get()
This is useful when tracking object changes.

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Seven Dake <sdake@redhat.com>
2011-02-04 09:47:15 -07:00
Angus Salkeld
6f098bba1d fix timersub warning on freebsd
Make them all protected by #ifndef timersub

Signed-off-by: Angus Salkeld <asalkeld@redhat.com>
Reviewed-by: Steven Dake <sdake@redhat.com>
2011-01-12 09:42:24 +11:00
Angus Salkeld
83b24b660b WD/SAM integration.
- timestamps -> uint64_t and in nanosecs
- use clock_gettime
- common object naming
- common state names
- timeouts in milliseconds



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@3054 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-09-27 21:13:15 +00:00
Angus Salkeld
07d06c0c0f Add monitoring and watchdog services.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@3053 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-09-27 21:12:03 +00:00
Angus Salkeld
397e648080 objdb: fix some strange types (uint8_t* -> void*).
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@3045 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-09-25 06:48:24 +00:00
Angus Salkeld
2ab786f3d1 CPG: remove irratating log "downlist received left_list:"
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@3043 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-09-25 06:46:34 +00:00
Steven Dake
4ac55e52e4 Patch from Kacper Kowalik to support honoring user defined LDFLAGS.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@3042 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-09-14 18:10:12 +00:00
Steven Dake
e94b3dd811 Patch from Honza:
Send CPG_REASON_PROCDOWN on process left

Our manual pages are clear:

CPG_REASON_PROCDOWN - the process left a group without calling
cpg_leave().

Currently, we are sending CPG_REASON_LEAVE in such situation.



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2946 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-06-15 19:35:32 +00:00
Steven Dake
3b457d30c7 Fix problem where callbacks are not delivered to evs service.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2916 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-06-01 15:36:08 +00:00
Steven Dake
0e9f0bfeb4 Make cpg_membership_get() functional.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2855 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-05-19 05:03:52 +00:00
Angus Salkeld
18a1ea648b Fix compile error in services/cfg.c
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2843 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-05-16 22:23:25 +00:00
Steven Dake
ed7b299290 Merge patch from Sato Yuki which fixes corosync-cfgtool -r
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2831 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-05-14 08:01:03 +00:00
Angus Salkeld
562616c79d cpg: fix unitialized variable
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2814 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-05-12 09:25:58 +00:00
Angus Salkeld
8f430ecc8a cpg: fix sync'ing the downlist.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2801 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-05-04 04:42:40 +00:00
Angus Salkeld
64fb3000f3 select a new sync member if the node with the lowest nodeid has left.
Problem:

Under certain circumstances cpg does not send group leave messages.

With a big token timeout (tested with token == 5min).
1 start all nodes
2 start ./test/testcpg on all nodes
2 go to the node with the lowest nodeid
3 ifconfig <int> down && killall -9 corosync && /etc/init.d/corosync restart && ./testcpg
4 the other nodes will not get the cpg leave event
5 testcpg reports an extra cpg group (basically one was not removed)

Solution:
If a member gets removed using the new trans_list and
that member is the node used for syncing (lowest nodeid)
then the next lowest node needs to be chosen for syncing.



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2785 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-04-22 22:20:09 +00:00
Jan Friesse
e8b143595c CPG model_initialize and ringid + members callback
Patch adds new function to initialize cpg, cpg_model_initialize. Model
is set of callbacks. With this function, future addions of models
should  be possible without changing the ABI.

Patch also contains callback in CPG_MODEL_V1 for notification about
Totem membership changes.


git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2770 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-04-20 12:40:48 +00:00
Angus Salkeld
9a862803aa Fix code coverage with lcrso's
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2729 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-03-24 22:14:25 +00:00
Christine Caulfield
1baa7b2ab3 Add a reload callback to libconfdb.
This also increments the libconfdb version to 4.1.0



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2683 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-03-16 09:51:30 +00:00
Angus Salkeld
1e17751d0d Remove warnings.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2682 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-03-11 00:27:04 +00:00
Jan Friesse
009dfc090e Support for lib_cpg_finalize
Add support for MESSAGE_REQ_CPG_FINALIZE message. This will allow us
remove cpg_pd from list of active connections, and remove problem, when
cpg_finalize + cpg_initialize + cpg_join can result in CPG_ERR_EXIST
error.


git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2676 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-03-04 12:17:47 +00:00
Jan Friesse
7e8da9a6fc Cpg join with undelivered leave message
Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again

Following sequence can (racy) end with broken process_info list.

To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_TRY_AGAIN is returned.


git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2675 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-03-04 12:12:24 +00:00
Angus Salkeld
ec09a97867 Fix some "make lint" problems
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2674 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-03-03 21:52:08 +00:00
Christine Caulfield
a22f051d04 Remove a double list_del() when a tracking CFG client shuts down without
calling cfg_track_stop. This caused corosync to crash.

The extra list_empty() check is redundant too because it also happens in remove_ci_from_shutdown() 



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2655 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-02-12 07:46:02 +00:00
Angus Salkeld
c6beee076a pass transitional members into the sync_init() callbacks.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2653 fd59a12c-fef9-0310-b244-a6a79926bd2f
2010-02-04 00:18:51 +00:00
Angus Salkeld
5f17683107 COVERITY 18: prevent deref after free.
Event deref_after_free: Dereferencing freed pointer "pi".



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2543 fd59a12c-fef9-0310-b244-a6a79926bd2f
2009-11-22 06:22:49 +00:00
Angus Salkeld
73b7aa19bb Add value types to objdb keys.
This allows you to create a key with a know type.
And then get the type with the key value.



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2511 fd59a12c-fef9-0310-b244-a6a79926bd2f
2009-10-10 03:20:38 +00:00
Christine Caulfield
2433ee3b2c This patche fixes a couple of small problems with votequorum:
- if a single node is booted with votequorum loaded then
   corosync-quorumtool shows zero nodes and no votes.
- votequorum doesn't always tell the main quorum module when a new node 
has joined the cluster (principally itself. this bug is actually tied 
into the above)

I've also added quorum to the default list of services. As quorum has 
been decoupled from sync it will not interfere with normal operations as 
it used to do and it makes more sense to have it there than not.



git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2510 fd59a12c-fef9-0310-b244-a6a79926bd2f
2009-10-06 12:57:35 +00:00
Steven Dake
9b56e33ee8 Remove pointless warning.
git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@2478 fd59a12c-fef9-0310-b244-a6a79926bd2f
2009-09-25 06:01:35 +00:00