mirror of
				https://git.proxmox.com/git/mirror_corosync
				synced 2025-10-31 14:31:46 +00:00 
			
		
		
		
	 83b24b660b
			
		
	
	
		83b24b660b
		
	
	
	
	
		
			
			- timestamps -> uint64_t and in nanosecs - use clock_gettime - common object naming - common state names - timeouts in milliseconds git-svn-id: http://svn.fedorahosted.org/svn/corosync/trunk@3054 fd59a12c-fef9-0310-b244-a6a79926bd2f
		
			
				
	
	
		
			182 lines
		
	
	
		
			7.4 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			182 lines
		
	
	
		
			7.4 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
| .\"/*
 | |
| .\" * Copyright (c) 2009-2010 Red Hat, Inc.
 | |
| .\" *
 | |
| .\" * All rights reserved.
 | |
| .\" *
 | |
| .\" * Author: Jan Friesse (jfriesse@redhat.com)
 | |
| .\" * Author: Steven Dake (sdake@redhat.com)
 | |
| .\" *
 | |
| .\" * This software licensed under BSD license, the text of which follows:
 | |
| .\" *
 | |
| .\" * Redistribution and use in source and binary forms, with or without
 | |
| .\" * modification, are permitted provided that the following conditions are met:
 | |
| .\" *
 | |
| .\" * - Redistributions of source code must retain the above copyright notice,
 | |
| .\" *   this list of conditions and the following disclaimer.
 | |
| .\" * - Redistributions in binary form must reproduce the above copyright notice,
 | |
| .\" *   this list of conditions and the following disclaimer in the documentation
 | |
| .\" *   and/or other materials provided with the distribution.
 | |
| .\" * - Neither the name of the Red Hat, Inc. nor the names of its
 | |
| .\" *   contributors may be used to endorse or promote products derived from this
 | |
| .\" *   software without specific prior written permission.
 | |
| .\" *
 | |
| .\" * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 | |
| .\" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 | |
| .\" * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 | |
| .\" * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
 | |
| .\" * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 | |
| .\" * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 | |
| .\" * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 | |
| .\" * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 | |
| .\" * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 | |
| .\" * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
 | |
| .\" * THE POSSIBILITY OF SUCH DAMAGE.
 | |
| .\" */
 | |
| .TH "SAM_OVERVIEW" 8 "21/05/2010" "corosync Man Page" "Corosync Cluster Engine Programmer's Manual"
 | |
| 
 | |
| .SH NAME
 | |
| .P
 | |
| sam_overview \- Overview of the Simple Availability Manager
 | |
| 
 | |
| .SH OVERVIEW
 | |
| .P
 | |
| The SAM library provide a tool to check the health of an application.
 | |
| The main purpose of SAM is to restart a local process when it fails to respond
 | |
| to a healthcheck request in a configured time interval.
 | |
| 
 | |
| .P
 | |
| During \fBsam_initialize(3)\fR, a duplicate copy of the process is created using
 | |
| the \fBfork(3)\fR system call.  This duplicate process copy contains the logic
 | |
| for executing the SAM server.  The SAM server is responsible for requesting
 | |
| healthchecks from the active process, and controlling the lifecycle of the
 | |
| active process when it fails.  If the active process fails to respond to the
 | |
| healthcheck request sent by the SAM server, it will be sent a user configurable
 | |
| signal (default SIGTERM) to request shutdown of the application.  After a configured time interval, the
 | |
| process will be forcibly killed by being sent a SIGKILL signal.  Once the
 | |
| active process terminates, the SAM server will create a new active process.
 | |
| 
 | |
| .P
 | |
| The Simple Availability Manager is meant to be used in conjunction with the 
 | |
| cpg service.  Used together, it is possible to restart a cpg process that fails
 | |
| healthchecking during operation.
 | |
| 
 | |
| .P
 | |
| The main features of SAM include:
 | |
| 
 | |
| .RS
 | |
| .IP \(bu 3
 | |
| A configurable recovery policy.
 | |
| .IP \(bu 3
 | |
| A configurable time interval for health check operations.
 | |
| .IP \(bu 3
 | |
| A notification via signal before recovery action is taken.
 | |
| .IP \(bu 3
 | |
| A mechanism to indicate to the application the number of times an active
 | |
| process has been created by the SAM server.
 | |
| .IP \(bu 3
 | |
| Both application driven health checking and event driven health checking.
 | |
| .RE
 | |
| 
 | |
| .SH Initializing SAM
 | |
| .P
 | |
| The SAM library is initialized by \fBsam_initialize(3)\fR.
 | |
| \fBsam_initalize(3)\fR may only be called once per process.  Calling it more 
 | |
| then once has undefined results and is not recommended or tested.
 | |
| 
 | |
| .SH Setting warning callback
 | |
| .P
 | |
| User configurable signal (default \fISIGTERM\fR) is sent to the application when a recovery action is
 | |
| planned.  The application can use the \fBsignal(3)\fR system call to monitor
 | |
| for this signal.
 | |
| 
 | |
| .P
 | |
| There are no special constraints on what SAM apis may be called in a warning
 | |
| callback.  After \fItime_interval\fR expires, a SIGKILL signal is sent to the
 | |
| active process to force its termination.
 | |
| 
 | |
| .SH Registering the active process
 | |
| .P
 | |
| The active process is registered with SAM by calling \fBsam_register(3)\fR.
 | |
| This function should only be called one time in a process.  After a recovery
 | |
| action is taken, the new active process will begin execution at the next line 
 | |
| of code in a user process after \fBsam_register(3)\fR.
 | |
| 
 | |
| .SH Enabling event driven healthchecking
 | |
| .P
 | |
| Two types of healthchecking are available to the user.  The first model is one
 | |
| where the user application healthchecks during its normal operation.  It is
 | |
| never requested to healtcheck, and if the active process doesn't respond within
 | |
| the time interval, the process will be restarted.
 | |
| 
 | |
| .P
 | |
| A more useful mechanism for healthchecking is event driven healthchecking.
 | |
| Because this model is directed by the SAM server, It isn't necessary to guess
 | |
| or add timers to the active process to signal a healthcheck operation is
 | |
| successful.  To use event driven healthchecking,
 | |
| the \fBsam_hc_callback_register(3)\fR function should be executed.
 | |
| 
 | |
| .SH Quorum integration
 | |
| .P
 | |
| SAM has special policies (\fISAM_RECOVERY_POLICY_QUIT\fR and \fISAM_RECOVERY_POLICY_RESTART\fR)
 | |
| for integration with quorum service. This policies changes SAM behaviour in two aspects.
 | |
| .RS
 | |
| .IP \(bu 3
 | |
| Call of \fBsam_start(3)\fR blocks until corosync becomes quorate
 | |
| .IP \(bu 3
 | |
| User selected recovery action is taken immediately after lost of quorum.
 | |
| .RE
 | |
| 
 | |
| .SH Storing user data
 | |
| .P
 | |
| Sometimes there is need to store some data, which survives between instances.
 | |
| One can in such case use files, databases, ... or much simpler in memory solution
 | |
| presented by \fBsam_data_store(3)\fR, \fBsam_data_restore(3)\fR and \fBsam_data_getsize(3)\fR
 | |
| functions.
 | |
| 
 | |
| .SH Confdb integration
 | |
| .P
 | |
| SAM has policy flag used for confdb system integration (\fISAM_RECOVERY_POLICY_CONFDB\fR).
 | |
| If process is registered with this flag, new confdb object PROCESS_NAME:PID is created with following
 | |
| keys:
 | |
| .RS
 | |
| .IP \(bu 3
 | |
| \fIrecovery\fR - will be quit or restart depending on policy
 | |
| .IP \(bu 3
 | |
| \fIpoll_period\fR - period of health checking in milliseconds
 | |
| .IP \(bu 3
 | |
| \fIlast_updated\fR - Timestamp (in nanoseconds) of the last health check.
 | |
| .IP \(bu 3
 | |
| \fIstate\fR - state of process (can be one of registered, started, failed, waiting for quorum)
 | |
| .RE
 | |
| 
 | |
| .P
 | |
| Object is automatically deleted if process exits with stopped health checking.
 | |
| 
 | |
| .P
 | |
| Confdb integration with corosync wathdog can be used in implicit and explicit way.
 | |
| 
 | |
| .P
 | |
| Implicit way is achieved by setting recovery policy to QUIT and let process exit with started health checking.
 | |
| If this happened, object is not deleted and corosync watchdog will take required action.
 | |
| 
 | |
| .P
 | |
| Explicit way is usefull for situations, when developer can deal with some non-fatal fall of application.
 | |
| This mode is achieved by setting policy to RESTART and using SAM same as without Confdb integration.
 | |
| If real fail is needed (like too many restarts at all, per/sec, ...), it's possible to use \fBsam_mark_failed(3)\fR
 | |
| and let corosync watchdog take required action.
 | |
| 
 | |
| .SH BUGS
 | |
| .SH "SEE ALSO"
 | |
| .BR sam_initialize (3),
 | |
| .BR sam_data_getsize (3),
 | |
| .BR sam_data_restore (3),
 | |
| .BR sam_data_store (3),
 | |
| .BR sam_finalize (3),
 | |
| .BR sam_mark_failed (3),
 | |
| .BR sam_start (3),
 | |
| .BR sam_stop (3),
 | |
| .BR sam_register (3),
 | |
| .BR sam_warn_signal_set (3),
 | |
| .BR sam_hc_send (3),
 | |
| .BR sam_hc_callback_register (3)
 |