mirror of
				https://git.proxmox.com/git/pve-docs
				synced 2025-11-04 06:11:08 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			815 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			815 lines
		
	
	
		
			26 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
[[chapter_ha_manager]]
 | 
						|
ifdef::manvolnum[]
 | 
						|
ha-manager(1)
 | 
						|
=============
 | 
						|
:pve-toplevel:
 | 
						|
 | 
						|
NAME
 | 
						|
----
 | 
						|
 | 
						|
ha-manager - Proxmox VE HA Manager
 | 
						|
 | 
						|
SYNOPSIS
 | 
						|
--------
 | 
						|
 | 
						|
include::ha-manager.1-synopsis.adoc[]
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
-----------
 | 
						|
endif::manvolnum[]
 | 
						|
ifndef::manvolnum[]
 | 
						|
High Availability
 | 
						|
=================
 | 
						|
:pve-toplevel:
 | 
						|
endif::manvolnum[]
 | 
						|
 | 
						|
Our modern society depends heavily on information provided by
 | 
						|
computers over the network. Mobile devices amplified that dependency,
 | 
						|
because people can access the network any time from anywhere. If you
 | 
						|
provide such services, it is very important that they are available
 | 
						|
most of the time.
 | 
						|
 | 
						|
We can mathematically define the availability as the ratio of (A) the
 | 
						|
total time a service is capable of being used during a given interval
 | 
						|
to (B) the length of the interval. It is normally expressed as a
 | 
						|
percentage of uptime in a given year.
 | 
						|
 | 
						|
.Availability - Downtime per Year
 | 
						|
[width="60%",cols="<d,d",options="header"]
 | 
						|
|===========================================================
 | 
						|
|Availability % |Downtime per year
 | 
						|
|99             |3.65 days
 | 
						|
|99.9 		|8.76 hours
 | 
						|
|99.99  	|52.56 minutes
 | 
						|
|99.999  	|5.26 minutes
 | 
						|
|99.9999  	|31.5 seconds
 | 
						|
|99.99999 	|3.15 seconds
 | 
						|
|===========================================================
 | 
						|
 | 
						|
There are several ways to increase availability. The most elegant
 | 
						|
solution is to rewrite your software, so that you can run it on
 | 
						|
several host at the same time. The software itself need to have a way
 | 
						|
to detect errors and do failover. This is relatively easy if you just
 | 
						|
want to serve read-only web pages. But in general this is complex, and
 | 
						|
sometimes impossible because you cannot modify the software
 | 
						|
yourself. The following solutions works without modifying the
 | 
						|
software:
 | 
						|
 | 
						|
* Use reliable ``server'' components
 | 
						|
+
 | 
						|
NOTE: Computer components with same functionality can have varying
 | 
						|
reliability numbers, depending on the component quality. Most vendors
 | 
						|
sell components with higher reliability as ``server'' components -
 | 
						|
usually at higher price.
 | 
						|
 | 
						|
* Eliminate single point of failure (redundant components)
 | 
						|
** use an uninterruptible power supply (UPS)
 | 
						|
** use redundant power supplies on the main boards
 | 
						|
** use ECC-RAM
 | 
						|
** use redundant network hardware
 | 
						|
** use RAID for local storage
 | 
						|
** use distributed, redundant storage for VM data
 | 
						|
 | 
						|
* Reduce downtime
 | 
						|
** rapidly accessible administrators (24/7)
 | 
						|
** availability of spare parts (other nodes in a {pve} cluster)
 | 
						|
** automatic error detection (provided by `ha-manager`)
 | 
						|
** automatic failover (provided by `ha-manager`)
 | 
						|
 | 
						|
Virtualization environments like {pve} make it much easier to reach
 | 
						|
high availability because they remove the ``hardware'' dependency. They
 | 
						|
also support to setup and use redundant storage and network
 | 
						|
devices. So if one host fail, you can simply start those services on
 | 
						|
another host within your cluster.
 | 
						|
 | 
						|
Even better, {pve} provides a software stack called `ha-manager`,
 | 
						|
which can do that automatically for you. It is able to automatically
 | 
						|
detect errors and do automatic failover.
 | 
						|
 | 
						|
{pve} `ha-manager` works like an ``automated'' administrator. First, you
 | 
						|
configure what resources (VMs, containers, ...) it should
 | 
						|
manage. `ha-manager` then observes correct functionality, and handles
 | 
						|
service failover to another node in case of errors. `ha-manager` can
 | 
						|
also handle normal user requests which may start, stop, relocate and
 | 
						|
migrate a service.
 | 
						|
 | 
						|
But high availability comes at a price. High quality components are
 | 
						|
more expensive, and making them redundant duplicates the costs at
 | 
						|
least. Additional spare parts increase costs further. So you should
 | 
						|
carefully calculate the benefits, and compare with those additional
 | 
						|
costs.
 | 
						|
 | 
						|
TIP: Increasing availability from 99% to 99.9% is relatively
 | 
						|
simply. But increasing availability from 99.9999% to 99.99999% is very
 | 
						|
hard and costly. `ha-manager` has typical error detection and failover
 | 
						|
times of about 2 minutes, so you can get no more than 99.999%
 | 
						|
availability.
 | 
						|
 | 
						|
 | 
						|
Requirements
 | 
						|
------------
 | 
						|
 | 
						|
You must meet the following requirements before you start with HA:
 | 
						|
 | 
						|
* at least three cluster nodes (to get reliable quorum)
 | 
						|
 | 
						|
* shared storage for VMs and containers
 | 
						|
 | 
						|
* hardware redundancy (everywhere)
 | 
						|
 | 
						|
* use reliable “server” components
 | 
						|
 | 
						|
* hardware watchdog - if not available we fall back to the
 | 
						|
  linux kernel software watchdog (`softdog`)
 | 
						|
 | 
						|
* optional hardware fencing devices
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_resources]]
 | 
						|
Resources
 | 
						|
---------
 | 
						|
 | 
						|
We call the primary management unit handled by `ha-manager` a
 | 
						|
resource. A resource (also called ``service'') is uniquely
 | 
						|
identified by a service ID (SID), which consists of the resource type
 | 
						|
and an type specific ID, e.g.: `vm:100`. That example would be a
 | 
						|
resource of type `vm` (virtual machine) with the ID 100.
 | 
						|
 | 
						|
For now we have two important resources types - virtual machines and
 | 
						|
containers. One basic idea here is that we can bundle related software
 | 
						|
into such VM or container, so there is no need to compose one big
 | 
						|
service from other services, like it was done with `rgmanager`. In
 | 
						|
general, a HA managed resource should not depend on other resources.
 | 
						|
 | 
						|
 | 
						|
Management Tasks
 | 
						|
----------------
 | 
						|
 | 
						|
This section provides a short overview of common management tasks. The
 | 
						|
first step is to enable HA for a resource. This is done by adding the
 | 
						|
resource to the HA resource configuration. You can do this using the
 | 
						|
GUI, or simply use the command line tool, for example:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager add vm:100
 | 
						|
----
 | 
						|
 | 
						|
The HA stack now tries to start the resources and keeps it
 | 
						|
running. Please note that you can configure the ``requested''
 | 
						|
resources state. For example you may want that the HA stack stops the
 | 
						|
resource:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager set vm:100 --state stopped
 | 
						|
----
 | 
						|
 | 
						|
and start it again later:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager set vm:100 --state started
 | 
						|
----
 | 
						|
 | 
						|
You can also use the normal VM and container management commands. They
 | 
						|
automatically forward the commands to the HA stack, so
 | 
						|
 | 
						|
----
 | 
						|
# qm start 100
 | 
						|
----
 | 
						|
 | 
						|
simply sets the requested state to `started`. Same applied to `qm
 | 
						|
stop`, which sets the requested state to `stopped`.
 | 
						|
 | 
						|
NOTE: The HA stack works fully asynchronous and needs to communicate
 | 
						|
with other cluster members. So it takes some seconds until you see
 | 
						|
the result of such actions.
 | 
						|
 | 
						|
To view the current HA resource configuration use:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager config
 | 
						|
vm:100
 | 
						|
	state stopped
 | 
						|
----
 | 
						|
 | 
						|
And you can view the actual HA manager and resource state with:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager status
 | 
						|
quorum OK
 | 
						|
master node1 (active, Wed Nov 23 11:07:23 2016)
 | 
						|
lrm elsa (active, Wed Nov 23 11:07:19 2016)
 | 
						|
service vm:100 (node1, started)
 | 
						|
----
 | 
						|
 | 
						|
You can also initiate resource migration to other nodes:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager migrate vm:100 node2
 | 
						|
----
 | 
						|
 | 
						|
This uses online migration and tries to keep the VM running. Online
 | 
						|
migration needs to transfer all used memory over the network, so it is
 | 
						|
sometimes faster to stop VM, then restart it on the new node. This can be
 | 
						|
done using the `relocate` command:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager relocate vm:100 node2
 | 
						|
----
 | 
						|
 | 
						|
Finally, you can remove the resource from the HA configuration using
 | 
						|
the following command:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager remove vm:100
 | 
						|
----
 | 
						|
 | 
						|
NOTE: This does not start or stop the resource.
 | 
						|
 | 
						|
But all HA related task can be done on the GUI, so there is no need to
 | 
						|
use the command line at all.
 | 
						|
 | 
						|
 | 
						|
How It Works
 | 
						|
------------
 | 
						|
 | 
						|
This section provides a detailed description of the {PVE} HA manager
 | 
						|
internals. It describes all involved daemons and how they work
 | 
						|
together. To provide HA, two daemons run on each node:
 | 
						|
 | 
						|
`pve-ha-lrm`::
 | 
						|
 | 
						|
The local resource manager (LRM), which controls the services running on
 | 
						|
the local node. It reads the requested states for its services from
 | 
						|
the current manager status file and executes the respective commands.
 | 
						|
 | 
						|
`pve-ha-crm`::
 | 
						|
 | 
						|
The cluster resource manager (CRM), which makes the cluster wide
 | 
						|
decisions. It sends commands to the LRM, processes the results,
 | 
						|
and moves resources to other nodes if something fails. The CRM also
 | 
						|
handles node fencing.
 | 
						|
 | 
						|
 | 
						|
.Locks in the LRM & CRM
 | 
						|
[NOTE]
 | 
						|
Locks are provided by our distributed configuration file system (pmxcfs).
 | 
						|
They are used to guarantee that each LRM is active once and working. As a
 | 
						|
LRM only executes actions when it holds its lock, we can mark a failed node
 | 
						|
as fenced if we can acquire its lock. This lets us then recover any failed
 | 
						|
HA services securely without any interference from the now unknown failed node.
 | 
						|
This all gets supervised by the CRM which holds currently the manager master
 | 
						|
lock.
 | 
						|
 | 
						|
 | 
						|
Service States
 | 
						|
~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The CRM use a service state enumeration to record the current service
 | 
						|
state. We display this state on the GUI and you can query it using
 | 
						|
the `ha-manager` command line tool:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager status
 | 
						|
quorum OK
 | 
						|
master elsa (active, Mon Nov 21 07:23:29 2016)
 | 
						|
lrm elsa (active, Mon Nov 21 07:23:22 2016)
 | 
						|
service ct:100 (elsa, stopped)
 | 
						|
service ct:102 (elsa, started)
 | 
						|
service vm:501 (elsa, started)
 | 
						|
----
 | 
						|
 | 
						|
Here is the list of possible states:
 | 
						|
 | 
						|
stopped::
 | 
						|
 | 
						|
Service is stopped (confirmed by LRM). If the LRM detects a stopped
 | 
						|
service is still running, it will stop it again.
 | 
						|
 | 
						|
request_stop::
 | 
						|
 | 
						|
Service should be stopped. The CRM waits for confirmation from the
 | 
						|
LRM.
 | 
						|
 | 
						|
stopping::
 | 
						|
 | 
						|
Pending stop request. But the CRM did not get the request so far.
 | 
						|
 | 
						|
started::
 | 
						|
 | 
						|
Service is active an LRM should start it ASAP if not already running.
 | 
						|
If the Service fails and is detected to be not running the LRM
 | 
						|
restarts it
 | 
						|
(see xref:ha_manager_start_failure_policy[Start Failure Policy]).
 | 
						|
 | 
						|
starting::
 | 
						|
 | 
						|
Pending start request. But the CRM has not got any confirmation from the
 | 
						|
LRM that the service is running.
 | 
						|
 | 
						|
fence::
 | 
						|
 | 
						|
Wait for node fencing (service node is not inside quorate cluster
 | 
						|
partition).  As soon as node gets fenced successfully the service will
 | 
						|
be recovered to another node, if possible
 | 
						|
(see xref:ha_manager_fencing[Fencing]).
 | 
						|
 | 
						|
freeze::
 | 
						|
 | 
						|
Do not touch the service state. We use this state while we reboot a
 | 
						|
node, or when we restart the LRM daemon
 | 
						|
(see xref:ha_manager_package_updates[Package Updates]).
 | 
						|
 | 
						|
migrate::
 | 
						|
 | 
						|
Migrate service (live) to other node.
 | 
						|
 | 
						|
error::
 | 
						|
 | 
						|
Service is disabled because of LRM errors. Needs manual intervention
 | 
						|
(see xref:ha_manager_error_recovery[Error Recovery]).
 | 
						|
 | 
						|
queued::
 | 
						|
 | 
						|
Service is newly added, and the CRM has not seen it so far.
 | 
						|
 | 
						|
disabled::
 | 
						|
 | 
						|
Service is stopped and marked as `disabled`
 | 
						|
 | 
						|
 | 
						|
Local Resource Manager
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The local resource manager (`pve-ha-lrm`) is started as a daemon on
 | 
						|
boot and waits until the HA cluster is quorate and thus cluster wide
 | 
						|
locks are working.
 | 
						|
 | 
						|
It can be in three states:
 | 
						|
 | 
						|
wait for agent lock::
 | 
						|
 | 
						|
The LRM waits for our exclusive lock. This is also used as idle state if no
 | 
						|
service is configured.
 | 
						|
 | 
						|
active::
 | 
						|
 | 
						|
The LRM holds its exclusive lock and has services configured.
 | 
						|
 | 
						|
lost agent lock::
 | 
						|
 | 
						|
The LRM lost its lock, this means a failure happened and quorum was lost.
 | 
						|
 | 
						|
After the LRM gets in the active state it reads the manager status
 | 
						|
file in `/etc/pve/ha/manager_status` and determines the commands it
 | 
						|
has to execute for the services it owns.
 | 
						|
For each command a worker gets started, this workers are running in
 | 
						|
parallel and are limited to at most 4 by default. This default setting
 | 
						|
may be changed through the datacenter configuration key `max_worker`.
 | 
						|
When finished the worker process gets collected and its result saved for
 | 
						|
the CRM.
 | 
						|
 | 
						|
.Maximum Concurrent Worker Adjustment Tips
 | 
						|
[NOTE]
 | 
						|
The default value of at most 4 concurrent workers may be unsuited for
 | 
						|
a specific setup. For example may 4 live migrations happen at the same
 | 
						|
time, which can lead to network congestions with slower networks and/or
 | 
						|
big (memory wise) services. Ensure that also in the worst case no congestion
 | 
						|
happens and lower the `max_worker` value if needed. In the contrary, if you
 | 
						|
have a particularly powerful high end setup you may also want to increase it.
 | 
						|
 | 
						|
Each command requested by the CRM is uniquely identifiable by an UID, when
 | 
						|
the worker finished its result will be processed and written in the LRM
 | 
						|
status file `/etc/pve/nodes/<nodename>/lrm_status`. There the CRM may collect
 | 
						|
it and let its state machine - respective the commands output - act on it.
 | 
						|
 | 
						|
The actions on each service between CRM and LRM are normally always synced.
 | 
						|
This means that the CRM requests a state uniquely marked by an UID, the LRM
 | 
						|
then executes this action *one time* and writes back the result, also
 | 
						|
identifiable by the same UID. This is needed so that the LRM does not
 | 
						|
executes an outdated command.
 | 
						|
With the exception of the `stop` and the `error` command,
 | 
						|
those two do not depend on the result produced and are executed
 | 
						|
always in the case of the stopped state and once in the case of
 | 
						|
the error state.
 | 
						|
 | 
						|
.Read the Logs
 | 
						|
[NOTE]
 | 
						|
The HA Stack logs every action it makes. This helps to understand what
 | 
						|
and also why something happens in the cluster. Here its important to see
 | 
						|
what both daemons, the LRM and the CRM, did. You may use
 | 
						|
`journalctl -u pve-ha-lrm` on the node(s) where the service is and
 | 
						|
the same command for the pve-ha-crm on the node which is the current master.
 | 
						|
 | 
						|
Cluster Resource Manager
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
The cluster resource manager (`pve-ha-crm`) starts on each node and
 | 
						|
waits there for the manager lock, which can only be held by one node
 | 
						|
at a time.  The node which successfully acquires the manager lock gets
 | 
						|
promoted to the CRM master.
 | 
						|
 | 
						|
It can be in three states:
 | 
						|
 | 
						|
wait for agent lock::
 | 
						|
 | 
						|
The CRM waits for our exclusive lock. This is also used as idle state if no
 | 
						|
service is configured
 | 
						|
 | 
						|
active::
 | 
						|
 | 
						|
The CRM holds its exclusive lock and has services configured
 | 
						|
 | 
						|
lost agent lock::
 | 
						|
 | 
						|
The CRM lost its lock, this means a failure happened and quorum was lost.
 | 
						|
 | 
						|
It main task is to manage the services which are configured to be highly
 | 
						|
available and try to always enforce the requested state. For example, a
 | 
						|
service with the requested state 'started' will be started if its not
 | 
						|
already running. If it crashes it will be automatically started again.
 | 
						|
Thus the CRM dictates the actions which the LRM needs to execute.
 | 
						|
 | 
						|
When an node leaves the cluster quorum, its state changes to unknown.
 | 
						|
If the current CRM then can secure the failed nodes lock, the services
 | 
						|
will be 'stolen' and restarted on another node.
 | 
						|
 | 
						|
When a cluster member determines that it is no longer in the cluster
 | 
						|
quorum, the LRM waits for a new quorum to form. As long as there is no
 | 
						|
quorum the node cannot reset the watchdog. This will trigger a reboot
 | 
						|
after the watchdog then times out, this happens after 60 seconds.
 | 
						|
 | 
						|
 | 
						|
Configuration
 | 
						|
-------------
 | 
						|
 | 
						|
The HA stack is well integrated into the {pve} API. So, for example,
 | 
						|
HA can be configured via the `ha-manager` command line interface, or
 | 
						|
the {pve} web interface - both interfaces provide an easy way to
 | 
						|
manage HA. Automation tools can use the API directly.
 | 
						|
 | 
						|
All HA configuration files are within `/etc/pve/ha/`, so they get
 | 
						|
automatically distributed to the cluster nodes, and all nodes share
 | 
						|
the same HA configuration.
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_resource_config]]
 | 
						|
Resources
 | 
						|
~~~~~~~~~
 | 
						|
 | 
						|
[thumbnail="gui-ha-manager-status.png"]
 | 
						|
 | 
						|
 | 
						|
The resource configuration file `/etc/pve/ha/resources.cfg` stores
 | 
						|
the list of resources managed by `ha-manager`. A resource configuration
 | 
						|
inside that list look like this:
 | 
						|
 | 
						|
----
 | 
						|
<type>: <name>
 | 
						|
	<property> <value>
 | 
						|
	...
 | 
						|
----
 | 
						|
 | 
						|
It starts with a resource type followed by a resource specific name,
 | 
						|
separated with colon. Together this forms the HA resource ID, which is
 | 
						|
used by all `ha-manager` commands to uniquely identify a resource
 | 
						|
(example: `vm:100` or `ct:101`). The next lines contain additional
 | 
						|
properties:
 | 
						|
 | 
						|
include::ha-resources-opts.adoc[]
 | 
						|
 | 
						|
Here is a real world example with one VM and one container. As you see,
 | 
						|
the syntax of those files is really simple, so it is even possible to
 | 
						|
read or edit those files using your favorite editor:
 | 
						|
 | 
						|
.Configuration Example (`/etc/pve/ha/resources.cfg`)
 | 
						|
----
 | 
						|
vm: 501
 | 
						|
    state started
 | 
						|
    max_relocate 2
 | 
						|
 | 
						|
ct: 102
 | 
						|
    # Note: use default settings for everything
 | 
						|
----
 | 
						|
 | 
						|
[thumbnail="gui-ha-manager-add-resource.png"]
 | 
						|
 | 
						|
Above config was generated using the `ha-manager` command line tool:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager add vm:501 --state started --max_relocate 2
 | 
						|
# ha-manager add ct:102
 | 
						|
----
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_groups]]
 | 
						|
Groups
 | 
						|
~~~~~~
 | 
						|
 | 
						|
[thumbnail="gui-ha-manager-groups-view.png"]
 | 
						|
 | 
						|
The HA group configuration file `/etc/pve/ha/groups.cfg` is used to
 | 
						|
define groups of cluster nodes. A resource can be restricted to run
 | 
						|
only on the members of such group. A group configuration look like
 | 
						|
this:
 | 
						|
 | 
						|
----
 | 
						|
group: <group>
 | 
						|
       nodes <node_list>
 | 
						|
       <property> <value>
 | 
						|
       ...
 | 
						|
----
 | 
						|
 | 
						|
include::ha-groups-opts.adoc[]
 | 
						|
 | 
						|
[thumbnail="gui-ha-manager-add-group.png"]
 | 
						|
 | 
						|
A common requirement is that a resource should run on a specific
 | 
						|
node. Usually the resource is able to run on other nodes, so you can define
 | 
						|
an unrestricted group with a single member:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager groupadd prefer_node1 --nodes node1
 | 
						|
----
 | 
						|
 | 
						|
For bigger clusters, it makes sense to define a more detailed failover
 | 
						|
behavior. For example, you may want to run a set of services on
 | 
						|
`node1` if possible. If `node1` is not available, you want to run them
 | 
						|
equally split on `node2` and `node3`. If those nodes also fail the
 | 
						|
services should run on `node4`. To achieve this you could set the node
 | 
						|
list to:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4"
 | 
						|
----
 | 
						|
 | 
						|
Another use case is if a resource uses other resources only available
 | 
						|
on specific nodes, lets say `node1` and `node2`. We need to make sure
 | 
						|
that HA manager does not use other nodes, so we need to create a
 | 
						|
restricted group with said nodes:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted
 | 
						|
----
 | 
						|
 | 
						|
Above commands created the following group configuration fils:
 | 
						|
 | 
						|
.Configuration Example (`/etc/pve/ha/groups.cfg`)
 | 
						|
----
 | 
						|
group: prefer_node1
 | 
						|
       nodes node1
 | 
						|
 | 
						|
group: mygroup1
 | 
						|
       nodes node2:1,node4,node1:2,node3:1
 | 
						|
 | 
						|
group: mygroup2
 | 
						|
       nodes node2,node1
 | 
						|
       restricted 1
 | 
						|
----
 | 
						|
 | 
						|
 | 
						|
The `nofailback` options is mostly useful to avoid unwanted resource
 | 
						|
movements during administration tasks. For example, if you need to
 | 
						|
migrate a service to a node which hasn't the highest priority in the
 | 
						|
group, you need to tell the HA manager to not move this service
 | 
						|
instantly back by setting the `nofailback` option.
 | 
						|
 | 
						|
Another scenario is when a service was fenced and it got recovered to
 | 
						|
another node. The admin tries to repair the fenced node and brings it
 | 
						|
up online again to investigate the failure cause and check if it runs
 | 
						|
stable again. Setting the `nofailback` flag prevents that the
 | 
						|
recovered services move straight back to the fenced node.
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_fencing]]
 | 
						|
Fencing
 | 
						|
-------
 | 
						|
 | 
						|
On node failures, fencing ensures that the erroneous node is
 | 
						|
guaranteed to be offline. This is required to make sure that no
 | 
						|
resource runs twice when it gets recovered on another node. This is a
 | 
						|
really important task, because without, it would not be possible to
 | 
						|
recover a resource on another node.
 | 
						|
 | 
						|
If a node would not get fenced, it would be in an unknown state where
 | 
						|
it may have still access to shared resources. This is really
 | 
						|
dangerous!  Imagine that every network but the storage one broke. Now,
 | 
						|
while not reachable from the public network, the VM still runs and
 | 
						|
writes to the shared storage.
 | 
						|
 | 
						|
If we then simply start up this VM on another node, we would get a
 | 
						|
dangerous race conditions because we write from both nodes. Such
 | 
						|
condition can destroy all VM data and the whole VM could be rendered
 | 
						|
unusable. The recovery could also fail if the storage protects from
 | 
						|
multiple mounts.
 | 
						|
 | 
						|
 | 
						|
How {pve} Fences
 | 
						|
~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
There are different methods to fence a node, for example, fence
 | 
						|
devices which cut off the power from the node or disable their
 | 
						|
communication completely. Those are often quite expensive and bring
 | 
						|
additional critical components into a system, because if they fail you
 | 
						|
cannot recover any service.
 | 
						|
 | 
						|
We thus wanted to integrate a simpler fencing method, which does not
 | 
						|
require additional external hardware. This can be done using
 | 
						|
watchdog timers.
 | 
						|
 | 
						|
.Possible Fencing Methods
 | 
						|
- external power switches
 | 
						|
- isolate nodes by disabling complete network traffic on the switch
 | 
						|
- self fencing using watchdog timers
 | 
						|
 | 
						|
Watchdog timers are widely used in critical and dependable systems
 | 
						|
since the beginning of micro controllers. They are often independent
 | 
						|
and simple integrated circuits which are used to detect and recover
 | 
						|
from computer malfunctions.
 | 
						|
 | 
						|
During normal operation, `ha-manager` regularly resets the watchdog
 | 
						|
timer to prevent it from elapsing. If, due to a hardware fault or
 | 
						|
program error, the computer fails to reset the watchdog, the timer
 | 
						|
will elapse and triggers a reset of the whole server (reboot).
 | 
						|
 | 
						|
Recent server motherboards often include such hardware watchdogs, but
 | 
						|
these need to be configured. If no watchdog is available or
 | 
						|
configured, we fall back to the Linux Kernel 'softdog'. While still
 | 
						|
reliable, it is not independent of the servers hardware, and thus has
 | 
						|
a lower reliability than a hardware watchdog.
 | 
						|
 | 
						|
 | 
						|
Configure Hardware Watchdog
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
By default, all hardware watchdog modules are blocked for security
 | 
						|
reasons. They are like a loaded gun if not correctly initialized. To
 | 
						|
enable a hardware watchdog, you need to specify the module to load in
 | 
						|
'/etc/default/pve-ha-manager', for example:
 | 
						|
 | 
						|
----
 | 
						|
# select watchdog module (default is softdog)
 | 
						|
WATCHDOG_MODULE=iTCO_wdt
 | 
						|
----
 | 
						|
 | 
						|
This configuration is read by the 'watchdog-mux' service, which load
 | 
						|
the specified module at startup.
 | 
						|
 | 
						|
 | 
						|
Recover Fenced Services
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
After a node failed and its fencing was successful, the CRM tries to
 | 
						|
move services from the failed node to nodes which are still online.
 | 
						|
 | 
						|
The selection of nodes, on which those services gets recovered, is
 | 
						|
influenced by the resource `group` settings, the list of currently active
 | 
						|
nodes, and their respective active service count.
 | 
						|
 | 
						|
The CRM first builds a set out of the intersection between user selected
 | 
						|
nodes (from `group` setting) and available nodes. It then choose the
 | 
						|
subset of nodes with the highest priority, and finally select the node
 | 
						|
with the lowest active service count. This minimizes the possibility
 | 
						|
of an overloaded node.
 | 
						|
 | 
						|
CAUTION: On node failure, the CRM distributes services to the
 | 
						|
remaining nodes. This increase the service count on those nodes, and
 | 
						|
can lead to high load, especially on small clusters. Please design
 | 
						|
your cluster so that it can handle such worst case scenarios.
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_start_failure_policy]]
 | 
						|
Start Failure Policy
 | 
						|
---------------------
 | 
						|
 | 
						|
The start failure policy comes in effect if a service failed to start on a
 | 
						|
node once ore more times. It can be used to configure how often a restart
 | 
						|
should be triggered on the same node and how often a service should be
 | 
						|
relocated so that it gets a try to be started on another node.
 | 
						|
The aim of this policy is to circumvent temporary unavailability of shared
 | 
						|
resources on a specific node. For example, if a shared storage isn't available
 | 
						|
on a quorate node anymore, e.g. network problems, but still on other nodes,
 | 
						|
the relocate policy allows then that the service gets started nonetheless.
 | 
						|
 | 
						|
There are two service start recover policy settings which can be configured
 | 
						|
specific for each resource.
 | 
						|
 | 
						|
max_restart::
 | 
						|
 | 
						|
Maximum number of tries to restart an failed service on the actual
 | 
						|
node.  The default is set to one.
 | 
						|
 | 
						|
max_relocate::
 | 
						|
 | 
						|
Maximum number of tries to relocate the service to a different node.
 | 
						|
A relocate only happens after the max_restart value is exceeded on the
 | 
						|
actual node. The default is set to one.
 | 
						|
 | 
						|
NOTE: The relocate count state will only reset to zero when the
 | 
						|
service had at least one successful start. That means if a service is
 | 
						|
re-started without fixing the error only the restart policy gets
 | 
						|
repeated.
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_error_recovery]]
 | 
						|
Error Recovery
 | 
						|
--------------
 | 
						|
 | 
						|
If after all tries the service state could not be recovered it gets
 | 
						|
placed in an error state. In this state the service won't get touched
 | 
						|
by the HA stack anymore. The only way out is disabling a service:
 | 
						|
 | 
						|
----
 | 
						|
# ha-manager set vm:100 --state disabled
 | 
						|
----
 | 
						|
 | 
						|
This can also be done in the web interface.
 | 
						|
 | 
						|
To recover from the error state you should do the following:
 | 
						|
 | 
						|
* bring the resource back into a safe and consistent state (e.g.:
 | 
						|
kill its process if the service could not be stopped)
 | 
						|
 | 
						|
* disable the resource to remove the error flag
 | 
						|
 | 
						|
* fix the error which led to this failures
 | 
						|
 | 
						|
* *after* you fixed all errors you may request that the service starts again
 | 
						|
 | 
						|
 | 
						|
[[ha_manager_package_updates]]
 | 
						|
Package Updates
 | 
						|
---------------
 | 
						|
 | 
						|
When updating the ha-manager you should do one node after the other, never
 | 
						|
all at once for various reasons. First, while we test our software
 | 
						|
thoughtfully, a bug affecting your specific setup cannot totally be ruled out.
 | 
						|
Upgrading one node after the other and checking the functionality of each node
 | 
						|
after finishing the update helps to recover from an eventual problems, while
 | 
						|
updating all could render you in a broken cluster state and is generally not
 | 
						|
good practice.
 | 
						|
 | 
						|
Also, the {pve} HA stack uses a request acknowledge protocol to perform
 | 
						|
actions between the cluster and the local resource manager. For restarting,
 | 
						|
the LRM makes a request to the CRM to freeze all its services. This prevents
 | 
						|
that they get touched by the Cluster during the short time the LRM is restarting.
 | 
						|
After that the LRM may safely close the watchdog during a restart.
 | 
						|
Such a restart happens on a update and as already stated a active master
 | 
						|
CRM is needed to acknowledge the requests from the LRM, if this is not the case
 | 
						|
the update process can be too long which, in the worst case, may result in
 | 
						|
a watchdog reset.
 | 
						|
 | 
						|
 | 
						|
Node Maintenance
 | 
						|
----------------
 | 
						|
 | 
						|
It is sometimes possible to shutdown or reboot a node to do
 | 
						|
maintenance tasks. Either to replace hardware, or simply to install a
 | 
						|
new kernel image.
 | 
						|
 | 
						|
 | 
						|
Shutdown
 | 
						|
~~~~~~~~
 | 
						|
 | 
						|
A shutdown ('poweroff') is usually done if the node is planned to stay
 | 
						|
down for some time. The LRM stops all managed services in that
 | 
						|
case. This means that other nodes will take over those service
 | 
						|
afterwards.
 | 
						|
 | 
						|
NOTE: Recent hardware has large amounts of RAM. So we stop all
 | 
						|
resources, then restart them to avoid online migration of all that
 | 
						|
RAM. If you want to use online migration, you need to invoke that
 | 
						|
manually before you shutdown the node.
 | 
						|
 | 
						|
 | 
						|
Reboot
 | 
						|
~~~~~~
 | 
						|
 | 
						|
Node reboots are initiated with the 'reboot' command. This is usually
 | 
						|
done after installing a new kernel. Please note that this is different
 | 
						|
from ``shutdown'', because the node immediately starts again.
 | 
						|
 | 
						|
The LRM tells the CRM that it wants to restart, and waits until the
 | 
						|
CRM puts all resources into the `freeze` state (same mechanism is used
 | 
						|
for xref:ha_manager_package_updates[Package Updates]). This prevents
 | 
						|
that those resources are moved to other nodes. Instead, the CRM start
 | 
						|
the resources after the reboot on the same node.
 | 
						|
 | 
						|
 | 
						|
Manual Resource Movement
 | 
						|
~~~~~~~~~~~~~~~~~~~~~~~~
 | 
						|
 | 
						|
Last but not least, you can also move resources manually to other
 | 
						|
nodes before you shutdown or restart a node. The advantage is that you
 | 
						|
have full control, and you can decide if you want to use online
 | 
						|
migration or not.
 | 
						|
 | 
						|
NOTE: Please do not 'kill' services like `pve-ha-crm`, `pve-ha-lrm` or
 | 
						|
`watchdog-mux`. They manage and use the watchdog, so this can result
 | 
						|
in a node reboot.
 | 
						|
 | 
						|
 | 
						|
ifdef::manvolnum[]
 | 
						|
include::pve-copyright.adoc[]
 | 
						|
endif::manvolnum[]
 | 
						|
 |