mirror of
https://git.proxmox.com/git/pve-docs
synced 2025-08-04 21:52:52 +00:00
208 lines
5.1 KiB
Plaintext
208 lines
5.1 KiB
Plaintext
[[chapter-ha-manager]]
|
|
ifdef::manvolnum[]
|
|
PVE({manvolnum})
|
|
================
|
|
include::attributes.txt[]
|
|
|
|
NAME
|
|
----
|
|
|
|
ha-manager - Proxmox VE HA Manager
|
|
|
|
SYNOPSYS
|
|
--------
|
|
|
|
include::ha-manager.1-synopsis.adoc[]
|
|
|
|
DESCRIPTION
|
|
-----------
|
|
endif::manvolnum[]
|
|
|
|
ifndef::manvolnum[]
|
|
High Availability
|
|
=================
|
|
include::attributes.txt[]
|
|
endif::manvolnum[]
|
|
|
|
'ha-manager' handles management of user-defined cluster services. This
|
|
includes handling of user requests including service start, service
|
|
disable, service relocate, and service restart. The cluster resource
|
|
manager daemon also handles restarting and relocating services in the
|
|
event of failures.
|
|
|
|
HOW IT WORKS
|
|
------------
|
|
|
|
The local resource manager ('pve-ha-lrm') is started as a daemon on
|
|
each node at system start and waits until the HA cluster is quorate
|
|
and locks are working. After initialization, the LRM determines which
|
|
services are enabled and starts them. Also the watchdog gets
|
|
initialized.
|
|
|
|
The cluster resource manager ('pve-ha-crm') starts on each node and
|
|
waits there for the manager lock, which can only be held by one node
|
|
at a time. The node which successfully acquires the manager lock gets
|
|
promoted to the CRM, it handles cluster wide actions like migrations
|
|
and failures.
|
|
|
|
When an node leaves the cluster quorum, its state changes to unknown.
|
|
If the current CRM then can secure the failed nodes lock, the services
|
|
will be 'stolen' and restarted on another node.
|
|
|
|
When a cluster member determines that it is no longer in the cluster
|
|
quorum, the LRM waits for a new quorum to form. As long as there is no
|
|
quorum the node cannot reset the watchdog. This will trigger a reboot
|
|
after 60 seconds.
|
|
|
|
CONFIGURATION
|
|
-------------
|
|
|
|
The HA stack is well integrated int the Proxmox VE API2. So, for
|
|
example, HA can be configured via 'ha-manager' or the PVE web
|
|
interface, which both provide an easy to use tool.
|
|
|
|
The resource configuration file can be located at
|
|
'/etc/pve/ha/resources.cfg' and the group configuration file at
|
|
'/etc/pve/ha/groups.cfg'. Use the provided tools to make changes,
|
|
there shouldn't be any need to edit them manually.
|
|
|
|
RESOURCES/SERVICES AGENTS
|
|
-------------------------
|
|
|
|
A resource or also called service can be managed by the
|
|
ha-manager. Currently we support virtual machines and container.
|
|
|
|
GROUPS
|
|
------
|
|
|
|
A group is a collection of cluster nodes which a service may be bound to.
|
|
|
|
GROUP SETTINGS
|
|
~~~~~~~~~~~~~~
|
|
|
|
nodes::
|
|
|
|
list of group node members
|
|
|
|
restricted::
|
|
|
|
resources bound to this group may only run on nodes defined by the
|
|
group. If no group node member is available the resource will be
|
|
placed in the stopped state.
|
|
|
|
nofailback::
|
|
|
|
the resource won't automatically fail back when a more preferred node
|
|
(re)joins the cluster.
|
|
|
|
|
|
RECOVERY POLICY
|
|
---------------
|
|
|
|
There are two service recover policy settings which can be configured
|
|
specific for each resource.
|
|
|
|
max_restart::
|
|
|
|
maximal number of tries to restart an failed service on the actual
|
|
node. The default is set to one.
|
|
|
|
max_relocate::
|
|
|
|
maximal number of tries to relocate the service to a different node.
|
|
A relocate only happens after the max_restart value is exceeded on the
|
|
actual node. The default is set to one.
|
|
|
|
Note that the relocate count state will only reset to zero when the
|
|
service had at least one successful start. That means if a service is
|
|
re-enabled without fixing the error only the restart policy gets
|
|
repeated.
|
|
|
|
ERROR RECOVERY
|
|
--------------
|
|
|
|
If after all tries the service state could not be recovered it gets
|
|
placed in an error state. In this state the service won't get touched
|
|
by the HA stack anymore. To recover from this state you should follow
|
|
these steps:
|
|
|
|
* bring the resource back into an safe and consistent state (e.g:
|
|
killing its process)
|
|
|
|
* disable the ha resource to place it in an stopped state
|
|
|
|
* fix the error which led to this failures
|
|
|
|
* *after* you fixed all errors you may enable the service again
|
|
|
|
|
|
SERVICE OPERATIONS
|
|
------------------
|
|
|
|
This are how the basic user-initiated service operations (via
|
|
'ha-manager') work.
|
|
|
|
enable::
|
|
|
|
the service will be started by the LRM if not already running.
|
|
|
|
disable::
|
|
|
|
the service will be stopped by the LRM if running.
|
|
|
|
migrate/relocate::
|
|
|
|
the service will be relocated (live) to another node.
|
|
|
|
remove::
|
|
|
|
the service will be removed from the HA managed resource list. Its
|
|
current state will not be touched.
|
|
|
|
start/stop::
|
|
|
|
start and stop commands can be issued to the resource specific tools
|
|
(like 'qm' or 'pct'), they will forward the request to the
|
|
'ha-manager' which then will execute the action and set the resulting
|
|
service state (enabled, disabled).
|
|
|
|
|
|
SERVICE STATES
|
|
--------------
|
|
|
|
stopped::
|
|
|
|
Service is stopped (confirmed by LRM)
|
|
|
|
request_stop::
|
|
|
|
Service should be stopped. Waiting for confirmation from LRM.
|
|
|
|
started::
|
|
|
|
Service is active an LRM should start it ASAP if not already running.
|
|
|
|
fence::
|
|
|
|
Wait for node fencing (service node is not inside quorate cluster
|
|
partition).
|
|
|
|
freeze::
|
|
|
|
Do not touch the service state. We use this state while we reboot a
|
|
node, or when we restart the LRM daemon.
|
|
|
|
migrate::
|
|
|
|
Migrate service (live) to other node.
|
|
|
|
error::
|
|
|
|
Service disabled because of LRM errors. Needs manual intervention.
|
|
|
|
|
|
ifdef::manvolnum[]
|
|
include::pve-copyright.adoc[]
|
|
endif::manvolnum[]
|
|
|