ceph: maintenance: revise and expand section "Replace OSDs"

Remove redundant information that is already described in section
“Destroy OSDs” and link to it.

Mention and link to the troubleshooting section, as replacing the OSD
may not fix the underyling problem.

Mention that the replacement disk should be of the same type and size
and comply with the recommendations.

Mention how to acknowledge warnings of crashed OSDs.

Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
This commit is contained in:
Alexander Zeidler 2025-02-05 11:08:49 +01:00 committed by Aaron Lauterer
parent 84ba04863c
commit 0a52307436

View File

@ -1035,43 +1035,24 @@ Ceph Maintenance
Replace OSDs Replace OSDs
~~~~~~~~~~~~ ~~~~~~~~~~~~
One of the most common maintenance tasks in Ceph is to replace the disk of an With the following steps you can replace the disk of an OSD, which is
OSD. If a disk is already in a failed state, then you can go ahead and run one of the most common maintenance tasks in Ceph. If there is a
through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate problem with an OSD while its disk still seems to be healthy, read the
those copies on the remaining OSDs if possible. This rebalancing will start as xref:pve_ceph_mon_and_ts[troubleshooting] section first.
soon as an OSD failure is detected or an OSD was actively stopped.
NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when . If the disk failed, get a
`size + 1` nodes are available. The reason for this is that the Ceph object xref:pve_ceph_recommendation_disk[recommended] replacement disk of the
balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as same type and size.
`failure domain'.
To replace a functioning disk from the GUI, go through the steps in . xref:pve_ceph_osd_destroy[Destroy] the OSD in question.
xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
On the command line, use the following commands: . Detach the old disk from the server and attach the new one.
---- . xref:pve_ceph_osd_create[Create] the OSD again.
ceph osd out osd.<id>
----
You can check with the command below if the OSD can be safely removed. . After automatic rebalancing, the cluster status should switch back
to `HEALTH_OK`. Any still listed crashes can be acknowledged by
---- running, for example, `ceph crash archive-all`.
ceph osd safe-to-destroy osd.<id>
----
Once the above check tells you that it is safe to remove the OSD, you can
continue with the following commands:
----
systemctl stop ceph-osd@<id>.service
pveceph osd destroy <id>
----
Replace the old disk with the new one and use the same procedure as described
in xref:pve_ceph_osd_create[Create OSDs].
Trim/Discard Trim/Discard
~~~~~~~~~~~~ ~~~~~~~~~~~~