ceph: maintenance: revise and expand section "Replace OSDs"

Remove redundant information that is already described in section “Destroy OSDs” and link to it. Mention and link to the troubleshooting section, as replacing the OSD may not fix the underyling problem. Mention that the replacement disk should be of the same type and size and comply with the recommendations. Mention how to acknowledge warnings of crashed OSDs. Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
2025-07-12 04:20:59 +00:00 · 2025-02-05 11:08:49 +01:00 · 2025-02-05 11:08:49 +01:00 · 0a52307436
commit 0a52307436
parent 84ba04863c
1 changed files with 13 additions and 32 deletions
--- a/pveceph.adoc
+++ b/pveceph.adoc
@ -1035,43 +1035,24 @@ Ceph Maintenance
 Replace OSDs
 ~~~~~~~~~~~~
-One of the most common maintenance tasks in Ceph is to replace the disk of an
+With the following steps you can replace the disk of an OSD, which is
-OSD. If a disk is already in a failed state, then you can go ahead and run
+one of the most common maintenance tasks in Ceph. If there is a
-through the steps in xref:pve_ceph_osd_destroy[Destroy OSDs]. Ceph will recreate
+problem with an OSD while its disk still seems to be healthy, read the
-those copies on the remaining OSDs if possible. This rebalancing will start as
+xref:pve_ceph_mon_and_ts[troubleshooting] section first.
 soon as an OSD failure is detected or an OSD was actively stopped.
-NOTE: With the default size/min_size (3/2) of a pool, recovery only starts when
+. If the disk failed, get a
-`size + 1` nodes are available. The reason for this is that the Ceph object
+xref:pve_ceph_recommendation_disk[recommended] replacement disk of the
-balancer xref:pve_ceph_device_classes[CRUSH] defaults to a full node as
+same type and size.
 `failure domain'.
-To replace a functioning disk from the GUI, go through the steps in
+. xref:pve_ceph_osd_destroy[Destroy] the OSD in question.
 xref:pve_ceph_osd_destroy[Destroy OSDs]. The only addition is to wait until
 the cluster shows 'HEALTH_OK' before stopping the OSD to destroy it.
-On the command line, use the following commands:
+. Detach the old disk from the server and attach the new one.
----
+. xref:pve_ceph_osd_create[Create] the OSD again.
 ceph osd out osd.<id>
 ----
-You can check with the command below if the OSD can be safely removed.
+. After automatic rebalancing, the cluster status should switch back
-
+to `HEALTH_OK`. Any still listed crashes can be acknowledged by
----
+running, for example, `ceph crash archive-all`.
 ceph osd safe-to-destroy osd.<id>
 ----
 Once the above check tells you that it is safe to remove the OSD, you can
 continue with the following commands:
 ----
 systemctl stop ceph-osd@<id>.service
 pveceph osd destroy <id>
 ----
 Replace the old disk with the new one and use the same procedure as described
 in xref:pve_ceph_osd_create[Create OSDs].
 Trim/Discard
 ~~~~~~~~~~~~