ceph: troubleshooting: revise and add frequently needed information

Existing information is slightly modified and retained. Add information: * List which logs are usually helpful for troubleshooting * Explain how to acknowledge listed Ceph crashes and view details * List common causes of Ceph problems and link to recommendations for a healthy cluster * Briefly describe the common problem "OSDs down/crashed" Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com> [AL]: use old anchor to sub chapter that was kept to not break links Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2025-10-04 21:16:32 +00:00 · 2025-02-05 11:08:47 +01:00 · 2025-02-05 11:08:47 +01:00 · 70b3fb96e1
commit 70b3fb96e1
parent 402893065f
1 changed files with 63 additions and 7 deletions
--- a/pveceph.adoc
+++ b/pveceph.adoc
@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
 ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
 ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
 below will also give you an overview of the current events and actions to take.
 To stop their execution, press CTRL-C.
 ----
-# single time output
+# Continuously watch the cluster status
-pve# ceph -s
+pve# watch ceph --status
-# continuously output status changes (press CTRL+C to stop)
+
-pve# ceph -w
+# Print the cluster status once (not being updated)
 # and continuously append lines of status events
 pve# ceph --watch
 ----
 [[pve_ceph_ts]]
 Troubleshooting
 ~~~~~~~~~~~~~~~
 This section includes frequently used troubleshooting information.
 More information can be found on the official Ceph website under
 Troubleshooting
 footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
 [[pve_ceph_ts_logs]]
 .Relevant Logs on Affected Node
 * xref:disk_health_monitoring[Disk Health Monitoring]
 * __System -> System Log__ (or, for example,
  `journalctl --since "2 days ago"`)
 * IPMI and RAID controller logs
 Ceph service crashes can be listed and viewed in detail by running
 `ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
 new can be acknowledged by running, for example,
 `ceph crash archive-all`.
 To get a more detailed view, every Ceph service has a log file under
 `/var/log/ceph/`. If more detail is required, the log level can be
 adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
-You can find more information about troubleshooting
+[[pve_ceph_ts_causes]]
-footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
+.Common Causes of Ceph Problems
 a Ceph cluster on the official website.
 * Network problems like congestion, a faulty switch, a shut down
 interface or a blocking firewall. Check whether all {pve} nodes are
 reliably reachable on the
 xref:pvecm_cluster_network[corosync cluster network] and on the
 xref:pve_ceph_install_wizard[Ceph public and cluster network].
 * Disk or connection parts which are:
 ** defective
 ** not firmly mounted
 ** lacking I/O performance under higher load (e.g. when using HDDs,
 consumer hardware or
 xref:pve_ceph_recommendation_raid[inadvisable RAID controllers])
 * Not fulfilling the xref:_recommendations_for_a_healthy_ceph_cluster[recommendations] for
 a healthy Ceph cluster.
 [[pve_ceph_ts_problems]]
 .Common Ceph Problems
 ::
 OSDs `down`/crashed:::
 A faulty OSD will be reported as `down` and mostly (auto) `out` 10
 minutes later. Depending on the cause, it can also automatically
 become `up` and `in` again. To try a manual activation via web
 interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
 on **Start**, **In** and **Reload**. When using the shell, run on the
 affected node `ceph-volume lvm activate --all`.
 +
 To activate a failed OSD, it may be necessary to
 xref:ha_manager_node_maintenance[safely reboot] the respective node
 or, as a last resort, to
 xref:pve_ceph_osd_replace[recreate or replace] the OSD.
 ifdef::manvolnum[]
 include::pve-copyright.adoc[]