ceph: troubleshooting: revise and add frequently needed information

Existing information is slightly modified and retained.

Add information:
* List which logs are usually helpful for troubleshooting
* Explain how to acknowledge listed Ceph crashes and view details
* List common causes of Ceph problems and link to recommendations for a
  healthy cluster
* Briefly describe the common problem "OSDs down/crashed"

Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com>
[AL]: use old anchor to sub chapter that was kept to not break links
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
This commit is contained in:
Alexander Zeidler 2025-02-05 11:08:47 +01:00 committed by Aaron Lauterer
parent 402893065f
commit 70b3fb96e1

View File

@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
below will also give you an overview of the current events and actions to take. below will also give you an overview of the current events and actions to take.
To stop their execution, press CTRL-C.
---- ----
# single time output # Continuously watch the cluster status
pve# ceph -s pve# watch ceph --status
# continuously output status changes (press CTRL+C to stop)
pve# ceph -w # Print the cluster status once (not being updated)
# and continuously append lines of status events
pve# ceph --watch
---- ----
[[pve_ceph_ts]]
Troubleshooting
~~~~~~~~~~~~~~~
This section includes frequently used troubleshooting information.
More information can be found on the official Ceph website under
Troubleshooting
footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
[[pve_ceph_ts_logs]]
.Relevant Logs on Affected Node
* xref:disk_health_monitoring[Disk Health Monitoring]
* __System -> System Log__ (or, for example,
`journalctl --since "2 days ago"`)
* IPMI and RAID controller logs
Ceph service crashes can be listed and viewed in detail by running
`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
new can be acknowledged by running, for example,
`ceph crash archive-all`.
To get a more detailed view, every Ceph service has a log file under To get a more detailed view, every Ceph service has a log file under
`/var/log/ceph/`. If more detail is required, the log level can be `/var/log/ceph/`. If more detail is required, the log level can be
adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
You can find more information about troubleshooting [[pve_ceph_ts_causes]]
footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] .Common Causes of Ceph Problems
a Ceph cluster on the official website.
* Network problems like congestion, a faulty switch, a shut down
interface or a blocking firewall. Check whether all {pve} nodes are
reliably reachable on the
xref:pvecm_cluster_network[corosync cluster network] and on the
xref:pve_ceph_install_wizard[Ceph public and cluster network].
* Disk or connection parts which are:
** defective
** not firmly mounted
** lacking I/O performance under higher load (e.g. when using HDDs,
consumer hardware or
xref:pve_ceph_recommendation_raid[inadvisable RAID controllers])
* Not fulfilling the xref:_recommendations_for_a_healthy_ceph_cluster[recommendations] for
a healthy Ceph cluster.
[[pve_ceph_ts_problems]]
.Common Ceph Problems
::
OSDs `down`/crashed:::
A faulty OSD will be reported as `down` and mostly (auto) `out` 10
minutes later. Depending on the cause, it can also automatically
become `up` and `in` again. To try a manual activation via web
interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
on **Start**, **In** and **Reload**. When using the shell, run on the
affected node `ceph-volume lvm activate --all`.
+
To activate a failed OSD, it may be necessary to
xref:ha_manager_node_maintenance[safely reboot] the respective node
or, as a last resort, to
xref:pve_ceph_osd_replace[recreate or replace] the OSD.
ifdef::manvolnum[] ifdef::manvolnum[]
include::pve-copyright.adoc[] include::pve-copyright.adoc[]