mirror of
https://git.proxmox.com/git/pve-docs
synced 2025-10-04 21:16:32 +00:00
ceph: troubleshooting: revise and add frequently needed information
Existing information is slightly modified and retained. Add information: * List which logs are usually helpful for troubleshooting * Explain how to acknowledge listed Ceph crashes and view details * List common causes of Ceph problems and link to recommendations for a healthy cluster * Briefly describe the common problem "OSDs down/crashed" Signed-off-by: Alexander Zeidler <a.zeidler@proxmox.com> [AL]: use old anchor to sub chapter that was kept to not break links Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
This commit is contained in:
parent
402893065f
commit
70b3fb96e1
70
pveceph.adoc
70
pveceph.adoc
@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy
|
|||||||
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
|
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
|
||||||
('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
|
('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands
|
||||||
below will also give you an overview of the current events and actions to take.
|
below will also give you an overview of the current events and actions to take.
|
||||||
|
To stop their execution, press CTRL-C.
|
||||||
|
|
||||||
----
|
----
|
||||||
# single time output
|
# Continuously watch the cluster status
|
||||||
pve# ceph -s
|
pve# watch ceph --status
|
||||||
# continuously output status changes (press CTRL+C to stop)
|
|
||||||
pve# ceph -w
|
# Print the cluster status once (not being updated)
|
||||||
|
# and continuously append lines of status events
|
||||||
|
pve# ceph --watch
|
||||||
----
|
----
|
||||||
|
|
||||||
|
[[pve_ceph_ts]]
|
||||||
|
Troubleshooting
|
||||||
|
~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section includes frequently used troubleshooting information.
|
||||||
|
More information can be found on the official Ceph website under
|
||||||
|
Troubleshooting
|
||||||
|
footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/].
|
||||||
|
|
||||||
|
[[pve_ceph_ts_logs]]
|
||||||
|
.Relevant Logs on Affected Node
|
||||||
|
|
||||||
|
* xref:disk_health_monitoring[Disk Health Monitoring]
|
||||||
|
* __System -> System Log__ (or, for example,
|
||||||
|
`journalctl --since "2 days ago"`)
|
||||||
|
* IPMI and RAID controller logs
|
||||||
|
|
||||||
|
Ceph service crashes can be listed and viewed in detail by running
|
||||||
|
`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as
|
||||||
|
new can be acknowledged by running, for example,
|
||||||
|
`ceph crash archive-all`.
|
||||||
|
|
||||||
To get a more detailed view, every Ceph service has a log file under
|
To get a more detailed view, every Ceph service has a log file under
|
||||||
`/var/log/ceph/`. If more detail is required, the log level can be
|
`/var/log/ceph/`. If more detail is required, the log level can be
|
||||||
adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
|
adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/].
|
||||||
|
|
||||||
You can find more information about troubleshooting
|
[[pve_ceph_ts_causes]]
|
||||||
footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]
|
.Common Causes of Ceph Problems
|
||||||
a Ceph cluster on the official website.
|
|
||||||
|
|
||||||
|
* Network problems like congestion, a faulty switch, a shut down
|
||||||
|
interface or a blocking firewall. Check whether all {pve} nodes are
|
||||||
|
reliably reachable on the
|
||||||
|
xref:pvecm_cluster_network[corosync cluster network] and on the
|
||||||
|
xref:pve_ceph_install_wizard[Ceph public and cluster network].
|
||||||
|
|
||||||
|
* Disk or connection parts which are:
|
||||||
|
** defective
|
||||||
|
** not firmly mounted
|
||||||
|
** lacking I/O performance under higher load (e.g. when using HDDs,
|
||||||
|
consumer hardware or
|
||||||
|
xref:pve_ceph_recommendation_raid[inadvisable RAID controllers])
|
||||||
|
|
||||||
|
* Not fulfilling the xref:_recommendations_for_a_healthy_ceph_cluster[recommendations] for
|
||||||
|
a healthy Ceph cluster.
|
||||||
|
|
||||||
|
[[pve_ceph_ts_problems]]
|
||||||
|
.Common Ceph Problems
|
||||||
|
::
|
||||||
|
|
||||||
|
OSDs `down`/crashed:::
|
||||||
|
A faulty OSD will be reported as `down` and mostly (auto) `out` 10
|
||||||
|
minutes later. Depending on the cause, it can also automatically
|
||||||
|
become `up` and `in` again. To try a manual activation via web
|
||||||
|
interface, go to __Any node -> Ceph -> OSD__, select the OSD and click
|
||||||
|
on **Start**, **In** and **Reload**. When using the shell, run on the
|
||||||
|
affected node `ceph-volume lvm activate --all`.
|
||||||
|
+
|
||||||
|
To activate a failed OSD, it may be necessary to
|
||||||
|
xref:ha_manager_node_maintenance[safely reboot] the respective node
|
||||||
|
or, as a last resort, to
|
||||||
|
xref:pve_ceph_osd_replace[recreate or replace] the OSD.
|
||||||
|
|
||||||
ifdef::manvolnum[]
|
ifdef::manvolnum[]
|
||||||
include::pve-copyright.adoc[]
|
include::pve-copyright.adoc[]
|
||||||
|
Loading…
Reference in New Issue
Block a user