mirror of
				https://git.proxmox.com/git/pve-docs
				synced 2025-11-04 06:11:08 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			653 lines
		
	
	
		
			23 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			653 lines
		
	
	
		
			23 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
[[chapter_pveceph]]
 | 
						||
ifdef::manvolnum[]
 | 
						||
pveceph(1)
 | 
						||
==========
 | 
						||
:pve-toplevel:
 | 
						||
 | 
						||
NAME
 | 
						||
----
 | 
						||
 | 
						||
pveceph - Manage Ceph Services on Proxmox VE Nodes
 | 
						||
 | 
						||
SYNOPSIS
 | 
						||
--------
 | 
						||
 | 
						||
include::pveceph.1-synopsis.adoc[]
 | 
						||
 | 
						||
DESCRIPTION
 | 
						||
-----------
 | 
						||
endif::manvolnum[]
 | 
						||
ifndef::manvolnum[]
 | 
						||
Manage Ceph Services on Proxmox VE Nodes
 | 
						||
========================================
 | 
						||
:pve-toplevel:
 | 
						||
endif::manvolnum[]
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-ceph-status.png"]
 | 
						||
 | 
						||
{pve} unifies your compute and storage systems, i.e. you can use the same
 | 
						||
physical nodes within a cluster for both computing (processing VMs and
 | 
						||
containers) and replicated storage. The traditional silos of compute and
 | 
						||
storage resources can be wrapped up into a single hyper-converged appliance.
 | 
						||
Separate storage networks (SANs) and connections via network attached storages
 | 
						||
(NAS) disappear. With the integration of Ceph, an open source software-defined
 | 
						||
storage platform, {pve} has the ability to run and manage Ceph storage directly
 | 
						||
on the hypervisor nodes.
 | 
						||
 | 
						||
Ceph is a distributed object store and file system designed to provide
 | 
						||
excellent performance, reliability and scalability.
 | 
						||
 | 
						||
.Some advantages of Ceph on {pve} are:
 | 
						||
- Easy setup and management with CLI and GUI support
 | 
						||
- Thin provisioning
 | 
						||
- Snapshots support
 | 
						||
- Self healing
 | 
						||
- Scalable to the exabyte level
 | 
						||
- Setup pools with different performance and redundancy characteristics
 | 
						||
- Data is replicated, making it fault tolerant
 | 
						||
- Runs on economical commodity hardware
 | 
						||
- No need for hardware RAID controllers
 | 
						||
- Open source
 | 
						||
 | 
						||
For small to mid sized deployments, it is possible to install a Ceph server for
 | 
						||
RADOS Block Devices (RBD) directly on your {pve} cluster nodes, see
 | 
						||
xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]. Recent
 | 
						||
hardware has plenty of CPU power and RAM, so running storage services
 | 
						||
and VMs on the same node is possible.
 | 
						||
 | 
						||
To simplify management, we provide 'pveceph' - a tool to install and
 | 
						||
manage {ceph} services on {pve} nodes.
 | 
						||
 | 
						||
.Ceph consists of a couple of Daemons footnote:[Ceph intro http://docs.ceph.com/docs/luminous/start/intro/], for use as a RBD storage:
 | 
						||
- Ceph Monitor (ceph-mon)
 | 
						||
- Ceph Manager (ceph-mgr)
 | 
						||
- Ceph OSD (ceph-osd; Object Storage Daemon)
 | 
						||
 | 
						||
TIP: We highly recommend to get familiar with Ceph's architecture
 | 
						||
footnote:[Ceph architecture http://docs.ceph.com/docs/luminous/architecture/]
 | 
						||
and vocabulary
 | 
						||
footnote:[Ceph glossary http://docs.ceph.com/docs/luminous/glossary].
 | 
						||
 | 
						||
 | 
						||
Precondition
 | 
						||
------------
 | 
						||
 | 
						||
To build a hyper-converged Proxmox + Ceph Cluster there should be at least
 | 
						||
three (preferably) identical servers for the setup.
 | 
						||
 | 
						||
Check also the recommendations from
 | 
						||
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/[Ceph's website].
 | 
						||
 | 
						||
.CPU
 | 
						||
Higher CPU core frequency reduce latency and should be preferred. As a simple
 | 
						||
rule of thumb, you should assign a CPU core (or thread) to each Ceph service to
 | 
						||
provide enough resources for stable and durable Ceph performance.
 | 
						||
 | 
						||
.Memory
 | 
						||
Especially in a hyper-converged setup, the memory consumption needs to be
 | 
						||
carefully monitored. In addition to the intended workload from virtual machines
 | 
						||
and container, Ceph needs enough memory available to provide good and stable
 | 
						||
performance. As a rule of thumb, for roughly 1 TiB of data, 1 GiB of memory
 | 
						||
will be used by an OSD. OSD caching will use additional memory.
 | 
						||
 | 
						||
.Network
 | 
						||
We recommend a network bandwidth of at least 10 GbE or more, which is used
 | 
						||
exclusively for Ceph. A meshed network setup
 | 
						||
footnote:[Full Mesh Network for Ceph {webwiki-url}Full_Mesh_Network_for_Ceph_Server]
 | 
						||
is also an option if there are no 10 GbE switches available.
 | 
						||
 | 
						||
The volume of traffic, especially during recovery, will interfere with other
 | 
						||
services on the same network and may even break the {pve} cluster stack.
 | 
						||
 | 
						||
Further, estimate your bandwidth needs. While one HDD might not saturate a 1 Gb
 | 
						||
link, multiple HDD OSDs per node can, and modern NVMe SSDs will even saturate
 | 
						||
10 Gbps of bandwidth quickly. Deploying a network capable of even more bandwith
 | 
						||
will ensure that it isn't your bottleneck and won't be anytime soon, 25, 40 or
 | 
						||
even 100 GBps are possible.
 | 
						||
 | 
						||
.Disks
 | 
						||
When planning the size of your Ceph cluster, it is important to take the
 | 
						||
recovery time into consideration. Especially with small clusters, the recovery
 | 
						||
might take long. It is recommended that you use SSDs instead of HDDs in small
 | 
						||
setups to reduce recovery time, minimizing the likelihood of a subsequent
 | 
						||
failure event during recovery.
 | 
						||
 | 
						||
In general SSDs will provide more IOPs than spinning disks. This fact and the
 | 
						||
higher cost may make a xref:pve_ceph_device_classes[class based] separation of
 | 
						||
pools appealing. Another possibility to speedup OSDs is to use a faster disk
 | 
						||
as journal or DB/**W**rite-**A**head-**L**og device, see
 | 
						||
xref:pve_ceph_osds[creating Ceph OSDs]. If a faster disk is used for multiple
 | 
						||
OSDs, a proper balance between OSD and WAL / DB (or journal) disk must be
 | 
						||
selected, otherwise the faster disk becomes the bottleneck for all linked OSDs.
 | 
						||
 | 
						||
Aside from the disk type, Ceph best performs with an even sized and distributed
 | 
						||
amount of disks per node. For example, 4 x 500 GB disks with in each node is
 | 
						||
better than a mixed setup with a single 1 TB and three 250 GB disk.
 | 
						||
 | 
						||
One also need to balance OSD count and single OSD capacity. More capacity
 | 
						||
allows to increase storage density, but it also means that a single OSD
 | 
						||
failure forces ceph to recover more data at once.
 | 
						||
 | 
						||
.Avoid RAID
 | 
						||
As Ceph handles data object redundancy and multiple parallel writes to disks
 | 
						||
(OSDs) on its own, using a RAID controller normally doesn’t improve
 | 
						||
performance or availability. On the contrary, Ceph is designed to handle whole
 | 
						||
disks on it's own, without any abstraction in between. RAID controller are not
 | 
						||
designed for the Ceph use case and may complicate things and sometimes even
 | 
						||
reduce performance, as their write and caching algorithms may interfere with
 | 
						||
the ones from Ceph.
 | 
						||
 | 
						||
WARNING: Avoid RAID controller, use host bus adapter (HBA) instead.
 | 
						||
 | 
						||
NOTE: Above recommendations should be seen as a rough guidance for choosing
 | 
						||
hardware. Therefore, it is still essential to adapt it to your specific needs,
 | 
						||
test your setup and monitor health and performance continuously.
 | 
						||
 | 
						||
[[pve_ceph_install_wizard]]
 | 
						||
Initial Ceph installation & configuration
 | 
						||
-----------------------------------------
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-node-ceph-install.png"]
 | 
						||
 | 
						||
With {pve} you have the benefit of an easy to use installation wizard
 | 
						||
for Ceph. Click on one of your cluster nodes and navigate to the Ceph
 | 
						||
section in the menu tree. If Ceph is not already installed you will be
 | 
						||
offered to do so now.
 | 
						||
 | 
						||
The wizard is divided into different sections, where each needs to be
 | 
						||
finished successfully in order to use Ceph. After starting the installation
 | 
						||
the wizard will download and install all required packages from {pve}'s ceph
 | 
						||
repository.
 | 
						||
 | 
						||
After finishing the first step, you will need to create a configuration.
 | 
						||
This step is only needed once per cluster, as this configuration is distributed
 | 
						||
automatically to all remaining cluster members through {pve}'s clustered
 | 
						||
xref:chapter_pmxcfs[configuration file system (pmxcfs)].
 | 
						||
 | 
						||
The configuration step includes the following settings:
 | 
						||
 | 
						||
* *Public Network:* You should setup a dedicated network for Ceph, this
 | 
						||
setting is required. Separating your Ceph traffic is highly recommended,
 | 
						||
because it could lead to troubles with other latency dependent services,
 | 
						||
e.g., cluster communication may decrease Ceph's performance, if not done.
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-node-ceph-install-wizard-step2.png"]
 | 
						||
 | 
						||
* *Cluster Network:* As an optional step you can go even further and
 | 
						||
separate the xref:pve_ceph_osds[OSD] replication & heartbeat traffic
 | 
						||
as well. This will relieve the public network and could lead to
 | 
						||
significant performance improvements especially in big clusters.
 | 
						||
 | 
						||
You have two more options which are considered advanced and therefore
 | 
						||
should only changed if you are an expert.
 | 
						||
 | 
						||
* *Number of replicas*: Defines the how often a object is replicated
 | 
						||
* *Minimum replicas*: Defines the minimum number of required replicas
 | 
						||
  for I/O to be marked as complete.
 | 
						||
 | 
						||
Additionally you need to choose your first monitor node, this is required.
 | 
						||
 | 
						||
That's it, you should see a success page as the last step with further
 | 
						||
instructions on how to go on. You are now prepared to start using Ceph,
 | 
						||
even though you will need to create additional xref:pve_ceph_monitors[monitors],
 | 
						||
create some xref:pve_ceph_osds[OSDs] and at least one xref:pve_ceph_pools[pool].
 | 
						||
 | 
						||
The rest of this chapter will guide you on how to get the most out of
 | 
						||
your {pve} based Ceph setup, this will include aforementioned and
 | 
						||
more like xref:pveceph_fs[CephFS] which is a very handy addition to your
 | 
						||
new Ceph cluster.
 | 
						||
 | 
						||
[[pve_ceph_install]]
 | 
						||
Installation of Ceph Packages
 | 
						||
-----------------------------
 | 
						||
Use {pve} Ceph installation wizard (recommended) or run the following
 | 
						||
command on each node:
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph install
 | 
						||
----
 | 
						||
 | 
						||
This sets up an `apt` package repository in
 | 
						||
`/etc/apt/sources.list.d/ceph.list` and installs the required software.
 | 
						||
 | 
						||
 | 
						||
Creating initial Ceph configuration
 | 
						||
-----------------------------------
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-ceph-config.png"]
 | 
						||
 | 
						||
Use the {pve} Ceph installation wizard (recommended) or run the
 | 
						||
following command on one node:
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph init --network 10.10.10.0/24
 | 
						||
----
 | 
						||
 | 
						||
This creates an initial configuration at `/etc/pve/ceph.conf` with a
 | 
						||
dedicated network for ceph. That file is automatically distributed to
 | 
						||
all {pve} nodes by using xref:chapter_pmxcfs[pmxcfs]. The command also
 | 
						||
creates a symbolic link from `/etc/ceph/ceph.conf` pointing to that file.
 | 
						||
So you can simply run Ceph commands without the need to specify a
 | 
						||
configuration file.
 | 
						||
 | 
						||
 | 
						||
[[pve_ceph_monitors]]
 | 
						||
Creating Ceph Monitors
 | 
						||
----------------------
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-ceph-monitor.png"]
 | 
						||
 | 
						||
The Ceph Monitor (MON)
 | 
						||
footnote:[Ceph Monitor http://docs.ceph.com/docs/luminous/start/intro/]
 | 
						||
maintains a master copy of the cluster map. For high availability you need to
 | 
						||
have at least 3 monitors. One monitor will already be installed if you
 | 
						||
used the installation wizard. You won't need more than 3 monitors as long
 | 
						||
as your cluster is small to midsize, only really large clusters will
 | 
						||
need more than that.
 | 
						||
 | 
						||
On each node where you want to place a monitor (three monitors are recommended),
 | 
						||
create it by using the 'Ceph -> Monitor' tab in the GUI or run.
 | 
						||
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph createmon
 | 
						||
----
 | 
						||
 | 
						||
This will also install the needed Ceph Manager ('ceph-mgr') by default. If you
 | 
						||
do not want to install a manager, specify the '-exclude-manager' option.
 | 
						||
 | 
						||
 | 
						||
[[pve_ceph_manager]]
 | 
						||
Creating Ceph Manager
 | 
						||
----------------------
 | 
						||
 | 
						||
The Manager daemon runs alongside the monitors, providing an interface for
 | 
						||
monitoring the cluster. Since the Ceph luminous release the
 | 
						||
ceph-mgr footnote:[Ceph Manager http://docs.ceph.com/docs/luminous/mgr/] daemon
 | 
						||
is required. During monitor installation the ceph manager will be installed as
 | 
						||
well.
 | 
						||
 | 
						||
NOTE: It is recommended to install the Ceph Manager on the monitor nodes. For
 | 
						||
high availability install more then one manager.
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph createmgr
 | 
						||
----
 | 
						||
 | 
						||
 | 
						||
[[pve_ceph_osds]]
 | 
						||
Creating Ceph OSDs
 | 
						||
------------------
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-ceph-osd-status.png"]
 | 
						||
 | 
						||
via GUI or via CLI as follows:
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph createosd /dev/sd[X]
 | 
						||
----
 | 
						||
 | 
						||
TIP: We recommend a Ceph cluster size, starting with 12 OSDs, distributed evenly
 | 
						||
among your, at least three nodes (4 OSDs on each node).
 | 
						||
 | 
						||
If the disk was used before (eg. ZFS/RAID/OSD), to remove partition table, boot
 | 
						||
sector and any OSD leftover the following command should be sufficient.
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
ceph-volume lvm zap /dev/sd[X] --destroy
 | 
						||
----
 | 
						||
 | 
						||
WARNING: The above command will destroy data on the disk!
 | 
						||
 | 
						||
Ceph Bluestore
 | 
						||
~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Starting with the Ceph Kraken release, a new Ceph OSD storage type was
 | 
						||
introduced, the so called Bluestore
 | 
						||
footnote:[Ceph Bluestore http://ceph.com/community/new-luminous-bluestore/].
 | 
						||
This is the default when creating OSDs since Ceph Luminous.
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph createosd /dev/sd[X]
 | 
						||
----
 | 
						||
 | 
						||
.Block.db and block.wal
 | 
						||
 | 
						||
If you want to use a separate DB/WAL device for your OSDs, you can specify it
 | 
						||
through the '-db_dev' and '-wal_dev' options. The WAL is placed with the DB, if not
 | 
						||
specified separately.
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph createosd /dev/sd[X] -db_dev /dev/sd[Y] -wal_dev /dev/sd[Z]
 | 
						||
----
 | 
						||
 | 
						||
You can directly choose the size for those with the '-db_size' and '-wal_size'
 | 
						||
paremeters respectively. If they are not given the following values (in order)
 | 
						||
will be used:
 | 
						||
 | 
						||
* bluestore_block_{db,wal}_size from ceph configuration...
 | 
						||
** ... database, section 'osd'
 | 
						||
** ... database, section 'global'
 | 
						||
** ... file, section 'osd'
 | 
						||
** ... file, section 'global'
 | 
						||
* 10% (DB)/1% (WAL) of OSD size
 | 
						||
 | 
						||
NOTE: The DB stores BlueStore’s internal metadata and the WAL is BlueStore’s
 | 
						||
internal journal or write-ahead log. It is recommended to use a fast SSD or
 | 
						||
NVRAM for better performance.
 | 
						||
 | 
						||
 | 
						||
Ceph Filestore
 | 
						||
~~~~~~~~~~~~~~
 | 
						||
 | 
						||
Before Ceph Luminous, Filestore was used as default storage type for Ceph OSDs.
 | 
						||
Starting with Ceph Nautilus, {pve} does not support creating such OSDs with
 | 
						||
'pveceph' anymore. If you still want to create filestore OSDs, use
 | 
						||
'ceph-volume' directly.
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
ceph-volume lvm create --filestore --data /dev/sd[X] --journal /dev/sd[Y]
 | 
						||
----
 | 
						||
 | 
						||
[[pve_ceph_pools]]
 | 
						||
Creating Ceph Pools
 | 
						||
-------------------
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-ceph-pools.png"]
 | 
						||
 | 
						||
A pool is a logical group for storing objects. It holds **P**lacement
 | 
						||
**G**roups (`PG`, `pg_num`), a collection of objects.
 | 
						||
 | 
						||
When no options are given, we set a default of **128 PGs**, a **size of 3
 | 
						||
replicas** and a **min_size of 2 replicas** for serving objects in a degraded
 | 
						||
state.
 | 
						||
 | 
						||
NOTE: The default number of PGs works for 2-5 disks. Ceph throws a
 | 
						||
'HEALTH_WARNING' if you have too few or too many PGs in your cluster.
 | 
						||
 | 
						||
It is advised to calculate the PG number depending on your setup, you can find
 | 
						||
the formula and the PG calculator footnote:[PG calculator
 | 
						||
http://ceph.com/pgcalc/] online. While PGs can be increased later on, they can
 | 
						||
never be decreased.
 | 
						||
 | 
						||
 | 
						||
You can create pools through command line or on the GUI on each PVE host under
 | 
						||
**Ceph -> Pools**.
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
pveceph createpool <name>
 | 
						||
----
 | 
						||
 | 
						||
If you would like to automatically also get a storage definition for your pool,
 | 
						||
mark the checkbox "Add storages" in the GUI or use the command line option
 | 
						||
'--add_storages' at pool creation.
 | 
						||
 | 
						||
Further information on Ceph pool handling can be found in the Ceph pool
 | 
						||
operation footnote:[Ceph pool operation
 | 
						||
http://docs.ceph.com/docs/luminous/rados/operations/pools/]
 | 
						||
manual.
 | 
						||
 | 
						||
[[pve_ceph_device_classes]]
 | 
						||
Ceph CRUSH & device classes
 | 
						||
---------------------------
 | 
						||
The foundation of Ceph is its algorithm, **C**ontrolled **R**eplication
 | 
						||
**U**nder **S**calable **H**ashing
 | 
						||
(CRUSH footnote:[CRUSH https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf]).
 | 
						||
 | 
						||
CRUSH calculates where to store to and retrieve data from, this has the
 | 
						||
advantage that no central index service is needed. CRUSH works with a map of
 | 
						||
OSDs, buckets (device locations) and rulesets (data replication) for pools.
 | 
						||
 | 
						||
NOTE: Further information can be found in the Ceph documentation, under the
 | 
						||
section CRUSH map footnote:[CRUSH map http://docs.ceph.com/docs/luminous/rados/operations/crush-map/].
 | 
						||
 | 
						||
This map can be altered to reflect different replication hierarchies. The object
 | 
						||
replicas can be separated (eg. failure domains), while maintaining the desired
 | 
						||
distribution.
 | 
						||
 | 
						||
A common use case is to use different classes of disks for different Ceph pools.
 | 
						||
For this reason, Ceph introduced the device classes with luminous, to
 | 
						||
accommodate the need for easy ruleset generation.
 | 
						||
 | 
						||
The device classes can be seen in the 'ceph osd tree' output. These classes
 | 
						||
represent their own root bucket, which can be seen with the below command.
 | 
						||
 | 
						||
[source, bash]
 | 
						||
----
 | 
						||
ceph osd crush tree --show-shadow
 | 
						||
----
 | 
						||
 | 
						||
Example output form the above command:
 | 
						||
 | 
						||
[source, bash]
 | 
						||
----
 | 
						||
ID  CLASS WEIGHT  TYPE NAME
 | 
						||
-16  nvme 2.18307 root default~nvme
 | 
						||
-13  nvme 0.72769     host sumi1~nvme
 | 
						||
 12  nvme 0.72769         osd.12
 | 
						||
-14  nvme 0.72769     host sumi2~nvme
 | 
						||
 13  nvme 0.72769         osd.13
 | 
						||
-15  nvme 0.72769     host sumi3~nvme
 | 
						||
 14  nvme 0.72769         osd.14
 | 
						||
 -1       7.70544 root default
 | 
						||
 -3       2.56848     host sumi1
 | 
						||
 12  nvme 0.72769         osd.12
 | 
						||
 -5       2.56848     host sumi2
 | 
						||
 13  nvme 0.72769         osd.13
 | 
						||
 -7       2.56848     host sumi3
 | 
						||
 14  nvme 0.72769         osd.14
 | 
						||
----
 | 
						||
 | 
						||
To let a pool distribute its objects only on a specific device class, you need
 | 
						||
to create a ruleset with the specific class first.
 | 
						||
 | 
						||
[source, bash]
 | 
						||
----
 | 
						||
ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
 | 
						||
----
 | 
						||
 | 
						||
[frame="none",grid="none", align="left", cols="30%,70%"]
 | 
						||
|===
 | 
						||
|<rule-name>|name of the rule, to connect with a pool (seen in GUI & CLI)
 | 
						||
|<root>|which crush root it should belong to (default ceph root "default")
 | 
						||
|<failure-domain>|at which failure-domain the objects should be distributed (usually host)
 | 
						||
|<class>|what type of OSD backing store to use (eg. nvme, ssd, hdd)
 | 
						||
|===
 | 
						||
 | 
						||
Once the rule is in the CRUSH map, you can tell a pool to use the ruleset.
 | 
						||
 | 
						||
[source, bash]
 | 
						||
----
 | 
						||
ceph osd pool set <pool-name> crush_rule <rule-name>
 | 
						||
----
 | 
						||
 | 
						||
TIP: If the pool already contains objects, all of these have to be moved
 | 
						||
accordingly. Depending on your setup this may introduce a big performance hit on
 | 
						||
your cluster. As an alternative, you can create a new pool and move disks
 | 
						||
separately.
 | 
						||
 | 
						||
 | 
						||
Ceph Client
 | 
						||
-----------
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-ceph-log.png"]
 | 
						||
 | 
						||
You can then configure {pve} to use such pools to store VM or
 | 
						||
Container images. Simply use the GUI too add a new `RBD` storage (see
 | 
						||
section xref:ceph_rados_block_devices[Ceph RADOS Block Devices (RBD)]).
 | 
						||
 | 
						||
You also need to copy the keyring to a predefined location for an external Ceph
 | 
						||
cluster. If Ceph is installed on the Proxmox nodes itself, then this will be
 | 
						||
done automatically.
 | 
						||
 | 
						||
NOTE: The file name needs to be `<storage_id> + `.keyring` - `<storage_id>` is
 | 
						||
the expression after 'rbd:' in `/etc/pve/storage.cfg` which is
 | 
						||
`my-ceph-storage` in the following example:
 | 
						||
 | 
						||
[source,bash]
 | 
						||
----
 | 
						||
mkdir /etc/pve/priv/ceph
 | 
						||
cp /etc/ceph/ceph.client.admin.keyring /etc/pve/priv/ceph/my-ceph-storage.keyring
 | 
						||
----
 | 
						||
 | 
						||
[[pveceph_fs]]
 | 
						||
CephFS
 | 
						||
------
 | 
						||
 | 
						||
Ceph provides also a filesystem running on top of the same object storage as
 | 
						||
RADOS block devices do. A **M**eta**d**ata **S**erver (`MDS`) is used to map
 | 
						||
the RADOS backed objects to files and directories, allowing to provide a
 | 
						||
POSIX-compliant replicated filesystem. This allows one to have a clustered
 | 
						||
highly available shared filesystem in an easy way if ceph is already used.  Its
 | 
						||
Metadata Servers guarantee that files get balanced out over the whole Ceph
 | 
						||
cluster, this way even high load will not overload a single host, which can be
 | 
						||
an issue with traditional shared filesystem approaches, like `NFS`, for
 | 
						||
example.
 | 
						||
 | 
						||
[thumbnail="screenshot/gui-node-ceph-cephfs-panel.png"]
 | 
						||
 | 
						||
{pve} supports both, using an existing xref:storage_cephfs[CephFS as storage]
 | 
						||
to save backups, ISO files or container templates and creating a
 | 
						||
hyper-converged CephFS itself.
 | 
						||
 | 
						||
 | 
						||
[[pveceph_fs_mds]]
 | 
						||
Metadata Server (MDS)
 | 
						||
~~~~~~~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
CephFS needs at least one Metadata Server to be configured and running to be
 | 
						||
able to work. One can simply create one through the {pve} web GUI's `Node ->
 | 
						||
CephFS` panel or on the command line with:
 | 
						||
 | 
						||
----
 | 
						||
pveceph mds create
 | 
						||
----
 | 
						||
 | 
						||
Multiple metadata servers can be created in a cluster. But with the default
 | 
						||
settings only one can be active at any time. If an MDS, or its node, becomes
 | 
						||
unresponsive (or crashes), another `standby` MDS will get promoted to `active`.
 | 
						||
One can speed up the hand-over between the active and a standby MDS up by using
 | 
						||
the 'hotstandby' parameter option on create, or if you have already created it
 | 
						||
you may set/add:
 | 
						||
 | 
						||
----
 | 
						||
mds standby replay = true
 | 
						||
----
 | 
						||
 | 
						||
in the ceph.conf respective MDS section. With this enabled, this specific MDS
 | 
						||
will always poll the active one, so that it can take over faster as it is in a
 | 
						||
`warm` state. But naturally, the active polling will cause some additional
 | 
						||
performance impact on your system and active `MDS`.
 | 
						||
 | 
						||
.Multiple Active MDS
 | 
						||
 | 
						||
Since Luminous (12.2.x) you can also have multiple active metadata servers
 | 
						||
running, but this is normally only useful for a high count on parallel clients,
 | 
						||
as else the `MDS` seldom is the bottleneck. If you want to set this up please
 | 
						||
refer to the ceph documentation. footnote:[Configuring multiple active MDS
 | 
						||
daemons http://docs.ceph.com/docs/luminous/cephfs/multimds/]
 | 
						||
 | 
						||
[[pveceph_fs_create]]
 | 
						||
Create a CephFS
 | 
						||
~~~~~~~~~~~~~~~
 | 
						||
 | 
						||
With {pve}'s CephFS integration into you can create a CephFS easily over the
 | 
						||
Web GUI, the CLI or an external API interface. Some prerequisites are required
 | 
						||
for this to work:
 | 
						||
 | 
						||
.Prerequisites for a successful CephFS setup:
 | 
						||
- xref:pve_ceph_install[Install Ceph packages], if this was already done some
 | 
						||
  time ago you might want to rerun it on an up to date system to ensure that
 | 
						||
  also all CephFS related packages get installed.
 | 
						||
- xref:pve_ceph_monitors[Setup Monitors]
 | 
						||
- xref:pve_ceph_monitors[Setup your OSDs]
 | 
						||
- xref:pveceph_fs_mds[Setup at least one MDS]
 | 
						||
 | 
						||
After this got all checked and done you can simply create a CephFS through
 | 
						||
either the Web GUI's `Node -> CephFS` panel or the command line tool `pveceph`,
 | 
						||
for example with:
 | 
						||
 | 
						||
----
 | 
						||
pveceph fs create --pg_num 128 --add-storage
 | 
						||
----
 | 
						||
 | 
						||
This creates a CephFS named `'cephfs'' using a pool for its data named
 | 
						||
`'cephfs_data'' with `128` placement groups and a pool for its metadata named
 | 
						||
`'cephfs_metadata'' with one quarter of the data pools placement groups (`32`).
 | 
						||
Check the xref:pve_ceph_pools[{pve} managed Ceph pool chapter] or visit the
 | 
						||
Ceph documentation for more information regarding a fitting placement group
 | 
						||
number (`pg_num`) for your setup footnote:[Ceph Placement Groups
 | 
						||
http://docs.ceph.com/docs/luminous/rados/operations/placement-groups/].
 | 
						||
Additionally, the `'--add-storage'' parameter will add the CephFS to the {pve}
 | 
						||
storage configuration after it was created successfully.
 | 
						||
 | 
						||
Destroy CephFS
 | 
						||
~~~~~~~~~~~~~~
 | 
						||
 | 
						||
WARNING: Destroying a CephFS will render all its data unusable, this cannot be
 | 
						||
undone!
 | 
						||
 | 
						||
If you really want to destroy an existing CephFS you first need to stop, or
 | 
						||
destroy, all metadata servers (`M̀DS`). You can destroy them either over the Web
 | 
						||
GUI or the command line interface, with:
 | 
						||
 | 
						||
----
 | 
						||
pveceph mds destroy NAME
 | 
						||
----
 | 
						||
on each {pve} node hosting a MDS daemon.
 | 
						||
 | 
						||
Then, you can remove (destroy) CephFS by issuing a:
 | 
						||
 | 
						||
----
 | 
						||
ceph fs rm NAME --yes-i-really-mean-it
 | 
						||
----
 | 
						||
on a single node hosting Ceph. After this you may want to remove the created
 | 
						||
data and metadata pools, this can be done either over the Web GUI or the CLI
 | 
						||
with:
 | 
						||
 | 
						||
----
 | 
						||
pveceph pool destroy NAME
 | 
						||
----
 | 
						||
 | 
						||
 | 
						||
Ceph monitoring and troubleshooting
 | 
						||
-----------------------------------
 | 
						||
A good start is to continuosly monitor the ceph health from the start of
 | 
						||
initial deployment. Either through the ceph tools itself, but also by accessing
 | 
						||
the status through the {pve} link:api-viewer/index.html[API].
 | 
						||
 | 
						||
The following ceph commands below can be used to see if the cluster is healthy
 | 
						||
('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors
 | 
						||
('HEALTH_ERR'). If the cluster is in an unhealthy state the status commands
 | 
						||
below will also give you an overview of the current events and actions to take.
 | 
						||
 | 
						||
----
 | 
						||
# single time output
 | 
						||
pve# ceph -s
 | 
						||
# continuously output status changes (press CTRL+C to stop)
 | 
						||
pve# ceph -w
 | 
						||
----
 | 
						||
 | 
						||
To get a more detailed view, every ceph service has a log file under
 | 
						||
`/var/log/ceph/` and if there is not enough detail, the log level can be
 | 
						||
adjusted footnote:[Ceph log and debugging http://docs.ceph.com/docs/luminous/rados/troubleshooting/log-and-debug/].
 | 
						||
 | 
						||
You can find more information about troubleshooting
 | 
						||
footnote:[Ceph troubleshooting http://docs.ceph.com/docs/luminous/rados/troubleshooting/]
 | 
						||
a Ceph cluster on the official website.
 | 
						||
 | 
						||
 | 
						||
ifdef::manvolnum[]
 | 
						||
include::pve-copyright.adoc[]
 | 
						||
endif::manvolnum[]
 |