Merge remote-tracking branch 'origin/quincy-stable-8' into quincy-stable-7

This commit is contained in:
Thomas Lamprecht 2023-11-02 17:21:22 +01:00
commit 692686d018
1290 changed files with 38032 additions and 16268 deletions

View File

@ -21,6 +21,13 @@
- The Signed-off-by line in every git commit is important; see <span class="x x-first x-last">[Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/master/</span>SubmittingPatches.rst<span class="x x-first x-last">)</span>
-->
## Contribution Guidelines
- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
- If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
- When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an `x` between the brackets: `[x]`. Spaces and capitalization matter when checking off items this way.
## Checklist
- Tracker (select at least one)
- [ ] References tracker ticket

View File

@ -11,6 +11,9 @@ jobs:
runs-on: ubuntu-latest
name: Verify
steps:
- name: Sleep for 30 seconds
run: sleep 30s
shell: bash
- name: Action
id: checklist
uses: ceph/ceph-pr-checklist-action@32e92d1a2a7c9991ed51de5fccb2296551373d60

View File

@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 3.16)
project(ceph
VERSION 17.2.6
VERSION 17.2.7
LANGUAGES CXX C ASM)
cmake_policy(SET CMP0028 NEW)

View File

@ -1,3 +1,49 @@
>=17.2.7
--------
* `ceph mgr dump` command now displays the name of the mgr module that
registered a RADOS client in the `name` field added to elements of the
`active_clients` array. Previously, only the address of a module's RADOS
client was shown in the `active_clients` array.
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
undergone significant usability and design improvements to address the slow
backfill issue. Some important changes are:
* The 'balanced' profile is set as the default mClock profile because it
represents a compromise between prioritizing client IO or recovery IO. Users
can then choose either the 'high_client_ops' profile to prioritize client IO
or the 'high_recovery_ops' profile to prioritize recovery IO.
* QoS parameters like reservation and limit are now specified in terms of a
fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
* The cost parameters (osd_mclock_cost_per_io_usec_* and
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
is now determined using the random IOPS and maximum sequential bandwidth
capability of the OSD's underlying device.
* Degraded object recovery is given higher priority when compared to misplaced
object recovery because degraded objects present a data safety issue not
present with objects that are merely misplaced. Therefore, backfilling
operations with the 'balanced' and 'high_client_ops' mClock profiles may
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
scheduler.
* The QoS allocations in all the mClock profiles are optimized based on the above
fixes and enhancements.
* For more detailed information see:
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
multi-site. Previously, the replicas of such objects were corrupted on decryption.
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
identify these original multipart uploads. The ``LastModified`` timestamp of any
identified object is incremented by 1ns to cause peer zones to replicate it again.
For multi-site deployments that make any use of Server-Side Encryption, we
recommended running this command against every bucket in every zone after all
zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
a large buildup of session metadata resulting in the MDS going read-only due to
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
config controls the maximum size that a (encoded) session metadata can grow.
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
procedure, the recovered files under `lost+found` directory can now be deleted.
>=17.2.6
--------
@ -7,6 +53,60 @@
>=17.2.5
--------
>=19.0.0
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
multi-site. Previously, the replicas of such objects were corrupted on decryption.
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
identify these original multipart uploads. The ``LastModified`` timestamp of any
identified object is incremented by 1ns to cause peer zones to replicate it again.
For multi-site deployments that make any use of Server-Side Encryption, we
recommended running this command against every bucket in every zone after all
zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
a large buildup of session metadata resulting in the MDS going read-only due to
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
config controls the maximum size that a (encoded) session metadata can grow.
* CephFS: For clusters with multiple CephFS file systems, all the snap-schedule
commands now expect the '--fs' argument.
* CephFS: The period specifier ``m`` now implies minutes and the period specifier
``M`` now implies months. This has been made consistent with the rest
of the system.
* RGW: New tools have been added to radosgw-admin for identifying and
correcting issues with versioned bucket indexes. Historical bugs with the
versioned bucket index transaction workflow made it possible for the index
to accumulate extraneous "book-keeping" olh entries and plain placeholder
entries. In some specific scenarios where clients made concurrent requests
referencing the same object key, it was likely that a lot of extra index
entries would accumulate. When a significant number of these entries are
present in a single bucket index shard, they can cause high bucket listing
latencies and lifecycle processing failures. To check whether a versioned
bucket has unnecessary olh entries, users can now run ``radosgw-admin
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
be safely removed. A distinct issue from the one described thus far, it is
also possible that some versioned buckets are maintaining extra unlinked
objects that are not listable from the S3/ Swift APIs. These extra objects
are typically a result of PUT requests that exited abnormally, in the middle
of a bucket index transaction - so the client would not have received a
successful response. Bugs in prior releases made these unlinked objects easy
to reproduce with any PUT request that was made on a bucket that was actively
resharding. Besides the extra space that these hidden, unlinked objects
consume, there can be another side effect in certain scenarios, caused by
the nature of the failure mode that produced them, where a client of a bucket
that was a victim of this bug may find the object associated with the key to
be in an inconsistent state. To check whether a versioned bucket has unlinked
entries, users can now run ``radosgw-admin bucket check unlinked``. If the
``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
a third issue made it possible for versioned bucket index stats to be
accounted inaccurately. The tooling for recalculating versioned bucket stats
also had a bug, and was not previously capable of fixing these inaccuracies.
This release resolves those issues and users can now expect that the existing
``radosgw-admin bucket check`` command will produce correct results. We
recommend that users with versioned buckets, especially those that existed
on prior releases, use these new tools to check whether their buckets are
affected and to clean them up accordingly.
>=18.0.0
* RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
and `Image::aio_compare_and_write` methods) now match those of C API. Both
@ -47,6 +147,100 @@
If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
will be visible.
Relevant tracker: https://tracker.ceph.com/issues/53729
* RBD: `rbd device unmap` command gained `--namespace` option. Support for
namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
map and unmap images in namespaces using the `image-spec` syntax since then
but the corresponding option available in most other commands was missing.
* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
When both are enabled, compression is applied before encryption.
* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
is removed. Together with it, the "pubsub" zone should not be used anymore.
The REST operations, as well as radosgw-admin commands for manipulating
subscriptions, as well as fetching and acking the notifications are removed
as well.
In case that the endpoint to which the notifications are sent maybe down or
disconnected, it is recommended to use persistent notifications to guarantee
the delivery of the notifications. In case the system that consumes the
notifications needs to pull them (instead of the notifications be pushed
to it), an external message bus (e.g. rabbitmq, Kafka) should be used for
that purpose.
* RGW: The serialized format of notification and topics has changed, so that
new/updated topics will be unreadable by old RGWs. We recommend completing
the RGW upgrades before creating or modifying any notification topics.
* RBD: Trailing newline in passphrase files (`<passphrase-file>` argument in
`rbd encryption format` command and `--encryption-passphrase-file` option
in other commands) is no longer stripped.
* RBD: Support for layered client-side encryption is added. Cloned images
can now be encrypted each with its own encryption format and passphrase,
potentially different from that of the parent image. The efficient
copy-on-write semantics intrinsic to unformatted (regular) cloned images
are retained.
* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
`client_max_retries_on_remount_failure` and move it from mds.yaml.in to
mds-client.yaml.in because this option was only used by MDS client from its
birth.
* The `perf dump` and `perf schema` commands are deprecated in favor of new
`counter dump` and `counter schema` commands. These new commands add support
for labeled perf counters and also emit existing unlabeled perf counters. Some
unlabeled perf counters became labeled in this release, with more to follow in
future releases; such converted perf counters are no longer emitted by the
`perf dump` and `perf schema` commands.
* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
`active_clients` fields at the top level. Previously, these fields were
output under `always_on_modules` field.
* `ceph mgr dump` command now displays the name of the mgr module that
registered a RADOS client in the `name` field added to elements of the
`active_clients` array. Previously, only the address of a module's RADOS
client was shown in the `active_clients` array.
* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
emitted only by the new `counter dump` and `counter schema` commands. As part
of the conversion, many also got renamed to better disambiguate journal-based
and snapshot-based mirroring.
* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
`std::list` before potentially appending to it, aligning with the semantics
of the corresponding C API (`rbd_watchers_list`).
* Telemetry: Users who are opted-in to telemetry can also opt-in to
participating in a leaderboard in the telemetry public
dashboards (https://telemetry-public.ceph.com/). Users can now also add a
description of the cluster to publicly appear in the leaderboard.
For more details, see:
https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
See a sample report with `ceph telemetry preview`.
Opt-in to telemetry with `ceph telemetry on`.
Opt-in to the leaderboard with
`ceph config set mgr mgr/telemetry/leaderboard true`.
Add leaderboard description with:
`ceph config set mgr mgr/telemetry/leaderboard_description Cluster description`.
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
procedure, the recovered files under `lost+found` directory can now be deleted.
* core: cache-tiering is now deprecated.
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
undergone significant usability and design improvements to address the slow
backfill issue. Some important changes are:
* The 'balanced' profile is set as the default mClock profile because it
represents a compromise between prioritizing client IO or recovery IO. Users
can then choose either the 'high_client_ops' profile to prioritize client IO
or the 'high_recovery_ops' profile to prioritize recovery IO.
* QoS parameters like reservation and limit are now specified in terms of a
fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
* The cost parameters (osd_mclock_cost_per_io_usec_* and
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
is now determined using the random IOPS and maximum sequential bandwidth
capability of the OSD's underlying device.
* Degraded object recovery is given higher priority when compared to misplaced
object recovery because degraded objects present a data safety issue not
present with objects that are merely misplaced. Therefore, backfilling
operations with the 'balanced' and 'high_client_ops' mClock profiles may
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
scheduler.
* The QoS allocations in all the mClock profiles are optimized based on the above
fixes and enhancements.
* For more detailed information see:
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
than the number mentioned against the config tunable `mds_max_snaps_per_dir`
so that a new snapshot can be created and retained during the next schedule
run.
>=17.2.1

View File

@ -1,81 +1,107 @@
# Ceph - a scalable distributed storage system
Please see http://ceph.com/ for current info.
See https://ceph.com/ for current information about Ceph.
## Contributing Code
Most of Ceph is dual licensed under the LGPL version 2.1 or 3.0. Some
miscellaneous code is under a BSD-style license or is public domain.
The documentation is licensed under Creative Commons
Attribution Share Alike 3.0 (CC-BY-SA-3.0). There are a handful of headers
included here that are licensed under the GPL. Please see the file
COPYING for a full inventory of licenses by file.
Most of Ceph is dual-licensed under the LGPL version 2.1 or 3.0. Some
miscellaneous code is either public domain or licensed under a BSD-style
license.
Code contributions must include a valid "Signed-off-by" acknowledging
the license for the modified or contributed file. Please see the file
SubmittingPatches.rst for details on what that means and on how to
generate and submit patches.
The Ceph documentation is licensed under Creative Commons Attribution Share
Alike 3.0 (CC-BY-SA-3.0).
We do not require assignment of copyright to contribute code; code is
Some headers included in the `ceph/ceph` repository are licensed under the GPL.
See the file `COPYING` for a full inventory of licenses by file.
All code contributions must include a valid "Signed-off-by" line. See the file
`SubmittingPatches.rst` for details on this and instructions on how to generate
and submit patches.
Assignment of copyright is not required to contribute code. Code is
contributed under the terms of the applicable license.
## Checking out the source
You can clone from github with
Clone the ceph/ceph repository from github by running the following command on
a system that has git installed:
git clone git@github.com:ceph/ceph
or, if you are not a github user,
Alternatively, if you are not a github user, you should run the following
command on a system that has git installed:
git clone git://github.com/ceph/ceph
Ceph contains many git submodules that need to be checked out with
When the `ceph/ceph` repository has been cloned to your system, run the
following commands to move into the cloned `ceph/ceph` repository and to check
out the git submodules associated with it:
cd ceph
git submodule update --init --recursive
## Build Prerequisites
The list of Debian or RPM packages dependencies can be installed with:
*section last updated 27 Jul 2023*
Make sure that ``curl`` is installed. The Debian and Ubuntu ``apt`` command is
provided here, but if you use a system with a different package manager, then
you must use whatever command is the proper counterpart of this one:
apt install curl
Install Debian or RPM package dependencies by running the following command:
./install-deps.sh
Install the ``python3-routes`` package:
apt install python3-routes
## Building Ceph
Note that these instructions are meant for developers who are
compiling the code for development and testing. To build binaries
suitable for installation we recommend you build deb or rpm packages
or refer to the `ceph.spec.in` or `debian/rules` to see which
configuration options are specified for production builds.
These instructions are meant for developers who are compiling the code for
development and testing. To build binaries that are suitable for installation
we recommend that you build `.deb` or `.rpm` packages, or refer to
``ceph.spec.in`` or ``debian/rules`` to see which configuration options are
specified for production builds.
Build instructions:
To build Ceph, make sure that you are in the top-level `ceph` directory that
contains `do_cmake.sh` and `CONTRIBUTING.rst` and run the following commands:
./do_cmake.sh
cd build
ninja
(do_cmake.sh now defaults to creating a debug build of ceph that can
be up to 5x slower with some workloads. Please pass
"-DCMAKE_BUILD_TYPE=RelWithDebInfo" to do_cmake.sh to create a non-debug
release.
``do_cmake.sh`` by default creates a "debug build" of Ceph, which can be up to
five times slower than a non-debug build. Pass
``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to ``do_cmake.sh`` to create a non-debug
build.
The number of jobs used by `ninja` is derived from the number of CPU cores of
the building host if unspecified. Use the `-j` option to limit the job number
if the build jobs are running out of memory. On average, each job takes around
2.5GiB memory.)
[Ninja](https://ninja-build.org/) is the buildsystem used by the Ceph project
to build test builds. The number of jobs used by `ninja` is derived from the
number of CPU cores of the building host if unspecified. Use the `-j` option to
limit the job number if the build jobs are running out of memory. If you
attempt to run `ninja` and receive a message that reads `g++: fatal error:
Killed signal terminated program cc1plus`, then you have run out of memory.
Using the `-j` option with an argument appropriate to the hardware on which the
`ninja` command is run is expected to result in a successful build. For example,
to limit the job number to 3, run the command `ninja -j 3`. On average, each
`ninja` job run in parallel needs approximately 2.5 GiB of RAM.
This assumes you make your build dir a subdirectory of the ceph.git
checkout. If you put it elsewhere, just point `CEPH_GIT_DIR` to the correct
path to the checkout. Any additional CMake args can be specified by setting ARGS
before invoking do_cmake. See [cmake options](#cmake-options)
for more details. Eg.
This documentation assumes that your build directory is a subdirectory of the
`ceph.git` checkout. If the build directory is located elsewhere, point
`CEPH_GIT_DIR` to the correct path of the checkout. Additional CMake args can
be specified by setting ARGS before invoking ``do_cmake.sh``. See [cmake
options](#cmake-options) for more details. For example:
ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh
To build only certain targets use:
To build only certain targets, run a command of the following form:
ninja [target name]
@ -130,24 +156,25 @@ are committed to git.)
## Running a test cluster
To run a functional test cluster,
From the `ceph/` directory, run the following commands to launch a test Ceph
cluster:
cd build
ninja vstart # builds just enough to run vstart
../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph -s
Almost all of the usual commands are available in the bin/ directory.
For example,
Most Ceph commands are available in the `bin/` directory. For example:
./bin/rados -p rbd bench 30 write
./bin/rbd create foo --size 1000
./bin/rados -p foo bench 30 write
To shut down the test cluster,
To shut down the test cluster, run the following command from the `build/`
directory:
../src/stop.sh
To start or stop individual daemons, the sysvinit script can be used:
Use the sysvinit script to start or stop individual daemons:
./bin/init-ceph restart osd.0
./bin/init-ceph stop

View File

@ -166,7 +166,7 @@
# main package definition
#################################################################################
Name: ceph
Version: 17.2.6
Version: 17.2.7
Release: 0%{?dist}
%if 0%{?fedora} || 0%{?rhel}
Epoch: 2
@ -182,7 +182,7 @@ License: LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-
Group: System/Filesystems
%endif
URL: http://ceph.com/
Source0: %{?_remote_tarball_prefix}ceph-17.2.6.tar.bz2
Source0: %{?_remote_tarball_prefix}ceph-17.2.7.tar.bz2
%if 0%{?suse_version}
# _insert_obs_source_lines_here
ExclusiveArch: x86_64 aarch64 ppc64le s390x
@ -1274,7 +1274,7 @@ This package provides Ceph default alerts for Prometheus.
# common
#################################################################################
%prep
%autosetup -p1 -n ceph-17.2.6
%autosetup -p1 -n ceph-17.2.7
%build
# Disable lto on systems that do not support symver attribute
@ -1863,6 +1863,7 @@ fi
%{_datadir}/ceph/mgr/prometheus
%{_datadir}/ceph/mgr/rbd_support
%{_datadir}/ceph/mgr/restful
%{_datadir}/ceph/mgr/rgw
%{_datadir}/ceph/mgr/selftest
%{_datadir}/ceph/mgr/snap_schedule
%{_datadir}/ceph/mgr/stats

View File

@ -1863,6 +1863,7 @@ fi
%{_datadir}/ceph/mgr/prometheus
%{_datadir}/ceph/mgr/rbd_support
%{_datadir}/ceph/mgr/restful
%{_datadir}/ceph/mgr/rgw
%{_datadir}/ceph/mgr/selftest
%{_datadir}/ceph/mgr/snap_schedule
%{_datadir}/ceph/mgr/stats

View File

@ -1,3 +1,9 @@
ceph (17.2.7-1) stable; urgency=medium
* New upstream release
-- Ceph Release Team <ceph-maintainers@ceph.io> Wed, 25 Oct 2023 23:46:13 +0000
ceph (17.2.6-1) stable; urgency=medium
* New upstream release

View File

@ -1 +1,3 @@
lib/systemd/system/cephfs-mirror*
usr/bin/cephfs-mirror
usr/share/man/man8/cephfs-mirror.8

View File

@ -30,60 +30,52 @@ A Ceph Storage Cluster consists of multiple types of daemons:
- :term:`Ceph Manager`
- :term:`Ceph Metadata Server`
.. ditaa::
+---------------+ +---------------+ +---------------+ +---------------+
| OSDs | | Monitors | | Managers | | MDS |
+---------------+ +---------------+ +---------------+ +---------------+
A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
monitors ensures high availability should a monitor daemon fail. Storage cluster
clients retrieve a copy of the cluster map from the Ceph Monitor.
Ceph Monitors maintain the master copy of the cluster map, which they provide
to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
availability in the event that one of the monitor daemons or its host fails.
The Ceph monitor provides copies of the cluster map to storage cluster clients.
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
back to monitors.
A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in
A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
modules.
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
provide file services.
Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
to efficiently compute information about data location, instead of having to
depend on a central lookup table. Ceph's high-level features include a
native interface to the Ceph Storage Cluster via ``librados``, and a number of
service interfaces built on top of ``librados``.
Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
to compute information about data location. This means that clients and OSDs
are not bottlenecked by a central lookup table. Ceph's high-level features
include a native interface to the Ceph Storage Cluster via ``librados``, and a
number of service interfaces built on top of ``librados``.
Storing Data
------------
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
:term:`Ceph File System` or a custom implementation you create using
``librados``-- which is stored as RADOS objects. Each object is stored on an
:term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and
replication operations on storage drives. With the older Filestore back end,
each RADOS object was stored as a separate file on a conventional filesystem
(usually XFS). With the new and default BlueStore back end, objects are
stored in a monolithic database-like fashion.
:term:`Ceph File System`, or a custom implementation that you create by using
``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
objects. Each object is stored on an :term:`Object Storage Device` (this is
also called an "OSD"). Ceph OSDs control read, write, and replication
operations on storage drives. The default BlueStore back end stores objects
in a monolithic, database-like fashion.
.. ditaa::
/-----\ +-----+ +-----+
| obj |------>| {d} |------>| {s} |
\-----/ +-----+ +-----+
/------\ +-----+ +-----+
| obj |------>| {d} |------>| {s} |
\------/ +-----+ +-----+
Object OSD Drive
Ceph OSD Daemons store data as objects in a flat namespace (e.g., no
hierarchy of directories). An object has an identifier, binary data, and
metadata consisting of a set of name/value pairs. The semantics are completely
up to :term:`Ceph Client`\s. For example, CephFS uses metadata to store file
attributes such as the file owner, created date, last modified date, and so
forth.
Ceph OSD Daemons store data as objects in a flat namespace. This means that
objects are not stored in a hierarchy of directories. An object has an
identifier, binary data, and metadata consisting of name/value pairs.
:term:`Ceph Client`\s determine the semantics of the object data. For example,
CephFS uses metadata to store file attributes such as the file owner, the
created date, and the last modified date.
.. ditaa::
@ -102,20 +94,23 @@ forth.
.. index:: architecture; high availability, scalability
.. _arch_scalability_and_high_availability:
Scalability and High Availability
---------------------------------
In traditional architectures, clients talk to a centralized component (e.g., a
gateway, broker, API, facade, etc.), which acts as a single point of entry to a
complex subsystem. This imposes a limit to both performance and scalability,
while introducing a single point of failure (i.e., if the centralized component
goes down, the whole system goes down, too).
In traditional architectures, clients talk to a centralized component. This
centralized component might be a gateway, a broker, an API, or a facade. A
centralized component of this kind acts as a single point of entry to a complex
subsystem. Architectures that rely upon such a centralized component have a
single point of failure and incur limits to performance and scalability. If
the centralized component goes down, the whole system becomes unavailable.
Ceph eliminates the centralized gateway to enable clients to interact with
Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
of monitors to ensure high availability. To eliminate centralization, Ceph
uses an algorithm called CRUSH.
Ceph eliminates this centralized component. This enables clients to interact
with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
to ensure data safety and high availability. Ceph also uses a cluster of
monitors to ensure high availability. To eliminate centralization, Ceph uses an
algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
.. index:: CRUSH; architecture
@ -124,15 +119,15 @@ CRUSH Introduction
~~~~~~~~~~~~~~~~~~
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
Replication Under Scalable Hashing)` algorithm to efficiently compute
information about object location, instead of having to depend on a
central lookup table. CRUSH provides a better data management mechanism compared
to older approaches, and enables massive scale by cleanly distributing the work
to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
replication to ensure resiliency, which is better suited to hyper-scale storage.
The following sections provide additional details on how CRUSH works. For a
detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
Placement of Replicated Data`_.
Replication Under Scalable Hashing)` algorithm to compute information about
object location instead of relying upon a central lookup table. CRUSH provides
a better data management mechanism than do older approaches, and CRUSH enables
massive scale by distributing the work to all the OSD daemons in the cluster
and all the clients that communicate with them. CRUSH uses intelligent data
replication to ensure resiliency, which is better suited to hyper-scale
storage. The following sections provide additional details on how CRUSH works.
For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
Decentralized Placement of Replicated Data`_.
.. index:: architecture; cluster map
@ -141,109 +136,130 @@ Placement of Replicated Data`_.
Cluster Map
~~~~~~~~~~~
Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
cluster topology, which is inclusive of 5 maps collectively referred to as the
"Cluster Map":
In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
must have current information about the cluster's topology. Current information
is stored in the "Cluster Map", which is in fact a collection of five maps. The
five maps that constitute the cluster map are:
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
address and port of each monitor. It also indicates the current epoch,
when the map was created, and the last time it changed. To view a monitor
map, execute ``ceph mon dump``.
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
the address, and the TCP port of each monitor. The monitor map specifies the
current epoch, the time of the monitor map's creation, and the time of the
monitor map's last modification. To view a monitor map, run ``ceph mon
dump``.
#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
and their status (e.g., ``up``, ``in``). To view an OSD map, execute
``ceph osd dump``.
#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
creation, the time of the OSD map's last modification, a list of pools, a
list of replica sizes, a list of PG numbers, and a list of OSDs and their
statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
osd dump``.
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
map epoch, the full ratios, and details on each placement group such as
the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
``active + clean``), and data usage statistics for each pool.
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
epoch, the full ratios, and the details of each placement group. This
includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
example, ``active + clean``), and data usage statistics for each pool.
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
traversing the hierarchy when storing data. To view a CRUSH map, execute
``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
You can view the decompiled map in a text editor or with ``cat``.
hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
and rules for traversing the hierarchy when storing data. To view a CRUSH
map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
running ``crushtool -d {comp-crushmap-filename} -o
{decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
decompiled map.
#. **The MDS Map:** Contains the current MDS map epoch, when the map was
created, and the last time it changed. It also contains the pool for
storing metadata, a list of metadata servers, and which metadata servers
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
Each map maintains an iterative history of its operating state changes. Ceph
Monitors maintain a master copy of the cluster map including the cluster
members, state, changes, and the overall health of the Ceph Storage Cluster.
Each map maintains a history of changes to its operating state. Ceph Monitors
maintain a master copy of the cluster map. This master copy includes the
cluster members, the state of the cluster, changes to the cluster, and
information recording the overall health of the Ceph Storage Cluster.
.. index:: high availability; monitor architecture
High Availability Monitors
~~~~~~~~~~~~~~~~~~~~~~~~~~
Before Ceph Clients can read or write data, they must contact a Ceph Monitor
to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
can operate with a single monitor; however, this introduces a single
point of failure (i.e., if the monitor goes down, Ceph Clients cannot
read or write data).
A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
cluster map in order to read data from or to write data to the Ceph cluster.
For added reliability and fault tolerance, Ceph supports a cluster of monitors.
In a cluster of monitors, latency and other faults can cause one or more
monitors to fall behind the current state of the cluster. For this reason, Ceph
must have agreement among various monitor instances regarding the state of the
cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
and the `Paxos`_ algorithm to establish a consensus among the monitors about the
current state of the cluster.
It is possible for a Ceph cluster to function properly with only a single
monitor, but a Ceph cluster that has only a single monitor has a single point
of failure: if the monitor goes down, Ceph clients will be unable to read data
from or write data to the cluster.
For details on configuring monitors, see the `Monitor Config Reference`_.
Ceph leverages a cluster of monitors in order to increase reliability and fault
tolerance. When a cluster of monitors is used, however, one or more of the
monitors in the cluster can fall behind due to latency or other faults. Ceph
mitigates these negative effects by requiring multiple monitor instances to
agree about the state of the cluster. To establish consensus among the monitors
regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
majority of monitors (for example, one in a cluster that contains only one
monitor, two in a cluster that contains three monitors, three in a cluster that
contains five monitors, four in a cluster that contains six monitors, and so
on).
See the `Monitor Config Reference`_ for more detail on configuring monitors.
.. index:: architecture; high availability authentication
.. _arch_high_availability_authentication:
High Availability Authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To identify users and protect against man-in-the-middle attacks, Ceph provides
its ``cephx`` authentication system to authenticate users and daemons.
The ``cephx`` authentication system is used by Ceph to authenticate users and
daemons and to protect against man-in-the-middle attacks.
.. note:: The ``cephx`` protocol does not address data encryption in transport
(e.g., SSL/TLS) or encryption at rest.
(for example, SSL/TLS) or encryption at rest.
Cephx uses shared secret keys for authentication, meaning both the client and
the monitor cluster have a copy of the client's secret key. The authentication
protocol is such that both parties are able to prove to each other they have a
copy of the key without actually revealing it. This provides mutual
authentication, which means the cluster is sure the user possesses the secret
key, and the user is sure that the cluster has a copy of the secret key.
``cephx`` uses shared secret keys for authentication. This means that both the
client and the monitor cluster keep a copy of the client's secret key.
A key scalability feature of Ceph is to avoid a centralized interface to the
Ceph object store, which means that Ceph clients must be able to interact with
OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
system, which authenticates users operating Ceph clients. The ``cephx`` protocol
operates in a manner with behavior similar to `Kerberos`_.
The ``cephx`` protocol makes it possible for each party to prove to the other
that it has a copy of the key without revealing it. This provides mutual
authentication and allows the cluster to confirm (1) that the user has the
secret key and (2) that the user can be confident that the cluster has a copy
of the secret key.
A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
monitor can authenticate users and distribute keys, so there is no single point
of failure or bottleneck when using ``cephx``. The monitor returns an
authentication data structure similar to a Kerberos ticket that contains a
session key for use in obtaining Ceph services. This session key is itself
encrypted with the user's permanent secret key, so that only the user can
request services from the Ceph Monitor(s). The client then uses the session key
to request its desired services from the monitor, and the monitor provides the
client with a ticket that will authenticate the client to the OSDs that actually
handle data. Ceph Monitors and OSDs share a secret, so the client can use the
ticket provided by the monitor with any OSD or metadata server in the cluster.
Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
ticket or session key obtained surreptitiously. This form of authentication will
prevent attackers with access to the communications medium from either creating
bogus messages under another user's identity or altering another user's
legitimate messages, as long as the user's secret key is not divulged before it
expires.
As stated in :ref:`Scalability and High Availability
<arch_scalability_and_high_availability>`, Ceph does not have any centralized
interface between clients and the Ceph object store. By avoiding such a
centralized interface, Ceph avoids the bottlenecks that attend such centralized
interfaces. However, this means that clients must interact directly with OSDs.
Direct interactions between Ceph clients and OSDs require authenticated
connections. The ``cephx`` authentication system establishes and sustains these
authenticated connections.
To use ``cephx``, an administrator must set up users first. In the following
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
monitor can authenticate users and distribute keys, which means that there is
no single point of failure and no bottleneck when using ``cephx``. The monitor
returns an authentication data structure that is similar to a Kerberos ticket.
This authentication data structure contains a session key for use in obtaining
Ceph services. The session key is itself encrypted with the user's permanent
secret key, which means that only the user can request services from the Ceph
Monitors. The client then uses the session key to request services from the
monitors, and the monitors provide the client with a ticket that authenticates
the client against the OSDs that actually handle data. Ceph Monitors and OSDs
share a secret, which means that the clients can use the ticket provided by the
monitors to authenticate against any OSD or metadata server in the cluster.
Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
expired ticket or session key that has been obtained surreptitiously. This form
of authentication prevents attackers who have access to the communications
medium from creating bogus messages under another user's identity and prevents
attackers from altering another user's legitimate messages, as long as the
user's secret key is not divulged before it expires.
An administrator must set up users before using ``cephx``. In the following
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
the command line to generate a username and secret key. Ceph's ``auth``
subsystem generates the username and key, stores a copy with the monitor(s) and
transmits the user's secret back to the ``client.admin`` user. This means that
subsystem generates the username and key, stores a copy on the monitor(s), and
transmits the user's secret back to the ``client.admin`` user. This means that
the client and the monitor share a secret key.
.. note:: The ``client.admin`` user must provide the user ID and
@ -262,17 +278,16 @@ the client and the monitor share a secret key.
| transmit key |
| |
To authenticate with the monitor, the client passes in the user name to the
monitor, and the monitor generates a session key and encrypts it with the secret
key associated to the user name. Then, the monitor transmits the encrypted
ticket back to the client. The client then decrypts the payload with the shared
secret key to retrieve the session key. The session key identifies the user for
the current session. The client then requests a ticket on behalf of the user
signed by the session key. The monitor generates a ticket, encrypts it with the
user's secret key and transmits it back to the client. The client decrypts the
ticket and uses it to sign requests to OSDs and metadata servers throughout the
cluster.
Here is how a client authenticates with a monitor. The client passes the user
name to the monitor. The monitor generates a session key that is encrypted with
the secret key associated with the ``username``. The monitor transmits the
encrypted ticket to the client. The client uses the shared secret key to
decrypt the payload. The session key identifies the user, and this act of
identification will last for the duration of the session. The client requests
a ticket for the user, and the ticket is signed with the session key. The
monitor generates a ticket and uses the user's secret key to encrypt it. The
encrypted ticket is transmitted to the client. The client decrypts the ticket
and uses it to sign requests to OSDs and to metadata servers in the cluster.
.. ditaa::
@ -302,10 +317,11 @@ cluster.
|<----+ |
The ``cephx`` protocol authenticates ongoing communications between the client
machine and the Ceph servers. Each message sent between a client and server,
subsequent to the initial authentication, is signed using a ticket that the
monitors, OSDs and metadata servers can verify with their shared secret.
The ``cephx`` protocol authenticates ongoing communications between the clients
and Ceph daemons. After initial authentication, each message sent between a
client and a daemon is signed using a ticket that can be verified by monitors,
OSDs, and metadata daemons. This ticket is verified by using the secret shared
between the client and the daemon.
.. ditaa::
@ -341,83 +357,93 @@ monitors, OSDs and metadata servers can verify with their shared secret.
|<-------------------------------------------|
receive response
The protection offered by this authentication is between the Ceph client and the
Ceph server hosts. The authentication is not extended beyond the Ceph client. If
the user accesses the Ceph client from a remote host, Ceph authentication is not
This authentication protects only the connections between Ceph clients and Ceph
daemons. The authentication is not extended beyond the Ceph client. If a user
accesses the Ceph client from a remote host, cephx authentication will not be
applied to the connection between the user's host and the client host.
See `Cephx Config Guide`_ for more on configuration details.
For configuration details, see `Cephx Config Guide`_. For user management
details, see `User Management`_.
See `User Management`_ for more on user management.
See :ref:`A Detailed Description of the Cephx Authentication Protocol
<cephx_2012_peter>` for more on the distinction between authorization and
authentication and for a step-by-step explanation of the setup of ``cephx``
tickets and session keys.
.. index:: architecture; smart daemons and scalability
Smart Daemons Enable Hyperscale
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A feature of many storage clusters is a centralized interface that keeps track
of the nodes that clients are permitted to access. Such centralized
architectures provide services to clients by means of a double dispatch. At the
petabyte-to-exabyte scale, such double dispatches are a significant
bottleneck.
In many clustered architectures, the primary purpose of cluster membership is
so that a centralized interface knows which nodes it can access. Then the
centralized interface provides services to the client through a double
dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
cluster-aware makes it possible for Ceph clients to interact directly with Ceph
OSD Daemons.
Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with
other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
to interact directly with Ceph OSD Daemons.
Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
easily perform tasks that a cluster with a centralized interface would struggle
to perform. The ability of Ceph nodes to make use of the computing power of
the greater cluster provides several benefits:
The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
nodes to easily perform tasks that would bog down a centralized server. The
ability to leverage this computing power leads to several major benefits:
#. **OSDs Service Clients Directly:** Network devices can support only a
limited number of concurrent connections. Because Ceph clients contact
Ceph OSD daemons directly without first connecting to a central interface,
Ceph enjoys improved perfomance and increased system capacity relative to
storage redundancy strategies that include a central interface. Ceph clients
maintain sessions only when needed, and maintain those sessions with only
particular Ceph OSD daemons, not with a centralized interface.
#. **OSDs Service Clients Directly:** Since any network device has a limit to
the number of concurrent connections it can support, a centralized system
has a low physical limit at high scales. By enabling Ceph Clients to contact
Ceph OSD Daemons directly, Ceph increases both performance and total system
capacity simultaneously, while removing a single point of failure. Ceph
Clients can maintain a session when they need to, and with a particular Ceph
OSD Daemon instead of a centralized server.
#. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
report their status. At the lowest level, the Ceph OSD Daemon status is
``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
``in`` the Ceph Storage Cluster, this status may indicate the failure of the
Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
OSDs periodically send messages to the Ceph Monitor (in releases prior to
Luminous, this was done by means of ``MPGStats``, and beginning with the
Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
Monitors receive no such message after a configurable period of time,
then they mark the OSD ``down``. This mechanism is a failsafe, however.
Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
report it to the Ceph Monitors. This contributes to making Ceph Monitors
lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
additional details.
#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
or ``down`` reflecting whether or not it is running and able to service
Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
Storage Cluster, this status may indicate the failure of the Ceph OSD
Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that
message after a configurable period of time then it marks the OSD down.
This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
This assures that Ceph Monitors are lightweight processes. See `Monitoring
OSDs`_ and `Heartbeats`_ for additional details.
#. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
RADOS objects. Ceph OSD Daemons compare the metadata of their own local
objects against the metadata of the replicas of those objects, which are
stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
mismatches in object size and finds metadata mismatches, and is usually
performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
bad sectors on drives that are not detectable with light scrubs. See `Data
Scrubbing`_ for details on configuring scrubbing.
#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare
their local objects metadata with its replicas stored on other OSDs. Scrubbing
happens on a per-Placement Group base. Scrubbing (usually performed daily)
catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper
scrubbing by comparing data in objects bit-for-bit with their checksums.
Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
configuring scrubbing.
#. **Replication:** Data replication involves a collaboration between Ceph
Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
determine the storage location of object replicas. Ceph clients use the
CRUSH algorithm to determine the storage location of an object, then the
object is mapped to a pool and to a placement group, and then the client
consults the CRUSH map to identify the placement group's primary OSD.
#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
objects should be stored (and for rebalancing). In a typical write scenario,
a client uses the CRUSH algorithm to compute where to store an object, maps
the object to a pool and placement group, then looks at the CRUSH map to
identify the primary OSD for the placement group.
The client writes the object to the identified placement group in the
primary OSD. Then, the primary OSD with its own copy of the CRUSH map
identifies the secondary and tertiary OSDs for replication purposes, and
replicates the object to the appropriate placement groups in the secondary
and tertiary OSDs (as many OSDs as additional replicas), and responds to the
client once it has confirmed the object was stored successfully.
After identifying the target placement group, the client writes the object
to the identified placement group's primary OSD. The primary OSD then
consults its own copy of the CRUSH map to identify secondary and tertiary
OSDS, replicates the object to the placement groups in those secondary and
tertiary OSDs, confirms that the object was stored successfully in the
secondary and tertiary OSDs, and reports to the client that the object
was stored successfully.
.. ditaa::
@ -444,19 +470,18 @@ ability to leverage this computing power leads to several major benefits:
| | | |
+---------------+ +---------------+
With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
clients from that duty, while ensuring high data availability and data safety.
By performing this act of data replication, Ceph OSD Daemons relieve Ceph
clients of the burden of replicating data.
Dynamic Cluster Management
--------------------------
In the `Scalability and High Availability`_ section, we explained how Ceph uses
CRUSH, cluster awareness and intelligent daemons to scale and maintain high
CRUSH, cluster topology, and intelligent daemons to scale and maintain high
availability. Key to Ceph's design is the autonomous, self-healing, and
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
enable modern cloud storage infrastructures to place data, rebalance the cluster
and recover from faults dynamically.
enable modern cloud storage infrastructures to place data, rebalance the
cluster, and adaptively place and balance data and recover from faults.
.. index:: architecture; pools
@ -465,10 +490,11 @@ About Pools
The Ceph storage system supports the notion of 'Pools', which are logical
partitions for storing objects.
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
number of placement groups determine how Ceph will place the data.
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
objects to pools. The way that Ceph places the data in the pools is determined
by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
placement groups in the pool.
.. ditaa::
@ -501,20 +527,23 @@ See `Set Pool Values`_ for details.
Mapping PGs to OSDs
~~~~~~~~~~~~~~~~~~~
Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
When a Ceph Client stores objects, CRUSH will map each object to a placement
group.
Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
object to a PG.
Mapping objects to placement groups creates a layer of indirection between the
Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
Client "knew" which Ceph OSD Daemon had which object, that would create a tight
coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
algorithm maps each object to a placement group and then maps each placement
group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
come online. The following diagram depicts how CRUSH maps objects to placement
groups, and placement groups to OSDs.
This mapping of RADOS objects to PGs implements an abstraction and indirection
layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
be able to grow (or shrink) and redistribute data adaptively when the internal
topology changes.
If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
RADOS object to a placement group and then maps each placement group to one or
more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
dynamically when new Ceph OSD Daemons and their underlying OSD devices come
online. The following diagram shows how the CRUSH algorithm maps objects to
placement groups, and how it maps placement groups to OSDs.
.. ditaa::
@ -540,44 +569,45 @@ groups, and placement groups to OSDs.
| | | | | | | |
\----------/ \----------/ \----------/ \----------/
With a copy of the cluster map and the CRUSH algorithm, the client can compute
exactly which OSD to use when reading or writing a particular object.
The client uses its copy of the cluster map and the CRUSH algorithm to compute
precisely which OSD it will use when reading or writing a particular object.
.. index:: architecture; calculating PG IDs
Calculating PG IDs
~~~~~~~~~~~~~~~~~~
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
OSDs, and metadata servers in the cluster. **However, it doesn't know anything
about object locations.**
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
the `Cluster Map`_. When a client has been equipped with a copy of the cluster
map, it is aware of all the monitors, OSDs, and metadata servers in the
cluster. **However, even equipped with a copy of the latest version of the
cluster map, the client doesn't know anything about object locations.**
.. epigraph::
**Object locations must be computed.**
Object locations get computed.
The client requies only the object ID and the name of the pool in order to
compute the object location.
Ceph stores data in named pools (for example, "liverpool"). When a client
stores a named object (for example, "john", "paul", "george", or "ringo") it
calculates a placement group by using the object name, a hash code, the number
of PGs in the pool, and the pool name. Ceph clients use the following steps to
compute PG IDs.
The only input required by the client is the object ID and the pool.
It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
it calculates a placement group using the object name, a hash code, the
number of PGs in the pool and the pool name. Ceph clients use the following
steps to compute PG IDs.
#. The client inputs the pool name and the object ID. (for example: pool =
"liverpool" and object-id = "john")
#. Ceph hashes the object ID.
#. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
get a PG ID.
#. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
``4``)
#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool"
and object-id = "john")
#. Ceph takes the object ID and hashes it.
#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
a PG ID.
#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
Computing object locations is much faster than performing object location query
over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
Hashing)` algorithm allows a client to compute where objects *should* be stored,
and enables the client to contact the primary OSD to store or retrieve the
objects.
It is much faster to compute object locations than to perform object location
query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
Scalable Hashing)` algorithm allows a client to compute where objects are
expected to be stored, and enables the client to contact the primary OSD to
store or retrieve the objects.
.. index:: architecture; PG Peering
@ -585,46 +615,51 @@ Peering and Sets
~~~~~~~~~~~~~~~~
In previous sections, we noted that Ceph OSD Daemons check each other's
heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
do is called 'peering', which is the process of bringing all of the OSDs that
store a Placement Group (PG) into agreement about the state of all of the
objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve
themselves; however, if the problem persists, you may need to refer to the
`Troubleshooting Peering Failure`_ section.
heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
which is the process of bringing all of the OSDs that store a Placement Group
(PG) into agreement about the state of all of the RADOS objects (and their
metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
Monitors. Peering issues usually resolve themselves; however, if the problem
persists, you may need to refer to the `Troubleshooting Peering Failure`_
section.
.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
.. Note:: PGs that agree on the state of the cluster do not necessarily have
the current data yet.
The Ceph Storage Cluster was designed to store at least two copies of an object
(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
availability, a Ceph Storage Cluster should store more than two copies of an object
(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
``degraded`` state while maintaining data safety.
(that is, ``size = 2``), which is the minimum requirement for data safety. For
high availability, a Ceph Storage Cluster should store more than two copies of
an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
to run in a ``degraded`` state while maintaining data safety.
Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
the *Primary* is the first OSD in the *Acting Set*, and is responsible for
coordinating the peering process for each placement group where it acts as
the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
writes to objects for a given placement group where it acts as the *Primary*.
.. warning:: Although we say here that R2 (replication with two copies) is the
minimum requirement for data safety, R3 (replication with three copies) is
recommended. On a long enough timeline, data stored with an R2 strategy will
be lost.
When a series of OSDs are responsible for a placement group, that series of
OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
OSD Daemons that are currently responsible for the placement group, or the Ceph
OSD Daemons that were responsible for a particular placement group as of some
As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
convention, the *Primary* is the first OSD in the *Acting Set*, and is
responsible for orchestrating the peering process for each placement group
where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
placement group that accepts client-initiated writes to objects.
The set of OSDs that is responsible for a placement group is called the
*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
that are currently responsible for the placement group, or to the Ceph OSD
Daemons that were responsible for a particular placement group as of some
epoch.
The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``.
When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up
Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
Daemons when an OSD fails.
.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
removed from the *Up Set*.
The Ceph OSD daemons that are part of an *Acting Set* might not always be
``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
The *Up Set* is an important distinction, because Ceph can remap PGs to other
Ceph OSD Daemons when an OSD fails.
.. note:: Consider a hypothetical *Acting Set* for a PG that contains
``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
*Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
*Primary*, and ``osd.25`` is removed from the *Up Set*.
.. index:: architecture; Rebalancing
@ -1467,11 +1502,11 @@ Ceph Clients
Ceph Clients include a number of service interfaces. These include:
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
provides resizable, thin-provisioned block devices with snapshotting and
cloning. Ceph stripes a block device across the cluster for high
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
that uses ``librbd`` directly--avoiding the kernel object overhead for
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
provides resizable, thin-provisioned block devices that can be snapshotted
and cloned. Ceph stripes a block device across the cluster for high
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
that uses ``librbd`` directly--avoiding the kernel object overhead for
virtualized systems.
- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service

View File

@ -11,9 +11,9 @@ Run a command of this form to list hosts associated with the cluster:
.. prompt:: bash #
ceph orch host ls [--format yaml] [--host-pattern <name>] [--label <label>] [--host-status <status>]
ceph orch host ls [--format yaml] [--host-pattern <name>] [--label <label>] [--host-status <status>] [--detail]
In commands of this form, the arguments "host-pattern", "label" and
In commands of this form, the arguments "host-pattern", "label", and
"host-status" are optional and are used for filtering.
- "host-pattern" is a regex that matches against hostnames and returns only
@ -25,6 +25,16 @@ In commands of this form, the arguments "host-pattern", "label" and
against name, label and status simultaneously, or to filter against any
proper subset of name, label and status.
The "detail" parameter provides more host related information for cephadm based
clusters. For example:
.. prompt:: bash #
# ceph orch host ls --detail
HOSTNAME ADDRESS LABELS STATUS VENDOR/MODEL CPU HDD SSD NIC
ceph-master 192.168.122.73 _admin QEMU (Standard PC (Q35 + ICH9, 2009)) 4C/4T 4/1.6TB - 1
1 hosts in cluster
.. _cephadm-adding-hosts:
Adding Hosts
@ -193,10 +203,18 @@ Place a host in and out of maintenance mode (stops all Ceph daemons on host):
.. prompt:: bash #
ceph orch host maintenance enter <hostname> [--force]
ceph orch host maintenance enter <hostname> [--force] [--yes-i-really-mean-it]
ceph orch host maintenance exit <hostname>
Where the force flag when entering maintenance allows the user to bypass warnings (but not alerts)
The ``--force`` flag allows the user to bypass warnings (but not alerts). The ``--yes-i-really-mean-it``
flag bypasses all safety checks and will attempt to force the host into maintenance mode no
matter what.
.. warning:: Using the --yes-i-really-mean-it flag to force the host to enter maintenance
mode can potentially cause loss of data availability, the mon quorum to break down due
to too few running monitors, mgr module commands (such as ``ceph orch . . .`` commands)
to be become unresponsive, and a number of other possible issues. Please only use this
flag if you're absolutely certain you know what you're doing.
See also :ref:`cephadm-fqdn`
@ -269,7 +287,7 @@ create a new CRUSH host located in the specified hierarchy.
.. note::
The ``location`` attribute will be only affect the initial CRUSH location. Subsequent
changes of the ``location`` property will be ignored. Also, removing a host will no remove
changes of the ``location`` property will be ignored. Also, removing a host will not remove
any CRUSH buckets.
See also :ref:`crush_map_default_types`.

View File

@ -142,6 +142,9 @@ cluster's first "monitor daemon", and that monitor daemon needs an IP address.
You must pass the IP address of the Ceph cluster's first host to the ``ceph
bootstrap`` command, so you'll need to know the IP address of that host.
.. important:: ``ssh`` must be installed and running in order for the
bootstrapping procedure to succeed.
.. note:: If there are multiple networks and interfaces, be sure to choose one
that will be accessible by any host accessing the Ceph cluster.
@ -288,18 +291,21 @@ its status with:
Adding Hosts
============
Next, add all hosts to the cluster by following :ref:`cephadm-adding-hosts`.
Add all hosts to the cluster by following the instructions in
:ref:`cephadm-adding-hosts`.
By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring
are maintained in ``/etc/ceph`` on all hosts with the ``_admin`` label, which is initially
applied only to the bootstrap host. We usually recommend that one or more other hosts be
given the ``_admin`` label so that the Ceph CLI (e.g., via ``cephadm shell``) is easily
accessible on multiple hosts. To add the ``_admin`` label to additional host(s):
By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring are
maintained in ``/etc/ceph`` on all hosts that have the ``_admin`` label. This
label is initially applied only to the bootstrap host. We usually recommend
that one or more other hosts be given the ``_admin`` label so that the Ceph CLI
(for example, via ``cephadm shell``) is easily accessible on multiple hosts. To add
the ``_admin`` label to additional host(s), run a command of the following form:
.. prompt:: bash #
ceph orch host label add *<host>* _admin
Adding additional MONs
======================

View File

@ -676,6 +676,22 @@ To disable the automatic management of dameons, set ``unmanaged=True`` in the
ceph orch apply -i mgr.yaml
Cephadm also supports setting the unmanaged parameter to true or false
using the ``ceph orch set-unmanaged`` and ``ceph orch set-managed`` commands.
The commands take the service name (as reported in ``ceph orch ls``) as
the only argument. For example,
.. prompt:: bash #
ceph orch set-unmanaged mon
would set ``unmanaged: true`` for the mon service and
.. prompt:: bash #
ceph orch set-managed mon
would set ``unmanaged: false`` for the mon service
.. note::
@ -683,6 +699,13 @@ To disable the automatic management of dameons, set ``unmanaged=True`` in the
longer deploy any new daemons (even if the placement specification matches
additional hosts).
.. note::
The "osd" service used to track OSDs that are not tied to any specific
service spec is special and will always be marked unmanaged. Attempting
to modify it with ``ceph orch set-unmanaged`` or ``ceph orch set-managed``
will result in a message ``No service of name osd found. Check "ceph orch ls" for all known services``
Deploying a daemon on a host manually
-------------------------------------

View File

@ -20,7 +20,18 @@ For example:
ceph fs volume create <fs_name> --placement="<placement spec>"
where ``fs_name`` is the name of the CephFS and ``placement`` is a
:ref:`orchestrator-cli-placement-spec`.
:ref:`orchestrator-cli-placement-spec`. For example, to place
MDS daemons for the new ``foo`` volume on hosts labeled with ``mds``:
.. prompt:: bash #
ceph fs volume create foo --placement="label:mds"
You can also update the placement after-the-fact via:
.. prompt:: bash #
ceph orch apply mds foo 'mds-[012]'
For manually deploying MDS daemons, use this specification:
@ -30,6 +41,7 @@ For manually deploying MDS daemons, use this specification:
service_id: fs_name
placement:
count: 3
label: mds
The specification can then be applied using:

View File

@ -4,8 +4,8 @@
MGR Service
===========
The cephadm MGR service is hosting different modules, like the :ref:`mgr-dashboard`
and the cephadm manager module.
The cephadm MGR service hosts multiple modules. These include the
:ref:`mgr-dashboard` and the cephadm manager module.
.. _cephadm-mgr-networks:

View File

@ -170,6 +170,64 @@ network ``10.1.2.0/24``, run the following commands:
ceph orch apply mon --placement="newhost1,newhost2,newhost3"
Setting Crush Locations for Monitors
------------------------------------
Cephadm supports setting CRUSH locations for mon daemons
using the mon service spec. The CRUSH locations are set
by hostname. When cephadm deploys a mon on a host that matches
a hostname specified in the CRUSH locations, it will add
``--set-crush-location <CRUSH-location>`` where the CRUSH location
is the first entry in the list of CRUSH locations for that
host. If multiple CRUSH locations are set for one host, cephadm
will attempt to set the additional locations using the
"ceph mon set_location" command.
.. note::
Setting the CRUSH location in the spec is the recommended way of
replacing tiebreaker mon daemons, as they require having a location
set when they are added.
.. note::
Tiebreaker mon daemons are a part of stretch mode clusters. For more
info on stretch mode clusters see :ref:`stretch_mode`
Example syntax for setting the CRUSH locations:
.. code-block:: yaml
service_type: mon
service_name: mon
placement:
count: 5
spec:
crush_locations:
host1:
- datacenter=a
host2:
- datacenter=b
- rack=2
host3:
- datacenter=a
.. note::
Sometimes, based on the timing of mon daemons being admitted to the mon
quorum, cephadm may fail to set the CRUSH location for some mon daemons
when multiple locations are specified. In this case, the recommended
action is to re-apply the same mon spec to retrigger the service action.
.. note::
Mon daemons will only get the ``--set-crush-location`` flag set when cephadm
actually deploys them. This means if a spec is applied that includes a CRUSH
location for a mon that is already deployed, the flag may not be set until
a redeploy command is issued for that mon daemon.
Further Reading
===============

View File

@ -197,12 +197,26 @@ configuration files for monitoring services.
Internally, cephadm already uses `Jinja2
<https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
configuration files for all monitoring components. To be able to customize the
configuration of Prometheus, Grafana or the Alertmanager it is possible to store
a Jinja2 template for each service that will be used for configuration
generation instead. This template will be evaluated every time a service of that
kind is deployed or reconfigured. That way, the custom configuration is
preserved and automatically applied on future deployments of these services.
configuration files for all monitoring components. Starting from version 17.2.3,
cephadm supports Prometheus http service discovery, and uses this endpoint for the
definition and management of the embedded Prometheus service. The endpoint listens on
``https://<mgr-ip>:8765/sd/`` (the port is
configurable through the variable ``service_discovery_port``) and returns scrape target
information in `http_sd_config format
<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
Customers with external monitoring stack can use `ceph-mgr` service discovery endpoint
to get scraping configuration. Root certificate of the server can be obtained by the
following command:
.. prompt:: bash #
ceph orch sd dump cert
The configuration of Prometheus, Grafana, or Alertmanager may be customized by storing
a Jinja2 template for each service. This template will be evaluated every time a service
of that kind is deployed or reconfigured. That way, the custom configuration is preserved
and automatically applied on future deployments of these services.
.. note::
@ -292,6 +306,21 @@ cluster.
By default, ceph-mgr presents prometheus metrics on port 9283 on each host
running a ceph-mgr daemon. Configure prometheus to scrape these.
To make this integration easier, cephadm provides a service discovery endpoint at
``https://<mgr-ip>:8765/sd/``. This endpoint can be used by an external
Prometheus server to retrieve target information for a specific service. Information returned
by this endpoint uses the format specified by the Prometheus `http_sd_config option
<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
Here's an example prometheus job definition that uses the cephadm service discovery endpoint
.. code-block:: bash
- job_name: 'ceph-exporter'
http_sd_configs:
- url: http://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter
* To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.
* To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
@ -429,6 +458,28 @@ Then apply this specification:
Grafana will now create an admin user called ``admin`` with the
given password.
Turning off anonymous access
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, cephadm allows anonymous users (users who have not provided any
login information) limited, viewer only access to the grafana dashboard. In
order to set up grafana to only allow viewing from logged in users, you can
set ``anonymous_access: False`` in your grafana spec.
.. code-block:: yaml
service_type: grafana
placement:
hosts:
- host1
spec:
anonymous_access: False
initial_admin_password: "mypassword"
Since deploying grafana with anonymous access set to false without an initial
admin password set would make the dashboard inaccessible, cephadm requires
setting the ``initial_admin_password`` when ``anonymous_access`` is set to false.
Setting up Alertmanager
-----------------------

View File

@ -113,6 +113,54 @@ A few notes:
a *port* property that is not 2049 to avoid conflicting with the
ingress service, which could be placed on the same host(s).
NFS with virtual IP but no haproxy
----------------------------------
Cephadm also supports deploying nfs with keepalived but not haproxy. This
offers a virtual ip supported by keepalived that the nfs daemon can directly bind
to instead of having traffic go through haproxy.
In this setup, you'll either want to set up the service using the nfs module
(see :ref:`nfs-module-cluster-create`) or place the ingress service first, so
the virtual IP is present for the nfs daemon to bind to. The ingress service
should include the attribute ``keepalive_only`` set to true. For example
.. code-block:: yaml
service_type: ingress
service_id: nfs.foo
placement:
count: 1
hosts:
- host1
- host2
- host3
spec:
backend_service: nfs.foo
monitor_port: 9049
virtual_ip: 192.168.122.100/24
keepalive_only: true
Then, an nfs service could be created that specifies a ``virtual_ip`` attribute
that will tell it to bind to that specific IP.
.. code-block:: yaml
service_type: nfs
service_id: foo
placement:
count: 1
hosts:
- host1
- host2
- host3
spec:
port: 2049
virtual_ip: 192.168.122.100
Note that in these setups, one should make sure to include ``count: 1`` in the
nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP.
Further Reading
===============

View File

@ -308,7 +308,7 @@ Replacing an OSD
.. prompt:: bash #
orch osd rm <osd_id(s)> --replace [--force]
ceph orch osd rm <osd_id(s)> --replace [--force]
Example:

View File

@ -83,23 +83,23 @@ To deploy RGWs serving the multisite *myorg* realm and the *us-east-1* zone on
.. prompt:: bash #
ceph orch apply rgw east --realm=myorg --zone=us-east-1 --placement="2 myhost1 myhost2"
ceph orch apply rgw east --realm=myorg --zonegroup=us-east-zg-1 --zone=us-east-1 --placement="2 myhost1 myhost2"
Note that in a multisite situation, cephadm only deploys the daemons. It does not create
or update the realm or zone configurations. To create a new realm and zone, you need to do
something like:
or update the realm or zone configurations. To create a new realms, zones and zonegroups
you can use :ref:`mgr-rgw-module` or manually using something like:
.. prompt:: bash #
radosgw-admin realm create --rgw-realm=<realm-name> --default
.. prompt:: bash #
radosgw-admin zonegroup create --rgw-zonegroup=<zonegroup-name> --master --default
radosgw-admin realm create --rgw-realm=<realm-name>
.. prompt:: bash #
radosgw-admin zone create --rgw-zonegroup=<zonegroup-name> --rgw-zone=<zone-name> --master --default
radosgw-admin zonegroup create --rgw-zonegroup=<zonegroup-name> --master
.. prompt:: bash #
radosgw-admin zone create --rgw-zonegroup=<zonegroup-name> --rgw-zone=<zone-name> --master
.. prompt:: bash #
@ -212,12 +212,14 @@ It is a yaml format file with the following properties:
- host2
- host3
spec:
backend_service: rgw.something # adjust to match your existing RGW service
virtual_ip: <string>/<string> # ex: 192.168.20.1/24
frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
ssl_cert: | # optional: SSL certificate and key
backend_service: rgw.something # adjust to match your existing RGW service
virtual_ip: <string>/<string> # ex: 192.168.20.1/24
frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
use_keepalived_multicast: <bool> # optional: Default is False.
vrrp_interface_network: <string>/<string> # optional: ex: 192.168.20.0/24
ssl_cert: | # optional: SSL certificate and key
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
@ -243,6 +245,7 @@ It is a yaml format file with the following properties:
frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
first_virtual_router_id: <integer> # optional: default 50
ssl_cert: | # optional: SSL certificate and key
-----BEGIN CERTIFICATE-----
...
@ -276,6 +279,21 @@ where the properties of this service specification are:
* ``ssl_cert``:
SSL certificate, if SSL is to be enabled. This must contain the both the certificate and
private key blocks in .pem format.
* ``use_keepalived_multicast``
Default is False. By default, cephadm will deploy keepalived config to use unicast IPs,
using the IPs of the hosts. The IPs chosen will be the same IPs cephadm uses to connect
to the machines. But if multicast is prefered, we can set ``use_keepalived_multicast``
to ``True`` and Keepalived will use multicast IP (224.0.0.18) to communicate between instances,
using the same interfaces as where the VIPs are.
* ``vrrp_interface_network``
By default, cephadm will configure keepalived to use the same interface where the VIPs are
for VRRP communication. If another interface is needed, it can be set via ``vrrp_interface_network``
with a network to identify which ethernet interface to use.
* ``first_virtual_router_id``
Default is 50. When deploying more than 1 ingress, this parameter can be used to ensure each
keepalived will have different virtual_router_id. In the case of using ``virtual_ips_list``,
each IP will create its own virtual router. So the first one will have ``first_virtual_router_id``,
second one will have ``first_virtual_router_id`` + 1, etc. Valid values go from 1 to 255.
.. _ingress-virtual-ip:

View File

@ -15,7 +15,7 @@ creation of multiple file systems use ``ceph fs flag set enable_multiple true``.
::
fs new <file system name> <metadata pool name> <data pool name>
ceph fs new <file system name> <metadata pool name> <data pool name>
This command creates a new file system. The file system name and metadata pool
name are self-explanatory. The specified data pool is the default data pool and
@ -25,19 +25,19 @@ to accommodate the new file system.
::
fs ls
ceph fs ls
List all file systems by name.
::
fs lsflags <file system name>
ceph fs lsflags <file system name>
List all the flags set on a file system.
::
fs dump [epoch]
ceph fs dump [epoch]
This dumps the FSMap at the given epoch (default: current) which includes all
file system settings, MDS daemons and the ranks they hold, and the list of
@ -46,7 +46,7 @@ standby MDS daemons.
::
fs rm <file system name> [--yes-i-really-mean-it]
ceph fs rm <file system name> [--yes-i-really-mean-it]
Destroy a CephFS file system. This wipes information about the state of the
file system from the FSMap. The metadata pool and data pools are untouched and
@ -54,28 +54,28 @@ must be destroyed separately.
::
fs get <file system name>
ceph fs get <file system name>
Get information about the named file system, including settings and ranks. This
is a subset of the same information from the ``fs dump`` command.
is a subset of the same information from the ``ceph fs dump`` command.
::
fs set <file system name> <var> <val>
ceph fs set <file system name> <var> <val>
Change a setting on a file system. These settings are specific to the named
file system and do not affect other file systems.
::
fs add_data_pool <file system name> <pool name/id>
ceph fs add_data_pool <file system name> <pool name/id>
Add a data pool to the file system. This pool can be used for file layouts
as an alternate location to store file data.
::
fs rm_data_pool <file system name> <pool name/id>
ceph fs rm_data_pool <file system name> <pool name/id>
This command removes the specified pool from the list of data pools for the
file system. If any files have layouts for the removed data pool, the file
@ -84,7 +84,7 @@ system) cannot be removed.
::
fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
ceph fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
Rename a Ceph file system. This also changes the application tags on the data
pools and metadata pool of the file system to the new file system name.
@ -98,7 +98,7 @@ Settings
::
fs set <fs name> max_file_size <size in bytes>
ceph fs set <fs name> max_file_size <size in bytes>
CephFS has a configurable maximum file size, and it's 1TB by default.
You may wish to set this limit higher if you expect to store large files
@ -132,13 +132,13 @@ Taking a CephFS cluster down is done by setting the down flag:
::
fs set <fs_name> down true
ceph fs set <fs_name> down true
To bring the cluster back online:
::
fs set <fs_name> down false
ceph fs set <fs_name> down false
This will also restore the previous value of max_mds. MDS daemons are brought
down in a way such that journals are flushed to the metadata pool and all
@ -149,11 +149,11 @@ Taking the cluster down rapidly for deletion or disaster recovery
-----------------------------------------------------------------
To allow rapidly deleting a file system (for testing) or to quickly bring the
file system and MDS daemons down, use the ``fs fail`` command:
file system and MDS daemons down, use the ``ceph fs fail`` command:
::
fs fail <fs_name>
ceph fs fail <fs_name>
This command sets a file system flag to prevent standbys from
activating on the file system (the ``joinable`` flag).
@ -162,7 +162,7 @@ This process can also be done manually by doing the following:
::
fs set <fs_name> joinable false
ceph fs set <fs_name> joinable false
Then the operator can fail all of the ranks which causes the MDS daemons to
respawn as standbys. The file system will be left in a degraded state.
@ -170,7 +170,7 @@ respawn as standbys. The file system will be left in a degraded state.
::
# For all ranks, 0-N:
mds fail <fs_name>:<n>
ceph mds fail <fs_name>:<n>
Once all ranks are inactive, the file system may also be deleted or left in
this state for other purposes (perhaps disaster recovery).
@ -179,7 +179,7 @@ To bring the cluster back up, simply set the joinable flag:
::
fs set <fs_name> joinable true
ceph fs set <fs_name> joinable true
Daemons
@ -198,34 +198,35 @@ Commands to manipulate MDS daemons:
::
mds fail <gid/name/role>
ceph mds fail <gid/name/role>
Mark an MDS daemon as failed. This is equivalent to what the cluster
would do if an MDS daemon had failed to send a message to the mon
for ``mds_beacon_grace`` second. If the daemon was active and a suitable
standby is available, using ``mds fail`` will force a failover to the standby.
standby is available, using ``ceph mds fail`` will force a failover to the
standby.
If the MDS daemon was in reality still running, then using ``mds fail``
If the MDS daemon was in reality still running, then using ``ceph mds fail``
will cause the daemon to restart. If it was active and a standby was
available, then the "failed" daemon will return as a standby.
::
tell mds.<daemon name> command ...
ceph tell mds.<daemon name> command ...
Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
daemons. Use ``ceph tell mds.* help`` to learn available commands.
::
mds metadata <gid/name/role>
ceph mds metadata <gid/name/role>
Get metadata about the given MDS known to the Monitors.
::
mds repaired <role>
ceph mds repaired <role>
Mark the file system rank as repaired. Unlike the name suggests, this command
does not change a MDS; it manipulates the file system rank which has been
@ -244,14 +245,14 @@ Commands to manipulate required client features of a file system:
::
fs required_client_features <fs name> add reply_encoding
fs required_client_features <fs name> rm reply_encoding
ceph fs required_client_features <fs name> add reply_encoding
ceph fs required_client_features <fs name> rm reply_encoding
To list all CephFS features
::
fs feature ls
ceph fs feature ls
Clients that are missing newly added features will be evicted automatically.
@ -346,7 +347,7 @@ Global settings
::
fs flag set <flag name> <flag val> [<confirmation string>]
ceph fs flag set <flag name> <flag val> [<confirmation string>]
Sets a global CephFS flag (i.e. not specific to a particular file system).
Currently, the only flag setting is 'enable_multiple' which allows having
@ -368,13 +369,13 @@ file system.
::
mds rmfailed
ceph mds rmfailed
This removes a rank from the failed set.
::
fs reset <file system name>
ceph fs reset <file system name>
This command resets the file system state to defaults, except for the name and
pools. Non-zero ranks are saved in the stopped set.
@ -382,7 +383,7 @@ pools. Non-zero ranks are saved in the stopped set.
::
fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
ceph fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
This command creates a file system with a specific **fscid** (file system cluster ID).
You may want to do this when an application expects the file system's ID to be

View File

@ -154,14 +154,8 @@ readdir. The behavior of the decay counter is the same as for cache trimming or
caps recall. Each readdir call increments the counter by the number of files in
the result.
The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
maybe throttled by cap acquisition throttle:
.. confval:: mds_session_max_caps_throttle_ratio
The timeout in seconds after which a client request is retried due to cap
acquisition throttling:
.. confval:: mds_cap_acquisition_throttle_retry_request_timeout
If the number of caps acquired by the client per session is greater than the

View File

@ -14,6 +14,8 @@ Requirements
The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later.
.. _cephfs_mirroring_creating_users:
Creating Users
--------------
@ -42,80 +44,155 @@ Mirror daemon should be spawned using `systemctl(1)` unit files::
$ cephfs-mirror --id mirror --cluster site-a -f
.. note:: User used here is `mirror` created in the `Creating Users` section.
.. note:: The user specified here is `mirror`, the creation of which is
described in the :ref:`Creating Users<cephfs_mirroring_creating_users>`
section.
Multiple ``cephfs-mirror`` daemons may be deployed for concurrent
synchronization and high availability. Mirror daemons share the synchronization
load using a simple ``M/N`` policy, where ``M`` is the number of directories
and ``N`` is the number of ``cephfs-mirror`` daemons.
When ``cephadm`` is used to manage a Ceph cluster, ``cephfs-mirror`` daemons can be
deployed by running the following command:
.. prompt:: bash $
ceph orch apply cephfs-mirror
To deploy multiple mirror daemons, run a command of the following form:
.. prompt:: bash $
ceph orch apply cephfs-mirror --placement=<placement-spec>
For example, to deploy 3 `cephfs-mirror` daemons on different hosts, run a command of the following form:
.. prompt:: bash $
$ ceph orch apply cephfs-mirror --placement="3 host1,host2,host3"
Interface
---------
`Mirroring` module (manager plugin) provides interfaces for managing directory snapshot
mirroring. Manager interfaces are (mostly) wrappers around monitor commands for managing
file system mirroring and is the recommended control interface.
The `Mirroring` module (manager plugin) provides interfaces for managing
directory snapshot mirroring. These are (mostly) wrappers around monitor
commands for managing file system mirroring and is the recommended control
interface.
Mirroring Module
----------------
The mirroring module is responsible for assigning directories to mirror daemons for
synchronization. Multiple mirror daemons can be spawned to achieve concurrency in
directory snapshot synchronization. When mirror daemons are spawned (or terminated)
, the mirroring module discovers the modified set of mirror daemons and rebalances
the directory assignment amongst the new set thus providing high-availability.
The mirroring module is responsible for assigning directories to mirror daemons
for synchronization. Multiple mirror daemons can be spawned to achieve
concurrency in directory snapshot synchronization. When mirror daemons are
spawned (or terminated), the mirroring module discovers the modified set of
mirror daemons and rebalances directory assignments across the new set, thus
providing high-availability.
.. note:: Multiple mirror daemons is currently untested. Only a single mirror daemon
is recommended.
.. note:: Deploying a single mirror daemon is recommended. Running multiple
daemons is untested.
Mirroring module is disabled by default. To enable mirroring use::
The mirroring module is disabled by default. To enable the mirroring module,
run the following command:
$ ceph mgr module enable mirroring
.. prompt:: bash $
Mirroring module provides a family of commands to control mirroring of directory
snapshots. To add or remove directories, mirroring needs to be enabled for a given
file system. To enable mirroring use::
ceph mgr module enable mirroring
$ ceph fs snapshot mirror enable <fs_name>
The mirroring module provides a family of commands that can be used to control
the mirroring of directory snapshots. To add or remove directories, mirroring
must be enabled for a given file system. To enable mirroring for a given file
system, run a command of the following form:
.. note:: Mirroring module commands use `fs snapshot mirror` prefix as compared to
the monitor commands which `fs mirror` prefix. Make sure to use module
commands.
.. prompt:: bash $
To disable mirroring, use::
ceph fs snapshot mirror enable <fs_name>
$ ceph fs snapshot mirror disable <fs_name>
.. note:: "Mirroring module" commands are prefixed with ``fs snapshot mirror``.
This distinguishes them from "monitor commands", which are prefixed with ``fs
mirror``. Be sure (in this context) to use module commands.
Once mirroring is enabled, add a peer to which directory snapshots are to be mirrored.
Peers follow `<client>@<cluster>` specification and get assigned a unique-id (UUID)
when added. See `Creating Users` section on how to create Ceph users for mirroring.
To disable mirroring for a given file system, run a command of the following form:
To add a peer use::
.. prompt:: bash $
$ ceph fs snapshot mirror peer_add <fs_name> <remote_cluster_spec> [<remote_fs_name>] [<remote_mon_host>] [<cephx_key>]
ceph fs snapshot mirror disable <fs_name>
`<remote_fs_name>` is optional, and defaults to `<fs_name>` (on the remote cluster).
After mirroring is enabled, add a peer to which directory snapshots are to be
mirrored. Peers are specified by the ``<client>@<cluster>`` format, which is
referred to elsewhere in this document as the ``remote_cluster_spec``. Peers
are assigned a unique-id (UUID) when added. See the :ref:`Creating
Users<cephfs_mirroring_creating_users>` section for instructions that describe
how to create Ceph users for mirroring.
This requires the remote cluster ceph configuration and user keyring to be available in
the primary cluster. See `Bootstrap Peers` section to avoid this. `peer_add` additionally
supports passing the remote cluster monitor address and the user key. However, bootstrapping
a peer is the recommended way to add a peer.
To add a peer, run a command of the following form:
.. prompt:: bash $
ceph fs snapshot mirror peer_add <fs_name> <remote_cluster_spec> [<remote_fs_name>] [<remote_mon_host>] [<cephx_key>]
``<remote_cluster_spec>`` is of the format ``client.<id>@<cluster_name>``.
``<remote_fs_name>`` is optional, and defaults to `<fs_name>` (on the remote
cluster).
For this command to succeed, the remote cluster's Ceph configuration and user
keyring must be available in the primary cluster. For example, if a user named
``client_mirror`` is created on the remote cluster which has ``rwps``
permissions for the remote file system named ``remote_fs`` (see `Creating
Users`) and the remote cluster is named ``remote_ceph`` (that is, the remote
cluster configuration file is named ``remote_ceph.conf`` on the primary
cluster), run the following command to add the remote filesystem as a peer to
the primary filesystem ``primary_fs``:
.. prompt:: bash $
ceph fs snapshot mirror peer_add primary_fs client.mirror_remote@remote_ceph remote_fs
To avoid having to maintain the remote cluster configuration file and remote
ceph user keyring in the primary cluster, users can bootstrap a peer (which
stores the relevant remote cluster details in the monitor config store on the
primary cluster). See the :ref:`Bootstrap
Peers<cephfs_mirroring_bootstrap_peers>` section.
The ``peer_add`` command supports passing the remote cluster monitor address
and the user key. However, bootstrapping a peer is the recommended way to add a
peer.
.. note:: Only a single peer is supported right now.
To remove a peer use::
To remove a peer, run a command of the following form:
$ ceph fs snapshot mirror peer_remove <fs_name> <peer_uuid>
.. prompt:: bash $
To list file system mirror peers use::
ceph fs snapshot mirror peer_remove <fs_name> <peer_uuid>
$ ceph fs snapshot mirror peer_list <fs_name>
To list file system mirror peers, run a command of the following form:
To configure a directory for mirroring, use::
.. prompt:: bash $
$ ceph fs snapshot mirror add <fs_name> <path>
ceph fs snapshot mirror peer_list <fs_name>
To stop a mirroring directory snapshots use::
To configure a directory for mirroring, run a command of the following form:
$ ceph fs snapshot mirror remove <fs_name> <path>
.. prompt:: bash $
Only absolute directory paths are allowed. Also, paths are normalized by the mirroring
module, therefore, `/a/b/../b` is equivalent to `/a/b`.
ceph fs snapshot mirror add <fs_name> <path>
To stop mirroring directory snapshots, run a command of the following form:
.. prompt:: bash $
ceph fs snapshot mirror remove <fs_name> <path>
Only absolute directory paths are allowed.
Paths are normalized by the mirroring module. This means that ``/a/b/../b`` is
equivalent to ``/a/b``. Paths always start from the CephFS file-system root and
not from the host system mount point.
For example::
$ mkdir -p /d0/d1/d2
$ ceph fs snapshot mirror add cephfs /d0/d1/d2
@ -123,16 +200,19 @@ module, therefore, `/a/b/../b` is equivalent to `/a/b`.
$ ceph fs snapshot mirror add cephfs /d0/d1/../d1/d2
Error EEXIST: directory /d0/d1/d2 is already tracked
Once a directory is added for mirroring, its subdirectory or ancestor directories are
disallowed to be added for mirroring::
After a directory is added for mirroring, the additional mirroring of
subdirectories or ancestor directories is disallowed::
$ ceph fs snapshot mirror add cephfs /d0/d1
Error EINVAL: /d0/d1 is a ancestor of tracked path /d0/d1/d2
$ ceph fs snapshot mirror add cephfs /d0/d1/d2/d3
Error EINVAL: /d0/d1/d2/d3 is a subtree of tracked path /d0/d1/d2
Commands to check directory mapping (to mirror daemons) and directory distribution are
detailed in `Mirroring Status` section.
The :ref:`Mirroring Status<cephfs_mirroring_mirroring_status>` section contains
information about the commands for checking the directory mapping (to mirror
daemons) and for checking the directory distribution.
.. _cephfs_mirroring_bootstrap_peers:
Bootstrap Peers
---------------
@ -160,6 +240,9 @@ e.g.::
$ ceph fs snapshot mirror peer_bootstrap import cephfs eyJmc2lkIjogIjBkZjE3MjE3LWRmY2QtNDAzMC05MDc5LTM2Nzk4NTVkNDJlZiIsICJmaWxlc3lzdGVtIjogImJhY2t1cF9mcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcGVlcl9ib290c3RyYXAiLCAic2l0ZV9uYW1lIjogInNpdGUtcmVtb3RlIiwgImtleSI6ICJBUUFhcDBCZ0xtRmpOeEFBVnNyZXozai9YYUV0T2UrbUJEZlJDZz09IiwgIm1vbl9ob3N0IjogIlt2MjoxOTIuMTY4LjAuNTo0MDkxOCx2MToxOTIuMTY4LjAuNTo0MDkxOV0ifQ==
.. _cephfs_mirroring_mirroring_status:
Mirroring Status
----------------

View File

@ -78,7 +78,15 @@ By default, `cephfs-top` connects to cluster name `ceph`. To use a non-default c
$ cephfs-top -d <seconds>
Interval should be greater than or equal to 0.5 seconds. Fractional seconds are honoured.
Refresh interval should be a positive integer.
To dump the metrics to stdout without creating a curses display use::
$ cephfs-top --dump
To dump the metrics of the given filesystem to stdout without creating a curses display use::
$ cephfs-top --dumpfs <fs_name>
Interactive Commands
--------------------
@ -104,3 +112,5 @@ The metrics display can be scrolled using the Arrow Keys, PgUp/PgDn, Home/End an
Sample screenshot running `cephfs-top` with 2 filesystems:
.. image:: cephfs-top.png
.. note:: Minimum compatible python version for cephfs-top is 3.6.0. cephfs-top is supported on distros RHEL 8, Ubuntu 18.04, CentOS 8 and above.

View File

@ -8,10 +8,17 @@ Creating pools
A Ceph file system requires at least two RADOS pools, one for data and one for metadata.
When configuring these pools, you might consider:
- Using a higher replication level for the metadata pool, as any data loss in
this pool can render the whole file system inaccessible.
- Using lower-latency storage such as SSDs for the metadata pool, as this will
directly affect the observed latency of file system operations on clients.
- We recommend configuring *at least* 3 replicas for the metadata pool,
as data loss in this pool can render the entire file system inaccessible.
Configuring 4 would not be extreme, especially since the metadata pool's
capacity requirements are quite modest.
- We recommend the fastest feasible low-latency storage devices (NVMe, Optane,
or at the very least SAS/SATA SSD) for the metadata pool, as this will
directly affect the latency of client file system operations.
- We strongly suggest that the CephFS metadata pool be provisioned on dedicated
SSD / NVMe OSDs. This ensures that high client workload does not adversely
impact metadata operations. See :ref:`device_classes` to configure pools this
way.
- The data pool used to create the file system is the "default" data pool and
the location for storing all inode backtrace information, used for hard link
management and disaster recovery. For this reason, all inodes created in

View File

@ -149,8 +149,8 @@ errors.
::
cephfs-data-scan scan_extents <data pool>
cephfs-data-scan scan_inodes <data pool>
cephfs-data-scan scan_extents [<data pool> [<extra data pool> ...]]
cephfs-data-scan scan_inodes [<data pool>]
cephfs-data-scan scan_links
'scan_extents' and 'scan_inodes' commands may take a *very long* time
@ -166,22 +166,22 @@ The example below shows how to run 4 workers simultaneously:
::
# Worker 0
cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
cephfs-data-scan scan_extents --worker_n 0 --worker_m 4
# Worker 1
cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
cephfs-data-scan scan_extents --worker_n 1 --worker_m 4
# Worker 2
cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
cephfs-data-scan scan_extents --worker_n 2 --worker_m 4
# Worker 3
cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
cephfs-data-scan scan_extents --worker_n 3 --worker_m 4
# Worker 0
cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4
# Worker 1
cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4
# Worker 2
cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4
# Worker 3
cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4
It is **important** to ensure that all workers have completed the
scan_extents phase before any workers enter the scan_inodes phase.
@ -191,8 +191,13 @@ operation to delete ancillary data generated during recovery.
::
cephfs-data-scan cleanup <data pool>
cephfs-data-scan cleanup [<data pool>]
Note, the data pool parameters for 'scan_extents', 'scan_inodes' and
'cleanup' commands are optional, and usually the tool will be able to
detect the pools automatically. Still you may override this. The
'scan_extents' command needs all data pools to be specified, while
'scan_inodes' and 'cleanup' commands need only the main data pool.
Using an alternate metadata pool for recovery
@ -229,35 +234,29 @@ backed by the original data pool.
::
ceph fs flag set enable_multiple true --yes-i-really-mean-it
ceph osd pool create cephfs_recovery_meta
ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay
ceph fs new cephfs_recovery recovery <data_pool> --recover --allow-dangerous-metadata-overlay
.. note::
The recovery file system starts with an MDS rank that will initialize the new
metadata pool with some metadata. This is necessary to bootstrap recovery.
However, now we will take the MDS down as we do not want it interacting with
the metadata pool further.
The ``--recover`` flag prevents any MDS from joining the new file system.
Next, we will create the intial metadata for the fs:
::
ceph fs fail cephfs_recovery
Next, we will reset the initial metadata the MDS created:
::
cephfs-table-tool cephfs_recovery:all reset session
cephfs-table-tool cephfs_recovery:all reset snap
cephfs-table-tool cephfs_recovery:all reset inode
cephfs-table-tool cephfs_recovery:0 reset session
cephfs-table-tool cephfs_recovery:0 reset snap
cephfs-table-tool cephfs_recovery:0 reset inode
cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
Now perform the recovery of the metadata pool from the data pool:
::
cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name>
cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt
cephfs-data-scan scan_links --filesystem cephfs_recovery
.. note::
@ -272,7 +271,6 @@ with:
::
cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
After recovery, some recovered directories will have incorrect statistics.
Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
@ -283,20 +281,22 @@ set to false (the default) to prevent the MDS from checking the statistics:
ceph config rm mds mds_verify_scatter
ceph config rm mds mds_debug_scatterstat
(Note, the config may also have been set globally or via a ceph.conf file.)
.. note::
Also verify the config has not been set globally or with a local ceph.conf file.
Now, allow an MDS to join the recovery file system:
::
ceph fs set cephfs_recovery joinable true
Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair recursive statistics.
Ensure you have an MDS running and issue:
::
ceph fs status # get active MDS
ceph tell mds.<id> scrub start / recursive repair
ceph tell mds.recovery_fs:0 scrub start / recursive,repair,force
.. note::

View File

@ -3,13 +3,13 @@
FS volumes and subvolumes
=========================
A single source of truth for CephFS exports is implemented in the volumes
module of the :term:`Ceph Manager` daemon (ceph-mgr). The OpenStack shared
file system service (manila_), Ceph Container Storage Interface (CSI_),
storage administrators among others can use the common CLI provided by the
ceph-mgr volumes module to manage the CephFS exports.
The volumes module of the :term:`Ceph Manager` daemon (ceph-mgr) provides a
single source of truth for CephFS exports. The OpenStack shared file system
service (manila_) and the Ceph Container Storage Interface (CSI_) storage
administrators use the common CLI provided by the ceph-mgr ``volumes`` module
to manage CephFS exports.
The ceph-mgr volumes module implements the following file system export
The ceph-mgr ``volumes`` module implements the following file system export
abstractions:
* FS volumes, an abstraction for CephFS file systems
@ -17,87 +17,82 @@ abstractions:
* FS subvolumes, an abstraction for independent CephFS directory trees
* FS subvolume groups, an abstraction for a directory level higher than FS
subvolumes to effect policies (e.g., :doc:`/cephfs/file-layouts`) across a
set of subvolumes
subvolumes. Used to effect policies (e.g., :doc:`/cephfs/file-layouts`)
across a set of subvolumes
Some possible use-cases for the export abstractions:
* FS subvolumes used as manila shares or CSI volumes
* FS subvolumes used as Manila shares or CSI volumes
* FS subvolume groups used as manila share groups
* FS subvolume groups used as Manila share groups
Requirements
------------
* Nautilus (14.2.x) or a later version of Ceph
* Nautilus (14.2.x) or later Ceph release
* Cephx client user (see :doc:`/rados/operations/user-management`) with
the following minimum capabilities::
at least the following capabilities::
mon 'allow r'
mgr 'allow rw'
FS Volumes
----------
Create a volume using::
Create a volume by running the following command:
$ ceph fs volume create <vol_name> [<placement>]
.. prompt:: bash #
ceph fs volume create <vol_name> [placement]
This creates a CephFS file system and its data and metadata pools. It can also
try to create MDSes for the filesystem using the enabled ceph-mgr orchestrator
module (see :doc:`/mgr/orchestrator`), e.g. rook.
deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for
example Rook). See :doc:`/mgr/orchestrator`.
<vol_name> is the volume name (an arbitrary string), and
``<vol_name>`` is the volume name (an arbitrary string). ``[placement]`` is an
optional string that specifies the :ref:`orchestrator-cli-placement-spec` for
the MDS. See also :ref:`orchestrator-cli-cephfs` for more examples on
placement.
<placement> is an optional string signifying which hosts should have NFS Ganesha
daemon containers running on them and, optionally, the total number of NFS
Ganesha daemons the cluster (should you want to have more than one NFS Ganesha
daemon running per node). For example, the following placement string means
"deploy NFS Ganesha daemons on nodes host1 and host2 (one daemon per host):
.. note:: Specifying placement via a YAML file is not supported through the
volume interface.
"host1,host2"
and this placement specification says to deploy two NFS Ganesha daemons each
on nodes host1 and host2 (for a total of four NFS Ganesha daemons in the
cluster):
"4 host1,host2"
For more details on placement specification refer to the :ref:`orchestrator-cli-service-spec`,
but keep in mind that specifying the placement via a YAML file is not supported.
Remove a volume using::
To remove a volume, run the following command:
$ ceph fs volume rm <vol_name> [--yes-i-really-mean-it]
This removes a file system and its data and metadata pools. It also tries to
remove MDSes using the enabled ceph-mgr orchestrator module.
remove MDS daemons using the enabled ceph-mgr orchestrator module.
List volumes using::
.. note:: After volume deletion, it is recommended to restart `ceph-mgr`
if a new file system is created on the same cluster and subvolume interface
is being used. Please see https://tracker.ceph.com/issues/49605#note-5
for more details.
List volumes by running the following command:
$ ceph fs volume ls
Rename a volume using::
Rename a volume by running the following command:
$ ceph fs volume rename <vol_name> <new_vol_name> [--yes-i-really-mean-it]
Renaming a volume can be an expensive operation. It does the following:
Renaming a volume can be an expensive operation that requires the following:
- renames the orchestrator managed MDS service to match the <new_vol_name>.
This involves launching a MDS service with <new_vol_name> and bringing down
the MDS service with <vol_name>.
- renames the file system matching <vol_name> to <new_vol_name>
- changes the application tags on the data and metadata pools of the file system
to <new_vol_name>
- renames the metadata and data pools of the file system.
- Renaming the orchestrator-managed MDS service to match the <new_vol_name>.
This involves launching a MDS service with ``<new_vol_name>`` and bringing
down the MDS service with ``<vol_name>``.
- Renaming the file system matching ``<vol_name>`` to ``<new_vol_name>``.
- Changing the application tags on the data and metadata pools of the file system
to ``<new_vol_name>``.
- Renaming the metadata and data pools of the file system.
The CephX IDs authorized to <vol_name> need to be reauthorized to <new_vol_name>. Any
on-going operations of the clients using these IDs may be disrupted. Mirroring is
expected to be disabled on the volume.
The CephX IDs that are authorized for ``<vol_name>`` must be reauthorized for
``<new_vol_name>``. Any ongoing operations of the clients using these IDs may
be disrupted. Ensure that mirroring is disabled on the volume.
Fetch the information of a CephFS volume using::
To fetch the information of a CephFS volume, run the following command:
$ ceph fs volume info vol_name [--human_readable]
@ -105,15 +100,15 @@ The ``--human_readable`` flag shows used and available pool capacities in KB/MB/
The output format is JSON and contains fields as follows:
* pools: Attributes of data and metadata pools
* avail: The amount of free space available in bytes
* used: The amount of storage consumed in bytes
* name: Name of the pool
* mon_addrs: List of monitor addresses
* used_size: Current used size of the CephFS volume in bytes
* pending_subvolume_deletions: Number of subvolumes pending deletion
* ``pools``: Attributes of data and metadata pools
* ``avail``: The amount of free space available in bytes
* ``used``: The amount of storage consumed in bytes
* ``name``: Name of the pool
* ``mon_addrs``: List of Ceph monitor addresses
* ``used_size``: Current used size of the CephFS volume in bytes
* ``pending_subvolume_deletions``: Number of subvolumes pending deletion
Sample output of volume info command::
Sample output of the ``volume info`` command::
$ ceph fs volume info vol_name
{
@ -143,88 +138,91 @@ Sample output of volume info command::
FS Subvolume groups
-------------------
Create a subvolume group using::
Create a subvolume group by running the following command:
$ ceph fs subvolumegroup create <vol_name> <group_name> [--size <size_in_bytes>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>]
The command succeeds even if the subvolume group already exists.
When creating a subvolume group you can specify its data pool layout (see
:doc:`/cephfs/file-layouts`), uid, gid, file mode in octal numerals and
:doc:`/cephfs/file-layouts`), uid, gid, file mode in octal numerals, and
size in bytes. The size of the subvolume group is specified by setting
a quota on it (see :doc:`/cephfs/quota`). By default, the subvolume group
is created with an octal file mode '755', uid '0', gid '0' and data pool
is created with octal file mode ``755``, uid ``0``, gid ``0`` and the data pool
layout of its parent directory.
Remove a subvolume group using::
Remove a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup rm <vol_name> <group_name> [--force]
The removal of a subvolume group fails if it is not empty or non-existent.
'--force' flag allows the non-existent subvolume group remove command to succeed.
The removal of a subvolume group fails if the subvolume group is not empty or
is non-existent. The ``--force`` flag allows the non-existent "subvolume group remove
command" to succeed.
Fetch the absolute path of a subvolume group using::
Fetch the absolute path of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup getpath <vol_name> <group_name>
List subvolume groups using::
List subvolume groups by running a command of the following form:
$ ceph fs subvolumegroup ls <vol_name>
.. note:: Subvolume group snapshot feature is no longer supported in mainline CephFS (existing group
snapshots can still be listed and deleted)
Fetch the metadata of a subvolume group using::
Fetch the metadata of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup info <vol_name> <group_name>
The output format is json and contains fields as follows.
The output format is JSON and contains fields as follows:
* atime: access time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* mtime: modification time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* ctime: change time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* uid: uid of subvolume group path
* gid: gid of subvolume group path
* mode: mode of subvolume group path
* mon_addrs: list of monitor addresses
* bytes_pcent: quota used in percentage if quota is set, else displays "undefined"
* bytes_quota: quota size in bytes if quota is set, else displays "infinite"
* bytes_used: current used size of the subvolume group in bytes
* created_at: time of creation of subvolume group in the format "YYYY-MM-DD HH:MM:SS"
* data_pool: data pool the subvolume group belongs to
* ``atime``: access time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* ``mtime``: modification time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* ``ctime``: change time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* ``uid``: uid of the subvolume group path
* ``gid``: gid of the subvolume group path
* ``mode``: mode of the subvolume group path
* ``mon_addrs``: list of monitor addresses
* ``bytes_pcent``: quota used in percentage if quota is set, else displays "undefined"
* ``bytes_quota``: quota size in bytes if quota is set, else displays "infinite"
* ``bytes_used``: current used size of the subvolume group in bytes
* ``created_at``: creation time of the subvolume group in the format "YYYY-MM-DD HH:MM:SS"
* ``data_pool``: data pool to which the subvolume group belongs
Check the presence of any subvolume group using::
Check the presence of any subvolume group by running a command of the following form:
$ ceph fs subvolumegroup exist <vol_name>
The strings returned by the 'exist' command:
The ``exist`` command outputs:
* "subvolumegroup exists": if any subvolumegroup is present
* "no subvolumegroup exists": if no subvolumegroup is present
.. note:: It checks for the presence of custom groups and not the default one. To validate the emptiness of the volume, subvolumegroup existence check alone is not sufficient. The subvolume existence also needs to be checked as there might be subvolumes in the default group.
.. note:: This command checks for the presence of custom groups and not
presence of the default one. To validate the emptiness of the volume, a
subvolumegroup existence check alone is not sufficient. Subvolume existence
also needs to be checked as there might be subvolumes in the default group.
Resize a subvolume group using::
Resize a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup resize <vol_name> <group_name> <new_size> [--no_shrink]
The command resizes the subvolume group quota using the size specified by 'new_size'.
The '--no_shrink' flag prevents the subvolume group to shrink below the current used
size of the subvolume group.
The command resizes the subvolume group quota, using the size specified by
``new_size``. The ``--no_shrink`` flag prevents the subvolume group from
shrinking below the current used size.
The subvolume group can be resized to an infinite size by passing 'inf' or 'infinite'
as the new_size.
The subvolume group may be resized to an infinite size by passing ``inf`` or
``infinite`` as the ``new_size``.
Remove a snapshot of a subvolume group using::
Remove a snapshot of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup snapshot rm <vol_name> <group_name> <snap_name> [--force]
Using the '--force' flag allows the command to succeed that would otherwise
fail if the snapshot did not exist.
Supplying the ``--force`` flag allows the command to succeed when it would otherwise
fail due to the nonexistence of the snapshot.
List snapshots of a subvolume group using::
List snapshots of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup snapshot ls <vol_name> <group_name>
@ -232,7 +230,7 @@ List snapshots of a subvolume group using::
FS Subvolumes
-------------
Create a subvolume using::
Create a subvolume using:
$ ceph fs subvolume create <vol_name> <subvol_name> [--size <size_in_bytes>] [--group_name <subvol_group_name>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] [--namespace-isolated]
@ -247,11 +245,10 @@ default a subvolume is created within the default subvolume group, and with an o
mode '755', uid of its subvolume group, gid of its subvolume group, data pool layout of
its parent directory and no size limit.
Remove a subvolume using::
Remove a subvolume using:
$ ceph fs subvolume rm <vol_name> <subvol_name> [--group_name <subvol_group_name>] [--force] [--retain-snapshots]
The command removes the subvolume and its contents. It does this in two steps.
First, it moves the subvolume to a trash folder, and then asynchronously purges
its contents.
@ -267,95 +264,95 @@ empty for all operations not involving the retained snapshots.
.. note:: Retained snapshots can be used as a clone source to recreate the subvolume, or clone to a newer subvolume.
Resize a subvolume using::
Resize a subvolume using:
$ ceph fs subvolume resize <vol_name> <subvol_name> <new_size> [--group_name <subvol_group_name>] [--no_shrink]
The command resizes the subvolume quota using the size specified by 'new_size'.
'--no_shrink' flag prevents the subvolume to shrink below the current used size of the subvolume.
The command resizes the subvolume quota using the size specified by ``new_size``.
The `--no_shrink`` flag prevents the subvolume from shrinking below the current used size of the subvolume.
The subvolume can be resized to an infinite size by passing 'inf' or 'infinite' as the new_size.
The subvolume can be resized to an unlimited (but sparse) logical size by passing ``inf`` or ``infinite`` as `` new_size``.
Authorize cephx auth IDs, the read/read-write access to fs subvolumes::
Authorize cephx auth IDs, the read/read-write access to fs subvolumes:
$ ceph fs subvolume authorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] [--access_level=<access_level>]
The 'access_level' takes 'r' or 'rw' as value.
The ``access_level`` takes ``r`` or ``rw`` as value.
Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes::
Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes:
$ ceph fs subvolume deauthorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]
List cephx auth IDs authorized to access fs subvolume::
List cephx auth IDs authorized to access fs subvolume:
$ ceph fs subvolume authorized_list <vol_name> <sub_name> [--group_name=<group_name>]
Evict fs clients based on auth ID and subvolume mounted::
Evict fs clients based on auth ID and subvolume mounted:
$ ceph fs subvolume evict <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]
Fetch the absolute path of a subvolume using::
Fetch the absolute path of a subvolume using:
$ ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>]
Fetch the information of a subvolume using::
Fetch the information of a subvolume using:
$ ceph fs subvolume info <vol_name> <subvol_name> [--group_name <subvol_group_name>]
The output format is json and contains fields as follows.
The output format is JSON and contains fields as follows.
* atime: access time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* mtime: modification time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* ctime: change time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* uid: uid of subvolume path
* gid: gid of subvolume path
* mode: mode of subvolume path
* mon_addrs: list of monitor addresses
* bytes_pcent: quota used in percentage if quota is set, else displays "undefined"
* bytes_quota: quota size in bytes if quota is set, else displays "infinite"
* bytes_used: current used size of the subvolume in bytes
* created_at: time of creation of subvolume in the format "YYYY-MM-DD HH:MM:SS"
* data_pool: data pool the subvolume belongs to
* path: absolute path of a subvolume
* type: subvolume type indicating whether it's clone or subvolume
* pool_namespace: RADOS namespace of the subvolume
* features: features supported by the subvolume
* state: current state of the subvolume
* ``atime``: access time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* ``mtime``: modification time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* ``ctime``: change time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* ``uid``: uid of the subvolume path
* ``gid``: gid of the subvolume path
* ``mode``: mode of the subvolume path
* ``mon_addrs``: list of monitor addresses
* ``bytes_pcent``: quota used in percentage if quota is set, else displays ``undefined``
* ``bytes_quota``: quota size in bytes if quota is set, else displays ``infinite``
* ``bytes_used``: current used size of the subvolume in bytes
* ``created_at``: creation time of the subvolume in the format "YYYY-MM-DD HH:MM:SS"
* ``data_pool``: data pool to which the subvolume belongs
* ``path``: absolute path of a subvolume
* ``type``: subvolume type indicating whether it's clone or subvolume
* ``pool_namespace``: RADOS namespace of the subvolume
* ``features``: features supported by the subvolume
* ``state``: current state of the subvolume
If a subvolume has been removed retaining its snapshots, the output only contains fields as follows.
If a subvolume has been removed retaining its snapshots, the output contains only fields as follows.
* type: subvolume type indicating whether it's clone or subvolume
* features: features supported by the subvolume
* state: current state of the subvolume
* ``type``: subvolume type indicating whether it's clone or subvolume
* ``features``: features supported by the subvolume
* ``state``: current state of the subvolume
The subvolume "features" are based on the internal version of the subvolume and is a list containing
a subset of the following features,
A subvolume's ``features`` are based on the internal version of the subvolume and are
a subset of the following:
* "snapshot-clone": supports cloning using a subvolumes snapshot as the source
* "snapshot-autoprotect": supports automatically protecting snapshots, that are active clone sources, from deletion
* "snapshot-retention": supports removing subvolume contents, retaining any existing snapshots
* ``snapshot-clone``: supports cloning using a subvolumes snapshot as the source
* ``snapshot-autoprotect``: supports automatically protecting snapshots, that are active clone sources, from deletion
* ``snapshot-retention``: supports removing subvolume contents, retaining any existing snapshots
The subvolume "state" is based on the current state of the subvolume and contains one of the following values.
A subvolume's ``state`` is based on the current state of the subvolume and contains one of the following values.
* "complete": subvolume is ready for all operations
* "snapshot-retained": subvolume is removed but its snapshots are retained
* ``complete``: subvolume is ready for all operations
* ``snapshot-retained``: subvolume is removed but its snapshots are retained
List subvolumes using::
List subvolumes using:
$ ceph fs subvolume ls <vol_name> [--group_name <subvol_group_name>]
.. note:: subvolumes that are removed but have snapshots retained, are also listed.
Check the presence of any subvolume using::
Check the presence of any subvolume using:
$ ceph fs subvolume exist <vol_name> [--group_name <subvol_group_name>]
The strings returned by the 'exist' command:
These are the possible results of the ``exist`` command:
* "subvolume exists": if any subvolume of given group_name is present
* "no subvolume exists": if no subvolume of given group_name is present
* ``subvolume exists``: if any subvolume of given group_name is present
* ``no subvolume exists``: if no subvolume of given group_name is present
Set custom metadata on the subvolume as a key-value pair using::
Set custom metadata on the subvolume as a key-value pair using:
$ ceph fs subvolume metadata set <vol_name> <subvol_name> <key_name> <value> [--group_name <subvol_group_name>]
@ -365,52 +362,51 @@ Set custom metadata on the subvolume as a key-value pair using::
.. note:: Custom metadata on a subvolume is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot.
Get custom metadata set on the subvolume using the metadata key::
Get custom metadata set on the subvolume using the metadata key:
$ ceph fs subvolume metadata get <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>]
List custom metadata (key-value pairs) set on the subvolume using::
List custom metadata (key-value pairs) set on the subvolume using:
$ ceph fs subvolume metadata ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]
Remove custom metadata set on the subvolume using the metadata key::
Remove custom metadata set on the subvolume using the metadata key:
$ ceph fs subvolume metadata rm <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>] [--force]
Using the '--force' flag allows the command to succeed that would otherwise
Using the ``--force`` flag allows the command to succeed that would otherwise
fail if the metadata key did not exist.
Create a snapshot of a subvolume using::
Create a snapshot of a subvolume using:
$ ceph fs subvolume snapshot create <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
Remove a snapshot of a subvolume using::
Remove a snapshot of a subvolume using:
$ ceph fs subvolume snapshot rm <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] [--force]
Using the '--force' flag allows the command to succeed that would otherwise
Using the ``--force`` flag allows the command to succeed that would otherwise
fail if the snapshot did not exist.
.. note:: if the last snapshot within a snapshot retained subvolume is removed, the subvolume is also removed
List snapshots of a subvolume using::
List snapshots of a subvolume using:
$ ceph fs subvolume snapshot ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]
Fetch the information of a snapshot using::
Fetch the information of a snapshot using:
$ ceph fs subvolume snapshot info <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
The output format is json and contains fields as follows.
* created_at: time of creation of snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff"
* data_pool: data pool the snapshot belongs to
* has_pending_clones: "yes" if snapshot clone is in progress otherwise "no"
* pending_clones: list of in progress or pending clones and their target group if exist otherwise this field is not shown
* orphan_clones_count: count of orphan clones if snapshot has orphan clones otherwise this field is not shown
* ``created_at``: creation time of the snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff"
* ``data_pool``: data pool to which the snapshot belongs
* ``has_pending_clones``: ``yes`` if snapshot clone is in progress, otherwise ``no``
* ``pending_clones``: list of in-progress or pending clones and their target group if any exist, otherwise this field is not shown
* ``orphan_clones_count``: count of orphan clones if the snapshot has orphan clones, otherwise this field is not shown
Sample output if snapshot clones are in progress or pending state::
Sample output when snapshot clones are in progress or pending::
$ ceph fs subvolume snapshot info cephfs subvol snap
{
@ -432,7 +428,7 @@ Sample output if snapshot clones are in progress or pending state::
]
}
Sample output if no snapshot clone is in progress or pending state::
Sample output when no snapshot clone is in progress or pending::
$ ceph fs subvolume snapshot info cephfs subvol snap
{
@ -441,90 +437,93 @@ Sample output if no snapshot clone is in progress or pending state::
"has_pending_clones": "no"
}
Set custom metadata on the snapshot as a key-value pair using::
Set custom key-value metadata on the snapshot by running:
$ ceph fs subvolume snapshot metadata set <vol_name> <subvol_name> <snap_name> <key_name> <value> [--group_name <subvol_group_name>]
.. note:: If the key_name already exists then the old value will get replaced by the new value.
.. note:: The key_name and value should be a string of ASCII characters (as specified in python's string.printable). The key_name is case-insensitive and always stored in lower case.
.. note:: The key_name and value should be a strings of ASCII characters (as specified in Python's ``string.printable``). The key_name is case-insensitive and always stored in lowercase.
.. note:: Custom metadata on a snapshots is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot.
.. note:: Custom metadata on a snapshot is not preserved when snapshotting the subvolume, and hence is also not preserved when cloning the subvolume snapshot.
Get custom metadata set on the snapshot using the metadata key::
Get custom metadata set on the snapshot using the metadata key:
$ ceph fs subvolume snapshot metadata get <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>]
List custom metadata (key-value pairs) set on the snapshot using::
List custom metadata (key-value pairs) set on the snapshot using:
$ ceph fs subvolume snapshot metadata ls <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
Remove custom metadata set on the snapshot using the metadata key::
Remove custom metadata set on the snapshot using the metadata key:
$ ceph fs subvolume snapshot metadata rm <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>] [--force]
Using the '--force' flag allows the command to succeed that would otherwise
Using the ``--force`` flag allows the command to succeed that would otherwise
fail if the metadata key did not exist.
Cloning Snapshots
-----------------
Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation involving copying
data from a snapshot to a subvolume. Due to this bulk copy nature, cloning is currently inefficient for very huge
Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation that copies
data from a snapshot to a subvolume. Due to this bulk copying, cloning is inefficient for very large
data sets.
.. note:: Removing a snapshot (source subvolume) would fail if there are pending or in progress clone operations.
Protecting snapshots prior to cloning was a pre-requisite in the Nautilus release, and the commands to protect/unprotect
snapshots were introduced for this purpose. This pre-requisite, and hence the commands to protect/unprotect, is being
deprecated in mainline CephFS, and may be removed from a future release.
Protecting snapshots prior to cloning was a prerequisite in the Nautilus release, and the commands to protect/unprotect
snapshots were introduced for this purpose. This prerequisite, and hence the commands to protect/unprotect, is being
deprecated and may be removed from a future release.
The commands being deprecated are:
$ ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
$ ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
.. note:: Using the above commands would not result in an error, but they serve no useful function.
.. prompt:: bash #
.. note:: Use subvolume info command to fetch subvolume metadata regarding supported "features" to help decide if protect/unprotect of snapshots is required, based on the "snapshot-autoprotect" feature availability.
ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
To initiate a clone operation use::
.. note:: Using the above commands will not result in an error, but they have no useful purpose.
.. note:: Use the ``subvolume info`` command to fetch subvolume metadata regarding supported ``features`` to help decide if protect/unprotect of snapshots is required, based on the availability of the ``snapshot-autoprotect`` feature.
To initiate a clone operation use:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name>
If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified as per::
If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --group_name <subvol_group_name>
Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use::
Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --target_group_name <subvol_group_name>
Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use::
Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --pool_layout <pool_layout>
Configure maximum number of concurrent clones. The default is set to 4::
Configure the maximum number of concurrent clones. The default is 4:
$ ceph config set mgr mgr/volumes/max_concurrent_clones <value>
To check the status of a clone operation use::
To check the status of a clone operation use:
$ ceph fs clone status <vol_name> <clone_name> [--group_name <group_name>]
A clone can be in one of the following states:
#. `pending` : Clone operation has not started
#. `in-progress` : Clone operation is in progress
#. `complete` : Clone operation has successfully finished
#. `failed` : Clone operation has failed
#. `canceled` : Clone operation is cancelled by user
#. ``pending`` : Clone operation has not started
#. ``in-progress`` : Clone operation is in progress
#. ``complete`` : Clone operation has successfully finished
#. ``failed`` : Clone operation has failed
#. ``canceled`` : Clone operation is cancelled by user
The reason for a clone failure is shown as below:
#. `errno` : error number
#. `error_msg` : failure error string
#. ``errno`` : error number
#. ``error_msg`` : failure error string
Sample output of an `in-progress` clone operation::
Here is an example of an ``in-progress`` clone::
$ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
$ ceph fs clone status cephfs clone1
@ -539,9 +538,9 @@ Sample output of an `in-progress` clone operation::
}
}
.. note:: The `failure` section will be shown only if the clone is in failed or cancelled state
.. note:: The ``failure`` section will be shown only if the clone's state is ``failed`` or ``cancelled``
Sample output of a `failed` clone operation::
Here is an example of a ``failed`` clone::
$ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
$ ceph fs clone status cephfs clone1
@ -561,11 +560,11 @@ Sample output of a `failed` clone operation::
}
}
(NOTE: since `subvol1` is in default group, `source` section in `clone status` does not include group name)
(NOTE: since ``subvol1`` is in the default group, the ``source`` object's ``clone status`` does not include the group name)
.. note:: Cloned subvolumes are accessible only after the clone operation has successfully completed.
For a successful clone operation, `clone status` would look like so::
After a successful clone operation, ``clone status`` will look like the below::
$ ceph fs clone status cephfs clone1
{
@ -574,21 +573,21 @@ For a successful clone operation, `clone status` would look like so::
}
}
or `failed` state when clone is unsuccessful.
If a clone operation is unsuccessful, the ``state`` value will be ``failed``.
On failure of a clone operation, the partial clone needs to be deleted and the clone operation needs to be retriggered.
To retry a failed clone operation, the incomplete clone must be deleted and the clone operation must be issued again.
To delete a partial clone use::
$ ceph fs subvolume rm <vol_name> <clone_name> [--group_name <group_name>] --force
.. note:: Cloning only synchronizes directories, regular files and symbolic links. Also, inode timestamps (access and
.. note:: Cloning synchronizes only directories, regular files and symbolic links. Inode timestamps (access and
modification times) are synchronized up to seconds granularity.
An `in-progress` or a `pending` clone operation can be canceled. To cancel a clone operation use the `clone cancel` command::
An ``in-progress`` or a ``pending`` clone operation may be canceled. To cancel a clone operation use the ``clone cancel`` command:
$ ceph fs clone cancel <vol_name> <clone_name> [--group_name <group_name>]
On successful cancellation, the cloned subvolume is moved to `canceled` state::
On successful cancellation, the cloned subvolume is moved to the ``canceled`` state::
$ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
$ ceph fs clone cancel cephfs clone1
@ -604,7 +603,7 @@ On successful cancellation, the cloned subvolume is moved to `canceled` state::
}
}
.. note:: The canceled cloned can be deleted by using --force option in `fs subvolume rm` command.
.. note:: The canceled cloned may be deleted by supplying the ``--force`` option to the `fs subvolume rm` command.
.. _subvol-pinning:
@ -612,17 +611,16 @@ On successful cancellation, the cloned subvolume is moved to `canceled` state::
Pinning Subvolumes and Subvolume Groups
---------------------------------------
Subvolumes and subvolume groups can be automatically pinned to ranks according
to policies. This can help distribute load across MDS ranks in predictable and
Subvolumes and subvolume groups may be automatically pinned to ranks according
to policies. This can distribute load across MDS ranks in predictable and
stable ways. Review :ref:`cephfs-pinning` and :ref:`cephfs-ephemeral-pinning`
for details on how pinning works.
Pinning is configured by::
Pinning is configured by:
$ ceph fs subvolumegroup pin <vol_name> <group_name> <pin_type> <pin_setting>
or for subvolumes::
or for subvolumes:
$ ceph fs subvolume pin <vol_name> <group_name> <pin_type> <pin_setting>
@ -631,7 +629,7 @@ one of ``export``, ``distributed``, or ``random``. The ``pin_setting``
corresponds to the extended attributed "value" as in the pinning documentation
referenced above.
So, for example, setting a distributed pinning strategy on a subvolume group::
So, for example, setting a distributed pinning strategy on a subvolume group:
$ ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1

View File

@ -130,7 +130,9 @@ other daemons, please see :ref:`health-checks`.
from properly cleaning up resources used by client requests. This message
appears if a client appears to have more than ``max_completed_requests``
(default 100000) requests that are complete on the MDS side but haven't
yet been accounted for in the client's *oldest tid* value.
yet been accounted for in the client's *oldest tid* value. The last tid
used by the MDS to trim completed client requests (or flush) is included
as part of `session ls` (or `client ls`) command as a debug aid.
``MDS_DAMAGE``
--------------

View File

@ -57,6 +57,8 @@
.. confval:: mds_kill_import_at
.. confval:: mds_kill_link_at
.. confval:: mds_kill_rename_at
.. confval:: mds_inject_skip_replaying_inotable
.. confval:: mds_kill_skip_replaying_inotable
.. confval:: mds_wipe_sessions
.. confval:: mds_wipe_ino_prealloc
.. confval:: mds_skip_ino

View File

@ -225,3 +225,17 @@ For the reverse situation:
The ``home/patrick`` directory and its children will be pinned to rank 2
because its export pin overrides the policy on ``home``.
To remove a partitioning policy, remove the respective extended attribute
or set the value to 0.
.. code::bash
$ setfattr -n ceph.dir.pin.distributed -v 0 home
# or
$ setfattr -x ceph.dir.pin.distributed home
For export pins, remove the extended attribute or set the extended attribute
value to `-1`.
.. code::bash
$ setfattr -n ceph.dir.pin -v -1 home

View File

@ -56,6 +56,18 @@ in the sample conf. There are options to do the following:
- enable read delegations (need at least v13.0.1 ``libcephfs2`` package
and v2.6.0 stable ``nfs-ganesha`` and ``nfs-ganesha-ceph`` packages)
.. important::
Under certain conditions, NFS access using the CephFS FSAL fails. This
causes an error to be thrown that reads "Input/output error". Under these
circumstances, the application metadata must be set for the CephFS metadata
and CephFS data pools. Do this by running the following command:
.. prompt:: bash $
ceph osd pool application set <cephfs_metadata_pool> cephfs <cephfs_data_pool> cephfs
Configuration for libcephfs clients
-----------------------------------

View File

@ -143,3 +143,14 @@ The types of damage that can be reported and repaired by File System Scrub are:
* BACKTRACE : Inode's backtrace in the data pool is corrupted.
Evaluate strays using recursive scrub
=====================================
- In order to evaluate strays i.e. purge stray directories in ``~mdsdir`` use the following command::
ceph tell mds.<fsname>:0 scrub start ~mdsdir recursive
- ``~mdsdir`` is not enqueued by default when scrubbing at the CephFS root. In order to perform stray evaluation
at root, run scrub with flags ``scrub_mdsdir`` and ``recursive``::
ceph tell mds.<fsname>:0 scrub start / recursive,scrub_mdsdir

View File

@ -142,6 +142,24 @@ Examples::
ceph fs snap-schedule retention add / 24h4w # add 24 hourly and 4 weekly to retention
ceph fs snap-schedule retention remove / 7d4w # remove 7 daily and 4 weekly, leaves 24 hourly
.. note: When adding a path to snap-schedule, remember to strip off the mount
point path prefix. Paths to snap-schedule should start at the appropriate
CephFS file system root and not at the host file system root.
e.g. if the Ceph File System is mounted at ``/mnt`` and the path under which
snapshots need to be taken is ``/mnt/some/path`` then the acutal path required
by snap-schedule is only ``/some/path``.
.. note: It should be noted that the "created" field in the snap-schedule status
command output is the timestamp at which the schedule was created. The "created"
timestamp has nothing to do with the creation of actual snapshots. The actual
snapshot creation is accounted for in the "created_count" field, which is a
cumulative count of the total number of snapshots created so far.
.. note: The maximum number of snapshots to retain per directory is limited by the
config tunable `mds_max_snaps_per_dir`. This tunable defaults to 100.
To ensure a new snapshot can be created, one snapshot less than this will be
retained. So by default, a maximum of 99 snapshots will be retained.
Active and inactive schedules
-----------------------------
Snapshot schedules can be added for a path that doesn't exist yet in the

View File

@ -21,6 +21,133 @@ We can get hints about what's going on by dumping the MDS cache ::
If high logging levels are set on the MDS, that will almost certainly hold the
information we need to diagnose and solve the issue.
Stuck during recovery
=====================
Stuck in up:replay
------------------
If your MDS is stuck in ``up:replay`` then it is likely that the journal is
very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is
behind on trimming its journal? If the journal has grown very large, it can
take hours to read the journal. There is no working around this but there
are things you can do to speed things along:
Reduce MDS debugging to 0. Even at the default settings, the MDS logs some
messages to memory for dumping if a fatal error is encountered. You can avoid
this:
.. code:: bash
ceph config set mds debug_mds 0
ceph config set mds debug_ms 0
ceph config set mds debug_monc 0
Note if the MDS fails then there will be virtually no information to determine
why. If you can calculate when ``up:replay`` will complete, you should restore
these configs just prior to entering the next state:
.. code:: bash
ceph config rm mds debug_mds
ceph config rm mds debug_ms
ceph config rm mds debug_monc
Once you've got replay moving along faster, you can calculate when the MDS will
complete. This is done by examining the journal replay status:
.. code:: bash
$ ceph tell mds.<fs_name>:0 status | jq .replay_status
{
"journal_read_pos": 4195244,
"journal_write_pos": 4195244,
"journal_expire_pos": 4194304,
"num_events": 2,
"num_segments": 2
}
Replay completes when the ``journal_read_pos`` reaches the
``journal_write_pos``. The write position will not change during replay. Track
the progression of the read position to compute the expected time to complete.
Avoiding recovery roadblocks
----------------------------
When trying to urgently restore your file system during an outage, here are some
things to do:
* **Deny all reconnect to clients.** This effectively blocklists all existing
CephFS sessions so all mounts will hang or become unavailable.
.. code:: bash
ceph config set mds mds_deny_all_reconnect true
Remember to undo this after the MDS becomes active.
.. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
* **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears
"stuck" doing some operation. Sometimes recovery of an MDS may involve an
operation that may take longer than expected (from the programmer's
perspective). This is more likely when recovery is already taking a longer than
normal amount of time to complete (indicated by your reading this document).
Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
.. code:: bash
ceph config set mds mds_heartbeat_reset_grace 3600
This has the effect of having the MDS continue to send beacons to the monitors
even when its internal "heartbeat" mechanism has not been reset (beat) in one
hour. Note the previous mechanism for achieving this was via the
`mds_beacon_grace` monitor setting.
* **Disable open file table prefetch.** Normally, the MDS will prefetch
directory contents during recovery to heat up its cache. During long
recovery, the cache is probably already hot **and large**. So this behavior
can be undesirable. Disable using:
.. code:: bash
ceph config set mds mds_oft_prefetch_dirfrags false
* **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may
cause new load on the file system when it's just getting back on its feet.
There will likely be some general maintenance to do before workloads should be
resumed. For example, expediting journal trim may be advisable if the recovery
took a long time because replay was reading a overly large journal.
You can do this manually or use the new file system tunable:
.. code:: bash
ceph fs set <fs_name> refuse_client_session true
That prevents any clients from establishing new sessions with the MDS.
Expediting MDS journal trim
===========================
If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a
long time!), you will want to have the MDS trim its journal more frequently.
You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings.
The main tunable available to do this is to modify the MDS tick interval. The
"tick" interval drives several upkeep activities in the MDS. It is strongly
recommended no significant file system load be present when modifying this tick
interval. This setting only affects an MDS in ``up:active``. The MDS does not
trim its journal during recovery.
.. code:: bash
ceph config set mds mds_tick_interval 2
RADOS Health
============
@ -188,6 +315,98 @@ You can enable dynamic debug against the CephFS module.
Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh
In-memory Log Dump
==================
In-memory logs can be dumped by setting ``mds_extraordinary_events_dump_interval``
during a lower level debugging (log level < 10). ``mds_extraordinary_events_dump_interval``
is the interval in seconds for dumping the recent in-memory logs when there is an Extra-Ordinary event.
The Extra-Ordinary events are classified as:
* Client Eviction
* Missed Beacon ACK from the monitors
* Missed Internal Heartbeats
In-memory Log Dump is disabled by default to prevent log file bloat in a production environment.
The below commands consecutively enables it::
$ ceph config set mds debug_mds <log_level>/<gather_level>
$ ceph config set mds mds_extraordinary_events_dump_interval <seconds>
The ``log_level`` should be < 10 and ``gather_level`` should be >= 10 to enable in-memory log dump.
When it is enabled, the MDS checks for the extra-ordinary events every
``mds_extraordinary_events_dump_interval`` seconds and if any of them occurs, MDS dumps the
in-memory logs containing the relevant event details in ceph-mds log.
.. note:: For higher log levels (log_level >= 10) there is no reason to dump the In-memory Logs and a
lower gather level (gather_level < 10) is insufficient to gather In-memory Logs. Thus a
log level >=10 or a gather level < 10 in debug_mds would prevent enabling the In-memory Log Dump.
In such cases, when there is a failure it's required to reset the value of
mds_extraordinary_events_dump_interval to 0 before enabling using the above commands.
The In-memory Log Dump can be disabled using::
$ ceph config set mds mds_extraordinary_events_dump_interval 0
Filesystems Become Inaccessible After an Upgrade
================================================
.. note::
You can avoid ``operation not permitted`` errors by running this procedure
before an upgrade. As of May 2023, it seems that ``operation not permitted``
errors of the kind discussed here occur after upgrades after Nautilus
(inclusive).
IF
you have CephFS file systems that have data and metadata pools that were
created by a ``ceph fs new`` command (meaning that they were not created
with the defaults)
OR
you have an existing CephFS file system and are upgrading to a new post-Nautilus
major version of Ceph
THEN
in order for the documented ``ceph fs authorize...`` commands to function as
documented (and to avoid 'operation not permitted' errors when doing file I/O
or similar security-related problems for all users except the ``client.admin``
user), you must first run:
.. prompt:: bash $
ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name>
and
.. prompt:: bash $
ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name>
Otherwise, when the OSDs receive a request to read or write data (not the
directory info, but file data) they will not know which Ceph file system name
to look up. This is true also of pool names, because the 'defaults' themselves
changed in the major releases, from::
data pool=fsname
metadata pool=fsname_metadata
to::
data pool=fsname.data and
metadata pool=fsname.meta
Any setup that used ``client.admin`` for all mounts did not run into this
problem, because the admin key gave blanket permissions.
A temporary fix involves changing mount requests to the 'client.admin' user and
its associated key. A less drastic but half-fix is to change the osd cap for
your user to just ``caps osd = "allow rw"`` and delete ``tag cephfs
data=....``
Reporting Issues
================

View File

@ -2,38 +2,44 @@
CephFS Mirroring
================
CephFS supports asynchronous replication of snapshots to a remote CephFS file system via
`cephfs-mirror` tool. Snapshots are synchronized by mirroring snapshot data followed by
creating a snapshot with the same name (for a given directory on the remote file system) as
the snapshot being synchronized.
CephFS supports asynchronous replication of snapshots to a remote CephFS file
system via `cephfs-mirror` tool. Snapshots are synchronized by mirroring
snapshot data followed by creating a snapshot with the same name (for a given
directory on the remote file system) as the snapshot being synchronized.
Requirements
------------
The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later.
The primary (local) and secondary (remote) Ceph clusters version should be
Pacific or later.
Key Idea
--------
For a given snapshot pair in a directory, `cephfs-mirror` daemon will rely on readdir diff
to identify changes in a directory tree. The diffs are applied to directory in the remote
file system thereby only synchronizing files that have changed between two snapshots.
For a given snapshot pair in a directory, `cephfs-mirror` daemon will rely on
readdir diff to identify changes in a directory tree. The diffs are applied to
directory in the remote file system thereby only synchronizing files that have
changed between two snapshots.
This feature is tracked here: https://tracker.ceph.com/issues/47034.
Currently, snapshot data is synchronized by bulk copying to the remote filesystem.
Currently, snapshot data is synchronized by bulk copying to the remote
filesystem.
.. note:: Synchronizing hardlinks is not supported -- hardlinked files get synchronized
as separate files.
.. note:: Synchronizing hardlinks is not supported -- hardlinked files get
synchronized as separate files.
Creating Users
--------------
Start by creating a user (on the primary/local cluster) for the mirror daemon. This user
requires write capability on the metadata pool to create RADOS objects (index objects)
for watch/notify operation and read capability on the data pool(s).
Start by creating a user (on the primary/local cluster) for the mirror daemon.
This user requires write capability on the metadata pool to create RADOS
objects (index objects) for watch/notify operation and read capability on the
data pool(s).
$ ceph auth get-or-create client.mirror mon 'profile cephfs-mirror' mds 'allow r' osd 'allow rw tag cephfs metadata=*, allow r tag cephfs data=*' mgr 'allow r'
.. prompt:: bash $
ceph auth get-or-create client.mirror mon 'profile cephfs-mirror' mds 'allow r' osd 'allow rw tag cephfs metadata=*, allow r tag cephfs data=*' mgr 'allow r'
Create a user for each file system peer (on the secondary/remote cluster). This user needs
to have full capabilities on the MDS (to take snapshots) and the OSDs::
@ -371,7 +377,7 @@ information. To check which mirror daemon a directory has been mapped to use::
"state": "mapped"
}
.. note:: `instance_id` is the RAODS instance-id associated with a mirror daemon.
.. note:: `instance_id` is the RADOS instance-id associated with a mirror daemon.
Other information such as `state` and `last_shuffled` are interesting when running
multiple mirror daemons.

View File

@ -0,0 +1,426 @@
===============
Deduplication
===============
Introduction
============
Applying data deduplication on an existing software stack is not easy
due to additional metadata management and original data processing
procedure.
In a typical deduplication system, the input source as a data
object is split into multiple chunks by a chunking algorithm.
The deduplication system then compares each chunk with
the existing data chunks, stored in the storage previously.
To this end, a fingerprint index that stores the hash value
of each chunk is employed by the deduplication system
in order to easily find the existing chunks by comparing
hash value rather than searching all contents that reside in
the underlying storage.
There are many challenges in order to implement deduplication on top
of Ceph. Among them, two issues are essential for deduplication.
First is managing scalability of fingerprint index; Second is
it is complex to ensure compatibility between newly applied
deduplication metadata and existing metadata.
Key Idea
========
1. Content hashing (Double hashing): Each client can find an object data
for an object ID using CRUSH. With CRUSH, a client knows object's location
in Base tier.
By hashing object's content at Base tier, a new OID (chunk ID) is generated.
Chunk tier stores in the new OID that has a partial content of original object.
Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
CRUSH(K) -> chunk's location
2. Self-contained object: The external metadata design
makes difficult for integration with storage feature support
since existing storage features cannot recognize the
additional external data structures. If we can design data
deduplication system without any external component, the
original storage features can be reused.
More details in https://ieeexplore.ieee.org/document/8416369
Design
======
.. ditaa::
+-------------+
| Ceph Client |
+------+------+
^
Tiering is |
Transparent | Metadata
to Ceph | +---------------+
Client Ops | | |
| +----->+ Base Pool |
| | | |
| | +-----+---+-----+
| | | ^
v v | | Dedup metadata in Base Pool
+------+----+--+ | | (Dedup metadata contains chunk offsets
| Objecter | | | and fingerprints)
+-----------+--+ | |
^ | | Data in Chunk Pool
| v |
| +-----+---+-----+
| | |
+----->| Chunk Pool |
| |
+---------------+
Data
Pool-based object management:
We define two pools.
The metadata pool stores metadata objects and the chunk pool stores
chunk objects. Since these two pools are divided based on
the purpose and usage, each pool can be managed more
efficiently according to its different characteristics. Base
pool and the chunk pool can separately select a redundancy
scheme between replication and erasure coding depending on
its usage and each pool can be placed in a different storage
location depending on the required performance.
Regarding how to use, please see ``osd_internals/manifest.rst``
Usage Patterns
==============
Each Ceph interface layer presents unique opportunities and costs for
deduplication and tiering in general.
RadosGW
-------
S3 big data workloads seem like a good opportunity for deduplication. These
objects tend to be write once, read mostly objects which don't see partial
overwrites. As such, it makes sense to fingerprint and dedup up front.
Unlike cephfs and rbd, radosgw has a system for storing
explicit metadata in the head object of a logical s3 object for
locating the remaining pieces. As such, radosgw could use the
refcounting machinery (``osd_internals/refcount.rst``) directly without
needing direct support from rados for manifests.
RBD/Cephfs
----------
RBD and CephFS both use deterministic naming schemes to partition
block devices/file data over rados objects. As such, the redirection
metadata would need to be included as part of rados, presumably
transparently.
Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
For those objects, we don't really want to perform dedup, and we don't
want to pay a write latency penalty in the hot path to do so anyway.
As such, performing tiering and dedup on cold objects in the background
is likely to be preferred.
One important wrinkle, however, is that both rbd and cephfs workloads
often feature usage of snapshots. This means that the rados manifest
support needs robust support for snapshots.
RADOS Machinery
===============
For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
For more information on rados refcount support, see ``osd_internals/refcount.rst``.
Status and Future Work
======================
At the moment, there exists some preliminary support for manifest
objects within the OSD as well as a dedup tool.
RadosGW data warehouse workloads probably represent the largest
opportunity for this feature, so the first priority is probably to add
direct support for fingerprinting and redirects into the refcount pool
to radosgw.
Aside from radosgw, completing work on manifest object support in the
OSD particularly as it relates to snapshots would be the next step for
rbd and cephfs workloads.
How to use deduplication
========================
* This feature is highly experimental and is subject to change or removal.
Ceph provides deduplication using RADOS machinery.
Below we explain how to perform deduplication.
Prerequisite
------------
If the Ceph cluster is started from Ceph mainline, users need to check
``ceph-test`` package which is including ceph-dedup-tool is installed.
Deatiled Instructions
---------------------
Users can use ceph-dedup-tool with ``estimate``, ``sample-dedup``,
``chunk-scrub``, and ``chunk-repair`` operations. To provide better
convenience for users, we have enabled necessary operations through
ceph-dedup-tool, and we recommend using the following operations freely
by using any types of scripts.
1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
ceph-dedup-tool --op estimate
--pool [BASE_POOL]
--chunk-size [CHUNK_SIZE]
--chunk-algorithm [fixed|fastcdc]
--fingerprint-algorithm [sha1|sha256|sha512]
--max-thread [THREAD_COUNT]
This CLI command will show how much storage space can be saved when deduplication
is applied on the pool. If the amount of the saved space is higher than user's expectation,
the pool probably is worth performing deduplication.
Users should specify the ``BASE_POOL``, within which the object targeted for deduplication
is stored. The users also need to run ceph-dedup-tool multiple time
with varying ``chunk_size`` to find the optimal chunk size. Note that the
optimal value probably differs in the content of each object in case of fastcdc
chunk algorithm (not fixed).
Example output:
.. code:: bash
{
"chunk_algo": "fastcdc",
"chunk_sizes": [
{
"target_chunk_size": 8192,
"dedup_bytes_ratio": 0.4897049
"dedup_object_ratio": 34.567315
"chunk_size_average": 64439,
"chunk_size_stddev": 33620
}
],
"summary": {
"examined_objects": 95,
"examined_bytes": 214968649
}
}
The above is an example output when executing ``estimate``. ``target_chunk_size`` is the same as
``chunk_size`` given by the user. ``dedup_bytes_ratio`` shows how many bytes are redundant from
examined bytes. For instance, 1 - ``dedup_bytes_ratio`` means the percentage of saved storage space.
``dedup_object_ratio`` is the generated chunk objects / ``examined_objects``. ``chunk_size_average``
means that the divided chunk size on average when performing CDC---this may differnet from ``target_chunk_size``
because CDC genarates different chunk-boundary depending on the content. ``chunk_size_stddev``
represents the standard deviation of the chunk size.
2. Create chunk pool.
^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
ceph osd pool create [CHUNK_POOL]
3. Run dedup command (there are two ways).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- **sample-dedup**
.. code:: bash
ceph-dedup-tool --op sample-dedup
--pool [BASE_POOL]
--chunk-pool [CHUNK_POOL]
--chunk-size [CHUNK_SIZE]
--chunk-algorithm [fastcdc]
--fingerprint-algorithm [sha1|sha256|sha512]
--chunk-dedup-threshold [THRESHOLD]
--max-thread [THREAD_COUNT]
--sampling-ratio [SAMPLE_RATIO]
--wakeup-period [WAKEUP_PERIOD]
--loop
--snap
The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
the ``BASE_POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.
Example output:
.. code:: bash
$ bin/ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 303 GiB 294 GiB 9.0 GiB 9.0 GiB 2.99
TOTAL 303 GiB 294 GiB 9.0 GiB 9.0 GiB 2.99
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 1.7 MiB 0 97 GiB
base 2 32 2.0 GiB 517 6.0 GiB 2.02 97 GiB
chunk 3 32 0 B 0 0 B 0 97 GiB
$ bin/ceph-dedup-tool --op sample-dedup --pool base --chunk-pool chunk
--fingerprint-algorithm sha1 --chunk-algorithm fastcdc --loop --sampling-ratio 100
--chunk-dedup-threshold 2 --chunk-size 8192 --max-thread 4 --wakeup-period 60
$ bin/ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 303 GiB 298 GiB 5.4 GiB 5.4 GiB 1.80
TOTAL 303 GiB 298 GiB 5.4 GiB 5.4 GiB 1.80
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 1.7 MiB 0 98 GiB
base 2 32 452 MiB 262 1.3 GiB 0.50 98 GiB
chunk 3 32 258 MiB 25.91k 938 MiB 0.31 98 GiB
- **object dedup**
.. code:: bash
ceph-dedup-tool --op object-dedup
--pool [BASE_POOL]
--object [OID]
--chunk-pool [CHUNK_POOL]
--fingerprint-algorithm [sha1|sha256|sha512]
--dedup-cdc-chunk-size [CHUNK_SIZE]
The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
the results of step 1 above.
Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
such as ``fingerprint-algorithm`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
object size in ``BASE_POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
4. Read/write I/Os
^^^^^^^^^^^^^^^^^^
After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
completely compatible with existing RADOS operations.
5. Run scrub to fix reference count
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Reference mismatches can on rare occasions occur to false positives when handling reference counts for
deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:
.. code:: bash
ceph-dedup-tool --op chunk-scrub
--chunk-pool [CHUNK_POOL]
--pool [POOL]
--max-thread [THREAD_COUNT]
The ``chunk-scrub`` command identifies reference mismatches between a
metadata object and a chunk object. The ``chunk-pool`` parameter tells
where the target chunk objects are located to the ceph-dedup-tool.
Example output:
A reference mismatch is intentionally created by inserting a reference (dummy-obj) into a chunk object (2ac67f70d3dd187f8f332bb1391f61d4e5c9baae) by using chunk-get-ref.
.. code:: bash
$ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
{
"type": "by_object",
"count": 2,
"refs": [
{
"oid": "testfile2",
"key": "",
"snapid": -2,
"hash": 2905889452,
"max": 0,
"pool": 2,
"namespace": ""
},
{
"oid": "dummy-obj",
"key": "",
"snapid": -2,
"hash": 1203585162,
"max": 0,
"pool": 2,
"namespace": ""
}
]
}
$ bin/ceph-dedup-tool --op chunk-scrub --chunk-pool chunk --max-thread 10
10 seconds is set as report period by default
join
join
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
--done--
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae ref 10:5102bde2:::dummy-obj:head: referencing pool does not exist
--done--
Total object : 1
Examined object : 1
Damaged object : 1
6. Repair a mismatched chunk reference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any reference mismatches occur after the ``chunk-scrub``, it is
recommended to perform the ``chunk-repair`` operation to fix reference
mismatches. The ``chunk-repair`` operation helps in resolving the
reference mismatch and restoring consistency.
.. code:: bash
ceph-dedup-tool --op chunk-repair
--chunk-pool [CHUNK_POOL_NAME]
--object [CHUNK_OID]
--target-ref [TARGET_OID]
--target-ref-pool-id [TARGET_POOL_ID]
``chunk-repair`` fixes the ``target-ref``, which is a wrong reference of
an ``object``. To fix it correctly, the users must enter the correct
``TARGET_OID`` and ``TARGET_POOL_ID``.
.. code:: bash
$ bin/ceph-dedup-tool --op chunk-repair --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae --target-ref dummy-obj --target-ref-pool-id 10
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae has 1 references for dummy-obj
dummy-obj has 0 references for 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
fix dangling reference from 1 to 0
$ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
{
"type": "by_object",
"count": 1,
"refs": [
{
"oid": "testfile2",
"key": "",
"snapid": -2,
"hash": 2905889452,
"max": 0,
"pool": 2,
"namespace": ""
}
]
}

View File

@ -1,3 +1,5 @@
.. _dev_deploying_a_development_cluster:
=================================
Deploying a development cluster
=================================

View File

@ -50,11 +50,10 @@ optional Ceph internal services are started automatically when it is used to
start a Ceph cluster. vstart is the basis for the three most commonly used
development environments in Ceph Dashboard.
You can read more about vstart in `Deploying a development cluster`_.
Additional information for developers can also be found in the `Developer
Guide`_.
You can read more about vstart in :ref:`Deploying a development cluster
<dev_deploying_a_development_cluster>`. Additional information for developers
can also be found in the `Developer Guide`_.
.. _Deploying a development cluster: https://docs.ceph.com/docs/master/dev/dev_cluster_deployement/
.. _Developer Guide: https://docs.ceph.com/docs/master/dev/quick_guide/
Host-based vs Docker-based Development Environments
@ -1269,7 +1268,6 @@ Tests can be found under the `a11y folder <./src/pybind/mgr/dashboard/frontend/c
beforeEach(() => {
cy.login();
Cypress.Cookies.preserveOnce('token');
shared.navigateTo();
});

View File

@ -55,7 +55,7 @@ using `vstart_runner.py`_. To do that, you'd need `teuthology`_ installed::
$ virtualenv --python=python3 venv
$ source venv/bin/activate
$ pip install 'setuptools >= 12'
$ pip install git+https://github.com/ceph/teuthology#egg=teuthology[test]
$ pip install teuthology[test]@git+https://github.com/ceph/teuthology
$ deactivate
The above steps installs teuthology in a virtual environment. Before running

View File

@ -3,9 +3,74 @@ Serialization (encode/decode)
=============================
When a structure is sent over the network or written to disk, it is
encoded into a string of bytes. Serializable structures have
``encode`` and ``decode`` methods that write and read from ``bufferlist``
objects representing byte strings.
encoded into a string of bytes. Usually (but not always -- multiple
serialization facilities coexist in Ceph) serializable structures
have ``encode`` and ``decode`` methods that write and read from
``bufferlist`` objects representing byte strings.
Terminology
-----------
It is best to think not in the domain of daemons and clients but
encoders and decoders. An encoder serializes a structure into a bufferlist
while a decoder does the opposite.
Encoders and decoders can be referred collectively as dencoders.
Dencoders (both encoders and docoders) live within daemons and clients.
For instance, when an RBD client issues an IO operation, it prepares
an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
that is put on the wire.
An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
Here encoder was used by the client while decoder by the OSD. However,
these roles can swing -- just imagine handling of the response: OSD encodes
the ``MOSDOpReply`` while RBD clients decode.
Encoder and decoder operate accordingly to a format which is defined
by a programmer by implementing the ``encode`` and ``decode`` methods.
Principles for format change
----------------------------
It is not unusual that the format of serialization changes. This
process requires careful attention from during both development
and review.
The general rule is that a decoder must understand what had been
encoded by an encoder. Most of the problems come from ensuring
that compatibility continues between old decoders and new encoders
as well as new decoders and old decoders. One should assume
that -- if not otherwise derogated -- any mix (old/new) is
possible in a cluster. There are 2 main reasons for that:
1. Upgrades. Although there are recommendations related to the order
of entity types (mons/osds/clients), it is not mandatory and
no assumption should be made about it.
2. Huge variability of client versions. It was always the case
that kernel (and thus kernel clients) upgrades are decoupled
from Ceph upgrades. Moreover, proliferation of containerization
bring the variability even to e.g. ``librbd`` -- now user space
libraries live on the container own.
With this being said, there are few rules limiting the degree
of interoperability between dencoders:
* ``n-2`` for dencoding between daemons,
* ``n-3`` hard requirement for client-involved scenarios,
* ``n-3..`` soft requirements for clinet-involved scenarios. Ideally
every client should be able to talk any version of daemons.
As the underlying reasons are the same, the rules dencoders
follow are virtually the same as for deprecations of our features
bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
Frameworks
----------
Currently multiple genres of dencoding helpers co-exist.
* encoding.h (the most proliferated one),
* denc.h (performance optimized, seen mostly in ``BlueStore``),
* the `Message` hierarchy.
Although details vary, the interoperability rules stay the same.
Adding a field to a structure
-----------------------------
@ -93,3 +158,69 @@ because we might still be passed older-versioned messages that do not
have the field. The ``struct_v`` variable is a local set by the ``DECODE_START``
macro.
# Into the weeeds
The append-extendability of our dencoders is a result of the forward
compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
They are implementing extendibility facilities. An encoder, when filling
the bufferlist, prepends three fields: version of the current format,
minimal version of a decoder compatible with it and the total size of
all encoded fields.
.. code-block:: cpp
/**
* start encoding block
*
* @param v current (code) version of the encoding
* @param compat oldest code version that can decode it
* @param bl bufferlist to encode to
*
*/
#define ENCODE_START(v, compat, bl) \
__u8 struct_v = v; \
__u8 struct_compat = compat; \
ceph_le32 struct_len; \
auto filler = (bl).append_hole(sizeof(struct_v) + \
sizeof(struct_compat) + sizeof(struct_len)); \
const auto starting_bl_len = (bl).length(); \
using ::ceph::encode; \
do {
The ``struct_len`` field allows the decoder to eat all the bytes that were
left undecoded in the user-provided ``decode`` implementation.
Analogically, decoders tracks how much input has been decoded in the
user-provided ``decode`` methods.
.. code-block:: cpp
#define DECODE_START(bl) \
unsigned struct_end = 0; \
__u32 struct_len; \
decode(struct_len, bl); \
... \
struct_end = bl.get_off() + struct_len; \
} \
do {
Decoder uses this information to discard the extra bytes it does not
understand. Advancing bufferlist is critical as dencoders tend to be nested;
just leaving it intact would work only for the very last ``deocde`` call
in a nested structure.
.. code-block:: cpp
#define DECODE_FINISH(bl) \
} while (false); \
if (struct_end) { \
... \
if (bl.get_off() < struct_end) \
bl += struct_end - bl.get_off(); \
}
This entire, cooperative mechanism allows encoder (its further revisions)
to generate more byte stream (due to e.g. adding a new field at the end)
and not worry that the residue will crash older decoder revisions.

View File

@ -16,32 +16,6 @@ mgr module
The following diagrams outline the involved parties and how the interact when the clients
query for the reports:
.. seqdiag::
seqdiag {
default_note_color = lightblue;
osd; mon; ceph-cli;
osd => mon [ label = "update osdmap service" ];
osd => mon [ label = "update osdmap service" ];
ceph-cli -> mon [ label = "send 'health' command" ];
mon -> mon [ leftnote = "gather checks from services" ];
ceph-cli <-- mon [ label = "checks and mutes" ];
}
.. seqdiag::
seqdiag {
default_note_color = lightblue;
osd; mon; mgr; mgr-module;
mgr -> mon [ label = "subscribe for 'mgrdigest'" ];
osd => mon [ label = "update osdmap service" ];
osd => mon [ label = "update osdmap service" ];
mon -> mgr [ label = "send MMgrDigest" ];
mgr -> mgr [ note = "update cluster state" ];
mon <-- mgr;
mgr-module -> mgr [ label = "mgr.get('health')" ];
mgr-module <-- mgr [ label = "heath reports in json" ];
}
Where are the Reports Generated
===============================
@ -68,19 +42,6 @@ later loaded and decoded, so they can be collected on demand. When it comes to
``MDSMonitor``, it persists the health metrics in the beacon sent by the MDS daemons,
and prepares health reports when storing the pending changes.
.. seqdiag::
seqdiag {
default_note_color = lightblue;
mds; mon-mds; mon-health; ceph-cli;
mds -> mon-mds [ label = "send beacon" ];
mon-mds -> mon-mds [ note = "store health metrics in beacon" ];
mds <-- mon-mds;
mon-mds -> mon-mds [ note = "encode_health(checks)" ];
ceph-cli -> mon-health [ label = "send 'health' command" ];
mon-health => mon-mds [ label = "gather health checks" ];
ceph-cli <-- mon-health [ label = "checks and mutes" ];
}
So, if we want to add a new warning related to cephfs, probably the best place to
start is ``MDSMonitor::encode_pending()``, where health reports are collected from
@ -106,23 +67,3 @@ metrics and status to mgr using ``MMgrReport``. On the mgr side, it periodically
an aggregated report to the ``MgrStatMonitor`` service on mon. As explained earlier,
this service just persists the health reports in the aggregated report to the monstore.
.. seqdiag::
seqdiag {
default_note_color = lightblue;
service; mgr; mon-mgr-stat; mon-health;
service -> mgr [ label = "send(open)" ];
mgr -> mgr [ note = "register the new service" ];
service <-- mgr;
mgr => service [ label = "send(configure)" ];
service -> mgr [ label = "send(report)" ];
mgr -> mgr [ note = "update/aggregate service metrics" ];
service <-- mgr;
service => mgr [ label = "send(report)" ];
mgr -> mon-mgr-stat [ label = "send(mgr-report)" ];
mon-mgr-stat -> mon-mgr-stat [ note = "store health checks in the report" ];
mgr <-- mon-mgr-stat;
mon-health => mon-mgr-stat [ label = "gather health checks" ];
service => mgr [ label = "send(report)" ];
service => mgr [ label = "send(close)" ];
}

View File

@ -87,7 +87,8 @@ Optionals are represented as a presence byte, followed by the item if it exists.
T element[present? 1 : 0]; // Only if present is non-zero.
}
Optionals are used to encode ``boost::optional``.
Optionals are used to encode ``boost::optional`` and, since introducing
C++17 to Ceph, ``std::optional``.
Pair
----

View File

@ -5,7 +5,7 @@ jerasure plugin
Introduction
------------
The parameters interpreted by the jerasure plugin are:
The parameters interpreted by the ``jerasure`` plugin are:
::
@ -31,3 +31,5 @@ upstream repositories `http://jerasure.org/jerasure/jerasure
`http://jerasure.org/jerasure/gf-complete
<http://jerasure.org/jerasure/gf-complete>`_ . The difference
between the two, if any, should match pull requests against upstream.
Note that as of 2023, the ``jerasure.org`` web site may no longer be
legitimate and/or associated with the original project.

View File

@ -114,29 +114,6 @@ baseline throughput for each device type was determined:
256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
by running 4 KiB random writes at a queue depth of 64 for 300 secs.
Factoring I/O Cost in mClock
============================
The services using mClock have a cost associated with them. The cost can be
different for each service type. The mClock scheduler factors in the cost
during calculations for parameters like *reservation*, *weight* and *limit*.
The calculations determine when the next op for the service type can be
dequeued from the operation queue. In general, the higher the cost, the longer
an op remains in the operation queue.
A cost modeling study was performed to determine the cost per I/O and the cost
per byte for SSD and HDD device types. The following cost specific options are
used under the hood by mClock,
- :confval:`osd_mclock_cost_per_io_usec`
- :confval:`osd_mclock_cost_per_io_usec_hdd`
- :confval:`osd_mclock_cost_per_io_usec_ssd`
- :confval:`osd_mclock_cost_per_byte_usec`
- :confval:`osd_mclock_cost_per_byte_usec_hdd`
- :confval:`osd_mclock_cost_per_byte_usec_ssd`
See :doc:`/rados/configuration/mclock-config-ref` for more details.
MClock Profile Allocations
==========================

View File

@ -0,0 +1,93 @@
=============
PastIntervals
=============
Purpose
-------
There are two situations where we need to consider the set of all acting-set
OSDs for a PG back to some epoch ``e``:
* During peering, we need to consider the acting set for every epoch back to
``last_epoch_started``, the last epoch in which the PG completed peering and
became active.
(see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation)
* During recovery, we need to consider the acting set for every epoch back to
``last_epoch_clean``, the last epoch at which all of the OSDs in the acting
set were fully recovered, and the acting set was full.
For either of these purposes, we could build such a set by iterating backwards
from the current OSDMap to the relevant epoch. Instead, we maintain a structure
PastIntervals for each PG.
An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping
didn't change. This includes changes to the acting set, the up set, the
primary, and several other parameters fully spelled out in
PastIntervals::check_new_interval.
Maintenance and Trimming
------------------------
The PastIntervals structure stores a record for each ``interval`` back to
last_epoch_clean. On each new ``interval`` (See AdvMap reactions,
PeeringState::should_restart_peering, and PeeringState::start_peering_interval)
each OSD with the PG will add the new ``interval`` to its local PastIntervals.
Activation messages to OSDs which do not already have the PG contain the
sender's PastIntervals so that the recipient needn't rebuild it. (See
PeeringState::activate needs_past_intervals).
PastIntervals are trimmed in two places. First, when the primary marks the
PG clean, it clears its past_intervals instance
(PeeringState::try_mark_clean()). The replicas will do the same thing when
they receive the info (See PeeringState::update_history).
The second, more complex, case is in PeeringState::start_peering_interval. In
the event of a "map gap", we assume that the PG actually has gone clean, but we
haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet.
To explain this behavior, we need to discuss OSDMap trimming.
OSDMap Trimming
---------------
OSDMaps are created by the Monitor quorum and gossiped out to the OSDs. The
Monitor cluster also determines when OSDs (and the Monitors) are allowed to
trim old OSDMap epochs. For the reasons explained above in this document, the
primary constraint is that we must retain all OSDMaps back to some epoch such
that all PGs have been clean at that or a later epoch (min_last_epoch_clean).
(See OSDMonitor::get_trim_to).
The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages
sent periodically by each OSDs. Each message contains a set of PGs for which
the OSD is primary at that moment as well as the min_last_epoch_clean across
that set. The Monitors track these values in OSDMonitor::last_epoch_clean.
There is a subtlety in the min_last_epoch_clean value used by the OSD to
populate the MOSDBeacon. OSD::collect_pg_stats invokes PG::with_pg_stats to
obtain the lec value, which actually uses
pg_stat_t::get_effective_last_epoch_clean() rather than
info.history.last_epoch_clean. If the PG is currently clean,
pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than
last_epoch_clean -- this works because the PG is clean at that epoch and it
allows OSDMaps to be trimmed during periods where OSDMaps are being created
(due to snapshot activity, perhaps), but no PGs are undergoing ``interval``
changes.
Back to PastIntervals
---------------------
We can now understand our second trimming case above. If OSDMaps have been
trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch
>= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals.
This dependency also pops up in PeeringState::check_past_interval_bounds().
PeeringState::get_required_past_interval_bounds takes as a parameter
oldest_epoch, which comes from OSDSuperblock::cluster_osdmap_trim_lower_bound.
We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map
because we don't necessarily trim all MOSDMap::cluster_osdmap_trim_lower_bound.
In order to avoid doing too much work at once we limit the amount of osdmaps
trimmed using ``osd_target_transaction_size`` in OSD::trim_maps().
For this reason, a specific OSD's oldest_map can lag behind
OSDSuperblock::cluster_osdmap_trim_lower_bound
for a while.
See https://tracker.ceph.com/issues/49689 for an example.

View File

@ -28,8 +28,8 @@ Premier
-------
* `Bloomberg <https://bloomberg.com>`_
* `China Mobile <https://www.chinamobileltd.com/>`_
* `DigitalOcean <https://www.digitalocean.com/>`_
* `Clyso <https://www.clyso.com/en/>`_
* `IBM <https://ibm.com>`_
* `Intel <http://www.intel.com/>`_
* `OVH <https://www.ovh.com/>`_
* `Red Hat <https://www.redhat.com/>`_
@ -37,16 +37,16 @@ Premier
* `SoftIron <https://www.softiron.com/>`_
* `SUSE <https://www.suse.com/>`_
* `Western Digital <https://www.wdc.com/>`_
* `XSKY <https://www.xsky.com/en/>`_
* `ZTE <https://www.zte.com.cn/global/>`_
General
-------
* `42on <https://www.42on.com/>`_
* `Akamai <https://www.akamai.com/>`_
* `ARM <http://www.arm.com/>`_
* `Canonical <https://www.canonical.com/>`_
* `Cloudbase Solutions <https://cloudbase.it/>`_
* `Clyso <https://www.clyso.com/en/>`_
* `CloudFerro <https://cloudferro.com/>`_
* `croit <http://www.croit.io/>`_
* `EasyStack <https://www.easystack.io/>`_
* `ISS <http://iss-integration.com/>`_
@ -97,22 +97,17 @@ Members
-------
* Anjaneya "Reddy" Chagam (Intel)
* Dan van der Ster (CERN) - Associate member representative
* Haomai Wang (XSKY)
* James Page (Canonical)
* Lenz Grimmer (SUSE) - Ceph Leadership Team representative
* Lars Marowsky-Bree (SUSE)
* Carlos Maltzahn (UCSC) - Associate member representative
* Dan van der Ster (Clyso) - Ceph Council representative
* Joachim Kraftmayer (Clyso)
* Josh Durgin (IBM) - Ceph Council representative
* Matias Bjorling (Western Digital)
* Matthew Leonard (Bloomberg)
* Mike Perez (Red Hat) - Ceph community manager
* Myoungwon Oh (Samsung Electronics)
* Martin Verges (croit) - General member representative
* Pawel Sadowski (OVH)
* Phil Straw (SoftIron)
* Robin Johnson (DigitalOcean)
* Sage Weil (Red Hat) - Ceph project leader
* Xie Xingguo (ZTE)
* Zhang Shaowen (China Mobile)
* Vincent Hsu (IBM)
Joining
=======

View File

@ -12,12 +12,13 @@
:ref:`BlueStore<rados_config_storage_devices_bluestore>`
OSD BlueStore is a storage back end used by OSD daemons, and
was designed specifically for use with Ceph. BlueStore was
introduced in the Ceph Kraken release. In the Ceph Luminous
release, BlueStore became Ceph's default storage back end,
supplanting FileStore. Unlike :term:`filestore`, BlueStore
stores objects directly on Ceph block devices without any file
system interface. Since Luminous (12.2), BlueStore has been
Ceph's default and recommended storage back end.
introduced in the Ceph Kraken release. The Luminous release of
Ceph promoted BlueStore to the default OSD back end,
supplanting FileStore. As of the Reef release, FileStore is no
longer available as a storage backend.
BlueStore stores objects directly on Ceph block devices without
a mounted file system.
Bucket
In the context of :term:`RGW`, a bucket is a group of objects.
@ -187,9 +188,13 @@
applications, Ceph Users, and :term:`Ceph Client`\s. Ceph
Storage Clusters receive data from :term:`Ceph Client`\s.
cephx
The Ceph authentication protocol. Cephx operates like Kerberos,
but it has no single point of failure.
CephX
The Ceph authentication protocol. CephX authenticates users and
daemons. CephX operates like Kerberos, but it has no single
point of failure. See the :ref:`High-availability
Authentication section<arch_high_availability_authentication>`
of the Architecture document and the :ref:`CephX Configuration
Reference<rados-cephx-config-ref>`.
Client
A client is any program external to Ceph that uses a Ceph
@ -248,6 +253,9 @@
Any single machine or server in a Ceph Cluster. See :term:`Ceph
Node`.
Hybrid OSD
Refers to an OSD that has both HDD and SSD drives.
LVM tags
Extensible metadata for LVM volumes and groups. It is used to
store Ceph-specific information about devices and its
@ -302,12 +310,33 @@
state of a multi-site configuration. When the period is updated,
the "epoch" is said thereby to have been changed.
Placement Groups (PGs)
Placement groups (PGs) are subsets of each logical Ceph pool.
Placement groups perform the function of placing objects (as a
group) into OSDs. Ceph manages data internally at
placement-group granularity: this scales better than would
managing individual (and therefore more numerous) RADOS
objects. A cluster that has a larger number of placement groups
(for example, 100 per OSD) is better balanced than an otherwise
identical cluster with a smaller number of placement groups.
Ceph's internal RADOS objects are each mapped to a specific
placement group, and each placement group belongs to exactly
one Ceph pool.
:ref:`Pool<rados_pools>`
A pool is a logical partition used to store objects.
Pools
See :term:`pool`.
:ref:`Primary Affinity <rados_ops_primary_affinity>`
The characteristic of an OSD that governs the likelihood that
a given OSD will be selected as the primary OSD (or "lead
OSD") in an acting set. Primary affinity was introduced in
Firefly (v. 0.80). See :ref:`Primary Affinity
<rados_ops_primary_affinity>`.
RADOS
**R**\eliable **A**\utonomic **D**\istributed **O**\bject
**S**\tore. RADOS is the object store that provides a scalable
@ -370,6 +399,28 @@
Amazon S3 RESTful API and the OpenStack Swift API. Also called
"RADOS Gateway" and "Ceph Object Gateway".
scrubs
The processes by which Ceph ensures data integrity. During the
process of scrubbing, Ceph generates a catalog of all objects
in a placement group, then ensures that none of the objects are
missing or mismatched by comparing each primary object against
its replicas, which are stored across other OSDs. Any PG
is determined to have a copy of an object that is different
than the other copies or is missing entirely is marked
"inconsistent" (that is, the PG is marked "inconsistent").
There are two kinds of scrubbing: light scrubbing and deep
scrubbing (also called "normal scrubbing" and "deep scrubbing",
respectively). Light scrubbing is performed daily and does
nothing more than confirm that a given object exists and that
its metadata is correct. Deep scrubbing is performed weekly and
reads the data and uses checksums to ensure data integrity.
See :ref:`Scrubbing <rados_config_scrubbing>` in the RADOS OSD
Configuration Reference Guide and page 141 of *Mastering Ceph,
second edition* (Fisk, Nick. 2019).
secrets
Secrets are credentials used to perform digital authentication
whenever privileged users must access systems that require
@ -387,6 +438,12 @@
Teuthology
The collection of software that performs scripted tests on Ceph.
User
An individual or a system actor (for example, an application)
that uses Ceph clients to interact with the :term:`Ceph Storage
Cluster`. See :ref:`User<rados-ops-user>` and :ref:`User
Management<user-management>`.
Zone
In the context of :term:`RGW`, a zone is a logical group that
consists of one or more :term:`RGW` instances. A zone's

View File

@ -53,9 +53,8 @@ the CLT itself.
Current CLT members are:
* Casey Bodley <cbodley@redhat.com>
* Dan van der Ster <daniel.vanderster@cern.ch>
* David Galloway <dgallowa@redhat.com>
* David Orman <ormandj@iland.com>
* Dan van der Ster <dan.vanderster@clyso.com>
* David Orman <ormandj@1111systems.com>
* Ernesto Puerta <epuerta@redhat.com>
* Gregory Farnum <gfarnum@redhat.com>
* Haomai Wang <haomai@xsky.com>

View File

@ -4,13 +4,19 @@
Ceph delivers **object, block, and file storage in one unified system**.
.. warning::
.. warning::
:ref:`If this is your first time using Ceph, read the "Basic Workflow"
page in the Ceph Developer Guide to learn how to contribute to the
Ceph project. (Click anywhere in this paragraph to read the "Basic
:ref:`If this is your first time using Ceph, read the "Basic Workflow"
page in the Ceph Developer Guide to learn how to contribute to the
Ceph project. (Click anywhere in this paragraph to read the "Basic
Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`.
.. note::
:ref:`If you want to make a commit to the documentation but you don't
know how to get started, read the "Documenting Ceph" page. (Click anywhere
in this paragraph to read the "Documenting Ceph" page.) <documenting_ceph>`.
.. container:: columns-3
.. container:: column
@ -104,6 +110,7 @@ about Ceph, see our `Architecture`_ section.
radosgw/index
mgr/index
mgr/dashboard
monitoring/index
api/index
architecture
Developer Guide <dev/developer_guide/index>

View File

@ -36,6 +36,22 @@ Options
Perform a selftest. This mode performs a sanity check of ``stats`` module.
.. option:: --conffile [CONFFILE]
Path to cluster configuration file
.. option:: -d [DELAY], --delay [DELAY]
Refresh interval in seconds (default: 1)
.. option:: --dump
Dump the metrics to stdout
.. option:: --dumpfs <fs_name>
Dump the metrics of the given filesystem to stdout
Descriptions of fields
======================

View File

@ -15,15 +15,15 @@ Synopsis
Description
===========
:program:`radosgw-admin` is a RADOS gateway user administration utility. It
allows creating and modifying users.
:program:`radosgw-admin` is a Ceph Object Gateway user administration utility. It
is used to create and modify users.
Commands
========
:program:`radosgw-admin` utility uses many commands for administration purpose
which are as follows:
:program:`radosgw-admin` utility provides commands for administration purposes
as follows:
:command:`user create`
Create a new user.
@ -32,8 +32,7 @@ which are as follows:
Modify a user.
:command:`user info`
Display information of a user, and any potentially available
subusers and keys.
Display information for a user including any subusers and keys.
:command:`user rename`
Renames a user.
@ -51,7 +50,7 @@ which are as follows:
Check user info.
:command:`user stats`
Show user stats as accounted by quota subsystem.
Show user stats as accounted by the quota subsystem.
:command:`user list`
List all users.
@ -78,10 +77,10 @@ which are as follows:
Remove access key.
:command:`bucket list`
List buckets, or, if bucket specified with --bucket=<bucket>,
list its objects. If bucket specified adding --allow-unordered
removes ordering requirement, possibly generating results more
quickly in buckets with large number of objects.
List buckets, or, if a bucket is specified with --bucket=<bucket>,
list its objects. Adding --allow-unordered
removes the ordering requirement, possibly generating results more
quickly for buckets with large number of objects.
:command:`bucket limit check`
Show bucket sharding stats.
@ -93,8 +92,8 @@ which are as follows:
Unlink bucket from specified user.
:command:`bucket chown`
Link bucket to specified user and update object ACLs.
Use --marker to resume if command gets interrupted.
Change bucket ownership to the specified user and update object ACLs.
Invoke with --marker to resume if the command is interrupted.
:command:`bucket stats`
Returns bucket statistics.
@ -109,12 +108,13 @@ which are as follows:
Rewrite all objects in the specified bucket.
:command:`bucket radoslist`
List the rados objects that contain the data for all objects is
the designated bucket, if --bucket=<bucket> is specified, or
otherwise all buckets.
List the RADOS objects that contain the data for all objects in
the designated bucket, if --bucket=<bucket> is specified.
Otherwise, list the RADOS objects that contain data for all
buckets.
:command:`bucket reshard`
Reshard a bucket.
Reshard a bucket's index.
:command:`bucket sync disable`
Disable bucket sync.
@ -306,16 +306,16 @@ which are as follows:
Run data sync for the specified source zone.
:command:`sync error list`
list sync error.
List sync errors.
:command:`sync error trim`
trim sync error.
Trim sync errors.
:command:`zone rename`
Rename a zone.
:command:`zone placement list`
List zone's placement targets.
List a zone's placement targets.
:command:`zone placement add`
Add a zone placement target.
@ -365,7 +365,7 @@ which are as follows:
List all bucket lifecycle progress.
:command:`lc process`
Manually process lifecycle. If a bucket is specified (e.g., via
Manually process lifecycle transitions. If a bucket is specified (e.g., via
--bucket_id or via --bucket and optional --tenant), only that bucket
is processed.
@ -385,7 +385,7 @@ which are as follows:
List metadata log which is needed for multi-site deployments.
:command:`mdlog trim`
Trim metadata log manually instead of relying on RGWs integrated log sync.
Trim metadata log manually instead of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync.
@ -397,7 +397,7 @@ which are as follows:
:command:`bilog trim`
Trim bucket index log (use start-marker, end-marker) manually instead
of relying on RGWs integrated log sync.
of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync.
@ -405,7 +405,7 @@ which are as follows:
List data log which is needed for multi-site deployments.
:command:`datalog trim`
Trim data log manually instead of relying on RGWs integrated log sync.
Trim data log manually instead of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync.
@ -413,19 +413,19 @@ which are as follows:
Read data log status.
:command:`orphans find`
Init and run search for leaked rados objects.
Init and run search for leaked RADOS objects.
DEPRECATED. See the "rgw-orphan-list" tool.
:command:`orphans finish`
Clean up search for leaked rados objects.
Clean up search for leaked RADOS objects.
DEPRECATED. See the "rgw-orphan-list" tool.
:command:`orphans list-jobs`
List the current job-ids for the orphans search.
List the current orphans search job IDs.
DEPRECATED. See the "rgw-orphan-list" tool.
:command:`role create`
create a new AWS role for use with STS.
Create a new role for use with STS (Security Token Service).
:command:`role rm`
Remove a role.
@ -485,7 +485,7 @@ which are as follows:
Show events in a pubsub subscription
:command:`subscription ack`
Ack (remove) an events in a pubsub subscription
Acknowledge (remove) events in a pubsub subscription
Options
@ -499,7 +499,8 @@ Options
.. option:: -m monaddress[:port]
Connect to specified monitor (instead of looking through ceph.conf).
Connect to specified monitor (instead of selecting one
from ceph.conf).
.. option:: --tenant=<tenant>
@ -507,19 +508,19 @@ Options
.. option:: --uid=uid
The radosgw user ID.
The user on which to operate.
.. option:: --new-uid=uid
ID of the new user. Used with 'user rename' command.
The new ID of the user. Used with 'user rename' command.
.. option:: --subuser=<name>
Name of the subuser.
Name of the subuser.
.. option:: --access-key=<key>
S3 access key.
S3 access key.
.. option:: --email=email
@ -531,28 +532,29 @@ Options
.. option:: --gen-access-key
Generate random access key (for S3).
Generate random access key (for S3).
.. option:: --gen-secret
Generate random secret key.
Generate random secret key.
.. option:: --key-type=<type>
key type, options are: swift, s3.
Key type, options are: swift, s3.
.. option:: --temp-url-key[-2]=<key>
Temporary url key.
Temporary URL key.
.. option:: --max-buckets
max number of buckets for a user (0 for no limit, negative value to disable bucket creation).
Default is 1000.
Maximum number of buckets for a user (0 for no limit, negative value to disable bucket creation).
Default is 1000.
.. option:: --access=<access>
Set the access permissions for the sub-user.
Set the access permissions for the subuser.
Available access permissions are read, write, readwrite and full.
.. option:: --display-name=<name>
@ -600,24 +602,24 @@ Options
.. option:: --bucket-new-name=[tenant-id/]<bucket>
Optional for `bucket link`; use to rename a bucket.
While tenant-id/ can be specified, this is never
necessary for normal operation.
While the tenant-id can be specified, this is not
necessary in normal operation.
.. option:: --shard-id=<shard-id>
Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.
Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.
.. option:: --max-entries=<entries>
Optional for listing operations to specify the max entries.
Optional for listing operations to specify the max entries.
.. option:: --purge-data
When specified, user removal will also purge all the user data.
When specified, user removal will also purge the user's data.
.. option:: --purge-keys
When specified, subuser removal will also purge all the subuser keys.
When specified, subuser removal will also purge the subuser' keys.
.. option:: --purge-objects
@ -625,7 +627,7 @@ Options
.. option:: --metadata-key=<key>
Key to retrieve metadata from with ``metadata get``.
Key from which to retrieve metadata, used with ``metadata get``.
.. option:: --remote=<remote>
@ -633,11 +635,11 @@ Options
.. option:: --period=<id>
Period id.
Period ID.
.. option:: --url=<url>
url for pushing/pulling period or realm.
URL for pushing/pulling period or realm.
.. option:: --epoch=<number>
@ -657,7 +659,7 @@ Options
.. option:: --master-zone=<id>
Master zone id.
Master zone ID.
.. option:: --rgw-realm=<name>
@ -665,11 +667,11 @@ Options
.. option:: --realm-id=<id>
The realm id.
The realm ID.
.. option:: --realm-new-name=<name>
New name of realm.
New name for the realm.
.. option:: --rgw-zonegroup=<name>
@ -677,7 +679,7 @@ Options
.. option:: --zonegroup-id=<id>
The zonegroup id.
The zonegroup ID.
.. option:: --zonegroup-new-name=<name>
@ -685,11 +687,11 @@ Options
.. option:: --rgw-zone=<zone>
Zone in which radosgw is running.
Zone in which the gateway is running.
.. option:: --zone-id=<id>
The zone id.
The zone ID.
.. option:: --zone-new-name=<name>
@ -709,7 +711,7 @@ Options
.. option:: --placement-id
Placement id for the zonegroup placement commands.
Placement ID for the zonegroup placement commands.
.. option:: --tags=<list>
@ -737,7 +739,7 @@ Options
.. option:: --data-extra-pool=<pool>
The placement target data extra (non-ec) pool.
The placement target data extra (non-EC) pool.
.. option:: --placement-index-type=<type>
@ -765,11 +767,11 @@ Options
.. option:: --sync-from=[zone-name][,...]
Set the list of zones to sync from.
Set the list of zones from which to sync.
.. option:: --sync-from-rm=[zone-name][,...]
Remove the zones from list of zones to sync from.
Remove zone(s) from list of zones from which to sync.
.. option:: --bucket-index-max-shards
@ -780,71 +782,71 @@ Options
.. option:: --fix
Besides checking bucket index, will also fix it.
Fix the bucket index in addition to checking it.
.. option:: --check-objects
bucket check: Rebuilds bucket index according to actual objects state.
Bucket check: Rebuilds the bucket index according to actual object state.
.. option:: --format=<format>
Specify output format for certain operations. Supported formats: xml, json.
Specify output format for certain operations. Supported formats: xml, json.
.. option:: --sync-stats
Option for 'user stats' command. When specified, it will update user stats with
the current stats reported by user's buckets indexes.
Option for the 'user stats' command. When specified, it will update user stats with
the current stats reported by the user's buckets indexes.
.. option:: --show-config
Show configuration.
Show configuration.
.. option:: --show-log-entries=<flag>
Enable/disable dump of log entries on log show.
Enable/disable dumping of log entries on log show.
.. option:: --show-log-sum=<flag>
Enable/disable dump of log summation on log show.
Enable/disable dump of log summation on log show.
.. option:: --skip-zero-entries
Log show only dumps entries that don't have zero value in one of the numeric
field.
Log show only dumps entries that don't have zero value in one of the numeric
field.
.. option:: --infile
Specify a file to read in when setting data.
Specify a file to read when setting data.
.. option:: --categories=<list>
Comma separated list of categories, used in usage show.
Comma separated list of categories, used in usage show.
.. option:: --caps=<caps>
List of caps (e.g., "usage=read, write; user=read").
List of capabilities (e.g., "usage=read, write; user=read").
.. option:: --compression=<compression-algorithm>
Placement target compression algorithm (lz4|snappy|zlib|zstd)
Placement target compression algorithm (lz4|snappy|zlib|zstd).
.. option:: --yes-i-really-mean-it
Required for certain operations.
Required as a guardrail for certain destructive operations.
.. option:: --min-rewrite-size
Specify the min object size for bucket rewrite (default 4M).
Specify the minimum object size for bucket rewrite (default 4M).
.. option:: --max-rewrite-size
Specify the max object size for bucket rewrite (default ULLONG_MAX).
Specify the maximum object size for bucket rewrite (default ULLONG_MAX).
.. option:: --min-rewrite-stripe-size
Specify the min stripe size for object rewrite (default 0). If the value
Specify the minimum stripe size for object rewrite (default 0). If the value
is set to 0, then the specified object will always be
rewritten for restriping.
rewritten when restriping.
.. option:: --warnings-only
@ -854,7 +856,7 @@ Options
.. option:: --bypass-gc
When specified with bucket deletion,
triggers object deletions by not involving GC.
triggers object deletion without involving GC.
.. option:: --inconsistent-index
@ -863,25 +865,25 @@ Options
.. option:: --max-concurrent-ios
Maximum concurrent ios for bucket operations. Affects operations that
scan the bucket index, e.g., listing, deletion, and all scan/search
operations such as finding orphans or checking the bucket index.
Default is 32.
Maximum concurrent bucket operations. Affects operations that
scan the bucket index, e.g., listing, deletion, and all scan/search
operations such as finding orphans or checking the bucket index.
The default is 32.
Quota Options
=============
.. option:: --max-objects
Specify max objects (negative value to disable).
Specify the maximum number of objects (negative value to disable).
.. option:: --max-size
Specify max size (in B/K/M/G/T, negative value to disable).
Specify the maximum object size (in B/K/M/G/T, negative value to disable).
.. option:: --quota-scope
The scope of quota (bucket, user).
The scope of quota (bucket, user).
Orphans Search Options
@ -889,16 +891,16 @@ Orphans Search Options
.. option:: --num-shards
Number of shards to use for keeping the temporary scan info
Number of shards to use for temporary scan info
.. option:: --orphan-stale-secs
Number of seconds to wait before declaring an object to be an orphan.
Default is 86400 (24 hours).
Number of seconds to wait before declaring an object to be an orphan.
The efault is 86400 (24 hours).
.. option:: --job-id
Set the job id (for orphans find)
Set the job id (for orphans find)
Orphans list-jobs options

View File

@ -53,10 +53,6 @@ Options
Run in foreground, log to usual location
.. option:: --rgw-socket-path=path
Specify a unix domain socket path.
.. option:: --rgw-region=region
The region where radosgw runs
@ -80,30 +76,24 @@ and ``mod_proxy_fcgi`` have to be present in the server. Unlike ``mod_fastcgi``,
or process management may be available in the FastCGI application framework
in use.
``Apache`` can be configured in a way that enables ``mod_proxy_fcgi`` to be used
with localhost tcp or through unix domain socket. ``mod_proxy_fcgi`` that doesn't
support unix domain socket such as the ones in Apache 2.2 and earlier versions of
Apache 2.4, needs to be configured for use with localhost tcp. Later versions of
Apache like Apache 2.4.9 or later support unix domain socket and as such they
allow for the configuration with unix domain socket instead of localhost tcp.
``Apache`` must be configured in a way that enables ``mod_proxy_fcgi`` to be
used with localhost tcp.
The following steps show the configuration in Ceph's configuration file i.e,
``/etc/ceph/ceph.conf`` and the gateway configuration file i.e,
``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or
``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost
tcp and through unix domain socket:
tcp:
#. For distros with Apache 2.2 and early versions of Apache 2.4 that use
localhost TCP and do not support Unix Domain Socket, append the following
contents to ``/etc/ceph/ceph.conf``::
localhost TCP, append the following contents to ``/etc/ceph/ceph.conf``::
[client.radosgw.gateway]
host = {hostname}
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = ""
log file = /var/log/ceph/client.radosgw.gateway.log
rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
rgw print continue = false
log_file = /var/log/ceph/client.radosgw.gateway.log
rgw_frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
rgw_print_continue = false
#. Add the following content in the gateway configuration file:
@ -149,16 +139,6 @@ tcp and through unix domain socket:
</VirtualHost>
#. For distros with Apache 2.4.9 or later that support Unix Domain Socket,
append the following configuration to ``/etc/ceph/ceph.conf``::
[client.radosgw.gateway]
host = {hostname}
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /var/log/ceph/client.radosgw.gateway.log
rgw print continue = false
#. Add the following content in the gateway configuration file:
For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``::
@ -182,10 +162,6 @@ tcp and through unix domain socket:
</VirtualHost>
Please note, ``Apache 2.4.7`` does not have Unix Domain Socket support in
it and as such it has to be configured with localhost tcp. The Unix Domain
Socket support is available in ``Apache 2.4.9`` and later versions.
#. Generate a key for radosgw to use for authentication with the cluster. ::
ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

View File

@ -41,14 +41,16 @@ So, prior to start consuming the Ceph API, a valid JSON Web Token (JWT) has to
be obtained, and it may then be reused for subsequent requests. The
``/api/auth`` endpoint will provide the valid token:
.. code-block:: sh
.. prompt:: bash $
$ curl -X POST "https://example.com:8443/api/auth" \
-H "Accept: application/vnd.ceph.api.v1.0+json" \
-H "Content-Type: application/json" \
-d '{"username": <username>, "password": <password>}'
curl -X POST "https://example.com:8443/api/auth" \
-H "Accept: application/vnd.ceph.api.v1.0+json" \
-H "Content-Type: application/json" \
-d '{"username": <username>, "password": <password>}'
{ "token": "<redacted_token>", ...}
::
{ "token": "<redacted_token>", ...}
The token obtained must be passed together with every API request in the
``Authorization`` HTTP header::
@ -74,11 +76,11 @@ purpose, Ceph API is built upon the following principles:
An example:
.. code-block:: bash
.. prompt:: bash $
$ curl -X GET "https://example.com:8443/api/osd" \
-H "Accept: application/vnd.ceph.api.v1.0+json" \
-H "Authorization: Bearer <token>"
curl -X GET "https://example.com:8443/api/osd" \
-H "Accept: application/vnd.ceph.api.v1.0+json" \
-H "Authorization: Bearer <token>"
Specification

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

View File

@ -127,62 +127,67 @@ The Ceph Dashboard offers the following monitoring and management capabilities:
Overview of the Dashboard Landing Page
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Displays overall cluster status, performance, and capacity metrics. Shows instant
feedback for changes in the cluster and provides easy access to subpages of the
dashboard.
The landing page of Ceph Dashboard serves as the home page and features metrics
such as the overall cluster status, performance, and capacity. It provides real-time
updates on any changes in the cluster and allows quick access to other sections of the dashboard.
.. image:: dashboard-landing-page.png
.. note::
You can change the landing page to the previous version from:
``Cluster >> Manager Modules >> Dashboard >> Edit``.
Editing the ``FEATURE_TOGGLE_DASHBOARD`` option will change the landing page, from one view to another.
Note that the previous version of the landing page will be disabled in future releases.
.. _dashboard-landing-page-details:
Details
"""""""
Provides an overview of the cluster configuration, displaying various critical aspects of the cluster.
.. image:: details-card.png
.. _dashboard-landing-page-status:
Status
""""""
Provides a visual indication of cluster health, and displays cluster alerts grouped by severity.
* **Cluster Status**: Displays overall cluster health. In case of any error it
displays a short description of the error and provides a link to the logs.
* **Hosts**: Displays the total number of hosts associated to the cluster and
links to a subpage that lists and describes each.
* **Monitors**: Displays mons and their quorum status and
open sessions. Links to a subpage that lists and describes each.
* **OSDs**: Displays object storage daemons (ceph-osds) and
the numbers of OSDs running (up), in service
(in), and out of the cluster (out). Provides links to
subpages providing a list of all OSDs and related management actions.
* **Managers**: Displays active and standby Ceph Manager
daemons (ceph-mgr).
* **Object Gateway**: Displays active object gateways (RGWs) and
provides links to subpages that list all object gateway daemons.
* **Metadata Servers**: Displays active and standby CephFS metadata
service daemons (ceph-mds).
* **iSCSI Gateways**: Display iSCSI gateways available,
active (up), and inactive (down). Provides a link to a subpage
showing a list of all iSCSI Gateways.
.. image:: status-card-open.png
.. _dashboard-landing-page-capacity:
Capacity
""""""""
* **Used**: Displays the used capacity out of the total physical capacity provided by storage nodes (OSDs)
* **Warning**: Displays the `nearfull` threshold of the OSDs
* **Danger**: Displays the `full` threshold of the OSDs
* **Raw Capacity**: Displays the capacity used out of the total
physical capacity provided by storage nodes (OSDs).
* **Objects**: Displays the number and status of RADOS objects
including the percentages of healthy, misplaced, degraded, and unfound
objects.
* **PG Status**: Displays the total number of placement groups and
their status, including the percentage clean, working,
warning, and unknown.
* **Pools**: Displays pools and links to a subpage listing details.
* **PGs per OSD**: Displays the number of placement groups assigned to
object storage daemons.
.. image:: capacity-card.png
.. _dashboard-landing-page-inventory:
Inventory
"""""""""
An inventory for all assets within the cluster.
Provides direct access to subpages of the dashboard from each item of this card.
.. image:: inventory-card.png
.. _dashboard-landing-page-performance:
Performance
"""""""""""
Cluster Utilization
"""""""""""""""""""
* **Used Capacity**: Total capacity used of the cluster. The maximum value of the chart is the maximum capacity of the cluster.
* **IOPS (Input/Output Operations Per Second)**: Number of read and write operations.
* **Latency**: Amount of time that it takes to process a read or a write request.
* **Client Throughput**: Amount of data that clients read or write to the cluster.
* **Recovery Throughput**: Amount of recovery data that clients read or write to the cluster.
* **Client READ/Write**: Displays an overview of
client input and output operations.
* **Client Throughput**: Displays the data transfer rates to and from Ceph clients.
* **Recovery throughput**: Displays rate of cluster healing and balancing operations.
* **Scrubbing**: Displays light and deep scrub status.
.. image:: cluster-utilization-card.png
Supported Browsers
^^^^^^^^^^^^^^^^^^

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View File

@ -24,12 +24,14 @@ see :ref:`nfs-ganesha-config`.
NFS Cluster management
======================
.. _nfs-module-cluster-create:
Create NFS Ganesha Cluster
--------------------------
.. code:: bash
$ ceph nfs cluster create <cluster_id> [<placement>] [--port <port>] [--ingress --virtual-ip <ip>]
$ ceph nfs cluster create <cluster_id> [<placement>] [--ingress] [--virtual_ip <value>] [--ingress-mode {default|keepalive-only|haproxy-standard|haproxy-protocol}] [--port <int>]
This creates a common recovery pool for all NFS Ganesha daemons, new user based on
``cluster_id``, and a common NFS Ganesha config RADOS object.
@ -94,6 +96,18 @@ of the details of NFS redirecting traffic on the virtual IP to the
appropriate backend NFS servers, and redeploying NFS servers when they
fail.
If a user additionally supplies ``--ingress-mode keepalive-only`` a
partial *ingress* service will be deployed that still provides a virtual
IP, but has nfs directly binding to that virtual IP and leaves out any
sort of load balancing or traffic redirection. This setup will restrict
users to deploying only 1 nfs daemon as multiple cannot bind to the same
port on the virtual IP.
Instead providing ``--ingress-mode default`` will result in the same setup
as not providing the ``--ingress-mode`` flag. In this setup keepalived will be
deployed to handle forming the virtual IP and haproxy will be deployed
to handle load balancing and traffic redirection.
Enabling ingress via the ``ceph nfs cluster create`` command deploys a
simple ingress configuration with the most common configuration
options. Ingress can also be added to an existing NFS service (e.g.,

View File

@ -18,9 +18,11 @@ for all reporting entities are returned in text exposition format.
Enabling prometheus output
==========================
The *prometheus* module is enabled with::
The *prometheus* module is enabled with:
ceph mgr module enable prometheus
.. prompt:: bash $
ceph mgr module enable prometheus
Configuration
-------------
@ -47,10 +49,10 @@ configurable with ``ceph config set``, with keys
is registered with Prometheus's `registry
<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
::
ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
ceph config set mgr mgr/prometheus/server_port 9283
.. prompt:: bash $
ceph config set mgr mgr/prometheus/server_addr 0.0.0.
ceph config set mgr mgr/prometheus/server_port 9283
.. warning::
@ -65,9 +67,11 @@ recommended to use 15 seconds as scrape interval, though, in some cases it
might be useful to increase the scrape interval.
To set a different scrape interval in the Prometheus module, set
``scrape_interval`` to the desired value::
``scrape_interval`` to the desired value:
ceph config set mgr mgr/prometheus/scrape_interval 20
.. prompt:: bash $
ceph config set mgr mgr/prometheus/scrape_interval 20
On large clusters (>1000 OSDs), the time to fetch the metrics may become
significant. Without the cache, the Prometheus manager module could, especially
@ -75,7 +79,7 @@ in conjunction with multiple Prometheus instances, overload the manager and lead
to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
by default. This means that there is a possibility that the cache becomes
stale. The cache is considered stale when the time to fetch the metrics from
Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.
If that is the case, **a warning will be logged** and the module will either
@ -86,35 +90,47 @@ This behavior can be configured. By default, it will return a 503 HTTP status
code (service unavailable). You can set other options using the ``ceph config
set`` commands.
To tell the module to respond with possibly stale data, set it to ``return``::
To tell the module to respond with possibly stale data, set it to ``return``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/stale_cache_strategy return
To tell the module to respond with "service unavailable", set it to ``fail``::
To tell the module to respond with "service unavailable", set it to ``fail``:
ceph config set mgr mgr/prometheus/stale_cache_strategy fail
.. prompt:: bash $
If you are confident that you don't require the cache, you can disable it::
ceph config set mgr mgr/prometheus/stale_cache_strategy fail
ceph config set mgr mgr/prometheus/cache false
If you are confident that you don't require the cache, you can disable it:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/cache false
If you are using the prometheus module behind some kind of reverse proxy or
loadbalancer, you can simplify discovering the active instance by switching
to ``error``-mode::
to ``error``-mode:
ceph config set mgr mgr/prometheus/standby_behaviour error
.. prompt:: bash $
ceph config set mgr mgr/prometheus/standby_behaviour error
If set, the prometheus module will repond with a HTTP error when requesting ``/``
from the standby instance. The default error code is 500, but you can configure
the HTTP response code with::
the HTTP response code with:
ceph config set mgr mgr/prometheus/standby_error_status_code 503
.. prompt:: bash $
ceph config set mgr mgr/prometheus/standby_error_status_code 503
Valid error codes are between 400-599.
To switch back to the default behaviour, simply set the config key to ``default``::
To switch back to the default behaviour, simply set the config key to ``default``:
ceph config set mgr mgr/prometheus/standby_behaviour default
.. prompt:: bash $
ceph config set mgr mgr/prometheus/standby_behaviour default
.. _prometheus-rbd-io-statistics:
@ -165,9 +181,17 @@ configuration parameter. The parameter is a comma or space separated list
of ``pool[/namespace]`` entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool.
Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:
ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
.. prompt:: bash $
ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
The wildcard can be used to indicate all pools or namespaces:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is
@ -176,9 +200,22 @@ parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown
RBD image.
Example to turn up the sync interval to 10 minutes::
Example to turn up the sync interval to 10 minutes:
ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
.. prompt:: bash $
ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
Ceph daemon performance counters metrics
-----------------------------------------
With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
the module option ``exclude_perf_counters`` to ``false``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/exclude_perf_counters false
Statistic names and labels
==========================

View File

@ -2,8 +2,9 @@
RGW Module
============
The rgw module helps with bootstraping and configuring RGW realm
and the different related entities.
The rgw module provides a simple interface to deploy RGW multisite.
It helps with bootstrapping and configuring RGW realm, zonegroup and
the different related entities.
Enabling
--------
@ -18,57 +19,120 @@ RGW Realm Operations
Bootstrapping RGW realm creates a new RGW realm entity, a new zonegroup,
and a new zone. It configures a new system user that can be used for
multisite sync operations, and returns a corresponding token. It sets
up new RGW instances via the orchestrator.
multisite sync operations. Under the hood this module instructs the
orchestrator to create and deploy the corresponding RGW daemons. The module
supports both passing the arguments through the cmd line or as a spec file:
It is also possible to create a new zone that connects to the master
zone and synchronizes data to/from it.
.. prompt:: bash #
ceph rgw realm bootstrap [--realm-name] [--zonegroup-name] [--zone-name] [--port] [--placement] [--start-radosgw]
The command supports providing the configuration through a spec file (`-i option`):
.. prompt:: bash #
ceph rgw realm bootstrap -i myrgw.yaml
Following is an example of RGW multisite spec file:
.. code-block:: yaml
rgw_realm: myrealm
rgw_zonegroup: myzonegroup
rgw_zone: myzone
placement:
hosts:
- ceph-node-1
- ceph-node-2
spec:
rgw_frontend_port: 5500
.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus,
the user can provide any orchestration supported rgw parameters including advanced
configuration features such as SSL certificates etc.
Users can also specify custom zone endpoints in the spec (or through the cmd line). In this case, no
cephadm daemons will be launched. Following is an example RGW spec file with zone endpoints:
.. code-block:: yaml
rgw_realm: myrealm
rgw_zonegroup: myzonegroup
rgw_zone: myzone
zone_endpoints: http://<rgw_host1>:<rgw_port1>, http://<rgw_host2>:<rgw_port2>
Realm Credentials Token
-----------------------
A new token is created when bootstrapping a new realm, and also
when creating one explicitly. The token encapsulates
the master zone endpoint, and a set of credentials that are associated
with a system user.
Removal of this token would remove the credentials, and if the corresponding
system user has no more access keys, it is removed.
Users can list the available tokens for the created (or already existing) realms.
The token is a base64 string that encapsulates the realm information and its
master zone endpoint authentication data. Following is an example of
the `ceph rgw realm tokens` output:
.. prompt:: bash #
ceph rgw realm tokens | jq
.. code-block:: json
[
{
"realm": "myrealm1",
"token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....NHlBTFhoIgp9"
},
{
"realm": "myrealm2",
"token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....RUU12ZDB0Igp9"
}
]
User can use the token to pull a realm to create secondary zone on a
different cluster that syncs with the master zone on the primary cluster
by using `ceph rgw zone create` command and providing the corresponding token.
Following is an example of zone spec file:
.. code-block:: yaml
rgw_zone: my-secondary-zone
rgw_realm_token: <token>
placement:
hosts:
- ceph-node-1
- ceph-node-2
spec:
rgw_frontend_port: 5500
.. prompt:: bash #
ceph rgw zone create -i zone-spec.yaml
.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus,
the user can provide any orchestration supported rgw parameters including advanced
configuration features such as SSL certificates etc.
Commands
--------
::
ceph rgw realm bootstrap
ceph rgw realm bootstrap -i spec.yaml
Create a new realm + zonegroup + zone and deploy rgw daemons via the
orchestrator. Command returns a realm token that allows new zones to easily
join this realm
orchestrator using the information specified in the YAML file.
::
ceph rgw zone create
ceph rgw realm tokens
Create a new zone and join existing realm (using the realm token)
List the tokens of all the available realms
::
ceph rgw zone-creds create
ceph rgw zone create -i spec.yaml
Create new credentials and return a token for new zone connection
::
ceph rgw zone-creds remove
Remove credentials and/or user that are associated with the specified
token
::
ceph rgw realm reconcile
Update the realm configuration to match the orchestrator deployment
Join an existing realm by creating a new secondary zone (using the realm token)
::

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.4 KiB

View File

@ -269,3 +269,24 @@ completely optional, and disabled by default.::
ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
ceph config set mgr mgr/telemetry/channel_ident true
Leaderboard
-----------
To participate in a leaderboard in the `public dashboards
<https://telemetry-public.ceph.com/>`_, run the following command:
.. prompt:: bash $
ceph config set mgr mgr/telemetry/leaderboard true
The leaderboard displays basic information about the cluster. This includes the
total storage capacity and the number of OSDs. To add a description of the
cluster, run a command of the following form:
.. prompt:: bash $
ceph config set mgr mgr/telemetry/leaderboard_description 'Ceph cluster for Computational Biology at the University of XYZ'
If the ``ident`` channel is enabled, its details will not be displayed in the
leaderboard.

View File

@ -0,0 +1,474 @@
.. _monitoring:
===================
Monitoring overview
===================
The aim of this part of the documentation is to explain the Ceph monitoring
stack and the meaning of the main Ceph metrics.
With a good understand of the Ceph monitoring stack and metrics users can
create customized monitoring tools, like Prometheus queries, Grafana
dashboards, or scripts.
Ceph Monitoring stack
=====================
Ceph provides a default monitoring stack wich is installed by cephadm and
explained in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
the cephadm documentation.
Ceph metrics
============
The main source for Ceph metrics are the performance counters exposed by each
Ceph daemon. The :doc:`../dev/perf_counters` are native Ceph monitoring data
Performance counters are transformed into standard Prometheus metrics by the
Ceph exporter daemon. This daemon runs on every Ceph cluster host and exposes a
metrics end point where all the performance counters exposed by all the Ceph
daemons running in the host are published in the form of Prometheus metrics.
In addition to the Ceph exporter, there is another agent to expose Ceph
metrics. It is the Prometheus manager module, wich exposes metrics related to
the whole cluster, basically metrics that are not produced by individual Ceph
daemons.
The main source for obtaining Ceph metrics is the metrics endpoint exposed by
the Cluster Prometheus server. Ceph can provide you with the Prometheus
endpoint where you can obtain the complete list of metrics (coming from Ceph
exporter daemons and Prometheus manager module) and exeute queries.
Use the following command to obtain the Prometheus server endpoint in your
cluster:
Example:
.. code-block:: bash
# ceph orch ps --service_name prometheus
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
prometheus.cephtest-node-00 cephtest-node-00.cephlab.com *:9095 running (103m) 50s ago 5w 142M - 2.33.4 514e6a882f6e efe3cbc2e521
With this information you can connect to
``http://cephtest-node-00.cephlab.com:9095`` to access the Prometheus server
interface.
And the complete list of metrics (with help) for your cluster will be available
in:
``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata``
It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
Performance metrics
===================
Main metrics used to measure Cluster Ceph performance:
All metrics have the following labels:
``ceph_daemon``: identifier of the OSD daemon generating the metric
``instance``: the IP address of the ceph exporter instance exposing the metric.
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_osd_op_r{ceph_daemon="osd.0", instance="192.168.122.7:9283", job="ceph"} = 73981
*Cluster I/O (throughput):*
Use ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain the cluster throughput generated by clients
Example:
.. code-block:: bash
Writes (B/s):
sum(irate(ceph_osd_op_w_in_bytes[1m]))
Reads (B/s):
sum(irate(ceph_osd_op_r_out_bytes[1m]))
*Cluster I/O (operations):*
Use ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the number of operations generated by clients
Example:
.. code-block:: bash
Writes (ops/s):
sum(irate(ceph_osd_op_w[1m]))
Reads (ops/s):
sum(irate(ceph_osd_op_r[1m]))
*Latency:*
Use ``ceph_osd_op_latency_sum`` wich represents the delay before a OSD transfer of data begins following a client instruction for its transfer
Example:
.. code-block:: bash
sum(irate(ceph_osd_op_latency_sum[1m]))
OSD performance
===============
The previous explained cluster performance metrics are based in OSD metrics, selecting the right label we can obtain for a single OSD the same performance information explained for the cluster:
Example:
.. code-block:: bash
OSD 0 read latency
irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"osd.0"}[1m]) / on (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
OSD 0 write IOPS
irate(ceph_osd_op_w{ceph_daemon=~"osd.0"}[1m])
OSD 0 write thughtput (bytes)
irate(ceph_osd_op_w_in_bytes{ceph_daemon=~"osd.0"}[1m])
OSD.0 total raw capacity available
ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481
Physical disk performance:
==========================
Combining Prometheus ``node_exporter`` metrics with Ceph metrics we can have
information about the performance provided by physical disks used by OSDs.
Example:
.. code-block:: bash
Read latency of device used by OSD 0:
label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Write latency of device used by OSD 0
label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
IOPS (device used by OSD.0)
reads:
label_replace(irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
writes:
label_replace(irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Throughput (device used by OSD.0)
reads:
label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
writes:
label_replace(irate(node_disk_written_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Physical Device Utilization (%) for OSD.0 in the last 5 minutes
label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Pool metrics
============
These metrics have the following labels:
``instance``: the ip address of the Ceph exporter daemon producing the metric.
``pool_id``: identifier of the pool
``job``: prometheus scrape job
- ``ceph_pool_metadata``: Information about the pool It can be used together
with other metrics to provide more contextual information in queries and
graphs. Apart of the three common labels this metric provide the following
extra labels:
- ``compression_mode``: compression used in the pool (lz4, snappy, zlib,
zstd, none). Example: compression_mode="none"
- ``description``: brief description of the pool type (replica:number of
replicas or Erasure code: ec profile). Example: description="replica:3"
- ``name``: name of the pool. Example: name=".mgr"
- ``type``: type of pool (replicated/erasure code). Example: type="replicated"
- ``ceph_pool_bytes_used``: Total raw capacity consumed by user data and associated overheads by pool (metadata + redundancy):
- ``ceph_pool_stored``: Total of CLIENT data stored in the pool
- ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool
- ``ceph_pool_compress_bytes_used``: Data compressed in the pool
- ``ceph_pool_rd``: CLIENT read operations per pool (reads per second)
- ``ceph_pool_rd_bytes``: CLIENT read operations in bytes per pool
- ``ceph_pool_wr``: CLIENT write operations per pool (writes per second)
- ``ceph_pool_wr_bytes``: CLIENT write operation in bytes per pool
**Useful queries**:
.. code-block:: bash
Total raw capacity available in the cluster:
sum(ceph_osd_stat_bytes)
Total raw capacity consumed in the cluster (including metadata + redundancy):
sum(ceph_pool_bytes_used)
Total of CLIENT data stored in the cluster:
sum(ceph_pool_stored)
Compression savings:
sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used)
CLIENT IOPS for a pool (testrbdpool)
reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
CLIENT Throughput for a pool
reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
Object metrics
==============
These metrics have the following labels:
``instance``: the ip address of the ceph exporter daemon providing the metric
``instance_id``: identifier of the rgw daemon
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_rgw_req{instance="192.168.122.7:9283", instance_id="154247", job="ceph"} = 12345
Generic metrics
---------------
- ``ceph_rgw_metadata``: Provides generic information about the RGW daemon. It
can be used together with other metrics to provide more contextual
information in queries and graphs. Apart from the three common labels, this
metric provides the following extra labels:
- ``ceph_daemon``: Name of the Ceph daemon. Example:
ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq",
- ``ceph_version``: Version of Ceph daemon. Example: ceph_version="ceph
version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
- ``hostname``: Name of the host where the daemon runs. Example:
hostname:"cephtest-node-00.cephlab.com",
- ``ceph_rgw_req``: Number total of requests for the daemon (GET+PUT+DELETE)
Useful to detect bottlenecks and optimize load distribution.
- ``ceph_rgw_qlen``: RGW operations queue length for the daemon.
Useful to detect bottlenecks and optimize load distribution.
- ``ceph_rgw_failed_req``: Aborted requests.
Useful to detect daemon errors
GET operations: related metrics
-------------------------------
- ``ceph_rgw_get_initial_lat_count``: Number of get operations
- ``ceph_rgw_get_initial_lat_sum``: Total latency time for the GET operations
- ``ceph_rgw_get``: Number total of GET requests
- ``ceph_rgw_get_b``: Total bytes transferred in GET operations
Put operations: related metrics
-------------------------------
- ``ceph_rgw_put_initial_lat_count``: Number of get operations
- ``ceph_rgw_put_initial_lat_sum``: Total latency time for the PUT operations
- ``ceph_rgw_put``: Total number of PUT operations
- ``ceph_rgw_get_b``: Total bytes transferred in PUT operations
Useful queries
--------------
.. code-block:: bash
The average of get latencies:
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
The average of put latencies:
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Total requests per second:
rate(ceph_rgw_req[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Total number of "other" operations (LIST, DELETE)
rate(ceph_rgw_req[30s]) - (rate(ceph_rgw_get[30s]) + rate(ceph_rgw_put[30s]))
GET latencies
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
PUT latencies
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Bandwidth consumed by GET operations
sum(rate(ceph_rgw_get_b[30s]))
Bandwidth consumed by PUT operations
sum(rate(ceph_rgw_put_b[30s]))
Bandwidth consumed by RGW instance (PUTs + GETs)
sum by (instance_id) (rate(ceph_rgw_get_b[30s]) + rate(ceph_rgw_put_b[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Http errors:
rate(ceph_rgw_failed_req[30s])
Filesystem Metrics
==================
These metrics have the following labels:
``ceph_daemon``: The name of the MDS daemon
``instance``: the ip address (and port) of of the Ceph exporter daemon exposing the metric
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452
Main metrics
------------
- ``ceph_mds_metadata``: Provides general information about the MDS daemon. It
can be used together with other metrics to provide more contextual
information in queries and graphs. It provides the following extra labels:
- ``ceph_version``: MDS daemon Ceph version
- ``fs_id``: filesystem cluster id
- ``hostname``: Host name where the MDS daemon runs
- ``public_addr``: Public address where the MDS daemon runs
- ``rank``: Rank of the MDS daemon
Example:
.. code-block:: bash
ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"}
- ``ceph_mds_request``: Total number of requests for the MDs daemon
- ``ceph_mds_reply_latency_sum``: Reply latency total
- ``ceph_mds_reply_latency_count``: Reply latency count
- ``ceph_mds_server_handle_client_request``: Number of client requests
- ``ceph_mds_sessions_session_count``: Session count
- ``ceph_mds_sessions_total_load``: Total load
- ``ceph_mds_sessions_sessions_open``: Sessions currently open
- ``ceph_mds_sessions_sessions_stale``: Sessions currently stale
- ``ceph_objecter_op_r``: Number of read operations
- ``ceph_objecter_op_w``: Number of write operations
- ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon
- ``ceph_mds_root_rfiles``: Total number of files managed by the daemon
Useful queries:
---------------
.. code-block:: bash
Total MDS daemons read workload:
sum(rate(ceph_objecter_op_r[1m]))
Total MDS daemons write workload:
sum(rate(ceph_objecter_op_w[1m]))
MDS daemon read workload: (daemon name is "mdstest")
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
MDS daemon write workload: (daemon name is "mdstest")
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
The average of reply latencies:
rate(ceph_mds_reply_latency_sum[30s]) / rate(ceph_mds_reply_latency_count[30s])
Total requests per second:
rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata
Block metrics
=============
By default RBD metrics for images are not available in order to provide the
best performance in the prometheus manager module.
To produce metrics for RBD images it is needed to configure properly the
manager option ``mgr/prometheus/rbd_stats_pools``. For more information please
see :ref:`prometheus-rbd-io-statistics`
These metrics have the following labels:
``image``: Name of the image which produces the metric value.
``instance``: Node where the rbd metric is produced. (It points to the Ceph exporter daemon)
``job``: Name of the Prometheus scrape job.
``pool``: Image pool name.
Example:
.. code-block:: bash
ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"}
Main metrics
------------
- ``ceph_rbd_read_bytes``: RBD image bytes read
- ``ceph_rbd_read_latency_count``: RBD image reads latency count
- ``ceph_rbd_read_latency_sum``: RBD image reads latency total
- ``ceph_rbd_read_ops``: RBD image reads count
- ``ceph_rbd_write_bytes``: RBD image bytes written
- ``ceph_rbd_write_latency_count``: RBD image writes latency count
- ``ceph_rbd_write_latency_sum``: RBD image writes latency total
- ``ceph_rbd_write_ops``: RBD image writes count
Useful queries
--------------
.. code-block:: bash
The average of read latencies:
rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata

View File

@ -1,107 +1,110 @@
.. _rados-cephx-config-ref:
========================
Cephx Config Reference
CephX Config Reference
========================
The ``cephx`` protocol is enabled by default. Cryptographic authentication has
some computational costs, though they should generally be quite low. If the
network environment connecting your client and server hosts is very safe and
you cannot afford authentication, you can turn it off. **This is not generally
recommended**.
The CephX protocol is enabled by default. The cryptographic authentication that
CephX provides has some computational costs, though they should generally be
quite low. If the network environment connecting your client and server hosts
is very safe and you cannot afford authentication, you can disable it.
**Disabling authentication is not generally recommended**.
.. note:: If you disable authentication, you are at risk of a man-in-the-middle
attack altering your client/server messages, which could lead to disastrous
security effects.
.. note:: If you disable authentication, you will be at risk of a
man-in-the-middle attack that alters your client/server messages, which
could have disastrous security effects.
For creating users, see `User Management`_. For details on the architecture
of Cephx, see `Architecture - High Availability Authentication`_.
For information about creating users, see `User Management`_. For details on
the architecture of CephX, see `Architecture - High Availability
Authentication`_.
Deployment Scenarios
====================
There are two main scenarios for deploying a Ceph cluster, which impact
how you initially configure Cephx. Most first time Ceph users use
``cephadm`` to create a cluster (easiest). For clusters using
other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need
to use the manual procedures or configure your deployment tool to
How you initially configure CephX depends on your scenario. There are two
common strategies for deploying a Ceph cluster. If you are a first-time Ceph
user, you should probably take the easiest approach: using ``cephadm`` to
deploy a cluster. But if your cluster uses other deployment tools (for example,
Ansible, Chef, Juju, or Puppet), you will need either to use the manual
deployment procedures or to configure your deployment tool so that it will
bootstrap your monitor(s).
Manual Deployment
-----------------
When you deploy a cluster manually, you have to bootstrap the monitor manually
and create the ``client.admin`` user and keyring. To bootstrap monitors, follow
the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are
the logical steps you must perform when using third party deployment tools like
Chef, Puppet, Juju, etc.
When you deploy a cluster manually, it is necessary to bootstrap the monitors
manually and to create the ``client.admin`` user and keyring. To bootstrap
monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when
using third-party deployment tools (for example, Chef, Puppet, and Juju).
Enabling/Disabling Cephx
Enabling/Disabling CephX
========================
Enabling Cephx requires that you have deployed keys for your monitors,
OSDs and metadata servers. If you are simply toggling Cephx on / off,
you do not have to repeat the bootstrapping procedures.
Enabling CephX is possible only if the keys for your monitors, OSDs, and
metadata servers have already been deployed. If you are simply toggling CephX
on or off, it is not necessary to repeat the bootstrapping procedures.
Enabling Cephx
Enabling CephX
--------------
When ``cephx`` is enabled, Ceph will look for the keyring in the default search
path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override
this location by adding a ``keyring`` option in the ``[global]`` section of
your `Ceph configuration`_ file, but this is not recommended.
When CephX is enabled, Ceph will look for the keyring in the default search
path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible
to override this search-path location by adding a ``keyring`` option in the
``[global]`` section of your `Ceph configuration`_ file, but this is not
recommended.
Execute the following procedures to enable ``cephx`` on a cluster with
authentication disabled. If you (or your deployment utility) have already
To enable CephX on a cluster for which authentication has been disabled, carry
out the following procedure. If you (or your deployment utility) have already
generated the keys, you may skip the steps related to generating keys.
#. Create a ``client.admin`` key, and save a copy of the key for your client
host
host:
.. prompt:: bash $
ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring
**Warning:** This will clobber any existing
**Warning:** This step will clobber any existing
``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a
deployment tool has already done it for you. Be careful!
deployment tool has already generated a keyring file for you. Be careful!
#. Create a keyring for your monitor cluster and generate a monitor
secret key.
#. Create a monitor keyring and generate a monitor secret key:
.. prompt:: bash $
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's
``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``,
use the following
#. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file
in the monitor's ``mon data`` directory. For example, to copy the monitor
keyring to ``mon.a`` in a cluster called ``ceph``, run the following
command:
.. prompt:: bash $
cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring
#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter
#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter:
.. prompt:: bash $
ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring
#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number
#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:
.. prompt:: bash $
ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring
#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter
#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:
.. prompt:: bash $
ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring
#. Enable ``cephx`` authentication by setting the following options in the
``[global]`` section of your `Ceph configuration`_ file
#. Enable CephX authentication by setting the following options in the
``[global]`` section of your `Ceph configuration`_ file:
.. code-block:: ini
@ -109,23 +112,23 @@ generated the keys, you may skip the steps related to generating keys.
auth_service_required = cephx
auth_client_required = cephx
#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.
For details on bootstrapping a monitor manually, see `Manual Deployment`_.
Disabling Cephx
Disabling CephX
---------------
The following procedure describes how to disable Cephx. If your cluster
environment is relatively safe, you can offset the computation expense of
running authentication. **We do not recommend it.** However, it may be easier
during setup and/or troubleshooting to temporarily disable authentication.
The following procedure describes how to disable CephX. If your cluster
environment is safe, you might want to disable CephX in order to offset the
computational expense of running authentication. **We do not recommend doing
so.** However, setup and troubleshooting might be easier if authentication is
temporarily disabled and subsequently re-enabled.
#. Disable ``cephx`` authentication by setting the following options in the
``[global]`` section of your `Ceph configuration`_ file
#. Disable CephX authentication by setting the following options in the
``[global]`` section of your `Ceph configuration`_ file:
.. code-block:: ini
@ -133,8 +136,7 @@ during setup and/or troubleshooting to temporarily disable authentication.
auth_service_required = none
auth_client_required = none
#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.
Configuration Settings
@ -144,70 +146,230 @@ Enablement
----------
.. confval:: auth_cluster_required
.. confval:: auth_service_required
.. confval:: auth_client_required
``auth_cluster_required``
:Description: If this configuration setting is enabled, the Ceph Storage
Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``,
``ceph-mds``, and ``ceph-mgr``) are required to authenticate with
each other. Valid settings are ``cephx`` or ``none``.
:Type: String
:Required: No
:Default: ``cephx``.
``auth_service_required``
:Description: If this configuration setting is enabled, then Ceph clients can
access Ceph services only if those clients authenticate with the
Ceph Storage Cluster. Valid settings are ``cephx`` or ``none``.
:Type: String
:Required: No
:Default: ``cephx``.
``auth_client_required``
:Description: If this configuration setting is enabled, then communication
between the Ceph client and Ceph Storage Cluster can be
established only if the Ceph Storage Cluster authenticates
against the Ceph client. Valid settings are ``cephx`` or
``none``.
:Type: String
:Required: No
:Default: ``cephx``.
.. index:: keys; keyring
Keys
----
When you run Ceph with authentication enabled, ``ceph`` administrative commands
and Ceph Clients require authentication keys to access the Ceph Storage Cluster.
When Ceph is run with authentication enabled, ``ceph`` administrative commands
and Ceph clients can access the Ceph Storage Cluster only if they use
authentication keys.
The most common way to provide these keys to the ``ceph`` administrative
commands and clients is to include a Ceph keyring under the ``/etc/ceph``
directory. For Octopus and later releases using ``cephadm``, the filename
is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``).
If you include the keyring under the ``/etc/ceph`` directory, you don't need to
specify a ``keyring`` entry in your Ceph configuration file.
The most common way to make these keys available to ``ceph`` administrative
commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph``
directory. For Octopus and later releases that use ``cephadm``, the filename is
usually ``ceph.client.admin.keyring``. If the keyring is included in the
``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry
in the Ceph configuration file.
We recommend copying the Ceph Storage Cluster's keyring file to nodes where you
will run administrative commands, because it contains the ``client.admin`` key.
Because the Ceph Storage Cluster's keyring file contains the ``client.admin``
key, we recommend copying the keyring file to nodes from which you run
administrative commands.
To perform this step manually, execute the following:
To perform this step manually, run the following command:
.. prompt:: bash $
sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring
.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set
(e.g., ``chmod 644``) on your client machine.
.. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions
(for example, ``chmod 644``) set on your client machine.
You may specify the key itself in the Ceph configuration file using the ``key``
setting (not recommended), or a path to a keyfile using the ``keyfile`` setting.
You can specify the key itself by using the ``key`` setting in the Ceph
configuration file (this approach is not recommended), or instead specify a
path to a keyfile by using the ``keyfile`` setting in the Ceph configuration
file.
``keyring``
:Description: The path to the keyring file.
:Type: String
:Required: No
:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin``
``keyfile``
:Description: The path to a keyfile (that is, a file containing only the key).
:Type: String
:Required: No
:Default: None
``key``
:Description: The key (that is, the text string of the key itself). We do not
recommend that you use this setting unless you know what you're
doing.
:Type: String
:Required: No
:Default: None
Daemon Keyrings
---------------
Administrative users or deployment tools (for example, ``cephadm``) generate
daemon keyrings in the same way that they generate user keyrings. By default,
Ceph stores the keyring of a daemon inside that daemon's data directory. The
default keyring locations and the capabilities that are necessary for the
daemon to function are shown below.
``ceph-mon``
:Location: ``$mon_data/keyring``
:Capabilities: ``mon 'allow *'``
``ceph-osd``
:Location: ``$osd_data/keyring``
:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'``
``ceph-mds``
:Location: ``$mds_data/keyring``
:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'``
``ceph-mgr``
:Location: ``$mgr_data/keyring``
:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'``
``radosgw``
:Location: ``$rgw_data/keyring``
:Capabilities: ``mon 'allow rwx' osd 'allow rwx'``
.. note:: The monitor keyring (that is, ``mon.``) contains a key but no
capabilities, and this keyring is not part of the cluster ``auth`` database.
The daemon's data-directory locations default to directories of the form::
/var/lib/ceph/$type/$cluster-$id
For example, ``osd.12`` would have the following data directory::
/var/lib/ceph/osd/ceph-12
It is possible to override these locations, but it is not recommended.
.. confval:: keyring
:default: /etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
.. confval:: keyfile
.. confval:: key
.. index:: signatures
Signatures
----------
Ceph performs a signature check that provides some limited protection
against messages being tampered with in flight (e.g., by a "man in the
middle" attack).
Ceph performs a signature check that provides some limited protection against
messages being tampered with in flight (for example, by a "man in the middle"
attack).
Like other parts of Ceph authentication, Ceph provides fine-grained control so
you can enable/disable signatures for service messages between clients and
Ceph, and so you can enable/disable signatures for messages between Ceph daemons.
As with other parts of Ceph authentication, signatures admit of fine-grained
control. You can enable or disable signatures for service messages between
clients and Ceph, and for messages between Ceph daemons.
Note that even with signatures enabled data is not encrypted in
flight.
Note that even when signatures are enabled data is not encrypted in flight.
``cephx_require_signatures``
:Description: If this configuration setting is set to ``true``, Ceph requires
signatures on all message traffic between the Ceph client and the
Ceph Storage Cluster, and between daemons within the Ceph Storage
Cluster.
.. note::
**ANTIQUATED NOTE:**
Neither Ceph Argonaut nor Linux kernel versions prior to 3.19
support signatures; if one of these clients is in use, ``cephx_require_signatures``
can be disabled in order to allow the client to connect.
:Type: Boolean
:Required: No
:Default: ``false``
``cephx_cluster_require_signatures``
:Description: If this configuration setting is set to ``true``, Ceph requires
signatures on all message traffic between Ceph daemons within the
Ceph Storage Cluster.
:Type: Boolean
:Required: No
:Default: ``false``
``cephx_service_require_signatures``
:Description: If this configuration setting is set to ``true``, Ceph requires
signatures on all message traffic between Ceph clients and the
Ceph Storage Cluster.
:Type: Boolean
:Required: No
:Default: ``false``
``cephx_sign_messages``
:Description: If this configuration setting is set to ``true``, and if the Ceph
version supports message signing, then Ceph will sign all
messages so that they are more difficult to spoof.
:Type: Boolean
:Default: ``true``
.. confval:: cephx_require_signatures
.. confval:: cephx_cluster_require_signatures
.. confval:: cephx_service_require_signatures
.. confval:: cephx_sign_messages
Time to Live
------------
.. confval:: auth_service_ticket_ttl
``auth_service_ticket_ttl``
:Description: When the Ceph Storage Cluster sends a ticket for authentication
to a Ceph client, the Ceph Storage Cluster assigns that ticket a
Time To Live (TTL).
:Type: Double
:Default: ``60*60``
.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping
.. _Operating a Cluster: ../../operations/operating

View File

@ -1,84 +1,95 @@
==========================
BlueStore Config Reference
==========================
==================================
BlueStore Configuration Reference
==================================
Devices
=======
BlueStore manages either one, two, or (in certain cases) three storage
devices.
BlueStore manages either one, two, or in certain cases three storage devices.
These *devices* are "devices" in the Linux/Unix sense. This means that they are
assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
entire storage drive, or a partition of a storage drive, or a logical volume.
BlueStore does not create or mount a conventional file system on devices that
it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
In the simplest case, BlueStore consumes a single (primary) storage device.
The storage device is normally used as a whole, occupying the full device that
is managed directly by BlueStore. This *primary device* is normally identified
by a ``block`` symlink in the data directory.
In the simplest case, BlueStore consumes all of a single storage device. This
device is known as the *primary device*. The primary device is identified by
the ``block`` symlink in the data directory.
The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
when ``ceph-volume`` activates it) with all the common OSD files that hold
information about the OSD, like: its identifier, which cluster it belongs to,
and its private keyring.
The data directory is a ``tmpfs`` mount. When this data directory is booted or
activated by ``ceph-volume``, it is populated with metadata files and links
that hold information about the OSD: for example, the OSD's identifier, the
name of the cluster that the OSD belongs to, and the OSD's private keyring.
It is also possible to deploy BlueStore across one or two additional devices:
In more complicated cases, BlueStore is deployed across one or two additional
devices:
* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
used for BlueStore's internal journal or write-ahead log. It is only useful
to use a WAL device if the device is faster than the primary device (e.g.,
when it is on an SSD and the primary device is an HDD).
* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
directory) can be used to separate out BlueStore's internal journal or
write-ahead log. Using a WAL device is advantageous only if the WAL device
is faster than the primary device (for example, if the WAL device is an SSD
and the primary device is an HDD).
* A *DB device* (identified as ``block.db`` in the data directory) can be used
for storing BlueStore's internal metadata. BlueStore (or rather, the
embedded RocksDB) will put as much metadata as it can on the DB device to
improve performance. If the DB device fills up, metadata will spill back
onto the primary device (where it would have been otherwise). Again, it is
only helpful to provision a DB device if it is faster than the primary
device.
to store BlueStore's internal metadata. BlueStore (or more precisely, the
embedded RocksDB) will put as much metadata as it can on the DB device in
order to improve performance. If the DB device becomes full, metadata will
spill back onto the primary device (where it would have been located in the
absence of the DB device). Again, it is advantageous to provision a DB device
only if it is faster than the primary device.
If there is only a small amount of fast storage available (e.g., less
than a gigabyte), we recommend using it as a WAL device. If there is
more, provisioning a DB device makes more sense. The BlueStore
journal will always be placed on the fastest device available, so
using a DB device will provide the same benefit that the WAL device
would while *also* allowing additional metadata to be stored there (if
it will fit). This means that if a DB device is specified but an explicit
WAL device is not, the WAL will be implicitly colocated with the DB on the faster
device.
If there is only a small amount of fast storage available (for example, less
than a gigabyte), we recommend using the available space as a WAL device. But
if more fast storage is available, it makes more sense to provision a DB
device. Because the BlueStore journal is always placed on the fastest device
available, using a DB device provides the same benefit that using a WAL device
would, while *also* allowing additional metadata to be stored off the primary
device (provided that it fits). DB devices make this possible because whenever
a DB device is specified but an explicit WAL device is not, the WAL will be
implicitly colocated with the DB on the faster device.
A single-device (colocated) BlueStore OSD can be provisioned with:
To provision a single-device (colocated) BlueStore OSD, run the following
command:
.. prompt:: bash $
ceph-volume lvm prepare --bluestore --data <device>
To specify a WAL device and/or DB device:
To specify a WAL device or DB device, run the following command:
.. prompt:: bash $
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
devices can be existing logical volumes or GPT partitions.
.. note:: The option ``--data`` can take as its argument any of the the
following devices: logical volumes specified using *vg/lv* notation,
existing logical volumes, and GPT partitions.
Provisioning strategies
-----------------------
Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
which had just one), there are two common arrangements that should help clarify
the deployment strategy:
BlueStore differs from Filestore in that there are several ways to deploy a
BlueStore OSD. However, the overall deployment strategy for BlueStore can be
clarified by examining just these two common arrangements:
.. _bluestore-single-type-device-config:
**block (data) only**
^^^^^^^^^^^^^^^^^^^^^
If all devices are the same type, for example all rotational drives, and
there are no fast devices to use for metadata, it makes sense to specify the
block device only and to not separate ``block.db`` or ``block.wal``. The
:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like:
If all devices are of the same type (for example, they are all HDDs), and if
there are no fast devices available for the storage of metadata, then it makes
sense to specify the block device only and to leave ``block.db`` and
``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
``/dev/sda`` device is as follows:
.. prompt:: bash $
ceph-volume lvm create --bluestore --data /dev/sda
If logical volumes have already been created for each device, (a single LV
using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
``ceph-vg/block-lv`` would look like:
If the devices to be used for a BlueStore OSD are pre-created logical volumes,
then the :ref:`ceph-volume-lvm` call for an logical volume named
``ceph-vg/block-lv`` is as follows:
.. prompt:: bash $
@ -88,15 +99,18 @@ using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
**block and block.db**
^^^^^^^^^^^^^^^^^^^^^^
If you have a mix of fast and slow devices (SSD / NVMe and rotational),
it is recommended to place ``block.db`` on the faster device while ``block``
(data) lives on the slower (spinning drive).
You must create these volume groups and logical volumes manually as
the ``ceph-volume`` tool is currently not able to do so automatically.
If you have a mix of fast and slow devices (for example, SSD or HDD), then we
recommend placing ``block.db`` on the faster device while ``block`` (that is,
the data) is stored on the slower device (that is, the rotational drive).
For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
and one (fast) solid state drive (``sdx``). First create the volume groups:
You must create these volume groups and these logical volumes manually. as The
``ceph-volume`` tool is currently unable to do so [create them?] automatically.
The following procedure illustrates the manual creation of volume groups and
logical volumes. For this example, we shall assume four rotational drives
(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
to create the volume groups, run the following commands:
.. prompt:: bash $
@ -105,7 +119,7 @@ and one (fast) solid state drive (``sdx``). First create the volume groups:
vgcreate ceph-block-2 /dev/sdc
vgcreate ceph-block-3 /dev/sdd
Now create the logical volumes for ``block``:
Next, to create the logical volumes for ``block``, run the following commands:
.. prompt:: bash $
@ -114,8 +128,9 @@ Now create the logical volumes for ``block``:
lvcreate -l 100%FREE -n block-2 ceph-block-2
lvcreate -l 100%FREE -n block-3 ceph-block-3
We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
Because there are four HDDs, there will be four OSDs. Supposing that there is a
200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
the following commands:
.. prompt:: bash $
@ -125,7 +140,7 @@ SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
lvcreate -L 50GB -n db-2 ceph-db-0
lvcreate -L 50GB -n db-3 ceph-db-0
Finally, create the 4 OSDs with ``ceph-volume``:
Finally, to create the four OSDs, run the following commands:
.. prompt:: bash $
@ -134,54 +149,57 @@ Finally, create the 4 OSDs with ``ceph-volume``:
ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
These operations should end up creating four OSDs, with ``block`` on the slower
rotational drives with a 50 GB logical volume (DB) for each on the solid state
drive.
After this procedure is finished, there should be four OSDs, ``block`` should
be on the four HDDs, and each HDD should have a 50GB logical volume
(specifically, a DB device) on the shared SSD.
Sizing
======
When using a :ref:`mixed spinning and solid drive setup
<bluestore-mixed-device-config>` it is important to make a large enough
``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
*as large as possible* logical volumes.
When using a :ref:`mixed spinning-and-solid-drive setup
<bluestore-mixed-device-config>`, it is important to make a large enough
``block.db`` logical volume for BlueStore. The logical volumes associated with
``block.db`` should have logical volumes that are *as large as possible*.
The general recommendation is to have ``block.db`` size in between 1% to 4%
of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
It is generally recommended that the size of ``block.db`` be somewhere between
1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
use of ``block.db`` to store metadata (in particular, omap keys). For example,
if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
2% of the ``block`` size.
In older releases, internal level sizes mean that the DB can fully utilize only
specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
so forth. Most deployments will not substantially benefit from sizing to
accommodate L3 and higher, though DB compaction can be facilitated by doubling
these figures to 6GB, 60GB, and 600GB.
In older releases, internal level sizes are such that the DB can fully utilize
only those specific partition / logical volume sizes that correspond to sums of
L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
sizing that accommodates L3 and higher, though DB compaction can be facilitated
by doubling these figures to 6GB, 60GB, and 600GB.
Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
enable better utilization of arbitrary DB device sizes, and the Pacific
release brings experimental dynamic level support. Users of older releases may
thus wish to plan ahead by provisioning larger DB devices today so that their
benefits may be realized with future upgrades.
When *not* using a mix of fast and slow devices, it isn't required to create
separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
automatically colocate these within the space of ``block``.
Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
release brings experimental dynamic-level support. Because of these advances,
users of older releases might want to plan ahead by provisioning larger DB
devices today so that the benefits of scale can be realized when upgrades are
made in the future.
When *not* using a mix of fast and slow devices, there is no requirement to
create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
will automatically colocate these devices within the space of ``block``.
Automatic Cache Sizing
======================
BlueStore can be configured to automatically resize its caches when TCMalloc
is configured as the memory allocator and the ``bluestore_cache_autotune``
setting is enabled. This option is currently enabled by default. BlueStore
will attempt to keep OSD heap memory usage under a designated target size via
the ``osd_memory_target`` configuration option. This is a best effort
algorithm and caches will not shrink smaller than the amount specified by
``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
of priorities. If priority information is not available, the
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
used as fallbacks.
BlueStore can be configured to automatically resize its caches, provided that
certain conditions are met: TCMalloc must be configured as the memory allocator
and the ``bluestore_cache_autotune`` configuration option must be enabled (note
that it is currently enabled by default). When automatic cache sizing is in
effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
size (as determined by ``osd_memory_target``). This approach makes use of a
best-effort algorithm and caches do not shrink smaller than the size defined by
the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
with a hierarchy of priorities. But if priority information is not available,
the values specified in the ``bluestore_cache_meta_ratio`` and
``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
.. confval:: bluestore_cache_autotune
.. confval:: osd_memory_target
@ -195,34 +213,33 @@ used as fallbacks.
Manual Cache Sizing
===================
The amount of memory consumed by each OSD for BlueStore caches is
determined by the ``bluestore_cache_size`` configuration option. If
that config option is not set (i.e., remains at 0), there is a
different default value that is used depending on whether an HDD or
SSD is used for the primary device (set by the
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
options).
The amount of memory consumed by each OSD to be used for its BlueStore cache is
determined by the ``bluestore_cache_size`` configuration option. If that option
has not been specified (that is, if it remains at 0), then Ceph uses a
different configuration option to determine the default memory budget:
``bluestore_cache_size_hdd`` if the primary device is an HDD, or
``bluestore_cache_size_ssd`` if the primary device is an SSD.
BlueStore and the rest of the Ceph OSD daemon do the best they can
to work within this memory budget. Note that on top of the configured
cache size, there is also memory consumed by the OSD itself, and
some additional utilization due to memory fragmentation and other
allocator overhead.
BlueStore and the rest of the Ceph OSD daemon make every effort to work within
this memory budget. Note that in addition to the configured cache size, there
is also memory consumed by the OSD itself. There is additional utilization due
to memory fragmentation and other allocator overhead.
The configured cache memory budget can be used in a few different ways:
The configured cache-memory budget can be used to store the following types of
things:
* Key/Value metadata (i.e., RocksDB's internal cache)
* Key/Value metadata (that is, RocksDB's internal cache)
* BlueStore metadata
* BlueStore data (i.e., recently read or written object data)
* BlueStore data (that is, recently read or recently written object data)
Cache memory usage is governed by the following options:
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
The fraction of the cache devoted to data
is governed by the effective bluestore cache size (depending on
``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
device) as well as the meta and kv ratios.
The data fraction can be calculated by
``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
Cache memory usage is governed by the configuration options
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction
of the cache that is reserved for data is governed by both the effective
BlueStore cache size (which depends on the relevant
``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
device) and the "meta" and "kv" ratios. This data fraction can be calculated
with the following formula: ``<effective_cache_size> * (1 -
bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
.. confval:: bluestore_cache_size
.. confval:: bluestore_cache_size_hdd
@ -233,29 +250,28 @@ The data fraction can be calculated by
Checksums
=========
BlueStore checksums all metadata and data written to disk. Metadata
checksumming is handled by RocksDB and uses `crc32c`. Data
checksumming is done by BlueStore and can make use of `crc32c`,
`xxhash32`, or `xxhash64`. The default is `crc32c` and should be
suitable for most purposes.
BlueStore checksums all metadata and all data written to disk. Metadata
checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
contrast, data checksumming is handled by BlueStore and can use either
`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
checksum algorithm and it is suitable for most purposes.
Full data checksumming does increase the amount of metadata that
BlueStore must store and manage. When possible, e.g., when clients
hint that data is written and read sequentially, BlueStore will
checksum larger blocks, but in many cases it must store a checksum
value (usually 4 bytes) for every 4 kilobyte block of data.
Full data checksumming increases the amount of metadata that BlueStore must
store and manage. Whenever possible (for example, when clients hint that data
is written and read sequentially), BlueStore will checksum larger blocks. In
many cases, however, it must store a checksum value (usually 4 bytes) for every
4 KB block of data.
It is possible to use a smaller checksum value by truncating the
checksum to two or one byte, reducing the metadata overhead. The
trade-off is that the probability that a random error will not be
detected is higher with a smaller checksum, going from about one in
four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
The smaller checksum values can be used by selecting `crc32c_16` or
`crc32c_8` as the checksum algorithm.
It is possible to obtain a smaller checksum value by truncating the checksum to
one or two bytes and reducing the metadata overhead. A drawback of this
approach is that it increases the probability of a random error going
undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
as the checksum algorithm.
The *checksum algorithm* can be set either via a per-pool
``csum_type`` property or the global config option. For example:
The *checksum algorithm* can be specified either via a per-pool ``csum_type``
configuration option or via the global configuration option. For example:
.. prompt:: bash $
@ -266,34 +282,35 @@ The *checksum algorithm* can be set either via a per-pool
Inline Compression
==================
BlueStore supports inline compression using `snappy`, `zlib`, or
`lz4`. Please note that the `lz4` compression plugin is not
distributed in the official release.
BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`.
Whether data in BlueStore is compressed is determined by a combination
of the *compression mode* and any hints associated with a write
operation. The modes are:
Whether data in BlueStore is compressed is determined by two factors: (1) the
*compression mode* and (2) any client hints associated with a write operation.
The compression modes are as follows:
* **none**: Never compress data.
* **passive**: Do not compress data unless the write operation has a
*compressible* hint set.
* **aggressive**: Compress data unless the write operation has an
* **aggressive**: Do compress data unless the write operation has an
*incompressible* hint set.
* **force**: Try to compress data no matter what.
For more information about the *compressible* and *incompressible* IO
hints, see :c:func:`rados_set_alloc_hint`.
For more information about the *compressible* and *incompressible* I/O hints,
see :c:func:`rados_set_alloc_hint`.
Note that regardless of the mode, if the size of the data chunk is not
reduced sufficiently it will not be used and the original
(uncompressed) data will be stored. For example, if the ``bluestore
compression required ratio`` is set to ``.7`` then the compressed data
must be 70% of the size of the original (or smaller).
Note that data in Bluestore will be compressed only if the data chunk will be
sufficiently reduced in size (as determined by the ``bluestore compression
required ratio`` setting). No matter which compression modes have been used, if
the data chunk is too big, then it will be discarded and the original
(uncompressed) data will be stored instead. For example, if ``bluestore
compression required ratio`` is set to ``.7``, then data compression will take
place only if the size of the compressed data is no more than 70% of the size
of the original data.
The *compression mode*, *compression algorithm*, *compression required
ratio*, *min blob size*, and *max blob size* can be set either via a
per-pool property or a global config option. Pool properties can be
set with:
The *compression mode*, *compression algorithm*, *compression required ratio*,
*min blob size*, and *max blob size* settings can be specified either via a
per-pool property or via a global config option. To specify pool properties,
run the following commands:
.. prompt:: bash $
@ -318,27 +335,30 @@ set with:
RocksDB Sharding
================
Internally BlueStore uses multiple types of key-value data,
stored in RocksDB. Each data type in BlueStore is assigned a
unique prefix. Until Pacific all key-value data was stored in
single RocksDB column family: 'default'. Since Pacific,
BlueStore can divide this data into multiple RocksDB column
families. When keys have similar access frequency, modification
frequency and lifetime, BlueStore benefits from better caching
and more precise compaction. This improves performance, and also
requires less disk space during compaction, since each column
family is smaller and can compact independent of others.
BlueStore maintains several types of internal key-value data, all of which are
stored in RocksDB. Each data type in BlueStore is assigned a unique prefix.
Prior to the Pacific release, all key-value data was stored in a single RocksDB
column family: 'default'. In Pacific and later releases, however, BlueStore can
divide key-value data into several RocksDB column families. BlueStore achieves
better caching and more precise compaction when keys are similar: specifically,
when keys have similar access frequency, similar modification frequency, and a
similar lifetime. Under such conditions, performance is improved and less disk
space is required during compaction (because each column family is smaller and
is able to compact independently of the others).
OSDs deployed in Pacific or later use RocksDB sharding by default.
If Ceph is upgraded to Pacific from a previous version, sharding is off.
OSDs deployed in Pacific or later releases use RocksDB sharding by default.
However, if Ceph has been upgraded to Pacific or a later version from a
previous version, sharding is disabled on any OSDs that were created before
Pacific.
To enable sharding and apply the Pacific defaults, stop an OSD and run
To enable sharding and apply the Pacific defaults to a specific OSD, stop the
OSD and run the following command:
.. prompt:: bash #
ceph-bluestore-tool \
ceph-bluestore-tool \
--path <data path> \
--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
--sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
reshard
.. confval:: bluestore_rocksdb_cf
@ -354,165 +374,179 @@ Throttling
.. confval:: bluestore_throttle_cost_per_io_ssd
SPDK Usage
==================
==========
If you want to use the SPDK driver for NVMe devices, you must prepare your system.
Refer to `SPDK document`__ for more details.
To use the SPDK driver for NVMe devices, you must first prepare your system.
See `SPDK document`__.
.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
SPDK offers a script to configure the device automatically. Users can run the
script as root:
SPDK offers a script that will configure the device automatically. Run this
script with root permissions:
.. prompt:: bash $
sudo src/spdk/scripts/setup.sh
You will need to specify the subject NVMe device's device selector with
the "spdk:" prefix for ``bluestore_block_path``.
You will need to specify the subject NVMe device's device selector with the
"spdk:" prefix for ``bluestore_block_path``.
For example, you can find the device selector of an Intel PCIe SSD with:
In the following example, you first find the device selector of an Intel NVMe
SSD by running the following command:
.. prompt:: bash $
lspci -mm -n -D -d 8086:0953
lspci -mm -n -d -d 8086:0953
The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
The form of the device selector is either ``DDDD:BB:DD.FF`` or
``DDDD.BB.DD.FF``.
and then set::
Next, supposing that ``0000:01:00.0`` is the device selector found in the
output of the ``lspci`` command, you can specify the device selector by running
the following command::
bluestore_block_path = spdk:0000:01:00.0
Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
command above.
You may also specify a remote NVMeoF target over the TCP transport, as in the
following example::
To run multiple SPDK instances per node, you must specify the
amount of dpdk memory in MB that each instance will use, to make sure each
instance uses its own dpdk memory
bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
In most cases, a single device can be used for data, DB, and WAL. We describe
To run multiple SPDK instances per node, you must make sure each instance uses
its own DPDK memory by specifying for each instance the amount of DPDK memory
(in MB) that the instance will use.
In most cases, a single device can be used for data, DB, and WAL. We describe
this strategy as *colocating* these components. Be sure to enter the below
settings to ensure that all IOs are issued through SPDK.::
settings to ensure that all I/Os are issued through SPDK::
bluestore_block_db_path = ""
bluestore_block_db_size = 0
bluestore_block_wal_path = ""
bluestore_block_wal_size = 0
Otherwise, the current implementation will populate the SPDK map files with
kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
If these settings are not entered, then the current implementation will
populate the SPDK map files with kernel file system symbols and will use the
kernel driver to issue DB/WAL I/Os.
Minimum Allocation Size
========================
=======================
There is a configured minimum amount of storage that BlueStore will allocate on
an OSD. In practice, this is the least amount of capacity that a RADOS object
can consume. The value of :confval:`bluestore_min_alloc_size` is derived from the
value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd`
depending on the OSD's ``rotational`` attribute. This means that when an OSD
is created on an HDD, BlueStore will be initialized with the current value
of :confval:`bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices)
with the value of :confval:`bluestore_min_alloc_size_ssd`.
There is a configured minimum amount of storage that BlueStore allocates on an
underlying storage device. In practice, this is the least amount of capacity
that even a tiny RADOS object can consume on each OSD's primary device. The
configuration option in question--:confval:`bluestore_min_alloc_size`--derives
its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or
:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
(including NVMe devices), Bluestore is initialized with the current value of
:confval:`bluestore_min_alloc_size_ssd`.
Through the Mimic release, the default values were 64KB and 16KB for rotational
(HDD) and non-rotational (SSD) media respectively. Octopus changed the default
for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD
(rotational) media to 4KB as well.
In Mimic and earlier releases, the default values were 64KB for rotational
media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
changed the the default value for non-rotational media (SSD) to 4KB, and the
Pacific release changed the default value for rotational media (HDD) to 4KB.
These changes were driven by space amplification experienced by Ceph RADOS
GateWay (RGW) deployments that host large numbers of small files
These changes were driven by space amplification that was experienced by Ceph
RADOS GateWay (RGW) deployments that hosted large numbers of small files
(S3/Swift objects).
For example, when an RGW client stores a 1KB S3 object, it is written to a
single RADOS object. With the default :confval:`min_alloc_size` value, 4KB of
underlying drive space is allocated. This means that roughly
(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300%
overhead or 25% efficiency. Similarly, a 5KB user object will be stored
as one 4KB and one 1KB RADOS object, again stranding 4KB of device capacity,
though in this case the overhead is a much smaller percentage. Think of this
in terms of the remainder from a modulus operation. The overhead *percentage*
thus decreases rapidly as user object size increases.
For example, when an RGW client stores a 1 KB S3 object, that object is written
to a single RADOS object. In accordance with the default
:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated.
This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never
used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB
user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB
RADOS object, with the result that 4KB of device capacity is stranded. In this
case, however, the overhead percentage is much smaller. Think of this in terms
of the remainder from a modulus operation. The overhead *percentage* thus
decreases rapidly as object size increases.
An easily missed additional subtlety is that this
takes place for *each* replica. So when using the default three copies of
data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device
capacity. If erasure coding (EC) is used instead of replication, the
amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object
will allocate (6 * 4KB) = 24KB of device capacity.
There is an additional subtlety that is easily missed: the amplification
phenomenon just described takes place for *each* replica. For example, when
using the default of three copies of data (3R), a 1 KB S3 object actually
strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used
instead of replication, the amplification might be even higher: for a ``k=4,
m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6)
of device capacity.
When an RGW bucket pool contains many relatively large user objects, the effect
of this phenomenon is often negligible, but should be considered for deployments
that expect a significant fraction of relatively small objects.
of this phenomenon is often negligible. However, with deployments that can
expect a significant fraction of relatively small user objects, the effect
should be taken into consideration.
The 4KB default value aligns well with conventional HDD and SSD devices. Some
new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best
when :confval:`bluestore_min_alloc_size_ssd`
is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB.
These novel storage drives allow one to achieve read performance competitive
with conventional TLC SSDs and write performance faster than HDDs, with
high density and lower cost than TLC SSDs.
The 4KB default value aligns well with conventional HDD and SSD devices.
However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear
best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation
to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel
storage drives can achieve read performance that is competitive with that of
conventional TLC SSDs and write performance that is faster than that of HDDs,
with higher density and lower cost than TLC SSDs.
Note that when creating OSDs on these devices, one must carefully apply the
non-default value only to appropriate devices, and not to conventional SSD and
HDD devices. This may be done through careful ordering of OSD creation, custom
OSD device classes, and especially by the use of central configuration _masks_.
Note that when creating OSDs on these novel devices, one must be careful to
apply the non-default value only to appropriate devices, and not to
conventional HDD and SSD devices. Error can be avoided through careful ordering
of OSD creation, with custom OSD device classes, and especially by the use of
central configuration *masks*.
Quincy and later releases add
the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size`
option that enables automatic discovery of the appropriate value as each OSD is
created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``,
``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction
technologies may confound the determination of appropriate values. OSDs
deployed on top of VMware storage have been reported to also
sometimes report a ``rotational`` attribute that does not match the underlying
hardware.
In Quincy and later releases, you can use the
:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow
automatic discovery of the correct value as each OSD is created. Note that the
use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or
other device-layering and abstraction technologies might confound the
determination of correct values. Moreover, OSDs deployed on top of VMware
storage have sometimes been found to report a ``rotational`` attribute that
does not match the underlying hardware.
We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that
behavior is appropriate. Note that this also may not work as desired with
older kernels. You can check for this by examining the presence and value
of ``/sys/block/<drive>/queue/optimal_io_size``.
We suggest inspecting such OSDs at startup via logs and admin sockets in order
to ensure that their behavior is correct. Be aware that this kind of inspection
might not work as expected with older kernels. To check for this issue,
examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``.
You may also inspect a given OSD:
.. note:: When running Reef or a later Ceph release, the ``min_alloc_size``
baked into each OSD is conveniently reported by ``ceph osd metadata``.
To inspect a specific OSD, run the following command:
.. prompt:: bash #
ceph osd metadata osd.1701 | grep rotational
ceph osd metadata osd.1701 | egrep rotational\|alloc
This space amplification may manifest as an unusually high ratio of raw to
stored data reported by ``ceph df``. ``ceph osd df`` may also report
anomalously high ``%USE`` / ``VAR`` values when
compared to other, ostensibly identical OSDs. A pool using OSDs with
mismatched ``min_alloc_size`` values may experience unexpected balancer
behavior as well.
Note that this BlueStore attribute takes effect *only* at OSD creation; if
changed later, a given OSD's behavior will not change unless / until it is
destroyed and redeployed with the appropriate option value(s). Upgrading
to a later Ceph release will *not* change the value used by OSDs deployed
under older releases or with other settings.
This space amplification might manifest as an unusually high ratio of raw to
stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR``
values reported by ``ceph osd df`` that are unusually high in comparison to
other, ostensibly identical, OSDs. Finally, there might be unexpected balancer
behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values.
This BlueStore attribute takes effect *only* at OSD creation; if the attribute
is changed later, a specific OSD's behavior will not change unless and until
the OSD is destroyed and redeployed with the appropriate option value(s).
Upgrading to a later Ceph release will *not* change the value used by OSDs that
were deployed under older releases or with other settings.
.. confval:: bluestore_min_alloc_size
.. confval:: bluestore_min_alloc_size_hdd
.. confval:: bluestore_min_alloc_size_ssd
.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
DSA (Data Streaming Accelerator Usage)
DSA (Data Streaming Accelerator) Usage
======================================
If you want to use the DML library to drive DSA device for offloading
read/write operations on Persist memory in Bluestore. You need to install
`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU.
If you want to use the DML library to drive the DSA device for offloading
read/write operations on persistent memory (PMEM) in BlueStore, you need to
install `DML`_ and the `idxd-config`_ library. This will work only on machines
that have a SPR (Sapphire Rapids) CPU.
.. _DML: https://github.com/intel/DML
.. _dml: https://github.com/intel/dml
.. _idxd-config: https://github.com/intel/idxd-config
After installing the DML software, you need to configure the shared
work queues (WQs) with the following WQ configuration example via accel-config tool:
After installing the DML software, configure the shared work queues (WQs) with
reference to the following WQ configuration example:
.. prompt:: bash $
accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
accel-config config-engine dsa0/engine0.1 --group-id=1
accel-config enable-device dsa0
accel-config enable-wq dsa0/wq0.1

View File

@ -4,116 +4,116 @@
Configuring Ceph
==================
When Ceph services start, the initialization process activates a series
of daemons that run in the background. A :term:`Ceph Storage Cluster` runs
at a minimum three types of daemons:
When Ceph services start, the initialization process activates a series of
daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
least three types of daemons:
- :term:`Ceph Monitor` (``ceph-mon``)
- :term:`Ceph Manager` (``ceph-mgr``)
- :term:`Ceph OSD Daemon` (``ceph-osd``)
Ceph Storage Clusters that support the :term:`Ceph File System` also run at
least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that
support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons
(``radosgw``) as well.
least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support
:term:`Ceph Object Storage` run Ceph RADOS Gateway daemons (``radosgw``).
Each daemon has a number of configuration options, each of which has a
default value. You may adjust the behavior of the system by changing these
configuration options. Be careful to understand the consequences before
Each daemon has a number of configuration options, each of which has a default
value. You may adjust the behavior of the system by changing these
configuration options. Be careful to understand the consequences before
overriding default values, as it is possible to significantly degrade the
performance and stability of your cluster. Also note that default values
sometimes change between releases, so it is best to review the version of
this documentation that aligns with your Ceph release.
performance and stability of your cluster. Note too that default values
sometimes change between releases. For this reason, it is best to review the
version of this documentation that applies to your Ceph release.
Option names
============
All Ceph configuration options have a unique name consisting of words
formed with lower-case characters and connected with underscore
(``_``) characters.
Each of the Ceph configuration options has a unique name that consists of words
formed with lowercase characters and connected with underscore characters
(``_``).
When option names are specified on the command line, either underscore
(``_``) or dash (``-``) characters can be used interchangeable (e.g.,
When option names are specified on the command line, underscore (``_``) and
dash (``-``) characters can be used interchangeably (for example,
``--mon-host`` is equivalent to ``--mon_host``).
When option names appear in configuration files, spaces can also be
used in place of underscore or dash. We suggest, though, that for
clarity and convenience you consistently use underscores, as we do
When option names appear in configuration files, spaces can also be used in
place of underscores or dashes. However, for the sake of clarity and
convenience, we suggest that you consistently use underscores, as we do
throughout this documentation.
Config sources
==============
Each Ceph daemon, process, and library will pull its configuration
from several sources, listed below. Sources later in the list will
override those earlier in the list when both are present.
Each Ceph daemon, process, and library pulls its configuration from one or more
of the several sources listed below. Sources that occur later in the list
override those that occur earlier in the list (when both are present).
- the compiled-in default value
- the monitor cluster's centralized configuration database
- a configuration file stored on the local host
- environment variables
- command line arguments
- runtime overrides set by an administrator
- command-line arguments
- runtime overrides that are set by an administrator
One of the first things a Ceph process does on startup is parse the
configuration options provided via the command line, environment, and
local configuration file. The process will then contact the monitor
cluster to retrieve configuration stored centrally for the entire
cluster. Once a complete view of the configuration is available, the
daemon or process startup will proceed.
configuration options provided via the command line, via the environment, and
via the local configuration file. Next, the process contacts the monitor
cluster to retrieve centrally-stored configuration for the entire cluster.
After a complete view of the configuration is available, the startup of the
daemon or process will commence.
.. _bootstrap-options:
Bootstrap options
-----------------
Some configuration options affect the process's ability to contact the
monitors, to authenticate, and to retrieve the cluster-stored configuration.
For this reason, these options might need to be stored locally on the node, and
set by means of a local configuration file. These options include the
following:
Bootstrap options are configuration options that affect the process's ability
to contact the monitors, to authenticate, and to retrieve the cluster-stored
configuration. For this reason, these options might need to be stored locally
on the node, and set by means of a local configuration file. These options
include the following:
.. confval:: mon_host
.. confval:: mon_host_override
- :confval:`mon_dns_srv_name`
- ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and
similar options that define which local directory the daemon
stores its data in.
- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be used to
specify the authentication credential to use to authenticate with
the monitor. Note that in most cases the default keyring location
is in the data directory specified above.
- :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`,
:confval:`mgr_data`, and similar options that define which local directory
the daemon stores its data in.
- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be
used to specify the authentication credential to use to authenticate with the
monitor. Note that in most cases the default keyring location is in the data
directory specified above.
In most cases, the default values of these options are suitable. There is one
exception to this: the :confval:`mon_host` option that identifies the addresses
of the cluster's monitors. When DNS is used to identify monitors, a local Ceph
In most cases, there is no reason to modify the default values of these
options. However, there is one exception to this: the :confval:`mon_host`
option that identifies the addresses of the cluster's monitors. But when
:ref:`DNS is used to identify monitors<mon-dns-lookup>`, a local Ceph
configuration file can be avoided entirely.
Skipping monitor config
-----------------------
Pass the option ``--no-mon-config`` to any process to skip the step that
retrieves configuration information from the cluster monitors. This is useful
in cases where configuration is managed entirely via configuration files, or
when the monitor cluster is down and some maintenance activity needs to be
done.
The option ``--no-mon-config`` can be passed in any command in order to skip
the step that retrieves configuration information from the cluster's monitors.
Skipping this retrieval step can be useful in cases where configuration is
managed entirely via configuration files, or when maintenance activity needs to
be done but the monitor cluster is down.
.. _ceph-conf-file:
Configuration sections
======================
Any given process or daemon has a single value for each configuration
option. However, values for an option may vary across different
daemon types even daemons of the same type. Ceph options that are
stored in the monitor configuration database or in local configuration
files are grouped into sections to indicate which daemons or clients
they apply to.
Each of the configuration options associated with a single process or daemon
has a single value. However, the values for a configuration option can vary
across daemon types, and can vary even across different daemons of the same
type. Ceph options that are stored in the monitor configuration database or in
local configuration files are grouped into sections |---| so-called "configuration
sections" |---| to indicate which daemons or clients they apply to.
These sections include:
These sections include the following:
.. confsec:: global
@ -156,43 +156,42 @@ These sections include:
.. confsec:: client
Settings under ``client`` affect all Ceph Clients
(e.g., mounted Ceph File Systems, mounted Ceph Block Devices,
etc.) as well as Rados Gateway (RGW) daemons.
Settings under ``client`` affect all Ceph clients
(for example, mounted Ceph File Systems, mounted Ceph Block Devices)
as well as RADOS Gateway (RGW) daemons.
:example: ``objecter_inflight_ops = 512``
Sections may also specify an individual daemon or client name. For example,
Configuration sections can also specify an individual daemon or client name. For example,
``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names.
Any given daemon will draw its settings from the global section, the
daemon or client type section, and the section sharing its name.
Settings in the most-specific section take precedence, so for example
if the same option is specified in both :confsec:`global`, :confsec:`mon`, and
``mon.foo`` on the same source (i.e., in the same configurationfile),
the ``mon.foo`` value will be used.
Any given daemon will draw its settings from the global section, the daemon- or
client-type section, and the section sharing its name. Settings in the
most-specific section take precedence so precedence: for example, if the same
option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo``
on the same source (i.e. that is, in the same configuration file), the
``mon.foo`` setting will be used.
If multiple values of the same configuration option are specified in the same
section, the last value wins.
Note that values from the local configuration file always take
precedence over values from the monitor configuration database,
regardless of which section they appear in.
section, the last value specified takes precedence.
Note that values from the local configuration file always take precedence over
values from the monitor configuration database, regardless of the section in
which they appear.
.. _ceph-metavariables:
Metavariables
=============
Metavariables simplify Ceph Storage Cluster configuration
dramatically. When a metavariable is set in a configuration value,
Ceph expands the metavariable into a concrete value at the time the
configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell.
Metavariables dramatically simplify Ceph storage cluster configuration. When a
metavariable is set in a configuration value, Ceph expands the metavariable at
the time the configuration value is used. In this way, Ceph metavariables
behave similarly to the way that variable expansion works in the Bash shell.
Ceph supports the following metavariables:
Ceph supports the following metavariables:
.. describe:: $cluster
@ -204,7 +203,7 @@ Ceph supports the following metavariables:
.. describe:: $type
Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``)
Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``)
:example: ``/var/lib/ceph/$type``
@ -233,33 +232,32 @@ Ceph supports the following metavariables:
:example: ``/var/run/ceph/$cluster-$name-$pid.asok``
The Configuration File
======================
Ceph configuration file
=======================
On startup, Ceph processes search for a configuration file in the
following locations:
#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF``
#. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF``
environment variable)
#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument)
#. ``-c path/path`` (that is, the ``-c`` command line argument)
#. ``/etc/ceph/$cluster.conf``
#. ``~/.ceph/$cluster.conf``
#. ``./$cluster.conf`` (*i.e.,* in the current working directory)
#. ``./$cluster.conf`` (that is, in the current working directory)
#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf``
where ``$cluster`` is the cluster's name (default ``ceph``).
Here ``$cluster`` is the cluster's name (default: ``ceph``).
The Ceph configuration file uses an *ini* style syntax. You can add comment
text after a pound sign (#) or a semi-colon (;). For example:
The Ceph configuration file uses an ``ini`` style syntax. You can add "comment
text" after a pound sign (#) or a semi-colon semicolon (;). For example:
.. code-block:: ini
# <--A number (#) sign precedes a comment.
; A comment may be anything.
# Comments always follow a semi-colon (;) or a pound (#) on each line.
# The end of the line terminates a comment.
# We recommend that you provide comments in your configuration file(s).
# <--A number (#) sign number sign (#) precedes a comment.
; A comment may be anything.
# Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line.
# The end of the line terminates a comment.
# We recommend that you provide comments in your configuration file(s).
.. _ceph-conf-settings:
@ -268,40 +266,41 @@ Config file section names
-------------------------
The configuration file is divided into sections. Each section must begin with a
valid configuration section name (see `Configuration sections`_, above)
surrounded by square brackets. For example,
valid configuration section name (see `Configuration sections`_, above) that is
surrounded by square brackets. For example:
.. code-block:: ini
[global]
debug_ms = 0
[osd]
debug_ms = 1
[global]
debug_ms = 0
[osd.1]
debug_ms = 10
[osd]
debug_ms = 1
[osd.2]
debug_ms = 10
[osd.1]
debug_ms = 10
[osd.2]
debug_ms = 10
Config file option values
-------------------------
The value of a configuration option is a string. If it is too long to
fit in a single line, you can put a backslash (``\``) at the end of line
as the line continuation marker, so the value of the option will be
the string after ``=`` in current line combined with the string in the next
line::
The value of a configuration option is a string. If the string is too long to
fit on a single line, you can put a backslash (``\``) at the end of the line
and the backslash will act as a line continuation marker. In such a case, the
value of the option will be the string after ``=`` in the current line,
combined with the string in the next line. Here is an example::
[global]
foo = long long ago\
long ago
In the example above, the value of "``foo``" would be "``long long ago long ago``".
In this example, the value of the "``foo``" option is "``long long ago long
ago``".
Normally, the option value ends with a new line, or a comment, like
An option value typically ends with either a newline or a comment. For
example:
.. code-block:: ini
@ -309,100 +308,108 @@ Normally, the option value ends with a new line, or a comment, like
obscure_one = difficult to explain # I will try harder in next release
simpler_one = nothing to explain
In the example above, the value of "``obscure one``" would be "``difficult to explain``";
and the value of "``simpler one`` would be "``nothing to explain``".
In this example, the value of the "``obscure one``" option is "``difficult to
explain``" and the value of the "``simpler one`` options is "``nothing to
explain``".
If an option value contains spaces, and we want to make it explicit, we
could quote the value using single or double quotes, like
When an option value contains spaces, it can be enclosed within single quotes
or double quotes in order to make its scope clear and in order to make sure
that the first space in the value is not interpreted as the end of the value.
For example:
.. code-block:: ini
[global]
line = "to be, or not to be"
Certain characters are not allowed to be present in the option values directly.
They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them,
like
In option values, there are four characters that are treated as escape
characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an
option value only if they are immediately preceded by the backslash character
(``\``). For example:
.. code-block:: ini
[global]
secret = "i love \# and \["
Every configuration option is typed with one of the types below:
Each configuration option falls under one of the following types:
.. describe:: int
64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G",
"T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
10\ :sup:`9`, etc. And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid
option values. Some times, a negative value implies "unlimited" when it comes to
an option for threshold or limit.
64-bit signed integer. Some SI suffixes are supported, such as "K", "M",
"G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M",
"128B" and "-1" are all valid option values. When a negative value is
assigned to a threshold option, this can indicate that the option is
"unlimited" -- that is, that there is no threshold or limit in effect.
:example: ``42``, ``-1``
.. describe:: uint
It is almost identical to ``integer``. But a negative value will be rejected.
This differs from ``integer`` only in that negative values are not
permitted.
:example: ``256``, ``0``
.. describe:: str
Free style strings encoded in UTF-8, but some characters are not allowed. Please
reference the above notes for the details.
A string encoded in UTF-8. Certain characters are not permitted. Reference
the above notes for the details.
:example: ``"hello world"``, ``"i love \#"``, ``yet-another-name``
.. describe:: boolean
one of the two values ``true`` or ``false``. But an integer is also accepted,
where "0" implies ``false``, and any non-zero values imply ``true``.
Typically either of the two values ``true`` or ``false``. However, any
integer is permitted: "0" implies ``false``, and any non-zero value implies
``true``.
:example: ``true``, ``false``, ``1``, ``0``
.. describe:: addr
a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger
protocol. If the prefix is not specified, ``v2`` protocol is used. Please see
:ref:`address_formats` for more details.
A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the
messenger protocol. If no prefix is specified, the ``v2`` protocol is used.
For more details, see :ref:`address_formats`.
:example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789``
.. describe:: addrvec
a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``.
A set of addresses separated by ",". The addresses can be optionally quoted
with ``[`` and ``]``.
:example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]``
.. describe:: uuid
the string format of a uuid defined by `RFC4122 <https://www.ietf.org/rfc/rfc4122.txt>`_.
And some variants are also supported, for more details, see
`Boost document <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_.
The string format of a uuid defined by `RFC4122
<https://www.ietf.org/rfc/rfc4122.txt>`_. Certain variants are also
supported: for more details, see `Boost document
<https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_.
:example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6``
.. describe:: size
denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are
supported. And "B" is the only supported unit. A negative value will be
rejected.
64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported.
"B" is the only supported unit string. Negative values are not permitted.
:example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``.
.. describe:: secs
denotes a duration of time. By default the unit is second if not specified.
Following units of time are supported:
Denotes a duration of time. The default unit of time is the second.
The following units of time are supported:
* second: "s", "sec", "second", "seconds"
* minute: "m", "min", "minute", "minutes"
* hour: "hs", "hr", "hour", "hours"
* day: "d", "day", "days"
* week: "w", "wk", "week", "weeks"
* month: "mo", "month", "months"
* year: "y", "yr", "year", "years"
* second: ``s``, ``sec``, ``second``, ``seconds``
* minute: ``m``, ``min``, ``minute``, ``minutes``
* hour: ``hs``, ``hr``, ``hour``, ``hours``
* day: ``d``, ``day``, ``days``
* week: ``w``, ``wk``, ``week``, ``weeks``
* month: ``mo``, ``month``, ``months``
* year: ``y``, ``yr``, ``year``, ``years``
:example: ``1 m``, ``1m`` and ``1 week``
@ -411,102 +418,103 @@ Every configuration option is typed with one of the types below:
Monitor configuration database
==============================
The monitor cluster manages a database of configuration options that
can be consumed by the entire cluster, enabling streamlined central
configuration management for the entire system. The vast majority of
configuration options can and should be stored here for ease of
administration and transparency.
The monitor cluster manages a database of configuration options that can be
consumed by the entire cluster. This allows for streamlined central
configuration management of the entire system. For ease of administration and
transparency, the vast majority of configuration options can and should be
stored in this database.
A handful of settings may still need to be stored in local
configuration files because they affect the ability to connect to the
monitors, authenticate, and fetch configuration information. In most
cases this is limited to the ``mon_host`` option, although this can
also be avoided through the use of DNS SRV records.
Some settings might need to be stored in local configuration files because they
affect the ability of the process to connect to the monitors, to authenticate,
and to fetch configuration information. In most cases this applies only to the
``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV
records<mon-dns-lookup>`.
Sections and masks
------------------
Configuration options stored by the monitor can live in a global
section, daemon type section, or specific daemon section, just like
options in a configuration file can.
Configuration options stored by the monitor can be stored in a global section,
in a daemon-type section, or in a specific daemon section. In this, they are
no different from the options in a configuration file.
In addition, options may also have a *mask* associated with them to
further restrict which daemons or clients the option applies to.
Masks take two forms:
In addition, options may have a *mask* associated with them to further restrict
which daemons or clients the option applies to. Masks take two forms:
#. ``type:location`` where *type* is a CRUSH property like `rack` or
`host`, and *location* is a value for that property. For example,
#. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or
``host``, and ``location`` is a value for that property. For example,
``host:foo`` would limit the option only to daemons or clients
running on a particular host.
#. ``class:device-class`` where *device-class* is the name of a CRUSH
device class (e.g., ``hdd`` or ``ssd``). For example,
#. ``class:device-class`` where ``device-class`` is the name of a CRUSH
device class (for example, ``hdd`` or ``ssd``). For example,
``class:ssd`` would limit the option only to OSDs backed by SSDs.
(This mask has no effect for non-OSD daemons or clients.)
(This mask has no effect on non-OSD daemons or clients.)
When setting a configuration option, the `who` may be a section name,
a mask, or a combination of both separated by a slash (``/``)
character. For example, ``osd/rack:foo`` would mean all OSD daemons
in the ``foo`` rack.
When viewing configuration options, the section name and mask are
generally separated out into separate fields or columns to ease readability.
In commands that specify a configuration option, the argument of the option (in
the following examples, this is the "who" string) may be a section name, a
mask, or a combination of both separated by a slash character (``/``). For
example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack.
When configuration options are shown, the section name and mask are presented
in separate fields or columns to make them more readable.
Commands
--------
The following CLI commands are used to configure the cluster:
* ``ceph config dump`` will dump the entire configuration database for
the cluster.
* ``ceph config dump`` dumps the entire monitor configuration
database for the cluster.
* ``ceph config get <who>`` will dump the configuration for a specific
daemon or client (e.g., ``mds.a``), as stored in the monitors'
configuration database.
* ``ceph config get <who>`` dumps the configuration options stored in
the monitor configuration database for a specific daemon or client
(for example, ``mds.a``).
* ``ceph config set <who> <option> <value>`` will set a configuration
option in the monitors' configuration database.
* ``ceph config get <who> <option>`` shows either a configuration value
stored in the monitor configuration database for a specific daemon or client
(for example, ``mds.a``), or, if that value is not present in the monitor
configuration database, the compiled-in default value.
* ``ceph config show <who>`` will show the reported running
configuration for a running daemon. These settings may differ from
those stored by the monitors if there are also local configuration
files in use or options have been overridden on the command line or
at run time. The source of the option values is reported as part
of the output.
* ``ceph config set <who> <option> <value>`` specifies a configuration
option in the monitor configuration database.
* ``ceph config assimilate-conf -i <input file> -o <output file>``
will ingest a configuration file from *input file* and move any
valid options into the monitors' configuration database. Any
settings that are unrecognized, invalid, or cannot be controlled by
the monitor will be returned in an abbreviated config file stored in
*output file*. This command is useful for transitioning from legacy
configuration files to centralized monitor-based configuration.
* ``ceph config show <who>`` shows the configuration for a running daemon.
These settings might differ from those stored by the monitors if there are
also local configuration files in use or if options have been overridden on
the command line or at run time. The source of the values of the options is
displayed in the output.
* ``ceph config assimilate-conf -i <input file> -o <output file>`` ingests a
configuration file from *input file* and moves any valid options into the
monitor configuration database. Any settings that are unrecognized, are
invalid, or cannot be controlled by the monitor will be returned in an
abbreviated configuration file stored in *output file*. This command is
useful for transitioning from legacy configuration files to centralized
monitor-based configuration.
Note that ``ceph config set <who> <option> <value>`` and ``ceph config get
<who> <option>`` will not necessarily return the same values. The latter
command will show compiled-in default values. In order to determine whether a
configuration option is present in the monitor configuration database, run
``ceph config dump``.
Help
====
You can get help for a particular option with:
To get help for a particular option, run the following command:
.. prompt:: bash $
ceph config help <option>
Note that this will use the configuration schema that is compiled into the running monitors. If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon:
.. prompt:: bash $
ceph daemon <name> config help [option]
For example:
.. prompt:: bash $
ceph config help log_file
::
::
log_file - path to log file
log_file - path to log file
(std::string, basic)
Default (non-daemon):
Default (daemon): /var/log/ceph/$cluster-$name.log
@ -543,20 +551,29 @@ or:
"can_update_at_runtime": false
}
The ``level`` property can be any of `basic`, `advanced`, or `dev`.
The `dev` options are intended for use by developers, generally for
testing purposes, and are not recommended for use by operators.
The ``level`` property can be ``basic``, ``advanced``, or ``dev``. The `dev`
options are intended for use by developers, generally for testing purposes, and
are not recommended for use by operators.
.. note:: This command uses the configuration schema that is compiled into the
running monitors. If you have a mixed-version cluster (as might exist, for
example, during an upgrade), you might want to query the option schema from
a specific running daemon by running a command of the following form:
.. prompt:: bash $
ceph daemon <name> config help [option]
Runtime Changes
===============
In most cases, Ceph permits changes to the configuration of a daemon at
runtime. This can be used for increasing or decreasing the amount of logging
run time. This can be used for increasing or decreasing the amount of logging
output, for enabling or disabling debug settings, and for runtime optimization.
Configuration options can be updated via the ``ceph config set`` command. For
example, to enable the debug log level on a specific OSD, run a command of this form:
Use the ``ceph config set`` command to update configuration options. For
example, to enable the most verbose debug log level on a specific OSD, run a
command of the following form:
.. prompt:: bash $
@ -565,129 +582,133 @@ example, to enable the debug log level on a specific OSD, run a command of this
.. note:: If an option has been customized in a local configuration file, the
`central config
<https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_
setting will be ignored (it has a lower priority than the local
configuration file).
setting will be ignored because it has a lower priority than the local
configuration file.
.. note:: Log levels range from 0 to 20.
Override values
---------------
Options can be set temporarily by using the `tell` or `daemon` interfaces on
the Ceph CLI. These *override* values are ephemeral, which means that they
affect only the current instance of the daemon and revert to persistently
configured values when the daemon restarts.
Options can be set temporarily by using the Ceph CLI ``tell`` or ``daemon``
interfaces on the Ceph CLI. These *override* values are ephemeral, which means
that they affect only the current instance of the daemon and revert to
persistently configured values when the daemon restarts.
Override values can be set in two ways:
#. From any host, send a message to a daemon with a command of the following
form:
.. prompt:: bash $
ceph tell <name> config set <option> <value>
For example:
.. prompt:: bash $
ceph tell osd.123 config set debug_osd 20
The ``tell`` command can also accept a wildcard as the daemon identifier.
For example, to adjust the debug level on all OSD daemons, run a command of
this form:
the following form:
.. prompt:: bash $
ceph tell osd.* config set debug_osd 20
#. On the host where the daemon is running, connect to the daemon via a socket
in ``/var/run/ceph`` by running a command of this form:
in ``/var/run/ceph`` by running a command of the following form:
.. prompt:: bash $
ceph daemon <name> config set <option> <value>
For example:
.. prompt:: bash $
ceph daemon osd.4 config set debug_osd 20
.. note:: In the output of the ``ceph config show`` command, these temporary
values are shown with a source of ``override``.
values are shown to have a source of ``override``.
Viewing runtime settings
========================
You can see the current options set for a running daemon with the ``ceph config show`` command. For example:
You can see the current settings specified for a running daemon with the ``ceph
config show`` command. For example, to see the (non-default) settings for the
daemon ``osd.0``, run the following command:
.. prompt:: bash $
ceph config show osd.0
will show you the (non-default) options for that daemon. You can also look at a specific option with:
To see a specific setting, run the following command:
.. prompt:: bash $
ceph config show osd.0 debug_osd
or view all options (even those with default values) with:
To see all settings (including those with default values), run the following
command:
.. prompt:: bash $
ceph config show-with-defaults osd.0
You can also observe settings for a running daemon by connecting to it from the local host via the admin socket. For example:
You can see all settings for a daemon that is currently running by connecting
to it on the local host via the admin socket. For example, to dump all
current settings, run the following command:
.. prompt:: bash $
ceph daemon osd.0 config show
will dump all current settings:
To see non-default settings and to see where each value came from (for example,
a config file, the monitor, or an override), run the following command:
.. prompt:: bash $
ceph daemon osd.0 config diff
will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and:
To see the value of a single setting, run the following command:
.. prompt:: bash $
ceph daemon osd.0 config get debug_osd
will report the value of a single option.
Changes since Nautilus
======================
Changes introduced in Octopus
=============================
With the Octopus release We changed the way the configuration file is parsed.
These changes are as follows:
- Repeated configuration options are allowed, and no warnings will be printed.
The value of the last one is used, which means that the setting last in the file
is the one that takes effect. Before this change, we would print warning messages
when lines with duplicated options were encountered, like::
- Repeated configuration options are allowed, and no warnings will be
displayed. This means that the setting that comes last in the file is the one
that takes effect. Prior to this change, Ceph displayed warning messages when
lines containing duplicate options were encountered, such as::
warning line 42: 'foo' in section 'bar' redefined
- Invalid UTF-8 options were ignored with warning messages. But since Octopus,
they are treated as fatal errors.
- Backslash ``\`` is used as the line continuation marker to combine the next
line with current one. Before Octopus, it was required to follow a backslash with
a non-empty line. But in Octopus, an empty line following a backslash is now allowed.
- Prior to Octopus, options containing invalid UTF-8 characters were ignored
with warning messages. But in Octopus, they are treated as fatal errors.
- The backslash character ``\`` is used as the line-continuation marker that
combines the next line with the current one. Prior to Octopus, there was a
requirement that any end-of-line backslash be followed by a non-empty line.
But in Octopus, an empty line following a backslash is allowed.
- In the configuration file, each line specifies an individual configuration
option. The option's name and its value are separated with ``=``, and the
value may be quoted using single or double quotes. If an invalid
value may be enclosed within single or double quotes. If an invalid
configuration is specified, we will treat it as an invalid configuration
file ::
file::
bad option ==== bad value
- Prior to Octopus, if no section name was specified in the configuration file,
all options would be set as though they were within the :confsec:`global`
section. This approach is discouraged. Since Octopus, any configuration
file that has no section name must contain only a single option.
- Before Octopus, if no section name was specified in the configuration file,
all options would be set as though they were within the :confsec:`global` section. This is
now discouraged. Since Octopus, only a single option is allowed for
configuration files without a section name.
.. |---| unicode:: U+2014 .. EM DASH :trim:

View File

@ -1,4 +1,3 @@
.. _ceph-conf-common-settings:
Common Settings
@ -7,30 +6,33 @@ Common Settings
The `Hardware Recommendations`_ section provides some hardware guidelines for
configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph
Node` to run multiple daemons. For example, a single node with multiple drives
may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a
particular type of process. For example, some nodes may run ``ceph-osd``
daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may
run ``ceph-mon`` daemons.
ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be
assigned to a particular type of process. For example, some nodes might run
``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still
other nodes might run ``ceph-mon`` daemons.
Each node has a name. The name of a node can be found in its ``host`` setting.
Monitors also specify a network address and port (that is, a domain name or IP
address) that can be found in the ``addr`` setting. A basic configuration file
typically specifies only minimal settings for each instance of monitor daemons.
For example:
Each node has a name identified by the ``host`` setting. Monitors also specify
a network address and port (i.e., domain name or IP address) identified by the
``addr`` setting. A basic configuration file will typically specify only
minimal settings for each instance of monitor daemons. For example:
.. code-block:: ini
[global]
mon_initial_members = ceph1
mon_host = 10.0.0.1
[global]
mon_initial_members = ceph1
mon_host = 10.0.0.1
.. important:: The ``host`` setting is the short name of the node (i.e., not
an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on
the command line to retrieve the name of the node. Do not use ``host``
settings for anything other than initial monitors unless you are deploying
Ceph manually. You **MUST NOT** specify ``host`` under individual daemons
when using deployment tools like ``chef`` or ``cephadm``, as those tools
will enter the appropriate values for you in the cluster map.
.. important:: The ``host`` setting's value is the short name of the node. It
is not an FQDN. It is **NOT** an IP address. To retrieve the name of the
node, enter ``hostname -s`` on the command line. Unless you are deploying
Ceph manually, do not use ``host`` settings for anything other than initial
monitor setup. **DO NOT** specify the ``host`` setting under individual
daemons when using deployment tools like ``chef`` or ``cephadm``. Such tools
are designed to enter the appropriate values for you in the cluster map.
.. _ceph-network-config:
@ -38,34 +40,35 @@ minimal settings for each instance of monitor daemons. For example:
Networks
========
See the `Network Configuration Reference`_ for a detailed discussion about
configuring a network for use with Ceph.
For more about configuring a network for use with Ceph, see the `Network
Configuration Reference`_ .
Monitors
========
Production Ceph clusters typically provision a minimum of three :term:`Ceph Monitor`
daemons to ensure availability should a monitor instance crash. A minimum of
three ensures that the Paxos algorithm can determine which version
of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph
Ceph production clusters typically provision at least three :term:`Ceph
Monitor` daemons to ensure availability in the event of a monitor instance
crash. A minimum of three :term:`Ceph Monitor` daemons ensures that the Paxos
algorithm is able to determine which version of the :term:`Ceph Cluster Map` is
the most recent. It makes this determination by consulting a majority of Ceph
Monitors in the quorum.
.. note:: You may deploy Ceph with a single monitor, but if the instance fails,
the lack of other monitors may interrupt data service availability.
the lack of other monitors might interrupt data-service availability.
Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol.
Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and on
port ``6789`` for the old v1 protocol.
By default, Ceph expects to store monitor data under the
following path::
By default, Ceph expects to store monitor data on the following path::
/var/lib/ceph/mon/$cluster-$id
/var/lib/ceph/mon/$cluster-$id
You or a deployment tool (e.g., ``cephadm``) must create the corresponding
directory. With metavariables fully expressed and a cluster named "ceph", the
foregoing directory would evaluate to::
You or a deployment tool (for example, ``cephadm``) must create the
corresponding directory. With metavariables fully expressed and a cluster named
"ceph", the path specified in the above example evaluates to::
/var/lib/ceph/mon/ceph-a
/var/lib/ceph/mon/ceph-a
For additional details, see the `Monitor Config Reference`_.
@ -74,22 +77,22 @@ For additional details, see the `Monitor Config Reference`_.
.. _ceph-osd-config:
Authentication
==============
.. versionadded:: Bobtail 0.56
For Bobtail (v 0.56) and beyond, you should expressly enable or disable
authentication in the ``[global]`` section of your Ceph configuration file.
Authentication is explicitly enabled or disabled in the ``[global]`` section of
the Ceph configuration file, as shown here:
.. code-block:: ini
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
Additionally, you should enable message signing. See `Cephx Config Reference`_ for details.
In addition, you should enable message signing. For details, see `Cephx Config
Reference`_.
.. _Cephx Config Reference: ../auth-config-ref
@ -100,65 +103,68 @@ Additionally, you should enable message signing. See `Cephx Config Reference`_ f
OSDs
====
Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node
has one OSD daemon running a Filestore on one storage device. The BlueStore back
end is now default, but when using Filestore you specify a journal size. For example:
When Ceph production clusters deploy :term:`Ceph OSD Daemons`, the typical
arrangement is that one node has one OSD daemon running Filestore on one
storage device. BlueStore is now the default back end, but when using Filestore
you must specify a journal size. For example:
.. code-block:: ini
[osd]
osd_journal_size = 10000
[osd]
osd_journal_size = 10000
[osd.0]
host = {hostname} #manual deployments only.
[osd.0]
host = {hostname} #manual deployments only.
By default, Ceph expects to store a Ceph OSD Daemon's data at the
following path::
By default, Ceph expects to store a Ceph OSD Daemon's data on the following
path::
/var/lib/ceph/osd/$cluster-$id
/var/lib/ceph/osd/$cluster-$id
You or a deployment tool (e.g., ``cephadm``) must create the corresponding
directory. With metavariables fully expressed and a cluster named "ceph", this
example would evaluate to::
You or a deployment tool (for example, ``cephadm``) must create the
corresponding directory. With metavariables fully expressed and a cluster named
"ceph", the path specified in the above example evaluates to::
/var/lib/ceph/osd/ceph-0
/var/lib/ceph/osd/ceph-0
You may override this path using the ``osd_data`` setting. We recommend not
changing the default location. Create the default directory on your OSD host.
You can override this path using the ``osd_data`` setting. We recommend that
you do not change the default location. To create the default directory on your
OSD host, run the following commands:
.. prompt:: bash $
ssh {osd-host}
sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
ssh {osd-host}
sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
The ``osd_data`` path ideally leads to a mount point with a device that is
separate from the device that contains the operating system and
daemons. If an OSD is to use a device other than the OS device, prepare it for
use with Ceph, and mount it to the directory you just created
The ``osd_data`` path ought to lead to a mount point that has mounted on it a
device that is distinct from the device that contains the operating system and
the daemons. To use a device distinct from the device that contains the
operating system and the daemons, prepare it for use with Ceph and mount it on
the directory you just created by running the following commands:
.. prompt:: bash $
ssh {new-osd-host}
sudo mkfs -t {fstype} /dev/{disk}
sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
ssh {new-osd-host}
sudo mkfs -t {fstype} /dev/{disk}
sudo mount -o user_xattr /dev/{disk} /var/lib/ceph/osd/ceph-{osd-number}
We recommend using the ``xfs`` file system when running
:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and are no
longer tested.)
We recommend using the ``xfs`` file system when running :command:`mkfs`. (The
``btrfs`` and ``ext4`` file systems are not recommended and are no longer
tested.)
See the `OSD Config Reference`_ for additional configuration details.
For additional configuration details, see `OSD Config Reference`_.
Heartbeats
==========
During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons
and report their findings to the Ceph Monitor. You do not have to provide any
settings. However, if you have network latency issues, you may wish to modify
the settings.
and report their findings to the Ceph Monitor. This process does not require
you to provide any settings. However, if you have network latency issues, you
might want to modify the default settings.
See `Configuring Monitor/OSD Interaction`_ for additional details.
For additional details, see `Configuring Monitor/OSD Interaction`_.
.. _ceph-logging-and-debugging:
@ -166,9 +172,9 @@ See `Configuring Monitor/OSD Interaction`_ for additional details.
Logs / Debugging
================
Sometimes you may encounter issues with Ceph that require
modifying logging output and using Ceph's debugging. See `Debugging and
Logging`_ for details on log rotation.
You might sometimes encounter issues with Ceph that require you to use Ceph's
logging and debugging features. For details on log rotation, see `Debugging and
Logging`_.
.. _Debugging and Logging: ../../troubleshooting/log-and-debug
@ -186,32 +192,29 @@ Example ceph.conf
Naming Clusters (deprecated)
============================
Each Ceph cluster has an internal name that is used as part of configuration
and log file names as well as directory and mountpoint names. This name
defaults to "ceph". Previous releases of Ceph allowed one to specify a custom
name instead, for example "ceph2". This was intended to facilitate running
multiple logical clusters on the same physical hardware, but in practice this
was rarely exploited and should no longer be attempted. Prior documentation
could also be misinterpreted as requiring unique cluster names in order to
use ``rbd-mirror``.
Each Ceph cluster has an internal name. This internal name is used as part of
configuration, and as part of "log file" names as well as part of directory
names and as part of mountpoint names. This name defaults to "ceph". Previous
releases of Ceph allowed one to specify a custom name instead, for example
"ceph2". This option was intended to facilitate the running of multiple logical
clusters on the same physical hardware, but in practice it was rarely
exploited. Custom cluster names should no longer be attempted. Old
documentation might lead readers to wrongly think that unique cluster names are
required to use ``rbd-mirror``. They are not required.
Custom cluster names are now considered deprecated and the ability to deploy
them has already been removed from some tools, though existing custom name
deployments continue to operate. The ability to run and manage clusters with
custom names may be progressively removed by future Ceph releases, so it is
strongly recommended to deploy all new clusters with the default name "ceph".
them has already been removed from some tools, although existing custom-name
deployments continue to operate. The ability to run and manage clusters with
custom names might be progressively removed by future Ceph releases, so **it is
strongly recommended to deploy all new clusters with the default name "ceph"**.
Some Ceph CLI commands accept an optional ``--cluster`` (cluster name) option. This
option is present purely for backward compatibility and need not be accommodated
by new tools and deployments.
Some Ceph CLI commands accept a ``--cluster`` (cluster name) option. This
option is present only for the sake of backward compatibility. New tools and
deployments cannot be relied upon to accommodate this option.
If you do need to allow multiple clusters to exist on the same host, please use
If you need to allow multiple clusters to exist on the same host, use
:ref:`cephadm`, which uses containers to fully isolate each cluster.
.. _Hardware Recommendations: ../../../start/hardware-recommendations
.. _Network Configuration Reference: ../network-config-ref
.. _OSD Config Reference: ../osd-config-ref

View File

@ -2,8 +2,14 @@
Filestore Config Reference
============================
The Filestore back end is no longer the default when creating new OSDs,
though Filestore OSDs are still supported.
.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
default storage back end. Since the Luminous release of Ceph, BlueStore has
been Ceph's default storage back end. However, Filestore OSDs are still
supported. See :ref:`OSD Back Ends
<rados_config_storage_devices_osd_backends>`. See :ref:`BlueStore Migration
<rados_operations_bluestore_migration>` for instructions explaining how to
replace an existing Filestore back end with a BlueStore back end.
``filestore debug omap check``
@ -18,26 +24,31 @@ though Filestore OSDs are still supported.
Extended Attributes
===================
Extended Attributes (XATTRs) are important for Filestore OSDs.
Some file systems have limits on the number of bytes that can be stored in XATTRs.
Additionally, in some cases, the file system may not be as fast as an alternative
method of storing XATTRs. The following settings may help improve performance
by using a method of storing XATTRs that is extrinsic to the underlying file system.
Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain
disadvantages can occur when the underlying file system is used for the storage
of XATTRs: some file systems have limits on the number of bytes that can be
stored in XATTRs, and your file system might in some cases therefore run slower
than would an alternative method of storing XATTRs. For this reason, a method
of storing XATTRs extrinsic to the underlying file system might improve
performance. To implement such an extrinsic method, refer to the following
settings.
Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided
by the underlying file system, if it does not impose a size limit. If
there is a size limit (4KB total on ext4, for instance), some Ceph
XATTRs will be stored in a key/value database when either the
If the underlying file system has no size limit, then Ceph XATTRs are stored as
``inline xattr``, using the XATTRs provided by the file system. But if there is
a size limit (for example, ext4 imposes a limit of 4 KB total), then some Ceph
XATTRs will be stored in a key/value database when the limit is reached. More
precisely, this begins to occur when either the
``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs``
threshold is reached.
``filestore_max_inline_xattr_size``
:Description: The maximum size of an XATTR stored in the file system (i.e., XFS,
Btrfs, EXT4, etc.) per object. Should not be larger than the
file system can handle. Default value of 0 means to use the value
specific to the underlying file system.
:Description: Defines the maximum size per object of an XATTR that can be
stored in the file system (for example, XFS, Btrfs, ext4). The
specified size should not be larger than the file system can
handle. Using the default value of 0 instructs Filestore to use
the value specific to the file system.
:Type: Unsigned 32-bit Integer
:Required: No
:Default: ``0``
@ -45,8 +56,9 @@ threshold is reached.
``filestore_max_inline_xattr_size_xfs``
:Description: The maximum size of an XATTR stored in the XFS file system.
Only used if ``filestore_max_inline_xattr_size`` == 0.
:Description: Defines the maximum size of an XATTR that can be stored in the
XFS file system. This setting is used only if
``filestore_max_inline_xattr_size`` == 0.
:Type: Unsigned 32-bit Integer
:Required: No
:Default: ``65536``
@ -54,8 +66,9 @@ threshold is reached.
``filestore_max_inline_xattr_size_btrfs``
:Description: The maximum size of an XATTR stored in the Btrfs file system.
Only used if ``filestore_max_inline_xattr_size`` == 0.
:Description: Defines the maximum size of an XATTR that can be stored in the
Btrfs file system. This setting is used only if
``filestore_max_inline_xattr_size`` == 0.
:Type: Unsigned 32-bit Integer
:Required: No
:Default: ``2048``
@ -63,8 +76,8 @@ threshold is reached.
``filestore_max_inline_xattr_size_other``
:Description: The maximum size of an XATTR stored in other file systems.
Only used if ``filestore_max_inline_xattr_size`` == 0.
:Description: Defines the maximum size of an XATTR that can be stored in other file systems.
This setting is used only if ``filestore_max_inline_xattr_size`` == 0.
:Type: Unsigned 32-bit Integer
:Required: No
:Default: ``512``
@ -72,9 +85,8 @@ threshold is reached.
``filestore_max_inline_xattrs``
:Description: The maximum number of XATTRs stored in the file system per object.
Default value of 0 means to use the value specific to the
underlying file system.
:Description: Defines the maximum number of XATTRs per object that can be stored in the file system.
Using the default value of 0 instructs Filestore to use the value specific to the file system.
:Type: 32-bit Integer
:Required: No
:Default: ``0``
@ -82,8 +94,8 @@ threshold is reached.
``filestore_max_inline_xattrs_xfs``
:Description: The maximum number of XATTRs stored in the XFS file system per object.
Only used if ``filestore_max_inline_xattrs`` == 0.
:Description: Defines the maximum number of XATTRs per object that can be stored in the XFS file system.
This setting is used only if ``filestore_max_inline_xattrs`` == 0.
:Type: 32-bit Integer
:Required: No
:Default: ``10``
@ -91,8 +103,8 @@ threshold is reached.
``filestore_max_inline_xattrs_btrfs``
:Description: The maximum number of XATTRs stored in the Btrfs file system per object.
Only used if ``filestore_max_inline_xattrs`` == 0.
:Description: Defines the maximum number of XATTRs per object that can be stored in the Btrfs file system.
This setting is used only if ``filestore_max_inline_xattrs`` == 0.
:Type: 32-bit Integer
:Required: No
:Default: ``10``
@ -100,8 +112,8 @@ threshold is reached.
``filestore_max_inline_xattrs_other``
:Description: The maximum number of XATTRs stored in other file systems per object.
Only used if ``filestore_max_inline_xattrs`` == 0.
:Description: Defines the maximum number of XATTRs per object that can be stored in other file systems.
This setting is used only if ``filestore_max_inline_xattrs`` == 0.
:Type: 32-bit Integer
:Required: No
:Default: ``2``
@ -111,18 +123,19 @@ threshold is reached.
Synchronization Intervals
=========================
Filestore needs to periodically quiesce writes and synchronize the
file system, which creates a consistent commit point. It can then free journal
entries up to the commit point. Synchronizing more frequently tends to reduce
the time required to perform synchronization, and reduces the amount of data
that needs to remain in the journal. Less frequent synchronization allows the
backing file system to coalesce small writes and metadata updates more
optimally, potentially resulting in more efficient synchronization at the
expense of potentially increasing tail latency.
Filestore must periodically quiesce writes and synchronize the file system.
Each synchronization creates a consistent commit point. When the commit point
is created, Filestore is able to free all journal entries up to that point.
More-frequent synchronization tends to reduce both synchronization time and
the amount of data that needs to remain in the journal. Less-frequent
synchronization allows the backing file system to coalesce small writes and
metadata updates, potentially increasing synchronization
efficiency but also potentially increasing tail latency.
``filestore_max_sync_interval``
:Description: The maximum interval in seconds for synchronizing Filestore.
:Description: Defines the maximum interval (in seconds) for synchronizing Filestore.
:Type: Double
:Required: No
:Default: ``5``
@ -130,7 +143,7 @@ expense of potentially increasing tail latency.
``filestore_min_sync_interval``
:Description: The minimum interval in seconds for synchronizing Filestore.
:Description: Defines the minimum interval (in seconds) for synchronizing Filestore.
:Type: Double
:Required: No
:Default: ``.01``
@ -142,14 +155,14 @@ Flusher
=======
The Filestore flusher forces data from large writes to be written out using
``sync_file_range`` before the sync in order to (hopefully) reduce the cost of
the eventual sync. In practice, disabling 'filestore_flusher' seems to improve
performance in some cases.
``sync_file_range`` prior to the synchronization.
Ideally, this action reduces the cost of the eventual synchronization. In practice, however, disabling
'filestore_flusher' seems in some cases to improve performance.
``filestore_flusher``
:Description: Enables the filestore flusher.
:Description: Enables the Filestore flusher.
:Type: Boolean
:Required: No
:Default: ``false``
@ -158,7 +171,7 @@ performance in some cases.
``filestore_flusher_max_fds``
:Description: Sets the maximum number of file descriptors for the flusher.
:Description: Defines the maximum number of file descriptors for the flusher.
:Type: Integer
:Required: No
:Default: ``512``
@ -176,7 +189,7 @@ performance in some cases.
``filestore_fsync_flushes_journal_data``
:Description: Flush journal data during file system synchronization.
:Description: Flushes journal data during file-system synchronization.
:Type: Boolean
:Required: No
:Default: ``false``
@ -187,11 +200,11 @@ performance in some cases.
Queue
=====
The following settings provide limits on the size of the Filestore queue.
The following settings define limits on the size of the Filestore queue:
``filestore_queue_max_ops``
:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations.
:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations.
:Type: Integer
:Required: No. Minimal impact on performance.
:Default: ``50``
@ -199,23 +212,20 @@ The following settings provide limits on the size of the Filestore queue.
``filestore_queue_max_bytes``
:Description: The maximum number of bytes for an operation.
:Description: Defines the maximum number of bytes permitted per operation.
:Type: Integer
:Required: No
:Default: ``100 << 20``
.. index:: filestore; timeouts
Timeouts
========
``filestore_op_threads``
:Description: The number of file system operation threads that execute in parallel.
:Description: Defines the number of file-system operation threads that execute in parallel.
:Type: Integer
:Required: No
:Default: ``2``
@ -223,7 +233,7 @@ Timeouts
``filestore_op_thread_timeout``
:Description: The timeout for a file system operation thread (in seconds).
:Description: Defines the timeout (in seconds) for a file-system operation thread.
:Type: Integer
:Required: No
:Default: ``60``
@ -231,7 +241,7 @@ Timeouts
``filestore_op_thread_suicide_timeout``
:Description: The timeout for a commit operation before cancelling the commit (in seconds).
:Description: Defines the timeout (in seconds) for a commit operation before the commit is cancelled.
:Type: Integer
:Required: No
:Default: ``180``
@ -245,17 +255,17 @@ B-Tree Filesystem
``filestore_btrfs_snap``
:Description: Enable snapshots for a ``btrfs`` filestore.
:Description: Enables snapshots for a ``btrfs`` Filestore.
:Type: Boolean
:Required: No. Only used for ``btrfs``.
:Required: No. Used only for ``btrfs``.
:Default: ``true``
``filestore_btrfs_clone_range``
:Description: Enable cloning ranges for a ``btrfs`` filestore.
:Description: Enables cloning ranges for a ``btrfs`` Filestore.
:Type: Boolean
:Required: No. Only used for ``btrfs``.
:Required: No. Used only for ``btrfs``.
:Default: ``true``
@ -267,7 +277,7 @@ Journal
``filestore_journal_parallel``
:Description: Enables parallel journaling, default for Btrfs.
:Description: Enables parallel journaling, default for ``btrfs``.
:Type: Boolean
:Required: No
:Default: ``false``
@ -275,7 +285,7 @@ Journal
``filestore_journal_writeahead``
:Description: Enables writeahead journaling, default for XFS.
:Description: Enables write-ahead journaling, default for XFS.
:Type: Boolean
:Required: No
:Default: ``false``
@ -283,7 +293,7 @@ Journal
``filestore_journal_trailing``
:Description: Deprecated, never use.
:Description: Deprecated. **Never use.**
:Type: Boolean
:Required: No
:Default: ``false``
@ -295,8 +305,8 @@ Misc
``filestore_merge_threshold``
:Description: Min number of files in a subdir before merging into parent
NOTE: A negative value means to disable subdir merging
:Description: Defines the minimum number of files permitted in a subdirectory before the subdirectory is merged into its parent directory.
NOTE: A negative value means that subdirectory merging is disabled.
:Type: Integer
:Required: No
:Default: ``-10``
@ -305,8 +315,8 @@ Misc
``filestore_split_multiple``
:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16``
is the maximum number of files in a subdirectory before
splitting into child directories.
is the maximum number of files permitted in a subdirectory
before the subdirectory is split into child directories.
:Type: Integer
:Required: No
@ -316,10 +326,10 @@ Misc
``filestore_split_rand_factor``
:Description: A random factor added to the split threshold to avoid
too many (expensive) Filestore splits occurring at once. See
``filestore_split_multiple`` for details.
This can only be changed offline for an existing OSD,
via the ``ceph-objectstore-tool apply-layout-settings`` command.
too many (expensive) Filestore splits occurring at the same time.
For details, see ``filestore_split_multiple``.
To change this setting for an existing OSD, it is necessary to take the OSD
offline before running the ``ceph-objectstore-tool apply-layout-settings`` command.
:Type: Unsigned 32-bit Integer
:Required: No
@ -328,7 +338,7 @@ Misc
``filestore_update_to``
:Description: Limits Filestore auto upgrade to specified version.
:Description: Limits automatic upgrades to a specified version of Filestore. Useful in cases in which you want to avoid upgrading to a specific version.
:Type: Integer
:Required: No
:Default: ``1000``
@ -336,7 +346,7 @@ Misc
``filestore_blackhole``
:Description: Drop any new transactions on the floor.
:Description: Drops any new transactions on the floor, similar to redirecting to NULL.
:Type: Boolean
:Required: No
:Default: ``false``
@ -344,7 +354,7 @@ Misc
``filestore_dump_file``
:Description: File onto which store transaction dumps.
:Description: Defines the file that transaction dumps are stored on.
:Type: Boolean
:Required: No
:Default: ``false``
@ -352,7 +362,7 @@ Misc
``filestore_kill_at``
:Description: inject a failure at the n'th opportunity
:Description: Injects a failure at the *n*\th opportunity.
:Type: String
:Required: No
:Default: ``false``
@ -360,8 +370,7 @@ Misc
``filestore_fail_eio``
:Description: Fail/Crash on eio.
:Description: Fail/Crash on EIO.
:Type: Boolean
:Required: No
:Default: ``true``

View File

@ -21,6 +21,9 @@ the QoS related parameters:
* total capacity (IOPS) of each OSD (determined automatically -
See `OSD Capacity Determination (Automated)`_)
* the max sequential bandwidth capacity (MiB/s) of each OSD -
See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option
* an mclock profile type to enable
Using the settings in the specified profile, an OSD determines and applies the
@ -39,15 +42,15 @@ Each service can be considered as a type of client from mclock's perspective.
Depending on the type of requests handled, mclock clients are classified into
the buckets as shown in the table below,
+------------------------+----------------------------------------------------+
| Client Type | Request Types |
+========================+====================================================+
| Client | I/O requests issued by external clients of Ceph |
+------------------------+----------------------------------------------------+
| Background recovery | Internal recovery/backfill requests |
+------------------------+----------------------------------------------------+
| Background best-effort | Internal scrub, snap trim and PG deletion requests |
+------------------------+----------------------------------------------------+
+------------------------+--------------------------------------------------------------+
| Client Type | Request Types |
+========================+==============================================================+
| Client | I/O requests issued by external clients of Ceph |
+------------------------+--------------------------------------------------------------+
| Background recovery | Internal recovery requests |
+------------------------+--------------------------------------------------------------+
| Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests |
+------------------------+--------------------------------------------------------------+
The mclock profiles allocate parameters like reservation, weight and limit
(see :ref:`dmclock-qos`) differently for each client type. The next sections
@ -85,32 +88,54 @@ Built-in Profiles
-----------------
Users can choose between the following built-in profile types:
.. note:: The values mentioned in the tables below represent the percentage
.. note:: The values mentioned in the tables below represent the proportion
of the total IOPS capacity of the OSD allocated for the service type.
By default, the *high_client_ops* profile is enabled to ensure that a larger
chunk of the bandwidth allocation goes to client ops. Background recovery ops
are given lower allocation (and therefore take a longer time to complete). But
there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, the
alternate built-in profiles may be enabled by following the steps mentioned
in next sections.
* balanced (default)
* high_client_ops
* high_recovery_ops
high_client_ops (*default*)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
This profile optimizes client performance over background activities by
allocating more reservation and limit to client operations as compared to
background operations in the OSD. This profile is enabled by default. The table
shows the resource control parameters set by the profile:
balanced (*default*)
^^^^^^^^^^^^^^^^^^^^
The *balanced* profile is the default mClock profile. This profile allocates
equal reservation/priority to client operations and background recovery
operations. Background best-effort ops are given lower reservation and therefore
take a longer time to complete when are are competing operations. This profile
helps meet the normal/steady-state requirements of the cluster. This is the
case when external client performance requirement is not critical and there are
other background operations that still need attention within the OSD.
But there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, the alternate
built-in profiles may be enabled by following the steps mentioned in next sections.
+------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client | 50% | 2 | MAX |
| client | 50% | 1 | MAX |
+------------------------+-------------+--------+-------+
| background recovery | 25% | 1 | 100% |
| background recovery | 50% | 1 | MAX |
+------------------------+-------------+--------+-------+
| background best-effort | 25% | 2 | MAX |
| background best-effort | MIN | 1 | 90% |
+------------------------+-------------+--------+-------+
high_client_ops
^^^^^^^^^^^^^^^
This profile optimizes client performance over background activities by
allocating more reservation and limit to client operations as compared to
background operations in the OSD. This profile, for example, may be enabled
to provide the needed performance for I/O intensive applications for a
sustained period of time at the cost of slower recoveries. The table shows
the resource control parameters set by the profile:
+------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client | 60% | 2 | MAX |
+------------------------+-------------+--------+-------+
| background recovery | 40% | 1 | MAX |
+------------------------+-------------+--------+-------+
| background best-effort | MIN | 1 | 70% |
+------------------------+-------------+--------+-------+
high_recovery_ops
@ -124,34 +149,16 @@ parameters set by the profile:
+------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client | 30% | 1 | 80% |
| client | 30% | 1 | MAX |
+------------------------+-------------+--------+-------+
| background recovery | 60% | 2 | 200% |
| background recovery | 70% | 2 | MAX |
+------------------------+-------------+--------+-------+
| background best-effort | 1 (MIN) | 2 | MAX |
+------------------------+-------------+--------+-------+
balanced
^^^^^^^^
This profile allocates equal reservation to client I/O operations and background
recovery operations. This means that equal I/O resources are allocated to both
external and background recovery operations. This profile, for example, may be
enabled by an administrator when external client performance requirement is not
critical and there are other background operations that still need attention
within the OSD.
+------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client | 40% | 1 | 100% |
+------------------------+-------------+--------+-------+
| background recovery | 40% | 1 | 150% |
+------------------------+-------------+--------+-------+
| background best-effort | 20% | 2 | MAX |
| background best-effort | MIN | 1 | MAX |
+------------------------+-------------+--------+-------+
.. note:: Across the built-in profiles, internal background best-effort clients
of mclock include "scrub", "snap trim", and "pg deletion" operations.
of mclock include "backfill", "scrub", "snap trim", and "pg deletion"
operations.
Custom Profile
@ -170,6 +177,11 @@ in order to ensure mClock scheduler is able to provide predictable QoS.
mClock Config Options
---------------------
.. important:: These defaults cannot be changed using any of the config
subsytem commands like *config set* or via the *config daemon* or *config
tell* interfaces. Although the above command(s) report success, the mclock
QoS parameters are reverted to their respective built-in profile defaults.
When a built-in profile is enabled, the mClock scheduler calculates the low
level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
enabled for each client type. The mclock parameters are calculated based on
@ -188,30 +200,35 @@ config parameters cannot be modified when using any of the built-in profiles:
Recovery/Backfill Options
-------------------------
The following recovery and backfill related Ceph options are set to new defaults
for mClock:
.. warning:: The recommendation is to not change these options as the built-in
profiles are optimized based on them. Changing these defaults can result in
unexpected performance outcomes.
The following recovery and backfill related Ceph options are overridden to
mClock defaults:
- :confval:`osd_max_backfills`
- :confval:`osd_recovery_max_active`
- :confval:`osd_recovery_max_active_hdd`
- :confval:`osd_recovery_max_active_ssd`
The following table shows the new mClock defaults. This is done to maximize the
impact of the built-in profile:
The following table shows the mClock defaults which is the same as the current
defaults. This is done to maximize the performance of the foreground (client)
operations:
+----------------------------------------+------------------+----------------+
| Config Option | Original Default | mClock Default |
+========================================+==================+================+
| :confval:`osd_max_backfills` | 1 | 10 |
| :confval:`osd_max_backfills` | 1 | 1 |
+----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active` | 0 | 0 |
+----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active_hdd` | 3 | 10 |
| :confval:`osd_recovery_max_active_hdd` | 3 | 3 |
+----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active_ssd` | 10 | 20 |
| :confval:`osd_recovery_max_active_ssd` | 10 | 10 |
+----------------------------------------+------------------+----------------+
The above mClock defaults, can be modified if necessary by enabling
The above mClock defaults, can be modified only if necessary by enabling
:confval:`osd_mclock_override_recovery_settings` (default: false). The
steps for this is discussed in the
`Steps to Modify mClock Max Backfills/Recovery Limits`_ section.
@ -246,8 +263,8 @@ all its clients.
Steps to Enable mClock Profile
==============================
As already mentioned, the default mclock profile is set to *high_client_ops*.
The other values for the built-in profiles include *balanced* and
As already mentioned, the default mclock profile is set to *balanced*.
The other values for the built-in profiles include *high_client_ops* and
*high_recovery_ops*.
If there is a requirement to change the default profile, then the option
@ -297,15 +314,17 @@ command can be used:
After switching to the *custom* profile, the desired mClock configuration
option may be modified. For example, to change the client reservation IOPS
allocation for a specific OSD (say osd.0), the following command can be used:
ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following
command can be used:
.. prompt:: bash #
ceph config set osd.0 osd_mclock_scheduler_client_res 3000
ceph config set osd.0 osd_mclock_scheduler_client_res 0.5
.. important:: Care must be taken to change the reservations of other services like
recovery and background best effort accordingly to ensure that the sum of the
reservations do not exceed the maximum IOPS capacity of the OSD.
.. important:: Care must be taken to change the reservations of other services
like recovery and background best effort accordingly to ensure that the sum
of the reservations do not exceed the maximum proportion (1.0) of the IOPS
capacity of the OSD.
.. tip:: The reservation and limit parameter allocations are per-shard based on
the type of backing device (HDD/SSD) under the OSD. See
@ -673,12 +692,8 @@ mClock Config Options
.. confval:: osd_mclock_profile
.. confval:: osd_mclock_max_capacity_iops_hdd
.. confval:: osd_mclock_max_capacity_iops_ssd
.. confval:: osd_mclock_cost_per_io_usec
.. confval:: osd_mclock_cost_per_io_usec_hdd
.. confval:: osd_mclock_cost_per_io_usec_ssd
.. confval:: osd_mclock_cost_per_byte_usec
.. confval:: osd_mclock_cost_per_byte_usec_hdd
.. confval:: osd_mclock_cost_per_byte_usec_ssd
.. confval:: osd_mclock_max_sequential_bandwidth_hdd
.. confval:: osd_mclock_max_sequential_bandwidth_ssd
.. confval:: osd_mclock_force_run_benchmark_on_init
.. confval:: osd_mclock_skip_benchmark
.. confval:: osd_mclock_override_recovery_settings

View File

@ -16,24 +16,27 @@ consistent, but you can add, remove or replace a monitor in a cluster. See
Background
==========
Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a
:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD
Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and
retrieving a current cluster map. Before Ceph Clients can read from or write to
Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor
first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph
Client can compute the location for any object. The ability to compute object
locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a
very important aspect of Ceph's high scalability and performance. See
`Scalability and High Availability`_ for additional details.
Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`.
The primary role of the Ceph Monitor is to maintain a master copy of the cluster
map. Ceph Monitors also provide authentication and logging services. Ceph
Monitors write all changes in the monitor services to a single Paxos instance,
and Paxos writes the changes to a key/value store for strong consistency. Ceph
Monitors can query the most recent version of the cluster map during sync
operations. Ceph Monitors leverage the key/value store's snapshots and iterators
(using leveldb) to perform store-wide synchronization.
The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to
determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph
Metadata Servers. Clients do this by connecting to one Ceph Monitor and
retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor
before they can read from or write to Ceph OSD Daemons or Ceph Metadata
Servers. A Ceph client that has a current copy of the cluster map and the CRUSH
algorithm can compute the location of any RADOS object within the cluster. This
makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct
communication between clients and Ceph OSD Daemons improves upon traditional
storage architectures that required clients to communicate with a central
component. See `Scalability and High Availability`_ for more on this subject.
The Ceph Monitor's primary function is to maintain a master copy of the cluster
map. Monitors also provide authentication and logging services. All changes in
the monitor services are written by the Ceph Monitor to a single Paxos
instance, and Paxos writes the changes to a key/value store. This provides
strong consistency. Ceph Monitors are able to query the most recent version of
the cluster map during sync operations, and they use the key/value store's
snapshots and iterators (using RocksDB) to perform store-wide synchronization.
.. ditaa::
/-------------\ /-------------\
@ -56,12 +59,6 @@ operations. Ceph Monitors leverage the key/value store's snapshots and iterators
| cCCC |*---------------------+
\-------------/
.. deprecated:: version 0.58
In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for
each service and store the map as a file.
.. index:: Ceph Monitor; cluster map
Cluster Maps
@ -541,6 +538,8 @@ Trimming requires that the placement groups are ``active+clean``.
.. index:: Ceph Monitor; clock
.. _mon-config-ref-clock:
Clock
-----

View File

@ -1,16 +1,22 @@
.. _mon-dns-lookup:
===============================
Looking up Monitors through DNS
===============================
Since version 11.0.0 RADOS supports looking up Monitors through DNS.
Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors
through DNS.
This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file.
The addition of the ability to look up monitors through DNS means that daemons
and clients do not require a *mon host* configuration directive in their
``ceph.conf`` configuration file.
Using DNS SRV TCP records clients are able to look up the monitors.
With a DNS update, clients and daemons can be made aware of changes
in the monitor topology. To be more precise and technical, clients look up the
monitors by using ``DNS SRV TCP`` records.
This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology.
By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive.
By default, clients and daemons look for the TCP service called *ceph-mon*,
which is configured by the *mon_dns_srv_name* configuration directive.
.. confval:: mon_dns_srv_name

View File

@ -91,9 +91,8 @@ Similarly, two options control whether IPv4 and IPv6 addresses are used:
to an IPv6 address
.. note:: The ability to bind to multiple ports has paved the way for
dual-stack IPv4 and IPv6 support. That said, dual-stack support is
not yet tested as of Nautilus v14.2.0 and likely needs some
additional code changes to work correctly.
dual-stack IPv4 and IPv6 support. That said, dual-stack operation is
not yet supported as of Quincy v17.2.0.
Connection modes
----------------

View File

@ -140,6 +140,8 @@ See `Pool & PG Config Reference`_ for details.
.. index:: OSD; scrubbing
.. _rados_config_scrubbing:
Scrubbing
=========

View File

@ -1,3 +1,5 @@
.. _rados_config_pool_pg_crush_ref:
======================================
Pool, PG and CRUSH Config Reference
======================================

View File

@ -25,6 +25,7 @@ There are several Ceph daemons in a storage cluster:
additional monitoring and providing interfaces to external
monitoring and management systems.
.. _rados_config_storage_devices_osd_backends:
OSD Back Ends
=============

View File

@ -4,74 +4,70 @@
Adding/Removing Monitors
==========================
When you have a cluster up and running, you may add or remove monitors
from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_
or `Monitor Bootstrap`_.
It is possible to add monitors to a running cluster as long as redundancy is
maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor
Bootstrap`_.
.. _adding-monitors:
Adding Monitors
===============
Ceph monitors are lightweight processes that are the single source of truth
for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3
for a production cluster. Ceph monitors use a variation of the
`Paxos`_ algorithm to establish consensus about maps and other critical
information across the cluster. Due to the nature of Paxos, Ceph requires
a majority of monitors to be active to establish a quorum (thus establishing
consensus).
Ceph monitors serve as the single source of truth for the cluster map. It is
possible to run a cluster with only one monitor, but for a production cluster
it is recommended to have at least three monitors provisioned and in quorum.
Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus
about maps and about other critical information across the cluster. Due to the
nature of Paxos, Ceph is able to maintain quorum (and thus establish
consensus) only if a majority of the monitors are ``active``.
It is advisable to run an odd number of monitors. An
odd number of monitors is more resilient than an
even number. For instance, with a two monitor deployment, no
failures can be tolerated and still maintain a quorum; with three monitors,
one failure can be tolerated; in a four monitor deployment, one failure can
be tolerated; with five monitors, two failures can be tolerated. This avoids
the dreaded *split brain* phenomenon, and is why an odd number is best.
In short, Ceph needs a majority of
monitors to be active (and able to communicate with each other), but that
majority can be achieved using a single monitor, or 2 out of 2 monitors,
2 out of 3, 3 out of 4, etc.
It is best to run an odd number of monitors. This is because a cluster that is
running an odd number of monitors is more resilient than a cluster running an
even number. For example, in a two-monitor deployment, no failures can be
tolerated if quorum is to be maintained; in a three-monitor deployment, one
failure can be tolerated; in a four-monitor deployment, one failure can be
tolerated; and in a five-monitor deployment, two failures can be tolerated. In
general, a cluster running an odd number of monitors is best because it avoids
what is called the *split brain* phenomenon. In short, Ceph is able to operate
only if a majority of monitors are ``active`` and able to communicate with each
other, (for example: there must be a single monitor, two out of two monitors,
two out of three monitors, three out of five monitors, or the like).
For small or non-critical deployments of multi-node Ceph clusters, it is
advisable to deploy three monitors, and to increase the number of monitors
to five for larger clusters or to survive a double failure. There is rarely
justification for seven or more.
recommended to deploy three monitors. For larger clusters or for clusters that
are intended to survive a double failure, it is recommended to deploy five
monitors. Only in rare circumstances is there any justification for deploying
seven or more monitors.
Since monitors are lightweight, it is possible to run them on the same
host as OSDs; however, we recommend running them on separate hosts,
because `fsync` issues with the kernel may impair performance.
Dedicated monitor nodes also minimize disruption since monitor and OSD
daemons are not inactive at the same time when a node crashes or is
taken down for maintenance.
Dedicated
monitor nodes also make for cleaner maintenance by avoiding both OSDs and
a mon going down if a node is rebooted, taken down, or crashes.
It is possible to run a monitor on the same host that is running an OSD.
However, this approach has disadvantages: for example: `fsync` issues with the
kernel might weaken performance, monitor and OSD daemons might be inactive at
the same time and cause disruption if the node crashes, is rebooted, or is
taken down for maintenance. Because of these risks, it is instead
recommended to run monitors and managers on dedicated hosts.
.. note:: A *majority* of monitors in your cluster must be able to
reach each other in order to establish a quorum.
reach each other in order for quorum to be established.
Deploy your Hardware
--------------------
Deploying your Hardware
-----------------------
If you are adding a new host when adding a new monitor, see `Hardware
Recommendations`_ for details on minimum recommendations for monitor hardware.
To add a monitor host to your cluster, first make sure you have an up-to-date
version of Linux installed (typically Ubuntu 16.04 or RHEL 7).
Some operators choose to add a new monitor host at the same time that they add
a new monitor. For details on the minimum recommendations for monitor hardware,
see `Hardware Recommendations`_. Before adding a monitor host to the cluster,
make sure that there is an up-to-date version of Linux installed.
Add your monitor host to a rack in your cluster, connect it to the network
and ensure that it has network connectivity.
Add the newly installed monitor host to a rack in your cluster, connect the
host to the network, and make sure that the host has network connectivity.
.. _Hardware Recommendations: ../../../start/hardware-recommendations
Install the Required Software
-----------------------------
Installing the Required Software
--------------------------------
For manually deployed clusters, you must install Ceph packages
manually. See `Installing Packages`_ for details.
You should configure SSH to a user with password-less authentication
and root permissions.
In manually deployed clusters, it is necessary to install Ceph packages
manually. For details, see `Installing Packages`_. Configure SSH so that it can
be used by a user that has passwordless authentication and root permissions.
.. _Installing Packages: ../../../install/install-storage-cluster
@ -81,67 +77,65 @@ and root permissions.
Adding a Monitor (Manual)
-------------------------
This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map
and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If
this results in only two monitor daemons, you may add more monitors by
repeating this procedure until you have a sufficient number of ``ceph-mon``
daemons to achieve a quorum.
The procedure in this section creates a ``ceph-mon`` data directory, retrieves
both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to
the cluster. The procedure might result in a Ceph cluster that contains only
two monitor daemons. To add more monitors until there are enough ``ceph-mon``
daemons to establish quorum, repeat the procedure.
At this point you should define your monitor's id. Traditionally, monitors
have been named with single letters (``a``, ``b``, ``c``, ...), but you are
free to define the id as you see fit. For the purpose of this document,
please take into account that ``{mon-id}`` should be the id you chose,
without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a``
on ``mon.a``).
This is a good point at which to define the new monitor's ``id``. Monitors have
often been named with single letters (``a``, ``b``, ``c``, etc.), but you are
free to define the ``id`` however you see fit. In this document, ``{mon-id}``
refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if
``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is
``a``. ???
#. Create the default directory on the machine that will host your
new monitor:
#. Create a data directory on the machine that will host the new monitor:
.. prompt:: bash $
ssh {new-mon-host}
sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
ssh {new-mon-host}
sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
#. Create a temporary directory ``{tmp}`` to keep the files needed during
this process. This directory should be different from the monitor's default
directory created in the previous step, and can be removed after all the
steps are executed:
#. Create a temporary directory ``{tmp}`` that will contain the files needed
during this procedure. This directory should be different from the data
directory created in the previous step. Because this is a temporary
directory, it can be removed after the procedure is complete:
.. prompt:: bash $
mkdir {tmp}
mkdir {tmp}
#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to
the retrieved keyring, and ``{key-filename}`` is the name of the file
containing the retrieved monitor key:
#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the
retrieved keyring and ``{key-filename}`` is the name of the file that
contains the retrieved monitor key):
.. prompt:: bash $
ceph auth get mon. -o {tmp}/{key-filename}
#. Retrieve the monitor map, where ``{tmp}`` is the path to
the retrieved monitor map, and ``{map-filename}`` is the name of the file
containing the retrieved monitor map:
#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map
and ``{map-filename}`` is the name of the file that contains the retrieved
monitor map):
.. prompt:: bash $
ceph mon getmap -o {tmp}/{map-filename}
#. Prepare the monitor's data directory created in the first step. You must
specify the path to the monitor map so that you can retrieve the
information about a quorum of monitors and their ``fsid``. You must also
specify a path to the monitor keyring:
#. Prepare the monitor's data directory, which was created in the first step.
The following command must specify the path to the monitor map (so that
information about a quorum of monitors and their ``fsid``\s can be
retrieved) and specify the path to the monitor keyring:
.. prompt:: bash $
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
#. Start the new monitor and it will automatically join the cluster.
The daemon needs to know which address to bind to, via either the
``--public-addr {ip}`` or ``--public-network {network}`` argument.
#. Start the new monitor. It will automatically join the cluster. To provide
information to the daemon about which address to bind to, use either the
``--public-addr {ip}`` option or the ``--public-network {network}`` option.
For example:
.. prompt:: bash $
ceph-mon -i {mon-id} --public-addr {ip:port}
@ -151,44 +145,47 @@ on ``mon.a``).
Removing Monitors
=================
When you remove monitors from a cluster, consider that Ceph monitors use
Paxos to establish consensus about the master cluster map. You must have
a sufficient number of monitors to establish a quorum for consensus about
the cluster map.
When monitors are removed from a cluster, it is important to remember
that Ceph monitors use Paxos to maintain consensus about the cluster
map. Such consensus is possible only if the number of monitors is sufficient
to establish quorum.
.. _Removing a Monitor (Manual):
Removing a Monitor (Manual)
---------------------------
This procedure removes a ``ceph-mon`` daemon from your cluster. If this
procedure results in only two monitor daemons, you may add or remove another
monitor until you have a number of ``ceph-mon`` daemons that can achieve a
quorum.
The procedure in this section removes a ``ceph-mon`` daemon from the cluster.
The procedure might result in a Ceph cluster that contains a number of monitors
insufficient to maintain quorum, so plan carefully. When replacing an old
monitor with a new monitor, add the new monitor first, wait for quorum to be
established, and then remove the old monitor. This ensures that quorum is not
lost.
#. Stop the monitor:
.. prompt:: bash $
service ceph -a stop mon.{mon-id}
#. Remove the monitor from the cluster:
.. prompt:: bash $
ceph mon remove {mon-id}
#. Remove the monitor entry from ``ceph.conf``.
#. Remove the monitor entry from the ``ceph.conf`` file:
.. _rados-mon-remove-from-unhealthy:
Removing Monitors from an Unhealthy Cluster
-------------------------------------------
This procedure removes a ``ceph-mon`` daemon from an unhealthy
cluster, for example a cluster where the monitors cannot form a
quorum.
The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy
cluster (for example, a cluster whose monitors are unable to form a quorum).
#. Stop all ``ceph-mon`` daemons on all monitor hosts:
@ -197,63 +194,68 @@ quorum.
ssh {mon-host}
systemctl stop ceph-mon.target
Repeat for all monitor hosts.
Repeat this step on every monitor host.
#. Identify a surviving monitor and log in to that host:
#. Identify a surviving monitor and log in to the monitor's host:
.. prompt:: bash $
ssh {mon-host}
#. Extract a copy of the monmap file:
#. Extract a copy of the ``monmap`` file by running a command of the following
form:
.. prompt:: bash $
ceph-mon -i {mon-id} --extract-monmap {map-path}
In most cases, this command will be:
Here is a more concrete example. In this example, ``hostname`` is the
``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``:
.. prompt:: bash $
ceph-mon -i `hostname` --extract-monmap /tmp/monmap
#. Remove the non-surviving or problematic monitors. For example, if
you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
only ``mon.a`` will survive, follow the example below:
#. Remove the non-surviving or otherwise problematic monitors:
.. prompt:: bash $
monmaptool {map-path} --rm {mon-id}
For example,
For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``,
and ``mon.c`` |---| and that only ``mon.a`` will survive:
.. prompt:: bash $
monmaptool /tmp/monmap --rm b
monmaptool /tmp/monmap --rm c
#. Inject the surviving map with the removed monitors into the
surviving monitor(s). For example, to inject a map into monitor
``mon.a``, follow the example below:
#. Inject the surviving map that includes the removed monitors into the
monmap of the surviving monitor(s):
.. prompt:: bash $
ceph-mon -i {mon-id} --inject-monmap {map-path}
For example:
Continuing with the above example, inject a map into monitor ``mon.a`` by
running the following command:
.. prompt:: bash $
ceph-mon -i a --inject-monmap /tmp/monmap
#. Start only the surviving monitors.
#. Verify the monitors form a quorum (``ceph -s``).
#. Verify that the monitors form a quorum by running the command ``ceph -s``.
#. You may wish to archive the removed monitors' data directory in
``/var/lib/ceph/mon`` in a safe location, or delete it if you are
confident the remaining monitors are healthy and are sufficiently
redundant.
#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``:
either archive this data directory in a safe location or delete this data
directory. However, do not delete it unless you are confident that the
remaining monitors are healthy and sufficiently redundant. Make sure that
there is enough room for the live DB to expand and compact, and make sure
that there is also room for an archived copy of the DB. The archived copy
can be compressed.
.. _Changing a Monitor's IP address:
@ -262,185 +264,195 @@ Changing a Monitor's IP Address
.. important:: Existing monitors are not supposed to change their IP addresses.
Monitors are critical components of a Ceph cluster, and they need to maintain a
quorum for the whole system to work properly. To establish a quorum, the
monitors need to discover each other. Ceph has strict requirements for
discovering monitors.
Monitors are critical components of a Ceph cluster. The entire system can work
properly only if the monitors maintain quorum, and quorum can be established
only if the monitors have discovered each other by means of their IP addresses.
Ceph has strict requirements on the discovery of monitors.
Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors.
However, monitors discover each other using the monitor map, not ``ceph.conf``.
For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you
need to obtain the current monmap for the cluster when creating a new monitor,
as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The
following sections explain the consistency requirements for Ceph monitors, and a
few safe ways to change a monitor's IP address.
Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons
to discover monitors, the monitor map is used by monitors to discover each
other. This is why it is necessary to obtain the current ``monmap`` at the time
a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_,
the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id}
--mkfs`` command. The following sections explain the consistency requirements
for Ceph monitors, and also explain a number of safe ways to change a monitor's
IP address.
Consistency Requirements
------------------------
A monitor always refers to the local copy of the monmap when discovering other
monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids
errors that could break the cluster (e.g., typos in ``ceph.conf`` when
specifying a monitor address or port). Since monitors use monmaps for discovery
and they share monmaps with clients and other Ceph daemons, the monmap provides
monitors with a strict guarantee that their consensus is valid.
When a monitor discovers other monitors in the cluster, it always refers to the
local copy of the monitor map. Using the monitor map instead of using the
``ceph.conf`` file avoids errors that could break the cluster (for example,
typos or other slight errors in ``ceph.conf`` when a monitor address or port is
specified). Because monitors use monitor maps for discovery and because they
share monitor maps with Ceph clients and other Ceph daemons, the monitor map
provides monitors with a strict guarantee that their consensus is valid.
Strict consistency also applies to updates to the monmap. As with any other
updates on the monitor, changes to the monmap always run through a distributed
consensus algorithm called `Paxos`_. The monitors must agree on each update to
the monmap, such as adding or removing a monitor, to ensure that each monitor in
the quorum has the same version of the monmap. Updates to the monmap are
the monmap, such as adding or removing a monitor, to ensure that each monitor
in the quorum has the same version of the monmap. Updates to the monmap are
incremental so that monitors have the latest agreed upon version, and a set of
previous versions, allowing a monitor that has an older version of the monmap to
catch up with the current state of the cluster.
previous versions, allowing a monitor that has an older version of the monmap
to catch up with the current state of the cluster.
If monitors discovered each other through the Ceph configuration file instead of
through the monmap, it would introduce additional risks because the Ceph
configuration files are not updated and distributed automatically. Monitors
might inadvertently use an older ``ceph.conf`` file, fail to recognize a
monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able
to determine the current state of the system accurately. Consequently, making
changes to an existing monitor's IP address must be done with great care.
There are additional advantages to using the monitor map rather than
``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not
automatically updated and distributed, its use would bring certain risks:
monitors might use an outdated ``ceph.conf`` file, might fail to recognize a
specific monitor, might fall out of quorum, and might develop a situation in
which `Paxos`_ is unable to accurately ascertain the current state of the
system. Because of these risks, any changes to an existing monitor's IP address
must be made with great care.
.. _operations_add_or_rm_mons_changing_mon_ip:
Changing a Monitor's IP address (The Right Way)
-----------------------------------------------
Changing a Monitor's IP address (Preferred Method)
--------------------------------------------------
Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to
ensure that other monitors in the cluster will receive the update. To change a
monitor's IP address, you must add a new monitor with the IP address you want
to use (as described in `Adding a Monitor (Manual)`_), ensure that the new
monitor successfully joins the quorum; then, remove the monitor that uses the
old IP address. Then, update the ``ceph.conf`` file to ensure that clients and
other daemons know the IP address of the new monitor.
If a monitor's IP address is changed only in the ``ceph.conf`` file, there is
no guarantee that the other monitors in the cluster will receive the update.
For this reason, the preferred method to change a monitor's IP address is as
follows: add a new monitor with the desired IP address (as described in `Adding
a Monitor (Manual)`_), make sure that the new monitor successfully joins the
quorum, remove the monitor that is using the old IP address, and update the
``ceph.conf`` file to ensure that clients and other daemons are made aware of
the new monitor's IP address.
For example, lets assume there are three monitors in place, such as ::
For example, suppose that there are three monitors in place::
[mon.a]
host = host01
addr = 10.0.0.1:6789
[mon.b]
host = host02
addr = 10.0.0.2:6789
[mon.c]
host = host03
addr = 10.0.0.3:6789
[mon.a]
host = host01
addr = 10.0.0.1:6789
[mon.b]
host = host02
addr = 10.0.0.2:6789
[mon.c]
host = host03
addr = 10.0.0.3:6789
To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the
steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure
that ``mon.d`` is running before removing ``mon.c``, or it will break the
quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving
all three monitors would thus require repeating this process as many times as
needed.
To change ``mon.c`` so that its name is ``host04`` and its IP address is
``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new
monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing
``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing
a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP
addresses, repeat this process.
Changing a Monitor's IP address (Advanced Method)
-------------------------------------------------
Changing a Monitor's IP address (The Messy Way)
-----------------------------------------------
There are cases in which the method outlined in :ref"`<Changing a Monitor's IP
Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot
be used. For example, it might be necessary to move the cluster's monitors to a
different network, to a different part of the datacenter, or to a different
datacenter altogether. It is still possible to change the monitors' IP
addresses, but a different method must be used.
There may come a time when the monitors must be moved to a different network, a
different part of the datacenter or a different datacenter altogether. While it
is possible to do it, the process becomes a bit more hazardous.
For such cases, a new monitor map with updated IP addresses for every monitor
in the cluster must be generated and injected on each monitor. Although this
method is not particularly easy, such a major migration is unlikely to be a
routine task. As stated at the beginning of this section, existing monitors are
not supposed to change their IP addresses.
In such a case, the solution is to generate a new monmap with updated IP
addresses for all the monitors in the cluster, and inject the new map on each
individual monitor. This is not the most user-friendly approach, but we do not
expect this to be something that needs to be done every other week. As it is
clearly stated on the top of this section, monitors are not supposed to change
IP addresses.
Continue with the monitor configuration in the example from :ref"`<Changing a
Monitor's IP Address (Preferred Method)>
operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors
are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that
these networks are unable to communicate. Carry out the following procedure:
Using the previous monitor configuration as an example, assume you want to move
all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these
networks are unable to communicate. Use the following procedure:
#. Retrieve the monitor map, where ``{tmp}`` is the path to
the retrieved monitor map, and ``{filename}`` is the name of the file
containing the retrieved monitor map:
#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor
map, and ``{filename}`` is the name of the file that contains the retrieved
monitor map):
.. prompt:: bash $
ceph mon getmap -o {tmp}/{filename}
#. The following example demonstrates the contents of the monmap:
#. Check the contents of the monitor map:
.. prompt:: bash $
monmaptool --print {tmp}/{filename}
::
::
monmaptool: monmap file {tmp}/{filename}
epoch 1
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
last_changed 2012-12-17 02:46:41.591248
created 2012-12-17 02:46:41.591248
0: 10.0.0.1:6789/0 mon.a
1: 10.0.0.2:6789/0 mon.b
2: 10.0.0.3:6789/0 mon.c
monmaptool: monmap file {tmp}/{filename}
epoch 1
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
last_changed 2012-12-17 02:46:41.591248
created 2012-12-17 02:46:41.591248
0: 10.0.0.1:6789/0 mon.a
1: 10.0.0.2:6789/0 mon.b
2: 10.0.0.3:6789/0 mon.c
#. Remove the existing monitors:
#. Remove the existing monitors from the monitor map:
.. prompt:: bash $
monmaptool --rm a --rm b --rm c {tmp}/{filename}
::
monmaptool: monmap file {tmp}/{filename}
monmaptool: removing a
monmaptool: removing b
monmaptool: removing c
monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
monmaptool: monmap file {tmp}/{filename}
monmaptool: removing a
monmaptool: removing b
monmaptool: removing c
monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
#. Add the new monitor locations:
#. Add the new monitor locations to the monitor map:
.. prompt:: bash $
monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
::
monmaptool: monmap file {tmp}/{filename}
monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
#. Check new contents:
#. Check the new contents of the monitor map:
.. prompt:: bash $
monmaptool --print {tmp}/{filename}
::
monmaptool: monmap file {tmp}/{filename}
epoch 1
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
last_changed 2012-12-17 02:46:41.591248
created 2012-12-17 02:46:41.591248
0: 10.1.0.1:6789/0 mon.a
1: 10.1.0.2:6789/0 mon.b
2: 10.1.0.3:6789/0 mon.c
monmaptool: monmap file {tmp}/{filename}
epoch 1
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
last_changed 2012-12-17 02:46:41.591248
created 2012-12-17 02:46:41.591248
0: 10.1.0.1:6789/0 mon.a
1: 10.1.0.2:6789/0 mon.b
2: 10.1.0.3:6789/0 mon.c
At this point, we assume the monitors (and stores) are installed at the new
location. The next step is to propagate the modified monmap to the new
monitors, and inject the modified monmap into each new monitor.
At this point, we assume that the monitors (and stores) have been installed at
the new location. Next, propagate the modified monitor map to the new monitors,
and inject the modified monitor map into each new monitor.
#. First, make sure to stop all your monitors. Injection must be done while
the daemon is not running.
#. Make sure all of your monitors have been stopped. Never inject into a
monitor while the monitor daemon is running.
#. Inject the monmap:
#. Inject the monitor map:
.. prompt:: bash $
ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
#. Restart the monitors.
#. Restart all of the monitors.
Migration to the new location is now complete. The monitors should operate
successfully.
After this step, migration to the new location is complete and
the monitors should operate successfully.
.. _Manual Deployment: ../../../install/manual-deployment
.. _Monitor Bootstrap: ../../../dev/mon-bootstrap
.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
.. |---| unicode:: U+2014 .. EM DASH
:trim:

View File

@ -2,49 +2,51 @@
Adding/Removing OSDs
======================
When you have a cluster up and running, you may add OSDs or remove OSDs
from the cluster at runtime.
When a cluster is up and running, it is possible to add or remove OSDs.
Adding OSDs
===========
When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an
OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a
host machine. If your host has multiple storage drives, you may map one
``ceph-osd`` daemon for each drive.
OSDs can be added to a cluster in order to expand the cluster's capacity and
resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one
storage drive within a host machine. But if your host machine has multiple
storage drives, you may map one ``ceph-osd`` daemon for each drive on the
machine.
Generally, it's a good idea to check the capacity of your cluster to see if you
are reaching the upper end of its capacity. As your cluster reaches its ``near
full`` ratio, you should add one or more OSDs to expand your cluster's capacity.
It's a good idea to check the capacity of your cluster so that you know when it
approaches its capacity limits. If your cluster has reached its ``near full``
ratio, then you should add OSDs to expand your cluster's capacity.
.. warning:: Do not let your cluster reach its ``full ratio`` before
adding an OSD. OSD failures that occur after the cluster reaches
its ``near full`` ratio may cause the cluster to exceed its
``full ratio``.
.. warning:: Do not add an OSD after your cluster has reached its ``full
ratio``. OSD failures that occur after the cluster reaches its ``near full
ratio`` might cause the cluster to exceed its ``full ratio``.
Deploy your Hardware
--------------------
If you are adding a new host when adding a new OSD, see `Hardware
Deploying your Hardware
-----------------------
If you are also adding a new host when adding a new OSD, see `Hardware
Recommendations`_ for details on minimum recommendations for OSD hardware. To
add an OSD host to your cluster, first make sure you have an up-to-date version
of Linux installed, and you have made some initial preparations for your
storage drives. See `Filesystem Recommendations`_ for details.
add an OSD host to your cluster, begin by making sure that an appropriate
version of Linux has been installed on the host machine and that all initial
preparations for your storage drives have been carried out. For details, see
`Filesystem Recommendations`_.
Next, add your OSD host to a rack in your cluster, connect the host to the
network, and ensure that the host has network connectivity. For details, see
`Network Configuration Reference`_.
Add your OSD host to a rack in your cluster, connect it to the network
and ensure that it has network connectivity. See the `Network Configuration
Reference`_ for details.
.. _Hardware Recommendations: ../../../start/hardware-recommendations
.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations
.. _Network Configuration Reference: ../../configuration/network-config-ref
Install the Required Software
-----------------------------
Installing the Required Software
--------------------------------
For manually deployed clusters, you must install Ceph packages
manually. See `Installing Ceph (Manual)`_ for details.
You should configure SSH to a user with password-less authentication
If your cluster has been manually deployed, you will need to install Ceph
software packages manually. For details, see `Installing Ceph (Manual)`_.
Configure SSH for the appropriate user to have both passwordless authentication
and root permissions.
.. _Installing Ceph (Manual): ../../../install
@ -53,48 +55,56 @@ and root permissions.
Adding an OSD (Manual)
----------------------
This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive,
and configures the cluster to distribute data to the OSD. If your host has
multiple drives, you may add an OSD for each drive by repeating this procedure.
The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to
use one drive, and configures the cluster to distribute data to the OSD. If
your host machine has multiple drives, you may add an OSD for each drive on the
host by repeating this procedure.
To add an OSD, create a data directory for it, mount a drive to that directory,
add the OSD to the cluster, and then add it to the CRUSH map.
As the following procedure will demonstrate, adding an OSD involves creating a
metadata directory for it, configuring a data storage drive, adding the OSD to
the cluster, and then adding it to the CRUSH map.
When you add the OSD to the CRUSH map, consider the weight you give to the new
OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger
hard drives than older hosts in the cluster (i.e., they may have greater
weight).
When you add the OSD to the CRUSH map, you will need to consider the weight you
assign to the new OSD. Since storage drive capacities increase over time, newer
OSD hosts are likely to have larger hard drives than the older hosts in the
cluster have and therefore might have greater weight as well.
.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives
of dissimilar size, you can adjust their weights. However, for best
performance, consider a CRUSH hierarchy with drives of the same type/size.
.. tip:: Ceph works best with uniform hardware across pools. It is possible to
add drives of dissimilar size and then adjust their weights accordingly.
However, for best performance, consider a CRUSH hierarchy that has drives of
the same type and size. It is better to add larger drives uniformly to
existing hosts. This can be done incrementally, replacing smaller drives
each time the new drives are added.
#. Create the OSD. If no UUID is given, it will be set automatically when the
OSD starts up. The following command will output the OSD number, which you
will need for subsequent steps:
#. Create the new OSD by running a command of the following form. If you opt
not to specify a UUID in this command, the UUID will be set automatically
when the OSD starts up. The OSD number, which is needed for subsequent
steps, is found in the command's output:
.. prompt:: bash $
ceph osd create [{uuid} [{id}]]
If the optional parameter {id} is given it will be used as the OSD id.
Note, in this case the command may fail if the number is already in use.
If the optional parameter {id} is specified it will be used as the OSD ID.
However, if the ID number is already in use, the command will fail.
.. warning:: In general, explicitly specifying {id} is not recommended.
IDs are allocated as an array, and skipping entries consumes some extra
memory. This can become significant if there are large gaps and/or
clusters are large. If {id} is not specified, the smallest available is
used.
.. warning:: Explicitly specifying the ``{id}`` parameter is not
recommended. IDs are allocated as an array, and any skipping of entries
consumes extra memory. This memory consumption can become significant if
there are large gaps or if clusters are large. By leaving the ``{id}``
parameter unspecified, we ensure that Ceph uses the smallest ID number
available and that these problems are avoided.
#. Create the default directory on your new OSD:
#. Create the default directory for your new OSD by running commands of the
following form:
.. prompt:: bash $
ssh {new-osd-host}
sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
#. If the OSD is for a drive other than the OS drive, prepare it
for use with Ceph, and mount it to the directory you just created:
#. If the OSD will be created on a drive other than the OS drive, prepare it
for use with Ceph. Run commands of the following form:
.. prompt:: bash $
@ -102,41 +112,49 @@ weight).
sudo mkfs -t {fstype} /dev/{drive}
sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
#. Initialize the OSD data directory:
#. Initialize the OSD data directory by running commands of the following form:
.. prompt:: bash $
ssh {new-osd-host}
ceph-osd -i {osd-num} --mkfs --mkkey
The directory must be empty before you can run ``ceph-osd``.
Make sure that the directory is empty before running ``ceph-osd``.
#. Register the OSD authentication key. The value of ``ceph`` for
``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your
cluster name differs from ``ceph``, use your cluster name instead:
#. Register the OSD authentication key by running a command of the following
form:
.. prompt:: bash $
ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring
#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The
``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy
wherever you wish. If you specify at least one bucket, the command
will place the OSD into the most specific bucket you specify, *and* it will
move that bucket underneath any other buckets you specify. **Important:** If
you specify only the root bucket, the command will attach the OSD directly
to the root, but CRUSH rules expect OSDs to be inside of hosts.
This presentation of the command has ``ceph-{osd-num}`` in the listed path
because many clusters have the name ``ceph``. However, if your cluster name
is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be
replaced with your cluster name. For example, if your cluster name is
``cluster1``, then the path in the command should be
``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``.
Execute the following:
#. Add the OSD to the CRUSH map by running the following command. This allows
the OSD to begin receiving data. The ``ceph osd crush add`` command can add
OSDs to the CRUSH hierarchy wherever you want. If you specify one or more
buckets, the command places the OSD in the most specific of those buckets,
and it moves that bucket underneath any other buckets that you have
specified. **Important:** If you specify only the root bucket, the command
will attach the OSD directly to the root, but CRUSH rules expect OSDs to be
inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not
receive any data.
.. prompt:: bash $
ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...]
You may also decompile the CRUSH map, add the OSD to the device list, add the
host as a bucket (if it's not already in the CRUSH map), add the device as an
item in the host, assign it a weight, recompile it and set it. See
`Add/Move an OSD`_ for details.
Note that there is another way to add a new OSD to the CRUSH map: decompile
the CRUSH map, add the OSD to the device list, add the host as a bucket (if
it is not already in the CRUSH map), add the device as an item in the host,
assign the device a weight, recompile the CRUSH map, and set the CRUSH map.
For details, see `Add/Move an OSD`_. This is rarely necessary with recent
releases (this sentence was written the month that Reef was released).
.. _rados-replacing-an-osd:
@ -144,193 +162,206 @@ weight).
Replacing an OSD
----------------
.. note:: If the instructions in this section do not work for you, try the
instructions in the cephadm documentation: :ref:`cephadm-replacing-an-osd`.
.. note:: If the procedure in this section does not work for you, try the
instructions in the ``cephadm`` documentation:
:ref:`cephadm-replacing-an-osd`.
When disks fail, or if an administrator wants to reprovision OSDs with a new
backend, for instance, for switching from FileStore to BlueStore, OSDs need to
be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry
need to be keep intact after the OSD is destroyed for replacement.
Sometimes OSDs need to be replaced: for example, when a disk fails, or when an
administrator wants to reprovision OSDs with a new back end (perhaps when
switching from Filestore to BlueStore). Replacing an OSD differs from `Removing
the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact
after the OSD is destroyed for replacement.
#. Make sure it is safe to destroy the OSD:
#. Make sure that it is safe to destroy the OSD:
.. prompt:: bash $
while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done
#. Destroy the OSD first:
#. Destroy the OSD:
.. prompt:: bash $
ceph osd destroy {id} --yes-i-really-mean-it
#. Zap a disk for the new OSD, if the disk was used before for other purposes.
It's not necessary for a new disk:
#. *Optional*: If the disk that you plan to use is not a new disk and has been
used before for other purposes, zap the disk:
.. prompt:: bash $
ceph-volume lvm zap /dev/sdX
#. Prepare the disk for replacement by using the previously destroyed OSD id:
#. Prepare the disk for replacement by using the ID of the OSD that was
destroyed in previous steps:
.. prompt:: bash $
ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
#. And activate the OSD:
#. Finally, activate the OSD:
.. prompt:: bash $
ceph-volume lvm activate {id} {fsid}
Alternatively, instead of preparing and activating, the device can be recreated
in one call, like:
Alternatively, instead of carrying out the final two steps (preparing the disk
and activating the OSD), you can re-create the OSD by running a single command
of the following form:
.. prompt:: bash $
ceph-volume lvm create --osd-id {id} --data /dev/sdX
Starting the OSD
----------------
After you add an OSD to Ceph, the OSD is in your configuration. However,
it is not yet running. The OSD is ``down`` and ``in``. You must start
your new OSD before it can begin receiving data. You may use
``service ceph`` from your admin host or start the OSD from its host
machine:
After an OSD is added to Ceph, the OSD is in the cluster. However, until it is
started, the OSD is considered ``down`` and ``in``. The OSD is not running and
will be unable to receive data. To start an OSD, either run ``service ceph``
from your admin host or run a command of the following form to start the OSD
from its host machine:
.. prompt:: bash $
sudo systemctl start ceph-osd@{osd-num}
After the OSD is started, it is considered ``up`` and ``in``.
Once you start your OSD, it is ``up`` and ``in``.
Observing the Data Migration
----------------------------
Observe the Data Migration
--------------------------
Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing
the server by migrating placement groups to your new OSD. You can observe this
process with the `ceph`_ tool. :
After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the
cluster by migrating placement groups (PGs) to the new OSD. To observe this
process by using the `ceph`_ tool, run the following command:
.. prompt:: bash $
ceph -w
You should see the placement group states change from ``active+clean`` to
``active, some degraded objects``, and finally ``active+clean`` when migration
completes. (Control-c to exit.)
Or:
.. prompt:: bash $
watch ceph status
The PG states will first change from ``active+clean`` to ``active, some
degraded objects`` and then return to ``active+clean`` when migration
completes. When you are finished observing, press Ctrl-C to exit.
.. _Add/Move an OSD: ../crush-map#addosd
.. _ceph: ../monitoring
Removing OSDs (Manual)
======================
When you want to reduce the size of a cluster or replace hardware, you may
remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd``
daemon for one storage drive within a host machine. If your host has multiple
storage drives, you may need to remove one ``ceph-osd`` daemon for each drive.
Generally, it's a good idea to check the capacity of your cluster to see if you
are reaching the upper end of its capacity. Ensure that when you remove an OSD
that your cluster is not at its ``near full`` ratio.
It is possible to remove an OSD manually while the cluster is running: you
might want to do this in order to reduce the size of the cluster or when
replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on
one storage drive within a host machine. Alternatively, if your host machine
has multiple storage drives, you might need to remove multiple ``ceph-osd``
daemons: one daemon for each drive on the machine.
.. warning:: Do not let your cluster reach its ``full ratio`` when
removing an OSD. Removing OSDs could cause the cluster to reach
or exceed its ``full ratio``.
.. warning:: Before you begin the process of removing an OSD, make sure that
your cluster is not near its ``full ratio``. Otherwise the act of removing
OSDs might cause the cluster to reach or exceed its ``full ratio``.
Take the OSD out of the Cluster
-----------------------------------
Taking the OSD ``out`` of the Cluster
-------------------------------------
Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it
out of the cluster so that Ceph can begin rebalancing and copying its data to
other OSDs. :
OSDs are typically ``up`` and ``in`` before they are removed from the cluster.
Before the OSD can be removed from the cluster, the OSD must be taken ``out``
of the cluster so that Ceph can begin rebalancing and copying its data to other
OSDs. To take an OSD ``out`` of the cluster, run a command of the following
form:
.. prompt:: bash $
ceph osd out {osd-num}
Observe the Data Migration
--------------------------
Observing the Data Migration
----------------------------
Once you have taken your OSD ``out`` of the cluster, Ceph will begin
rebalancing the cluster by migrating placement groups out of the OSD you
removed. You can observe this process with the `ceph`_ tool. :
After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing
the cluster by migrating placement groups out of the OSD that was removed. To
observe this process by using the `ceph`_ tool, run the following command:
.. prompt:: bash $
ceph -w
You should see the placement group states change from ``active+clean`` to
``active, some degraded objects``, and finally ``active+clean`` when migration
completes. (Control-c to exit.)
The PG states will change from ``active+clean`` to ``active, some degraded
objects`` and will then return to ``active+clean`` when migration completes.
When you are finished observing, press Ctrl-C to exit.
.. note:: Sometimes, typically in a "small" cluster with few hosts (for
instance with a small testing cluster), the fact to take ``out`` the
OSD can spawn a CRUSH corner case where some PGs remain stuck in the
``active+remapped`` state. If you are in this case, you should mark
the OSD ``in`` with:
.. note:: Under certain conditions, the action of taking ``out`` an OSD
might lead CRUSH to encounter a corner case in which some PGs remain stuck
in the ``active+remapped`` state. This problem sometimes occurs in small
clusters with few hosts (for example, in a small testing cluster). To
address this problem, mark the OSD ``in`` by running a command of the
following form:
.. prompt:: bash $
ceph osd in {osd-num}
to come back to the initial state and then, instead of marking ``out``
the OSD, set its weight to 0 with:
After the OSD has come back to its initial state, do not mark the OSD
``out`` again. Instead, set the OSD's weight to ``0`` by running a command
of the following form:
.. prompt:: bash $
ceph osd crush reweight osd.{osd-num} 0
After that, you can observe the data migration which should come to its
end. The difference between marking ``out`` the OSD and reweighting it
to 0 is that in the first case the weight of the bucket which contains
the OSD is not changed whereas in the second case the weight of the bucket
is updated (and decreased of the OSD weight). The reweight command could
be sometimes favoured in the case of a "small" cluster.
After the OSD has been reweighted, observe the data migration and confirm
that it has completed successfully. The difference between marking an OSD
``out`` and reweighting the OSD to ``0`` has to do with the bucket that
contains the OSD. When an OSD is marked ``out``, the weight of the bucket is
not changed. But when an OSD is reweighted to ``0``, the weight of the
bucket is updated (namely, the weight of the OSD is subtracted from the
overall weight of the bucket). When operating small clusters, it can
sometimes be preferable to use the above reweight command.
Stopping the OSD
----------------
After you take an OSD out of the cluster, it may still be running.
That is, the OSD may be ``up`` and ``out``. You must stop
your OSD before you remove it from the configuration:
After you take an OSD ``out`` of the cluster, the OSD might still be running.
In such a case, the OSD is ``up`` and ``out``. Before it is removed from the
cluster, the OSD must be stopped by running commands of the following form:
.. prompt:: bash $
ssh {osd-host}
sudo systemctl stop ceph-osd@{osd-num}
Once you stop your OSD, it is ``down``.
After the OSD has been stopped, it is ``down``.
Removing the OSD
----------------
This procedure removes an OSD from a cluster map, removes its authentication
key, removes the OSD from the OSD map, and removes the OSD from the
``ceph.conf`` file. If your host has multiple drives, you may need to remove an
OSD for each drive by repeating this procedure.
The following procedure removes an OSD from the cluster map, removes the OSD's
authentication key, removes the OSD from the OSD map, and removes the OSD from
the ``ceph.conf`` file. If your host has multiple drives, it might be necessary
to remove an OSD from each drive by repeating this procedure.
#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH
map, removes its authentication key. And it is removed from the OSD map as
well. Please note the :ref:`purge subcommand <ceph-admin-osd>` is introduced in Luminous, for older
versions, please see below:
#. Begin by having the cluster forget the OSD. This step removes the OSD from
the CRUSH map, removes the OSD's authentication key, and removes the OSD
from the OSD map. (The :ref:`purge subcommand <ceph-admin-osd>` was
introduced in Luminous. For older releases, see :ref:`the procedure linked
here <ceph_osd_purge_procedure_pre_luminous>`.):
.. prompt:: bash $
ceph osd purge {id} --yes-i-really-mean-it
#. Navigate to the host where you keep the master copy of the cluster's
``ceph.conf`` file:
#. Navigate to the host where the master copy of the cluster's
``ceph.conf`` file is kept:
.. prompt:: bash $
@ -338,46 +369,48 @@ OSD for each drive by repeating this procedure.
cd /etc/ceph
vim ceph.conf
#. Remove the OSD entry from your ``ceph.conf`` file (if it exists)::
#. Remove the OSD entry from your ``ceph.conf`` file (if such an entry
exists)::
[osd.1]
host = {hostname}
[osd.1]
host = {hostname}
#. From the host where you keep the master copy of the cluster's ``ceph.conf``
file, copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of
other hosts in your cluster.
#. Copy the updated ``ceph.conf`` file from the location on the host where the
master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph``
directory of the other hosts in your cluster.
If your Ceph cluster is older than Luminous, instead of using ``ceph osd
purge``, you need to perform this step manually:
.. _ceph_osd_purge_procedure_pre_luminous:
If your Ceph cluster is older than Luminous, you will be unable to use the
``ceph osd purge`` command. Instead, carry out the following procedure:
#. Remove the OSD from the CRUSH map so that it no longer receives data. You may
also decompile the CRUSH map, remove the OSD from the device list, remove the
device as an item in the host bucket or remove the host bucket (if it's in the
CRUSH map and you intend to remove the host), recompile the map and set it.
See `Remove an OSD`_ for details:
#. Remove the OSD from the CRUSH map so that it no longer receives data (for
more details, see `Remove an OSD`_):
.. prompt:: bash $
ceph osd crush remove {name}
Instead of removing the OSD from the CRUSH map, you might opt for one of two
alternatives: (1) decompile the CRUSH map, remove the OSD from the device
list, and remove the device from the host bucket; (2) remove the host bucket
from the CRUSH map (provided that it is in the CRUSH map and that you intend
to remove the host), recompile the map, and set it:
#. Remove the OSD authentication key:
.. prompt:: bash $
ceph auth del osd.{osd-num}
The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the
``$cluster-$id``. If your cluster name differs from ``ceph``, use your
cluster name instead.
#. Remove the OSD:
.. prompt:: bash $
ceph osd rm {osd-num}
for example:
For example:
.. prompt:: bash $

View File

@ -3,14 +3,15 @@
Balancer
========
The *balancer* can optimize the placement of PGs across OSDs in
order to achieve a balanced distribution, either automatically or in a
supervised fashion.
The *balancer* can optimize the allocation of placement groups (PGs) across
OSDs in order to achieve a balanced distribution. The balancer can operate
either automatically or in a supervised fashion.
Status
------
The current status of the balancer can be checked at any time with:
To check the current status of the balancer, run the following command:
.. prompt:: bash $
@ -20,70 +21,78 @@ The current status of the balancer can be checked at any time with:
Automatic balancing
-------------------
The automatic balancing feature is enabled by default in ``upmap``
mode. Please refer to :ref:`upmap` for more details. The balancer can be
turned off with:
When the balancer is in ``upmap`` mode, the automatic balancing feature is
enabled by default. For more details, see :ref:`upmap`. To disable the
balancer, run the following command:
.. prompt:: bash $
ceph balancer off
The balancer mode can be changed to ``crush-compat`` mode, which is
backward compatible with older clients, and will make small changes to
the data distribution over time to ensure that OSDs are equally utilized.
The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode.
``crush-compat`` mode is backward compatible with older clients. In
``crush-compat`` mode, the balancer automatically makes small changes to the
data distribution in order to ensure that OSDs are utilized equally.
Throttling
----------
No adjustments will be made to the PG distribution if the cluster is
degraded (e.g., because an OSD has failed and the system has not yet
healed itself).
If the cluster is degraded (that is, if an OSD has failed and the system hasn't
healed itself yet), then the balancer will not make any adjustments to the PG
distribution.
When the cluster is healthy, the balancer will throttle its changes
such that the percentage of PGs that are misplaced (i.e., that need to
be moved) is below a threshold of (by default) 5%. The
``target_max_misplaced_ratio`` threshold can be adjusted with:
When the cluster is healthy, the balancer will incrementally move a small
fraction of unbalanced PGs in order to improve distribution. This fraction
will not exceed a certain threshold that defaults to 5%. To adjust this
``target_max_misplaced_ratio`` threshold setting, run the following command:
.. prompt:: bash $
ceph config set mgr target_max_misplaced_ratio .07 # 7%
Set the number of seconds to sleep in between runs of the automatic balancer:
The balancer sleeps between runs. To set the number of seconds for this
interval of sleep, run the following command:
.. prompt:: bash $
ceph config set mgr mgr/balancer/sleep_interval 60
Set the time of day to begin automatic balancing in HHMM format:
To set the time of day (in HHMM format) at which automatic balancing begins,
run the following command:
.. prompt:: bash $
ceph config set mgr mgr/balancer/begin_time 0000
Set the time of day to finish automatic balancing in HHMM format:
To set the time of day (in HHMM format) at which automatic balancing ends, run
the following command:
.. prompt:: bash $
ceph config set mgr mgr/balancer/end_time 2359
Restrict automatic balancing to this day of the week or later.
Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on:
Automatic balancing can be restricted to certain days of the week. To restrict
it to a specific day of the week or later (as with crontab, ``0`` is Sunday,
``1`` is Monday, and so on), run the following command:
.. prompt:: bash $
ceph config set mgr mgr/balancer/begin_weekday 0
Restrict automatic balancing to this day of the week or earlier.
Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on:
To restrict automatic balancing to a specific day of the week or earlier
(again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following
command:
.. prompt:: bash $
ceph config set mgr mgr/balancer/end_weekday 6
Pool IDs to which the automatic balancing will be limited.
The default for this is an empty string, meaning all pools will be balanced.
The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` command:
Automatic balancing can be restricted to certain pools. By default, the value
of this setting is an empty string, so that all pools are automatically
balanced. To restrict automatic balancing to specific pools, retrieve their
numeric pool IDs (by running the :command:`ceph osd pool ls detail` command),
and then run the following command:
.. prompt:: bash $
@ -93,43 +102,41 @@ The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` c
Modes
-----
There are currently two supported balancer modes:
There are two supported balancer modes:
#. **crush-compat**. The CRUSH compat mode uses the compat weight-set
feature (introduced in Luminous) to manage an alternative set of
weights for devices in the CRUSH hierarchy. The normal weights
should remain set to the size of the device to reflect the target
amount of data that we want to store on the device. The balancer
then optimizes the weight-set values, adjusting them up or down in
small increments, in order to achieve a distribution that matches
the target distribution as closely as possible. (Because PG
placement is a pseudorandom process, there is a natural amount of
variation in the placement; by optimizing the weights we
counter-act that natural variation.)
#. **crush-compat**. This mode uses the compat weight-set feature (introduced
in Luminous) to manage an alternative set of weights for devices in the
CRUSH hierarchy. When the balancer is operating in this mode, the normal
weights should remain set to the size of the device in order to reflect the
target amount of data intended to be stored on the device. The balancer will
then optimize the weight-set values, adjusting them up or down in small
increments, in order to achieve a distribution that matches the target
distribution as closely as possible. (Because PG placement is a pseudorandom
process, it is subject to a natural amount of variation; optimizing the
weights serves to counteract that natural variation.)
Notably, this mode is *fully backwards compatible* with older
clients: when an OSDMap and CRUSH map is shared with older clients,
we present the optimized weights as the "real" weights.
Note that this mode is *fully backward compatible* with older clients: when
an OSD Map and CRUSH map are shared with older clients, Ceph presents the
optimized weights as the "real" weights.
The primary restriction of this mode is that the balancer cannot
handle multiple CRUSH hierarchies with different placement rules if
the subtrees of the hierarchy share any OSDs. (This is normally
not the case, and is generally not a recommended configuration
because it is hard to manage the space utilization on the shared
OSDs.)
The primary limitation of this mode is that the balancer cannot handle
multiple CRUSH hierarchies with different placement rules if the subtrees of
the hierarchy share any OSDs. (Such sharing of OSDs is not typical and,
because of the difficulty of managing the space utilization on the shared
OSDs, is generally not recommended.)
#. **upmap**. Starting with Luminous, the OSDMap can store explicit
mappings for individual OSDs as exceptions to the normal CRUSH
placement calculation. These `upmap` entries provide fine-grained
control over the PG mapping. This CRUSH mode will optimize the
placement of individual PGs in order to achieve a balanced
distribution. In most cases, this distribution is "perfect," which
an equal number of PGs on each OSD (+/-1 PG, since they might not
divide evenly).
#. **upmap**. In Luminous and later releases, the OSDMap can store explicit
mappings for individual OSDs as exceptions to the normal CRUSH placement
calculation. These ``upmap`` entries provide fine-grained control over the
PG mapping. This balancer mode optimizes the placement of individual PGs in
order to achieve a balanced distribution. In most cases, the resulting
distribution is nearly perfect: that is, there is an equal number of PGs on
each OSD (±1 PG, since the total number might not divide evenly).
Note that using upmap requires that all clients be Luminous or newer.
To use ``upmap``, all clients must be Luminous or newer.
The default mode is ``upmap``. The mode can be adjusted with:
The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by
running the following command:
.. prompt:: bash $
@ -138,69 +145,77 @@ The default mode is ``upmap``. The mode can be adjusted with:
Supervised optimization
-----------------------
The balancer operation is broken into a few distinct phases:
Supervised use of the balancer can be understood in terms of three distinct
phases:
#. building a *plan*
#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan*
#. executing the *plan*
#. building a plan
#. evaluating the quality of the data distribution, either for the current PG
distribution or for the PG distribution that would result after executing a
plan
#. executing the plan
To evaluate and score the current distribution:
To evaluate the current distribution, run the following command:
.. prompt:: bash $
ceph balancer eval
You can also evaluate the distribution for a single pool with:
To evaluate the distribution for a single pool, run the following command:
.. prompt:: bash $
ceph balancer eval <pool-name>
Greater detail for the evaluation can be seen with:
To see the evaluation in greater detail, run the following command:
.. prompt:: bash $
ceph balancer eval-verbose ...
The balancer can generate a plan, using the currently configured mode, with:
To instruct the balancer to generate a plan (using the currently configured
mode), make up a name (any useful identifying string) for the plan, and run the
following command:
.. prompt:: bash $
ceph balancer optimize <plan-name>
The name is provided by the user and can be any useful identifying string. The contents of a plan can be seen with:
To see the contents of a plan, run the following command:
.. prompt:: bash $
ceph balancer show <plan-name>
All plans can be shown with:
To display all plans, run the following command:
.. prompt:: bash $
ceph balancer ls
Old plans can be discarded with:
To discard an old plan, run the following command:
.. prompt:: bash $
ceph balancer rm <plan-name>
Currently recorded plans are shown as part of the status command:
To see currently recorded plans, examine the output of the following status
command:
.. prompt:: bash $
ceph balancer status
The quality of the distribution that would result after executing a plan can be calculated with:
To evaluate the distribution that would result from executing a specific plan,
run the following command:
.. prompt:: bash $
ceph balancer eval <plan-name>
Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with:
If a plan is expected to improve the distribution (that is, the plan's score is
lower than the current cluster state's score), you can execute that plan by
running the following command:
.. prompt:: bash $
ceph balancer execute <plan-name>

View File

@ -1,69 +1,68 @@
.. _rados_operations_bluestore_migration:
=====================
BlueStore Migration
=====================
Each OSD can run either BlueStore or Filestore, and a single Ceph
cluster can contain a mix of both. Users who have previously deployed
Filestore OSDs should transition to BlueStore in order to
take advantage of the improved performance and robustness. Moreover,
Ceph releases beginning with Reef do not support Filestore. There are
several strategies for making such a transition.
Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph
cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs.
Because BlueStore is superior to Filestore in performance and robustness, and
because Filestore is not supported by Ceph releases beginning with Reef, users
deploying Filestore OSDs should transition to BlueStore. There are several
strategies for making the transition to BlueStore.
An individual OSD cannot be converted in place;
BlueStore and Filestore are simply too different for that to be
feasible. The conversion process uses either the cluster's normal
replication and healing support or tools and strategies that copy OSD
content from an old (Filestore) device to a new (BlueStore) one.
BlueStore is so different from Filestore that an individual OSD cannot be
converted in place. Instead, the conversion process must use either (1) the
cluster's normal replication and healing support, or (2) tools and strategies
that copy OSD content from an old (Filestore) device to a new (BlueStore) one.
Deploying new OSDs with BlueStore
=================================
Deploy new OSDs with BlueStore
==============================
Use BlueStore when deploying new OSDs (for example, when the cluster is
expanded). Because this is the default behavior, no specific change is
needed.
New OSDs (e.g., when the cluster is expanded) should be deployed
using BlueStore. This is the default behavior so no specific change
is needed.
Similarly, use BlueStore for any OSDs that have been reprovisioned after
a failed drive was replaced.
Similarly, any OSDs that are reprovisioned after replacing a failed drive
should use BlueStore.
Converting existing OSDs
========================
Convert existing OSDs
=====================
"Mark-``out``" replacement
--------------------------
Mark out and replace
--------------------
The simplest approach is to ensure that the cluster is healthy,
then mark ``out`` each device in turn, wait for
data to replicate across the cluster, reprovision the OSD, and mark
it back ``in`` again. Proceed to the next OSD when recovery is complete.
This is easy to automate but results in more data migration than
is strictly necessary, which in turn presents additional wear to SSDs and takes
longer to complete.
The simplest approach is to verify that the cluster is healthy and
then follow these steps for each Filestore OSD in succession: mark the OSD
``out``, wait for the data to replicate across the cluster, reprovision the OSD,
mark the OSD back ``in``, and wait for recovery to complete before proceeding
to the next OSD. This approach is easy to automate, but it entails unnecessary
data migration that carries costs in time and SSD wear.
#. Identify a Filestore OSD to replace::
ID=<osd-id-number>
DEVICE=<disk-device>
You can tell whether a given OSD is Filestore or BlueStore with:
#. Determine whether a given OSD is Filestore or BlueStore:
.. prompt:: bash $
.. prompt:: bash $
ceph osd metadata $ID | grep osd_objectstore
ceph osd metadata $ID | grep osd_objectstore
You can get a current count of Filestore and BlueStore OSDs with:
#. Get a current count of Filestore and BlueStore OSDs:
.. prompt:: bash $
.. prompt:: bash $
ceph osd count-metadata osd_objectstore
ceph osd count-metadata osd_objectstore
#. Mark the Filestore OSD ``out``:
#. Mark a Filestore OSD ``out``:
.. prompt:: bash $
ceph osd out $ID
#. Wait for the data to migrate off the OSD in question:
#. Wait for the data to migrate off this OSD:
.. prompt:: bash $
@ -75,7 +74,9 @@ longer to complete.
systemctl kill ceph-osd@$ID
#. Note which device this OSD is using:
.. _osd_id_retrieval:
#. Note which device the OSD is using:
.. prompt:: bash $
@ -87,25 +88,27 @@ longer to complete.
umount /var/lib/ceph/osd/ceph-$ID
#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
the contents of the device; be certain the data on the device is
not needed (i.e., that the cluster is healthy) before proceeding:
#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy
the contents of the device; you must be certain that the data on the device is
not needed (in other words, that the cluster is healthy) before proceeding:
.. prompt:: bash $
ceph-volume lvm zap $DEVICE
#. Tell the cluster the OSD has been destroyed (and a new OSD can be
reprovisioned with the same ID):
#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be
reprovisioned with the same OSD ID):
.. prompt:: bash $
ceph osd destroy $ID --yes-i-really-mean-it
#. Provision a BlueStore OSD in its place with the same OSD ID.
This requires you do identify which device to wipe based on what you saw
mounted above. BE CAREFUL! Also note that hybrid OSDs may require
adjustments to these commands:
#. Provision a BlueStore OSD in place by using the same OSD ID. This requires
you to identify which device to wipe, and to make certain that you target
the correct and intended device, using the information that was retrieved in
the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step. BE
CAREFUL! Note that you may need to modify these commands when dealing with
hybrid OSDs:
.. prompt:: bash $
@ -113,15 +116,15 @@ longer to complete.
#. Repeat.
You can allow balancing of the replacement OSD to happen
concurrently with the draining of the next OSD, or follow the same
procedure for multiple OSDs in parallel, as long as you ensure the
cluster is fully clean (all data has all replicas) before destroying
any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to
only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
``rack``. Failure to do so will reduce the redundancy and availability of
your data and increase the risk of (or even cause) data loss.
You may opt to (1) have the balancing of the replacement BlueStore OSD take
place concurrently with the draining of the next Filestore OSD, or instead
(2) follow the same procedure for multiple OSDs in parallel. In either case,
however, you must ensure that the cluster is fully clean (in other words, that
all data has all replicas) before destroying any OSDs. If you opt to reprovision
multiple OSDs in parallel, be **very** careful to destroy OSDs only within a
single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to
satisfy this requirement will reduce the redundancy and availability of your
data and increase the risk of data loss (or even guarantee data loss).
Advantages:
@ -131,29 +134,29 @@ Advantages:
Disadvantages:
* Data is copied over the network twice: once to some other OSD in the
cluster (to maintain the desired number of replicas), and then again
back to the reprovisioned BlueStore OSD.
* Data is copied over the network twice: once to another OSD in the cluster (to
maintain the specified number of replicas), and again back to the
reprovisioned BlueStore OSD.
"Whole host" replacement
------------------------
Whole host replacement
----------------------
If you have a spare host in the cluster, or sufficient free space to evacuate
an entire host for use as a spare, then the conversion can be done on a
host-by-host basis so that each stored copy of the data is migrated only once.
If you have a spare host in the cluster, or have sufficient free space
to evacuate an entire host in order to use it as a spare, then the
conversion can be done on a host-by-host basis with each stored copy of
the data migrating only once.
To use this approach, you need an empty host that has no OSDs provisioned.
There are two ways to do this: either by using a new, empty host that is not
yet part of the cluster, or by offloading data from an existing host that is
already part of the cluster.
First, you need an empty host that has no OSDs provisioned. There are two
ways to do this: either by starting with a new, empty host that isn't yet
part of the cluster, or by offloading data from an existing host in the cluster.
Using a new, empty host
^^^^^^^^^^^^^^^^^^^^^^^
Use a new, empty host
^^^^^^^^^^^^^^^^^^^^^
Ideally the host will have roughly the same capacity as each of the other hosts
you will be converting. Add the host to the CRUSH hierarchy, but do not attach
it to the root:
Ideally the host should have roughly the
same capacity as other hosts you will be converting.
Add the host to the CRUSH hierarchy, but do not attach it to the root:
.. prompt:: bash $
@ -162,23 +165,22 @@ Add the host to the CRUSH hierarchy, but do not attach it to the root:
Make sure that Ceph packages are installed on the new host.
Use an existing host
^^^^^^^^^^^^^^^^^^^^
Using an existing host
^^^^^^^^^^^^^^^^^^^^^^
If you would like to use an existing host
that is already part of the cluster, and there is sufficient free
space on that host so that all of its data can be migrated off to
other cluster hosts, you can instead do::
If you would like to use an existing host that is already part of the cluster,
and if there is sufficient free space on that host so that all of its data can
be migrated off to other cluster hosts, you can do the following (instead of
using a new, empty host):
.. prompt:: bash $
.. prompt:: bash $
OLDHOST=<existing-cluster-host-to-offload>
ceph osd crush unlink $OLDHOST default
where "default" is the immediate ancestor in the CRUSH map. (For
smaller clusters with unmodified configurations this will normally
be "default", but it might also be a rack name.) You should now
be "default", but it might instead be a rack name.) You should now
see the host at the top of the OSD tree output with no parent:
.. prompt:: bash $
@ -199,15 +201,18 @@ see the host at the top of the OSD tree output with no parent:
2 ssd 1.00000 osd.2 up 1.00000 1.00000
...
If everything looks good, jump directly to the "Wait for data
migration to complete" step below and proceed from there to clean up
the old OSDs.
If everything looks good, jump directly to the :ref:`"Wait for the data
migration to complete" <bluestore_data_migration_step>` step below and proceed
from there to clean up the old OSDs.
Migration process
^^^^^^^^^^^^^^^^^
If you're using a new host, start at step #1. For an existing host,
jump to step #5 below.
If you're using a new host, start at :ref:`the first step
<bluestore_migration_process_first_step>`. If you're using an existing host,
jump to :ref:`this step <bluestore_data_migration_step>`.
.. _bluestore_migration_process_first_step:
#. Provision new BlueStore OSDs for all devices:
@ -215,14 +220,14 @@ jump to step #5 below.
ceph-volume lvm create --bluestore --data /dev/$DEVICE
#. Verify OSDs join the cluster with:
#. Verify that the new OSDs have joined the cluster:
.. prompt:: bash $
ceph osd tree
You should see the new host ``$NEWHOST`` with all of the OSDs beneath
it, but the host should *not* be nested beneath any other node in
it, but the host should *not* be nested beneath any other node in the
hierarchy (like ``root default``). For example, if ``newhost`` is
the empty host, you might see something like::
@ -251,13 +256,16 @@ jump to step #5 below.
ceph osd crush swap-bucket $NEWHOST $OLDHOST
At this point all data on ``$OLDHOST`` will start migrating to OSDs
on ``$NEWHOST``. If there is a difference in the total capacity of
the old and new hosts you may also see some data migrate to or from
other nodes in the cluster, but as long as the hosts are similarly
sized this will be a relatively small amount of data.
At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on
``$NEWHOST``. If there is a difference between the total capacity of the
old hosts and the total capacity of the new hosts, you may also see some
data migrate to or from other nodes in the cluster. Provided that the hosts
are similarly sized, however, this will be a relatively small amount of
data.
#. Wait for data migration to complete:
.. _bluestore_data_migration_step:
#. Wait for the data migration to complete:
.. prompt:: bash $
@ -279,14 +287,14 @@ jump to step #5 below.
ceph osd purge $osd --yes-i-really-mean-it
done
#. Wipe the old OSD devices. This requires you do identify which
devices are to be wiped manually (BE CAREFUL!). For each device:
#. Wipe the old OSDs. This requires you to identify which devices are to be
wiped manually. BE CAREFUL! For each device:
.. prompt:: bash $
ceph-volume lvm zap $DEVICE
#. Use the now-empty host as the new host, and repeat::
#. Use the now-empty host as the new host, and repeat:
.. prompt:: bash $
@ -295,54 +303,53 @@ jump to step #5 below.
Advantages:
* Data is copied over the network only once.
* Converts an entire host's OSDs at once.
* Can parallelize to converting multiple hosts at a time.
* No spare devices are required on each host.
* An entire host's OSDs are converted at once.
* Can be parallelized, to make possible the conversion of multiple hosts at the same time.
* No host involved in this process needs to have a spare device.
Disadvantages:
* A spare host is required.
* An entire host's worth of OSDs will be migrating data at a time. This
* An entire host's worth of OSDs will be migrating data at a time. This
is likely to impact overall cluster performance.
* All migrated data still makes one full hop over the network.
Per-OSD device copy
-------------------
A single logical OSD can be converted by using the ``copy`` function
of ``ceph-objectstore-tool``. This requires that the host have a free
device (or devices) to provision a new, empty BlueStore OSD. For
example, if each host in your cluster has twelve OSDs, then you'd need a
thirteenth unused device so that each OSD can be converted in turn before the
old device is reclaimed to convert the next OSD.
included in ``ceph-objectstore-tool``. This requires that the host have one or more free
devices to provision a new, empty BlueStore OSD. For
example, if each host in your cluster has twelve OSDs, then you need a
thirteenth unused OSD so that each OSD can be converted before the
previous OSD is reclaimed to convert the next OSD.
Caveats:
* This strategy requires that an empty BlueStore OSD be prepared
without allocating a new OSD ID, something that the ``ceph-volume``
tool doesn't support. More importantly, the setup of *dmcrypt* is
closely tied to the OSD identity, which means that this approach
does not work with encrypted OSDs.
* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate
a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:**
because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not
work with encrypted OSDs.
* The device must be manually partitioned.
* An unsupported user-contributed script that shows this process may be found at
* An unsupported user-contributed script that demonstrates this process may be found here:
https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
Advantages:
* Little or no data migrates over the network during the conversion, so long as
the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
while the process proceeds.
* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the
cluster while the conversion process is underway, little or no data migrates over the
network during the conversion.
Disadvantages:
* Tooling is not fully implemented, supported, or documented.
* Each host must have an appropriate spare or empty device for staging.
* The OSD is offline during the conversion, which means new writes to PGs
with the OSD in their acting set may not be ideally redundant until the
subject OSD comes up and recovers. This increases the risk of data
loss due to an overlapping failure. However, if another OSD fails before
conversion and start-up are complete, the original Filestore OSD can be
loss due to an overlapping failure. However, if another OSD fails before
conversion and startup have completed, the original Filestore OSD can be
started to provide access to its original data.

View File

@ -1,6 +1,10 @@
===============
Cache Tiering
===============
.. warning:: Cache tiering has been deprecated in the Reef release as it
has lacked a maintainer for a very long time. This does not mean
it will be certainly removed, but we may choose to remove it
without much further notice.
A cache tier provides Ceph Clients with better I/O performance for a subset of
the data stored in a backing storage tier. Cache tiering involves creating a

View File

@ -1,88 +1,100 @@
.. _changing_monitor_elections:
=====================================
Configure Monitor Election Strategies
=====================================
=======================================
Configuring Monitor Election Strategies
=======================================
By default, the monitors will use the ``classic`` mode. We
recommend that you stay in this mode unless you have a very specific reason.
By default, the monitors are in ``classic`` mode. We recommend staying in this
mode unless you have a very specific reason.
If you want to switch modes BEFORE constructing the cluster, change
the ``mon election default strategy`` option. This option is an integer value:
If you want to switch modes BEFORE constructing the cluster, change the ``mon
election default strategy`` option. This option takes an integer value:
* 1 for "classic"
* 2 for "disallow"
* 3 for "connectivity"
* ``1`` for ``classic``
* ``2`` for ``disallow``
* ``3`` for ``connectivity``
Once your cluster is running, you can change strategies by running ::
After your cluster has started running, you can change strategies by running a
command of the following form:
$ ceph mon set election_strategy {classic|disallow|connectivity}
Choosing a mode
===============
The modes other than classic provide different features. We recommend
you stay in classic mode if you don't need the extra features as it is
the simplest mode.
The disallow Mode
=================
This mode lets you mark monitors as disallowd, in which case they will
participate in the quorum and serve clients, but cannot be elected leader. You
may wish to use this if you have some monitors which are known to be far away
from clients.
You can disallow a leader by running:
The modes other than ``classic`` provide specific features. We recommend staying
in ``classic`` mode if you don't need these extra features because it is the
simplest mode.
.. _rados_operations_disallow_mode:
Disallow Mode
=============
The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed
monitors participate in the quorum and serve clients, but cannot be elected
leader. You might want to use this mode for monitors that are far away from
clients.
To disallow a monitor from being elected leader, run a command of the following
form:
.. prompt:: bash $
ceph mon add disallowed_leader {name}
You can remove a monitor from the disallowed list, and allow it to become
a leader again, by running:
To remove a monitor from the disallowed list and allow it to be elected leader,
run a command of the following form:
.. prompt:: bash $
ceph mon rm disallowed_leader {name}
The list of disallowed_leaders is included when you run:
To see the list of disallowed leaders, examine the output of the following
command:
.. prompt:: bash $
ceph mon dump
The connectivity Mode
=====================
This mode evaluates connection scores provided by each monitor for its
peers and elects the monitor with the highest score. This mode is designed
to handle network partitioning or *net-splits*, which may happen if your cluster
is stretched across multiple data centers or otherwise has a non-uniform
or unbalanced network topology.
Connectivity Mode
=================
This mode also supports disallowing monitors from being the leader
using the same commands as above in disallow.
The ``connectivity`` mode evaluates connection scores that are provided by each
monitor for its peers and elects the monitor with the highest score. This mode
is designed to handle network partitioning (also called *net-splits*): network
partitioning might occur if your cluster is stretched across multiple data
centers or otherwise has a non-uniform or unbalanced network topology.
The ``connectivity`` mode also supports disallowing monitors from being elected
leader by using the same commands that were presented in :ref:`Disallow Mode <rados_operations_disallow_mode>`.
Examining connectivity scores
=============================
The monitors maintain connection scores even if they aren't in
the connectivity election mode. You can examine the scores a monitor
has by running:
The monitors maintain connection scores even if they aren't in ``connectivity``
mode. To examine a specific monitor's connection scores, run a command of the
following form:
.. prompt:: bash $
ceph daemon mon.{name} connection scores dump
Scores for individual connections range from 0-1 inclusive, and also
include whether the connection is considered alive or dead (determined by
whether it returned its latest ping within the timeout).
Scores for an individual connection range from ``0`` to ``1`` inclusive and
include whether the connection is considered alive or dead (as determined by
whether it returned its latest ping before timeout).
While this would be an unexpected occurrence, if for some reason you experience
problems and troubleshooting makes you think your scores have become invalid,
you can forget history and reset them by running:
Connectivity scores are expected to remain valid. However, if during
troubleshooting you determine that these scores have for some reason become
invalid, drop the history and reset the scores by running a command of the
following form:
.. prompt:: bash $
ceph daemon mon.{name} connection scores reset
While resetting scores has low risk (monitors will still quickly determine
if a connection is alive or dead, and trend back to the previous scores if they
were accurate!), it should also not be needed and is not recommended unless
requested by your support team or a developer.
Resetting connectivity scores carries little risk: monitors will still quickly
determine whether a connection is alive or dead and trend back to the previous
scores if those scores were accurate. Nevertheless, resetting scores ought to
be unnecessary and it is not recommended unless advised by your support team
or by a developer.

View File

@ -8,13 +8,13 @@
Monitor Commands
================
Monitor commands are issued using the ``ceph`` utility:
To issue monitor commands, use the ``ceph`` utility:
.. prompt:: bash $
ceph [-m monhost] {command}
The command is usually (though not always) of the form:
In most cases, monitor commands have the following form:
.. prompt:: bash $
@ -24,48 +24,49 @@ The command is usually (though not always) of the form:
System Commands
===============
Execute the following to display the current cluster status. :
To display the current cluster status, run the following commands:
.. prompt:: bash $
ceph -s
ceph status
Execute the following to display a running summary of cluster status
and major events. :
To display a running summary of cluster status and major events, run the
following command:
.. prompt:: bash $
ceph -w
Execute the following to show the monitor quorum, including which monitors are
participating and which one is the leader. :
To display the monitor quorum, including which monitors are participating and
which one is the leader, run the following commands:
.. prompt:: bash $
ceph mon stat
ceph quorum_status
Execute the following to query the status of a single monitor, including whether
or not it is in the quorum. :
To query the status of a single monitor, including whether it is in the quorum,
run the following command:
.. prompt:: bash $
ceph tell mon.[id] mon_status
where the value of ``[id]`` can be determined, e.g., from ``ceph -s``.
Here the value of ``[id]`` can be found by consulting the output of ``ceph
-s``.
Authentication Subsystem
========================
To add a keyring for an OSD, execute the following:
To add an OSD keyring for a specific OSD, run the following command:
.. prompt:: bash $
ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring}
To list the cluster's keys and their capabilities, execute the following:
To list the cluster's keys and their capabilities, run the following command:
.. prompt:: bash $
@ -75,42 +76,57 @@ To list the cluster's keys and their capabilities, execute the following:
Placement Group Subsystem
=========================
To display the statistics for all placement groups (PGs), execute the following:
To display the statistics for all placement groups (PGs), run the following
command:
.. prompt:: bash $
ceph pg dump [--format {format}]
The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``.
When implementing monitoring and other tools, it is best to use ``json`` format.
JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much
less variable from release to release. The ``jq`` utility can be invaluable when extracting
data from JSON output.
Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``,
``xml``, and ``xml-pretty``. When implementing monitoring tools and other
tools, it is best to use the ``json`` format. JSON parsing is more
deterministic than the ``plain`` format (which is more human readable), and the
layout is much more consistent from release to release. The ``jq`` utility is
very useful for extracting data from JSON output.
To display the statistics for all placement groups stuck in a specified state,
execute the following:
To display the statistics for all PGs stuck in a specified state, run the
following command:
.. prompt:: bash $
ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}]
Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``,
``xml``, or ``xml-pretty``.
``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, ``xml``, or ``xml-pretty``.
The ``--threshold`` argument determines the time interval (in seconds) for a PG
to be considered ``stuck`` (default: 300).
``--threshold`` defines how many seconds "stuck" is (default: 300)
PGs might be stuck in any of the following states:
**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
with the most up-to-date data to come back.
**Inactive**
**Unclean** Placement groups contain objects that are not replicated the desired number
of times. They should be recovering.
PGs are unable to process reads or writes because they are waiting for an
OSD that has the most up-to-date data to return to an ``up`` state.
**Stale** Placement groups are in an unknown state - the OSDs that host them have not
reported to the monitor cluster in a while (configured by
``mon_osd_report_timeout``).
Delete "lost" objects or revert them to their prior state, either a previous version
or delete them if they were just created. :
**Unclean**
PGs contain objects that have not been replicated the desired number of
times. These PGs have not yet completed the process of recovering.
**Stale**
PGs are in an unknown state, because the OSDs that host them have not
reported to the monitor cluster for a certain period of time (specified by
the ``mon_osd_report_timeout`` configuration setting).
To delete a ``lost`` object or revert an object to its prior state, either by
reverting it to its previous version or by deleting it because it was just
created and has no previous version, run the following command:
.. prompt:: bash $
@ -122,227 +138,262 @@ or delete them if they were just created. :
OSD Subsystem
=============
Query OSD subsystem status. :
To query OSD subsystem status, run the following command:
.. prompt:: bash $
ceph osd stat
Write a copy of the most recent OSD map to a file. See
:ref:`osdmaptool <osdmaptool>`. :
To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool
<osdmaptool>`), run the following command:
.. prompt:: bash $
ceph osd getmap -o file
Write a copy of the crush map from the most recent OSD map to
file. :
To write a copy of the CRUSH map from the most recent OSD map to a file, run
the following command:
.. prompt:: bash $
ceph osd getcrushmap -o file
The foregoing is functionally equivalent to :
Note that this command is functionally equivalent to the following two
commands:
.. prompt:: bash $
ceph osd getmap -o /tmp/osdmap
osdmaptool /tmp/osdmap --export-crush file
Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``,
``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is
dumped as plain text. As above, JSON format is best for tools, scripting, and other automation. :
To dump the OSD map, run the following command:
.. prompt:: bash $
ceph osd dump [--format {format}]
Dump the OSD map as a tree with one line per OSD containing weight
and state. :
The ``--format`` option accepts the following arguments: ``plain`` (default),
``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is
the recommended format for tools, scripting, and other forms of automation.
To dump the OSD map as a tree that lists one OSD per line and displays
information about the weights and states of the OSDs, run the following
command:
.. prompt:: bash $
ceph osd tree [--format {format}]
Find out where a specific object is or would be stored in the system:
To find out where a specific RADOS object is stored in the system, run a
command of the following form:
.. prompt:: bash $
ceph osd map <pool-name> <object-name>
Add or move a new item (OSD) with the given id/name/weight at the specified
location. :
To add or move a new OSD (specified by its ID, name, or weight) to a specific
CRUSH location, run the following command:
.. prompt:: bash $
ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]]
Remove an existing item (OSD) from the CRUSH map. :
To remove an existing OSD from the CRUSH map, run the following command:
.. prompt:: bash $
ceph osd crush remove {name}
Remove an existing bucket from the CRUSH map. :
To remove an existing bucket from the CRUSH map, run the following command:
.. prompt:: bash $
ceph osd crush remove {bucket-name}
Move an existing bucket from one position in the hierarchy to another. :
To move an existing bucket from one position in the CRUSH hierarchy to another,
run the following command:
.. prompt:: bash $
ceph osd crush move {id} {loc1} [{loc2} ...]
Set the weight of the item given by ``{name}`` to ``{weight}``. :
To set the CRUSH weight of a specific OSD (specified by ``{name}``) to
``{weight}``, run the following command:
.. prompt:: bash $
ceph osd crush reweight {name} {weight}
Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. :
To mark an OSD as ``lost``, run the following command:
.. prompt:: bash $
ceph osd lost {id} [--yes-i-really-mean-it]
Create a new OSD. If no UUID is given, it will be set automatically when the OSD
starts up. :
.. warning::
This could result in permanent data loss. Use with caution!
To create a new OSD, run the following command:
.. prompt:: bash $
ceph osd create [{uuid}]
Remove the given OSD(s). :
If no UUID is given as part of this command, the UUID will be set automatically
when the OSD starts up.
To remove one or more specific OSDs, run the following command:
.. prompt:: bash $
ceph osd rm [{id}...]
Query the current ``max_osd`` parameter in the OSD map. :
To display the current ``max_osd`` parameter in the OSD map, run the following
command:
.. prompt:: bash $
ceph osd getmaxosd
Import the given crush map. :
To import a specific CRUSH map, run the following command:
.. prompt:: bash $
ceph osd setcrushmap -i file
Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so
most admins will never need to adjust this. :
To set the ``max_osd`` parameter in the OSD map, run the following command:
.. prompt:: bash $
ceph osd setmaxosd
Mark OSD ``{osd-num}`` down. :
The parameter has a default value of 10000. Most operators will never need to
adjust it.
To mark a specific OSD ``down``, run the following command:
.. prompt:: bash $
ceph osd down {osd-num}
Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). :
To mark a specific OSD ``out`` (so that no data will be allocated to it), run
the following command:
.. prompt:: bash $
ceph osd out {osd-num}
Mark ``{osd-num}`` in the distribution (i.e. allocated data). :
To mark a specific OSD ``in`` (so that data will be allocated to it), run the
following command:
.. prompt:: bash $
ceph osd in {osd-num}
Set or clear the pause flags in the OSD map. If set, no IO requests
will be sent to any OSD. Clearing the flags via unpause results in
resending pending requests. :
By using the "pause flags" in the OSD map, you can pause or unpause I/O
requests. If the flags are set, then no I/O requests will be sent to any OSD.
When the flags are cleared, then pending I/O requests will be resent. To set or
clear pause flags, run one of the following commands:
.. prompt:: bash $
ceph osd pause
ceph osd unpause
Set the override weight (reweight) of ``{osd-num}`` to ``{weight}``. Two OSDs with the
same weight will receive roughly the same number of I/O requests and
store approximately the same amount of data. ``ceph osd reweight``
sets an override weight on the OSD. This value is in the range 0 to 1,
and forces CRUSH to re-place (1-weight) of the data that would
otherwise live on this drive. It does not change weights assigned
to the buckets above the OSD in the crush map, and is a corrective
measure in case the normal CRUSH distribution is not working out quite
right. For instance, if one of your OSDs is at 90% and the others are
at 50%, you could reduce this weight to compensate. :
You can assign an override or ``reweight`` weight value to a specific OSD if
the normal CRUSH distribution seems to be suboptimal. The weight of an OSD
helps determine the extent of its I/O requests and data storage: two OSDs with
the same weight will receive approximately the same number of I/O requests and
store approximately the same amount of data. The ``ceph osd reweight`` command
assigns an override weight to an OSD. The weight value is in the range 0 to 1,
and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of
the data that would otherwise be on this OSD. The command does not change the
weights of the buckets above the OSD in the CRUSH map. Using the command is
merely a corrective measure: for example, if one of your OSDs is at 90% and the
others are at 50%, you could reduce the outlier weight to correct this
imbalance. To assign an override weight to a specific OSD, run the following
command:
.. prompt:: bash $
ceph osd reweight {osd-num} {weight}
Balance OSD fullness by reducing the override weight of OSDs which are
overly utilized. Note that these override aka ``reweight`` values
default to 1.00000 and are relative only to each other; they not absolute.
It is crucial to distinguish them from CRUSH weights, which reflect the
absolute capacity of a bucket in TiB. By default this command adjusts
override weight on OSDs which have + or - 20% of the average utilization,
but if you include a ``threshold`` that percentage will be used instead. :
.. note:: Any assigned override reweight value will conflict with the balancer.
This means that if the balancer is in use, all override reweight values
should be ``1.0000`` in order to avoid suboptimal cluster behavior.
A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
are being disproportionately utilized. Note that override or ``reweight``
weights have values relative to one another that default to 1.00000; their
values are not absolute, and these weights must be distinguished from CRUSH
weights (which reflect the absolute capacity of a bucket, as measured in TiB).
To reweight OSDs by utilization, run the following command:
.. prompt:: bash $
ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]
To limit the step by which any OSD's reweight will be changed, specify
``max_change`` which defaults to 0.05. To limit the number of OSDs that will
be adjusted, specify ``max_osds`` as well; the default is 4. Increasing these
parameters can speed leveling of OSD utilization, at the potential cost of
greater impact on client operations due to more data moving at once.
By default, this command adjusts the override weight of OSDs that have ±20% of
the average utilization, but you can specify a different percentage in the
``threshold`` argument.
To determine which and how many PGs and OSDs will be affected by a given invocation
you can test before executing. :
To limit the increment by which any OSD's reweight is to be changed, use the
``max_change`` argument (default: 0.05). To limit the number of OSDs that are
to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these
variables can accelerate the reweighting process, but perhaps at the cost of
slower client operations (as a result of the increase in data movement).
You can test the ``osd reweight-by-utilization`` command before running it. To
find out which and how many PGs and OSDs will be affected by a specific use of
the ``osd reweight-by-utilization`` command, run the following command:
.. prompt:: bash $
ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing]
Adding ``--no-increasing`` to either command prevents increasing any
override weights that are currently < 1.00000. This can be useful when
you are balancing in a hurry to remedy ``full`` or ``nearful`` OSDs or
when some OSDs are being evacuated or slowly brought into service.
The ``--no-increasing`` option can be added to the ``reweight-by-utilization``
and ``test-reweight-by-utilization`` commands in order to prevent any override
weights that are currently less than 1.00000 from being increased. This option
can be useful in certain circumstances: for example, when you are hastily
balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are
OSDs being evacuated or slowly brought into service.
Deployments utilizing Nautilus (or later revisions of Luminous and Mimic)
that have no pre-Luminous cients may instead wish to instead enable the
`balancer`` module for ``ceph-mgr``.
Operators of deployments that utilize Nautilus or newer (or later revisions of
Luminous and Mimic) and that have no pre-Luminous clients might likely instead
want to enable the `balancer`` module for ``ceph-mgr``.
Add/remove an IP address or CIDR range to/from the blocklist.
When adding to the blocklist,
you can specify how long it should be blocklisted in seconds; otherwise,
it will default to 1 hour. A blocklisted address is prevented from
connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware
that OSD will also be prevented from performing operations on its peers where it
acts as a client. (This includes tiering and copy-from functionality.)
If you want to blocklist a range (in CIDR format), you may do so by
including the ``range`` keyword.
These commands are mostly only useful for failure testing, as
blocklists are normally maintained automatically and shouldn't need
manual intervention. :
The blocklist can be modified by adding or removing an IP address or a CIDR
range. If an address is blocklisted, it will be unable to connect to any OSD.
If an OSD is contained within an IP address or CIDR range that has been
blocklisted, the OSD will be unable to perform operations on its peers when it
acts as a client: such blocked operations include tiering and copy-from
functionality. To add or remove an IP address or CIDR range to the blocklist,
run one of the following commands:
.. prompt:: bash $
ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]
Creates/deletes a snapshot of a pool. :
If you add something to the blocklist with the above ``add`` command, you can
use the ``TIME`` keyword to specify the length of time (in seconds) that it
will remain on the blocklist (default: one hour). To add or remove a CIDR
range, use the ``range`` keyword in the above commands.
Note that these commands are useful primarily in failure testing. Under normal
conditions, blocklists are maintained automatically and do not need any manual
intervention.
To create or delete a snapshot of a specific storage pool, run one of the
following commands:
.. prompt:: bash $
ceph osd pool mksnap {pool-name} {snap-name}
ceph osd pool rmsnap {pool-name} {snap-name}
Creates/deletes/renames a storage pool. :
To create, delete, or rename a specific storage pool, run one of the following
commands:
.. prompt:: bash $
@ -350,20 +401,20 @@ Creates/deletes/renames a storage pool. :
ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
ceph osd pool rename {old-name} {new-name}
Changes a pool setting. :
To change a pool setting, run the following command:
.. prompt:: bash $
ceph osd pool set {pool-name} {field} {value}
Valid fields are:
The following are valid fields:
* ``size``: Sets the number of copies of data in the pool.
* ``pg_num``: The placement group number.
* ``pgp_num``: Effective number when calculating pg placement.
* ``crush_rule``: rule number for mapping placement.
* ``size``: The number of copies of data in the pool.
* ``pg_num``: The PG number.
* ``pgp_num``: The effective number of PGs when calculating placement.
* ``crush_rule``: The rule number for mapping placement.
Get the value of a pool setting. :
To retrieve the value of a pool setting, run the following command:
.. prompt:: bash $
@ -371,40 +422,43 @@ Get the value of a pool setting. :
Valid fields are:
* ``pg_num``: The placement group number.
* ``pgp_num``: Effective number of placement groups when calculating placement.
* ``pg_num``: The PG number.
* ``pgp_num``: The effective number of PGs when calculating placement.
Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :
To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run
the following command:
.. prompt:: bash $
ceph osd scrub {osd-num}
Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :
To send a repair command to a specific OSD, or to all OSDs (by using ``*``),
run the following command:
.. prompt:: bash $
ceph osd repair N
Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES``
in write requests of ``BYTES_PER_WRITE`` each. By default, the test
writes 1 GB in total in 4-MB increments.
The benchmark is non-destructive and will not overwrite existing live
OSD data, but might temporarily affect the performance of clients
concurrently accessing the OSD. :
You can run a simple throughput benchmark test against a specific OSD. This
test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally,
in multiple write requests that each have a size of ``BYTES_PER_WRITE``
(default: 4 MB). The test is not destructive and it will not overwrite existing
live OSD data, but it might temporarily affect the performance of clients that
are concurrently accessing the OSD. To launch this benchmark test, run the
following command:
.. prompt:: bash $
ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
To clear an OSD's caches between benchmark runs, use the 'cache drop' command :
To clear the caches of a specific OSD during the interval between one benchmark
run and another, run the following command:
.. prompt:: bash $
ceph tell osd.N cache drop
To get the cache statistics of an OSD, use the 'cache status' command :
To retrieve the cache statistics of a specific OSD, run the following command:
.. prompt:: bash $
@ -413,7 +467,8 @@ To get the cache statistics of an OSD, use the 'cache status' command :
MDS Subsystem
=============
Change configuration parameters on a running mds. :
To change the configuration parameters of a running metadata server, run the
following command:
.. prompt:: bash $
@ -425,19 +480,20 @@ Example:
ceph tell mds.0 config set debug_ms 1
Enables debug messages. :
To enable debug messages, run the following command:
.. prompt:: bash $
ceph mds stat
Displays the status of all metadata servers. :
To display the status of all metadata servers, run the following command:
.. prompt:: bash $
ceph mds fail 0
Marks the active MDS as failed, triggering failover to a standby if present.
To mark the active metadata server as failed (and to trigger failover to a
standby if a standby is present), run the following command:
.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
@ -445,157 +501,165 @@ Marks the active MDS as failed, triggering failover to a standby if present.
Mon Subsystem
=============
Show monitor stats:
To display monitor statistics, run the following command:
.. prompt:: bash $
ceph mon stat
This command returns output similar to the following:
::
e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
There is a ``quorum`` list at the end of the output. It lists those monitor
nodes that are part of the current quorum.
The ``quorum`` list at the end lists monitor nodes that are part of the current quorum.
This is also available more directly:
To retrieve this information in a more direct way, run the following command:
.. prompt:: bash $
ceph quorum_status -f json-pretty
.. code-block:: javascript
{
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"quorum_names": [
"a",
"b",
"c"
],
"quorum_leader_name": "a",
"monmap": {
"epoch": 2,
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"modified": "2016-12-26 14:42:09.288066",
"created": "2016-12-26 14:42:03.573585",
"features": {
"persistent": [
"kraken"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"addr": "127.0.0.1:40000\/0",
"public_addr": "127.0.0.1:40000\/0"
},
{
"rank": 1,
"name": "b",
"addr": "127.0.0.1:40001\/0",
"public_addr": "127.0.0.1:40001\/0"
},
{
"rank": 2,
"name": "c",
"addr": "127.0.0.1:40002\/0",
"public_addr": "127.0.0.1:40002\/0"
}
]
}
}
This command returns output similar to the following:
.. code-block:: javascript
{
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"quorum_names": [
"a",
"b",
"c"
],
"quorum_leader_name": "a",
"monmap": {
"epoch": 2,
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"modified": "2016-12-26 14:42:09.288066",
"created": "2016-12-26 14:42:03.573585",
"features": {
"persistent": [
"kraken"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"addr": "127.0.0.1:40000\/0",
"public_addr": "127.0.0.1:40000\/0"
},
{
"rank": 1,
"name": "b",
"addr": "127.0.0.1:40001\/0",
"public_addr": "127.0.0.1:40001\/0"
},
{
"rank": 2,
"name": "c",
"addr": "127.0.0.1:40002\/0",
"public_addr": "127.0.0.1:40002\/0"
}
]
}
}
The above will block until a quorum is reached.
For a status of just a single monitor:
To see the status of a specific monitor, run the following command:
.. prompt:: bash $
ceph tell mon.[name] mon_status
where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample
output::
{
"name": "b",
"rank": 1,
"state": "peon",
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"features": {
"required_con": "9025616074522624",
"required_mon": [
"kraken"
],
"quorum_con": "1152921504336314367",
"quorum_mon": [
"kraken"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 2,
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"modified": "2016-12-26 14:42:09.288066",
"created": "2016-12-26 14:42:03.573585",
"features": {
"persistent": [
"kraken"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"addr": "127.0.0.1:40000\/0",
"public_addr": "127.0.0.1:40000\/0"
},
{
"rank": 1,
"name": "b",
"addr": "127.0.0.1:40001\/0",
"public_addr": "127.0.0.1:40001\/0"
},
{
"rank": 2,
"name": "c",
"addr": "127.0.0.1:40002\/0",
"public_addr": "127.0.0.1:40002\/0"
}
]
}
}
A dump of the monitor state:
Here the value of ``[name]`` can be found by consulting the output of the
``ceph quorum_status`` command. This command returns output similar to the
following:
::
{
"name": "b",
"rank": 1,
"state": "peon",
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"features": {
"required_con": "9025616074522624",
"required_mon": [
"kraken"
],
"quorum_con": "1152921504336314367",
"quorum_mon": [
"kraken"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 2,
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"modified": "2016-12-26 14:42:09.288066",
"created": "2016-12-26 14:42:03.573585",
"features": {
"persistent": [
"kraken"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"addr": "127.0.0.1:40000\/0",
"public_addr": "127.0.0.1:40000\/0"
},
{
"rank": 1,
"name": "b",
"addr": "127.0.0.1:40001\/0",
"public_addr": "127.0.0.1:40001\/0"
},
{
"rank": 2,
"name": "c",
"addr": "127.0.0.1:40002\/0",
"public_addr": "127.0.0.1:40002\/0"
}
]
}
}
To see a dump of the monitor state, run the following command:
.. prompt:: bash $
ceph mon dump
This command returns output similar to the following:
::
dumped monmap epoch 2
epoch 2
fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
last_changed 2016-12-26 14:42:09.288066
created 2016-12-26 14:42:03.573585
0: 127.0.0.1:40000/0 mon.a
1: 127.0.0.1:40001/0 mon.b
2: 127.0.0.1:40002/0 mon.c
dumped monmap epoch 2
epoch 2
fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
last_changed 2016-12-26 14:42:09.288066
created 2016-12-26 14:42:03.573585
0: 127.0.0.1:40000/0 mon.a
1: 127.0.0.1:40001/0 mon.b
2: 127.0.0.1:40002/0 mon.c

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -2,40 +2,45 @@
Data Placement Overview
=========================
Ceph stores, replicates and rebalances data objects across a RADOS cluster
dynamically. With many different users storing objects in different pools for
different purposes on countless OSDs, Ceph operations require some data
placement planning. The main data placement planning concepts in Ceph include:
Ceph stores, replicates, and rebalances data objects across a RADOS cluster
dynamically. Because different users store objects in different pools for
different purposes on many OSDs, Ceph operations require a certain amount of
data- placement planning. The main data-placement planning concepts in Ceph
include:
- **Pools:** Ceph stores data within pools, which are logical groups for storing
objects. Pools manage the number of placement groups, the number of replicas,
and the CRUSH rule for the pool. To store data in a pool, you must have
an authenticated user with permissions for the pool. Ceph can snapshot pools.
See `Pools`_ for additional details.
- **Pools:** Ceph stores data within pools, which are logical groups used for
storing objects. Pools manage the number of placement groups, the number of
replicas, and the CRUSH rule for the pool. To store data in a pool, it is
necessary to be an authenticated user with permissions for the pool. Ceph is
able to make snapshots of pools. For additional details, see `Pools`_.
- **Placement Groups:** Ceph maps objects to placement groups (PGs).
Placement groups (PGs) are shards or fragments of a logical object pool
that place objects as a group into OSDs. Placement groups reduce the amount
of per-object metadata when Ceph stores the data in OSDs. A larger number of
placement groups (e.g., 100 per OSD) leads to better balancing. See
:ref:`placement groups` for additional details.
- **Placement Groups:** Ceph maps objects to placement groups. Placement
groups (PGs) are shards or fragments of a logical object pool that place
objects as a group into OSDs. Placement groups reduce the amount of
per-object metadata that is necessary for Ceph to store the data in OSDs. A
greater number of placement groups (for example, 100 PGs per OSD as compared
with 50 PGs per OSD) leads to better balancing. For additional details, see
:ref:`placement groups`.
- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without
performance bottlenecks, without limitations to scalability, and without a
single point of failure. CRUSH maps provide the physical topology of the
cluster to the CRUSH algorithm to determine where the data for an object
and its replicas should be stored, and how to do so across failure domains
for added data safety among other things. See `CRUSH Maps`_ for additional
details.
- **CRUSH Maps:** CRUSH plays a major role in allowing Ceph to scale while
avoiding certain pitfalls, such as performance bottlenecks, limitations to
scalability, and single points of failure. CRUSH maps provide the physical
topology of the cluster to the CRUSH algorithm, so that it can determine both
(1) where the data for an object and its replicas should be stored and (2)
how to store that data across failure domains so as to improve data safety.
For additional details, see `CRUSH Maps`_.
- **Balancer:** The balancer is a feature that will automatically optimize the
distribution of PGs across devices to achieve a balanced data distribution,
maximizing the amount of data that can be stored in the cluster and evenly
distributing the workload across OSDs.
- **Balancer:** The balancer is a feature that automatically optimizes the
distribution of placement groups across devices in order to achieve a
balanced data distribution, in order to maximize the amount of data that can
be stored in the cluster, and in order to evenly distribute the workload
across OSDs.
When you initially set up a test cluster, you can use the default values. Once
you begin planning for a large Ceph cluster, refer to pools, placement groups
and CRUSH for data placement operations.
It is possible to use the default values for each of the above components.
Default values are recommended for a test cluster's initial setup. However,
when planning a large Ceph cluster, values should be customized for
data-placement operations with reference to the different roles played by
pools, placement groups, and CRUSH.
.. _Pools: ../pools
.. _CRUSH Maps: ../crush-map

View File

@ -3,28 +3,32 @@
Device Management
=================
Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
which daemons, and collects health metrics about those devices in order to
provide tools to predict and/or automatically respond to hardware failure.
Device management allows Ceph to address hardware failure. Ceph tracks hardware
storage devices (HDDs, SSDs) to see which devices are managed by which daemons.
Ceph also collects health metrics about these devices. By doing so, Ceph can
provide tools that predict hardware failure and can automatically respond to
hardware failure.
Device tracking
---------------
You can query which storage devices are in use with:
To see a list of the storage devices that are in use, run the following
command:
.. prompt:: bash $
ceph device ls
You can also list devices by daemon or by host:
Alternatively, to list devices by daemon or by host, run a command of one of
the following forms:
.. prompt:: bash $
ceph device ls-by-daemon <daemon>
ceph device ls-by-host <host>
For any individual device, you can query information about its
location and how it is being consumed with:
To see information about the location of an specific device and about how the
device is being consumed, run a command of the following form:
.. prompt:: bash $
@ -33,103 +37,107 @@ location and how it is being consumed with:
Identifying physical devices
----------------------------
You can blink the drive LEDs on hardware enclosures to make the replacement of
failed disks easy and less error-prone. Use the following command::
To make the replacement of failed disks easier and less error-prone, you can
(in some cases) "blink" the drive's LEDs on hardware enclosures by running a
command of the following form::
device light on|off <devid> [ident|fault] [--force]
The ``<devid>`` parameter is the device identification. You can obtain this
information using the following command:
.. note:: Using this command to blink the lights might not work. Whether it
works will depend upon such factors as your kernel revision, your SES
firmware, or the setup of your HBA.
The ``<devid>`` parameter is the device identification. To retrieve this
information, run the following command:
.. prompt:: bash $
ceph device ls
The ``[ident|fault]`` parameter is used to set the kind of light to blink.
By default, the `identification` light is used.
The ``[ident|fault]`` parameter determines which kind of light will blink. By
default, the `identification` light is used.
.. note::
This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
The orchestrator module enabled is shown by executing the following command:
.. note:: This command works only if the Cephadm or the Rook `orchestrator
<https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_
module is enabled. To see which orchestrator module is enabled, run the
following command:
.. prompt:: bash $
ceph orch status
The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
to customize this command you can configure this via a Jinja2 template::
The command that makes the drive's LEDs blink is `lsmcli`. To customize this
command, configure it via a Jinja2 template by running commands of the
following forms::
ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
The Jinja2 template is rendered using the following arguments:
The following arguments can be used to customize the Jinja2 template:
* ``on``
A boolean value.
* ``ident_fault``
A string containing `ident` or `fault`.
A string that contains `ident` or `fault`.
* ``dev``
A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`.
* ``path``
A string containing the device path, e.g. `/dev/sda`.
A string that contains the device path: for example, `/dev/sda`.
.. _enabling-monitoring:
Enabling monitoring
-------------------
Ceph can also monitor health metrics associated with your device. For
example, SATA hard disks implement a standard called SMART that
provides a wide range of internal metrics about the device's usage and
health, like the number of hours powered on, number of power cycles,
or unrecoverable read errors. Other device types like SAS and NVMe
implement a similar set of metrics (via slightly different standards).
All of these can be collected by Ceph via the ``smartctl`` tool.
Ceph can also monitor the health metrics associated with your device. For
example, SATA drives implement a standard called SMART that provides a wide
range of internal metrics about the device's usage and health (for example: the
number of hours powered on, the number of power cycles, the number of
unrecoverable read errors). Other device types such as SAS and NVMe present a
similar set of metrics (via slightly different standards). All of these
metrics can be collected by Ceph via the ``smartctl`` tool.
You can enable or disable health monitoring with:
You can enable or disable health monitoring by running one of the following
commands:
.. prompt:: bash $
ceph device monitoring on
or:
.. prompt:: bash $
ceph device monitoring off
Scraping
--------
If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with:
If monitoring is enabled, device metrics will be scraped automatically at
regular intervals. To configure that interval, run a command of the following
form:
.. prompt:: bash $
ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
The default is to scrape once every 24 hours.
By default, device metrics are scraped once every 24 hours.
You can manually trigger a scrape of all devices with:
To manually scrape all devices, run the following command:
.. prompt:: bash $
ceph device scrape-health-metrics
A single device can be scraped with:
To scrape a single device, run a command of the following form:
.. prompt:: bash $
ceph device scrape-health-metrics <device-id>
Or a single daemon's devices can be scraped with:
To scrape a single daemon's devices, run a command of the following form:
.. prompt:: bash $
ceph device scrape-daemon-health-metrics <who>
The stored health metrics for a device can be retrieved (optionally
for a specific timestamp) with:
To retrieve the stored health metrics for a device (optionally for a specific
timestamp), run a command of the following form:
.. prompt:: bash $
@ -138,71 +146,82 @@ for a specific timestamp) with:
Failure prediction
------------------
Ceph can predict life expectancy and device failures based on the
health metrics it collects. There are three modes:
Ceph can predict drive life expectancy and device failures by analyzing the
health metrics that it collects. The prediction modes are as follows:
* *none*: disable device failure prediction.
* *local*: use a pre-trained prediction model from the ceph-mgr daemon
* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon.
The prediction mode can be configured with:
To configure the prediction mode, run a command of the following form:
.. prompt:: bash $
ceph config set global device_failure_prediction_mode <mode>
Prediction normally runs in the background on a periodic basis, so it
may take some time before life expectancy values are populated. You
can see the life expectancy of all devices in output from:
Under normal conditions, failure prediction runs periodically in the
background. For this reason, life expectancy values might be populated only
after a significant amount of time has passed. The life expectancy of all
devices is displayed in the output of the following command:
.. prompt:: bash $
ceph device ls
You can also query the metadata for a specific device with:
To see the metadata of a specific device, run a command of the following form:
.. prompt:: bash $
ceph device info <devid>
You can explicitly force prediction of a device's life expectancy with:
To explicitly force prediction of a specific device's life expectancy, run a
command of the following form:
.. prompt:: bash $
ceph device predict-life-expectancy <devid>
If you are not using Ceph's internal device failure prediction but
have some external source of information about device failures, you
can inform Ceph of a device's life expectancy with:
In addition to Ceph's internal device failure prediction, you might have an
external source of information about device failures. To inform Ceph of a
specific device's life expectancy, run a command of the following form:
.. prompt:: bash $
ceph device set-life-expectancy <devid> <from> [<to>]
Life expectancies are expressed as a time interval so that
uncertainty can be expressed in the form of a wide interval. The
interval end can also be left unspecified.
Life expectancies are expressed as a time interval. This means that the
uncertainty of the life expectancy can be expressed in the form of a range of
time, and perhaps a wide range of time. The interval's end can be left
unspecified.
Health alerts
-------------
The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
device failure must be before we generate a health warning.
The ``mgr/devicehealth/warn_threshold`` configuration option controls the
health check for an expected device failure. If the device is expected to fail
within the specified time interval, an alert is raised.
The stored life expectancy of all devices can be checked, and any
appropriate health alerts generated, with:
To check the stored life expectancy of all devices and generate any appropriate
health alert, run the following command:
.. prompt:: bash $
ceph device check-health
Automatic Mitigation
--------------------
Automatic Migration
-------------------
If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
default), then for devices that are expected to fail soon the module
will automatically migrate data away from them by marking the devices
"out".
The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically
migrates data away from devices that are expected to fail soon. If this option
is enabled, the module marks such devices ``out`` so that automatic migration
will occur.
The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
expected device failure must be before we automatically mark an osd
"out".
.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent
this process from cascading to total failure. If the "self heal" module
marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio``
is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health
check. For instructions on what to do in this situation, see
:ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`.
The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the
time interval for automatic migration. If a device is expected to fail within
the specified time interval, it will be automatically marked ``out``.

View File

@ -6,9 +6,11 @@ The *jerasure* plugin is the most generic and flexible plugin, it is
also the default for Ceph erasure coded pools.
The *jerasure* plugin encapsulates the `Jerasure
<http://jerasure.org>`_ library. It is
recommended to read the *jerasure* documentation to get a better
understanding of the parameters.
<https://github.com/ceph/jerasure>`_ library. It is
recommended to read the ``jerasure`` documentation to
understand the parameters. Note that the ``jerasure.org``
web site as of 2023 may no longer be connected to the original
project or legitimate.
Create a jerasure profile
=========================

View File

@ -110,6 +110,8 @@ To remove an erasure code profile::
If the profile is referenced by a pool, the deletion will fail.
.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior.
osd erasure-code-profile get
============================

View File

@ -1,14 +1,14 @@
.. _ecpool:
=============
==============
Erasure code
=============
==============
By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
replicated-type pools, every object is copied to multiple disks (this
multiple copying is the "replication").
replicated-type pools, every object is copied to multiple disks. This
multiple copying is the method of data protection known as "replication".
In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
pools use a method of data protection that is different from replication. In
erasure coding, data is broken into fragments of two kinds: data blocks and
parity blocks. If a drive fails or becomes corrupted, the parity blocks are
@ -16,17 +16,17 @@ used to rebuild the data. At scale, erasure coding saves space relative to
replication.
In this documentation, data blocks are referred to as "data chunks"
and parity blocks are referred to as "encoding chunks".
and parity blocks are referred to as "coding chunks".
Erasure codes are also called "forward error correction codes". The
first forward error correction code was developed in 1950 by Richard
Hamming at Bell Laboratories.
Creating a sample erasure coded pool
Creating a sample erasure-coded pool
------------------------------------
The simplest erasure coded pool is equivalent to `RAID5
The simplest erasure-coded pool is similar to `RAID5
<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
requires at least three hosts:
@ -47,12 +47,13 @@ requires at least three hosts:
ABCDEFGHI
Erasure code profiles
Erasure-code profiles
---------------------
The default erasure code profile can sustain the loss of two OSDs. This erasure
code profile is equivalent to a replicated pool of size three, but requires
2TB to store 1TB of data instead of 3TB to store 1TB of data. The default
The default erasure-code profile can sustain the overlapping loss of two OSDs
without losing data. This erasure-code profile is equivalent to a replicated
pool of size three, but with different storage requirements: instead of
requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
profile can be displayed with this command:
.. prompt:: bash $
@ -68,26 +69,27 @@ profile can be displayed with this command:
technique=reed_sol_van
.. note::
The default erasure-coded pool, the profile of which is displayed here, is
not the same as the simplest erasure-coded pool.
The default erasure-coded pool has two data chunks (k) and two coding chunks
(m). The profile of the default erasure-coded pool is "k=2 m=2".
The profile just displayed is for the *default* erasure-coded pool, not the
*simplest* erasure-coded pool. These two pools are not the same:
The simplest erasure-coded pool has two data chunks (k) and one coding chunk
(m). The profile of the simplest erasure-coded pool is "k=2 m=1".
The default erasure-coded pool has two data chunks (K) and two coding chunks
(M). The profile of the default erasure-coded pool is "k=2 m=2".
The simplest erasure-coded pool has two data chunks (K) and one coding chunk
(M). The profile of the simplest erasure-coded pool is "k=2 m=1".
Choosing the right profile is important because the profile cannot be modified
after the pool is created. If you find that you need an erasure-coded pool with
a profile different than the one you have created, you must create a new pool
with a different (and presumably more carefully-considered) profile. When the
new pool is created, all objects from the wrongly-configured pool must be moved
to the newly-created pool. There is no way to alter the profile of a pool after its creation.
with a different (and presumably more carefully considered) profile. When the
new pool is created, all objects from the wrongly configured pool must be moved
to the newly created pool. There is no way to alter the profile of a pool after
the pool has been created.
The most important parameters of the profile are *K*, *M* and
The most important parameters of the profile are *K*, *M*, and
*crush-failure-domain* because they define the storage overhead and
the data durability. For example, if the desired architecture must
sustain the loss of two racks with a storage overhead of 67% overhead,
sustain the loss of two racks with a storage overhead of 67%,
the following profile can be defined:
.. prompt:: bash $
@ -106,7 +108,7 @@ the following profile can be defined:
The *NYAN* object will be divided in three (*K=3*) and two additional
*chunks* will be created (*M=2*). The value of *M* defines how many
OSD can be lost simultaneously without losing any data. The
OSDs can be lost simultaneously without losing any data. The
*crush-failure-domain=rack* will create a CRUSH rule that ensures
no two *chunks* are stored in the same rack.
@ -155,19 +157,19 @@ no two *chunks* are stored in the same rack.
+------+
More information can be found in the `erasure code profiles
More information can be found in the `erasure-code profiles
<../erasure-code-profile>`_ documentation.
Erasure Coding with Overwrites
------------------------------
By default, erasure coded pools only work with uses like RGW that
perform full object writes and appends.
By default, erasure-coded pools work only with operations that
perform full object writes and appends (for example, RGW).
Since Luminous, partial writes for an erasure coded pool may be
Since Luminous, partial writes for an erasure-coded pool may be
enabled with a per-pool setting. This lets RBD and CephFS store their
data in an erasure coded pool:
data in an erasure-coded pool:
.. prompt:: bash $
@ -175,31 +177,33 @@ data in an erasure coded pool:
This can be enabled only on a pool residing on BlueStore OSDs, since
BlueStore's checksumming is used during deep scrubs to detect bitrot
or other corruption. In addition to being unsafe, using Filestore with
EC overwrites results in lower performance compared to BlueStore.
or other corruption. Using Filestore with EC overwrites is not only
unsafe, but it also results in lower performance compared to BlueStore.
Erasure coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an EC pool, and
Erasure-coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an EC pool and
their metadata in a replicated pool. For RBD, this means using the
erasure coded pool as the ``--data-pool`` during image creation:
erasure-coded pool as the ``--data-pool`` during image creation:
.. prompt:: bash $
rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
For CephFS, an erasure coded pool can be set as the default data pool during
For CephFS, an erasure-coded pool can be set as the default data pool during
file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
Erasure coded pool and cache tiering
------------------------------------
Erasure-coded pools and cache tiering
-------------------------------------
Erasure coded pools require more resources than replicated pools and
lack some functionality such as omap. To overcome these
limitations, one can set up a `cache tier <../cache-tiering>`_
before the erasure coded pool.
Erasure-coded pools require more resources than replicated pools and
lack some of the functionality supported by replicated pools (for example, omap).
To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
before setting up the erasure-coded pool.
For instance, if the pool *hot-storage* is made of fast storage:
For example, if the pool *hot-storage* is made of fast storage, the following commands
will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
mode:
.. prompt:: bash $
@ -207,58 +211,60 @@ For instance, if the pool *hot-storage* is made of fast storage:
ceph osd tier cache-mode hot-storage writeback
ceph osd tier set-overlay ecpool hot-storage
will place the *hot-storage* pool as tier of *ecpool* in *writeback*
mode so that every write and read to the *ecpool* are actually using
the *hot-storage* and benefit from its flexibility and speed.
The result is that every write and read to the *ecpool* actually uses
the *hot-storage* pool and benefits from its flexibility and speed.
More information can be found in the `cache tiering
<../cache-tiering>`_ documentation. Note however that cache tiering
<../cache-tiering>`_ documentation. Note, however, that cache tiering
is deprecated and may be removed completely in a future release.
Erasure coded pool recovery
Erasure-coded pool recovery
---------------------------
If an erasure coded pool loses some data shards, it must recover them from others.
This involves reading from the remaining shards, reconstructing the data, and
If an erasure-coded pool loses any data shards, it must recover them from others.
This recovery involves reading from the remaining shards, reconstructing the data, and
writing new shards.
In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
available. (With fewer than *K* shards, you have actually lost data!)
Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
be ``K+2`` or more to prevent loss of writes and data.
This conservative decision was made out of an abundance of caution when
designing the new pool mode. As a result pools with lost OSDs but without
complete loss of any data were unable to recover and go active
without manual intervention to temporarily change the ``min_size`` setting.
Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
available, even if ``min_size`` was greater than ``K``. This was a conservative
decision made out of an abundance of caution when designing the new pool
mode. As a result, however, pools with lost OSDs but without complete data loss were
unable to recover and go active without manual intervention to temporarily change
the ``min_size`` setting.
We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
loss of data.
Glossary
--------
*chunk*
when the encoding function is called, it returns chunks of the same
size. Data chunks which can be concatenated to reconstruct the original
object and coding chunks which can be used to rebuild a lost chunk.
When the encoding function is called, it returns chunks of the same size as each other. There are two
kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
(2) *coding chunks*, which can be used to rebuild a lost chunk.
*K*
the number of data *chunks*, i.e. the number of *chunks* in which the
original object is divided. For instance if *K* = 2 a 10KB object
will be divided into *K* objects of 5KB each.
The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
is divided into two objects of 5KB each.
*M*
the number of coding *chunks*, i.e. the number of additional *chunks*
computed by the encoding functions. If there are 2 coding *chunks*,
it means 2 OSDs can be out without losing data.
The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
chunks, then two OSDs can be missing without data loss.
Table of content
----------------
Table of contents
-----------------
.. toctree::
:maxdepth: 1
:maxdepth: 1
erasure-code-profile
erasure-code-jerasure
erasure-code-isa
erasure-code-lrc
erasure-code-shec
erasure-code-clay
erasure-code-profile
erasure-code-jerasure
erasure-code-isa
erasure-code-lrc
erasure-code-shec
erasure-code-clay

File diff suppressed because it is too large Load Diff

View File

@ -3,35 +3,38 @@
=========================
High availability and high reliability require a fault-tolerant approach to
managing hardware and software issues. Ceph has no single point-of-failure, and
can service requests for data in a "degraded" mode. Ceph's `data placement`_
introduces a layer of indirection to ensure that data doesn't bind directly to
particular OSD addresses. This means that tracking down system faults requires
finding the `placement group`_ and the underlying OSDs at root of the problem.
managing hardware and software issues. Ceph has no single point of failure and
it can service requests for data even when in a "degraded" mode. Ceph's `data
placement`_ introduces a layer of indirection to ensure that data doesn't bind
directly to specific OSDs. For this reason, tracking system faults
requires finding the `placement group`_ (PG) and the underlying OSDs at the
root of the problem.
.. tip:: A fault in one part of the cluster may prevent you from accessing a
particular object, but that doesn't mean that you cannot access other objects.
When you run into a fault, don't panic. Just follow the steps for monitoring
your OSDs and placement groups. Then, begin troubleshooting.
.. tip:: A fault in one part of the cluster might prevent you from accessing a
particular object, but that doesn't mean that you are prevented from
accessing other objects. When you run into a fault, don't panic. Just
follow the steps for monitoring your OSDs and placement groups, and then
begin troubleshooting.
Ceph is generally self-repairing. However, when problems persist, monitoring
OSDs and placement groups will help you identify the problem.
Ceph is self-repairing. However, when problems persist, monitoring OSDs and
placement groups will help you identify the problem.
Monitoring OSDs
===============
An OSD's status is either in the cluster (``in``) or out of the cluster
(``out``); and, it is either up and running (``up``), or it is down and not
running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster
(you can read and write data) or it is ``out`` of the cluster. If it was
``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate
placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
not assign placement groups to the OSD. If an OSD is ``down``, it should also be
``out``.
An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
either running and reachable (``up``), or it is not running and not
reachable (``down``).
.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster
will not be in a healthy state.
If an OSD is ``up``, it may be either ``in`` service (clients can read and
write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintin the configured redundancy.
If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
If an OSD is ``down``, it will also be ``out``.
.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
indicates that the cluster is not in a healthy state.
.. ditaa::
@ -50,129 +53,128 @@ not assign placement groups to the OSD. If an OSD is ``down``, it should also be
| | | |
+----------------+ +----------------+
If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
you may notice that the cluster does not always echo back ``HEALTH OK``. Don't
panic. With respect to OSDs, you should expect that the cluster will **NOT**
echo ``HEALTH OK`` in a few expected circumstances:
If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
you might notice that the cluster does not always show ``HEALTH OK``. Don't
panic. There are certain circumstances in which it is expected and normal that
the cluster will **NOT** show ``HEALTH OK``:
#. You haven't started the cluster yet (it won't respond).
#. You have just started or restarted the cluster and it's not ready yet,
because the placement groups are getting created and the OSDs are in
the process of peering.
#. You just added or removed an OSD.
#. You just have modified your cluster map.
#. You haven't started the cluster yet.
#. You have just started or restarted the cluster and it's not ready to show
health statuses yet, because the PGs are in the process of being created and
the OSDs are in the process of peering.
#. You have just added or removed an OSD.
#. You have just have modified your cluster map.
An important aspect of monitoring OSDs is to ensure that when the cluster
is up and running that all OSDs that are ``in`` the cluster are ``up`` and
running, too. To see if all OSDs are running, execute:
Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them:
whenever the cluster is up and running, every OSD that is ``in`` the cluster should also
be ``up`` and running. To see if all of the cluster's OSDs are running, run the following
command:
.. prompt:: bash $
ceph osd stat
ceph osd stat
The result should tell you the total number of OSDs (x),
how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). ::
The output provides the following information: the total number of OSDs (x),
how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). ::
x osds: y up, z in; epoch: eNNNN
x osds: y up, z in; epoch: eNNNN
If the number of OSDs that are ``in`` the cluster is more than the number of
OSDs that are ``up``, execute the following command to identify the ``ceph-osd``
If the number of OSDs that are ``in`` the cluster is greater than the number of
OSDs that are ``up``, run the following command to identify the ``ceph-osd``
daemons that are not running:
.. prompt:: bash $
ceph osd tree
ceph osd tree
::
#ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 2.00000 pool openstack
-3 2.00000 rack dell-2950-rack-A
-2 2.00000 host dell-2950-A1
0 ssd 1.00000 osd.0 up 1.00000 1.00000
1 ssd 1.00000 osd.1 down 1.00000 1.00000
#ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 2.00000 pool openstack
-3 2.00000 rack dell-2950-rack-A
-2 2.00000 host dell-2950-A1
0 ssd 1.00000 osd.0 up 1.00000 1.00000
1 ssd 1.00000 osd.1 down 1.00000 1.00000
.. tip:: The ability to search through a well-designed CRUSH hierarchy may help
you troubleshoot your cluster by identifying the physical locations faster.
.. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical
locations of particular OSDs might help you troubleshoot your cluster.
If an OSD is ``down``, start it:
If an OSD is ``down``, start it by running the following command:
.. prompt:: bash $
sudo systemctl start ceph-osd@1
sudo systemctl start ceph-osd@1
For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_.
See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't
restart.
PG Sets
=======
When CRUSH assigns placement groups to OSDs, it looks at the number of replicas
for the pool and assigns the placement group to OSDs such that each replica of
the placement group gets assigned to a different OSD. For example, if the pool
requires three replicas of a placement group, CRUSH may assign them to
``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a
pseudo-random placement that will take into account failure domains you set in
your `CRUSH map`_, so you will rarely see placement groups assigned to nearest
neighbor OSDs in a large cluster.
When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG
are required by the pool and then assigns each replica to a different OSD.
For example, if the pool requires three replicas of a PG, CRUSH might assign
them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a
pseudo-random placement that takes into account the failure domains that you
have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to
immediately adjacent OSDs in a large cluster.
Ceph processes a client request using the **Acting Set**, which is the set of
OSDs that will actually handle the requests since they have a full and working
version of a placement group shard. The set of OSDs that should contain a shard
of a particular placement group as the **Up Set**, i.e. where data is
moved/copied to (or planned to be).
Ceph processes client requests with the **Acting Set** of OSDs: this is the set
of OSDs that currently have a full and working version of a PG shard and that
are therefore responsible for handling requests. By contrast, the **Up Set** is
the set of OSDs that contain a shard of a specific PG. Data is moved or copied
to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See
:ref:`Placement Group Concepts <rados_operations_pg_concepts>`.
In some cases, an OSD in the Acting Set is ``down`` or otherwise not able to
service requests for objects in the placement group. When these situations
arise, don't panic. Common examples include:
Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to
service requests for objects in the PG. When this kind of situation
arises, don't panic. Common examples of such a situation include:
- You added or removed an OSD. Then, CRUSH reassigned the placement group to
other OSDs--thereby changing the composition of the Acting Set and spawning
the migration of data with a "backfill" process.
- You added or removed an OSD, CRUSH reassigned the PG to
other OSDs, and this reassignment changed the composition of the Acting Set and triggered
the migration of data by means of a "backfill" process.
- An OSD was ``down``, was restarted, and is now ``recovering``.
- An OSD in the Acting Set is ``down`` or unable to service requests,
- An OSD in the Acting Set is ``down`` or unable to service requests,
and another OSD has temporarily assumed its duties.
In most cases, the Up Set and the Acting Set are identical. When they are not,
it may indicate that Ceph is migrating the PG (it's remapped), an OSD is
recovering, or that there is a problem (i.e., Ceph usually echoes a "HEALTH
WARN" state with a "stuck stale" message in such scenarios).
Typically, the Up Set and the Acting Set are identical. When they are not, it
might indicate that Ceph is migrating the PG (in other words, that the PG has
been remapped), that an OSD is recovering, or that there is a problem with the
cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a
"stuck stale" message).
To retrieve a list of placement groups, execute:
To retrieve a list of PGs, run the following command:
.. prompt:: bash $
ceph pg dump
To view which OSDs are within the Acting Set or the Up Set for a given placement
group, execute:
ceph pg dump
To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command:
.. prompt:: bash $
ceph pg map {pg-num}
ceph pg map {pg-num}
The result should tell you the osdmap epoch (eNNN), the placement group number
({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set
The output provides the following information: the osdmap epoch (eNNN), the PG number
({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set
(acting[])::
osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]
osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]
.. note:: If the Up Set and Acting Set do not match, this may be an indicator
that the cluster rebalancing itself or of a potential problem with
.. note:: If the Up Set and the Acting Set do not match, this might indicate
that the cluster is rebalancing itself or that there is a problem with
the cluster.
Peering
=======
Before you can write data to a placement group, it must be in an ``active``
state, and it **should** be in a ``clean`` state. For Ceph to determine the
current state of a placement group, the primary OSD of the placement group
(i.e., the first OSD in the acting set), peers with the secondary and tertiary
OSDs to establish agreement on the current state of the placement group
(assuming a pool with 3 replicas of the PG).
Before you can write data to a PG, it must be in an ``active`` state and it
will preferably be in a ``clean`` state. For Ceph to determine the current
state of a PG, peering must take place. That is, the primary OSD of the PG
(that is, the first OSD in the Acting Set) must peer with the secondary and
OSDs so that consensus on the current state of the PG can be established. In
the following diagram, we assume a pool with three replicas of the PG:
.. ditaa::
@ -187,109 +189,110 @@ OSDs to establish agreement on the current state of the placement group
| Peering |
| |
| Request To |
| Peer |
|----------------------------->|
| Peer |
|----------------------------->|
|<-----------------------------|
| Peering |
The OSDs also report their status to the monitor. See `Configuring Monitor/OSD
Interaction`_ for details. To troubleshoot peering issues, see `Peering
The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD
Interaction`_. To troubleshoot peering issues, see `Peering
Failure`_.
Monitoring Placement Group States
=================================
Monitoring PG States
====================
If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
you may notice that the cluster does not always echo back ``HEALTH OK``. After
you check to see if the OSDs are running, you should also check placement group
states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a
number of placement group peering-related circumstances:
If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
you might notice that the cluster does not always show ``HEALTH OK``. After
first checking to see if the OSDs are running, you should also check PG
states. There are certain PG-peering-related circumstances in which it is expected
and normal that the cluster will **NOT** show ``HEALTH OK``:
#. You have just created a pool and placement groups haven't peered yet.
#. The placement groups are recovering.
#. You have just created a pool and the PGs haven't peered yet.
#. The PGs are recovering.
#. You have just added an OSD to or removed an OSD from the cluster.
#. You have just modified your CRUSH map and your placement groups are migrating.
#. There is inconsistent data in different replicas of a placement group.
#. Ceph is scrubbing a placement group's replicas.
#. You have just modified your CRUSH map and your PGs are migrating.
#. There is inconsistent data in different replicas of a PG.
#. Ceph is scrubbing a PG's replicas.
#. Ceph doesn't have enough storage capacity to complete backfilling operations.
If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't
panic. In many cases, the cluster will recover on its own. In some cases, you
may need to take action. An important aspect of monitoring placement groups is
to ensure that when the cluster is up and running that all placement groups are
``active``, and preferably in the ``clean`` state. To see the status of all
placement groups, execute:
If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't
panic. In many cases, the cluster will recover on its own. In some cases, however, you
might need to take action. An important aspect of monitoring PGs is to check their
status as ``active`` and ``clean``: that is, it is important to ensure that, when the
cluster is up and running, all PGs are ``active`` and (preferably) ``clean``.
To see the status of every PG, run the following command:
.. prompt:: bash $
ceph pg stat
ceph pg stat
The result should tell you the total number of placement groups (x), how many
placement groups are in a particular state such as ``active+clean`` (y) and the
The output provides the following information: the total number of PGs (x), how many
PGs are in a particular state such as ``active+clean`` (y), and the
amount of data stored (z). ::
x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
.. note:: It is common for Ceph to report multiple states for placement groups.
.. note:: It is common for Ceph to report multiple states for PGs (for example,
``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``.
In addition to the placement group states, Ceph will also echo back the amount of
storage capacity used (aa), the amount of storage capacity remaining (bb), and the total
storage capacity for the placement group. These numbers can be important in a
few cases:
Here Ceph shows not only the PG states, but also storage capacity used (aa),
the amount of storage capacity remaining (bb), and the total storage capacity
of the PG. These values can be important in a few cases:
- You are reaching your ``near full ratio`` or ``full ratio``.
- Your data is not getting distributed across the cluster due to an
error in your CRUSH configuration.
- The cluster is reaching its ``near full ratio`` or ``full ratio``.
- Data is not being distributed across the cluster due to an error in the
CRUSH configuration.
.. topic:: Placement Group IDs
Placement group IDs consist of the pool number (not pool name) followed
by a period (.) and the placement group ID--a hexadecimal number. You
can view pool numbers and their names from the output of ``ceph osd
lspools``. For example, the first pool created corresponds to
pool number ``1``. A fully qualified placement group ID has the
PG IDs consist of the pool number (not the pool name) followed by a period
(.) and a hexadecimal number. You can view pool numbers and their names from
in the output of ``ceph osd lspools``. For example, the first pool that was
created corresponds to pool number ``1``. A fully qualified PG ID has the
following form::
{pool-num}.{pg-id}
And it typically looks like this::
1.1f
To retrieve a list of placement groups, execute the following:
{pool-num}.{pg-id}
It typically resembles the following::
1.1701b
To retrieve a list of PGs, run the following command:
.. prompt:: bash $
ceph pg dump
You can also format the output in JSON format and save it to a file:
ceph pg dump
To format the output in JSON format and save it to a file, run the following command:
.. prompt:: bash $
ceph pg dump -o {filename} --format=json
ceph pg dump -o {filename} --format=json
To query a particular placement group, execute the following:
To query a specific PG, run the following command:
.. prompt:: bash $
ceph pg {poolnum}.{pg-id} query
ceph pg {poolnum}.{pg-id} query
Ceph will output the query in JSON format.
The following subsections describe the common pg states in detail.
The following subsections describe the most common PG states in detail.
Creating
--------
When you create a pool, it will create the number of placement groups you
specified. Ceph will echo ``creating`` when it is creating one or more
placement groups. Once they are created, the OSDs that are part of a placement
group's Acting Set will peer. Once peering is complete, the placement group
status should be ``active+clean``, which means a Ceph client can begin writing
to the placement group.
PGs are created when you create a pool: the command that creates a pool
specifies the total number of PGs for that pool, and when the pool is created
all of those PGs are created as well. Ceph will echo ``creating`` while it is
creating PGs. After the PG(s) are created, the OSDs that are part of a PG's
Acting Set will peer. Once peering is complete, the PG status should be
``active+clean``. This status means that Ceph clients begin writing to the
PG.
.. ditaa::
@ -300,43 +303,38 @@ to the placement group.
Peering
-------
When Ceph is Peering a placement group, Ceph is bringing the OSDs that
store the replicas of the placement group into **agreement about the state**
of the objects and metadata in the placement group. When Ceph completes peering,
this means that the OSDs that store the placement group agree about the current
state of the placement group. However, completion of the peering process does
**NOT** mean that each replica has the latest contents.
When a PG peers, the OSDs that store the replicas of its data converge on an
agreed state of the data and metadata within that PG. When peering is complete,
those OSDs agree about the state of that PG. However, completion of the peering
process does **NOT** mean that each replica has the latest contents.
.. topic:: Authoritative History
Ceph will **NOT** acknowledge a write operation to a client, until
all OSDs of the acting set persist the write operation. This practice
ensures that at least one member of the acting set will have a record
of every acknowledged write operation since the last successful
peering operation.
Ceph will **NOT** acknowledge a write operation to a client until that write
operation is persisted by every OSD in the Acting Set. This practice ensures
that at least one member of the Acting Set will have a record of every
acknowledged write operation since the last successful peering operation.
With an accurate record of each acknowledged write operation, Ceph can
construct and disseminate a new authoritative history of the placement
group--a complete, and fully ordered set of operations that, if performed,
would bring an OSDs copy of a placement group up to date.
Given an accurate record of each acknowledged write operation, Ceph can
construct a new authoritative history of the PG--that is, a complete and
fully ordered set of operations that, if performed, would bring an OSDs
copy of the PG up to date.
Active
------
Once Ceph completes the peering process, a placement group may become
``active``. The ``active`` state means that the data in the placement group is
generally available in the primary placement group and the replicas for read
and write operations.
After Ceph has completed the peering process, a PG should become ``active``.
The ``active`` state means that the data in the PG is generally available for
read and write operations in the primary and replica OSDs.
Clean
-----
When a placement group is in the ``clean`` state, the primary OSD and the
replica OSDs have successfully peered and there are no stray replicas for the
placement group. Ceph replicated all objects in the placement group the correct
number of times.
When a PG is in the ``clean`` state, all OSDs holding its data and metadata
have successfully peered and there are no stray replicas. Ceph has replicated
all objects in the PG the correct number of times.
Degraded
@ -344,143 +342,147 @@ Degraded
When a client writes an object to the primary OSD, the primary OSD is
responsible for writing the replicas to the replica OSDs. After the primary OSD
writes the object to storage, the placement group will remain in a ``degraded``
writes the object to storage, the PG will remain in a ``degraded``
state until the primary OSD has received an acknowledgement from the replica
OSDs that Ceph created the replica objects successfully.
The reason a placement group can be ``active+degraded`` is that an OSD may be
``active`` even though it doesn't hold all of the objects yet. If an OSD goes
``down``, Ceph marks each placement group assigned to the OSD as ``degraded``.
The OSDs must peer again when the OSD comes back online. However, a client can
still write a new object to a ``degraded`` placement group if it is ``active``.
The reason that a PG can be ``active+degraded`` is that an OSD can be
``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes
``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must
peer again when the OSD comes back online. However, a client can still write a
new object to a ``degraded`` PG if it is ``active``.
If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the
If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the
``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
to another OSD. The time between being marked ``down`` and being marked ``out``
is controlled by ``mon osd down out interval``, which is set to ``600`` seconds
is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
by default.
A placement group can also be ``degraded``, because Ceph cannot find one or more
objects that Ceph thinks should be in the placement group. While you cannot
read or write to unfound objects, you can still access all of the other objects
in the ``degraded`` placement group.
A PG can also be in the ``degraded`` state because there are one or more
objects that Ceph expects to find in the PG but that Ceph cannot find. Although
you cannot read or write to unfound objects, you can still access all of the other
objects in the ``degraded`` PG.
Recovering
----------
Ceph was designed for fault-tolerance at a scale where hardware and software
problems are ongoing. When an OSD goes ``down``, its contents may fall behind
the current state of other replicas in the placement groups. When the OSD is
back ``up``, the contents of the placement groups must be updated to reflect the
current state. During that time period, the OSD may reflect a ``recovering``
state.
Ceph was designed for fault-tolerance, because hardware and other server
problems are expected or even routine. When an OSD goes ``down``, its contents
might fall behind the current state of other replicas in the PGs. When the OSD
has returned to the ``up`` state, the contents of the PGs must be updated to
reflect that current state. During that time period, the OSD might be in a
``recovering`` state.
Recovery is not always trivial, because a hardware failure might cause a
cascading failure of multiple OSDs. For example, a network switch for a rack or
cabinet may fail, which can cause the OSDs of a number of host machines to fall
behind the current state of the cluster. Each one of the OSDs must recover once
the fault is resolved.
cabinet might fail, which can cause the OSDs of a number of host machines to
fall behind the current state of the cluster. In such a scenario, general
recovery is possible only if each of the OSDs recovers after the fault has been
resolved.]
Ceph provides a number of settings to balance the resource contention between
new service requests and the need to recover data objects and restore the
placement groups to the current state. The ``osd recovery delay start`` setting
allows an OSD to restart, re-peer and even process some replay requests before
starting the recovery process. The ``osd
recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
restart and re-peer at staggered rates. The ``osd recovery max active`` setting
limits the number of recovery requests an OSD will entertain simultaneously to
prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting
limits the size of the recovered data chunks to prevent network congestion.
Ceph provides a number of settings that determine how the cluster balances the
resource contention between the need to process new service requests and the
need to recover data objects and restore the PGs to the current state. The
``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and
even process some replay requests before starting the recovery process. The
``osd_recovery_thread_timeout`` setting determines the duration of a thread
timeout, because multiple OSDs might fail, restart, and re-peer at staggered
rates. The ``osd_recovery_max_active`` setting limits the number of recovery
requests an OSD can entertain simultaneously, in order to prevent the OSD from
failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of
the recovered data chunks, in order to prevent network congestion.
Back Filling
------------
When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs
in the cluster to the newly added OSD. Forcing the new OSD to accept the
reassigned placement groups immediately can put excessive load on the new OSD.
Back filling the OSD with the placement groups allows this process to begin in
the background. Once backfilling is complete, the new OSD will begin serving
requests when it is ready.
When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are
already in the cluster to the newly added OSD. It can put excessive load on the
new OSD to force it to immediately accept the reassigned PGs. Back filling the
OSD with the PGs allows this process to begin in the background. After the
backfill operations have completed, the new OSD will begin serving requests as
soon as it is ready.
During the backfill operations, you may see one of several states:
During the backfill operations, you might see one of several states:
``backfill_wait`` indicates that a backfill operation is pending, but is not
underway yet; ``backfilling`` indicates that a backfill operation is underway;
and, ``backfill_toofull`` indicates that a backfill operation was requested,
but couldn't be completed due to insufficient storage capacity. When a
placement group cannot be backfilled, it may be considered ``incomplete``.
yet underway; ``backfilling`` indicates that a backfill operation is currently
underway; and ``backfill_toofull`` indicates that a backfill operation was
requested but couldn't be completed due to insufficient storage capacity. When
a PG cannot be backfilled, it might be considered ``incomplete``.
The ``backfill_toofull`` state may be transient. It is possible that as PGs
are moved around, space may become available. The ``backfill_toofull`` is
similar to ``backfill_wait`` in that as soon as conditions change
backfill can proceed.
The ``backfill_toofull`` state might be transient. It might happen that, as PGs
are moved around, space becomes available. The ``backfill_toofull`` state is
similar to ``backfill_wait`` in that backfill operations can proceed as soon as
conditions change.
Ceph provides a number of settings to manage the load spike associated with
reassigning placement groups to an OSD (especially a new OSD). By default,
``osd_max_backfills`` sets the maximum number of concurrent backfills to and from
an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a
backfill request if the OSD is approaching its full ratio (90%, by default) and
change with ``ceph osd set-backfillfull-ratio`` command.
If an OSD refuses a backfill request, the ``osd backfill retry interval``
enables an OSD to retry the request (after 30 seconds, by default). OSDs can
also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan
intervals (64 and 512, by default).
Ceph provides a number of settings to manage the load spike associated with the
reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills``
setting specifies the maximum number of concurrent backfills to and from an OSD
(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a
backfill request if the OSD is approaching its full ratio (default: 90%). This
setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If
an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting
allows an OSD to retry the request after a certain interval (default: 30
seconds). OSDs can also set ``osd_backfill_scan_min`` and
``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and
512, respectively).
Remapped
--------
When the Acting Set that services a placement group changes, the data migrates
from the old acting set to the new acting set. It may take some time for a new
primary OSD to service requests. So it may ask the old primary to continue to
service requests until the placement group migration is complete. Once data
migration completes, the mapping uses the primary OSD of the new acting set.
When the Acting Set that services a PG changes, the data migrates from the old
Acting Set to the new Acting Set. Because it might take time for the new
primary OSD to begin servicing requests, the old primary OSD might be required
to continue servicing requests until the PG data migration is complete. After
data migration has completed, the mapping uses the primary OSD of the new
Acting Set.
Stale
-----
While Ceph uses heartbeats to ensure that hosts and daemons are running, the
``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
reporting statistics in a timely manner (e.g., a temporary network fault). By
default, OSD daemons report their placement group, up through, boot and failure
statistics every half second (i.e., ``0.5``), which is more frequent than the
heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
fails to report to the monitor or if other OSDs have reported the primary OSD
``down``, the monitors will mark the placement group ``stale``.
Although Ceph uses heartbeats in order to ensure that hosts and daemons are
running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are
not reporting statistics in a timely manner (for example, there might be a
temporary network fault). By default, OSD daemons report their PG, up through,
boot, and failure statistics every half second (that is, in accordance with a
value of ``0.5``), which is more frequent than the reports defined by the
heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report
to the monitor or if other OSDs have reported the primary OSD ``down``, the
monitors will mark the PG ``stale``.
When you start your cluster, it is common to see the ``stale`` state until
the peering process completes. After your cluster has been running for awhile,
seeing placement groups in the ``stale`` state indicates that the primary OSD
for those placement groups is ``down`` or not reporting placement group statistics
to the monitor.
When you start your cluster, it is common to see the ``stale`` state until the
peering process completes. After your cluster has been running for a while,
however, seeing PGs in the ``stale`` state indicates that the primary OSD for
those PGs is ``down`` or not reporting PG statistics to the monitor.
Identifying Troubled PGs
========================
As previously noted, a placement group is not necessarily problematic just
because its state is not ``active+clean``. Generally, Ceph's ability to self
repair may not be working when placement groups get stuck. The stuck states
include:
As previously noted, a PG is not necessarily having problems just because its
state is not ``active+clean``. When PGs are stuck, this might indicate that
Ceph cannot perform self-repairs. The stuck states include:
- **Unclean**: Placement groups contain objects that are not replicated the
desired number of times. They should be recovering.
- **Inactive**: Placement groups cannot process reads or writes because they
are waiting for an OSD with the most up-to-date data to come back ``up``.
- **Stale**: Placement groups are in an unknown state, because the OSDs that
host them have not reported to the monitor cluster in a while (configured
by ``mon osd report timeout``).
- **Unclean**: PGs contain objects that have not been replicated the desired
number of times. Under normal conditions, it can be assumed that these PGs
are recovering.
- **Inactive**: PGs cannot process reads or writes because they are waiting for
an OSD that has the most up-to-date data to come back ``up``.
- **Stale**: PG are in an unknown state, because the OSDs that host them have
not reported to the monitor cluster for a certain period of time (determined
by ``mon_osd_report_timeout``).
To identify stuck placement groups, execute the following:
To identify stuck PGs, run the following command:
.. prompt:: bash $
ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
See `Placement Group Subsystem`_ for additional details. To troubleshoot
stuck placement groups, see `Troubleshooting PG Errors`_.
For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
see `Troubleshooting PG Errors`_.
Finding an Object Location
@ -491,55 +493,54 @@ To store object data in the Ceph Object Store, a Ceph client must:
#. Set an object name
#. Specify a `pool`_
The Ceph client retrieves the latest cluster map and the CRUSH algorithm
calculates how to map the object to a `placement group`_, and then calculates
how to assign the placement group to an OSD dynamically. To find the object
location, all you need is the object name and the pool name. For example:
The Ceph client retrieves the latest cluster map, the CRUSH algorithm
calculates how to map the object to a PG, and then the algorithm calculates how
to dynamically assign the PG to an OSD. To find the object location given only
the object name and the pool name, run a command of the following form:
.. prompt:: bash $
ceph osd map {poolname} {object-name} [namespace]
ceph osd map {poolname} {object-name} [namespace]
.. topic:: Exercise: Locate an Object
As an exercise, let's create an object. Specify an object name, a path
to a test file containing some object data and a pool name using the
As an exercise, let's create an object. We can specify an object name, a path
to a test file that contains some object data, and a pool name by using the
``rados put`` command on the command line. For example:
.. prompt:: bash $
rados put {object-name} {file-path} --pool=data
rados put test-object-1 testfile.txt --pool=data
rados put {object-name} {file-path} --pool=data
rados put test-object-1 testfile.txt --pool=data
To verify that the Ceph Object Store stored the object, execute the
following:
To verify that the Ceph Object Store stored the object, run the
following command:
.. prompt:: bash $
rados -p data ls
Now, identify the object location:
To identify the object location, run the following commands:
.. prompt:: bash $
ceph osd map {pool-name} {object-name}
ceph osd map data test-object-1
Ceph should output the object's location. For example::
osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
To remove the test object, simply delete it using the ``rados rm``
command. For example:
Ceph should output the object's location. For example::
osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
To remove the test object, simply delete it by running the ``rados rm``
command. For example:
.. prompt:: bash $
rados rm test-object-1 --pool=data
As the cluster evolves, the object location may change dynamically. One benefit
of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform
the migration manually. See the `Architecture`_ section for details.
of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually
performing the migration. For details, see the `Architecture`_ section.
.. _data placement: ../data-placement
.. _pool: ../pools

View File

@ -2,9 +2,9 @@
Monitoring a Cluster
======================
Once you have a running cluster, you may use the ``ceph`` tool to monitor your
cluster. Monitoring a cluster typically involves checking OSD status, monitor
status, placement group status and metadata server status.
After you have a running cluster, you can use the ``ceph`` tool to monitor your
cluster. Monitoring a cluster typically involves checking OSD status, monitor
status, placement group status, and metadata server status.
Using the command line
======================
@ -13,11 +13,11 @@ Interactive mode
----------------
To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
with no arguments. For example:
with no arguments. For example:
.. prompt:: bash $
ceph
ceph
.. prompt:: ceph>
:prompts: ceph>
@ -30,8 +30,9 @@ with no arguments. For example:
Non-default paths
-----------------
If you specified non-default locations for your configuration or keyring,
you may specify their locations:
If you specified non-default locations for your configuration or keyring when
you install the cluster, you may specify their locations to the ``ceph`` tool
by running the following command:
.. prompt:: bash $
@ -40,30 +41,32 @@ you may specify their locations:
Checking a Cluster's Status
===========================
After you start your cluster, and before you start reading and/or
writing data, check your cluster's status first.
After you start your cluster, and before you start reading and/or writing data,
you should check your cluster's status.
To check a cluster's status, execute the following:
To check a cluster's status, run the following command:
.. prompt:: bash $
ceph status
Or:
Alternatively, you can run the following command:
.. prompt:: bash $
ceph -s
In interactive mode, type ``status`` and press **Enter**:
In interactive mode, this operation is performed by typing ``status`` and
pressing **Enter**:
.. prompt:: ceph>
:prompts: ceph>
status
Ceph will print the cluster status. For example, a tiny Ceph demonstration
cluster with one of each service may print the following:
Ceph will print the cluster status. For example, a tiny Ceph "demonstration
cluster" that is running one instance of each service (monitor, manager, and
OSD) might print the following:
::
@ -84,33 +87,35 @@ cluster with one of each service may print the following:
pgs: 16 active+clean
.. topic:: How Ceph Calculates Data Usage
How Ceph Calculates Data Usage
------------------------------
The ``usage`` value reflects the *actual* amount of raw storage used. The
``xxx GB / xxx GB`` value means the amount available (the lesser number)
of the overall storage capacity of the cluster. The notional number reflects
the size of the stored data before it is replicated, cloned or snapshotted.
Therefore, the amount of data actually stored typically exceeds the notional
amount stored, because Ceph creates replicas of the data and may also use
storage capacity for cloning and snapshotting.
The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx
GB / xxx GB`` value means the amount available (the lesser number) of the
overall storage capacity of the cluster. The notional number reflects the size
of the stored data before it is replicated, cloned or snapshotted. Therefore,
the amount of data actually stored typically exceeds the notional amount
stored, because Ceph creates replicas of the data and may also use storage
capacity for cloning and snapshotting.
Watching a Cluster
==================
In addition to local logging by each daemon, Ceph clusters maintain
a *cluster log* that records high level events about the whole system.
This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
default), but can also be monitored via the command line.
Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster
itself maintains a *cluster log* that records high-level events about the
entire Ceph cluster. These events are logged to disk on monitor servers (in
the default location ``/var/log/ceph/ceph.log``), and they can be monitored via
the command line.
To follow the cluster log, use the following command:
To follow the cluster log, run the following command:
.. prompt:: bash $
ceph -w
Ceph will print the status of the system, followed by each log message as it
is emitted. For example:
Ceph will print the status of the system, followed by each log message as it is
added. For example:
::
@ -135,21 +140,20 @@ is emitted. For example:
2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
In addition to using ``ceph -w`` to print log lines as they are emitted,
use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
log.
Instead of printing log lines as they are added, you might want to print only
the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n``
lines from the cluster log.
Monitoring Health Checks
========================
Ceph continuously runs various *health checks* against its own status. When
a health check fails, this is reflected in the output of ``ceph status`` (or
``ceph health``). In addition, messages are sent to the cluster log to
indicate when a check fails, and when the cluster recovers.
Ceph continuously runs various *health checks*. When
a health check fails, this failure is reflected in the output of ``ceph status`` and
``ceph health``. The cluster log receives messages that
indicate when a check has failed and when the cluster has recovered.
For example, when an OSD goes down, the ``health`` section of the status
output may be updated as follows:
output is updated as follows:
::
@ -157,7 +161,7 @@ output may be updated as follows:
1 osds down
Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
At this time, cluster log messages are also emitted to record the failure of the
At the same time, cluster log messages are emitted to record the failure of the
health checks:
::
@ -166,7 +170,7 @@ health checks:
2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
When the OSD comes back online, the cluster log records the cluster's return
to a health state:
to a healthy state:
::
@ -177,21 +181,23 @@ to a health state:
Network Performance Checks
--------------------------
Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We
also use the response times to monitor network performance.
While it is possible that a busy OSD could delay a ping response, we can assume
that if a network switch fails multiple delays will be detected between distinct pairs of OSDs.
Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
availability and network performance. If a single delayed response is detected,
this might indicate nothing more than a busy OSD. But if multiple delays
between distinct pairs of OSDs are detected, this might indicate a failed
network switch, a NIC failure, or a layer 1 failure.
By default we will warn about ping times which exceed 1 second (1000 milliseconds).
By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
health check (a ``HEALTH_WARN``. For example:
::
HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)
The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10
detail line items.
::
In the output of the ``ceph health detail`` command, you can see which OSDs are
experiencing delays and how long the delays are. The output of ``ceph health
detail`` is limited to ten lines. Here is an example of the output you can
expect from the ``ceph health detail`` command::
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
@ -199,11 +205,15 @@ detail line items.
Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec
To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be
sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second
(1000 milliseconds) can be overridden as an argument in milliseconds.
To see more detail and to collect a complete dump of network performance
information, use the ``dump_osd_network`` command. This command is usually sent
to a Ceph Manager Daemon, but it can be used to collect information about a
specific OSD's interactions by sending it to that OSD. The default threshold
for a slow heartbeat is 1 second (1000 milliseconds), but this can be
overridden by providing a number of milliseconds as an argument.
The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr.
To show all network performance data with a specified threshold of 0, send the
following command to the mgr:
.. prompt:: bash $
@ -287,26 +297,26 @@ The following command will show all gathered network performance data by specify
Muting health checks
Muting Health Checks
--------------------
Health checks can be muted so that they do not affect the overall
reported status of the cluster. Alerts are specified using the health
check code (see :ref:`health-checks`):
Health checks can be muted so that they have no effect on the overall
reported status of the cluster. For example, if the cluster has raised a
single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``.
To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and
run the following command:
.. prompt:: bash $
ceph health mute <code>
For example, if there is a health warning, muting it will make the
cluster report an overall status of ``HEALTH_OK``. For example, to
mute an ``OSD_DOWN`` alert,:
For example, to mute an ``OSD_DOWN`` health check, run the following command:
.. prompt:: bash $
ceph health mute OSD_DOWN
Mutes are reported as part of the short and long form of the ``ceph health`` command.
Mutes are reported as part of the short and long form of the ``ceph health`` command's output.
For example, in the above scenario, the cluster would report:
.. prompt:: bash $
@ -327,7 +337,7 @@ For example, in the above scenario, the cluster would report:
(MUTED) OSD_DOWN 1 osds down
osd.1 is down
A mute can be explicitly removed with:
A mute can be removed by running the following command:
.. prompt:: bash $
@ -339,56 +349,44 @@ For example:
ceph health unmute OSD_DOWN
A health check mute may optionally have a TTL (time to live)
associated with it, such that the mute will automatically expire
after the specified period of time has elapsed. The TTL is specified as an optional
duration argument, e.g.:
A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive)
associated with it: this means that the mute will automatically expire
after a specified period of time. The TTL is specified as an optional
duration argument, as seen in the following examples:
.. prompt:: bash $
ceph health mute OSD_DOWN 4h # mute for 4 hours
ceph health mute MON_DOWN 15m # mute for 15 minutes
ceph health mute MON_DOWN 15m # mute for 15 minutes
Normally, if a muted health alert is resolved (e.g., in the example
above, the OSD comes back up), the mute goes away. If the alert comes
Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check
in the example above has come back up), the mute goes away. If the health check comes
back later, it will be reported in the usual way.
It is possible to make a mute "sticky" such that the mute will remain even if the
alert clears. For example:
It is possible to make a health mute "sticky": this means that the mute will remain even if the
health check clears. For example, to make a health mute "sticky", you might run the following command:
.. prompt:: bash $
ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour
Most health mutes also disappear if the extent of an alert gets worse. For example,
if there is one OSD down, and the alert is muted, the mute will disappear if one
or more additional OSDs go down. This is true for any health alert that involves
a count indicating how much or how many of something is triggering the warning or
error.
Most health mutes disappear if the unhealthy condition that triggered the health check gets worse.
For example, suppose that there is one OSD down and the health check is muted. In that case, if
one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value.
Detecting configuration issues
==============================
In addition to the health checks that Ceph continuously runs on its
own status, there are some configuration issues that may only be detected
by an external tool.
Use the `ceph-medic`_ tool to run these additional checks on your Ceph
cluster's configuration.
Checking a Cluster's Usage Stats
================================
To check a cluster's data usage and data distribution among pools, you can
use the ``df`` option. It is similar to Linux ``df``. Execute
the following:
To check a cluster's data usage and data distribution among pools, use the
``df`` command. This option is similar to Linux's ``df`` command. Run the
following command:
.. prompt:: bash $
ceph df
The output of ``ceph df`` looks like this::
The output of ``ceph df`` resembles the following::
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00
@ -400,52 +398,49 @@ The output of ``ceph df`` looks like this::
cephfs.a.meta 2 32 6.8 KiB 6.8 KiB 0 B 22 96 KiB 96 KiB 0 B 0 297 GiB N/A N/A 22 0 B 0 B
cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B
test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B
- **CLASS:** for example, "ssd" or "hdd"
- **CLASS:** For example, "ssd" or "hdd".
- **SIZE:** The amount of storage capacity managed by the cluster.
- **AVAIL:** The amount of free space available in the cluster.
- **USED:** The amount of raw storage consumed by user data (excluding
BlueStore's database)
BlueStore's database).
- **RAW USED:** The amount of raw storage consumed by user data, internal
overhead, or reserved capacity.
- **%RAW USED:** The percentage of raw storage used. Use this number in
conjunction with the ``full ratio`` and ``near full ratio`` to ensure that
you are not reaching your cluster's capacity. See `Storage Capacity`_ for
additional details.
overhead, and reserved capacity.
- **%RAW USED:** The percentage of raw storage used. Watch this number in
conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when
your cluster approaches the fullness thresholds. See `Storage Capacity`_.
**POOLS:**
**POOLS:**
The **POOLS** section of the output provides a list of pools and the notional
usage of each pool. The output from this section **DOES NOT** reflect replicas,
clones or snapshots. For example, if you store an object with 1MB of data, the
notional usage will be 1MB, but the actual usage may be 2MB or more depending
on the number of replicas, clones and snapshots.
The POOLS section of the output provides a list of pools and the *notional*
usage of each pool. This section of the output **DOES NOT** reflect replicas,
clones, or snapshots. For example, if you store an object with 1MB of data,
then the notional usage will be 1MB, but the actual usage might be 2MB or more
depending on the number of replicas, clones, and snapshots.
- **ID:** The number of the node within the pool.
- **STORED:** actual amount of data user/Ceph has stored in a pool. This is
similar to the USED column in earlier versions of Ceph but the calculations
(for BlueStore!) are more precise (gaps are properly handled).
- **ID:** The number of the specific node within the pool.
- **STORED:** The actual amount of data that the user has stored in a pool.
This is similar to the USED column in earlier versions of Ceph, but the
calculations (for BlueStore!) are more precise (in that gaps are properly
handled).
- **(DATA):** usage for RBD (RADOS Block Device), CephFS file data, and RGW
- **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW
(RADOS Gateway) object data.
- **(OMAP):** key-value pairs. Used primarily by CephFS and RGW (RADOS
- **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS
Gateway) for metadata storage.
- **OBJECTS:** The notional number of objects stored per pool. "Notional" is
defined above in the paragraph immediately under "POOLS".
- **USED:** The space allocated for a pool over all OSDs. This includes
replication, allocation granularity, and erasure-coding overhead. Compression
savings and object content gaps are also taken into account. BlueStore's
database is not included in this amount.
- **OBJECTS:** The notional number of objects stored per pool (that is, the
number of objects other than replicas, clones, or snapshots).
- **USED:** The space allocated for a pool over all OSDs. This includes space
for replication, space for allocation granularity, and space for the overhead
associated with erasure-coding. Compression savings and object-content gaps
are also taken into account. However, BlueStore's database is not included in
the amount reported under USED.
- **(DATA):** object usage for RBD (RADOS Block Device), CephFS file data, and RGW
(RADOS Gateway) object data.
- **(OMAP):** object key-value pairs. Used primarily by CephFS and RGW (RADOS
- **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data,
and RGW (RADOS Gateway) object data.
- **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS
Gateway) for metadata storage.
- **%USED:** The notional percentage of storage used per pool.
@ -454,50 +449,51 @@ on the number of replicas, clones and snapshots.
- **QUOTA OBJECTS:** The number of quota objects.
- **QUOTA BYTES:** The number of bytes in the quota objects.
- **DIRTY:** The number of objects in the cache pool that have been written to
the cache pool but have not been flushed yet to the base pool. This field is
only available when cache tiering is in use.
- **USED COMPR:** amount of space allocated for compressed data (i.e. this
includes compressed data plus all the allocation, replication and erasure
coding overhead).
- **UNDER COMPR:** amount of data passed through compression (summed over all
replicas) and beneficial enough to be stored in a compressed form.
the cache pool but have not yet been flushed to the base pool. This field is
available only when cache tiering is in use.
- **USED COMPR:** The amount of space allocated for compressed data. This
includes compressed data in addition to all of the space required for
replication, allocation granularity, and erasure- coding overhead.
- **UNDER COMPR:** The amount of data that has passed through compression
(summed over all replicas) and that is worth storing in a compressed form.
.. note:: The numbers in the POOLS section are notional. They are not
inclusive of the number of replicas, snapshots or clones. As a result, the
sum of the USED and %USED amounts will not add up to the USED and %USED
amounts in the RAW section of the output.
.. note:: The numbers in the POOLS section are notional. They do not include
the number of replicas, clones, or snapshots. As a result, the sum of the
USED and %USED amounts in the POOLS section of the output will not be equal
to the sum of the USED and %USED amounts in the RAW section of the output.
.. note:: The MAX AVAIL value is a complicated function of the replication
or erasure code used, the CRUSH rule that maps storage to devices, the
utilization of those devices, and the configured ``mon_osd_full_ratio``.
.. note:: The MAX AVAIL value is a complicated function of the replication or
the kind of erasure coding used, the CRUSH rule that maps storage to
devices, the utilization of those devices, and the configured
``mon_osd_full_ratio`` setting.
Checking OSD Status
===================
You can check OSDs to ensure they are ``up`` and ``in`` by executing the
To check if OSDs are ``up`` and ``in``, run the
following command:
.. prompt:: bash #
ceph osd stat
Or:
Alternatively, you can run the following command:
.. prompt:: bash #
ceph osd dump
You can also check view OSDs according to their position in the CRUSH map by
using the following command:
To view OSDs according to their position in the CRUSH map, run the following
command:
.. prompt:: bash #
ceph osd tree
Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
and their weight:
To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are
``up``, and the weight of the OSDs, run the following command:
.. code-block:: bash
@ -509,88 +505,90 @@ and their weight:
1 ssd 1.00000 osd.1 up 1.00000 1.00000
2 ssd 1.00000 osd.2 up 1.00000 1.00000
For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
See `Monitoring OSDs and Placement Groups`_.
Checking Monitor Status
=======================
If your cluster has multiple monitors (likely), you should check the monitor
quorum status after you start the cluster and before reading and/or writing data. A
quorum must be present when multiple monitors are running. You should also check
monitor status periodically to ensure that they are running.
If your cluster has multiple monitors, then you need to perform certain
"monitor status" checks. After starting the cluster and before reading or
writing data, you should check quorum status. A quorum must be present when
multiple monitors are running to ensure proper functioning of your Ceph
cluster. Check monitor status regularly in order to ensure that all of the
monitors are running.
To see display the monitor map, execute the following:
To display the monitor map, run the following command:
.. prompt:: bash $
ceph mon stat
Or:
Alternatively, you can run the following command:
.. prompt:: bash $
ceph mon dump
To check the quorum status for the monitor cluster, execute the following:
To check the quorum status for the monitor cluster, run the following command:
.. prompt:: bash $
ceph quorum_status
Ceph will return the quorum status. For example, a Ceph cluster consisting of
three monitors may return the following:
Ceph returns the quorum status. For example, a Ceph cluster that consists of
three monitors might return the following:
.. code-block:: javascript
{ "election_epoch": 10,
"quorum": [
0,
1,
2],
"quorum_names": [
"a",
"b",
"c"],
"quorum_leader_name": "a",
"monmap": { "epoch": 1,
"fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
"modified": "2011-12-12 13:28:27.505520",
"created": "2011-12-12 13:28:27.505520",
"features": {"persistent": [
"kraken",
"luminous",
"mimic"],
"optional": []
},
"mons": [
{ "rank": 0,
"name": "a",
"addr": "127.0.0.1:6789/0",
"public_addr": "127.0.0.1:6789/0"},
{ "rank": 1,
"name": "b",
"addr": "127.0.0.1:6790/0",
"public_addr": "127.0.0.1:6790/0"},
{ "rank": 2,
"name": "c",
"addr": "127.0.0.1:6791/0",
"public_addr": "127.0.0.1:6791/0"}
]
}
}
{ "election_epoch": 10,
"quorum": [
0,
1,
2],
"quorum_names": [
"a",
"b",
"c"],
"quorum_leader_name": "a",
"monmap": { "epoch": 1,
"fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
"modified": "2011-12-12 13:28:27.505520",
"created": "2011-12-12 13:28:27.505520",
"features": {"persistent": [
"kraken",
"luminous",
"mimic"],
"optional": []
},
"mons": [
{ "rank": 0,
"name": "a",
"addr": "127.0.0.1:6789/0",
"public_addr": "127.0.0.1:6789/0"},
{ "rank": 1,
"name": "b",
"addr": "127.0.0.1:6790/0",
"public_addr": "127.0.0.1:6790/0"},
{ "rank": 2,
"name": "c",
"addr": "127.0.0.1:6791/0",
"public_addr": "127.0.0.1:6791/0"}
]
}
}
Checking MDS Status
===================
Metadata servers provide metadata services for CephFS. Metadata servers have
two sets of states: ``up | down`` and ``active | inactive``. To ensure your
metadata servers are ``up`` and ``active``, execute the following:
Metadata servers provide metadata services for CephFS. Metadata servers have
two sets of states: ``up | down`` and ``active | inactive``. To check if your
metadata servers are ``up`` and ``active``, run the following command:
.. prompt:: bash $
ceph mds stat
To display details of the metadata cluster, execute the following:
To display details of the metadata servers, run the following command:
.. prompt:: bash $
@ -600,9 +598,9 @@ To display details of the metadata cluster, execute the following:
Checking Placement Group States
===============================
Placement groups map objects to OSDs. When you monitor your
placement groups, you will want them to be ``active`` and ``clean``.
For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
Placement groups (PGs) map objects to OSDs. PGs are monitored in order to
ensure that they are ``active`` and ``clean``. See `Monitoring OSDs and
Placement Groups`_.
.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
@ -611,37 +609,36 @@ For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
Using the Admin Socket
======================
The Ceph admin socket allows you to query a daemon via a socket interface.
By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
via the admin socket, login to the host running the daemon and use the
following command:
The Ceph admin socket allows you to query a daemon via a socket interface. By
default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via
the admin socket, log in to the host that is running the daemon and run one of
the two following commands:
.. prompt:: bash $
ceph daemon {daemon-name}
ceph daemon {path-to-socket-file}
For example, the following are equivalent:
For example, the following commands are equivalent to each other:
.. prompt:: bash $
ceph daemon osd.0 foo
ceph daemon /var/run/ceph/ceph-osd.0.asok foo
To view the available admin socket commands, execute the following command:
To view the available admin-socket commands, run the following command:
.. prompt:: bash $
ceph daemon {daemon-name} help
The admin socket command enables you to show and set your configuration at
runtime. See `Viewing a Configuration at Runtime`_ for details.
Additionally, you can set configuration values at runtime directly (i.e., the
admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
config set``, which relies on the monitor but doesn't require you to login
directly to the host in question ).
Admin-socket commands enable you to view and set your configuration at runtime.
For more on viewing your configuration, see `Viewing a Configuration at
Runtime`_. There are two methods of setting configuration value at runtime: (1)
using the admin socket, which bypasses the monitor and requires a direct login
to the host in question, and (2) using the ``ceph tell {daemon-type}.{id}
config set`` command, which relies on the monitor and does not require a direct
login.
.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime
.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/

View File

@ -6,50 +6,52 @@
Running Ceph with systemd
==========================
=========================
For all distributions that support systemd (CentOS 7, Fedora, Debian
Jessie 8 and later, SUSE), ceph daemons are now managed using native
systemd files instead of the legacy sysvinit scripts. For example:
In all distributions that support systemd (CentOS 7, Fedora, Debian
Jessie 8 and later, and SUSE), systemd files (and NOT legacy SysVinit scripts)
are used to manage Ceph daemons. Ceph daemons therefore behave like any other daemons
that can be controlled by the ``systemctl`` command, as in the following examples:
.. prompt:: bash $
sudo systemctl start ceph.target # start all daemons
sudo systemctl status ceph-osd@12 # check status of osd.12
To list the Ceph systemd units on a node, execute:
To list all of the Ceph systemd units on a node, run the following command:
.. prompt:: bash $
sudo systemctl status ceph\*.service ceph\*.target
Starting all Daemons
Starting all daemons
--------------------
To start all daemons on a Ceph Node (irrespective of type), execute the
following:
To start all of the daemons on a Ceph node (regardless of their type), run the
following command:
.. prompt:: bash $
sudo systemctl start ceph.target
Stopping all Daemons
Stopping all daemons
--------------------
To stop all daemons on a Ceph Node (irrespective of type), execute the
following:
To stop all of the daemons on a Ceph node (regardless of their type), run the
following command:
.. prompt:: bash $
sudo systemctl stop ceph\*.service ceph\*.target
Starting all Daemons by Type
Starting all daemons by type
----------------------------
To start all daemons of a particular type on a Ceph Node, execute one of the
following:
To start all of the daemons of a particular type on a Ceph node, run one of the
following commands:
.. prompt:: bash $
@ -58,24 +60,24 @@ following:
sudo systemctl start ceph-mds.target
Stopping all Daemons by Type
Stopping all daemons by type
----------------------------
To stop all daemons of a particular type on a Ceph Node, execute one of the
following:
To stop all of the daemons of a particular type on a Ceph node, run one of the
following commands:
.. prompt:: bash $
sudo systemctl stop ceph-mon\*.service ceph-mon.target
sudo systemctl stop ceph-osd\*.service ceph-osd.target
sudo systemctl stop ceph-mon\*.service ceph-mon.target
sudo systemctl stop ceph-mds\*.service ceph-mds.target
Starting a Daemon
Starting a daemon
-----------------
To start a specific daemon instance on a Ceph Node, execute one of the
following:
To start a specific daemon instance on a Ceph node, run one of the
following commands:
.. prompt:: bash $
@ -92,11 +94,11 @@ For example:
sudo systemctl start ceph-mds@ceph-server
Stopping a Daemon
Stopping a daemon
-----------------
To stop a specific daemon instance on a Ceph Node, execute one of the
following:
To stop a specific daemon instance on a Ceph node, run one of the
following commands:
.. prompt:: bash $
@ -115,15 +117,14 @@ For example:
.. index:: sysvinit; operating a cluster
Running Ceph with sysvinit
Running Ceph with SysVinit
==========================
Each time you to **start**, **restart**, and **stop** Ceph daemons (or your
entire cluster) you must specify at least one option and one command. You may
also specify a daemon type or a daemon instance. ::
{commandline} [options] [commands] [daemons]
Each time you start, restart, or stop Ceph daemons, you must specify at least one option and one command.
Likewise, each time you start, restart, or stop your entire cluster, you must specify at least one option and one command.
In both cases, you can also specify a daemon type or a daemon instance. ::
{commandline} [options] [commands] [daemons]
The ``ceph`` options include:
@ -134,12 +135,12 @@ The ``ceph`` options include:
+-----------------+----------+-------------------------------------------------+
| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. |
+-----------------+----------+-------------------------------------------------+
| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` |
| ``--allhosts`` | ``-a`` | Execute on all nodes listed in ``ceph.conf``. |
| | | Otherwise, it only executes on ``localhost``. |
+-----------------+----------+-------------------------------------------------+
| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. |
+-----------------+----------+-------------------------------------------------+
| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. |
| ``--norestart`` | ``N/A`` | Do not restart a daemon if it core dumps. |
+-----------------+----------+-------------------------------------------------+
| ``--conf`` | ``-c`` | Use an alternate configuration file. |
+-----------------+----------+-------------------------------------------------+
@ -153,24 +154,21 @@ The ``ceph`` commands include:
+------------------+------------------------------------------------------------+
| ``stop`` | Stop the daemon(s). |
+------------------+------------------------------------------------------------+
| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` |
| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9``. |
+------------------+------------------------------------------------------------+
| ``killall`` | Kill all daemons of a particular type. |
| ``killall`` | Kill all daemons of a particular type. |
+------------------+------------------------------------------------------------+
| ``cleanlogs`` | Cleans out the log directory. |
+------------------+------------------------------------------------------------+
| ``cleanalllogs`` | Cleans out **everything** in the log directory. |
+------------------+------------------------------------------------------------+
For subsystem operations, the ``ceph`` service can target specific daemon types
by adding a particular daemon type for the ``[daemons]`` option. Daemon types
include:
The ``[daemons]`` option allows the ``ceph`` service to target specific daemon types
in order to perform subsystem operations. Daemon types include:
- ``mon``
- ``osd``
- ``mds``
.. _Valgrind: http://www.valgrind.org/
.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html

View File

@ -1,3 +1,5 @@
.. _rados_operations_pg_concepts:
==========================
Placement Group Concepts
==========================

View File

@ -1,59 +1,60 @@
============================
Repairing PG inconsistencies
Repairing PG Inconsistencies
============================
Sometimes a placement group might become "inconsistent". To return the
placement group to an active+clean state, you must first determine which
of the placement groups has become inconsistent and then run the "pg
repair" command on it. This page contains commands for diagnosing placement
groups and the command for repairing placement groups that have become
Sometimes a Placement Group (PG) might become ``inconsistent``. To return the PG
to an ``active+clean`` state, you must first determine which of the PGs has become
inconsistent and then run the ``pg repair`` command on it. This page contains
commands for diagnosing PGs and the command for repairing PGs that have become
inconsistent.
.. highlight:: console
Commands for Diagnosing Placement-group Problems
================================================
The commands in this section provide various ways of diagnosing broken placement groups.
Commands for Diagnosing PG Problems
===================================
The commands in this section provide various ways of diagnosing broken PGs.
The following command provides a high-level (low detail) overview of the health of the ceph cluster:
To see a high-level (low-detail) overview of Ceph cluster health, run the
following command:
.. prompt:: bash #
ceph health detail
The following command provides more detail on the status of the placement groups:
To see more detail on the status of the PGs, run the following command:
.. prompt:: bash #
ceph pg dump --format=json-pretty
The following command lists inconsistent placement groups:
To see a list of inconsistent PGs, run the following command:
.. prompt:: bash #
rados list-inconsistent-pg {pool}
The following command lists inconsistent rados objects:
To see a list of inconsistent RADOS objects, run the following command:
.. prompt:: bash #
rados list-inconsistent-obj {pgid}
The following command lists inconsistent snapsets in the given placement group:
To see a list of inconsistent snapsets in a specific PG, run the following
command:
.. prompt:: bash #
rados list-inconsistent-snapset {pgid}
Commands for Repairing Placement Groups
=======================================
The form of the command to repair a broken placement group is:
Commands for Repairing PGs
==========================
The form of the command to repair a broken PG is as follows:
.. prompt:: bash #
ceph pg repair {pgid}
Where ``{pgid}`` is the id of the affected placement group.
Here ``{pgid}`` represents the id of the affected PG.
For example:
@ -61,23 +62,57 @@ For example:
ceph pg repair 1.4
More Information on Placement Group Repair
==========================================
Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency.
.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the
pool that contains the PG. The command ``ceph osd listpools`` and the
command ``ceph osd dump | grep pool`` return a list of pool numbers.
The ``pg repair`` command attempts to fix inconsistencies of various kinds. If ``pg repair`` finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If ``pg repair`` finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of ``pg repair``.
More Information on PG Repair
=============================
Ceph stores and updates the checksums of objects stored in the cluster. When a
scrub is performed on a PG, the OSD attempts to choose an authoritative copy
from among its replicas. Only one of the possible cases is consistent. After
performing a deep scrub, Ceph calculates the checksum of an object that is read
from disk and compares it to the checksum that was previously recorded. If the
current checksum and the previously recorded checksum do not match, that
mismatch is considered to be an inconsistency. In the case of replicated pools,
any mismatch between the checksum of any replica of an object and the checksum
of the authoritative copy means that there is an inconsistency. The discovery
of these inconsistencies cause a PG's state to be set to ``inconsistent``.
For erasure coded and BlueStore pools, Ceph will automatically repair
if ``osd_scrub_auto_repair`` (default ``false`) is set to ``true`` and
at most ``osd_scrub_auto_repair_num_errors`` (default ``5``) errors are found.
The ``pg repair`` command attempts to fix inconsistencies of various kinds. If
``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
the inconsistent copy with the digest of the authoritative copy. If ``pg
repair`` finds an inconsistent replicated pool, it marks the inconsistent copy
as missing. In the case of replicated pools, recovery is beyond the scope of
``pg repair``.
``pg repair`` will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them.
In the case of erasure-coded and BlueStore pools, Ceph will automatically
perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
``5``) errors are found.
The checksum of a RADOS object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of Filestore. When replicated Filestore pools are in play, users might prefer manual repair over ``ceph pg repair``.
The ``pg repair`` command will not solve every problem. Ceph does not
automatically repair PGs when they are found to contain inconsistencies.
The material in this paragraph is relevant for Filestore, and BlueStore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, ``pg repair`` favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the ``eph-objectstore-tool``.
The checksum of a RADOS object or an omap is not always available. Checksums
are calculated incrementally. If a replicated object is updated
non-sequentially, the write operation involved in the update changes the object
and invalidates its checksum. The whole object is not read while the checksum
is recalculated. The ``pg repair`` command is able to make repairs even when
checksums are not available to it, as in the case of Filestore. Users working
with replicated Filestore pools might prefer manual repair to ``ceph pg
repair``.
This material is relevant for Filestore, but not for BlueStore, which has its
own internal checksums. The matched-record checksum and the calculated checksum
cannot prove that any specific copy is in fact authoritative. If there is no
checksum available, ``pg repair`` favors the data on the primary, but this
might not be the uncorrupted replica. Because of this uncertainty, human
intervention is necessary when an inconsistency is discovered. This
intervention sometimes involves use of ``ceph-objectstore-tool``.
External Links
==============
https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page contains a walkthrough of the repair of a placement group, and is recommended reading if you want to repair a placement
group but have never done so.
https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
contains a walkthrough of the repair of a PG. It is recommended reading if you
want to repair a PG but have never done so.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -7,208 +7,256 @@ Stretch Clusters
Stretch Clusters
================
Ceph generally expects all parts of its network and overall cluster to be
equally reliable, with failures randomly distributed across the CRUSH map.
So you may lose a switch that knocks out a number of OSDs, but we expect
the remaining OSDs and monitors to route around that.
This is usually a good choice, but may not work well in some
stretched cluster configurations where a significant part of your cluster
is stuck behind a single network component. For instance, a single
cluster which is located in multiple data centers, and you want to
sustain the loss of a full DC.
A stretch cluster is a cluster that has servers in geographically separated
data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed
and low-latency connections, but limited links. Stretch clusters have a higher
likelihood of (possibly asymmetric) network splits, and a higher likelihood of
temporary or complete loss of an entire data center (which can represent
one-third to one-half of the total cluster).
There are two standard configurations we've seen deployed, with either
two or three data centers (or, in clouds, availability zones). With two
zones, we expect each site to hold a copy of the data, and for a third
site to have a tiebreaker monitor (this can be a VM or high-latency compared
to the main sites) to pick a winner if the network connection fails and both
DCs remain alive. For three sites, we expect a copy of the data and an equal
number of monitors in each site.
Ceph is designed with the expectation that all parts of its network and cluster
will be reliable and that failures will be distributed randomly across the
CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is
designed so that the remaining OSDs and monitors will route around such a loss.
Note that the standard Ceph configuration will survive MANY failures of the
network or data centers and it will never compromise data consistency. If you
bring back enough Ceph servers following a failure, it will recover. If you
lose a data center, but can still form a quorum of monitors and have all the data
available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules
that will re-replicate to meet it), Ceph will maintain availability.
Sometimes this cannot be relied upon. If you have a "stretched-cluster"
deployment in which much of your cluster is behind a single network component,
you might need to use **stretch mode** to ensure data integrity.
What can't it handle?
We will here consider two standard configurations: a configuration with two
data centers (or, in clouds, two availability zones), and a configuration with
three data centers (or, in clouds, three availability zones).
In the two-site configuration, Ceph expects each of the sites to hold a copy of
the data, and Ceph also expects there to be a third site that has a tiebreaker
monitor. This tiebreaker monitor picks a winner if the network connection fails
and both data centers remain alive.
The tiebreaker monitor can be a VM. It can also have high latency relative to
the two main sites.
The standard Ceph configuration is able to survive MANY network failures or
data-center failures without ever compromising data availability. If enough
Ceph servers are brought back following a failure, the cluster *will* recover.
If you lose a data center but are still able to form a quorum of monitors and
still have all the data available, Ceph will maintain availability. (This
assumes that the cluster has enough copies to satisfy the pools' ``min_size``
configuration option, or (failing that) that the cluster has CRUSH rules in
place that will cause the cluster to re-replicate the data until the
``min_size`` configuration option has been met.)
Stretch Cluster Issues
======================
No matter what happens, Ceph will not compromise on data integrity
and consistency. If there's a failure in your network or a loss of nodes and
you can restore service, Ceph will return to normal functionality on its own.
But there are scenarios where you lose data availability despite having
enough servers available to satisfy Ceph's consistency and sizing constraints, or
where you may be surprised to not satisfy Ceph's constraints.
The first important category of these failures resolve around inconsistent
networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
them out of the acting PG sets despite the primary being unable to replicate data.
If this happens, IO will not be permitted, because Ceph can't satisfy its durability
guarantees.
Ceph does not permit the compromise of data integrity and data consistency
under any circumstances. When service is restored after a network failure or a
loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
without operator intervention.
Ceph does not permit the compromise of data integrity or data consistency, but
there are situations in which *data availability* is compromised. These
situations can occur even though there are enough clusters available to satisfy
Ceph's consistency and sizing constraints. In some situations, you might
discover that your cluster does not satisfy those constraints.
The first category of these failures that we will discuss involves inconsistent
networks -- if there is a netsplit (a disconnection between two servers that
splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
and remove them from the acting PG sets. This failure to mark ODSs ``down``
will occur, despite the fact that the primary PG is unable to replicate data (a
situation that, under normal non-netsplit circumstances, would result in the
marking of affected OSDs as ``down`` and their removal from the PG). If this
happens, Ceph will be unable to satisfy its durability guarantees and
consequently IO will not be permitted.
The second category of failures that we will discuss involves the situation in
which the constraints are not sufficient to guarantee the replication of data
across data centers, though it might seem that the data is correctly replicated
across data centers. For example, in a scenario in which there are two data
centers named Data Center A and Data Center B, and the CRUSH rule targets three
replicas and places a replica in each data center with a ``min_size`` of ``2``,
the PG might go active with two replicas in Data Center A and zero replicas in
Data Center B. In a situation of this kind, the loss of Data Center A means
that the data is lost and Ceph will not be able to operate on it. This
situation is surprisingly difficult to avoid using only standard CRUSH rules.
The second important category of failures is when you think you have data replicated
across data centers, but the constraints aren't sufficient to guarantee this.
For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
and places a copy in each data center with a ``min_size`` of 2. The PG may go active with
2 copies in site A and no copies in site B, which means that if you then lose site A you
have lost data and Ceph can't operate on it. This situation is surprisingly difficult
to avoid with standard CRUSH rules.
Stretch Mode
============
The new stretch mode is designed to handle the 2-site case. Three sites are
just as susceptible to netsplit issues, but are much more tolerant of
component availability outages than 2-site clusters are.
Stretch mode is designed to handle deployments in which you cannot guarantee the
replication of data across two data centers. This kind of situation can arise
when the cluster's CRUSH rule specifies that three copies are to be made, but
then a copy is placed in each data center with a ``min_size`` of 2. Under such
conditions, a placement group can become active with two copies in the first
data center and no copies in the second data center.
To enter stretch mode, you must set the location of each monitor, matching
your CRUSH map. For instance, to place ``mon.a`` in your first data center:
.. prompt:: bash $
Entering Stretch Mode
---------------------
ceph mon set_location a datacenter=site1
To enable stretch mode, you must set the location of each monitor, matching
your CRUSH map. This procedure shows how to do this.
Next, generate a CRUSH rule which will place 2 copies in each data center. This
will require editing the CRUSH map directly:
.. prompt:: bash $
#. Place ``mon.a`` in your first data center:
ceph osd getcrushmap > crush.map.bin
crushtool -d crush.map.bin -o crush.map.txt
.. prompt:: bash $
Now edit the ``crush.map.txt`` file to add a new rule. Here
there is only one other rule, so this is ID 1, but you may need
to use a different rule ID. We also have two datacenter buckets
named ``site1`` and ``site2``::
ceph mon set_location a datacenter=site1
rule stretch_rule {
id 1
type replicated
min_size 1
max_size 10
step take site1
step chooseleaf firstn 2 type host
step emit
step take site2
step chooseleaf firstn 2 type host
step emit
}
#. Generate a CRUSH rule that places two copies in each data center.
This requires editing the CRUSH map directly:
Finally, inject the CRUSH map to make the rule available to the cluster:
.. prompt:: bash $
.. prompt:: bash $
ceph osd getcrushmap > crush.map.bin
crushtool -d crush.map.bin -o crush.map.txt
crushtool -c crush.map.txt -o crush2.map.bin
ceph osd setcrushmap -i crush2.map.bin
#. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
other rule (``id 1``), but you might need to use a different rule ID. We
have two data-center buckets named ``site1`` and ``site2``:
If you aren't already running your monitors in connectivity mode, do so with
the instructions in `Changing Monitor Elections`_.
::
rule stretch_rule {
id 1
min_size 1
max_size 10
type replicated
step take site1
step chooseleaf firstn 2 type host
step emit
step take site2
step chooseleaf firstn 2 type host
step emit
}
#. Inject the CRUSH map to make the rule available to the cluster:
.. prompt:: bash $
crushtool -c crush.map.txt -o crush2.map.bin
ceph osd setcrushmap -i crush2.map.bin
#. Run the monitors in connectivity mode. See `Changing Monitor Elections`_.
#. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the
tiebreaker monitor and we are splitting across data centers. The tiebreaker
monitor must be assigned a data center that is neither ``site1`` nor
``site2``. For this purpose you can create another data-center bucket named
``site3`` in your CRUSH and place ``mon.e`` there:
.. prompt:: bash $
ceph mon set_location e datacenter=site3
ceph mon enable_stretch_mode e stretch_rule datacenter
When stretch mode is enabled, PGs will become active only when they peer
across data centers (or across whichever CRUSH bucket type was specified),
assuming both are alive. Pools will increase in size from the default ``3`` to
``4``, and two copies will be expected in each site. OSDs will be allowed to
connect to monitors only if they are in the same data center as the monitors.
New monitors will not be allowed to join the cluster if they do not specify a
location.
If all OSDs and monitors in one of the data centers become inaccessible at once,
the surviving data center enters a "degraded stretch mode". A warning will be
issued, the ``min_size`` will be reduced to ``1``, and the cluster will be
allowed to go active with the data in the single remaining site. The pool size
does not change, so warnings will be generated that report that the pools are
too small -- but a special stretch mode flag will prevent the OSDs from
creating extra copies in the remaining data center. This means that the data
center will keep only two copies, just as before.
When the missing data center comes back, the cluster will enter a "recovery
stretch mode". This changes the warning and allows peering, but requires OSDs
only from the data center that was ``up`` throughout the duration of the
downtime. When all PGs are in a known state, and are neither degraded nor
incomplete, the cluster transitions back to regular stretch mode, ends the
warning, restores ``min_size`` to its original value (``2``), requires both
sites to peer, and no longer requires the site that was up throughout the
duration of the downtime when peering (which makes failover to the other site
possible, if needed).
.. _Changing Monitor elections: ../change-mon-elections
And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the
tiebreaker and we are splitting across data centers. ``mon.e`` should be also
set a datacenter, that will differ from ``site1`` and ``site2``. For this
purpose you can create another datacenter bucket named ```site3`` in your
CRUSH and place ``mon.e`` there:
Limitations of Stretch Mode
===========================
When using stretch mode, OSDs must be located at exactly two sites.
.. prompt:: bash $
Two monitors should be run in each data center, plus a tiebreaker in a third
(or in the cloud) for a total of five monitors. While in stretch mode, OSDs
will connect only to monitors within the data center in which they are located.
OSDs *DO NOT* connect to the tiebreaker monitor.
ceph mon set_location e datacenter=site3
ceph mon enable_stretch_mode e stretch_rule datacenter
Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure
coded pools with stretch mode will fail. Erasure coded pools cannot be created
while in stretch mode.
When stretch mode is enabled, the OSDs will only take PGs active when
they peer across data centers (or whatever other CRUSH bucket type
you specified), assuming both are alive. Pools will increase in size
from the default 3 to 4, expecting 2 copies in each site. OSDs will only
be allowed to connect to monitors in the same data center. New monitors
will not be allowed to join the cluster if they do not specify a location.
To use stretch mode, you will need to create a CRUSH rule that provides two
replicas in each data center. Ensure that there are four total replicas: two in
each data center. If pools exist in the cluster that do not have the default
``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such
a CRUSH rule is given above.
If all the OSDs and monitors from a data center become inaccessible
at once, the surviving data center will enter a degraded stretch mode. This
will issue a warning, reduce the min_size to 1, and allow
the cluster to go active with data in the single remaining site. Note that
we do not change the pool size, so you will also get warnings that the
pools are too small -- but a special stretch mode flag will prevent the OSDs
from creating extra copies in the remaining data center (so it will only keep
2 copies, as before).
Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly,
``min_size 1``), we recommend enabling stretch mode only when using OSDs on
SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended
due to the long time it takes for them to recover after connectivity between
data centers has been restored. This reduces the potential for data loss.
When the missing data center comes back, the cluster will enter
recovery stretch mode. This changes the warning and allows peering, but
still only requires OSDs from the data center which was up the whole time.
When all PGs are in a known state, and are neither degraded nor incomplete,
the cluster transitions back to regular stretch mode, ends the warning,
restores min_size to its starting value (2) and requires both sites to peer,
and stops requiring the always-alive site when peering (so that you can fail
over to the other site, if necessary).
Stretch Mode Limitations
========================
As implied by the setup, stretch mode only handles 2 sites with OSDs.
While it is not enforced, you should run 2 monitors in each site plus
a tiebreaker, for a total of 5. This is because OSDs can only connect
to monitors in their own site when in stretch mode.
You cannot use erasure coded pools with stretch mode. If you try, it will
refuse, and it will not allow you to create EC pools once in stretch mode.
You must create your own CRUSH rule which provides 2 copies in each site, and
you must use 4 total copies with 2 in each site. If you have existing pools
with non-default size/min_size, Ceph will object when you attempt to
enable stretch mode.
Because it runs with ``min_size 1`` when degraded, you should only use stretch
mode with all-flash OSDs. This minimizes the time needed to recover once
connectivity is restored, and thus minimizes the potential for data loss.
Hopefully, future development will extend this feature to support EC pools and
running with more than 2 full sites.
In the future, stretch mode might support erasure-coded pools and might support
deployments that have more than two data centers.
Other commands
==============
If your tiebreaker monitor fails for some reason, you can replace it. Turn on
a new monitor and run:
Replacing a failed tiebreaker monitor
-------------------------------------
Turn on a new monitor and run the following command:
.. prompt:: bash $
ceph mon set_new_tiebreaker mon.<new_mon_name>
This command will protest if the new monitor is in the same location as existing
non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker
monitor; you should do so yourself.
This command protests if the new monitor is in the same location as the
existing non-tiebreaker monitors. **This command WILL NOT remove the previous
tiebreaker monitor.** Remove the previous tiebreaker monitor yourself.
If you are writing your own tooling for deploying Ceph, you can use a new
``--set-crush-location`` option when booting monitors, instead of running
``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg
``ceph-mon --set-crush-location 'datacenter=a'``, which must match the
bucket type you specified when running ``enable_stretch_mode``.
Using "--set-crush-location" and not "ceph mon set_location"
------------------------------------------------------------
If you write your own tooling for deploying Ceph, use the
``--set-crush-location`` option when booting monitors instead of running ``ceph
mon set_location``. This option accepts only a single ``bucket=loc`` pair (for
example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must
match the bucket type that was specified when running ``enable_stretch_mode``.
When in stretch degraded mode, the cluster will go into "recovery" mode automatically
when the disconnected data center comes back. If that doesn't work, or you want to
enable recovery mode early, you can invoke:
Forcing recovery stretch mode
-----------------------------
When in stretch degraded mode, the cluster will go into "recovery" mode
automatically when the disconnected data center comes back. If that does not
happen or you want to enable recovery mode early, run the following command:
.. prompt:: bash $
ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
But this command should not be necessary; it is included to deal with
unanticipated situations.
Forcing normal stretch mode
---------------------------
When in recovery mode, the cluster should go back into normal stretch mode
when the PGs are healthy. If this doesn't happen, or you want to force the
When in recovery mode, the cluster should go back into normal stretch mode when
the PGs are healthy. If this fails to happen or if you want to force the
cross-data-center peering early and are willing to risk data downtime (or have
verified separately that all the PGs can peer, even if they aren't fully
recovered), you can invoke:
recovered), run the following command:
.. prompt:: bash $
ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
This command should not be necessary; it is included to deal with
unanticipated situations. But you might wish to invoke it to remove
the ``HEALTH_WARN`` state which recovery mode generates.
This command can be used to to remove the ``HEALTH_WARN`` state, which recovery
mode generates.

Some files were not shown because too many files have changed in this diff Show More