Merge remote-tracking branch 'origin/quincy-stable-8' into quincy-stable-7

This commit is contained in:
Thomas Lamprecht 2023-11-02 17:21:22 +01:00
commit 692686d018
1290 changed files with 38032 additions and 16268 deletions

View File

@ -21,6 +21,13 @@
- The Signed-off-by line in every git commit is important; see <span class="x x-first x-last">[Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/master/</span>SubmittingPatches.rst<span class="x x-first x-last">)</span> - The Signed-off-by line in every git commit is important; see <span class="x x-first x-last">[Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/master/</span>SubmittingPatches.rst<span class="x x-first x-last">)</span>
--> -->
## Contribution Guidelines
- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
- If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
- When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an `x` between the brackets: `[x]`. Spaces and capitalization matter when checking off items this way.
## Checklist ## Checklist
- Tracker (select at least one) - Tracker (select at least one)
- [ ] References tracker ticket - [ ] References tracker ticket

View File

@ -11,6 +11,9 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
name: Verify name: Verify
steps: steps:
- name: Sleep for 30 seconds
run: sleep 30s
shell: bash
- name: Action - name: Action
id: checklist id: checklist
uses: ceph/ceph-pr-checklist-action@32e92d1a2a7c9991ed51de5fccb2296551373d60 uses: ceph/ceph-pr-checklist-action@32e92d1a2a7c9991ed51de5fccb2296551373d60

View File

@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 3.16) cmake_minimum_required(VERSION 3.16)
project(ceph project(ceph
VERSION 17.2.6 VERSION 17.2.7
LANGUAGES CXX C ASM) LANGUAGES CXX C ASM)
cmake_policy(SET CMP0028 NEW) cmake_policy(SET CMP0028 NEW)

View File

@ -1,3 +1,49 @@
>=17.2.7
--------
* `ceph mgr dump` command now displays the name of the mgr module that
registered a RADOS client in the `name` field added to elements of the
`active_clients` array. Previously, only the address of a module's RADOS
client was shown in the `active_clients` array.
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
undergone significant usability and design improvements to address the slow
backfill issue. Some important changes are:
* The 'balanced' profile is set as the default mClock profile because it
represents a compromise between prioritizing client IO or recovery IO. Users
can then choose either the 'high_client_ops' profile to prioritize client IO
or the 'high_recovery_ops' profile to prioritize recovery IO.
* QoS parameters like reservation and limit are now specified in terms of a
fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
* The cost parameters (osd_mclock_cost_per_io_usec_* and
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
is now determined using the random IOPS and maximum sequential bandwidth
capability of the OSD's underlying device.
* Degraded object recovery is given higher priority when compared to misplaced
object recovery because degraded objects present a data safety issue not
present with objects that are merely misplaced. Therefore, backfilling
operations with the 'balanced' and 'high_client_ops' mClock profiles may
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
scheduler.
* The QoS allocations in all the mClock profiles are optimized based on the above
fixes and enhancements.
* For more detailed information see:
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
multi-site. Previously, the replicas of such objects were corrupted on decryption.
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
identify these original multipart uploads. The ``LastModified`` timestamp of any
identified object is incremented by 1ns to cause peer zones to replicate it again.
For multi-site deployments that make any use of Server-Side Encryption, we
recommended running this command against every bucket in every zone after all
zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
a large buildup of session metadata resulting in the MDS going read-only due to
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
config controls the maximum size that a (encoded) session metadata can grow.
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
procedure, the recovered files under `lost+found` directory can now be deleted.
>=17.2.6 >=17.2.6
-------- --------
@ -7,6 +53,60 @@
>=17.2.5 >=17.2.5
-------- --------
>=19.0.0
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
multi-site. Previously, the replicas of such objects were corrupted on decryption.
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
identify these original multipart uploads. The ``LastModified`` timestamp of any
identified object is incremented by 1ns to cause peer zones to replicate it again.
For multi-site deployments that make any use of Server-Side Encryption, we
recommended running this command against every bucket in every zone after all
zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
a large buildup of session metadata resulting in the MDS going read-only due to
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
config controls the maximum size that a (encoded) session metadata can grow.
* CephFS: For clusters with multiple CephFS file systems, all the snap-schedule
commands now expect the '--fs' argument.
* CephFS: The period specifier ``m`` now implies minutes and the period specifier
``M`` now implies months. This has been made consistent with the rest
of the system.
* RGW: New tools have been added to radosgw-admin for identifying and
correcting issues with versioned bucket indexes. Historical bugs with the
versioned bucket index transaction workflow made it possible for the index
to accumulate extraneous "book-keeping" olh entries and plain placeholder
entries. In some specific scenarios where clients made concurrent requests
referencing the same object key, it was likely that a lot of extra index
entries would accumulate. When a significant number of these entries are
present in a single bucket index shard, they can cause high bucket listing
latencies and lifecycle processing failures. To check whether a versioned
bucket has unnecessary olh entries, users can now run ``radosgw-admin
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
be safely removed. A distinct issue from the one described thus far, it is
also possible that some versioned buckets are maintaining extra unlinked
objects that are not listable from the S3/ Swift APIs. These extra objects
are typically a result of PUT requests that exited abnormally, in the middle
of a bucket index transaction - so the client would not have received a
successful response. Bugs in prior releases made these unlinked objects easy
to reproduce with any PUT request that was made on a bucket that was actively
resharding. Besides the extra space that these hidden, unlinked objects
consume, there can be another side effect in certain scenarios, caused by
the nature of the failure mode that produced them, where a client of a bucket
that was a victim of this bug may find the object associated with the key to
be in an inconsistent state. To check whether a versioned bucket has unlinked
entries, users can now run ``radosgw-admin bucket check unlinked``. If the
``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
a third issue made it possible for versioned bucket index stats to be
accounted inaccurately. The tooling for recalculating versioned bucket stats
also had a bug, and was not previously capable of fixing these inaccuracies.
This release resolves those issues and users can now expect that the existing
``radosgw-admin bucket check`` command will produce correct results. We
recommend that users with versioned buckets, especially those that existed
on prior releases, use these new tools to check whether their buckets are
affected and to clean them up accordingly.
>=18.0.0
* RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write` * RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
and `Image::aio_compare_and_write` methods) now match those of C API. Both and `Image::aio_compare_and_write` methods) now match those of C API. Both
@ -47,6 +147,100 @@
If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
will be visible. will be visible.
Relevant tracker: https://tracker.ceph.com/issues/53729 Relevant tracker: https://tracker.ceph.com/issues/53729
* RBD: `rbd device unmap` command gained `--namespace` option. Support for
namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
map and unmap images in namespaces using the `image-spec` syntax since then
but the corresponding option available in most other commands was missing.
* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
When both are enabled, compression is applied before encryption.
* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
is removed. Together with it, the "pubsub" zone should not be used anymore.
The REST operations, as well as radosgw-admin commands for manipulating
subscriptions, as well as fetching and acking the notifications are removed
as well.
In case that the endpoint to which the notifications are sent maybe down or
disconnected, it is recommended to use persistent notifications to guarantee
the delivery of the notifications. In case the system that consumes the
notifications needs to pull them (instead of the notifications be pushed
to it), an external message bus (e.g. rabbitmq, Kafka) should be used for
that purpose.
* RGW: The serialized format of notification and topics has changed, so that
new/updated topics will be unreadable by old RGWs. We recommend completing
the RGW upgrades before creating or modifying any notification topics.
* RBD: Trailing newline in passphrase files (`<passphrase-file>` argument in
`rbd encryption format` command and `--encryption-passphrase-file` option
in other commands) is no longer stripped.
* RBD: Support for layered client-side encryption is added. Cloned images
can now be encrypted each with its own encryption format and passphrase,
potentially different from that of the parent image. The efficient
copy-on-write semantics intrinsic to unformatted (regular) cloned images
are retained.
* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
`client_max_retries_on_remount_failure` and move it from mds.yaml.in to
mds-client.yaml.in because this option was only used by MDS client from its
birth.
* The `perf dump` and `perf schema` commands are deprecated in favor of new
`counter dump` and `counter schema` commands. These new commands add support
for labeled perf counters and also emit existing unlabeled perf counters. Some
unlabeled perf counters became labeled in this release, with more to follow in
future releases; such converted perf counters are no longer emitted by the
`perf dump` and `perf schema` commands.
* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
`active_clients` fields at the top level. Previously, these fields were
output under `always_on_modules` field.
* `ceph mgr dump` command now displays the name of the mgr module that
registered a RADOS client in the `name` field added to elements of the
`active_clients` array. Previously, only the address of a module's RADOS
client was shown in the `active_clients` array.
* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
emitted only by the new `counter dump` and `counter schema` commands. As part
of the conversion, many also got renamed to better disambiguate journal-based
and snapshot-based mirroring.
* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
`std::list` before potentially appending to it, aligning with the semantics
of the corresponding C API (`rbd_watchers_list`).
* Telemetry: Users who are opted-in to telemetry can also opt-in to
participating in a leaderboard in the telemetry public
dashboards (https://telemetry-public.ceph.com/). Users can now also add a
description of the cluster to publicly appear in the leaderboard.
For more details, see:
https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
See a sample report with `ceph telemetry preview`.
Opt-in to telemetry with `ceph telemetry on`.
Opt-in to the leaderboard with
`ceph config set mgr mgr/telemetry/leaderboard true`.
Add leaderboard description with:
`ceph config set mgr mgr/telemetry/leaderboard_description Cluster description`.
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
procedure, the recovered files under `lost+found` directory can now be deleted.
* core: cache-tiering is now deprecated.
* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
undergone significant usability and design improvements to address the slow
backfill issue. Some important changes are:
* The 'balanced' profile is set as the default mClock profile because it
represents a compromise between prioritizing client IO or recovery IO. Users
can then choose either the 'high_client_ops' profile to prioritize client IO
or the 'high_recovery_ops' profile to prioritize recovery IO.
* QoS parameters like reservation and limit are now specified in terms of a
fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
* The cost parameters (osd_mclock_cost_per_io_usec_* and
osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
is now determined using the random IOPS and maximum sequential bandwidth
capability of the OSD's underlying device.
* Degraded object recovery is given higher priority when compared to misplaced
object recovery because degraded objects present a data safety issue not
present with objects that are merely misplaced. Therefore, backfilling
operations with the 'balanced' and 'high_client_ops' mClock profiles may
progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
scheduler.
* The QoS allocations in all the mClock profiles are optimized based on the above
fixes and enhancements.
* For more detailed information see:
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
than the number mentioned against the config tunable `mds_max_snaps_per_dir`
so that a new snapshot can be created and retained during the next schedule
run.
>=17.2.1 >=17.2.1

View File

@ -1,81 +1,107 @@
# Ceph - a scalable distributed storage system # Ceph - a scalable distributed storage system
Please see http://ceph.com/ for current info. See https://ceph.com/ for current information about Ceph.
## Contributing Code ## Contributing Code
Most of Ceph is dual licensed under the LGPL version 2.1 or 3.0. Some Most of Ceph is dual-licensed under the LGPL version 2.1 or 3.0. Some
miscellaneous code is under a BSD-style license or is public domain. miscellaneous code is either public domain or licensed under a BSD-style
The documentation is licensed under Creative Commons license.
Attribution Share Alike 3.0 (CC-BY-SA-3.0). There are a handful of headers
included here that are licensed under the GPL. Please see the file
COPYING for a full inventory of licenses by file.
Code contributions must include a valid "Signed-off-by" acknowledging The Ceph documentation is licensed under Creative Commons Attribution Share
the license for the modified or contributed file. Please see the file Alike 3.0 (CC-BY-SA-3.0).
SubmittingPatches.rst for details on what that means and on how to
generate and submit patches.
We do not require assignment of copyright to contribute code; code is Some headers included in the `ceph/ceph` repository are licensed under the GPL.
See the file `COPYING` for a full inventory of licenses by file.
All code contributions must include a valid "Signed-off-by" line. See the file
`SubmittingPatches.rst` for details on this and instructions on how to generate
and submit patches.
Assignment of copyright is not required to contribute code. Code is
contributed under the terms of the applicable license. contributed under the terms of the applicable license.
## Checking out the source ## Checking out the source
You can clone from github with Clone the ceph/ceph repository from github by running the following command on
a system that has git installed:
git clone git@github.com:ceph/ceph git clone git@github.com:ceph/ceph
or, if you are not a github user, Alternatively, if you are not a github user, you should run the following
command on a system that has git installed:
git clone git://github.com/ceph/ceph git clone git://github.com/ceph/ceph
Ceph contains many git submodules that need to be checked out with When the `ceph/ceph` repository has been cloned to your system, run the
following commands to move into the cloned `ceph/ceph` repository and to check
out the git submodules associated with it:
cd ceph
git submodule update --init --recursive git submodule update --init --recursive
## Build Prerequisites ## Build Prerequisites
The list of Debian or RPM packages dependencies can be installed with: *section last updated 27 Jul 2023*
Make sure that ``curl`` is installed. The Debian and Ubuntu ``apt`` command is
provided here, but if you use a system with a different package manager, then
you must use whatever command is the proper counterpart of this one:
apt install curl
Install Debian or RPM package dependencies by running the following command:
./install-deps.sh ./install-deps.sh
Install the ``python3-routes`` package:
apt install python3-routes
## Building Ceph ## Building Ceph
Note that these instructions are meant for developers who are These instructions are meant for developers who are compiling the code for
compiling the code for development and testing. To build binaries development and testing. To build binaries that are suitable for installation
suitable for installation we recommend you build deb or rpm packages we recommend that you build `.deb` or `.rpm` packages, or refer to
or refer to the `ceph.spec.in` or `debian/rules` to see which ``ceph.spec.in`` or ``debian/rules`` to see which configuration options are
configuration options are specified for production builds. specified for production builds.
Build instructions: To build Ceph, make sure that you are in the top-level `ceph` directory that
contains `do_cmake.sh` and `CONTRIBUTING.rst` and run the following commands:
./do_cmake.sh ./do_cmake.sh
cd build cd build
ninja ninja
(do_cmake.sh now defaults to creating a debug build of ceph that can ``do_cmake.sh`` by default creates a "debug build" of Ceph, which can be up to
be up to 5x slower with some workloads. Please pass five times slower than a non-debug build. Pass
"-DCMAKE_BUILD_TYPE=RelWithDebInfo" to do_cmake.sh to create a non-debug ``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to ``do_cmake.sh`` to create a non-debug
release. build.
The number of jobs used by `ninja` is derived from the number of CPU cores of [Ninja](https://ninja-build.org/) is the buildsystem used by the Ceph project
the building host if unspecified. Use the `-j` option to limit the job number to build test builds. The number of jobs used by `ninja` is derived from the
if the build jobs are running out of memory. On average, each job takes around number of CPU cores of the building host if unspecified. Use the `-j` option to
2.5GiB memory.) limit the job number if the build jobs are running out of memory. If you
attempt to run `ninja` and receive a message that reads `g++: fatal error:
Killed signal terminated program cc1plus`, then you have run out of memory.
Using the `-j` option with an argument appropriate to the hardware on which the
`ninja` command is run is expected to result in a successful build. For example,
to limit the job number to 3, run the command `ninja -j 3`. On average, each
`ninja` job run in parallel needs approximately 2.5 GiB of RAM.
This assumes you make your build dir a subdirectory of the ceph.git This documentation assumes that your build directory is a subdirectory of the
checkout. If you put it elsewhere, just point `CEPH_GIT_DIR` to the correct `ceph.git` checkout. If the build directory is located elsewhere, point
path to the checkout. Any additional CMake args can be specified by setting ARGS `CEPH_GIT_DIR` to the correct path of the checkout. Additional CMake args can
before invoking do_cmake. See [cmake options](#cmake-options) be specified by setting ARGS before invoking ``do_cmake.sh``. See [cmake
for more details. Eg. options](#cmake-options) for more details. For example:
ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh
To build only certain targets use: To build only certain targets, run a command of the following form:
ninja [target name] ninja [target name]
@ -130,24 +156,25 @@ are committed to git.)
## Running a test cluster ## Running a test cluster
To run a functional test cluster, From the `ceph/` directory, run the following commands to launch a test Ceph
cluster:
cd build cd build
ninja vstart # builds just enough to run vstart ninja vstart # builds just enough to run vstart
../src/vstart.sh --debug --new -x --localhost --bluestore ../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph -s ./bin/ceph -s
Almost all of the usual commands are available in the bin/ directory. Most Ceph commands are available in the `bin/` directory. For example:
For example,
./bin/rados -p rbd bench 30 write
./bin/rbd create foo --size 1000 ./bin/rbd create foo --size 1000
./bin/rados -p foo bench 30 write
To shut down the test cluster, To shut down the test cluster, run the following command from the `build/`
directory:
../src/stop.sh ../src/stop.sh
To start or stop individual daemons, the sysvinit script can be used: Use the sysvinit script to start or stop individual daemons:
./bin/init-ceph restart osd.0 ./bin/init-ceph restart osd.0
./bin/init-ceph stop ./bin/init-ceph stop

View File

@ -166,7 +166,7 @@
# main package definition # main package definition
################################################################################# #################################################################################
Name: ceph Name: ceph
Version: 17.2.6 Version: 17.2.7
Release: 0%{?dist} Release: 0%{?dist}
%if 0%{?fedora} || 0%{?rhel} %if 0%{?fedora} || 0%{?rhel}
Epoch: 2 Epoch: 2
@ -182,7 +182,7 @@ License: LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-
Group: System/Filesystems Group: System/Filesystems
%endif %endif
URL: http://ceph.com/ URL: http://ceph.com/
Source0: %{?_remote_tarball_prefix}ceph-17.2.6.tar.bz2 Source0: %{?_remote_tarball_prefix}ceph-17.2.7.tar.bz2
%if 0%{?suse_version} %if 0%{?suse_version}
# _insert_obs_source_lines_here # _insert_obs_source_lines_here
ExclusiveArch: x86_64 aarch64 ppc64le s390x ExclusiveArch: x86_64 aarch64 ppc64le s390x
@ -1274,7 +1274,7 @@ This package provides Ceph default alerts for Prometheus.
# common # common
################################################################################# #################################################################################
%prep %prep
%autosetup -p1 -n ceph-17.2.6 %autosetup -p1 -n ceph-17.2.7
%build %build
# Disable lto on systems that do not support symver attribute # Disable lto on systems that do not support symver attribute
@ -1863,6 +1863,7 @@ fi
%{_datadir}/ceph/mgr/prometheus %{_datadir}/ceph/mgr/prometheus
%{_datadir}/ceph/mgr/rbd_support %{_datadir}/ceph/mgr/rbd_support
%{_datadir}/ceph/mgr/restful %{_datadir}/ceph/mgr/restful
%{_datadir}/ceph/mgr/rgw
%{_datadir}/ceph/mgr/selftest %{_datadir}/ceph/mgr/selftest
%{_datadir}/ceph/mgr/snap_schedule %{_datadir}/ceph/mgr/snap_schedule
%{_datadir}/ceph/mgr/stats %{_datadir}/ceph/mgr/stats

View File

@ -1863,6 +1863,7 @@ fi
%{_datadir}/ceph/mgr/prometheus %{_datadir}/ceph/mgr/prometheus
%{_datadir}/ceph/mgr/rbd_support %{_datadir}/ceph/mgr/rbd_support
%{_datadir}/ceph/mgr/restful %{_datadir}/ceph/mgr/restful
%{_datadir}/ceph/mgr/rgw
%{_datadir}/ceph/mgr/selftest %{_datadir}/ceph/mgr/selftest
%{_datadir}/ceph/mgr/snap_schedule %{_datadir}/ceph/mgr/snap_schedule
%{_datadir}/ceph/mgr/stats %{_datadir}/ceph/mgr/stats

View File

@ -1,3 +1,9 @@
ceph (17.2.7-1) stable; urgency=medium
* New upstream release
-- Ceph Release Team <ceph-maintainers@ceph.io> Wed, 25 Oct 2023 23:46:13 +0000
ceph (17.2.6-1) stable; urgency=medium ceph (17.2.6-1) stable; urgency=medium
* New upstream release * New upstream release

View File

@ -1 +1,3 @@
lib/systemd/system/cephfs-mirror*
usr/bin/cephfs-mirror usr/bin/cephfs-mirror
usr/share/man/man8/cephfs-mirror.8

View File

@ -30,60 +30,52 @@ A Ceph Storage Cluster consists of multiple types of daemons:
- :term:`Ceph Manager` - :term:`Ceph Manager`
- :term:`Ceph Metadata Server` - :term:`Ceph Metadata Server`
.. ditaa:: Ceph Monitors maintain the master copy of the cluster map, which they provide
to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
+---------------+ +---------------+ +---------------+ +---------------+ availability in the event that one of the monitor daemons or its host fails.
| OSDs | | Monitors | | Managers | | MDS | The Ceph monitor provides copies of the cluster map to storage cluster clients.
+---------------+ +---------------+ +---------------+ +---------------+
A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
monitors ensures high availability should a monitor daemon fail. Storage cluster
clients retrieve a copy of the cluster map from the Ceph Monitor.
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
back to monitors. back to monitors.
A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
modules. modules.
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
provide file services. provide file services.
Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
to efficiently compute information about data location, instead of having to to compute information about data location. This means that clients and OSDs
depend on a central lookup table. Ceph's high-level features include a are not bottlenecked by a central lookup table. Ceph's high-level features
native interface to the Ceph Storage Cluster via ``librados``, and a number of include a native interface to the Ceph Storage Cluster via ``librados``, and a
service interfaces built on top of ``librados``. number of service interfaces built on top of ``librados``.
Storing Data Storing Data
------------ ------------
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
:term:`Ceph File System` or a custom implementation you create using :term:`Ceph File System`, or a custom implementation that you create by using
``librados``-- which is stored as RADOS objects. Each object is stored on an ``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
:term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and objects. Each object is stored on an :term:`Object Storage Device` (this is
replication operations on storage drives. With the older Filestore back end, also called an "OSD"). Ceph OSDs control read, write, and replication
each RADOS object was stored as a separate file on a conventional filesystem operations on storage drives. The default BlueStore back end stores objects
(usually XFS). With the new and default BlueStore back end, objects are in a monolithic, database-like fashion.
stored in a monolithic database-like fashion.
.. ditaa:: .. ditaa::
/-----\ +-----+ +-----+ /------\ +-----+ +-----+
| obj |------>| {d} |------>| {s} | | obj |------>| {d} |------>| {s} |
\-----/ +-----+ +-----+ \------/ +-----+ +-----+
Object OSD Drive Object OSD Drive
Ceph OSD Daemons store data as objects in a flat namespace (e.g., no Ceph OSD Daemons store data as objects in a flat namespace. This means that
hierarchy of directories). An object has an identifier, binary data, and objects are not stored in a hierarchy of directories. An object has an
metadata consisting of a set of name/value pairs. The semantics are completely identifier, binary data, and metadata consisting of name/value pairs.
up to :term:`Ceph Client`\s. For example, CephFS uses metadata to store file :term:`Ceph Client`\s determine the semantics of the object data. For example,
attributes such as the file owner, created date, last modified date, and so CephFS uses metadata to store file attributes such as the file owner, the
forth. created date, and the last modified date.
.. ditaa:: .. ditaa::
@ -102,20 +94,23 @@ forth.
.. index:: architecture; high availability, scalability .. index:: architecture; high availability, scalability
.. _arch_scalability_and_high_availability:
Scalability and High Availability Scalability and High Availability
--------------------------------- ---------------------------------
In traditional architectures, clients talk to a centralized component (e.g., a In traditional architectures, clients talk to a centralized component. This
gateway, broker, API, facade, etc.), which acts as a single point of entry to a centralized component might be a gateway, a broker, an API, or a facade. A
complex subsystem. This imposes a limit to both performance and scalability, centralized component of this kind acts as a single point of entry to a complex
while introducing a single point of failure (i.e., if the centralized component subsystem. Architectures that rely upon such a centralized component have a
goes down, the whole system goes down, too). single point of failure and incur limits to performance and scalability. If
the centralized component goes down, the whole system becomes unavailable.
Ceph eliminates the centralized gateway to enable clients to interact with Ceph eliminates this centralized component. This enables clients to interact
Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster to ensure data safety and high availability. Ceph also uses a cluster of
of monitors to ensure high availability. To eliminate centralization, Ceph monitors to ensure high availability. To eliminate centralization, Ceph uses an
uses an algorithm called CRUSH. algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
.. index:: CRUSH; architecture .. index:: CRUSH; architecture
@ -124,15 +119,15 @@ CRUSH Introduction
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
Replication Under Scalable Hashing)` algorithm to efficiently compute Replication Under Scalable Hashing)` algorithm to compute information about
information about object location, instead of having to depend on a object location instead of relying upon a central lookup table. CRUSH provides
central lookup table. CRUSH provides a better data management mechanism compared a better data management mechanism than do older approaches, and CRUSH enables
to older approaches, and enables massive scale by cleanly distributing the work massive scale by distributing the work to all the OSD daemons in the cluster
to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data and all the clients that communicate with them. CRUSH uses intelligent data
replication to ensure resiliency, which is better suited to hyper-scale storage. replication to ensure resiliency, which is better suited to hyper-scale
The following sections provide additional details on how CRUSH works. For a storage. The following sections provide additional details on how CRUSH works.
detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
Placement of Replicated Data`_. Decentralized Placement of Replicated Data`_.
.. index:: architecture; cluster map .. index:: architecture; cluster map
@ -141,108 +136,129 @@ Placement of Replicated Data`_.
Cluster Map Cluster Map
~~~~~~~~~~~ ~~~~~~~~~~~
Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
cluster topology, which is inclusive of 5 maps collectively referred to as the must have current information about the cluster's topology. Current information
"Cluster Map": is stored in the "Cluster Map", which is in fact a collection of five maps. The
five maps that constitute the cluster map are:
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name #. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
address and port of each monitor. It also indicates the current epoch, the address, and the TCP port of each monitor. The monitor map specifies the
when the map was created, and the last time it changed. To view a monitor current epoch, the time of the monitor map's creation, and the time of the
map, execute ``ceph mon dump``. monitor map's last modification. To view a monitor map, run ``ceph mon
dump``.
#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and #. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
last modified, a list of pools, replica sizes, PG numbers, a list of OSDs creation, the time of the OSD map's last modification, a list of pools, a
and their status (e.g., ``up``, ``in``). To view an OSD map, execute list of replica sizes, a list of PG numbers, and a list of OSDs and their
``ceph osd dump``. statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
osd dump``.
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD #. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
map epoch, the full ratios, and details on each placement group such as epoch, the full ratios, and the details of each placement group. This
the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
``active + clean``), and data usage statistics for each pool. example, ``active + clean``), and data usage statistics for each pool.
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain #. **The CRUSH Map:** Contains a list of storage devices, the failure domain
hierarchy (e.g., device, host, rack, row, room, etc.), and rules for hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
traversing the hierarchy when storing data. To view a CRUSH map, execute and rules for traversing the hierarchy when storing data. To view a CRUSH
``ceph osd getcrushmap -o {filename}``; then, decompile it by executing map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``. running ``crushtool -d {comp-crushmap-filename} -o
You can view the decompiled map in a text editor or with ``cat``. {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
decompiled map.
#. **The MDS Map:** Contains the current MDS map epoch, when the map was #. **The MDS Map:** Contains the current MDS map epoch, when the map was
created, and the last time it changed. It also contains the pool for created, and the last time it changed. It also contains the pool for
storing metadata, a list of metadata servers, and which metadata servers storing metadata, a list of metadata servers, and which metadata servers
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``. are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
Each map maintains an iterative history of its operating state changes. Ceph Each map maintains a history of changes to its operating state. Ceph Monitors
Monitors maintain a master copy of the cluster map including the cluster maintain a master copy of the cluster map. This master copy includes the
members, state, changes, and the overall health of the Ceph Storage Cluster. cluster members, the state of the cluster, changes to the cluster, and
information recording the overall health of the Ceph Storage Cluster.
.. index:: high availability; monitor architecture .. index:: high availability; monitor architecture
High Availability Monitors High Availability Monitors
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
Before Ceph Clients can read or write data, they must contact a Ceph Monitor A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
to obtain the most recent copy of the cluster map. A Ceph Storage Cluster cluster map in order to read data from or to write data to the Ceph cluster.
can operate with a single monitor; however, this introduces a single
point of failure (i.e., if the monitor goes down, Ceph Clients cannot
read or write data).
For added reliability and fault tolerance, Ceph supports a cluster of monitors. It is possible for a Ceph cluster to function properly with only a single
In a cluster of monitors, latency and other faults can cause one or more monitor, but a Ceph cluster that has only a single monitor has a single point
monitors to fall behind the current state of the cluster. For this reason, Ceph of failure: if the monitor goes down, Ceph clients will be unable to read data
must have agreement among various monitor instances regarding the state of the from or write data to the cluster.
cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
and the `Paxos`_ algorithm to establish a consensus among the monitors about the
current state of the cluster.
For details on configuring monitors, see the `Monitor Config Reference`_. Ceph leverages a cluster of monitors in order to increase reliability and fault
tolerance. When a cluster of monitors is used, however, one or more of the
monitors in the cluster can fall behind due to latency or other faults. Ceph
mitigates these negative effects by requiring multiple monitor instances to
agree about the state of the cluster. To establish consensus among the monitors
regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
majority of monitors (for example, one in a cluster that contains only one
monitor, two in a cluster that contains three monitors, three in a cluster that
contains five monitors, four in a cluster that contains six monitors, and so
on).
See the `Monitor Config Reference`_ for more detail on configuring monitors.
.. index:: architecture; high availability authentication .. index:: architecture; high availability authentication
.. _arch_high_availability_authentication:
High Availability Authentication High Availability Authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To identify users and protect against man-in-the-middle attacks, Ceph provides The ``cephx`` authentication system is used by Ceph to authenticate users and
its ``cephx`` authentication system to authenticate users and daemons. daemons and to protect against man-in-the-middle attacks.
.. note:: The ``cephx`` protocol does not address data encryption in transport .. note:: The ``cephx`` protocol does not address data encryption in transport
(e.g., SSL/TLS) or encryption at rest. (for example, SSL/TLS) or encryption at rest.
Cephx uses shared secret keys for authentication, meaning both the client and ``cephx`` uses shared secret keys for authentication. This means that both the
the monitor cluster have a copy of the client's secret key. The authentication client and the monitor cluster keep a copy of the client's secret key.
protocol is such that both parties are able to prove to each other they have a
copy of the key without actually revealing it. This provides mutual
authentication, which means the cluster is sure the user possesses the secret
key, and the user is sure that the cluster has a copy of the secret key.
A key scalability feature of Ceph is to avoid a centralized interface to the The ``cephx`` protocol makes it possible for each party to prove to the other
Ceph object store, which means that Ceph clients must be able to interact with that it has a copy of the key without revealing it. This provides mutual
OSDs directly. To protect data, Ceph provides its ``cephx`` authentication authentication and allows the cluster to confirm (1) that the user has the
system, which authenticates users operating Ceph clients. The ``cephx`` protocol secret key and (2) that the user can be confident that the cluster has a copy
operates in a manner with behavior similar to `Kerberos`_. of the secret key.
A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each As stated in :ref:`Scalability and High Availability
monitor can authenticate users and distribute keys, so there is no single point <arch_scalability_and_high_availability>`, Ceph does not have any centralized
of failure or bottleneck when using ``cephx``. The monitor returns an interface between clients and the Ceph object store. By avoiding such a
authentication data structure similar to a Kerberos ticket that contains a centralized interface, Ceph avoids the bottlenecks that attend such centralized
session key for use in obtaining Ceph services. This session key is itself interfaces. However, this means that clients must interact directly with OSDs.
encrypted with the user's permanent secret key, so that only the user can Direct interactions between Ceph clients and OSDs require authenticated
request services from the Ceph Monitor(s). The client then uses the session key connections. The ``cephx`` authentication system establishes and sustains these
to request its desired services from the monitor, and the monitor provides the authenticated connections.
client with a ticket that will authenticate the client to the OSDs that actually
handle data. Ceph Monitors and OSDs share a secret, so the client can use the
ticket provided by the monitor with any OSD or metadata server in the cluster.
Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
ticket or session key obtained surreptitiously. This form of authentication will
prevent attackers with access to the communications medium from either creating
bogus messages under another user's identity or altering another user's
legitimate messages, as long as the user's secret key is not divulged before it
expires.
To use ``cephx``, an administrator must set up users first. In the following The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
monitor can authenticate users and distribute keys, which means that there is
no single point of failure and no bottleneck when using ``cephx``. The monitor
returns an authentication data structure that is similar to a Kerberos ticket.
This authentication data structure contains a session key for use in obtaining
Ceph services. The session key is itself encrypted with the user's permanent
secret key, which means that only the user can request services from the Ceph
Monitors. The client then uses the session key to request services from the
monitors, and the monitors provide the client with a ticket that authenticates
the client against the OSDs that actually handle data. Ceph Monitors and OSDs
share a secret, which means that the clients can use the ticket provided by the
monitors to authenticate against any OSD or metadata server in the cluster.
Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
expired ticket or session key that has been obtained surreptitiously. This form
of authentication prevents attackers who have access to the communications
medium from creating bogus messages under another user's identity and prevents
attackers from altering another user's legitimate messages, as long as the
user's secret key is not divulged before it expires.
An administrator must set up users before using ``cephx``. In the following
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
the command line to generate a username and secret key. Ceph's ``auth`` the command line to generate a username and secret key. Ceph's ``auth``
subsystem generates the username and key, stores a copy with the monitor(s) and subsystem generates the username and key, stores a copy on the monitor(s), and
transmits the user's secret back to the ``client.admin`` user. This means that transmits the user's secret back to the ``client.admin`` user. This means that
the client and the monitor share a secret key. the client and the monitor share a secret key.
@ -262,17 +278,16 @@ the client and the monitor share a secret key.
| transmit key | | transmit key |
| | | |
Here is how a client authenticates with a monitor. The client passes the user
To authenticate with the monitor, the client passes in the user name to the name to the monitor. The monitor generates a session key that is encrypted with
monitor, and the monitor generates a session key and encrypts it with the secret the secret key associated with the ``username``. The monitor transmits the
key associated to the user name. Then, the monitor transmits the encrypted encrypted ticket to the client. The client uses the shared secret key to
ticket back to the client. The client then decrypts the payload with the shared decrypt the payload. The session key identifies the user, and this act of
secret key to retrieve the session key. The session key identifies the user for identification will last for the duration of the session. The client requests
the current session. The client then requests a ticket on behalf of the user a ticket for the user, and the ticket is signed with the session key. The
signed by the session key. The monitor generates a ticket, encrypts it with the monitor generates a ticket and uses the user's secret key to encrypt it. The
user's secret key and transmits it back to the client. The client decrypts the encrypted ticket is transmitted to the client. The client decrypts the ticket
ticket and uses it to sign requests to OSDs and metadata servers throughout the and uses it to sign requests to OSDs and to metadata servers in the cluster.
cluster.
.. ditaa:: .. ditaa::
@ -302,10 +317,11 @@ cluster.
|<----+ | |<----+ |
The ``cephx`` protocol authenticates ongoing communications between the client The ``cephx`` protocol authenticates ongoing communications between the clients
machine and the Ceph servers. Each message sent between a client and server, and Ceph daemons. After initial authentication, each message sent between a
subsequent to the initial authentication, is signed using a ticket that the client and a daemon is signed using a ticket that can be verified by monitors,
monitors, OSDs and metadata servers can verify with their shared secret. OSDs, and metadata daemons. This ticket is verified by using the secret shared
between the client and the daemon.
.. ditaa:: .. ditaa::
@ -341,83 +357,93 @@ monitors, OSDs and metadata servers can verify with their shared secret.
|<-------------------------------------------| |<-------------------------------------------|
receive response receive response
The protection offered by this authentication is between the Ceph client and the This authentication protects only the connections between Ceph clients and Ceph
Ceph server hosts. The authentication is not extended beyond the Ceph client. If daemons. The authentication is not extended beyond the Ceph client. If a user
the user accesses the Ceph client from a remote host, Ceph authentication is not accesses the Ceph client from a remote host, cephx authentication will not be
applied to the connection between the user's host and the client host. applied to the connection between the user's host and the client host.
See `Cephx Config Guide`_ for more on configuration details.
For configuration details, see `Cephx Config Guide`_. For user management See `User Management`_ for more on user management.
details, see `User Management`_.
See :ref:`A Detailed Description of the Cephx Authentication Protocol
<cephx_2012_peter>` for more on the distinction between authorization and
authentication and for a step-by-step explanation of the setup of ``cephx``
tickets and session keys.
.. index:: architecture; smart daemons and scalability .. index:: architecture; smart daemons and scalability
Smart Daemons Enable Hyperscale Smart Daemons Enable Hyperscale
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A feature of many storage clusters is a centralized interface that keeps track
of the nodes that clients are permitted to access. Such centralized
architectures provide services to clients by means of a double dispatch. At the
petabyte-to-exabyte scale, such double dispatches are a significant
bottleneck.
In many clustered architectures, the primary purpose of cluster membership is Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
so that a centralized interface knows which nodes it can access. Then the cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
centralized interface provides services to the client through a double OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale. with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
cluster-aware makes it possible for Ceph clients to interact directly with Ceph
OSD Daemons.
Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients easily perform tasks that a cluster with a centralized interface would struggle
to interact directly with Ceph OSD Daemons. to perform. The ability of Ceph nodes to make use of the computing power of
the greater cluster provides several benefits:
The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with #. **OSDs Service Clients Directly:** Network devices can support only a
each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph limited number of concurrent connections. Because Ceph clients contact
nodes to easily perform tasks that would bog down a centralized server. The Ceph OSD daemons directly without first connecting to a central interface,
ability to leverage this computing power leads to several major benefits: Ceph enjoys improved perfomance and increased system capacity relative to
storage redundancy strategies that include a central interface. Ceph clients
maintain sessions only when needed, and maintain those sessions with only
particular Ceph OSD daemons, not with a centralized interface.
#. **OSDs Service Clients Directly:** Since any network device has a limit to #. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
the number of concurrent connections it can support, a centralized system report their status. At the lowest level, the Ceph OSD Daemon status is
has a low physical limit at high scales. By enabling Ceph Clients to contact ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
Ceph OSD Daemons directly, Ceph increases both performance and total system able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
capacity simultaneously, while removing a single point of failure. Ceph ``in`` the Ceph Storage Cluster, this status may indicate the failure of the
Clients can maintain a session when they need to, and with a particular Ceph Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
OSD Daemon instead of a centralized server. the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
OSDs periodically send messages to the Ceph Monitor (in releases prior to
Luminous, this was done by means of ``MPGStats``, and beginning with the
Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
Monitors receive no such message after a configurable period of time,
then they mark the OSD ``down``. This mechanism is a failsafe, however.
Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
report it to the Ceph Monitors. This contributes to making Ceph Monitors
lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
additional details.
#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report #. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` RADOS objects. Ceph OSD Daemons compare the metadata of their own local
or ``down`` reflecting whether or not it is running and able to service objects against the metadata of the replicas of those objects, which are
Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
Storage Cluster, this status may indicate the failure of the Ceph OSD mismatches in object size and finds metadata mismatches, and is usually
Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous, bad sectors on drives that are not detectable with light scrubs. See `Data
and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that Scrubbing`_ for details on configuring scrubbing.
message after a configurable period of time then it marks the OSD down.
This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
This assures that Ceph Monitors are lightweight processes. See `Monitoring
OSDs`_ and `Heartbeats`_ for additional details.
#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, #. **Replication:** Data replication involves a collaboration between Ceph
Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
their local objects metadata with its replicas stored on other OSDs. Scrubbing determine the storage location of object replicas. Ceph clients use the
happens on a per-Placement Group base. Scrubbing (usually performed daily) CRUSH algorithm to determine the storage location of an object, then the
catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper object is mapped to a pool and to a placement group, and then the client
scrubbing by comparing data in objects bit-for-bit with their checksums. consults the CRUSH map to identify the placement group's primary OSD.
Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
configuring scrubbing.
#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH After identifying the target placement group, the client writes the object
algorithm, but the Ceph OSD Daemon uses it to compute where replicas of to the identified placement group's primary OSD. The primary OSD then
objects should be stored (and for rebalancing). In a typical write scenario, consults its own copy of the CRUSH map to identify secondary and tertiary
a client uses the CRUSH algorithm to compute where to store an object, maps OSDS, replicates the object to the placement groups in those secondary and
the object to a pool and placement group, then looks at the CRUSH map to tertiary OSDs, confirms that the object was stored successfully in the
identify the primary OSD for the placement group. secondary and tertiary OSDs, and reports to the client that the object
was stored successfully.
The client writes the object to the identified placement group in the
primary OSD. Then, the primary OSD with its own copy of the CRUSH map
identifies the secondary and tertiary OSDs for replication purposes, and
replicates the object to the appropriate placement groups in the secondary
and tertiary OSDs (as many OSDs as additional replicas), and responds to the
client once it has confirmed the object was stored successfully.
.. ditaa:: .. ditaa::
@ -444,19 +470,18 @@ ability to leverage this computing power leads to several major benefits:
| | | | | | | |
+---------------+ +---------------+ +---------------+ +---------------+
With the ability to perform data replication, Ceph OSD Daemons relieve Ceph By performing this act of data replication, Ceph OSD Daemons relieve Ceph
clients from that duty, while ensuring high data availability and data safety. clients of the burden of replicating data.
Dynamic Cluster Management Dynamic Cluster Management
-------------------------- --------------------------
In the `Scalability and High Availability`_ section, we explained how Ceph uses In the `Scalability and High Availability`_ section, we explained how Ceph uses
CRUSH, cluster awareness and intelligent daemons to scale and maintain high CRUSH, cluster topology, and intelligent daemons to scale and maintain high
availability. Key to Ceph's design is the autonomous, self-healing, and availability. Key to Ceph's design is the autonomous, self-healing, and
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
enable modern cloud storage infrastructures to place data, rebalance the cluster enable modern cloud storage infrastructures to place data, rebalance the
and recover from faults dynamically. cluster, and adaptively place and balance data and recover from faults.
.. index:: architecture; pools .. index:: architecture; pools
@ -466,9 +491,10 @@ About Pools
The Ceph storage system supports the notion of 'Pools', which are logical The Ceph storage system supports the notion of 'Pools', which are logical
partitions for storing objects. partitions for storing objects.
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
pools. The pool's ``size`` or number of replicas, the CRUSH rule and the objects to pools. The way that Ceph places the data in the pools is determined
number of placement groups determine how Ceph will place the data. by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
placement groups in the pool.
.. ditaa:: .. ditaa::
@ -501,20 +527,23 @@ See `Set Pool Values`_ for details.
Mapping PGs to OSDs Mapping PGs to OSDs
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically. Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
When a Ceph Client stores objects, CRUSH will map each object to a placement maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
group. object to a PG.
Mapping objects to placement groups creates a layer of indirection between the This mapping of RADOS objects to PGs implements an abstraction and indirection
Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph be able to grow (or shrink) and redistribute data adaptively when the internal
Client "knew" which Ceph OSD Daemon had which object, that would create a tight topology changes.
coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
algorithm maps each object to a placement group and then maps each placement If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
come online. The following diagram depicts how CRUSH maps objects to placement RADOS object to a placement group and then maps each placement group to one or
groups, and placement groups to OSDs. more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
dynamically when new Ceph OSD Daemons and their underlying OSD devices come
online. The following diagram shows how the CRUSH algorithm maps objects to
placement groups, and how it maps placement groups to OSDs.
.. ditaa:: .. ditaa::
@ -540,44 +569,45 @@ groups, and placement groups to OSDs.
| | | | | | | | | | | | | | | |
\----------/ \----------/ \----------/ \----------/ \----------/ \----------/ \----------/ \----------/
With a copy of the cluster map and the CRUSH algorithm, the client can compute The client uses its copy of the cluster map and the CRUSH algorithm to compute
exactly which OSD to use when reading or writing a particular object. precisely which OSD it will use when reading or writing a particular object.
.. index:: architecture; calculating PG IDs .. index:: architecture; calculating PG IDs
Calculating PG IDs Calculating PG IDs
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
`Cluster Map`_. With the cluster map, the client knows about all of the monitors, the `Cluster Map`_. When a client has been equipped with a copy of the cluster
OSDs, and metadata servers in the cluster. **However, it doesn't know anything map, it is aware of all the monitors, OSDs, and metadata servers in the
about object locations.** cluster. **However, even equipped with a copy of the latest version of the
cluster map, the client doesn't know anything about object locations.**
.. epigraph:: **Object locations must be computed.**
Object locations get computed. The client requies only the object ID and the name of the pool in order to
compute the object location.
Ceph stores data in named pools (for example, "liverpool"). When a client
stores a named object (for example, "john", "paul", "george", or "ringo") it
calculates a placement group by using the object name, a hash code, the number
of PGs in the pool, and the pool name. Ceph clients use the following steps to
compute PG IDs.
The only input required by the client is the object ID and the pool. #. The client inputs the pool name and the object ID. (for example: pool =
It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client "liverpool" and object-id = "john")
wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.) #. Ceph hashes the object ID.
it calculates a placement group using the object name, a hash code, the #. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
number of PGs in the pool and the pool name. Ceph clients use the following get a PG ID.
steps to compute PG IDs. #. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
``4``)
#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool" It is much faster to compute object locations than to perform object location
and object-id = "john") query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
#. Ceph takes the object ID and hashes it. Scalable Hashing)` algorithm allows a client to compute where objects are
#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get expected to be stored, and enables the client to contact the primary OSD to
a PG ID. store or retrieve the objects.
#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
Computing object locations is much faster than performing object location query
over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
Hashing)` algorithm allows a client to compute where objects *should* be stored,
and enables the client to contact the primary OSD to store or retrieve the
objects.
.. index:: architecture; PG Peering .. index:: architecture; PG Peering
@ -585,46 +615,51 @@ Peering and Sets
~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
In previous sections, we noted that Ceph OSD Daemons check each other's In previous sections, we noted that Ceph OSD Daemons check each other's
heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
do is called 'peering', which is the process of bringing all of the OSDs that which is the process of bringing all of the OSDs that store a Placement Group
store a Placement Group (PG) into agreement about the state of all of the (PG) into agreement about the state of all of the RADOS objects (and their
objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve Monitors. Peering issues usually resolve themselves; however, if the problem
themselves; however, if the problem persists, you may need to refer to the persists, you may need to refer to the `Troubleshooting Peering Failure`_
`Troubleshooting Peering Failure`_ section. section.
.. Note:: Agreeing on the state does not mean that the PGs have the latest contents. .. Note:: PGs that agree on the state of the cluster do not necessarily have
the current data yet.
The Ceph Storage Cluster was designed to store at least two copies of an object The Ceph Storage Cluster was designed to store at least two copies of an object
(i.e., ``size = 2``), which is the minimum requirement for data safety. For high (that is, ``size = 2``), which is the minimum requirement for data safety. For
availability, a Ceph Storage Cluster should store more than two copies of an object high availability, a Ceph Storage Cluster should store more than two copies of
(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
``degraded`` state while maintaining data safety. to run in a ``degraded`` state while maintaining data safety.
Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not .. warning:: Although we say here that R2 (replication with two copies) is the
name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but minimum requirement for data safety, R3 (replication with three copies) is
rather refer to them as *Primary*, *Secondary*, and so forth. By convention, recommended. On a long enough timeline, data stored with an R2 strategy will
the *Primary* is the first OSD in the *Acting Set*, and is responsible for be lost.
coordinating the peering process for each placement group where it acts as
the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
writes to objects for a given placement group where it acts as the *Primary*.
When a series of OSDs are responsible for a placement group, that series of As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
OSD Daemons that are currently responsible for the placement group, or the Ceph etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
OSD Daemons that were responsible for a particular placement group as of some convention, the *Primary* is the first OSD in the *Acting Set*, and is
responsible for orchestrating the peering process for each placement group
where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
placement group that accepts client-initiated writes to objects.
The set of OSDs that is responsible for a placement group is called the
*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
that are currently responsible for the placement group, or to the Ceph OSD
Daemons that were responsible for a particular placement group as of some
epoch. epoch.
The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``. The Ceph OSD daemons that are part of an *Acting Set* might not always be
When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up ``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD The *Up Set* is an important distinction, because Ceph can remap PGs to other
Daemons when an OSD fails. Ceph OSD Daemons when an OSD fails.
.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
removed from the *Up Set*.
.. note:: Consider a hypothetical *Acting Set* for a PG that contains
``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
*Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
*Primary*, and ``osd.25`` is removed from the *Up Set*.
.. index:: architecture; Rebalancing .. index:: architecture; Rebalancing
@ -1468,8 +1503,8 @@ Ceph Clients
Ceph Clients include a number of service interfaces. These include: Ceph Clients include a number of service interfaces. These include:
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
provides resizable, thin-provisioned block devices with snapshotting and provides resizable, thin-provisioned block devices that can be snapshotted
cloning. Ceph stripes a block device across the cluster for high and cloned. Ceph stripes a block device across the cluster for high
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
that uses ``librbd`` directly--avoiding the kernel object overhead for that uses ``librbd`` directly--avoiding the kernel object overhead for
virtualized systems. virtualized systems.

View File

@ -11,9 +11,9 @@ Run a command of this form to list hosts associated with the cluster:
.. prompt:: bash # .. prompt:: bash #
ceph orch host ls [--format yaml] [--host-pattern <name>] [--label <label>] [--host-status <status>] ceph orch host ls [--format yaml] [--host-pattern <name>] [--label <label>] [--host-status <status>] [--detail]
In commands of this form, the arguments "host-pattern", "label" and In commands of this form, the arguments "host-pattern", "label", and
"host-status" are optional and are used for filtering. "host-status" are optional and are used for filtering.
- "host-pattern" is a regex that matches against hostnames and returns only - "host-pattern" is a regex that matches against hostnames and returns only
@ -25,6 +25,16 @@ In commands of this form, the arguments "host-pattern", "label" and
against name, label and status simultaneously, or to filter against any against name, label and status simultaneously, or to filter against any
proper subset of name, label and status. proper subset of name, label and status.
The "detail" parameter provides more host related information for cephadm based
clusters. For example:
.. prompt:: bash #
# ceph orch host ls --detail
HOSTNAME ADDRESS LABELS STATUS VENDOR/MODEL CPU HDD SSD NIC
ceph-master 192.168.122.73 _admin QEMU (Standard PC (Q35 + ICH9, 2009)) 4C/4T 4/1.6TB - 1
1 hosts in cluster
.. _cephadm-adding-hosts: .. _cephadm-adding-hosts:
Adding Hosts Adding Hosts
@ -193,10 +203,18 @@ Place a host in and out of maintenance mode (stops all Ceph daemons on host):
.. prompt:: bash # .. prompt:: bash #
ceph orch host maintenance enter <hostname> [--force] ceph orch host maintenance enter <hostname> [--force] [--yes-i-really-mean-it]
ceph orch host maintenance exit <hostname> ceph orch host maintenance exit <hostname>
Where the force flag when entering maintenance allows the user to bypass warnings (but not alerts) The ``--force`` flag allows the user to bypass warnings (but not alerts). The ``--yes-i-really-mean-it``
flag bypasses all safety checks and will attempt to force the host into maintenance mode no
matter what.
.. warning:: Using the --yes-i-really-mean-it flag to force the host to enter maintenance
mode can potentially cause loss of data availability, the mon quorum to break down due
to too few running monitors, mgr module commands (such as ``ceph orch . . .`` commands)
to be become unresponsive, and a number of other possible issues. Please only use this
flag if you're absolutely certain you know what you're doing.
See also :ref:`cephadm-fqdn` See also :ref:`cephadm-fqdn`
@ -269,7 +287,7 @@ create a new CRUSH host located in the specified hierarchy.
.. note:: .. note::
The ``location`` attribute will be only affect the initial CRUSH location. Subsequent The ``location`` attribute will be only affect the initial CRUSH location. Subsequent
changes of the ``location`` property will be ignored. Also, removing a host will no remove changes of the ``location`` property will be ignored. Also, removing a host will not remove
any CRUSH buckets. any CRUSH buckets.
See also :ref:`crush_map_default_types`. See also :ref:`crush_map_default_types`.

View File

@ -142,6 +142,9 @@ cluster's first "monitor daemon", and that monitor daemon needs an IP address.
You must pass the IP address of the Ceph cluster's first host to the ``ceph You must pass the IP address of the Ceph cluster's first host to the ``ceph
bootstrap`` command, so you'll need to know the IP address of that host. bootstrap`` command, so you'll need to know the IP address of that host.
.. important:: ``ssh`` must be installed and running in order for the
bootstrapping procedure to succeed.
.. note:: If there are multiple networks and interfaces, be sure to choose one .. note:: If there are multiple networks and interfaces, be sure to choose one
that will be accessible by any host accessing the Ceph cluster. that will be accessible by any host accessing the Ceph cluster.
@ -288,18 +291,21 @@ its status with:
Adding Hosts Adding Hosts
============ ============
Next, add all hosts to the cluster by following :ref:`cephadm-adding-hosts`. Add all hosts to the cluster by following the instructions in
:ref:`cephadm-adding-hosts`.
By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring are
are maintained in ``/etc/ceph`` on all hosts with the ``_admin`` label, which is initially maintained in ``/etc/ceph`` on all hosts that have the ``_admin`` label. This
applied only to the bootstrap host. We usually recommend that one or more other hosts be label is initially applied only to the bootstrap host. We usually recommend
given the ``_admin`` label so that the Ceph CLI (e.g., via ``cephadm shell``) is easily that one or more other hosts be given the ``_admin`` label so that the Ceph CLI
accessible on multiple hosts. To add the ``_admin`` label to additional host(s): (for example, via ``cephadm shell``) is easily accessible on multiple hosts. To add
the ``_admin`` label to additional host(s), run a command of the following form:
.. prompt:: bash # .. prompt:: bash #
ceph orch host label add *<host>* _admin ceph orch host label add *<host>* _admin
Adding additional MONs Adding additional MONs
====================== ======================

View File

@ -676,6 +676,22 @@ To disable the automatic management of dameons, set ``unmanaged=True`` in the
ceph orch apply -i mgr.yaml ceph orch apply -i mgr.yaml
Cephadm also supports setting the unmanaged parameter to true or false
using the ``ceph orch set-unmanaged`` and ``ceph orch set-managed`` commands.
The commands take the service name (as reported in ``ceph orch ls``) as
the only argument. For example,
.. prompt:: bash #
ceph orch set-unmanaged mon
would set ``unmanaged: true`` for the mon service and
.. prompt:: bash #
ceph orch set-managed mon
would set ``unmanaged: false`` for the mon service
.. note:: .. note::
@ -683,6 +699,13 @@ To disable the automatic management of dameons, set ``unmanaged=True`` in the
longer deploy any new daemons (even if the placement specification matches longer deploy any new daemons (even if the placement specification matches
additional hosts). additional hosts).
.. note::
The "osd" service used to track OSDs that are not tied to any specific
service spec is special and will always be marked unmanaged. Attempting
to modify it with ``ceph orch set-unmanaged`` or ``ceph orch set-managed``
will result in a message ``No service of name osd found. Check "ceph orch ls" for all known services``
Deploying a daemon on a host manually Deploying a daemon on a host manually
------------------------------------- -------------------------------------

View File

@ -20,7 +20,18 @@ For example:
ceph fs volume create <fs_name> --placement="<placement spec>" ceph fs volume create <fs_name> --placement="<placement spec>"
where ``fs_name`` is the name of the CephFS and ``placement`` is a where ``fs_name`` is the name of the CephFS and ``placement`` is a
:ref:`orchestrator-cli-placement-spec`. :ref:`orchestrator-cli-placement-spec`. For example, to place
MDS daemons for the new ``foo`` volume on hosts labeled with ``mds``:
.. prompt:: bash #
ceph fs volume create foo --placement="label:mds"
You can also update the placement after-the-fact via:
.. prompt:: bash #
ceph orch apply mds foo 'mds-[012]'
For manually deploying MDS daemons, use this specification: For manually deploying MDS daemons, use this specification:
@ -30,6 +41,7 @@ For manually deploying MDS daemons, use this specification:
service_id: fs_name service_id: fs_name
placement: placement:
count: 3 count: 3
label: mds
The specification can then be applied using: The specification can then be applied using:

View File

@ -4,8 +4,8 @@
MGR Service MGR Service
=========== ===========
The cephadm MGR service is hosting different modules, like the :ref:`mgr-dashboard` The cephadm MGR service hosts multiple modules. These include the
and the cephadm manager module. :ref:`mgr-dashboard` and the cephadm manager module.
.. _cephadm-mgr-networks: .. _cephadm-mgr-networks:

View File

@ -170,6 +170,64 @@ network ``10.1.2.0/24``, run the following commands:
ceph orch apply mon --placement="newhost1,newhost2,newhost3" ceph orch apply mon --placement="newhost1,newhost2,newhost3"
Setting Crush Locations for Monitors
------------------------------------
Cephadm supports setting CRUSH locations for mon daemons
using the mon service spec. The CRUSH locations are set
by hostname. When cephadm deploys a mon on a host that matches
a hostname specified in the CRUSH locations, it will add
``--set-crush-location <CRUSH-location>`` where the CRUSH location
is the first entry in the list of CRUSH locations for that
host. If multiple CRUSH locations are set for one host, cephadm
will attempt to set the additional locations using the
"ceph mon set_location" command.
.. note::
Setting the CRUSH location in the spec is the recommended way of
replacing tiebreaker mon daemons, as they require having a location
set when they are added.
.. note::
Tiebreaker mon daemons are a part of stretch mode clusters. For more
info on stretch mode clusters see :ref:`stretch_mode`
Example syntax for setting the CRUSH locations:
.. code-block:: yaml
service_type: mon
service_name: mon
placement:
count: 5
spec:
crush_locations:
host1:
- datacenter=a
host2:
- datacenter=b
- rack=2
host3:
- datacenter=a
.. note::
Sometimes, based on the timing of mon daemons being admitted to the mon
quorum, cephadm may fail to set the CRUSH location for some mon daemons
when multiple locations are specified. In this case, the recommended
action is to re-apply the same mon spec to retrigger the service action.
.. note::
Mon daemons will only get the ``--set-crush-location`` flag set when cephadm
actually deploys them. This means if a spec is applied that includes a CRUSH
location for a mon that is already deployed, the flag may not be set until
a redeploy command is issued for that mon daemon.
Further Reading Further Reading
=============== ===============

View File

@ -197,12 +197,26 @@ configuration files for monitoring services.
Internally, cephadm already uses `Jinja2 Internally, cephadm already uses `Jinja2
<https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the <https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
configuration files for all monitoring components. To be able to customize the configuration files for all monitoring components. Starting from version 17.2.3,
configuration of Prometheus, Grafana or the Alertmanager it is possible to store cephadm supports Prometheus http service discovery, and uses this endpoint for the
a Jinja2 template for each service that will be used for configuration definition and management of the embedded Prometheus service. The endpoint listens on
generation instead. This template will be evaluated every time a service of that ``https://<mgr-ip>:8765/sd/`` (the port is
kind is deployed or reconfigured. That way, the custom configuration is configurable through the variable ``service_discovery_port``) and returns scrape target
preserved and automatically applied on future deployments of these services. information in `http_sd_config format
<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
Customers with external monitoring stack can use `ceph-mgr` service discovery endpoint
to get scraping configuration. Root certificate of the server can be obtained by the
following command:
.. prompt:: bash #
ceph orch sd dump cert
The configuration of Prometheus, Grafana, or Alertmanager may be customized by storing
a Jinja2 template for each service. This template will be evaluated every time a service
of that kind is deployed or reconfigured. That way, the custom configuration is preserved
and automatically applied on future deployments of these services.
.. note:: .. note::
@ -292,6 +306,21 @@ cluster.
By default, ceph-mgr presents prometheus metrics on port 9283 on each host By default, ceph-mgr presents prometheus metrics on port 9283 on each host
running a ceph-mgr daemon. Configure prometheus to scrape these. running a ceph-mgr daemon. Configure prometheus to scrape these.
To make this integration easier, cephadm provides a service discovery endpoint at
``https://<mgr-ip>:8765/sd/``. This endpoint can be used by an external
Prometheus server to retrieve target information for a specific service. Information returned
by this endpoint uses the format specified by the Prometheus `http_sd_config option
<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
Here's an example prometheus job definition that uses the cephadm service discovery endpoint
.. code-block:: bash
- job_name: 'ceph-exporter'
http_sd_configs:
- url: http://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter
* To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`. * To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.
* To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`. * To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
@ -429,6 +458,28 @@ Then apply this specification:
Grafana will now create an admin user called ``admin`` with the Grafana will now create an admin user called ``admin`` with the
given password. given password.
Turning off anonymous access
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By default, cephadm allows anonymous users (users who have not provided any
login information) limited, viewer only access to the grafana dashboard. In
order to set up grafana to only allow viewing from logged in users, you can
set ``anonymous_access: False`` in your grafana spec.
.. code-block:: yaml
service_type: grafana
placement:
hosts:
- host1
spec:
anonymous_access: False
initial_admin_password: "mypassword"
Since deploying grafana with anonymous access set to false without an initial
admin password set would make the dashboard inaccessible, cephadm requires
setting the ``initial_admin_password`` when ``anonymous_access`` is set to false.
Setting up Alertmanager Setting up Alertmanager
----------------------- -----------------------

View File

@ -113,6 +113,54 @@ A few notes:
a *port* property that is not 2049 to avoid conflicting with the a *port* property that is not 2049 to avoid conflicting with the
ingress service, which could be placed on the same host(s). ingress service, which could be placed on the same host(s).
NFS with virtual IP but no haproxy
----------------------------------
Cephadm also supports deploying nfs with keepalived but not haproxy. This
offers a virtual ip supported by keepalived that the nfs daemon can directly bind
to instead of having traffic go through haproxy.
In this setup, you'll either want to set up the service using the nfs module
(see :ref:`nfs-module-cluster-create`) or place the ingress service first, so
the virtual IP is present for the nfs daemon to bind to. The ingress service
should include the attribute ``keepalive_only`` set to true. For example
.. code-block:: yaml
service_type: ingress
service_id: nfs.foo
placement:
count: 1
hosts:
- host1
- host2
- host3
spec:
backend_service: nfs.foo
monitor_port: 9049
virtual_ip: 192.168.122.100/24
keepalive_only: true
Then, an nfs service could be created that specifies a ``virtual_ip`` attribute
that will tell it to bind to that specific IP.
.. code-block:: yaml
service_type: nfs
service_id: foo
placement:
count: 1
hosts:
- host1
- host2
- host3
spec:
port: 2049
virtual_ip: 192.168.122.100
Note that in these setups, one should make sure to include ``count: 1`` in the
nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP.
Further Reading Further Reading
=============== ===============

View File

@ -308,7 +308,7 @@ Replacing an OSD
.. prompt:: bash # .. prompt:: bash #
orch osd rm <osd_id(s)> --replace [--force] ceph orch osd rm <osd_id(s)> --replace [--force]
Example: Example:

View File

@ -83,23 +83,23 @@ To deploy RGWs serving the multisite *myorg* realm and the *us-east-1* zone on
.. prompt:: bash # .. prompt:: bash #
ceph orch apply rgw east --realm=myorg --zone=us-east-1 --placement="2 myhost1 myhost2" ceph orch apply rgw east --realm=myorg --zonegroup=us-east-zg-1 --zone=us-east-1 --placement="2 myhost1 myhost2"
Note that in a multisite situation, cephadm only deploys the daemons. It does not create Note that in a multisite situation, cephadm only deploys the daemons. It does not create
or update the realm or zone configurations. To create a new realm and zone, you need to do or update the realm or zone configurations. To create a new realms, zones and zonegroups
something like: you can use :ref:`mgr-rgw-module` or manually using something like:
.. prompt:: bash # .. prompt:: bash #
radosgw-admin realm create --rgw-realm=<realm-name> --default radosgw-admin realm create --rgw-realm=<realm-name>
.. prompt:: bash # .. prompt:: bash #
radosgw-admin zonegroup create --rgw-zonegroup=<zonegroup-name> --master --default radosgw-admin zonegroup create --rgw-zonegroup=<zonegroup-name> --master
.. prompt:: bash # .. prompt:: bash #
radosgw-admin zone create --rgw-zonegroup=<zonegroup-name> --rgw-zone=<zone-name> --master --default radosgw-admin zone create --rgw-zonegroup=<zonegroup-name> --rgw-zone=<zone-name> --master
.. prompt:: bash # .. prompt:: bash #
@ -217,6 +217,8 @@ It is a yaml format file with the following properties:
frontend_port: <integer> # ex: 8080 frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks virtual_interface_networks: [ ... ] # optional: list of CIDR networks
use_keepalived_multicast: <bool> # optional: Default is False.
vrrp_interface_network: <string>/<string> # optional: ex: 192.168.20.0/24
ssl_cert: | # optional: SSL certificate and key ssl_cert: | # optional: SSL certificate and key
-----BEGIN CERTIFICATE----- -----BEGIN CERTIFICATE-----
... ...
@ -243,6 +245,7 @@ It is a yaml format file with the following properties:
frontend_port: <integer> # ex: 8080 frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks virtual_interface_networks: [ ... ] # optional: list of CIDR networks
first_virtual_router_id: <integer> # optional: default 50
ssl_cert: | # optional: SSL certificate and key ssl_cert: | # optional: SSL certificate and key
-----BEGIN CERTIFICATE----- -----BEGIN CERTIFICATE-----
... ...
@ -276,6 +279,21 @@ where the properties of this service specification are:
* ``ssl_cert``: * ``ssl_cert``:
SSL certificate, if SSL is to be enabled. This must contain the both the certificate and SSL certificate, if SSL is to be enabled. This must contain the both the certificate and
private key blocks in .pem format. private key blocks in .pem format.
* ``use_keepalived_multicast``
Default is False. By default, cephadm will deploy keepalived config to use unicast IPs,
using the IPs of the hosts. The IPs chosen will be the same IPs cephadm uses to connect
to the machines. But if multicast is prefered, we can set ``use_keepalived_multicast``
to ``True`` and Keepalived will use multicast IP (224.0.0.18) to communicate between instances,
using the same interfaces as where the VIPs are.
* ``vrrp_interface_network``
By default, cephadm will configure keepalived to use the same interface where the VIPs are
for VRRP communication. If another interface is needed, it can be set via ``vrrp_interface_network``
with a network to identify which ethernet interface to use.
* ``first_virtual_router_id``
Default is 50. When deploying more than 1 ingress, this parameter can be used to ensure each
keepalived will have different virtual_router_id. In the case of using ``virtual_ips_list``,
each IP will create its own virtual router. So the first one will have ``first_virtual_router_id``,
second one will have ``first_virtual_router_id`` + 1, etc. Valid values go from 1 to 255.
.. _ingress-virtual-ip: .. _ingress-virtual-ip:

View File

@ -15,7 +15,7 @@ creation of multiple file systems use ``ceph fs flag set enable_multiple true``.
:: ::
fs new <file system name> <metadata pool name> <data pool name> ceph fs new <file system name> <metadata pool name> <data pool name>
This command creates a new file system. The file system name and metadata pool This command creates a new file system. The file system name and metadata pool
name are self-explanatory. The specified data pool is the default data pool and name are self-explanatory. The specified data pool is the default data pool and
@ -25,19 +25,19 @@ to accommodate the new file system.
:: ::
fs ls ceph fs ls
List all file systems by name. List all file systems by name.
:: ::
fs lsflags <file system name> ceph fs lsflags <file system name>
List all the flags set on a file system. List all the flags set on a file system.
:: ::
fs dump [epoch] ceph fs dump [epoch]
This dumps the FSMap at the given epoch (default: current) which includes all This dumps the FSMap at the given epoch (default: current) which includes all
file system settings, MDS daemons and the ranks they hold, and the list of file system settings, MDS daemons and the ranks they hold, and the list of
@ -46,7 +46,7 @@ standby MDS daemons.
:: ::
fs rm <file system name> [--yes-i-really-mean-it] ceph fs rm <file system name> [--yes-i-really-mean-it]
Destroy a CephFS file system. This wipes information about the state of the Destroy a CephFS file system. This wipes information about the state of the
file system from the FSMap. The metadata pool and data pools are untouched and file system from the FSMap. The metadata pool and data pools are untouched and
@ -54,28 +54,28 @@ must be destroyed separately.
:: ::
fs get <file system name> ceph fs get <file system name>
Get information about the named file system, including settings and ranks. This Get information about the named file system, including settings and ranks. This
is a subset of the same information from the ``fs dump`` command. is a subset of the same information from the ``ceph fs dump`` command.
:: ::
fs set <file system name> <var> <val> ceph fs set <file system name> <var> <val>
Change a setting on a file system. These settings are specific to the named Change a setting on a file system. These settings are specific to the named
file system and do not affect other file systems. file system and do not affect other file systems.
:: ::
fs add_data_pool <file system name> <pool name/id> ceph fs add_data_pool <file system name> <pool name/id>
Add a data pool to the file system. This pool can be used for file layouts Add a data pool to the file system. This pool can be used for file layouts
as an alternate location to store file data. as an alternate location to store file data.
:: ::
fs rm_data_pool <file system name> <pool name/id> ceph fs rm_data_pool <file system name> <pool name/id>
This command removes the specified pool from the list of data pools for the This command removes the specified pool from the list of data pools for the
file system. If any files have layouts for the removed data pool, the file file system. If any files have layouts for the removed data pool, the file
@ -84,7 +84,7 @@ system) cannot be removed.
:: ::
fs rename <file system name> <new file system name> [--yes-i-really-mean-it] ceph fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
Rename a Ceph file system. This also changes the application tags on the data Rename a Ceph file system. This also changes the application tags on the data
pools and metadata pool of the file system to the new file system name. pools and metadata pool of the file system to the new file system name.
@ -98,7 +98,7 @@ Settings
:: ::
fs set <fs name> max_file_size <size in bytes> ceph fs set <fs name> max_file_size <size in bytes>
CephFS has a configurable maximum file size, and it's 1TB by default. CephFS has a configurable maximum file size, and it's 1TB by default.
You may wish to set this limit higher if you expect to store large files You may wish to set this limit higher if you expect to store large files
@ -132,13 +132,13 @@ Taking a CephFS cluster down is done by setting the down flag:
:: ::
fs set <fs_name> down true ceph fs set <fs_name> down true
To bring the cluster back online: To bring the cluster back online:
:: ::
fs set <fs_name> down false ceph fs set <fs_name> down false
This will also restore the previous value of max_mds. MDS daemons are brought This will also restore the previous value of max_mds. MDS daemons are brought
down in a way such that journals are flushed to the metadata pool and all down in a way such that journals are flushed to the metadata pool and all
@ -149,11 +149,11 @@ Taking the cluster down rapidly for deletion or disaster recovery
----------------------------------------------------------------- -----------------------------------------------------------------
To allow rapidly deleting a file system (for testing) or to quickly bring the To allow rapidly deleting a file system (for testing) or to quickly bring the
file system and MDS daemons down, use the ``fs fail`` command: file system and MDS daemons down, use the ``ceph fs fail`` command:
:: ::
fs fail <fs_name> ceph fs fail <fs_name>
This command sets a file system flag to prevent standbys from This command sets a file system flag to prevent standbys from
activating on the file system (the ``joinable`` flag). activating on the file system (the ``joinable`` flag).
@ -162,7 +162,7 @@ This process can also be done manually by doing the following:
:: ::
fs set <fs_name> joinable false ceph fs set <fs_name> joinable false
Then the operator can fail all of the ranks which causes the MDS daemons to Then the operator can fail all of the ranks which causes the MDS daemons to
respawn as standbys. The file system will be left in a degraded state. respawn as standbys. The file system will be left in a degraded state.
@ -170,7 +170,7 @@ respawn as standbys. The file system will be left in a degraded state.
:: ::
# For all ranks, 0-N: # For all ranks, 0-N:
mds fail <fs_name>:<n> ceph mds fail <fs_name>:<n>
Once all ranks are inactive, the file system may also be deleted or left in Once all ranks are inactive, the file system may also be deleted or left in
this state for other purposes (perhaps disaster recovery). this state for other purposes (perhaps disaster recovery).
@ -179,7 +179,7 @@ To bring the cluster back up, simply set the joinable flag:
:: ::
fs set <fs_name> joinable true ceph fs set <fs_name> joinable true
Daemons Daemons
@ -198,34 +198,35 @@ Commands to manipulate MDS daemons:
:: ::
mds fail <gid/name/role> ceph mds fail <gid/name/role>
Mark an MDS daemon as failed. This is equivalent to what the cluster Mark an MDS daemon as failed. This is equivalent to what the cluster
would do if an MDS daemon had failed to send a message to the mon would do if an MDS daemon had failed to send a message to the mon
for ``mds_beacon_grace`` second. If the daemon was active and a suitable for ``mds_beacon_grace`` second. If the daemon was active and a suitable
standby is available, using ``mds fail`` will force a failover to the standby. standby is available, using ``ceph mds fail`` will force a failover to the
standby.
If the MDS daemon was in reality still running, then using ``mds fail`` If the MDS daemon was in reality still running, then using ``ceph mds fail``
will cause the daemon to restart. If it was active and a standby was will cause the daemon to restart. If it was active and a standby was
available, then the "failed" daemon will return as a standby. available, then the "failed" daemon will return as a standby.
:: ::
tell mds.<daemon name> command ... ceph tell mds.<daemon name> command ...
Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
daemons. Use ``ceph tell mds.* help`` to learn available commands. daemons. Use ``ceph tell mds.* help`` to learn available commands.
:: ::
mds metadata <gid/name/role> ceph mds metadata <gid/name/role>
Get metadata about the given MDS known to the Monitors. Get metadata about the given MDS known to the Monitors.
:: ::
mds repaired <role> ceph mds repaired <role>
Mark the file system rank as repaired. Unlike the name suggests, this command Mark the file system rank as repaired. Unlike the name suggests, this command
does not change a MDS; it manipulates the file system rank which has been does not change a MDS; it manipulates the file system rank which has been
@ -244,14 +245,14 @@ Commands to manipulate required client features of a file system:
:: ::
fs required_client_features <fs name> add reply_encoding ceph fs required_client_features <fs name> add reply_encoding
fs required_client_features <fs name> rm reply_encoding ceph fs required_client_features <fs name> rm reply_encoding
To list all CephFS features To list all CephFS features
:: ::
fs feature ls ceph fs feature ls
Clients that are missing newly added features will be evicted automatically. Clients that are missing newly added features will be evicted automatically.
@ -346,7 +347,7 @@ Global settings
:: ::
fs flag set <flag name> <flag val> [<confirmation string>] ceph fs flag set <flag name> <flag val> [<confirmation string>]
Sets a global CephFS flag (i.e. not specific to a particular file system). Sets a global CephFS flag (i.e. not specific to a particular file system).
Currently, the only flag setting is 'enable_multiple' which allows having Currently, the only flag setting is 'enable_multiple' which allows having
@ -368,13 +369,13 @@ file system.
:: ::
mds rmfailed ceph mds rmfailed
This removes a rank from the failed set. This removes a rank from the failed set.
:: ::
fs reset <file system name> ceph fs reset <file system name>
This command resets the file system state to defaults, except for the name and This command resets the file system state to defaults, except for the name and
pools. Non-zero ranks are saved in the stopped set. pools. Non-zero ranks are saved in the stopped set.
@ -382,7 +383,7 @@ pools. Non-zero ranks are saved in the stopped set.
:: ::
fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force ceph fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
This command creates a file system with a specific **fscid** (file system cluster ID). This command creates a file system with a specific **fscid** (file system cluster ID).
You may want to do this when an application expects the file system's ID to be You may want to do this when an application expects the file system's ID to be

View File

@ -154,14 +154,8 @@ readdir. The behavior of the decay counter is the same as for cache trimming or
caps recall. Each readdir call increments the counter by the number of files in caps recall. Each readdir call increments the counter by the number of files in
the result. the result.
The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
maybe throttled by cap acquisition throttle:
.. confval:: mds_session_max_caps_throttle_ratio .. confval:: mds_session_max_caps_throttle_ratio
The timeout in seconds after which a client request is retried due to cap
acquisition throttling:
.. confval:: mds_cap_acquisition_throttle_retry_request_timeout .. confval:: mds_cap_acquisition_throttle_retry_request_timeout
If the number of caps acquired by the client per session is greater than the If the number of caps acquired by the client per session is greater than the

View File

@ -14,6 +14,8 @@ Requirements
The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later. The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later.
.. _cephfs_mirroring_creating_users:
Creating Users Creating Users
-------------- --------------
@ -42,80 +44,155 @@ Mirror daemon should be spawned using `systemctl(1)` unit files::
$ cephfs-mirror --id mirror --cluster site-a -f $ cephfs-mirror --id mirror --cluster site-a -f
.. note:: User used here is `mirror` created in the `Creating Users` section. .. note:: The user specified here is `mirror`, the creation of which is
described in the :ref:`Creating Users<cephfs_mirroring_creating_users>`
section.
Multiple ``cephfs-mirror`` daemons may be deployed for concurrent
synchronization and high availability. Mirror daemons share the synchronization
load using a simple ``M/N`` policy, where ``M`` is the number of directories
and ``N`` is the number of ``cephfs-mirror`` daemons.
When ``cephadm`` is used to manage a Ceph cluster, ``cephfs-mirror`` daemons can be
deployed by running the following command:
.. prompt:: bash $
ceph orch apply cephfs-mirror
To deploy multiple mirror daemons, run a command of the following form:
.. prompt:: bash $
ceph orch apply cephfs-mirror --placement=<placement-spec>
For example, to deploy 3 `cephfs-mirror` daemons on different hosts, run a command of the following form:
.. prompt:: bash $
$ ceph orch apply cephfs-mirror --placement="3 host1,host2,host3"
Interface Interface
--------- ---------
`Mirroring` module (manager plugin) provides interfaces for managing directory snapshot The `Mirroring` module (manager plugin) provides interfaces for managing
mirroring. Manager interfaces are (mostly) wrappers around monitor commands for managing directory snapshot mirroring. These are (mostly) wrappers around monitor
file system mirroring and is the recommended control interface. commands for managing file system mirroring and is the recommended control
interface.
Mirroring Module Mirroring Module
---------------- ----------------
The mirroring module is responsible for assigning directories to mirror daemons for The mirroring module is responsible for assigning directories to mirror daemons
synchronization. Multiple mirror daemons can be spawned to achieve concurrency in for synchronization. Multiple mirror daemons can be spawned to achieve
directory snapshot synchronization. When mirror daemons are spawned (or terminated) concurrency in directory snapshot synchronization. When mirror daemons are
, the mirroring module discovers the modified set of mirror daemons and rebalances spawned (or terminated), the mirroring module discovers the modified set of
the directory assignment amongst the new set thus providing high-availability. mirror daemons and rebalances directory assignments across the new set, thus
providing high-availability.
.. note:: Multiple mirror daemons is currently untested. Only a single mirror daemon .. note:: Deploying a single mirror daemon is recommended. Running multiple
is recommended. daemons is untested.
Mirroring module is disabled by default. To enable mirroring use:: The mirroring module is disabled by default. To enable the mirroring module,
run the following command:
$ ceph mgr module enable mirroring .. prompt:: bash $
Mirroring module provides a family of commands to control mirroring of directory ceph mgr module enable mirroring
snapshots. To add or remove directories, mirroring needs to be enabled for a given
file system. To enable mirroring use::
$ ceph fs snapshot mirror enable <fs_name> The mirroring module provides a family of commands that can be used to control
the mirroring of directory snapshots. To add or remove directories, mirroring
must be enabled for a given file system. To enable mirroring for a given file
system, run a command of the following form:
.. note:: Mirroring module commands use `fs snapshot mirror` prefix as compared to .. prompt:: bash $
the monitor commands which `fs mirror` prefix. Make sure to use module
commands.
To disable mirroring, use:: ceph fs snapshot mirror enable <fs_name>
$ ceph fs snapshot mirror disable <fs_name> .. note:: "Mirroring module" commands are prefixed with ``fs snapshot mirror``.
This distinguishes them from "monitor commands", which are prefixed with ``fs
mirror``. Be sure (in this context) to use module commands.
Once mirroring is enabled, add a peer to which directory snapshots are to be mirrored. To disable mirroring for a given file system, run a command of the following form:
Peers follow `<client>@<cluster>` specification and get assigned a unique-id (UUID)
when added. See `Creating Users` section on how to create Ceph users for mirroring.
To add a peer use:: .. prompt:: bash $
$ ceph fs snapshot mirror peer_add <fs_name> <remote_cluster_spec> [<remote_fs_name>] [<remote_mon_host>] [<cephx_key>] ceph fs snapshot mirror disable <fs_name>
`<remote_fs_name>` is optional, and defaults to `<fs_name>` (on the remote cluster). After mirroring is enabled, add a peer to which directory snapshots are to be
mirrored. Peers are specified by the ``<client>@<cluster>`` format, which is
referred to elsewhere in this document as the ``remote_cluster_spec``. Peers
are assigned a unique-id (UUID) when added. See the :ref:`Creating
Users<cephfs_mirroring_creating_users>` section for instructions that describe
how to create Ceph users for mirroring.
This requires the remote cluster ceph configuration and user keyring to be available in To add a peer, run a command of the following form:
the primary cluster. See `Bootstrap Peers` section to avoid this. `peer_add` additionally
supports passing the remote cluster monitor address and the user key. However, bootstrapping .. prompt:: bash $
a peer is the recommended way to add a peer.
ceph fs snapshot mirror peer_add <fs_name> <remote_cluster_spec> [<remote_fs_name>] [<remote_mon_host>] [<cephx_key>]
``<remote_cluster_spec>`` is of the format ``client.<id>@<cluster_name>``.
``<remote_fs_name>`` is optional, and defaults to `<fs_name>` (on the remote
cluster).
For this command to succeed, the remote cluster's Ceph configuration and user
keyring must be available in the primary cluster. For example, if a user named
``client_mirror`` is created on the remote cluster which has ``rwps``
permissions for the remote file system named ``remote_fs`` (see `Creating
Users`) and the remote cluster is named ``remote_ceph`` (that is, the remote
cluster configuration file is named ``remote_ceph.conf`` on the primary
cluster), run the following command to add the remote filesystem as a peer to
the primary filesystem ``primary_fs``:
.. prompt:: bash $
ceph fs snapshot mirror peer_add primary_fs client.mirror_remote@remote_ceph remote_fs
To avoid having to maintain the remote cluster configuration file and remote
ceph user keyring in the primary cluster, users can bootstrap a peer (which
stores the relevant remote cluster details in the monitor config store on the
primary cluster). See the :ref:`Bootstrap
Peers<cephfs_mirroring_bootstrap_peers>` section.
The ``peer_add`` command supports passing the remote cluster monitor address
and the user key. However, bootstrapping a peer is the recommended way to add a
peer.
.. note:: Only a single peer is supported right now. .. note:: Only a single peer is supported right now.
To remove a peer use:: To remove a peer, run a command of the following form:
$ ceph fs snapshot mirror peer_remove <fs_name> <peer_uuid> .. prompt:: bash $
To list file system mirror peers use:: ceph fs snapshot mirror peer_remove <fs_name> <peer_uuid>
$ ceph fs snapshot mirror peer_list <fs_name> To list file system mirror peers, run a command of the following form:
To configure a directory for mirroring, use:: .. prompt:: bash $
$ ceph fs snapshot mirror add <fs_name> <path> ceph fs snapshot mirror peer_list <fs_name>
To stop a mirroring directory snapshots use:: To configure a directory for mirroring, run a command of the following form:
$ ceph fs snapshot mirror remove <fs_name> <path> .. prompt:: bash $
Only absolute directory paths are allowed. Also, paths are normalized by the mirroring ceph fs snapshot mirror add <fs_name> <path>
module, therefore, `/a/b/../b` is equivalent to `/a/b`.
To stop mirroring directory snapshots, run a command of the following form:
.. prompt:: bash $
ceph fs snapshot mirror remove <fs_name> <path>
Only absolute directory paths are allowed.
Paths are normalized by the mirroring module. This means that ``/a/b/../b`` is
equivalent to ``/a/b``. Paths always start from the CephFS file-system root and
not from the host system mount point.
For example::
$ mkdir -p /d0/d1/d2 $ mkdir -p /d0/d1/d2
$ ceph fs snapshot mirror add cephfs /d0/d1/d2 $ ceph fs snapshot mirror add cephfs /d0/d1/d2
@ -123,16 +200,19 @@ module, therefore, `/a/b/../b` is equivalent to `/a/b`.
$ ceph fs snapshot mirror add cephfs /d0/d1/../d1/d2 $ ceph fs snapshot mirror add cephfs /d0/d1/../d1/d2
Error EEXIST: directory /d0/d1/d2 is already tracked Error EEXIST: directory /d0/d1/d2 is already tracked
Once a directory is added for mirroring, its subdirectory or ancestor directories are After a directory is added for mirroring, the additional mirroring of
disallowed to be added for mirroring:: subdirectories or ancestor directories is disallowed::
$ ceph fs snapshot mirror add cephfs /d0/d1 $ ceph fs snapshot mirror add cephfs /d0/d1
Error EINVAL: /d0/d1 is a ancestor of tracked path /d0/d1/d2 Error EINVAL: /d0/d1 is a ancestor of tracked path /d0/d1/d2
$ ceph fs snapshot mirror add cephfs /d0/d1/d2/d3 $ ceph fs snapshot mirror add cephfs /d0/d1/d2/d3
Error EINVAL: /d0/d1/d2/d3 is a subtree of tracked path /d0/d1/d2 Error EINVAL: /d0/d1/d2/d3 is a subtree of tracked path /d0/d1/d2
Commands to check directory mapping (to mirror daemons) and directory distribution are The :ref:`Mirroring Status<cephfs_mirroring_mirroring_status>` section contains
detailed in `Mirroring Status` section. information about the commands for checking the directory mapping (to mirror
daemons) and for checking the directory distribution.
.. _cephfs_mirroring_bootstrap_peers:
Bootstrap Peers Bootstrap Peers
--------------- ---------------
@ -160,6 +240,9 @@ e.g.::
$ ceph fs snapshot mirror peer_bootstrap import cephfs eyJmc2lkIjogIjBkZjE3MjE3LWRmY2QtNDAzMC05MDc5LTM2Nzk4NTVkNDJlZiIsICJmaWxlc3lzdGVtIjogImJhY2t1cF9mcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcGVlcl9ib290c3RyYXAiLCAic2l0ZV9uYW1lIjogInNpdGUtcmVtb3RlIiwgImtleSI6ICJBUUFhcDBCZ0xtRmpOeEFBVnNyZXozai9YYUV0T2UrbUJEZlJDZz09IiwgIm1vbl9ob3N0IjogIlt2MjoxOTIuMTY4LjAuNTo0MDkxOCx2MToxOTIuMTY4LjAuNTo0MDkxOV0ifQ== $ ceph fs snapshot mirror peer_bootstrap import cephfs eyJmc2lkIjogIjBkZjE3MjE3LWRmY2QtNDAzMC05MDc5LTM2Nzk4NTVkNDJlZiIsICJmaWxlc3lzdGVtIjogImJhY2t1cF9mcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcGVlcl9ib290c3RyYXAiLCAic2l0ZV9uYW1lIjogInNpdGUtcmVtb3RlIiwgImtleSI6ICJBUUFhcDBCZ0xtRmpOeEFBVnNyZXozai9YYUV0T2UrbUJEZlJDZz09IiwgIm1vbl9ob3N0IjogIlt2MjoxOTIuMTY4LjAuNTo0MDkxOCx2MToxOTIuMTY4LjAuNTo0MDkxOV0ifQ==
.. _cephfs_mirroring_mirroring_status:
Mirroring Status Mirroring Status
---------------- ----------------

View File

@ -78,7 +78,15 @@ By default, `cephfs-top` connects to cluster name `ceph`. To use a non-default c
$ cephfs-top -d <seconds> $ cephfs-top -d <seconds>
Interval should be greater than or equal to 0.5 seconds. Fractional seconds are honoured. Refresh interval should be a positive integer.
To dump the metrics to stdout without creating a curses display use::
$ cephfs-top --dump
To dump the metrics of the given filesystem to stdout without creating a curses display use::
$ cephfs-top --dumpfs <fs_name>
Interactive Commands Interactive Commands
-------------------- --------------------
@ -104,3 +112,5 @@ The metrics display can be scrolled using the Arrow Keys, PgUp/PgDn, Home/End an
Sample screenshot running `cephfs-top` with 2 filesystems: Sample screenshot running `cephfs-top` with 2 filesystems:
.. image:: cephfs-top.png .. image:: cephfs-top.png
.. note:: Minimum compatible python version for cephfs-top is 3.6.0. cephfs-top is supported on distros RHEL 8, Ubuntu 18.04, CentOS 8 and above.

View File

@ -8,10 +8,17 @@ Creating pools
A Ceph file system requires at least two RADOS pools, one for data and one for metadata. A Ceph file system requires at least two RADOS pools, one for data and one for metadata.
When configuring these pools, you might consider: When configuring these pools, you might consider:
- Using a higher replication level for the metadata pool, as any data loss in - We recommend configuring *at least* 3 replicas for the metadata pool,
this pool can render the whole file system inaccessible. as data loss in this pool can render the entire file system inaccessible.
- Using lower-latency storage such as SSDs for the metadata pool, as this will Configuring 4 would not be extreme, especially since the metadata pool's
directly affect the observed latency of file system operations on clients. capacity requirements are quite modest.
- We recommend the fastest feasible low-latency storage devices (NVMe, Optane,
or at the very least SAS/SATA SSD) for the metadata pool, as this will
directly affect the latency of client file system operations.
- We strongly suggest that the CephFS metadata pool be provisioned on dedicated
SSD / NVMe OSDs. This ensures that high client workload does not adversely
impact metadata operations. See :ref:`device_classes` to configure pools this
way.
- The data pool used to create the file system is the "default" data pool and - The data pool used to create the file system is the "default" data pool and
the location for storing all inode backtrace information, used for hard link the location for storing all inode backtrace information, used for hard link
management and disaster recovery. For this reason, all inodes created in management and disaster recovery. For this reason, all inodes created in

View File

@ -149,8 +149,8 @@ errors.
:: ::
cephfs-data-scan scan_extents <data pool> cephfs-data-scan scan_extents [<data pool> [<extra data pool> ...]]
cephfs-data-scan scan_inodes <data pool> cephfs-data-scan scan_inodes [<data pool>]
cephfs-data-scan scan_links cephfs-data-scan scan_links
'scan_extents' and 'scan_inodes' commands may take a *very long* time 'scan_extents' and 'scan_inodes' commands may take a *very long* time
@ -166,22 +166,22 @@ The example below shows how to run 4 workers simultaneously:
:: ::
# Worker 0 # Worker 0
cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool> cephfs-data-scan scan_extents --worker_n 0 --worker_m 4
# Worker 1 # Worker 1
cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool> cephfs-data-scan scan_extents --worker_n 1 --worker_m 4
# Worker 2 # Worker 2
cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool> cephfs-data-scan scan_extents --worker_n 2 --worker_m 4
# Worker 3 # Worker 3
cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool> cephfs-data-scan scan_extents --worker_n 3 --worker_m 4
# Worker 0 # Worker 0
cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool> cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4
# Worker 1 # Worker 1
cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool> cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4
# Worker 2 # Worker 2
cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool> cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4
# Worker 3 # Worker 3
cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool> cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4
It is **important** to ensure that all workers have completed the It is **important** to ensure that all workers have completed the
scan_extents phase before any workers enter the scan_inodes phase. scan_extents phase before any workers enter the scan_inodes phase.
@ -191,8 +191,13 @@ operation to delete ancillary data generated during recovery.
:: ::
cephfs-data-scan cleanup <data pool> cephfs-data-scan cleanup [<data pool>]
Note, the data pool parameters for 'scan_extents', 'scan_inodes' and
'cleanup' commands are optional, and usually the tool will be able to
detect the pools automatically. Still you may override this. The
'scan_extents' command needs all data pools to be specified, while
'scan_inodes' and 'cleanup' commands need only the main data pool.
Using an alternate metadata pool for recovery Using an alternate metadata pool for recovery
@ -229,35 +234,29 @@ backed by the original data pool.
:: ::
ceph fs flag set enable_multiple true --yes-i-really-mean-it
ceph osd pool create cephfs_recovery_meta ceph osd pool create cephfs_recovery_meta
ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay ceph fs new cephfs_recovery recovery <data_pool> --recover --allow-dangerous-metadata-overlay
.. note::
The recovery file system starts with an MDS rank that will initialize the new The ``--recover`` flag prevents any MDS from joining the new file system.
metadata pool with some metadata. This is necessary to bootstrap recovery.
However, now we will take the MDS down as we do not want it interacting with Next, we will create the intial metadata for the fs:
the metadata pool further.
:: ::
ceph fs fail cephfs_recovery cephfs-table-tool cephfs_recovery:0 reset session
cephfs-table-tool cephfs_recovery:0 reset snap
Next, we will reset the initial metadata the MDS created: cephfs-table-tool cephfs_recovery:0 reset inode
cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
::
cephfs-table-tool cephfs_recovery:all reset session
cephfs-table-tool cephfs_recovery:all reset snap
cephfs-table-tool cephfs_recovery:all reset inode
Now perform the recovery of the metadata pool from the data pool: Now perform the recovery of the metadata pool from the data pool:
:: ::
cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool> cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name>
cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool> cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt
cephfs-data-scan scan_links --filesystem cephfs_recovery cephfs-data-scan scan_links --filesystem cephfs_recovery
.. note:: .. note::
@ -272,7 +271,6 @@ with:
:: ::
cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force
After recovery, some recovered directories will have incorrect statistics. After recovery, some recovered directories will have incorrect statistics.
Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
@ -283,20 +281,22 @@ set to false (the default) to prevent the MDS from checking the statistics:
ceph config rm mds mds_verify_scatter ceph config rm mds mds_verify_scatter
ceph config rm mds mds_debug_scatterstat ceph config rm mds mds_debug_scatterstat
(Note, the config may also have been set globally or via a ceph.conf file.) .. note::
Also verify the config has not been set globally or with a local ceph.conf file.
Now, allow an MDS to join the recovery file system: Now, allow an MDS to join the recovery file system:
:: ::
ceph fs set cephfs_recovery joinable true ceph fs set cephfs_recovery joinable true
Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics. Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair recursive statistics.
Ensure you have an MDS running and issue: Ensure you have an MDS running and issue:
:: ::
ceph fs status # get active MDS ceph tell mds.recovery_fs:0 scrub start / recursive,repair,force
ceph tell mds.<id> scrub start / recursive repair
.. note:: .. note::

View File

@ -3,13 +3,13 @@
FS volumes and subvolumes FS volumes and subvolumes
========================= =========================
A single source of truth for CephFS exports is implemented in the volumes The volumes module of the :term:`Ceph Manager` daemon (ceph-mgr) provides a
module of the :term:`Ceph Manager` daemon (ceph-mgr). The OpenStack shared single source of truth for CephFS exports. The OpenStack shared file system
file system service (manila_), Ceph Container Storage Interface (CSI_), service (manila_) and the Ceph Container Storage Interface (CSI_) storage
storage administrators among others can use the common CLI provided by the administrators use the common CLI provided by the ceph-mgr ``volumes`` module
ceph-mgr volumes module to manage the CephFS exports. to manage CephFS exports.
The ceph-mgr volumes module implements the following file system export The ceph-mgr ``volumes`` module implements the following file system export
abstractions: abstractions:
* FS volumes, an abstraction for CephFS file systems * FS volumes, an abstraction for CephFS file systems
@ -17,87 +17,82 @@ abstractions:
* FS subvolumes, an abstraction for independent CephFS directory trees * FS subvolumes, an abstraction for independent CephFS directory trees
* FS subvolume groups, an abstraction for a directory level higher than FS * FS subvolume groups, an abstraction for a directory level higher than FS
subvolumes to effect policies (e.g., :doc:`/cephfs/file-layouts`) across a subvolumes. Used to effect policies (e.g., :doc:`/cephfs/file-layouts`)
set of subvolumes across a set of subvolumes
Some possible use-cases for the export abstractions: Some possible use-cases for the export abstractions:
* FS subvolumes used as manila shares or CSI volumes * FS subvolumes used as Manila shares or CSI volumes
* FS subvolume groups used as manila share groups * FS subvolume groups used as Manila share groups
Requirements Requirements
------------ ------------
* Nautilus (14.2.x) or a later version of Ceph * Nautilus (14.2.x) or later Ceph release
* Cephx client user (see :doc:`/rados/operations/user-management`) with * Cephx client user (see :doc:`/rados/operations/user-management`) with
the following minimum capabilities:: at least the following capabilities::
mon 'allow r' mon 'allow r'
mgr 'allow rw' mgr 'allow rw'
FS Volumes FS Volumes
---------- ----------
Create a volume using:: Create a volume by running the following command:
$ ceph fs volume create <vol_name> [<placement>] .. prompt:: bash #
ceph fs volume create <vol_name> [placement]
This creates a CephFS file system and its data and metadata pools. It can also This creates a CephFS file system and its data and metadata pools. It can also
try to create MDSes for the filesystem using the enabled ceph-mgr orchestrator deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for
module (see :doc:`/mgr/orchestrator`), e.g. rook. example Rook). See :doc:`/mgr/orchestrator`.
<vol_name> is the volume name (an arbitrary string), and ``<vol_name>`` is the volume name (an arbitrary string). ``[placement]`` is an
optional string that specifies the :ref:`orchestrator-cli-placement-spec` for
the MDS. See also :ref:`orchestrator-cli-cephfs` for more examples on
placement.
<placement> is an optional string signifying which hosts should have NFS Ganesha .. note:: Specifying placement via a YAML file is not supported through the
daemon containers running on them and, optionally, the total number of NFS volume interface.
Ganesha daemons the cluster (should you want to have more than one NFS Ganesha
daemon running per node). For example, the following placement string means
"deploy NFS Ganesha daemons on nodes host1 and host2 (one daemon per host):
"host1,host2" To remove a volume, run the following command:
and this placement specification says to deploy two NFS Ganesha daemons each
on nodes host1 and host2 (for a total of four NFS Ganesha daemons in the
cluster):
"4 host1,host2"
For more details on placement specification refer to the :ref:`orchestrator-cli-service-spec`,
but keep in mind that specifying the placement via a YAML file is not supported.
Remove a volume using::
$ ceph fs volume rm <vol_name> [--yes-i-really-mean-it] $ ceph fs volume rm <vol_name> [--yes-i-really-mean-it]
This removes a file system and its data and metadata pools. It also tries to This removes a file system and its data and metadata pools. It also tries to
remove MDSes using the enabled ceph-mgr orchestrator module. remove MDS daemons using the enabled ceph-mgr orchestrator module.
List volumes using:: .. note:: After volume deletion, it is recommended to restart `ceph-mgr`
if a new file system is created on the same cluster and subvolume interface
is being used. Please see https://tracker.ceph.com/issues/49605#note-5
for more details.
List volumes by running the following command:
$ ceph fs volume ls $ ceph fs volume ls
Rename a volume using:: Rename a volume by running the following command:
$ ceph fs volume rename <vol_name> <new_vol_name> [--yes-i-really-mean-it] $ ceph fs volume rename <vol_name> <new_vol_name> [--yes-i-really-mean-it]
Renaming a volume can be an expensive operation. It does the following: Renaming a volume can be an expensive operation that requires the following:
- renames the orchestrator managed MDS service to match the <new_vol_name>. - Renaming the orchestrator-managed MDS service to match the <new_vol_name>.
This involves launching a MDS service with <new_vol_name> and bringing down This involves launching a MDS service with ``<new_vol_name>`` and bringing
the MDS service with <vol_name>. down the MDS service with ``<vol_name>``.
- renames the file system matching <vol_name> to <new_vol_name> - Renaming the file system matching ``<vol_name>`` to ``<new_vol_name>``.
- changes the application tags on the data and metadata pools of the file system - Changing the application tags on the data and metadata pools of the file system
to <new_vol_name> to ``<new_vol_name>``.
- renames the metadata and data pools of the file system. - Renaming the metadata and data pools of the file system.
The CephX IDs authorized to <vol_name> need to be reauthorized to <new_vol_name>. Any The CephX IDs that are authorized for ``<vol_name>`` must be reauthorized for
on-going operations of the clients using these IDs may be disrupted. Mirroring is ``<new_vol_name>``. Any ongoing operations of the clients using these IDs may
expected to be disabled on the volume. be disrupted. Ensure that mirroring is disabled on the volume.
Fetch the information of a CephFS volume using:: To fetch the information of a CephFS volume, run the following command:
$ ceph fs volume info vol_name [--human_readable] $ ceph fs volume info vol_name [--human_readable]
@ -105,15 +100,15 @@ The ``--human_readable`` flag shows used and available pool capacities in KB/MB/
The output format is JSON and contains fields as follows: The output format is JSON and contains fields as follows:
* pools: Attributes of data and metadata pools * ``pools``: Attributes of data and metadata pools
* avail: The amount of free space available in bytes * ``avail``: The amount of free space available in bytes
* used: The amount of storage consumed in bytes * ``used``: The amount of storage consumed in bytes
* name: Name of the pool * ``name``: Name of the pool
* mon_addrs: List of monitor addresses * ``mon_addrs``: List of Ceph monitor addresses
* used_size: Current used size of the CephFS volume in bytes * ``used_size``: Current used size of the CephFS volume in bytes
* pending_subvolume_deletions: Number of subvolumes pending deletion * ``pending_subvolume_deletions``: Number of subvolumes pending deletion
Sample output of volume info command:: Sample output of the ``volume info`` command::
$ ceph fs volume info vol_name $ ceph fs volume info vol_name
{ {
@ -143,88 +138,91 @@ Sample output of volume info command::
FS Subvolume groups FS Subvolume groups
------------------- -------------------
Create a subvolume group using:: Create a subvolume group by running the following command:
$ ceph fs subvolumegroup create <vol_name> <group_name> [--size <size_in_bytes>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] $ ceph fs subvolumegroup create <vol_name> <group_name> [--size <size_in_bytes>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>]
The command succeeds even if the subvolume group already exists. The command succeeds even if the subvolume group already exists.
When creating a subvolume group you can specify its data pool layout (see When creating a subvolume group you can specify its data pool layout (see
:doc:`/cephfs/file-layouts`), uid, gid, file mode in octal numerals and :doc:`/cephfs/file-layouts`), uid, gid, file mode in octal numerals, and
size in bytes. The size of the subvolume group is specified by setting size in bytes. The size of the subvolume group is specified by setting
a quota on it (see :doc:`/cephfs/quota`). By default, the subvolume group a quota on it (see :doc:`/cephfs/quota`). By default, the subvolume group
is created with an octal file mode '755', uid '0', gid '0' and data pool is created with octal file mode ``755``, uid ``0``, gid ``0`` and the data pool
layout of its parent directory. layout of its parent directory.
Remove a subvolume group by running a command of the following form:
Remove a subvolume group using::
$ ceph fs subvolumegroup rm <vol_name> <group_name> [--force] $ ceph fs subvolumegroup rm <vol_name> <group_name> [--force]
The removal of a subvolume group fails if it is not empty or non-existent. The removal of a subvolume group fails if the subvolume group is not empty or
'--force' flag allows the non-existent subvolume group remove command to succeed. is non-existent. The ``--force`` flag allows the non-existent "subvolume group remove
command" to succeed.
Fetch the absolute path of a subvolume group using:: Fetch the absolute path of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup getpath <vol_name> <group_name> $ ceph fs subvolumegroup getpath <vol_name> <group_name>
List subvolume groups using:: List subvolume groups by running a command of the following form:
$ ceph fs subvolumegroup ls <vol_name> $ ceph fs subvolumegroup ls <vol_name>
.. note:: Subvolume group snapshot feature is no longer supported in mainline CephFS (existing group .. note:: Subvolume group snapshot feature is no longer supported in mainline CephFS (existing group
snapshots can still be listed and deleted) snapshots can still be listed and deleted)
Fetch the metadata of a subvolume group using:: Fetch the metadata of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup info <vol_name> <group_name> $ ceph fs subvolumegroup info <vol_name> <group_name>
The output format is json and contains fields as follows. The output format is JSON and contains fields as follows:
* atime: access time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS" * ``atime``: access time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* mtime: modification time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS" * ``mtime``: modification time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* ctime: change time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS" * ``ctime``: change time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
* uid: uid of subvolume group path * ``uid``: uid of the subvolume group path
* gid: gid of subvolume group path * ``gid``: gid of the subvolume group path
* mode: mode of subvolume group path * ``mode``: mode of the subvolume group path
* mon_addrs: list of monitor addresses * ``mon_addrs``: list of monitor addresses
* bytes_pcent: quota used in percentage if quota is set, else displays "undefined" * ``bytes_pcent``: quota used in percentage if quota is set, else displays "undefined"
* bytes_quota: quota size in bytes if quota is set, else displays "infinite" * ``bytes_quota``: quota size in bytes if quota is set, else displays "infinite"
* bytes_used: current used size of the subvolume group in bytes * ``bytes_used``: current used size of the subvolume group in bytes
* created_at: time of creation of subvolume group in the format "YYYY-MM-DD HH:MM:SS" * ``created_at``: creation time of the subvolume group in the format "YYYY-MM-DD HH:MM:SS"
* data_pool: data pool the subvolume group belongs to * ``data_pool``: data pool to which the subvolume group belongs
Check the presence of any subvolume group using:: Check the presence of any subvolume group by running a command of the following form:
$ ceph fs subvolumegroup exist <vol_name> $ ceph fs subvolumegroup exist <vol_name>
The strings returned by the 'exist' command: The ``exist`` command outputs:
* "subvolumegroup exists": if any subvolumegroup is present * "subvolumegroup exists": if any subvolumegroup is present
* "no subvolumegroup exists": if no subvolumegroup is present * "no subvolumegroup exists": if no subvolumegroup is present
.. note:: It checks for the presence of custom groups and not the default one. To validate the emptiness of the volume, subvolumegroup existence check alone is not sufficient. The subvolume existence also needs to be checked as there might be subvolumes in the default group. .. note:: This command checks for the presence of custom groups and not
presence of the default one. To validate the emptiness of the volume, a
subvolumegroup existence check alone is not sufficient. Subvolume existence
also needs to be checked as there might be subvolumes in the default group.
Resize a subvolume group using:: Resize a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup resize <vol_name> <group_name> <new_size> [--no_shrink] $ ceph fs subvolumegroup resize <vol_name> <group_name> <new_size> [--no_shrink]
The command resizes the subvolume group quota using the size specified by 'new_size'. The command resizes the subvolume group quota, using the size specified by
The '--no_shrink' flag prevents the subvolume group to shrink below the current used ``new_size``. The ``--no_shrink`` flag prevents the subvolume group from
size of the subvolume group. shrinking below the current used size.
The subvolume group can be resized to an infinite size by passing 'inf' or 'infinite' The subvolume group may be resized to an infinite size by passing ``inf`` or
as the new_size. ``infinite`` as the ``new_size``.
Remove a snapshot of a subvolume group using:: Remove a snapshot of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup snapshot rm <vol_name> <group_name> <snap_name> [--force] $ ceph fs subvolumegroup snapshot rm <vol_name> <group_name> <snap_name> [--force]
Using the '--force' flag allows the command to succeed that would otherwise Supplying the ``--force`` flag allows the command to succeed when it would otherwise
fail if the snapshot did not exist. fail due to the nonexistence of the snapshot.
List snapshots of a subvolume group using:: List snapshots of a subvolume group by running a command of the following form:
$ ceph fs subvolumegroup snapshot ls <vol_name> <group_name> $ ceph fs subvolumegroup snapshot ls <vol_name> <group_name>
@ -232,7 +230,7 @@ List snapshots of a subvolume group using::
FS Subvolumes FS Subvolumes
------------- -------------
Create a subvolume using:: Create a subvolume using:
$ ceph fs subvolume create <vol_name> <subvol_name> [--size <size_in_bytes>] [--group_name <subvol_group_name>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] [--namespace-isolated] $ ceph fs subvolume create <vol_name> <subvol_name> [--size <size_in_bytes>] [--group_name <subvol_group_name>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] [--namespace-isolated]
@ -247,11 +245,10 @@ default a subvolume is created within the default subvolume group, and with an o
mode '755', uid of its subvolume group, gid of its subvolume group, data pool layout of mode '755', uid of its subvolume group, gid of its subvolume group, data pool layout of
its parent directory and no size limit. its parent directory and no size limit.
Remove a subvolume using:: Remove a subvolume using:
$ ceph fs subvolume rm <vol_name> <subvol_name> [--group_name <subvol_group_name>] [--force] [--retain-snapshots] $ ceph fs subvolume rm <vol_name> <subvol_name> [--group_name <subvol_group_name>] [--force] [--retain-snapshots]
The command removes the subvolume and its contents. It does this in two steps. The command removes the subvolume and its contents. It does this in two steps.
First, it moves the subvolume to a trash folder, and then asynchronously purges First, it moves the subvolume to a trash folder, and then asynchronously purges
its contents. its contents.
@ -267,95 +264,95 @@ empty for all operations not involving the retained snapshots.
.. note:: Retained snapshots can be used as a clone source to recreate the subvolume, or clone to a newer subvolume. .. note:: Retained snapshots can be used as a clone source to recreate the subvolume, or clone to a newer subvolume.
Resize a subvolume using:: Resize a subvolume using:
$ ceph fs subvolume resize <vol_name> <subvol_name> <new_size> [--group_name <subvol_group_name>] [--no_shrink] $ ceph fs subvolume resize <vol_name> <subvol_name> <new_size> [--group_name <subvol_group_name>] [--no_shrink]
The command resizes the subvolume quota using the size specified by 'new_size'. The command resizes the subvolume quota using the size specified by ``new_size``.
'--no_shrink' flag prevents the subvolume to shrink below the current used size of the subvolume. The `--no_shrink`` flag prevents the subvolume from shrinking below the current used size of the subvolume.
The subvolume can be resized to an infinite size by passing 'inf' or 'infinite' as the new_size. The subvolume can be resized to an unlimited (but sparse) logical size by passing ``inf`` or ``infinite`` as `` new_size``.
Authorize cephx auth IDs, the read/read-write access to fs subvolumes:: Authorize cephx auth IDs, the read/read-write access to fs subvolumes:
$ ceph fs subvolume authorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] [--access_level=<access_level>] $ ceph fs subvolume authorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] [--access_level=<access_level>]
The 'access_level' takes 'r' or 'rw' as value. The ``access_level`` takes ``r`` or ``rw`` as value.
Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes:: Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes:
$ ceph fs subvolume deauthorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] $ ceph fs subvolume deauthorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]
List cephx auth IDs authorized to access fs subvolume:: List cephx auth IDs authorized to access fs subvolume:
$ ceph fs subvolume authorized_list <vol_name> <sub_name> [--group_name=<group_name>] $ ceph fs subvolume authorized_list <vol_name> <sub_name> [--group_name=<group_name>]
Evict fs clients based on auth ID and subvolume mounted:: Evict fs clients based on auth ID and subvolume mounted:
$ ceph fs subvolume evict <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] $ ceph fs subvolume evict <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]
Fetch the absolute path of a subvolume using:: Fetch the absolute path of a subvolume using:
$ ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>] $ ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>]
Fetch the information of a subvolume using:: Fetch the information of a subvolume using:
$ ceph fs subvolume info <vol_name> <subvol_name> [--group_name <subvol_group_name>] $ ceph fs subvolume info <vol_name> <subvol_name> [--group_name <subvol_group_name>]
The output format is json and contains fields as follows. The output format is JSON and contains fields as follows.
* atime: access time of subvolume path in the format "YYYY-MM-DD HH:MM:SS" * ``atime``: access time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* mtime: modification time of subvolume path in the format "YYYY-MM-DD HH:MM:SS" * ``mtime``: modification time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* ctime: change time of subvolume path in the format "YYYY-MM-DD HH:MM:SS" * ``ctime``: change time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
* uid: uid of subvolume path * ``uid``: uid of the subvolume path
* gid: gid of subvolume path * ``gid``: gid of the subvolume path
* mode: mode of subvolume path * ``mode``: mode of the subvolume path
* mon_addrs: list of monitor addresses * ``mon_addrs``: list of monitor addresses
* bytes_pcent: quota used in percentage if quota is set, else displays "undefined" * ``bytes_pcent``: quota used in percentage if quota is set, else displays ``undefined``
* bytes_quota: quota size in bytes if quota is set, else displays "infinite" * ``bytes_quota``: quota size in bytes if quota is set, else displays ``infinite``
* bytes_used: current used size of the subvolume in bytes * ``bytes_used``: current used size of the subvolume in bytes
* created_at: time of creation of subvolume in the format "YYYY-MM-DD HH:MM:SS" * ``created_at``: creation time of the subvolume in the format "YYYY-MM-DD HH:MM:SS"
* data_pool: data pool the subvolume belongs to * ``data_pool``: data pool to which the subvolume belongs
* path: absolute path of a subvolume * ``path``: absolute path of a subvolume
* type: subvolume type indicating whether it's clone or subvolume * ``type``: subvolume type indicating whether it's clone or subvolume
* pool_namespace: RADOS namespace of the subvolume * ``pool_namespace``: RADOS namespace of the subvolume
* features: features supported by the subvolume * ``features``: features supported by the subvolume
* state: current state of the subvolume * ``state``: current state of the subvolume
If a subvolume has been removed retaining its snapshots, the output only contains fields as follows. If a subvolume has been removed retaining its snapshots, the output contains only fields as follows.
* type: subvolume type indicating whether it's clone or subvolume * ``type``: subvolume type indicating whether it's clone or subvolume
* features: features supported by the subvolume * ``features``: features supported by the subvolume
* state: current state of the subvolume * ``state``: current state of the subvolume
The subvolume "features" are based on the internal version of the subvolume and is a list containing A subvolume's ``features`` are based on the internal version of the subvolume and are
a subset of the following features, a subset of the following:
* "snapshot-clone": supports cloning using a subvolumes snapshot as the source * ``snapshot-clone``: supports cloning using a subvolumes snapshot as the source
* "snapshot-autoprotect": supports automatically protecting snapshots, that are active clone sources, from deletion * ``snapshot-autoprotect``: supports automatically protecting snapshots, that are active clone sources, from deletion
* "snapshot-retention": supports removing subvolume contents, retaining any existing snapshots * ``snapshot-retention``: supports removing subvolume contents, retaining any existing snapshots
The subvolume "state" is based on the current state of the subvolume and contains one of the following values. A subvolume's ``state`` is based on the current state of the subvolume and contains one of the following values.
* "complete": subvolume is ready for all operations * ``complete``: subvolume is ready for all operations
* "snapshot-retained": subvolume is removed but its snapshots are retained * ``snapshot-retained``: subvolume is removed but its snapshots are retained
List subvolumes using:: List subvolumes using:
$ ceph fs subvolume ls <vol_name> [--group_name <subvol_group_name>] $ ceph fs subvolume ls <vol_name> [--group_name <subvol_group_name>]
.. note:: subvolumes that are removed but have snapshots retained, are also listed. .. note:: subvolumes that are removed but have snapshots retained, are also listed.
Check the presence of any subvolume using:: Check the presence of any subvolume using:
$ ceph fs subvolume exist <vol_name> [--group_name <subvol_group_name>] $ ceph fs subvolume exist <vol_name> [--group_name <subvol_group_name>]
The strings returned by the 'exist' command: These are the possible results of the ``exist`` command:
* "subvolume exists": if any subvolume of given group_name is present * ``subvolume exists``: if any subvolume of given group_name is present
* "no subvolume exists": if no subvolume of given group_name is present * ``no subvolume exists``: if no subvolume of given group_name is present
Set custom metadata on the subvolume as a key-value pair using:: Set custom metadata on the subvolume as a key-value pair using:
$ ceph fs subvolume metadata set <vol_name> <subvol_name> <key_name> <value> [--group_name <subvol_group_name>] $ ceph fs subvolume metadata set <vol_name> <subvol_name> <key_name> <value> [--group_name <subvol_group_name>]
@ -365,52 +362,51 @@ Set custom metadata on the subvolume as a key-value pair using::
.. note:: Custom metadata on a subvolume is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot. .. note:: Custom metadata on a subvolume is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot.
Get custom metadata set on the subvolume using the metadata key:: Get custom metadata set on the subvolume using the metadata key:
$ ceph fs subvolume metadata get <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>] $ ceph fs subvolume metadata get <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>]
List custom metadata (key-value pairs) set on the subvolume using:: List custom metadata (key-value pairs) set on the subvolume using:
$ ceph fs subvolume metadata ls <vol_name> <subvol_name> [--group_name <subvol_group_name>] $ ceph fs subvolume metadata ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]
Remove custom metadata set on the subvolume using the metadata key:: Remove custom metadata set on the subvolume using the metadata key:
$ ceph fs subvolume metadata rm <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>] [--force] $ ceph fs subvolume metadata rm <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>] [--force]
Using the '--force' flag allows the command to succeed that would otherwise Using the ``--force`` flag allows the command to succeed that would otherwise
fail if the metadata key did not exist. fail if the metadata key did not exist.
Create a snapshot of a subvolume using:: Create a snapshot of a subvolume using:
$ ceph fs subvolume snapshot create <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] $ ceph fs subvolume snapshot create <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
Remove a snapshot of a subvolume using:
Remove a snapshot of a subvolume using::
$ ceph fs subvolume snapshot rm <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] [--force] $ ceph fs subvolume snapshot rm <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] [--force]
Using the '--force' flag allows the command to succeed that would otherwise Using the ``--force`` flag allows the command to succeed that would otherwise
fail if the snapshot did not exist. fail if the snapshot did not exist.
.. note:: if the last snapshot within a snapshot retained subvolume is removed, the subvolume is also removed .. note:: if the last snapshot within a snapshot retained subvolume is removed, the subvolume is also removed
List snapshots of a subvolume using:: List snapshots of a subvolume using:
$ ceph fs subvolume snapshot ls <vol_name> <subvol_name> [--group_name <subvol_group_name>] $ ceph fs subvolume snapshot ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]
Fetch the information of a snapshot using:: Fetch the information of a snapshot using:
$ ceph fs subvolume snapshot info <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] $ ceph fs subvolume snapshot info <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
The output format is json and contains fields as follows. The output format is json and contains fields as follows.
* created_at: time of creation of snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff" * ``created_at``: creation time of the snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff"
* data_pool: data pool the snapshot belongs to * ``data_pool``: data pool to which the snapshot belongs
* has_pending_clones: "yes" if snapshot clone is in progress otherwise "no" * ``has_pending_clones``: ``yes`` if snapshot clone is in progress, otherwise ``no``
* pending_clones: list of in progress or pending clones and their target group if exist otherwise this field is not shown * ``pending_clones``: list of in-progress or pending clones and their target group if any exist, otherwise this field is not shown
* orphan_clones_count: count of orphan clones if snapshot has orphan clones otherwise this field is not shown * ``orphan_clones_count``: count of orphan clones if the snapshot has orphan clones, otherwise this field is not shown
Sample output if snapshot clones are in progress or pending state:: Sample output when snapshot clones are in progress or pending::
$ ceph fs subvolume snapshot info cephfs subvol snap $ ceph fs subvolume snapshot info cephfs subvol snap
{ {
@ -432,7 +428,7 @@ Sample output if snapshot clones are in progress or pending state::
] ]
} }
Sample output if no snapshot clone is in progress or pending state:: Sample output when no snapshot clone is in progress or pending::
$ ceph fs subvolume snapshot info cephfs subvol snap $ ceph fs subvolume snapshot info cephfs subvol snap
{ {
@ -441,90 +437,93 @@ Sample output if no snapshot clone is in progress or pending state::
"has_pending_clones": "no" "has_pending_clones": "no"
} }
Set custom metadata on the snapshot as a key-value pair using:: Set custom key-value metadata on the snapshot by running:
$ ceph fs subvolume snapshot metadata set <vol_name> <subvol_name> <snap_name> <key_name> <value> [--group_name <subvol_group_name>] $ ceph fs subvolume snapshot metadata set <vol_name> <subvol_name> <snap_name> <key_name> <value> [--group_name <subvol_group_name>]
.. note:: If the key_name already exists then the old value will get replaced by the new value. .. note:: If the key_name already exists then the old value will get replaced by the new value.
.. note:: The key_name and value should be a string of ASCII characters (as specified in python's string.printable). The key_name is case-insensitive and always stored in lower case. .. note:: The key_name and value should be a strings of ASCII characters (as specified in Python's ``string.printable``). The key_name is case-insensitive and always stored in lowercase.
.. note:: Custom metadata on a snapshots is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot. .. note:: Custom metadata on a snapshot is not preserved when snapshotting the subvolume, and hence is also not preserved when cloning the subvolume snapshot.
Get custom metadata set on the snapshot using the metadata key:: Get custom metadata set on the snapshot using the metadata key:
$ ceph fs subvolume snapshot metadata get <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>] $ ceph fs subvolume snapshot metadata get <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>]
List custom metadata (key-value pairs) set on the snapshot using:: List custom metadata (key-value pairs) set on the snapshot using:
$ ceph fs subvolume snapshot metadata ls <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] $ ceph fs subvolume snapshot metadata ls <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
Remove custom metadata set on the snapshot using the metadata key:: Remove custom metadata set on the snapshot using the metadata key:
$ ceph fs subvolume snapshot metadata rm <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>] [--force] $ ceph fs subvolume snapshot metadata rm <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>] [--force]
Using the '--force' flag allows the command to succeed that would otherwise Using the ``--force`` flag allows the command to succeed that would otherwise
fail if the metadata key did not exist. fail if the metadata key did not exist.
Cloning Snapshots Cloning Snapshots
----------------- -----------------
Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation involving copying Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation that copies
data from a snapshot to a subvolume. Due to this bulk copy nature, cloning is currently inefficient for very huge data from a snapshot to a subvolume. Due to this bulk copying, cloning is inefficient for very large
data sets. data sets.
.. note:: Removing a snapshot (source subvolume) would fail if there are pending or in progress clone operations. .. note:: Removing a snapshot (source subvolume) would fail if there are pending or in progress clone operations.
Protecting snapshots prior to cloning was a pre-requisite in the Nautilus release, and the commands to protect/unprotect Protecting snapshots prior to cloning was a prerequisite in the Nautilus release, and the commands to protect/unprotect
snapshots were introduced for this purpose. This pre-requisite, and hence the commands to protect/unprotect, is being snapshots were introduced for this purpose. This prerequisite, and hence the commands to protect/unprotect, is being
deprecated in mainline CephFS, and may be removed from a future release. deprecated and may be removed from a future release.
The commands being deprecated are: The commands being deprecated are:
$ ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
$ ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
.. note:: Using the above commands would not result in an error, but they serve no useful function. .. prompt:: bash #
.. note:: Use subvolume info command to fetch subvolume metadata regarding supported "features" to help decide if protect/unprotect of snapshots is required, based on the "snapshot-autoprotect" feature availability. ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
To initiate a clone operation use:: .. note:: Using the above commands will not result in an error, but they have no useful purpose.
.. note:: Use the ``subvolume info`` command to fetch subvolume metadata regarding supported ``features`` to help decide if protect/unprotect of snapshots is required, based on the availability of the ``snapshot-autoprotect`` feature.
To initiate a clone operation use:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name>
If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified as per:: If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --group_name <subvol_group_name> $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --group_name <subvol_group_name>
Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use:: Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --target_group_name <subvol_group_name> $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --target_group_name <subvol_group_name>
Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use:: Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use:
$ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --pool_layout <pool_layout> $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --pool_layout <pool_layout>
Configure maximum number of concurrent clones. The default is set to 4:: Configure the maximum number of concurrent clones. The default is 4:
$ ceph config set mgr mgr/volumes/max_concurrent_clones <value> $ ceph config set mgr mgr/volumes/max_concurrent_clones <value>
To check the status of a clone operation use:: To check the status of a clone operation use:
$ ceph fs clone status <vol_name> <clone_name> [--group_name <group_name>] $ ceph fs clone status <vol_name> <clone_name> [--group_name <group_name>]
A clone can be in one of the following states: A clone can be in one of the following states:
#. `pending` : Clone operation has not started #. ``pending`` : Clone operation has not started
#. `in-progress` : Clone operation is in progress #. ``in-progress`` : Clone operation is in progress
#. `complete` : Clone operation has successfully finished #. ``complete`` : Clone operation has successfully finished
#. `failed` : Clone operation has failed #. ``failed`` : Clone operation has failed
#. `canceled` : Clone operation is cancelled by user #. ``canceled`` : Clone operation is cancelled by user
The reason for a clone failure is shown as below: The reason for a clone failure is shown as below:
#. `errno` : error number #. ``errno`` : error number
#. `error_msg` : failure error string #. ``error_msg`` : failure error string
Sample output of an `in-progress` clone operation:: Here is an example of an ``in-progress`` clone::
$ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1 $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
$ ceph fs clone status cephfs clone1 $ ceph fs clone status cephfs clone1
@ -539,9 +538,9 @@ Sample output of an `in-progress` clone operation::
} }
} }
.. note:: The `failure` section will be shown only if the clone is in failed or cancelled state .. note:: The ``failure`` section will be shown only if the clone's state is ``failed`` or ``cancelled``
Sample output of a `failed` clone operation:: Here is an example of a ``failed`` clone::
$ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1 $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
$ ceph fs clone status cephfs clone1 $ ceph fs clone status cephfs clone1
@ -561,11 +560,11 @@ Sample output of a `failed` clone operation::
} }
} }
(NOTE: since `subvol1` is in default group, `source` section in `clone status` does not include group name) (NOTE: since ``subvol1`` is in the default group, the ``source`` object's ``clone status`` does not include the group name)
.. note:: Cloned subvolumes are accessible only after the clone operation has successfully completed. .. note:: Cloned subvolumes are accessible only after the clone operation has successfully completed.
For a successful clone operation, `clone status` would look like so:: After a successful clone operation, ``clone status`` will look like the below::
$ ceph fs clone status cephfs clone1 $ ceph fs clone status cephfs clone1
{ {
@ -574,21 +573,21 @@ For a successful clone operation, `clone status` would look like so::
} }
} }
or `failed` state when clone is unsuccessful. If a clone operation is unsuccessful, the ``state`` value will be ``failed``.
On failure of a clone operation, the partial clone needs to be deleted and the clone operation needs to be retriggered. To retry a failed clone operation, the incomplete clone must be deleted and the clone operation must be issued again.
To delete a partial clone use:: To delete a partial clone use::
$ ceph fs subvolume rm <vol_name> <clone_name> [--group_name <group_name>] --force $ ceph fs subvolume rm <vol_name> <clone_name> [--group_name <group_name>] --force
.. note:: Cloning only synchronizes directories, regular files and symbolic links. Also, inode timestamps (access and .. note:: Cloning synchronizes only directories, regular files and symbolic links. Inode timestamps (access and
modification times) are synchronized up to seconds granularity. modification times) are synchronized up to seconds granularity.
An `in-progress` or a `pending` clone operation can be canceled. To cancel a clone operation use the `clone cancel` command:: An ``in-progress`` or a ``pending`` clone operation may be canceled. To cancel a clone operation use the ``clone cancel`` command:
$ ceph fs clone cancel <vol_name> <clone_name> [--group_name <group_name>] $ ceph fs clone cancel <vol_name> <clone_name> [--group_name <group_name>]
On successful cancellation, the cloned subvolume is moved to `canceled` state:: On successful cancellation, the cloned subvolume is moved to the ``canceled`` state::
$ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1 $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
$ ceph fs clone cancel cephfs clone1 $ ceph fs clone cancel cephfs clone1
@ -604,7 +603,7 @@ On successful cancellation, the cloned subvolume is moved to `canceled` state::
} }
} }
.. note:: The canceled cloned can be deleted by using --force option in `fs subvolume rm` command. .. note:: The canceled cloned may be deleted by supplying the ``--force`` option to the `fs subvolume rm` command.
.. _subvol-pinning: .. _subvol-pinning:
@ -612,17 +611,16 @@ On successful cancellation, the cloned subvolume is moved to `canceled` state::
Pinning Subvolumes and Subvolume Groups Pinning Subvolumes and Subvolume Groups
--------------------------------------- ---------------------------------------
Subvolumes and subvolume groups may be automatically pinned to ranks according
Subvolumes and subvolume groups can be automatically pinned to ranks according to policies. This can distribute load across MDS ranks in predictable and
to policies. This can help distribute load across MDS ranks in predictable and
stable ways. Review :ref:`cephfs-pinning` and :ref:`cephfs-ephemeral-pinning` stable ways. Review :ref:`cephfs-pinning` and :ref:`cephfs-ephemeral-pinning`
for details on how pinning works. for details on how pinning works.
Pinning is configured by:: Pinning is configured by:
$ ceph fs subvolumegroup pin <vol_name> <group_name> <pin_type> <pin_setting> $ ceph fs subvolumegroup pin <vol_name> <group_name> <pin_type> <pin_setting>
or for subvolumes:: or for subvolumes:
$ ceph fs subvolume pin <vol_name> <group_name> <pin_type> <pin_setting> $ ceph fs subvolume pin <vol_name> <group_name> <pin_type> <pin_setting>
@ -631,7 +629,7 @@ one of ``export``, ``distributed``, or ``random``. The ``pin_setting``
corresponds to the extended attributed "value" as in the pinning documentation corresponds to the extended attributed "value" as in the pinning documentation
referenced above. referenced above.
So, for example, setting a distributed pinning strategy on a subvolume group:: So, for example, setting a distributed pinning strategy on a subvolume group:
$ ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1 $ ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1

View File

@ -130,7 +130,9 @@ other daemons, please see :ref:`health-checks`.
from properly cleaning up resources used by client requests. This message from properly cleaning up resources used by client requests. This message
appears if a client appears to have more than ``max_completed_requests`` appears if a client appears to have more than ``max_completed_requests``
(default 100000) requests that are complete on the MDS side but haven't (default 100000) requests that are complete on the MDS side but haven't
yet been accounted for in the client's *oldest tid* value. yet been accounted for in the client's *oldest tid* value. The last tid
used by the MDS to trim completed client requests (or flush) is included
as part of `session ls` (or `client ls`) command as a debug aid.
``MDS_DAMAGE`` ``MDS_DAMAGE``
-------------- --------------

View File

@ -57,6 +57,8 @@
.. confval:: mds_kill_import_at .. confval:: mds_kill_import_at
.. confval:: mds_kill_link_at .. confval:: mds_kill_link_at
.. confval:: mds_kill_rename_at .. confval:: mds_kill_rename_at
.. confval:: mds_inject_skip_replaying_inotable
.. confval:: mds_kill_skip_replaying_inotable
.. confval:: mds_wipe_sessions .. confval:: mds_wipe_sessions
.. confval:: mds_wipe_ino_prealloc .. confval:: mds_wipe_ino_prealloc
.. confval:: mds_skip_ino .. confval:: mds_skip_ino

View File

@ -225,3 +225,17 @@ For the reverse situation:
The ``home/patrick`` directory and its children will be pinned to rank 2 The ``home/patrick`` directory and its children will be pinned to rank 2
because its export pin overrides the policy on ``home``. because its export pin overrides the policy on ``home``.
To remove a partitioning policy, remove the respective extended attribute
or set the value to 0.
.. code::bash
$ setfattr -n ceph.dir.pin.distributed -v 0 home
# or
$ setfattr -x ceph.dir.pin.distributed home
For export pins, remove the extended attribute or set the extended attribute
value to `-1`.
.. code::bash
$ setfattr -n ceph.dir.pin -v -1 home

View File

@ -56,6 +56,18 @@ in the sample conf. There are options to do the following:
- enable read delegations (need at least v13.0.1 ``libcephfs2`` package - enable read delegations (need at least v13.0.1 ``libcephfs2`` package
and v2.6.0 stable ``nfs-ganesha`` and ``nfs-ganesha-ceph`` packages) and v2.6.0 stable ``nfs-ganesha`` and ``nfs-ganesha-ceph`` packages)
.. important::
Under certain conditions, NFS access using the CephFS FSAL fails. This
causes an error to be thrown that reads "Input/output error". Under these
circumstances, the application metadata must be set for the CephFS metadata
and CephFS data pools. Do this by running the following command:
.. prompt:: bash $
ceph osd pool application set <cephfs_metadata_pool> cephfs <cephfs_data_pool> cephfs
Configuration for libcephfs clients Configuration for libcephfs clients
----------------------------------- -----------------------------------

View File

@ -143,3 +143,14 @@ The types of damage that can be reported and repaired by File System Scrub are:
* BACKTRACE : Inode's backtrace in the data pool is corrupted. * BACKTRACE : Inode's backtrace in the data pool is corrupted.
Evaluate strays using recursive scrub
=====================================
- In order to evaluate strays i.e. purge stray directories in ``~mdsdir`` use the following command::
ceph tell mds.<fsname>:0 scrub start ~mdsdir recursive
- ``~mdsdir`` is not enqueued by default when scrubbing at the CephFS root. In order to perform stray evaluation
at root, run scrub with flags ``scrub_mdsdir`` and ``recursive``::
ceph tell mds.<fsname>:0 scrub start / recursive,scrub_mdsdir

View File

@ -142,6 +142,24 @@ Examples::
ceph fs snap-schedule retention add / 24h4w # add 24 hourly and 4 weekly to retention ceph fs snap-schedule retention add / 24h4w # add 24 hourly and 4 weekly to retention
ceph fs snap-schedule retention remove / 7d4w # remove 7 daily and 4 weekly, leaves 24 hourly ceph fs snap-schedule retention remove / 7d4w # remove 7 daily and 4 weekly, leaves 24 hourly
.. note: When adding a path to snap-schedule, remember to strip off the mount
point path prefix. Paths to snap-schedule should start at the appropriate
CephFS file system root and not at the host file system root.
e.g. if the Ceph File System is mounted at ``/mnt`` and the path under which
snapshots need to be taken is ``/mnt/some/path`` then the acutal path required
by snap-schedule is only ``/some/path``.
.. note: It should be noted that the "created" field in the snap-schedule status
command output is the timestamp at which the schedule was created. The "created"
timestamp has nothing to do with the creation of actual snapshots. The actual
snapshot creation is accounted for in the "created_count" field, which is a
cumulative count of the total number of snapshots created so far.
.. note: The maximum number of snapshots to retain per directory is limited by the
config tunable `mds_max_snaps_per_dir`. This tunable defaults to 100.
To ensure a new snapshot can be created, one snapshot less than this will be
retained. So by default, a maximum of 99 snapshots will be retained.
Active and inactive schedules Active and inactive schedules
----------------------------- -----------------------------
Snapshot schedules can be added for a path that doesn't exist yet in the Snapshot schedules can be added for a path that doesn't exist yet in the

View File

@ -21,6 +21,133 @@ We can get hints about what's going on by dumping the MDS cache ::
If high logging levels are set on the MDS, that will almost certainly hold the If high logging levels are set on the MDS, that will almost certainly hold the
information we need to diagnose and solve the issue. information we need to diagnose and solve the issue.
Stuck during recovery
=====================
Stuck in up:replay
------------------
If your MDS is stuck in ``up:replay`` then it is likely that the journal is
very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is
behind on trimming its journal? If the journal has grown very large, it can
take hours to read the journal. There is no working around this but there
are things you can do to speed things along:
Reduce MDS debugging to 0. Even at the default settings, the MDS logs some
messages to memory for dumping if a fatal error is encountered. You can avoid
this:
.. code:: bash
ceph config set mds debug_mds 0
ceph config set mds debug_ms 0
ceph config set mds debug_monc 0
Note if the MDS fails then there will be virtually no information to determine
why. If you can calculate when ``up:replay`` will complete, you should restore
these configs just prior to entering the next state:
.. code:: bash
ceph config rm mds debug_mds
ceph config rm mds debug_ms
ceph config rm mds debug_monc
Once you've got replay moving along faster, you can calculate when the MDS will
complete. This is done by examining the journal replay status:
.. code:: bash
$ ceph tell mds.<fs_name>:0 status | jq .replay_status
{
"journal_read_pos": 4195244,
"journal_write_pos": 4195244,
"journal_expire_pos": 4194304,
"num_events": 2,
"num_segments": 2
}
Replay completes when the ``journal_read_pos`` reaches the
``journal_write_pos``. The write position will not change during replay. Track
the progression of the read position to compute the expected time to complete.
Avoiding recovery roadblocks
----------------------------
When trying to urgently restore your file system during an outage, here are some
things to do:
* **Deny all reconnect to clients.** This effectively blocklists all existing
CephFS sessions so all mounts will hang or become unavailable.
.. code:: bash
ceph config set mds mds_deny_all_reconnect true
Remember to undo this after the MDS becomes active.
.. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
* **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears
"stuck" doing some operation. Sometimes recovery of an MDS may involve an
operation that may take longer than expected (from the programmer's
perspective). This is more likely when recovery is already taking a longer than
normal amount of time to complete (indicated by your reading this document).
Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
.. code:: bash
ceph config set mds mds_heartbeat_reset_grace 3600
This has the effect of having the MDS continue to send beacons to the monitors
even when its internal "heartbeat" mechanism has not been reset (beat) in one
hour. Note the previous mechanism for achieving this was via the
`mds_beacon_grace` monitor setting.
* **Disable open file table prefetch.** Normally, the MDS will prefetch
directory contents during recovery to heat up its cache. During long
recovery, the cache is probably already hot **and large**. So this behavior
can be undesirable. Disable using:
.. code:: bash
ceph config set mds mds_oft_prefetch_dirfrags false
* **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may
cause new load on the file system when it's just getting back on its feet.
There will likely be some general maintenance to do before workloads should be
resumed. For example, expediting journal trim may be advisable if the recovery
took a long time because replay was reading a overly large journal.
You can do this manually or use the new file system tunable:
.. code:: bash
ceph fs set <fs_name> refuse_client_session true
That prevents any clients from establishing new sessions with the MDS.
Expediting MDS journal trim
===========================
If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a
long time!), you will want to have the MDS trim its journal more frequently.
You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings.
The main tunable available to do this is to modify the MDS tick interval. The
"tick" interval drives several upkeep activities in the MDS. It is strongly
recommended no significant file system load be present when modifying this tick
interval. This setting only affects an MDS in ``up:active``. The MDS does not
trim its journal during recovery.
.. code:: bash
ceph config set mds mds_tick_interval 2
RADOS Health RADOS Health
============ ============
@ -188,6 +315,98 @@ You can enable dynamic debug against the CephFS module.
Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh
In-memory Log Dump
==================
In-memory logs can be dumped by setting ``mds_extraordinary_events_dump_interval``
during a lower level debugging (log level < 10). ``mds_extraordinary_events_dump_interval``
is the interval in seconds for dumping the recent in-memory logs when there is an Extra-Ordinary event.
The Extra-Ordinary events are classified as:
* Client Eviction
* Missed Beacon ACK from the monitors
* Missed Internal Heartbeats
In-memory Log Dump is disabled by default to prevent log file bloat in a production environment.
The below commands consecutively enables it::
$ ceph config set mds debug_mds <log_level>/<gather_level>
$ ceph config set mds mds_extraordinary_events_dump_interval <seconds>
The ``log_level`` should be < 10 and ``gather_level`` should be >= 10 to enable in-memory log dump.
When it is enabled, the MDS checks for the extra-ordinary events every
``mds_extraordinary_events_dump_interval`` seconds and if any of them occurs, MDS dumps the
in-memory logs containing the relevant event details in ceph-mds log.
.. note:: For higher log levels (log_level >= 10) there is no reason to dump the In-memory Logs and a
lower gather level (gather_level < 10) is insufficient to gather In-memory Logs. Thus a
log level >=10 or a gather level < 10 in debug_mds would prevent enabling the In-memory Log Dump.
In such cases, when there is a failure it's required to reset the value of
mds_extraordinary_events_dump_interval to 0 before enabling using the above commands.
The In-memory Log Dump can be disabled using::
$ ceph config set mds mds_extraordinary_events_dump_interval 0
Filesystems Become Inaccessible After an Upgrade
================================================
.. note::
You can avoid ``operation not permitted`` errors by running this procedure
before an upgrade. As of May 2023, it seems that ``operation not permitted``
errors of the kind discussed here occur after upgrades after Nautilus
(inclusive).
IF
you have CephFS file systems that have data and metadata pools that were
created by a ``ceph fs new`` command (meaning that they were not created
with the defaults)
OR
you have an existing CephFS file system and are upgrading to a new post-Nautilus
major version of Ceph
THEN
in order for the documented ``ceph fs authorize...`` commands to function as
documented (and to avoid 'operation not permitted' errors when doing file I/O
or similar security-related problems for all users except the ``client.admin``
user), you must first run:
.. prompt:: bash $
ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name>
and
.. prompt:: bash $
ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name>
Otherwise, when the OSDs receive a request to read or write data (not the
directory info, but file data) they will not know which Ceph file system name
to look up. This is true also of pool names, because the 'defaults' themselves
changed in the major releases, from::
data pool=fsname
metadata pool=fsname_metadata
to::
data pool=fsname.data and
metadata pool=fsname.meta
Any setup that used ``client.admin`` for all mounts did not run into this
problem, because the admin key gave blanket permissions.
A temporary fix involves changing mount requests to the 'client.admin' user and
its associated key. A less drastic but half-fix is to change the osd cap for
your user to just ``caps osd = "allow rw"`` and delete ``tag cephfs
data=....``
Reporting Issues Reporting Issues
================ ================

View File

@ -2,38 +2,44 @@
CephFS Mirroring CephFS Mirroring
================ ================
CephFS supports asynchronous replication of snapshots to a remote CephFS file system via CephFS supports asynchronous replication of snapshots to a remote CephFS file
`cephfs-mirror` tool. Snapshots are synchronized by mirroring snapshot data followed by system via `cephfs-mirror` tool. Snapshots are synchronized by mirroring
creating a snapshot with the same name (for a given directory on the remote file system) as snapshot data followed by creating a snapshot with the same name (for a given
the snapshot being synchronized. directory on the remote file system) as the snapshot being synchronized.
Requirements Requirements
------------ ------------
The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later. The primary (local) and secondary (remote) Ceph clusters version should be
Pacific or later.
Key Idea Key Idea
-------- --------
For a given snapshot pair in a directory, `cephfs-mirror` daemon will rely on readdir diff For a given snapshot pair in a directory, `cephfs-mirror` daemon will rely on
to identify changes in a directory tree. The diffs are applied to directory in the remote readdir diff to identify changes in a directory tree. The diffs are applied to
file system thereby only synchronizing files that have changed between two snapshots. directory in the remote file system thereby only synchronizing files that have
changed between two snapshots.
This feature is tracked here: https://tracker.ceph.com/issues/47034. This feature is tracked here: https://tracker.ceph.com/issues/47034.
Currently, snapshot data is synchronized by bulk copying to the remote filesystem. Currently, snapshot data is synchronized by bulk copying to the remote
filesystem.
.. note:: Synchronizing hardlinks is not supported -- hardlinked files get synchronized .. note:: Synchronizing hardlinks is not supported -- hardlinked files get
as separate files. synchronized as separate files.
Creating Users Creating Users
-------------- --------------
Start by creating a user (on the primary/local cluster) for the mirror daemon. This user Start by creating a user (on the primary/local cluster) for the mirror daemon.
requires write capability on the metadata pool to create RADOS objects (index objects) This user requires write capability on the metadata pool to create RADOS
for watch/notify operation and read capability on the data pool(s). objects (index objects) for watch/notify operation and read capability on the
data pool(s).
$ ceph auth get-or-create client.mirror mon 'profile cephfs-mirror' mds 'allow r' osd 'allow rw tag cephfs metadata=*, allow r tag cephfs data=*' mgr 'allow r' .. prompt:: bash $
ceph auth get-or-create client.mirror mon 'profile cephfs-mirror' mds 'allow r' osd 'allow rw tag cephfs metadata=*, allow r tag cephfs data=*' mgr 'allow r'
Create a user for each file system peer (on the secondary/remote cluster). This user needs Create a user for each file system peer (on the secondary/remote cluster). This user needs
to have full capabilities on the MDS (to take snapshots) and the OSDs:: to have full capabilities on the MDS (to take snapshots) and the OSDs::
@ -371,7 +377,7 @@ information. To check which mirror daemon a directory has been mapped to use::
"state": "mapped" "state": "mapped"
} }
.. note:: `instance_id` is the RAODS instance-id associated with a mirror daemon. .. note:: `instance_id` is the RADOS instance-id associated with a mirror daemon.
Other information such as `state` and `last_shuffled` are interesting when running Other information such as `state` and `last_shuffled` are interesting when running
multiple mirror daemons. multiple mirror daemons.

View File

@ -0,0 +1,426 @@
===============
Deduplication
===============
Introduction
============
Applying data deduplication on an existing software stack is not easy
due to additional metadata management and original data processing
procedure.
In a typical deduplication system, the input source as a data
object is split into multiple chunks by a chunking algorithm.
The deduplication system then compares each chunk with
the existing data chunks, stored in the storage previously.
To this end, a fingerprint index that stores the hash value
of each chunk is employed by the deduplication system
in order to easily find the existing chunks by comparing
hash value rather than searching all contents that reside in
the underlying storage.
There are many challenges in order to implement deduplication on top
of Ceph. Among them, two issues are essential for deduplication.
First is managing scalability of fingerprint index; Second is
it is complex to ensure compatibility between newly applied
deduplication metadata and existing metadata.
Key Idea
========
1. Content hashing (Double hashing): Each client can find an object data
for an object ID using CRUSH. With CRUSH, a client knows object's location
in Base tier.
By hashing object's content at Base tier, a new OID (chunk ID) is generated.
Chunk tier stores in the new OID that has a partial content of original object.
Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K ->
CRUSH(K) -> chunk's location
2. Self-contained object: The external metadata design
makes difficult for integration with storage feature support
since existing storage features cannot recognize the
additional external data structures. If we can design data
deduplication system without any external component, the
original storage features can be reused.
More details in https://ieeexplore.ieee.org/document/8416369
Design
======
.. ditaa::
+-------------+
| Ceph Client |
+------+------+
^
Tiering is |
Transparent | Metadata
to Ceph | +---------------+
Client Ops | | |
| +----->+ Base Pool |
| | | |
| | +-----+---+-----+
| | | ^
v v | | Dedup metadata in Base Pool
+------+----+--+ | | (Dedup metadata contains chunk offsets
| Objecter | | | and fingerprints)
+-----------+--+ | |
^ | | Data in Chunk Pool
| v |
| +-----+---+-----+
| | |
+----->| Chunk Pool |
| |
+---------------+
Data
Pool-based object management:
We define two pools.
The metadata pool stores metadata objects and the chunk pool stores
chunk objects. Since these two pools are divided based on
the purpose and usage, each pool can be managed more
efficiently according to its different characteristics. Base
pool and the chunk pool can separately select a redundancy
scheme between replication and erasure coding depending on
its usage and each pool can be placed in a different storage
location depending on the required performance.
Regarding how to use, please see ``osd_internals/manifest.rst``
Usage Patterns
==============
Each Ceph interface layer presents unique opportunities and costs for
deduplication and tiering in general.
RadosGW
-------
S3 big data workloads seem like a good opportunity for deduplication. These
objects tend to be write once, read mostly objects which don't see partial
overwrites. As such, it makes sense to fingerprint and dedup up front.
Unlike cephfs and rbd, radosgw has a system for storing
explicit metadata in the head object of a logical s3 object for
locating the remaining pieces. As such, radosgw could use the
refcounting machinery (``osd_internals/refcount.rst``) directly without
needing direct support from rados for manifests.
RBD/Cephfs
----------
RBD and CephFS both use deterministic naming schemes to partition
block devices/file data over rados objects. As such, the redirection
metadata would need to be included as part of rados, presumably
transparently.
Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
For those objects, we don't really want to perform dedup, and we don't
want to pay a write latency penalty in the hot path to do so anyway.
As such, performing tiering and dedup on cold objects in the background
is likely to be preferred.
One important wrinkle, however, is that both rbd and cephfs workloads
often feature usage of snapshots. This means that the rados manifest
support needs robust support for snapshots.
RADOS Machinery
===============
For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
For more information on rados refcount support, see ``osd_internals/refcount.rst``.
Status and Future Work
======================
At the moment, there exists some preliminary support for manifest
objects within the OSD as well as a dedup tool.
RadosGW data warehouse workloads probably represent the largest
opportunity for this feature, so the first priority is probably to add
direct support for fingerprinting and redirects into the refcount pool
to radosgw.
Aside from radosgw, completing work on manifest object support in the
OSD particularly as it relates to snapshots would be the next step for
rbd and cephfs workloads.
How to use deduplication
========================
* This feature is highly experimental and is subject to change or removal.
Ceph provides deduplication using RADOS machinery.
Below we explain how to perform deduplication.
Prerequisite
------------
If the Ceph cluster is started from Ceph mainline, users need to check
``ceph-test`` package which is including ceph-dedup-tool is installed.
Deatiled Instructions
---------------------
Users can use ceph-dedup-tool with ``estimate``, ``sample-dedup``,
``chunk-scrub``, and ``chunk-repair`` operations. To provide better
convenience for users, we have enabled necessary operations through
ceph-dedup-tool, and we recommend using the following operations freely
by using any types of scripts.
1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
ceph-dedup-tool --op estimate
--pool [BASE_POOL]
--chunk-size [CHUNK_SIZE]
--chunk-algorithm [fixed|fastcdc]
--fingerprint-algorithm [sha1|sha256|sha512]
--max-thread [THREAD_COUNT]
This CLI command will show how much storage space can be saved when deduplication
is applied on the pool. If the amount of the saved space is higher than user's expectation,
the pool probably is worth performing deduplication.
Users should specify the ``BASE_POOL``, within which the object targeted for deduplication
is stored. The users also need to run ceph-dedup-tool multiple time
with varying ``chunk_size`` to find the optimal chunk size. Note that the
optimal value probably differs in the content of each object in case of fastcdc
chunk algorithm (not fixed).
Example output:
.. code:: bash
{
"chunk_algo": "fastcdc",
"chunk_sizes": [
{
"target_chunk_size": 8192,
"dedup_bytes_ratio": 0.4897049
"dedup_object_ratio": 34.567315
"chunk_size_average": 64439,
"chunk_size_stddev": 33620
}
],
"summary": {
"examined_objects": 95,
"examined_bytes": 214968649
}
}
The above is an example output when executing ``estimate``. ``target_chunk_size`` is the same as
``chunk_size`` given by the user. ``dedup_bytes_ratio`` shows how many bytes are redundant from
examined bytes. For instance, 1 - ``dedup_bytes_ratio`` means the percentage of saved storage space.
``dedup_object_ratio`` is the generated chunk objects / ``examined_objects``. ``chunk_size_average``
means that the divided chunk size on average when performing CDC---this may differnet from ``target_chunk_size``
because CDC genarates different chunk-boundary depending on the content. ``chunk_size_stddev``
represents the standard deviation of the chunk size.
2. Create chunk pool.
^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
ceph osd pool create [CHUNK_POOL]
3. Run dedup command (there are two ways).
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- **sample-dedup**
.. code:: bash
ceph-dedup-tool --op sample-dedup
--pool [BASE_POOL]
--chunk-pool [CHUNK_POOL]
--chunk-size [CHUNK_SIZE]
--chunk-algorithm [fastcdc]
--fingerprint-algorithm [sha1|sha256|sha512]
--chunk-dedup-threshold [THRESHOLD]
--max-thread [THREAD_COUNT]
--sampling-ratio [SAMPLE_RATIO]
--wakeup-period [WAKEUP_PERIOD]
--loop
--snap
The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
the ``BASE_POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.
Example output:
.. code:: bash
$ bin/ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 303 GiB 294 GiB 9.0 GiB 9.0 GiB 2.99
TOTAL 303 GiB 294 GiB 9.0 GiB 9.0 GiB 2.99
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 1.7 MiB 0 97 GiB
base 2 32 2.0 GiB 517 6.0 GiB 2.02 97 GiB
chunk 3 32 0 B 0 0 B 0 97 GiB
$ bin/ceph-dedup-tool --op sample-dedup --pool base --chunk-pool chunk
--fingerprint-algorithm sha1 --chunk-algorithm fastcdc --loop --sampling-ratio 100
--chunk-dedup-threshold 2 --chunk-size 8192 --max-thread 4 --wakeup-period 60
$ bin/ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 303 GiB 298 GiB 5.4 GiB 5.4 GiB 1.80
TOTAL 303 GiB 298 GiB 5.4 GiB 5.4 GiB 1.80
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 577 KiB 2 1.7 MiB 0 98 GiB
base 2 32 452 MiB 262 1.3 GiB 0.50 98 GiB
chunk 3 32 258 MiB 25.91k 938 MiB 0.31 98 GiB
- **object dedup**
.. code:: bash
ceph-dedup-tool --op object-dedup
--pool [BASE_POOL]
--object [OID]
--chunk-pool [CHUNK_POOL]
--fingerprint-algorithm [sha1|sha256|sha512]
--dedup-cdc-chunk-size [CHUNK_SIZE]
The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
the results of step 1 above.
Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
such as ``fingerprint-algorithm`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
object size in ``BASE_POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
4. Read/write I/Os
^^^^^^^^^^^^^^^^^^
After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
completely compatible with existing RADOS operations.
5. Run scrub to fix reference count
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Reference mismatches can on rare occasions occur to false positives when handling reference counts for
deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:
.. code:: bash
ceph-dedup-tool --op chunk-scrub
--chunk-pool [CHUNK_POOL]
--pool [POOL]
--max-thread [THREAD_COUNT]
The ``chunk-scrub`` command identifies reference mismatches between a
metadata object and a chunk object. The ``chunk-pool`` parameter tells
where the target chunk objects are located to the ceph-dedup-tool.
Example output:
A reference mismatch is intentionally created by inserting a reference (dummy-obj) into a chunk object (2ac67f70d3dd187f8f332bb1391f61d4e5c9baae) by using chunk-get-ref.
.. code:: bash
$ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
{
"type": "by_object",
"count": 2,
"refs": [
{
"oid": "testfile2",
"key": "",
"snapid": -2,
"hash": 2905889452,
"max": 0,
"pool": 2,
"namespace": ""
},
{
"oid": "dummy-obj",
"key": "",
"snapid": -2,
"hash": 1203585162,
"max": 0,
"pool": 2,
"namespace": ""
}
]
}
$ bin/ceph-dedup-tool --op chunk-scrub --chunk-pool chunk --max-thread 10
10 seconds is set as report period by default
join
join
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
--done--
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae ref 10:5102bde2:::dummy-obj:head: referencing pool does not exist
--done--
Total object : 1
Examined object : 1
Damaged object : 1
6. Repair a mismatched chunk reference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If any reference mismatches occur after the ``chunk-scrub``, it is
recommended to perform the ``chunk-repair`` operation to fix reference
mismatches. The ``chunk-repair`` operation helps in resolving the
reference mismatch and restoring consistency.
.. code:: bash
ceph-dedup-tool --op chunk-repair
--chunk-pool [CHUNK_POOL_NAME]
--object [CHUNK_OID]
--target-ref [TARGET_OID]
--target-ref-pool-id [TARGET_POOL_ID]
``chunk-repair`` fixes the ``target-ref``, which is a wrong reference of
an ``object``. To fix it correctly, the users must enter the correct
``TARGET_OID`` and ``TARGET_POOL_ID``.
.. code:: bash
$ bin/ceph-dedup-tool --op chunk-repair --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae --target-ref dummy-obj --target-ref-pool-id 10
2ac67f70d3dd187f8f332bb1391f61d4e5c9baae has 1 references for dummy-obj
dummy-obj has 0 references for 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
fix dangling reference from 1 to 0
$ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
{
"type": "by_object",
"count": 1,
"refs": [
{
"oid": "testfile2",
"key": "",
"snapid": -2,
"hash": 2905889452,
"max": 0,
"pool": 2,
"namespace": ""
}
]
}

View File

@ -1,3 +1,5 @@
.. _dev_deploying_a_development_cluster:
================================= =================================
Deploying a development cluster Deploying a development cluster
================================= =================================

View File

@ -50,11 +50,10 @@ optional Ceph internal services are started automatically when it is used to
start a Ceph cluster. vstart is the basis for the three most commonly used start a Ceph cluster. vstart is the basis for the three most commonly used
development environments in Ceph Dashboard. development environments in Ceph Dashboard.
You can read more about vstart in `Deploying a development cluster`_. You can read more about vstart in :ref:`Deploying a development cluster
Additional information for developers can also be found in the `Developer <dev_deploying_a_development_cluster>`. Additional information for developers
Guide`_. can also be found in the `Developer Guide`_.
.. _Deploying a development cluster: https://docs.ceph.com/docs/master/dev/dev_cluster_deployement/
.. _Developer Guide: https://docs.ceph.com/docs/master/dev/quick_guide/ .. _Developer Guide: https://docs.ceph.com/docs/master/dev/quick_guide/
Host-based vs Docker-based Development Environments Host-based vs Docker-based Development Environments
@ -1269,7 +1268,6 @@ Tests can be found under the `a11y folder <./src/pybind/mgr/dashboard/frontend/c
beforeEach(() => { beforeEach(() => {
cy.login(); cy.login();
Cypress.Cookies.preserveOnce('token');
shared.navigateTo(); shared.navigateTo();
}); });

View File

@ -55,7 +55,7 @@ using `vstart_runner.py`_. To do that, you'd need `teuthology`_ installed::
$ virtualenv --python=python3 venv $ virtualenv --python=python3 venv
$ source venv/bin/activate $ source venv/bin/activate
$ pip install 'setuptools >= 12' $ pip install 'setuptools >= 12'
$ pip install git+https://github.com/ceph/teuthology#egg=teuthology[test] $ pip install teuthology[test]@git+https://github.com/ceph/teuthology
$ deactivate $ deactivate
The above steps installs teuthology in a virtual environment. Before running The above steps installs teuthology in a virtual environment. Before running

View File

@ -3,9 +3,74 @@ Serialization (encode/decode)
============================= =============================
When a structure is sent over the network or written to disk, it is When a structure is sent over the network or written to disk, it is
encoded into a string of bytes. Serializable structures have encoded into a string of bytes. Usually (but not always -- multiple
``encode`` and ``decode`` methods that write and read from ``bufferlist`` serialization facilities coexist in Ceph) serializable structures
objects representing byte strings. have ``encode`` and ``decode`` methods that write and read from
``bufferlist`` objects representing byte strings.
Terminology
-----------
It is best to think not in the domain of daemons and clients but
encoders and decoders. An encoder serializes a structure into a bufferlist
while a decoder does the opposite.
Encoders and decoders can be referred collectively as dencoders.
Dencoders (both encoders and docoders) live within daemons and clients.
For instance, when an RBD client issues an IO operation, it prepares
an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
that is put on the wire.
An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
Here encoder was used by the client while decoder by the OSD. However,
these roles can swing -- just imagine handling of the response: OSD encodes
the ``MOSDOpReply`` while RBD clients decode.
Encoder and decoder operate accordingly to a format which is defined
by a programmer by implementing the ``encode`` and ``decode`` methods.
Principles for format change
----------------------------
It is not unusual that the format of serialization changes. This
process requires careful attention from during both development
and review.
The general rule is that a decoder must understand what had been
encoded by an encoder. Most of the problems come from ensuring
that compatibility continues between old decoders and new encoders
as well as new decoders and old decoders. One should assume
that -- if not otherwise derogated -- any mix (old/new) is
possible in a cluster. There are 2 main reasons for that:
1. Upgrades. Although there are recommendations related to the order
of entity types (mons/osds/clients), it is not mandatory and
no assumption should be made about it.
2. Huge variability of client versions. It was always the case
that kernel (and thus kernel clients) upgrades are decoupled
from Ceph upgrades. Moreover, proliferation of containerization
bring the variability even to e.g. ``librbd`` -- now user space
libraries live on the container own.
With this being said, there are few rules limiting the degree
of interoperability between dencoders:
* ``n-2`` for dencoding between daemons,
* ``n-3`` hard requirement for client-involved scenarios,
* ``n-3..`` soft requirements for clinet-involved scenarios. Ideally
every client should be able to talk any version of daemons.
As the underlying reasons are the same, the rules dencoders
follow are virtually the same as for deprecations of our features
bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
Frameworks
----------
Currently multiple genres of dencoding helpers co-exist.
* encoding.h (the most proliferated one),
* denc.h (performance optimized, seen mostly in ``BlueStore``),
* the `Message` hierarchy.
Although details vary, the interoperability rules stay the same.
Adding a field to a structure Adding a field to a structure
----------------------------- -----------------------------
@ -93,3 +158,69 @@ because we might still be passed older-versioned messages that do not
have the field. The ``struct_v`` variable is a local set by the ``DECODE_START`` have the field. The ``struct_v`` variable is a local set by the ``DECODE_START``
macro. macro.
# Into the weeeds
The append-extendability of our dencoders is a result of the forward
compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
They are implementing extendibility facilities. An encoder, when filling
the bufferlist, prepends three fields: version of the current format,
minimal version of a decoder compatible with it and the total size of
all encoded fields.
.. code-block:: cpp
/**
* start encoding block
*
* @param v current (code) version of the encoding
* @param compat oldest code version that can decode it
* @param bl bufferlist to encode to
*
*/
#define ENCODE_START(v, compat, bl) \
__u8 struct_v = v; \
__u8 struct_compat = compat; \
ceph_le32 struct_len; \
auto filler = (bl).append_hole(sizeof(struct_v) + \
sizeof(struct_compat) + sizeof(struct_len)); \
const auto starting_bl_len = (bl).length(); \
using ::ceph::encode; \
do {
The ``struct_len`` field allows the decoder to eat all the bytes that were
left undecoded in the user-provided ``decode`` implementation.
Analogically, decoders tracks how much input has been decoded in the
user-provided ``decode`` methods.
.. code-block:: cpp
#define DECODE_START(bl) \
unsigned struct_end = 0; \
__u32 struct_len; \
decode(struct_len, bl); \
... \
struct_end = bl.get_off() + struct_len; \
} \
do {
Decoder uses this information to discard the extra bytes it does not
understand. Advancing bufferlist is critical as dencoders tend to be nested;
just leaving it intact would work only for the very last ``deocde`` call
in a nested structure.
.. code-block:: cpp
#define DECODE_FINISH(bl) \
} while (false); \
if (struct_end) { \
... \
if (bl.get_off() < struct_end) \
bl += struct_end - bl.get_off(); \
}
This entire, cooperative mechanism allows encoder (its further revisions)
to generate more byte stream (due to e.g. adding a new field at the end)
and not worry that the residue will crash older decoder revisions.

View File

@ -16,32 +16,6 @@ mgr module
The following diagrams outline the involved parties and how the interact when the clients The following diagrams outline the involved parties and how the interact when the clients
query for the reports: query for the reports:
.. seqdiag::
seqdiag {
default_note_color = lightblue;
osd; mon; ceph-cli;
osd => mon [ label = "update osdmap service" ];
osd => mon [ label = "update osdmap service" ];
ceph-cli -> mon [ label = "send 'health' command" ];
mon -> mon [ leftnote = "gather checks from services" ];
ceph-cli <-- mon [ label = "checks and mutes" ];
}
.. seqdiag::
seqdiag {
default_note_color = lightblue;
osd; mon; mgr; mgr-module;
mgr -> mon [ label = "subscribe for 'mgrdigest'" ];
osd => mon [ label = "update osdmap service" ];
osd => mon [ label = "update osdmap service" ];
mon -> mgr [ label = "send MMgrDigest" ];
mgr -> mgr [ note = "update cluster state" ];
mon <-- mgr;
mgr-module -> mgr [ label = "mgr.get('health')" ];
mgr-module <-- mgr [ label = "heath reports in json" ];
}
Where are the Reports Generated Where are the Reports Generated
=============================== ===============================
@ -68,19 +42,6 @@ later loaded and decoded, so they can be collected on demand. When it comes to
``MDSMonitor``, it persists the health metrics in the beacon sent by the MDS daemons, ``MDSMonitor``, it persists the health metrics in the beacon sent by the MDS daemons,
and prepares health reports when storing the pending changes. and prepares health reports when storing the pending changes.
.. seqdiag::
seqdiag {
default_note_color = lightblue;
mds; mon-mds; mon-health; ceph-cli;
mds -> mon-mds [ label = "send beacon" ];
mon-mds -> mon-mds [ note = "store health metrics in beacon" ];
mds <-- mon-mds;
mon-mds -> mon-mds [ note = "encode_health(checks)" ];
ceph-cli -> mon-health [ label = "send 'health' command" ];
mon-health => mon-mds [ label = "gather health checks" ];
ceph-cli <-- mon-health [ label = "checks and mutes" ];
}
So, if we want to add a new warning related to cephfs, probably the best place to So, if we want to add a new warning related to cephfs, probably the best place to
start is ``MDSMonitor::encode_pending()``, where health reports are collected from start is ``MDSMonitor::encode_pending()``, where health reports are collected from
@ -106,23 +67,3 @@ metrics and status to mgr using ``MMgrReport``. On the mgr side, it periodically
an aggregated report to the ``MgrStatMonitor`` service on mon. As explained earlier, an aggregated report to the ``MgrStatMonitor`` service on mon. As explained earlier,
this service just persists the health reports in the aggregated report to the monstore. this service just persists the health reports in the aggregated report to the monstore.
.. seqdiag::
seqdiag {
default_note_color = lightblue;
service; mgr; mon-mgr-stat; mon-health;
service -> mgr [ label = "send(open)" ];
mgr -> mgr [ note = "register the new service" ];
service <-- mgr;
mgr => service [ label = "send(configure)" ];
service -> mgr [ label = "send(report)" ];
mgr -> mgr [ note = "update/aggregate service metrics" ];
service <-- mgr;
service => mgr [ label = "send(report)" ];
mgr -> mon-mgr-stat [ label = "send(mgr-report)" ];
mon-mgr-stat -> mon-mgr-stat [ note = "store health checks in the report" ];
mgr <-- mon-mgr-stat;
mon-health => mon-mgr-stat [ label = "gather health checks" ];
service => mgr [ label = "send(report)" ];
service => mgr [ label = "send(close)" ];
}

View File

@ -87,7 +87,8 @@ Optionals are represented as a presence byte, followed by the item if it exists.
T element[present? 1 : 0]; // Only if present is non-zero. T element[present? 1 : 0]; // Only if present is non-zero.
} }
Optionals are used to encode ``boost::optional``. Optionals are used to encode ``boost::optional`` and, since introducing
C++17 to Ceph, ``std::optional``.
Pair Pair
---- ----

View File

@ -5,7 +5,7 @@ jerasure plugin
Introduction Introduction
------------ ------------
The parameters interpreted by the jerasure plugin are: The parameters interpreted by the ``jerasure`` plugin are:
:: ::
@ -31,3 +31,5 @@ upstream repositories `http://jerasure.org/jerasure/jerasure
`http://jerasure.org/jerasure/gf-complete `http://jerasure.org/jerasure/gf-complete
<http://jerasure.org/jerasure/gf-complete>`_ . The difference <http://jerasure.org/jerasure/gf-complete>`_ . The difference
between the two, if any, should match pull requests against upstream. between the two, if any, should match pull requests against upstream.
Note that as of 2023, the ``jerasure.org`` web site may no longer be
legitimate and/or associated with the original project.

View File

@ -114,29 +114,6 @@ baseline throughput for each device type was determined:
256 KiB. For HDDs, it was 40MiB. The above throughput was obtained 256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
by running 4 KiB random writes at a queue depth of 64 for 300 secs. by running 4 KiB random writes at a queue depth of 64 for 300 secs.
Factoring I/O Cost in mClock
============================
The services using mClock have a cost associated with them. The cost can be
different for each service type. The mClock scheduler factors in the cost
during calculations for parameters like *reservation*, *weight* and *limit*.
The calculations determine when the next op for the service type can be
dequeued from the operation queue. In general, the higher the cost, the longer
an op remains in the operation queue.
A cost modeling study was performed to determine the cost per I/O and the cost
per byte for SSD and HDD device types. The following cost specific options are
used under the hood by mClock,
- :confval:`osd_mclock_cost_per_io_usec`
- :confval:`osd_mclock_cost_per_io_usec_hdd`
- :confval:`osd_mclock_cost_per_io_usec_ssd`
- :confval:`osd_mclock_cost_per_byte_usec`
- :confval:`osd_mclock_cost_per_byte_usec_hdd`
- :confval:`osd_mclock_cost_per_byte_usec_ssd`
See :doc:`/rados/configuration/mclock-config-ref` for more details.
MClock Profile Allocations MClock Profile Allocations
========================== ==========================

View File

@ -0,0 +1,93 @@
=============
PastIntervals
=============
Purpose
-------
There are two situations where we need to consider the set of all acting-set
OSDs for a PG back to some epoch ``e``:
* During peering, we need to consider the acting set for every epoch back to
``last_epoch_started``, the last epoch in which the PG completed peering and
became active.
(see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation)
* During recovery, we need to consider the acting set for every epoch back to
``last_epoch_clean``, the last epoch at which all of the OSDs in the acting
set were fully recovered, and the acting set was full.
For either of these purposes, we could build such a set by iterating backwards
from the current OSDMap to the relevant epoch. Instead, we maintain a structure
PastIntervals for each PG.
An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping
didn't change. This includes changes to the acting set, the up set, the
primary, and several other parameters fully spelled out in
PastIntervals::check_new_interval.
Maintenance and Trimming
------------------------
The PastIntervals structure stores a record for each ``interval`` back to
last_epoch_clean. On each new ``interval`` (See AdvMap reactions,
PeeringState::should_restart_peering, and PeeringState::start_peering_interval)
each OSD with the PG will add the new ``interval`` to its local PastIntervals.
Activation messages to OSDs which do not already have the PG contain the
sender's PastIntervals so that the recipient needn't rebuild it. (See
PeeringState::activate needs_past_intervals).
PastIntervals are trimmed in two places. First, when the primary marks the
PG clean, it clears its past_intervals instance
(PeeringState::try_mark_clean()). The replicas will do the same thing when
they receive the info (See PeeringState::update_history).
The second, more complex, case is in PeeringState::start_peering_interval. In
the event of a "map gap", we assume that the PG actually has gone clean, but we
haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet.
To explain this behavior, we need to discuss OSDMap trimming.
OSDMap Trimming
---------------
OSDMaps are created by the Monitor quorum and gossiped out to the OSDs. The
Monitor cluster also determines when OSDs (and the Monitors) are allowed to
trim old OSDMap epochs. For the reasons explained above in this document, the
primary constraint is that we must retain all OSDMaps back to some epoch such
that all PGs have been clean at that or a later epoch (min_last_epoch_clean).
(See OSDMonitor::get_trim_to).
The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages
sent periodically by each OSDs. Each message contains a set of PGs for which
the OSD is primary at that moment as well as the min_last_epoch_clean across
that set. The Monitors track these values in OSDMonitor::last_epoch_clean.
There is a subtlety in the min_last_epoch_clean value used by the OSD to
populate the MOSDBeacon. OSD::collect_pg_stats invokes PG::with_pg_stats to
obtain the lec value, which actually uses
pg_stat_t::get_effective_last_epoch_clean() rather than
info.history.last_epoch_clean. If the PG is currently clean,
pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than
last_epoch_clean -- this works because the PG is clean at that epoch and it
allows OSDMaps to be trimmed during periods where OSDMaps are being created
(due to snapshot activity, perhaps), but no PGs are undergoing ``interval``
changes.
Back to PastIntervals
---------------------
We can now understand our second trimming case above. If OSDMaps have been
trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch
>= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals.
This dependency also pops up in PeeringState::check_past_interval_bounds().
PeeringState::get_required_past_interval_bounds takes as a parameter
oldest_epoch, which comes from OSDSuperblock::cluster_osdmap_trim_lower_bound.
We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map
because we don't necessarily trim all MOSDMap::cluster_osdmap_trim_lower_bound.
In order to avoid doing too much work at once we limit the amount of osdmaps
trimmed using ``osd_target_transaction_size`` in OSD::trim_maps().
For this reason, a specific OSD's oldest_map can lag behind
OSDSuperblock::cluster_osdmap_trim_lower_bound
for a while.
See https://tracker.ceph.com/issues/49689 for an example.

View File

@ -28,8 +28,8 @@ Premier
------- -------
* `Bloomberg <https://bloomberg.com>`_ * `Bloomberg <https://bloomberg.com>`_
* `China Mobile <https://www.chinamobileltd.com/>`_ * `Clyso <https://www.clyso.com/en/>`_
* `DigitalOcean <https://www.digitalocean.com/>`_ * `IBM <https://ibm.com>`_
* `Intel <http://www.intel.com/>`_ * `Intel <http://www.intel.com/>`_
* `OVH <https://www.ovh.com/>`_ * `OVH <https://www.ovh.com/>`_
* `Red Hat <https://www.redhat.com/>`_ * `Red Hat <https://www.redhat.com/>`_
@ -37,16 +37,16 @@ Premier
* `SoftIron <https://www.softiron.com/>`_ * `SoftIron <https://www.softiron.com/>`_
* `SUSE <https://www.suse.com/>`_ * `SUSE <https://www.suse.com/>`_
* `Western Digital <https://www.wdc.com/>`_ * `Western Digital <https://www.wdc.com/>`_
* `XSKY <https://www.xsky.com/en/>`_
* `ZTE <https://www.zte.com.cn/global/>`_
General General
------- -------
* `42on <https://www.42on.com/>`_
* `Akamai <https://www.akamai.com/>`_
* `ARM <http://www.arm.com/>`_ * `ARM <http://www.arm.com/>`_
* `Canonical <https://www.canonical.com/>`_ * `Canonical <https://www.canonical.com/>`_
* `Cloudbase Solutions <https://cloudbase.it/>`_ * `Cloudbase Solutions <https://cloudbase.it/>`_
* `Clyso <https://www.clyso.com/en/>`_ * `CloudFerro <https://cloudferro.com/>`_
* `croit <http://www.croit.io/>`_ * `croit <http://www.croit.io/>`_
* `EasyStack <https://www.easystack.io/>`_ * `EasyStack <https://www.easystack.io/>`_
* `ISS <http://iss-integration.com/>`_ * `ISS <http://iss-integration.com/>`_
@ -97,22 +97,17 @@ Members
------- -------
* Anjaneya "Reddy" Chagam (Intel) * Anjaneya "Reddy" Chagam (Intel)
* Dan van der Ster (CERN) - Associate member representative * Carlos Maltzahn (UCSC) - Associate member representative
* Haomai Wang (XSKY) * Dan van der Ster (Clyso) - Ceph Council representative
* James Page (Canonical) * Joachim Kraftmayer (Clyso)
* Lenz Grimmer (SUSE) - Ceph Leadership Team representative * Josh Durgin (IBM) - Ceph Council representative
* Lars Marowsky-Bree (SUSE)
* Matias Bjorling (Western Digital) * Matias Bjorling (Western Digital)
* Matthew Leonard (Bloomberg) * Matthew Leonard (Bloomberg)
* Mike Perez (Red Hat) - Ceph community manager * Mike Perez (Red Hat) - Ceph community manager
* Myoungwon Oh (Samsung Electronics) * Myoungwon Oh (Samsung Electronics)
* Martin Verges (croit) - General member representative * Martin Verges (croit) - General member representative
* Pawel Sadowski (OVH) * Pawel Sadowski (OVH)
* Phil Straw (SoftIron) * Vincent Hsu (IBM)
* Robin Johnson (DigitalOcean)
* Sage Weil (Red Hat) - Ceph project leader
* Xie Xingguo (ZTE)
* Zhang Shaowen (China Mobile)
Joining Joining
======= =======

View File

@ -12,12 +12,13 @@
:ref:`BlueStore<rados_config_storage_devices_bluestore>` :ref:`BlueStore<rados_config_storage_devices_bluestore>`
OSD BlueStore is a storage back end used by OSD daemons, and OSD BlueStore is a storage back end used by OSD daemons, and
was designed specifically for use with Ceph. BlueStore was was designed specifically for use with Ceph. BlueStore was
introduced in the Ceph Kraken release. In the Ceph Luminous introduced in the Ceph Kraken release. The Luminous release of
release, BlueStore became Ceph's default storage back end, Ceph promoted BlueStore to the default OSD back end,
supplanting FileStore. Unlike :term:`filestore`, BlueStore supplanting FileStore. As of the Reef release, FileStore is no
stores objects directly on Ceph block devices without any file longer available as a storage backend.
system interface. Since Luminous (12.2), BlueStore has been
Ceph's default and recommended storage back end. BlueStore stores objects directly on Ceph block devices without
a mounted file system.
Bucket Bucket
In the context of :term:`RGW`, a bucket is a group of objects. In the context of :term:`RGW`, a bucket is a group of objects.
@ -187,9 +188,13 @@
applications, Ceph Users, and :term:`Ceph Client`\s. Ceph applications, Ceph Users, and :term:`Ceph Client`\s. Ceph
Storage Clusters receive data from :term:`Ceph Client`\s. Storage Clusters receive data from :term:`Ceph Client`\s.
cephx CephX
The Ceph authentication protocol. Cephx operates like Kerberos, The Ceph authentication protocol. CephX authenticates users and
but it has no single point of failure. daemons. CephX operates like Kerberos, but it has no single
point of failure. See the :ref:`High-availability
Authentication section<arch_high_availability_authentication>`
of the Architecture document and the :ref:`CephX Configuration
Reference<rados-cephx-config-ref>`.
Client Client
A client is any program external to Ceph that uses a Ceph A client is any program external to Ceph that uses a Ceph
@ -248,6 +253,9 @@
Any single machine or server in a Ceph Cluster. See :term:`Ceph Any single machine or server in a Ceph Cluster. See :term:`Ceph
Node`. Node`.
Hybrid OSD
Refers to an OSD that has both HDD and SSD drives.
LVM tags LVM tags
Extensible metadata for LVM volumes and groups. It is used to Extensible metadata for LVM volumes and groups. It is used to
store Ceph-specific information about devices and its store Ceph-specific information about devices and its
@ -302,12 +310,33 @@
state of a multi-site configuration. When the period is updated, state of a multi-site configuration. When the period is updated,
the "epoch" is said thereby to have been changed. the "epoch" is said thereby to have been changed.
Placement Groups (PGs)
Placement groups (PGs) are subsets of each logical Ceph pool.
Placement groups perform the function of placing objects (as a
group) into OSDs. Ceph manages data internally at
placement-group granularity: this scales better than would
managing individual (and therefore more numerous) RADOS
objects. A cluster that has a larger number of placement groups
(for example, 100 per OSD) is better balanced than an otherwise
identical cluster with a smaller number of placement groups.
Ceph's internal RADOS objects are each mapped to a specific
placement group, and each placement group belongs to exactly
one Ceph pool.
:ref:`Pool<rados_pools>` :ref:`Pool<rados_pools>`
A pool is a logical partition used to store objects. A pool is a logical partition used to store objects.
Pools Pools
See :term:`pool`. See :term:`pool`.
:ref:`Primary Affinity <rados_ops_primary_affinity>`
The characteristic of an OSD that governs the likelihood that
a given OSD will be selected as the primary OSD (or "lead
OSD") in an acting set. Primary affinity was introduced in
Firefly (v. 0.80). See :ref:`Primary Affinity
<rados_ops_primary_affinity>`.
RADOS RADOS
**R**\eliable **A**\utonomic **D**\istributed **O**\bject **R**\eliable **A**\utonomic **D**\istributed **O**\bject
**S**\tore. RADOS is the object store that provides a scalable **S**\tore. RADOS is the object store that provides a scalable
@ -370,6 +399,28 @@
Amazon S3 RESTful API and the OpenStack Swift API. Also called Amazon S3 RESTful API and the OpenStack Swift API. Also called
"RADOS Gateway" and "Ceph Object Gateway". "RADOS Gateway" and "Ceph Object Gateway".
scrubs
The processes by which Ceph ensures data integrity. During the
process of scrubbing, Ceph generates a catalog of all objects
in a placement group, then ensures that none of the objects are
missing or mismatched by comparing each primary object against
its replicas, which are stored across other OSDs. Any PG
is determined to have a copy of an object that is different
than the other copies or is missing entirely is marked
"inconsistent" (that is, the PG is marked "inconsistent").
There are two kinds of scrubbing: light scrubbing and deep
scrubbing (also called "normal scrubbing" and "deep scrubbing",
respectively). Light scrubbing is performed daily and does
nothing more than confirm that a given object exists and that
its metadata is correct. Deep scrubbing is performed weekly and
reads the data and uses checksums to ensure data integrity.
See :ref:`Scrubbing <rados_config_scrubbing>` in the RADOS OSD
Configuration Reference Guide and page 141 of *Mastering Ceph,
second edition* (Fisk, Nick. 2019).
secrets secrets
Secrets are credentials used to perform digital authentication Secrets are credentials used to perform digital authentication
whenever privileged users must access systems that require whenever privileged users must access systems that require
@ -387,6 +438,12 @@
Teuthology Teuthology
The collection of software that performs scripted tests on Ceph. The collection of software that performs scripted tests on Ceph.
User
An individual or a system actor (for example, an application)
that uses Ceph clients to interact with the :term:`Ceph Storage
Cluster`. See :ref:`User<rados-ops-user>` and :ref:`User
Management<user-management>`.
Zone Zone
In the context of :term:`RGW`, a zone is a logical group that In the context of :term:`RGW`, a zone is a logical group that
consists of one or more :term:`RGW` instances. A zone's consists of one or more :term:`RGW` instances. A zone's

View File

@ -53,9 +53,8 @@ the CLT itself.
Current CLT members are: Current CLT members are:
* Casey Bodley <cbodley@redhat.com> * Casey Bodley <cbodley@redhat.com>
* Dan van der Ster <daniel.vanderster@cern.ch> * Dan van der Ster <dan.vanderster@clyso.com>
* David Galloway <dgallowa@redhat.com> * David Orman <ormandj@1111systems.com>
* David Orman <ormandj@iland.com>
* Ernesto Puerta <epuerta@redhat.com> * Ernesto Puerta <epuerta@redhat.com>
* Gregory Farnum <gfarnum@redhat.com> * Gregory Farnum <gfarnum@redhat.com>
* Haomai Wang <haomai@xsky.com> * Haomai Wang <haomai@xsky.com>

View File

@ -11,6 +11,12 @@ Ceph delivers **object, block, and file storage in one unified system**.
Ceph project. (Click anywhere in this paragraph to read the "Basic Ceph project. (Click anywhere in this paragraph to read the "Basic
Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`. Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`.
.. note::
:ref:`If you want to make a commit to the documentation but you don't
know how to get started, read the "Documenting Ceph" page. (Click anywhere
in this paragraph to read the "Documenting Ceph" page.) <documenting_ceph>`.
.. container:: columns-3 .. container:: columns-3
.. container:: column .. container:: column
@ -104,6 +110,7 @@ about Ceph, see our `Architecture`_ section.
radosgw/index radosgw/index
mgr/index mgr/index
mgr/dashboard mgr/dashboard
monitoring/index
api/index api/index
architecture architecture
Developer Guide <dev/developer_guide/index> Developer Guide <dev/developer_guide/index>

View File

@ -36,6 +36,22 @@ Options
Perform a selftest. This mode performs a sanity check of ``stats`` module. Perform a selftest. This mode performs a sanity check of ``stats`` module.
.. option:: --conffile [CONFFILE]
Path to cluster configuration file
.. option:: -d [DELAY], --delay [DELAY]
Refresh interval in seconds (default: 1)
.. option:: --dump
Dump the metrics to stdout
.. option:: --dumpfs <fs_name>
Dump the metrics of the given filesystem to stdout
Descriptions of fields Descriptions of fields
====================== ======================

View File

@ -15,15 +15,15 @@ Synopsis
Description Description
=========== ===========
:program:`radosgw-admin` is a RADOS gateway user administration utility. It :program:`radosgw-admin` is a Ceph Object Gateway user administration utility. It
allows creating and modifying users. is used to create and modify users.
Commands Commands
======== ========
:program:`radosgw-admin` utility uses many commands for administration purpose :program:`radosgw-admin` utility provides commands for administration purposes
which are as follows: as follows:
:command:`user create` :command:`user create`
Create a new user. Create a new user.
@ -32,8 +32,7 @@ which are as follows:
Modify a user. Modify a user.
:command:`user info` :command:`user info`
Display information of a user, and any potentially available Display information for a user including any subusers and keys.
subusers and keys.
:command:`user rename` :command:`user rename`
Renames a user. Renames a user.
@ -51,7 +50,7 @@ which are as follows:
Check user info. Check user info.
:command:`user stats` :command:`user stats`
Show user stats as accounted by quota subsystem. Show user stats as accounted by the quota subsystem.
:command:`user list` :command:`user list`
List all users. List all users.
@ -78,10 +77,10 @@ which are as follows:
Remove access key. Remove access key.
:command:`bucket list` :command:`bucket list`
List buckets, or, if bucket specified with --bucket=<bucket>, List buckets, or, if a bucket is specified with --bucket=<bucket>,
list its objects. If bucket specified adding --allow-unordered list its objects. Adding --allow-unordered
removes ordering requirement, possibly generating results more removes the ordering requirement, possibly generating results more
quickly in buckets with large number of objects. quickly for buckets with large number of objects.
:command:`bucket limit check` :command:`bucket limit check`
Show bucket sharding stats. Show bucket sharding stats.
@ -93,8 +92,8 @@ which are as follows:
Unlink bucket from specified user. Unlink bucket from specified user.
:command:`bucket chown` :command:`bucket chown`
Link bucket to specified user and update object ACLs. Change bucket ownership to the specified user and update object ACLs.
Use --marker to resume if command gets interrupted. Invoke with --marker to resume if the command is interrupted.
:command:`bucket stats` :command:`bucket stats`
Returns bucket statistics. Returns bucket statistics.
@ -109,12 +108,13 @@ which are as follows:
Rewrite all objects in the specified bucket. Rewrite all objects in the specified bucket.
:command:`bucket radoslist` :command:`bucket radoslist`
List the rados objects that contain the data for all objects is List the RADOS objects that contain the data for all objects in
the designated bucket, if --bucket=<bucket> is specified, or the designated bucket, if --bucket=<bucket> is specified.
otherwise all buckets. Otherwise, list the RADOS objects that contain data for all
buckets.
:command:`bucket reshard` :command:`bucket reshard`
Reshard a bucket. Reshard a bucket's index.
:command:`bucket sync disable` :command:`bucket sync disable`
Disable bucket sync. Disable bucket sync.
@ -306,16 +306,16 @@ which are as follows:
Run data sync for the specified source zone. Run data sync for the specified source zone.
:command:`sync error list` :command:`sync error list`
list sync error. List sync errors.
:command:`sync error trim` :command:`sync error trim`
trim sync error. Trim sync errors.
:command:`zone rename` :command:`zone rename`
Rename a zone. Rename a zone.
:command:`zone placement list` :command:`zone placement list`
List zone's placement targets. List a zone's placement targets.
:command:`zone placement add` :command:`zone placement add`
Add a zone placement target. Add a zone placement target.
@ -365,7 +365,7 @@ which are as follows:
List all bucket lifecycle progress. List all bucket lifecycle progress.
:command:`lc process` :command:`lc process`
Manually process lifecycle. If a bucket is specified (e.g., via Manually process lifecycle transitions. If a bucket is specified (e.g., via
--bucket_id or via --bucket and optional --tenant), only that bucket --bucket_id or via --bucket and optional --tenant), only that bucket
is processed. is processed.
@ -385,7 +385,7 @@ which are as follows:
List metadata log which is needed for multi-site deployments. List metadata log which is needed for multi-site deployments.
:command:`mdlog trim` :command:`mdlog trim`
Trim metadata log manually instead of relying on RGWs integrated log sync. Trim metadata log manually instead of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync. complete, otherwise it can reinitiate a sync.
@ -397,7 +397,7 @@ which are as follows:
:command:`bilog trim` :command:`bilog trim`
Trim bucket index log (use start-marker, end-marker) manually instead Trim bucket index log (use start-marker, end-marker) manually instead
of relying on RGWs integrated log sync. of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync. complete, otherwise it can reinitiate a sync.
@ -405,7 +405,7 @@ which are as follows:
List data log which is needed for multi-site deployments. List data log which is needed for multi-site deployments.
:command:`datalog trim` :command:`datalog trim`
Trim data log manually instead of relying on RGWs integrated log sync. Trim data log manually instead of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync. complete, otherwise it can reinitiate a sync.
@ -413,19 +413,19 @@ which are as follows:
Read data log status. Read data log status.
:command:`orphans find` :command:`orphans find`
Init and run search for leaked rados objects. Init and run search for leaked RADOS objects.
DEPRECATED. See the "rgw-orphan-list" tool. DEPRECATED. See the "rgw-orphan-list" tool.
:command:`orphans finish` :command:`orphans finish`
Clean up search for leaked rados objects. Clean up search for leaked RADOS objects.
DEPRECATED. See the "rgw-orphan-list" tool. DEPRECATED. See the "rgw-orphan-list" tool.
:command:`orphans list-jobs` :command:`orphans list-jobs`
List the current job-ids for the orphans search. List the current orphans search job IDs.
DEPRECATED. See the "rgw-orphan-list" tool. DEPRECATED. See the "rgw-orphan-list" tool.
:command:`role create` :command:`role create`
create a new AWS role for use with STS. Create a new role for use with STS (Security Token Service).
:command:`role rm` :command:`role rm`
Remove a role. Remove a role.
@ -485,7 +485,7 @@ which are as follows:
Show events in a pubsub subscription Show events in a pubsub subscription
:command:`subscription ack` :command:`subscription ack`
Ack (remove) an events in a pubsub subscription Acknowledge (remove) events in a pubsub subscription
Options Options
@ -499,7 +499,8 @@ Options
.. option:: -m monaddress[:port] .. option:: -m monaddress[:port]
Connect to specified monitor (instead of looking through ceph.conf). Connect to specified monitor (instead of selecting one
from ceph.conf).
.. option:: --tenant=<tenant> .. option:: --tenant=<tenant>
@ -507,11 +508,11 @@ Options
.. option:: --uid=uid .. option:: --uid=uid
The radosgw user ID. The user on which to operate.
.. option:: --new-uid=uid .. option:: --new-uid=uid
ID of the new user. Used with 'user rename' command. The new ID of the user. Used with 'user rename' command.
.. option:: --subuser=<name> .. option:: --subuser=<name>
@ -533,26 +534,27 @@ Options
Generate random access key (for S3). Generate random access key (for S3).
.. option:: --gen-secret .. option:: --gen-secret
Generate random secret key. Generate random secret key.
.. option:: --key-type=<type> .. option:: --key-type=<type>
key type, options are: swift, s3. Key type, options are: swift, s3.
.. option:: --temp-url-key[-2]=<key> .. option:: --temp-url-key[-2]=<key>
Temporary url key. Temporary URL key.
.. option:: --max-buckets .. option:: --max-buckets
max number of buckets for a user (0 for no limit, negative value to disable bucket creation). Maximum number of buckets for a user (0 for no limit, negative value to disable bucket creation).
Default is 1000. Default is 1000.
.. option:: --access=<access> .. option:: --access=<access>
Set the access permissions for the sub-user. Set the access permissions for the subuser.
Available access permissions are read, write, readwrite and full. Available access permissions are read, write, readwrite and full.
.. option:: --display-name=<name> .. option:: --display-name=<name>
@ -600,8 +602,8 @@ Options
.. option:: --bucket-new-name=[tenant-id/]<bucket> .. option:: --bucket-new-name=[tenant-id/]<bucket>
Optional for `bucket link`; use to rename a bucket. Optional for `bucket link`; use to rename a bucket.
While tenant-id/ can be specified, this is never While the tenant-id can be specified, this is not
necessary for normal operation. necessary in normal operation.
.. option:: --shard-id=<shard-id> .. option:: --shard-id=<shard-id>
@ -613,11 +615,11 @@ Options
.. option:: --purge-data .. option:: --purge-data
When specified, user removal will also purge all the user data. When specified, user removal will also purge the user's data.
.. option:: --purge-keys .. option:: --purge-keys
When specified, subuser removal will also purge all the subuser keys. When specified, subuser removal will also purge the subuser' keys.
.. option:: --purge-objects .. option:: --purge-objects
@ -625,7 +627,7 @@ Options
.. option:: --metadata-key=<key> .. option:: --metadata-key=<key>
Key to retrieve metadata from with ``metadata get``. Key from which to retrieve metadata, used with ``metadata get``.
.. option:: --remote=<remote> .. option:: --remote=<remote>
@ -633,11 +635,11 @@ Options
.. option:: --period=<id> .. option:: --period=<id>
Period id. Period ID.
.. option:: --url=<url> .. option:: --url=<url>
url for pushing/pulling period or realm. URL for pushing/pulling period or realm.
.. option:: --epoch=<number> .. option:: --epoch=<number>
@ -657,7 +659,7 @@ Options
.. option:: --master-zone=<id> .. option:: --master-zone=<id>
Master zone id. Master zone ID.
.. option:: --rgw-realm=<name> .. option:: --rgw-realm=<name>
@ -665,11 +667,11 @@ Options
.. option:: --realm-id=<id> .. option:: --realm-id=<id>
The realm id. The realm ID.
.. option:: --realm-new-name=<name> .. option:: --realm-new-name=<name>
New name of realm. New name for the realm.
.. option:: --rgw-zonegroup=<name> .. option:: --rgw-zonegroup=<name>
@ -677,7 +679,7 @@ Options
.. option:: --zonegroup-id=<id> .. option:: --zonegroup-id=<id>
The zonegroup id. The zonegroup ID.
.. option:: --zonegroup-new-name=<name> .. option:: --zonegroup-new-name=<name>
@ -685,11 +687,11 @@ Options
.. option:: --rgw-zone=<zone> .. option:: --rgw-zone=<zone>
Zone in which radosgw is running. Zone in which the gateway is running.
.. option:: --zone-id=<id> .. option:: --zone-id=<id>
The zone id. The zone ID.
.. option:: --zone-new-name=<name> .. option:: --zone-new-name=<name>
@ -709,7 +711,7 @@ Options
.. option:: --placement-id .. option:: --placement-id
Placement id for the zonegroup placement commands. Placement ID for the zonegroup placement commands.
.. option:: --tags=<list> .. option:: --tags=<list>
@ -737,7 +739,7 @@ Options
.. option:: --data-extra-pool=<pool> .. option:: --data-extra-pool=<pool>
The placement target data extra (non-ec) pool. The placement target data extra (non-EC) pool.
.. option:: --placement-index-type=<type> .. option:: --placement-index-type=<type>
@ -765,11 +767,11 @@ Options
.. option:: --sync-from=[zone-name][,...] .. option:: --sync-from=[zone-name][,...]
Set the list of zones to sync from. Set the list of zones from which to sync.
.. option:: --sync-from-rm=[zone-name][,...] .. option:: --sync-from-rm=[zone-name][,...]
Remove the zones from list of zones to sync from. Remove zone(s) from list of zones from which to sync.
.. option:: --bucket-index-max-shards .. option:: --bucket-index-max-shards
@ -780,11 +782,11 @@ Options
.. option:: --fix .. option:: --fix
Besides checking bucket index, will also fix it. Fix the bucket index in addition to checking it.
.. option:: --check-objects .. option:: --check-objects
bucket check: Rebuilds bucket index according to actual objects state. Bucket check: Rebuilds the bucket index according to actual object state.
.. option:: --format=<format> .. option:: --format=<format>
@ -792,8 +794,8 @@ Options
.. option:: --sync-stats .. option:: --sync-stats
Option for 'user stats' command. When specified, it will update user stats with Option for the 'user stats' command. When specified, it will update user stats with
the current stats reported by user's buckets indexes. the current stats reported by the user's buckets indexes.
.. option:: --show-config .. option:: --show-config
@ -801,7 +803,7 @@ Options
.. option:: --show-log-entries=<flag> .. option:: --show-log-entries=<flag>
Enable/disable dump of log entries on log show. Enable/disable dumping of log entries on log show.
.. option:: --show-log-sum=<flag> .. option:: --show-log-sum=<flag>
@ -814,7 +816,7 @@ Options
.. option:: --infile .. option:: --infile
Specify a file to read in when setting data. Specify a file to read when setting data.
.. option:: --categories=<list> .. option:: --categories=<list>
@ -822,29 +824,29 @@ Options
.. option:: --caps=<caps> .. option:: --caps=<caps>
List of caps (e.g., "usage=read, write; user=read"). List of capabilities (e.g., "usage=read, write; user=read").
.. option:: --compression=<compression-algorithm> .. option:: --compression=<compression-algorithm>
Placement target compression algorithm (lz4|snappy|zlib|zstd) Placement target compression algorithm (lz4|snappy|zlib|zstd).
.. option:: --yes-i-really-mean-it .. option:: --yes-i-really-mean-it
Required for certain operations. Required as a guardrail for certain destructive operations.
.. option:: --min-rewrite-size .. option:: --min-rewrite-size
Specify the min object size for bucket rewrite (default 4M). Specify the minimum object size for bucket rewrite (default 4M).
.. option:: --max-rewrite-size .. option:: --max-rewrite-size
Specify the max object size for bucket rewrite (default ULLONG_MAX). Specify the maximum object size for bucket rewrite (default ULLONG_MAX).
.. option:: --min-rewrite-stripe-size .. option:: --min-rewrite-stripe-size
Specify the min stripe size for object rewrite (default 0). If the value Specify the minimum stripe size for object rewrite (default 0). If the value
is set to 0, then the specified object will always be is set to 0, then the specified object will always be
rewritten for restriping. rewritten when restriping.
.. option:: --warnings-only .. option:: --warnings-only
@ -854,7 +856,7 @@ Options
.. option:: --bypass-gc .. option:: --bypass-gc
When specified with bucket deletion, When specified with bucket deletion,
triggers object deletions by not involving GC. triggers object deletion without involving GC.
.. option:: --inconsistent-index .. option:: --inconsistent-index
@ -863,21 +865,21 @@ Options
.. option:: --max-concurrent-ios .. option:: --max-concurrent-ios
Maximum concurrent ios for bucket operations. Affects operations that Maximum concurrent bucket operations. Affects operations that
scan the bucket index, e.g., listing, deletion, and all scan/search scan the bucket index, e.g., listing, deletion, and all scan/search
operations such as finding orphans or checking the bucket index. operations such as finding orphans or checking the bucket index.
Default is 32. The default is 32.
Quota Options Quota Options
============= =============
.. option:: --max-objects .. option:: --max-objects
Specify max objects (negative value to disable). Specify the maximum number of objects (negative value to disable).
.. option:: --max-size .. option:: --max-size
Specify max size (in B/K/M/G/T, negative value to disable). Specify the maximum object size (in B/K/M/G/T, negative value to disable).
.. option:: --quota-scope .. option:: --quota-scope
@ -889,12 +891,12 @@ Orphans Search Options
.. option:: --num-shards .. option:: --num-shards
Number of shards to use for keeping the temporary scan info Number of shards to use for temporary scan info
.. option:: --orphan-stale-secs .. option:: --orphan-stale-secs
Number of seconds to wait before declaring an object to be an orphan. Number of seconds to wait before declaring an object to be an orphan.
Default is 86400 (24 hours). The efault is 86400 (24 hours).
.. option:: --job-id .. option:: --job-id

View File

@ -53,10 +53,6 @@ Options
Run in foreground, log to usual location Run in foreground, log to usual location
.. option:: --rgw-socket-path=path
Specify a unix domain socket path.
.. option:: --rgw-region=region .. option:: --rgw-region=region
The region where radosgw runs The region where radosgw runs
@ -80,30 +76,24 @@ and ``mod_proxy_fcgi`` have to be present in the server. Unlike ``mod_fastcgi``,
or process management may be available in the FastCGI application framework or process management may be available in the FastCGI application framework
in use. in use.
``Apache`` can be configured in a way that enables ``mod_proxy_fcgi`` to be used ``Apache`` must be configured in a way that enables ``mod_proxy_fcgi`` to be
with localhost tcp or through unix domain socket. ``mod_proxy_fcgi`` that doesn't used with localhost tcp.
support unix domain socket such as the ones in Apache 2.2 and earlier versions of
Apache 2.4, needs to be configured for use with localhost tcp. Later versions of
Apache like Apache 2.4.9 or later support unix domain socket and as such they
allow for the configuration with unix domain socket instead of localhost tcp.
The following steps show the configuration in Ceph's configuration file i.e, The following steps show the configuration in Ceph's configuration file i.e,
``/etc/ceph/ceph.conf`` and the gateway configuration file i.e, ``/etc/ceph/ceph.conf`` and the gateway configuration file i.e,
``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or ``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or
``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost ``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost
tcp and through unix domain socket: tcp:
#. For distros with Apache 2.2 and early versions of Apache 2.4 that use #. For distros with Apache 2.2 and early versions of Apache 2.4 that use
localhost TCP and do not support Unix Domain Socket, append the following localhost TCP, append the following contents to ``/etc/ceph/ceph.conf``::
contents to ``/etc/ceph/ceph.conf``::
[client.radosgw.gateway] [client.radosgw.gateway]
host = {hostname} host = {hostname}
keyring = /etc/ceph/ceph.client.radosgw.keyring keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = "" log_file = /var/log/ceph/client.radosgw.gateway.log
log file = /var/log/ceph/client.radosgw.gateway.log rgw_frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0 rgw_print_continue = false
rgw print continue = false
#. Add the following content in the gateway configuration file: #. Add the following content in the gateway configuration file:
@ -149,16 +139,6 @@ tcp and through unix domain socket:
</VirtualHost> </VirtualHost>
#. For distros with Apache 2.4.9 or later that support Unix Domain Socket,
append the following configuration to ``/etc/ceph/ceph.conf``::
[client.radosgw.gateway]
host = {hostname}
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /var/log/ceph/client.radosgw.gateway.log
rgw print continue = false
#. Add the following content in the gateway configuration file: #. Add the following content in the gateway configuration file:
For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``:: For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``::
@ -182,10 +162,6 @@ tcp and through unix domain socket:
</VirtualHost> </VirtualHost>
Please note, ``Apache 2.4.7`` does not have Unix Domain Socket support in
it and as such it has to be configured with localhost tcp. The Unix Domain
Socket support is available in ``Apache 2.4.9`` and later versions.
#. Generate a key for radosgw to use for authentication with the cluster. :: #. Generate a key for radosgw to use for authentication with the cluster. ::
ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

View File

@ -41,13 +41,15 @@ So, prior to start consuming the Ceph API, a valid JSON Web Token (JWT) has to
be obtained, and it may then be reused for subsequent requests. The be obtained, and it may then be reused for subsequent requests. The
``/api/auth`` endpoint will provide the valid token: ``/api/auth`` endpoint will provide the valid token:
.. code-block:: sh .. prompt:: bash $
$ curl -X POST "https://example.com:8443/api/auth" \ curl -X POST "https://example.com:8443/api/auth" \
-H "Accept: application/vnd.ceph.api.v1.0+json" \ -H "Accept: application/vnd.ceph.api.v1.0+json" \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{"username": <username>, "password": <password>}' -d '{"username": <username>, "password": <password>}'
::
{ "token": "<redacted_token>", ...} { "token": "<redacted_token>", ...}
The token obtained must be passed together with every API request in the The token obtained must be passed together with every API request in the
@ -74,9 +76,9 @@ purpose, Ceph API is built upon the following principles:
An example: An example:
.. code-block:: bash .. prompt:: bash $
$ curl -X GET "https://example.com:8443/api/osd" \ curl -X GET "https://example.com:8443/api/osd" \
-H "Accept: application/vnd.ceph.api.v1.0+json" \ -H "Accept: application/vnd.ceph.api.v1.0+json" \
-H "Authorization: Bearer <token>" -H "Authorization: Bearer <token>"

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

View File

@ -127,62 +127,67 @@ The Ceph Dashboard offers the following monitoring and management capabilities:
Overview of the Dashboard Landing Page Overview of the Dashboard Landing Page
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Displays overall cluster status, performance, and capacity metrics. Shows instant The landing page of Ceph Dashboard serves as the home page and features metrics
feedback for changes in the cluster and provides easy access to subpages of the such as the overall cluster status, performance, and capacity. It provides real-time
dashboard. updates on any changes in the cluster and allows quick access to other sections of the dashboard.
.. image:: dashboard-landing-page.png
.. note::
You can change the landing page to the previous version from:
``Cluster >> Manager Modules >> Dashboard >> Edit``.
Editing the ``FEATURE_TOGGLE_DASHBOARD`` option will change the landing page, from one view to another.
Note that the previous version of the landing page will be disabled in future releases.
.. _dashboard-landing-page-details:
Details
"""""""
Provides an overview of the cluster configuration, displaying various critical aspects of the cluster.
.. image:: details-card.png
.. _dashboard-landing-page-status: .. _dashboard-landing-page-status:
Status Status
"""""" """"""
Provides a visual indication of cluster health, and displays cluster alerts grouped by severity.
* **Cluster Status**: Displays overall cluster health. In case of any error it .. image:: status-card-open.png
displays a short description of the error and provides a link to the logs.
* **Hosts**: Displays the total number of hosts associated to the cluster and
links to a subpage that lists and describes each.
* **Monitors**: Displays mons and their quorum status and
open sessions. Links to a subpage that lists and describes each.
* **OSDs**: Displays object storage daemons (ceph-osds) and
the numbers of OSDs running (up), in service
(in), and out of the cluster (out). Provides links to
subpages providing a list of all OSDs and related management actions.
* **Managers**: Displays active and standby Ceph Manager
daemons (ceph-mgr).
* **Object Gateway**: Displays active object gateways (RGWs) and
provides links to subpages that list all object gateway daemons.
* **Metadata Servers**: Displays active and standby CephFS metadata
service daemons (ceph-mds).
* **iSCSI Gateways**: Display iSCSI gateways available,
active (up), and inactive (down). Provides a link to a subpage
showing a list of all iSCSI Gateways.
.. _dashboard-landing-page-capacity: .. _dashboard-landing-page-capacity:
Capacity Capacity
"""""""" """"""""
* **Used**: Displays the used capacity out of the total physical capacity provided by storage nodes (OSDs)
* **Warning**: Displays the `nearfull` threshold of the OSDs
* **Danger**: Displays the `full` threshold of the OSDs
* **Raw Capacity**: Displays the capacity used out of the total .. image:: capacity-card.png
physical capacity provided by storage nodes (OSDs).
* **Objects**: Displays the number and status of RADOS objects .. _dashboard-landing-page-inventory:
including the percentages of healthy, misplaced, degraded, and unfound
objects. Inventory
* **PG Status**: Displays the total number of placement groups and """""""""
their status, including the percentage clean, working, An inventory for all assets within the cluster.
warning, and unknown. Provides direct access to subpages of the dashboard from each item of this card.
* **Pools**: Displays pools and links to a subpage listing details.
* **PGs per OSD**: Displays the number of placement groups assigned to .. image:: inventory-card.png
object storage daemons.
.. _dashboard-landing-page-performance: .. _dashboard-landing-page-performance:
Performance Cluster Utilization
""""""""""" """""""""""""""""""
* **Used Capacity**: Total capacity used of the cluster. The maximum value of the chart is the maximum capacity of the cluster.
* **IOPS (Input/Output Operations Per Second)**: Number of read and write operations.
* **Latency**: Amount of time that it takes to process a read or a write request.
* **Client Throughput**: Amount of data that clients read or write to the cluster.
* **Recovery Throughput**: Amount of recovery data that clients read or write to the cluster.
* **Client READ/Write**: Displays an overview of
client input and output operations. .. image:: cluster-utilization-card.png
* **Client Throughput**: Displays the data transfer rates to and from Ceph clients.
* **Recovery throughput**: Displays rate of cluster healing and balancing operations.
* **Scrubbing**: Displays light and deep scrub status.
Supported Browsers Supported Browsers
^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View File

@ -24,12 +24,14 @@ see :ref:`nfs-ganesha-config`.
NFS Cluster management NFS Cluster management
====================== ======================
.. _nfs-module-cluster-create:
Create NFS Ganesha Cluster Create NFS Ganesha Cluster
-------------------------- --------------------------
.. code:: bash .. code:: bash
$ ceph nfs cluster create <cluster_id> [<placement>] [--port <port>] [--ingress --virtual-ip <ip>] $ ceph nfs cluster create <cluster_id> [<placement>] [--ingress] [--virtual_ip <value>] [--ingress-mode {default|keepalive-only|haproxy-standard|haproxy-protocol}] [--port <int>]
This creates a common recovery pool for all NFS Ganesha daemons, new user based on This creates a common recovery pool for all NFS Ganesha daemons, new user based on
``cluster_id``, and a common NFS Ganesha config RADOS object. ``cluster_id``, and a common NFS Ganesha config RADOS object.
@ -94,6 +96,18 @@ of the details of NFS redirecting traffic on the virtual IP to the
appropriate backend NFS servers, and redeploying NFS servers when they appropriate backend NFS servers, and redeploying NFS servers when they
fail. fail.
If a user additionally supplies ``--ingress-mode keepalive-only`` a
partial *ingress* service will be deployed that still provides a virtual
IP, but has nfs directly binding to that virtual IP and leaves out any
sort of load balancing or traffic redirection. This setup will restrict
users to deploying only 1 nfs daemon as multiple cannot bind to the same
port on the virtual IP.
Instead providing ``--ingress-mode default`` will result in the same setup
as not providing the ``--ingress-mode`` flag. In this setup keepalived will be
deployed to handle forming the virtual IP and haproxy will be deployed
to handle load balancing and traffic redirection.
Enabling ingress via the ``ceph nfs cluster create`` command deploys a Enabling ingress via the ``ceph nfs cluster create`` command deploys a
simple ingress configuration with the most common configuration simple ingress configuration with the most common configuration
options. Ingress can also be added to an existing NFS service (e.g., options. Ingress can also be added to an existing NFS service (e.g.,

View File

@ -18,7 +18,9 @@ for all reporting entities are returned in text exposition format.
Enabling prometheus output Enabling prometheus output
========================== ==========================
The *prometheus* module is enabled with:: The *prometheus* module is enabled with:
.. prompt:: bash $
ceph mgr module enable prometheus ceph mgr module enable prometheus
@ -47,9 +49,9 @@ configurable with ``ceph config set``, with keys
is registered with Prometheus's `registry is registered with Prometheus's `registry
<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_. <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
:: .. prompt:: bash $
ceph config set mgr mgr/prometheus/server_addr 0.0.0.0 ceph config set mgr mgr/prometheus/server_addr 0.0.0.
ceph config set mgr mgr/prometheus/server_port 9283 ceph config set mgr mgr/prometheus/server_port 9283
.. warning:: .. warning::
@ -65,7 +67,9 @@ recommended to use 15 seconds as scrape interval, though, in some cases it
might be useful to increase the scrape interval. might be useful to increase the scrape interval.
To set a different scrape interval in the Prometheus module, set To set a different scrape interval in the Prometheus module, set
``scrape_interval`` to the desired value:: ``scrape_interval`` to the desired value:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/scrape_interval 20 ceph config set mgr mgr/prometheus/scrape_interval 20
@ -75,7 +79,7 @@ in conjunction with multiple Prometheus instances, overload the manager and lead
to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled to unresponsive or crashing Ceph manager instances. Hence, the cache is enabled
by default. This means that there is a possibility that the cache becomes by default. This means that there is a possibility that the cache becomes
stale. The cache is considered stale when the time to fetch the metrics from stale. The cache is considered stale when the time to fetch the metrics from
Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``. Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.
If that is the case, **a warning will be logged** and the module will either If that is the case, **a warning will be logged** and the module will either
@ -86,33 +90,45 @@ This behavior can be configured. By default, it will return a 503 HTTP status
code (service unavailable). You can set other options using the ``ceph config code (service unavailable). You can set other options using the ``ceph config
set`` commands. set`` commands.
To tell the module to respond with possibly stale data, set it to ``return``:: To tell the module to respond with possibly stale data, set it to ``return``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/stale_cache_strategy return ceph config set mgr mgr/prometheus/stale_cache_strategy return
To tell the module to respond with "service unavailable", set it to ``fail``:: To tell the module to respond with "service unavailable", set it to ``fail``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/stale_cache_strategy fail ceph config set mgr mgr/prometheus/stale_cache_strategy fail
If you are confident that you don't require the cache, you can disable it:: If you are confident that you don't require the cache, you can disable it:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/cache false ceph config set mgr mgr/prometheus/cache false
If you are using the prometheus module behind some kind of reverse proxy or If you are using the prometheus module behind some kind of reverse proxy or
loadbalancer, you can simplify discovering the active instance by switching loadbalancer, you can simplify discovering the active instance by switching
to ``error``-mode:: to ``error``-mode:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/standby_behaviour error ceph config set mgr mgr/prometheus/standby_behaviour error
If set, the prometheus module will repond with a HTTP error when requesting ``/`` If set, the prometheus module will repond with a HTTP error when requesting ``/``
from the standby instance. The default error code is 500, but you can configure from the standby instance. The default error code is 500, but you can configure
the HTTP response code with:: the HTTP response code with:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/standby_error_status_code 503 ceph config set mgr mgr/prometheus/standby_error_status_code 503
Valid error codes are between 400-599. Valid error codes are between 400-599.
To switch back to the default behaviour, simply set the config key to ``default``:: To switch back to the default behaviour, simply set the config key to ``default``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/standby_behaviour default ceph config set mgr mgr/prometheus/standby_behaviour default
@ -165,10 +181,18 @@ configuration parameter. The parameter is a comma or space separated list
of ``pool[/namespace]`` entries. If the namespace is not specified the of ``pool[/namespace]`` entries. If the namespace is not specified the
statistics are collected for all namespaces in the pool. statistics are collected for all namespaces in the pool.
Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:: Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN" ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
The wildcard can be used to indicate all pools or namespaces:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/rbd_stats_pools "*"
The module makes the list of all available images scanning the specified The module makes the list of all available images scanning the specified
pools and namespaces and refreshes it periodically. The period is pools and namespaces and refreshes it periodically. The period is
configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval`` configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
@ -176,10 +200,23 @@ parameter (in sec) and is 300 sec (5 minutes) by default. The module will
force refresh earlier if it detects statistics from a previously unknown force refresh earlier if it detects statistics from a previously unknown
RBD image. RBD image.
Example to turn up the sync interval to 10 minutes:: Example to turn up the sync interval to 10 minutes:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600 ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
Ceph daemon performance counters metrics
-----------------------------------------
With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
the module option ``exclude_perf_counters`` to ``false``:
.. prompt:: bash $
ceph config set mgr mgr/prometheus/exclude_perf_counters false
Statistic names and labels Statistic names and labels
========================== ==========================

View File

@ -2,8 +2,9 @@
RGW Module RGW Module
============ ============
The rgw module helps with bootstraping and configuring RGW realm The rgw module provides a simple interface to deploy RGW multisite.
and the different related entities. It helps with bootstrapping and configuring RGW realm, zonegroup and
the different related entities.
Enabling Enabling
-------- --------
@ -18,57 +19,120 @@ RGW Realm Operations
Bootstrapping RGW realm creates a new RGW realm entity, a new zonegroup, Bootstrapping RGW realm creates a new RGW realm entity, a new zonegroup,
and a new zone. It configures a new system user that can be used for and a new zone. It configures a new system user that can be used for
multisite sync operations, and returns a corresponding token. It sets multisite sync operations. Under the hood this module instructs the
up new RGW instances via the orchestrator. orchestrator to create and deploy the corresponding RGW daemons. The module
supports both passing the arguments through the cmd line or as a spec file:
It is also possible to create a new zone that connects to the master .. prompt:: bash #
zone and synchronizes data to/from it.
ceph rgw realm bootstrap [--realm-name] [--zonegroup-name] [--zone-name] [--port] [--placement] [--start-radosgw]
The command supports providing the configuration through a spec file (`-i option`):
.. prompt:: bash #
ceph rgw realm bootstrap -i myrgw.yaml
Following is an example of RGW multisite spec file:
.. code-block:: yaml
rgw_realm: myrealm
rgw_zonegroup: myzonegroup
rgw_zone: myzone
placement:
hosts:
- ceph-node-1
- ceph-node-2
spec:
rgw_frontend_port: 5500
.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus,
the user can provide any orchestration supported rgw parameters including advanced
configuration features such as SSL certificates etc.
Users can also specify custom zone endpoints in the spec (or through the cmd line). In this case, no
cephadm daemons will be launched. Following is an example RGW spec file with zone endpoints:
.. code-block:: yaml
rgw_realm: myrealm
rgw_zonegroup: myzonegroup
rgw_zone: myzone
zone_endpoints: http://<rgw_host1>:<rgw_port1>, http://<rgw_host2>:<rgw_port2>
Realm Credentials Token Realm Credentials Token
----------------------- -----------------------
A new token is created when bootstrapping a new realm, and also
when creating one explicitly. The token encapsulates
the master zone endpoint, and a set of credentials that are associated
with a system user.
Removal of this token would remove the credentials, and if the corresponding
system user has no more access keys, it is removed.
Users can list the available tokens for the created (or already existing) realms.
The token is a base64 string that encapsulates the realm information and its
master zone endpoint authentication data. Following is an example of
the `ceph rgw realm tokens` output:
.. prompt:: bash #
ceph rgw realm tokens | jq
.. code-block:: json
[
{
"realm": "myrealm1",
"token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....NHlBTFhoIgp9"
},
{
"realm": "myrealm2",
"token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....RUU12ZDB0Igp9"
}
]
User can use the token to pull a realm to create secondary zone on a
different cluster that syncs with the master zone on the primary cluster
by using `ceph rgw zone create` command and providing the corresponding token.
Following is an example of zone spec file:
.. code-block:: yaml
rgw_zone: my-secondary-zone
rgw_realm_token: <token>
placement:
hosts:
- ceph-node-1
- ceph-node-2
spec:
rgw_frontend_port: 5500
.. prompt:: bash #
ceph rgw zone create -i zone-spec.yaml
.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus,
the user can provide any orchestration supported rgw parameters including advanced
configuration features such as SSL certificates etc.
Commands Commands
-------- --------
:: ::
ceph rgw realm bootstrap ceph rgw realm bootstrap -i spec.yaml
Create a new realm + zonegroup + zone and deploy rgw daemons via the Create a new realm + zonegroup + zone and deploy rgw daemons via the
orchestrator. Command returns a realm token that allows new zones to easily orchestrator using the information specified in the YAML file.
join this realm
:: ::
ceph rgw zone create ceph rgw realm tokens
Create a new zone and join existing realm (using the realm token) List the tokens of all the available realms
:: ::
ceph rgw zone-creds create ceph rgw zone create -i spec.yaml
Create new credentials and return a token for new zone connection Join an existing realm by creating a new secondary zone (using the realm token)
::
ceph rgw zone-creds remove
Remove credentials and/or user that are associated with the specified
token
::
ceph rgw realm reconcile
Update the realm configuration to match the orchestrator deployment
:: ::

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.4 KiB

View File

@ -269,3 +269,24 @@ completely optional, and disabled by default.::
ceph config set mgr mgr/telemetry/description 'My first Ceph cluster' ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
ceph config set mgr mgr/telemetry/channel_ident true ceph config set mgr mgr/telemetry/channel_ident true
Leaderboard
-----------
To participate in a leaderboard in the `public dashboards
<https://telemetry-public.ceph.com/>`_, run the following command:
.. prompt:: bash $
ceph config set mgr mgr/telemetry/leaderboard true
The leaderboard displays basic information about the cluster. This includes the
total storage capacity and the number of OSDs. To add a description of the
cluster, run a command of the following form:
.. prompt:: bash $
ceph config set mgr mgr/telemetry/leaderboard_description 'Ceph cluster for Computational Biology at the University of XYZ'
If the ``ident`` channel is enabled, its details will not be displayed in the
leaderboard.

View File

@ -0,0 +1,474 @@
.. _monitoring:
===================
Monitoring overview
===================
The aim of this part of the documentation is to explain the Ceph monitoring
stack and the meaning of the main Ceph metrics.
With a good understand of the Ceph monitoring stack and metrics users can
create customized monitoring tools, like Prometheus queries, Grafana
dashboards, or scripts.
Ceph Monitoring stack
=====================
Ceph provides a default monitoring stack wich is installed by cephadm and
explained in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
the cephadm documentation.
Ceph metrics
============
The main source for Ceph metrics are the performance counters exposed by each
Ceph daemon. The :doc:`../dev/perf_counters` are native Ceph monitoring data
Performance counters are transformed into standard Prometheus metrics by the
Ceph exporter daemon. This daemon runs on every Ceph cluster host and exposes a
metrics end point where all the performance counters exposed by all the Ceph
daemons running in the host are published in the form of Prometheus metrics.
In addition to the Ceph exporter, there is another agent to expose Ceph
metrics. It is the Prometheus manager module, wich exposes metrics related to
the whole cluster, basically metrics that are not produced by individual Ceph
daemons.
The main source for obtaining Ceph metrics is the metrics endpoint exposed by
the Cluster Prometheus server. Ceph can provide you with the Prometheus
endpoint where you can obtain the complete list of metrics (coming from Ceph
exporter daemons and Prometheus manager module) and exeute queries.
Use the following command to obtain the Prometheus server endpoint in your
cluster:
Example:
.. code-block:: bash
# ceph orch ps --service_name prometheus
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
prometheus.cephtest-node-00 cephtest-node-00.cephlab.com *:9095 running (103m) 50s ago 5w 142M - 2.33.4 514e6a882f6e efe3cbc2e521
With this information you can connect to
``http://cephtest-node-00.cephlab.com:9095`` to access the Prometheus server
interface.
And the complete list of metrics (with help) for your cluster will be available
in:
``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata``
It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
Performance metrics
===================
Main metrics used to measure Cluster Ceph performance:
All metrics have the following labels:
``ceph_daemon``: identifier of the OSD daemon generating the metric
``instance``: the IP address of the ceph exporter instance exposing the metric.
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_osd_op_r{ceph_daemon="osd.0", instance="192.168.122.7:9283", job="ceph"} = 73981
*Cluster I/O (throughput):*
Use ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain the cluster throughput generated by clients
Example:
.. code-block:: bash
Writes (B/s):
sum(irate(ceph_osd_op_w_in_bytes[1m]))
Reads (B/s):
sum(irate(ceph_osd_op_r_out_bytes[1m]))
*Cluster I/O (operations):*
Use ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the number of operations generated by clients
Example:
.. code-block:: bash
Writes (ops/s):
sum(irate(ceph_osd_op_w[1m]))
Reads (ops/s):
sum(irate(ceph_osd_op_r[1m]))
*Latency:*
Use ``ceph_osd_op_latency_sum`` wich represents the delay before a OSD transfer of data begins following a client instruction for its transfer
Example:
.. code-block:: bash
sum(irate(ceph_osd_op_latency_sum[1m]))
OSD performance
===============
The previous explained cluster performance metrics are based in OSD metrics, selecting the right label we can obtain for a single OSD the same performance information explained for the cluster:
Example:
.. code-block:: bash
OSD 0 read latency
irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"osd.0"}[1m]) / on (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
OSD 0 write IOPS
irate(ceph_osd_op_w{ceph_daemon=~"osd.0"}[1m])
OSD 0 write thughtput (bytes)
irate(ceph_osd_op_w_in_bytes{ceph_daemon=~"osd.0"}[1m])
OSD.0 total raw capacity available
ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481
Physical disk performance:
==========================
Combining Prometheus ``node_exporter`` metrics with Ceph metrics we can have
information about the performance provided by physical disks used by OSDs.
Example:
.. code-block:: bash
Read latency of device used by OSD 0:
label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Write latency of device used by OSD 0
label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
IOPS (device used by OSD.0)
reads:
label_replace(irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
writes:
label_replace(irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Throughput (device used by OSD.0)
reads:
label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
writes:
label_replace(irate(node_disk_written_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Physical Device Utilization (%) for OSD.0 in the last 5 minutes
label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Pool metrics
============
These metrics have the following labels:
``instance``: the ip address of the Ceph exporter daemon producing the metric.
``pool_id``: identifier of the pool
``job``: prometheus scrape job
- ``ceph_pool_metadata``: Information about the pool It can be used together
with other metrics to provide more contextual information in queries and
graphs. Apart of the three common labels this metric provide the following
extra labels:
- ``compression_mode``: compression used in the pool (lz4, snappy, zlib,
zstd, none). Example: compression_mode="none"
- ``description``: brief description of the pool type (replica:number of
replicas or Erasure code: ec profile). Example: description="replica:3"
- ``name``: name of the pool. Example: name=".mgr"
- ``type``: type of pool (replicated/erasure code). Example: type="replicated"
- ``ceph_pool_bytes_used``: Total raw capacity consumed by user data and associated overheads by pool (metadata + redundancy):
- ``ceph_pool_stored``: Total of CLIENT data stored in the pool
- ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool
- ``ceph_pool_compress_bytes_used``: Data compressed in the pool
- ``ceph_pool_rd``: CLIENT read operations per pool (reads per second)
- ``ceph_pool_rd_bytes``: CLIENT read operations in bytes per pool
- ``ceph_pool_wr``: CLIENT write operations per pool (writes per second)
- ``ceph_pool_wr_bytes``: CLIENT write operation in bytes per pool
**Useful queries**:
.. code-block:: bash
Total raw capacity available in the cluster:
sum(ceph_osd_stat_bytes)
Total raw capacity consumed in the cluster (including metadata + redundancy):
sum(ceph_pool_bytes_used)
Total of CLIENT data stored in the cluster:
sum(ceph_pool_stored)
Compression savings:
sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used)
CLIENT IOPS for a pool (testrbdpool)
reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
CLIENT Throughput for a pool
reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
Object metrics
==============
These metrics have the following labels:
``instance``: the ip address of the ceph exporter daemon providing the metric
``instance_id``: identifier of the rgw daemon
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_rgw_req{instance="192.168.122.7:9283", instance_id="154247", job="ceph"} = 12345
Generic metrics
---------------
- ``ceph_rgw_metadata``: Provides generic information about the RGW daemon. It
can be used together with other metrics to provide more contextual
information in queries and graphs. Apart from the three common labels, this
metric provides the following extra labels:
- ``ceph_daemon``: Name of the Ceph daemon. Example:
ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq",
- ``ceph_version``: Version of Ceph daemon. Example: ceph_version="ceph
version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
- ``hostname``: Name of the host where the daemon runs. Example:
hostname:"cephtest-node-00.cephlab.com",
- ``ceph_rgw_req``: Number total of requests for the daemon (GET+PUT+DELETE)
Useful to detect bottlenecks and optimize load distribution.
- ``ceph_rgw_qlen``: RGW operations queue length for the daemon.
Useful to detect bottlenecks and optimize load distribution.
- ``ceph_rgw_failed_req``: Aborted requests.
Useful to detect daemon errors
GET operations: related metrics
-------------------------------
- ``ceph_rgw_get_initial_lat_count``: Number of get operations
- ``ceph_rgw_get_initial_lat_sum``: Total latency time for the GET operations
- ``ceph_rgw_get``: Number total of GET requests
- ``ceph_rgw_get_b``: Total bytes transferred in GET operations
Put operations: related metrics
-------------------------------
- ``ceph_rgw_put_initial_lat_count``: Number of get operations
- ``ceph_rgw_put_initial_lat_sum``: Total latency time for the PUT operations
- ``ceph_rgw_put``: Total number of PUT operations
- ``ceph_rgw_get_b``: Total bytes transferred in PUT operations
Useful queries
--------------
.. code-block:: bash
The average of get latencies:
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
The average of put latencies:
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Total requests per second:
rate(ceph_rgw_req[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Total number of "other" operations (LIST, DELETE)
rate(ceph_rgw_req[30s]) - (rate(ceph_rgw_get[30s]) + rate(ceph_rgw_put[30s]))
GET latencies
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
PUT latencies
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Bandwidth consumed by GET operations
sum(rate(ceph_rgw_get_b[30s]))
Bandwidth consumed by PUT operations
sum(rate(ceph_rgw_put_b[30s]))
Bandwidth consumed by RGW instance (PUTs + GETs)
sum by (instance_id) (rate(ceph_rgw_get_b[30s]) + rate(ceph_rgw_put_b[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Http errors:
rate(ceph_rgw_failed_req[30s])
Filesystem Metrics
==================
These metrics have the following labels:
``ceph_daemon``: The name of the MDS daemon
``instance``: the ip address (and port) of of the Ceph exporter daemon exposing the metric
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452
Main metrics
------------
- ``ceph_mds_metadata``: Provides general information about the MDS daemon. It
can be used together with other metrics to provide more contextual
information in queries and graphs. It provides the following extra labels:
- ``ceph_version``: MDS daemon Ceph version
- ``fs_id``: filesystem cluster id
- ``hostname``: Host name where the MDS daemon runs
- ``public_addr``: Public address where the MDS daemon runs
- ``rank``: Rank of the MDS daemon
Example:
.. code-block:: bash
ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"}
- ``ceph_mds_request``: Total number of requests for the MDs daemon
- ``ceph_mds_reply_latency_sum``: Reply latency total
- ``ceph_mds_reply_latency_count``: Reply latency count
- ``ceph_mds_server_handle_client_request``: Number of client requests
- ``ceph_mds_sessions_session_count``: Session count
- ``ceph_mds_sessions_total_load``: Total load
- ``ceph_mds_sessions_sessions_open``: Sessions currently open
- ``ceph_mds_sessions_sessions_stale``: Sessions currently stale
- ``ceph_objecter_op_r``: Number of read operations
- ``ceph_objecter_op_w``: Number of write operations
- ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon
- ``ceph_mds_root_rfiles``: Total number of files managed by the daemon
Useful queries:
---------------
.. code-block:: bash
Total MDS daemons read workload:
sum(rate(ceph_objecter_op_r[1m]))
Total MDS daemons write workload:
sum(rate(ceph_objecter_op_w[1m]))
MDS daemon read workload: (daemon name is "mdstest")
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
MDS daemon write workload: (daemon name is "mdstest")
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
The average of reply latencies:
rate(ceph_mds_reply_latency_sum[30s]) / rate(ceph_mds_reply_latency_count[30s])
Total requests per second:
rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata
Block metrics
=============
By default RBD metrics for images are not available in order to provide the
best performance in the prometheus manager module.
To produce metrics for RBD images it is needed to configure properly the
manager option ``mgr/prometheus/rbd_stats_pools``. For more information please
see :ref:`prometheus-rbd-io-statistics`
These metrics have the following labels:
``image``: Name of the image which produces the metric value.
``instance``: Node where the rbd metric is produced. (It points to the Ceph exporter daemon)
``job``: Name of the Prometheus scrape job.
``pool``: Image pool name.
Example:
.. code-block:: bash
ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"}
Main metrics
------------
- ``ceph_rbd_read_bytes``: RBD image bytes read
- ``ceph_rbd_read_latency_count``: RBD image reads latency count
- ``ceph_rbd_read_latency_sum``: RBD image reads latency total
- ``ceph_rbd_read_ops``: RBD image reads count
- ``ceph_rbd_write_bytes``: RBD image bytes written
- ``ceph_rbd_write_latency_count``: RBD image writes latency count
- ``ceph_rbd_write_latency_sum``: RBD image writes latency total
- ``ceph_rbd_write_ops``: RBD image writes count
Useful queries
--------------
.. code-block:: bash
The average of read latencies:
rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata

View File

@ -1,107 +1,110 @@
.. _rados-cephx-config-ref:
======================== ========================
Cephx Config Reference CephX Config Reference
======================== ========================
The ``cephx`` protocol is enabled by default. Cryptographic authentication has The CephX protocol is enabled by default. The cryptographic authentication that
some computational costs, though they should generally be quite low. If the CephX provides has some computational costs, though they should generally be
network environment connecting your client and server hosts is very safe and quite low. If the network environment connecting your client and server hosts
you cannot afford authentication, you can turn it off. **This is not generally is very safe and you cannot afford authentication, you can disable it.
recommended**. **Disabling authentication is not generally recommended**.
.. note:: If you disable authentication, you are at risk of a man-in-the-middle .. note:: If you disable authentication, you will be at risk of a
attack altering your client/server messages, which could lead to disastrous man-in-the-middle attack that alters your client/server messages, which
security effects. could have disastrous security effects.
For creating users, see `User Management`_. For details on the architecture For information about creating users, see `User Management`_. For details on
of Cephx, see `Architecture - High Availability Authentication`_. the architecture of CephX, see `Architecture - High Availability
Authentication`_.
Deployment Scenarios Deployment Scenarios
==================== ====================
There are two main scenarios for deploying a Ceph cluster, which impact How you initially configure CephX depends on your scenario. There are two
how you initially configure Cephx. Most first time Ceph users use common strategies for deploying a Ceph cluster. If you are a first-time Ceph
``cephadm`` to create a cluster (easiest). For clusters using user, you should probably take the easiest approach: using ``cephadm`` to
other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need deploy a cluster. But if your cluster uses other deployment tools (for example,
to use the manual procedures or configure your deployment tool to Ansible, Chef, Juju, or Puppet), you will need either to use the manual
deployment procedures or to configure your deployment tool so that it will
bootstrap your monitor(s). bootstrap your monitor(s).
Manual Deployment Manual Deployment
----------------- -----------------
When you deploy a cluster manually, you have to bootstrap the monitor manually When you deploy a cluster manually, it is necessary to bootstrap the monitors
and create the ``client.admin`` user and keyring. To bootstrap monitors, follow manually and to create the ``client.admin`` user and keyring. To bootstrap
the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when
the logical steps you must perform when using third party deployment tools like using third-party deployment tools (for example, Chef, Puppet, and Juju).
Chef, Puppet, Juju, etc.
Enabling/Disabling Cephx Enabling/Disabling CephX
======================== ========================
Enabling Cephx requires that you have deployed keys for your monitors, Enabling CephX is possible only if the keys for your monitors, OSDs, and
OSDs and metadata servers. If you are simply toggling Cephx on / off, metadata servers have already been deployed. If you are simply toggling CephX
you do not have to repeat the bootstrapping procedures. on or off, it is not necessary to repeat the bootstrapping procedures.
Enabling CephX
Enabling Cephx
-------------- --------------
When ``cephx`` is enabled, Ceph will look for the keyring in the default search When CephX is enabled, Ceph will look for the keyring in the default search
path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible
this location by adding a ``keyring`` option in the ``[global]`` section of to override this search-path location by adding a ``keyring`` option in the
your `Ceph configuration`_ file, but this is not recommended. ``[global]`` section of your `Ceph configuration`_ file, but this is not
recommended.
Execute the following procedures to enable ``cephx`` on a cluster with To enable CephX on a cluster for which authentication has been disabled, carry
authentication disabled. If you (or your deployment utility) have already out the following procedure. If you (or your deployment utility) have already
generated the keys, you may skip the steps related to generating keys. generated the keys, you may skip the steps related to generating keys.
#. Create a ``client.admin`` key, and save a copy of the key for your client #. Create a ``client.admin`` key, and save a copy of the key for your client
host host:
.. prompt:: bash $ .. prompt:: bash $
ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring
**Warning:** This will clobber any existing **Warning:** This step will clobber any existing
``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a
deployment tool has already done it for you. Be careful! deployment tool has already generated a keyring file for you. Be careful!
#. Create a keyring for your monitor cluster and generate a monitor #. Create a monitor keyring and generate a monitor secret key:
secret key.
.. prompt:: bash $ .. prompt:: bash $
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's #. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file
``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``, in the monitor's ``mon data`` directory. For example, to copy the monitor
use the following keyring to ``mon.a`` in a cluster called ``ceph``, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring
#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter #. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter:
.. prompt:: bash $ .. prompt:: bash $
ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring
#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number #. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:
.. prompt:: bash $ .. prompt:: bash $
ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring
#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter #. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:
.. prompt:: bash $ .. prompt:: bash $
ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring
#. Enable ``cephx`` authentication by setting the following options in the #. Enable CephX authentication by setting the following options in the
``[global]`` section of your `Ceph configuration`_ file ``[global]`` section of your `Ceph configuration`_ file:
.. code-block:: ini .. code-block:: ini
@ -109,23 +112,23 @@ generated the keys, you may skip the steps related to generating keys.
auth_service_required = cephx auth_service_required = cephx
auth_client_required = cephx auth_client_required = cephx
#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.
#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
For details on bootstrapping a monitor manually, see `Manual Deployment`_. For details on bootstrapping a monitor manually, see `Manual Deployment`_.
Disabling Cephx Disabling CephX
--------------- ---------------
The following procedure describes how to disable Cephx. If your cluster The following procedure describes how to disable CephX. If your cluster
environment is relatively safe, you can offset the computation expense of environment is safe, you might want to disable CephX in order to offset the
running authentication. **We do not recommend it.** However, it may be easier computational expense of running authentication. **We do not recommend doing
during setup and/or troubleshooting to temporarily disable authentication. so.** However, setup and troubleshooting might be easier if authentication is
temporarily disabled and subsequently re-enabled.
#. Disable ``cephx`` authentication by setting the following options in the #. Disable CephX authentication by setting the following options in the
``[global]`` section of your `Ceph configuration`_ file ``[global]`` section of your `Ceph configuration`_ file:
.. code-block:: ini .. code-block:: ini
@ -133,8 +136,7 @@ during setup and/or troubleshooting to temporarily disable authentication.
auth_service_required = none auth_service_required = none
auth_client_required = none auth_client_required = none
#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.
#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
Configuration Settings Configuration Settings
@ -144,70 +146,230 @@ Enablement
---------- ----------
.. confval:: auth_cluster_required ``auth_cluster_required``
.. confval:: auth_service_required
.. confval:: auth_client_required :Description: If this configuration setting is enabled, the Ceph Storage
Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``,
``ceph-mds``, and ``ceph-mgr``) are required to authenticate with
each other. Valid settings are ``cephx`` or ``none``.
:Type: String
:Required: No
:Default: ``cephx``.
``auth_service_required``
:Description: If this configuration setting is enabled, then Ceph clients can
access Ceph services only if those clients authenticate with the
Ceph Storage Cluster. Valid settings are ``cephx`` or ``none``.
:Type: String
:Required: No
:Default: ``cephx``.
``auth_client_required``
:Description: If this configuration setting is enabled, then communication
between the Ceph client and Ceph Storage Cluster can be
established only if the Ceph Storage Cluster authenticates
against the Ceph client. Valid settings are ``cephx`` or
``none``.
:Type: String
:Required: No
:Default: ``cephx``.
.. index:: keys; keyring .. index:: keys; keyring
Keys Keys
---- ----
When you run Ceph with authentication enabled, ``ceph`` administrative commands When Ceph is run with authentication enabled, ``ceph`` administrative commands
and Ceph Clients require authentication keys to access the Ceph Storage Cluster. and Ceph clients can access the Ceph Storage Cluster only if they use
authentication keys.
The most common way to provide these keys to the ``ceph`` administrative The most common way to make these keys available to ``ceph`` administrative
commands and clients is to include a Ceph keyring under the ``/etc/ceph`` commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph``
directory. For Octopus and later releases using ``cephadm``, the filename directory. For Octopus and later releases that use ``cephadm``, the filename is
is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``). usually ``ceph.client.admin.keyring``. If the keyring is included in the
If you include the keyring under the ``/etc/ceph`` directory, you don't need to ``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry
specify a ``keyring`` entry in your Ceph configuration file. in the Ceph configuration file.
We recommend copying the Ceph Storage Cluster's keyring file to nodes where you Because the Ceph Storage Cluster's keyring file contains the ``client.admin``
will run administrative commands, because it contains the ``client.admin`` key. key, we recommend copying the keyring file to nodes from which you run
administrative commands.
To perform this step manually, execute the following: To perform this step manually, run the following command:
.. prompt:: bash $ .. prompt:: bash $
sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring
.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set .. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions
(e.g., ``chmod 644``) on your client machine. (for example, ``chmod 644``) set on your client machine.
You may specify the key itself in the Ceph configuration file using the ``key`` You can specify the key itself by using the ``key`` setting in the Ceph
setting (not recommended), or a path to a keyfile using the ``keyfile`` setting. configuration file (this approach is not recommended), or instead specify a
path to a keyfile by using the ``keyfile`` setting in the Ceph configuration
file.
``keyring``
:Description: The path to the keyring file.
:Type: String
:Required: No
:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin``
``keyfile``
:Description: The path to a keyfile (that is, a file containing only the key).
:Type: String
:Required: No
:Default: None
``key``
:Description: The key (that is, the text string of the key itself). We do not
recommend that you use this setting unless you know what you're
doing.
:Type: String
:Required: No
:Default: None
Daemon Keyrings
---------------
Administrative users or deployment tools (for example, ``cephadm``) generate
daemon keyrings in the same way that they generate user keyrings. By default,
Ceph stores the keyring of a daemon inside that daemon's data directory. The
default keyring locations and the capabilities that are necessary for the
daemon to function are shown below.
``ceph-mon``
:Location: ``$mon_data/keyring``
:Capabilities: ``mon 'allow *'``
``ceph-osd``
:Location: ``$osd_data/keyring``
:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'``
``ceph-mds``
:Location: ``$mds_data/keyring``
:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'``
``ceph-mgr``
:Location: ``$mgr_data/keyring``
:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'``
``radosgw``
:Location: ``$rgw_data/keyring``
:Capabilities: ``mon 'allow rwx' osd 'allow rwx'``
.. note:: The monitor keyring (that is, ``mon.``) contains a key but no
capabilities, and this keyring is not part of the cluster ``auth`` database.
The daemon's data-directory locations default to directories of the form::
/var/lib/ceph/$type/$cluster-$id
For example, ``osd.12`` would have the following data directory::
/var/lib/ceph/osd/ceph-12
It is possible to override these locations, but it is not recommended.
.. confval:: keyring
:default: /etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
.. confval:: keyfile
.. confval:: key
.. index:: signatures .. index:: signatures
Signatures Signatures
---------- ----------
Ceph performs a signature check that provides some limited protection Ceph performs a signature check that provides some limited protection against
against messages being tampered with in flight (e.g., by a "man in the messages being tampered with in flight (for example, by a "man in the middle"
middle" attack). attack).
Like other parts of Ceph authentication, Ceph provides fine-grained control so As with other parts of Ceph authentication, signatures admit of fine-grained
you can enable/disable signatures for service messages between clients and control. You can enable or disable signatures for service messages between
Ceph, and so you can enable/disable signatures for messages between Ceph daemons. clients and Ceph, and for messages between Ceph daemons.
Note that even with signatures enabled data is not encrypted in Note that even when signatures are enabled data is not encrypted in flight.
flight.
``cephx_require_signatures``
:Description: If this configuration setting is set to ``true``, Ceph requires
signatures on all message traffic between the Ceph client and the
Ceph Storage Cluster, and between daemons within the Ceph Storage
Cluster.
.. note::
**ANTIQUATED NOTE:**
Neither Ceph Argonaut nor Linux kernel versions prior to 3.19
support signatures; if one of these clients is in use, ``cephx_require_signatures``
can be disabled in order to allow the client to connect.
:Type: Boolean
:Required: No
:Default: ``false``
``cephx_cluster_require_signatures``
:Description: If this configuration setting is set to ``true``, Ceph requires
signatures on all message traffic between Ceph daemons within the
Ceph Storage Cluster.
:Type: Boolean
:Required: No
:Default: ``false``
``cephx_service_require_signatures``
:Description: If this configuration setting is set to ``true``, Ceph requires
signatures on all message traffic between Ceph clients and the
Ceph Storage Cluster.
:Type: Boolean
:Required: No
:Default: ``false``
``cephx_sign_messages``
:Description: If this configuration setting is set to ``true``, and if the Ceph
version supports message signing, then Ceph will sign all
messages so that they are more difficult to spoof.
:Type: Boolean
:Default: ``true``
.. confval:: cephx_require_signatures
.. confval:: cephx_cluster_require_signatures
.. confval:: cephx_service_require_signatures
.. confval:: cephx_sign_messages
Time to Live Time to Live
------------ ------------
.. confval:: auth_service_ticket_ttl ``auth_service_ticket_ttl``
:Description: When the Ceph Storage Cluster sends a ticket for authentication
to a Ceph client, the Ceph Storage Cluster assigns that ticket a
Time To Live (TTL).
:Type: Double
:Default: ``60*60``
.. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping .. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping
.. _Operating a Cluster: ../../operations/operating .. _Operating a Cluster: ../../operations/operating

View File

@ -1,84 +1,95 @@
========================== ==================================
BlueStore Config Reference BlueStore Configuration Reference
========================== ==================================
Devices Devices
======= =======
BlueStore manages either one, two, or (in certain cases) three storage BlueStore manages either one, two, or in certain cases three storage devices.
devices. These *devices* are "devices" in the Linux/Unix sense. This means that they are
assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
entire storage drive, or a partition of a storage drive, or a logical volume.
BlueStore does not create or mount a conventional file system on devices that
it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
In the simplest case, BlueStore consumes a single (primary) storage device. In the simplest case, BlueStore consumes all of a single storage device. This
The storage device is normally used as a whole, occupying the full device that device is known as the *primary device*. The primary device is identified by
is managed directly by BlueStore. This *primary device* is normally identified the ``block`` symlink in the data directory.
by a ``block`` symlink in the data directory.
The data directory is a ``tmpfs`` mount which gets populated (at boot time, or The data directory is a ``tmpfs`` mount. When this data directory is booted or
when ``ceph-volume`` activates it) with all the common OSD files that hold activated by ``ceph-volume``, it is populated with metadata files and links
information about the OSD, like: its identifier, which cluster it belongs to, that hold information about the OSD: for example, the OSD's identifier, the
and its private keyring. name of the cluster that the OSD belongs to, and the OSD's private keyring.
It is also possible to deploy BlueStore across one or two additional devices: In more complicated cases, BlueStore is deployed across one or two additional
devices:
* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be * A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
used for BlueStore's internal journal or write-ahead log. It is only useful directory) can be used to separate out BlueStore's internal journal or
to use a WAL device if the device is faster than the primary device (e.g., write-ahead log. Using a WAL device is advantageous only if the WAL device
when it is on an SSD and the primary device is an HDD). is faster than the primary device (for example, if the WAL device is an SSD
and the primary device is an HDD).
* A *DB device* (identified as ``block.db`` in the data directory) can be used * A *DB device* (identified as ``block.db`` in the data directory) can be used
for storing BlueStore's internal metadata. BlueStore (or rather, the to store BlueStore's internal metadata. BlueStore (or more precisely, the
embedded RocksDB) will put as much metadata as it can on the DB device to embedded RocksDB) will put as much metadata as it can on the DB device in
improve performance. If the DB device fills up, metadata will spill back order to improve performance. If the DB device becomes full, metadata will
onto the primary device (where it would have been otherwise). Again, it is spill back onto the primary device (where it would have been located in the
only helpful to provision a DB device if it is faster than the primary absence of the DB device). Again, it is advantageous to provision a DB device
device. only if it is faster than the primary device.
If there is only a small amount of fast storage available (e.g., less If there is only a small amount of fast storage available (for example, less
than a gigabyte), we recommend using it as a WAL device. If there is than a gigabyte), we recommend using the available space as a WAL device. But
more, provisioning a DB device makes more sense. The BlueStore if more fast storage is available, it makes more sense to provision a DB
journal will always be placed on the fastest device available, so device. Because the BlueStore journal is always placed on the fastest device
using a DB device will provide the same benefit that the WAL device available, using a DB device provides the same benefit that using a WAL device
would while *also* allowing additional metadata to be stored there (if would, while *also* allowing additional metadata to be stored off the primary
it will fit). This means that if a DB device is specified but an explicit device (provided that it fits). DB devices make this possible because whenever
WAL device is not, the WAL will be implicitly colocated with the DB on the faster a DB device is specified but an explicit WAL device is not, the WAL will be
device. implicitly colocated with the DB on the faster device.
A single-device (colocated) BlueStore OSD can be provisioned with: To provision a single-device (colocated) BlueStore OSD, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm prepare --bluestore --data <device> ceph-volume lvm prepare --bluestore --data <device>
To specify a WAL device and/or DB device: To specify a WAL device or DB device, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
.. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other .. note:: The option ``--data`` can take as its argument any of the the
devices can be existing logical volumes or GPT partitions. following devices: logical volumes specified using *vg/lv* notation,
existing logical volumes, and GPT partitions.
Provisioning strategies Provisioning strategies
----------------------- -----------------------
Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
which had just one), there are two common arrangements that should help clarify BlueStore differs from Filestore in that there are several ways to deploy a
the deployment strategy: BlueStore OSD. However, the overall deployment strategy for BlueStore can be
clarified by examining just these two common arrangements:
.. _bluestore-single-type-device-config: .. _bluestore-single-type-device-config:
**block (data) only** **block (data) only**
^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^
If all devices are the same type, for example all rotational drives, and If all devices are of the same type (for example, they are all HDDs), and if
there are no fast devices to use for metadata, it makes sense to specify the there are no fast devices available for the storage of metadata, then it makes
block device only and to not separate ``block.db`` or ``block.wal``. The sense to specify the block device only and to leave ``block.db`` and
:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like: ``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
``/dev/sda`` device is as follows:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm create --bluestore --data /dev/sda ceph-volume lvm create --bluestore --data /dev/sda
If logical volumes have already been created for each device, (a single LV If the devices to be used for a BlueStore OSD are pre-created logical volumes,
using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named then the :ref:`ceph-volume-lvm` call for an logical volume named
``ceph-vg/block-lv`` would look like: ``ceph-vg/block-lv`` is as follows:
.. prompt:: bash $ .. prompt:: bash $
@ -88,15 +99,18 @@ using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
**block and block.db** **block and block.db**
^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
If you have a mix of fast and slow devices (SSD / NVMe and rotational),
it is recommended to place ``block.db`` on the faster device while ``block``
(data) lives on the slower (spinning drive).
You must create these volume groups and logical volumes manually as If you have a mix of fast and slow devices (for example, SSD or HDD), then we
the ``ceph-volume`` tool is currently not able to do so automatically. recommend placing ``block.db`` on the faster device while ``block`` (that is,
the data) is stored on the slower device (that is, the rotational drive).
For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``) You must create these volume groups and these logical volumes manually. as The
and one (fast) solid state drive (``sdx``). First create the volume groups: ``ceph-volume`` tool is currently unable to do so [create them?] automatically.
The following procedure illustrates the manual creation of volume groups and
logical volumes. For this example, we shall assume four rotational drives
(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
to create the volume groups, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -105,7 +119,7 @@ and one (fast) solid state drive (``sdx``). First create the volume groups:
vgcreate ceph-block-2 /dev/sdc vgcreate ceph-block-2 /dev/sdc
vgcreate ceph-block-3 /dev/sdd vgcreate ceph-block-3 /dev/sdd
Now create the logical volumes for ``block``: Next, to create the logical volumes for ``block``, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -114,8 +128,9 @@ Now create the logical volumes for ``block``:
lvcreate -l 100%FREE -n block-2 ceph-block-2 lvcreate -l 100%FREE -n block-2 ceph-block-2
lvcreate -l 100%FREE -n block-3 ceph-block-3 lvcreate -l 100%FREE -n block-3 ceph-block-3
We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB Because there are four HDDs, there will be four OSDs. Supposing that there is a
SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB: 200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
the following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -125,7 +140,7 @@ SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
lvcreate -L 50GB -n db-2 ceph-db-0 lvcreate -L 50GB -n db-2 ceph-db-0
lvcreate -L 50GB -n db-3 ceph-db-0 lvcreate -L 50GB -n db-3 ceph-db-0
Finally, create the 4 OSDs with ``ceph-volume``: Finally, to create the four OSDs, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -134,54 +149,57 @@ Finally, create the 4 OSDs with ``ceph-volume``:
ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
These operations should end up creating four OSDs, with ``block`` on the slower After this procedure is finished, there should be four OSDs, ``block`` should
rotational drives with a 50 GB logical volume (DB) for each on the solid state be on the four HDDs, and each HDD should have a 50GB logical volume
drive. (specifically, a DB device) on the shared SSD.
Sizing Sizing
====== ======
When using a :ref:`mixed spinning and solid drive setup When using a :ref:`mixed spinning-and-solid-drive setup
<bluestore-mixed-device-config>` it is important to make a large enough <bluestore-mixed-device-config>`, it is important to make a large enough
``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have ``block.db`` logical volume for BlueStore. The logical volumes associated with
*as large as possible* logical volumes. ``block.db`` should have logical volumes that are *as large as possible*.
The general recommendation is to have ``block.db`` size in between 1% to 4% It is generally recommended that the size of ``block.db`` be somewhere between
of ``block`` size. For RGW workloads, it is recommended that the ``block.db`` 1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
size isn't smaller than 4% of ``block``, because RGW heavily uses it to store the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't use of ``block.db`` to store metadata (in particular, omap keys). For example,
be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough. if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
2% of the ``block`` size.
In older releases, internal level sizes mean that the DB can fully utilize only In older releases, internal level sizes are such that the DB can fully utilize
specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2, only those specific partition / logical volume sizes that correspond to sums of
etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
so forth. Most deployments will not substantially benefit from sizing to 3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
accommodate L3 and higher, though DB compaction can be facilitated by doubling sizing that accommodates L3 and higher, though DB compaction can be facilitated
these figures to 6GB, 60GB, and 600GB. by doubling these figures to 6GB, 60GB, and 600GB.
Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
enable better utilization of arbitrary DB device sizes, and the Pacific for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
release brings experimental dynamic level support. Users of older releases may release brings experimental dynamic-level support. Because of these advances,
thus wish to plan ahead by provisioning larger DB devices today so that their users of older releases might want to plan ahead by provisioning larger DB
benefits may be realized with future upgrades. devices today so that the benefits of scale can be realized when upgrades are
made in the future.
When *not* using a mix of fast and slow devices, it isn't required to create
separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
automatically colocate these within the space of ``block``.
When *not* using a mix of fast and slow devices, there is no requirement to
create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
will automatically colocate these devices within the space of ``block``.
Automatic Cache Sizing Automatic Cache Sizing
====================== ======================
BlueStore can be configured to automatically resize its caches when TCMalloc BlueStore can be configured to automatically resize its caches, provided that
is configured as the memory allocator and the ``bluestore_cache_autotune`` certain conditions are met: TCMalloc must be configured as the memory allocator
setting is enabled. This option is currently enabled by default. BlueStore and the ``bluestore_cache_autotune`` configuration option must be enabled (note
will attempt to keep OSD heap memory usage under a designated target size via that it is currently enabled by default). When automatic cache sizing is in
the ``osd_memory_target`` configuration option. This is a best effort effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
algorithm and caches will not shrink smaller than the amount specified by size (as determined by ``osd_memory_target``). This approach makes use of a
``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy best-effort algorithm and caches do not shrink smaller than the size defined by
of priorities. If priority information is not available, the the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are with a hierarchy of priorities. But if priority information is not available,
used as fallbacks. the values specified in the ``bluestore_cache_meta_ratio`` and
``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
.. confval:: bluestore_cache_autotune .. confval:: bluestore_cache_autotune
.. confval:: osd_memory_target .. confval:: osd_memory_target
@ -195,34 +213,33 @@ used as fallbacks.
Manual Cache Sizing Manual Cache Sizing
=================== ===================
The amount of memory consumed by each OSD for BlueStore caches is The amount of memory consumed by each OSD to be used for its BlueStore cache is
determined by the ``bluestore_cache_size`` configuration option. If determined by the ``bluestore_cache_size`` configuration option. If that option
that config option is not set (i.e., remains at 0), there is a has not been specified (that is, if it remains at 0), then Ceph uses a
different default value that is used depending on whether an HDD or different configuration option to determine the default memory budget:
SSD is used for the primary device (set by the ``bluestore_cache_size_hdd`` if the primary device is an HDD, or
``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config ``bluestore_cache_size_ssd`` if the primary device is an SSD.
options).
BlueStore and the rest of the Ceph OSD daemon do the best they can BlueStore and the rest of the Ceph OSD daemon make every effort to work within
to work within this memory budget. Note that on top of the configured this memory budget. Note that in addition to the configured cache size, there
cache size, there is also memory consumed by the OSD itself, and is also memory consumed by the OSD itself. There is additional utilization due
some additional utilization due to memory fragmentation and other to memory fragmentation and other allocator overhead.
allocator overhead.
The configured cache memory budget can be used in a few different ways: The configured cache-memory budget can be used to store the following types of
things:
* Key/Value metadata (i.e., RocksDB's internal cache) * Key/Value metadata (that is, RocksDB's internal cache)
* BlueStore metadata * BlueStore metadata
* BlueStore data (i.e., recently read or written object data) * BlueStore data (that is, recently read or recently written object data)
Cache memory usage is governed by the following options: Cache memory usage is governed by the configuration options
``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction
The fraction of the cache devoted to data of the cache that is reserved for data is governed by both the effective
is governed by the effective bluestore cache size (depending on BlueStore cache size (which depends on the relevant
``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary ``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
device) as well as the meta and kv ratios. device) and the "meta" and "kv" ratios. This data fraction can be calculated
The data fraction can be calculated by with the following formula: ``<effective_cache_size> * (1 -
``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)`` bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
.. confval:: bluestore_cache_size .. confval:: bluestore_cache_size
.. confval:: bluestore_cache_size_hdd .. confval:: bluestore_cache_size_hdd
@ -233,29 +250,28 @@ The data fraction can be calculated by
Checksums Checksums
========= =========
BlueStore checksums all metadata and data written to disk. Metadata BlueStore checksums all metadata and all data written to disk. Metadata
checksumming is handled by RocksDB and uses `crc32c`. Data checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
checksumming is done by BlueStore and can make use of `crc32c`, contrast, data checksumming is handled by BlueStore and can use either
`xxhash32`, or `xxhash64`. The default is `crc32c` and should be `crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
suitable for most purposes. checksum algorithm and it is suitable for most purposes.
Full data checksumming does increase the amount of metadata that Full data checksumming increases the amount of metadata that BlueStore must
BlueStore must store and manage. When possible, e.g., when clients store and manage. Whenever possible (for example, when clients hint that data
hint that data is written and read sequentially, BlueStore will is written and read sequentially), BlueStore will checksum larger blocks. In
checksum larger blocks, but in many cases it must store a checksum many cases, however, it must store a checksum value (usually 4 bytes) for every
value (usually 4 bytes) for every 4 kilobyte block of data. 4 KB block of data.
It is possible to use a smaller checksum value by truncating the It is possible to obtain a smaller checksum value by truncating the checksum to
checksum to two or one byte, reducing the metadata overhead. The one or two bytes and reducing the metadata overhead. A drawback of this
trade-off is that the probability that a random error will not be approach is that it increases the probability of a random error going
detected is higher with a smaller checksum, going from about one in undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
four billion with a 32-bit (4 byte) checksum to one in 65,536 for a 65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum. checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
The smaller checksum values can be used by selecting `crc32c_16` or as the checksum algorithm.
`crc32c_8` as the checksum algorithm.
The *checksum algorithm* can be set either via a per-pool The *checksum algorithm* can be specified either via a per-pool ``csum_type``
``csum_type`` property or the global config option. For example: configuration option or via the global configuration option. For example:
.. prompt:: bash $ .. prompt:: bash $
@ -266,34 +282,35 @@ The *checksum algorithm* can be set either via a per-pool
Inline Compression Inline Compression
================== ==================
BlueStore supports inline compression using `snappy`, `zlib`, or BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`.
`lz4`. Please note that the `lz4` compression plugin is not
distributed in the official release.
Whether data in BlueStore is compressed is determined by a combination Whether data in BlueStore is compressed is determined by two factors: (1) the
of the *compression mode* and any hints associated with a write *compression mode* and (2) any client hints associated with a write operation.
operation. The modes are: The compression modes are as follows:
* **none**: Never compress data. * **none**: Never compress data.
* **passive**: Do not compress data unless the write operation has a * **passive**: Do not compress data unless the write operation has a
*compressible* hint set. *compressible* hint set.
* **aggressive**: Compress data unless the write operation has an * **aggressive**: Do compress data unless the write operation has an
*incompressible* hint set. *incompressible* hint set.
* **force**: Try to compress data no matter what. * **force**: Try to compress data no matter what.
For more information about the *compressible* and *incompressible* IO For more information about the *compressible* and *incompressible* I/O hints,
hints, see :c:func:`rados_set_alloc_hint`. see :c:func:`rados_set_alloc_hint`.
Note that regardless of the mode, if the size of the data chunk is not Note that data in Bluestore will be compressed only if the data chunk will be
reduced sufficiently it will not be used and the original sufficiently reduced in size (as determined by the ``bluestore compression
(uncompressed) data will be stored. For example, if the ``bluestore required ratio`` setting). No matter which compression modes have been used, if
compression required ratio`` is set to ``.7`` then the compressed data the data chunk is too big, then it will be discarded and the original
must be 70% of the size of the original (or smaller). (uncompressed) data will be stored instead. For example, if ``bluestore
compression required ratio`` is set to ``.7``, then data compression will take
place only if the size of the compressed data is no more than 70% of the size
of the original data.
The *compression mode*, *compression algorithm*, *compression required The *compression mode*, *compression algorithm*, *compression required ratio*,
ratio*, *min blob size*, and *max blob size* can be set either via a *min blob size*, and *max blob size* settings can be specified either via a
per-pool property or a global config option. Pool properties can be per-pool property or via a global config option. To specify pool properties,
set with: run the following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -318,27 +335,30 @@ set with:
RocksDB Sharding RocksDB Sharding
================ ================
Internally BlueStore uses multiple types of key-value data, BlueStore maintains several types of internal key-value data, all of which are
stored in RocksDB. Each data type in BlueStore is assigned a stored in RocksDB. Each data type in BlueStore is assigned a unique prefix.
unique prefix. Until Pacific all key-value data was stored in Prior to the Pacific release, all key-value data was stored in a single RocksDB
single RocksDB column family: 'default'. Since Pacific, column family: 'default'. In Pacific and later releases, however, BlueStore can
BlueStore can divide this data into multiple RocksDB column divide key-value data into several RocksDB column families. BlueStore achieves
families. When keys have similar access frequency, modification better caching and more precise compaction when keys are similar: specifically,
frequency and lifetime, BlueStore benefits from better caching when keys have similar access frequency, similar modification frequency, and a
and more precise compaction. This improves performance, and also similar lifetime. Under such conditions, performance is improved and less disk
requires less disk space during compaction, since each column space is required during compaction (because each column family is smaller and
family is smaller and can compact independent of others. is able to compact independently of the others).
OSDs deployed in Pacific or later use RocksDB sharding by default. OSDs deployed in Pacific or later releases use RocksDB sharding by default.
If Ceph is upgraded to Pacific from a previous version, sharding is off. However, if Ceph has been upgraded to Pacific or a later version from a
previous version, sharding is disabled on any OSDs that were created before
Pacific.
To enable sharding and apply the Pacific defaults, stop an OSD and run To enable sharding and apply the Pacific defaults to a specific OSD, stop the
OSD and run the following command:
.. prompt:: bash # .. prompt:: bash #
ceph-bluestore-tool \ ceph-bluestore-tool \
--path <data path> \ --path <data path> \
--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \ --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
reshard reshard
.. confval:: bluestore_rocksdb_cf .. confval:: bluestore_rocksdb_cf
@ -354,165 +374,179 @@ Throttling
.. confval:: bluestore_throttle_cost_per_io_ssd .. confval:: bluestore_throttle_cost_per_io_ssd
SPDK Usage SPDK Usage
================== ==========
If you want to use the SPDK driver for NVMe devices, you must prepare your system. To use the SPDK driver for NVMe devices, you must first prepare your system.
Refer to `SPDK document`__ for more details. See `SPDK document`__.
.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
SPDK offers a script to configure the device automatically. Users can run the SPDK offers a script that will configure the device automatically. Run this
script as root: script with root permissions:
.. prompt:: bash $ .. prompt:: bash $
sudo src/spdk/scripts/setup.sh sudo src/spdk/scripts/setup.sh
You will need to specify the subject NVMe device's device selector with You will need to specify the subject NVMe device's device selector with the
the "spdk:" prefix for ``bluestore_block_path``. "spdk:" prefix for ``bluestore_block_path``.
For example, you can find the device selector of an Intel PCIe SSD with: In the following example, you first find the device selector of an Intel NVMe
SSD by running the following command:
.. prompt:: bash $ .. prompt:: bash $
lspci -mm -n -D -d 8086:0953 lspci -mm -n -d -d 8086:0953
The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``. The form of the device selector is either ``DDDD:BB:DD.FF`` or
``DDDD.BB.DD.FF``.
and then set:: Next, supposing that ``0000:01:00.0`` is the device selector found in the
output of the ``lspci`` command, you can specify the device selector by running
the following command::
bluestore_block_path = spdk:0000:01:00.0 bluestore_block_path = spdk:0000:01:00.0
Where ``0000:01:00.0`` is the device selector found in the output of ``lspci`` You may also specify a remote NVMeoF target over the TCP transport, as in the
command above. following example::
To run multiple SPDK instances per node, you must specify the bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
amount of dpdk memory in MB that each instance will use, to make sure each
instance uses its own dpdk memory To run multiple SPDK instances per node, you must make sure each instance uses
its own DPDK memory by specifying for each instance the amount of DPDK memory
(in MB) that the instance will use.
In most cases, a single device can be used for data, DB, and WAL. We describe In most cases, a single device can be used for data, DB, and WAL. We describe
this strategy as *colocating* these components. Be sure to enter the below this strategy as *colocating* these components. Be sure to enter the below
settings to ensure that all IOs are issued through SPDK.:: settings to ensure that all I/Os are issued through SPDK::
bluestore_block_db_path = "" bluestore_block_db_path = ""
bluestore_block_db_size = 0 bluestore_block_db_size = 0
bluestore_block_wal_path = "" bluestore_block_wal_path = ""
bluestore_block_wal_size = 0 bluestore_block_wal_size = 0
Otherwise, the current implementation will populate the SPDK map files with If these settings are not entered, then the current implementation will
kernel file system symbols and will use the kernel driver to issue DB/WAL IO. populate the SPDK map files with kernel file system symbols and will use the
kernel driver to issue DB/WAL I/Os.
Minimum Allocation Size Minimum Allocation Size
======================== =======================
There is a configured minimum amount of storage that BlueStore will allocate on There is a configured minimum amount of storage that BlueStore allocates on an
an OSD. In practice, this is the least amount of capacity that a RADOS object underlying storage device. In practice, this is the least amount of capacity
can consume. The value of :confval:`bluestore_min_alloc_size` is derived from the that even a tiny RADOS object can consume on each OSD's primary device. The
value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd` configuration option in question--:confval:`bluestore_min_alloc_size`--derives
depending on the OSD's ``rotational`` attribute. This means that when an OSD its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or
is created on an HDD, BlueStore will be initialized with the current value :confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
of :confval:`bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices) attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
with the value of :confval:`bluestore_min_alloc_size_ssd`. the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
(including NVMe devices), Bluestore is initialized with the current value of
:confval:`bluestore_min_alloc_size_ssd`.
Through the Mimic release, the default values were 64KB and 16KB for rotational In Mimic and earlier releases, the default values were 64KB for rotational
(HDD) and non-rotational (SSD) media respectively. Octopus changed the default media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD changed the the default value for non-rotational media (SSD) to 4KB, and the
(rotational) media to 4KB as well. Pacific release changed the default value for rotational media (HDD) to 4KB.
These changes were driven by space amplification experienced by Ceph RADOS These changes were driven by space amplification that was experienced by Ceph
GateWay (RGW) deployments that host large numbers of small files RADOS GateWay (RGW) deployments that hosted large numbers of small files
(S3/Swift objects). (S3/Swift objects).
For example, when an RGW client stores a 1KB S3 object, it is written to a For example, when an RGW client stores a 1 KB S3 object, that object is written
single RADOS object. With the default :confval:`min_alloc_size` value, 4KB of to a single RADOS object. In accordance with the default
underlying drive space is allocated. This means that roughly :confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated.
(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300% This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never
overhead or 25% efficiency. Similarly, a 5KB user object will be stored used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB
as one 4KB and one 1KB RADOS object, again stranding 4KB of device capacity, user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB
though in this case the overhead is a much smaller percentage. Think of this RADOS object, with the result that 4KB of device capacity is stranded. In this
in terms of the remainder from a modulus operation. The overhead *percentage* case, however, the overhead percentage is much smaller. Think of this in terms
thus decreases rapidly as user object size increases. of the remainder from a modulus operation. The overhead *percentage* thus
decreases rapidly as object size increases.
An easily missed additional subtlety is that this There is an additional subtlety that is easily missed: the amplification
takes place for *each* replica. So when using the default three copies of phenomenon just described takes place for *each* replica. For example, when
data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device using the default of three copies of data (3R), a 1 KB S3 object actually
capacity. If erasure coding (EC) is used instead of replication, the strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used
amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object instead of replication, the amplification might be even higher: for a ``k=4,
will allocate (6 * 4KB) = 24KB of device capacity. m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6)
of device capacity.
When an RGW bucket pool contains many relatively large user objects, the effect When an RGW bucket pool contains many relatively large user objects, the effect
of this phenomenon is often negligible, but should be considered for deployments of this phenomenon is often negligible. However, with deployments that can
that expect a significant fraction of relatively small objects. expect a significant fraction of relatively small user objects, the effect
should be taken into consideration.
The 4KB default value aligns well with conventional HDD and SSD devices. Some The 4KB default value aligns well with conventional HDD and SSD devices.
new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear
when :confval:`bluestore_min_alloc_size_ssd` best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation
is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB. to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel
These novel storage drives allow one to achieve read performance competitive storage drives can achieve read performance that is competitive with that of
with conventional TLC SSDs and write performance faster than HDDs, with conventional TLC SSDs and write performance that is faster than that of HDDs,
high density and lower cost than TLC SSDs. with higher density and lower cost than TLC SSDs.
Note that when creating OSDs on these devices, one must carefully apply the Note that when creating OSDs on these novel devices, one must be careful to
non-default value only to appropriate devices, and not to conventional SSD and apply the non-default value only to appropriate devices, and not to
HDD devices. This may be done through careful ordering of OSD creation, custom conventional HDD and SSD devices. Error can be avoided through careful ordering
OSD device classes, and especially by the use of central configuration _masks_. of OSD creation, with custom OSD device classes, and especially by the use of
central configuration *masks*.
Quincy and later releases add In Quincy and later releases, you can use the
the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size` :confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow
option that enables automatic discovery of the appropriate value as each OSD is automatic discovery of the correct value as each OSD is created. Note that the
created. Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``, use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or
``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction other device-layering and abstraction technologies might confound the
technologies may confound the determination of appropriate values. OSDs determination of correct values. Moreover, OSDs deployed on top of VMware
deployed on top of VMware storage have been reported to also storage have sometimes been found to report a ``rotational`` attribute that
sometimes report a ``rotational`` attribute that does not match the underlying does not match the underlying hardware.
hardware.
We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that We suggest inspecting such OSDs at startup via logs and admin sockets in order
behavior is appropriate. Note that this also may not work as desired with to ensure that their behavior is correct. Be aware that this kind of inspection
older kernels. You can check for this by examining the presence and value might not work as expected with older kernels. To check for this issue,
of ``/sys/block/<drive>/queue/optimal_io_size``. examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``.
You may also inspect a given OSD: .. note:: When running Reef or a later Ceph release, the ``min_alloc_size``
baked into each OSD is conveniently reported by ``ceph osd metadata``.
To inspect a specific OSD, run the following command:
.. prompt:: bash # .. prompt:: bash #
ceph osd metadata osd.1701 | grep rotational ceph osd metadata osd.1701 | egrep rotational\|alloc
This space amplification may manifest as an unusually high ratio of raw to This space amplification might manifest as an unusually high ratio of raw to
stored data reported by ``ceph df``. ``ceph osd df`` may also report stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR``
anomalously high ``%USE`` / ``VAR`` values when values reported by ``ceph osd df`` that are unusually high in comparison to
compared to other, ostensibly identical OSDs. A pool using OSDs with other, ostensibly identical, OSDs. Finally, there might be unexpected balancer
mismatched ``min_alloc_size`` values may experience unexpected balancer behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values.
behavior as well.
Note that this BlueStore attribute takes effect *only* at OSD creation; if
changed later, a given OSD's behavior will not change unless / until it is
destroyed and redeployed with the appropriate option value(s). Upgrading
to a later Ceph release will *not* change the value used by OSDs deployed
under older releases or with other settings.
This BlueStore attribute takes effect *only* at OSD creation; if the attribute
is changed later, a specific OSD's behavior will not change unless and until
the OSD is destroyed and redeployed with the appropriate option value(s).
Upgrading to a later Ceph release will *not* change the value used by OSDs that
were deployed under older releases or with other settings.
.. confval:: bluestore_min_alloc_size .. confval:: bluestore_min_alloc_size
.. confval:: bluestore_min_alloc_size_hdd .. confval:: bluestore_min_alloc_size_hdd
.. confval:: bluestore_min_alloc_size_ssd .. confval:: bluestore_min_alloc_size_ssd
.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size .. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
DSA (Data Streaming Accelerator Usage) DSA (Data Streaming Accelerator) Usage
====================================== ======================================
If you want to use the DML library to drive DSA device for offloading If you want to use the DML library to drive the DSA device for offloading
read/write operations on Persist memory in Bluestore. You need to install read/write operations on persistent memory (PMEM) in BlueStore, you need to
`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU. install `DML`_ and the `idxd-config`_ library. This will work only on machines
that have a SPR (Sapphire Rapids) CPU.
.. _DML: https://github.com/intel/DML .. _dml: https://github.com/intel/dml
.. _idxd-config: https://github.com/intel/idxd-config .. _idxd-config: https://github.com/intel/idxd-config
After installing the DML software, you need to configure the shared After installing the DML software, configure the shared work queues (WQs) with
work queues (WQs) with the following WQ configuration example via accel-config tool: reference to the following WQ configuration example:
.. prompt:: bash $ .. prompt:: bash $
accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
accel-config config-engine dsa0/engine0.1 --group-id=1 accel-config config-engine dsa0/engine0.1 --group-id=1
accel-config enable-device dsa0 accel-config enable-device dsa0
accel-config enable-wq dsa0/wq0.1 accel-config enable-wq dsa0/wq0.1

View File

@ -4,116 +4,116 @@
Configuring Ceph Configuring Ceph
================== ==================
When Ceph services start, the initialization process activates a series When Ceph services start, the initialization process activates a series of
of daemons that run in the background. A :term:`Ceph Storage Cluster` runs daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
at a minimum three types of daemons: least three types of daemons:
- :term:`Ceph Monitor` (``ceph-mon``) - :term:`Ceph Monitor` (``ceph-mon``)
- :term:`Ceph Manager` (``ceph-mgr``) - :term:`Ceph Manager` (``ceph-mgr``)
- :term:`Ceph OSD Daemon` (``ceph-osd``) - :term:`Ceph OSD Daemon` (``ceph-osd``)
Ceph Storage Clusters that support the :term:`Ceph File System` also run at Ceph Storage Clusters that support the :term:`Ceph File System` also run at
least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support
support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons (``radosgw``).
(``radosgw``) as well.
Each daemon has a number of configuration options, each of which has a Each daemon has a number of configuration options, each of which has a default
default value. You may adjust the behavior of the system by changing these value. You may adjust the behavior of the system by changing these
configuration options. Be careful to understand the consequences before configuration options. Be careful to understand the consequences before
overriding default values, as it is possible to significantly degrade the overriding default values, as it is possible to significantly degrade the
performance and stability of your cluster. Also note that default values performance and stability of your cluster. Note too that default values
sometimes change between releases, so it is best to review the version of sometimes change between releases. For this reason, it is best to review the
this documentation that aligns with your Ceph release. version of this documentation that applies to your Ceph release.
Option names Option names
============ ============
All Ceph configuration options have a unique name consisting of words Each of the Ceph configuration options has a unique name that consists of words
formed with lower-case characters and connected with underscore formed with lowercase characters and connected with underscore characters
(``_``) characters. (``_``).
When option names are specified on the command line, either underscore When option names are specified on the command line, underscore (``_``) and
(``_``) or dash (``-``) characters can be used interchangeable (e.g., dash (``-``) characters can be used interchangeably (for example,
``--mon-host`` is equivalent to ``--mon_host``). ``--mon-host`` is equivalent to ``--mon_host``).
When option names appear in configuration files, spaces can also be When option names appear in configuration files, spaces can also be used in
used in place of underscore or dash. We suggest, though, that for place of underscores or dashes. However, for the sake of clarity and
clarity and convenience you consistently use underscores, as we do convenience, we suggest that you consistently use underscores, as we do
throughout this documentation. throughout this documentation.
Config sources Config sources
============== ==============
Each Ceph daemon, process, and library will pull its configuration Each Ceph daemon, process, and library pulls its configuration from one or more
from several sources, listed below. Sources later in the list will of the several sources listed below. Sources that occur later in the list
override those earlier in the list when both are present. override those that occur earlier in the list (when both are present).
- the compiled-in default value - the compiled-in default value
- the monitor cluster's centralized configuration database - the monitor cluster's centralized configuration database
- a configuration file stored on the local host - a configuration file stored on the local host
- environment variables - environment variables
- command line arguments - command-line arguments
- runtime overrides set by an administrator - runtime overrides that are set by an administrator
One of the first things a Ceph process does on startup is parse the One of the first things a Ceph process does on startup is parse the
configuration options provided via the command line, environment, and configuration options provided via the command line, via the environment, and
local configuration file. The process will then contact the monitor via the local configuration file. Next, the process contacts the monitor
cluster to retrieve configuration stored centrally for the entire cluster to retrieve centrally-stored configuration for the entire cluster.
cluster. Once a complete view of the configuration is available, the After a complete view of the configuration is available, the startup of the
daemon or process startup will proceed. daemon or process will commence.
.. _bootstrap-options: .. _bootstrap-options:
Bootstrap options Bootstrap options
----------------- -----------------
Some configuration options affect the process's ability to contact the Bootstrap options are configuration options that affect the process's ability
monitors, to authenticate, and to retrieve the cluster-stored configuration. to contact the monitors, to authenticate, and to retrieve the cluster-stored
For this reason, these options might need to be stored locally on the node, and configuration. For this reason, these options might need to be stored locally
set by means of a local configuration file. These options include the on the node, and set by means of a local configuration file. These options
following: include the following:
.. confval:: mon_host .. confval:: mon_host
.. confval:: mon_host_override .. confval:: mon_host_override
- :confval:`mon_dns_srv_name` - :confval:`mon_dns_srv_name`
- ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and - :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`,
similar options that define which local directory the daemon :confval:`mgr_data`, and similar options that define which local directory
stores its data in. the daemon stores its data in.
- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be used to - :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be
specify the authentication credential to use to authenticate with used to specify the authentication credential to use to authenticate with the
the monitor. Note that in most cases the default keyring location monitor. Note that in most cases the default keyring location is in the data
is in the data directory specified above. directory specified above.
In most cases, the default values of these options are suitable. There is one In most cases, there is no reason to modify the default values of these
exception to this: the :confval:`mon_host` option that identifies the addresses options. However, there is one exception to this: the :confval:`mon_host`
of the cluster's monitors. When DNS is used to identify monitors, a local Ceph option that identifies the addresses of the cluster's monitors. But when
:ref:`DNS is used to identify monitors<mon-dns-lookup>`, a local Ceph
configuration file can be avoided entirely. configuration file can be avoided entirely.
Skipping monitor config Skipping monitor config
----------------------- -----------------------
Pass the option ``--no-mon-config`` to any process to skip the step that The option ``--no-mon-config`` can be passed in any command in order to skip
retrieves configuration information from the cluster monitors. This is useful the step that retrieves configuration information from the cluster's monitors.
in cases where configuration is managed entirely via configuration files, or Skipping this retrieval step can be useful in cases where configuration is
when the monitor cluster is down and some maintenance activity needs to be managed entirely via configuration files, or when maintenance activity needs to
done. be done but the monitor cluster is down.
.. _ceph-conf-file: .. _ceph-conf-file:
Configuration sections Configuration sections
====================== ======================
Any given process or daemon has a single value for each configuration Each of the configuration options associated with a single process or daemon
option. However, values for an option may vary across different has a single value. However, the values for a configuration option can vary
daemon types even daemons of the same type. Ceph options that are across daemon types, and can vary even across different daemons of the same
stored in the monitor configuration database or in local configuration type. Ceph options that are stored in the monitor configuration database or in
files are grouped into sections to indicate which daemons or clients local configuration files are grouped into sections |---| so-called "configuration
they apply to. sections" |---| to indicate which daemons or clients they apply to.
These sections include:
These sections include the following:
.. confsec:: global .. confsec:: global
@ -156,41 +156,40 @@ These sections include:
.. confsec:: client .. confsec:: client
Settings under ``client`` affect all Ceph Clients Settings under ``client`` affect all Ceph clients
(e.g., mounted Ceph File Systems, mounted Ceph Block Devices, (for example, mounted Ceph File Systems, mounted Ceph Block Devices)
etc.) as well as Rados Gateway (RGW) daemons. as well as RADOS Gateway (RGW) daemons.
:example: ``objecter_inflight_ops = 512`` :example: ``objecter_inflight_ops = 512``
Sections may also specify an individual daemon or client name. For example, Configuration sections can also specify an individual daemon or client name. For example,
``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names. ``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names.
Any given daemon will draw its settings from the global section, the Any given daemon will draw its settings from the global section, the daemon- or
daemon or client type section, and the section sharing its name. client-type section, and the section sharing its name. Settings in the
Settings in the most-specific section take precedence, so for example most-specific section take precedence so precedence: for example, if the same
if the same option is specified in both :confsec:`global`, :confsec:`mon`, and option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo``
``mon.foo`` on the same source (i.e., in the same configurationfile), on the same source (i.e. that is, in the same configuration file), the
the ``mon.foo`` value will be used. ``mon.foo`` setting will be used.
If multiple values of the same configuration option are specified in the same If multiple values of the same configuration option are specified in the same
section, the last value wins. section, the last value specified takes precedence.
Note that values from the local configuration file always take
precedence over values from the monitor configuration database,
regardless of which section they appear in.
Note that values from the local configuration file always take precedence over
values from the monitor configuration database, regardless of the section in
which they appear.
.. _ceph-metavariables: .. _ceph-metavariables:
Metavariables Metavariables
============= =============
Metavariables simplify Ceph Storage Cluster configuration Metavariables dramatically simplify Ceph storage cluster configuration. When a
dramatically. When a metavariable is set in a configuration value, metavariable is set in a configuration value, Ceph expands the metavariable at
Ceph expands the metavariable into a concrete value at the time the the time the configuration value is used. In this way, Ceph metavariables
configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell. behave similarly to the way that variable expansion works in the Bash shell.
Ceph supports the following metavariables: Ceph supports the following metavariables:
@ -204,7 +203,7 @@ Ceph supports the following metavariables:
.. describe:: $type .. describe:: $type
Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``) Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``)
:example: ``/var/lib/ceph/$type`` :example: ``/var/lib/ceph/$type``
@ -233,31 +232,30 @@ Ceph supports the following metavariables:
:example: ``/var/run/ceph/$cluster-$name-$pid.asok`` :example: ``/var/run/ceph/$cluster-$name-$pid.asok``
Ceph configuration file
The Configuration File =======================
======================
On startup, Ceph processes search for a configuration file in the On startup, Ceph processes search for a configuration file in the
following locations: following locations:
#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF`` #. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF``
environment variable) environment variable)
#. ``-c path/path`` (*i.e.,* the ``-c`` command line argument) #. ``-c path/path`` (that is, the ``-c`` command line argument)
#. ``/etc/ceph/$cluster.conf`` #. ``/etc/ceph/$cluster.conf``
#. ``~/.ceph/$cluster.conf`` #. ``~/.ceph/$cluster.conf``
#. ``./$cluster.conf`` (*i.e.,* in the current working directory) #. ``./$cluster.conf`` (that is, in the current working directory)
#. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf`` #. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf``
where ``$cluster`` is the cluster's name (default ``ceph``). Here ``$cluster`` is the cluster's name (default: ``ceph``).
The Ceph configuration file uses an *ini* style syntax. You can add comment The Ceph configuration file uses an ``ini`` style syntax. You can add "comment
text after a pound sign (#) or a semi-colon (;). For example: text" after a pound sign (#) or a semi-colon semicolon (;). For example:
.. code-block:: ini .. code-block:: ini
# <--A number (#) sign precedes a comment. # <--A number (#) sign number sign (#) precedes a comment.
; A comment may be anything. ; A comment may be anything.
# Comments always follow a semi-colon (;) or a pound (#) on each line. # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line.
# The end of the line terminates a comment. # The end of the line terminates a comment.
# We recommend that you provide comments in your configuration file(s). # We recommend that you provide comments in your configuration file(s).
@ -268,8 +266,8 @@ Config file section names
------------------------- -------------------------
The configuration file is divided into sections. Each section must begin with a The configuration file is divided into sections. Each section must begin with a
valid configuration section name (see `Configuration sections`_, above) valid configuration section name (see `Configuration sections`_, above) that is
surrounded by square brackets. For example, surrounded by square brackets. For example:
.. code-block:: ini .. code-block:: ini
@ -285,23 +283,24 @@ surrounded by square brackets. For example,
[osd.2] [osd.2]
debug_ms = 10 debug_ms = 10
Config file option values Config file option values
------------------------- -------------------------
The value of a configuration option is a string. If it is too long to The value of a configuration option is a string. If the string is too long to
fit in a single line, you can put a backslash (``\``) at the end of line fit on a single line, you can put a backslash (``\``) at the end of the line
as the line continuation marker, so the value of the option will be and the backslash will act as a line continuation marker. In such a case, the
the string after ``=`` in current line combined with the string in the next value of the option will be the string after ``=`` in the current line,
line:: combined with the string in the next line. Here is an example::
[global] [global]
foo = long long ago\ foo = long long ago\
long ago long ago
In the example above, the value of "``foo``" would be "``long long ago long ago``". In this example, the value of the "``foo``" option is "``long long ago long
ago``".
Normally, the option value ends with a new line, or a comment, like An option value typically ends with either a newline or a comment. For
example:
.. code-block:: ini .. code-block:: ini
@ -309,100 +308,108 @@ Normally, the option value ends with a new line, or a comment, like
obscure_one = difficult to explain # I will try harder in next release obscure_one = difficult to explain # I will try harder in next release
simpler_one = nothing to explain simpler_one = nothing to explain
In the example above, the value of "``obscure one``" would be "``difficult to explain``"; In this example, the value of the "``obscure one``" option is "``difficult to
and the value of "``simpler one`` would be "``nothing to explain``". explain``" and the value of the "``simpler one`` options is "``nothing to
explain``".
If an option value contains spaces, and we want to make it explicit, we When an option value contains spaces, it can be enclosed within single quotes
could quote the value using single or double quotes, like or double quotes in order to make its scope clear and in order to make sure
that the first space in the value is not interpreted as the end of the value.
For example:
.. code-block:: ini .. code-block:: ini
[global] [global]
line = "to be, or not to be" line = "to be, or not to be"
Certain characters are not allowed to be present in the option values directly. In option values, there are four characters that are treated as escape
They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them, characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an
like option value only if they are immediately preceded by the backslash character
(``\``). For example:
.. code-block:: ini .. code-block:: ini
[global] [global]
secret = "i love \# and \[" secret = "i love \# and \["
Every configuration option is typed with one of the types below: Each configuration option falls under one of the following types:
.. describe:: int .. describe:: int
64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G", 64-bit signed integer. Some SI suffixes are supported, such as "K", "M",
"T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`, "G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
10\ :sup:`9`, etc. And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid 10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M",
option values. Some times, a negative value implies "unlimited" when it comes to "128B" and "-1" are all valid option values. When a negative value is
an option for threshold or limit. assigned to a threshold option, this can indicate that the option is
"unlimited" -- that is, that there is no threshold or limit in effect.
:example: ``42``, ``-1`` :example: ``42``, ``-1``
.. describe:: uint .. describe:: uint
It is almost identical to ``integer``. But a negative value will be rejected. This differs from ``integer`` only in that negative values are not
permitted.
:example: ``256``, ``0`` :example: ``256``, ``0``
.. describe:: str .. describe:: str
Free style strings encoded in UTF-8, but some characters are not allowed. Please A string encoded in UTF-8. Certain characters are not permitted. Reference
reference the above notes for the details. the above notes for the details.
:example: ``"hello world"``, ``"i love \#"``, ``yet-another-name`` :example: ``"hello world"``, ``"i love \#"``, ``yet-another-name``
.. describe:: boolean .. describe:: boolean
one of the two values ``true`` or ``false``. But an integer is also accepted, Typically either of the two values ``true`` or ``false``. However, any
where "0" implies ``false``, and any non-zero values imply ``true``. integer is permitted: "0" implies ``false``, and any non-zero value implies
``true``.
:example: ``true``, ``false``, ``1``, ``0`` :example: ``true``, ``false``, ``1``, ``0``
.. describe:: addr .. describe:: addr
a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the
protocol. If the prefix is not specified, ``v2`` protocol is used. Please see messenger protocol. If no prefix is specified, the ``v2`` protocol is used.
:ref:`address_formats` for more details. For more details, see :ref:`address_formats`.
:example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789`` :example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789``
.. describe:: addrvec .. describe:: addrvec
a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``. A set of addresses separated by ",". The addresses can be optionally quoted
with ``[`` and ``]``.
:example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]`` :example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567`` ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]``
.. describe:: uuid .. describe:: uuid
the string format of a uuid defined by `RFC4122 <https://www.ietf.org/rfc/rfc4122.txt>`_. The string format of a uuid defined by `RFC4122
And some variants are also supported, for more details, see <https://www.ietf.org/rfc/rfc4122.txt>`_. Certain variants are also
`Boost document <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_. supported: for more details, see `Boost document
<https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_.
:example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6`` :example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6``
.. describe:: size .. describe:: size
denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are 64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported.
supported. And "B" is the only supported unit. A negative value will be "B" is the only supported unit string. Negative values are not permitted.
rejected.
:example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``. :example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``.
.. describe:: secs .. describe:: secs
denotes a duration of time. By default the unit is second if not specified. Denotes a duration of time. The default unit of time is the second.
Following units of time are supported: The following units of time are supported:
* second: "s", "sec", "second", "seconds" * second: ``s``, ``sec``, ``second``, ``seconds``
* minute: "m", "min", "minute", "minutes" * minute: ``m``, ``min``, ``minute``, ``minutes``
* hour: "hs", "hr", "hour", "hours" * hour: ``hs``, ``hr``, ``hour``, ``hours``
* day: "d", "day", "days" * day: ``d``, ``day``, ``days``
* week: "w", "wk", "week", "weeks" * week: ``w``, ``wk``, ``week``, ``weeks``
* month: "mo", "month", "months" * month: ``mo``, ``month``, ``months``
* year: "y", "yr", "year", "years" * year: ``y``, ``yr``, ``year``, ``years``
:example: ``1 m``, ``1m`` and ``1 week`` :example: ``1 m``, ``1m`` and ``1 week``
@ -411,93 +418,94 @@ Every configuration option is typed with one of the types below:
Monitor configuration database Monitor configuration database
============================== ==============================
The monitor cluster manages a database of configuration options that The monitor cluster manages a database of configuration options that can be
can be consumed by the entire cluster, enabling streamlined central consumed by the entire cluster. This allows for streamlined central
configuration management for the entire system. The vast majority of configuration management of the entire system. For ease of administration and
configuration options can and should be stored here for ease of transparency, the vast majority of configuration options can and should be
administration and transparency. stored in this database.
A handful of settings may still need to be stored in local Some settings might need to be stored in local configuration files because they
configuration files because they affect the ability to connect to the affect the ability of the process to connect to the monitors, to authenticate,
monitors, authenticate, and fetch configuration information. In most and to fetch configuration information. In most cases this applies only to the
cases this is limited to the ``mon_host`` option, although this can ``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV
also be avoided through the use of DNS SRV records. records<mon-dns-lookup>`.
Sections and masks Sections and masks
------------------ ------------------
Configuration options stored by the monitor can live in a global Configuration options stored by the monitor can be stored in a global section,
section, daemon type section, or specific daemon section, just like in a daemon-type section, or in a specific daemon section. In this, they are
options in a configuration file can. no different from the options in a configuration file.
In addition, options may also have a *mask* associated with them to In addition, options may have a *mask* associated with them to further restrict
further restrict which daemons or clients the option applies to. which daemons or clients the option applies to. Masks take two forms:
Masks take two forms:
#. ``type:location`` where *type* is a CRUSH property like `rack` or #. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or
`host`, and *location* is a value for that property. For example, ``host``, and ``location`` is a value for that property. For example,
``host:foo`` would limit the option only to daemons or clients ``host:foo`` would limit the option only to daemons or clients
running on a particular host. running on a particular host.
#. ``class:device-class`` where *device-class* is the name of a CRUSH #. ``class:device-class`` where ``device-class`` is the name of a CRUSH
device class (e.g., ``hdd`` or ``ssd``). For example, device class (for example, ``hdd`` or ``ssd``). For example,
``class:ssd`` would limit the option only to OSDs backed by SSDs. ``class:ssd`` would limit the option only to OSDs backed by SSDs.
(This mask has no effect for non-OSD daemons or clients.) (This mask has no effect on non-OSD daemons or clients.)
When setting a configuration option, the `who` may be a section name, In commands that specify a configuration option, the argument of the option (in
a mask, or a combination of both separated by a slash (``/``) the following examples, this is the "who" string) may be a section name, a
character. For example, ``osd/rack:foo`` would mean all OSD daemons mask, or a combination of both separated by a slash character (``/``). For
in the ``foo`` rack. example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack.
When viewing configuration options, the section name and mask are
generally separated out into separate fields or columns to ease readability.
When configuration options are shown, the section name and mask are presented
in separate fields or columns to make them more readable.
Commands Commands
-------- --------
The following CLI commands are used to configure the cluster: The following CLI commands are used to configure the cluster:
* ``ceph config dump`` will dump the entire configuration database for * ``ceph config dump`` dumps the entire monitor configuration
the cluster. database for the cluster.
* ``ceph config get <who>`` will dump the configuration for a specific * ``ceph config get <who>`` dumps the configuration options stored in
daemon or client (e.g., ``mds.a``), as stored in the monitors' the monitor configuration database for a specific daemon or client
configuration database. (for example, ``mds.a``).
* ``ceph config set <who> <option> <value>`` will set a configuration * ``ceph config get <who> <option>`` shows either a configuration value
option in the monitors' configuration database. stored in the monitor configuration database for a specific daemon or client
(for example, ``mds.a``), or, if that value is not present in the monitor
configuration database, the compiled-in default value.
* ``ceph config show <who>`` will show the reported running * ``ceph config set <who> <option> <value>`` specifies a configuration
configuration for a running daemon. These settings may differ from option in the monitor configuration database.
those stored by the monitors if there are also local configuration
files in use or options have been overridden on the command line or
at run time. The source of the option values is reported as part
of the output.
* ``ceph config assimilate-conf -i <input file> -o <output file>`` * ``ceph config show <who>`` shows the configuration for a running daemon.
will ingest a configuration file from *input file* and move any These settings might differ from those stored by the monitors if there are
valid options into the monitors' configuration database. Any also local configuration files in use or if options have been overridden on
settings that are unrecognized, invalid, or cannot be controlled by the command line or at run time. The source of the values of the options is
the monitor will be returned in an abbreviated config file stored in displayed in the output.
*output file*. This command is useful for transitioning from legacy
configuration files to centralized monitor-based configuration.
* ``ceph config assimilate-conf -i <input file> -o <output file>`` ingests a
configuration file from *input file* and moves any valid options into the
monitor configuration database. Any settings that are unrecognized, are
invalid, or cannot be controlled by the monitor will be returned in an
abbreviated configuration file stored in *output file*. This command is
useful for transitioning from legacy configuration files to centralized
monitor-based configuration.
Note that ``ceph config set <who> <option> <value>`` and ``ceph config get
<who> <option>`` will not necessarily return the same values. The latter
command will show compiled-in default values. In order to determine whether a
configuration option is present in the monitor configuration database, run
``ceph config dump``.
Help Help
==== ====
You can get help for a particular option with: To get help for a particular option, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config help <option> ceph config help <option>
Note that this will use the configuration schema that is compiled into the running monitors. If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon:
.. prompt:: bash $
ceph daemon <name> config help [option]
For example: For example:
.. prompt:: bash $ .. prompt:: bash $
@ -543,20 +551,29 @@ or:
"can_update_at_runtime": false "can_update_at_runtime": false
} }
The ``level`` property can be any of `basic`, `advanced`, or `dev`. The ``level`` property can be ``basic``, ``advanced``, or ``dev``. The `dev`
The `dev` options are intended for use by developers, generally for options are intended for use by developers, generally for testing purposes, and
testing purposes, and are not recommended for use by operators. are not recommended for use by operators.
.. note:: This command uses the configuration schema that is compiled into the
running monitors. If you have a mixed-version cluster (as might exist, for
example, during an upgrade), you might want to query the option schema from
a specific running daemon by running a command of the following form:
.. prompt:: bash $
ceph daemon <name> config help [option]
Runtime Changes Runtime Changes
=============== ===============
In most cases, Ceph permits changes to the configuration of a daemon at In most cases, Ceph permits changes to the configuration of a daemon at
runtime. This can be used for increasing or decreasing the amount of logging run time. This can be used for increasing or decreasing the amount of logging
output, for enabling or disabling debug settings, and for runtime optimization. output, for enabling or disabling debug settings, and for runtime optimization.
Configuration options can be updated via the ``ceph config set`` command. For Use the ``ceph config set`` command to update configuration options. For
example, to enable the debug log level on a specific OSD, run a command of this form: example, to enable the most verbose debug log level on a specific OSD, run a
command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -565,16 +582,18 @@ example, to enable the debug log level on a specific OSD, run a command of this
.. note:: If an option has been customized in a local configuration file, the .. note:: If an option has been customized in a local configuration file, the
`central config `central config
<https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_ <https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_
setting will be ignored (it has a lower priority than the local setting will be ignored because it has a lower priority than the local
configuration file). configuration file.
.. note:: Log levels range from 0 to 20.
Override values Override values
--------------- ---------------
Options can be set temporarily by using the `tell` or `daemon` interfaces on Options can be set temporarily by using the Ceph CLI ``tell`` or ``daemon``
the Ceph CLI. These *override* values are ephemeral, which means that they interfaces on the Ceph CLI. These *override* values are ephemeral, which means
affect only the current instance of the daemon and revert to persistently that they affect only the current instance of the daemon and revert to
configured values when the daemon restarts. persistently configured values when the daemon restarts.
Override values can be set in two ways: Override values can be set in two ways:
@ -593,14 +612,14 @@ Override values can be set in two ways:
The ``tell`` command can also accept a wildcard as the daemon identifier. The ``tell`` command can also accept a wildcard as the daemon identifier.
For example, to adjust the debug level on all OSD daemons, run a command of For example, to adjust the debug level on all OSD daemons, run a command of
this form: the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph tell osd.* config set debug_osd 20 ceph tell osd.* config set debug_osd 20
#. On the host where the daemon is running, connect to the daemon via a socket #. On the host where the daemon is running, connect to the daemon via a socket
in ``/var/run/ceph`` by running a command of this form: in ``/var/run/ceph`` by running a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -613,81 +632,83 @@ Override values can be set in two ways:
ceph daemon osd.4 config set debug_osd 20 ceph daemon osd.4 config set debug_osd 20
.. note:: In the output of the ``ceph config show`` command, these temporary .. note:: In the output of the ``ceph config show`` command, these temporary
values are shown with a source of ``override``. values are shown to have a source of ``override``.
Viewing runtime settings Viewing runtime settings
======================== ========================
You can see the current options set for a running daemon with the ``ceph config show`` command. For example: You can see the current settings specified for a running daemon with the ``ceph
config show`` command. For example, to see the (non-default) settings for the
daemon ``osd.0``, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config show osd.0 ceph config show osd.0
will show you the (non-default) options for that daemon. You can also look at a specific option with: To see a specific setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config show osd.0 debug_osd ceph config show osd.0 debug_osd
or view all options (even those with default values) with: To see all settings (including those with default values), run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph config show-with-defaults osd.0 ceph config show-with-defaults osd.0
You can also observe settings for a running daemon by connecting to it from the local host via the admin socket. For example: You can see all settings for a daemon that is currently running by connecting
to it on the local host via the admin socket. For example, to dump all
current settings, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon osd.0 config show ceph daemon osd.0 config show
will dump all current settings: To see non-default settings and to see where each value came from (for example,
a config file, the monitor, or an override), run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon osd.0 config diff ceph daemon osd.0 config diff
will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and: To see the value of a single setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon osd.0 config get debug_osd ceph daemon osd.0 config get debug_osd
will report the value of a single option.
Changes introduced in Octopus
=============================
Changes since Nautilus
======================
With the Octopus release We changed the way the configuration file is parsed. With the Octopus release We changed the way the configuration file is parsed.
These changes are as follows: These changes are as follows:
- Repeated configuration options are allowed, and no warnings will be printed. - Repeated configuration options are allowed, and no warnings will be
The value of the last one is used, which means that the setting last in the file displayed. This means that the setting that comes last in the file is the one
is the one that takes effect. Before this change, we would print warning messages that takes effect. Prior to this change, Ceph displayed warning messages when
when lines with duplicated options were encountered, like:: lines containing duplicate options were encountered, such as::
warning line 42: 'foo' in section 'bar' redefined warning line 42: 'foo' in section 'bar' redefined
- Prior to Octopus, options containing invalid UTF-8 characters were ignored
- Invalid UTF-8 options were ignored with warning messages. But since Octopus, with warning messages. But in Octopus, they are treated as fatal errors.
they are treated as fatal errors. - The backslash character ``\`` is used as the line-continuation marker that
combines the next line with the current one. Prior to Octopus, there was a
- Backslash ``\`` is used as the line continuation marker to combine the next requirement that any end-of-line backslash be followed by a non-empty line.
line with current one. Before Octopus, it was required to follow a backslash with But in Octopus, an empty line following a backslash is allowed.
a non-empty line. But in Octopus, an empty line following a backslash is now allowed.
- In the configuration file, each line specifies an individual configuration - In the configuration file, each line specifies an individual configuration
option. The option's name and its value are separated with ``=``, and the option. The option's name and its value are separated with ``=``, and the
value may be quoted using single or double quotes. If an invalid value may be enclosed within single or double quotes. If an invalid
configuration is specified, we will treat it as an invalid configuration configuration is specified, we will treat it as an invalid configuration
file :: file::
bad option ==== bad value bad option ==== bad value
- Prior to Octopus, if no section name was specified in the configuration file,
all options would be set as though they were within the :confsec:`global`
section. This approach is discouraged. Since Octopus, any configuration
file that has no section name must contain only a single option.
- Before Octopus, if no section name was specified in the configuration file, .. |---| unicode:: U+2014 .. EM DASH :trim:
all options would be set as though they were within the :confsec:`global` section. This is
now discouraged. Since Octopus, only a single option is allowed for
configuration files without a section name.

View File

@ -1,4 +1,3 @@
.. _ceph-conf-common-settings: .. _ceph-conf-common-settings:
Common Settings Common Settings
@ -7,15 +6,19 @@ Common Settings
The `Hardware Recommendations`_ section provides some hardware guidelines for The `Hardware Recommendations`_ section provides some hardware guidelines for
configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph
Node` to run multiple daemons. For example, a single node with multiple drives Node` to run multiple daemons. For example, a single node with multiple drives
may run one ``ceph-osd`` for each drive. Ideally, you will have a node for a ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be
particular type of process. For example, some nodes may run ``ceph-osd`` assigned to a particular type of process. For example, some nodes might run
daemons, other nodes may run ``ceph-mds`` daemons, and still other nodes may ``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still
run ``ceph-mon`` daemons. other nodes might run ``ceph-mon`` daemons.
Each node has a name. The name of a node can be found in its ``host`` setting.
Monitors also specify a network address and port (that is, a domain name or IP
address) that can be found in the ``addr`` setting. A basic configuration file
typically specifies only minimal settings for each instance of monitor daemons.
For example:
Each node has a name identified by the ``host`` setting. Monitors also specify
a network address and port (i.e., domain name or IP address) identified by the
``addr`` setting. A basic configuration file will typically specify only
minimal settings for each instance of monitor daemons. For example:
.. code-block:: ini .. code-block:: ini
@ -23,14 +26,13 @@ minimal settings for each instance of monitor daemons. For example:
mon_initial_members = ceph1 mon_initial_members = ceph1
mon_host = 10.0.0.1 mon_host = 10.0.0.1
.. important:: The ``host`` setting's value is the short name of the node. It
.. important:: The ``host`` setting is the short name of the node (i.e., not is not an FQDN. It is **NOT** an IP address. To retrieve the name of the
an fqdn). It is **NOT** an IP address either. Enter ``hostname -s`` on node, enter ``hostname -s`` on the command line. Unless you are deploying
the command line to retrieve the name of the node. Do not use ``host`` Ceph manually, do not use ``host`` settings for anything other than initial
settings for anything other than initial monitors unless you are deploying monitor setup. **DO NOT** specify the ``host`` setting under individual
Ceph manually. You **MUST NOT** specify ``host`` under individual daemons daemons when using deployment tools like ``chef`` or ``cephadm``. Such tools
when using deployment tools like ``chef`` or ``cephadm``, as those tools are designed to enter the appropriate values for you in the cluster map.
will enter the appropriate values for you in the cluster map.
.. _ceph-network-config: .. _ceph-network-config:
@ -38,32 +40,33 @@ minimal settings for each instance of monitor daemons. For example:
Networks Networks
======== ========
See the `Network Configuration Reference`_ for a detailed discussion about For more about configuring a network for use with Ceph, see the `Network
configuring a network for use with Ceph. Configuration Reference`_ .
Monitors Monitors
======== ========
Production Ceph clusters typically provision a minimum of three :term:`Ceph Monitor` Ceph production clusters typically provision at least three :term:`Ceph
daemons to ensure availability should a monitor instance crash. A minimum of Monitor` daemons to ensure availability in the event of a monitor instance
three ensures that the Paxos algorithm can determine which version crash. A minimum of three :term:`Ceph Monitor` daemons ensures that the Paxos
of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph algorithm is able to determine which version of the :term:`Ceph Cluster Map` is
the most recent. It makes this determination by consulting a majority of Ceph
Monitors in the quorum. Monitors in the quorum.
.. note:: You may deploy Ceph with a single monitor, but if the instance fails, .. note:: You may deploy Ceph with a single monitor, but if the instance fails,
the lack of other monitors may interrupt data service availability. the lack of other monitors might interrupt data-service availability.
Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol. Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and on
port ``6789`` for the old v1 protocol.
By default, Ceph expects to store monitor data under the By default, Ceph expects to store monitor data on the following path::
following path::
/var/lib/ceph/mon/$cluster-$id /var/lib/ceph/mon/$cluster-$id
You or a deployment tool (e.g., ``cephadm``) must create the corresponding You or a deployment tool (for example, ``cephadm``) must create the
directory. With metavariables fully expressed and a cluster named "ceph", the corresponding directory. With metavariables fully expressed and a cluster named
foregoing directory would evaluate to:: "ceph", the path specified in the above example evaluates to::
/var/lib/ceph/mon/ceph-a /var/lib/ceph/mon/ceph-a
@ -74,14 +77,13 @@ For additional details, see the `Monitor Config Reference`_.
.. _ceph-osd-config: .. _ceph-osd-config:
Authentication Authentication
============== ==============
.. versionadded:: Bobtail 0.56 .. versionadded:: Bobtail 0.56
For Bobtail (v 0.56) and beyond, you should expressly enable or disable Authentication is explicitly enabled or disabled in the ``[global]`` section of
authentication in the ``[global]`` section of your Ceph configuration file. the Ceph configuration file, as shown here:
.. code-block:: ini .. code-block:: ini
@ -89,7 +91,8 @@ authentication in the ``[global]`` section of your Ceph configuration file.
auth_service_required = cephx auth_service_required = cephx
auth_client_required = cephx auth_client_required = cephx
Additionally, you should enable message signing. See `Cephx Config Reference`_ for details. In addition, you should enable message signing. For details, see `Cephx Config
Reference`_.
.. _Cephx Config Reference: ../auth-config-ref .. _Cephx Config Reference: ../auth-config-ref
@ -100,9 +103,10 @@ Additionally, you should enable message signing. See `Cephx Config Reference`_ f
OSDs OSDs
==== ====
Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node When Ceph production clusters deploy :term:`Ceph OSD Daemons`, the typical
has one OSD daemon running a Filestore on one storage device. The BlueStore back arrangement is that one node has one OSD daemon running Filestore on one
end is now default, but when using Filestore you specify a journal size. For example: storage device. BlueStore is now the default back end, but when using Filestore
you must specify a journal size. For example:
.. code-block:: ini .. code-block:: ini
@ -113,52 +117,54 @@ end is now default, but when using Filestore you specify a journal size. For exa
host = {hostname} #manual deployments only. host = {hostname} #manual deployments only.
By default, Ceph expects to store a Ceph OSD Daemon's data at the By default, Ceph expects to store a Ceph OSD Daemon's data on the following
following path:: path::
/var/lib/ceph/osd/$cluster-$id /var/lib/ceph/osd/$cluster-$id
You or a deployment tool (e.g., ``cephadm``) must create the corresponding You or a deployment tool (for example, ``cephadm``) must create the
directory. With metavariables fully expressed and a cluster named "ceph", this corresponding directory. With metavariables fully expressed and a cluster named
example would evaluate to:: "ceph", the path specified in the above example evaluates to::
/var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0
You may override this path using the ``osd_data`` setting. We recommend not You can override this path using the ``osd_data`` setting. We recommend that
changing the default location. Create the default directory on your OSD host. you do not change the default location. To create the default directory on your
OSD host, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
ssh {osd-host} ssh {osd-host}
sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
The ``osd_data`` path ideally leads to a mount point with a device that is The ``osd_data`` path ought to lead to a mount point that has mounted on it a
separate from the device that contains the operating system and device that is distinct from the device that contains the operating system and
daemons. If an OSD is to use a device other than the OS device, prepare it for the daemons. To use a device distinct from the device that contains the
use with Ceph, and mount it to the directory you just created operating system and the daemons, prepare it for use with Ceph and mount it on
the directory you just created by running the following commands:
.. prompt:: bash $ .. prompt:: bash $
ssh {new-osd-host} ssh {new-osd-host}
sudo mkfs -t {fstype} /dev/{disk} sudo mkfs -t {fstype} /dev/{disk}
sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} sudo mount -o user_xattr /dev/{disk} /var/lib/ceph/osd/ceph-{osd-number}
We recommend using the ``xfs`` file system when running We recommend using the ``xfs`` file system when running :command:`mkfs`. (The
:command:`mkfs`. (``btrfs`` and ``ext4`` are not recommended and are no ``btrfs`` and ``ext4`` file systems are not recommended and are no longer
longer tested.) tested.)
See the `OSD Config Reference`_ for additional configuration details. For additional configuration details, see `OSD Config Reference`_.
Heartbeats Heartbeats
========== ==========
During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons
and report their findings to the Ceph Monitor. You do not have to provide any and report their findings to the Ceph Monitor. This process does not require
settings. However, if you have network latency issues, you may wish to modify you to provide any settings. However, if you have network latency issues, you
the settings. might want to modify the default settings.
See `Configuring Monitor/OSD Interaction`_ for additional details. For additional details, see `Configuring Monitor/OSD Interaction`_.
.. _ceph-logging-and-debugging: .. _ceph-logging-and-debugging:
@ -166,9 +172,9 @@ See `Configuring Monitor/OSD Interaction`_ for additional details.
Logs / Debugging Logs / Debugging
================ ================
Sometimes you may encounter issues with Ceph that require You might sometimes encounter issues with Ceph that require you to use Ceph's
modifying logging output and using Ceph's debugging. See `Debugging and logging and debugging features. For details on log rotation, see `Debugging and
Logging`_ for details on log rotation. Logging`_.
.. _Debugging and Logging: ../../troubleshooting/log-and-debug .. _Debugging and Logging: ../../troubleshooting/log-and-debug
@ -186,32 +192,29 @@ Example ceph.conf
Naming Clusters (deprecated) Naming Clusters (deprecated)
============================ ============================
Each Ceph cluster has an internal name that is used as part of configuration Each Ceph cluster has an internal name. This internal name is used as part of
and log file names as well as directory and mountpoint names. This name configuration, and as part of "log file" names as well as part of directory
defaults to "ceph". Previous releases of Ceph allowed one to specify a custom names and as part of mountpoint names. This name defaults to "ceph". Previous
name instead, for example "ceph2". This was intended to facilitate running releases of Ceph allowed one to specify a custom name instead, for example
multiple logical clusters on the same physical hardware, but in practice this "ceph2". This option was intended to facilitate the running of multiple logical
was rarely exploited and should no longer be attempted. Prior documentation clusters on the same physical hardware, but in practice it was rarely
could also be misinterpreted as requiring unique cluster names in order to exploited. Custom cluster names should no longer be attempted. Old
use ``rbd-mirror``. documentation might lead readers to wrongly think that unique cluster names are
required to use ``rbd-mirror``. They are not required.
Custom cluster names are now considered deprecated and the ability to deploy Custom cluster names are now considered deprecated and the ability to deploy
them has already been removed from some tools, though existing custom name them has already been removed from some tools, although existing custom-name
deployments continue to operate. The ability to run and manage clusters with deployments continue to operate. The ability to run and manage clusters with
custom names may be progressively removed by future Ceph releases, so it is custom names might be progressively removed by future Ceph releases, so **it is
strongly recommended to deploy all new clusters with the default name "ceph". strongly recommended to deploy all new clusters with the default name "ceph"**.
Some Ceph CLI commands accept an optional ``--cluster`` (cluster name) option. This Some Ceph CLI commands accept a ``--cluster`` (cluster name) option. This
option is present purely for backward compatibility and need not be accommodated option is present only for the sake of backward compatibility. New tools and
by new tools and deployments. deployments cannot be relied upon to accommodate this option.
If you do need to allow multiple clusters to exist on the same host, please use If you need to allow multiple clusters to exist on the same host, use
:ref:`cephadm`, which uses containers to fully isolate each cluster. :ref:`cephadm`, which uses containers to fully isolate each cluster.
.. _Hardware Recommendations: ../../../start/hardware-recommendations .. _Hardware Recommendations: ../../../start/hardware-recommendations
.. _Network Configuration Reference: ../network-config-ref .. _Network Configuration Reference: ../network-config-ref
.. _OSD Config Reference: ../osd-config-ref .. _OSD Config Reference: ../osd-config-ref

View File

@ -2,8 +2,14 @@
Filestore Config Reference Filestore Config Reference
============================ ============================
The Filestore back end is no longer the default when creating new OSDs, .. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
though Filestore OSDs are still supported. default storage back end. Since the Luminous release of Ceph, BlueStore has
been Ceph's default storage back end. However, Filestore OSDs are still
supported. See :ref:`OSD Back Ends
<rados_config_storage_devices_osd_backends>`. See :ref:`BlueStore Migration
<rados_operations_bluestore_migration>` for instructions explaining how to
replace an existing Filestore back end with a BlueStore back end.
``filestore debug omap check`` ``filestore debug omap check``
@ -18,26 +24,31 @@ though Filestore OSDs are still supported.
Extended Attributes Extended Attributes
=================== ===================
Extended Attributes (XATTRs) are important for Filestore OSDs. Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain
Some file systems have limits on the number of bytes that can be stored in XATTRs. disadvantages can occur when the underlying file system is used for the storage
Additionally, in some cases, the file system may not be as fast as an alternative of XATTRs: some file systems have limits on the number of bytes that can be
method of storing XATTRs. The following settings may help improve performance stored in XATTRs, and your file system might in some cases therefore run slower
by using a method of storing XATTRs that is extrinsic to the underlying file system. than would an alternative method of storing XATTRs. For this reason, a method
of storing XATTRs extrinsic to the underlying file system might improve
performance. To implement such an extrinsic method, refer to the following
settings.
Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided If the underlying file system has no size limit, then Ceph XATTRs are stored as
by the underlying file system, if it does not impose a size limit. If ``inline xattr``, using the XATTRs provided by the file system. But if there is
there is a size limit (4KB total on ext4, for instance), some Ceph a size limit (for example, ext4 imposes a limit of 4 KB total), then some Ceph
XATTRs will be stored in a key/value database when either the XATTRs will be stored in a key/value database when the limit is reached. More
precisely, this begins to occur when either the
``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs`` ``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs``
threshold is reached. threshold is reached.
``filestore_max_inline_xattr_size`` ``filestore_max_inline_xattr_size``
:Description: The maximum size of an XATTR stored in the file system (i.e., XFS, :Description: Defines the maximum size per object of an XATTR that can be
Btrfs, EXT4, etc.) per object. Should not be larger than the stored in the file system (for example, XFS, Btrfs, ext4). The
file system can handle. Default value of 0 means to use the value specified size should not be larger than the file system can
specific to the underlying file system. handle. Using the default value of 0 instructs Filestore to use
the value specific to the file system.
:Type: Unsigned 32-bit Integer :Type: Unsigned 32-bit Integer
:Required: No :Required: No
:Default: ``0`` :Default: ``0``
@ -45,8 +56,9 @@ threshold is reached.
``filestore_max_inline_xattr_size_xfs`` ``filestore_max_inline_xattr_size_xfs``
:Description: The maximum size of an XATTR stored in the XFS file system. :Description: Defines the maximum size of an XATTR that can be stored in the
Only used if ``filestore_max_inline_xattr_size`` == 0. XFS file system. This setting is used only if
``filestore_max_inline_xattr_size`` == 0.
:Type: Unsigned 32-bit Integer :Type: Unsigned 32-bit Integer
:Required: No :Required: No
:Default: ``65536`` :Default: ``65536``
@ -54,8 +66,9 @@ threshold is reached.
``filestore_max_inline_xattr_size_btrfs`` ``filestore_max_inline_xattr_size_btrfs``
:Description: The maximum size of an XATTR stored in the Btrfs file system. :Description: Defines the maximum size of an XATTR that can be stored in the
Only used if ``filestore_max_inline_xattr_size`` == 0. Btrfs file system. This setting is used only if
``filestore_max_inline_xattr_size`` == 0.
:Type: Unsigned 32-bit Integer :Type: Unsigned 32-bit Integer
:Required: No :Required: No
:Default: ``2048`` :Default: ``2048``
@ -63,8 +76,8 @@ threshold is reached.
``filestore_max_inline_xattr_size_other`` ``filestore_max_inline_xattr_size_other``
:Description: The maximum size of an XATTR stored in other file systems. :Description: Defines the maximum size of an XATTR that can be stored in other file systems.
Only used if ``filestore_max_inline_xattr_size`` == 0. This setting is used only if ``filestore_max_inline_xattr_size`` == 0.
:Type: Unsigned 32-bit Integer :Type: Unsigned 32-bit Integer
:Required: No :Required: No
:Default: ``512`` :Default: ``512``
@ -72,9 +85,8 @@ threshold is reached.
``filestore_max_inline_xattrs`` ``filestore_max_inline_xattrs``
:Description: The maximum number of XATTRs stored in the file system per object. :Description: Defines the maximum number of XATTRs per object that can be stored in the file system.
Default value of 0 means to use the value specific to the Using the default value of 0 instructs Filestore to use the value specific to the file system.
underlying file system.
:Type: 32-bit Integer :Type: 32-bit Integer
:Required: No :Required: No
:Default: ``0`` :Default: ``0``
@ -82,8 +94,8 @@ threshold is reached.
``filestore_max_inline_xattrs_xfs`` ``filestore_max_inline_xattrs_xfs``
:Description: The maximum number of XATTRs stored in the XFS file system per object. :Description: Defines the maximum number of XATTRs per object that can be stored in the XFS file system.
Only used if ``filestore_max_inline_xattrs`` == 0. This setting is used only if ``filestore_max_inline_xattrs`` == 0.
:Type: 32-bit Integer :Type: 32-bit Integer
:Required: No :Required: No
:Default: ``10`` :Default: ``10``
@ -91,8 +103,8 @@ threshold is reached.
``filestore_max_inline_xattrs_btrfs`` ``filestore_max_inline_xattrs_btrfs``
:Description: The maximum number of XATTRs stored in the Btrfs file system per object. :Description: Defines the maximum number of XATTRs per object that can be stored in the Btrfs file system.
Only used if ``filestore_max_inline_xattrs`` == 0. This setting is used only if ``filestore_max_inline_xattrs`` == 0.
:Type: 32-bit Integer :Type: 32-bit Integer
:Required: No :Required: No
:Default: ``10`` :Default: ``10``
@ -100,8 +112,8 @@ threshold is reached.
``filestore_max_inline_xattrs_other`` ``filestore_max_inline_xattrs_other``
:Description: The maximum number of XATTRs stored in other file systems per object. :Description: Defines the maximum number of XATTRs per object that can be stored in other file systems.
Only used if ``filestore_max_inline_xattrs`` == 0. This setting is used only if ``filestore_max_inline_xattrs`` == 0.
:Type: 32-bit Integer :Type: 32-bit Integer
:Required: No :Required: No
:Default: ``2`` :Default: ``2``
@ -111,18 +123,19 @@ threshold is reached.
Synchronization Intervals Synchronization Intervals
========================= =========================
Filestore needs to periodically quiesce writes and synchronize the Filestore must periodically quiesce writes and synchronize the file system.
file system, which creates a consistent commit point. It can then free journal Each synchronization creates a consistent commit point. When the commit point
entries up to the commit point. Synchronizing more frequently tends to reduce is created, Filestore is able to free all journal entries up to that point.
the time required to perform synchronization, and reduces the amount of data More-frequent synchronization tends to reduce both synchronization time and
that needs to remain in the journal. Less frequent synchronization allows the the amount of data that needs to remain in the journal. Less-frequent
backing file system to coalesce small writes and metadata updates more synchronization allows the backing file system to coalesce small writes and
optimally, potentially resulting in more efficient synchronization at the metadata updates, potentially increasing synchronization
expense of potentially increasing tail latency. efficiency but also potentially increasing tail latency.
``filestore_max_sync_interval`` ``filestore_max_sync_interval``
:Description: The maximum interval in seconds for synchronizing Filestore. :Description: Defines the maximum interval (in seconds) for synchronizing Filestore.
:Type: Double :Type: Double
:Required: No :Required: No
:Default: ``5`` :Default: ``5``
@ -130,7 +143,7 @@ expense of potentially increasing tail latency.
``filestore_min_sync_interval`` ``filestore_min_sync_interval``
:Description: The minimum interval in seconds for synchronizing Filestore. :Description: Defines the minimum interval (in seconds) for synchronizing Filestore.
:Type: Double :Type: Double
:Required: No :Required: No
:Default: ``.01`` :Default: ``.01``
@ -142,14 +155,14 @@ Flusher
======= =======
The Filestore flusher forces data from large writes to be written out using The Filestore flusher forces data from large writes to be written out using
``sync_file_range`` before the sync in order to (hopefully) reduce the cost of ``sync_file_range`` prior to the synchronization.
the eventual sync. In practice, disabling 'filestore_flusher' seems to improve Ideally, this action reduces the cost of the eventual synchronization. In practice, however, disabling
performance in some cases. 'filestore_flusher' seems in some cases to improve performance.
``filestore_flusher`` ``filestore_flusher``
:Description: Enables the filestore flusher. :Description: Enables the Filestore flusher.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -158,7 +171,7 @@ performance in some cases.
``filestore_flusher_max_fds`` ``filestore_flusher_max_fds``
:Description: Sets the maximum number of file descriptors for the flusher. :Description: Defines the maximum number of file descriptors for the flusher.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``512`` :Default: ``512``
@ -176,7 +189,7 @@ performance in some cases.
``filestore_fsync_flushes_journal_data`` ``filestore_fsync_flushes_journal_data``
:Description: Flush journal data during file system synchronization. :Description: Flushes journal data during file-system synchronization.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -187,11 +200,11 @@ performance in some cases.
Queue Queue
===== =====
The following settings provide limits on the size of the Filestore queue. The following settings define limits on the size of the Filestore queue:
``filestore_queue_max_ops`` ``filestore_queue_max_ops``
:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. :Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations.
:Type: Integer :Type: Integer
:Required: No. Minimal impact on performance. :Required: No. Minimal impact on performance.
:Default: ``50`` :Default: ``50``
@ -199,23 +212,20 @@ The following settings provide limits on the size of the Filestore queue.
``filestore_queue_max_bytes`` ``filestore_queue_max_bytes``
:Description: The maximum number of bytes for an operation. :Description: Defines the maximum number of bytes permitted per operation.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``100 << 20`` :Default: ``100 << 20``
.. index:: filestore; timeouts .. index:: filestore; timeouts
Timeouts Timeouts
======== ========
``filestore_op_threads`` ``filestore_op_threads``
:Description: The number of file system operation threads that execute in parallel. :Description: Defines the number of file-system operation threads that execute in parallel.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``2`` :Default: ``2``
@ -223,7 +233,7 @@ Timeouts
``filestore_op_thread_timeout`` ``filestore_op_thread_timeout``
:Description: The timeout for a file system operation thread (in seconds). :Description: Defines the timeout (in seconds) for a file-system operation thread.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``60`` :Default: ``60``
@ -231,7 +241,7 @@ Timeouts
``filestore_op_thread_suicide_timeout`` ``filestore_op_thread_suicide_timeout``
:Description: The timeout for a commit operation before cancelling the commit (in seconds). :Description: Defines the timeout (in seconds) for a commit operation before the commit is cancelled.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``180`` :Default: ``180``
@ -245,17 +255,17 @@ B-Tree Filesystem
``filestore_btrfs_snap`` ``filestore_btrfs_snap``
:Description: Enable snapshots for a ``btrfs`` filestore. :Description: Enables snapshots for a ``btrfs`` Filestore.
:Type: Boolean :Type: Boolean
:Required: No. Only used for ``btrfs``. :Required: No. Used only for ``btrfs``.
:Default: ``true`` :Default: ``true``
``filestore_btrfs_clone_range`` ``filestore_btrfs_clone_range``
:Description: Enable cloning ranges for a ``btrfs`` filestore. :Description: Enables cloning ranges for a ``btrfs`` Filestore.
:Type: Boolean :Type: Boolean
:Required: No. Only used for ``btrfs``. :Required: No. Used only for ``btrfs``.
:Default: ``true`` :Default: ``true``
@ -267,7 +277,7 @@ Journal
``filestore_journal_parallel`` ``filestore_journal_parallel``
:Description: Enables parallel journaling, default for Btrfs. :Description: Enables parallel journaling, default for ``btrfs``.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -275,7 +285,7 @@ Journal
``filestore_journal_writeahead`` ``filestore_journal_writeahead``
:Description: Enables writeahead journaling, default for XFS. :Description: Enables write-ahead journaling, default for XFS.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -283,7 +293,7 @@ Journal
``filestore_journal_trailing`` ``filestore_journal_trailing``
:Description: Deprecated, never use. :Description: Deprecated. **Never use.**
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -295,8 +305,8 @@ Misc
``filestore_merge_threshold`` ``filestore_merge_threshold``
:Description: Min number of files in a subdir before merging into parent :Description: Defines the minimum number of files permitted in a subdirectory before the subdirectory is merged into its parent directory.
NOTE: A negative value means to disable subdir merging NOTE: A negative value means that subdirectory merging is disabled.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``-10`` :Default: ``-10``
@ -305,8 +315,8 @@ Misc
``filestore_split_multiple`` ``filestore_split_multiple``
:Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16`` :Description: ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16``
is the maximum number of files in a subdirectory before is the maximum number of files permitted in a subdirectory
splitting into child directories. before the subdirectory is split into child directories.
:Type: Integer :Type: Integer
:Required: No :Required: No
@ -316,10 +326,10 @@ Misc
``filestore_split_rand_factor`` ``filestore_split_rand_factor``
:Description: A random factor added to the split threshold to avoid :Description: A random factor added to the split threshold to avoid
too many (expensive) Filestore splits occurring at once. See too many (expensive) Filestore splits occurring at the same time.
``filestore_split_multiple`` for details. For details, see ``filestore_split_multiple``.
This can only be changed offline for an existing OSD, To change this setting for an existing OSD, it is necessary to take the OSD
via the ``ceph-objectstore-tool apply-layout-settings`` command. offline before running the ``ceph-objectstore-tool apply-layout-settings`` command.
:Type: Unsigned 32-bit Integer :Type: Unsigned 32-bit Integer
:Required: No :Required: No
@ -328,7 +338,7 @@ Misc
``filestore_update_to`` ``filestore_update_to``
:Description: Limits Filestore auto upgrade to specified version. :Description: Limits automatic upgrades to a specified version of Filestore. Useful in cases in which you want to avoid upgrading to a specific version.
:Type: Integer :Type: Integer
:Required: No :Required: No
:Default: ``1000`` :Default: ``1000``
@ -336,7 +346,7 @@ Misc
``filestore_blackhole`` ``filestore_blackhole``
:Description: Drop any new transactions on the floor. :Description: Drops any new transactions on the floor, similar to redirecting to NULL.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -344,7 +354,7 @@ Misc
``filestore_dump_file`` ``filestore_dump_file``
:Description: File onto which store transaction dumps. :Description: Defines the file that transaction dumps are stored on.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -352,7 +362,7 @@ Misc
``filestore_kill_at`` ``filestore_kill_at``
:Description: inject a failure at the n'th opportunity :Description: Injects a failure at the *n*\th opportunity.
:Type: String :Type: String
:Required: No :Required: No
:Default: ``false`` :Default: ``false``
@ -360,8 +370,7 @@ Misc
``filestore_fail_eio`` ``filestore_fail_eio``
:Description: Fail/Crash on eio. :Description: Fail/Crash on EIO.
:Type: Boolean :Type: Boolean
:Required: No :Required: No
:Default: ``true`` :Default: ``true``

View File

@ -21,6 +21,9 @@ the QoS related parameters:
* total capacity (IOPS) of each OSD (determined automatically - * total capacity (IOPS) of each OSD (determined automatically -
See `OSD Capacity Determination (Automated)`_) See `OSD Capacity Determination (Automated)`_)
* the max sequential bandwidth capacity (MiB/s) of each OSD -
See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option
* an mclock profile type to enable * an mclock profile type to enable
Using the settings in the specified profile, an OSD determines and applies the Using the settings in the specified profile, an OSD determines and applies the
@ -39,15 +42,15 @@ Each service can be considered as a type of client from mclock's perspective.
Depending on the type of requests handled, mclock clients are classified into Depending on the type of requests handled, mclock clients are classified into
the buckets as shown in the table below, the buckets as shown in the table below,
+------------------------+----------------------------------------------------+ +------------------------+--------------------------------------------------------------+
| Client Type | Request Types | | Client Type | Request Types |
+========================+====================================================+ +========================+==============================================================+
| Client | I/O requests issued by external clients of Ceph | | Client | I/O requests issued by external clients of Ceph |
+------------------------+----------------------------------------------------+ +------------------------+--------------------------------------------------------------+
| Background recovery | Internal recovery/backfill requests | | Background recovery | Internal recovery requests |
+------------------------+----------------------------------------------------+ +------------------------+--------------------------------------------------------------+
| Background best-effort | Internal scrub, snap trim and PG deletion requests | | Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests |
+------------------------+----------------------------------------------------+ +------------------------+--------------------------------------------------------------+
The mclock profiles allocate parameters like reservation, weight and limit The mclock profiles allocate parameters like reservation, weight and limit
(see :ref:`dmclock-qos`) differently for each client type. The next sections (see :ref:`dmclock-qos`) differently for each client type. The next sections
@ -85,32 +88,54 @@ Built-in Profiles
----------------- -----------------
Users can choose between the following built-in profile types: Users can choose between the following built-in profile types:
.. note:: The values mentioned in the tables below represent the percentage .. note:: The values mentioned in the tables below represent the proportion
of the total IOPS capacity of the OSD allocated for the service type. of the total IOPS capacity of the OSD allocated for the service type.
By default, the *high_client_ops* profile is enabled to ensure that a larger * balanced (default)
chunk of the bandwidth allocation goes to client ops. Background recovery ops * high_client_ops
are given lower allocation (and therefore take a longer time to complete). But * high_recovery_ops
there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, the
alternate built-in profiles may be enabled by following the steps mentioned
in next sections.
high_client_ops (*default*) balanced (*default*)
^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^
This profile optimizes client performance over background activities by The *balanced* profile is the default mClock profile. This profile allocates
allocating more reservation and limit to client operations as compared to equal reservation/priority to client operations and background recovery
background operations in the OSD. This profile is enabled by default. The table operations. Background best-effort ops are given lower reservation and therefore
shows the resource control parameters set by the profile: take a longer time to complete when are are competing operations. This profile
helps meet the normal/steady-state requirements of the cluster. This is the
case when external client performance requirement is not critical and there are
other background operations that still need attention within the OSD.
But there might be instances that necessitate giving higher allocations to either
client ops or recovery ops. In order to deal with such a situation, the alternate
built-in profiles may be enabled by following the steps mentioned in next sections.
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit | | Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+ +========================+=============+========+=======+
| client | 50% | 2 | MAX | | client | 50% | 1 | MAX |
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
| background recovery | 25% | 1 | 100% | | background recovery | 50% | 1 | MAX |
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
| background best-effort | 25% | 2 | MAX | | background best-effort | MIN | 1 | 90% |
+------------------------+-------------+--------+-------+
high_client_ops
^^^^^^^^^^^^^^^
This profile optimizes client performance over background activities by
allocating more reservation and limit to client operations as compared to
background operations in the OSD. This profile, for example, may be enabled
to provide the needed performance for I/O intensive applications for a
sustained period of time at the cost of slower recoveries. The table shows
the resource control parameters set by the profile:
+------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client | 60% | 2 | MAX |
+------------------------+-------------+--------+-------+
| background recovery | 40% | 1 | MAX |
+------------------------+-------------+--------+-------+
| background best-effort | MIN | 1 | 70% |
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
high_recovery_ops high_recovery_ops
@ -124,34 +149,16 @@ parameters set by the profile:
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit | | Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+ +========================+=============+========+=======+
| client | 30% | 1 | 80% | | client | 30% | 1 | MAX |
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
| background recovery | 60% | 2 | 200% | | background recovery | 70% | 2 | MAX |
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
| background best-effort | 1 (MIN) | 2 | MAX | | background best-effort | MIN | 1 | MAX |
+------------------------+-------------+--------+-------+
balanced
^^^^^^^^
This profile allocates equal reservation to client I/O operations and background
recovery operations. This means that equal I/O resources are allocated to both
external and background recovery operations. This profile, for example, may be
enabled by an administrator when external client performance requirement is not
critical and there are other background operations that still need attention
within the OSD.
+------------------------+-------------+--------+-------+
| Service Type | Reservation | Weight | Limit |
+========================+=============+========+=======+
| client | 40% | 1 | 100% |
+------------------------+-------------+--------+-------+
| background recovery | 40% | 1 | 150% |
+------------------------+-------------+--------+-------+
| background best-effort | 20% | 2 | MAX |
+------------------------+-------------+--------+-------+ +------------------------+-------------+--------+-------+
.. note:: Across the built-in profiles, internal background best-effort clients .. note:: Across the built-in profiles, internal background best-effort clients
of mclock include "scrub", "snap trim", and "pg deletion" operations. of mclock include "backfill", "scrub", "snap trim", and "pg deletion"
operations.
Custom Profile Custom Profile
@ -170,6 +177,11 @@ in order to ensure mClock scheduler is able to provide predictable QoS.
mClock Config Options mClock Config Options
--------------------- ---------------------
.. important:: These defaults cannot be changed using any of the config
subsytem commands like *config set* or via the *config daemon* or *config
tell* interfaces. Although the above command(s) report success, the mclock
QoS parameters are reverted to their respective built-in profile defaults.
When a built-in profile is enabled, the mClock scheduler calculates the low When a built-in profile is enabled, the mClock scheduler calculates the low
level mclock parameters [*reservation*, *weight*, *limit*] based on the profile level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
enabled for each client type. The mclock parameters are calculated based on enabled for each client type. The mclock parameters are calculated based on
@ -188,30 +200,35 @@ config parameters cannot be modified when using any of the built-in profiles:
Recovery/Backfill Options Recovery/Backfill Options
------------------------- -------------------------
The following recovery and backfill related Ceph options are set to new defaults .. warning:: The recommendation is to not change these options as the built-in
for mClock: profiles are optimized based on them. Changing these defaults can result in
unexpected performance outcomes.
The following recovery and backfill related Ceph options are overridden to
mClock defaults:
- :confval:`osd_max_backfills` - :confval:`osd_max_backfills`
- :confval:`osd_recovery_max_active` - :confval:`osd_recovery_max_active`
- :confval:`osd_recovery_max_active_hdd` - :confval:`osd_recovery_max_active_hdd`
- :confval:`osd_recovery_max_active_ssd` - :confval:`osd_recovery_max_active_ssd`
The following table shows the new mClock defaults. This is done to maximize the The following table shows the mClock defaults which is the same as the current
impact of the built-in profile: defaults. This is done to maximize the performance of the foreground (client)
operations:
+----------------------------------------+------------------+----------------+ +----------------------------------------+------------------+----------------+
| Config Option | Original Default | mClock Default | | Config Option | Original Default | mClock Default |
+========================================+==================+================+ +========================================+==================+================+
| :confval:`osd_max_backfills` | 1 | 10 | | :confval:`osd_max_backfills` | 1 | 1 |
+----------------------------------------+------------------+----------------+ +----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active` | 0 | 0 | | :confval:`osd_recovery_max_active` | 0 | 0 |
+----------------------------------------+------------------+----------------+ +----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active_hdd` | 3 | 10 | | :confval:`osd_recovery_max_active_hdd` | 3 | 3 |
+----------------------------------------+------------------+----------------+ +----------------------------------------+------------------+----------------+
| :confval:`osd_recovery_max_active_ssd` | 10 | 20 | | :confval:`osd_recovery_max_active_ssd` | 10 | 10 |
+----------------------------------------+------------------+----------------+ +----------------------------------------+------------------+----------------+
The above mClock defaults, can be modified if necessary by enabling The above mClock defaults, can be modified only if necessary by enabling
:confval:`osd_mclock_override_recovery_settings` (default: false). The :confval:`osd_mclock_override_recovery_settings` (default: false). The
steps for this is discussed in the steps for this is discussed in the
`Steps to Modify mClock Max Backfills/Recovery Limits`_ section. `Steps to Modify mClock Max Backfills/Recovery Limits`_ section.
@ -246,8 +263,8 @@ all its clients.
Steps to Enable mClock Profile Steps to Enable mClock Profile
============================== ==============================
As already mentioned, the default mclock profile is set to *high_client_ops*. As already mentioned, the default mclock profile is set to *balanced*.
The other values for the built-in profiles include *balanced* and The other values for the built-in profiles include *high_client_ops* and
*high_recovery_ops*. *high_recovery_ops*.
If there is a requirement to change the default profile, then the option If there is a requirement to change the default profile, then the option
@ -297,15 +314,17 @@ command can be used:
After switching to the *custom* profile, the desired mClock configuration After switching to the *custom* profile, the desired mClock configuration
option may be modified. For example, to change the client reservation IOPS option may be modified. For example, to change the client reservation IOPS
allocation for a specific OSD (say osd.0), the following command can be used: ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following
command can be used:
.. prompt:: bash # .. prompt:: bash #
ceph config set osd.0 osd_mclock_scheduler_client_res 3000 ceph config set osd.0 osd_mclock_scheduler_client_res 0.5
.. important:: Care must be taken to change the reservations of other services like .. important:: Care must be taken to change the reservations of other services
recovery and background best effort accordingly to ensure that the sum of the like recovery and background best effort accordingly to ensure that the sum
reservations do not exceed the maximum IOPS capacity of the OSD. of the reservations do not exceed the maximum proportion (1.0) of the IOPS
capacity of the OSD.
.. tip:: The reservation and limit parameter allocations are per-shard based on .. tip:: The reservation and limit parameter allocations are per-shard based on
the type of backing device (HDD/SSD) under the OSD. See the type of backing device (HDD/SSD) under the OSD. See
@ -673,12 +692,8 @@ mClock Config Options
.. confval:: osd_mclock_profile .. confval:: osd_mclock_profile
.. confval:: osd_mclock_max_capacity_iops_hdd .. confval:: osd_mclock_max_capacity_iops_hdd
.. confval:: osd_mclock_max_capacity_iops_ssd .. confval:: osd_mclock_max_capacity_iops_ssd
.. confval:: osd_mclock_cost_per_io_usec .. confval:: osd_mclock_max_sequential_bandwidth_hdd
.. confval:: osd_mclock_cost_per_io_usec_hdd .. confval:: osd_mclock_max_sequential_bandwidth_ssd
.. confval:: osd_mclock_cost_per_io_usec_ssd
.. confval:: osd_mclock_cost_per_byte_usec
.. confval:: osd_mclock_cost_per_byte_usec_hdd
.. confval:: osd_mclock_cost_per_byte_usec_ssd
.. confval:: osd_mclock_force_run_benchmark_on_init .. confval:: osd_mclock_force_run_benchmark_on_init
.. confval:: osd_mclock_skip_benchmark .. confval:: osd_mclock_skip_benchmark
.. confval:: osd_mclock_override_recovery_settings .. confval:: osd_mclock_override_recovery_settings

View File

@ -16,24 +16,27 @@ consistent, but you can add, remove or replace a monitor in a cluster. See
Background Background
========== ==========
Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`.
:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD
Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and
retrieving a current cluster map. Before Ceph Clients can read from or write to
Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor
first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph
Client can compute the location for any object. The ability to compute object
locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a
very important aspect of Ceph's high scalability and performance. See
`Scalability and High Availability`_ for additional details.
The primary role of the Ceph Monitor is to maintain a master copy of the cluster The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to
map. Ceph Monitors also provide authentication and logging services. Ceph determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph
Monitors write all changes in the monitor services to a single Paxos instance, Metadata Servers. Clients do this by connecting to one Ceph Monitor and
and Paxos writes the changes to a key/value store for strong consistency. Ceph retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor
Monitors can query the most recent version of the cluster map during sync before they can read from or write to Ceph OSD Daemons or Ceph Metadata
operations. Ceph Monitors leverage the key/value store's snapshots and iterators Servers. A Ceph client that has a current copy of the cluster map and the CRUSH
(using leveldb) to perform store-wide synchronization. algorithm can compute the location of any RADOS object within the cluster. This
makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct
communication between clients and Ceph OSD Daemons improves upon traditional
storage architectures that required clients to communicate with a central
component. See `Scalability and High Availability`_ for more on this subject.
The Ceph Monitor's primary function is to maintain a master copy of the cluster
map. Monitors also provide authentication and logging services. All changes in
the monitor services are written by the Ceph Monitor to a single Paxos
instance, and Paxos writes the changes to a key/value store. This provides
strong consistency. Ceph Monitors are able to query the most recent version of
the cluster map during sync operations, and they use the key/value store's
snapshots and iterators (using RocksDB) to perform store-wide synchronization.
.. ditaa:: .. ditaa::
/-------------\ /-------------\ /-------------\ /-------------\
@ -56,12 +59,6 @@ operations. Ceph Monitors leverage the key/value store's snapshots and iterators
| cCCC |*---------------------+ | cCCC |*---------------------+
\-------------/ \-------------/
.. deprecated:: version 0.58
In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for
each service and store the map as a file.
.. index:: Ceph Monitor; cluster map .. index:: Ceph Monitor; cluster map
Cluster Maps Cluster Maps
@ -541,6 +538,8 @@ Trimming requires that the placement groups are ``active+clean``.
.. index:: Ceph Monitor; clock .. index:: Ceph Monitor; clock
.. _mon-config-ref-clock:
Clock Clock
----- -----

View File

@ -1,16 +1,22 @@
.. _mon-dns-lookup:
=============================== ===============================
Looking up Monitors through DNS Looking up Monitors through DNS
=============================== ===============================
Since version 11.0.0 RADOS supports looking up Monitors through DNS. Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors
through DNS.
This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file. The addition of the ability to look up monitors through DNS means that daemons
and clients do not require a *mon host* configuration directive in their
``ceph.conf`` configuration file.
Using DNS SRV TCP records clients are able to look up the monitors. With a DNS update, clients and daemons can be made aware of changes
in the monitor topology. To be more precise and technical, clients look up the
monitors by using ``DNS SRV TCP`` records.
This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology. By default, clients and daemons look for the TCP service called *ceph-mon*,
which is configured by the *mon_dns_srv_name* configuration directive.
By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive.
.. confval:: mon_dns_srv_name .. confval:: mon_dns_srv_name

View File

@ -91,9 +91,8 @@ Similarly, two options control whether IPv4 and IPv6 addresses are used:
to an IPv6 address to an IPv6 address
.. note:: The ability to bind to multiple ports has paved the way for .. note:: The ability to bind to multiple ports has paved the way for
dual-stack IPv4 and IPv6 support. That said, dual-stack support is dual-stack IPv4 and IPv6 support. That said, dual-stack operation is
not yet tested as of Nautilus v14.2.0 and likely needs some not yet supported as of Quincy v17.2.0.
additional code changes to work correctly.
Connection modes Connection modes
---------------- ----------------

View File

@ -140,6 +140,8 @@ See `Pool & PG Config Reference`_ for details.
.. index:: OSD; scrubbing .. index:: OSD; scrubbing
.. _rados_config_scrubbing:
Scrubbing Scrubbing
========= =========

View File

@ -1,3 +1,5 @@
.. _rados_config_pool_pg_crush_ref:
====================================== ======================================
Pool, PG and CRUSH Config Reference Pool, PG and CRUSH Config Reference
====================================== ======================================

View File

@ -25,6 +25,7 @@ There are several Ceph daemons in a storage cluster:
additional monitoring and providing interfaces to external additional monitoring and providing interfaces to external
monitoring and management systems. monitoring and management systems.
.. _rados_config_storage_devices_osd_backends:
OSD Back Ends OSD Back Ends
============= =============

View File

@ -4,74 +4,70 @@
Adding/Removing Monitors Adding/Removing Monitors
========================== ==========================
When you have a cluster up and running, you may add or remove monitors It is possible to add monitors to a running cluster as long as redundancy is
from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor
or `Monitor Bootstrap`_. Bootstrap`_.
.. _adding-monitors: .. _adding-monitors:
Adding Monitors Adding Monitors
=============== ===============
Ceph monitors are lightweight processes that are the single source of truth Ceph monitors serve as the single source of truth for the cluster map. It is
for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 possible to run a cluster with only one monitor, but for a production cluster
for a production cluster. Ceph monitors use a variation of the it is recommended to have at least three monitors provisioned and in quorum.
`Paxos`_ algorithm to establish consensus about maps and other critical Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus
information across the cluster. Due to the nature of Paxos, Ceph requires about maps and about other critical information across the cluster. Due to the
a majority of monitors to be active to establish a quorum (thus establishing nature of Paxos, Ceph is able to maintain quorum (and thus establish
consensus). consensus) only if a majority of the monitors are ``active``.
It is advisable to run an odd number of monitors. An It is best to run an odd number of monitors. This is because a cluster that is
odd number of monitors is more resilient than an running an odd number of monitors is more resilient than a cluster running an
even number. For instance, with a two monitor deployment, no even number. For example, in a two-monitor deployment, no failures can be
failures can be tolerated and still maintain a quorum; with three monitors, tolerated if quorum is to be maintained; in a three-monitor deployment, one
one failure can be tolerated; in a four monitor deployment, one failure can failure can be tolerated; in a four-monitor deployment, one failure can be
be tolerated; with five monitors, two failures can be tolerated. This avoids tolerated; and in a five-monitor deployment, two failures can be tolerated. In
the dreaded *split brain* phenomenon, and is why an odd number is best. general, a cluster running an odd number of monitors is best because it avoids
In short, Ceph needs a majority of what is called the *split brain* phenomenon. In short, Ceph is able to operate
monitors to be active (and able to communicate with each other), but that only if a majority of monitors are ``active`` and able to communicate with each
majority can be achieved using a single monitor, or 2 out of 2 monitors, other, (for example: there must be a single monitor, two out of two monitors,
2 out of 3, 3 out of 4, etc. two out of three monitors, three out of five monitors, or the like).
For small or non-critical deployments of multi-node Ceph clusters, it is For small or non-critical deployments of multi-node Ceph clusters, it is
advisable to deploy three monitors, and to increase the number of monitors recommended to deploy three monitors. For larger clusters or for clusters that
to five for larger clusters or to survive a double failure. There is rarely are intended to survive a double failure, it is recommended to deploy five
justification for seven or more. monitors. Only in rare circumstances is there any justification for deploying
seven or more monitors.
Since monitors are lightweight, it is possible to run them on the same It is possible to run a monitor on the same host that is running an OSD.
host as OSDs; however, we recommend running them on separate hosts, However, this approach has disadvantages: for example: `fsync` issues with the
because `fsync` issues with the kernel may impair performance. kernel might weaken performance, monitor and OSD daemons might be inactive at
Dedicated monitor nodes also minimize disruption since monitor and OSD the same time and cause disruption if the node crashes, is rebooted, or is
daemons are not inactive at the same time when a node crashes or is taken down for maintenance. Because of these risks, it is instead
taken down for maintenance. recommended to run monitors and managers on dedicated hosts.
Dedicated
monitor nodes also make for cleaner maintenance by avoiding both OSDs and
a mon going down if a node is rebooted, taken down, or crashes.
.. note:: A *majority* of monitors in your cluster must be able to .. note:: A *majority* of monitors in your cluster must be able to
reach each other in order to establish a quorum. reach each other in order for quorum to be established.
Deploy your Hardware Deploying your Hardware
-------------------- -----------------------
If you are adding a new host when adding a new monitor, see `Hardware Some operators choose to add a new monitor host at the same time that they add
Recommendations`_ for details on minimum recommendations for monitor hardware. a new monitor. For details on the minimum recommendations for monitor hardware,
To add a monitor host to your cluster, first make sure you have an up-to-date see `Hardware Recommendations`_. Before adding a monitor host to the cluster,
version of Linux installed (typically Ubuntu 16.04 or RHEL 7). make sure that there is an up-to-date version of Linux installed.
Add your monitor host to a rack in your cluster, connect it to the network Add the newly installed monitor host to a rack in your cluster, connect the
and ensure that it has network connectivity. host to the network, and make sure that the host has network connectivity.
.. _Hardware Recommendations: ../../../start/hardware-recommendations .. _Hardware Recommendations: ../../../start/hardware-recommendations
Install the Required Software Installing the Required Software
----------------------------- --------------------------------
For manually deployed clusters, you must install Ceph packages In manually deployed clusters, it is necessary to install Ceph packages
manually. See `Installing Packages`_ for details. manually. For details, see `Installing Packages`_. Configure SSH so that it can
You should configure SSH to a user with password-less authentication be used by a user that has passwordless authentication and root permissions.
and root permissions.
.. _Installing Packages: ../../../install/install-storage-cluster .. _Installing Packages: ../../../install/install-storage-cluster
@ -81,65 +77,63 @@ and root permissions.
Adding a Monitor (Manual) Adding a Monitor (Manual)
------------------------- -------------------------
This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map The procedure in this section creates a ``ceph-mon`` data directory, retrieves
and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to
this results in only two monitor daemons, you may add more monitors by the cluster. The procedure might result in a Ceph cluster that contains only
repeating this procedure until you have a sufficient number of ``ceph-mon`` two monitor daemons. To add more monitors until there are enough ``ceph-mon``
daemons to achieve a quorum. daemons to establish quorum, repeat the procedure.
At this point you should define your monitor's id. Traditionally, monitors This is a good point at which to define the new monitor's ``id``. Monitors have
have been named with single letters (``a``, ``b``, ``c``, ...), but you are often been named with single letters (``a``, ``b``, ``c``, etc.), but you are
free to define the id as you see fit. For the purpose of this document, free to define the ``id`` however you see fit. In this document, ``{mon-id}``
please take into account that ``{mon-id}`` should be the id you chose, refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if
without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` ``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is
on ``mon.a``). ``a``. ???
#. Create the default directory on the machine that will host your #. Create a data directory on the machine that will host the new monitor:
new monitor:
.. prompt:: bash $ .. prompt:: bash $
ssh {new-mon-host} ssh {new-mon-host}
sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
#. Create a temporary directory ``{tmp}`` to keep the files needed during #. Create a temporary directory ``{tmp}`` that will contain the files needed
this process. This directory should be different from the monitor's default during this procedure. This directory should be different from the data
directory created in the previous step, and can be removed after all the directory created in the previous step. Because this is a temporary
steps are executed: directory, it can be removed after the procedure is complete:
.. prompt:: bash $ .. prompt:: bash $
mkdir {tmp} mkdir {tmp}
#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to #. Retrieve the keyring for your monitors (``{tmp}`` is the path to the
the retrieved keyring, and ``{key-filename}`` is the name of the file retrieved keyring and ``{key-filename}`` is the name of the file that
containing the retrieved monitor key: contains the retrieved monitor key):
.. prompt:: bash $ .. prompt:: bash $
ceph auth get mon. -o {tmp}/{key-filename} ceph auth get mon. -o {tmp}/{key-filename}
#. Retrieve the monitor map, where ``{tmp}`` is the path to #. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map
the retrieved monitor map, and ``{map-filename}`` is the name of the file and ``{map-filename}`` is the name of the file that contains the retrieved
containing the retrieved monitor map: monitor map):
.. prompt:: bash $ .. prompt:: bash $
ceph mon getmap -o {tmp}/{map-filename} ceph mon getmap -o {tmp}/{map-filename}
#. Prepare the monitor's data directory created in the first step. You must #. Prepare the monitor's data directory, which was created in the first step.
specify the path to the monitor map so that you can retrieve the The following command must specify the path to the monitor map (so that
information about a quorum of monitors and their ``fsid``. You must also information about a quorum of monitors and their ``fsid``\s can be
specify a path to the monitor keyring: retrieved) and specify the path to the monitor keyring:
.. prompt:: bash $ .. prompt:: bash $
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
#. Start the new monitor. It will automatically join the cluster. To provide
#. Start the new monitor and it will automatically join the cluster. information to the daemon about which address to bind to, use either the
The daemon needs to know which address to bind to, via either the ``--public-addr {ip}`` option or the ``--public-network {network}`` option.
``--public-addr {ip}`` or ``--public-network {network}`` argument.
For example: For example:
.. prompt:: bash $ .. prompt:: bash $
@ -151,20 +145,24 @@ on ``mon.a``).
Removing Monitors Removing Monitors
================= =================
When you remove monitors from a cluster, consider that Ceph monitors use When monitors are removed from a cluster, it is important to remember
Paxos to establish consensus about the master cluster map. You must have that Ceph monitors use Paxos to maintain consensus about the cluster
a sufficient number of monitors to establish a quorum for consensus about map. Such consensus is possible only if the number of monitors is sufficient
the cluster map. to establish quorum.
.. _Removing a Monitor (Manual): .. _Removing a Monitor (Manual):
Removing a Monitor (Manual) Removing a Monitor (Manual)
--------------------------- ---------------------------
This procedure removes a ``ceph-mon`` daemon from your cluster. If this The procedure in this section removes a ``ceph-mon`` daemon from the cluster.
procedure results in only two monitor daemons, you may add or remove another The procedure might result in a Ceph cluster that contains a number of monitors
monitor until you have a number of ``ceph-mon`` daemons that can achieve a insufficient to maintain quorum, so plan carefully. When replacing an old
quorum. monitor with a new monitor, add the new monitor first, wait for quorum to be
established, and then remove the old monitor. This ensures that quorum is not
lost.
#. Stop the monitor: #. Stop the monitor:
@ -178,17 +176,16 @@ quorum.
ceph mon remove {mon-id} ceph mon remove {mon-id}
#. Remove the monitor entry from ``ceph.conf``. #. Remove the monitor entry from the ``ceph.conf`` file:
.. _rados-mon-remove-from-unhealthy: .. _rados-mon-remove-from-unhealthy:
Removing Monitors from an Unhealthy Cluster Removing Monitors from an Unhealthy Cluster
------------------------------------------- -------------------------------------------
This procedure removes a ``ceph-mon`` daemon from an unhealthy The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy
cluster, for example a cluster where the monitors cannot form a cluster (for example, a cluster whose monitors are unable to form a quorum).
quorum.
#. Stop all ``ceph-mon`` daemons on all monitor hosts: #. Stop all ``ceph-mon`` daemons on all monitor hosts:
@ -197,63 +194,68 @@ quorum.
ssh {mon-host} ssh {mon-host}
systemctl stop ceph-mon.target systemctl stop ceph-mon.target
Repeat for all monitor hosts. Repeat this step on every monitor host.
#. Identify a surviving monitor and log in to that host: #. Identify a surviving monitor and log in to the monitor's host:
.. prompt:: bash $ .. prompt:: bash $
ssh {mon-host} ssh {mon-host}
#. Extract a copy of the monmap file: #. Extract a copy of the ``monmap`` file by running a command of the following
form:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --extract-monmap {map-path} ceph-mon -i {mon-id} --extract-monmap {map-path}
In most cases, this command will be: Here is a more concrete example. In this example, ``hostname`` is the
``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i `hostname` --extract-monmap /tmp/monmap ceph-mon -i `hostname` --extract-monmap /tmp/monmap
#. Remove the non-surviving or problematic monitors. For example, if #. Remove the non-surviving or otherwise problematic monitors:
you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
only ``mon.a`` will survive, follow the example below:
.. prompt:: bash $ .. prompt:: bash $
monmaptool {map-path} --rm {mon-id} monmaptool {map-path} --rm {mon-id}
For example, For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``,
and ``mon.c`` |---| and that only ``mon.a`` will survive:
.. prompt:: bash $ .. prompt:: bash $
monmaptool /tmp/monmap --rm b monmaptool /tmp/monmap --rm b
monmaptool /tmp/monmap --rm c monmaptool /tmp/monmap --rm c
#. Inject the surviving map with the removed monitors into the #. Inject the surviving map that includes the removed monitors into the
surviving monitor(s). For example, to inject a map into monitor monmap of the surviving monitor(s):
``mon.a``, follow the example below:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --inject-monmap {map-path} ceph-mon -i {mon-id} --inject-monmap {map-path}
For example: Continuing with the above example, inject a map into monitor ``mon.a`` by
running the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i a --inject-monmap /tmp/monmap ceph-mon -i a --inject-monmap /tmp/monmap
#. Start only the surviving monitors. #. Start only the surviving monitors.
#. Verify the monitors form a quorum (``ceph -s``). #. Verify that the monitors form a quorum by running the command ``ceph -s``.
#. You may wish to archive the removed monitors' data directory in #. The data directory of the removed monitors is in ``/var/lib/ceph/mon``:
``/var/lib/ceph/mon`` in a safe location, or delete it if you are either archive this data directory in a safe location or delete this data
confident the remaining monitors are healthy and are sufficiently directory. However, do not delete it unless you are confident that the
redundant. remaining monitors are healthy and sufficiently redundant. Make sure that
there is enough room for the live DB to expand and compact, and make sure
that there is also room for an archived copy of the DB. The archived copy
can be compressed.
.. _Changing a Monitor's IP address: .. _Changing a Monitor's IP address:
@ -262,60 +264,65 @@ Changing a Monitor's IP Address
.. important:: Existing monitors are not supposed to change their IP addresses. .. important:: Existing monitors are not supposed to change their IP addresses.
Monitors are critical components of a Ceph cluster, and they need to maintain a Monitors are critical components of a Ceph cluster. The entire system can work
quorum for the whole system to work properly. To establish a quorum, the properly only if the monitors maintain quorum, and quorum can be established
monitors need to discover each other. Ceph has strict requirements for only if the monitors have discovered each other by means of their IP addresses.
discovering monitors. Ceph has strict requirements on the discovery of monitors.
Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons
However, monitors discover each other using the monitor map, not ``ceph.conf``. to discover monitors, the monitor map is used by monitors to discover each
For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you other. This is why it is necessary to obtain the current ``monmap`` at the time
need to obtain the current monmap for the cluster when creating a new monitor, a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_,
as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id}
following sections explain the consistency requirements for Ceph monitors, and a --mkfs`` command. The following sections explain the consistency requirements
few safe ways to change a monitor's IP address. for Ceph monitors, and also explain a number of safe ways to change a monitor's
IP address.
Consistency Requirements Consistency Requirements
------------------------ ------------------------
A monitor always refers to the local copy of the monmap when discovering other When a monitor discovers other monitors in the cluster, it always refers to the
monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids local copy of the monitor map. Using the monitor map instead of using the
errors that could break the cluster (e.g., typos in ``ceph.conf`` when ``ceph.conf`` file avoids errors that could break the cluster (for example,
specifying a monitor address or port). Since monitors use monmaps for discovery typos or other slight errors in ``ceph.conf`` when a monitor address or port is
and they share monmaps with clients and other Ceph daemons, the monmap provides specified). Because monitors use monitor maps for discovery and because they
monitors with a strict guarantee that their consensus is valid. share monitor maps with Ceph clients and other Ceph daemons, the monitor map
provides monitors with a strict guarantee that their consensus is valid.
Strict consistency also applies to updates to the monmap. As with any other Strict consistency also applies to updates to the monmap. As with any other
updates on the monitor, changes to the monmap always run through a distributed updates on the monitor, changes to the monmap always run through a distributed
consensus algorithm called `Paxos`_. The monitors must agree on each update to consensus algorithm called `Paxos`_. The monitors must agree on each update to
the monmap, such as adding or removing a monitor, to ensure that each monitor in the monmap, such as adding or removing a monitor, to ensure that each monitor
the quorum has the same version of the monmap. Updates to the monmap are in the quorum has the same version of the monmap. Updates to the monmap are
incremental so that monitors have the latest agreed upon version, and a set of incremental so that monitors have the latest agreed upon version, and a set of
previous versions, allowing a monitor that has an older version of the monmap to previous versions, allowing a monitor that has an older version of the monmap
catch up with the current state of the cluster. to catch up with the current state of the cluster.
If monitors discovered each other through the Ceph configuration file instead of There are additional advantages to using the monitor map rather than
through the monmap, it would introduce additional risks because the Ceph ``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not
configuration files are not updated and distributed automatically. Monitors automatically updated and distributed, its use would bring certain risks:
might inadvertently use an older ``ceph.conf`` file, fail to recognize a monitors might use an outdated ``ceph.conf`` file, might fail to recognize a
monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able specific monitor, might fall out of quorum, and might develop a situation in
to determine the current state of the system accurately. Consequently, making which `Paxos`_ is unable to accurately ascertain the current state of the
changes to an existing monitor's IP address must be done with great care. system. Because of these risks, any changes to an existing monitor's IP address
must be made with great care.
.. _operations_add_or_rm_mons_changing_mon_ip:
Changing a Monitor's IP address (The Right Way) Changing a Monitor's IP address (Preferred Method)
----------------------------------------------- --------------------------------------------------
Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to If a monitor's IP address is changed only in the ``ceph.conf`` file, there is
ensure that other monitors in the cluster will receive the update. To change a no guarantee that the other monitors in the cluster will receive the update.
monitor's IP address, you must add a new monitor with the IP address you want For this reason, the preferred method to change a monitor's IP address is as
to use (as described in `Adding a Monitor (Manual)`_), ensure that the new follows: add a new monitor with the desired IP address (as described in `Adding
monitor successfully joins the quorum; then, remove the monitor that uses the a Monitor (Manual)`_), make sure that the new monitor successfully joins the
old IP address. Then, update the ``ceph.conf`` file to ensure that clients and quorum, remove the monitor that is using the old IP address, and update the
other daemons know the IP address of the new monitor. ``ceph.conf`` file to ensure that clients and other daemons are made aware of
the new monitor's IP address.
For example, lets assume there are three monitors in place, such as :: For example, suppose that there are three monitors in place::
[mon.a] [mon.a]
host = host01 host = host01
@ -327,41 +334,44 @@ For example, lets assume there are three monitors in place, such as ::
host = host03 host = host03
addr = 10.0.0.3:6789 addr = 10.0.0.3:6789
To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the To change ``mon.c`` so that its name is ``host04`` and its IP address is
steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure ``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new
that ``mon.d`` is running before removing ``mon.c``, or it will break the monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing
quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving ``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing
all three monitors would thus require repeating this process as many times as a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP
needed. addresses, repeat this process.
Changing a Monitor's IP address (Advanced Method)
-------------------------------------------------
Changing a Monitor's IP address (The Messy Way) There are cases in which the method outlined in :ref"`<Changing a Monitor's IP
----------------------------------------------- Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot
be used. For example, it might be necessary to move the cluster's monitors to a
different network, to a different part of the datacenter, or to a different
datacenter altogether. It is still possible to change the monitors' IP
addresses, but a different method must be used.
There may come a time when the monitors must be moved to a different network, a For such cases, a new monitor map with updated IP addresses for every monitor
different part of the datacenter or a different datacenter altogether. While it in the cluster must be generated and injected on each monitor. Although this
is possible to do it, the process becomes a bit more hazardous. method is not particularly easy, such a major migration is unlikely to be a
routine task. As stated at the beginning of this section, existing monitors are
not supposed to change their IP addresses.
In such a case, the solution is to generate a new monmap with updated IP Continue with the monitor configuration in the example from :ref"`<Changing a
addresses for all the monitors in the cluster, and inject the new map on each Monitor's IP Address (Preferred Method)>
individual monitor. This is not the most user-friendly approach, but we do not operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors
expect this to be something that needs to be done every other week. As it is are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that
clearly stated on the top of this section, monitors are not supposed to change these networks are unable to communicate. Carry out the following procedure:
IP addresses.
Using the previous monitor configuration as an example, assume you want to move #. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor
all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these map, and ``{filename}`` is the name of the file that contains the retrieved
networks are unable to communicate. Use the following procedure: monitor map):
#. Retrieve the monitor map, where ``{tmp}`` is the path to
the retrieved monitor map, and ``{filename}`` is the name of the file
containing the retrieved monitor map:
.. prompt:: bash $ .. prompt:: bash $
ceph mon getmap -o {tmp}/{filename} ceph mon getmap -o {tmp}/{filename}
#. The following example demonstrates the contents of the monmap: #. Check the contents of the monitor map:
.. prompt:: bash $ .. prompt:: bash $
@ -378,13 +388,12 @@ networks are unable to communicate. Use the following procedure:
1: 10.0.0.2:6789/0 mon.b 1: 10.0.0.2:6789/0 mon.b
2: 10.0.0.3:6789/0 mon.c 2: 10.0.0.3:6789/0 mon.c
#. Remove the existing monitors: #. Remove the existing monitors from the monitor map:
.. prompt:: bash $ .. prompt:: bash $
monmaptool --rm a --rm b --rm c {tmp}/{filename} monmaptool --rm a --rm b --rm c {tmp}/{filename}
:: ::
monmaptool: monmap file {tmp}/{filename} monmaptool: monmap file {tmp}/{filename}
@ -393,19 +402,18 @@ networks are unable to communicate. Use the following procedure:
monmaptool: removing c monmaptool: removing c
monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
#. Add the new monitor locations: #. Add the new monitor locations to the monitor map:
.. prompt:: bash $ .. prompt:: bash $
monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
:: ::
monmaptool: monmap file {tmp}/{filename} monmaptool: monmap file {tmp}/{filename}
monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
#. Check new contents: #. Check the new contents of the monitor map:
.. prompt:: bash $ .. prompt:: bash $
@ -422,25 +430,29 @@ networks are unable to communicate. Use the following procedure:
1: 10.1.0.2:6789/0 mon.b 1: 10.1.0.2:6789/0 mon.b
2: 10.1.0.3:6789/0 mon.c 2: 10.1.0.3:6789/0 mon.c
At this point, we assume the monitors (and stores) are installed at the new At this point, we assume that the monitors (and stores) have been installed at
location. The next step is to propagate the modified monmap to the new the new location. Next, propagate the modified monitor map to the new monitors,
monitors, and inject the modified monmap into each new monitor. and inject the modified monitor map into each new monitor.
#. First, make sure to stop all your monitors. Injection must be done while #. Make sure all of your monitors have been stopped. Never inject into a
the daemon is not running. monitor while the monitor daemon is running.
#. Inject the monmap: #. Inject the monitor map:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
#. Restart the monitors. #. Restart all of the monitors.
Migration to the new location is now complete. The monitors should operate
successfully.
After this step, migration to the new location is complete and
the monitors should operate successfully.
.. _Manual Deployment: ../../../install/manual-deployment .. _Manual Deployment: ../../../install/manual-deployment
.. _Monitor Bootstrap: ../../../dev/mon-bootstrap .. _Monitor Bootstrap: ../../../dev/mon-bootstrap
.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
.. |---| unicode:: U+2014 .. EM DASH
:trim:

View File

@ -2,49 +2,51 @@
Adding/Removing OSDs Adding/Removing OSDs
====================== ======================
When you have a cluster up and running, you may add OSDs or remove OSDs When a cluster is up and running, it is possible to add or remove OSDs.
from the cluster at runtime.
Adding OSDs Adding OSDs
=========== ===========
When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an OSDs can be added to a cluster in order to expand the cluster's capacity and
OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one
host machine. If your host has multiple storage drives, you may map one storage drive within a host machine. But if your host machine has multiple
``ceph-osd`` daemon for each drive. storage drives, you may map one ``ceph-osd`` daemon for each drive on the
machine.
Generally, it's a good idea to check the capacity of your cluster to see if you It's a good idea to check the capacity of your cluster so that you know when it
are reaching the upper end of its capacity. As your cluster reaches its ``near approaches its capacity limits. If your cluster has reached its ``near full``
full`` ratio, you should add one or more OSDs to expand your cluster's capacity. ratio, then you should add OSDs to expand your cluster's capacity.
.. warning:: Do not let your cluster reach its ``full ratio`` before .. warning:: Do not add an OSD after your cluster has reached its ``full
adding an OSD. OSD failures that occur after the cluster reaches ratio``. OSD failures that occur after the cluster reaches its ``near full
its ``near full`` ratio may cause the cluster to exceed its ratio`` might cause the cluster to exceed its ``full ratio``.
``full ratio``.
Deploy your Hardware
--------------------
If you are adding a new host when adding a new OSD, see `Hardware Deploying your Hardware
-----------------------
If you are also adding a new host when adding a new OSD, see `Hardware
Recommendations`_ for details on minimum recommendations for OSD hardware. To Recommendations`_ for details on minimum recommendations for OSD hardware. To
add an OSD host to your cluster, first make sure you have an up-to-date version add an OSD host to your cluster, begin by making sure that an appropriate
of Linux installed, and you have made some initial preparations for your version of Linux has been installed on the host machine and that all initial
storage drives. See `Filesystem Recommendations`_ for details. preparations for your storage drives have been carried out. For details, see
`Filesystem Recommendations`_.
Next, add your OSD host to a rack in your cluster, connect the host to the
network, and ensure that the host has network connectivity. For details, see
`Network Configuration Reference`_.
Add your OSD host to a rack in your cluster, connect it to the network
and ensure that it has network connectivity. See the `Network Configuration
Reference`_ for details.
.. _Hardware Recommendations: ../../../start/hardware-recommendations .. _Hardware Recommendations: ../../../start/hardware-recommendations
.. _Filesystem Recommendations: ../../configuration/filesystem-recommendations .. _Filesystem Recommendations: ../../configuration/filesystem-recommendations
.. _Network Configuration Reference: ../../configuration/network-config-ref .. _Network Configuration Reference: ../../configuration/network-config-ref
Install the Required Software Installing the Required Software
----------------------------- --------------------------------
For manually deployed clusters, you must install Ceph packages If your cluster has been manually deployed, you will need to install Ceph
manually. See `Installing Ceph (Manual)`_ for details. software packages manually. For details, see `Installing Ceph (Manual)`_.
You should configure SSH to a user with password-less authentication Configure SSH for the appropriate user to have both passwordless authentication
and root permissions. and root permissions.
.. _Installing Ceph (Manual): ../../../install .. _Installing Ceph (Manual): ../../../install
@ -53,48 +55,56 @@ and root permissions.
Adding an OSD (Manual) Adding an OSD (Manual)
---------------------- ----------------------
This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive, The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to
and configures the cluster to distribute data to the OSD. If your host has use one drive, and configures the cluster to distribute data to the OSD. If
multiple drives, you may add an OSD for each drive by repeating this procedure. your host machine has multiple drives, you may add an OSD for each drive on the
host by repeating this procedure.
To add an OSD, create a data directory for it, mount a drive to that directory, As the following procedure will demonstrate, adding an OSD involves creating a
add the OSD to the cluster, and then add it to the CRUSH map. metadata directory for it, configuring a data storage drive, adding the OSD to
the cluster, and then adding it to the CRUSH map.
When you add the OSD to the CRUSH map, consider the weight you give to the new When you add the OSD to the CRUSH map, you will need to consider the weight you
OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger assign to the new OSD. Since storage drive capacities increase over time, newer
hard drives than older hosts in the cluster (i.e., they may have greater OSD hosts are likely to have larger hard drives than the older hosts in the
weight). cluster have and therefore might have greater weight as well.
.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives .. tip:: Ceph works best with uniform hardware across pools. It is possible to
of dissimilar size, you can adjust their weights. However, for best add drives of dissimilar size and then adjust their weights accordingly.
performance, consider a CRUSH hierarchy with drives of the same type/size. However, for best performance, consider a CRUSH hierarchy that has drives of
the same type and size. It is better to add larger drives uniformly to
existing hosts. This can be done incrementally, replacing smaller drives
each time the new drives are added.
#. Create the OSD. If no UUID is given, it will be set automatically when the #. Create the new OSD by running a command of the following form. If you opt
OSD starts up. The following command will output the OSD number, which you not to specify a UUID in this command, the UUID will be set automatically
will need for subsequent steps: when the OSD starts up. The OSD number, which is needed for subsequent
steps, is found in the command's output:
.. prompt:: bash $ .. prompt:: bash $
ceph osd create [{uuid} [{id}]] ceph osd create [{uuid} [{id}]]
If the optional parameter {id} is given it will be used as the OSD id. If the optional parameter {id} is specified it will be used as the OSD ID.
Note, in this case the command may fail if the number is already in use. However, if the ID number is already in use, the command will fail.
.. warning:: In general, explicitly specifying {id} is not recommended. .. warning:: Explicitly specifying the ``{id}`` parameter is not
IDs are allocated as an array, and skipping entries consumes some extra recommended. IDs are allocated as an array, and any skipping of entries
memory. This can become significant if there are large gaps and/or consumes extra memory. This memory consumption can become significant if
clusters are large. If {id} is not specified, the smallest available is there are large gaps or if clusters are large. By leaving the ``{id}``
used. parameter unspecified, we ensure that Ceph uses the smallest ID number
available and that these problems are avoided.
#. Create the default directory on your new OSD: #. Create the default directory for your new OSD by running commands of the
following form:
.. prompt:: bash $ .. prompt:: bash $
ssh {new-osd-host} ssh {new-osd-host}
sudo mkdir /var/lib/ceph/osd/ceph-{osd-number} sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
#. If the OSD is for a drive other than the OS drive, prepare it #. If the OSD will be created on a drive other than the OS drive, prepare it
for use with Ceph, and mount it to the directory you just created: for use with Ceph. Run commands of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -102,41 +112,49 @@ weight).
sudo mkfs -t {fstype} /dev/{drive} sudo mkfs -t {fstype} /dev/{drive}
sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number} sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
#. Initialize the OSD data directory: #. Initialize the OSD data directory by running commands of the following form:
.. prompt:: bash $ .. prompt:: bash $
ssh {new-osd-host} ssh {new-osd-host}
ceph-osd -i {osd-num} --mkfs --mkkey ceph-osd -i {osd-num} --mkfs --mkkey
The directory must be empty before you can run ``ceph-osd``. Make sure that the directory is empty before running ``ceph-osd``.
#. Register the OSD authentication key. The value of ``ceph`` for #. Register the OSD authentication key by running a command of the following
``ceph-{osd-num}`` in the path is the ``$cluster-$id``. If your form:
cluster name differs from ``ceph``, use your cluster name instead:
.. prompt:: bash $ .. prompt:: bash $
ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring
#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The This presentation of the command has ``ceph-{osd-num}`` in the listed path
``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy because many clusters have the name ``ceph``. However, if your cluster name
wherever you wish. If you specify at least one bucket, the command is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be
will place the OSD into the most specific bucket you specify, *and* it will replaced with your cluster name. For example, if your cluster name is
move that bucket underneath any other buckets you specify. **Important:** If ``cluster1``, then the path in the command should be
you specify only the root bucket, the command will attach the OSD directly ``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``.
to the root, but CRUSH rules expect OSDs to be inside of hosts.
Execute the following: #. Add the OSD to the CRUSH map by running the following command. This allows
the OSD to begin receiving data. The ``ceph osd crush add`` command can add
OSDs to the CRUSH hierarchy wherever you want. If you specify one or more
buckets, the command places the OSD in the most specific of those buckets,
and it moves that bucket underneath any other buckets that you have
specified. **Important:** If you specify only the root bucket, the command
will attach the OSD directly to the root, but CRUSH rules expect OSDs to be
inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not
receive any data.
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...] ceph osd crush add {id-or-name} {weight} [{bucket-type}={bucket-name} ...]
You may also decompile the CRUSH map, add the OSD to the device list, add the Note that there is another way to add a new OSD to the CRUSH map: decompile
host as a bucket (if it's not already in the CRUSH map), add the device as an the CRUSH map, add the OSD to the device list, add the host as a bucket (if
item in the host, assign it a weight, recompile it and set it. See it is not already in the CRUSH map), add the device as an item in the host,
`Add/Move an OSD`_ for details. assign the device a weight, recompile the CRUSH map, and set the CRUSH map.
For details, see `Add/Move an OSD`_. This is rarely necessary with recent
releases (this sentence was written the month that Reef was released).
.. _rados-replacing-an-osd: .. _rados-replacing-an-osd:
@ -144,193 +162,206 @@ weight).
Replacing an OSD Replacing an OSD
---------------- ----------------
.. note:: If the instructions in this section do not work for you, try the .. note:: If the procedure in this section does not work for you, try the
instructions in the cephadm documentation: :ref:`cephadm-replacing-an-osd`. instructions in the ``cephadm`` documentation:
:ref:`cephadm-replacing-an-osd`.
When disks fail, or if an administrator wants to reprovision OSDs with a new Sometimes OSDs need to be replaced: for example, when a disk fails, or when an
backend, for instance, for switching from FileStore to BlueStore, OSDs need to administrator wants to reprovision OSDs with a new back end (perhaps when
be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry switching from Filestore to BlueStore). Replacing an OSD differs from `Removing
need to be keep intact after the OSD is destroyed for replacement. the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact
after the OSD is destroyed for replacement.
#. Make sure it is safe to destroy the OSD:
#. Make sure that it is safe to destroy the OSD:
.. prompt:: bash $ .. prompt:: bash $
while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done
#. Destroy the OSD first: #. Destroy the OSD:
.. prompt:: bash $ .. prompt:: bash $
ceph osd destroy {id} --yes-i-really-mean-it ceph osd destroy {id} --yes-i-really-mean-it
#. Zap a disk for the new OSD, if the disk was used before for other purposes. #. *Optional*: If the disk that you plan to use is not a new disk and has been
It's not necessary for a new disk: used before for other purposes, zap the disk:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm zap /dev/sdX ceph-volume lvm zap /dev/sdX
#. Prepare the disk for replacement by using the previously destroyed OSD id: #. Prepare the disk for replacement by using the ID of the OSD that was
destroyed in previous steps:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm prepare --osd-id {id} --data /dev/sdX ceph-volume lvm prepare --osd-id {id} --data /dev/sdX
#. And activate the OSD: #. Finally, activate the OSD:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm activate {id} {fsid} ceph-volume lvm activate {id} {fsid}
Alternatively, instead of preparing and activating, the device can be recreated Alternatively, instead of carrying out the final two steps (preparing the disk
in one call, like: and activating the OSD), you can re-create the OSD by running a single command
of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm create --osd-id {id} --data /dev/sdX ceph-volume lvm create --osd-id {id} --data /dev/sdX
Starting the OSD Starting the OSD
---------------- ----------------
After you add an OSD to Ceph, the OSD is in your configuration. However, After an OSD is added to Ceph, the OSD is in the cluster. However, until it is
it is not yet running. The OSD is ``down`` and ``in``. You must start started, the OSD is considered ``down`` and ``in``. The OSD is not running and
your new OSD before it can begin receiving data. You may use will be unable to receive data. To start an OSD, either run ``service ceph``
``service ceph`` from your admin host or start the OSD from its host from your admin host or run a command of the following form to start the OSD
machine: from its host machine:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl start ceph-osd@{osd-num} sudo systemctl start ceph-osd@{osd-num}
After the OSD is started, it is considered ``up`` and ``in``.
Once you start your OSD, it is ``up`` and ``in``. Observing the Data Migration
----------------------------
After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the
Observe the Data Migration cluster by migrating placement groups (PGs) to the new OSD. To observe this
-------------------------- process by using the `ceph`_ tool, run the following command:
Once you have added your new OSD to the CRUSH map, Ceph will begin rebalancing
the server by migrating placement groups to your new OSD. You can observe this
process with the `ceph`_ tool. :
.. prompt:: bash $ .. prompt:: bash $
ceph -w ceph -w
You should see the placement group states change from ``active+clean`` to Or:
``active, some degraded objects``, and finally ``active+clean`` when migration
completes. (Control-c to exit.) .. prompt:: bash $
watch ceph status
The PG states will first change from ``active+clean`` to ``active, some
degraded objects`` and then return to ``active+clean`` when migration
completes. When you are finished observing, press Ctrl-C to exit.
.. _Add/Move an OSD: ../crush-map#addosd .. _Add/Move an OSD: ../crush-map#addosd
.. _ceph: ../monitoring .. _ceph: ../monitoring
Removing OSDs (Manual) Removing OSDs (Manual)
====================== ======================
When you want to reduce the size of a cluster or replace hardware, you may It is possible to remove an OSD manually while the cluster is running: you
remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd`` might want to do this in order to reduce the size of the cluster or when
daemon for one storage drive within a host machine. If your host has multiple replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on
storage drives, you may need to remove one ``ceph-osd`` daemon for each drive. one storage drive within a host machine. Alternatively, if your host machine
Generally, it's a good idea to check the capacity of your cluster to see if you has multiple storage drives, you might need to remove multiple ``ceph-osd``
are reaching the upper end of its capacity. Ensure that when you remove an OSD daemons: one daemon for each drive on the machine.
that your cluster is not at its ``near full`` ratio.
.. warning:: Do not let your cluster reach its ``full ratio`` when .. warning:: Before you begin the process of removing an OSD, make sure that
removing an OSD. Removing OSDs could cause the cluster to reach your cluster is not near its ``full ratio``. Otherwise the act of removing
or exceed its ``full ratio``. OSDs might cause the cluster to reach or exceed its ``full ratio``.
Take the OSD out of the Cluster Taking the OSD ``out`` of the Cluster
----------------------------------- -------------------------------------
Before you remove an OSD, it is usually ``up`` and ``in``. You need to take it OSDs are typically ``up`` and ``in`` before they are removed from the cluster.
out of the cluster so that Ceph can begin rebalancing and copying its data to Before the OSD can be removed from the cluster, the OSD must be taken ``out``
other OSDs. : of the cluster so that Ceph can begin rebalancing and copying its data to other
OSDs. To take an OSD ``out`` of the cluster, run a command of the following
form:
.. prompt:: bash $ .. prompt:: bash $
ceph osd out {osd-num} ceph osd out {osd-num}
Observe the Data Migration Observing the Data Migration
-------------------------- ----------------------------
Once you have taken your OSD ``out`` of the cluster, Ceph will begin After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing
rebalancing the cluster by migrating placement groups out of the OSD you the cluster by migrating placement groups out of the OSD that was removed. To
removed. You can observe this process with the `ceph`_ tool. : observe this process by using the `ceph`_ tool, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph -w ceph -w
You should see the placement group states change from ``active+clean`` to The PG states will change from ``active+clean`` to ``active, some degraded
``active, some degraded objects``, and finally ``active+clean`` when migration objects`` and will then return to ``active+clean`` when migration completes.
completes. (Control-c to exit.) When you are finished observing, press Ctrl-C to exit.
.. note:: Sometimes, typically in a "small" cluster with few hosts (for .. note:: Under certain conditions, the action of taking ``out`` an OSD
instance with a small testing cluster), the fact to take ``out`` the might lead CRUSH to encounter a corner case in which some PGs remain stuck
OSD can spawn a CRUSH corner case where some PGs remain stuck in the in the ``active+remapped`` state. This problem sometimes occurs in small
``active+remapped`` state. If you are in this case, you should mark clusters with few hosts (for example, in a small testing cluster). To
the OSD ``in`` with: address this problem, mark the OSD ``in`` by running a command of the
following form:
.. prompt:: bash $ .. prompt:: bash $
ceph osd in {osd-num} ceph osd in {osd-num}
to come back to the initial state and then, instead of marking ``out`` After the OSD has come back to its initial state, do not mark the OSD
the OSD, set its weight to 0 with: ``out`` again. Instead, set the OSD's weight to ``0`` by running a command
of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush reweight osd.{osd-num} 0 ceph osd crush reweight osd.{osd-num} 0
After that, you can observe the data migration which should come to its After the OSD has been reweighted, observe the data migration and confirm
end. The difference between marking ``out`` the OSD and reweighting it that it has completed successfully. The difference between marking an OSD
to 0 is that in the first case the weight of the bucket which contains ``out`` and reweighting the OSD to ``0`` has to do with the bucket that
the OSD is not changed whereas in the second case the weight of the bucket contains the OSD. When an OSD is marked ``out``, the weight of the bucket is
is updated (and decreased of the OSD weight). The reweight command could not changed. But when an OSD is reweighted to ``0``, the weight of the
be sometimes favoured in the case of a "small" cluster. bucket is updated (namely, the weight of the OSD is subtracted from the
overall weight of the bucket). When operating small clusters, it can
sometimes be preferable to use the above reweight command.
Stopping the OSD Stopping the OSD
---------------- ----------------
After you take an OSD out of the cluster, it may still be running. After you take an OSD ``out`` of the cluster, the OSD might still be running.
That is, the OSD may be ``up`` and ``out``. You must stop In such a case, the OSD is ``up`` and ``out``. Before it is removed from the
your OSD before you remove it from the configuration: cluster, the OSD must be stopped by running commands of the following form:
.. prompt:: bash $ .. prompt:: bash $
ssh {osd-host} ssh {osd-host}
sudo systemctl stop ceph-osd@{osd-num} sudo systemctl stop ceph-osd@{osd-num}
Once you stop your OSD, it is ``down``. After the OSD has been stopped, it is ``down``.
Removing the OSD Removing the OSD
---------------- ----------------
This procedure removes an OSD from a cluster map, removes its authentication The following procedure removes an OSD from the cluster map, removes the OSD's
key, removes the OSD from the OSD map, and removes the OSD from the authentication key, removes the OSD from the OSD map, and removes the OSD from
``ceph.conf`` file. If your host has multiple drives, you may need to remove an the ``ceph.conf`` file. If your host has multiple drives, it might be necessary
OSD for each drive by repeating this procedure. to remove an OSD from each drive by repeating this procedure.
#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH #. Begin by having the cluster forget the OSD. This step removes the OSD from
map, removes its authentication key. And it is removed from the OSD map as the CRUSH map, removes the OSD's authentication key, and removes the OSD
well. Please note the :ref:`purge subcommand <ceph-admin-osd>` is introduced in Luminous, for older from the OSD map. (The :ref:`purge subcommand <ceph-admin-osd>` was
versions, please see below: introduced in Luminous. For older releases, see :ref:`the procedure linked
here <ceph_osd_purge_procedure_pre_luminous>`.):
.. prompt:: bash $ .. prompt:: bash $
ceph osd purge {id} --yes-i-really-mean-it ceph osd purge {id} --yes-i-really-mean-it
#. Navigate to the host where you keep the master copy of the cluster's
``ceph.conf`` file: #. Navigate to the host where the master copy of the cluster's
``ceph.conf`` file is kept:
.. prompt:: bash $ .. prompt:: bash $
@ -338,46 +369,48 @@ OSD for each drive by repeating this procedure.
cd /etc/ceph cd /etc/ceph
vim ceph.conf vim ceph.conf
#. Remove the OSD entry from your ``ceph.conf`` file (if it exists):: #. Remove the OSD entry from your ``ceph.conf`` file (if such an entry
exists)::
[osd.1] [osd.1]
host = {hostname} host = {hostname}
#. From the host where you keep the master copy of the cluster's ``ceph.conf`` #. Copy the updated ``ceph.conf`` file from the location on the host where the
file, copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph``
other hosts in your cluster. directory of the other hosts in your cluster.
If your Ceph cluster is older than Luminous, instead of using ``ceph osd .. _ceph_osd_purge_procedure_pre_luminous:
purge``, you need to perform this step manually:
If your Ceph cluster is older than Luminous, you will be unable to use the
``ceph osd purge`` command. Instead, carry out the following procedure:
#. Remove the OSD from the CRUSH map so that it no longer receives data. You may #. Remove the OSD from the CRUSH map so that it no longer receives data (for
also decompile the CRUSH map, remove the OSD from the device list, remove the more details, see `Remove an OSD`_):
device as an item in the host bucket or remove the host bucket (if it's in the
CRUSH map and you intend to remove the host), recompile the map and set it.
See `Remove an OSD`_ for details:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush remove {name} ceph osd crush remove {name}
Instead of removing the OSD from the CRUSH map, you might opt for one of two
alternatives: (1) decompile the CRUSH map, remove the OSD from the device
list, and remove the device from the host bucket; (2) remove the host bucket
from the CRUSH map (provided that it is in the CRUSH map and that you intend
to remove the host), recompile the map, and set it:
#. Remove the OSD authentication key: #. Remove the OSD authentication key:
.. prompt:: bash $ .. prompt:: bash $
ceph auth del osd.{osd-num} ceph auth del osd.{osd-num}
The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the
``$cluster-$id``. If your cluster name differs from ``ceph``, use your
cluster name instead.
#. Remove the OSD: #. Remove the OSD:
.. prompt:: bash $ .. prompt:: bash $
ceph osd rm {osd-num} ceph osd rm {osd-num}
for example: For example:
.. prompt:: bash $ .. prompt:: bash $

View File

@ -3,14 +3,15 @@
Balancer Balancer
======== ========
The *balancer* can optimize the placement of PGs across OSDs in The *balancer* can optimize the allocation of placement groups (PGs) across
order to achieve a balanced distribution, either automatically or in a OSDs in order to achieve a balanced distribution. The balancer can operate
supervised fashion. either automatically or in a supervised fashion.
Status Status
------ ------
The current status of the balancer can be checked at any time with: To check the current status of the balancer, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -20,70 +21,78 @@ The current status of the balancer can be checked at any time with:
Automatic balancing Automatic balancing
------------------- -------------------
The automatic balancing feature is enabled by default in ``upmap`` When the balancer is in ``upmap`` mode, the automatic balancing feature is
mode. Please refer to :ref:`upmap` for more details. The balancer can be enabled by default. For more details, see :ref:`upmap`. To disable the
turned off with: balancer, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer off ceph balancer off
The balancer mode can be changed to ``crush-compat`` mode, which is The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode.
backward compatible with older clients, and will make small changes to ``crush-compat`` mode is backward compatible with older clients. In
the data distribution over time to ensure that OSDs are equally utilized. ``crush-compat`` mode, the balancer automatically makes small changes to the
data distribution in order to ensure that OSDs are utilized equally.
Throttling Throttling
---------- ----------
No adjustments will be made to the PG distribution if the cluster is If the cluster is degraded (that is, if an OSD has failed and the system hasn't
degraded (e.g., because an OSD has failed and the system has not yet healed itself yet), then the balancer will not make any adjustments to the PG
healed itself). distribution.
When the cluster is healthy, the balancer will throttle its changes When the cluster is healthy, the balancer will incrementally move a small
such that the percentage of PGs that are misplaced (i.e., that need to fraction of unbalanced PGs in order to improve distribution. This fraction
be moved) is below a threshold of (by default) 5%. The will not exceed a certain threshold that defaults to 5%. To adjust this
``target_max_misplaced_ratio`` threshold can be adjusted with: ``target_max_misplaced_ratio`` threshold setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr target_max_misplaced_ratio .07 # 7% ceph config set mgr target_max_misplaced_ratio .07 # 7%
Set the number of seconds to sleep in between runs of the automatic balancer: The balancer sleeps between runs. To set the number of seconds for this
interval of sleep, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr mgr/balancer/sleep_interval 60 ceph config set mgr mgr/balancer/sleep_interval 60
Set the time of day to begin automatic balancing in HHMM format: To set the time of day (in HHMM format) at which automatic balancing begins,
run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr mgr/balancer/begin_time 0000 ceph config set mgr mgr/balancer/begin_time 0000
Set the time of day to finish automatic balancing in HHMM format: To set the time of day (in HHMM format) at which automatic balancing ends, run
the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr mgr/balancer/end_time 2359 ceph config set mgr mgr/balancer/end_time 2359
Restrict automatic balancing to this day of the week or later. Automatic balancing can be restricted to certain days of the week. To restrict
Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: it to a specific day of the week or later (as with crontab, ``0`` is Sunday,
``1`` is Monday, and so on), run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr mgr/balancer/begin_weekday 0 ceph config set mgr mgr/balancer/begin_weekday 0
Restrict automatic balancing to this day of the week or earlier. To restrict automatic balancing to a specific day of the week or earlier
Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on: (again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr mgr/balancer/end_weekday 6 ceph config set mgr mgr/balancer/end_weekday 6
Pool IDs to which the automatic balancing will be limited. Automatic balancing can be restricted to certain pools. By default, the value
The default for this is an empty string, meaning all pools will be balanced. of this setting is an empty string, so that all pools are automatically
The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` command: balanced. To restrict automatic balancing to specific pools, retrieve their
numeric pool IDs (by running the :command:`ceph osd pool ls detail` command),
and then run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -93,43 +102,41 @@ The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` c
Modes Modes
----- -----
There are currently two supported balancer modes: There are two supported balancer modes:
#. **crush-compat**. The CRUSH compat mode uses the compat weight-set #. **crush-compat**. This mode uses the compat weight-set feature (introduced
feature (introduced in Luminous) to manage an alternative set of in Luminous) to manage an alternative set of weights for devices in the
weights for devices in the CRUSH hierarchy. The normal weights CRUSH hierarchy. When the balancer is operating in this mode, the normal
should remain set to the size of the device to reflect the target weights should remain set to the size of the device in order to reflect the
amount of data that we want to store on the device. The balancer target amount of data intended to be stored on the device. The balancer will
then optimizes the weight-set values, adjusting them up or down in then optimize the weight-set values, adjusting them up or down in small
small increments, in order to achieve a distribution that matches increments, in order to achieve a distribution that matches the target
the target distribution as closely as possible. (Because PG distribution as closely as possible. (Because PG placement is a pseudorandom
placement is a pseudorandom process, there is a natural amount of process, it is subject to a natural amount of variation; optimizing the
variation in the placement; by optimizing the weights we weights serves to counteract that natural variation.)
counter-act that natural variation.)
Notably, this mode is *fully backwards compatible* with older Note that this mode is *fully backward compatible* with older clients: when
clients: when an OSDMap and CRUSH map is shared with older clients, an OSD Map and CRUSH map are shared with older clients, Ceph presents the
we present the optimized weights as the "real" weights. optimized weights as the "real" weights.
The primary restriction of this mode is that the balancer cannot The primary limitation of this mode is that the balancer cannot handle
handle multiple CRUSH hierarchies with different placement rules if multiple CRUSH hierarchies with different placement rules if the subtrees of
the subtrees of the hierarchy share any OSDs. (This is normally the hierarchy share any OSDs. (Such sharing of OSDs is not typical and,
not the case, and is generally not a recommended configuration because of the difficulty of managing the space utilization on the shared
because it is hard to manage the space utilization on the shared OSDs, is generally not recommended.)
OSDs.)
#. **upmap**. Starting with Luminous, the OSDMap can store explicit #. **upmap**. In Luminous and later releases, the OSDMap can store explicit
mappings for individual OSDs as exceptions to the normal CRUSH mappings for individual OSDs as exceptions to the normal CRUSH placement
placement calculation. These `upmap` entries provide fine-grained calculation. These ``upmap`` entries provide fine-grained control over the
control over the PG mapping. This CRUSH mode will optimize the PG mapping. This balancer mode optimizes the placement of individual PGs in
placement of individual PGs in order to achieve a balanced order to achieve a balanced distribution. In most cases, the resulting
distribution. In most cases, this distribution is "perfect," which distribution is nearly perfect: that is, there is an equal number of PGs on
an equal number of PGs on each OSD (+/-1 PG, since they might not each OSD (±1 PG, since the total number might not divide evenly).
divide evenly).
Note that using upmap requires that all clients be Luminous or newer. To use ``upmap``, all clients must be Luminous or newer.
The default mode is ``upmap``. The mode can be adjusted with: The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by
running the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -138,69 +145,77 @@ The default mode is ``upmap``. The mode can be adjusted with:
Supervised optimization Supervised optimization
----------------------- -----------------------
The balancer operation is broken into a few distinct phases: Supervised use of the balancer can be understood in terms of three distinct
phases:
#. building a *plan* #. building a plan
#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan* #. evaluating the quality of the data distribution, either for the current PG
#. executing the *plan* distribution or for the PG distribution that would result after executing a
plan
#. executing the plan
To evaluate and score the current distribution: To evaluate the current distribution, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer eval ceph balancer eval
You can also evaluate the distribution for a single pool with: To evaluate the distribution for a single pool, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer eval <pool-name> ceph balancer eval <pool-name>
Greater detail for the evaluation can be seen with: To see the evaluation in greater detail, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer eval-verbose ... ceph balancer eval-verbose ...
The balancer can generate a plan, using the currently configured mode, with: To instruct the balancer to generate a plan (using the currently configured
mode), make up a name (any useful identifying string) for the plan, and run the
following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer optimize <plan-name> ceph balancer optimize <plan-name>
The name is provided by the user and can be any useful identifying string. The contents of a plan can be seen with: To see the contents of a plan, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer show <plan-name> ceph balancer show <plan-name>
All plans can be shown with: To display all plans, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer ls ceph balancer ls
Old plans can be discarded with: To discard an old plan, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer rm <plan-name> ceph balancer rm <plan-name>
Currently recorded plans are shown as part of the status command: To see currently recorded plans, examine the output of the following status
command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer status ceph balancer status
The quality of the distribution that would result after executing a plan can be calculated with: To evaluate the distribution that would result from executing a specific plan,
run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer eval <plan-name> ceph balancer eval <plan-name>
Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with: If a plan is expected to improve the distribution (that is, the plan's score is
lower than the current cluster state's score), you can execute that plan by
running the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph balancer execute <plan-name> ceph balancer execute <plan-name>

View File

@ -1,69 +1,68 @@
.. _rados_operations_bluestore_migration:
===================== =====================
BlueStore Migration BlueStore Migration
===================== =====================
Each OSD can run either BlueStore or Filestore, and a single Ceph Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph
cluster can contain a mix of both. Users who have previously deployed cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs.
Filestore OSDs should transition to BlueStore in order to Because BlueStore is superior to Filestore in performance and robustness, and
take advantage of the improved performance and robustness. Moreover, because Filestore is not supported by Ceph releases beginning with Reef, users
Ceph releases beginning with Reef do not support Filestore. There are deploying Filestore OSDs should transition to BlueStore. There are several
several strategies for making such a transition. strategies for making the transition to BlueStore.
An individual OSD cannot be converted in place; BlueStore is so different from Filestore that an individual OSD cannot be
BlueStore and Filestore are simply too different for that to be converted in place. Instead, the conversion process must use either (1) the
feasible. The conversion process uses either the cluster's normal cluster's normal replication and healing support, or (2) tools and strategies
replication and healing support or tools and strategies that copy OSD that copy OSD content from an old (Filestore) device to a new (BlueStore) one.
content from an old (Filestore) device to a new (BlueStore) one.
Deploying new OSDs with BlueStore
=================================
Deploy new OSDs with BlueStore Use BlueStore when deploying new OSDs (for example, when the cluster is
============================== expanded). Because this is the default behavior, no specific change is
needed.
New OSDs (e.g., when the cluster is expanded) should be deployed Similarly, use BlueStore for any OSDs that have been reprovisioned after
using BlueStore. This is the default behavior so no specific change a failed drive was replaced.
is needed.
Similarly, any OSDs that are reprovisioned after replacing a failed drive Converting existing OSDs
should use BlueStore. ========================
Convert existing OSDs "Mark-``out``" replacement
===================== --------------------------
Mark out and replace The simplest approach is to verify that the cluster is healthy and
-------------------- then follow these steps for each Filestore OSD in succession: mark the OSD
``out``, wait for the data to replicate across the cluster, reprovision the OSD,
The simplest approach is to ensure that the cluster is healthy, mark the OSD back ``in``, and wait for recovery to complete before proceeding
then mark ``out`` each device in turn, wait for to the next OSD. This approach is easy to automate, but it entails unnecessary
data to replicate across the cluster, reprovision the OSD, and mark data migration that carries costs in time and SSD wear.
it back ``in`` again. Proceed to the next OSD when recovery is complete.
This is easy to automate but results in more data migration than
is strictly necessary, which in turn presents additional wear to SSDs and takes
longer to complete.
#. Identify a Filestore OSD to replace:: #. Identify a Filestore OSD to replace::
ID=<osd-id-number> ID=<osd-id-number>
DEVICE=<disk-device> DEVICE=<disk-device>
You can tell whether a given OSD is Filestore or BlueStore with: #. Determine whether a given OSD is Filestore or BlueStore:
.. prompt:: bash $ .. prompt:: bash $
ceph osd metadata $ID | grep osd_objectstore ceph osd metadata $ID | grep osd_objectstore
You can get a current count of Filestore and BlueStore OSDs with: #. Get a current count of Filestore and BlueStore OSDs:
.. prompt:: bash $ .. prompt:: bash $
ceph osd count-metadata osd_objectstore ceph osd count-metadata osd_objectstore
#. Mark the Filestore OSD ``out``: #. Mark a Filestore OSD ``out``:
.. prompt:: bash $ .. prompt:: bash $
ceph osd out $ID ceph osd out $ID
#. Wait for the data to migrate off the OSD in question: #. Wait for the data to migrate off this OSD:
.. prompt:: bash $ .. prompt:: bash $
@ -75,7 +74,9 @@ longer to complete.
systemctl kill ceph-osd@$ID systemctl kill ceph-osd@$ID
#. Note which device this OSD is using: .. _osd_id_retrieval:
#. Note which device the OSD is using:
.. prompt:: bash $ .. prompt:: bash $
@ -87,25 +88,27 @@ longer to complete.
umount /var/lib/ceph/osd/ceph-$ID umount /var/lib/ceph/osd/ceph-$ID
#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy #. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy
the contents of the device; be certain the data on the device is the contents of the device; you must be certain that the data on the device is
not needed (i.e., that the cluster is healthy) before proceeding: not needed (in other words, that the cluster is healthy) before proceeding:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm zap $DEVICE ceph-volume lvm zap $DEVICE
#. Tell the cluster the OSD has been destroyed (and a new OSD can be #. Tell the cluster that the OSD has been destroyed (and that a new OSD can be
reprovisioned with the same ID): reprovisioned with the same OSD ID):
.. prompt:: bash $ .. prompt:: bash $
ceph osd destroy $ID --yes-i-really-mean-it ceph osd destroy $ID --yes-i-really-mean-it
#. Provision a BlueStore OSD in its place with the same OSD ID. #. Provision a BlueStore OSD in place by using the same OSD ID. This requires
This requires you do identify which device to wipe based on what you saw you to identify which device to wipe, and to make certain that you target
mounted above. BE CAREFUL! Also note that hybrid OSDs may require the correct and intended device, using the information that was retrieved in
adjustments to these commands: the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step. BE
CAREFUL! Note that you may need to modify these commands when dealing with
hybrid OSDs:
.. prompt:: bash $ .. prompt:: bash $
@ -113,15 +116,15 @@ longer to complete.
#. Repeat. #. Repeat.
You can allow balancing of the replacement OSD to happen You may opt to (1) have the balancing of the replacement BlueStore OSD take
concurrently with the draining of the next OSD, or follow the same place concurrently with the draining of the next Filestore OSD, or instead
procedure for multiple OSDs in parallel, as long as you ensure the (2) follow the same procedure for multiple OSDs in parallel. In either case,
cluster is fully clean (all data has all replicas) before destroying however, you must ensure that the cluster is fully clean (in other words, that
any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to all data has all replicas) before destroying any OSDs. If you opt to reprovision
only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or multiple OSDs in parallel, be **very** careful to destroy OSDs only within a
``rack``. Failure to do so will reduce the redundancy and availability of single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to
your data and increase the risk of (or even cause) data loss. satisfy this requirement will reduce the redundancy and availability of your
data and increase the risk of data loss (or even guarantee data loss).
Advantages: Advantages:
@ -131,29 +134,29 @@ Advantages:
Disadvantages: Disadvantages:
* Data is copied over the network twice: once to some other OSD in the * Data is copied over the network twice: once to another OSD in the cluster (to
cluster (to maintain the desired number of replicas), and then again maintain the specified number of replicas), and again back to the
back to the reprovisioned BlueStore OSD. reprovisioned BlueStore OSD.
"Whole host" replacement
------------------------
Whole host replacement If you have a spare host in the cluster, or sufficient free space to evacuate
---------------------- an entire host for use as a spare, then the conversion can be done on a
host-by-host basis so that each stored copy of the data is migrated only once.
If you have a spare host in the cluster, or have sufficient free space To use this approach, you need an empty host that has no OSDs provisioned.
to evacuate an entire host in order to use it as a spare, then the There are two ways to do this: either by using a new, empty host that is not
conversion can be done on a host-by-host basis with each stored copy of yet part of the cluster, or by offloading data from an existing host that is
the data migrating only once. already part of the cluster.
First, you need an empty host that has no OSDs provisioned. There are two Using a new, empty host
ways to do this: either by starting with a new, empty host that isn't yet ^^^^^^^^^^^^^^^^^^^^^^^
part of the cluster, or by offloading data from an existing host in the cluster.
Use a new, empty host Ideally the host will have roughly the same capacity as each of the other hosts
^^^^^^^^^^^^^^^^^^^^^ you will be converting. Add the host to the CRUSH hierarchy, but do not attach
it to the root:
Ideally the host should have roughly the
same capacity as other hosts you will be converting.
Add the host to the CRUSH hierarchy, but do not attach it to the root:
.. prompt:: bash $ .. prompt:: bash $
@ -162,14 +165,13 @@ Add the host to the CRUSH hierarchy, but do not attach it to the root:
Make sure that Ceph packages are installed on the new host. Make sure that Ceph packages are installed on the new host.
Use an existing host Using an existing host
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
If you would like to use an existing host
that is already part of the cluster, and there is sufficient free
space on that host so that all of its data can be migrated off to
other cluster hosts, you can instead do::
If you would like to use an existing host that is already part of the cluster,
and if there is sufficient free space on that host so that all of its data can
be migrated off to other cluster hosts, you can do the following (instead of
using a new, empty host):
.. prompt:: bash $ .. prompt:: bash $
@ -178,7 +180,7 @@ other cluster hosts, you can instead do::
where "default" is the immediate ancestor in the CRUSH map. (For where "default" is the immediate ancestor in the CRUSH map. (For
smaller clusters with unmodified configurations this will normally smaller clusters with unmodified configurations this will normally
be "default", but it might also be a rack name.) You should now be "default", but it might instead be a rack name.) You should now
see the host at the top of the OSD tree output with no parent: see the host at the top of the OSD tree output with no parent:
.. prompt:: bash $ .. prompt:: bash $
@ -199,15 +201,18 @@ see the host at the top of the OSD tree output with no parent:
2 ssd 1.00000 osd.2 up 1.00000 1.00000 2 ssd 1.00000 osd.2 up 1.00000 1.00000
... ...
If everything looks good, jump directly to the "Wait for data If everything looks good, jump directly to the :ref:`"Wait for the data
migration to complete" step below and proceed from there to clean up migration to complete" <bluestore_data_migration_step>` step below and proceed
the old OSDs. from there to clean up the old OSDs.
Migration process Migration process
^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^
If you're using a new host, start at step #1. For an existing host, If you're using a new host, start at :ref:`the first step
jump to step #5 below. <bluestore_migration_process_first_step>`. If you're using an existing host,
jump to :ref:`this step <bluestore_data_migration_step>`.
.. _bluestore_migration_process_first_step:
#. Provision new BlueStore OSDs for all devices: #. Provision new BlueStore OSDs for all devices:
@ -215,14 +220,14 @@ jump to step #5 below.
ceph-volume lvm create --bluestore --data /dev/$DEVICE ceph-volume lvm create --bluestore --data /dev/$DEVICE
#. Verify OSDs join the cluster with: #. Verify that the new OSDs have joined the cluster:
.. prompt:: bash $ .. prompt:: bash $
ceph osd tree ceph osd tree
You should see the new host ``$NEWHOST`` with all of the OSDs beneath You should see the new host ``$NEWHOST`` with all of the OSDs beneath
it, but the host should *not* be nested beneath any other node in it, but the host should *not* be nested beneath any other node in the
hierarchy (like ``root default``). For example, if ``newhost`` is hierarchy (like ``root default``). For example, if ``newhost`` is
the empty host, you might see something like:: the empty host, you might see something like::
@ -251,13 +256,16 @@ jump to step #5 below.
ceph osd crush swap-bucket $NEWHOST $OLDHOST ceph osd crush swap-bucket $NEWHOST $OLDHOST
At this point all data on ``$OLDHOST`` will start migrating to OSDs At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on
on ``$NEWHOST``. If there is a difference in the total capacity of ``$NEWHOST``. If there is a difference between the total capacity of the
the old and new hosts you may also see some data migrate to or from old hosts and the total capacity of the new hosts, you may also see some
other nodes in the cluster, but as long as the hosts are similarly data migrate to or from other nodes in the cluster. Provided that the hosts
sized this will be a relatively small amount of data. are similarly sized, however, this will be a relatively small amount of
data.
#. Wait for data migration to complete: .. _bluestore_data_migration_step:
#. Wait for the data migration to complete:
.. prompt:: bash $ .. prompt:: bash $
@ -279,14 +287,14 @@ jump to step #5 below.
ceph osd purge $osd --yes-i-really-mean-it ceph osd purge $osd --yes-i-really-mean-it
done done
#. Wipe the old OSD devices. This requires you do identify which #. Wipe the old OSDs. This requires you to identify which devices are to be
devices are to be wiped manually (BE CAREFUL!). For each device: wiped manually. BE CAREFUL! For each device:
.. prompt:: bash $ .. prompt:: bash $
ceph-volume lvm zap $DEVICE ceph-volume lvm zap $DEVICE
#. Use the now-empty host as the new host, and repeat:: #. Use the now-empty host as the new host, and repeat:
.. prompt:: bash $ .. prompt:: bash $
@ -295,9 +303,9 @@ jump to step #5 below.
Advantages: Advantages:
* Data is copied over the network only once. * Data is copied over the network only once.
* Converts an entire host's OSDs at once. * An entire host's OSDs are converted at once.
* Can parallelize to converting multiple hosts at a time. * Can be parallelized, to make possible the conversion of multiple hosts at the same time.
* No spare devices are required on each host. * No host involved in this process needs to have a spare device.
Disadvantages: Disadvantages:
@ -306,43 +314,42 @@ Disadvantages:
is likely to impact overall cluster performance. is likely to impact overall cluster performance.
* All migrated data still makes one full hop over the network. * All migrated data still makes one full hop over the network.
Per-OSD device copy Per-OSD device copy
------------------- -------------------
A single logical OSD can be converted by using the ``copy`` function A single logical OSD can be converted by using the ``copy`` function
of ``ceph-objectstore-tool``. This requires that the host have a free included in ``ceph-objectstore-tool``. This requires that the host have one or more free
device (or devices) to provision a new, empty BlueStore OSD. For devices to provision a new, empty BlueStore OSD. For
example, if each host in your cluster has twelve OSDs, then you'd need a example, if each host in your cluster has twelve OSDs, then you need a
thirteenth unused device so that each OSD can be converted in turn before the thirteenth unused OSD so that each OSD can be converted before the
old device is reclaimed to convert the next OSD. previous OSD is reclaimed to convert the next OSD.
Caveats: Caveats:
* This strategy requires that an empty BlueStore OSD be prepared * This approach requires that we prepare an empty BlueStore OSD but that we do not allocate
without allocating a new OSD ID, something that the ``ceph-volume`` a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:**
tool doesn't support. More importantly, the setup of *dmcrypt* is because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not
closely tied to the OSD identity, which means that this approach work with encrypted OSDs.
does not work with encrypted OSDs.
* The device must be manually partitioned. * The device must be manually partitioned.
* An unsupported user-contributed script that shows this process may be found at * An unsupported user-contributed script that demonstrates this process may be found here:
https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
Advantages: Advantages:
* Little or no data migrates over the network during the conversion, so long as * Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the
the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster cluster while the conversion process is underway, little or no data migrates over the
while the process proceeds. network during the conversion.
Disadvantages: Disadvantages:
* Tooling is not fully implemented, supported, or documented. * Tooling is not fully implemented, supported, or documented.
* Each host must have an appropriate spare or empty device for staging. * Each host must have an appropriate spare or empty device for staging.
* The OSD is offline during the conversion, which means new writes to PGs * The OSD is offline during the conversion, which means new writes to PGs
with the OSD in their acting set may not be ideally redundant until the with the OSD in their acting set may not be ideally redundant until the
subject OSD comes up and recovers. This increases the risk of data subject OSD comes up and recovers. This increases the risk of data
loss due to an overlapping failure. However, if another OSD fails before loss due to an overlapping failure. However, if another OSD fails before
conversion and start-up are complete, the original Filestore OSD can be conversion and startup have completed, the original Filestore OSD can be
started to provide access to its original data. started to provide access to its original data.

View File

@ -1,6 +1,10 @@
=============== ===============
Cache Tiering Cache Tiering
=============== ===============
.. warning:: Cache tiering has been deprecated in the Reef release as it
has lacked a maintainer for a very long time. This does not mean
it will be certainly removed, but we may choose to remove it
without much further notice.
A cache tier provides Ceph Clients with better I/O performance for a subset of A cache tier provides Ceph Clients with better I/O performance for a subset of
the data stored in a backing storage tier. Cache tiering involves creating a the data stored in a backing storage tier. Cache tiering involves creating a

View File

@ -1,88 +1,100 @@
.. _changing_monitor_elections: .. _changing_monitor_elections:
===================================== =======================================
Configure Monitor Election Strategies Configuring Monitor Election Strategies
===================================== =======================================
By default, the monitors will use the ``classic`` mode. We By default, the monitors are in ``classic`` mode. We recommend staying in this
recommend that you stay in this mode unless you have a very specific reason. mode unless you have a very specific reason.
If you want to switch modes BEFORE constructing the cluster, change If you want to switch modes BEFORE constructing the cluster, change the ``mon
the ``mon election default strategy`` option. This option is an integer value: election default strategy`` option. This option takes an integer value:
* 1 for "classic" * ``1`` for ``classic``
* 2 for "disallow" * ``2`` for ``disallow``
* 3 for "connectivity" * ``3`` for ``connectivity``
Once your cluster is running, you can change strategies by running :: After your cluster has started running, you can change strategies by running a
command of the following form:
$ ceph mon set election_strategy {classic|disallow|connectivity} $ ceph mon set election_strategy {classic|disallow|connectivity}
Choosing a mode Choosing a mode
=============== ===============
The modes other than classic provide different features. We recommend
you stay in classic mode if you don't need the extra features as it is
the simplest mode.
The disallow Mode The modes other than ``classic`` provide specific features. We recommend staying
================= in ``classic`` mode if you don't need these extra features because it is the
This mode lets you mark monitors as disallowd, in which case they will simplest mode.
participate in the quorum and serve clients, but cannot be elected leader. You
may wish to use this if you have some monitors which are known to be far away .. _rados_operations_disallow_mode:
from clients.
You can disallow a leader by running: Disallow Mode
=============
The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed
monitors participate in the quorum and serve clients, but cannot be elected
leader. You might want to use this mode for monitors that are far away from
clients.
To disallow a monitor from being elected leader, run a command of the following
form:
.. prompt:: bash $ .. prompt:: bash $
ceph mon add disallowed_leader {name} ceph mon add disallowed_leader {name}
You can remove a monitor from the disallowed list, and allow it to become To remove a monitor from the disallowed list and allow it to be elected leader,
a leader again, by running: run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph mon rm disallowed_leader {name} ceph mon rm disallowed_leader {name}
The list of disallowed_leaders is included when you run: To see the list of disallowed leaders, examine the output of the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon dump ceph mon dump
The connectivity Mode Connectivity Mode
===================== =================
This mode evaluates connection scores provided by each monitor for its
peers and elects the monitor with the highest score. This mode is designed
to handle network partitioning or *net-splits*, which may happen if your cluster
is stretched across multiple data centers or otherwise has a non-uniform
or unbalanced network topology.
This mode also supports disallowing monitors from being the leader The ``connectivity`` mode evaluates connection scores that are provided by each
using the same commands as above in disallow. monitor for its peers and elects the monitor with the highest score. This mode
is designed to handle network partitioning (also called *net-splits*): network
partitioning might occur if your cluster is stretched across multiple data
centers or otherwise has a non-uniform or unbalanced network topology.
The ``connectivity`` mode also supports disallowing monitors from being elected
leader by using the same commands that were presented in :ref:`Disallow Mode <rados_operations_disallow_mode>`.
Examining connectivity scores Examining connectivity scores
============================= =============================
The monitors maintain connection scores even if they aren't in
the connectivity election mode. You can examine the scores a monitor The monitors maintain connection scores even if they aren't in ``connectivity``
has by running: mode. To examine a specific monitor's connection scores, run a command of the
following form:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon mon.{name} connection scores dump ceph daemon mon.{name} connection scores dump
Scores for individual connections range from 0-1 inclusive, and also Scores for an individual connection range from ``0`` to ``1`` inclusive and
include whether the connection is considered alive or dead (determined by include whether the connection is considered alive or dead (as determined by
whether it returned its latest ping within the timeout). whether it returned its latest ping before timeout).
While this would be an unexpected occurrence, if for some reason you experience Connectivity scores are expected to remain valid. However, if during
problems and troubleshooting makes you think your scores have become invalid, troubleshooting you determine that these scores have for some reason become
you can forget history and reset them by running: invalid, drop the history and reset the scores by running a command of the
following form:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon mon.{name} connection scores reset ceph daemon mon.{name} connection scores reset
While resetting scores has low risk (monitors will still quickly determine Resetting connectivity scores carries little risk: monitors will still quickly
if a connection is alive or dead, and trend back to the previous scores if they determine whether a connection is alive or dead and trend back to the previous
were accurate!), it should also not be needed and is not recommended unless scores if those scores were accurate. Nevertheless, resetting scores ought to
requested by your support team or a developer. be unnecessary and it is not recommended unless advised by your support team
or by a developer.

View File

@ -8,13 +8,13 @@
Monitor Commands Monitor Commands
================ ================
Monitor commands are issued using the ``ceph`` utility: To issue monitor commands, use the ``ceph`` utility:
.. prompt:: bash $ .. prompt:: bash $
ceph [-m monhost] {command} ceph [-m monhost] {command}
The command is usually (though not always) of the form: In most cases, monitor commands have the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -24,48 +24,49 @@ The command is usually (though not always) of the form:
System Commands System Commands
=============== ===============
Execute the following to display the current cluster status. : To display the current cluster status, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph -s ceph -s
ceph status ceph status
Execute the following to display a running summary of cluster status To display a running summary of cluster status and major events, run the
and major events. : following command:
.. prompt:: bash $ .. prompt:: bash $
ceph -w ceph -w
Execute the following to show the monitor quorum, including which monitors are To display the monitor quorum, including which monitors are participating and
participating and which one is the leader. : which one is the leader, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph mon stat ceph mon stat
ceph quorum_status ceph quorum_status
Execute the following to query the status of a single monitor, including whether To query the status of a single monitor, including whether it is in the quorum,
or not it is in the quorum. : run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell mon.[id] mon_status ceph tell mon.[id] mon_status
where the value of ``[id]`` can be determined, e.g., from ``ceph -s``. Here the value of ``[id]`` can be found by consulting the output of ``ceph
-s``.
Authentication Subsystem Authentication Subsystem
======================== ========================
To add a keyring for an OSD, execute the following: To add an OSD keyring for a specific OSD, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring} ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring}
To list the cluster's keys and their capabilities, execute the following: To list the cluster's keys and their capabilities, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -75,42 +76,57 @@ To list the cluster's keys and their capabilities, execute the following:
Placement Group Subsystem Placement Group Subsystem
========================= =========================
To display the statistics for all placement groups (PGs), execute the following: To display the statistics for all placement groups (PGs), run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg dump [--format {format}] ceph pg dump [--format {format}]
The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``. Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``,
When implementing monitoring and other tools, it is best to use ``json`` format. ``xml``, and ``xml-pretty``. When implementing monitoring tools and other
JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much tools, it is best to use the ``json`` format. JSON parsing is more
less variable from release to release. The ``jq`` utility can be invaluable when extracting deterministic than the ``plain`` format (which is more human readable), and the
data from JSON output. layout is much more consistent from release to release. The ``jq`` utility is
very useful for extracting data from JSON output.
To display the statistics for all placement groups stuck in a specified state, To display the statistics for all PGs stuck in a specified state, run the
execute the following: following command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}] ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}]
Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``,
``xml``, or ``xml-pretty``.
``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, ``xml``, or ``xml-pretty``. The ``--threshold`` argument determines the time interval (in seconds) for a PG
to be considered ``stuck`` (default: 300).
``--threshold`` defines how many seconds "stuck" is (default: 300) PGs might be stuck in any of the following states:
**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD **Inactive**
with the most up-to-date data to come back.
**Unclean** Placement groups contain objects that are not replicated the desired number PGs are unable to process reads or writes because they are waiting for an
of times. They should be recovering. OSD that has the most up-to-date data to return to an ``up`` state.
**Stale** Placement groups are in an unknown state - the OSDs that host them have not
reported to the monitor cluster in a while (configured by
``mon_osd_report_timeout``).
Delete "lost" objects or revert them to their prior state, either a previous version **Unclean**
or delete them if they were just created. :
PGs contain objects that have not been replicated the desired number of
times. These PGs have not yet completed the process of recovering.
**Stale**
PGs are in an unknown state, because the OSDs that host them have not
reported to the monitor cluster for a certain period of time (specified by
the ``mon_osd_report_timeout`` configuration setting).
To delete a ``lost`` object or revert an object to its prior state, either by
reverting it to its previous version or by deleting it because it was just
created and has no previous version, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -122,227 +138,262 @@ or delete them if they were just created. :
OSD Subsystem OSD Subsystem
============= =============
Query OSD subsystem status. : To query OSD subsystem status, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd stat ceph osd stat
Write a copy of the most recent OSD map to a file. See To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool
:ref:`osdmaptool <osdmaptool>`. : <osdmaptool>`), run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getmap -o file ceph osd getmap -o file
Write a copy of the crush map from the most recent OSD map to To write a copy of the CRUSH map from the most recent OSD map to a file, run
file. : the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getcrushmap -o file ceph osd getcrushmap -o file
The foregoing is functionally equivalent to : Note that this command is functionally equivalent to the following two
commands:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getmap -o /tmp/osdmap ceph osd getmap -o /tmp/osdmap
osdmaptool /tmp/osdmap --export-crush file osdmaptool /tmp/osdmap --export-crush file
Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``, To dump the OSD map, run the following command:
``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is
dumped as plain text. As above, JSON format is best for tools, scripting, and other automation. :
.. prompt:: bash $ .. prompt:: bash $
ceph osd dump [--format {format}] ceph osd dump [--format {format}]
Dump the OSD map as a tree with one line per OSD containing weight The ``--format`` option accepts the following arguments: ``plain`` (default),
and state. : ``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is
the recommended format for tools, scripting, and other forms of automation.
To dump the OSD map as a tree that lists one OSD per line and displays
information about the weights and states of the OSDs, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd tree [--format {format}] ceph osd tree [--format {format}]
Find out where a specific object is or would be stored in the system: To find out where a specific RADOS object is stored in the system, run a
command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph osd map <pool-name> <object-name> ceph osd map <pool-name> <object-name>
Add or move a new item (OSD) with the given id/name/weight at the specified To add or move a new OSD (specified by its ID, name, or weight) to a specific
location. : CRUSH location, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]] ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]]
Remove an existing item (OSD) from the CRUSH map. : To remove an existing OSD from the CRUSH map, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush remove {name} ceph osd crush remove {name}
Remove an existing bucket from the CRUSH map. : To remove an existing bucket from the CRUSH map, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush remove {bucket-name} ceph osd crush remove {bucket-name}
Move an existing bucket from one position in the hierarchy to another. : To move an existing bucket from one position in the CRUSH hierarchy to another,
run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush move {id} {loc1} [{loc2} ...] ceph osd crush move {id} {loc1} [{loc2} ...]
Set the weight of the item given by ``{name}`` to ``{weight}``. : To set the CRUSH weight of a specific OSD (specified by ``{name}``) to
``{weight}``, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd crush reweight {name} {weight} ceph osd crush reweight {name} {weight}
Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. : To mark an OSD as ``lost``, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd lost {id} [--yes-i-really-mean-it] ceph osd lost {id} [--yes-i-really-mean-it]
Create a new OSD. If no UUID is given, it will be set automatically when the OSD .. warning::
starts up. : This could result in permanent data loss. Use with caution!
To create a new OSD, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd create [{uuid}] ceph osd create [{uuid}]
Remove the given OSD(s). : If no UUID is given as part of this command, the UUID will be set automatically
when the OSD starts up.
To remove one or more specific OSDs, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd rm [{id}...] ceph osd rm [{id}...]
Query the current ``max_osd`` parameter in the OSD map. : To display the current ``max_osd`` parameter in the OSD map, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getmaxosd ceph osd getmaxosd
Import the given crush map. : To import a specific CRUSH map, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd setcrushmap -i file ceph osd setcrushmap -i file
Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so To set the ``max_osd`` parameter in the OSD map, run the following command:
most admins will never need to adjust this. :
.. prompt:: bash $ .. prompt:: bash $
ceph osd setmaxosd ceph osd setmaxosd
Mark OSD ``{osd-num}`` down. : The parameter has a default value of 10000. Most operators will never need to
adjust it.
To mark a specific OSD ``down``, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd down {osd-num} ceph osd down {osd-num}
Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). : To mark a specific OSD ``out`` (so that no data will be allocated to it), run
the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd out {osd-num} ceph osd out {osd-num}
Mark ``{osd-num}`` in the distribution (i.e. allocated data). : To mark a specific OSD ``in`` (so that data will be allocated to it), run the
following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd in {osd-num} ceph osd in {osd-num}
Set or clear the pause flags in the OSD map. If set, no IO requests By using the "pause flags" in the OSD map, you can pause or unpause I/O
will be sent to any OSD. Clearing the flags via unpause results in requests. If the flags are set, then no I/O requests will be sent to any OSD.
resending pending requests. : When the flags are cleared, then pending I/O requests will be resent. To set or
clear pause flags, run one of the following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph osd pause ceph osd pause
ceph osd unpause ceph osd unpause
Set the override weight (reweight) of ``{osd-num}`` to ``{weight}``. Two OSDs with the You can assign an override or ``reweight`` weight value to a specific OSD if
same weight will receive roughly the same number of I/O requests and the normal CRUSH distribution seems to be suboptimal. The weight of an OSD
store approximately the same amount of data. ``ceph osd reweight`` helps determine the extent of its I/O requests and data storage: two OSDs with
sets an override weight on the OSD. This value is in the range 0 to 1, the same weight will receive approximately the same number of I/O requests and
and forces CRUSH to re-place (1-weight) of the data that would store approximately the same amount of data. The ``ceph osd reweight`` command
otherwise live on this drive. It does not change weights assigned assigns an override weight to an OSD. The weight value is in the range 0 to 1,
to the buckets above the OSD in the crush map, and is a corrective and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of
measure in case the normal CRUSH distribution is not working out quite the data that would otherwise be on this OSD. The command does not change the
right. For instance, if one of your OSDs is at 90% and the others are weights of the buckets above the OSD in the CRUSH map. Using the command is
at 50%, you could reduce this weight to compensate. : merely a corrective measure: for example, if one of your OSDs is at 90% and the
others are at 50%, you could reduce the outlier weight to correct this
imbalance. To assign an override weight to a specific OSD, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd reweight {osd-num} {weight} ceph osd reweight {osd-num} {weight}
Balance OSD fullness by reducing the override weight of OSDs which are .. note:: Any assigned override reweight value will conflict with the balancer.
overly utilized. Note that these override aka ``reweight`` values This means that if the balancer is in use, all override reweight values
default to 1.00000 and are relative only to each other; they not absolute. should be ``1.0000`` in order to avoid suboptimal cluster behavior.
It is crucial to distinguish them from CRUSH weights, which reflect the
absolute capacity of a bucket in TiB. By default this command adjusts A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
override weight on OSDs which have + or - 20% of the average utilization, are being disproportionately utilized. Note that override or ``reweight``
but if you include a ``threshold`` that percentage will be used instead. : weights have values relative to one another that default to 1.00000; their
values are not absolute, and these weights must be distinguished from CRUSH
weights (which reflect the absolute capacity of a bucket, as measured in TiB).
To reweight OSDs by utilization, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]
To limit the step by which any OSD's reweight will be changed, specify By default, this command adjusts the override weight of OSDs that have ±20% of
``max_change`` which defaults to 0.05. To limit the number of OSDs that will the average utilization, but you can specify a different percentage in the
be adjusted, specify ``max_osds`` as well; the default is 4. Increasing these ``threshold`` argument.
parameters can speed leveling of OSD utilization, at the potential cost of
greater impact on client operations due to more data moving at once.
To determine which and how many PGs and OSDs will be affected by a given invocation To limit the increment by which any OSD's reweight is to be changed, use the
you can test before executing. : ``max_change`` argument (default: 0.05). To limit the number of OSDs that are
to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these
variables can accelerate the reweighting process, but perhaps at the cost of
slower client operations (as a result of the increase in data movement).
You can test the ``osd reweight-by-utilization`` command before running it. To
find out which and how many PGs and OSDs will be affected by a specific use of
the ``osd reweight-by-utilization`` command, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing] ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing]
Adding ``--no-increasing`` to either command prevents increasing any The ``--no-increasing`` option can be added to the ``reweight-by-utilization``
override weights that are currently < 1.00000. This can be useful when and ``test-reweight-by-utilization`` commands in order to prevent any override
you are balancing in a hurry to remedy ``full`` or ``nearful`` OSDs or weights that are currently less than 1.00000 from being increased. This option
when some OSDs are being evacuated or slowly brought into service. can be useful in certain circumstances: for example, when you are hastily
balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are
OSDs being evacuated or slowly brought into service.
Deployments utilizing Nautilus (or later revisions of Luminous and Mimic) Operators of deployments that utilize Nautilus or newer (or later revisions of
that have no pre-Luminous cients may instead wish to instead enable the Luminous and Mimic) and that have no pre-Luminous clients might likely instead
`balancer`` module for ``ceph-mgr``. want to enable the `balancer`` module for ``ceph-mgr``.
Add/remove an IP address or CIDR range to/from the blocklist. The blocklist can be modified by adding or removing an IP address or a CIDR
When adding to the blocklist, range. If an address is blocklisted, it will be unable to connect to any OSD.
you can specify how long it should be blocklisted in seconds; otherwise, If an OSD is contained within an IP address or CIDR range that has been
it will default to 1 hour. A blocklisted address is prevented from blocklisted, the OSD will be unable to perform operations on its peers when it
connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware acts as a client: such blocked operations include tiering and copy-from
that OSD will also be prevented from performing operations on its peers where it functionality. To add or remove an IP address or CIDR range to the blocklist,
acts as a client. (This includes tiering and copy-from functionality.) run one of the following commands:
If you want to blocklist a range (in CIDR format), you may do so by
including the ``range`` keyword.
These commands are mostly only useful for failure testing, as
blocklists are normally maintained automatically and shouldn't need
manual intervention. :
.. prompt:: bash $ .. prompt:: bash $
ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]
Creates/deletes a snapshot of a pool. : If you add something to the blocklist with the above ``add`` command, you can
use the ``TIME`` keyword to specify the length of time (in seconds) that it
will remain on the blocklist (default: one hour). To add or remove a CIDR
range, use the ``range`` keyword in the above commands.
Note that these commands are useful primarily in failure testing. Under normal
conditions, blocklists are maintained automatically and do not need any manual
intervention.
To create or delete a snapshot of a specific storage pool, run one of the
following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph osd pool mksnap {pool-name} {snap-name} ceph osd pool mksnap {pool-name} {snap-name}
ceph osd pool rmsnap {pool-name} {snap-name} ceph osd pool rmsnap {pool-name} {snap-name}
Creates/deletes/renames a storage pool. : To create, delete, or rename a specific storage pool, run one of the following
commands:
.. prompt:: bash $ .. prompt:: bash $
@ -350,20 +401,20 @@ Creates/deletes/renames a storage pool. :
ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
ceph osd pool rename {old-name} {new-name} ceph osd pool rename {old-name} {new-name}
Changes a pool setting. : To change a pool setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd pool set {pool-name} {field} {value} ceph osd pool set {pool-name} {field} {value}
Valid fields are: The following are valid fields:
* ``size``: Sets the number of copies of data in the pool. * ``size``: The number of copies of data in the pool.
* ``pg_num``: The placement group number. * ``pg_num``: The PG number.
* ``pgp_num``: Effective number when calculating pg placement. * ``pgp_num``: The effective number of PGs when calculating placement.
* ``crush_rule``: rule number for mapping placement. * ``crush_rule``: The rule number for mapping placement.
Get the value of a pool setting. : To retrieve the value of a pool setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -371,40 +422,43 @@ Get the value of a pool setting. :
Valid fields are: Valid fields are:
* ``pg_num``: The placement group number. * ``pg_num``: The PG number.
* ``pgp_num``: Effective number of placement groups when calculating placement. * ``pgp_num``: The effective number of PGs when calculating placement.
To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run
Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. : the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd scrub {osd-num} ceph osd scrub {osd-num}
Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. : To send a repair command to a specific OSD, or to all OSDs (by using ``*``),
run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd repair N ceph osd repair N
Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` You can run a simple throughput benchmark test against a specific OSD. This
in write requests of ``BYTES_PER_WRITE`` each. By default, the test test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally,
writes 1 GB in total in 4-MB increments. in multiple write requests that each have a size of ``BYTES_PER_WRITE``
The benchmark is non-destructive and will not overwrite existing live (default: 4 MB). The test is not destructive and it will not overwrite existing
OSD data, but might temporarily affect the performance of clients live OSD data, but it might temporarily affect the performance of clients that
concurrently accessing the OSD. : are concurrently accessing the OSD. To launch this benchmark test, run the
following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
To clear an OSD's caches between benchmark runs, use the 'cache drop' command : To clear the caches of a specific OSD during the interval between one benchmark
run and another, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell osd.N cache drop ceph tell osd.N cache drop
To get the cache statistics of an OSD, use the 'cache status' command : To retrieve the cache statistics of a specific OSD, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -413,7 +467,8 @@ To get the cache statistics of an OSD, use the 'cache status' command :
MDS Subsystem MDS Subsystem
============= =============
Change configuration parameters on a running mds. : To change the configuration parameters of a running metadata server, run the
following command:
.. prompt:: bash $ .. prompt:: bash $
@ -425,19 +480,20 @@ Example:
ceph tell mds.0 config set debug_ms 1 ceph tell mds.0 config set debug_ms 1
Enables debug messages. : To enable debug messages, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mds stat ceph mds stat
Displays the status of all metadata servers. : To display the status of all metadata servers, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mds fail 0 ceph mds fail 0
Marks the active MDS as failed, triggering failover to a standby if present. To mark the active metadata server as failed (and to trigger failover to a
standby if a standby is present), run the following command:
.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap .. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
@ -445,25 +501,29 @@ Marks the active MDS as failed, triggering failover to a standby if present.
Mon Subsystem Mon Subsystem
============= =============
Show monitor stats: To display monitor statistics, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon stat ceph mon stat
This command returns output similar to the following:
:: ::
e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
There is a ``quorum`` list at the end of the output. It lists those monitor
nodes that are part of the current quorum.
The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. To retrieve this information in a more direct way, run the following command:
This is also available more directly:
.. prompt:: bash $ .. prompt:: bash $
ceph quorum_status -f json-pretty ceph quorum_status -f json-pretty
This command returns output similar to the following:
.. code-block:: javascript .. code-block:: javascript
{ {
@ -516,14 +576,17 @@ This is also available more directly:
The above will block until a quorum is reached. The above will block until a quorum is reached.
For a status of just a single monitor: To see the status of a specific monitor, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell mon.[name] mon_status ceph tell mon.[name] mon_status
where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample Here the value of ``[name]`` can be found by consulting the output of the
output:: ``ceph quorum_status`` command. This command returns output similar to the
following:
::
{ {
"name": "b", "name": "b",
@ -582,12 +645,14 @@ output::
} }
} }
A dump of the monitor state: To see a dump of the monitor state, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon dump ceph mon dump
This command returns output similar to the following:
:: ::
dumped monmap epoch 2 dumped monmap epoch 2
@ -598,4 +663,3 @@ A dump of the monitor state:
0: 127.0.0.1:40000/0 mon.a 0: 127.0.0.1:40000/0 mon.a
1: 127.0.0.1:40001/0 mon.b 1: 127.0.0.1:40001/0 mon.b
2: 127.0.0.1:40002/0 mon.c 2: 127.0.0.1:40002/0 mon.c

View File

@ -1,25 +1,24 @@
Manually editing a CRUSH Map Manually editing the CRUSH Map
============================ ==============================
.. note:: Manually editing the CRUSH map is an advanced .. note:: Manually editing the CRUSH map is an advanced administrator
administrator operation. All CRUSH changes that are operation. For the majority of installations, CRUSH changes can be
necessary for the overwhelming majority of installations are implemented via the Ceph CLI and do not require manual CRUSH map edits. If
possible via the standard ceph CLI and do not require manual you have identified a use case where manual edits *are* necessary with a
CRUSH map edits. If you have identified a use case where recent Ceph release, consider contacting the Ceph developers at dev@ceph.io
manual edits *are* necessary with recent Ceph releases, consider so that future versions of Ceph do not have this problem.
contacting the Ceph developers so that future versions of Ceph
can obviate your corner case.
To edit an existing CRUSH map: To edit an existing CRUSH map, carry out the following procedure:
#. `Get the CRUSH map`_. #. `Get the CRUSH map`_.
#. `Decompile`_ the CRUSH map. #. `Decompile`_ the CRUSH map.
#. Edit at least one of `Devices`_, `Buckets`_ and `Rules`_. #. Edit at least one of the following sections: `Devices`_, `Buckets`_, and
`Rules`_. Use a text editor for this task.
#. `Recompile`_ the CRUSH map. #. `Recompile`_ the CRUSH map.
#. `Set the CRUSH map`_. #. `Set the CRUSH map`_.
For details on setting the CRUSH map rule for a specific pool, see `Set For details on setting the CRUSH map rule for a specific pool, see `Set Pool
Pool Values`_. Values`_.
.. _Get the CRUSH map: #getcrushmap .. _Get the CRUSH map: #getcrushmap
.. _Decompile: #decompilecrushmap .. _Decompile: #decompilecrushmap
@ -32,25 +31,25 @@ Pool Values`_.
.. _getcrushmap: .. _getcrushmap:
Get a CRUSH Map Get the CRUSH Map
--------------- -----------------
To get the CRUSH map for your cluster, execute the following: To get the CRUSH map for your cluster, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getcrushmap -o {compiled-crushmap-filename} ceph osd getcrushmap -o {compiled-crushmap-filename}
Ceph will output (-o) a compiled CRUSH map to the filename you specified. Since Ceph outputs (``-o``) a compiled CRUSH map to the filename that you have
the CRUSH map is in a compiled form, you must decompile it first before you can specified. Because the CRUSH map is in a compiled form, you must first
edit it. decompile it before you can edit it.
.. _decompilecrushmap: .. _decompilecrushmap:
Decompile a CRUSH Map Decompile the CRUSH Map
--------------------- -----------------------
To decompile a CRUSH map, execute the following: To decompile the CRUSH map, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -58,10 +57,10 @@ To decompile a CRUSH map, execute the following:
.. _compilecrushmap: .. _compilecrushmap:
Recompile a CRUSH Map Recompile the CRUSH Map
--------------------- -----------------------
To compile a CRUSH map, execute the following: To compile the CRUSH map, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -72,63 +71,73 @@ To compile a CRUSH map, execute the following:
Set the CRUSH Map Set the CRUSH Map
----------------- -----------------
To set the CRUSH map for your cluster, execute the following: To set the CRUSH map for your cluster, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph osd setcrushmap -i {compiled-crushmap-filename} ceph osd setcrushmap -i {compiled-crushmap-filename}
Ceph will load (-i) a compiled CRUSH map from the filename you specified. Ceph loads (``-i``) a compiled CRUSH map from the filename that you have
specified.
Sections Sections
-------- --------
There are six main sections to a CRUSH Map. A CRUSH map has six main sections:
#. **tunables:** The preamble at the top of the map describes any *tunables* #. **tunables:** The preamble at the top of the map describes any *tunables*
that differ from the historical / legacy CRUSH behavior. These that are not a part of legacy CRUSH behavior. These tunables correct for old
correct for old bugs, optimizations, or other changes that have bugs, optimizations, or other changes that have been made over the years to
been made over the years to improve CRUSH's behavior. improve CRUSH's behavior.
#. **devices:** Devices are individual OSDs that store data. #. **devices:** Devices are individual OSDs that store data.
#. **types**: Bucket ``types`` define the types of buckets used in #. **types**: Bucket ``types`` define the types of buckets that are used in
your CRUSH hierarchy. Buckets consist of a hierarchical aggregation your CRUSH hierarchy.
of storage locations (e.g., rows, racks, chassis, hosts, etc.) and
their assigned weights.
#. **buckets:** Once you define bucket types, you must define each node #. **buckets:** Buckets consist of a hierarchical aggregation of storage
in the hierarchy, its type, and which devices or other nodes it locations (for example, rows, racks, chassis, hosts) and their assigned
weights. After the bucket ``types`` have been defined, the CRUSH map defines
each node in the hierarchy, its type, and which devices or other nodes it
contains. contains.
#. **rules:** Rules define policy about how data is distributed across #. **rules:** Rules define policy about how data is distributed across
devices in the hierarchy. devices in the hierarchy.
#. **choose_args:** Choose_args are alternative weights associated with #. **choose_args:** ``choose_args`` are alternative weights associated with
the hierarchy that have been adjusted to optimize data placement. A single the hierarchy that have been adjusted in order to optimize data placement. A
choose_args map can be used for the entire cluster, or one can be single ``choose_args`` map can be used for the entire cluster, or a number
created for each individual pool. of ``choose_args`` maps can be created such that each map is crafted for a
particular pool.
.. _crushmapdevices: .. _crushmapdevices:
CRUSH Map Devices CRUSH-Map Devices
----------------- -----------------
Devices are individual OSDs that store data. Usually one is defined here for each Devices are individual OSDs that store data. In this section, there is usually
OSD daemon in your one device defined for each OSD daemon in your cluster. Devices are identified
cluster. Devices are identified by an ``id`` (a non-negative integer) and by an ``id`` (a non-negative integer) and a ``name`` (usually ``osd.N``, where
a ``name``, normally ``osd.N`` where ``N`` is the device id. ``N`` is the device's ``id``).
.. _crush-map-device-class: .. _crush-map-device-class:
Devices may also have a *device class* associated with them (e.g., A device can also have a *device class* associated with it: for example,
``hdd`` or ``ssd``), allowing them to be conveniently targeted by a ``hdd`` or ``ssd``. Device classes make it possible for devices to be targeted
crush rule. by CRUSH rules. This means that device classes allow CRUSH rules to select only
OSDs that match certain characteristics. For example, you might want an RBD
pool associated only with SSDs and a different RBD pool associated only with
HDDs.
To see a list of devices, run the following command:
.. prompt:: bash # .. prompt:: bash #
devices ceph device ls
The output of this command takes the following form:
:: ::
@ -138,7 +147,7 @@ For example:
.. prompt:: bash # .. prompt:: bash #
devices ceph device ls
:: ::
@ -147,29 +156,31 @@ For example:
device 2 osd.2 device 2 osd.2
device 3 osd.3 device 3 osd.3
In most cases, each device maps to a single ``ceph-osd`` daemon. This In most cases, each device maps to a corresponding ``ceph-osd`` daemon. This
is normally a single storage device, a pair of devices (for example, daemon might map to a single storage device, a pair of devices (for example,
one for data and one for a journal or metadata), or in some cases a one for data and one for a journal or metadata), or in some cases a small RAID
small RAID device. device or a partition of a larger storage device.
CRUSH Map Bucket Types
CRUSH-Map Bucket Types
---------------------- ----------------------
The second list in the CRUSH map defines 'bucket' types. Buckets facilitate The second list in the CRUSH map defines 'bucket' types. Buckets facilitate a
a hierarchy of nodes and leaves. Node (or non-leaf) buckets typically represent hierarchy of nodes and leaves. Node buckets (also known as non-leaf buckets)
physical locations in a hierarchy. Nodes aggregate other nodes or leaves. typically represent physical locations in a hierarchy. Nodes aggregate other
Leaf buckets represent ``ceph-osd`` daemons and their corresponding storage nodes or leaves. Leaf buckets represent ``ceph-osd`` daemons and their
media. corresponding storage media.
.. tip:: The term "bucket" used in the context of CRUSH means a node in .. tip:: In the context of CRUSH, the term "bucket" is used to refer to
the hierarchy, i.e. a location or a piece of physical hardware. It a node in the hierarchy (that is, to a location or a piece of physical
is a different concept from the term "bucket" when used in the hardware). In the context of RADOS Gateway APIs, however, the term
context of RADOS Gateway APIs. "bucket" has a different meaning.
To add a bucket type to the CRUSH map, create a new line under your list of To add a bucket type to the CRUSH map, create a new line under the list of
bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name. bucket types. Enter ``type`` followed by a unique numeric ID and a bucket name.
By convention, there is one leaf bucket and it is ``type 0``; however, you may By convention, there is exactly one leaf bucket type and it is ``type 0``;
give it any name you like (e.g., osd, disk, drive, storage):: however, you may give the leaf bucket any name you like (for example: ``osd``,
``disk``, ``drive``, ``storage``)::
# types # types
type {num} {bucket-name} type {num} {bucket-name}
@ -190,35 +201,34 @@ For example::
type 10 region type 10 region
type 11 root type 11 root
.. _crushmapbuckets: .. _crushmapbuckets:
CRUSH Map Bucket Hierarchy CRUSH-Map Bucket Hierarchy
-------------------------- --------------------------
The CRUSH algorithm distributes data objects among storage devices according The CRUSH algorithm distributes data objects among storage devices according to
to a per-device weight value, approximating a uniform probability distribution. a per-device weight value, approximating a uniform probability distribution.
CRUSH distributes objects and their replicas according to the hierarchical CRUSH distributes objects and their replicas according to the hierarchical
cluster map you define. Your CRUSH map represents the available storage cluster map you define. The CRUSH map represents the available storage devices
devices and the logical elements that contain them. and the logical elements that contain them.
To map placement groups to OSDs across failure domains, a CRUSH map defines a To map placement groups (PGs) to OSDs across failure domains, a CRUSH map
hierarchical list of bucket types (i.e., under ``#types`` in the generated CRUSH defines a hierarchical list of bucket types under ``#types`` in the generated
map). The purpose of creating a bucket hierarchy is to segregate the CRUSH map. The purpose of creating a bucket hierarchy is to segregate the leaf
leaf nodes by their failure domains, such as hosts, chassis, racks, power nodes according to their failure domains (for example: hosts, chassis, racks,
distribution units, pods, rows, rooms, and data centers. With the exception of power distribution units, pods, rows, rooms, and data centers). With the
the leaf nodes representing OSDs, the rest of the hierarchy is arbitrary, and exception of the leaf nodes that represent OSDs, the hierarchy is arbitrary and
you may define it according to your own needs. you may define it according to your own needs.
We recommend adapting your CRUSH map to your firm's hardware naming conventions We recommend adapting your CRUSH map to your preferred hardware-naming
and using instance names that reflect the physical hardware. Your naming conventions and using bucket names that clearly reflect the physical
practice can make it easier to administer the cluster and troubleshoot hardware. Clear naming practice can make it easier to administer the cluster
problems when an OSD and/or other hardware malfunctions and the administrator and easier to troubleshoot problems when OSDs malfunction (or other hardware
need access to physical hardware. malfunctions) and the administrator needs access to physical hardware.
In the following example, the bucket hierarchy has a leaf bucket named ``osd``,
and two node buckets named ``host`` and ``rack`` respectively. In the following example, the bucket hierarchy has a leaf bucket named ``osd``
and two node buckets named ``host`` and ``rack``:
.. ditaa:: .. ditaa::
+-----------+ +-----------+
@ -240,28 +250,32 @@ and two node buckets named ``host`` and ``rack`` respectively.
| Bucket | | Bucket | | Bucket | | Bucket | | Bucket | | Bucket | | Bucket | | Bucket |
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+ +-----------+ +-----------+ +-----------+
.. note:: The higher numbered ``rack`` bucket type aggregates the lower .. note:: The higher-numbered ``rack`` bucket type aggregates the
numbered ``host`` bucket type. lower-numbered ``host`` bucket type.
Since leaf nodes reflect storage devices declared under the ``#devices`` list Because leaf nodes reflect storage devices that have already been declared
at the beginning of the CRUSH map, you do not need to declare them as bucket under the ``#devices`` list at the beginning of the CRUSH map, there is no need
instances. The second lowest bucket type in your hierarchy usually aggregates to declare them as bucket instances. The second-lowest bucket type in your
the devices (i.e., it's usually the computer containing the storage media, and hierarchy is typically used to aggregate the devices (that is, the
uses whatever term you prefer to describe it, such as "node", "computer", second-lowest bucket type is usually the computer that contains the storage
"server," "host", "machine", etc.). In high density environments, it is media and, such as ``node``, ``computer``, ``server``, ``host``, or
increasingly common to see multiple hosts/nodes per chassis. You should account ``machine``). In high-density environments, it is common to have multiple hosts
for chassis failure too--e.g., the need to pull a chassis if a node fails may or nodes in a single chassis (for example, in the cases of blades or twins). It
result in bringing down numerous hosts/nodes and their OSDs. is important to anticipate the potential consequences of chassis failure -- for
example, during the replacement of a chassis in case of a node failure, the
chassis's hosts or nodes (and their associated OSDs) will be in a ``down``
state.
When declaring a bucket instance, you must specify its type, give it a unique To declare a bucket instance, do the following: specify its type, give it a
name (string), assign it a unique ID expressed as a negative integer (optional), unique name (an alphanumeric string), assign it a unique ID expressed as a
specify a weight relative to the total capacity/capability of its item(s), negative integer (this is optional), assign it a weight relative to the total
specify the bucket algorithm (usually ``straw2``), and the hash (usually ``0``, capacity and capability of the item(s) in the bucket, assign it a bucket
reflecting hash algorithm ``rjenkins1``). A bucket may have one or more items. algorithm (usually ``straw2``), and specify the bucket algorithm's hash
The items may consist of node buckets or leaves. Items may have a weight that (usually ``0``, a setting that reflects the hash algorithm ``rjenkins1``). A
reflects the relative weight of the item. bucket may have one or more items. The items may consist of node buckets or
leaves. Items may have a weight that reflects the relative weight of the item.
You may declare a node bucket with the following syntax:: To declare a node bucket, use the following syntax::
[bucket-type] [bucket-name] { [bucket-type] [bucket-name] {
id [a unique negative numeric ID] id [a unique negative numeric ID]
@ -271,8 +285,10 @@ You may declare a node bucket with the following syntax::
item [item-name] weight [weight] item [item-name] weight [weight]
} }
For example, using the diagram above, we would define two host buckets For example, in the above diagram, two host buckets (referred to in the
and one rack bucket. The OSDs are declared as items within the host buckets:: declaration below as ``node1`` and ``node2``) and one rack bucket (referred to
in the declaration below as ``rack1``) are defined. The OSDs are declared as
items within the host buckets::
host node1 { host node1 {
id -1 id -1
@ -298,84 +314,85 @@ and one rack bucket. The OSDs are declared as items within the host buckets::
item node2 weight 2.00 item node2 weight 2.00
} }
.. note:: In the foregoing example, note that the rack bucket does not contain .. note:: In this example, the rack bucket does not contain any OSDs. Instead,
any OSDs. Rather it contains lower level host buckets, and includes the it contains lower-level host buckets and includes the sum of their weight in
sum total of their weight in the item entry. the item entry.
.. topic:: Bucket Types .. topic:: Bucket Types
Ceph supports five bucket types, each representing a tradeoff between Ceph supports five bucket types. Each bucket type provides a balance between
performance and reorganization efficiency. If you are unsure of which bucket performance and reorganization efficiency, and each is different from the
type to use, we recommend using a ``straw2`` bucket. For a detailed others. If you are unsure of which bucket type to use, use the ``straw2``
discussion of bucket types, refer to bucket. For a more technical discussion of bucket types than is offered
`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, here, see **Section 3.4** of `CRUSH - Controlled, Scalable, Decentralized
and more specifically to **Section 3.4**. The bucket types are: Placement of Replicated Data`_.
#. **uniform**: Uniform buckets aggregate devices with **exactly** the same The bucket types are as follows:
weight. For example, when firms commission or decommission hardware, they
typically do so with many machines that have exactly the same physical
configuration (e.g., bulk purchases). When storage devices have exactly
the same weight, you may use the ``uniform`` bucket type, which allows
CRUSH to map replicas into uniform buckets in constant time. With
non-uniform weights, you should use another bucket algorithm.
#. **list**: List buckets aggregate their content as linked lists. Based on #. **uniform**: Uniform buckets aggregate devices that have **exactly**
the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`P` algorithm, the same weight. For example, when hardware is commissioned or
a list is a natural and intuitive choice for an **expanding cluster**: decommissioned, it is often done in sets of machines that have exactly
either an object is relocated to the newest device with some appropriate the same physical configuration (this can be the case, for example,
probability, or it remains on the older devices as before. The result is after bulk purchases). When storage devices have exactly the same
optimal data migration when items are added to the bucket. Items removed weight, you may use the ``uniform`` bucket type, which allows CRUSH to
from the middle or tail of the list, however, can result in a significant map replicas into uniform buckets in constant time. If your devices have
amount of unnecessary movement, making list buckets most suitable for non-uniform weights, you should not use the uniform bucket algorithm.
circumstances in which they **never (or very rarely) shrink**.
#. **list**: List buckets aggregate their content as linked lists. The
behavior of list buckets is governed by the :abbr:`RUSH (Replication
Under Scalable Hashing)`:sub:`P` algorithm. In the behavior of this
bucket type, an object is either relocated to the newest device in
accordance with an appropriate probability, or it remains on the older
devices as before. This results in optimal data migration when items are
added to the bucket. The removal of items from the middle or the tail of
the list, however, can result in a significant amount of unnecessary
data movement. This means that list buckets are most suitable for
circumstances in which they **never shrink or very rarely shrink**.
#. **tree**: Tree buckets use a binary search tree. They are more efficient #. **tree**: Tree buckets use a binary search tree. They are more efficient
than list buckets when a bucket contains a larger set of items. Based on at dealing with buckets that contain many items than are list buckets.
the :abbr:`RUSH (Replication Under Scalable Hashing)` :sub:`R` algorithm, The behavior of tree buckets is governed by the :abbr:`RUSH (Replication
tree buckets reduce the placement time to O(log :sub:`n`), making them Under Scalable Hashing)`:sub:`R` algorithm. Tree buckets reduce the
suitable for managing much larger sets of devices or nested buckets. placement time to 0(log\ :sub:`n`). This means that tree buckets are
suitable for managing large sets of devices or nested buckets.
#. **straw**: List and Tree buckets use a divide and conquer strategy #. **straw**: Straw buckets allow all items in the bucket to "compete"
in a way that either gives certain items precedence (e.g., those against each other for replica placement through a process analogous to
at the beginning of a list) or obviates the need to consider entire drawing straws. This is different from the behavior of list buckets and
subtrees of items at all. That improves the performance of the replica tree buckets, which use a divide-and-conquer strategy that either gives
placement process, but can also introduce suboptimal reorganization certain items precedence (for example, those at the beginning of a list)
behavior when the contents of a bucket change due an addition, removal, or obviates the need to consider entire subtrees of items. Such an
or re-weighting of an item. The straw bucket type allows all items to approach improves the performance of the replica placement process, but
fairly “compete” against each other for replica placement through a can also introduce suboptimal reorganization behavior when the contents
process analogous to a draw of straws. of a bucket change due an addition, a removal, or the re-weighting of an
item.
#. **straw2**: Straw2 buckets improve Straw to correctly avoid any data * **straw2**: Straw2 buckets improve on Straw by correctly avoiding
movement between items when neighbor weights change. any data movement between items when neighbor weights change. For
example, if the weight of a given item changes (including during the
operations of adding it to the cluster or removing it from the
cluster), there will be data movement to or from only that item.
Neighbor weights are not taken into account.
For example the weight of item A including adding it anew or removing
it completely, there will be data movement only to or from item A.
.. topic:: Hash .. topic:: Hash
Each bucket uses a hash algorithm. Currently, Ceph supports ``rjenkins1``. Each bucket uses a hash algorithm. As of Reef, Ceph supports the
Enter ``0`` as your hash setting to select ``rjenkins1``. ``rjenkins1`` algorithm. To select ``rjenkins1`` as the hash algorithm,
enter ``0`` as your hash setting.
.. _weightingbucketitems: .. _weightingbucketitems:
.. topic:: Weighting Bucket Items .. topic:: Weighting Bucket Items
Ceph expresses bucket weights as doubles, which allows for fine Ceph expresses bucket weights as doubles, which allows for fine-grained
weighting. A weight is the relative difference between device capacities. We weighting. A weight is the relative difference between device capacities. We
recommend using ``1.00`` as the relative weight for a 1TB storage device. recommend using ``1.00`` as the relative weight for a 1 TB storage device.
In such a scenario, a weight of ``0.5`` would represent approximately 500GB, In such a scenario, a weight of ``0.50`` would represent approximately 500
and a weight of ``3.00`` would represent approximately 3TB. Higher level GB, and a weight of ``3.00`` would represent approximately 3 TB. Buckets
buckets have a weight that is the sum total of the leaf items aggregated by higher in the CRUSH hierarchy have a weight that is the sum of the weight of
the bucket. the leaf items aggregated by the bucket.
A bucket item weight is one dimensional, but you may also calculate your
item weights to reflect the performance of the storage drive. For example,
if you have many 1TB drives where some have relatively low data transfer
rate and the others have a relatively high data transfer rate, you may
weight them differently, even though they have the same capacity (e.g.,
a weight of 0.80 for the first set of drives with lower total throughput,
and 1.20 for the second set of drives with higher total throughput).
.. _crushmaprules: .. _crushmaprules:
@ -383,32 +400,31 @@ and one rack bucket. The OSDs are declared as items within the host buckets::
CRUSH Map Rules CRUSH Map Rules
--------------- ---------------
CRUSH maps support the notion of 'CRUSH rules', which are the rules that CRUSH maps have rules that include data placement for a pool: these are
determine data placement for a pool. The default CRUSH map has a rule for each called "CRUSH rules". The default CRUSH map has one rule for each pool. If you
pool. For large clusters, you will likely create many pools where each pool may are running a large cluster, you might create many pools and each of those
have its own non-default CRUSH rule. pools might have its own non-default CRUSH rule.
.. note:: In most cases, you will not need to modify the default rule. When
you create a new pool, by default the rule will be set to ``0``.
CRUSH rules define placement and replication strategies or distribution policies .. note:: In most cases, there is no need to modify the default rule. When a
that allow you to specify exactly how CRUSH places object replicas. For new pool is created, by default the rule will be set to the value ``0``
example, you might create a rule selecting a pair of targets for 2-way (which indicates the default CRUSH rule, which has the numeric ID ``0``).
mirroring, another rule for selecting three targets in two different data
centers for 3-way mirroring, and yet another rule for erasure coding over six CRUSH rules define policy that governs how data is distributed across the devices in
storage devices. For a detailed discussion of CRUSH rules, refer to the hierarchy. The rules define placement as well as replication strategies or
`CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, distribution policies that allow you to specify exactly how CRUSH places data
and more specifically to **Section 3.2**. replicas. For example, you might create one rule selecting a pair of targets for
two-way mirroring, another rule for selecting three targets in two different data
centers for three-way replication, and yet another rule for erasure coding across
six storage devices. For a detailed discussion of CRUSH rules, see **Section 3.2**
of `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_.
A rule takes the following form:: A rule takes the following form::
rule <rulename> { rule <rulename> {
id [a unique whole numeric ID] id [a unique integer ID]
type [ replicated | erasure ] type [replicated|erasure]
min_size <min-size>
max_size <max-size>
step take <bucket-name> [class <device-class>] step take <bucket-name> [class <device-class>]
step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type> step [choose|chooseleaf] [firstn|indep] <N> type <bucket-type>
step emit step emit
@ -416,153 +432,128 @@ A rule takes the following form::
``id`` ``id``
:Description: A unique integer that identifies the rule.
:Description: A unique whole number for identifying the rule. :Purpose: A component of the rule mask.
:Type: Integer
:Purpose: A component of the rule mask. :Required: Yes
:Type: Integer :Default: 0
:Required: Yes
:Default: 0
``type`` ``type``
:Description: Denotes the type of replication strategy to be enforced by the
:Description: Describes a rule for either a storage drive (replicated) rule.
or a RAID. :Purpose: A component of the rule mask.
:Type: String
:Purpose: A component of the rule mask. :Required: Yes
:Type: String :Default: ``replicated``
:Required: Yes :Valid Values: ``replicated`` or ``erasure``
:Default: ``replicated``
:Valid Values: Currently only ``replicated`` and ``erasure``
``min_size``
:Description: If a pool makes fewer replicas than this number, CRUSH will
**NOT** select this rule.
:Type: Integer
:Purpose: A component of the rule mask.
:Required: Yes
:Default: ``1``
``max_size``
:Description: If a pool makes more replicas than this number, CRUSH will
**NOT** select this rule.
:Type: Integer
:Purpose: A component of the rule mask.
:Required: Yes
:Default: 10
``step take <bucket-name> [class <device-class>]`` ``step take <bucket-name> [class <device-class>]``
:Description: Takes a bucket name and iterates down the tree. If
the ``device-class`` argument is specified, the argument must
match a class assigned to OSDs within the cluster. Only
devices belonging to the class are included.
:Purpose: A component of the rule.
:Required: Yes
:Example: ``step take data``
:Description: Takes a bucket name, and begins iterating down the tree.
If the ``device-class`` is specified, it must match
a class previously used when defining a device. All
devices that do not belong to the class are excluded.
:Purpose: A component of the rule.
:Required: Yes
:Example: ``step take data``
``step choose firstn {num} type {bucket-type}`` ``step choose firstn {num} type {bucket-type}``
:Description: Selects ``num`` buckets of the given type from within the
current bucket. ``{num}`` is usually the number of replicas in
the pool (in other words, the pool size).
:Description: Selects the number of buckets of the given type from within the - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
current bucket. The number is usually the number of replicas in - If ``pool-num-replicas > {num} > 0``, choose that many buckets.
the pool (i.e., pool size). - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets.
- If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). :Purpose: A component of the rule.
- If ``{num} > 0 && < pool-num-replicas``, choose that many buckets. :Prerequisite: Follows ``step take`` or ``step choose``.
- If ``{num} < 0``, it means ``pool-num-replicas - {num}``. :Example: ``step choose firstn 1 type row``
:Purpose: A component of the rule.
:Prerequisite: Follows ``step take`` or ``step choose``.
:Example: ``step choose firstn 1 type row``
``step chooseleaf firstn {num} type {bucket-type}`` ``step chooseleaf firstn {num} type {bucket-type}``
:Description: Selects a set of buckets of the given type and chooses a leaf
node (that is, an OSD) from the subtree of each bucket in that set of buckets. The
number of buckets in the set is usually the number of replicas in
the pool (in other words, the pool size).
:Description: Selects a set of buckets of ``{bucket-type}`` and chooses a leaf - If ``{num} == 0``, choose ``pool-num-replicas`` buckets (as many buckets as are available).
node (that is, an OSD) from the subtree of each bucket in the set of buckets. - If ``pool-num-replicas > {num} > 0``, choose that many buckets.
The number of buckets in the set is usually the number of replicas in - If ``{num} < 0``, choose ``pool-num-replicas - {num}`` buckets.
the pool (i.e., pool size). :Purpose: A component of the rule. Using ``chooseleaf`` obviates the need to select a device in a separate step.
:Prerequisite: Follows ``step take`` or ``step choose``.
- If ``{num} == 0``, choose ``pool-num-replicas`` buckets (all available). :Example: ``step chooseleaf firstn 0 type row``
- If ``{num} > 0 && < pool-num-replicas``, choose that many buckets.
- If ``{num} < 0``, it means ``pool-num-replicas - {num}``.
:Purpose: A component of the rule. Usage removes the need to select a device using two steps.
:Prerequisite: Follows ``step take`` or ``step choose``.
:Example: ``step chooseleaf firstn 0 type row``
``step emit`` ``step emit``
:Description: Outputs the current value on the top of the stack and empties
:Description: Outputs the current value and empties the stack. Typically used the stack. Typically used
at the end of a rule, but may also be used to pick from different at the end of a rule, but may also be used to choose from different
trees in the same rule. trees in the same rule.
:Purpose: A component of the rule. :Purpose: A component of the rule.
:Prerequisite: Follows ``step choose``. :Prerequisite: Follows ``step choose``.
:Example: ``step emit`` :Example: ``step emit``
.. important:: A given CRUSH rule may be assigned to multiple pools, but it .. important:: A single CRUSH rule can be assigned to multiple pools, but
is not possible for a single pool to have multiple CRUSH rules. a single pool cannot have multiple CRUSH rules.
``firstn`` versus ``indep`` ``firstn`` or ``indep``
:Description: Controls the replacement strategy CRUSH uses when items (OSDs) :Description: Determines which replacement strategy CRUSH uses when items (OSDs)
are marked down in the CRUSH map. If this rule is to be used with are marked ``down`` in the CRUSH map. When this rule is used
replicated pools it should be ``firstn`` and if it's for with replicated pools, ``firstn`` is used. When this rule is
erasure-coded pools it should be ``indep``. used with erasure-coded pools, ``indep`` is used.
The reason has to do with how they behave when a Suppose that a PG is stored on OSDs 1, 2, 3, 4, and 5 and then
previously-selected device fails. Let's say you have a PG stored OSD 3 goes down.
on OSDs 1, 2, 3, 4, 5. Then 3 goes down.
With the "firstn" mode, CRUSH simply adjusts its calculation to When in ``firstn`` mode, CRUSH simply adjusts its calculation
select 1 and 2, then selects 3 but discovers it's down, so it to select OSDs 1 and 2, then selects 3 and discovers that 3 is
retries and selects 4 and 5, and then goes on to select a new down, retries and selects 4 and 5, and finally goes on to
OSD 6. So the final CRUSH mapping change is select a new OSD: OSD 6. The final CRUSH mapping
1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6. transformation is therefore 1, 2, 3, 4, 5 → 1, 2, 4, 5, 6.
But if you're storing an EC pool, that means you just changed the However, if you were storing an erasure-coded pool, the above
data mapped to OSDs 4, 5, and 6! So the "indep" mode attempts to sequence would have changed the data that is mapped to OSDs 4,
not do that. You can instead expect it, when it selects the failed 5, and 6. The ``indep`` mode attempts to avoid this unwanted
OSD 3, to try again and pick out 6, for a final transformation of: consequence. When in ``indep`` mode, CRUSH can be expected to
1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5 select 3, discover that 3 is down, retry, and select 6. The
final CRUSH mapping transformation is therefore 1, 2, 3, 4, 5
→ 1, 2, 6, 4, 5.
.. _crush-reclassify: .. _crush-reclassify:
Migrating from a legacy SSD rule to device classes Migrating from a legacy SSD rule to device classes
-------------------------------------------------- --------------------------------------------------
It used to be necessary to manually edit your CRUSH map and maintain a Prior to the Luminous release's introduction of the *device class* feature, in
parallel hierarchy for each specialized device type (e.g., SSD) in order to order to write rules that applied to a specialized device type (for example,
write rules that apply to those devices. Since the Luminous release, SSD), it was necessary to manually edit the CRUSH map and maintain a parallel
the *device class* feature has enabled this transparently. hierarchy for each device type. The device class feature provides a more
transparent way to achieve this end.
However, migrating from an existing, manually customized per-device map to However, if your cluster is migrated from an existing manually-customized
the new device class rules in the trivial way will cause all data in the per-device map to new device class-based rules, all data in the system will be
system to be reshuffled. reshuffled.
The ``crushtool`` has a few commands that can transform a legacy rule The ``crushtool`` utility has several commands that can transform a legacy rule
and hierarchy so that you can start using the new class-based rules. and hierarchy and allow you to start using the new device class rules. There
There are three types of transformations possible: are three possible types of transformation:
#. ``--reclassify-root <root-name> <device-class>`` #. ``--reclassify-root <root-name> <device-class>``
This will take everything in the hierarchy beneath root-name and This command examines everything under ``root-name`` in the hierarchy and
adjust any rules that reference that root via a ``take rewrites any rules that reference the specified root and that have the
<root-name>`` to instead ``take <root-name> class <device-class>``. form ``take <root-name>`` so that they instead have the
It renumbers the buckets in such a way that the old IDs are instead form ``take <root-name> class <device-class>``. The command also renumbers
used for the specified class's "shadow tree" so that no data the buckets in such a way that the old IDs are used for the specified
movement takes place. class's "shadow tree" and as a result no data movement takes place.
For example, imagine you have an existing rule like:: For example, suppose you have the following as an existing rule::
rule replicated_rule { rule replicated_rule {
id 0 id 0
@ -572,8 +563,8 @@ There are three types of transformations possible:
step emit step emit
} }
If you reclassify the root `default` as class `hdd`, the rule will If the root ``default`` is reclassified as class ``hdd``, the new rule will
become:: be as follows::
rule replicated_rule { rule replicated_rule {
id 0 id 0
@ -585,23 +576,26 @@ There are three types of transformations possible:
#. ``--set-subtree-class <bucket-name> <device-class>`` #. ``--set-subtree-class <bucket-name> <device-class>``
This will mark every device in the subtree rooted at *bucket-name* This command marks every device in the subtree that is rooted at *bucket-name*
with the specified device class. with the specified device class.
This is normally used in conjunction with the ``--reclassify-root`` This command is typically used in conjunction with the ``--reclassify-root`` option
option to ensure that all devices in that root are labeled with the in order to ensure that all devices in that root are labeled with the
correct class. In some situations, however, some of those devices correct class. In certain circumstances, however, some of those devices
(correctly) have a different class and we do not want to relabel are correctly labeled with a different class and must not be relabeled. To
them. In such cases, one can exclude the ``--set-subtree-class`` manage this difficulty, one can exclude the ``--set-subtree-class``
option. This means that the remapping process will not be perfect, option. The remapping process will not be perfect, because the previous rule
since the previous rule distributed across devices of multiple had an effect on devices of multiple classes but the adjusted rules will map
classes but the adjusted rules will only map to devices of the only to devices of the specified device class. However, when there are not many
specified *device-class*, but that often is an accepted level of outlier devices, the resulting level of data movement is often within tolerable
data movement when the number of outlier devices is small. limits.
#. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>`` #. ``--reclassify-bucket <match-pattern> <device-class> <default-parent>``
This will allow you to merge a parallel type-specific hierarchy with the normal hierarchy. For example, many users have maps like:: This command allows you to merge a parallel type-specific hierarchy with the
normal hierarchy. For example, many users have maps that resemble the
following::
host node1 { host node1 {
id -2 # do not change unnecessarily id -2 # do not change unnecessarily
@ -643,29 +637,31 @@ There are three types of transformations possible:
... ...
} }
This function will reclassify each bucket that matches a This command reclassifies each bucket that matches a certain
pattern. The pattern can look like ``%suffix`` or ``prefix%``. pattern. The pattern can be of the form ``%suffix`` or ``prefix%``. For
For example, in the above example, we would use the pattern example, in the above example, we would use the pattern
``%-ssd``. For each matched bucket, the remaining portion of the ``%-ssd``. For each matched bucket, the remaining portion of the
name (that matches the ``%`` wildcard) specifies the *base bucket*. name (corresponding to the ``%`` wildcard) specifies the *base bucket*. All
All devices in the matched bucket are labeled with the specified devices in the matched bucket are labeled with the specified
device class and then moved to the base bucket. If the base bucket device class and then moved to the base bucket. If the base bucket
does not exist (e.g., ``node12-ssd`` exists but ``node12`` does does not exist (for example, ``node12-ssd`` exists but ``node12`` does
not), then it is created and linked underneath the specified not), then it is created and linked under the specified
*default parent* bucket. In each case, we are careful to preserve *default parent* bucket. In each case, care is taken to preserve
the old bucket IDs for the new shadow buckets to prevent data the old bucket IDs for the new shadow buckets in order to prevent data
movement. Any rules with ``take`` steps referencing the old movement. Any rules with ``take`` steps that reference the old
buckets are adjusted. buckets are adjusted accordingly.
#. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>`` #. ``--reclassify-bucket <bucket-name> <device-class> <base-bucket>``
The same command can also be used without a wildcard to map a The same command can also be used without a wildcard in order to map a
single bucket. For example, in the previous example, we want the single bucket. For example, in the previous example, we want the
``ssd`` bucket to be mapped to the ``default`` bucket. ``ssd`` bucket to be mapped to the ``default`` bucket.
The final command to convert the map comprising the above fragments would be something like: #. The final command to convert the map that consists of the above fragments
resembles the following:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getcrushmap -o original ceph osd getcrushmap -o original
crushtool -i original --reclassify \ crushtool -i original --reclassify \
@ -675,7 +671,16 @@ The final command to convert the map comprising the above fragments would be som
--reclassify-bucket ssd ssd default \ --reclassify-bucket ssd ssd default \
-o adjusted -o adjusted
In order to ensure that the conversion is correct, there is a ``--compare`` command that will test a large sample of inputs against the CRUSH map and check that the same result is output. These inputs are controlled by the same options that apply to the ``--test`` command. For the above example,: ``--compare`` flag
------------------
A ``--compare`` flag is available to make sure that the conversion performed in
:ref:`Migrating from a legacy SSD rule to device classes <crush-reclassify>` is
correct. This flag tests a large sample of inputs against the CRUSH map and
checks that the expected result is output. The options that control these
inputs are the same as the options that apply to the ``--test`` command. For an
illustration of how this ``--compare`` command applies to the above example,
see the following:
.. prompt:: bash $ .. prompt:: bash $
@ -687,39 +692,39 @@ In order to ensure that the conversion is correct, there is a ``--compare`` comm
rule 1 had 0/10240 mismatched mappings (0) rule 1 had 0/10240 mismatched mappings (0)
maps appear equivalent maps appear equivalent
If there were differences, the ratio of remapped inputs would be reported in If the command finds any differences, the ratio of remapped inputs is reported
the parentheses. in the parentheses.
When you are satisfied with the adjusted map, apply it to the cluster with a command of the form: When you are satisfied with the adjusted map, apply it to the cluster by
running the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd setcrushmap -i adjusted ceph osd setcrushmap -i adjusted
Tuning CRUSH, the hard way Manually Tuning CRUSH
-------------------------- ---------------------
If you can ensure that all clients are running recent code, you can If you have verified that all clients are running recent code, you can adjust
adjust the tunables by extracting the CRUSH map, modifying the values, the CRUSH tunables by extracting the CRUSH map, modifying the values, and
and reinjecting it into the cluster. reinjecting the map into the cluster. The procedure is carried out as follows:
* Extract the latest CRUSH map: #. Extract the latest CRUSH map:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getcrushmap -o /tmp/crush ceph osd getcrushmap -o /tmp/crush
* Adjust tunables. These values appear to offer the best behavior #. Adjust tunables. In our tests, the following values appear to result in the
for both large and small clusters we tested with. You will need to best behavior for both large and small clusters. The procedure requires that
additionally specify the ``--enable-unsafe-tunables`` argument to you specify the ``--enable-unsafe-tunables`` flag in the ``crushtool``
``crushtool`` for this to work. Please use this option with command. Use this option with **extreme care**:
extreme care.:
.. prompt:: bash $ .. prompt:: bash $
crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
* Reinject modified map: #. Reinject the modified map:
.. prompt:: bash $ .. prompt:: bash $
@ -728,16 +733,14 @@ and reinjecting it into the cluster.
Legacy values Legacy values
------------- -------------
For reference, the legacy values for the CRUSH tunables can be set To set the legacy values of the CRUSH tunables, run the following command:
with:
.. prompt:: bash $ .. prompt:: bash $
crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
Again, the special ``--enable-unsafe-tunables`` option is required. The special ``--enable-unsafe-tunables`` flag is required. Be careful when
Further, as noted above, be careful running old versions of the running old versions of the ``ceph-osd`` daemon after reverting to legacy
``ceph-osd`` daemon after reverting to legacy values as the feature values, because the feature bit is not perfectly enforced.
bit is not perfectly enforced.
.. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.io/assets/pdfs/weil-crush-sc06.pdf

File diff suppressed because it is too large Load Diff

View File

@ -2,40 +2,45 @@
Data Placement Overview Data Placement Overview
========================= =========================
Ceph stores, replicates and rebalances data objects across a RADOS cluster Ceph stores, replicates, and rebalances data objects across a RADOS cluster
dynamically. With many different users storing objects in different pools for dynamically. Because different users store objects in different pools for
different purposes on countless OSDs, Ceph operations require some data different purposes on many OSDs, Ceph operations require a certain amount of
placement planning. The main data placement planning concepts in Ceph include: data- placement planning. The main data-placement planning concepts in Ceph
include:
- **Pools:** Ceph stores data within pools, which are logical groups for storing - **Pools:** Ceph stores data within pools, which are logical groups used for
objects. Pools manage the number of placement groups, the number of replicas, storing objects. Pools manage the number of placement groups, the number of
and the CRUSH rule for the pool. To store data in a pool, you must have replicas, and the CRUSH rule for the pool. To store data in a pool, it is
an authenticated user with permissions for the pool. Ceph can snapshot pools. necessary to be an authenticated user with permissions for the pool. Ceph is
See `Pools`_ for additional details. able to make snapshots of pools. For additional details, see `Pools`_.
- **Placement Groups:** Ceph maps objects to placement groups (PGs). - **Placement Groups:** Ceph maps objects to placement groups. Placement
Placement groups (PGs) are shards or fragments of a logical object pool groups (PGs) are shards or fragments of a logical object pool that place
that place objects as a group into OSDs. Placement groups reduce the amount objects as a group into OSDs. Placement groups reduce the amount of
of per-object metadata when Ceph stores the data in OSDs. A larger number of per-object metadata that is necessary for Ceph to store the data in OSDs. A
placement groups (e.g., 100 per OSD) leads to better balancing. See greater number of placement groups (for example, 100 PGs per OSD as compared
:ref:`placement groups` for additional details. with 50 PGs per OSD) leads to better balancing. For additional details, see
:ref:`placement groups`.
- **CRUSH Maps:** CRUSH is a big part of what allows Ceph to scale without - **CRUSH Maps:** CRUSH plays a major role in allowing Ceph to scale while
performance bottlenecks, without limitations to scalability, and without a avoiding certain pitfalls, such as performance bottlenecks, limitations to
single point of failure. CRUSH maps provide the physical topology of the scalability, and single points of failure. CRUSH maps provide the physical
cluster to the CRUSH algorithm to determine where the data for an object topology of the cluster to the CRUSH algorithm, so that it can determine both
and its replicas should be stored, and how to do so across failure domains (1) where the data for an object and its replicas should be stored and (2)
for added data safety among other things. See `CRUSH Maps`_ for additional how to store that data across failure domains so as to improve data safety.
details. For additional details, see `CRUSH Maps`_.
- **Balancer:** The balancer is a feature that will automatically optimize the - **Balancer:** The balancer is a feature that automatically optimizes the
distribution of PGs across devices to achieve a balanced data distribution, distribution of placement groups across devices in order to achieve a
maximizing the amount of data that can be stored in the cluster and evenly balanced data distribution, in order to maximize the amount of data that can
distributing the workload across OSDs. be stored in the cluster, and in order to evenly distribute the workload
across OSDs.
When you initially set up a test cluster, you can use the default values. Once It is possible to use the default values for each of the above components.
you begin planning for a large Ceph cluster, refer to pools, placement groups Default values are recommended for a test cluster's initial setup. However,
and CRUSH for data placement operations. when planning a large Ceph cluster, values should be customized for
data-placement operations with reference to the different roles played by
pools, placement groups, and CRUSH.
.. _Pools: ../pools .. _Pools: ../pools
.. _CRUSH Maps: ../crush-map .. _CRUSH Maps: ../crush-map

View File

@ -3,28 +3,32 @@
Device Management Device Management
================= =================
Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by Device management allows Ceph to address hardware failure. Ceph tracks hardware
which daemons, and collects health metrics about those devices in order to storage devices (HDDs, SSDs) to see which devices are managed by which daemons.
provide tools to predict and/or automatically respond to hardware failure. Ceph also collects health metrics about these devices. By doing so, Ceph can
provide tools that predict hardware failure and can automatically respond to
hardware failure.
Device tracking Device tracking
--------------- ---------------
You can query which storage devices are in use with: To see a list of the storage devices that are in use, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph device ls ceph device ls
You can also list devices by daemon or by host: Alternatively, to list devices by daemon or by host, run a command of one of
the following forms:
.. prompt:: bash $ .. prompt:: bash $
ceph device ls-by-daemon <daemon> ceph device ls-by-daemon <daemon>
ceph device ls-by-host <host> ceph device ls-by-host <host>
For any individual device, you can query information about its To see information about the location of an specific device and about how the
location and how it is being consumed with: device is being consumed, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -33,103 +37,107 @@ location and how it is being consumed with:
Identifying physical devices Identifying physical devices
---------------------------- ----------------------------
You can blink the drive LEDs on hardware enclosures to make the replacement of To make the replacement of failed disks easier and less error-prone, you can
failed disks easy and less error-prone. Use the following command:: (in some cases) "blink" the drive's LEDs on hardware enclosures by running a
command of the following form::
device light on|off <devid> [ident|fault] [--force] device light on|off <devid> [ident|fault] [--force]
The ``<devid>`` parameter is the device identification. You can obtain this .. note:: Using this command to blink the lights might not work. Whether it
information using the following command: works will depend upon such factors as your kernel revision, your SES
firmware, or the setup of your HBA.
The ``<devid>`` parameter is the device identification. To retrieve this
information, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph device ls ceph device ls
The ``[ident|fault]`` parameter is used to set the kind of light to blink. The ``[ident|fault]`` parameter determines which kind of light will blink. By
By default, the `identification` light is used. default, the `identification` light is used.
.. note:: .. note:: This command works only if the Cephadm or the Rook `orchestrator
This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled. <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_
The orchestrator module enabled is shown by executing the following command: module is enabled. To see which orchestrator module is enabled, run the
following command:
.. prompt:: bash $ .. prompt:: bash $
ceph orch status ceph orch status
The command behind the scene to blink the drive LEDs is `lsmcli`. If you need The command that makes the drive's LEDs blink is `lsmcli`. To customize this
to customize this command you can configure this via a Jinja2 template:: command, configure it via a Jinja2 template by running commands of the
following forms::
ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>" ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'" ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"
The Jinja2 template is rendered using the following arguments: The following arguments can be used to customize the Jinja2 template:
* ``on`` * ``on``
A boolean value. A boolean value.
* ``ident_fault`` * ``ident_fault``
A string containing `ident` or `fault`. A string that contains `ident` or `fault`.
* ``dev`` * ``dev``
A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`. A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`.
* ``path`` * ``path``
A string containing the device path, e.g. `/dev/sda`. A string that contains the device path: for example, `/dev/sda`.
.. _enabling-monitoring: .. _enabling-monitoring:
Enabling monitoring Enabling monitoring
------------------- -------------------
Ceph can also monitor health metrics associated with your device. For Ceph can also monitor the health metrics associated with your device. For
example, SATA hard disks implement a standard called SMART that example, SATA drives implement a standard called SMART that provides a wide
provides a wide range of internal metrics about the device's usage and range of internal metrics about the device's usage and health (for example: the
health, like the number of hours powered on, number of power cycles, number of hours powered on, the number of power cycles, the number of
or unrecoverable read errors. Other device types like SAS and NVMe unrecoverable read errors). Other device types such as SAS and NVMe present a
implement a similar set of metrics (via slightly different standards). similar set of metrics (via slightly different standards). All of these
All of these can be collected by Ceph via the ``smartctl`` tool. metrics can be collected by Ceph via the ``smartctl`` tool.
You can enable or disable health monitoring with: You can enable or disable health monitoring by running one of the following
commands:
.. prompt:: bash $ .. prompt:: bash $
ceph device monitoring on ceph device monitoring on
or:
.. prompt:: bash $
ceph device monitoring off ceph device monitoring off
Scraping Scraping
-------- --------
If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with: If monitoring is enabled, device metrics will be scraped automatically at
regular intervals. To configure that interval, run a command of the following
form:
.. prompt:: bash $ .. prompt:: bash $
ceph config set mgr mgr/devicehealth/scrape_frequency <seconds> ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
The default is to scrape once every 24 hours. By default, device metrics are scraped once every 24 hours.
You can manually trigger a scrape of all devices with: To manually scrape all devices, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph device scrape-health-metrics ceph device scrape-health-metrics
A single device can be scraped with: To scrape a single device, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph device scrape-health-metrics <device-id> ceph device scrape-health-metrics <device-id>
Or a single daemon's devices can be scraped with: To scrape a single daemon's devices, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph device scrape-daemon-health-metrics <who> ceph device scrape-daemon-health-metrics <who>
The stored health metrics for a device can be retrieved (optionally To retrieve the stored health metrics for a device (optionally for a specific
for a specific timestamp) with: timestamp), run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -138,71 +146,82 @@ for a specific timestamp) with:
Failure prediction Failure prediction
------------------ ------------------
Ceph can predict life expectancy and device failures based on the Ceph can predict drive life expectancy and device failures by analyzing the
health metrics it collects. There are three modes: health metrics that it collects. The prediction modes are as follows:
* *none*: disable device failure prediction. * *none*: disable device failure prediction.
* *local*: use a pre-trained prediction model from the ceph-mgr daemon * *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon.
The prediction mode can be configured with: To configure the prediction mode, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph config set global device_failure_prediction_mode <mode> ceph config set global device_failure_prediction_mode <mode>
Prediction normally runs in the background on a periodic basis, so it Under normal conditions, failure prediction runs periodically in the
may take some time before life expectancy values are populated. You background. For this reason, life expectancy values might be populated only
can see the life expectancy of all devices in output from: after a significant amount of time has passed. The life expectancy of all
devices is displayed in the output of the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph device ls ceph device ls
You can also query the metadata for a specific device with: To see the metadata of a specific device, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph device info <devid> ceph device info <devid>
You can explicitly force prediction of a device's life expectancy with: To explicitly force prediction of a specific device's life expectancy, run a
command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph device predict-life-expectancy <devid> ceph device predict-life-expectancy <devid>
If you are not using Ceph's internal device failure prediction but In addition to Ceph's internal device failure prediction, you might have an
have some external source of information about device failures, you external source of information about device failures. To inform Ceph of a
can inform Ceph of a device's life expectancy with: specific device's life expectancy, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
ceph device set-life-expectancy <devid> <from> [<to>] ceph device set-life-expectancy <devid> <from> [<to>]
Life expectancies are expressed as a time interval so that Life expectancies are expressed as a time interval. This means that the
uncertainty can be expressed in the form of a wide interval. The uncertainty of the life expectancy can be expressed in the form of a range of
interval end can also be left unspecified. time, and perhaps a wide range of time. The interval's end can be left
unspecified.
Health alerts Health alerts
------------- -------------
The ``mgr/devicehealth/warn_threshold`` controls how soon an expected The ``mgr/devicehealth/warn_threshold`` configuration option controls the
device failure must be before we generate a health warning. health check for an expected device failure. If the device is expected to fail
within the specified time interval, an alert is raised.
The stored life expectancy of all devices can be checked, and any To check the stored life expectancy of all devices and generate any appropriate
appropriate health alerts generated, with: health alert, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph device check-health ceph device check-health
Automatic Mitigation Automatic Migration
-------------------- -------------------
If the ``mgr/devicehealth/self_heal`` option is enabled (it is by The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically
default), then for devices that are expected to fail soon the module migrates data away from devices that are expected to fail soon. If this option
will automatically migrate data away from them by marking the devices is enabled, the module marks such devices ``out`` so that automatic migration
"out". will occur.
The ``mgr/devicehealth/mark_out_threshold`` controls how soon an .. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent
expected device failure must be before we automatically mark an osd this process from cascading to total failure. If the "self heal" module
"out". marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio``
is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health
check. For instructions on what to do in this situation, see
:ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`.
The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the
time interval for automatic migration. If a device is expected to fail within
the specified time interval, it will be automatically marked ``out``.

View File

@ -6,9 +6,11 @@ The *jerasure* plugin is the most generic and flexible plugin, it is
also the default for Ceph erasure coded pools. also the default for Ceph erasure coded pools.
The *jerasure* plugin encapsulates the `Jerasure The *jerasure* plugin encapsulates the `Jerasure
<http://jerasure.org>`_ library. It is <https://github.com/ceph/jerasure>`_ library. It is
recommended to read the *jerasure* documentation to get a better recommended to read the ``jerasure`` documentation to
understanding of the parameters. understand the parameters. Note that the ``jerasure.org``
web site as of 2023 may no longer be connected to the original
project or legitimate.
Create a jerasure profile Create a jerasure profile
========================= =========================

View File

@ -110,6 +110,8 @@ To remove an erasure code profile::
If the profile is referenced by a pool, the deletion will fail. If the profile is referenced by a pool, the deletion will fail.
.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior.
osd erasure-code-profile get osd erasure-code-profile get
============================ ============================

View File

@ -1,14 +1,14 @@
.. _ecpool: .. _ecpool:
============= ==============
Erasure code Erasure code
============= ==============
By default, Ceph `pools <../pools>`_ are created with the type "replicated". In By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
replicated-type pools, every object is copied to multiple disks (this replicated-type pools, every object is copied to multiple disks. This
multiple copying is the "replication"). multiple copying is the method of data protection known as "replication".
In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_ By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
pools use a method of data protection that is different from replication. In pools use a method of data protection that is different from replication. In
erasure coding, data is broken into fragments of two kinds: data blocks and erasure coding, data is broken into fragments of two kinds: data blocks and
parity blocks. If a drive fails or becomes corrupted, the parity blocks are parity blocks. If a drive fails or becomes corrupted, the parity blocks are
@ -16,17 +16,17 @@ used to rebuild the data. At scale, erasure coding saves space relative to
replication. replication.
In this documentation, data blocks are referred to as "data chunks" In this documentation, data blocks are referred to as "data chunks"
and parity blocks are referred to as "encoding chunks". and parity blocks are referred to as "coding chunks".
Erasure codes are also called "forward error correction codes". The Erasure codes are also called "forward error correction codes". The
first forward error correction code was developed in 1950 by Richard first forward error correction code was developed in 1950 by Richard
Hamming at Bell Laboratories. Hamming at Bell Laboratories.
Creating a sample erasure coded pool Creating a sample erasure-coded pool
------------------------------------ ------------------------------------
The simplest erasure coded pool is equivalent to `RAID5 The simplest erasure-coded pool is similar to `RAID5
<https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
requires at least three hosts: requires at least three hosts:
@ -47,12 +47,13 @@ requires at least three hosts:
ABCDEFGHI ABCDEFGHI
Erasure code profiles Erasure-code profiles
--------------------- ---------------------
The default erasure code profile can sustain the loss of two OSDs. This erasure The default erasure-code profile can sustain the overlapping loss of two OSDs
code profile is equivalent to a replicated pool of size three, but requires without losing data. This erasure-code profile is equivalent to a replicated
2TB to store 1TB of data instead of 3TB to store 1TB of data. The default pool of size three, but with different storage requirements: instead of
requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
profile can be displayed with this command: profile can be displayed with this command:
.. prompt:: bash $ .. prompt:: bash $
@ -68,26 +69,27 @@ profile can be displayed with this command:
technique=reed_sol_van technique=reed_sol_van
.. note:: .. note::
The default erasure-coded pool, the profile of which is displayed here, is The profile just displayed is for the *default* erasure-coded pool, not the
not the same as the simplest erasure-coded pool. *simplest* erasure-coded pool. These two pools are not the same:
The default erasure-coded pool has two data chunks (k) and two coding chunks The default erasure-coded pool has two data chunks (K) and two coding chunks
(m). The profile of the default erasure-coded pool is "k=2 m=2". (M). The profile of the default erasure-coded pool is "k=2 m=2".
The simplest erasure-coded pool has two data chunks (k) and one coding chunk The simplest erasure-coded pool has two data chunks (K) and one coding chunk
(m). The profile of the simplest erasure-coded pool is "k=2 m=1". (M). The profile of the simplest erasure-coded pool is "k=2 m=1".
Choosing the right profile is important because the profile cannot be modified Choosing the right profile is important because the profile cannot be modified
after the pool is created. If you find that you need an erasure-coded pool with after the pool is created. If you find that you need an erasure-coded pool with
a profile different than the one you have created, you must create a new pool a profile different than the one you have created, you must create a new pool
with a different (and presumably more carefully-considered) profile. When the with a different (and presumably more carefully considered) profile. When the
new pool is created, all objects from the wrongly-configured pool must be moved new pool is created, all objects from the wrongly configured pool must be moved
to the newly-created pool. There is no way to alter the profile of a pool after its creation. to the newly created pool. There is no way to alter the profile of a pool after
the pool has been created.
The most important parameters of the profile are *K*, *M* and The most important parameters of the profile are *K*, *M*, and
*crush-failure-domain* because they define the storage overhead and *crush-failure-domain* because they define the storage overhead and
the data durability. For example, if the desired architecture must the data durability. For example, if the desired architecture must
sustain the loss of two racks with a storage overhead of 67% overhead, sustain the loss of two racks with a storage overhead of 67%,
the following profile can be defined: the following profile can be defined:
.. prompt:: bash $ .. prompt:: bash $
@ -106,7 +108,7 @@ the following profile can be defined:
The *NYAN* object will be divided in three (*K=3*) and two additional The *NYAN* object will be divided in three (*K=3*) and two additional
*chunks* will be created (*M=2*). The value of *M* defines how many *chunks* will be created (*M=2*). The value of *M* defines how many
OSD can be lost simultaneously without losing any data. The OSDs can be lost simultaneously without losing any data. The
*crush-failure-domain=rack* will create a CRUSH rule that ensures *crush-failure-domain=rack* will create a CRUSH rule that ensures
no two *chunks* are stored in the same rack. no two *chunks* are stored in the same rack.
@ -155,19 +157,19 @@ no two *chunks* are stored in the same rack.
+------+ +------+
More information can be found in the `erasure code profiles More information can be found in the `erasure-code profiles
<../erasure-code-profile>`_ documentation. <../erasure-code-profile>`_ documentation.
Erasure Coding with Overwrites Erasure Coding with Overwrites
------------------------------ ------------------------------
By default, erasure coded pools only work with uses like RGW that By default, erasure-coded pools work only with operations that
perform full object writes and appends. perform full object writes and appends (for example, RGW).
Since Luminous, partial writes for an erasure coded pool may be Since Luminous, partial writes for an erasure-coded pool may be
enabled with a per-pool setting. This lets RBD and CephFS store their enabled with a per-pool setting. This lets RBD and CephFS store their
data in an erasure coded pool: data in an erasure-coded pool:
.. prompt:: bash $ .. prompt:: bash $
@ -175,31 +177,33 @@ data in an erasure coded pool:
This can be enabled only on a pool residing on BlueStore OSDs, since This can be enabled only on a pool residing on BlueStore OSDs, since
BlueStore's checksumming is used during deep scrubs to detect bitrot BlueStore's checksumming is used during deep scrubs to detect bitrot
or other corruption. In addition to being unsafe, using Filestore with or other corruption. Using Filestore with EC overwrites is not only
EC overwrites results in lower performance compared to BlueStore. unsafe, but it also results in lower performance compared to BlueStore.
Erasure coded pools do not support omap, so to use them with RBD and Erasure-coded pools do not support omap, so to use them with RBD and
CephFS you must instruct them to store their data in an EC pool, and CephFS you must instruct them to store their data in an EC pool and
their metadata in a replicated pool. For RBD, this means using the their metadata in a replicated pool. For RBD, this means using the
erasure coded pool as the ``--data-pool`` during image creation: erasure-coded pool as the ``--data-pool`` during image creation:
.. prompt:: bash $ .. prompt:: bash $
rbd create --size 1G --data-pool ec_pool replicated_pool/image_name rbd create --size 1G --data-pool ec_pool replicated_pool/image_name
For CephFS, an erasure coded pool can be set as the default data pool during For CephFS, an erasure-coded pool can be set as the default data pool during
file system creation or via `file layouts <../../../cephfs/file-layouts>`_. file system creation or via `file layouts <../../../cephfs/file-layouts>`_.
Erasure coded pool and cache tiering Erasure-coded pools and cache tiering
------------------------------------ -------------------------------------
Erasure coded pools require more resources than replicated pools and Erasure-coded pools require more resources than replicated pools and
lack some functionality such as omap. To overcome these lack some of the functionality supported by replicated pools (for example, omap).
limitations, one can set up a `cache tier <../cache-tiering>`_ To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
before the erasure coded pool. before setting up the erasure-coded pool.
For instance, if the pool *hot-storage* is made of fast storage: For example, if the pool *hot-storage* is made of fast storage, the following commands
will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
mode:
.. prompt:: bash $ .. prompt:: bash $
@ -207,51 +211,53 @@ For instance, if the pool *hot-storage* is made of fast storage:
ceph osd tier cache-mode hot-storage writeback ceph osd tier cache-mode hot-storage writeback
ceph osd tier set-overlay ecpool hot-storage ceph osd tier set-overlay ecpool hot-storage
will place the *hot-storage* pool as tier of *ecpool* in *writeback* The result is that every write and read to the *ecpool* actually uses
mode so that every write and read to the *ecpool* are actually using the *hot-storage* pool and benefits from its flexibility and speed.
the *hot-storage* and benefit from its flexibility and speed.
More information can be found in the `cache tiering More information can be found in the `cache tiering
<../cache-tiering>`_ documentation. Note however that cache tiering <../cache-tiering>`_ documentation. Note, however, that cache tiering
is deprecated and may be removed completely in a future release. is deprecated and may be removed completely in a future release.
Erasure coded pool recovery Erasure-coded pool recovery
--------------------------- ---------------------------
If an erasure coded pool loses some data shards, it must recover them from others. If an erasure-coded pool loses any data shards, it must recover them from others.
This involves reading from the remaining shards, reconstructing the data, and This recovery involves reading from the remaining shards, reconstructing the data, and
writing new shards. writing new shards.
In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
available. (With fewer than *K* shards, you have actually lost data!) available. (With fewer than *K* shards, you have actually lost data!)
Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
available, even if ``min_size`` is greater than ``K``. We recommend ``min_size`` available, even if ``min_size`` was greater than ``K``. This was a conservative
be ``K+2`` or more to prevent loss of writes and data. decision made out of an abundance of caution when designing the new pool
This conservative decision was made out of an abundance of caution when mode. As a result, however, pools with lost OSDs but without complete data loss were
designing the new pool mode. As a result pools with lost OSDs but without unable to recover and go active without manual intervention to temporarily change
complete loss of any data were unable to recover and go active the ``min_size`` setting.
without manual intervention to temporarily change the ``min_size`` setting.
We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
loss of data.
Glossary Glossary
-------- --------
*chunk* *chunk*
when the encoding function is called, it returns chunks of the same When the encoding function is called, it returns chunks of the same size as each other. There are two
size. Data chunks which can be concatenated to reconstruct the original kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
object and coding chunks which can be used to rebuild a lost chunk. (2) *coding chunks*, which can be used to rebuild a lost chunk.
*K* *K*
the number of data *chunks*, i.e. the number of *chunks* in which the The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
original object is divided. For instance if *K* = 2 a 10KB object is divided into two objects of 5KB each.
will be divided into *K* objects of 5KB each.
*M* *M*
the number of coding *chunks*, i.e. the number of additional *chunks* The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
computed by the encoding functions. If there are 2 coding *chunks*, be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
it means 2 OSDs can be out without losing data. chunks, then two OSDs can be missing without data loss.
Table of contents
Table of content -----------------
----------------
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1

File diff suppressed because it is too large Load Diff

View File

@ -3,35 +3,38 @@
========================= =========================
High availability and high reliability require a fault-tolerant approach to High availability and high reliability require a fault-tolerant approach to
managing hardware and software issues. Ceph has no single point-of-failure, and managing hardware and software issues. Ceph has no single point of failure and
can service requests for data in a "degraded" mode. Ceph's `data placement`_ it can service requests for data even when in a "degraded" mode. Ceph's `data
introduces a layer of indirection to ensure that data doesn't bind directly to placement`_ introduces a layer of indirection to ensure that data doesn't bind
particular OSD addresses. This means that tracking down system faults requires directly to specific OSDs. For this reason, tracking system faults
finding the `placement group`_ and the underlying OSDs at root of the problem. requires finding the `placement group`_ (PG) and the underlying OSDs at the
root of the problem.
.. tip:: A fault in one part of the cluster may prevent you from accessing a .. tip:: A fault in one part of the cluster might prevent you from accessing a
particular object, but that doesn't mean that you cannot access other objects. particular object, but that doesn't mean that you are prevented from
When you run into a fault, don't panic. Just follow the steps for monitoring accessing other objects. When you run into a fault, don't panic. Just
your OSDs and placement groups. Then, begin troubleshooting. follow the steps for monitoring your OSDs and placement groups, and then
begin troubleshooting.
Ceph is generally self-repairing. However, when problems persist, monitoring Ceph is self-repairing. However, when problems persist, monitoring OSDs and
OSDs and placement groups will help you identify the problem. placement groups will help you identify the problem.
Monitoring OSDs Monitoring OSDs
=============== ===============
An OSD's status is either in the cluster (``in``) or out of the cluster An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
(``out``); and, it is either up and running (``up``), or it is down and not either running and reachable (``up``), or it is not running and not
running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster reachable (``down``).
(you can read and write data) or it is ``out`` of the cluster. If it was
``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate
placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
not assign placement groups to the OSD. If an OSD is ``down``, it should also be
``out``.
.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster If an OSD is ``up``, it may be either ``in`` service (clients can read and
will not be in a healthy state. write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintin the configured redundancy.
If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
If an OSD is ``down``, it will also be ``out``.
.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
indicates that the cluster is not in a healthy state.
.. ditaa:: .. ditaa::
@ -50,33 +53,34 @@ not assign placement groups to the OSD. If an OSD is ``down``, it should also be
| | | | | | | |
+----------------+ +----------------+ +----------------+ +----------------+
If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
you may notice that the cluster does not always echo back ``HEALTH OK``. Don't you might notice that the cluster does not always show ``HEALTH OK``. Don't
panic. With respect to OSDs, you should expect that the cluster will **NOT** panic. There are certain circumstances in which it is expected and normal that
echo ``HEALTH OK`` in a few expected circumstances: the cluster will **NOT** show ``HEALTH OK``:
#. You haven't started the cluster yet (it won't respond). #. You haven't started the cluster yet.
#. You have just started or restarted the cluster and it's not ready yet, #. You have just started or restarted the cluster and it's not ready to show
because the placement groups are getting created and the OSDs are in health statuses yet, because the PGs are in the process of being created and
the process of peering. the OSDs are in the process of peering.
#. You just added or removed an OSD. #. You have just added or removed an OSD.
#. You just have modified your cluster map. #. You have just have modified your cluster map.
An important aspect of monitoring OSDs is to ensure that when the cluster Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them:
is up and running that all OSDs that are ``in`` the cluster are ``up`` and whenever the cluster is up and running, every OSD that is ``in`` the cluster should also
running, too. To see if all OSDs are running, execute: be ``up`` and running. To see if all of the cluster's OSDs are running, run the following
command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd stat ceph osd stat
The result should tell you the total number of OSDs (x), The output provides the following information: the total number of OSDs (x),
how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). :: how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). ::
x osds: y up, z in; epoch: eNNNN x osds: y up, z in; epoch: eNNNN
If the number of OSDs that are ``in`` the cluster is more than the number of If the number of OSDs that are ``in`` the cluster is greater than the number of
OSDs that are ``up``, execute the following command to identify the ``ceph-osd`` OSDs that are ``up``, run the following command to identify the ``ceph-osd``
daemons that are not running: daemons that are not running:
.. prompt:: bash $ .. prompt:: bash $
@ -92,87 +96,85 @@ daemons that are not running:
0 ssd 1.00000 osd.0 up 1.00000 1.00000 0 ssd 1.00000 osd.0 up 1.00000 1.00000
1 ssd 1.00000 osd.1 down 1.00000 1.00000 1 ssd 1.00000 osd.1 down 1.00000 1.00000
.. tip:: The ability to search through a well-designed CRUSH hierarchy may help .. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical
you troubleshoot your cluster by identifying the physical locations faster. locations of particular OSDs might help you troubleshoot your cluster.
If an OSD is ``down``, start it: If an OSD is ``down``, start it by running the following command:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl start ceph-osd@1 sudo systemctl start ceph-osd@1
See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_.
restart.
PG Sets PG Sets
======= =======
When CRUSH assigns placement groups to OSDs, it looks at the number of replicas When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG
for the pool and assigns the placement group to OSDs such that each replica of are required by the pool and then assigns each replica to a different OSD.
the placement group gets assigned to a different OSD. For example, if the pool For example, if the pool requires three replicas of a PG, CRUSH might assign
requires three replicas of a placement group, CRUSH may assign them to them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a
``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a pseudo-random placement that takes into account the failure domains that you
pseudo-random placement that will take into account failure domains you set in have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to
your `CRUSH map`_, so you will rarely see placement groups assigned to nearest immediately adjacent OSDs in a large cluster.
neighbor OSDs in a large cluster.
Ceph processes a client request using the **Acting Set**, which is the set of Ceph processes client requests with the **Acting Set** of OSDs: this is the set
OSDs that will actually handle the requests since they have a full and working of OSDs that currently have a full and working version of a PG shard and that
version of a placement group shard. The set of OSDs that should contain a shard are therefore responsible for handling requests. By contrast, the **Up Set** is
of a particular placement group as the **Up Set**, i.e. where data is the set of OSDs that contain a shard of a specific PG. Data is moved or copied
moved/copied to (or planned to be). to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See
:ref:`Placement Group Concepts <rados_operations_pg_concepts>`.
In some cases, an OSD in the Acting Set is ``down`` or otherwise not able to Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to
service requests for objects in the placement group. When these situations service requests for objects in the PG. When this kind of situation
arise, don't panic. Common examples include: arises, don't panic. Common examples of such a situation include:
- You added or removed an OSD. Then, CRUSH reassigned the placement group to - You added or removed an OSD, CRUSH reassigned the PG to
other OSDs--thereby changing the composition of the Acting Set and spawning other OSDs, and this reassignment changed the composition of the Acting Set and triggered
the migration of data with a "backfill" process. the migration of data by means of a "backfill" process.
- An OSD was ``down``, was restarted, and is now ``recovering``. - An OSD was ``down``, was restarted, and is now ``recovering``.
- An OSD in the Acting Set is ``down`` or unable to service requests, - An OSD in the Acting Set is ``down`` or unable to service requests,
and another OSD has temporarily assumed its duties. and another OSD has temporarily assumed its duties.
In most cases, the Up Set and the Acting Set are identical. When they are not, Typically, the Up Set and the Acting Set are identical. When they are not, it
it may indicate that Ceph is migrating the PG (it's remapped), an OSD is might indicate that Ceph is migrating the PG (in other words, that the PG has
recovering, or that there is a problem (i.e., Ceph usually echoes a "HEALTH been remapped), that an OSD is recovering, or that there is a problem with the
WARN" state with a "stuck stale" message in such scenarios). cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a
"stuck stale" message).
To retrieve a list of placement groups, execute: To retrieve a list of PGs, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg dump ceph pg dump
To view which OSDs are within the Acting Set or the Up Set for a given placement To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command:
group, execute:
.. prompt:: bash $ .. prompt:: bash $
ceph pg map {pg-num} ceph pg map {pg-num}
The result should tell you the osdmap epoch (eNNN), the placement group number The output provides the following information: the osdmap epoch (eNNN), the PG number
({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set ({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set
(acting[]):: (acting[])::
osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2] osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]
.. note:: If the Up Set and Acting Set do not match, this may be an indicator .. note:: If the Up Set and the Acting Set do not match, this might indicate
that the cluster rebalancing itself or of a potential problem with that the cluster is rebalancing itself or that there is a problem with
the cluster. the cluster.
Peering Peering
======= =======
Before you can write data to a placement group, it must be in an ``active`` Before you can write data to a PG, it must be in an ``active`` state and it
state, and it **should** be in a ``clean`` state. For Ceph to determine the will preferably be in a ``clean`` state. For Ceph to determine the current
current state of a placement group, the primary OSD of the placement group state of a PG, peering must take place. That is, the primary OSD of the PG
(i.e., the first OSD in the acting set), peers with the secondary and tertiary (that is, the first OSD in the Acting Set) must peer with the secondary and
OSDs to establish agreement on the current state of the placement group OSDs so that consensus on the current state of the PG can be established. In
(assuming a pool with 3 replicas of the PG). the following diagram, we assume a pool with three replicas of the PG:
.. ditaa:: .. ditaa::
@ -192,86 +194,85 @@ OSDs to establish agreement on the current state of the placement group
|<-----------------------------| |<-----------------------------|
| Peering | | Peering |
The OSDs also report their status to the monitor. See `Configuring Monitor/OSD The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD
Interaction`_ for details. To troubleshoot peering issues, see `Peering Interaction`_. To troubleshoot peering issues, see `Peering
Failure`_. Failure`_.
Monitoring Placement Group States Monitoring PG States
================================= ====================
If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``, If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
you may notice that the cluster does not always echo back ``HEALTH OK``. After you might notice that the cluster does not always show ``HEALTH OK``. After
you check to see if the OSDs are running, you should also check placement group first checking to see if the OSDs are running, you should also check PG
states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a states. There are certain PG-peering-related circumstances in which it is expected
number of placement group peering-related circumstances: and normal that the cluster will **NOT** show ``HEALTH OK``:
#. You have just created a pool and placement groups haven't peered yet. #. You have just created a pool and the PGs haven't peered yet.
#. The placement groups are recovering. #. The PGs are recovering.
#. You have just added an OSD to or removed an OSD from the cluster. #. You have just added an OSD to or removed an OSD from the cluster.
#. You have just modified your CRUSH map and your placement groups are migrating. #. You have just modified your CRUSH map and your PGs are migrating.
#. There is inconsistent data in different replicas of a placement group. #. There is inconsistent data in different replicas of a PG.
#. Ceph is scrubbing a placement group's replicas. #. Ceph is scrubbing a PG's replicas.
#. Ceph doesn't have enough storage capacity to complete backfilling operations. #. Ceph doesn't have enough storage capacity to complete backfilling operations.
If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't
panic. In many cases, the cluster will recover on its own. In some cases, you panic. In many cases, the cluster will recover on its own. In some cases, however, you
may need to take action. An important aspect of monitoring placement groups is might need to take action. An important aspect of monitoring PGs is to check their
to ensure that when the cluster is up and running that all placement groups are status as ``active`` and ``clean``: that is, it is important to ensure that, when the
``active``, and preferably in the ``clean`` state. To see the status of all cluster is up and running, all PGs are ``active`` and (preferably) ``clean``.
placement groups, execute: To see the status of every PG, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg stat ceph pg stat
The result should tell you the total number of placement groups (x), how many The output provides the following information: the total number of PGs (x), how many
placement groups are in a particular state such as ``active+clean`` (y) and the PGs are in a particular state such as ``active+clean`` (y), and the
amount of data stored (z). :: amount of data stored (z). ::
x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
.. note:: It is common for Ceph to report multiple states for placement groups. .. note:: It is common for Ceph to report multiple states for PGs (for example,
``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``.
In addition to the placement group states, Ceph will also echo back the amount of Here Ceph shows not only the PG states, but also storage capacity used (aa),
storage capacity used (aa), the amount of storage capacity remaining (bb), and the total the amount of storage capacity remaining (bb), and the total storage capacity
storage capacity for the placement group. These numbers can be important in a of the PG. These values can be important in a few cases:
few cases:
- You are reaching your ``near full ratio`` or ``full ratio``. - The cluster is reaching its ``near full ratio`` or ``full ratio``.
- Your data is not getting distributed across the cluster due to an - Data is not being distributed across the cluster due to an error in the
error in your CRUSH configuration. CRUSH configuration.
.. topic:: Placement Group IDs .. topic:: Placement Group IDs
Placement group IDs consist of the pool number (not pool name) followed PG IDs consist of the pool number (not the pool name) followed by a period
by a period (.) and the placement group ID--a hexadecimal number. You (.) and a hexadecimal number. You can view pool numbers and their names from
can view pool numbers and their names from the output of ``ceph osd in the output of ``ceph osd lspools``. For example, the first pool that was
lspools``. For example, the first pool created corresponds to created corresponds to pool number ``1``. A fully qualified PG ID has the
pool number ``1``. A fully qualified placement group ID has the
following form:: following form::
{pool-num}.{pg-id} {pool-num}.{pg-id}
And it typically looks like this:: It typically resembles the following::
1.1f 1.1701b
To retrieve a list of placement groups, execute the following: To retrieve a list of PGs, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg dump ceph pg dump
You can also format the output in JSON format and save it to a file: To format the output in JSON format and save it to a file, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg dump -o {filename} --format=json ceph pg dump -o {filename} --format=json
To query a particular placement group, execute the following: To query a specific PG, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -279,17 +280,19 @@ To query a particular placement group, execute the following:
Ceph will output the query in JSON format. Ceph will output the query in JSON format.
The following subsections describe the common pg states in detail. The following subsections describe the most common PG states in detail.
Creating Creating
-------- --------
When you create a pool, it will create the number of placement groups you PGs are created when you create a pool: the command that creates a pool
specified. Ceph will echo ``creating`` when it is creating one or more specifies the total number of PGs for that pool, and when the pool is created
placement groups. Once they are created, the OSDs that are part of a placement all of those PGs are created as well. Ceph will echo ``creating`` while it is
group's Acting Set will peer. Once peering is complete, the placement group creating PGs. After the PG(s) are created, the OSDs that are part of a PG's
status should be ``active+clean``, which means a Ceph client can begin writing Acting Set will peer. Once peering is complete, the PG status should be
to the placement group. ``active+clean``. This status means that Ceph clients begin writing to the
PG.
.. ditaa:: .. ditaa::
@ -300,43 +303,38 @@ to the placement group.
Peering Peering
------- -------
When Ceph is Peering a placement group, Ceph is bringing the OSDs that When a PG peers, the OSDs that store the replicas of its data converge on an
store the replicas of the placement group into **agreement about the state** agreed state of the data and metadata within that PG. When peering is complete,
of the objects and metadata in the placement group. When Ceph completes peering, those OSDs agree about the state of that PG. However, completion of the peering
this means that the OSDs that store the placement group agree about the current process does **NOT** mean that each replica has the latest contents.
state of the placement group. However, completion of the peering process does
**NOT** mean that each replica has the latest contents.
.. topic:: Authoritative History .. topic:: Authoritative History
Ceph will **NOT** acknowledge a write operation to a client, until Ceph will **NOT** acknowledge a write operation to a client until that write
all OSDs of the acting set persist the write operation. This practice operation is persisted by every OSD in the Acting Set. This practice ensures
ensures that at least one member of the acting set will have a record that at least one member of the Acting Set will have a record of every
of every acknowledged write operation since the last successful acknowledged write operation since the last successful peering operation.
peering operation.
With an accurate record of each acknowledged write operation, Ceph can Given an accurate record of each acknowledged write operation, Ceph can
construct and disseminate a new authoritative history of the placement construct a new authoritative history of the PG--that is, a complete and
group--a complete, and fully ordered set of operations that, if performed, fully ordered set of operations that, if performed, would bring an OSDs
would bring an OSDs copy of a placement group up to date. copy of the PG up to date.
Active Active
------ ------
Once Ceph completes the peering process, a placement group may become After Ceph has completed the peering process, a PG should become ``active``.
``active``. The ``active`` state means that the data in the placement group is The ``active`` state means that the data in the PG is generally available for
generally available in the primary placement group and the replicas for read read and write operations in the primary and replica OSDs.
and write operations.
Clean Clean
----- -----
When a placement group is in the ``clean`` state, the primary OSD and the When a PG is in the ``clean`` state, all OSDs holding its data and metadata
replica OSDs have successfully peered and there are no stray replicas for the have successfully peered and there are no stray replicas. Ceph has replicated
placement group. Ceph replicated all objects in the placement group the correct all objects in the PG the correct number of times.
number of times.
Degraded Degraded
@ -344,143 +342,147 @@ Degraded
When a client writes an object to the primary OSD, the primary OSD is When a client writes an object to the primary OSD, the primary OSD is
responsible for writing the replicas to the replica OSDs. After the primary OSD responsible for writing the replicas to the replica OSDs. After the primary OSD
writes the object to storage, the placement group will remain in a ``degraded`` writes the object to storage, the PG will remain in a ``degraded``
state until the primary OSD has received an acknowledgement from the replica state until the primary OSD has received an acknowledgement from the replica
OSDs that Ceph created the replica objects successfully. OSDs that Ceph created the replica objects successfully.
The reason a placement group can be ``active+degraded`` is that an OSD may be The reason that a PG can be ``active+degraded`` is that an OSD can be
``active`` even though it doesn't hold all of the objects yet. If an OSD goes ``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes
``down``, Ceph marks each placement group assigned to the OSD as ``degraded``. ``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must
The OSDs must peer again when the OSD comes back online. However, a client can peer again when the OSD comes back online. However, a client can still write a
still write a new object to a ``degraded`` placement group if it is ``active``. new object to a ``degraded`` PG if it is ``active``.
If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the
``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD ``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
to another OSD. The time between being marked ``down`` and being marked ``out`` to another OSD. The time between being marked ``down`` and being marked ``out``
is controlled by ``mon osd down out interval``, which is set to ``600`` seconds is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
by default. by default.
A placement group can also be ``degraded``, because Ceph cannot find one or more A PG can also be in the ``degraded`` state because there are one or more
objects that Ceph thinks should be in the placement group. While you cannot objects that Ceph expects to find in the PG but that Ceph cannot find. Although
read or write to unfound objects, you can still access all of the other objects you cannot read or write to unfound objects, you can still access all of the other
in the ``degraded`` placement group. objects in the ``degraded`` PG.
Recovering Recovering
---------- ----------
Ceph was designed for fault-tolerance at a scale where hardware and software Ceph was designed for fault-tolerance, because hardware and other server
problems are ongoing. When an OSD goes ``down``, its contents may fall behind problems are expected or even routine. When an OSD goes ``down``, its contents
the current state of other replicas in the placement groups. When the OSD is might fall behind the current state of other replicas in the PGs. When the OSD
back ``up``, the contents of the placement groups must be updated to reflect the has returned to the ``up`` state, the contents of the PGs must be updated to
current state. During that time period, the OSD may reflect a ``recovering`` reflect that current state. During that time period, the OSD might be in a
state. ``recovering`` state.
Recovery is not always trivial, because a hardware failure might cause a Recovery is not always trivial, because a hardware failure might cause a
cascading failure of multiple OSDs. For example, a network switch for a rack or cascading failure of multiple OSDs. For example, a network switch for a rack or
cabinet may fail, which can cause the OSDs of a number of host machines to fall cabinet might fail, which can cause the OSDs of a number of host machines to
behind the current state of the cluster. Each one of the OSDs must recover once fall behind the current state of the cluster. In such a scenario, general
the fault is resolved. recovery is possible only if each of the OSDs recovers after the fault has been
resolved.]
Ceph provides a number of settings to balance the resource contention between Ceph provides a number of settings that determine how the cluster balances the
new service requests and the need to recover data objects and restore the resource contention between the need to process new service requests and the
placement groups to the current state. The ``osd recovery delay start`` setting need to recover data objects and restore the PGs to the current state. The
allows an OSD to restart, re-peer and even process some replay requests before ``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and
starting the recovery process. The ``osd even process some replay requests before starting the recovery process. The
recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail, ``osd_recovery_thread_timeout`` setting determines the duration of a thread
restart and re-peer at staggered rates. The ``osd recovery max active`` setting timeout, because multiple OSDs might fail, restart, and re-peer at staggered
limits the number of recovery requests an OSD will entertain simultaneously to rates. The ``osd_recovery_max_active`` setting limits the number of recovery
prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting requests an OSD can entertain simultaneously, in order to prevent the OSD from
limits the size of the recovered data chunks to prevent network congestion. failing to serve. The ``osd_recovery_max_chunk`` setting limits the size of
the recovered data chunks, in order to prevent network congestion.
Back Filling Back Filling
------------ ------------
When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are
in the cluster to the newly added OSD. Forcing the new OSD to accept the already in the cluster to the newly added OSD. It can put excessive load on the
reassigned placement groups immediately can put excessive load on the new OSD. new OSD to force it to immediately accept the reassigned PGs. Back filling the
Back filling the OSD with the placement groups allows this process to begin in OSD with the PGs allows this process to begin in the background. After the
the background. Once backfilling is complete, the new OSD will begin serving backfill operations have completed, the new OSD will begin serving requests as
requests when it is ready. soon as it is ready.
During the backfill operations, you may see one of several states: During the backfill operations, you might see one of several states:
``backfill_wait`` indicates that a backfill operation is pending, but is not ``backfill_wait`` indicates that a backfill operation is pending, but is not
underway yet; ``backfilling`` indicates that a backfill operation is underway; yet underway; ``backfilling`` indicates that a backfill operation is currently
and, ``backfill_toofull`` indicates that a backfill operation was requested, underway; and ``backfill_toofull`` indicates that a backfill operation was
but couldn't be completed due to insufficient storage capacity. When a requested but couldn't be completed due to insufficient storage capacity. When
placement group cannot be backfilled, it may be considered ``incomplete``. a PG cannot be backfilled, it might be considered ``incomplete``.
The ``backfill_toofull`` state may be transient. It is possible that as PGs The ``backfill_toofull`` state might be transient. It might happen that, as PGs
are moved around, space may become available. The ``backfill_toofull`` is are moved around, space becomes available. The ``backfill_toofull`` state is
similar to ``backfill_wait`` in that as soon as conditions change similar to ``backfill_wait`` in that backfill operations can proceed as soon as
backfill can proceed. conditions change.
Ceph provides a number of settings to manage the load spike associated with Ceph provides a number of settings to manage the load spike associated with the
reassigning placement groups to an OSD (especially a new OSD). By default, reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills``
``osd_max_backfills`` sets the maximum number of concurrent backfills to and from setting specifies the maximum number of concurrent backfills to and from an OSD
an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a (default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a
backfill request if the OSD is approaching its full ratio (90%, by default) and backfill request if the OSD is approaching its full ratio (default: 90%). This
change with ``ceph osd set-backfillfull-ratio`` command. setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If
If an OSD refuses a backfill request, the ``osd backfill retry interval`` an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting
enables an OSD to retry the request (after 30 seconds, by default). OSDs can allows an OSD to retry the request after a certain interval (default: 30
also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan seconds). OSDs can also set ``osd_backfill_scan_min`` and
intervals (64 and 512, by default). ``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and
512, respectively).
Remapped Remapped
-------- --------
When the Acting Set that services a placement group changes, the data migrates When the Acting Set that services a PG changes, the data migrates from the old
from the old acting set to the new acting set. It may take some time for a new Acting Set to the new Acting Set. Because it might take time for the new
primary OSD to service requests. So it may ask the old primary to continue to primary OSD to begin servicing requests, the old primary OSD might be required
service requests until the placement group migration is complete. Once data to continue servicing requests until the PG data migration is complete. After
migration completes, the mapping uses the primary OSD of the new acting set. data migration has completed, the mapping uses the primary OSD of the new
Acting Set.
Stale Stale
----- -----
While Ceph uses heartbeats to ensure that hosts and daemons are running, the Although Ceph uses heartbeats in order to ensure that hosts and daemons are
``ceph-osd`` daemons may also get into a ``stuck`` state where they are not running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are
reporting statistics in a timely manner (e.g., a temporary network fault). By not reporting statistics in a timely manner (for example, there might be a
default, OSD daemons report their placement group, up through, boot and failure temporary network fault). By default, OSD daemons report their PG, up through,
statistics every half second (i.e., ``0.5``), which is more frequent than the boot, and failure statistics every half second (that is, in accordance with a
heartbeat thresholds. If the **Primary OSD** of a placement group's acting set value of ``0.5``), which is more frequent than the reports defined by the
fails to report to the monitor or if other OSDs have reported the primary OSD heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report
``down``, the monitors will mark the placement group ``stale``. to the monitor or if other OSDs have reported the primary OSD ``down``, the
monitors will mark the PG ``stale``.
When you start your cluster, it is common to see the ``stale`` state until When you start your cluster, it is common to see the ``stale`` state until the
the peering process completes. After your cluster has been running for awhile, peering process completes. After your cluster has been running for a while,
seeing placement groups in the ``stale`` state indicates that the primary OSD however, seeing PGs in the ``stale`` state indicates that the primary OSD for
for those placement groups is ``down`` or not reporting placement group statistics those PGs is ``down`` or not reporting PG statistics to the monitor.
to the monitor.
Identifying Troubled PGs Identifying Troubled PGs
======================== ========================
As previously noted, a placement group is not necessarily problematic just As previously noted, a PG is not necessarily having problems just because its
because its state is not ``active+clean``. Generally, Ceph's ability to self state is not ``active+clean``. When PGs are stuck, this might indicate that
repair may not be working when placement groups get stuck. The stuck states Ceph cannot perform self-repairs. The stuck states include:
include:
- **Unclean**: Placement groups contain objects that are not replicated the - **Unclean**: PGs contain objects that have not been replicated the desired
desired number of times. They should be recovering. number of times. Under normal conditions, it can be assumed that these PGs
- **Inactive**: Placement groups cannot process reads or writes because they are recovering.
are waiting for an OSD with the most up-to-date data to come back ``up``. - **Inactive**: PGs cannot process reads or writes because they are waiting for
- **Stale**: Placement groups are in an unknown state, because the OSDs that an OSD that has the most up-to-date data to come back ``up``.
host them have not reported to the monitor cluster in a while (configured - **Stale**: PG are in an unknown state, because the OSDs that host them have
by ``mon osd report timeout``). not reported to the monitor cluster for a certain period of time (determined
by ``mon_osd_report_timeout``).
To identify stuck placement groups, execute the following: To identify stuck PGs, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded] ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
See `Placement Group Subsystem`_ for additional details. To troubleshoot For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
stuck placement groups, see `Troubleshooting PG Errors`_. see `Troubleshooting PG Errors`_.
Finding an Object Location Finding an Object Location
@ -491,10 +493,10 @@ To store object data in the Ceph Object Store, a Ceph client must:
#. Set an object name #. Set an object name
#. Specify a `pool`_ #. Specify a `pool`_
The Ceph client retrieves the latest cluster map and the CRUSH algorithm The Ceph client retrieves the latest cluster map, the CRUSH algorithm
calculates how to map the object to a `placement group`_, and then calculates calculates how to map the object to a PG, and then the algorithm calculates how
how to assign the placement group to an OSD dynamically. To find the object to dynamically assign the PG to an OSD. To find the object location given only
location, all you need is the object name and the pool name. For example: the object name and the pool name, run a command of the following form:
.. prompt:: bash $ .. prompt:: bash $
@ -502,8 +504,8 @@ location, all you need is the object name and the pool name. For example:
.. topic:: Exercise: Locate an Object .. topic:: Exercise: Locate an Object
As an exercise, let's create an object. Specify an object name, a path As an exercise, let's create an object. We can specify an object name, a path
to a test file containing some object data and a pool name using the to a test file that contains some object data, and a pool name by using the
``rados put`` command on the command line. For example: ``rados put`` command on the command line. For example:
.. prompt:: bash $ .. prompt:: bash $
@ -511,14 +513,14 @@ location, all you need is the object name and the pool name. For example:
rados put {object-name} {file-path} --pool=data rados put {object-name} {file-path} --pool=data
rados put test-object-1 testfile.txt --pool=data rados put test-object-1 testfile.txt --pool=data
To verify that the Ceph Object Store stored the object, execute the To verify that the Ceph Object Store stored the object, run the
following: following command:
.. prompt:: bash $ .. prompt:: bash $
rados -p data ls rados -p data ls
Now, identify the object location: To identify the object location, run the following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -529,17 +531,16 @@ location, all you need is the object name and the pool name. For example:
osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0) osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
To remove the test object, simply delete it using the ``rados rm`` To remove the test object, simply delete it by running the ``rados rm``
command. For example: command. For example:
.. prompt:: bash $ .. prompt:: bash $
rados rm test-object-1 --pool=data rados rm test-object-1 --pool=data
As the cluster evolves, the object location may change dynamically. One benefit As the cluster evolves, the object location may change dynamically. One benefit
of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually
the migration manually. See the `Architecture`_ section for details. performing the migration. For details, see the `Architecture`_ section.
.. _data placement: ../data-placement .. _data placement: ../data-placement
.. _pool: ../pools .. _pool: ../pools

View File

@ -2,9 +2,9 @@
Monitoring a Cluster Monitoring a Cluster
====================== ======================
Once you have a running cluster, you may use the ``ceph`` tool to monitor your After you have a running cluster, you can use the ``ceph`` tool to monitor your
cluster. Monitoring a cluster typically involves checking OSD status, monitor cluster. Monitoring a cluster typically involves checking OSD status, monitor
status, placement group status and metadata server status. status, placement group status, and metadata server status.
Using the command line Using the command line
====================== ======================
@ -30,8 +30,9 @@ with no arguments. For example:
Non-default paths Non-default paths
----------------- -----------------
If you specified non-default locations for your configuration or keyring, If you specified non-default locations for your configuration or keyring when
you may specify their locations: you install the cluster, you may specify their locations to the ``ceph`` tool
by running the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -40,30 +41,32 @@ you may specify their locations:
Checking a Cluster's Status Checking a Cluster's Status
=========================== ===========================
After you start your cluster, and before you start reading and/or After you start your cluster, and before you start reading and/or writing data,
writing data, check your cluster's status first. you should check your cluster's status.
To check a cluster's status, execute the following: To check a cluster's status, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph status ceph status
Or: Alternatively, you can run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph -s ceph -s
In interactive mode, type ``status`` and press **Enter**: In interactive mode, this operation is performed by typing ``status`` and
pressing **Enter**:
.. prompt:: ceph> .. prompt:: ceph>
:prompts: ceph> :prompts: ceph>
status status
Ceph will print the cluster status. For example, a tiny Ceph demonstration Ceph will print the cluster status. For example, a tiny Ceph "demonstration
cluster with one of each service may print the following: cluster" that is running one instance of each service (monitor, manager, and
OSD) might print the following:
:: ::
@ -84,33 +87,35 @@ cluster with one of each service may print the following:
pgs: 16 active+clean pgs: 16 active+clean
.. topic:: How Ceph Calculates Data Usage How Ceph Calculates Data Usage
------------------------------
The ``usage`` value reflects the *actual* amount of raw storage used. The The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx
``xxx GB / xxx GB`` value means the amount available (the lesser number) GB / xxx GB`` value means the amount available (the lesser number) of the
of the overall storage capacity of the cluster. The notional number reflects overall storage capacity of the cluster. The notional number reflects the size
the size of the stored data before it is replicated, cloned or snapshotted. of the stored data before it is replicated, cloned or snapshotted. Therefore,
Therefore, the amount of data actually stored typically exceeds the notional the amount of data actually stored typically exceeds the notional amount
amount stored, because Ceph creates replicas of the data and may also use stored, because Ceph creates replicas of the data and may also use storage
storage capacity for cloning and snapshotting. capacity for cloning and snapshotting.
Watching a Cluster Watching a Cluster
================== ==================
In addition to local logging by each daemon, Ceph clusters maintain Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster
a *cluster log* that records high level events about the whole system. itself maintains a *cluster log* that records high-level events about the
This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by entire Ceph cluster. These events are logged to disk on monitor servers (in
default), but can also be monitored via the command line. the default location ``/var/log/ceph/ceph.log``), and they can be monitored via
the command line.
To follow the cluster log, use the following command: To follow the cluster log, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph -w ceph -w
Ceph will print the status of the system, followed by each log message as it Ceph will print the status of the system, followed by each log message as it is
is emitted. For example: added. For example:
:: ::
@ -135,21 +140,20 @@ is emitted. For example:
2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x 2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available 2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available
Instead of printing log lines as they are added, you might want to print only
In addition to using ``ceph -w`` to print log lines as they are emitted, the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n``
use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster lines from the cluster log.
log.
Monitoring Health Checks Monitoring Health Checks
======================== ========================
Ceph continuously runs various *health checks* against its own status. When Ceph continuously runs various *health checks*. When
a health check fails, this is reflected in the output of ``ceph status`` (or a health check fails, this failure is reflected in the output of ``ceph status`` and
``ceph health``). In addition, messages are sent to the cluster log to ``ceph health``. The cluster log receives messages that
indicate when a check fails, and when the cluster recovers. indicate when a check has failed and when the cluster has recovered.
For example, when an OSD goes down, the ``health`` section of the status For example, when an OSD goes down, the ``health`` section of the status
output may be updated as follows: output is updated as follows:
:: ::
@ -157,7 +161,7 @@ output may be updated as follows:
1 osds down 1 osds down
Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded
At this time, cluster log messages are also emitted to record the failure of the At the same time, cluster log messages are emitted to record the failure of the
health checks: health checks:
:: ::
@ -166,7 +170,7 @@ health checks:
2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED) 2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)
When the OSD comes back online, the cluster log records the cluster's return When the OSD comes back online, the cluster log records the cluster's return
to a health state: to a healthy state:
:: ::
@ -177,21 +181,23 @@ to a health state:
Network Performance Checks Network Performance Checks
-------------------------- --------------------------
Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability. We Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
also use the response times to monitor network performance. availability and network performance. If a single delayed response is detected,
While it is possible that a busy OSD could delay a ping response, we can assume this might indicate nothing more than a busy OSD. But if multiple delays
that if a network switch fails multiple delays will be detected between distinct pairs of OSDs. between distinct pairs of OSDs are detected, this might indicate a failed
network switch, a NIC failure, or a layer 1 failure.
By default we will warn about ping times which exceed 1 second (1000 milliseconds). By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
health check (a ``HEALTH_WARN``. For example:
:: ::
HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms) HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)
The health detail will add the combination of OSDs are seeing the delays and by how much. There is a limit of 10 In the output of the ``ceph health detail`` command, you can see which OSDs are
detail line items. experiencing delays and how long the delays are. The output of ``ceph health
detail`` is limited to ten lines. Here is an example of the output you can
:: expect from the ``ceph health detail`` command::
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms) [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
@ -199,11 +205,15 @@ detail line items.
Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec
To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used. Typically, this would be To see more detail and to collect a complete dump of network performance
sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD. The current threshold which defaults to 1 second information, use the ``dump_osd_network`` command. This command is usually sent
(1000 milliseconds) can be overridden as an argument in milliseconds. to a Ceph Manager Daemon, but it can be used to collect information about a
specific OSD's interactions by sending it to that OSD. The default threshold
for a slow heartbeat is 1 second (1000 milliseconds), but this can be
overridden by providing a number of milliseconds as an argument.
The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr. To show all network performance data with a specified threshold of 0, send the
following command to the mgr:
.. prompt:: bash $ .. prompt:: bash $
@ -287,26 +297,26 @@ The following command will show all gathered network performance data by specify
Muting health checks Muting Health Checks
-------------------- --------------------
Health checks can be muted so that they do not affect the overall Health checks can be muted so that they have no effect on the overall
reported status of the cluster. Alerts are specified using the health reported status of the cluster. For example, if the cluster has raised a
check code (see :ref:`health-checks`): single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``.
To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and
run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph health mute <code> ceph health mute <code>
For example, if there is a health warning, muting it will make the For example, to mute an ``OSD_DOWN`` health check, run the following command:
cluster report an overall status of ``HEALTH_OK``. For example, to
mute an ``OSD_DOWN`` alert,:
.. prompt:: bash $ .. prompt:: bash $
ceph health mute OSD_DOWN ceph health mute OSD_DOWN
Mutes are reported as part of the short and long form of the ``ceph health`` command. Mutes are reported as part of the short and long form of the ``ceph health`` command's output.
For example, in the above scenario, the cluster would report: For example, in the above scenario, the cluster would report:
.. prompt:: bash $ .. prompt:: bash $
@ -327,7 +337,7 @@ For example, in the above scenario, the cluster would report:
(MUTED) OSD_DOWN 1 osds down (MUTED) OSD_DOWN 1 osds down
osd.1 is down osd.1 is down
A mute can be explicitly removed with: A mute can be removed by running the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -339,56 +349,44 @@ For example:
ceph health unmute OSD_DOWN ceph health unmute OSD_DOWN
A health check mute may optionally have a TTL (time to live) A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive)
associated with it, such that the mute will automatically expire associated with it: this means that the mute will automatically expire
after the specified period of time has elapsed. The TTL is specified as an optional after a specified period of time. The TTL is specified as an optional
duration argument, e.g.: duration argument, as seen in the following examples:
.. prompt:: bash $ .. prompt:: bash $
ceph health mute OSD_DOWN 4h # mute for 4 hours ceph health mute OSD_DOWN 4h # mute for 4 hours
ceph health mute MON_DOWN 15m # mute for 15 minutes ceph health mute MON_DOWN 15m # mute for 15 minutes
Normally, if a muted health alert is resolved (e.g., in the example Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check
above, the OSD comes back up), the mute goes away. If the alert comes in the example above has come back up), the mute goes away. If the health check comes
back later, it will be reported in the usual way. back later, it will be reported in the usual way.
It is possible to make a mute "sticky" such that the mute will remain even if the It is possible to make a health mute "sticky": this means that the mute will remain even if the
alert clears. For example: health check clears. For example, to make a health mute "sticky", you might run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour ceph health mute OSD_DOWN 1h --sticky # ignore any/all down OSDs for next hour
Most health mutes also disappear if the extent of an alert gets worse. For example, Most health mutes disappear if the unhealthy condition that triggered the health check gets worse.
if there is one OSD down, and the alert is muted, the mute will disappear if one For example, suppose that there is one OSD down and the health check is muted. In that case, if
or more additional OSDs go down. This is true for any health alert that involves one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value.
a count indicating how much or how many of something is triggering the warning or
error.
Detecting configuration issues
==============================
In addition to the health checks that Ceph continuously runs on its
own status, there are some configuration issues that may only be detected
by an external tool.
Use the `ceph-medic`_ tool to run these additional checks on your Ceph
cluster's configuration.
Checking a Cluster's Usage Stats Checking a Cluster's Usage Stats
================================ ================================
To check a cluster's data usage and data distribution among pools, you can To check a cluster's data usage and data distribution among pools, use the
use the ``df`` option. It is similar to Linux ``df``. Execute ``df`` command. This option is similar to Linux's ``df`` command. Run the
the following: following command:
.. prompt:: bash $ .. prompt:: bash $
ceph df ceph df
The output of ``ceph df`` looks like this:: The output of ``ceph df`` resembles the following::
CLASS SIZE AVAIL USED RAW USED %RAW USED CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00 ssd 202 GiB 200 GiB 2.0 GiB 2.0 GiB 1.00
@ -401,51 +399,48 @@ The output of ``ceph df`` looks like this::
cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B cephfs.a.data 3 32 0 B 0 B 0 B 0 0 B 0 B 0 B 0 99 GiB N/A N/A 0 0 B 0 B
test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B test 4 32 22 MiB 22 MiB 50 KiB 248 19 MiB 19 MiB 50 KiB 0 297 GiB N/A N/A 248 0 B 0 B
- **CLASS:** For example, "ssd" or "hdd".
- **CLASS:** for example, "ssd" or "hdd"
- **SIZE:** The amount of storage capacity managed by the cluster. - **SIZE:** The amount of storage capacity managed by the cluster.
- **AVAIL:** The amount of free space available in the cluster. - **AVAIL:** The amount of free space available in the cluster.
- **USED:** The amount of raw storage consumed by user data (excluding - **USED:** The amount of raw storage consumed by user data (excluding
BlueStore's database) BlueStore's database).
- **RAW USED:** The amount of raw storage consumed by user data, internal - **RAW USED:** The amount of raw storage consumed by user data, internal
overhead, or reserved capacity. overhead, and reserved capacity.
- **%RAW USED:** The percentage of raw storage used. Use this number in - **%RAW USED:** The percentage of raw storage used. Watch this number in
conjunction with the ``full ratio`` and ``near full ratio`` to ensure that conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when
you are not reaching your cluster's capacity. See `Storage Capacity`_ for your cluster approaches the fullness thresholds. See `Storage Capacity`_.
additional details.
**POOLS:** **POOLS:**
The **POOLS** section of the output provides a list of pools and the notional The POOLS section of the output provides a list of pools and the *notional*
usage of each pool. The output from this section **DOES NOT** reflect replicas, usage of each pool. This section of the output **DOES NOT** reflect replicas,
clones or snapshots. For example, if you store an object with 1MB of data, the clones, or snapshots. For example, if you store an object with 1MB of data,
notional usage will be 1MB, but the actual usage may be 2MB or more depending then the notional usage will be 1MB, but the actual usage might be 2MB or more
on the number of replicas, clones and snapshots. depending on the number of replicas, clones, and snapshots.
- **ID:** The number of the node within the pool. - **ID:** The number of the specific node within the pool.
- **STORED:** actual amount of data user/Ceph has stored in a pool. This is - **STORED:** The actual amount of data that the user has stored in a pool.
similar to the USED column in earlier versions of Ceph but the calculations This is similar to the USED column in earlier versions of Ceph, but the
(for BlueStore!) are more precise (gaps are properly handled). calculations (for BlueStore!) are more precise (in that gaps are properly
handled).
- **(DATA):** usage for RBD (RADOS Block Device), CephFS file data, and RGW - **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW
(RADOS Gateway) object data. (RADOS Gateway) object data.
- **(OMAP):** key-value pairs. Used primarily by CephFS and RGW (RADOS - **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS
Gateway) for metadata storage. Gateway) for metadata storage.
- **OBJECTS:** The notional number of objects stored per pool. "Notional" is - **OBJECTS:** The notional number of objects stored per pool (that is, the
defined above in the paragraph immediately under "POOLS". number of objects other than replicas, clones, or snapshots).
- **USED:** The space allocated for a pool over all OSDs. This includes - **USED:** The space allocated for a pool over all OSDs. This includes space
replication, allocation granularity, and erasure-coding overhead. Compression for replication, space for allocation granularity, and space for the overhead
savings and object content gaps are also taken into account. BlueStore's associated with erasure-coding. Compression savings and object-content gaps
database is not included in this amount. are also taken into account. However, BlueStore's database is not included in
the amount reported under USED.
- **(DATA):** object usage for RBD (RADOS Block Device), CephFS file data, and RGW - **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data,
(RADOS Gateway) object data. and RGW (RADOS Gateway) object data.
- **(OMAP):** object key-value pairs. Used primarily by CephFS and RGW (RADOS - **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS
Gateway) for metadata storage. Gateway) for metadata storage.
- **%USED:** The notional percentage of storage used per pool. - **%USED:** The notional percentage of storage used per pool.
@ -454,50 +449,51 @@ on the number of replicas, clones and snapshots.
- **QUOTA OBJECTS:** The number of quota objects. - **QUOTA OBJECTS:** The number of quota objects.
- **QUOTA BYTES:** The number of bytes in the quota objects. - **QUOTA BYTES:** The number of bytes in the quota objects.
- **DIRTY:** The number of objects in the cache pool that have been written to - **DIRTY:** The number of objects in the cache pool that have been written to
the cache pool but have not been flushed yet to the base pool. This field is the cache pool but have not yet been flushed to the base pool. This field is
only available when cache tiering is in use. available only when cache tiering is in use.
- **USED COMPR:** amount of space allocated for compressed data (i.e. this - **USED COMPR:** The amount of space allocated for compressed data. This
includes compressed data plus all the allocation, replication and erasure includes compressed data in addition to all of the space required for
coding overhead). replication, allocation granularity, and erasure- coding overhead.
- **UNDER COMPR:** amount of data passed through compression (summed over all - **UNDER COMPR:** The amount of data that has passed through compression
replicas) and beneficial enough to be stored in a compressed form. (summed over all replicas) and that is worth storing in a compressed form.
.. note:: The numbers in the POOLS section are notional. They are not .. note:: The numbers in the POOLS section are notional. They do not include
inclusive of the number of replicas, snapshots or clones. As a result, the the number of replicas, clones, or snapshots. As a result, the sum of the
sum of the USED and %USED amounts will not add up to the USED and %USED USED and %USED amounts in the POOLS section of the output will not be equal
amounts in the RAW section of the output. to the sum of the USED and %USED amounts in the RAW section of the output.
.. note:: The MAX AVAIL value is a complicated function of the replication .. note:: The MAX AVAIL value is a complicated function of the replication or
or erasure code used, the CRUSH rule that maps storage to devices, the the kind of erasure coding used, the CRUSH rule that maps storage to
utilization of those devices, and the configured ``mon_osd_full_ratio``. devices, the utilization of those devices, and the configured
``mon_osd_full_ratio`` setting.
Checking OSD Status Checking OSD Status
=================== ===================
You can check OSDs to ensure they are ``up`` and ``in`` by executing the To check if OSDs are ``up`` and ``in``, run the
following command: following command:
.. prompt:: bash # .. prompt:: bash #
ceph osd stat ceph osd stat
Or: Alternatively, you can run the following command:
.. prompt:: bash # .. prompt:: bash #
ceph osd dump ceph osd dump
You can also check view OSDs according to their position in the CRUSH map by To view OSDs according to their position in the CRUSH map, run the following
using the following command: command:
.. prompt:: bash # .. prompt:: bash #
ceph osd tree ceph osd tree
Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are
and their weight: ``up``, and the weight of the OSDs, run the following command:
.. code-block:: bash .. code-block:: bash
@ -509,36 +505,38 @@ and their weight:
1 ssd 1.00000 osd.1 up 1.00000 1.00000 1 ssd 1.00000 osd.1 up 1.00000 1.00000
2 ssd 1.00000 osd.2 up 1.00000 1.00000 2 ssd 1.00000 osd.2 up 1.00000 1.00000
For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. See `Monitoring OSDs and Placement Groups`_.
Checking Monitor Status Checking Monitor Status
======================= =======================
If your cluster has multiple monitors (likely), you should check the monitor If your cluster has multiple monitors, then you need to perform certain
quorum status after you start the cluster and before reading and/or writing data. A "monitor status" checks. After starting the cluster and before reading or
quorum must be present when multiple monitors are running. You should also check writing data, you should check quorum status. A quorum must be present when
monitor status periodically to ensure that they are running. multiple monitors are running to ensure proper functioning of your Ceph
cluster. Check monitor status regularly in order to ensure that all of the
monitors are running.
To see display the monitor map, execute the following: To display the monitor map, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon stat ceph mon stat
Or: Alternatively, you can run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon dump ceph mon dump
To check the quorum status for the monitor cluster, execute the following: To check the quorum status for the monitor cluster, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph quorum_status ceph quorum_status
Ceph will return the quorum status. For example, a Ceph cluster consisting of Ceph returns the quorum status. For example, a Ceph cluster that consists of
three monitors may return the following: three monitors might return the following:
.. code-block:: javascript .. code-block:: javascript
@ -583,14 +581,14 @@ Checking MDS Status
=================== ===================
Metadata servers provide metadata services for CephFS. Metadata servers have Metadata servers provide metadata services for CephFS. Metadata servers have
two sets of states: ``up | down`` and ``active | inactive``. To ensure your two sets of states: ``up | down`` and ``active | inactive``. To check if your
metadata servers are ``up`` and ``active``, execute the following: metadata servers are ``up`` and ``active``, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mds stat ceph mds stat
To display details of the metadata cluster, execute the following: To display details of the metadata servers, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -600,9 +598,9 @@ To display details of the metadata cluster, execute the following:
Checking Placement Group States Checking Placement Group States
=============================== ===============================
Placement groups map objects to OSDs. When you monitor your Placement groups (PGs) map objects to OSDs. PGs are monitored in order to
placement groups, you will want them to be ``active`` and ``clean``. ensure that they are ``active`` and ``clean``. See `Monitoring OSDs and
For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_. Placement Groups`_.
.. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg .. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg
@ -611,37 +609,36 @@ For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
Using the Admin Socket Using the Admin Socket
====================== ======================
The Ceph admin socket allows you to query a daemon via a socket interface. The Ceph admin socket allows you to query a daemon via a socket interface. By
By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via
via the admin socket, login to the host running the daemon and use the the admin socket, log in to the host that is running the daemon and run one of
following command: the two following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon {daemon-name} ceph daemon {daemon-name}
ceph daemon {path-to-socket-file} ceph daemon {path-to-socket-file}
For example, the following are equivalent: For example, the following commands are equivalent to each other:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon osd.0 foo ceph daemon osd.0 foo
ceph daemon /var/run/ceph/ceph-osd.0.asok foo ceph daemon /var/run/ceph/ceph-osd.0.asok foo
To view the available admin socket commands, execute the following command: To view the available admin-socket commands, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph daemon {daemon-name} help ceph daemon {daemon-name} help
The admin socket command enables you to show and set your configuration at Admin-socket commands enable you to view and set your configuration at runtime.
runtime. See `Viewing a Configuration at Runtime`_ for details. For more on viewing your configuration, see `Viewing a Configuration at
Runtime`_. There are two methods of setting configuration value at runtime: (1)
Additionally, you can set configuration values at runtime directly (i.e., the using the admin socket, which bypasses the monitor and requires a direct login
admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id} to the host in question, and (2) using the ``ceph tell {daemon-type}.{id}
config set``, which relies on the monitor but doesn't require you to login config set`` command, which relies on the monitor and does not require a direct
directly to the host in question ). login.
.. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime .. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime
.. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity .. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/

View File

@ -6,50 +6,52 @@
Running Ceph with systemd Running Ceph with systemd
========================== =========================
For all distributions that support systemd (CentOS 7, Fedora, Debian In all distributions that support systemd (CentOS 7, Fedora, Debian
Jessie 8 and later, SUSE), ceph daemons are now managed using native Jessie 8 and later, and SUSE), systemd files (and NOT legacy SysVinit scripts)
systemd files instead of the legacy sysvinit scripts. For example: are used to manage Ceph daemons. Ceph daemons therefore behave like any other daemons
that can be controlled by the ``systemctl`` command, as in the following examples:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl start ceph.target # start all daemons sudo systemctl start ceph.target # start all daemons
sudo systemctl status ceph-osd@12 # check status of osd.12 sudo systemctl status ceph-osd@12 # check status of osd.12
To list the Ceph systemd units on a node, execute: To list all of the Ceph systemd units on a node, run the following command:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl status ceph\*.service ceph\*.target sudo systemctl status ceph\*.service ceph\*.target
Starting all Daemons
Starting all daemons
-------------------- --------------------
To start all daemons on a Ceph Node (irrespective of type), execute the To start all of the daemons on a Ceph node (regardless of their type), run the
following: following command:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl start ceph.target sudo systemctl start ceph.target
Stopping all Daemons Stopping all daemons
-------------------- --------------------
To stop all daemons on a Ceph Node (irrespective of type), execute the To stop all of the daemons on a Ceph node (regardless of their type), run the
following: following command:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl stop ceph\*.service ceph\*.target sudo systemctl stop ceph\*.service ceph\*.target
Starting all Daemons by Type Starting all daemons by type
---------------------------- ----------------------------
To start all daemons of a particular type on a Ceph Node, execute one of the To start all of the daemons of a particular type on a Ceph node, run one of the
following: following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -58,24 +60,24 @@ following:
sudo systemctl start ceph-mds.target sudo systemctl start ceph-mds.target
Stopping all Daemons by Type Stopping all daemons by type
---------------------------- ----------------------------
To stop all daemons of a particular type on a Ceph Node, execute one of the To stop all of the daemons of a particular type on a Ceph node, run one of the
following: following commands:
.. prompt:: bash $ .. prompt:: bash $
sudo systemctl stop ceph-mon\*.service ceph-mon.target
sudo systemctl stop ceph-osd\*.service ceph-osd.target sudo systemctl stop ceph-osd\*.service ceph-osd.target
sudo systemctl stop ceph-mon\*.service ceph-mon.target
sudo systemctl stop ceph-mds\*.service ceph-mds.target sudo systemctl stop ceph-mds\*.service ceph-mds.target
Starting a Daemon Starting a daemon
----------------- -----------------
To start a specific daemon instance on a Ceph Node, execute one of the To start a specific daemon instance on a Ceph node, run one of the
following: following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -92,11 +94,11 @@ For example:
sudo systemctl start ceph-mds@ceph-server sudo systemctl start ceph-mds@ceph-server
Stopping a Daemon Stopping a daemon
----------------- -----------------
To stop a specific daemon instance on a Ceph Node, execute one of the To stop a specific daemon instance on a Ceph node, run one of the
following: following commands:
.. prompt:: bash $ .. prompt:: bash $
@ -115,16 +117,15 @@ For example:
.. index:: sysvinit; operating a cluster .. index:: sysvinit; operating a cluster
Running Ceph with sysvinit Running Ceph with SysVinit
========================== ==========================
Each time you to **start**, **restart**, and **stop** Ceph daemons (or your Each time you start, restart, or stop Ceph daemons, you must specify at least one option and one command.
entire cluster) you must specify at least one option and one command. You may Likewise, each time you start, restart, or stop your entire cluster, you must specify at least one option and one command.
also specify a daemon type or a daemon instance. :: In both cases, you can also specify a daemon type or a daemon instance. ::
{commandline} [options] [commands] [daemons] {commandline} [options] [commands] [daemons]
The ``ceph`` options include: The ``ceph`` options include:
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
@ -134,12 +135,12 @@ The ``ceph`` options include:
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
| ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. | | ``--valgrind`` | ``N/A`` | (Dev and QA only) Use `Valgrind`_ debugging. |
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
| ``--allhosts`` | ``-a`` | Execute on all nodes in ``ceph.conf.`` | | ``--allhosts`` | ``-a`` | Execute on all nodes listed in ``ceph.conf``. |
| | | Otherwise, it only executes on ``localhost``. | | | | Otherwise, it only executes on ``localhost``. |
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
| ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. | | ``--restart`` | ``N/A`` | Automatically restart daemon if it core dumps. |
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
| ``--norestart`` | ``N/A`` | Don't restart a daemon if it core dumps. | | ``--norestart`` | ``N/A`` | Do not restart a daemon if it core dumps. |
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
| ``--conf`` | ``-c`` | Use an alternate configuration file. | | ``--conf`` | ``-c`` | Use an alternate configuration file. |
+-----------------+----------+-------------------------------------------------+ +-----------------+----------+-------------------------------------------------+
@ -153,7 +154,7 @@ The ``ceph`` commands include:
+------------------+------------------------------------------------------------+ +------------------+------------------------------------------------------------+
| ``stop`` | Stop the daemon(s). | | ``stop`` | Stop the daemon(s). |
+------------------+------------------------------------------------------------+ +------------------+------------------------------------------------------------+
| ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9`` | | ``forcestop`` | Force the daemon(s) to stop. Same as ``kill -9``. |
+------------------+------------------------------------------------------------+ +------------------+------------------------------------------------------------+
| ``killall`` | Kill all daemons of a particular type. | | ``killall`` | Kill all daemons of a particular type. |
+------------------+------------------------------------------------------------+ +------------------+------------------------------------------------------------+
@ -162,15 +163,12 @@ The ``ceph`` commands include:
| ``cleanalllogs`` | Cleans out **everything** in the log directory. | | ``cleanalllogs`` | Cleans out **everything** in the log directory. |
+------------------+------------------------------------------------------------+ +------------------+------------------------------------------------------------+
For subsystem operations, the ``ceph`` service can target specific daemon types The ``[daemons]`` option allows the ``ceph`` service to target specific daemon types
by adding a particular daemon type for the ``[daemons]`` option. Daemon types in order to perform subsystem operations. Daemon types include:
include:
- ``mon`` - ``mon``
- ``osd`` - ``osd``
- ``mds`` - ``mds``
.. _Valgrind: http://www.valgrind.org/ .. _Valgrind: http://www.valgrind.org/
.. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html .. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html

View File

@ -1,3 +1,5 @@
.. _rados_operations_pg_concepts:
========================== ==========================
Placement Group Concepts Placement Group Concepts
========================== ==========================

View File

@ -1,59 +1,60 @@
============================ ============================
Repairing PG inconsistencies Repairing PG Inconsistencies
============================ ============================
Sometimes a placement group might become "inconsistent". To return the Sometimes a Placement Group (PG) might become ``inconsistent``. To return the PG
placement group to an active+clean state, you must first determine which to an ``active+clean`` state, you must first determine which of the PGs has become
of the placement groups has become inconsistent and then run the "pg inconsistent and then run the ``pg repair`` command on it. This page contains
repair" command on it. This page contains commands for diagnosing placement commands for diagnosing PGs and the command for repairing PGs that have become
groups and the command for repairing placement groups that have become
inconsistent. inconsistent.
.. highlight:: console .. highlight:: console
Commands for Diagnosing Placement-group Problems Commands for Diagnosing PG Problems
================================================ ===================================
The commands in this section provide various ways of diagnosing broken placement groups. The commands in this section provide various ways of diagnosing broken PGs.
The following command provides a high-level (low detail) overview of the health of the ceph cluster: To see a high-level (low-detail) overview of Ceph cluster health, run the
following command:
.. prompt:: bash # .. prompt:: bash #
ceph health detail ceph health detail
The following command provides more detail on the status of the placement groups: To see more detail on the status of the PGs, run the following command:
.. prompt:: bash # .. prompt:: bash #
ceph pg dump --format=json-pretty ceph pg dump --format=json-pretty
The following command lists inconsistent placement groups: To see a list of inconsistent PGs, run the following command:
.. prompt:: bash # .. prompt:: bash #
rados list-inconsistent-pg {pool} rados list-inconsistent-pg {pool}
The following command lists inconsistent rados objects: To see a list of inconsistent RADOS objects, run the following command:
.. prompt:: bash # .. prompt:: bash #
rados list-inconsistent-obj {pgid} rados list-inconsistent-obj {pgid}
The following command lists inconsistent snapsets in the given placement group: To see a list of inconsistent snapsets in a specific PG, run the following
command:
.. prompt:: bash # .. prompt:: bash #
rados list-inconsistent-snapset {pgid} rados list-inconsistent-snapset {pgid}
Commands for Repairing Placement Groups Commands for Repairing PGs
======================================= ==========================
The form of the command to repair a broken placement group is: The form of the command to repair a broken PG is as follows:
.. prompt:: bash # .. prompt:: bash #
ceph pg repair {pgid} ceph pg repair {pgid}
Where ``{pgid}`` is the id of the affected placement group. Here ``{pgid}`` represents the id of the affected PG.
For example: For example:
@ -61,23 +62,57 @@ For example:
ceph pg repair 1.4 ceph pg repair 1.4
More Information on Placement Group Repair .. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the
========================================== pool that contains the PG. The command ``ceph osd listpools`` and the
Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency. command ``ceph osd dump | grep pool`` return a list of pool numbers.
The ``pg repair`` command attempts to fix inconsistencies of various kinds. If ``pg repair`` finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If ``pg repair`` finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of ``pg repair``. More Information on PG Repair
=============================
Ceph stores and updates the checksums of objects stored in the cluster. When a
scrub is performed on a PG, the OSD attempts to choose an authoritative copy
from among its replicas. Only one of the possible cases is consistent. After
performing a deep scrub, Ceph calculates the checksum of an object that is read
from disk and compares it to the checksum that was previously recorded. If the
current checksum and the previously recorded checksum do not match, that
mismatch is considered to be an inconsistency. In the case of replicated pools,
any mismatch between the checksum of any replica of an object and the checksum
of the authoritative copy means that there is an inconsistency. The discovery
of these inconsistencies cause a PG's state to be set to ``inconsistent``.
For erasure coded and BlueStore pools, Ceph will automatically repair The ``pg repair`` command attempts to fix inconsistencies of various kinds. If
if ``osd_scrub_auto_repair`` (default ``false`) is set to ``true`` and ``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
at most ``osd_scrub_auto_repair_num_errors`` (default ``5``) errors are found. the inconsistent copy with the digest of the authoritative copy. If ``pg
repair`` finds an inconsistent replicated pool, it marks the inconsistent copy
as missing. In the case of replicated pools, recovery is beyond the scope of
``pg repair``.
``pg repair`` will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them. In the case of erasure-coded and BlueStore pools, Ceph will automatically
perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
``5``) errors are found.
The checksum of a RADOS object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of Filestore. When replicated Filestore pools are in play, users might prefer manual repair over ``ceph pg repair``. The ``pg repair`` command will not solve every problem. Ceph does not
automatically repair PGs when they are found to contain inconsistencies.
The material in this paragraph is relevant for Filestore, and BlueStore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, ``pg repair`` favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the ``eph-objectstore-tool``. The checksum of a RADOS object or an omap is not always available. Checksums
are calculated incrementally. If a replicated object is updated
non-sequentially, the write operation involved in the update changes the object
and invalidates its checksum. The whole object is not read while the checksum
is recalculated. The ``pg repair`` command is able to make repairs even when
checksums are not available to it, as in the case of Filestore. Users working
with replicated Filestore pools might prefer manual repair to ``ceph pg
repair``.
This material is relevant for Filestore, but not for BlueStore, which has its
own internal checksums. The matched-record checksum and the calculated checksum
cannot prove that any specific copy is in fact authoritative. If there is no
checksum available, ``pg repair`` favors the data on the primary, but this
might not be the uncorrupted replica. Because of this uncertainty, human
intervention is necessary when an inconsistency is discovered. This
intervention sometimes involves use of ``ceph-objectstore-tool``.
External Links External Links
============== ==============
https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page contains a walkthrough of the repair of a placement group, and is recommended reading if you want to repair a placement https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
group but have never done so. contains a walkthrough of the repair of a PG. It is recommended reading if you
want to repair a PG but have never done so.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -7,88 +7,123 @@ Stretch Clusters
Stretch Clusters Stretch Clusters
================ ================
Ceph generally expects all parts of its network and overall cluster to be
equally reliable, with failures randomly distributed across the CRUSH map.
So you may lose a switch that knocks out a number of OSDs, but we expect
the remaining OSDs and monitors to route around that.
This is usually a good choice, but may not work well in some A stretch cluster is a cluster that has servers in geographically separated
stretched cluster configurations where a significant part of your cluster data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed
is stuck behind a single network component. For instance, a single and low-latency connections, but limited links. Stretch clusters have a higher
cluster which is located in multiple data centers, and you want to likelihood of (possibly asymmetric) network splits, and a higher likelihood of
sustain the loss of a full DC. temporary or complete loss of an entire data center (which can represent
one-third to one-half of the total cluster).
There are two standard configurations we've seen deployed, with either Ceph is designed with the expectation that all parts of its network and cluster
two or three data centers (or, in clouds, availability zones). With two will be reliable and that failures will be distributed randomly across the
zones, we expect each site to hold a copy of the data, and for a third CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is
site to have a tiebreaker monitor (this can be a VM or high-latency compared designed so that the remaining OSDs and monitors will route around such a loss.
to the main sites) to pick a winner if the network connection fails and both
DCs remain alive. For three sites, we expect a copy of the data and an equal
number of monitors in each site.
Note that the standard Ceph configuration will survive MANY failures of the Sometimes this cannot be relied upon. If you have a "stretched-cluster"
network or data centers and it will never compromise data consistency. If you deployment in which much of your cluster is behind a single network component,
bring back enough Ceph servers following a failure, it will recover. If you you might need to use **stretch mode** to ensure data integrity.
lose a data center, but can still form a quorum of monitors and have all the data
available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules
that will re-replicate to meet it), Ceph will maintain availability.
What can't it handle? We will here consider two standard configurations: a configuration with two
data centers (or, in clouds, two availability zones), and a configuration with
three data centers (or, in clouds, three availability zones).
In the two-site configuration, Ceph expects each of the sites to hold a copy of
the data, and Ceph also expects there to be a third site that has a tiebreaker
monitor. This tiebreaker monitor picks a winner if the network connection fails
and both data centers remain alive.
The tiebreaker monitor can be a VM. It can also have high latency relative to
the two main sites.
The standard Ceph configuration is able to survive MANY network failures or
data-center failures without ever compromising data availability. If enough
Ceph servers are brought back following a failure, the cluster *will* recover.
If you lose a data center but are still able to form a quorum of monitors and
still have all the data available, Ceph will maintain availability. (This
assumes that the cluster has enough copies to satisfy the pools' ``min_size``
configuration option, or (failing that) that the cluster has CRUSH rules in
place that will cause the cluster to re-replicate the data until the
``min_size`` configuration option has been met.)
Stretch Cluster Issues Stretch Cluster Issues
====================== ======================
No matter what happens, Ceph will not compromise on data integrity
and consistency. If there's a failure in your network or a loss of nodes and
you can restore service, Ceph will return to normal functionality on its own.
But there are scenarios where you lose data availability despite having Ceph does not permit the compromise of data integrity and data consistency
enough servers available to satisfy Ceph's consistency and sizing constraints, or under any circumstances. When service is restored after a network failure or a
where you may be surprised to not satisfy Ceph's constraints. loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
The first important category of these failures resolve around inconsistent without operator intervention.
networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
them out of the acting PG sets despite the primary being unable to replicate data. Ceph does not permit the compromise of data integrity or data consistency, but
If this happens, IO will not be permitted, because Ceph can't satisfy its durability there are situations in which *data availability* is compromised. These
guarantees. situations can occur even though there are enough clusters available to satisfy
Ceph's consistency and sizing constraints. In some situations, you might
discover that your cluster does not satisfy those constraints.
The first category of these failures that we will discuss involves inconsistent
networks -- if there is a netsplit (a disconnection between two servers that
splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
and remove them from the acting PG sets. This failure to mark ODSs ``down``
will occur, despite the fact that the primary PG is unable to replicate data (a
situation that, under normal non-netsplit circumstances, would result in the
marking of affected OSDs as ``down`` and their removal from the PG). If this
happens, Ceph will be unable to satisfy its durability guarantees and
consequently IO will not be permitted.
The second category of failures that we will discuss involves the situation in
which the constraints are not sufficient to guarantee the replication of data
across data centers, though it might seem that the data is correctly replicated
across data centers. For example, in a scenario in which there are two data
centers named Data Center A and Data Center B, and the CRUSH rule targets three
replicas and places a replica in each data center with a ``min_size`` of ``2``,
the PG might go active with two replicas in Data Center A and zero replicas in
Data Center B. In a situation of this kind, the loss of Data Center A means
that the data is lost and Ceph will not be able to operate on it. This
situation is surprisingly difficult to avoid using only standard CRUSH rules.
The second important category of failures is when you think you have data replicated
across data centers, but the constraints aren't sufficient to guarantee this.
For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
and places a copy in each data center with a ``min_size`` of 2. The PG may go active with
2 copies in site A and no copies in site B, which means that if you then lose site A you
have lost data and Ceph can't operate on it. This situation is surprisingly difficult
to avoid with standard CRUSH rules.
Stretch Mode Stretch Mode
============ ============
The new stretch mode is designed to handle the 2-site case. Three sites are Stretch mode is designed to handle deployments in which you cannot guarantee the
just as susceptible to netsplit issues, but are much more tolerant of replication of data across two data centers. This kind of situation can arise
component availability outages than 2-site clusters are. when the cluster's CRUSH rule specifies that three copies are to be made, but
then a copy is placed in each data center with a ``min_size`` of 2. Under such
conditions, a placement group can become active with two copies in the first
data center and no copies in the second data center.
To enter stretch mode, you must set the location of each monitor, matching
your CRUSH map. For instance, to place ``mon.a`` in your first data center:
.. prompt:: bash $ Entering Stretch Mode
---------------------
To enable stretch mode, you must set the location of each monitor, matching
your CRUSH map. This procedure shows how to do this.
#. Place ``mon.a`` in your first data center:
.. prompt:: bash $
ceph mon set_location a datacenter=site1 ceph mon set_location a datacenter=site1
Next, generate a CRUSH rule which will place 2 copies in each data center. This #. Generate a CRUSH rule that places two copies in each data center.
will require editing the CRUSH map directly: This requires editing the CRUSH map directly:
.. prompt:: bash $ .. prompt:: bash $
ceph osd getcrushmap > crush.map.bin ceph osd getcrushmap > crush.map.bin
crushtool -d crush.map.bin -o crush.map.txt crushtool -d crush.map.bin -o crush.map.txt
Now edit the ``crush.map.txt`` file to add a new rule. Here #. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
there is only one other rule, so this is ID 1, but you may need other rule (``id 1``), but you might need to use a different rule ID. We
to use a different rule ID. We also have two datacenter buckets have two data-center buckets named ``site1`` and ``site2``:
named ``site1`` and ``site2``::
::
rule stretch_rule { rule stretch_rule {
id 1 id 1
type replicated
min_size 1 min_size 1
max_size 10 max_size 10
type replicated
step take site1 step take site1
step chooseleaf firstn 2 type host step chooseleaf firstn 2 type host
step emit step emit
@ -97,118 +132,131 @@ named ``site1`` and ``site2``::
step emit step emit
} }
Finally, inject the CRUSH map to make the rule available to the cluster: #. Inject the CRUSH map to make the rule available to the cluster:
.. prompt:: bash $ .. prompt:: bash $
crushtool -c crush.map.txt -o crush2.map.bin crushtool -c crush.map.txt -o crush2.map.bin
ceph osd setcrushmap -i crush2.map.bin ceph osd setcrushmap -i crush2.map.bin
If you aren't already running your monitors in connectivity mode, do so with #. Run the monitors in connectivity mode. See `Changing Monitor Elections`_.
the instructions in `Changing Monitor Elections`_.
.. _Changing Monitor elections: ../change-mon-elections #. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the
tiebreaker monitor and we are splitting across data centers. The tiebreaker
monitor must be assigned a data center that is neither ``site1`` nor
``site2``. For this purpose you can create another data-center bucket named
``site3`` in your CRUSH and place ``mon.e`` there:
And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the .. prompt:: bash $
tiebreaker and we are splitting across data centers. ``mon.e`` should be also
set a datacenter, that will differ from ``site1`` and ``site2``. For this
purpose you can create another datacenter bucket named ```site3`` in your
CRUSH and place ``mon.e`` there:
.. prompt:: bash $
ceph mon set_location e datacenter=site3 ceph mon set_location e datacenter=site3
ceph mon enable_stretch_mode e stretch_rule datacenter ceph mon enable_stretch_mode e stretch_rule datacenter
When stretch mode is enabled, the OSDs will only take PGs active when When stretch mode is enabled, PGs will become active only when they peer
they peer across data centers (or whatever other CRUSH bucket type across data centers (or across whichever CRUSH bucket type was specified),
you specified), assuming both are alive. Pools will increase in size assuming both are alive. Pools will increase in size from the default ``3`` to
from the default 3 to 4, expecting 2 copies in each site. OSDs will only ``4``, and two copies will be expected in each site. OSDs will be allowed to
be allowed to connect to monitors in the same data center. New monitors connect to monitors only if they are in the same data center as the monitors.
will not be allowed to join the cluster if they do not specify a location. New monitors will not be allowed to join the cluster if they do not specify a
location.
If all the OSDs and monitors from a data center become inaccessible If all OSDs and monitors in one of the data centers become inaccessible at once,
at once, the surviving data center will enter a degraded stretch mode. This the surviving data center enters a "degraded stretch mode". A warning will be
will issue a warning, reduce the min_size to 1, and allow issued, the ``min_size`` will be reduced to ``1``, and the cluster will be
the cluster to go active with data in the single remaining site. Note that allowed to go active with the data in the single remaining site. The pool size
we do not change the pool size, so you will also get warnings that the does not change, so warnings will be generated that report that the pools are
pools are too small -- but a special stretch mode flag will prevent the OSDs too small -- but a special stretch mode flag will prevent the OSDs from
from creating extra copies in the remaining data center (so it will only keep creating extra copies in the remaining data center. This means that the data
2 copies, as before). center will keep only two copies, just as before.
When the missing data center comes back, the cluster will enter When the missing data center comes back, the cluster will enter a "recovery
recovery stretch mode. This changes the warning and allows peering, but stretch mode". This changes the warning and allows peering, but requires OSDs
still only requires OSDs from the data center which was up the whole time. only from the data center that was ``up`` throughout the duration of the
When all PGs are in a known state, and are neither degraded nor incomplete, downtime. When all PGs are in a known state, and are neither degraded nor
the cluster transitions back to regular stretch mode, ends the warning, incomplete, the cluster transitions back to regular stretch mode, ends the
restores min_size to its starting value (2) and requires both sites to peer, warning, restores ``min_size`` to its original value (``2``), requires both
and stops requiring the always-alive site when peering (so that you can fail sites to peer, and no longer requires the site that was up throughout the
over to the other site, if necessary). duration of the downtime when peering (which makes failover to the other site
possible, if needed).
Stretch Mode Limitations .. _Changing Monitor elections: ../change-mon-elections
========================
As implied by the setup, stretch mode only handles 2 sites with OSDs.
While it is not enforced, you should run 2 monitors in each site plus Limitations of Stretch Mode
a tiebreaker, for a total of 5. This is because OSDs can only connect ===========================
to monitors in their own site when in stretch mode. When using stretch mode, OSDs must be located at exactly two sites.
You cannot use erasure coded pools with stretch mode. If you try, it will Two monitors should be run in each data center, plus a tiebreaker in a third
refuse, and it will not allow you to create EC pools once in stretch mode. (or in the cloud) for a total of five monitors. While in stretch mode, OSDs
will connect only to monitors within the data center in which they are located.
OSDs *DO NOT* connect to the tiebreaker monitor.
You must create your own CRUSH rule which provides 2 copies in each site, and Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure
you must use 4 total copies with 2 in each site. If you have existing pools coded pools with stretch mode will fail. Erasure coded pools cannot be created
with non-default size/min_size, Ceph will object when you attempt to while in stretch mode.
enable stretch mode.
Because it runs with ``min_size 1`` when degraded, you should only use stretch To use stretch mode, you will need to create a CRUSH rule that provides two
mode with all-flash OSDs. This minimizes the time needed to recover once replicas in each data center. Ensure that there are four total replicas: two in
connectivity is restored, and thus minimizes the potential for data loss. each data center. If pools exist in the cluster that do not have the default
``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such
a CRUSH rule is given above.
Hopefully, future development will extend this feature to support EC pools and Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly,
running with more than 2 full sites. ``min_size 1``), we recommend enabling stretch mode only when using OSDs on
SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended
due to the long time it takes for them to recover after connectivity between
data centers has been restored. This reduces the potential for data loss.
In the future, stretch mode might support erasure-coded pools and might support
deployments that have more than two data centers.
Other commands Other commands
============== ==============
If your tiebreaker monitor fails for some reason, you can replace it. Turn on
a new monitor and run: Replacing a failed tiebreaker monitor
-------------------------------------
Turn on a new monitor and run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon set_new_tiebreaker mon.<new_mon_name> ceph mon set_new_tiebreaker mon.<new_mon_name>
This command will protest if the new monitor is in the same location as existing This command protests if the new monitor is in the same location as the
non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker existing non-tiebreaker monitors. **This command WILL NOT remove the previous
monitor; you should do so yourself. tiebreaker monitor.** Remove the previous tiebreaker monitor yourself.
If you are writing your own tooling for deploying Ceph, you can use a new Using "--set-crush-location" and not "ceph mon set_location"
``--set-crush-location`` option when booting monitors, instead of running ------------------------------------------------------------
``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg
``ceph-mon --set-crush-location 'datacenter=a'``, which must match the
bucket type you specified when running ``enable_stretch_mode``.
If you write your own tooling for deploying Ceph, use the
``--set-crush-location`` option when booting monitors instead of running ``ceph
mon set_location``. This option accepts only a single ``bucket=loc`` pair (for
example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must
match the bucket type that was specified when running ``enable_stretch_mode``.
When in stretch degraded mode, the cluster will go into "recovery" mode automatically Forcing recovery stretch mode
when the disconnected data center comes back. If that doesn't work, or you want to -----------------------------
enable recovery mode early, you can invoke:
When in stretch degraded mode, the cluster will go into "recovery" mode
automatically when the disconnected data center comes back. If that does not
happen or you want to enable recovery mode early, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd force_recovery_stretch_mode --yes-i-really-mean-it ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
But this command should not be necessary; it is included to deal with Forcing normal stretch mode
unanticipated situations. ---------------------------
When in recovery mode, the cluster should go back into normal stretch mode When in recovery mode, the cluster should go back into normal stretch mode when
when the PGs are healthy. If this doesn't happen, or you want to force the the PGs are healthy. If this fails to happen or if you want to force the
cross-data-center peering early and are willing to risk data downtime (or have cross-data-center peering early and are willing to risk data downtime (or have
verified separately that all the PGs can peer, even if they aren't fully verified separately that all the PGs can peer, even if they aren't fully
recovered), you can invoke: recovered), run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd force_healthy_stretch_mode --yes-i-really-mean-it ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
This command should not be necessary; it is included to deal with This command can be used to to remove the ``HEALTH_WARN`` state, which recovery
unanticipated situations. But you might wish to invoke it to remove mode generates.
the ``HEALTH_WARN`` state which recovery mode generates.

Some files were not shown because too many files have changed in this diff Show More