diff --git a/ceph/.github/pull_request_template.md b/ceph/.github/pull_request_template.md
index a5008307f..fd43d808f 100644
--- a/ceph/.github/pull_request_template.md
+++ b/ceph/.github/pull_request_template.md
@@ -21,6 +21,13 @@
- The Signed-off-by line in every git commit is important; see [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/master/SubmittingPatches.rst)
-->
+## Contribution Guidelines
+- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
+
+- If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
+
+- When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an `x` between the brackets: `[x]`. Spaces and capitalization matter when checking off items this way.
+
## Checklist
- Tracker (select at least one)
- [ ] References tracker ticket
diff --git a/ceph/.github/workflows/pr-checklist.yml b/ceph/.github/workflows/pr-checklist.yml
index c916db33e..0b3f2d6e5 100644
--- a/ceph/.github/workflows/pr-checklist.yml
+++ b/ceph/.github/workflows/pr-checklist.yml
@@ -11,6 +11,9 @@ jobs:
runs-on: ubuntu-latest
name: Verify
steps:
+ - name: Sleep for 30 seconds
+ run: sleep 30s
+ shell: bash
- name: Action
id: checklist
uses: ceph/ceph-pr-checklist-action@32e92d1a2a7c9991ed51de5fccb2296551373d60
diff --git a/ceph/CMakeLists.txt b/ceph/CMakeLists.txt
index b52e8f82f..f37cdae2a 100644
--- a/ceph/CMakeLists.txt
+++ b/ceph/CMakeLists.txt
@@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 3.16)
project(ceph
- VERSION 17.2.6
+ VERSION 17.2.7
LANGUAGES CXX C ASM)
cmake_policy(SET CMP0028 NEW)
diff --git a/ceph/PendingReleaseNotes b/ceph/PendingReleaseNotes
index be4de4ead..310c8561d 100644
--- a/ceph/PendingReleaseNotes
+++ b/ceph/PendingReleaseNotes
@@ -1,3 +1,49 @@
+>=17.2.7
+--------
+
+* `ceph mgr dump` command now displays the name of the mgr module that
+ registered a RADOS client in the `name` field added to elements of the
+ `active_clients` array. Previously, only the address of a module's RADOS
+ client was shown in the `active_clients` array.
+* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
+ undergone significant usability and design improvements to address the slow
+ backfill issue. Some important changes are:
+ * The 'balanced' profile is set as the default mClock profile because it
+ represents a compromise between prioritizing client IO or recovery IO. Users
+ can then choose either the 'high_client_ops' profile to prioritize client IO
+ or the 'high_recovery_ops' profile to prioritize recovery IO.
+ * QoS parameters like reservation and limit are now specified in terms of a
+ fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
+ * The cost parameters (osd_mclock_cost_per_io_usec_* and
+ osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
+ is now determined using the random IOPS and maximum sequential bandwidth
+ capability of the OSD's underlying device.
+ * Degraded object recovery is given higher priority when compared to misplaced
+ object recovery because degraded objects present a data safety issue not
+ present with objects that are merely misplaced. Therefore, backfilling
+ operations with the 'balanced' and 'high_client_ops' mClock profiles may
+ progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
+ scheduler.
+ * The QoS allocations in all the mClock profiles are optimized based on the above
+ fixes and enhancements.
+ * For more detailed information see:
+ https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
+* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
+ multi-site. Previously, the replicas of such objects were corrupted on decryption.
+ A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
+ identify these original multipart uploads. The ``LastModified`` timestamp of any
+ identified object is incremented by 1ns to cause peer zones to replicate it again.
+ For multi-site deployments that make any use of Server-Side Encryption, we
+ recommended running this command against every bucket in every zone after all
+ zones have upgraded.
+* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
+ a large buildup of session metadata resulting in the MDS going read-only due to
+ the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
+ config controls the maximum size that a (encoded) session metadata can grow.
+
+* CEPHFS: After recovering a Ceph File System post following the disaster recovery
+ procedure, the recovered files under `lost+found` directory can now be deleted.
+
>=17.2.6
--------
@@ -7,6 +53,60 @@
>=17.2.5
--------
+>=19.0.0
+
+* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
+ multi-site. Previously, the replicas of such objects were corrupted on decryption.
+ A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
+ identify these original multipart uploads. The ``LastModified`` timestamp of any
+ identified object is incremented by 1ns to cause peer zones to replicate it again.
+ For multi-site deployments that make any use of Server-Side Encryption, we
+ recommended running this command against every bucket in every zone after all
+ zones have upgraded.
+* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
+ a large buildup of session metadata resulting in the MDS going read-only due to
+ the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
+ config controls the maximum size that a (encoded) session metadata can grow.
+* CephFS: For clusters with multiple CephFS file systems, all the snap-schedule
+ commands now expect the '--fs' argument.
+* CephFS: The period specifier ``m`` now implies minutes and the period specifier
+ ``M`` now implies months. This has been made consistent with the rest
+ of the system.
+* RGW: New tools have been added to radosgw-admin for identifying and
+ correcting issues with versioned bucket indexes. Historical bugs with the
+ versioned bucket index transaction workflow made it possible for the index
+ to accumulate extraneous "book-keeping" olh entries and plain placeholder
+ entries. In some specific scenarios where clients made concurrent requests
+ referencing the same object key, it was likely that a lot of extra index
+ entries would accumulate. When a significant number of these entries are
+ present in a single bucket index shard, they can cause high bucket listing
+ latencies and lifecycle processing failures. To check whether a versioned
+ bucket has unnecessary olh entries, users can now run ``radosgw-admin
+ bucket check olh``. If the ``--fix`` flag is used, the extra entries will
+ be safely removed. A distinct issue from the one described thus far, it is
+ also possible that some versioned buckets are maintaining extra unlinked
+ objects that are not listable from the S3/ Swift APIs. These extra objects
+ are typically a result of PUT requests that exited abnormally, in the middle
+ of a bucket index transaction - so the client would not have received a
+ successful response. Bugs in prior releases made these unlinked objects easy
+ to reproduce with any PUT request that was made on a bucket that was actively
+ resharding. Besides the extra space that these hidden, unlinked objects
+ consume, there can be another side effect in certain scenarios, caused by
+ the nature of the failure mode that produced them, where a client of a bucket
+ that was a victim of this bug may find the object associated with the key to
+ be in an inconsistent state. To check whether a versioned bucket has unlinked
+ entries, users can now run ``radosgw-admin bucket check unlinked``. If the
+ ``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
+ a third issue made it possible for versioned bucket index stats to be
+ accounted inaccurately. The tooling for recalculating versioned bucket stats
+ also had a bug, and was not previously capable of fixing these inaccuracies.
+ This release resolves those issues and users can now expect that the existing
+ ``radosgw-admin bucket check`` command will produce correct results. We
+ recommend that users with versioned buckets, especially those that existed
+ on prior releases, use these new tools to check whether their buckets are
+ affected and to clean them up accordingly.
+
+>=18.0.0
* RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
and `Image::aio_compare_and_write` methods) now match those of C API. Both
@@ -47,6 +147,100 @@
If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
will be visible.
Relevant tracker: https://tracker.ceph.com/issues/53729
+* RBD: `rbd device unmap` command gained `--namespace` option. Support for
+ namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
+ map and unmap images in namespaces using the `image-spec` syntax since then
+ but the corresponding option available in most other commands was missing.
+* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
+ When both are enabled, compression is applied before encryption.
+* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
+ is removed. Together with it, the "pubsub" zone should not be used anymore.
+ The REST operations, as well as radosgw-admin commands for manipulating
+ subscriptions, as well as fetching and acking the notifications are removed
+ as well.
+ In case that the endpoint to which the notifications are sent maybe down or
+ disconnected, it is recommended to use persistent notifications to guarantee
+ the delivery of the notifications. In case the system that consumes the
+ notifications needs to pull them (instead of the notifications be pushed
+ to it), an external message bus (e.g. rabbitmq, Kafka) should be used for
+ that purpose.
+* RGW: The serialized format of notification and topics has changed, so that
+ new/updated topics will be unreadable by old RGWs. We recommend completing
+ the RGW upgrades before creating or modifying any notification topics.
+* RBD: Trailing newline in passphrase files (`` argument in
+ `rbd encryption format` command and `--encryption-passphrase-file` option
+ in other commands) is no longer stripped.
+* RBD: Support for layered client-side encryption is added. Cloned images
+ can now be encrypted each with its own encryption format and passphrase,
+ potentially different from that of the parent image. The efficient
+ copy-on-write semantics intrinsic to unformatted (regular) cloned images
+ are retained.
+* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
+ `client_max_retries_on_remount_failure` and move it from mds.yaml.in to
+ mds-client.yaml.in because this option was only used by MDS client from its
+ birth.
+* The `perf dump` and `perf schema` commands are deprecated in favor of new
+ `counter dump` and `counter schema` commands. These new commands add support
+ for labeled perf counters and also emit existing unlabeled perf counters. Some
+ unlabeled perf counters became labeled in this release, with more to follow in
+ future releases; such converted perf counters are no longer emitted by the
+ `perf dump` and `perf schema` commands.
+* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
+ `active_clients` fields at the top level. Previously, these fields were
+ output under `always_on_modules` field.
+* `ceph mgr dump` command now displays the name of the mgr module that
+ registered a RADOS client in the `name` field added to elements of the
+ `active_clients` array. Previously, only the address of a module's RADOS
+ client was shown in the `active_clients` array.
+* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
+ emitted only by the new `counter dump` and `counter schema` commands. As part
+ of the conversion, many also got renamed to better disambiguate journal-based
+ and snapshot-based mirroring.
+* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
+ `std::list` before potentially appending to it, aligning with the semantics
+ of the corresponding C API (`rbd_watchers_list`).
+* Telemetry: Users who are opted-in to telemetry can also opt-in to
+ participating in a leaderboard in the telemetry public
+ dashboards (https://telemetry-public.ceph.com/). Users can now also add a
+ description of the cluster to publicly appear in the leaderboard.
+ For more details, see:
+ https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
+ See a sample report with `ceph telemetry preview`.
+ Opt-in to telemetry with `ceph telemetry on`.
+ Opt-in to the leaderboard with
+ `ceph config set mgr mgr/telemetry/leaderboard true`.
+ Add leaderboard description with:
+ `ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster description’`.
+* CEPHFS: After recovering a Ceph File System post following the disaster recovery
+ procedure, the recovered files under `lost+found` directory can now be deleted.
+* core: cache-tiering is now deprecated.
+* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
+ undergone significant usability and design improvements to address the slow
+ backfill issue. Some important changes are:
+ * The 'balanced' profile is set as the default mClock profile because it
+ represents a compromise between prioritizing client IO or recovery IO. Users
+ can then choose either the 'high_client_ops' profile to prioritize client IO
+ or the 'high_recovery_ops' profile to prioritize recovery IO.
+ * QoS parameters like reservation and limit are now specified in terms of a
+ fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
+ * The cost parameters (osd_mclock_cost_per_io_usec_* and
+ osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
+ is now determined using the random IOPS and maximum sequential bandwidth
+ capability of the OSD's underlying device.
+ * Degraded object recovery is given higher priority when compared to misplaced
+ object recovery because degraded objects present a data safety issue not
+ present with objects that are merely misplaced. Therefore, backfilling
+ operations with the 'balanced' and 'high_client_ops' mClock profiles may
+ progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
+ scheduler.
+ * The QoS allocations in all the mClock profiles are optimized based on the above
+ fixes and enhancements.
+ * For more detailed information see:
+ https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
+* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
+ than the number mentioned against the config tunable `mds_max_snaps_per_dir`
+ so that a new snapshot can be created and retained during the next schedule
+ run.
>=17.2.1
diff --git a/ceph/README.md b/ceph/README.md
index c4c691850..86f098fb7 100644
--- a/ceph/README.md
+++ b/ceph/README.md
@@ -1,81 +1,107 @@
# Ceph - a scalable distributed storage system
-Please see http://ceph.com/ for current info.
+See https://ceph.com/ for current information about Ceph.
## Contributing Code
-Most of Ceph is dual licensed under the LGPL version 2.1 or 3.0. Some
-miscellaneous code is under a BSD-style license or is public domain.
-The documentation is licensed under Creative Commons
-Attribution Share Alike 3.0 (CC-BY-SA-3.0). There are a handful of headers
-included here that are licensed under the GPL. Please see the file
-COPYING for a full inventory of licenses by file.
+Most of Ceph is dual-licensed under the LGPL version 2.1 or 3.0. Some
+miscellaneous code is either public domain or licensed under a BSD-style
+license.
-Code contributions must include a valid "Signed-off-by" acknowledging
-the license for the modified or contributed file. Please see the file
-SubmittingPatches.rst for details on what that means and on how to
-generate and submit patches.
+The Ceph documentation is licensed under Creative Commons Attribution Share
+Alike 3.0 (CC-BY-SA-3.0).
-We do not require assignment of copyright to contribute code; code is
+Some headers included in the `ceph/ceph` repository are licensed under the GPL.
+See the file `COPYING` for a full inventory of licenses by file.
+
+All code contributions must include a valid "Signed-off-by" line. See the file
+`SubmittingPatches.rst` for details on this and instructions on how to generate
+and submit patches.
+
+Assignment of copyright is not required to contribute code. Code is
contributed under the terms of the applicable license.
## Checking out the source
-You can clone from github with
+Clone the ceph/ceph repository from github by running the following command on
+a system that has git installed:
git clone git@github.com:ceph/ceph
-or, if you are not a github user,
+Alternatively, if you are not a github user, you should run the following
+command on a system that has git installed:
git clone git://github.com/ceph/ceph
-Ceph contains many git submodules that need to be checked out with
+When the `ceph/ceph` repository has been cloned to your system, run the
+following commands to move into the cloned `ceph/ceph` repository and to check
+out the git submodules associated with it:
+ cd ceph
git submodule update --init --recursive
## Build Prerequisites
-The list of Debian or RPM packages dependencies can be installed with:
+*section last updated 27 Jul 2023*
+
+Make sure that ``curl`` is installed. The Debian and Ubuntu ``apt`` command is
+provided here, but if you use a system with a different package manager, then
+you must use whatever command is the proper counterpart of this one:
+
+ apt install curl
+
+Install Debian or RPM package dependencies by running the following command:
./install-deps.sh
+Install the ``python3-routes`` package:
+
+ apt install python3-routes
+
## Building Ceph
-Note that these instructions are meant for developers who are
-compiling the code for development and testing. To build binaries
-suitable for installation we recommend you build deb or rpm packages
-or refer to the `ceph.spec.in` or `debian/rules` to see which
-configuration options are specified for production builds.
+These instructions are meant for developers who are compiling the code for
+development and testing. To build binaries that are suitable for installation
+we recommend that you build `.deb` or `.rpm` packages, or refer to
+``ceph.spec.in`` or ``debian/rules`` to see which configuration options are
+specified for production builds.
-Build instructions:
+To build Ceph, make sure that you are in the top-level `ceph` directory that
+contains `do_cmake.sh` and `CONTRIBUTING.rst` and run the following commands:
./do_cmake.sh
cd build
ninja
-(do_cmake.sh now defaults to creating a debug build of ceph that can
-be up to 5x slower with some workloads. Please pass
-"-DCMAKE_BUILD_TYPE=RelWithDebInfo" to do_cmake.sh to create a non-debug
-release.
+``do_cmake.sh`` by default creates a "debug build" of Ceph, which can be up to
+five times slower than a non-debug build. Pass
+``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to ``do_cmake.sh`` to create a non-debug
+build.
-The number of jobs used by `ninja` is derived from the number of CPU cores of
-the building host if unspecified. Use the `-j` option to limit the job number
-if the build jobs are running out of memory. On average, each job takes around
-2.5GiB memory.)
+[Ninja](https://ninja-build.org/) is the buildsystem used by the Ceph project
+to build test builds. The number of jobs used by `ninja` is derived from the
+number of CPU cores of the building host if unspecified. Use the `-j` option to
+limit the job number if the build jobs are running out of memory. If you
+attempt to run `ninja` and receive a message that reads `g++: fatal error:
+Killed signal terminated program cc1plus`, then you have run out of memory.
+Using the `-j` option with an argument appropriate to the hardware on which the
+`ninja` command is run is expected to result in a successful build. For example,
+to limit the job number to 3, run the command `ninja -j 3`. On average, each
+`ninja` job run in parallel needs approximately 2.5 GiB of RAM.
-This assumes you make your build dir a subdirectory of the ceph.git
-checkout. If you put it elsewhere, just point `CEPH_GIT_DIR` to the correct
-path to the checkout. Any additional CMake args can be specified by setting ARGS
-before invoking do_cmake. See [cmake options](#cmake-options)
-for more details. Eg.
+This documentation assumes that your build directory is a subdirectory of the
+`ceph.git` checkout. If the build directory is located elsewhere, point
+`CEPH_GIT_DIR` to the correct path of the checkout. Additional CMake args can
+be specified by setting ARGS before invoking ``do_cmake.sh``. See [cmake
+options](#cmake-options) for more details. For example:
ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh
-To build only certain targets use:
+To build only certain targets, run a command of the following form:
ninja [target name]
@@ -130,24 +156,25 @@ are committed to git.)
## Running a test cluster
-To run a functional test cluster,
+From the `ceph/` directory, run the following commands to launch a test Ceph
+cluster:
cd build
ninja vstart # builds just enough to run vstart
../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph -s
-Almost all of the usual commands are available in the bin/ directory.
-For example,
+Most Ceph commands are available in the `bin/` directory. For example:
- ./bin/rados -p rbd bench 30 write
./bin/rbd create foo --size 1000
+ ./bin/rados -p foo bench 30 write
-To shut down the test cluster,
+To shut down the test cluster, run the following command from the `build/`
+directory:
../src/stop.sh
-To start or stop individual daemons, the sysvinit script can be used:
+Use the sysvinit script to start or stop individual daemons:
./bin/init-ceph restart osd.0
./bin/init-ceph stop
diff --git a/ceph/ceph.spec b/ceph/ceph.spec
index afe6fe92f..accc10eb9 100644
--- a/ceph/ceph.spec
+++ b/ceph/ceph.spec
@@ -166,7 +166,7 @@
# main package definition
#################################################################################
Name: ceph
-Version: 17.2.6
+Version: 17.2.7
Release: 0%{?dist}
%if 0%{?fedora} || 0%{?rhel}
Epoch: 2
@@ -182,7 +182,7 @@ License: LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-
Group: System/Filesystems
%endif
URL: http://ceph.com/
-Source0: %{?_remote_tarball_prefix}ceph-17.2.6.tar.bz2
+Source0: %{?_remote_tarball_prefix}ceph-17.2.7.tar.bz2
%if 0%{?suse_version}
# _insert_obs_source_lines_here
ExclusiveArch: x86_64 aarch64 ppc64le s390x
@@ -1274,7 +1274,7 @@ This package provides Ceph default alerts for Prometheus.
# common
#################################################################################
%prep
-%autosetup -p1 -n ceph-17.2.6
+%autosetup -p1 -n ceph-17.2.7
%build
# Disable lto on systems that do not support symver attribute
@@ -1863,6 +1863,7 @@ fi
%{_datadir}/ceph/mgr/prometheus
%{_datadir}/ceph/mgr/rbd_support
%{_datadir}/ceph/mgr/restful
+%{_datadir}/ceph/mgr/rgw
%{_datadir}/ceph/mgr/selftest
%{_datadir}/ceph/mgr/snap_schedule
%{_datadir}/ceph/mgr/stats
diff --git a/ceph/ceph.spec.in b/ceph/ceph.spec.in
index e1575c066..e4cbbef69 100644
--- a/ceph/ceph.spec.in
+++ b/ceph/ceph.spec.in
@@ -1863,6 +1863,7 @@ fi
%{_datadir}/ceph/mgr/prometheus
%{_datadir}/ceph/mgr/rbd_support
%{_datadir}/ceph/mgr/restful
+%{_datadir}/ceph/mgr/rgw
%{_datadir}/ceph/mgr/selftest
%{_datadir}/ceph/mgr/snap_schedule
%{_datadir}/ceph/mgr/stats
diff --git a/ceph/changelog.upstream b/ceph/changelog.upstream
index d4a23a078..6a3119fc4 100644
--- a/ceph/changelog.upstream
+++ b/ceph/changelog.upstream
@@ -1,3 +1,9 @@
+ceph (17.2.7-1) stable; urgency=medium
+
+ * New upstream release
+
+ -- Ceph Release Team Wed, 25 Oct 2023 23:46:13 +0000
+
ceph (17.2.6-1) stable; urgency=medium
* New upstream release
diff --git a/ceph/debian/cephfs-mirror.install b/ceph/debian/cephfs-mirror.install
index 19d2e483d..ca0807e27 100644
--- a/ceph/debian/cephfs-mirror.install
+++ b/ceph/debian/cephfs-mirror.install
@@ -1 +1,3 @@
+lib/systemd/system/cephfs-mirror*
usr/bin/cephfs-mirror
+usr/share/man/man8/cephfs-mirror.8
diff --git a/ceph/doc/architecture.rst b/ceph/doc/architecture.rst
index 46be74603..1c5af8d38 100644
--- a/ceph/doc/architecture.rst
+++ b/ceph/doc/architecture.rst
@@ -30,60 +30,52 @@ A Ceph Storage Cluster consists of multiple types of daemons:
- :term:`Ceph Manager`
- :term:`Ceph Metadata Server`
-.. ditaa::
-
- +---------------+ +---------------+ +---------------+ +---------------+
- | OSDs | | Monitors | | Managers | | MDS |
- +---------------+ +---------------+ +---------------+ +---------------+
-
-A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
-monitors ensures high availability should a monitor daemon fail. Storage cluster
-clients retrieve a copy of the cluster map from the Ceph Monitor.
+Ceph Monitors maintain the master copy of the cluster map, which they provide
+to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
+availability in the event that one of the monitor daemons or its host fails.
+The Ceph monitor provides copies of the cluster map to storage cluster clients.
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
back to monitors.
-A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in
+A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
modules.
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
provide file services.
-Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
-to efficiently compute information about data location, instead of having to
-depend on a central lookup table. Ceph's high-level features include a
-native interface to the Ceph Storage Cluster via ``librados``, and a number of
-service interfaces built on top of ``librados``.
-
-
+Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
+to compute information about data location. This means that clients and OSDs
+are not bottlenecked by a central lookup table. Ceph's high-level features
+include a native interface to the Ceph Storage Cluster via ``librados``, and a
+number of service interfaces built on top of ``librados``.
Storing Data
------------
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
-:term:`Ceph File System` or a custom implementation you create using
-``librados``-- which is stored as RADOS objects. Each object is stored on an
-:term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and
-replication operations on storage drives. With the older Filestore back end,
-each RADOS object was stored as a separate file on a conventional filesystem
-(usually XFS). With the new and default BlueStore back end, objects are
-stored in a monolithic database-like fashion.
+:term:`Ceph File System`, or a custom implementation that you create by using
+``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
+objects. Each object is stored on an :term:`Object Storage Device` (this is
+also called an "OSD"). Ceph OSDs control read, write, and replication
+operations on storage drives. The default BlueStore back end stores objects
+in a monolithic, database-like fashion.
.. ditaa::
- /-----\ +-----+ +-----+
- | obj |------>| {d} |------>| {s} |
- \-----/ +-----+ +-----+
+ /------\ +-----+ +-----+
+ | obj |------>| {d} |------>| {s} |
+ \------/ +-----+ +-----+
Object OSD Drive
-Ceph OSD Daemons store data as objects in a flat namespace (e.g., no
-hierarchy of directories). An object has an identifier, binary data, and
-metadata consisting of a set of name/value pairs. The semantics are completely
-up to :term:`Ceph Client`\s. For example, CephFS uses metadata to store file
-attributes such as the file owner, created date, last modified date, and so
-forth.
+Ceph OSD Daemons store data as objects in a flat namespace. This means that
+objects are not stored in a hierarchy of directories. An object has an
+identifier, binary data, and metadata consisting of name/value pairs.
+:term:`Ceph Client`\s determine the semantics of the object data. For example,
+CephFS uses metadata to store file attributes such as the file owner, the
+created date, and the last modified date.
.. ditaa::
@@ -102,20 +94,23 @@ forth.
.. index:: architecture; high availability, scalability
+.. _arch_scalability_and_high_availability:
+
Scalability and High Availability
---------------------------------
-In traditional architectures, clients talk to a centralized component (e.g., a
-gateway, broker, API, facade, etc.), which acts as a single point of entry to a
-complex subsystem. This imposes a limit to both performance and scalability,
-while introducing a single point of failure (i.e., if the centralized component
-goes down, the whole system goes down, too).
+In traditional architectures, clients talk to a centralized component. This
+centralized component might be a gateway, a broker, an API, or a facade. A
+centralized component of this kind acts as a single point of entry to a complex
+subsystem. Architectures that rely upon such a centralized component have a
+single point of failure and incur limits to performance and scalability. If
+the centralized component goes down, the whole system becomes unavailable.
-Ceph eliminates the centralized gateway to enable clients to interact with
-Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
-Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
-of monitors to ensure high availability. To eliminate centralization, Ceph
-uses an algorithm called CRUSH.
+Ceph eliminates this centralized component. This enables clients to interact
+with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
+to ensure data safety and high availability. Ceph also uses a cluster of
+monitors to ensure high availability. To eliminate centralization, Ceph uses an
+algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
.. index:: CRUSH; architecture
@@ -124,15 +119,15 @@ CRUSH Introduction
~~~~~~~~~~~~~~~~~~
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
-Replication Under Scalable Hashing)` algorithm to efficiently compute
-information about object location, instead of having to depend on a
-central lookup table. CRUSH provides a better data management mechanism compared
-to older approaches, and enables massive scale by cleanly distributing the work
-to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
-replication to ensure resiliency, which is better suited to hyper-scale storage.
-The following sections provide additional details on how CRUSH works. For a
-detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
-Placement of Replicated Data`_.
+Replication Under Scalable Hashing)` algorithm to compute information about
+object location instead of relying upon a central lookup table. CRUSH provides
+a better data management mechanism than do older approaches, and CRUSH enables
+massive scale by distributing the work to all the OSD daemons in the cluster
+and all the clients that communicate with them. CRUSH uses intelligent data
+replication to ensure resiliency, which is better suited to hyper-scale
+storage. The following sections provide additional details on how CRUSH works.
+For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
+Decentralized Placement of Replicated Data`_.
.. index:: architecture; cluster map
@@ -141,109 +136,130 @@ Placement of Replicated Data`_.
Cluster Map
~~~~~~~~~~~
-Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
-cluster topology, which is inclusive of 5 maps collectively referred to as the
-"Cluster Map":
+In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
+must have current information about the cluster's topology. Current information
+is stored in the "Cluster Map", which is in fact a collection of five maps. The
+five maps that constitute the cluster map are:
-#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
- address and port of each monitor. It also indicates the current epoch,
- when the map was created, and the last time it changed. To view a monitor
- map, execute ``ceph mon dump``.
+#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
+ the address, and the TCP port of each monitor. The monitor map specifies the
+ current epoch, the time of the monitor map's creation, and the time of the
+ monitor map's last modification. To view a monitor map, run ``ceph mon
+ dump``.
-#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
- last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
- and their status (e.g., ``up``, ``in``). To view an OSD map, execute
- ``ceph osd dump``.
+#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
+ creation, the time of the OSD map's last modification, a list of pools, a
+ list of replica sizes, a list of PG numbers, and a list of OSDs and their
+ statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
+ osd dump``.
-#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
- map epoch, the full ratios, and details on each placement group such as
- the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
- ``active + clean``), and data usage statistics for each pool.
+#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
+ epoch, the full ratios, and the details of each placement group. This
+ includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
+ example, ``active + clean``), and data usage statistics for each pool.
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
- hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
- traversing the hierarchy when storing data. To view a CRUSH map, execute
- ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
- ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
- You can view the decompiled map in a text editor or with ``cat``.
+ hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
+ and rules for traversing the hierarchy when storing data. To view a CRUSH
+ map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
+ running ``crushtool -d {comp-crushmap-filename} -o
+ {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
+ decompiled map.
#. **The MDS Map:** Contains the current MDS map epoch, when the map was
created, and the last time it changed. It also contains the pool for
storing metadata, a list of metadata servers, and which metadata servers
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
-Each map maintains an iterative history of its operating state changes. Ceph
-Monitors maintain a master copy of the cluster map including the cluster
-members, state, changes, and the overall health of the Ceph Storage Cluster.
+Each map maintains a history of changes to its operating state. Ceph Monitors
+maintain a master copy of the cluster map. This master copy includes the
+cluster members, the state of the cluster, changes to the cluster, and
+information recording the overall health of the Ceph Storage Cluster.
.. index:: high availability; monitor architecture
High Availability Monitors
~~~~~~~~~~~~~~~~~~~~~~~~~~
-Before Ceph Clients can read or write data, they must contact a Ceph Monitor
-to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
-can operate with a single monitor; however, this introduces a single
-point of failure (i.e., if the monitor goes down, Ceph Clients cannot
-read or write data).
+A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
+cluster map in order to read data from or to write data to the Ceph cluster.
-For added reliability and fault tolerance, Ceph supports a cluster of monitors.
-In a cluster of monitors, latency and other faults can cause one or more
-monitors to fall behind the current state of the cluster. For this reason, Ceph
-must have agreement among various monitor instances regarding the state of the
-cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
-and the `Paxos`_ algorithm to establish a consensus among the monitors about the
-current state of the cluster.
+It is possible for a Ceph cluster to function properly with only a single
+monitor, but a Ceph cluster that has only a single monitor has a single point
+of failure: if the monitor goes down, Ceph clients will be unable to read data
+from or write data to the cluster.
-For details on configuring monitors, see the `Monitor Config Reference`_.
+Ceph leverages a cluster of monitors in order to increase reliability and fault
+tolerance. When a cluster of monitors is used, however, one or more of the
+monitors in the cluster can fall behind due to latency or other faults. Ceph
+mitigates these negative effects by requiring multiple monitor instances to
+agree about the state of the cluster. To establish consensus among the monitors
+regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
+majority of monitors (for example, one in a cluster that contains only one
+monitor, two in a cluster that contains three monitors, three in a cluster that
+contains five monitors, four in a cluster that contains six monitors, and so
+on).
+
+See the `Monitor Config Reference`_ for more detail on configuring monitors.
.. index:: architecture; high availability authentication
+.. _arch_high_availability_authentication:
+
High Availability Authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To identify users and protect against man-in-the-middle attacks, Ceph provides
-its ``cephx`` authentication system to authenticate users and daemons.
+The ``cephx`` authentication system is used by Ceph to authenticate users and
+daemons and to protect against man-in-the-middle attacks.
.. note:: The ``cephx`` protocol does not address data encryption in transport
- (e.g., SSL/TLS) or encryption at rest.
+ (for example, SSL/TLS) or encryption at rest.
-Cephx uses shared secret keys for authentication, meaning both the client and
-the monitor cluster have a copy of the client's secret key. The authentication
-protocol is such that both parties are able to prove to each other they have a
-copy of the key without actually revealing it. This provides mutual
-authentication, which means the cluster is sure the user possesses the secret
-key, and the user is sure that the cluster has a copy of the secret key.
+``cephx`` uses shared secret keys for authentication. This means that both the
+client and the monitor cluster keep a copy of the client's secret key.
-A key scalability feature of Ceph is to avoid a centralized interface to the
-Ceph object store, which means that Ceph clients must be able to interact with
-OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
-system, which authenticates users operating Ceph clients. The ``cephx`` protocol
-operates in a manner with behavior similar to `Kerberos`_.
+The ``cephx`` protocol makes it possible for each party to prove to the other
+that it has a copy of the key without revealing it. This provides mutual
+authentication and allows the cluster to confirm (1) that the user has the
+secret key and (2) that the user can be confident that the cluster has a copy
+of the secret key.
-A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
-monitor can authenticate users and distribute keys, so there is no single point
-of failure or bottleneck when using ``cephx``. The monitor returns an
-authentication data structure similar to a Kerberos ticket that contains a
-session key for use in obtaining Ceph services. This session key is itself
-encrypted with the user's permanent secret key, so that only the user can
-request services from the Ceph Monitor(s). The client then uses the session key
-to request its desired services from the monitor, and the monitor provides the
-client with a ticket that will authenticate the client to the OSDs that actually
-handle data. Ceph Monitors and OSDs share a secret, so the client can use the
-ticket provided by the monitor with any OSD or metadata server in the cluster.
-Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
-ticket or session key obtained surreptitiously. This form of authentication will
-prevent attackers with access to the communications medium from either creating
-bogus messages under another user's identity or altering another user's
-legitimate messages, as long as the user's secret key is not divulged before it
-expires.
+As stated in :ref:`Scalability and High Availability
+`, Ceph does not have any centralized
+interface between clients and the Ceph object store. By avoiding such a
+centralized interface, Ceph avoids the bottlenecks that attend such centralized
+interfaces. However, this means that clients must interact directly with OSDs.
+Direct interactions between Ceph clients and OSDs require authenticated
+connections. The ``cephx`` authentication system establishes and sustains these
+authenticated connections.
-To use ``cephx``, an administrator must set up users first. In the following
-diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
+The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
+
+A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
+monitor can authenticate users and distribute keys, which means that there is
+no single point of failure and no bottleneck when using ``cephx``. The monitor
+returns an authentication data structure that is similar to a Kerberos ticket.
+This authentication data structure contains a session key for use in obtaining
+Ceph services. The session key is itself encrypted with the user's permanent
+secret key, which means that only the user can request services from the Ceph
+Monitors. The client then uses the session key to request services from the
+monitors, and the monitors provide the client with a ticket that authenticates
+the client against the OSDs that actually handle data. Ceph Monitors and OSDs
+share a secret, which means that the clients can use the ticket provided by the
+monitors to authenticate against any OSD or metadata server in the cluster.
+
+Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
+expired ticket or session key that has been obtained surreptitiously. This form
+of authentication prevents attackers who have access to the communications
+medium from creating bogus messages under another user's identity and prevents
+attackers from altering another user's legitimate messages, as long as the
+user's secret key is not divulged before it expires.
+
+An administrator must set up users before using ``cephx``. In the following
+diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
the command line to generate a username and secret key. Ceph's ``auth``
-subsystem generates the username and key, stores a copy with the monitor(s) and
-transmits the user's secret back to the ``client.admin`` user. This means that
+subsystem generates the username and key, stores a copy on the monitor(s), and
+transmits the user's secret back to the ``client.admin`` user. This means that
the client and the monitor share a secret key.
.. note:: The ``client.admin`` user must provide the user ID and
@@ -262,17 +278,16 @@ the client and the monitor share a secret key.
| transmit key |
| |
-
-To authenticate with the monitor, the client passes in the user name to the
-monitor, and the monitor generates a session key and encrypts it with the secret
-key associated to the user name. Then, the monitor transmits the encrypted
-ticket back to the client. The client then decrypts the payload with the shared
-secret key to retrieve the session key. The session key identifies the user for
-the current session. The client then requests a ticket on behalf of the user
-signed by the session key. The monitor generates a ticket, encrypts it with the
-user's secret key and transmits it back to the client. The client decrypts the
-ticket and uses it to sign requests to OSDs and metadata servers throughout the
-cluster.
+Here is how a client authenticates with a monitor. The client passes the user
+name to the monitor. The monitor generates a session key that is encrypted with
+the secret key associated with the ``username``. The monitor transmits the
+encrypted ticket to the client. The client uses the shared secret key to
+decrypt the payload. The session key identifies the user, and this act of
+identification will last for the duration of the session. The client requests
+a ticket for the user, and the ticket is signed with the session key. The
+monitor generates a ticket and uses the user's secret key to encrypt it. The
+encrypted ticket is transmitted to the client. The client decrypts the ticket
+and uses it to sign requests to OSDs and to metadata servers in the cluster.
.. ditaa::
@@ -302,10 +317,11 @@ cluster.
|<----+ |
-The ``cephx`` protocol authenticates ongoing communications between the client
-machine and the Ceph servers. Each message sent between a client and server,
-subsequent to the initial authentication, is signed using a ticket that the
-monitors, OSDs and metadata servers can verify with their shared secret.
+The ``cephx`` protocol authenticates ongoing communications between the clients
+and Ceph daemons. After initial authentication, each message sent between a
+client and a daemon is signed using a ticket that can be verified by monitors,
+OSDs, and metadata daemons. This ticket is verified by using the secret shared
+between the client and the daemon.
.. ditaa::
@@ -341,83 +357,93 @@ monitors, OSDs and metadata servers can verify with their shared secret.
|<-------------------------------------------|
receive response
-The protection offered by this authentication is between the Ceph client and the
-Ceph server hosts. The authentication is not extended beyond the Ceph client. If
-the user accesses the Ceph client from a remote host, Ceph authentication is not
+This authentication protects only the connections between Ceph clients and Ceph
+daemons. The authentication is not extended beyond the Ceph client. If a user
+accesses the Ceph client from a remote host, cephx authentication will not be
applied to the connection between the user's host and the client host.
+See `Cephx Config Guide`_ for more on configuration details.
-For configuration details, see `Cephx Config Guide`_. For user management
-details, see `User Management`_.
+See `User Management`_ for more on user management.
+See :ref:`A Detailed Description of the Cephx Authentication Protocol
+` for more on the distinction between authorization and
+authentication and for a step-by-step explanation of the setup of ``cephx``
+tickets and session keys.
.. index:: architecture; smart daemons and scalability
Smart Daemons Enable Hyperscale
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A feature of many storage clusters is a centralized interface that keeps track
+of the nodes that clients are permitted to access. Such centralized
+architectures provide services to clients by means of a double dispatch. At the
+petabyte-to-exabyte scale, such double dispatches are a significant
+bottleneck.
-In many clustered architectures, the primary purpose of cluster membership is
-so that a centralized interface knows which nodes it can access. Then the
-centralized interface provides services to the client through a double
-dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
+Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
+cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
+OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
+with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
+cluster-aware makes it possible for Ceph clients to interact directly with Ceph
+OSD Daemons.
-Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
-aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
-Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with
-other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
-to interact directly with Ceph OSD Daemons.
+Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
+another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
+resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
+easily perform tasks that a cluster with a centralized interface would struggle
+to perform. The ability of Ceph nodes to make use of the computing power of
+the greater cluster provides several benefits:
-The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
-each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
-nodes to easily perform tasks that would bog down a centralized server. The
-ability to leverage this computing power leads to several major benefits:
+#. **OSDs Service Clients Directly:** Network devices can support only a
+ limited number of concurrent connections. Because Ceph clients contact
+ Ceph OSD daemons directly without first connecting to a central interface,
+ Ceph enjoys improved perfomance and increased system capacity relative to
+ storage redundancy strategies that include a central interface. Ceph clients
+ maintain sessions only when needed, and maintain those sessions with only
+ particular Ceph OSD daemons, not with a centralized interface.
-#. **OSDs Service Clients Directly:** Since any network device has a limit to
- the number of concurrent connections it can support, a centralized system
- has a low physical limit at high scales. By enabling Ceph Clients to contact
- Ceph OSD Daemons directly, Ceph increases both performance and total system
- capacity simultaneously, while removing a single point of failure. Ceph
- Clients can maintain a session when they need to, and with a particular Ceph
- OSD Daemon instead of a centralized server.
+#. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
+ report their status. At the lowest level, the Ceph OSD Daemon status is
+ ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
+ able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
+ ``in`` the Ceph Storage Cluster, this status may indicate the failure of the
+ Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
+ the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
+ OSDs periodically send messages to the Ceph Monitor (in releases prior to
+ Luminous, this was done by means of ``MPGStats``, and beginning with the
+ Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
+ Monitors receive no such message after a configurable period of time,
+ then they mark the OSD ``down``. This mechanism is a failsafe, however.
+ Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
+ report it to the Ceph Monitors. This contributes to making Ceph Monitors
+ lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
+ additional details.
-#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
- on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
- or ``down`` reflecting whether or not it is running and able to service
- Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
- Storage Cluster, this status may indicate the failure of the Ceph OSD
- Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
- Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
- periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
- and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that
- message after a configurable period of time then it marks the OSD down.
- This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
- determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
- This assures that Ceph Monitors are lightweight processes. See `Monitoring
- OSDs`_ and `Heartbeats`_ for additional details.
+#. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
+ RADOS objects. Ceph OSD Daemons compare the metadata of their own local
+ objects against the metadata of the replicas of those objects, which are
+ stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
+ mismatches in object size and finds metadata mismatches, and is usually
+ performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
+ data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
+ bad sectors on drives that are not detectable with light scrubs. See `Data
+ Scrubbing`_ for details on configuring scrubbing.
-#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
- Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare
- their local objects metadata with its replicas stored on other OSDs. Scrubbing
- happens on a per-Placement Group base. Scrubbing (usually performed daily)
- catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper
- scrubbing by comparing data in objects bit-for-bit with their checksums.
- Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
- weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
- configuring scrubbing.
+#. **Replication:** Data replication involves a collaboration between Ceph
+ Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
+ determine the storage location of object replicas. Ceph clients use the
+ CRUSH algorithm to determine the storage location of an object, then the
+ object is mapped to a pool and to a placement group, and then the client
+ consults the CRUSH map to identify the placement group's primary OSD.
-#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
- algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
- objects should be stored (and for rebalancing). In a typical write scenario,
- a client uses the CRUSH algorithm to compute where to store an object, maps
- the object to a pool and placement group, then looks at the CRUSH map to
- identify the primary OSD for the placement group.
-
- The client writes the object to the identified placement group in the
- primary OSD. Then, the primary OSD with its own copy of the CRUSH map
- identifies the secondary and tertiary OSDs for replication purposes, and
- replicates the object to the appropriate placement groups in the secondary
- and tertiary OSDs (as many OSDs as additional replicas), and responds to the
- client once it has confirmed the object was stored successfully.
+ After identifying the target placement group, the client writes the object
+ to the identified placement group's primary OSD. The primary OSD then
+ consults its own copy of the CRUSH map to identify secondary and tertiary
+ OSDS, replicates the object to the placement groups in those secondary and
+ tertiary OSDs, confirms that the object was stored successfully in the
+ secondary and tertiary OSDs, and reports to the client that the object
+ was stored successfully.
.. ditaa::
@@ -444,19 +470,18 @@ ability to leverage this computing power leads to several major benefits:
| | | |
+---------------+ +---------------+
-With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
-clients from that duty, while ensuring high data availability and data safety.
-
+By performing this act of data replication, Ceph OSD Daemons relieve Ceph
+clients of the burden of replicating data.
Dynamic Cluster Management
--------------------------
In the `Scalability and High Availability`_ section, we explained how Ceph uses
-CRUSH, cluster awareness and intelligent daemons to scale and maintain high
+CRUSH, cluster topology, and intelligent daemons to scale and maintain high
availability. Key to Ceph's design is the autonomous, self-healing, and
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
-enable modern cloud storage infrastructures to place data, rebalance the cluster
-and recover from faults dynamically.
+enable modern cloud storage infrastructures to place data, rebalance the
+cluster, and adaptively place and balance data and recover from faults.
.. index:: architecture; pools
@@ -465,10 +490,11 @@ About Pools
The Ceph storage system supports the notion of 'Pools', which are logical
partitions for storing objects.
-
-Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
-pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
-number of placement groups determine how Ceph will place the data.
+
+Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
+objects to pools. The way that Ceph places the data in the pools is determined
+by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
+placement groups in the pool.
.. ditaa::
@@ -501,20 +527,23 @@ See `Set Pool Values`_ for details.
Mapping PGs to OSDs
~~~~~~~~~~~~~~~~~~~
-Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
-When a Ceph Client stores objects, CRUSH will map each object to a placement
-group.
+Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
+maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
+object to a PG.
-Mapping objects to placement groups creates a layer of indirection between the
-Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
-grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
-Client "knew" which Ceph OSD Daemon had which object, that would create a tight
-coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
-algorithm maps each object to a placement group and then maps each placement
-group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
-rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
-come online. The following diagram depicts how CRUSH maps objects to placement
-groups, and placement groups to OSDs.
+This mapping of RADOS objects to PGs implements an abstraction and indirection
+layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
+be able to grow (or shrink) and redistribute data adaptively when the internal
+topology changes.
+
+If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
+tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
+But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
+RADOS object to a placement group and then maps each placement group to one or
+more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
+dynamically when new Ceph OSD Daemons and their underlying OSD devices come
+online. The following diagram shows how the CRUSH algorithm maps objects to
+placement groups, and how it maps placement groups to OSDs.
.. ditaa::
@@ -540,44 +569,45 @@ groups, and placement groups to OSDs.
| | | | | | | |
\----------/ \----------/ \----------/ \----------/
-With a copy of the cluster map and the CRUSH algorithm, the client can compute
-exactly which OSD to use when reading or writing a particular object.
+The client uses its copy of the cluster map and the CRUSH algorithm to compute
+precisely which OSD it will use when reading or writing a particular object.
.. index:: architecture; calculating PG IDs
Calculating PG IDs
~~~~~~~~~~~~~~~~~~
-When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
-`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
-OSDs, and metadata servers in the cluster. **However, it doesn't know anything
-about object locations.**
+When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
+the `Cluster Map`_. When a client has been equipped with a copy of the cluster
+map, it is aware of all the monitors, OSDs, and metadata servers in the
+cluster. **However, even equipped with a copy of the latest version of the
+cluster map, the client doesn't know anything about object locations.**
-.. epigraph::
+**Object locations must be computed.**
- Object locations get computed.
+The client requies only the object ID and the name of the pool in order to
+compute the object location.
+Ceph stores data in named pools (for example, "liverpool"). When a client
+stores a named object (for example, "john", "paul", "george", or "ringo") it
+calculates a placement group by using the object name, a hash code, the number
+of PGs in the pool, and the pool name. Ceph clients use the following steps to
+compute PG IDs.
-The only input required by the client is the object ID and the pool.
-It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
-wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
-it calculates a placement group using the object name, a hash code, the
-number of PGs in the pool and the pool name. Ceph clients use the following
-steps to compute PG IDs.
+#. The client inputs the pool name and the object ID. (for example: pool =
+ "liverpool" and object-id = "john")
+#. Ceph hashes the object ID.
+#. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
+ get a PG ID.
+#. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
+ ``4``)
+#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
-#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool"
- and object-id = "john")
-#. Ceph takes the object ID and hashes it.
-#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
- a PG ID.
-#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
-#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
-
-Computing object locations is much faster than performing object location query
-over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
-Hashing)` algorithm allows a client to compute where objects *should* be stored,
-and enables the client to contact the primary OSD to store or retrieve the
-objects.
+It is much faster to compute object locations than to perform object location
+query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
+Scalable Hashing)` algorithm allows a client to compute where objects are
+expected to be stored, and enables the client to contact the primary OSD to
+store or retrieve the objects.
.. index:: architecture; PG Peering
@@ -585,46 +615,51 @@ Peering and Sets
~~~~~~~~~~~~~~~~
In previous sections, we noted that Ceph OSD Daemons check each other's
-heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
-do is called 'peering', which is the process of bringing all of the OSDs that
-store a Placement Group (PG) into agreement about the state of all of the
-objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
-Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve
-themselves; however, if the problem persists, you may need to refer to the
-`Troubleshooting Peering Failure`_ section.
+heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
+which is the process of bringing all of the OSDs that store a Placement Group
+(PG) into agreement about the state of all of the RADOS objects (and their
+metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
+Monitors. Peering issues usually resolve themselves; however, if the problem
+persists, you may need to refer to the `Troubleshooting Peering Failure`_
+section.
-.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
+.. Note:: PGs that agree on the state of the cluster do not necessarily have
+ the current data yet.
The Ceph Storage Cluster was designed to store at least two copies of an object
-(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
-availability, a Ceph Storage Cluster should store more than two copies of an object
-(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
-``degraded`` state while maintaining data safety.
+(that is, ``size = 2``), which is the minimum requirement for data safety. For
+high availability, a Ceph Storage Cluster should store more than two copies of
+an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
+to run in a ``degraded`` state while maintaining data safety.
-Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
-name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
-rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
-the *Primary* is the first OSD in the *Acting Set*, and is responsible for
-coordinating the peering process for each placement group where it acts as
-the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
-writes to objects for a given placement group where it acts as the *Primary*.
+.. warning:: Although we say here that R2 (replication with two copies) is the
+ minimum requirement for data safety, R3 (replication with three copies) is
+ recommended. On a long enough timeline, data stored with an R2 strategy will
+ be lost.
-When a series of OSDs are responsible for a placement group, that series of
-OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
-OSD Daemons that are currently responsible for the placement group, or the Ceph
-OSD Daemons that were responsible for a particular placement group as of some
+As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
+name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
+etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
+convention, the *Primary* is the first OSD in the *Acting Set*, and is
+responsible for orchestrating the peering process for each placement group
+where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
+placement group that accepts client-initiated writes to objects.
+
+The set of OSDs that is responsible for a placement group is called the
+*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
+that are currently responsible for the placement group, or to the Ceph OSD
+Daemons that were responsible for a particular placement group as of some
epoch.
-The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``.
-When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up
-Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
-Daemons when an OSD fails.
-
-.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
- ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
- the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
- removed from the *Up Set*.
+The Ceph OSD daemons that are part of an *Acting Set* might not always be
+``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
+The *Up Set* is an important distinction, because Ceph can remap PGs to other
+Ceph OSD Daemons when an OSD fails.
+.. note:: Consider a hypothetical *Acting Set* for a PG that contains
+ ``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
+ *Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
+ *Primary*, and ``osd.25`` is removed from the *Up Set*.
.. index:: architecture; Rebalancing
@@ -1467,11 +1502,11 @@ Ceph Clients
Ceph Clients include a number of service interfaces. These include:
-- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
- provides resizable, thin-provisioned block devices with snapshotting and
- cloning. Ceph stripes a block device across the cluster for high
- performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
- that uses ``librbd`` directly--avoiding the kernel object overhead for
+- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
+ provides resizable, thin-provisioned block devices that can be snapshotted
+ and cloned. Ceph stripes a block device across the cluster for high
+ performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
+ that uses ``librbd`` directly--avoiding the kernel object overhead for
virtualized systems.
- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
diff --git a/ceph/doc/cephadm/host-management.rst b/ceph/doc/cephadm/host-management.rst
index b2c514c8c..779c80902 100644
--- a/ceph/doc/cephadm/host-management.rst
+++ b/ceph/doc/cephadm/host-management.rst
@@ -11,9 +11,9 @@ Run a command of this form to list hosts associated with the cluster:
.. prompt:: bash #
- ceph orch host ls [--format yaml] [--host-pattern ] [--label