Merge remote-tracking branch 'origin/quincy-stable-8' into quincy-stable-7

2025-08-17 04:50:13 +00:00 · 2023-11-02 17:21:22 +01:00 · 2023-11-02 17:21:22 +01:00 · 692686d018
commit 692686d018
parent d98942ae86 e303afc2e9
1290 changed files with 38032 additions and 16268 deletions
--- a/ceph/.github/pull_request_template.md
+++ b/ceph/.github/pull_request_template.md
@ -21,6 +21,13 @@
  - The Signed-off-by line in every git commit is important; see <span class="x x-first x-last">[Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/master/</span>SubmittingPatches.rst<span class="x x-first x-last">)</span>
 -->

+## Contribution Guidelines
+- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
+
+- If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
+
+- When filling out the below checklist, you may click boxes directly in the GitHub web UI.  When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an `x` between the brackets: `[x]`.  Spaces and capitalization matter when checking off items this way.
+
 ## Checklist
 - Tracker (select at least one)
  - [ ] References tracker ticket
--- a/ceph/.github/workflows/pr-checklist.yml
+++ b/ceph/.github/workflows/pr-checklist.yml
@ -11,6 +11,9 @@ jobs:
    runs-on: ubuntu-latest
    name: Verify
    steps:
+      - name: Sleep for 30 seconds
+        run: sleep 30s
+        shell: bash
      - name: Action
        id: checklist
        uses: ceph/ceph-pr-checklist-action@32e92d1a2a7c9991ed51de5fccb2296551373d60
--- a/ceph/CMakeLists.txt
+++ b/ceph/CMakeLists.txt
@ -1,7 +1,7 @@
 cmake_minimum_required(VERSION 3.16)

 project(ceph
-  VERSION 17.2.6
+  VERSION 17.2.7
  LANGUAGES CXX C ASM)

 cmake_policy(SET CMP0028 NEW)
--- a/ceph/PendingReleaseNotes
+++ b/ceph/PendingReleaseNotes
@ -1,3 +1,49 @@
+>=17.2.7
+--------
+
+* `ceph mgr dump` command now displays the name of the mgr module that
+  registered a RADOS client in the `name` field added to elements of the
+  `active_clients` array. Previously, only the address of a module's RADOS
+  client was shown in the `active_clients` array.
+* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
+  undergone significant usability and design improvements to address the slow
+  backfill issue. Some important changes are:
+  * The 'balanced' profile is set as the default mClock profile because it
+    represents a compromise between prioritizing client IO or recovery IO. Users
+    can then choose either the 'high_client_ops' profile to prioritize client IO
+    or the 'high_recovery_ops' profile to prioritize recovery IO.
+  * QoS parameters like reservation and limit are now specified in terms of a
+    fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
+  * The cost parameters (osd_mclock_cost_per_io_usec_* and
+    osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
+    is now determined using the random IOPS and maximum sequential bandwidth
+    capability of the OSD's underlying device.
+  * Degraded object recovery is given higher priority when compared to misplaced
+    object recovery because degraded objects present a data safety issue not
+    present with objects that are merely misplaced. Therefore, backfilling
+    operations with the 'balanced' and 'high_client_ops' mClock profiles may
+    progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
+    scheduler.
+  * The QoS allocations in all the mClock profiles are optimized based on the above
+    fixes and enhancements.
+  * For more detailed information see:
+    https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/
+* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
+  multi-site. Previously, the replicas of such objects were corrupted on decryption.
+  A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
+  identify these original multipart uploads. The ``LastModified`` timestamp of any
+  identified object is incremented by 1ns to cause peer zones to replicate it again.
+  For multi-site deployments that make any use of Server-Side Encryption, we
+  recommended running this command against every bucket in every zone after all
+  zones have upgraded.
+* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
+  a large buildup of session metadata resulting in the MDS going read-only due to
+  the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
+  config controls the maximum size that a (encoded) session metadata can grow.
+
+* CEPHFS: After recovering a Ceph File System post following the disaster recovery
+  procedure, the recovered files under `lost+found` directory can now be deleted.
+
 >=17.2.6
 --------

@ -7,6 +53,60 @@

 >=17.2.5
 --------
+>=19.0.0
+
+* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
+  multi-site. Previously, the replicas of such objects were corrupted on decryption.
+  A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
+  identify these original multipart uploads. The ``LastModified`` timestamp of any
+  identified object is incremented by 1ns to cause peer zones to replicate it again.
+  For multi-site deployments that make any use of Server-Side Encryption, we
+  recommended running this command against every bucket in every zone after all
+  zones have upgraded.
+* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
+  a large buildup of session metadata resulting in the MDS going read-only due to
+  the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
+  config controls the maximum size that a (encoded) session metadata can grow.
+* CephFS: For clusters with multiple CephFS file systems, all the snap-schedule
+  commands now expect the '--fs' argument.
+* CephFS: The period specifier ``m`` now implies minutes and the period specifier
+  ``M`` now implies months. This has been made consistent with the rest
+  of the system.
+* RGW: New tools have been added to radosgw-admin for identifying and
+  correcting issues with versioned bucket indexes. Historical bugs with the
+  versioned bucket index transaction workflow made it possible for the index
+  to accumulate extraneous "book-keeping" olh entries and plain placeholder
+  entries. In some specific scenarios where clients made concurrent requests
+  referencing the same object key, it was likely that a lot of extra index
+  entries would accumulate. When a significant number of these entries are
+  present in a single bucket index shard, they can cause high bucket listing
+  latencies and lifecycle processing failures. To check whether a versioned
+  bucket has unnecessary olh entries, users can now run ``radosgw-admin
+  bucket check olh``. If the ``--fix`` flag is used, the extra entries will
+  be safely removed. A distinct issue from the one described thus far, it is
+  also possible that some versioned buckets are maintaining extra unlinked
+  objects that are not listable from the S3/ Swift APIs. These extra objects
+  are typically a result of PUT requests that exited abnormally, in the middle
+  of a bucket index transaction - so the client would not have received a
+  successful response. Bugs in prior releases made these unlinked objects easy
+  to reproduce with any PUT request that was made on a bucket that was actively
+  resharding. Besides the extra space that these hidden, unlinked objects
+  consume, there can be another side effect in certain scenarios, caused by
+  the nature of the failure mode that produced them, where a client of a bucket
+  that was a victim of this bug may find the object associated with the key to
+  be in an inconsistent state. To check whether a versioned bucket has unlinked
+  entries, users can now run ``radosgw-admin bucket check unlinked``. If the
+  ``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
+  a third issue made it possible for versioned bucket index stats to be
+  accounted inaccurately. The tooling for recalculating versioned bucket stats
+  also had a bug, and was not previously capable of fixing these inaccuracies.
+  This release resolves those issues and users can now expect that the existing
+  ``radosgw-admin bucket check`` command will produce correct results. We
+  recommend that users with versioned buckets, especially those that existed
+  on prior releases, use these new tools to check whether their buckets are
+  affected and to clean them up accordingly.
+
+>=18.0.0

 * RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write`
  and `Image::aio_compare_and_write` methods) now match those of C API.  Both
@ -47,6 +147,100 @@
  If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning
  will be visible.
  Relevant tracker: https://tracker.ceph.com/issues/53729
+* RBD: `rbd device unmap` command gained `--namespace` option.  Support for
+  namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to
+  map and unmap images in namespaces using the `image-spec` syntax since then
+  but the corresponding option available in most other commands was missing.
+* RGW: Compression is now supported for objects uploaded with Server-Side Encryption.
+  When both are enabled, compression is applied before encryption.
+* RGW: the "pubsub" functionality for storing bucket notifications inside Ceph
+  is removed. Together with it, the "pubsub" zone should not be used anymore.
+  The REST operations, as well as radosgw-admin commands for manipulating
+  subscriptions, as well as fetching and acking the notifications are removed 
+  as well.
+  In case that the endpoint to which the notifications are sent maybe down or 
+  disconnected, it is recommended to use persistent notifications to guarantee 
+  the delivery of the notifications. In case the system that consumes the 
+  notifications needs to pull them (instead of the notifications be pushed 
+  to it), an external message bus (e.g. rabbitmq, Kafka) should be used for 
+  that purpose.
+* RGW: The serialized format of notification and topics has changed, so that 
+  new/updated topics will be unreadable by old RGWs. We recommend completing 
+  the RGW upgrades before creating or modifying any notification topics.
+* RBD: Trailing newline in passphrase files (`<passphrase-file>` argument in
+  `rbd encryption format` command and `--encryption-passphrase-file` option
+  in other commands) is no longer stripped.
+* RBD: Support for layered client-side encryption is added.  Cloned images
+  can now be encrypted each with its own encryption format and passphrase,
+  potentially different from that of the parent image.  The efficient
+  copy-on-write semantics intrinsic to unformatted (regular) cloned images
+  are retained.
+* CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to
+  `client_max_retries_on_remount_failure` and move it from mds.yaml.in to
+  mds-client.yaml.in because this option was only used by MDS client from its
+  birth.
+* The `perf dump` and `perf schema` commands are deprecated in favor of new
+  `counter dump` and `counter schema` commands. These new commands add support
+  for labeled perf counters and also emit existing unlabeled perf counters. Some
+  unlabeled perf counters became labeled in this release, with more to follow in
+  future releases; such converted perf counters are no longer emitted by the
+  `perf dump` and `perf schema` commands.
+* `ceph mgr dump` command now outputs `last_failure_osd_epoch` and
+  `active_clients` fields at the top level.  Previously, these fields were
+  output under `always_on_modules` field.
+* `ceph mgr dump` command now displays the name of the mgr module that
+  registered a RADOS client in the `name` field added to elements of the
+  `active_clients` array. Previously, only the address of a module's RADOS
+  client was shown in the `active_clients` array.
+* RBD: All rbd-mirror daemon perf counters became labeled and as such are now
+  emitted only by the new `counter dump` and `counter schema` commands.  As part
+  of the conversion, many also got renamed to better disambiguate journal-based
+  and snapshot-based mirroring.
+* RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed
+  `std::list` before potentially appending to it, aligning with the semantics
+  of the corresponding C API (`rbd_watchers_list`).
+* Telemetry: Users who are opted-in to telemetry can also opt-in to
+  participating in a leaderboard in the telemetry public
+  dashboards (https://telemetry-public.ceph.com/). Users can now also add a
+  description of the cluster to publicly appear in the leaderboard.
+  For more details, see:
+  https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard
+  See a sample report with `ceph telemetry preview`.
+  Opt-in to telemetry with `ceph telemetry on`.
+  Opt-in to the leaderboard with
+  `ceph config set mgr mgr/telemetry/leaderboard true`.
+  Add leaderboard description with:
+  `ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster description’`.
+* CEPHFS: After recovering a Ceph File System post following the disaster recovery
+  procedure, the recovered files under `lost+found` directory can now be deleted.
+* core: cache-tiering is now deprecated.
+* mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has
+  undergone significant usability and design improvements to address the slow
+  backfill issue. Some important changes are:
+  * The 'balanced' profile is set as the default mClock profile because it
+    represents a compromise between prioritizing client IO or recovery IO. Users
+    can then choose either the 'high_client_ops' profile to prioritize client IO
+    or the 'high_recovery_ops' profile to prioritize recovery IO.
+  * QoS parameters like reservation and limit are now specified in terms of a
+    fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
+  * The cost parameters (osd_mclock_cost_per_io_usec_* and
+    osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation
+    is now determined using the random IOPS and maximum sequential bandwidth
+    capability of the OSD's underlying device.
+  * Degraded object recovery is given higher priority when compared to misplaced
+    object recovery because degraded objects present a data safety issue not
+    present with objects that are merely misplaced. Therefore, backfilling
+    operations with the 'balanced' and 'high_client_ops' mClock profiles may
+    progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ)
+    scheduler.
+  * The QoS allocations in all the mClock profiles are optimized based on the above
+    fixes and enhancements.
+  * For more detailed information see:
+    https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
+* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
+  than the number mentioned against the config tunable `mds_max_snaps_per_dir`
+  so that a new snapshot can be created and retained during the next schedule
+  run.

 >=17.2.1

--- a/ceph/README.md
+++ b/ceph/README.md
@ -1,81 +1,107 @@
 # Ceph - a scalable distributed storage system

-Please see http://ceph.com/ for current info.
+See https://ceph.com/ for current information about Ceph.


 ## Contributing Code

-Most of Ceph is dual licensed under the LGPL version 2.1 or 3.0.  Some
-miscellaneous code is under a BSD-style license or is public domain.
-The documentation is licensed under Creative Commons
-Attribution Share Alike 3.0 (CC-BY-SA-3.0).  There are a handful of headers
-included here that are licensed under the GPL.  Please see the file
-COPYING for a full inventory of licenses by file.
+Most of Ceph is dual-licensed under the LGPL version 2.1 or 3.0. Some
+miscellaneous code is either public domain or licensed under a BSD-style
+license.

-Code contributions must include a valid "Signed-off-by" acknowledging
-the license for the modified or contributed file.  Please see the file
-SubmittingPatches.rst for details on what that means and on how to
-generate and submit patches.
+The Ceph documentation is licensed under Creative Commons Attribution Share
+Alike 3.0 (CC-BY-SA-3.0). 

-We do not require assignment of copyright to contribute code; code is
+Some headers included in the `ceph/ceph` repository are licensed under the GPL.
+See the file `COPYING` for a full inventory of licenses by file.
+
+All code contributions must include a valid "Signed-off-by" line. See the file
+`SubmittingPatches.rst` for details on this and instructions on how to generate
+and submit patches.
+
+Assignment of copyright is not required to contribute code. Code is
 contributed under the terms of the applicable license.


 ## Checking out the source

-You can clone from github with
+Clone the ceph/ceph repository from github by running the following command on
+a system that has git installed:

 	git clone git@github.com:ceph/ceph

-or, if you are not a github user,
+Alternatively, if you are not a github user, you should run the following
+command on a system that has git installed:

 	git clone git://github.com/ceph/ceph

-Ceph contains many git submodules that need to be checked out with
+When the `ceph/ceph` repository has been cloned to your system, run the
+following commands to move into the cloned `ceph/ceph` repository and to check
+out the git submodules associated with it:

+    cd ceph
 	git submodule update --init --recursive


 ## Build Prerequisites

-The list of Debian or RPM packages dependencies can be installed with:
+*section last updated 27 Jul 2023*
+
+Make sure that ``curl`` is installed. The Debian and Ubuntu ``apt`` command is
+provided here, but if you use a system with a different package manager, then
+you must use whatever command is the proper counterpart of this one:
+
+    apt install curl
+
+Install Debian or RPM package dependencies by running the following command:

 	./install-deps.sh

+Install the ``python3-routes`` package:
+
+    apt install python3-routes
+

 ## Building Ceph

-Note that these instructions are meant for developers who are
-compiling the code for development and testing.  To build binaries
-suitable for installation we recommend you build deb or rpm packages
-or refer to the `ceph.spec.in` or `debian/rules` to see which
-configuration options are specified for production builds.
+These instructions are meant for developers who are compiling the code for
+development and testing. To build binaries that are suitable for installation
+we recommend that you build `.deb` or `.rpm` packages, or refer to
+``ceph.spec.in`` or ``debian/rules`` to see which configuration options are
+specified for production builds.

-Build instructions:
+To build Ceph, make sure that you are in the top-level `ceph` directory that
+contains `do_cmake.sh` and `CONTRIBUTING.rst` and run the following commands:

 	./do_cmake.sh
 	cd build
 	ninja

-(do_cmake.sh now defaults to creating a debug build of ceph that can
-be up to 5x slower with some workloads. Please pass 
-"-DCMAKE_BUILD_TYPE=RelWithDebInfo" to do_cmake.sh to create a non-debug
-release.
+``do_cmake.sh`` by default creates a "debug build" of Ceph, which can be up to
+five times slower than a non-debug build.  Pass
+``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to ``do_cmake.sh`` to create a non-debug
+build.

-The number of jobs used by `ninja` is derived from the number of CPU cores of
-the building host if unspecified. Use the `-j` option to limit the job number
-if the build jobs are running out of memory. On average, each job takes around
-2.5GiB memory.)
+[Ninja](https://ninja-build.org/) is the buildsystem used by the Ceph project
+to build test builds.  The number of jobs used by `ninja` is derived from the
+number of CPU cores of the building host if unspecified. Use the `-j` option to
+limit the job number if the build jobs are running out of memory. If you
+attempt to run `ninja` and receive a message that reads `g++: fatal error:
+Killed signal terminated program cc1plus`, then you have run out of memory.
+Using the `-j` option with an argument appropriate to the hardware on which the
+`ninja` command is run is expected to result in a successful build. For example,
+to limit the job number to 3, run the command `ninja -j 3`. On average, each
+`ninja` job run in parallel needs approximately 2.5 GiB of RAM.

-This assumes you make your build dir a subdirectory of the ceph.git
-checkout. If you put it elsewhere, just point `CEPH_GIT_DIR` to the correct
-path to the checkout. Any additional CMake args can be specified by setting ARGS
-before invoking do_cmake. See [cmake options](#cmake-options)
-for more details. Eg.
+This documentation assumes that your build directory is a subdirectory of the
+`ceph.git` checkout. If the build directory is located elsewhere, point
+`CEPH_GIT_DIR` to the correct path of the checkout. Additional CMake args can
+be specified by setting ARGS before invoking ``do_cmake.sh``.  See [cmake
+options](#cmake-options) for more details. For example:

    ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh

-To build only certain targets use:
+To build only certain targets, run a command of the following form:

 	ninja [target name]

@ -130,24 +156,25 @@ are committed to git.)

 ## Running a test cluster

-To run a functional test cluster,
+From the `ceph/` directory, run the following commands to launch a test Ceph
+cluster:

 	cd build
 	ninja vstart        # builds just enough to run vstart
 	../src/vstart.sh --debug --new -x --localhost --bluestore
 	./bin/ceph -s

-Almost all of the usual commands are available in the bin/ directory.
-For example,
+Most Ceph commands are available in the `bin/` directory. For example:

-	./bin/rados -p rbd bench 30 write
 	./bin/rbd create foo --size 1000
+	./bin/rados -p foo bench 30 write

-To shut down the test cluster,
+To shut down the test cluster, run the following command from the `build/`
+directory:

 	../src/stop.sh

-To start or stop individual daemons, the sysvinit script can be used:
+Use the sysvinit script to start or stop individual daemons: 

 	./bin/init-ceph restart osd.0
 	./bin/init-ceph stop
--- a/ceph/ceph.spec
+++ b/ceph/ceph.spec
@ -166,7 +166,7 @@
 # main package definition
 #################################################################################
 Name:		ceph
-Version:	17.2.6
+Version:	17.2.7
 Release:	0%{?dist}
 %if 0%{?fedora} || 0%{?rhel}
 Epoch:		2
@ -182,7 +182,7 @@ License:	LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-
 Group:		System/Filesystems
 %endif
 URL:		http://ceph.com/
-Source0:	%{?_remote_tarball_prefix}ceph-17.2.6.tar.bz2
+Source0:	%{?_remote_tarball_prefix}ceph-17.2.7.tar.bz2
 %if 0%{?suse_version}
 # _insert_obs_source_lines_here
 ExclusiveArch:  x86_64 aarch64 ppc64le s390x
@ -1274,7 +1274,7 @@ This package provides Ceph default alerts for Prometheus.
 # common
 #################################################################################
 %prep
-%autosetup -p1 -n ceph-17.2.6
+%autosetup -p1 -n ceph-17.2.7

 %build
 # Disable lto on systems that do not support symver attribute
@ -1863,6 +1863,7 @@ fi
 %{_datadir}/ceph/mgr/prometheus
 %{_datadir}/ceph/mgr/rbd_support
 %{_datadir}/ceph/mgr/restful
+%{_datadir}/ceph/mgr/rgw
 %{_datadir}/ceph/mgr/selftest
 %{_datadir}/ceph/mgr/snap_schedule
 %{_datadir}/ceph/mgr/stats
--- a/ceph/ceph.spec.in
+++ b/ceph/ceph.spec.in
@ -1863,6 +1863,7 @@ fi
 %{_datadir}/ceph/mgr/prometheus
 %{_datadir}/ceph/mgr/rbd_support
 %{_datadir}/ceph/mgr/restful
+%{_datadir}/ceph/mgr/rgw
 %{_datadir}/ceph/mgr/selftest
 %{_datadir}/ceph/mgr/snap_schedule
 %{_datadir}/ceph/mgr/stats
--- a/ceph/changelog.upstream
+++ b/ceph/changelog.upstream
@ -1,3 +1,9 @@
+ceph (17.2.7-1) stable; urgency=medium
+
+  * New upstream release
+
+ -- Ceph Release Team <ceph-maintainers@ceph.io>  Wed, 25 Oct 2023 23:46:13 +0000
+
 ceph (17.2.6-1) stable; urgency=medium

  * New upstream release
--- a/ceph/debian/cephfs-mirror.install
+++ b/ceph/debian/cephfs-mirror.install
@ -1 +1,3 @@
+lib/systemd/system/cephfs-mirror*
 usr/bin/cephfs-mirror
+usr/share/man/man8/cephfs-mirror.8
--- a/ceph/doc/architecture.rst
+++ b/ceph/doc/architecture.rst
@ -30,60 +30,52 @@ A Ceph Storage Cluster consists of multiple types of daemons:
 - :term:`Ceph Manager`
 - :term:`Ceph Metadata Server`

-.. ditaa::
-
-            +---------------+ +---------------+ +---------------+ +---------------+ 
-            |      OSDs     | |    Monitors   | |    Managers   | |      MDS      |
-            +---------------+ +---------------+ +---------------+ +---------------+ 
-
-A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
-monitors ensures high availability should a monitor daemon fail. Storage cluster
-clients retrieve a copy of the cluster map from the Ceph Monitor.
+Ceph Monitors maintain the master copy of the cluster map, which they provide
+to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
+availability in the event that one of the monitor daemons or its host fails.
+The Ceph monitor provides copies of the cluster map to storage cluster clients.

 A Ceph OSD Daemon checks its own state and the state of other OSDs and reports 
 back to monitors.

-A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in
+A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
 modules.

 A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
 provide file services.

-Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
-to efficiently compute information about data location, instead of having to
-depend on a central lookup table. Ceph's high-level features include a
-native interface to the Ceph Storage Cluster via ``librados``, and a number of
-service interfaces built on top of ``librados``.
-
-
+Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
+to compute information about data location. This means that clients and OSDs
+are not bottlenecked by a central lookup table. Ceph's high-level features
+include a native interface to the Ceph Storage Cluster via ``librados``, and a
+number of service interfaces built on top of ``librados``.

 Storing Data
 ------------

 The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
 comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
-:term:`Ceph File System` or a custom implementation you create using
-``librados``-- which is stored as RADOS objects. Each object is stored on an
-:term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and
-replication operations on storage drives.  With the older Filestore back end,
-each RADOS object was stored as a separate file on a conventional filesystem
-(usually XFS).  With the new and default BlueStore back end, objects are
-stored in a monolithic database-like fashion.
+:term:`Ceph File System`, or a custom implementation that you create by using
+``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
+objects. Each object is stored on an :term:`Object Storage Device` (this is
+also called an "OSD"). Ceph OSDs control read, write, and replication
+operations on storage drives. The default BlueStore back end stores objects 
+in a monolithic, database-like fashion.

 .. ditaa::

-           /-----\       +-----+       +-----+
-           | obj |------>| {d} |------>| {s} |
-           \-----/       +-----+       +-----+
+           /------\       +-----+       +-----+
+           | obj  |------>| {d} |------>| {s} |
+           \------/       +-----+       +-----+
   
            Object         OSD          Drive

-Ceph OSD Daemons store data as objects in a flat namespace (e.g., no
-hierarchy of directories). An object has an identifier, binary data, and
-metadata consisting of a set of name/value pairs. The semantics are completely
-up to :term:`Ceph Client`\s. For example, CephFS uses metadata to store file
-attributes such as the file owner, created date, last modified date, and so
-forth.
+Ceph OSD Daemons store data as objects in a flat namespace. This means that
+objects are not stored in a hierarchy of directories. An object has an
+identifier, binary data, and metadata consisting of name/value pairs.
+:term:`Ceph Client`\s determine the semantics of the object data. For example,
+CephFS uses metadata to store file attributes such as the file owner, the
+created date, and the last modified date.


 .. ditaa::
@ -102,20 +94,23 @@ forth.

 .. index:: architecture; high availability, scalability

+.. _arch_scalability_and_high_availability:
+
 Scalability and High Availability
 ---------------------------------

-In traditional architectures, clients talk to a centralized component (e.g., a
-gateway, broker, API, facade, etc.), which acts as a single point of entry to a
-complex subsystem. This imposes a limit to both performance and scalability,
-while introducing a single point of failure (i.e., if the centralized component
-goes down, the whole system goes down, too).
+In traditional architectures, clients talk to a centralized component. This
+centralized component might be a gateway, a broker, an API, or a facade. A
+centralized component of this kind acts as a single point of entry to a complex
+subsystem. Architectures that rely upon such a centralized component have a
+single point of failure and incur limits to performance and scalability. If
+the centralized component goes down, the whole system becomes unavailable.

-Ceph eliminates the centralized gateway to enable clients to interact with 
-Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
-Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
-of monitors to ensure high availability. To eliminate centralization, Ceph 
-uses an algorithm called CRUSH.
+Ceph eliminates this centralized component. This enables clients to interact
+with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
+to ensure data safety and high availability. Ceph also uses a cluster of
+monitors to ensure high availability. To eliminate centralization, Ceph uses an
+algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.


 .. index:: CRUSH; architecture
@ -124,15 +119,15 @@ CRUSH Introduction
 ~~~~~~~~~~~~~~~~~~

 Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
-Replication Under Scalable Hashing)` algorithm to efficiently compute
-information about object location, instead of having to depend on a
-central lookup table. CRUSH provides a better data management mechanism compared
-to older approaches, and enables massive scale by cleanly distributing the work
-to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
-replication to ensure resiliency, which is better suited to hyper-scale storage.
-The following sections provide additional details on how CRUSH works. For a
-detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
-Placement of Replicated Data`_.
+Replication Under Scalable Hashing)` algorithm to compute information about
+object location instead of relying upon a central lookup table. CRUSH provides
+a better data management mechanism than do older approaches, and CRUSH enables
+massive scale by distributing the work to all the OSD daemons in the cluster
+and all the clients that communicate with them. CRUSH uses intelligent data
+replication to ensure resiliency, which is better suited to hyper-scale
+storage. The following sections provide additional details on how CRUSH works.
+For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
+Decentralized Placement of Replicated Data`_.

 .. index:: architecture; cluster map

@ -141,109 +136,130 @@ Placement of Replicated Data`_.
 Cluster Map
 ~~~~~~~~~~~

-Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
-cluster topology, which is inclusive of 5 maps collectively referred to as the
-"Cluster Map":
+In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
+must have current information about the cluster's topology. Current information
+is stored in the "Cluster Map", which is in fact a collection of five maps. The
+five maps that constitute the cluster map are:

-#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name 
-   address and port of each monitor. It also indicates the current epoch, 
-   when the map was created, and the last time it changed. To view a monitor
-   map, execute ``ceph mon dump``.   
+#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
+   the address, and the TCP port of each monitor. The monitor map specifies the
+   current epoch, the time of the monitor map's creation, and the time of the
+   monitor map's last modification.  To view a monitor map, run ``ceph mon
+   dump``.   
   
-#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
-   last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
-   and their status (e.g., ``up``, ``in``). To view an OSD map, execute
-   ``ceph osd dump``. 
+#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
+   creation, the time of the OSD map's last modification, a list of pools, a
+   list of replica sizes, a list of PG numbers, and a list of OSDs and their
+   statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
+   osd dump``. 
   
-#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
-   map epoch, the full ratios, and details on each placement group such as
-   the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., 
-   ``active + clean``), and data usage statistics for each pool.
+#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
+   epoch, the full ratios, and the details of each placement group. This
+   includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
+   example, ``active + clean``), and data usage statistics for each pool.

 #. **The CRUSH Map:** Contains a list of storage devices, the failure domain
-   hierarchy (e.g., device, host, rack, row, room, etc.), and rules for 
-   traversing the hierarchy when storing data. To view a CRUSH map, execute
-   ``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
-   ``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
-   You can view the decompiled map in a text editor or with ``cat``. 
+   hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
+   and rules for traversing the hierarchy when storing data. To view a CRUSH
+   map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
+   running ``crushtool -d {comp-crushmap-filename} -o
+   {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
+   decompiled map.

 #. **The MDS Map:** Contains the current MDS map epoch, when the map was 
   created, and the last time it changed. It also contains the pool for 
   storing metadata, a list of metadata servers, and which metadata servers
   are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.

-Each map maintains an iterative history of its operating state changes. Ceph
-Monitors maintain a master copy of the cluster map including the cluster
-members, state, changes, and the overall health of the Ceph Storage Cluster.
+Each map maintains a history of changes to its operating state. Ceph Monitors
+maintain a master copy of the cluster map. This master copy includes the
+cluster members, the state of the cluster, changes to the cluster, and
+information recording the overall health of the Ceph Storage Cluster.

 .. index:: high availability; monitor architecture

 High Availability Monitors
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

-Before Ceph Clients can read or write data, they must contact a Ceph Monitor
-to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
-can operate with a single monitor; however, this introduces a single 
-point of failure (i.e., if the monitor goes down, Ceph Clients cannot
-read or write data).
+A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
+cluster map in order to read data from or to write data to the Ceph cluster.

-For added reliability and fault tolerance, Ceph supports a cluster of monitors.
-In a cluster of monitors, latency and other faults can cause one or more
-monitors to fall behind the current state of the cluster. For this reason, Ceph
-must have agreement among various monitor instances regarding the state of the
-cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
-and the `Paxos`_ algorithm to establish a consensus among the monitors about the
-current state of the cluster.
+It is possible for a Ceph cluster to function properly with only a single
+monitor, but a Ceph cluster that has only a single monitor has a single point
+of failure: if the monitor goes down, Ceph clients will be unable to read data
+from or write data to the cluster.

-For details on configuring monitors, see the `Monitor Config Reference`_.
+Ceph leverages a cluster of monitors in order to increase reliability and fault
+tolerance. When a cluster of monitors is used, however, one or more of the
+monitors in the cluster can fall behind due to latency or other faults. Ceph
+mitigates these negative effects by requiring multiple monitor instances to
+agree about the state of the cluster. To establish consensus among the monitors
+regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
+majority of monitors (for example, one in a cluster that contains only one
+monitor, two in a cluster that contains three monitors, three in a cluster that
+contains five monitors, four in a cluster that contains six monitors, and so
+on).
+
+See the `Monitor Config Reference`_ for more detail on configuring monitors.

 .. index:: architecture; high availability authentication

+.. _arch_high_availability_authentication:
+
 High Availability Authentication
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-To identify users and protect against man-in-the-middle attacks, Ceph provides
-its ``cephx`` authentication system to authenticate users and daemons.
+The ``cephx`` authentication system is used by Ceph to authenticate users and
+daemons and to protect against man-in-the-middle attacks. 

 .. note:: The ``cephx`` protocol does not address data encryption in transport 
-   (e.g., SSL/TLS) or encryption at rest.
+   (for example, SSL/TLS) or encryption at rest.

-Cephx uses shared secret keys for authentication, meaning both the client and
-the monitor cluster have a copy of the client's secret key. The authentication
-protocol is such that both parties are able to prove to each other they have a
-copy of the key without actually revealing it. This provides mutual
-authentication, which means the cluster is sure the user possesses the secret
-key, and the user is sure that the cluster has a copy of the secret key.
+``cephx`` uses shared secret keys for authentication. This means that both the
+client and the monitor cluster keep a copy of the client's secret key. 

-A key scalability feature of Ceph is to avoid a centralized interface to the
-Ceph object store, which means that Ceph clients must be able to interact with
-OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
-system, which authenticates users operating Ceph clients. The ``cephx`` protocol
-operates in a manner with behavior similar to `Kerberos`_. 
+The ``cephx`` protocol makes it possible for each party to prove to the other
+that it has a copy of the key without revealing it. This provides mutual
+authentication and allows the cluster to confirm (1) that the user has the
+secret key and (2) that the user can be confident that the cluster has a copy
+of the secret key.

-A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
-monitor can authenticate users and distribute keys, so there is no single point
-of failure or bottleneck when using ``cephx``. The monitor returns an
-authentication data structure similar to a Kerberos ticket that contains a
-session key for use in obtaining Ceph services.  This session key is itself
-encrypted with the user's permanent  secret key, so that only the user can
-request services from the Ceph Monitor(s). The client then uses the session key
-to request its desired services from the monitor, and the monitor provides the
-client with a ticket that will authenticate the client to the OSDs that actually
-handle data. Ceph Monitors and OSDs share a secret, so the client can use the
-ticket provided by the monitor with any OSD or metadata server in the cluster.
-Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
-ticket or session key obtained surreptitiously. This form of authentication will
-prevent attackers with access to the communications medium from either creating
-bogus messages under another user's identity or altering another user's
-legitimate messages, as long as the user's secret key is not divulged before it
-expires.
+As stated in :ref:`Scalability and High Availability
+<arch_scalability_and_high_availability>`, Ceph does not have any centralized
+interface between clients and the Ceph object store. By avoiding such a
+centralized interface, Ceph avoids the bottlenecks that attend such centralized
+interfaces. However, this means that clients must interact directly with OSDs.
+Direct interactions between Ceph clients and OSDs require authenticated
+connections. The ``cephx`` authentication system establishes and sustains these
+authenticated connections.

-To use ``cephx``, an administrator must set up users first. In the following
-diagram, the ``client.admin`` user invokes  ``ceph auth get-or-create-key`` from
+The ``cephx`` protocol operates in a manner similar to `Kerberos`_. 
+
+A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
+monitor can authenticate users and distribute keys, which means that there is
+no single point of failure and no bottleneck when using ``cephx``. The monitor
+returns an authentication data structure that is similar to a Kerberos ticket.
+This authentication data structure contains a session key for use in obtaining
+Ceph services. The session key is itself encrypted with the user's permanent
+secret key, which means that only the user can request services from the Ceph
+Monitors. The client then uses the session key to request services from the
+monitors, and the monitors provide the client with a ticket that authenticates
+the client against the OSDs that actually handle data. Ceph Monitors and OSDs
+share a secret, which means that the clients can use the ticket provided by the
+monitors to authenticate against any OSD or metadata server in the cluster. 
+
+Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
+expired ticket or session key that has been obtained surreptitiously. This form
+of authentication prevents attackers who have access to the communications
+medium from creating bogus messages under another user's identity and prevents
+attackers from altering another user's legitimate messages, as long as the
+user's secret key is not divulged before it expires.
+
+An administrator must set up users before using ``cephx``.  In the following
+diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
 the command line to generate a username and secret key. Ceph's ``auth``
-subsystem generates the username and key, stores a copy with the monitor(s) and
-transmits the user's secret back to the ``client.admin`` user. This means that 
+subsystem generates the username and key, stores a copy on the monitor(s), and
+transmits the user's secret back to the ``client.admin`` user. This means that
 the client and the monitor share a secret key.

 .. note:: The ``client.admin`` user must provide the user ID and 
@ -262,17 +278,16 @@ the client and the monitor share a secret key.
                | transmit key  |
                |               |

-
-To authenticate with the monitor, the client passes in the user name to the
-monitor, and the monitor generates a session key and encrypts it with the secret
-key associated to the user name. Then, the monitor transmits the encrypted
-ticket back to the client. The client then decrypts the payload with the shared
-secret key to retrieve the session key. The session key identifies the user for
-the current session. The client then requests a ticket on behalf of the user
-signed by the session key. The monitor generates a ticket, encrypts it with the
-user's secret key and transmits it back to the client. The client decrypts the
-ticket and uses it to sign requests to OSDs and metadata servers throughout the
-cluster.
+Here is how a client authenticates with a monitor. The client passes the user
+name to the monitor. The monitor generates a session key that is encrypted with
+the secret key associated with the ``username``. The monitor transmits the
+encrypted ticket to the client. The client uses the shared secret key to
+decrypt the payload. The session key identifies the user, and this act of
+identification will last for the duration of the session.  The client requests
+a ticket for the user, and the ticket is signed with the session key. The
+monitor generates a ticket and uses the user's secret key to encrypt it. The
+encrypted ticket is transmitted to the client. The client decrypts the ticket
+and uses it to sign requests to OSDs and to metadata servers in the cluster.

 .. ditaa::

@ -302,10 +317,11 @@ cluster.
                |<----+         |              


-The ``cephx`` protocol authenticates ongoing communications between the client
-machine and the Ceph servers. Each message sent between a client and server,
-subsequent to the initial authentication, is signed using a ticket that the
-monitors, OSDs and metadata servers can verify with their shared secret.
+The ``cephx`` protocol authenticates ongoing communications between the clients
+and Ceph daemons. After initial authentication, each message sent between a
+client and a daemon is signed using a ticket that can be verified by monitors,
+OSDs, and metadata daemons. This ticket is verified by using the secret shared
+between the client and the daemon.

 .. ditaa::

@ -341,83 +357,93 @@ monitors, OSDs and metadata servers can verify with their shared secret.
                |<-------------------------------------------|
                               receive response

-The protection offered by this authentication is between the Ceph client and the
-Ceph server hosts. The authentication is not extended beyond the Ceph client. If
-the user accesses the Ceph client from a remote host, Ceph authentication is not
+This authentication protects only the connections between Ceph clients and Ceph
+daemons. The authentication is not extended beyond the Ceph client. If a user
+accesses the Ceph client from a remote host, cephx authentication will not be
 applied to the connection between the user's host and the client host.

+See `Cephx Config Guide`_ for more on configuration details. 

-For configuration details, see `Cephx Config Guide`_. For user management 
-details, see `User Management`_.
+See `User Management`_ for more on user management.

+See :ref:`A Detailed Description of the Cephx Authentication Protocol
+<cephx_2012_peter>` for more on the distinction between authorization and
+authentication and for a step-by-step explanation of the setup of ``cephx``
+tickets and session keys.

 .. index:: architecture; smart daemons and scalability

 Smart Daemons Enable Hyperscale
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A feature of many storage clusters is a centralized interface that keeps track
+of the nodes that clients are permitted to access. Such centralized
+architectures provide services to clients by means of a double dispatch. At the
+petabyte-to-exabyte scale, such double dispatches are a significant
+bottleneck.

-In many clustered architectures, the primary purpose of cluster membership is 
-so that a centralized interface knows which nodes it can access. Then the
-centralized interface provides services to the client through a double
-dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
+Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
+cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
+OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
+with other Ceph OSD Daemons and to interact directly with Ceph Monitors.  Being
+cluster-aware makes it possible for Ceph clients to interact directly with Ceph
+OSD Daemons.

-Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
-aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
-Daemons in the cluster.  This enables Ceph OSD Daemons to interact directly with
-other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
-to interact directly with Ceph OSD Daemons.
+Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
+another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
+resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
+easily perform tasks that a cluster with a centralized interface would struggle
+to perform. The ability of Ceph nodes to make use of the computing power of
+the greater cluster provides several benefits:

-The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
-each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
-nodes to easily perform tasks that would bog down a centralized server. The
-ability to leverage this computing power leads to several major benefits:
+#. **OSDs Service Clients Directly:** Network devices can support only a
+   limited number of concurrent connections. Because Ceph clients contact
+   Ceph OSD daemons directly without first connecting to a central interface,
+   Ceph enjoys improved perfomance and increased system capacity relative to
+   storage redundancy strategies that include a central interface. Ceph clients
+   maintain sessions only when needed, and maintain those sessions with only
+   particular Ceph OSD daemons, not with a centralized interface.

-#. **OSDs Service Clients Directly:** Since any network device has a limit to 
-   the number of concurrent connections it can support, a centralized system 
-   has a low physical limit at high scales. By enabling Ceph Clients to contact 
-   Ceph OSD Daemons directly, Ceph increases both performance and total system 
-   capacity simultaneously, while removing a single point of failure. Ceph 
-   Clients can maintain a session when they need to, and with a particular Ceph 
-   OSD Daemon instead of a centralized server.
+#. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
+   report their status. At the lowest level, the Ceph OSD Daemon status is
+   ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
+   able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
+   ``in`` the Ceph Storage Cluster, this status may indicate the failure of the
+   Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
+   the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
+   OSDs periodically send messages to the Ceph Monitor (in releases prior to
+   Luminous, this was done by means of ``MPGStats``, and beginning with the
+   Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
+   Monitors receive no such message after a configurable period of time,
+   then they mark the OSD ``down``. This mechanism is a failsafe, however.
+   Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
+   report it to the Ceph Monitors. This contributes to making Ceph Monitors 
+   lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
+   additional details.

-#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report 
-   on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` 
-   or ``down`` reflecting whether or not it is running and able to service 
-   Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph 
-   Storage Cluster, this status may indicate the failure of the Ceph OSD 
-   Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD 
-   Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
-   periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
-   and a new ``MOSDBeacon`` in luminous).  If the Ceph Monitor doesn't see that
-   message after a configurable period of time then it marks the OSD down.
-   This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
-   determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
-   This assures that Ceph Monitors are lightweight processes.  See `Monitoring
-   OSDs`_ and `Heartbeats`_ for additional details.
+#. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
+   RADOS objects. Ceph OSD Daemons compare the metadata of their own local
+   objects against the metadata of the replicas of those objects, which are
+   stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
+   mismatches in object size and finds metadata mismatches, and is usually
+   performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
+   data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
+   bad sectors on drives that are not detectable with light scrubs. See `Data
+   Scrubbing`_ for details on configuring scrubbing.

-#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, 
-   Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare
-   their local objects metadata with its replicas stored on other OSDs. Scrubbing
-   happens on a per-Placement Group base. Scrubbing (usually performed daily)
-   catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper
-   scrubbing by comparing data in objects bit-for-bit with their checksums.
-   Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
-   weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
-   configuring scrubbing.
+#. **Replication:** Data replication involves a collaboration between Ceph
+   Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
+   determine the storage location of object replicas. Ceph clients use the
+   CRUSH algorithm to determine the storage location of an object, then the
+   object is mapped to a pool and to a placement group, and then the client
+   consults the CRUSH map to identify the placement group's primary OSD.

-#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH 
-   algorithm, but the Ceph OSD Daemon uses it to compute where replicas of 
-   objects should be stored (and for rebalancing). In a typical write scenario, 
-   a client uses the CRUSH algorithm to compute where to store an object, maps 
-   the object to a pool and placement group, then looks at the CRUSH map to 
-   identify the primary OSD for the placement group.
-   
-   The client writes the object to the identified placement group in the 
-   primary OSD. Then, the primary OSD with its own copy of the CRUSH map 
-   identifies the secondary and tertiary OSDs for replication purposes, and 
-   replicates the object to the appropriate placement groups in the secondary 
-   and tertiary OSDs (as many OSDs as additional replicas), and responds to the
-   client once it has confirmed the object was stored successfully.
+   After identifying the target placement group, the client writes the object
+   to the identified placement group's primary OSD. The primary OSD then
+   consults its own copy of the CRUSH map to identify secondary and tertiary
+   OSDS, replicates the object to the placement groups in those secondary and
+   tertiary OSDs, confirms that the object was stored successfully in the
+   secondary and tertiary OSDs, and reports to the client that the object
+   was stored successfully.

 .. ditaa::

@ -444,19 +470,18 @@ ability to leverage this computing power leads to several major benefits:
 |               |   |               |
 +---------------+   +---------------+

-With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
-clients from that duty, while ensuring high data availability and data safety.
-
+By performing this act of data replication, Ceph OSD Daemons relieve Ceph
+clients of the burden of replicating data.

 Dynamic Cluster Management
 --------------------------

 In the `Scalability and High Availability`_ section, we explained how Ceph uses
-CRUSH, cluster awareness and intelligent daemons to scale and maintain high
+CRUSH, cluster topology, and intelligent daemons to scale and maintain high
 availability. Key to Ceph's design is the autonomous, self-healing, and
 intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
-enable modern cloud storage infrastructures to place data, rebalance the cluster
-and recover from faults dynamically.
+enable modern cloud storage infrastructures to place data, rebalance the
+cluster, and adaptively place and balance data and recover from faults.

 .. index:: architecture; pools

@ -465,10 +490,11 @@ About Pools

 The Ceph storage system supports the notion of 'Pools', which are logical
 partitions for storing objects.
-
-Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
-pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
-number of placement groups determine how Ceph will place the data.
+   
+Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
+objects to pools. The way that Ceph places the data in the pools is determined
+by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
+placement groups in the pool.

 .. ditaa::

@ -501,20 +527,23 @@ See `Set Pool Values`_ for details.
 Mapping PGs to OSDs
 ~~~~~~~~~~~~~~~~~~~

-Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
-When a Ceph Client stores objects, CRUSH will map each object to a placement
-group.
+Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
+maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
+object to a PG. 

-Mapping objects to placement groups creates a layer of indirection between the
-Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
-grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
-Client "knew" which Ceph OSD Daemon had which object, that would create a tight
-coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
-algorithm maps each object to a placement group and then maps each placement
-group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
-rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
-come online. The following diagram depicts how CRUSH maps objects to placement
-groups, and placement groups to OSDs.
+This mapping of RADOS objects to PGs implements an abstraction and indirection
+layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
+be able to grow (or shrink) and redistribute data adaptively when the internal
+topology changes. 
+
+If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
+tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
+But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
+RADOS object to a placement group and then maps each placement group to one or
+more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
+dynamically when new Ceph OSD Daemons and their underlying OSD devices come
+online. The following diagram shows how the CRUSH algorithm maps objects to
+placement groups, and how it maps placement groups to OSDs.

 .. ditaa::

@ -540,44 +569,45 @@ groups, and placement groups to OSDs.
   |          |  |          |  |          |  |          |
   \----------/  \----------/  \----------/  \----------/  

-With a copy of the cluster map and the CRUSH algorithm, the client can compute
-exactly which OSD to use when reading or writing a particular object.
+The client uses its copy of the cluster map and the CRUSH algorithm to compute
+precisely which OSD it will use when reading or writing a particular object.

 .. index:: architecture; calculating PG IDs

 Calculating PG IDs
 ~~~~~~~~~~~~~~~~~~

-When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
-`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
-OSDs, and metadata servers in the cluster. **However, it doesn't know anything
-about object locations.** 
+When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
+the `Cluster Map`_. When a client has been equipped with a copy of the cluster
+map, it is aware of all the monitors, OSDs, and metadata servers in the
+cluster. **However, even equipped with a copy of the latest version of the
+cluster map, the client doesn't know anything about object locations.** 

-.. epigraph:: 
+**Object locations must be computed.**

-	Object locations get computed.
+The client requies only the object ID and the name of the pool in order to
+compute the object location.

+Ceph stores data in named pools (for example,  "liverpool"). When a client
+stores a named object (for example, "john", "paul", "george", or "ringo") it
+calculates a placement group by using the object name, a hash code, the number
+of PGs in the pool, and the pool name. Ceph clients use the following steps to
+compute PG IDs.

-The only input required by the client is the object ID and the pool.
-It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
-wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
-it calculates a placement group using the object name, a hash code, the
-number of PGs in the pool and the pool name. Ceph clients use the following
-steps to compute PG IDs.
+#. The client inputs the pool name and the object ID. (for example: pool =
+   "liverpool" and object-id = "john")
+#. Ceph hashes the object ID.
+#. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
+   get a PG ID.
+#. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
+   ``4``)
+#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).

-#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool" 
-   and object-id = "john")
-#. Ceph takes the object ID and hashes it.
-#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get 
-   a PG ID.
-#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
-#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
-
-Computing object locations is much faster than performing object location query
-over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
-Hashing)` algorithm allows a client to compute where objects *should* be stored,
-and enables the client to contact the primary OSD to store or retrieve the
-objects.
+It is much faster to compute object locations than to perform object location
+query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
+Scalable Hashing)` algorithm allows a client to compute where objects are
+expected to be stored, and enables the client to contact the primary OSD to
+store or retrieve the objects.

 .. index:: architecture; PG Peering

@ -585,46 +615,51 @@ Peering and Sets
 ~~~~~~~~~~~~~~~~

 In previous sections, we noted that Ceph OSD Daemons check each other's
-heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
-do is called 'peering', which is the process of bringing all of the OSDs that
-store a Placement Group (PG) into agreement about the state of all of the
-objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
-Peering Failure`_ to the Ceph Monitors. Peering issues  usually resolve
-themselves; however, if the problem persists, you may need to refer to the
-`Troubleshooting Peering Failure`_ section.
+heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
+which is the process of bringing all of the OSDs that store a Placement Group
+(PG) into agreement about the state of all of the RADOS objects (and their
+metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
+Monitors. Peering issues usually resolve themselves; however, if the problem
+persists, you may need to refer to the `Troubleshooting Peering Failure`_
+section.

-.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
+.. Note:: PGs that agree on the state of the cluster do not necessarily have
+   the current data yet. 

 The Ceph Storage Cluster was designed to store at least two copies of an object
-(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
-availability, a Ceph Storage Cluster should store more than two copies of an object
-(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a 
-``degraded`` state while maintaining data safety.
+(that is, ``size = 2``), which is the minimum requirement for data safety. For
+high availability, a Ceph Storage Cluster should store more than two copies of
+an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
+to run in a ``degraded`` state while maintaining data safety.

-Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not 
-name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but 
-rather refer to them as *Primary*, *Secondary*, and so forth. By convention, 
-the *Primary* is the first OSD in the *Acting Set*, and is responsible for 
-coordinating the peering process for each placement group where it acts as 
-the *Primary*, and is the **ONLY** OSD that that will accept client-initiated 
-writes to objects for a given placement group where it acts as the *Primary*.
+.. warning:: Although we say here that R2 (replication with two copies) is the
+   minimum requirement for data safety, R3 (replication with three copies) is
+   recommended. On a long enough timeline, data stored with an R2 strategy will
+   be lost.

-When a series of OSDs are responsible for a placement group, that series of
-OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
-OSD Daemons that are currently responsible for the placement group, or the Ceph
-OSD Daemons that were responsible  for a particular placement group as of some
+As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
+name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
+etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
+convention, the *Primary* is the first OSD in the *Acting Set*, and is
+responsible for orchestrating the peering process for each placement group
+where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
+placement group that accepts client-initiated writes to objects.
+
+The set of OSDs that is responsible for a placement group is called the
+*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
+that are currently responsible for the placement group, or to the Ceph OSD
+Daemons that were responsible for a particular placement group as of some
 epoch.

-The Ceph OSD daemons that are part of an *Acting Set* may not always be  ``up``.
-When an OSD in the *Acting Set* is ``up``, it is part of the  *Up Set*. The *Up
-Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
-Daemons when an OSD fails. 
-
-.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and 
-   ``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
-   the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be 
-   removed from the *Up Set*.
+The Ceph OSD daemons that are part of an *Acting Set* might not always be
+``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
+The *Up Set* is an important distinction, because Ceph can remap PGs to other
+Ceph OSD Daemons when an OSD fails. 

+.. note:: Consider a hypothetical *Acting Set* for a PG that contains
+   ``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
+   *Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
+   *Primary*, and ``osd.25`` is removed from the *Up Set*.

 .. index:: architecture; Rebalancing

@ -1467,11 +1502,11 @@ Ceph Clients

 Ceph Clients include a number of service interfaces. These include:

- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service 
-  provides resizable, thin-provisioned block devices with snapshotting and
-  cloning. Ceph stripes a block device across the cluster for high
-  performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor 
-  that uses ``librbd`` directly--avoiding the kernel object overhead for 
+- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
+  provides resizable, thin-provisioned block devices that can be snapshotted
+  and cloned. Ceph stripes a block device across the cluster for high
+  performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
+  that uses ``librbd`` directly--avoiding the kernel object overhead for
  virtualized systems.

 - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service 
--- a/ceph/doc/cephadm/host-management.rst
+++ b/ceph/doc/cephadm/host-management.rst
@ -11,9 +11,9 @@ Run a command of this form to list hosts associated with the cluster:

 .. prompt:: bash #

-   ceph orch host ls [--format yaml] [--host-pattern <name>] [--label <label>] [--host-status <status>]
+   ceph orch host ls [--format yaml] [--host-pattern <name>] [--label <label>] [--host-status <status>] [--detail]

-In commands of this form, the arguments "host-pattern", "label" and
+In commands of this form, the arguments "host-pattern", "label", and
 "host-status" are optional and are used for filtering. 

 - "host-pattern" is a regex that matches against hostnames and returns only
@ -25,6 +25,16 @@ In commands of this form, the arguments "host-pattern", "label" and
  against name, label and status simultaneously, or to filter against any
  proper subset of name, label and status.

+The "detail" parameter provides more host related information for cephadm based
+clusters. For example:
+
+.. prompt:: bash #
+
+   # ceph orch host ls --detail 
+   HOSTNAME     ADDRESS         LABELS  STATUS  VENDOR/MODEL                           CPU    HDD      SSD  NIC  
+   ceph-master  192.168.122.73  _admin          QEMU (Standard PC (Q35 + ICH9, 2009))  4C/4T  4/1.6TB  -    1    
+   1 hosts in cluster
+
 .. _cephadm-adding-hosts:    
    
 Adding Hosts
@ -193,10 +203,18 @@ Place a host in and out of maintenance mode (stops all Ceph daemons on host):

 .. prompt:: bash #

-   ceph orch host maintenance enter <hostname> [--force]
+   ceph orch host maintenance enter <hostname> [--force] [--yes-i-really-mean-it]
   ceph orch host maintenance exit <hostname>

-Where the force flag when entering maintenance allows the user to bypass warnings (but not alerts)
+The ``--force`` flag allows the user to bypass warnings (but not alerts). The ``--yes-i-really-mean-it``
+flag bypasses all safety checks and will attempt to force the host into maintenance mode no
+matter what.
+
+.. warning:: Using the --yes-i-really-mean-it flag to force the host to enter maintenance
+   mode can potentially cause loss of data availability, the mon quorum to break down due
+   to too few running monitors, mgr module commands (such as ``ceph orch . . .`` commands)
+   to be become unresponsive, and a number of other possible issues. Please only use this
+   flag if you're absolutely certain you know what you're doing.

 See also :ref:`cephadm-fqdn`

@ -269,7 +287,7 @@ create a new CRUSH host located in the specified hierarchy.
 .. note:: 

  The ``location`` attribute will be only affect the initial CRUSH location. Subsequent
-  changes of the ``location`` property will be ignored. Also, removing a host will no remove
+  changes of the ``location`` property will be ignored. Also, removing a host will not remove
  any CRUSH buckets.

 See also :ref:`crush_map_default_types`.
--- a/ceph/doc/cephadm/install.rst
+++ b/ceph/doc/cephadm/install.rst
@ -142,6 +142,9 @@ cluster's first "monitor daemon", and that monitor daemon needs an IP address.
 You must pass the IP address of the Ceph cluster's first host to the ``ceph
 bootstrap`` command, so you'll need to know the IP address of that host.

+.. important:: ``ssh`` must be installed and running in order for the
+   bootstrapping procedure to succeed.
+
 .. note:: If there are multiple networks and interfaces, be sure to choose one
   that will be accessible by any host accessing the Ceph cluster.

@ -288,18 +291,21 @@ its status with:
 Adding Hosts
 ============

-Next, add all hosts to the cluster by following :ref:`cephadm-adding-hosts`.
+Add all hosts to the cluster by following the instructions in
+:ref:`cephadm-adding-hosts`.

-By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring
-are maintained in ``/etc/ceph`` on all hosts with the ``_admin`` label, which is initially
-applied only to the bootstrap host. We usually recommend that one or more other hosts be
-given the ``_admin`` label so that the Ceph CLI (e.g., via ``cephadm shell``) is easily
-accessible on multiple hosts. To add the ``_admin`` label to additional host(s):
+By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring are
+maintained in ``/etc/ceph`` on all hosts that have the ``_admin`` label. This
+label is initially applied only to the bootstrap host. We usually recommend
+that one or more other hosts be given the ``_admin`` label so that the Ceph CLI
+(for example, via ``cephadm shell``) is easily accessible on multiple hosts. To add
+the ``_admin`` label to additional host(s), run a command of the following form:

  .. prompt:: bash #

    ceph orch host label add *<host>* _admin

+
 Adding additional MONs
 ======================

--- a/ceph/doc/cephadm/services/index.rst
+++ b/ceph/doc/cephadm/services/index.rst
@ -676,6 +676,22 @@ To disable the automatic management of dameons, set ``unmanaged=True`` in the

   ceph orch apply -i mgr.yaml

+Cephadm also supports setting the unmanaged parameter to true or false
+using the ``ceph orch set-unmanaged`` and ``ceph orch set-managed`` commands.
+The commands take the service name (as reported in ``ceph orch ls``) as
+the only argument. For example,
+
+.. prompt:: bash #
+
+   ceph orch set-unmanaged mon
+
+would set ``unmanaged: true`` for the mon service and
+
+.. prompt:: bash #
+
+   ceph orch set-managed mon
+
+would set ``unmanaged: false`` for the mon service

 .. note::

@ -683,6 +699,13 @@ To disable the automatic management of dameons, set ``unmanaged=True`` in the
  longer deploy any new daemons (even if the placement specification matches
  additional hosts).

+.. note::
+
+  The "osd" service used to track OSDs that are not tied to any specific
+  service spec is special and will always be marked unmanaged. Attempting
+  to modify it with ``ceph orch set-unmanaged`` or ``ceph orch set-managed``
+  will result in a message ``No service of name osd found. Check "ceph orch ls" for all known services``
+
 Deploying a daemon on a host manually
 -------------------------------------

--- a/ceph/doc/cephadm/services/mds.rst
+++ b/ceph/doc/cephadm/services/mds.rst
@ -20,7 +20,18 @@ For example:
  ceph fs volume create <fs_name> --placement="<placement spec>"

 where ``fs_name`` is the name of the CephFS and ``placement`` is a
-:ref:`orchestrator-cli-placement-spec`.
+:ref:`orchestrator-cli-placement-spec`. For example, to place
+MDS daemons for the new ``foo`` volume on hosts labeled with ``mds``:
+
+.. prompt:: bash #
+
+  ceph fs volume create foo --placement="label:mds"
+
+You can also update the placement after-the-fact via:
+
+.. prompt:: bash #
+
+  ceph orch apply mds foo 'mds-[012]'

 For manually deploying MDS daemons, use this specification:

@ -30,6 +41,7 @@ For manually deploying MDS daemons, use this specification:
    service_id: fs_name
    placement:
      count: 3
+      label: mds


 The specification can then be applied using:
--- a/ceph/doc/cephadm/services/mgr.rst
+++ b/ceph/doc/cephadm/services/mgr.rst
@ -4,8 +4,8 @@
 MGR Service
 ===========

-The cephadm MGR service is hosting different modules, like the :ref:`mgr-dashboard`
-and the cephadm manager module.
+The cephadm MGR service hosts multiple modules. These include the
+:ref:`mgr-dashboard` and the cephadm manager module.

 .. _cephadm-mgr-networks:

--- a/ceph/doc/cephadm/services/mon.rst
+++ b/ceph/doc/cephadm/services/mon.rst
@ -170,6 +170,64 @@ network ``10.1.2.0/24``, run the following commands:

    ceph orch apply mon --placement="newhost1,newhost2,newhost3" 

+
+Setting Crush Locations for Monitors
+------------------------------------
+
+Cephadm supports setting CRUSH locations for mon daemons
+using the mon service spec. The CRUSH locations are set
+by hostname. When cephadm deploys a mon on a host that matches
+a hostname specified in the CRUSH locations, it will add
+``--set-crush-location <CRUSH-location>`` where the CRUSH location
+is the first entry in the list of CRUSH locations for that
+host. If multiple CRUSH locations are set for one host, cephadm
+will attempt to set the additional locations using the
+"ceph mon set_location" command.
+
+.. note::
+
+   Setting the CRUSH location in the spec is the recommended way of
+   replacing tiebreaker mon daemons, as they require having a location
+   set when they are added.
+
+ .. note::
+
+   Tiebreaker mon daemons are a part of stretch mode clusters. For more
+   info on stretch mode clusters see :ref:`stretch_mode`
+
+Example syntax for setting the CRUSH locations:
+
+.. code-block:: yaml
+
+    service_type: mon
+    service_name: mon
+    placement:
+      count: 5
+    spec:
+      crush_locations:
+        host1:
+        - datacenter=a
+        host2:
+        - datacenter=b
+        - rack=2
+        host3:
+        - datacenter=a
+
+.. note::
+
+   Sometimes, based on the timing of mon daemons being admitted to the mon
+   quorum, cephadm may fail to set the CRUSH location for some mon daemons
+   when multiple locations are specified. In this case, the recommended
+   action is to re-apply the same mon spec to retrigger the service action.
+
+.. note::
+
+   Mon daemons will only get the ``--set-crush-location`` flag set when cephadm
+   actually deploys them. This means if a spec is applied that includes a CRUSH
+   location for a mon that is already deployed, the flag may not be set until
+   a redeploy command is issued for that mon daemon.
+
+
 Further Reading
 ===============

--- a/ceph/doc/cephadm/services/monitoring.rst
+++ b/ceph/doc/cephadm/services/monitoring.rst
@ -197,12 +197,26 @@ configuration files for monitoring services.

 Internally, cephadm already uses `Jinja2
 <https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
-configuration files for all monitoring components. To be able to customize the
-configuration of Prometheus, Grafana or the Alertmanager it is possible to store
-a Jinja2 template for each service that will be used for configuration
-generation instead. This template will be evaluated every time a service of that
-kind is deployed or reconfigured. That way, the custom configuration is
-preserved and automatically applied on future deployments of these services.
+configuration files for all monitoring components. Starting from version 17.2.3,
+cephadm supports Prometheus http service discovery, and uses this endpoint for the
+definition and management of the embedded Prometheus service. The endpoint listens on
+``https://<mgr-ip>:8765/sd/`` (the port is
+configurable through the variable ``service_discovery_port``) and returns scrape target
+information in `http_sd_config format
+<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
+
+Customers with external monitoring stack can use `ceph-mgr` service discovery endpoint
+to get scraping configuration. Root certificate of the server can be obtained by the
+following command:
+
+   .. prompt:: bash #
+
+     ceph orch sd dump cert
+
+The configuration of Prometheus, Grafana, or Alertmanager may be customized by storing
+a Jinja2 template for each service. This template will be evaluated every time a service
+of that kind is deployed or reconfigured. That way, the custom configuration is preserved
+and automatically applied on future deployments of these services.

 .. note::

@ -292,6 +306,21 @@ cluster.
  By default, ceph-mgr presents prometheus metrics on port 9283 on each host
  running a ceph-mgr daemon.  Configure prometheus to scrape these.

+To make this integration easier, cephadm provides a service discovery endpoint at
+``https://<mgr-ip>:8765/sd/``. This endpoint can be used by an external
+Prometheus server to retrieve target information for a specific service. Information returned
+by this endpoint uses the format specified by the Prometheus `http_sd_config option
+<https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config/>`_
+
+Here's an example prometheus job definition that uses the cephadm service discovery endpoint
+
+  .. code-block:: bash
+
+     - job_name: 'ceph-exporter'  
+       http_sd_configs:  
+       - url: http://<mgr-ip>:8765/sd/prometheus/sd-config?service=ceph-exporter
+
+
 * To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.

 * To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.
@ -429,6 +458,28 @@ Then apply this specification:
 Grafana will now create an admin user called ``admin`` with the
 given password.

+Turning off anonymous access
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, cephadm allows anonymous users (users who have not provided any
+login information) limited, viewer only access to the grafana dashboard. In
+order to set up grafana to only allow viewing from logged in users, you can
+set ``anonymous_access: False`` in your grafana spec.
+
+.. code-block:: yaml
+
+  service_type: grafana
+  placement:
+    hosts:
+    - host1
+  spec:
+    anonymous_access: False
+    initial_admin_password: "mypassword"
+
+Since deploying grafana with anonymous access set to false without an initial
+admin password set would make the dashboard inaccessible, cephadm requires
+setting the ``initial_admin_password`` when ``anonymous_access`` is set to false.
+

 Setting up Alertmanager
 -----------------------
--- a/ceph/doc/cephadm/services/nfs.rst
+++ b/ceph/doc/cephadm/services/nfs.rst
@ -113,6 +113,54 @@ A few notes:
    a *port* property that is not 2049 to avoid conflicting with the
    ingress service, which could be placed on the same host(s).

+NFS with virtual IP but no haproxy
+----------------------------------
+
+Cephadm also supports deploying nfs with keepalived but not haproxy. This
+offers a virtual ip supported by keepalived that the nfs daemon can directly bind
+to instead of having traffic go through haproxy.
+
+In this setup, you'll either want to set up the service using the nfs module
+(see :ref:`nfs-module-cluster-create`) or place the ingress service first, so
+the virtual IP is present for the nfs daemon to bind to. The ingress service
+should include the attribute ``keepalive_only`` set to true. For example
+
+.. code-block:: yaml
+
+    service_type: ingress
+    service_id: nfs.foo
+    placement:
+      count: 1
+      hosts:
+      - host1
+      - host2
+      - host3
+    spec:
+      backend_service: nfs.foo
+      monitor_port: 9049
+      virtual_ip: 192.168.122.100/24
+      keepalive_only: true
+
+Then, an nfs service could be created that specifies a ``virtual_ip`` attribute
+that will tell it to bind to that specific IP.
+
+.. code-block:: yaml
+
+    service_type: nfs
+    service_id: foo
+    placement:
+      count: 1
+      hosts:
+      - host1
+      - host2
+      - host3
+    spec:
+      port: 2049
+      virtual_ip: 192.168.122.100
+
+Note that in these setups, one should make sure to include ``count: 1`` in the
+nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP.
+
 Further Reading
 ===============

--- a/ceph/doc/cephadm/services/osd.rst
+++ b/ceph/doc/cephadm/services/osd.rst
@ -308,7 +308,7 @@ Replacing an OSD

 .. prompt:: bash #

-  orch osd rm <osd_id(s)> --replace [--force]
+  ceph orch osd rm <osd_id(s)> --replace [--force]

 Example:

--- a/ceph/doc/cephadm/services/rgw.rst
+++ b/ceph/doc/cephadm/services/rgw.rst
@ -83,23 +83,23 @@ To deploy RGWs serving the multisite *myorg* realm and the *us-east-1* zone on

 .. prompt:: bash #

-   ceph orch apply rgw east --realm=myorg --zone=us-east-1 --placement="2 myhost1 myhost2"
+   ceph orch apply rgw east --realm=myorg --zonegroup=us-east-zg-1 --zone=us-east-1 --placement="2 myhost1 myhost2"

 Note that in a multisite situation, cephadm only deploys the daemons.  It does not create
-or update the realm or zone configurations.  To create a new realm and zone, you need to do
-something like:
+or update the realm or zone configurations.  To create a new realms, zones and zonegroups
+you can use :ref:`mgr-rgw-module` or manually using something like:

 .. prompt:: bash #

-  radosgw-admin realm create --rgw-realm=<realm-name> --default
-  
-.. prompt:: bash #
-
-  radosgw-admin zonegroup create --rgw-zonegroup=<zonegroup-name>  --master --default
+  radosgw-admin realm create --rgw-realm=<realm-name>

 .. prompt:: bash #

-  radosgw-admin zone create --rgw-zonegroup=<zonegroup-name> --rgw-zone=<zone-name> --master --default
+  radosgw-admin zonegroup create --rgw-zonegroup=<zonegroup-name>  --master
+
+.. prompt:: bash #
+
+  radosgw-admin zone create --rgw-zonegroup=<zonegroup-name> --rgw-zone=<zone-name> --master

 .. prompt:: bash #

@ -212,12 +212,14 @@ It is a yaml format file with the following properties:
        - host2
        - host3
    spec:
-      backend_service: rgw.something      # adjust to match your existing RGW service
-      virtual_ip: <string>/<string>       # ex: 192.168.20.1/24
-      frontend_port: <integer>            # ex: 8080
-      monitor_port: <integer>             # ex: 1967, used by haproxy for load balancer status
-      virtual_interface_networks: [ ... ] # optional: list of CIDR networks
-      ssl_cert: |                         # optional: SSL certificate and key
+      backend_service: rgw.something            # adjust to match your existing RGW service
+      virtual_ip: <string>/<string>             # ex: 192.168.20.1/24
+      frontend_port: <integer>                  # ex: 8080
+      monitor_port: <integer>                   # ex: 1967, used by haproxy for load balancer status
+      virtual_interface_networks: [ ... ]       # optional: list of CIDR networks
+      use_keepalived_multicast: <bool>          # optional: Default is False.
+      vrrp_interface_network: <string>/<string> # optional: ex: 192.168.20.0/24
+      ssl_cert: |                               # optional: SSL certificate and key
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
@ -243,6 +245,7 @@ It is a yaml format file with the following properties:
      frontend_port: <integer>            # ex: 8080
      monitor_port: <integer>             # ex: 1967, used by haproxy for load balancer status
      virtual_interface_networks: [ ... ] # optional: list of CIDR networks
+      first_virtual_router_id: <integer>  # optional: default 50
      ssl_cert: |                         # optional: SSL certificate and key
        -----BEGIN CERTIFICATE-----
        ...
@ -276,6 +279,21 @@ where the properties of this service specification are:
 * ``ssl_cert``:
    SSL certificate, if SSL is to be enabled. This must contain the both the certificate and
    private key blocks in .pem format.
+* ``use_keepalived_multicast``
+    Default is False. By default, cephadm will deploy keepalived config to use unicast IPs,
+    using the IPs of the hosts. The IPs chosen will be the same IPs cephadm uses to connect
+    to the machines. But if multicast is prefered, we can set ``use_keepalived_multicast``
+    to ``True`` and Keepalived will use multicast IP (224.0.0.18) to communicate between instances,
+    using the same interfaces as where the VIPs are.
+* ``vrrp_interface_network``
+    By default, cephadm will configure keepalived to use the same interface where the VIPs are
+    for VRRP communication. If another interface is needed, it can be set via ``vrrp_interface_network``
+    with a network to identify which ethernet interface to use.
+* ``first_virtual_router_id``
+    Default is 50. When deploying more than 1 ingress, this parameter can be used to ensure each
+    keepalived will have different virtual_router_id. In the case of using ``virtual_ips_list``,
+    each IP will create its own virtual router. So the first one will have ``first_virtual_router_id``,
+    second one will have ``first_virtual_router_id`` + 1, etc. Valid values go from 1 to 255.

 .. _ingress-virtual-ip:

--- a/ceph/doc/cephfs/administration.rst
+++ b/ceph/doc/cephfs/administration.rst
@ -15,7 +15,7 @@ creation of multiple file systems use ``ceph fs flag set enable_multiple true``.

 ::

-    fs new <file system name> <metadata pool name> <data pool name>
+    ceph fs new <file system name> <metadata pool name> <data pool name>

 This command creates a new file system. The file system name and metadata pool
 name are self-explanatory. The specified data pool is the default data pool and
@ -25,19 +25,19 @@ to accommodate the new file system.

 ::

-    fs ls
+    ceph fs ls

 List all file systems by name.

 ::

-    fs lsflags <file system name>
+    ceph fs lsflags <file system name>

 List all the flags set on a file system.

 ::

-    fs dump [epoch]
+    ceph fs dump [epoch]

 This dumps the FSMap at the given epoch (default: current) which includes all
 file system settings, MDS daemons and the ranks they hold, and the list of
@ -46,7 +46,7 @@ standby MDS daemons.

 ::

-    fs rm <file system name> [--yes-i-really-mean-it]
+    ceph fs rm <file system name> [--yes-i-really-mean-it]

 Destroy a CephFS file system. This wipes information about the state of the
 file system from the FSMap. The metadata pool and data pools are untouched and
@ -54,28 +54,28 @@ must be destroyed separately.

 ::

-    fs get <file system name>
+    ceph fs get <file system name>

 Get information about the named file system, including settings and ranks. This
-is a subset of the same information from the ``fs dump`` command.
+is a subset of the same information from the ``ceph fs dump`` command.

 ::

-    fs set <file system name> <var> <val>
+    ceph fs set <file system name> <var> <val>

 Change a setting on a file system. These settings are specific to the named
 file system and do not affect other file systems.

 ::

-    fs add_data_pool <file system name> <pool name/id>
+    ceph fs add_data_pool <file system name> <pool name/id>

 Add a data pool to the file system. This pool can be used for file layouts
 as an alternate location to store file data.

 ::

-    fs rm_data_pool <file system name> <pool name/id>
+    ceph fs rm_data_pool <file system name> <pool name/id>

 This command removes the specified pool from the list of data pools for the
 file system.  If any files have layouts for the removed data pool, the file
@ -84,7 +84,7 @@ system) cannot be removed.

 ::

-    fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
+    ceph fs rename <file system name> <new file system name> [--yes-i-really-mean-it]

 Rename a Ceph file system. This also changes the application tags on the data
 pools and metadata pool of the file system to the new file system name.
@ -98,7 +98,7 @@ Settings

 ::

-    fs set <fs name> max_file_size <size in bytes>
+    ceph fs set <fs name> max_file_size <size in bytes>

 CephFS has a configurable maximum file size, and it's 1TB by default.
 You may wish to set this limit higher if you expect to store large files
@ -132,13 +132,13 @@ Taking a CephFS cluster down is done by setting the down flag:
 
 :: 
 
-    fs set <fs_name> down true
+    ceph fs set <fs_name> down true
 
 To bring the cluster back online:
 
 :: 

-    fs set <fs_name> down false
+    ceph fs set <fs_name> down false

 This will also restore the previous value of max_mds. MDS daemons are brought
 down in a way such that journals are flushed to the metadata pool and all
@ -149,11 +149,11 @@ Taking the cluster down rapidly for deletion or disaster recovery
 -----------------------------------------------------------------

 To allow rapidly deleting a file system (for testing) or to quickly bring the
-file system and MDS daemons down, use the ``fs fail`` command:
+file system and MDS daemons down, use the ``ceph fs fail`` command:

 ::

-    fs fail <fs_name>
+    ceph fs fail <fs_name>

 This command sets a file system flag to prevent standbys from
 activating on the file system (the ``joinable`` flag).
@ -162,7 +162,7 @@ This process can also be done manually by doing the following:

 ::

-    fs set <fs_name> joinable false
+    ceph fs set <fs_name> joinable false

 Then the operator can fail all of the ranks which causes the MDS daemons to
 respawn as standbys. The file system will be left in a degraded state.
@ -170,7 +170,7 @@ respawn as standbys. The file system will be left in a degraded state.
 ::

    # For all ranks, 0-N:
-    mds fail <fs_name>:<n>
+    ceph mds fail <fs_name>:<n>

 Once all ranks are inactive, the file system may also be deleted or left in
 this state for other purposes (perhaps disaster recovery).
@ -179,7 +179,7 @@ To bring the cluster back up, simply set the joinable flag:

 ::

-    fs set <fs_name> joinable true
+    ceph fs set <fs_name> joinable true


 Daemons
@ -198,34 +198,35 @@ Commands to manipulate MDS daemons:

 ::

-    mds fail <gid/name/role>
+    ceph mds fail <gid/name/role>

 Mark an MDS daemon as failed.  This is equivalent to what the cluster
 would do if an MDS daemon had failed to send a message to the mon
 for ``mds_beacon_grace`` second.  If the daemon was active and a suitable
-standby is available, using ``mds fail`` will force a failover to the standby.
+standby is available, using ``ceph mds fail`` will force a failover to the
+standby.

-If the MDS daemon was in reality still running, then using ``mds fail``
+If the MDS daemon was in reality still running, then using ``ceph mds fail``
 will cause the daemon to restart.  If it was active and a standby was
 available, then the "failed" daemon will return as a standby.


 ::

-    tell mds.<daemon name> command ...
+    ceph tell mds.<daemon name> command ...

 Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
 daemons. Use ``ceph tell mds.* help`` to learn available commands.

 ::

-    mds metadata <gid/name/role>
+    ceph mds metadata <gid/name/role>

 Get metadata about the given MDS known to the Monitors.

 ::

-    mds repaired <role>
+    ceph mds repaired <role>

 Mark the file system rank as repaired. Unlike the name suggests, this command
 does not change a MDS; it manipulates the file system rank which has been
@ -244,14 +245,14 @@ Commands to manipulate required client features of a file system:

 ::

-    fs required_client_features <fs name> add reply_encoding
-    fs required_client_features <fs name> rm reply_encoding
+    ceph fs required_client_features <fs name> add reply_encoding
+    ceph fs required_client_features <fs name> rm reply_encoding

 To list all CephFS features

 ::

-    fs feature ls
+    ceph fs feature ls

 Clients that are missing newly added features will be evicted automatically.

@ -346,7 +347,7 @@ Global settings

 ::

-    fs flag set <flag name> <flag val> [<confirmation string>]
+    ceph fs flag set <flag name> <flag val> [<confirmation string>]

 Sets a global CephFS flag (i.e. not specific to a particular file system).
 Currently, the only flag setting is 'enable_multiple' which allows having
@ -368,13 +369,13 @@ file system.

 ::

-    mds rmfailed
+    ceph mds rmfailed

 This removes a rank from the failed set.

 ::

-    fs reset <file system name>
+    ceph fs reset <file system name>

 This command resets the file system state to defaults, except for the name and
 pools. Non-zero ranks are saved in the stopped set.
@ -382,7 +383,7 @@ pools. Non-zero ranks are saved in the stopped set.

 ::

-    fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
+    ceph fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force

 This command creates a file system with a specific **fscid** (file system cluster ID).
 You may want to do this when an application expects the file system's ID to be
--- a/ceph/doc/cephfs/cache-configuration.rst
+++ b/ceph/doc/cephfs/cache-configuration.rst
@ -154,14 +154,8 @@ readdir. The behavior of the decay counter is the same as for cache trimming or
 caps recall. Each readdir call increments the counter by the number of files in
 the result.

-The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
-maybe throttled by cap acquisition throttle:
-
 .. confval:: mds_session_max_caps_throttle_ratio

-The timeout in seconds after which a client request is retried due to cap
-acquisition throttling:
-
 .. confval:: mds_cap_acquisition_throttle_retry_request_timeout

 If the number of caps acquired by the client per session is greater than the
--- a/ceph/doc/cephfs/cephfs-mirroring.rst
+++ b/ceph/doc/cephfs/cephfs-mirroring.rst
@ -14,6 +14,8 @@ Requirements

 The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later.

+.. _cephfs_mirroring_creating_users:
+
 Creating Users
 --------------

@ -42,80 +44,155 @@ Mirror daemon should be spawned using `systemctl(1)` unit files::

  $ cephfs-mirror --id mirror --cluster site-a -f

-.. note:: User used here is `mirror` created in the `Creating Users` section.
+.. note:: The user specified here is `mirror`, the creation of which is
+   described in the :ref:`Creating Users<cephfs_mirroring_creating_users>`
+   section.
+
+Multiple ``cephfs-mirror`` daemons may be deployed for concurrent
+synchronization and high availability. Mirror daemons share the synchronization
+load using a simple ``M/N`` policy, where ``M`` is the number of directories
+and ``N`` is the number of ``cephfs-mirror`` daemons.
+
+When ``cephadm`` is used to manage a Ceph cluster, ``cephfs-mirror`` daemons can be
+deployed by running the following command:
+
+.. prompt:: bash $
+
+   ceph orch apply cephfs-mirror
+
+To deploy multiple mirror daemons, run a command of the following form:
+
+.. prompt:: bash $
+
+   ceph orch apply cephfs-mirror --placement=<placement-spec>
+
+For example, to deploy 3 `cephfs-mirror` daemons on different hosts, run a command of the following form:
+
+.. prompt:: bash $
+
+  $ ceph orch apply cephfs-mirror --placement="3 host1,host2,host3"

 Interface
 ---------

-`Mirroring` module (manager plugin) provides interfaces for managing directory snapshot
-mirroring. Manager interfaces are (mostly) wrappers around monitor commands for managing
-file system mirroring and is the recommended control interface.
+The `Mirroring` module (manager plugin) provides interfaces for managing
+directory snapshot mirroring. These are (mostly) wrappers around monitor
+commands for managing file system mirroring and is the recommended control
+interface.

 Mirroring Module
 ----------------

-The mirroring module is responsible for assigning directories to mirror daemons for
-synchronization. Multiple mirror daemons can be spawned to achieve concurrency in
-directory snapshot synchronization. When mirror daemons are spawned (or terminated)
-, the mirroring module discovers the modified set of mirror daemons and rebalances
-the directory assignment amongst the new set thus providing high-availability.
+The mirroring module is responsible for assigning directories to mirror daemons
+for synchronization. Multiple mirror daemons can be spawned to achieve
+concurrency in directory snapshot synchronization. When mirror daemons are
+spawned (or terminated), the mirroring module discovers the modified set of
+mirror daemons and rebalances directory assignments across the new set, thus
+providing high-availability.

-.. note:: Multiple mirror daemons is currently untested. Only a single mirror daemon
-          is recommended.
+.. note:: Deploying a single mirror daemon is recommended. Running multiple
+   daemons is untested.

-Mirroring module is disabled by default. To enable mirroring use::
+The mirroring module is disabled by default. To enable the mirroring module,
+run the following command:

-  $ ceph mgr module enable mirroring
+.. prompt:: bash $

-Mirroring module provides a family of commands to control mirroring of directory
-snapshots. To add or remove directories, mirroring needs to be enabled for a given
-file system. To enable mirroring use::
+   ceph mgr module enable mirroring

-  $ ceph fs snapshot mirror enable <fs_name>
+The mirroring module provides a family of commands that can be used to control
+the mirroring of directory snapshots. To add or remove directories, mirroring
+must be enabled for a given file system. To enable mirroring for a given file
+system, run a command of the following form:

-.. note:: Mirroring module commands use `fs snapshot mirror` prefix as compared to
-          the monitor commands which `fs mirror` prefix. Make sure to use module
-          commands.
+.. prompt:: bash $

-To disable mirroring, use::
+   ceph fs snapshot mirror enable <fs_name>

-  $ ceph fs snapshot mirror disable <fs_name>
+.. note:: "Mirroring module" commands are prefixed with ``fs snapshot mirror``.
+   This distinguishes them from "monitor commands", which are prefixed with ``fs
+   mirror``. Be sure (in this context) to use module commands.

-Once mirroring is enabled, add a peer to which directory snapshots are to be mirrored.
-Peers follow `<client>@<cluster>` specification and get assigned a unique-id (UUID)
-when added. See `Creating Users` section on how to create Ceph users for mirroring.
+To disable mirroring for a given file system, run a command of the following form:

-To add a peer use::
+.. prompt:: bash $

-  $ ceph fs snapshot mirror peer_add <fs_name> <remote_cluster_spec> [<remote_fs_name>] [<remote_mon_host>] [<cephx_key>]
+   ceph fs snapshot mirror disable <fs_name>

-`<remote_fs_name>` is optional, and defaults to `<fs_name>` (on the remote cluster).
+After mirroring is enabled, add a peer to which directory snapshots are to be
+mirrored. Peers are specified by the ``<client>@<cluster>`` format, which is
+referred to elsewhere in this document as the ``remote_cluster_spec``. Peers
+are assigned a unique-id (UUID) when added. See the :ref:`Creating
+Users<cephfs_mirroring_creating_users>` section for instructions that describe
+how to create Ceph users for mirroring.

-This requires the remote cluster ceph configuration and user keyring to be available in
-the primary cluster. See `Bootstrap Peers` section to avoid this. `peer_add` additionally
-supports passing the remote cluster monitor address and the user key. However, bootstrapping
-a peer is the recommended way to add a peer.
+To add a peer, run a command of the following form:
+
+.. prompt:: bash $
+
+   ceph fs snapshot mirror peer_add <fs_name> <remote_cluster_spec> [<remote_fs_name>] [<remote_mon_host>] [<cephx_key>]
+
+``<remote_cluster_spec>`` is of the format ``client.<id>@<cluster_name>``.
+
+``<remote_fs_name>`` is optional, and defaults to `<fs_name>` (on the remote
+cluster).
+
+For this command to succeed, the remote cluster's Ceph configuration and user
+keyring must be available in the primary cluster. For example, if a user named
+``client_mirror`` is created on the remote cluster which has ``rwps``
+permissions for the remote file system named ``remote_fs`` (see `Creating
+Users`) and the remote cluster is named ``remote_ceph`` (that is, the remote
+cluster configuration file is named ``remote_ceph.conf`` on the primary
+cluster), run the following command to add the remote filesystem as a peer to
+the primary filesystem ``primary_fs``:
+
+.. prompt:: bash $
+
+   ceph fs snapshot mirror peer_add primary_fs client.mirror_remote@remote_ceph remote_fs
+
+To avoid having to maintain the remote cluster configuration file and remote
+ceph user keyring in the primary cluster, users can bootstrap a peer (which
+stores the relevant remote cluster details in the monitor config store on the
+primary cluster). See the :ref:`Bootstrap
+Peers<cephfs_mirroring_bootstrap_peers>` section.
+
+The ``peer_add`` command supports passing the remote cluster monitor address
+and the user key. However, bootstrapping a peer is the recommended way to add a
+peer.

 .. note:: Only a single peer is supported right now.

-To remove a peer use::
+To remove a peer, run a command of the following form:

-  $ ceph fs snapshot mirror peer_remove <fs_name> <peer_uuid>
+.. prompt:: bash $

-To list file system mirror peers use::
+   ceph fs snapshot mirror peer_remove <fs_name> <peer_uuid>

-  $ ceph fs snapshot mirror peer_list <fs_name>
+To list file system mirror peers, run a command of the following form:

-To configure a directory for mirroring, use::
+.. prompt:: bash $

-  $ ceph fs snapshot mirror add <fs_name> <path>
+   ceph fs snapshot mirror peer_list <fs_name>

-To stop a mirroring directory snapshots use::
+To configure a directory for mirroring, run a command of the following form:

-  $ ceph fs snapshot mirror remove <fs_name> <path>
+.. prompt:: bash $

-Only absolute directory paths are allowed. Also, paths are normalized by the mirroring
-module, therefore, `/a/b/../b` is equivalent to `/a/b`.
+   ceph fs snapshot mirror add <fs_name> <path>
+
+To stop mirroring directory snapshots, run a command of the following form:
+
+.. prompt:: bash $
+
+   ceph fs snapshot mirror remove <fs_name> <path>
+
+Only absolute directory paths are allowed. 
+
+Paths are normalized by the mirroring module. This means that ``/a/b/../b`` is
+equivalent to ``/a/b``. Paths always start from the CephFS file-system root and
+not from the host system mount point.
+
+For example::

  $ mkdir -p /d0/d1/d2
  $ ceph fs snapshot mirror add cephfs /d0/d1/d2
@ -123,16 +200,19 @@ module, therefore, `/a/b/../b` is equivalent to `/a/b`.
  $ ceph fs snapshot mirror add cephfs /d0/d1/../d1/d2
  Error EEXIST: directory /d0/d1/d2 is already tracked

-Once a directory is added for mirroring, its subdirectory or ancestor directories are
-disallowed to be added for mirroring::
+After a directory is added for mirroring, the additional mirroring of
+subdirectories or ancestor directories is disallowed::

  $ ceph fs snapshot mirror add cephfs /d0/d1
  Error EINVAL: /d0/d1 is a ancestor of tracked path /d0/d1/d2
  $ ceph fs snapshot mirror add cephfs /d0/d1/d2/d3
  Error EINVAL: /d0/d1/d2/d3 is a subtree of tracked path /d0/d1/d2

-Commands to check directory mapping (to mirror daemons) and directory distribution are
-detailed in `Mirroring Status` section.
+The :ref:`Mirroring Status<cephfs_mirroring_mirroring_status>` section contains
+information about the commands for checking the directory mapping (to mirror
+daemons) and for checking the directory distribution. 
+
+.. _cephfs_mirroring_bootstrap_peers:

 Bootstrap Peers
 ---------------
@ -160,6 +240,9 @@ e.g.::

  $ ceph fs snapshot mirror peer_bootstrap import cephfs eyJmc2lkIjogIjBkZjE3MjE3LWRmY2QtNDAzMC05MDc5LTM2Nzk4NTVkNDJlZiIsICJmaWxlc3lzdGVtIjogImJhY2t1cF9mcyIsICJ1c2VyIjogImNsaWVudC5taXJyb3JfcGVlcl9ib290c3RyYXAiLCAic2l0ZV9uYW1lIjogInNpdGUtcmVtb3RlIiwgImtleSI6ICJBUUFhcDBCZ0xtRmpOeEFBVnNyZXozai9YYUV0T2UrbUJEZlJDZz09IiwgIm1vbl9ob3N0IjogIlt2MjoxOTIuMTY4LjAuNTo0MDkxOCx2MToxOTIuMTY4LjAuNTo0MDkxOV0ifQ==

+
+.. _cephfs_mirroring_mirroring_status:
+
 Mirroring Status
 ----------------

--- a/ceph/doc/cephfs/cephfs-top.rst
+++ b/ceph/doc/cephfs/cephfs-top.rst
@ -78,7 +78,15 @@ By default, `cephfs-top` connects to cluster name `ceph`. To use a non-default c

  $ cephfs-top -d <seconds>

-Interval should be greater than or equal to 0.5 seconds. Fractional seconds are honoured.
+Refresh interval should be a positive integer.
+
+To dump the metrics to stdout without creating a curses display use::
+
+  $ cephfs-top --dump
+
+To dump the metrics of the given filesystem to stdout without creating a curses display use::
+
+  $ cephfs-top --dumpfs <fs_name>

 Interactive Commands
 --------------------
@ -104,3 +112,5 @@ The metrics display can be scrolled using the Arrow Keys, PgUp/PgDn, Home/End an
 Sample screenshot running `cephfs-top` with 2 filesystems:

 .. image:: cephfs-top.png
+
+.. note:: Minimum compatible python version for cephfs-top is 3.6.0. cephfs-top is supported on distros RHEL 8, Ubuntu 18.04, CentOS 8 and above.
--- a/ceph/doc/cephfs/createfs.rst
+++ b/ceph/doc/cephfs/createfs.rst
@ -8,10 +8,17 @@ Creating pools
 A Ceph file system requires at least two RADOS pools, one for data and one for metadata.
 When configuring these pools, you might consider:

- Using a higher replication level for the metadata pool, as any data loss in
-  this pool can render the whole file system inaccessible.
- Using lower-latency storage such as SSDs for the metadata pool, as this will
-  directly affect the observed latency of file system operations on clients.
+- We recommend configuring *at least* 3 replicas for the metadata pool,
+  as data loss in this pool can render the entire file system inaccessible.
+  Configuring 4 would not be extreme, especially since the metadata pool's
+  capacity requirements are quite modest.
+- We recommend the fastest feasible low-latency storage devices (NVMe, Optane,
+  or at the very least SAS/SATA SSD) for the metadata pool, as this will
+  directly affect the latency of client file system operations.
+- We strongly suggest that the CephFS metadata pool be provisioned on dedicated
+  SSD / NVMe OSDs. This ensures that high client workload does not adversely
+  impact metadata operations. See :ref:`device_classes` to configure pools this
+  way.
 - The data pool used to create the file system is the "default" data pool and
  the location for storing all inode backtrace information, used for hard link
  management and disaster recovery. For this reason, all inodes created in
--- a/ceph/doc/cephfs/disaster-recovery-experts.rst
+++ b/ceph/doc/cephfs/disaster-recovery-experts.rst
@ -149,8 +149,8 @@ errors.

 ::

-    cephfs-data-scan scan_extents <data pool>
-    cephfs-data-scan scan_inodes <data pool>
+    cephfs-data-scan scan_extents [<data pool> [<extra data pool> ...]]
+    cephfs-data-scan scan_inodes [<data pool>]
    cephfs-data-scan scan_links

 'scan_extents' and 'scan_inodes' commands may take a *very long* time
@ -166,22 +166,22 @@ The example below shows how to run 4 workers simultaneously:
 ::

    # Worker 0
-    cephfs-data-scan scan_extents --worker_n 0 --worker_m 4 <data pool>
+    cephfs-data-scan scan_extents --worker_n 0 --worker_m 4
    # Worker 1
-    cephfs-data-scan scan_extents --worker_n 1 --worker_m 4 <data pool>
+    cephfs-data-scan scan_extents --worker_n 1 --worker_m 4
    # Worker 2
-    cephfs-data-scan scan_extents --worker_n 2 --worker_m 4 <data pool>
+    cephfs-data-scan scan_extents --worker_n 2 --worker_m 4
    # Worker 3
-    cephfs-data-scan scan_extents --worker_n 3 --worker_m 4 <data pool>
+    cephfs-data-scan scan_extents --worker_n 3 --worker_m 4

    # Worker 0
-    cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4 <data pool>
+    cephfs-data-scan scan_inodes --worker_n 0 --worker_m 4
    # Worker 1
-    cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4 <data pool>
+    cephfs-data-scan scan_inodes --worker_n 1 --worker_m 4
    # Worker 2
-    cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4 <data pool>
+    cephfs-data-scan scan_inodes --worker_n 2 --worker_m 4
    # Worker 3
-    cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4 <data pool>
+    cephfs-data-scan scan_inodes --worker_n 3 --worker_m 4

 It is **important** to ensure that all workers have completed the
 scan_extents phase before any workers enter the scan_inodes phase.
@ -191,8 +191,13 @@ operation to delete ancillary data generated during recovery.

 ::

-    cephfs-data-scan cleanup <data pool>
+    cephfs-data-scan cleanup [<data pool>]

+Note, the data pool parameters for 'scan_extents', 'scan_inodes' and
+'cleanup' commands are optional, and usually the tool will be able to
+detect the pools automatically. Still you may override this. The
+'scan_extents' command needs all data pools to be specified, while
+'scan_inodes' and 'cleanup' commands need only the main data pool.


 Using an alternate metadata pool for recovery
@ -229,35 +234,29 @@ backed by the original data pool.

 ::

-    ceph fs flag set enable_multiple true --yes-i-really-mean-it
    ceph osd pool create cephfs_recovery_meta
-    ceph fs new cephfs_recovery recovery <data_pool> --allow-dangerous-metadata-overlay
+    ceph fs new cephfs_recovery recovery <data_pool> --recover --allow-dangerous-metadata-overlay

+.. note::

-The recovery file system starts with an MDS rank that will initialize the new
-metadata pool with some metadata. This is necessary to bootstrap recovery.
-However, now we will take the MDS down as we do not want it interacting with
-the metadata pool further.
+   The ``--recover`` flag prevents any MDS from joining the new file system.
+
+Next, we will create the intial metadata for the fs:

 ::

-    ceph fs fail cephfs_recovery
-
-Next, we will reset the initial metadata the MDS created:
-
-::
-
-    cephfs-table-tool cephfs_recovery:all reset session
-    cephfs-table-tool cephfs_recovery:all reset snap
-    cephfs-table-tool cephfs_recovery:all reset inode
+    cephfs-table-tool cephfs_recovery:0 reset session
+    cephfs-table-tool cephfs_recovery:0 reset snap
+    cephfs-table-tool cephfs_recovery:0 reset inode
+    cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force

 Now perform the recovery of the metadata pool from the data pool:

 ::

    cephfs-data-scan init --force-init --filesystem cephfs_recovery --alternate-pool cephfs_recovery_meta
-    cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name> <data_pool>
-    cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt <data_pool>
+    cephfs-data-scan scan_extents --alternate-pool cephfs_recovery_meta --filesystem <fs_name>
+    cephfs-data-scan scan_inodes --alternate-pool cephfs_recovery_meta --filesystem <fs_name> --force-corrupt
    cephfs-data-scan scan_links --filesystem cephfs_recovery

 .. note::
@ -272,7 +271,6 @@ with:
 ::

    cephfs-journal-tool --rank=<fs_name>:0 event recover_dentries list --alternate-pool cephfs_recovery_meta
-    cephfs-journal-tool --rank cephfs_recovery:0 journal reset --force

 After recovery, some recovered directories will have incorrect statistics.
 Ensure the parameters ``mds_verify_scatter`` and ``mds_debug_scatterstat`` are
@ -283,20 +281,22 @@ set to false (the default) to prevent the MDS from checking the statistics:
    ceph config rm mds mds_verify_scatter
    ceph config rm mds mds_debug_scatterstat

-(Note, the config may also have been set globally or via a ceph.conf file.)
+.. note::
+
+    Also verify the config has not been set globally or with a local ceph.conf file.
+
 Now, allow an MDS to join the recovery file system:

 ::

    ceph fs set cephfs_recovery joinable true

-Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair the statistics.
+Finally, run a forward :doc:`scrub </cephfs/scrub>` to repair recursive statistics.
 Ensure you have an MDS running and issue:

 ::

-    ceph fs status # get active MDS
-    ceph tell mds.<id> scrub start / recursive repair
+    ceph tell mds.recovery_fs:0 scrub start / recursive,repair,force

 .. note::

--- a/ceph/doc/cephfs/fs-volumes.rst
+++ b/ceph/doc/cephfs/fs-volumes.rst
@ -3,13 +3,13 @@
 FS volumes and subvolumes
 =========================

-A  single source of truth for CephFS exports is implemented in the volumes
-module of the :term:`Ceph Manager` daemon (ceph-mgr). The OpenStack shared
-file system service (manila_), Ceph Container Storage Interface (CSI_),
-storage administrators among others can use the common CLI provided by the
-ceph-mgr volumes module to manage the CephFS exports.
+The volumes module of the :term:`Ceph Manager` daemon (ceph-mgr) provides a
+single source of truth for CephFS exports. The OpenStack shared file system
+service (manila_) and the Ceph Container Storage Interface (CSI_) storage
+administrators use the common CLI provided by the ceph-mgr ``volumes`` module
+to manage CephFS exports.

-The ceph-mgr volumes module implements the following file system export
+The ceph-mgr ``volumes`` module implements the following file system export
 abstractions:

 * FS volumes, an abstraction for CephFS file systems
@ -17,87 +17,82 @@ abstractions:
 * FS subvolumes, an abstraction for independent CephFS directory trees

 * FS subvolume groups, an abstraction for a directory level higher than FS
-  subvolumes to effect policies (e.g., :doc:`/cephfs/file-layouts`) across a
-  set of subvolumes
+  subvolumes. Used to effect policies (e.g., :doc:`/cephfs/file-layouts`)
+  across a set of subvolumes

 Some possible use-cases for the export abstractions:

-* FS subvolumes used as manila shares or CSI volumes
+* FS subvolumes used as Manila shares or CSI volumes

-* FS subvolume groups used as manila share groups
+* FS subvolume groups used as Manila share groups

 Requirements
 ------------

-* Nautilus (14.2.x) or a later version of Ceph
+* Nautilus (14.2.x) or later Ceph release

 * Cephx client user (see :doc:`/rados/operations/user-management`) with
-  the following minimum capabilities::
+  at least the following capabilities::

    mon 'allow r'
    mgr 'allow rw'

-
 FS Volumes
 ----------

-Create a volume using::
+Create a volume by running the following command:

-    $ ceph fs volume create <vol_name> [<placement>]
+.. prompt:: bash #
+
+   ceph fs volume create <vol_name> [placement]

 This creates a CephFS file system and its data and metadata pools. It can also
-try to create MDSes for the filesystem using the enabled ceph-mgr orchestrator
-module (see :doc:`/mgr/orchestrator`), e.g. rook.
+deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for
+example Rook). See :doc:`/mgr/orchestrator`.

-<vol_name> is the volume name (an arbitrary string), and
+``<vol_name>`` is the volume name (an arbitrary string). ``[placement]`` is an
+optional string that specifies the :ref:`orchestrator-cli-placement-spec` for
+the MDS. See also :ref:`orchestrator-cli-cephfs` for more examples on
+placement.

-<placement> is an optional string signifying which hosts should have NFS Ganesha
-daemon containers running on them and, optionally, the total number of NFS
-Ganesha daemons the cluster (should you want to have more than one NFS Ganesha
-daemon running per node). For example, the following placement string means
-"deploy NFS Ganesha daemons on nodes host1 and host2 (one daemon per host):
+.. note:: Specifying placement via a YAML file is not supported through the
+          volume interface.

-    "host1,host2"
-
-and this placement specification says to deploy two NFS Ganesha daemons each
-on nodes host1 and host2 (for a total of four NFS Ganesha daemons in the
-cluster):
-
-    "4 host1,host2"
-
-For more details on placement specification refer to the :ref:`orchestrator-cli-service-spec`,
-but keep in mind that specifying the placement via a YAML file is not supported.
-
-Remove a volume using::
+To remove a volume, run the following command:

    $ ceph fs volume rm <vol_name> [--yes-i-really-mean-it]

 This removes a file system and its data and metadata pools. It also tries to
-remove MDSes using the enabled ceph-mgr orchestrator module.
+remove MDS daemons using the enabled ceph-mgr orchestrator module.

-List volumes using::
+.. note:: After volume deletion, it is recommended to restart `ceph-mgr`
+   if a new file system is created on the same cluster and subvolume interface
+   is being used. Please see https://tracker.ceph.com/issues/49605#note-5
+   for more details.
+
+List volumes by running the following command:

    $ ceph fs volume ls

-Rename a volume using::
+Rename a volume by running the following command:

    $ ceph fs volume rename <vol_name> <new_vol_name> [--yes-i-really-mean-it]

-Renaming a volume can be an expensive operation. It does the following:
+Renaming a volume can be an expensive operation that requires the following:

- renames the orchestrator managed MDS service to match the <new_vol_name>.
-  This involves launching a MDS service with <new_vol_name> and bringing down
-  the MDS service with <vol_name>.
- renames the file system matching <vol_name> to <new_vol_name>
- changes the application tags on the data and metadata pools of the file system
-  to <new_vol_name>
- renames the  metadata and data pools of the file system.
+- Renaming the orchestrator-managed MDS service to match the <new_vol_name>.
+  This involves launching a MDS service with ``<new_vol_name>`` and bringing
+  down the MDS service with ``<vol_name>``.
+- Renaming the file system matching ``<vol_name>`` to ``<new_vol_name>``.
+- Changing the application tags on the data and metadata pools of the file system
+  to ``<new_vol_name>``.
+- Renaming the metadata and data pools of the file system.

-The CephX IDs authorized to <vol_name> need to be reauthorized to <new_vol_name>. Any
-on-going operations of the clients using these IDs may be disrupted. Mirroring is
-expected to be disabled on the volume.
+The CephX IDs that are authorized for ``<vol_name>`` must be reauthorized for
+``<new_vol_name>``. Any ongoing operations of the clients using these IDs may
+be disrupted. Ensure that mirroring is disabled on the volume.

-Fetch the information of a CephFS volume using::
+To fetch the information of a CephFS volume, run the following command:

    $ ceph fs volume info vol_name [--human_readable]

@ -105,15 +100,15 @@ The ``--human_readable`` flag shows used and available pool capacities in KB/MB/

 The output format is JSON and contains fields as follows:

-* pools: Attributes of data and metadata pools
-        * avail: The amount of free space available in bytes
-        * used: The amount of storage consumed in bytes
-        * name: Name of the pool
-* mon_addrs: List of monitor addresses
-* used_size: Current used size of the CephFS volume in bytes
-* pending_subvolume_deletions: Number of subvolumes pending deletion
+* ``pools``: Attributes of data and metadata pools
+        * ``avail``: The amount of free space available in bytes
+        * ``used``: The amount of storage consumed in bytes
+        * ``name``: Name of the pool
+* ``mon_addrs``: List of Ceph monitor addresses
+* ``used_size``: Current used size of the CephFS volume in bytes
+* ``pending_subvolume_deletions``: Number of subvolumes pending deletion

-Sample output of volume info command::
+Sample output of the ``volume info`` command::

  $ ceph fs volume info vol_name
  {
@ -143,88 +138,91 @@ Sample output of volume info command::
 FS Subvolume groups
 -------------------

-Create a subvolume group using::
+Create a subvolume group by running the following command:

    $ ceph fs subvolumegroup create <vol_name> <group_name> [--size <size_in_bytes>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>]

 The command succeeds even if the subvolume group already exists.

 When creating a subvolume group you can specify its data pool layout (see
-:doc:`/cephfs/file-layouts`), uid, gid, file mode in octal numerals and
+:doc:`/cephfs/file-layouts`), uid, gid, file mode in octal numerals, and
 size in bytes. The size of the subvolume group is specified by setting
 a quota on it (see :doc:`/cephfs/quota`). By default, the subvolume group
-is created with an octal file mode '755', uid '0', gid '0' and data pool
+is created with octal file mode ``755``, uid ``0``, gid ``0`` and the data pool
 layout of its parent directory.

-
-Remove a subvolume group using::
+Remove a subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup rm <vol_name> <group_name> [--force]

-The removal of a subvolume group fails if it is not empty or non-existent.
-'--force' flag allows the non-existent subvolume group remove command to succeed.
+The removal of a subvolume group fails if the subvolume group is not empty or
+is non-existent. The ``--force`` flag allows the non-existent "subvolume group remove
+command" to succeed.


-Fetch the absolute path of a subvolume group using::
+Fetch the absolute path of a subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup getpath <vol_name> <group_name>

-List subvolume groups using::
+List subvolume groups by running a command of the following form:

    $ ceph fs subvolumegroup ls <vol_name>

 .. note:: Subvolume group snapshot feature is no longer supported in mainline CephFS (existing group
          snapshots can still be listed and deleted)

-Fetch the metadata of a subvolume group using::
+Fetch the metadata of a subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup info <vol_name> <group_name>

-The output format is json and contains fields as follows.
+The output format is JSON and contains fields as follows:

-* atime: access time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
-* mtime: modification time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
-* ctime: change time of subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
-* uid: uid of subvolume group path
-* gid: gid of subvolume group path
-* mode: mode of subvolume group path
-* mon_addrs: list of monitor addresses
-* bytes_pcent: quota used in percentage if quota is set, else displays "undefined"
-* bytes_quota: quota size in bytes if quota is set, else displays "infinite"
-* bytes_used: current used size of the subvolume group in bytes
-* created_at: time of creation of subvolume group in the format "YYYY-MM-DD HH:MM:SS"
-* data_pool: data pool the subvolume group belongs to
+* ``atime``: access time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
+* ``mtime``: modification time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
+* ``ctime``: change time of the subvolume group path in the format "YYYY-MM-DD HH:MM:SS"
+* ``uid``: uid of the subvolume group path
+* ``gid``: gid of the subvolume group path
+* ``mode``: mode of the subvolume group path
+* ``mon_addrs``: list of monitor addresses
+* ``bytes_pcent``: quota used in percentage if quota is set, else displays "undefined"
+* ``bytes_quota``: quota size in bytes if quota is set, else displays "infinite"
+* ``bytes_used``: current used size of the subvolume group in bytes
+* ``created_at``: creation time of the subvolume group in the format "YYYY-MM-DD HH:MM:SS"
+* ``data_pool``: data pool to which the subvolume group belongs

-Check the presence of any subvolume group using::
+Check the presence of any subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup exist <vol_name>

-The strings returned by the 'exist' command:
+The ``exist`` command outputs:

 * "subvolumegroup exists": if any subvolumegroup is present
 * "no subvolumegroup exists": if no subvolumegroup is present

-.. note:: It checks for the presence of custom groups and not the default one. To validate the emptiness of the volume, subvolumegroup existence check alone is not sufficient. The subvolume existence also needs to be checked as there might be subvolumes in the default group.
+.. note:: This command checks for the presence of custom groups and not
+   presence of the default one. To validate the emptiness of the volume, a
+   subvolumegroup existence check alone is not sufficient. Subvolume existence
+   also needs to be checked as there might be subvolumes in the default group.

-Resize a subvolume group using::
+Resize a subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup resize <vol_name> <group_name> <new_size> [--no_shrink]

-The command resizes the subvolume group quota using the size specified by 'new_size'.
-The '--no_shrink' flag prevents the subvolume group to shrink below the current used
-size of the subvolume group.
+The command resizes the subvolume group quota, using the size specified by
+``new_size``.  The ``--no_shrink`` flag prevents the subvolume group from
+shrinking below the current used size.

-The subvolume group can be resized to an infinite size by passing 'inf' or 'infinite'
-as the new_size.
+The subvolume group may be resized to an infinite size by passing ``inf`` or
+``infinite`` as the ``new_size``.

-Remove a snapshot of a subvolume group using::
+Remove a snapshot of a subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup snapshot rm <vol_name> <group_name> <snap_name> [--force]

-Using the '--force' flag allows the command to succeed that would otherwise
-fail if the snapshot did not exist.
+Supplying the ``--force`` flag allows the command to succeed when it would otherwise
+fail due to the nonexistence of the snapshot.

-List snapshots of a subvolume group using::
+List snapshots of a subvolume group by running a command of the following form:

    $ ceph fs subvolumegroup snapshot ls <vol_name> <group_name>

@ -232,7 +230,7 @@ List snapshots of a subvolume group using::
 FS Subvolumes
 -------------

-Create a subvolume using::
+Create a subvolume using:

    $ ceph fs subvolume create <vol_name> <subvol_name> [--size <size_in_bytes>] [--group_name <subvol_group_name>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] [--namespace-isolated]

@ -247,11 +245,10 @@ default a subvolume is created within the default subvolume group, and with an o
 mode '755', uid of its subvolume group, gid of its subvolume group, data pool layout of
 its parent directory and no size limit.

-Remove a subvolume using::
+Remove a subvolume using:

    $ ceph fs subvolume rm <vol_name> <subvol_name> [--group_name <subvol_group_name>] [--force] [--retain-snapshots]

-
 The command removes the subvolume and its contents. It does this in two steps.
 First, it moves the subvolume to a trash folder, and then asynchronously purges
 its contents.
@ -267,95 +264,95 @@ empty for all operations not involving the retained snapshots.

 .. note:: Retained snapshots can be used as a clone source to recreate the subvolume, or clone to a newer subvolume.

-Resize a subvolume using::
+Resize a subvolume using:

    $ ceph fs subvolume resize <vol_name> <subvol_name> <new_size> [--group_name <subvol_group_name>] [--no_shrink]

-The command resizes the subvolume quota using the size specified by 'new_size'.
-'--no_shrink' flag prevents the subvolume to shrink below the current used size of the subvolume.
+The command resizes the subvolume quota using the size specified by ``new_size``.
+The `--no_shrink`` flag prevents the subvolume from shrinking below the current  used size of the subvolume.

-The subvolume can be resized to an infinite size by passing 'inf' or 'infinite' as the new_size.
+The subvolume can be resized to an unlimited (but sparse) logical size by passing ``inf`` or ``infinite`` as `` new_size``.

-Authorize cephx auth IDs, the read/read-write access to fs subvolumes::
+Authorize cephx auth IDs, the read/read-write access to fs subvolumes:

    $ ceph fs subvolume authorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] [--access_level=<access_level>]

-The 'access_level' takes 'r' or 'rw' as value.
+The ``access_level`` takes ``r`` or ``rw`` as value.

-Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes::
+Deauthorize cephx auth IDs, the read/read-write access to fs subvolumes:

    $ ceph fs subvolume deauthorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]

-List cephx auth IDs authorized to access fs subvolume::
+List cephx auth IDs authorized to access fs subvolume:

    $ ceph fs subvolume authorized_list <vol_name> <sub_name> [--group_name=<group_name>]

-Evict fs clients based on auth ID and subvolume mounted::
+Evict fs clients based on auth ID and subvolume mounted:

    $ ceph fs subvolume evict <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]

-Fetch the absolute path of a subvolume using::
+Fetch the absolute path of a subvolume using:

    $ ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>]

-Fetch the information of a subvolume using::
+Fetch the information of a subvolume using:

    $ ceph fs subvolume info <vol_name> <subvol_name> [--group_name <subvol_group_name>]

-The output format is json and contains fields as follows.
+The output format is JSON and contains fields as follows.

-* atime: access time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
-* mtime: modification time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
-* ctime: change time of subvolume path in the format "YYYY-MM-DD HH:MM:SS"
-* uid: uid of subvolume path
-* gid: gid of subvolume path
-* mode: mode of subvolume path
-* mon_addrs: list of monitor addresses
-* bytes_pcent: quota used in percentage if quota is set, else displays "undefined"
-* bytes_quota: quota size in bytes if quota is set, else displays "infinite"
-* bytes_used: current used size of the subvolume in bytes
-* created_at: time of creation of subvolume in the format "YYYY-MM-DD HH:MM:SS"
-* data_pool: data pool the subvolume belongs to
-* path: absolute path of a subvolume
-* type: subvolume type indicating whether it's clone or subvolume
-* pool_namespace: RADOS namespace of the subvolume
-* features: features supported by the subvolume
-* state: current state of the subvolume
+* ``atime``: access time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
+* ``mtime``: modification time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
+* ``ctime``: change time of the subvolume path in the format "YYYY-MM-DD HH:MM:SS"
+* ``uid``: uid of the subvolume path
+* ``gid``: gid of the subvolume path
+* ``mode``: mode of the subvolume path
+* ``mon_addrs``: list of monitor addresses
+* ``bytes_pcent``: quota used in percentage if quota is set, else displays ``undefined``
+* ``bytes_quota``: quota size in bytes if quota is set, else displays ``infinite``
+* ``bytes_used``: current used size of the subvolume in bytes
+* ``created_at``: creation time of the subvolume in the format "YYYY-MM-DD HH:MM:SS"
+* ``data_pool``: data pool to which the subvolume belongs
+* ``path``: absolute path of a subvolume
+* ``type``: subvolume type indicating whether it's clone or subvolume
+* ``pool_namespace``: RADOS namespace of the subvolume
+* ``features``: features supported by the subvolume
+* ``state``: current state of the subvolume

-If a subvolume has been removed retaining its snapshots, the output only contains fields as follows.
+If a subvolume has been removed retaining its snapshots, the output contains only fields as follows.

-* type: subvolume type indicating whether it's clone or subvolume
-* features: features supported by the subvolume
-* state: current state of the subvolume
+* ``type``: subvolume type indicating whether it's clone or subvolume
+* ``features``: features supported by the subvolume
+* ``state``: current state of the subvolume

-The subvolume "features" are based on the internal version of the subvolume and is a list containing
-a subset of the following features,
+A subvolume's ``features`` are based on the internal version of the subvolume and are
+a subset of the following:

-* "snapshot-clone": supports cloning using a subvolumes snapshot as the source
-* "snapshot-autoprotect": supports automatically protecting snapshots, that are active clone sources, from deletion
-* "snapshot-retention": supports removing subvolume contents, retaining any existing snapshots
+* ``snapshot-clone``: supports cloning using a subvolumes snapshot as the source
+* ``snapshot-autoprotect``: supports automatically protecting snapshots, that are active clone sources, from deletion
+* ``snapshot-retention``: supports removing subvolume contents, retaining any existing snapshots

-The subvolume "state" is based on the current state of the subvolume and contains one of the following values.
+A subvolume's ``state`` is based on the current state of the subvolume and contains one of the following values.

-* "complete": subvolume is ready for all operations
-* "snapshot-retained": subvolume is removed but its snapshots are retained
+* ``complete``: subvolume is ready for all operations
+* ``snapshot-retained``: subvolume is removed but its snapshots are retained

-List subvolumes using::
+List subvolumes using:

    $ ceph fs subvolume ls <vol_name> [--group_name <subvol_group_name>]

 .. note:: subvolumes that are removed but have snapshots retained, are also listed.

-Check the presence of any subvolume using::
+Check the presence of any subvolume using:

    $ ceph fs subvolume exist <vol_name> [--group_name <subvol_group_name>]

-The strings returned by the 'exist' command:
+These are the possible results of the ``exist`` command:

-* "subvolume exists": if any subvolume of given group_name is present
-* "no subvolume exists": if no subvolume of given group_name is present
+* ``subvolume exists``: if any subvolume of given group_name is present
+* ``no subvolume exists``: if no subvolume of given group_name is present

-Set custom metadata on the subvolume as a key-value pair using::
+Set custom metadata on the subvolume as a key-value pair using:

    $ ceph fs subvolume metadata set <vol_name> <subvol_name> <key_name> <value> [--group_name <subvol_group_name>]

@ -365,52 +362,51 @@ Set custom metadata on the subvolume as a key-value pair using::

 .. note:: Custom metadata on a subvolume is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot.

-Get custom metadata set on the subvolume using the metadata key::
+Get custom metadata set on the subvolume using the metadata key:

    $ ceph fs subvolume metadata get <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>]

-List custom metadata (key-value pairs) set on the subvolume using::
+List custom metadata (key-value pairs) set on the subvolume using:

    $ ceph fs subvolume metadata ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]

-Remove custom metadata set on the subvolume using the metadata key::
+Remove custom metadata set on the subvolume using the metadata key:

    $ ceph fs subvolume metadata rm <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>] [--force]

-Using the '--force' flag allows the command to succeed that would otherwise
+Using the ``--force`` flag allows the command to succeed that would otherwise
 fail if the metadata key did not exist.

-Create a snapshot of a subvolume using::
+Create a snapshot of a subvolume using:

    $ ceph fs subvolume snapshot create <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

-
-Remove a snapshot of a subvolume using::
+Remove a snapshot of a subvolume using:

    $ ceph fs subvolume snapshot rm <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] [--force]

-Using the '--force' flag allows the command to succeed that would otherwise
+Using the ``--force`` flag allows the command to succeed that would otherwise
 fail if the snapshot did not exist.

 .. note:: if the last snapshot within a snapshot retained subvolume is removed, the subvolume is also removed

-List snapshots of a subvolume using::
+List snapshots of a subvolume using:

    $ ceph fs subvolume snapshot ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]

-Fetch the information of a snapshot using::
+Fetch the information of a snapshot using:

    $ ceph fs subvolume snapshot info <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

 The output format is json and contains fields as follows.

-* created_at: time of creation of snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff"
-* data_pool: data pool the snapshot belongs to
-* has_pending_clones: "yes" if snapshot clone is in progress otherwise "no"
-* pending_clones: list of in progress or pending clones and their target group if exist otherwise this field is not shown
-* orphan_clones_count: count of orphan clones if snapshot has orphan clones otherwise this field is not shown
+* ``created_at``: creation time of the snapshot in the format "YYYY-MM-DD HH:MM:SS:ffffff"
+* ``data_pool``: data pool to which the snapshot belongs
+* ``has_pending_clones``: ``yes`` if snapshot clone is in progress, otherwise ``no``
+* ``pending_clones``: list of in-progress or pending clones and their target group if any exist, otherwise this field is not shown
+* ``orphan_clones_count``: count of orphan clones if the snapshot has orphan clones, otherwise this field is not shown

-Sample output if snapshot clones are in progress or pending state::
+Sample output when snapshot clones are in progress or pending::

  $ ceph fs subvolume snapshot info cephfs subvol snap
  {
@ -432,7 +428,7 @@ Sample output if snapshot clones are in progress or pending state::
      ]
  }

-Sample output if no snapshot clone is in progress or pending state::
+Sample output when no snapshot clone is in progress or pending::

  $ ceph fs subvolume snapshot info cephfs subvol snap
  {
@ -441,90 +437,93 @@ Sample output if no snapshot clone is in progress or pending state::
      "has_pending_clones": "no"
  }

-Set custom metadata on the snapshot as a key-value pair using::
+Set custom key-value metadata on the snapshot by running:

    $ ceph fs subvolume snapshot metadata set <vol_name> <subvol_name> <snap_name> <key_name> <value> [--group_name <subvol_group_name>]

 .. note:: If the key_name already exists then the old value will get replaced by the new value.

-.. note:: The key_name and value should be a string of ASCII characters (as specified in python's string.printable). The key_name is case-insensitive and always stored in lower case.
+.. note:: The key_name and value should be a strings of ASCII characters (as specified in Python's ``string.printable``). The key_name is case-insensitive and always stored in lowercase.

-.. note:: Custom metadata on a snapshots is not preserved when snapshotting the subvolume, and hence, is also not preserved when cloning the subvolume snapshot.
+.. note:: Custom metadata on a snapshot is not preserved when snapshotting the subvolume, and hence is also not preserved when cloning the subvolume snapshot.

-Get custom metadata set on the snapshot using the metadata key::
+Get custom metadata set on the snapshot using the metadata key:

    $ ceph fs subvolume snapshot metadata get <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>]

-List custom metadata (key-value pairs) set on the snapshot using::
+List custom metadata (key-value pairs) set on the snapshot using:

    $ ceph fs subvolume snapshot metadata ls <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

-Remove custom metadata set on the snapshot using the metadata key::
+Remove custom metadata set on the snapshot using the metadata key:

    $ ceph fs subvolume snapshot metadata rm <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>] [--force]

-Using the '--force' flag allows the command to succeed that would otherwise
+Using the ``--force`` flag allows the command to succeed that would otherwise
 fail if the metadata key did not exist.

 Cloning Snapshots
 -----------------

-Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation involving copying
-data from a snapshot to a subvolume. Due to this bulk copy nature, cloning is currently inefficient for very huge
+Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation that copies
+data from a snapshot to a subvolume. Due to this bulk copying, cloning is inefficient for very large
 data sets.

 .. note:: Removing a snapshot (source subvolume) would fail if there are pending or in progress clone operations.

-Protecting snapshots prior to cloning was a pre-requisite in the Nautilus release, and the commands to protect/unprotect
-snapshots were introduced for this purpose. This pre-requisite, and hence the commands to protect/unprotect, is being
-deprecated in mainline CephFS, and may be removed from a future release.
+Protecting snapshots prior to cloning was a prerequisite in the Nautilus release, and the commands to protect/unprotect
+snapshots were introduced for this purpose. This prerequisite, and hence the commands to protect/unprotect, is being
+deprecated and may be removed from a future release.

 The commands being deprecated are:
-  $ ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
-  $ ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

-.. note:: Using the above commands would not result in an error, but they serve no useful function.
+.. prompt:: bash #

-.. note:: Use subvolume info command to fetch subvolume metadata regarding supported "features" to help decide if protect/unprotect of snapshots is required, based on the "snapshot-autoprotect" feature availability.
+   ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
+   ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

-To initiate a clone operation use::
+.. note:: Using the above commands will not result in an error, but they have no useful purpose.
+
+.. note:: Use the ``subvolume info`` command to fetch subvolume metadata regarding supported ``features`` to help decide if protect/unprotect of snapshots is required, based on the availability of the ``snapshot-autoprotect`` feature.
+
+To initiate a clone operation use:

  $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name>

-If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified as per::
+If a snapshot (source subvolume) is a part of non-default group, the group name needs to be specified:

  $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --group_name <subvol_group_name>

-Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use::
+Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). To clone to a particular group use:

  $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --target_group_name <subvol_group_name>

-Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use::
+Similar to specifying a pool layout when creating a subvolume, pool layout can be specified when creating a cloned subvolume. To create a cloned subvolume with a specific pool layout use:

  $ ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --pool_layout <pool_layout>

-Configure maximum number of concurrent clones. The default is set to 4::
+Configure the maximum number of concurrent clones. The default is 4:

  $ ceph config set mgr mgr/volumes/max_concurrent_clones <value>

-To check the status of a clone operation use::
+To check the status of a clone operation use:

  $ ceph fs clone status <vol_name> <clone_name> [--group_name <group_name>]

 A clone can be in one of the following states:

-#. `pending`     : Clone operation has not started
-#. `in-progress` : Clone operation is in progress
-#. `complete`    : Clone operation has successfully finished
-#. `failed`      : Clone operation has failed
-#. `canceled`    : Clone operation is cancelled by user
+#. ``pending``     : Clone operation has not started
+#. ``in-progress`` : Clone operation is in progress
+#. ``complete``    : Clone operation has successfully finished
+#. ``failed``      : Clone operation has failed
+#. ``canceled``    : Clone operation is cancelled by user

 The reason for a clone failure is shown as below:

-#. `errno`     : error number
-#. `error_msg` : failure error string
+#. ``errno``     : error number
+#. ``error_msg`` : failure error string

-Sample output of an `in-progress` clone operation::
+Here is an example of an ``in-progress`` clone::

  $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
  $ ceph fs clone status cephfs clone1
@ -539,9 +538,9 @@ Sample output of an `in-progress` clone operation::
    }
  }

-.. note:: The `failure` section will be shown only if the clone is in failed or cancelled state
+.. note:: The ``failure`` section will be shown only if the clone's state is ``failed`` or ``cancelled``

-Sample output of a `failed` clone operation::
+Here is an example of a ``failed`` clone::

  $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
  $ ceph fs clone status cephfs clone1
@ -561,11 +560,11 @@ Sample output of a `failed` clone operation::
    }
  }

-(NOTE: since `subvol1` is in default group, `source` section in `clone status` does not include group name)
+(NOTE: since ``subvol1`` is in the default group, the ``source`` object's  ``clone status`` does not include the group name)

 .. note:: Cloned subvolumes are accessible only after the clone operation has successfully completed.

-For a successful clone operation, `clone status` would look like so::
+After a successful clone operation, ``clone status`` will look like the below::

  $ ceph fs clone status cephfs clone1
  {
@ -574,21 +573,21 @@ For a successful clone operation, `clone status` would look like so::
    }
  }

-or `failed` state when clone is unsuccessful.
+If a clone operation is unsuccessful, the ``state`` value will be  ``failed``.

-On failure of a clone operation, the partial clone needs to be deleted and the clone operation needs to be retriggered.
+To retry a failed clone operation, the incomplete clone must be deleted and the clone operation must be issued again.
 To delete a partial clone use::

  $ ceph fs subvolume rm <vol_name> <clone_name> [--group_name <group_name>] --force

-.. note:: Cloning only synchronizes directories, regular files and symbolic links. Also, inode timestamps (access and
+.. note:: Cloning synchronizes only directories, regular files and symbolic links. Inode timestamps (access and
          modification times) are synchronized up to seconds granularity.

-An `in-progress` or a `pending` clone operation can be canceled. To cancel a clone operation use the `clone cancel` command::
+An ``in-progress`` or a ``pending`` clone operation may be canceled. To cancel a clone operation use the ``clone cancel`` command:

  $ ceph fs clone cancel <vol_name> <clone_name> [--group_name <group_name>]

-On successful cancellation, the cloned subvolume is moved to `canceled` state::
+On successful cancellation, the cloned subvolume is moved to the ``canceled`` state::

  $ ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
  $ ceph fs clone cancel cephfs clone1
@ -604,7 +603,7 @@ On successful cancellation, the cloned subvolume is moved to `canceled` state::
    }
  }

-.. note:: The canceled cloned can be deleted by using --force option in `fs subvolume rm` command.
+.. note:: The canceled cloned may be deleted by supplying the ``--force`` option to the `fs subvolume rm` command.


 .. _subvol-pinning:
@ -612,17 +611,16 @@ On successful cancellation, the cloned subvolume is moved to `canceled` state::
 Pinning Subvolumes and Subvolume Groups
 ---------------------------------------

-
-Subvolumes and subvolume groups can be automatically pinned to ranks according
-to policies. This can help distribute load across MDS ranks in predictable and
+Subvolumes and subvolume groups may be automatically pinned to ranks according
+to policies. This can distribute load across MDS ranks in predictable and
 stable ways.  Review :ref:`cephfs-pinning` and :ref:`cephfs-ephemeral-pinning`
 for details on how pinning works.

-Pinning is configured by::
+Pinning is configured by:

  $ ceph fs subvolumegroup pin <vol_name> <group_name> <pin_type> <pin_setting>

-or for subvolumes::
+or for subvolumes:

  $ ceph fs subvolume pin <vol_name> <group_name> <pin_type> <pin_setting>

@ -631,7 +629,7 @@ one of ``export``, ``distributed``, or ``random``. The ``pin_setting``
 corresponds to the extended attributed "value" as in the pinning documentation
 referenced above.

-So, for example, setting a distributed pinning strategy on a subvolume group::
+So, for example, setting a distributed pinning strategy on a subvolume group:

  $ ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1

--- a/ceph/doc/cephfs/health-messages.rst
+++ b/ceph/doc/cephfs/health-messages.rst
@ -130,7 +130,9 @@ other daemons, please see :ref:`health-checks`.
    from properly cleaning up resources used by client requests.  This message
    appears if a client appears to have more than ``max_completed_requests``
    (default 100000) requests that are complete on the MDS side but haven't
-    yet been accounted for in the client's *oldest tid* value.
+    yet been accounted for in the client's *oldest tid* value. The last tid
+    used by the MDS to trim completed client requests (or flush) is included
+    as part of `session ls` (or `client ls`) command as a debug aid.

 ``MDS_DAMAGE``
 --------------
--- a/ceph/doc/cephfs/mds-config-ref.rst
+++ b/ceph/doc/cephfs/mds-config-ref.rst
@ -57,6 +57,8 @@
 .. confval:: mds_kill_import_at
 .. confval:: mds_kill_link_at
 .. confval:: mds_kill_rename_at
+.. confval:: mds_inject_skip_replaying_inotable
+.. confval:: mds_kill_skip_replaying_inotable
 .. confval:: mds_wipe_sessions
 .. confval:: mds_wipe_ino_prealloc
 .. confval:: mds_skip_ino
--- a/ceph/doc/cephfs/multimds.rst
+++ b/ceph/doc/cephfs/multimds.rst
@ -225,3 +225,17 @@ For the reverse situation:

 The ``home/patrick`` directory and its children will be pinned to rank 2
 because its export pin overrides the policy on ``home``.
+
+To remove a partitioning policy, remove the respective extended attribute
+or set the value to 0.
+
+.. code::bash
+   $ setfattr -n ceph.dir.pin.distributed -v 0 home
+   # or
+   $ setfattr -x ceph.dir.pin.distributed home
+
+For export pins, remove the extended attribute or set the extended attribute
+value to `-1`.
+
+.. code::bash
+   $ setfattr -n ceph.dir.pin -v -1 home
--- a/ceph/doc/cephfs/nfs.rst
+++ b/ceph/doc/cephfs/nfs.rst
@ -56,6 +56,18 @@ in the sample conf. There are options to do the following:
 - enable read delegations (need at least v13.0.1 ``libcephfs2`` package
  and v2.6.0 stable ``nfs-ganesha`` and ``nfs-ganesha-ceph`` packages)

+.. important::
+
+   Under certain conditions, NFS access using the CephFS FSAL fails. This
+   causes an error to be thrown that reads "Input/output error". Under these
+   circumstances, the application metadata must be set for the CephFS metadata
+   and CephFS data pools. Do this by running the following command:
+
+   .. prompt:: bash $
+
+      ceph osd pool application set <cephfs_metadata_pool> cephfs <cephfs_data_pool> cephfs
+
+
 Configuration for libcephfs clients
 -----------------------------------

--- a/ceph/doc/cephfs/scrub.rst
+++ b/ceph/doc/cephfs/scrub.rst
@ -143,3 +143,14 @@ The types of damage that can be reported and repaired by File System Scrub are:

 * BACKTRACE : Inode's backtrace in the data pool is corrupted.

+Evaluate strays using recursive scrub
+=====================================
+
+- In order to evaluate strays i.e. purge stray directories in ``~mdsdir`` use the following command::
+
+    ceph tell mds.<fsname>:0 scrub start ~mdsdir recursive
+
+- ``~mdsdir`` is not enqueued by default when scrubbing at the CephFS root. In order to perform stray evaluation
+  at root, run scrub with flags ``scrub_mdsdir`` and ``recursive``::
+
+    ceph tell mds.<fsname>:0 scrub start / recursive,scrub_mdsdir
--- a/ceph/doc/cephfs/snap-schedule.rst
+++ b/ceph/doc/cephfs/snap-schedule.rst
@ -142,6 +142,24 @@ Examples::
  ceph fs snap-schedule retention add / 24h4w # add 24 hourly and 4 weekly to retention
  ceph fs snap-schedule retention remove / 7d4w # remove 7 daily and 4 weekly, leaves 24 hourly

+.. note: When adding a path to snap-schedule, remember to strip off the mount
+   point path prefix. Paths to snap-schedule should start at the appropriate
+   CephFS file system root and not at the host file system root.
+   e.g. if the Ceph File System is mounted at ``/mnt`` and the path under which
+   snapshots need to be taken is ``/mnt/some/path`` then the acutal path required
+   by snap-schedule is only ``/some/path``.
+
+.. note: It should be noted that the "created" field in the snap-schedule status
+   command output is the timestamp at which the schedule was created. The "created"
+   timestamp has nothing to do with the creation of actual snapshots. The actual
+   snapshot creation is accounted for in the "created_count" field, which is a
+   cumulative count of the total number of snapshots created so far.
+
+.. note: The maximum number of snapshots to retain per directory is limited by the
+   config tunable `mds_max_snaps_per_dir`. This tunable defaults to 100.
+   To ensure a new snapshot can be created, one snapshot less than this will be
+   retained. So by default, a maximum of 99 snapshots will be retained.
+
 Active and inactive schedules
 -----------------------------
 Snapshot schedules can be added for a path that doesn't exist yet in the
--- a/ceph/doc/cephfs/troubleshooting.rst
+++ b/ceph/doc/cephfs/troubleshooting.rst
@ -21,6 +21,133 @@ We can get hints about what's going on by dumping the MDS cache ::
 If high logging levels are set on the MDS, that will almost certainly hold the
 information we need to diagnose and solve the issue.

+Stuck during recovery
+=====================
+
+Stuck in up:replay
+------------------
+
+If your MDS is stuck in ``up:replay`` then it is likely that the journal is
+very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is
+behind on trimming its journal? If the journal has grown very large, it can
+take hours to read the journal. There is no working around this but there
+are things you can do to speed things along:
+
+Reduce MDS debugging to 0. Even at the default settings, the MDS logs some
+messages to memory for dumping if a fatal error is encountered. You can avoid
+this:
+
+.. code:: bash
+
+   ceph config set mds debug_mds 0
+   ceph config set mds debug_ms 0
+   ceph config set mds debug_monc 0
+
+Note if the MDS fails then there will be virtually no information to determine
+why. If you can calculate when ``up:replay`` will complete, you should restore
+these configs just prior to entering the next state:
+
+.. code:: bash
+
+   ceph config rm mds debug_mds
+   ceph config rm mds debug_ms
+   ceph config rm mds debug_monc
+
+Once you've got replay moving along faster, you can calculate when the MDS will
+complete. This is done by examining the journal replay status:
+
+.. code:: bash
+
+   $ ceph tell mds.<fs_name>:0 status | jq .replay_status
+   {
+     "journal_read_pos": 4195244,
+     "journal_write_pos": 4195244,
+     "journal_expire_pos": 4194304,
+     "num_events": 2,
+     "num_segments": 2
+   }
+
+Replay completes when the ``journal_read_pos`` reaches the
+``journal_write_pos``. The write position will not change during replay. Track
+the progression of the read position to compute the expected time to complete.
+
+
+Avoiding recovery roadblocks
+----------------------------
+
+When trying to urgently restore your file system during an outage, here are some
+things to do:
+
+* **Deny all reconnect to clients.** This effectively blocklists all existing
+  CephFS sessions so all mounts will hang or become unavailable.
+
+.. code:: bash
+
+   ceph config set mds mds_deny_all_reconnect true
+
+  Remember to undo this after the MDS becomes active.
+
+.. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
+
+* **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears
+  "stuck" doing some operation. Sometimes recovery of an MDS may involve an
+  operation that may take longer than expected (from the programmer's
+  perspective). This is more likely when recovery is already taking a longer than
+  normal amount of time to complete (indicated by your reading this document).
+  Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
+
+.. code:: bash
+
+   ceph config set mds mds_heartbeat_reset_grace 3600
+
+  This has the effect of having the MDS continue to send beacons to the monitors
+  even when its internal "heartbeat" mechanism has not been reset (beat) in one
+  hour. Note the previous mechanism for achieving this was via the
+  `mds_beacon_grace` monitor setting.
+
+* **Disable open file table prefetch.** Normally, the MDS will prefetch
+  directory contents during recovery to heat up its cache. During long
+  recovery, the cache is probably already hot **and large**. So this behavior
+  can be undesirable. Disable using:
+
+.. code:: bash
+
+   ceph config set mds mds_oft_prefetch_dirfrags false
+
+* **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may
+  cause new load on the file system when it's just getting back on its feet.
+  There will likely be some general maintenance to do before workloads should be
+  resumed. For example, expediting journal trim may be advisable if the recovery
+  took a long time because replay was reading a overly large journal.
+
+  You can do this manually or use the new file system tunable:
+
+.. code:: bash
+
+   ceph fs set <fs_name> refuse_client_session true
+
+  That prevents any clients from establishing new sessions with the MDS.
+
+
+
+Expediting MDS journal trim
+===========================
+
+If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a
+long time!), you will want to have the MDS trim its journal more frequently.
+You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings.
+
+The main tunable available to do this is to modify the MDS tick interval. The
+"tick" interval drives several upkeep activities in the MDS. It is strongly
+recommended no significant file system load be present when modifying this tick
+interval. This setting only affects an MDS in ``up:active``. The MDS does not
+trim its journal during recovery.
+
+.. code:: bash
+
+   ceph config set mds mds_tick_interval 2
+
+
 RADOS Health
 ============

@ -188,6 +315,98 @@ You can enable dynamic debug against the CephFS module.

 Please see: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh

+In-memory Log Dump
+==================
+
+In-memory logs can be dumped by setting ``mds_extraordinary_events_dump_interval``
+during a lower level debugging (log level < 10). ``mds_extraordinary_events_dump_interval``
+is the interval in seconds for dumping the recent in-memory logs when there is an Extra-Ordinary event.
+
+The Extra-Ordinary events are classified as:
+
+* Client Eviction
+* Missed Beacon ACK from the monitors
+* Missed Internal Heartbeats
+
+In-memory Log Dump is disabled by default to prevent log file bloat in a production environment.
+The below commands consecutively enables it::
+
+  $ ceph config set mds debug_mds <log_level>/<gather_level>
+  $ ceph config set mds mds_extraordinary_events_dump_interval <seconds>
+
+The ``log_level`` should be < 10 and ``gather_level`` should be >= 10 to enable in-memory log dump.
+When it is enabled, the MDS checks for the extra-ordinary events every
+``mds_extraordinary_events_dump_interval`` seconds and if any of them occurs, MDS dumps the
+in-memory logs containing the relevant event details in ceph-mds log.
+
+.. note:: For higher log levels (log_level >= 10) there is no reason to dump the In-memory Logs and a
+          lower gather level (gather_level < 10) is insufficient to gather In-memory Logs. Thus a
+          log level >=10 or a gather level < 10 in debug_mds would prevent enabling the In-memory Log Dump.
+          In such cases, when there is a failure it's required to reset the value of
+          mds_extraordinary_events_dump_interval to 0 before enabling using the above commands.
+
+The In-memory Log Dump can be disabled using::
+
+  $ ceph config set mds mds_extraordinary_events_dump_interval 0
+
+Filesystems Become Inaccessible After an Upgrade
+================================================
+
+.. note::
+   You can avoid ``operation not permitted`` errors by running this procedure
+   before an upgrade. As of May 2023, it seems that ``operation not permitted``
+   errors of the kind discussed here occur after upgrades after Nautilus
+   (inclusive).
+
+IF
+
+you have CephFS file systems that have data and metadata pools that were
+created by a ``ceph fs new`` command (meaning that they were not created
+with the defaults)
+
+OR
+
+you have an existing CephFS file system and are upgrading to a new post-Nautilus
+major version of Ceph
+
+THEN
+
+in order for the documented ``ceph fs authorize...`` commands to function as
+documented (and to avoid 'operation not permitted' errors when doing file I/O
+or similar security-related problems for all users except the ``client.admin``
+user), you must first run:
+
+.. prompt:: bash $
+
+   ceph osd pool application set <your metadata pool name> cephfs metadata <your ceph fs filesystem name>
+
+and
+
+.. prompt:: bash $
+
+   ceph osd pool application set <your data pool name> cephfs data <your ceph fs filesystem name>
+
+Otherwise, when the OSDs receive a request to read or write data (not the
+directory info, but file data) they will not know which Ceph file system name
+to look up. This is true also of pool names, because the 'defaults' themselves
+changed in the major releases, from::
+
+   data pool=fsname
+   metadata pool=fsname_metadata
+
+to::
+
+   data pool=fsname.data and
+   metadata pool=fsname.meta
+
+Any setup that used ``client.admin`` for all mounts did not run into this
+problem, because the admin key gave blanket permissions.
+
+A temporary fix involves changing mount requests to the 'client.admin' user and
+its associated key. A less drastic but half-fix is to change the osd cap for
+your user to just ``caps osd = "allow rw"``  and delete ``tag cephfs
+data=....``
+
 Reporting Issues
 ================

--- a/ceph/doc/dev/cephfs-mirroring.rst
+++ b/ceph/doc/dev/cephfs-mirroring.rst
@ -2,38 +2,44 @@
 CephFS Mirroring
 ================

-CephFS supports asynchronous replication of snapshots to a remote CephFS file system via
-`cephfs-mirror` tool. Snapshots are synchronized by mirroring snapshot data followed by
-creating a snapshot with the same name (for a given directory on the remote file system) as
-the snapshot being synchronized.
+CephFS supports asynchronous replication of snapshots to a remote CephFS file
+system via `cephfs-mirror` tool. Snapshots are synchronized by mirroring
+snapshot data followed by creating a snapshot with the same name (for a given
+directory on the remote file system) as the snapshot being synchronized.

 Requirements
 ------------

-The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later.
+The primary (local) and secondary (remote) Ceph clusters version should be
+Pacific or later.

 Key Idea
 --------

-For a given snapshot pair in a directory, `cephfs-mirror` daemon will rely on readdir diff
-to identify changes in a directory tree. The diffs are applied to directory in the remote
-file system thereby only synchronizing files that have changed between two snapshots.
+For a given snapshot pair in a directory, `cephfs-mirror` daemon will rely on
+readdir diff to identify changes in a directory tree. The diffs are applied to
+directory in the remote file system thereby only synchronizing files that have
+changed between two snapshots.

 This feature is tracked here: https://tracker.ceph.com/issues/47034.

-Currently, snapshot data is synchronized by bulk copying to the remote filesystem.
+Currently, snapshot data is synchronized by bulk copying to the remote
+filesystem.

-.. note:: Synchronizing hardlinks is not supported -- hardlinked files get synchronized
-          as separate files.
+.. note:: Synchronizing hardlinks is not supported -- hardlinked files get
+   synchronized as separate files.

 Creating Users
 --------------

-Start by creating a user (on the primary/local cluster) for the mirror daemon. This user
-requires write capability on the metadata pool to create RADOS objects (index objects)
-for watch/notify operation and read capability on the data pool(s).
+Start by creating a user (on the primary/local cluster) for the mirror daemon.
+This user requires write capability on the metadata pool to create RADOS
+objects (index objects) for watch/notify operation and read capability on the
+data pool(s).

-  $ ceph auth get-or-create client.mirror mon 'profile cephfs-mirror' mds 'allow r' osd 'allow rw tag cephfs metadata=*, allow r tag cephfs data=*' mgr 'allow r'
+.. prompt:: bash $
+
+   ceph auth get-or-create client.mirror mon 'profile cephfs-mirror' mds 'allow r' osd 'allow rw tag cephfs metadata=*, allow r tag cephfs data=*' mgr 'allow r'

 Create a user for each file system peer (on the secondary/remote cluster). This user needs
 to have full capabilities on the MDS (to take snapshots) and the OSDs::
@ -371,7 +377,7 @@ information. To check which mirror daemon a directory has been mapped to use::
    "state": "mapped"
  }

-.. note:: `instance_id` is the RAODS instance-id associated with a mirror daemon.
+.. note:: `instance_id` is the RADOS instance-id associated with a mirror daemon.

 Other information such as `state` and `last_shuffled` are interesting when running
 multiple mirror daemons.
--- a/ceph/doc/dev/deduplication.rst
+++ b/ceph/doc/dev/deduplication.rst
@ -0,0 +1,426 @@
+===============
+ Deduplication
+===============
+
+
+Introduction
+============
+
+Applying data deduplication on an existing software stack is not easy 
+due to additional metadata management and original data processing 
+procedure. 
+
+In a typical deduplication system, the input source as a data
+object is split into multiple chunks by a chunking algorithm.
+The deduplication system then compares each chunk with
+the existing data chunks, stored in the storage previously.
+To this end, a fingerprint index that stores the hash value
+of each chunk is employed by the deduplication system
+in order to easily find the existing chunks by comparing
+hash value rather than searching all contents that reside in
+the underlying storage.
+
+There are many challenges in order to implement deduplication on top
+of Ceph. Among them, two issues are essential for deduplication.
+First is managing scalability of fingerprint index; Second is
+it is complex to ensure compatibility between newly applied
+deduplication metadata and existing metadata.
+
+Key Idea
+========
+1. Content hashing (Double hashing): Each client can find an object data 
+for an object ID using CRUSH. With CRUSH, a client knows object's location
+in Base tier. 
+By hashing object's content at Base tier, a new OID (chunk ID) is generated.
+Chunk tier stores in the new OID that has a partial content of original object.
+
+ Client 1 -> OID=1 -> HASH(1's content)=K -> OID=K -> 
+ CRUSH(K) -> chunk's location
+
+
+2. Self-contained object: The external metadata design
+makes difficult for integration with storage feature support
+since existing storage features cannot recognize the
+additional external data structures. If we can design data
+deduplication system without any external component, the
+original storage features can be reused.
+
+More details in https://ieeexplore.ieee.org/document/8416369
+
+Design
+======
+
+.. ditaa::
+
+           +-------------+
+           | Ceph Client |
+           +------+------+
+                  ^
+     Tiering is   |  
+    Transparent   |               Metadata
+        to Ceph   |           +---------------+
+     Client Ops   |           |               |   
+                  |    +----->+   Base Pool   |
+                  |    |      |               |
+                  |    |      +-----+---+-----+
+                  |    |            |   ^ 
+                  v    v            |   |   Dedup metadata in Base Pool
+           +------+----+--+         |   |   (Dedup metadata contains chunk offsets
+           |   Objecter   |         |   |    and fingerprints)
+           +-----------+--+         |   |
+                       ^            |   |   Data in Chunk Pool
+                       |            v   |
+                       |      +-----+---+-----+
+                       |      |               |
+                       +----->|  Chunk Pool   |
+                              |               |
+                              +---------------+
+                                    Data
+
+
+Pool-based object management:
+We define two pools.
+The metadata pool stores metadata objects and the chunk pool stores
+chunk objects. Since these two pools are divided based on
+the purpose and usage, each pool can be managed more
+efficiently according to its different characteristics. Base
+pool and the chunk pool can separately select a redundancy
+scheme between replication and erasure coding depending on
+its usage and each pool can be placed in a different storage
+location depending on the required performance.
+
+Regarding how to use, please see ``osd_internals/manifest.rst``
+
+Usage Patterns
+==============
+
+Each Ceph interface layer presents unique opportunities and costs for
+deduplication and tiering in general.
+
+RadosGW
+-------
+
+S3 big data workloads seem like a good opportunity for deduplication.  These
+objects tend to be write once, read mostly objects which don't see partial
+overwrites.  As such, it makes sense to fingerprint and dedup up front.
+
+Unlike cephfs and rbd, radosgw has a system for storing
+explicit metadata in the head object of a logical s3 object for
+locating the remaining pieces.  As such, radosgw could use the
+refcounting machinery (``osd_internals/refcount.rst``) directly without
+needing direct support from rados for manifests.
+
+RBD/Cephfs
+----------
+
+RBD and CephFS both use deterministic naming schemes to partition
+block devices/file data over rados objects.  As such, the redirection
+metadata would need to be included as part of rados, presumably
+transparently.
+
+Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites.
+For those objects, we don't really want to perform dedup, and we don't
+want to pay a write latency penalty in the hot path to do so anyway.
+As such, performing tiering and dedup on cold objects in the background
+is likely to be preferred.
+   
+One important wrinkle, however, is that both rbd and cephfs workloads
+often feature usage of snapshots.  This means that the rados manifest
+support needs robust support for snapshots.
+
+RADOS Machinery
+===============
+
+For more information on rados redirect/chunk/dedup support, see ``osd_internals/manifest.rst``.
+For more information on rados refcount support, see ``osd_internals/refcount.rst``.
+
+Status and Future Work
+======================
+
+At the moment, there exists some preliminary support for manifest
+objects within the OSD as well as a dedup tool.
+
+RadosGW data warehouse workloads probably represent the largest
+opportunity for this feature, so the first priority is probably to add
+direct support for fingerprinting and redirects into the refcount pool
+to radosgw.
+
+Aside from radosgw, completing work on manifest object support in the
+OSD particularly as it relates to snapshots would be the next step for
+rbd and cephfs workloads.
+
+How to use deduplication
+========================
+
+ * This feature is highly experimental and is subject to change or removal.
+
+Ceph provides deduplication using RADOS machinery.
+Below we explain how to perform deduplication. 
+
+Prerequisite
+------------
+
+If the Ceph cluster is started from Ceph mainline, users need to check
+``ceph-test`` package which is including ceph-dedup-tool is installed.
+
+Deatiled Instructions
+---------------------
+
+Users can use ceph-dedup-tool with ``estimate``, ``sample-dedup``, 
+``chunk-scrub``, and ``chunk-repair`` operations. To provide better
+convenience for users, we have enabled necessary operations through
+ceph-dedup-tool, and we recommend using the following operations freely
+by using any types of scripts.
+
+
+1. Estimate space saving ratio of a target pool using ``ceph-dedup-tool``.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+    ceph-dedup-tool --op estimate
+      --pool [BASE_POOL]
+      --chunk-size [CHUNK_SIZE]
+      --chunk-algorithm [fixed|fastcdc]
+      --fingerprint-algorithm [sha1|sha256|sha512]
+      --max-thread [THREAD_COUNT]
+
+This CLI command will show how much storage space can be saved when deduplication
+is applied on the pool. If the amount of the saved space is higher than user's expectation,
+the pool probably is worth performing deduplication. 
+Users should specify the ``BASE_POOL``, within which the object targeted for deduplication 
+is stored. The users also need to run ceph-dedup-tool multiple time
+with varying ``chunk_size`` to find the optimal chunk size. Note that the
+optimal value probably differs in the content of each object in case of fastcdc
+chunk algorithm (not fixed).
+
+Example output:
+
+.. code:: bash
+
+    {
+      "chunk_algo": "fastcdc",
+      "chunk_sizes": [
+        {
+          "target_chunk_size": 8192,
+          "dedup_bytes_ratio": 0.4897049
+          "dedup_object_ratio": 34.567315
+          "chunk_size_average": 64439,
+          "chunk_size_stddev": 33620
+        }
+      ],
+      "summary": {
+        "examined_objects": 95,
+        "examined_bytes": 214968649
+      }
+    }
+
+The above is an example output when executing ``estimate``. ``target_chunk_size`` is the same as
+``chunk_size`` given by the user. ``dedup_bytes_ratio`` shows how many bytes are redundant from 
+examined bytes. For instance, 1 - ``dedup_bytes_ratio`` means the percentage of saved storage space.
+``dedup_object_ratio`` is the generated chunk objects / ``examined_objects``. ``chunk_size_average`` 
+means that the divided chunk size on average when performing CDC---this may differnet from ``target_chunk_size``
+because CDC genarates different chunk-boundary depending on the content. ``chunk_size_stddev``
+represents the standard deviation of the chunk size. 
+
+
+2. Create chunk pool. 
+^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: bash
+
+  ceph osd pool create [CHUNK_POOL]
+    
+
+3. Run dedup command (there are two ways).
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- **sample-dedup**
+  
+.. code:: bash
+
+    ceph-dedup-tool --op sample-dedup
+      --pool [BASE_POOL]
+      --chunk-pool [CHUNK_POOL]
+      --chunk-size [CHUNK_SIZE]
+      --chunk-algorithm [fastcdc]
+      --fingerprint-algorithm [sha1|sha256|sha512]
+      --chunk-dedup-threshold [THRESHOLD]
+      --max-thread [THREAD_COUNT]
+      --sampling-ratio [SAMPLE_RATIO]
+      --wakeup-period [WAKEUP_PERIOD]
+      --loop 
+      --snap
+
+The ``sample-dedup`` comamnd spawns threads specified by ``THREAD_COUNT`` to deduplicate objects on
+the ``BASE_POOL``. According to sampling-ratio---do a full search if ``SAMPLE_RATIO`` is 100, the threads selectively
+perform deduplication if the chunk is redundant over ``THRESHOLD`` times during iteration.
+If --loop is set, the theads will wakeup after ``WAKEUP_PERIOD``. If not, the threads will exit after one iteration.
+
+Example output:
+
+.. code:: bash
+
+   $ bin/ceph df
+   --- RAW STORAGE ---
+   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
+   ssd    303 GiB  294 GiB  9.0 GiB   9.0 GiB       2.99
+   TOTAL  303 GiB  294 GiB  9.0 GiB   9.0 GiB       2.99
+
+   --- POOLS ---
+   POOL   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
+   .mgr    1    1  577 KiB        2  1.7 MiB      0     97 GiB
+   base    2   32  2.0 GiB      517  6.0 GiB   2.02     97 GiB
+   chunk   3   32   0  B          0    0   B      0     97 GiB
+
+   $ bin/ceph-dedup-tool --op sample-dedup --pool base --chunk-pool chunk
+     --fingerprint-algorithm sha1 --chunk-algorithm fastcdc --loop --sampling-ratio 100
+     --chunk-dedup-threshold 2 --chunk-size 8192 --max-thread 4 --wakeup-period 60
+
+   $ bin/ceph df
+   --- RAW STORAGE ---
+   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
+   ssd    303 GiB  298 GiB  5.4 GiB   5.4 GiB       1.80
+   TOTAL  303 GiB  298 GiB  5.4 GiB   5.4 GiB       1.80
+
+   --- POOLS ---
+   POOL   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
+   .mgr    1    1  577 KiB        2  1.7 MiB      0     98 GiB
+   base    2   32  452 MiB      262  1.3 GiB   0.50     98 GiB
+   chunk   3   32  258 MiB   25.91k  938 MiB   0.31     98 GiB
+
+- **object dedup**
+
+.. code:: bash
+
+    ceph-dedup-tool --op object-dedup
+      --pool [BASE_POOL]
+      --object [OID]
+      --chunk-pool [CHUNK_POOL]
+      --fingerprint-algorithm [sha1|sha256|sha512]
+      --dedup-cdc-chunk-size [CHUNK_SIZE]
+
+The ``object-dedup`` command triggers deduplication on the RADOS object specified by ``OID``.
+All parameters shown above must be specified. ``CHUNK_SIZE`` should be taken from
+the results of step 1 above.
+Note that when this command is executed, ``fastcdc`` will be set by default and other parameters
+such as ``fingerprint-algorithm`` and ``CHUNK_SIZE`` will be set as defaults for the pool.
+Deduplicated objects will appear in the chunk pool. If the object is mutated over time, user needs to re-run
+``object-dedup`` because chunk-boundary should be recalculated based on updated contents.
+The user needs to specify ``snap`` if the target object is snapshotted. After deduplication is done, the target
+object size in ``BASE_POOL`` is zero (evicted) and chunks objects are genereated---these appear in ``CHUNK_POOL``.
+
+4. Read/write I/Os
+^^^^^^^^^^^^^^^^^^
+
+After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
+completely compatible with existing RADOS operations.
+
+
+5. Run scrub to fix reference count 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reference mismatches can on rare occasions occur to false positives when handling reference counts for
+deduplicated RADOS objects. These mismatches will be fixed by periodically scrubbing the pool:
+
+.. code:: bash
+
+    ceph-dedup-tool --op chunk-scrub
+      --chunk-pool [CHUNK_POOL]
+      --pool [POOL]
+      --max-thread [THREAD_COUNT]
+
+The ``chunk-scrub`` command identifies reference mismatches between a
+metadata object and a chunk object. The ``chunk-pool`` parameter tells
+where the target chunk objects are located to the ceph-dedup-tool.
+
+Example output:
+
+A reference mismatch is intentionally created by inserting a reference (dummy-obj) into a chunk object (2ac67f70d3dd187f8f332bb1391f61d4e5c9baae) by using chunk-get-ref.
+
+.. code:: bash
+
+    $ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+    {
+      "type": "by_object",
+      "count": 2,
+    	"refs": [
+        {
+          "oid": "testfile2",
+        	"key": "",
+        	"snapid": -2,
+        	"hash": 2905889452,
+        	"max": 0,
+        	"pool": 2,
+        	"namespace": ""
+      	},
+        {
+          "oid": "dummy-obj",
+          "key": "",
+          "snapid": -2,
+          "hash": 1203585162,
+          "max": 0,
+          "pool": 2,
+          "namespace": ""
+        }
+      ]
+    }
+
+    $ bin/ceph-dedup-tool --op chunk-scrub --chunk-pool chunk --max-thread 10
+    10 seconds is set as report period by default
+    join
+    join
+    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+    --done--
+    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae ref 10:5102bde2:::dummy-obj:head: referencing pool does not exist
+    --done--
+     Total object : 1
+     Examined object : 1
+     Damaged object : 1
+
+6. Repair a mismatched chunk reference
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If any reference mismatches occur after the ``chunk-scrub``, it is
+recommended to perform the ``chunk-repair`` operation to fix reference
+mismatches. The ``chunk-repair`` operation helps in resolving the
+reference mismatch and restoring consistency.
+
+.. code:: bash
+
+    ceph-dedup-tool --op chunk-repair
+      --chunk-pool [CHUNK_POOL_NAME]
+      --object [CHUNK_OID]
+      --target-ref [TARGET_OID]
+      --target-ref-pool-id [TARGET_POOL_ID]
+
+``chunk-repair`` fixes the ``target-ref``, which is a wrong reference of
+an ``object``. To fix it correctly, the users must enter the correct
+``TARGET_OID`` and ``TARGET_POOL_ID``.
+
+.. code:: bash
+
+    $ bin/ceph-dedup-tool --op chunk-repair --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae --target-ref dummy-obj --target-ref-pool-id 10
+    2ac67f70d3dd187f8f332bb1391f61d4e5c9baae has 1 references for dummy-obj
+    dummy-obj has 0 references for 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+     fix dangling reference from 1 to 0
+
+    $ bin/ceph-dedup-tool --op dump-chunk-refs --chunk-pool chunk --object 2ac67f70d3dd187f8f332bb1391f61d4e5c9baae
+    {
+      "type": "by_object",
+      "count": 1,
+      "refs": [
+        {
+          "oid": "testfile2",
+          "key": "",
+          "snapid": -2,
+          "hash": 2905889452,
+          "max": 0,
+          "pool": 2,
+          "namespace": ""
+        }
+      ]
+    }
+
+
+ 
--- a/ceph/doc/dev/dev_cluster_deployement.rst
+++ b/ceph/doc/dev/dev_cluster_deployement.rst
@ -1,3 +1,5 @@
+.. _dev_deploying_a_development_cluster:
+
 =================================
 Deploying a development cluster
 =================================
--- a/ceph/doc/dev/developer_guide/dash-devel.rst
+++ b/ceph/doc/dev/developer_guide/dash-devel.rst
@ -50,11 +50,10 @@ optional Ceph internal services are started automatically when it is used to
 start a Ceph cluster. vstart is the basis for the three most commonly used
 development environments in Ceph Dashboard.

-You can read more about vstart in `Deploying a development cluster`_.
-Additional information for developers can also be found in the `Developer
-Guide`_.
+You can read more about vstart in :ref:`Deploying a development cluster
+<dev_deploying_a_development_cluster>`. Additional information for developers
+can also be found in the `Developer Guide`_.

-.. _Deploying a development cluster: https://docs.ceph.com/docs/master/dev/dev_cluster_deployement/
 .. _Developer Guide: https://docs.ceph.com/docs/master/dev/quick_guide/

 Host-based vs Docker-based Development Environments
@ -1269,7 +1268,6 @@ Tests can be found under the `a11y folder <./src/pybind/mgr/dashboard/frontend/c
  
    beforeEach(() => {
      cy.login();
-      Cypress.Cookies.preserveOnce('token');
      shared.navigateTo();
    });
  
--- a/ceph/doc/dev/developer_guide/running-tests-locally.rst
+++ b/ceph/doc/dev/developer_guide/running-tests-locally.rst
@ -55,7 +55,7 @@ using `vstart_runner.py`_. To do that, you'd need `teuthology`_ installed::
    $ virtualenv --python=python3 venv
    $ source venv/bin/activate
    $ pip install 'setuptools >= 12'
-    $ pip install git+https://github.com/ceph/teuthology#egg=teuthology[test]
+    $ pip install teuthology[test]@git+https://github.com/ceph/teuthology
    $ deactivate

 The above steps installs teuthology in a virtual environment. Before running
--- a/ceph/doc/dev/encoding.rst
+++ b/ceph/doc/dev/encoding.rst
@ -3,9 +3,74 @@ Serialization (encode/decode)
 =============================

 When a structure is sent over the network or written to disk, it is
-encoded into a string of bytes.  Serializable structures have
-``encode`` and ``decode`` methods that write and read from ``bufferlist``
-objects representing byte strings.
+encoded into a string of bytes. Usually (but not always -- multiple
+serialization facilities coexist in Ceph) serializable structures
+have ``encode`` and ``decode`` methods that write and read from
+``bufferlist`` objects representing byte strings.
+
+Terminology
+-----------
+It is best to think not in the domain of daemons and clients but
+encoders and decoders. An encoder serializes a structure into a bufferlist
+while a decoder does the opposite.
+
+Encoders and decoders can be referred collectively as dencoders.
+
+Dencoders (both encoders and docoders) live within daemons and clients.
+For instance, when an RBD client issues an IO operation, it prepares
+an instance of the ``MOSDOp`` structure and encodes it into a bufferlist
+that is put on the wire.
+An OSD reads these bytes and decodes them back into an ``MOSDOp`` instance.
+Here encoder was used by the client while decoder by the OSD. However,
+these roles can swing -- just imagine handling of the response: OSD encodes
+the ``MOSDOpReply`` while RBD clients decode.
+
+Encoder and decoder operate accordingly to a format which is defined
+by a programmer by implementing the ``encode`` and ``decode`` methods.
+
+Principles for format change
+----------------------------
+It is not unusual that the format of serialization changes. This
+process requires careful attention from during both development
+and review.
+
+The general rule is that a decoder must understand what had been
+encoded by an encoder. Most of the problems come from ensuring
+that compatibility continues between old decoders and new encoders
+as well as new decoders and old decoders. One should assume
+that -- if not otherwise derogated -- any mix (old/new) is
+possible in a cluster. There are 2 main reasons for that:
+
+1. Upgrades. Although there are recommendations related to the order
+   of entity types (mons/osds/clients), it is not mandatory and
+   no assumption should be made about it.
+2. Huge variability of client versions. It was always the case
+   that kernel (and thus kernel clients) upgrades are decoupled
+   from Ceph upgrades. Moreover, proliferation of containerization
+   bring the variability even to e.g. ``librbd`` -- now user space
+   libraries live on the container own.
+
+With this being said, there are few rules limiting the degree
+of interoperability between dencoders:
+
+* ``n-2`` for dencoding between daemons,
+* ``n-3`` hard requirement for client-involved scenarios,
+* ``n-3..``  soft requirements for clinet-involved scenarios. Ideally
+  every client should be able to talk any version of daemons.
+
+As the underlying reasons are the same, the rules dencoders
+follow are virtually the same as for deprecations of our features
+bits. See the ``Notes on deprecation`` in ``src/include/ceph_features.h``.
+
+Frameworks
+----------
+Currently multiple genres of dencoding helpers co-exist.
+
+* encoding.h (the most proliferated one),
+* denc.h (performance optimized, seen mostly in ``BlueStore``),
+* the `Message` hierarchy.
+
+Although details vary, the interoperability rules stay the same.

 Adding a field to a structure
 -----------------------------
@ -93,3 +158,69 @@ because we might still be passed older-versioned messages that do not
 have the field.  The ``struct_v`` variable is a local set by the ``DECODE_START``
 macro.

+# Into the weeeds
+
+The append-extendability of our dencoders is a result of the forward
+compatibility that the ``ENCODE_START`` and ``DECODE_FINISH`` macros bring.
+
+They are implementing extendibility facilities. An encoder, when filling
+the bufferlist, prepends three fields: version of the current format,
+minimal version of a decoder compatible with it and the total size of
+all encoded fields.
+
+.. code-block:: cpp
+
+        /**
+         * start encoding block
+         *
+         * @param v current (code) version of the encoding
+         * @param compat oldest code version that can decode it
+         * @param bl bufferlist to encode to
+         *
+         */
+        #define ENCODE_START(v, compat, bl)                             \
+          __u8 struct_v = v;                                            \
+          __u8 struct_compat = compat;                                  \
+          ceph_le32 struct_len;                                         \
+          auto filler = (bl).append_hole(sizeof(struct_v) +             \
+            sizeof(struct_compat) + sizeof(struct_len));                \
+          const auto starting_bl_len = (bl).length();                   \
+          using ::ceph::encode;                                         \
+          do {
+
+The ``struct_len`` field allows the decoder to eat all the bytes that were
+left undecoded in the user-provided ``decode`` implementation.
+Analogically, decoders tracks how much input has been decoded in the
+user-provided ``decode`` methods.
+
+.. code-block:: cpp
+
+        #define DECODE_START(bl)		                        \
+          unsigned struct_end = 0;					\
+          __u32 struct_len;						\
+          decode(struct_len, bl);					\
+          ...                                                           \
+          struct_end = bl.get_off() + struct_len;			\
+          }								\
+          do {
+
+
+Decoder uses this information to discard the extra bytes it does not
+understand. Advancing bufferlist is critical as dencoders tend to be nested;
+just leaving it intact would work only for the very last ``deocde`` call
+in a nested structure.
+
+.. code-block:: cpp
+
+        #define DECODE_FINISH(bl)					\
+          } while (false);						\
+          if (struct_end) {						\
+            ...                                                         \
+            if (bl.get_off() < struct_end)				\
+              bl += struct_end - bl.get_off();				\
+          }
+
+
+This entire, cooperative mechanism allows encoder (its further revisions)
+to generate more byte stream (due to e.g. adding a new field at the end)
+and not worry that the residue will crash older decoder revisions.
--- a/ceph/doc/dev/health-reports.rst
+++ b/ceph/doc/dev/health-reports.rst
@ -16,32 +16,6 @@ mgr module
 The following diagrams outline the involved parties and how the interact when the clients
 query for the reports:

-.. seqdiag::
-
-   seqdiag {
-     default_note_color = lightblue;
-     osd; mon; ceph-cli;
-     osd  => mon [ label = "update osdmap service" ];
-     osd  => mon [ label = "update osdmap service" ];
-     ceph-cli  -> mon [ label = "send 'health' command" ];
-     mon -> mon [ leftnote = "gather checks from services" ];
-     ceph-cli <-- mon [ label = "checks and mutes" ];
-   }
-
-.. seqdiag::
-
-   seqdiag {
-     default_note_color = lightblue;
-     osd; mon; mgr; mgr-module;
-     mgr  -> mon [ label = "subscribe for 'mgrdigest'" ];
-     osd  => mon [ label = "update osdmap service" ];
-     osd  => mon [ label = "update osdmap service" ];
-     mon  -> mgr [ label = "send MMgrDigest" ];
-     mgr  -> mgr [ note = "update cluster state" ];
-     mon <-- mgr;
-     mgr-module  -> mgr [ label = "mgr.get('health')" ];
-     mgr-module <-- mgr [ label = "heath reports in json" ];
-   }

 Where are the Reports Generated
 ===============================
@ -68,19 +42,6 @@ later loaded and decoded, so they can be collected on demand. When it comes to
 ``MDSMonitor``, it persists the health metrics in the beacon sent by the MDS daemons,
 and prepares health reports when storing the pending changes.

-.. seqdiag::
-
-   seqdiag {
-     default_note_color = lightblue;
-     mds; mon-mds; mon-health; ceph-cli;
-     mds  -> mon-mds [ label = "send beacon" ];
-     mon-mds -> mon-mds [ note = "store health metrics in beacon" ];
-     mds <-- mon-mds;
-     mon-mds -> mon-mds [ note = "encode_health(checks)" ];
-     ceph-cli -> mon-health [ label = "send 'health' command" ];
-     mon-health => mon-mds [ label = "gather health checks" ];
-     ceph-cli <-- mon-health [ label = "checks and mutes" ];
-   }

 So, if we want to add a new warning related to cephfs, probably the best place to
 start is ``MDSMonitor::encode_pending()``, where health reports are collected from
@ -106,23 +67,3 @@ metrics and status to mgr using ``MMgrReport``. On the mgr side, it periodically
 an aggregated report to the ``MgrStatMonitor`` service on mon. As explained earlier,
 this service just persists the health reports in the aggregated report to the monstore.

-.. seqdiag::
-
-   seqdiag {
-     default_note_color = lightblue;
-     service; mgr; mon-mgr-stat; mon-health;
-     service -> mgr [ label = "send(open)" ];
-     mgr -> mgr [ note = "register the new service" ];
-     service <-- mgr;
-     mgr => service [ label = "send(configure)" ];
-     service -> mgr [ label = "send(report)" ];
-     mgr -> mgr [ note = "update/aggregate service metrics" ];
-     service <-- mgr;
-     service => mgr [ label = "send(report)" ];
-     mgr -> mon-mgr-stat [ label = "send(mgr-report)" ];
-     mon-mgr-stat -> mon-mgr-stat [ note = "store health checks in the report" ];
-     mgr <-- mon-mgr-stat;
-     mon-health => mon-mgr-stat [ label = "gather health checks" ];
-     service => mgr [ label = "send(report)" ];
-     service => mgr [ label = "send(close)" ];
-   }
--- a/ceph/doc/dev/network-encoding.rst
+++ b/ceph/doc/dev/network-encoding.rst
@ -87,7 +87,8 @@ Optionals are represented as a presence byte, followed by the item if it exists.
 		T  element[present? 1 : 0]; // Only if present is non-zero.
 	}

-Optionals are used to encode ``boost::optional``.
+Optionals are used to encode ``boost::optional`` and, since introducing
+C++17 to Ceph, ``std::optional``.

 Pair
 ----
--- a/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
+++ b/ceph/doc/dev/osd_internals/erasure_coding/jerasure.rst
@ -5,7 +5,7 @@ jerasure plugin
 Introduction
 ------------

-The parameters interpreted by the jerasure plugin are:
+The parameters interpreted by the ``jerasure`` plugin are:

 ::
 
@ -31,3 +31,5 @@ upstream repositories `http://jerasure.org/jerasure/jerasure
 `http://jerasure.org/jerasure/gf-complete
 <http://jerasure.org/jerasure/gf-complete>`_ . The difference
 between the two, if any, should match pull requests against upstream.
+Note that as of 2023, the ``jerasure.org`` web site may no longer be
+legitimate and/or associated with the original project.
--- a/ceph/doc/dev/osd_internals/mclock_wpq_cmp_study.rst
+++ b/ceph/doc/dev/osd_internals/mclock_wpq_cmp_study.rst
@ -114,29 +114,6 @@ baseline throughput for each device type was determined:
          256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
          by running 4 KiB random writes at a queue depth of 64 for 300 secs.

-Factoring I/O Cost in mClock
-============================
-
-The services using mClock have a cost associated with them. The cost can be
-different for each service type. The mClock scheduler factors in the cost
-during calculations for parameters like *reservation*, *weight* and *limit*.
-The calculations determine when the next op for the service type can be
-dequeued from the operation queue. In general, the higher the cost, the longer
-an op remains in the operation queue.
-
-A cost modeling study was performed to determine the cost per I/O and the cost
-per byte for SSD and HDD device types. The following cost specific options are
-used under the hood by mClock,
-
- :confval:`osd_mclock_cost_per_io_usec`
- :confval:`osd_mclock_cost_per_io_usec_hdd`
- :confval:`osd_mclock_cost_per_io_usec_ssd`
- :confval:`osd_mclock_cost_per_byte_usec`
- :confval:`osd_mclock_cost_per_byte_usec_hdd`
- :confval:`osd_mclock_cost_per_byte_usec_ssd`
-
-See :doc:`/rados/configuration/mclock-config-ref` for more details.
-
 MClock Profile Allocations
 ==========================

--- a/ceph/doc/dev/osd_internals/past_intervals.rst
+++ b/ceph/doc/dev/osd_internals/past_intervals.rst
@ -0,0 +1,93 @@
+=============
+PastIntervals
+=============
+
+Purpose
+-------
+
+There are two situations where we need to consider the set of all acting-set
+OSDs for a PG back to some epoch ``e``:
+
+ * During peering, we need to consider the acting set for every epoch back to
+   ``last_epoch_started``, the last epoch in which the PG completed peering and
+   became active.
+   (see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation)
+ * During recovery, we need to consider the acting set for every epoch back to
+   ``last_epoch_clean``, the last epoch at which all of the OSDs in the acting
+   set were fully recovered, and the acting set was full.
+
+For either of these purposes, we could build such a set by iterating backwards
+from the current OSDMap to the relevant epoch.  Instead, we maintain a structure
+PastIntervals for each PG.
+
+An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping
+didn't change.  This includes changes to the acting set, the up set, the
+primary, and several other parameters fully spelled out in
+PastIntervals::check_new_interval.
+
+Maintenance and Trimming
+------------------------
+
+The PastIntervals structure stores a record for each ``interval`` back to
+last_epoch_clean.  On each new ``interval`` (See AdvMap reactions,
+PeeringState::should_restart_peering, and PeeringState::start_peering_interval)
+each OSD with the PG will add the new ``interval`` to its local PastIntervals.
+Activation messages to OSDs which do not already have the PG contain the
+sender's PastIntervals so that the recipient needn't rebuild it.  (See
+PeeringState::activate needs_past_intervals).
+
+PastIntervals are trimmed in two places.  First, when the primary marks the
+PG clean, it clears its past_intervals instance
+(PeeringState::try_mark_clean()).  The replicas will do the same thing when
+they receive the info (See PeeringState::update_history).
+
+The second, more complex, case is in PeeringState::start_peering_interval.  In
+the event of a "map gap", we assume that the PG actually has gone clean, but we
+haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet.
+To explain this behavior, we need to discuss OSDMap trimming.
+
+OSDMap Trimming
+---------------
+
+OSDMaps are created by the Monitor quorum and gossiped out to the OSDs.  The
+Monitor cluster also determines when OSDs (and the Monitors) are allowed to
+trim old OSDMap epochs.  For the reasons explained above in this document, the
+primary constraint is that we must retain all OSDMaps back to some epoch such
+that all PGs have been clean at that or a later epoch (min_last_epoch_clean).
+(See OSDMonitor::get_trim_to).
+
+The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages
+sent periodically by each OSDs.  Each message contains a set of PGs for which
+the OSD is primary at that moment as well as the min_last_epoch_clean across
+that set.  The Monitors track these values in OSDMonitor::last_epoch_clean.
+
+There is a subtlety in the min_last_epoch_clean value used by the OSD to
+populate the MOSDBeacon.  OSD::collect_pg_stats invokes PG::with_pg_stats to
+obtain the lec value, which actually uses
+pg_stat_t::get_effective_last_epoch_clean() rather than
+info.history.last_epoch_clean.  If the PG is currently clean,
+pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than
+last_epoch_clean -- this works because the PG is clean at that epoch and it
+allows OSDMaps to be trimmed during periods where OSDMaps are being created
+(due to snapshot activity, perhaps), but no PGs are undergoing ``interval``
+changes.
+
+Back to PastIntervals
+---------------------
+
+We can now understand our second trimming case above.  If OSDMaps have been
+trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch
+>= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals.
+
+This dependency also pops up in PeeringState::check_past_interval_bounds().
+PeeringState::get_required_past_interval_bounds takes as a parameter
+oldest_epoch, which comes from OSDSuperblock::cluster_osdmap_trim_lower_bound.
+We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map
+because we don't necessarily trim all MOSDMap::cluster_osdmap_trim_lower_bound.
+In order to avoid doing too much work at once we limit the amount of osdmaps
+trimmed using ``osd_target_transaction_size`` in OSD::trim_maps().
+For this reason, a specific OSD's oldest_map can lag behind
+OSDSuperblock::cluster_osdmap_trim_lower_bound
+for a while.
+
+See https://tracker.ceph.com/issues/49689 for an example.
--- a/ceph/doc/foundation.rst
+++ b/ceph/doc/foundation.rst
@ -28,8 +28,8 @@ Premier
 -------

 * `Bloomberg <https://bloomberg.com>`_
-* `China Mobile <https://www.chinamobileltd.com/>`_
-* `DigitalOcean <https://www.digitalocean.com/>`_
+* `Clyso <https://www.clyso.com/en/>`_
+* `IBM <https://ibm.com>`_
 * `Intel <http://www.intel.com/>`_
 * `OVH <https://www.ovh.com/>`_
 * `Red Hat <https://www.redhat.com/>`_
@ -37,16 +37,16 @@ Premier
 * `SoftIron <https://www.softiron.com/>`_
 * `SUSE <https://www.suse.com/>`_
 * `Western Digital <https://www.wdc.com/>`_
-* `XSKY <https://www.xsky.com/en/>`_
-* `ZTE <https://www.zte.com.cn/global/>`_

 General
 -------

+* `42on <https://www.42on.com/>`_
+* `Akamai <https://www.akamai.com/>`_
 * `ARM <http://www.arm.com/>`_
 * `Canonical <https://www.canonical.com/>`_
 * `Cloudbase Solutions <https://cloudbase.it/>`_
-* `Clyso <https://www.clyso.com/en/>`_
+* `CloudFerro <https://cloudferro.com/>`_
 * `croit <http://www.croit.io/>`_
 * `EasyStack <https://www.easystack.io/>`_
 * `ISS <http://iss-integration.com/>`_
@ -97,22 +97,17 @@ Members
 -------

 * Anjaneya "Reddy" Chagam (Intel)
-* Dan van der Ster (CERN) - Associate member representative
-* Haomai Wang (XSKY)
-* James Page (Canonical)
-* Lenz Grimmer (SUSE) - Ceph Leadership Team representative
-* Lars Marowsky-Bree (SUSE)
+* Carlos Maltzahn (UCSC) - Associate member representative
+* Dan van der Ster (Clyso) - Ceph Council representative
+* Joachim Kraftmayer (Clyso)
+* Josh Durgin (IBM) - Ceph Council representative
 * Matias Bjorling (Western Digital)
 * Matthew Leonard (Bloomberg)
 * Mike Perez (Red Hat) - Ceph community manager
 * Myoungwon Oh (Samsung Electronics)
 * Martin Verges (croit) - General member representative
 * Pawel Sadowski (OVH)
-* Phil Straw (SoftIron)
-* Robin Johnson (DigitalOcean)
-* Sage Weil (Red Hat) - Ceph project leader
-* Xie Xingguo (ZTE)
-* Zhang Shaowen (China Mobile)
+* Vincent Hsu (IBM)

 Joining
 =======
--- a/ceph/doc/glossary.rst
+++ b/ceph/doc/glossary.rst
@ -12,12 +12,13 @@
 	:ref:`BlueStore<rados_config_storage_devices_bluestore>`
                OSD BlueStore is a storage back end used by OSD daemons, and
                was designed specifically for use with Ceph. BlueStore was
-                introduced in the Ceph Kraken release. In the Ceph Luminous
-                release, BlueStore became Ceph's default storage back end,
-                supplanting FileStore. Unlike :term:`filestore`, BlueStore
-                stores objects directly on Ceph block devices without any file
-                system interface. Since Luminous (12.2), BlueStore has been
-                Ceph's default and recommended storage back end.
+                introduced in the Ceph Kraken release. The Luminous release of
+                Ceph promoted BlueStore to the default OSD back end,
+                supplanting FileStore. As of the Reef release, FileStore is no
+                longer available as a storage backend.
+                
+                BlueStore stores objects directly on Ceph block devices without
+                a mounted file system.  

        Bucket
                In the context of :term:`RGW`, a bucket is a group of objects.
@ -187,9 +188,13 @@
                applications, Ceph Users, and :term:`Ceph Client`\s. Ceph
                Storage Clusters receive data from :term:`Ceph Client`\s.

-	cephx
-                The Ceph authentication protocol. Cephx operates like Kerberos,
-                but it has no single point of failure.
+	CephX
+                The Ceph authentication protocol. CephX authenticates users and
+                daemons. CephX operates like Kerberos, but it has no single
+                point of failure. See the :ref:`High-availability
+                Authentication section<arch_high_availability_authentication>`
+                of the Architecture document and the :ref:`CephX Configuration
+                Reference<rados-cephx-config-ref>`. 

 	Client
                A client is any program external to Ceph that uses a Ceph
@ -248,6 +253,9 @@
                Any single machine or server in a Ceph Cluster. See :term:`Ceph
                Node`.

+        Hybrid OSD  
+                Refers to an OSD that has both HDD and SSD drives.
+
 	LVM tags
                Extensible metadata for LVM volumes and groups. It is used to
                store Ceph-specific information about devices and its
@ -302,12 +310,33 @@
                state of a multi-site configuration. When the period is updated,
                the "epoch" is said thereby to have been changed.

+        Placement Groups (PGs)
+                Placement groups (PGs) are subsets of each logical Ceph pool.
+                Placement groups perform the function of placing objects (as a
+                group) into OSDs. Ceph manages data internally at
+                placement-group granularity: this scales better than would
+                managing individual (and therefore more numerous) RADOS
+                objects. A cluster that has a larger number of placement groups
+                (for example, 100 per OSD) is better balanced than an otherwise
+                identical cluster with a smaller number of placement groups. 
+                
+                Ceph's internal RADOS objects are each mapped to a specific
+                placement group, and each placement group belongs to exactly
+                one Ceph pool. 
+
 	:ref:`Pool<rados_pools>`
 		A pool is a logical partition used to store objects.

 	Pools
                See :term:`pool`.

+	:ref:`Primary Affinity <rados_ops_primary_affinity>`
+                The characteristic of an OSD that governs the likelihood that
+                a given OSD will be selected as the primary OSD (or "lead
+                OSD") in an acting set. Primary affinity was introduced in
+                Firefly (v. 0.80). See :ref:`Primary Affinity
+                <rados_ops_primary_affinity>`.
+
 	RADOS
                **R**\eliable **A**\utonomic **D**\istributed **O**\bject
                **S**\tore. RADOS is the object store that provides a scalable
@ -370,6 +399,28 @@
                Amazon S3 RESTful API and the OpenStack Swift API. Also called
                "RADOS Gateway" and "Ceph Object Gateway".

+        scrubs
+
+                The processes by which Ceph ensures data integrity. During the
+                process of scrubbing, Ceph generates a catalog of all objects
+                in a placement group, then ensures that none of the objects are
+                missing or mismatched by comparing each primary object against
+                its replicas, which are stored across other OSDs. Any PG
+                is determined to have a copy of an object that is different
+                than the other copies or is missing entirely is marked
+                "inconsistent" (that is, the PG is marked "inconsistent"). 
+
+                There are two kinds of scrubbing: light scrubbing and deep
+                scrubbing (also called "normal scrubbing" and "deep scrubbing",
+                respectively). Light scrubbing is performed daily and does
+                nothing more than confirm that a given object exists and that
+                its metadata is correct. Deep scrubbing is performed weekly and
+                reads the data and uses checksums to ensure data integrity.
+
+                See :ref:`Scrubbing <rados_config_scrubbing>` in the RADOS OSD
+                Configuration Reference Guide and page 141 of *Mastering Ceph,
+                second edition* (Fisk, Nick. 2019).
+
        secrets
                Secrets are credentials used to perform digital authentication
                whenever privileged users must access systems that require
@ -387,6 +438,12 @@
 	Teuthology
 		The collection of software that performs scripted tests on Ceph.

+        User
+                An individual or a system actor (for example, an application)
+                that uses Ceph clients to interact with the :term:`Ceph Storage
+                Cluster`. See :ref:`User<rados-ops-user>` and :ref:`User
+                Management<user-management>`.
+
        Zone
                In the context of :term:`RGW`, a zone is a logical group that
                consists of one or more :term:`RGW` instances.  A zone's
--- a/ceph/doc/governance.rst
+++ b/ceph/doc/governance.rst
@ -53,9 +53,8 @@ the CLT itself.
 Current CLT members are:

 * Casey Bodley <cbodley@redhat.com>
- * Dan van der Ster <daniel.vanderster@cern.ch>
- * David Galloway <dgallowa@redhat.com>
- * David Orman <ormandj@iland.com>
+ * Dan van der Ster <dan.vanderster@clyso.com>
+ * David Orman <ormandj@1111systems.com>
 * Ernesto Puerta <epuerta@redhat.com>
 * Gregory Farnum <gfarnum@redhat.com>
 * Haomai Wang <haomai@xsky.com>
--- a/ceph/doc/index.rst
+++ b/ceph/doc/index.rst
@ -4,13 +4,19 @@

 Ceph delivers **object, block, and file storage in one unified system**.

-.. warning:: 
+.. warning::

-   :ref:`If this is your first time using Ceph, read the "Basic Workflow" 
-   page in the Ceph Developer Guide to learn how to contribute to the 
-   Ceph project. (Click anywhere in this paragraph to read the "Basic 
+   :ref:`If this is your first time using Ceph, read the "Basic Workflow"
+   page in the Ceph Developer Guide to learn how to contribute to the
+   Ceph project. (Click anywhere in this paragraph to read the "Basic
   Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`.

+.. note::
+
+   :ref:`If you want to make a commit to the documentation but you don't
+   know how to get started, read the "Documenting Ceph" page. (Click anywhere
+   in this paragraph to read the "Documenting Ceph" page.) <documenting_ceph>`.
+
 .. container:: columns-3

   .. container:: column
@ -104,6 +110,7 @@ about Ceph, see our `Architecture`_ section.
   radosgw/index
   mgr/index
   mgr/dashboard
+   monitoring/index
   api/index
   architecture
   Developer Guide <dev/developer_guide/index>
--- a/ceph/doc/man/8/cephfs-top.rst
+++ b/ceph/doc/man/8/cephfs-top.rst
@ -36,6 +36,22 @@ Options

   Perform a selftest. This mode performs a sanity check of ``stats`` module.

+.. option:: --conffile [CONFFILE]
+
+   Path to cluster configuration file
+
+.. option:: -d [DELAY], --delay [DELAY]
+
+   Refresh interval in seconds (default: 1)
+
+.. option:: --dump
+
+   Dump the metrics to stdout
+
+.. option:: --dumpfs <fs_name>
+
+   Dump the metrics of the given filesystem to stdout
+
 Descriptions of fields
 ======================

--- a/ceph/doc/man/8/radosgw-admin.rst
+++ b/ceph/doc/man/8/radosgw-admin.rst
@ -15,15 +15,15 @@ Synopsis
 Description
 ===========

-:program:`radosgw-admin` is a RADOS gateway user administration utility. It
-allows creating and modifying users.
+:program:`radosgw-admin` is a Ceph Object Gateway user administration utility. It
+is used to create and modify users.


 Commands
 ========

-:program:`radosgw-admin` utility uses many commands for administration purpose
-which are as follows:
+:program:`radosgw-admin` utility provides commands for administration purposes
+as follows:

 :command:`user create`
  Create a new user.
@ -32,8 +32,7 @@ which are as follows:
  Modify a user.

 :command:`user info`
-  Display information of a user, and any potentially available
-  subusers and keys.
+  Display information for a user including any subusers and keys.

 :command:`user rename`
  Renames a user.
@ -51,7 +50,7 @@ which are as follows:
  Check user info.

 :command:`user stats`
-  Show user stats as accounted by quota subsystem.
+  Show user stats as accounted by the quota subsystem.

 :command:`user list`
  List all users.
@ -78,10 +77,10 @@ which are as follows:
  Remove access key.

 :command:`bucket list`
-  List buckets, or, if bucket specified with --bucket=<bucket>,
-  list its objects. If bucket specified adding --allow-unordered
-  removes ordering requirement, possibly generating results more
-  quickly in buckets with large number of objects.
+  List buckets, or, if a bucket is specified with --bucket=<bucket>,
+  list its objects. Adding --allow-unordered
+  removes the ordering requirement, possibly generating results more
+  quickly for buckets with large number of objects.

 :command:`bucket limit check`
  Show bucket sharding stats.
@ -93,8 +92,8 @@ which are as follows:
  Unlink bucket from specified user.

 :command:`bucket chown`
-  Link bucket to specified user and update object ACLs. 
-  Use --marker to resume if command gets interrupted.
+  Change bucket ownership to the specified user and update object ACLs. 
+  Invoke with --marker to resume if the command is interrupted.

 :command:`bucket stats`
  Returns bucket statistics.
@ -109,12 +108,13 @@ which are as follows:
  Rewrite all objects in the specified bucket.

 :command:`bucket radoslist`
-  List the rados objects that contain the data for all objects is
-  the designated bucket, if --bucket=<bucket> is specified, or
-  otherwise all buckets.
+  List the RADOS objects that contain the data for all objects in
+  the designated bucket, if --bucket=<bucket> is specified. 
+  Otherwise, list the RADOS objects that contain data for all 
+  buckets.

 :command:`bucket reshard`
-  Reshard a bucket.
+  Reshard a bucket's index.

 :command:`bucket sync disable`
  Disable bucket sync.
@ -306,16 +306,16 @@ which are as follows:
  Run data sync for the specified source zone.

 :command:`sync error list`
-  list sync error.
+  List sync errors.

 :command:`sync error trim`
-  trim sync error.
+  Trim sync errors.

 :command:`zone rename`
  Rename a zone.

 :command:`zone placement list`
-  List zone's placement targets.
+  List a zone's placement targets.

 :command:`zone placement add`
  Add a zone placement target.
@ -365,7 +365,7 @@ which are as follows:
  List all bucket lifecycle progress.

 :command:`lc process`
-  Manually process lifecycle.  If a bucket is specified (e.g., via
+  Manually process lifecycle transitions.  If a bucket is specified (e.g., via
  --bucket_id or via --bucket and optional --tenant), only that bucket
  is processed.

@ -385,7 +385,7 @@ which are as follows:
  List metadata log which is needed for multi-site deployments.

 :command:`mdlog trim`
-  Trim metadata log manually instead of relying on RGWs integrated log sync.
+  Trim metadata log manually instead of relying on the gateway's integrated log sync.
  Before trimming, compare the listings and make sure the last sync was
  complete, otherwise it can reinitiate a sync.

@ -397,7 +397,7 @@ which are as follows:

 :command:`bilog trim`
  Trim bucket index log (use start-marker, end-marker) manually instead
-  of relying on RGWs integrated log sync.
+  of relying on the gateway's integrated log sync.
  Before trimming, compare the listings and make sure the last sync was
  complete, otherwise it can reinitiate a sync.

@ -405,7 +405,7 @@ which are as follows:
  List data log which is needed for multi-site deployments.

 :command:`datalog trim`
-  Trim data log manually instead of relying on RGWs integrated log sync.
+  Trim data log manually instead of relying on the gateway's integrated log sync.
  Before trimming, compare the listings and make sure the last sync was
  complete, otherwise it can reinitiate a sync.

@ -413,19 +413,19 @@ which are as follows:
  Read data log status.

 :command:`orphans find`
-  Init and run search for leaked rados objects.
+  Init and run search for leaked RADOS objects.
  DEPRECATED. See the "rgw-orphan-list" tool.

 :command:`orphans finish`
-  Clean up search for leaked rados objects.
+  Clean up search for leaked RADOS objects.
  DEPRECATED. See the "rgw-orphan-list" tool.

 :command:`orphans list-jobs`
-  List the current job-ids for the orphans search.
+  List the current orphans search job IDs.
  DEPRECATED. See the "rgw-orphan-list" tool.

 :command:`role create`
-  create a new AWS role for use with STS.
+  Create a new role for use with STS (Security Token Service).

 :command:`role rm`
  Remove a role.
@ -485,7 +485,7 @@ which are as follows:
  Show events in a pubsub subscription
             
 :command:`subscription ack`
-  Ack (remove) an events in a pubsub subscription
+  Acknowledge (remove) events in a pubsub subscription


 Options
@ -499,7 +499,8 @@ Options

 .. option:: -m monaddress[:port]

-   Connect to specified monitor (instead of looking through ceph.conf).
+   Connect to specified monitor (instead of selecting one
+   from ceph.conf).

 .. option:: --tenant=<tenant>

@ -507,19 +508,19 @@ Options

 .. option:: --uid=uid

-   The radosgw user ID.
+   The user on which to operate.

 .. option:: --new-uid=uid

-   ID of the new user. Used with 'user rename' command.
+   The new ID of the user. Used with 'user rename' command.

 .. option:: --subuser=<name>

-	Name of the subuser.
+    Name of the subuser.

 .. option:: --access-key=<key>

-        S3 access key.
+   S3 access key.

 .. option:: --email=email

@ -531,28 +532,29 @@ Options

 .. option:: --gen-access-key

-	Generate random access key (for S3).
+    Generate random access key (for S3).
+

 .. option:: --gen-secret

-	Generate random secret key.
+    Generate random secret key.

 .. option:: --key-type=<type>

-	key type, options are: swift, s3.
+    Key type, options are: swift, s3.

 .. option:: --temp-url-key[-2]=<key>

-	Temporary url key.
+    Temporary URL key.

 .. option:: --max-buckets

-	max number of buckets for a user (0 for no limit, negative value to disable bucket creation).
-	Default is 1000.
+    Maximum number of buckets for a user (0 for no limit, negative value to disable bucket creation).
+    Default is 1000.

 .. option:: --access=<access>

-   Set the access permissions for the sub-user.
+   Set the access permissions for the subuser.
   Available access permissions are read, write, readwrite and full.

 .. option:: --display-name=<name>
@ -600,24 +602,24 @@ Options
 .. option:: --bucket-new-name=[tenant-id/]<bucket>

   Optional for `bucket link`; use to rename a bucket.
-        While tenant-id/ can be specified, this is never
-        necessary for normal operation.
+   While the tenant-id can be specified, this is not
+   necessary in normal operation.

 .. option:: --shard-id=<shard-id>

-	Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.
+   Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.

 .. option:: --max-entries=<entries>

-	Optional for listing operations to specify the max entries.
+   Optional for listing operations to specify the max entries.

 .. option:: --purge-data

-   When specified, user removal will also purge all the user data.
+   When specified, user removal will also purge the user's data.

 .. option:: --purge-keys

-	When specified, subuser removal will also purge all the subuser keys.
+   When specified, subuser removal will also purge the subuser' keys.
   
 .. option:: --purge-objects

@ -625,7 +627,7 @@ Options

 .. option:: --metadata-key=<key>

-	Key to retrieve metadata from with ``metadata get``.
+   Key from which to retrieve metadata, used with ``metadata get``.

 .. option:: --remote=<remote>

@ -633,11 +635,11 @@ Options

 .. option:: --period=<id>

-   Period id.
+   Period ID.

 .. option:: --url=<url>

-   url for pushing/pulling period or realm.
+   URL for pushing/pulling period or realm.

 .. option:: --epoch=<number>

@ -657,7 +659,7 @@ Options

 .. option:: --master-zone=<id>

-   Master zone id.
+   Master zone ID.

 .. option:: --rgw-realm=<name>

@ -665,11 +667,11 @@ Options

 .. option:: --realm-id=<id>

-   The realm id.
+   The realm ID.

 .. option:: --realm-new-name=<name>

-   New name of realm.
+   New name for the realm.

 .. option:: --rgw-zonegroup=<name>

@ -677,7 +679,7 @@ Options

 .. option:: --zonegroup-id=<id>

-   The zonegroup id.
+   The zonegroup ID.

 .. option:: --zonegroup-new-name=<name>

@ -685,11 +687,11 @@ Options

 .. option:: --rgw-zone=<zone>

-	Zone in which radosgw is running.
+   Zone in which the gateway is running.

 .. option:: --zone-id=<id>

-   The zone id.
+   The zone ID.

 .. option:: --zone-new-name=<name>

@ -709,7 +711,7 @@ Options

 .. option:: --placement-id

-   Placement id for the zonegroup placement commands.
+   Placement ID for the zonegroup placement commands.

 .. option:: --tags=<list>

@ -737,7 +739,7 @@ Options

 .. option:: --data-extra-pool=<pool>

-   The placement target data extra (non-ec) pool.
+   The placement target data extra (non-EC) pool.

 .. option:: --placement-index-type=<type>

@ -765,11 +767,11 @@ Options

 .. option:: --sync-from=[zone-name][,...]

-   Set the list of zones to sync from.
+   Set the list of zones from which to sync.

 .. option:: --sync-from-rm=[zone-name][,...]

-   Remove the zones from list of zones to sync from.
+   Remove zone(s) from list of zones from which to sync.

 .. option:: --bucket-index-max-shards

@ -780,71 +782,71 @@ Options

 .. option:: --fix

-	Besides checking bucket index, will also fix it.
+    Fix the bucket index in addition to checking it.

 .. option:: --check-objects

-	bucket check: Rebuilds bucket index according to actual objects state.
+    Bucket check: Rebuilds the bucket index according to actual object state.

 .. option:: --format=<format>

-	Specify output format for certain operations. Supported formats: xml, json.
+    Specify output format for certain operations. Supported formats: xml, json.

 .. option:: --sync-stats

-	Option for 'user stats' command. When specified, it will update user stats with
-	the current stats reported by user's buckets indexes.
+    Option for the 'user stats' command. When specified, it will update user stats with
+    the current stats reported by the user's buckets indexes.

 .. option:: --show-config

-	Show configuration.
+    Show configuration.

 .. option:: --show-log-entries=<flag>

-	Enable/disable dump of log entries on log show.
+    Enable/disable dumping of log entries on log show.

 .. option:: --show-log-sum=<flag>

-	Enable/disable dump of log summation on log show.
+    Enable/disable dump of log summation on log show.

 .. option:: --skip-zero-entries

-	Log show only dumps entries that don't have zero value in one of the numeric
-	field.
+    Log show only dumps entries that don't have zero value in one of the numeric
+    field.

 .. option:: --infile

-	Specify a file to read in when setting data.
+    Specify a file to read when setting data.

 .. option:: --categories=<list>

-	Comma separated list of categories, used in usage show.
+    Comma separated list of categories, used in usage show.

 .. option:: --caps=<caps>

-	List of caps (e.g., "usage=read, write; user=read").
+    List of capabilities (e.g., "usage=read, write; user=read").

 .. option:: --compression=<compression-algorithm>

-    Placement target compression algorithm (lz4|snappy|zlib|zstd)
+    Placement target compression algorithm (lz4|snappy|zlib|zstd).

 .. option:: --yes-i-really-mean-it

-	Required for certain operations.
+    Required as a guardrail for certain destructive operations.

 .. option:: --min-rewrite-size

-    Specify the min object size for bucket rewrite (default 4M).
+    Specify the minimum object size for bucket rewrite (default 4M).

 .. option:: --max-rewrite-size

-    Specify the max object size for bucket rewrite (default ULLONG_MAX).
+    Specify the maximum object size for bucket rewrite (default ULLONG_MAX).

 .. option:: --min-rewrite-stripe-size

-    Specify the min stripe size for object rewrite (default 0). If the value
+    Specify the minimum stripe size for object rewrite (default 0). If the value
    is set to 0, then the specified object will always be
-    rewritten for restriping.
+    rewritten when restriping.

 .. option:: --warnings-only

@ -854,7 +856,7 @@ Options
 .. option:: --bypass-gc

   When specified with bucket deletion,
-   triggers object deletions by not involving GC.
+   triggers object deletion without involving GC.

 .. option:: --inconsistent-index

@ -863,25 +865,25 @@ Options

 .. option:: --max-concurrent-ios

-        Maximum concurrent ios for bucket operations. Affects operations that
-        scan the bucket index, e.g., listing, deletion, and all scan/search
-        operations such as finding orphans or checking the bucket index.
-        Default is 32.
+   Maximum concurrent bucket operations. Affects operations that
+   scan the bucket index, e.g., listing, deletion, and all scan/search
+   operations such as finding orphans or checking the bucket index.
+   The default is 32.

 Quota Options
 =============

 .. option:: --max-objects

-	Specify max objects (negative value to disable).
+   Specify the maximum number of objects (negative value to disable).

 .. option:: --max-size

-	Specify max size (in B/K/M/G/T, negative value to disable).
+    Specify the maximum object size (in B/K/M/G/T, negative value to disable).

 .. option:: --quota-scope

-	The scope of quota (bucket, user).
+    The scope of quota (bucket, user).


 Orphans Search Options
@ -889,16 +891,16 @@ Orphans Search Options

 .. option:: --num-shards

-	Number of shards to use for keeping the temporary scan info
+    Number of shards to use for temporary scan info

 .. option:: --orphan-stale-secs

-        Number of seconds to wait before declaring an object to be an orphan.
-        Default is 86400 (24 hours).
+   Number of seconds to wait before declaring an object to be an orphan.
+   The efault is 86400 (24 hours).

 .. option:: --job-id

-        Set the job id (for orphans find)
+   Set the job id (for orphans find)


 Orphans list-jobs options
--- a/ceph/doc/man/8/radosgw.rst
+++ b/ceph/doc/man/8/radosgw.rst
@ -53,10 +53,6 @@ Options

   Run in foreground, log to usual location

-.. option:: --rgw-socket-path=path
-
-   Specify a unix domain socket path.
-
 .. option:: --rgw-region=region

   The region where radosgw runs
@ -80,30 +76,24 @@ and ``mod_proxy_fcgi`` have to be present in the server. Unlike ``mod_fastcgi``,
 or process management may be available in the FastCGI application framework
 in use.

-``Apache`` can be configured in a way that enables ``mod_proxy_fcgi`` to be used
-with localhost tcp or through unix domain socket. ``mod_proxy_fcgi`` that doesn't
-support unix domain socket such as the ones in Apache 2.2 and earlier versions of
-Apache 2.4, needs to be configured for use with localhost tcp. Later versions of
-Apache like Apache 2.4.9 or later support unix domain socket and as such they
-allow for the configuration with unix domain socket instead of localhost tcp.
+``Apache`` must be configured in a way that enables ``mod_proxy_fcgi`` to be
+used with localhost tcp.

 The following steps show the configuration in Ceph's configuration file i.e,
 ``/etc/ceph/ceph.conf`` and the gateway configuration file i.e,
 ``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or
 ``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost
-tcp and through unix domain socket:
+tcp:

 #. For distros with Apache 2.2 and early versions of Apache 2.4 that use
-   localhost TCP and do not support Unix Domain Socket, append the following
-   contents to ``/etc/ceph/ceph.conf``::
+   localhost TCP, append the following contents to ``/etc/ceph/ceph.conf``::

 	[client.radosgw.gateway]
 	host = {hostname}
 	keyring = /etc/ceph/ceph.client.radosgw.keyring
-	rgw socket path = ""
-	log file = /var/log/ceph/client.radosgw.gateway.log
-	rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
-	rgw print continue = false
+	log_file = /var/log/ceph/client.radosgw.gateway.log
+	rgw_frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
+	rgw_print_continue = false

 #. Add the following content in the gateway configuration file:

@ -149,16 +139,6 @@ tcp and through unix domain socket:

 		</VirtualHost>

-#. For distros with Apache 2.4.9 or later that support Unix Domain Socket,
-   append the following configuration to ``/etc/ceph/ceph.conf``::
-
-	[client.radosgw.gateway]
-	host = {hostname}
-	keyring = /etc/ceph/ceph.client.radosgw.keyring
-	rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
-	log file = /var/log/ceph/client.radosgw.gateway.log
-	rgw print continue = false
-
 #. Add the following content in the gateway configuration file:

   For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``::
@ -182,10 +162,6 @@ tcp and through unix domain socket:

 		</VirtualHost>

-   Please note, ``Apache 2.4.7`` does not have Unix Domain Socket support in
-   it and as such it has to be configured with localhost tcp. The Unix Domain
-   Socket support is available in ``Apache 2.4.9`` and later versions.
-
 #. Generate a key for radosgw to use for authentication with the cluster. ::

 	ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway
--- a/ceph/doc/mgr/capacity-card.png
+++ b/ceph/doc/mgr/capacity-card.png
--- a/ceph/doc/mgr/ceph_api/index.rst
+++ b/ceph/doc/mgr/ceph_api/index.rst
@ -41,14 +41,16 @@ So, prior to start consuming the Ceph API, a valid JSON Web Token (JWT) has to
 be obtained, and it may then be reused for subsequent requests. The
 ``/api/auth`` endpoint will provide the valid token:

-.. code-block:: sh
+.. prompt:: bash $

-  $ curl -X POST "https://example.com:8443/api/auth" \
-    -H  "Accept: application/vnd.ceph.api.v1.0+json" \
-    -H  "Content-Type: application/json" \
-    -d '{"username": <username>, "password": <password>}'
+   curl -X POST "https://example.com:8443/api/auth" \
+   -H  "Accept: application/vnd.ceph.api.v1.0+json" \
+   -H  "Content-Type: application/json" \
+   -d '{"username": <username>, "password": <password>}'

-  { "token": "<redacted_token>", ...}
+::
+
+    { "token": "<redacted_token>", ...}

 The token obtained must be passed together with every API request in the
 ``Authorization`` HTTP header::
@ -74,11 +76,11 @@ purpose, Ceph API is built upon the following principles:

 An example:

-.. code-block:: bash
+.. prompt:: bash $

-  $ curl -X GET "https://example.com:8443/api/osd" \
-    -H  "Accept: application/vnd.ceph.api.v1.0+json" \
-    -H  "Authorization: Bearer <token>"
+   curl -X GET "https://example.com:8443/api/osd" \
+   -H  "Accept: application/vnd.ceph.api.v1.0+json" \
+   -H  "Authorization: Bearer <token>"


 Specification
--- a/ceph/doc/mgr/cluster-utilization-card.png
+++ b/ceph/doc/mgr/cluster-utilization-card.png
--- a/ceph/doc/mgr/dashboard-landing-page.png
+++ b/ceph/doc/mgr/dashboard-landing-page.png
--- a/ceph/doc/mgr/dashboard.rst
+++ b/ceph/doc/mgr/dashboard.rst
@ -127,62 +127,67 @@ The Ceph Dashboard offers the following monitoring and management capabilities:
 Overview of the Dashboard Landing Page
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-Displays overall cluster status, performance, and capacity metrics. Shows instant
-feedback for changes in the cluster and provides easy access to subpages of the
-dashboard.
+The landing page of Ceph Dashboard serves as the home page and features metrics
+such as the overall cluster status, performance, and capacity. It provides real-time
+updates on any changes in the cluster and allows quick access to other sections of the dashboard.
+
+.. image:: dashboard-landing-page.png
+
+
+.. note::
+  You can change the landing page to the previous version from:
+  ``Cluster >> Manager Modules >> Dashboard >> Edit``.
+  Editing the ``FEATURE_TOGGLE_DASHBOARD`` option will change the landing page, from one view to another.
+
+  Note that the previous version of the landing page will be disabled in future releases.
+
+.. _dashboard-landing-page-details:
+
+Details
+"""""""
+Provides an overview of the cluster configuration, displaying various critical aspects of the cluster.
+
+.. image:: details-card.png

 .. _dashboard-landing-page-status:

 Status
 """"""
+Provides a visual indication of cluster health, and displays cluster alerts grouped by severity.

-* **Cluster Status**: Displays overall cluster health. In case of any error it
-  displays a short description of the error and provides a link to the logs.
-* **Hosts**: Displays the total number of hosts associated to the cluster and
-  links to a subpage that lists and describes each.
-* **Monitors**: Displays mons and their quorum status and
-  open sessions.  Links to a subpage that lists and describes each.
-* **OSDs**: Displays object storage daemons (ceph-osds) and
-  the numbers of OSDs running (up), in service
-  (in), and out of the cluster (out). Provides links to
-  subpages providing a list of all OSDs and related management actions.
-* **Managers**: Displays active and standby Ceph Manager
-  daemons (ceph-mgr).
-* **Object Gateway**: Displays active object gateways (RGWs) and
-  provides links to subpages that list all object gateway daemons.
-* **Metadata Servers**: Displays active and standby CephFS metadata
-  service daemons (ceph-mds).
-* **iSCSI Gateways**: Display iSCSI gateways available,
-  active (up), and inactive (down). Provides a link to a subpage
-  showing a list of all iSCSI Gateways.
+.. image:: status-card-open.png

 .. _dashboard-landing-page-capacity:

 Capacity
 """"""""
+* **Used**: Displays the used capacity out of the total physical capacity provided by storage nodes (OSDs)
+* **Warning**: Displays the `nearfull` threshold of the OSDs
+* **Danger**: Displays the `full` threshold of the OSDs

-* **Raw Capacity**: Displays the capacity used out of the total
-  physical capacity provided by storage nodes (OSDs).
-* **Objects**: Displays the number and status of RADOS objects
-  including the percentages of healthy, misplaced, degraded, and unfound
-  objects.
-* **PG Status**: Displays the total number of placement groups and
-  their status, including the percentage clean, working,
-  warning, and unknown.
-* **Pools**: Displays pools and links to a subpage listing details.
-* **PGs per OSD**: Displays the number of placement groups assigned to
-  object storage daemons.
+.. image:: capacity-card.png
+
+.. _dashboard-landing-page-inventory:
+
+Inventory
+"""""""""
+An inventory for all assets within the cluster.
+Provides direct access to subpages of the dashboard from each item of this card.
+
+.. image:: inventory-card.png

 .. _dashboard-landing-page-performance:

-Performance
-"""""""""""
+Cluster Utilization
+"""""""""""""""""""
+* **Used Capacity**: Total capacity used of the cluster. The maximum value of the chart is the maximum capacity of the cluster.
+* **IOPS (Input/Output Operations Per Second)**: Number of read and write operations.
+* **Latency**: Amount of time that it takes to process a read or a write request.
+* **Client Throughput**: Amount of data that clients read or write to the cluster.
+* **Recovery Throughput**: Amount of recovery data that clients read or write to the cluster.

-* **Client READ/Write**: Displays an overview of
-  client input and output operations.
-* **Client Throughput**: Displays the data transfer rates to and from Ceph clients.
-* **Recovery throughput**: Displays rate of cluster healing and balancing operations.
-* **Scrubbing**: Displays light and deep scrub status.
+
+.. image:: cluster-utilization-card.png

 Supported Browsers
 ^^^^^^^^^^^^^^^^^^
--- a/ceph/doc/mgr/details-card.png
+++ b/ceph/doc/mgr/details-card.png
--- a/ceph/doc/mgr/inventory-card.png
+++ b/ceph/doc/mgr/inventory-card.png
--- a/ceph/doc/mgr/nfs.rst
+++ b/ceph/doc/mgr/nfs.rst
@ -24,12 +24,14 @@ see :ref:`nfs-ganesha-config`.
 NFS Cluster management
 ======================

+.. _nfs-module-cluster-create:
+
 Create NFS Ganesha Cluster
 --------------------------

 .. code:: bash

-    $ ceph nfs cluster create <cluster_id> [<placement>] [--port <port>] [--ingress --virtual-ip <ip>]
+    $ ceph nfs cluster create <cluster_id> [<placement>] [--ingress] [--virtual_ip <value>] [--ingress-mode {default|keepalive-only|haproxy-standard|haproxy-protocol}] [--port <int>]

 This creates a common recovery pool for all NFS Ganesha daemons, new user based on
 ``cluster_id``, and a common NFS Ganesha config RADOS object.
@ -94,6 +96,18 @@ of the details of NFS redirecting traffic on the virtual IP to the
 appropriate backend NFS servers, and redeploying NFS servers when they
 fail.

+If a user additionally supplies ``--ingress-mode keepalive-only`` a
+partial *ingress* service will be deployed that still provides a virtual
+IP, but has nfs directly binding to that virtual IP and leaves out any
+sort of load balancing or traffic redirection. This setup will restrict
+users to deploying only 1 nfs daemon as multiple cannot bind to the same
+port on the virtual IP.
+
+Instead providing ``--ingress-mode default`` will result in the same setup
+as not providing the ``--ingress-mode`` flag. In this setup keepalived will be
+deployed to handle forming the virtual IP and haproxy will be deployed
+to handle load balancing and traffic redirection.
+
 Enabling ingress via the ``ceph nfs cluster create`` command deploys a
 simple ingress configuration with the most common configuration
 options.  Ingress can also be added to an existing NFS service (e.g.,
--- a/ceph/doc/mgr/prometheus.rst
+++ b/ceph/doc/mgr/prometheus.rst
@ -18,9 +18,11 @@ for all reporting entities are returned in text exposition format.
 Enabling prometheus output
 ==========================

-The *prometheus* module is enabled with::
+The *prometheus* module is enabled with:

-  ceph mgr module enable prometheus
+.. prompt:: bash $
+
+   ceph mgr module enable prometheus

 Configuration
 -------------
@ -47,10 +49,10 @@ configurable with ``ceph config set``, with keys
 is registered with Prometheus's `registry
 <https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.

-::
-
-    ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
-    ceph config set mgr mgr/prometheus/server_port 9283
+.. prompt:: bash $
+   
+   ceph config set mgr mgr/prometheus/server_addr 0.0.0.
+   ceph config set mgr mgr/prometheus/server_port 9283

 .. warning::

@ -65,9 +67,11 @@ recommended to use 15 seconds as scrape interval, though, in some cases it
 might be useful to increase the scrape interval.

 To set a different scrape interval in the Prometheus module, set
-``scrape_interval`` to the desired value::
+``scrape_interval`` to the desired value:

-    ceph config set mgr mgr/prometheus/scrape_interval 20
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/scrape_interval 20

 On large clusters (>1000 OSDs), the time to fetch the metrics may become
 significant.  Without the cache, the Prometheus manager module could, especially
@ -75,7 +79,7 @@ in conjunction with multiple Prometheus instances, overload the manager and lead
 to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
 by default.  This means that there is a possibility that the cache becomes
 stale.  The cache is considered stale when the time to fetch the metrics from
-Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
+Ceph exceeds the configured :confval:`mgr/prometheus/scrape_interval`.

 If that is the case, **a warning will be logged** and the module will either

@ -86,35 +90,47 @@ This behavior can be configured. By default, it will return a 503 HTTP status
 code (service unavailable). You can set other options using the ``ceph config
 set`` commands.

-To tell the module to respond with possibly stale data, set it to ``return``::
+To tell the module to respond with possibly stale data, set it to ``return``:
+
+.. prompt:: bash $

    ceph config set mgr mgr/prometheus/stale_cache_strategy return

-To tell the module to respond with "service unavailable", set it to ``fail``::
+To tell the module to respond with "service unavailable", set it to ``fail``:

-    ceph config set mgr mgr/prometheus/stale_cache_strategy fail
+.. prompt:: bash $

-If you are confident that you don't require the cache, you can disable it::
+   ceph config set mgr mgr/prometheus/stale_cache_strategy fail

-    ceph config set mgr mgr/prometheus/cache false
+If you are confident that you don't require the cache, you can disable it:
+
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/cache false

 If you are using the prometheus module behind some kind of reverse proxy or
 loadbalancer, you can simplify discovering the active instance by switching
-to ``error``-mode::
+to ``error``-mode:

-    ceph config set mgr mgr/prometheus/standby_behaviour error
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/standby_behaviour error

 If set, the prometheus module will repond with a HTTP error when requesting ``/``
 from the standby instance. The default error code is 500, but you can configure
-the HTTP response code with::
+the HTTP response code with:

-    ceph config set mgr mgr/prometheus/standby_error_status_code 503
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/standby_error_status_code 503

 Valid error codes are between 400-599.

-To switch back to the default behaviour, simply set the config key to ``default``::
+To switch back to the default behaviour, simply set the config key to ``default``:

-    ceph config set mgr mgr/prometheus/standby_behaviour default
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/standby_behaviour default

 .. _prometheus-rbd-io-statistics:

@ -165,9 +181,17 @@ configuration parameter. The parameter is a comma or space separated list
 of ``pool[/namespace]`` entries. If the namespace is not specified the
 statistics are collected for all namespaces in the pool.

-Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
+Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``:

-  ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
+
+The wildcard can be used to indicate all pools or namespaces:
+
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/rbd_stats_pools "*"

 The module makes the list of all available images scanning the specified
 pools and namespaces and refreshes it periodically. The period is
@ -176,9 +200,22 @@ parameter (in sec) and is 300 sec (5 minutes) by default. The module will
 force refresh earlier if it detects statistics from a previously unknown
 RBD image.

-Example to turn up the sync interval to 10 minutes::
+Example to turn up the sync interval to 10 minutes:

-  ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
+
+Ceph daemon performance counters metrics
+-----------------------------------------
+
+With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
+perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
+the module option ``exclude_perf_counters`` to ``false``:
+
+.. prompt:: bash $
+
+   ceph config set mgr mgr/prometheus/exclude_perf_counters false

 Statistic names and labels
 ==========================
--- a/ceph/doc/mgr/rgw.rst
+++ b/ceph/doc/mgr/rgw.rst
@ -2,8 +2,9 @@

 RGW Module
 ============
-The rgw module helps with bootstraping and configuring RGW realm
-and the different related entities.
+The rgw module provides a simple interface to deploy RGW multisite.
+It helps with bootstrapping and configuring RGW realm, zonegroup and
+the different related entities.

 Enabling
 --------
@ -18,57 +19,120 @@ RGW Realm Operations

 Bootstrapping RGW realm creates a new RGW realm entity, a new zonegroup,
 and a new zone. It configures a new system user that can be used for
-multisite sync operations, and returns a corresponding token. It sets
-up new RGW instances via the orchestrator.
+multisite sync operations. Under the hood this module instructs the
+orchestrator to create and deploy the corresponding RGW daemons. The module
+supports both passing the arguments through the cmd line or as a spec file:

-It is also possible to create a new zone that connects to the master
-zone and synchronizes data to/from it.
+.. prompt:: bash #
+
+  ceph rgw realm bootstrap [--realm-name] [--zonegroup-name] [--zone-name] [--port] [--placement] [--start-radosgw]
+
+The command supports providing the configuration through a spec file (`-i option`):
+
+.. prompt:: bash #
+
+  ceph rgw realm bootstrap -i myrgw.yaml
+
+Following is an example of RGW multisite spec file:
+
+.. code-block:: yaml
+
+  rgw_realm: myrealm
+  rgw_zonegroup: myzonegroup
+  rgw_zone: myzone
+  placement:
+    hosts:
+     - ceph-node-1
+     - ceph-node-2
+  spec:
+    rgw_frontend_port: 5500
+
+.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus,
+          the user can provide any orchestration supported rgw parameters including advanced
+          configuration features such as SSL certificates etc.
+
+Users can also specify custom zone endpoints in the spec (or through the cmd line). In this case, no
+cephadm daemons will be launched. Following is an example RGW spec file with zone endpoints:
+
+.. code-block:: yaml
+
+  rgw_realm: myrealm
+  rgw_zonegroup: myzonegroup
+  rgw_zone: myzone
+  zone_endpoints: http://<rgw_host1>:<rgw_port1>, http://<rgw_host2>:<rgw_port2>


 Realm Credentials Token
 -----------------------
-A new token is created when bootstrapping a new realm, and also
-when creating one explicitly.  The token encapsulates
-the master zone endpoint, and a set of credentials that are associated
-with a system user.
-Removal of this token would remove the credentials, and if the corresponding
-system user has no more access keys, it is removed.

+Users can list the available tokens for the created (or already existing) realms.
+The token is a base64 string that encapsulates the realm information and its
+master zone endpoint authentication data. Following is an example of
+the `ceph rgw realm tokens` output:
+
+.. prompt:: bash #
+
+  ceph rgw realm tokens | jq
+
+.. code-block:: json
+
+  [
+    {
+      "realm": "myrealm1",
+      "token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....NHlBTFhoIgp9"
+    },
+    {
+      "realm": "myrealm2",
+      "token": "ewogICAgInJlYWxtX25hbWUiOiAibXlyZWFs....RUU12ZDB0Igp9"
+    }
+  ]
+
+User can use the token to pull a realm to create secondary zone on a
+different cluster that syncs with the master zone on the primary cluster
+by using `ceph rgw zone create` command and providing the corresponding token.
+
+Following is an example of zone spec file:
+
+.. code-block:: yaml
+
+  rgw_zone: my-secondary-zone
+  rgw_realm_token: <token>
+  placement:
+    hosts:
+     - ceph-node-1
+     - ceph-node-2
+  spec:
+    rgw_frontend_port: 5500
+
+
+.. prompt:: bash #
+
+  ceph rgw zone create -i zone-spec.yaml
+
+.. note:: The spec file used by RGW has the same format as the one used by the orchestrator. Thus,
+          the user can provide any orchestration supported rgw parameters including advanced
+          configuration features such as SSL certificates etc.

 Commands
 --------
 ::

-  ceph rgw realm bootstrap
+  ceph rgw realm bootstrap -i spec.yaml

 Create a new realm + zonegroup + zone and deploy rgw daemons via the
-orchestrator.  Command returns a realm token that allows new zones to easily
-join this realm
+orchestrator using the information specified in the YAML file.

 ::

-  ceph rgw zone create
+  ceph rgw realm tokens

-Create a new zone and join existing realm (using the realm token)
+List the tokens of all the available realms

 ::

-  ceph rgw zone-creds create
+  ceph rgw zone create -i spec.yaml

-Create new credentials and return a token for new zone connection
-
-::
-
-  ceph rgw zone-creds remove
- 
-Remove credentials and/or user that are associated with the specified
-token
-
-::
-
-  ceph rgw realm reconcile
-
-Update the realm configuration to match the orchestrator deployment
+Join an existing realm by creating a new secondary zone (using the realm token)

 ::

--- a/ceph/doc/mgr/status-card-open.png
+++ b/ceph/doc/mgr/status-card-open.png
--- a/ceph/doc/mgr/telemetry.rst
+++ b/ceph/doc/mgr/telemetry.rst
@ -269,3 +269,24 @@ completely optional, and disabled by default.::
  ceph config set mgr mgr/telemetry/description 'My first Ceph cluster'
  ceph config set mgr mgr/telemetry/channel_ident true

+Leaderboard
+-----------
+
+To participate in a leaderboard in the `public dashboards
+<https://telemetry-public.ceph.com/>`_, run the following command:
+
+.. prompt:: bash $
+
+   ceph config set mgr mgr/telemetry/leaderboard true
+
+The leaderboard displays basic information about the cluster. This includes the
+total storage capacity and the number of OSDs. To add a description of the
+cluster, run a command of the following form: 
+
+.. prompt:: bash $
+
+   ceph config set mgr mgr/telemetry/leaderboard_description 'Ceph cluster for Computational Biology at the University of XYZ'
+
+If the ``ident`` channel is enabled, its details will not be displayed in the
+leaderboard.
+
--- a/ceph/doc/monitoring/index.rst
+++ b/ceph/doc/monitoring/index.rst
@ -0,0 +1,474 @@
+.. _monitoring:
+
+===================
+Monitoring overview
+===================
+
+The aim of this part of the documentation is to explain the Ceph monitoring
+stack and the meaning of the main Ceph metrics.
+
+With a good understand of the Ceph monitoring stack and metrics users can
+create customized monitoring tools, like Prometheus queries, Grafana
+dashboards, or scripts.
+
+
+Ceph Monitoring stack
+=====================
+
+Ceph provides a default monitoring stack wich is installed by cephadm and
+explained in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
+the cephadm documentation.
+
+
+Ceph metrics
+============
+
+The main source for Ceph metrics are the performance counters exposed by each
+Ceph daemon. The :doc:`../dev/perf_counters` are native Ceph monitoring data
+
+Performance counters are transformed into standard Prometheus metrics by the
+Ceph exporter daemon. This daemon runs on every Ceph cluster host and exposes a
+metrics end point where all the performance counters exposed by all the Ceph
+daemons running in the host are published in the form of Prometheus metrics.
+
+In addition to the Ceph exporter, there is another agent to expose Ceph
+metrics. It is the Prometheus manager module, wich exposes metrics related to
+the whole cluster, basically metrics that are not produced by individual Ceph
+daemons.
+
+The main source for obtaining Ceph metrics is the metrics endpoint exposed by
+the Cluster Prometheus server.  Ceph can provide you with the Prometheus
+endpoint where you can obtain the complete list of metrics (coming from Ceph
+exporter daemons and Prometheus manager module) and exeute queries.
+
+Use the following command to obtain the Prometheus server endpoint in your
+cluster:
+
+Example:
+
+.. code-block:: bash
+
+  # ceph orch ps --service_name prometheus
+  NAME                         HOST                          PORTS   STATUS          REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
+  prometheus.cephtest-node-00  cephtest-node-00.cephlab.com  *:9095  running (103m)    50s ago   5w     142M        -  2.33.4   514e6a882f6e  efe3cbc2e521
+
+With this information you can connect to
+``http://cephtest-node-00.cephlab.com:9095`` to access the Prometheus server
+interface.
+
+And the complete list of metrics (with help) for your cluster will be available
+in:
+
+``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata``
+
+
+It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
+
+
+Performance metrics
+===================
+
+Main metrics used to measure Cluster Ceph performance:
+
+All metrics have the following labels:
+``ceph_daemon``: identifier of the OSD daemon generating the metric
+``instance``: the IP address of the ceph exporter instance exposing the metric.
+``job``: prometheus scrape job
+
+Example:
+
+.. code-block:: bash
+
+  ceph_osd_op_r{ceph_daemon="osd.0", instance="192.168.122.7:9283", job="ceph"} = 73981
+
+*Cluster I/O (throughput):*
+Use ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain the cluster throughput generated by clients
+
+Example:
+
+.. code-block:: bash
+
+  Writes (B/s):
+  sum(irate(ceph_osd_op_w_in_bytes[1m]))
+
+  Reads (B/s):
+  sum(irate(ceph_osd_op_r_out_bytes[1m]))
+
+
+*Cluster I/O (operations):*
+Use ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the number of operations generated by clients
+
+Example:
+
+.. code-block:: bash
+
+  Writes (ops/s):
+  sum(irate(ceph_osd_op_w[1m]))
+
+  Reads (ops/s):
+  sum(irate(ceph_osd_op_r[1m]))
+
+*Latency:*
+Use ``ceph_osd_op_latency_sum`` wich represents the delay before a OSD transfer of data begins following a client instruction for its transfer
+
+Example:
+
+.. code-block:: bash
+
+  sum(irate(ceph_osd_op_latency_sum[1m]))
+
+
+OSD performance
+===============
+
+The previous explained cluster performance metrics are based in OSD metrics, selecting the right label we can obtain for a single OSD the same performance information explained for the cluster:
+
+Example:
+
+.. code-block:: bash
+
+  OSD 0 read latency
+  irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"osd.0"}[1m]) / on (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
+
+  OSD 0 write IOPS
+  irate(ceph_osd_op_w{ceph_daemon=~"osd.0"}[1m])
+
+  OSD 0 write thughtput (bytes)
+  irate(ceph_osd_op_w_in_bytes{ceph_daemon=~"osd.0"}[1m])
+
+  OSD.0 total raw capacity available
+  ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481
+
+
+Physical disk performance:
+==========================
+
+Combining Prometheus ``node_exporter`` metrics with Ceph metrics we can have
+information about the performance provided by physical disks used by OSDs.
+
+Example:
+
+.. code-block:: bash
+
+  Read latency of device used by OSD 0:
+  label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+  Write latency of device used by OSD 0
+  label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+  IOPS (device used by OSD.0)
+  reads:
+  label_replace(irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+  writes:
+  label_replace(irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+  Throughput (device used by OSD.0)
+  reads:
+  label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+  writes:
+  label_replace(irate(node_disk_written_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+  Physical Device Utilization (%) for OSD.0 in the last 5 minutes
+  label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
+
+Pool metrics
+============
+
+These metrics have the following labels:
+``instance``: the ip address of the Ceph exporter daemon producing the metric.
+``pool_id``: identifier of the pool
+``job``: prometheus scrape job
+
+
+- ``ceph_pool_metadata``: Information about the pool It can be used together
+  with other metrics to provide more contextual information in queries and
+  graphs.  Apart of the three common labels this metric provide the following
+  extra labels:
+
+  - ``compression_mode``: compression used in the pool (lz4, snappy, zlib,
+    zstd, none). Example: compression_mode="none"
+
+  - ``description``: brief description of the pool type (replica:number of
+    replicas or Erasure code: ec profile). Example: description="replica:3"
+  - ``name``: name of the pool. Example: name=".mgr"
+  - ``type``: type of pool (replicated/erasure code). Example: type="replicated"
+
+- ``ceph_pool_bytes_used``: Total raw capacity consumed by user data and associated overheads by pool (metadata + redundancy):
+
+- ``ceph_pool_stored``: Total of CLIENT data stored in the pool
+
+- ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool
+
+- ``ceph_pool_compress_bytes_used``:  Data compressed in the pool
+
+- ``ceph_pool_rd``: CLIENT read operations per pool (reads per second)
+
+- ``ceph_pool_rd_bytes``: CLIENT read operations in bytes per pool
+
+- ``ceph_pool_wr``: CLIENT write operations per pool (writes per second)
+
+- ``ceph_pool_wr_bytes``: CLIENT write operation in bytes per pool
+
+
+**Useful queries**:
+
+.. code-block:: bash
+
+  Total raw capacity available in the cluster:
+  sum(ceph_osd_stat_bytes)
+
+  Total raw capacity consumed in the cluster (including metadata + redundancy):
+  sum(ceph_pool_bytes_used)
+
+  Total of CLIENT data stored in the cluster:
+  sum(ceph_pool_stored)
+
+  Compression savings:
+  sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used)
+
+  CLIENT IOPS for a pool (testrbdpool)
+  reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
+  writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
+
+  CLIENT Throughput for a pool
+  reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
+  writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
+
+Object metrics
+==============
+
+These metrics have the following labels:
+``instance``: the ip address of the ceph exporter daemon providing the metric
+``instance_id``: identifier of the rgw daemon
+``job``: prometheus scrape job
+
+Example:
+
+.. code-block:: bash
+
+  ceph_rgw_req{instance="192.168.122.7:9283", instance_id="154247", job="ceph"} = 12345
+
+
+Generic metrics
+---------------
+- ``ceph_rgw_metadata``: Provides generic information about the RGW daemon.  It
+  can be used together with other metrics to provide more contextual
+  information in queries and graphs. Apart from the three common labels, this
+  metric provides the following extra labels:
+
+  - ``ceph_daemon``: Name of the Ceph daemon. Example:
+    ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq",
+  - ``ceph_version``: Version of Ceph daemon. Example: ceph_version="ceph
+    version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
+  - ``hostname``: Name of the host where the daemon runs. Example:
+    hostname:"cephtest-node-00.cephlab.com",
+
+- ``ceph_rgw_req``: Number total of requests for the daemon (GET+PUT+DELETE)
+    Useful to detect bottlenecks and optimize load distribution.
+
+- ``ceph_rgw_qlen``: RGW operations queue length for the daemon.
+    Useful to detect bottlenecks and optimize load distribution.
+
+- ``ceph_rgw_failed_req``: Aborted requests.
+    Useful to detect daemon errors
+
+
+GET operations: related metrics
+-------------------------------
+- ``ceph_rgw_get_initial_lat_count``: Number of get operations
+
+- ``ceph_rgw_get_initial_lat_sum``: Total latency time for the GET operations
+
+- ``ceph_rgw_get``: Number total of GET requests
+
+- ``ceph_rgw_get_b``: Total bytes transferred in GET operations
+
+
+Put operations: related metrics
+-------------------------------
+- ``ceph_rgw_put_initial_lat_count``: Number of get operations
+
+- ``ceph_rgw_put_initial_lat_sum``: Total latency time for the PUT operations
+
+- ``ceph_rgw_put``: Total number of PUT operations
+
+- ``ceph_rgw_get_b``: Total bytes transferred in PUT operations
+
+
+Useful queries
+--------------
+
+.. code-block:: bash
+
+  The average of get latencies:
+  rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+
+  The average of put latencies:
+  rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+
+  Total requests per second:
+  rate(ceph_rgw_req[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+
+  Total number of "other" operations (LIST, DELETE)
+  rate(ceph_rgw_req[30s]) -  (rate(ceph_rgw_get[30s]) + rate(ceph_rgw_put[30s]))
+
+  GET latencies
+  rate(ceph_rgw_get_initial_lat_sum[30s]) /  rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+
+  PUT latencies
+  rate(ceph_rgw_put_initial_lat_sum[30s]) /  rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+
+  Bandwidth consumed by GET operations
+  sum(rate(ceph_rgw_get_b[30s]))
+
+  Bandwidth consumed by PUT operations
+  sum(rate(ceph_rgw_put_b[30s]))
+
+  Bandwidth consumed by RGW instance (PUTs + GETs)
+  sum by (instance_id) (rate(ceph_rgw_get_b[30s]) + rate(ceph_rgw_put_b[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
+
+  Http errors:
+  rate(ceph_rgw_failed_req[30s])
+
+
+Filesystem Metrics
+==================
+
+These metrics have the following labels:
+``ceph_daemon``: The name of the MDS daemon
+``instance``: the ip address (and port) of of the Ceph exporter daemon exposing the metric
+``job``: prometheus scrape job
+
+Example:
+
+.. code-block:: bash
+
+  ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452
+
+
+Main metrics
+------------
+
+- ``ceph_mds_metadata``: Provides general information about the MDS daemon.  It
+  can be used together with other metrics to provide more contextual
+  information in queries and graphs.  It provides the following extra labels:
+
+  - ``ceph_version``: MDS daemon Ceph version
+  - ``fs_id``: filesystem cluster id
+  - ``hostname``: Host name where the MDS daemon runs
+  - ``public_addr``: Public address where the MDS daemon runs
+  - ``rank``: Rank of the MDS daemon
+
+Example:
+
+.. code-block:: bash
+
+ ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"}
+
+
+- ``ceph_mds_request``: Total number of requests for the MDs daemon
+
+- ``ceph_mds_reply_latency_sum``: Reply latency total
+
+- ``ceph_mds_reply_latency_count``: Reply latency count
+
+- ``ceph_mds_server_handle_client_request``: Number of client requests
+
+- ``ceph_mds_sessions_session_count``: Session count
+
+- ``ceph_mds_sessions_total_load``: Total load
+
+- ``ceph_mds_sessions_sessions_open``: Sessions currently open
+
+- ``ceph_mds_sessions_sessions_stale``: Sessions currently stale
+
+- ``ceph_objecter_op_r``: Number of read operations
+
+- ``ceph_objecter_op_w``: Number of write operations
+
+- ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon
+
+- ``ceph_mds_root_rfiles``: Total number of files managed by the daemon
+
+
+Useful queries:
+---------------
+
+.. code-block:: bash
+
+  Total MDS daemons read workload:
+  sum(rate(ceph_objecter_op_r[1m]))
+
+  Total MDS daemons write workload:
+  sum(rate(ceph_objecter_op_w[1m]))
+
+  MDS daemon read workload: (daemon name is "mdstest")
+  sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
+
+  MDS daemon write workload: (daemon name is "mdstest")
+  sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
+
+  The average of reply latencies:
+  rate(ceph_mds_reply_latency_sum[30s]) / rate(ceph_mds_reply_latency_count[30s])
+
+  Total requests per second:
+  rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata
+
+
+Block metrics
+=============
+
+By default RBD metrics for images are not available in order to provide the
+best performance in the prometheus manager module.
+
+To produce metrics for RBD images it is needed to configure properly the
+manager option ``mgr/prometheus/rbd_stats_pools``. For more information please
+see :ref:`prometheus-rbd-io-statistics`
+
+
+These metrics have the following labels:
+``image``: Name of the image which produces the metric value.
+``instance``: Node where the rbd metric is produced. (It points to the Ceph exporter daemon)
+``job``: Name of the Prometheus scrape job.
+``pool``: Image pool name.
+
+Example:
+
+.. code-block:: bash
+
+  ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"}
+
+
+Main metrics
+------------
+
+- ``ceph_rbd_read_bytes``: RBD image bytes read
+
+- ``ceph_rbd_read_latency_count``: RBD image reads latency count
+
+- ``ceph_rbd_read_latency_sum``: RBD image reads latency total
+
+- ``ceph_rbd_read_ops``: RBD image reads count
+
+- ``ceph_rbd_write_bytes``: RBD image bytes written
+
+- ``ceph_rbd_write_latency_count``: RBD image writes latency count
+
+- ``ceph_rbd_write_latency_sum``: RBD image writes latency total
+
+- ``ceph_rbd_write_ops``: RBD image writes count
+
+
+Useful queries
+--------------
+
+.. code-block:: bash
+
+  The average of read latencies:
+  rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata
+
+
+
+
--- a/ceph/doc/rados/configuration/auth-config-ref.rst
+++ b/ceph/doc/rados/configuration/auth-config-ref.rst
@ -1,107 +1,110 @@
+.. _rados-cephx-config-ref:
+
 ========================
- Cephx Config Reference
+ CephX Config Reference
 ========================

-The ``cephx`` protocol is enabled by default. Cryptographic authentication has
-some computational costs, though they should generally be quite low.  If the
-network environment connecting your client and server hosts is very safe and
-you cannot afford authentication, you can turn it off. **This is not generally
-recommended**.
+The CephX protocol is enabled by default. The cryptographic authentication that
+CephX provides has some computational costs, though they should generally be
+quite low. If the network environment connecting your client and server hosts
+is very safe and you cannot afford authentication, you can disable it.
+**Disabling authentication is not generally recommended**.

-.. note:: If you disable authentication, you are at risk of a man-in-the-middle
-   attack altering your client/server messages, which could lead to disastrous
-   security effects.
+.. note:: If you disable authentication, you will be at risk of a
+   man-in-the-middle attack that alters your client/server messages, which
+   could have disastrous security effects.

-For creating users, see `User Management`_. For details on the architecture
-of Cephx, see `Architecture - High Availability Authentication`_.
+For information about creating users, see `User Management`_. For details on
+the architecture of CephX, see `Architecture - High Availability
+Authentication`_.


 Deployment Scenarios
 ====================

-There are two main scenarios for deploying a Ceph cluster, which impact
-how you initially configure Cephx. Most first time Ceph users use
-``cephadm`` to create a cluster (easiest). For clusters using
-other deployment tools (e.g., Chef, Juju, Puppet, etc.), you will need
-to use the manual procedures or configure your deployment tool to
+How you initially configure CephX depends on your scenario. There are two
+common strategies for deploying a Ceph cluster.  If you are a first-time Ceph
+user, you should probably take the easiest approach: using ``cephadm`` to
+deploy a cluster. But if your cluster uses other deployment tools (for example,
+Ansible, Chef, Juju, or Puppet), you will need either to use the manual
+deployment procedures or to configure your deployment tool so that it will
 bootstrap your monitor(s).

 Manual Deployment
 -----------------

-When you deploy a cluster manually, you have to bootstrap the monitor manually
-and create the ``client.admin`` user and keyring. To bootstrap monitors, follow
-the steps in `Monitor Bootstrapping`_. The steps for monitor bootstrapping are
-the logical steps you must perform when using third party deployment tools like
-Chef, Puppet,  Juju, etc.
+When you deploy a cluster manually, it is necessary to bootstrap the monitors
+manually and to create the ``client.admin`` user and keyring. To bootstrap
+monitors, follow the steps in `Monitor Bootstrapping`_. Follow these steps when
+using third-party deployment tools (for example, Chef, Puppet, and Juju).


-Enabling/Disabling Cephx
+Enabling/Disabling CephX
 ========================

-Enabling Cephx requires that you have deployed keys for your monitors,
-OSDs and metadata servers. If you are simply toggling Cephx on / off,
-you do not have to repeat the bootstrapping procedures.
+Enabling CephX is possible only if the keys for your monitors, OSDs, and
+metadata servers have already been deployed. If you are simply toggling CephX
+on or off, it is not necessary to repeat the bootstrapping procedures.

-
-Enabling Cephx
+Enabling CephX
 --------------

-When ``cephx`` is enabled, Ceph will look for the keyring in the default search
-path, which includes ``/etc/ceph/$cluster.$name.keyring``. You can override
-this location by adding a ``keyring`` option in the ``[global]`` section of
-your `Ceph configuration`_ file, but this is not recommended.
+When CephX is enabled, Ceph will look for the keyring in the default search
+path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible
+to override this search-path location by adding a ``keyring`` option in the
+``[global]`` section of your `Ceph configuration`_ file, but this is not
+recommended.

-Execute the following procedures to enable ``cephx`` on a cluster with
-authentication disabled. If you (or your deployment utility) have already
+To enable CephX on a cluster for which authentication has been disabled, carry
+out the following procedure.  If you (or your deployment utility) have already
 generated the keys, you may skip the steps related to generating keys.

 #. Create a ``client.admin`` key, and save a copy of the key for your client
-   host
+   host:

   .. prompt:: bash $

     ceph auth get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *' -o /etc/ceph/ceph.client.admin.keyring

-   **Warning:** This will clobber any existing
+   **Warning:** This step will clobber any existing
   ``/etc/ceph/client.admin.keyring`` file. Do not perform this step if a
-   deployment tool has already done it for you. Be careful!
+   deployment tool has already generated a keyring file for you. Be careful!

-#. Create a keyring for your monitor cluster and generate a monitor
-   secret key.
+#. Create a monitor keyring and generate a monitor secret key:

   .. prompt:: bash $

     ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'

-#. Copy the monitor keyring into a ``ceph.mon.keyring`` file in every monitor's
-   ``mon data`` directory. For example, to copy it to ``mon.a`` in cluster ``ceph``,
-   use the following
+#. For each monitor, copy the monitor keyring into a ``ceph.mon.keyring`` file
+   in the monitor's ``mon data`` directory. For example, to copy the monitor
+   keyring to ``mon.a`` in a cluster called ``ceph``, run the following
+   command:

   .. prompt:: bash $

     cp /tmp/ceph.mon.keyring /var/lib/ceph/mon/ceph-a/keyring

-#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter
+#. Generate a secret key for every MGR, where ``{$id}`` is the MGR letter:

   .. prompt:: bash $

      ceph auth get-or-create mgr.{$id} mon 'allow profile mgr' mds 'allow *' osd 'allow *' -o /var/lib/ceph/mgr/ceph-{$id}/keyring

-#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number
+#. Generate a secret key for every OSD, where ``{$id}`` is the OSD number:

   .. prompt:: bash $

      ceph auth get-or-create osd.{$id} mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-{$id}/keyring

-#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter
+#. Generate a secret key for every MDS, where ``{$id}`` is the MDS letter:

   .. prompt:: bash $

      ceph auth get-or-create mds.{$id} mon 'allow rwx' osd 'allow *' mds 'allow *' mgr 'allow profile mds' -o /var/lib/ceph/mds/ceph-{$id}/keyring

-#. Enable ``cephx`` authentication by setting the following options in the
-   ``[global]`` section of your `Ceph configuration`_ file
+#. Enable CephX authentication by setting the following options in the
+   ``[global]`` section of your `Ceph configuration`_ file:

   .. code-block:: ini

@ -109,23 +112,23 @@ generated the keys, you may skip the steps related to generating keys.
      auth_service_required = cephx
      auth_client_required = cephx

-
-#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
+#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.

 For details on bootstrapping a monitor manually, see `Manual Deployment`_.



-Disabling Cephx
+Disabling CephX
 ---------------

-The following procedure describes how to disable Cephx. If your cluster
-environment is relatively safe, you can offset the computation expense of
-running authentication. **We do not recommend it.** However, it may be easier
-during setup and/or troubleshooting to temporarily disable authentication.
+The following procedure describes how to disable CephX. If your cluster
+environment is safe, you might want to disable CephX in order to offset the
+computational expense of running authentication. **We do not recommend doing
+so.** However, setup and troubleshooting might be easier if authentication is
+temporarily disabled and subsequently re-enabled.

-#. Disable ``cephx`` authentication by setting the following options in the
-   ``[global]`` section of your `Ceph configuration`_ file
+#. Disable CephX authentication by setting the following options in the
+   ``[global]`` section of your `Ceph configuration`_ file:

   .. code-block:: ini

@ -133,8 +136,7 @@ during setup and/or troubleshooting to temporarily disable authentication.
      auth_service_required = none
      auth_client_required = none

-
-#. Start or restart the Ceph cluster. See `Operating a Cluster`_ for details.
+#. Start or restart the Ceph cluster. For details, see `Operating a Cluster`_.


 Configuration Settings
@ -144,70 +146,230 @@ Enablement
 ----------


-.. confval:: auth_cluster_required
-.. confval:: auth_service_required
-.. confval:: auth_client_required
+``auth_cluster_required``
+
+:Description: If this configuration setting is enabled, the Ceph Storage
+              Cluster daemons (that is, ``ceph-mon``, ``ceph-osd``,
+              ``ceph-mds``, and ``ceph-mgr``) are required to authenticate with
+              each other. Valid settings are ``cephx`` or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+``auth_service_required``
+
+:Description: If this configuration setting is enabled, then Ceph clients can
+              access Ceph services only if those clients authenticate with the
+              Ceph Storage Cluster.  Valid settings are ``cephx`` or ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+
+
+``auth_client_required``
+
+:Description: If this configuration setting is enabled, then communication
+              between the Ceph client and Ceph Storage Cluster can be
+              established only if the Ceph Storage Cluster authenticates
+              against the Ceph client. Valid settings are ``cephx`` or
+              ``none``.
+
+:Type: String
+:Required: No
+:Default: ``cephx``.
+

 .. index:: keys; keyring

 Keys
 ----

-When you run Ceph with authentication enabled, ``ceph`` administrative commands
-and Ceph Clients require authentication keys to access the Ceph Storage Cluster.
+When Ceph is run with authentication enabled, ``ceph`` administrative commands
+and Ceph clients can access the Ceph Storage Cluster only if they use
+authentication keys.

-The most common way to provide these keys to the ``ceph`` administrative
-commands and clients is to include a Ceph keyring under the ``/etc/ceph``
-directory. For Octopus and later releases using ``cephadm``, the filename
-is usually ``ceph.client.admin.keyring`` (or ``$cluster.client.admin.keyring``).
-If you include the keyring under the ``/etc/ceph`` directory, you don't need to
-specify a ``keyring`` entry in your Ceph configuration file.
+The most common way to make these keys available to ``ceph`` administrative
+commands and Ceph clients is to include a Ceph keyring under the ``/etc/ceph``
+directory. For Octopus and later releases that use ``cephadm``, the filename is
+usually ``ceph.client.admin.keyring``.  If the keyring is included in the
+``/etc/ceph`` directory, then it is unnecessary to specify a ``keyring`` entry
+in the Ceph configuration file.

-We recommend copying the Ceph Storage Cluster's keyring file to nodes where you
-will run administrative commands, because it contains the ``client.admin`` key.
+Because the Ceph Storage Cluster's keyring file contains the ``client.admin``
+key, we recommend copying the keyring file to nodes from which you run
+administrative commands.

-To perform this step manually, execute the following:
+To perform this step manually, run the following command:

 .. prompt:: bash $

   sudo scp {user}@{ceph-cluster-host}:/etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.client.admin.keyring

-.. tip:: Ensure the ``ceph.keyring`` file has appropriate permissions set
-   (e.g., ``chmod 644``) on your client machine.
+.. tip:: Make sure that the ``ceph.keyring`` file has appropriate permissions
+   (for example, ``chmod 644``) set on your client machine.

-You may specify the key itself in the Ceph configuration file using the ``key``
-setting (not recommended), or a path to a keyfile using the ``keyfile`` setting.
+You can specify the key itself by using the ``key`` setting in the Ceph
+configuration file (this approach is not recommended), or instead specify a
+path to a keyfile by using the ``keyfile`` setting in the Ceph configuration
+file.
+
+``keyring``
+
+:Description: The path to the keyring file.
+:Type: String
+:Required: No
+:Default: ``/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin``
+
+
+``keyfile``
+
+:Description: The path to a keyfile (that is, a file containing only the key).
+:Type: String
+:Required: No
+:Default: None
+
+
+``key``
+
+:Description: The key (that is, the text string of the key itself). We do not
+              recommend that you use this setting unless you know what you're
+              doing.
+:Type: String
+:Required: No
+:Default: None
+
+
+Daemon Keyrings
+---------------
+
+Administrative users or deployment tools (for example, ``cephadm``) generate
+daemon keyrings in the same way that they generate user keyrings. By default,
+Ceph stores the keyring of a daemon inside that daemon's data directory. The
+default keyring locations and the capabilities that are necessary for the
+daemon to function are shown below.
+
+``ceph-mon``
+
+:Location: ``$mon_data/keyring``
+:Capabilities: ``mon 'allow *'``
+
+``ceph-osd``
+
+:Location: ``$osd_data/keyring``
+:Capabilities: ``mgr 'allow profile osd' mon 'allow profile osd' osd 'allow *'``
+
+``ceph-mds``
+
+:Location: ``$mds_data/keyring``
+:Capabilities: ``mds 'allow' mgr 'allow profile mds' mon 'allow profile mds' osd 'allow rwx'``
+
+``ceph-mgr``
+
+:Location: ``$mgr_data/keyring``
+:Capabilities: ``mon 'allow profile mgr' mds 'allow *' osd 'allow *'``
+
+``radosgw``
+
+:Location: ``$rgw_data/keyring``
+:Capabilities: ``mon 'allow rwx' osd 'allow rwx'``
+
+
+.. note:: The monitor keyring (that is, ``mon.``) contains a key but no
+   capabilities, and this keyring is not part of the cluster ``auth`` database.
+
+The daemon's data-directory locations default to directories of the form::
+
+  /var/lib/ceph/$type/$cluster-$id
+
+For example, ``osd.12`` would have the following data directory::
+
+  /var/lib/ceph/osd/ceph-12
+
+It is possible to override these locations, but it is not recommended.

-.. confval:: keyring
-   :default: /etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
-.. confval:: keyfile
-.. confval:: key

 .. index:: signatures

 Signatures
 ----------

-Ceph performs a signature check that provides some limited protection
-against messages being tampered with in flight (e.g., by a "man in the
-middle" attack).
+Ceph performs a signature check that provides some limited protection against
+messages being tampered with in flight (for example, by a "man in the middle"
+attack).

-Like other parts of Ceph authentication, Ceph provides fine-grained control so
-you can enable/disable signatures for service messages between clients and
-Ceph, and so you can enable/disable signatures for messages between Ceph daemons.
+As with other parts of Ceph authentication, signatures admit of fine-grained
+control.  You can enable or disable signatures for service messages between
+clients and Ceph, and for messages between Ceph daemons.

-Note that even with signatures enabled data is not encrypted in
-flight.
+Note that even when signatures are enabled data is not encrypted in flight.
+
+``cephx_require_signatures``
+
+:Description: If this configuration setting is set to ``true``, Ceph requires
+              signatures on all message traffic between the Ceph client and the
+              Ceph Storage Cluster, and between daemons within the Ceph Storage
+              Cluster.
+
+.. note::
+          **ANTIQUATED NOTE:**
+
+          Neither Ceph Argonaut nor Linux kernel versions prior to 3.19
+          support signatures; if one of these clients is in use, ``cephx_require_signatures``
+          can be disabled in order to allow the client to connect.
+
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx_cluster_require_signatures``
+
+:Description: If this configuration setting is set to ``true``, Ceph requires
+              signatures on all message traffic between Ceph daemons within the
+              Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx_service_require_signatures``
+
+:Description: If this configuration setting is set to ``true``, Ceph requires
+              signatures on all message traffic between Ceph clients and the
+              Ceph Storage Cluster.
+
+:Type: Boolean
+:Required: No
+:Default: ``false``
+
+
+``cephx_sign_messages``
+
+:Description: If this configuration setting is set to ``true``, and if the Ceph
+              version supports message signing, then Ceph will sign all
+              messages so that they are more difficult to spoof.
+
+:Type: Boolean
+:Default: ``true``

-.. confval:: cephx_require_signatures
-.. confval:: cephx_cluster_require_signatures
-.. confval:: cephx_service_require_signatures
-.. confval:: cephx_sign_messages

 Time to Live
 ------------

-.. confval:: auth_service_ticket_ttl
+``auth_service_ticket_ttl``
+
+:Description: When the Ceph Storage Cluster sends a ticket for authentication
+              to a Ceph client, the Ceph Storage Cluster assigns that ticket a
+              Time To Live (TTL).
+
+:Type: Double
+:Default: ``60*60``
+

 .. _Monitor Bootstrapping: ../../../install/manual-deployment#monitor-bootstrapping
 .. _Operating a Cluster: ../../operations/operating
--- a/ceph/doc/rados/configuration/bluestore-config-ref.rst
+++ b/ceph/doc/rados/configuration/bluestore-config-ref.rst
@ -1,84 +1,95 @@
-==========================
-BlueStore Config Reference
-==========================
+==================================
+ BlueStore Configuration Reference 
+==================================

 Devices
 =======

-BlueStore manages either one, two, or (in certain cases) three storage
-devices.
+BlueStore manages either one, two, or in certain cases three storage devices.
+These *devices* are "devices" in the Linux/Unix sense. This means that they are
+assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
+entire storage drive, or a partition of a storage drive, or a logical volume.
+BlueStore does not create or mount a conventional file system on devices that
+it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.

-In the simplest case, BlueStore consumes a single (primary) storage device.
-The storage device is normally used as a whole, occupying the full device that
-is managed directly by BlueStore. This *primary device* is normally identified
-by a ``block`` symlink in the data directory.
+In the simplest case, BlueStore consumes all of a single storage device. This
+device is known as the *primary device*. The primary device is identified by
+the ``block`` symlink in the data directory.

-The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
-when ``ceph-volume`` activates it) with all the common OSD files that hold
-information about the OSD, like: its identifier, which cluster it belongs to,
-and its private keyring.
+The data directory is a ``tmpfs`` mount. When this data directory is booted or
+activated by ``ceph-volume``, it is populated with metadata files and links
+that hold information about the OSD: for example, the OSD's identifier, the
+name of the cluster that the OSD belongs to, and the OSD's private keyring.

-It is also possible to deploy BlueStore across one or two additional devices:
+In more complicated cases, BlueStore is deployed across one or two additional
+devices:

-* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
-  used for BlueStore's internal journal or write-ahead log. It is only useful
-  to use a WAL device if the device is faster than the primary device (e.g.,
-  when it is on an SSD and the primary device is an HDD).
+* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
+  directory) can be used to separate out BlueStore's internal journal or
+  write-ahead log. Using a WAL device is advantageous only if the WAL device
+  is faster than the primary device (for example, if the WAL device is an SSD
+  and the primary device is an HDD).
 * A *DB device* (identified as ``block.db`` in the data directory) can be used
-  for storing BlueStore's internal metadata.  BlueStore (or rather, the
-  embedded RocksDB) will put as much metadata as it can on the DB device to
-  improve performance.  If the DB device fills up, metadata will spill back
-  onto the primary device (where it would have been otherwise).  Again, it is
-  only helpful to provision a DB device if it is faster than the primary
-  device.
+  to store BlueStore's internal metadata. BlueStore (or more precisely, the
+  embedded RocksDB) will put as much metadata as it can on the DB device in
+  order to improve performance. If the DB device becomes full, metadata will
+  spill back onto the primary device (where it would have been located in the
+  absence of the DB device). Again, it is advantageous to provision a DB device
+  only if it is faster than the primary device.

-If there is only a small amount of fast storage available (e.g., less
-than a gigabyte), we recommend using it as a WAL device.  If there is
-more, provisioning a DB device makes more sense.  The BlueStore
-journal will always be placed on the fastest device available, so
-using a DB device will provide the same benefit that the WAL device
-would while *also* allowing additional metadata to be stored there (if
-it will fit).  This means that if a DB device is specified but an explicit
-WAL device is not, the WAL will be implicitly colocated with the DB on the faster
-device.
+If there is only a small amount of fast storage available (for example, less
+than a gigabyte), we recommend using the available space as a WAL device. But
+if more fast storage is available, it makes more sense to provision a DB
+device. Because the BlueStore journal is always placed on the fastest device
+available, using a DB device provides the same benefit that using a WAL device
+would, while *also* allowing additional metadata to be stored off the primary
+device (provided that it fits). DB devices make this possible because whenever
+a DB device is specified but an explicit WAL device is not, the WAL will be
+implicitly colocated with the DB on the faster device.

-A single-device (colocated) BlueStore OSD can be provisioned with:
+To provision a single-device (colocated) BlueStore OSD, run the following
+command:

 .. prompt:: bash $

   ceph-volume lvm prepare --bluestore --data <device>

-To specify a WAL device and/or DB device:
+To specify a WAL device or DB device, run the following command:
   
 .. prompt:: bash $

   ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>

-.. note:: ``--data`` can be a Logical Volume using  *vg/lv* notation. Other
-          devices can be existing logical volumes or GPT partitions.
+.. note:: The option ``--data`` can take as its argument any of the the
+   following devices: logical volumes specified using *vg/lv* notation,
+   existing logical volumes, and GPT partitions.
+
+

 Provisioning strategies
 -----------------------
-Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
-which had just one), there are two common arrangements that should help clarify
-the deployment strategy:
+
+BlueStore differs from Filestore in that there are several ways to deploy a
+BlueStore OSD. However, the overall deployment strategy for BlueStore can be
+clarified by examining just these two common arrangements:

 .. _bluestore-single-type-device-config:

 **block (data) only**
 ^^^^^^^^^^^^^^^^^^^^^
-If all devices are the same type, for example all rotational drives, and
-there are no fast devices to use for metadata, it makes sense to specify the
-block device only and to not separate ``block.db`` or ``block.wal``. The
-:ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like:
+If all devices are of the same type (for example, they are all HDDs), and if
+there are no fast devices available for the storage of metadata, then it makes
+sense to specify the block device only and to leave ``block.db`` and
+``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
+``/dev/sda`` device is as follows:

 .. prompt:: bash $

   ceph-volume lvm create --bluestore --data /dev/sda

-If logical volumes have already been created for each device, (a single LV
-using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
-``ceph-vg/block-lv`` would look like:
+If the devices to be used for a BlueStore OSD are pre-created logical volumes,
+then the :ref:`ceph-volume-lvm` call for an logical volume named
+``ceph-vg/block-lv`` is as follows:

 .. prompt:: bash $

@ -88,15 +99,18 @@ using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named

 **block and block.db**
 ^^^^^^^^^^^^^^^^^^^^^^
-If you have a mix of fast and slow devices (SSD / NVMe and rotational),
-it is recommended to place ``block.db`` on the faster device while ``block``
-(data) lives on the slower (spinning drive).

-You must create these volume groups and logical volumes manually as 
-the ``ceph-volume`` tool is currently not able to do so automatically.
+If you have a mix of fast and slow devices (for example, SSD or HDD), then we
+recommend placing ``block.db`` on the faster device while ``block`` (that is,
+the data) is stored on the slower device (that is, the rotational drive).

-For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
-and one (fast) solid state drive (``sdx``). First create the volume groups:
+You must create these volume groups and these logical volumes manually. as The
+``ceph-volume`` tool is currently unable to do so [create them?] automatically.
+
+The following procedure illustrates the manual creation of volume groups and
+logical volumes.  For this example, we shall assume four rotational drives
+(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
+to create the volume groups, run the following commands:

 .. prompt:: bash $

@ -105,7 +119,7 @@ and one (fast) solid state drive (``sdx``). First create the volume groups:
   vgcreate ceph-block-2 /dev/sdc
   vgcreate ceph-block-3 /dev/sdd

-Now create the logical volumes for ``block``:
+Next, to create the logical volumes for ``block``, run the following commands:

 .. prompt:: bash $

@ -114,8 +128,9 @@ Now create the logical volumes for ``block``:
   lvcreate -l 100%FREE -n block-2 ceph-block-2
   lvcreate -l 100%FREE -n block-3 ceph-block-3

-We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
-SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
+Because there are four HDDs, there will be four OSDs. Supposing that there is a
+200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
+the following commands:

 .. prompt:: bash $

@ -125,7 +140,7 @@ SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB:
   lvcreate -L 50GB -n db-2 ceph-db-0
   lvcreate -L 50GB -n db-3 ceph-db-0

-Finally, create the 4 OSDs with ``ceph-volume``:
+Finally, to create the four OSDs, run the following commands:

 .. prompt:: bash $

@ -134,54 +149,57 @@ Finally, create the 4 OSDs with ``ceph-volume``:
   ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
   ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3

-These operations should end up creating four OSDs, with ``block`` on the slower
-rotational drives with a 50 GB logical volume (DB) for each on the solid state
-drive.
+After this procedure is finished, there should be four OSDs, ``block`` should
+be on the four HDDs, and each HDD should have a 50GB logical volume
+(specifically, a DB device) on the shared SSD.

 Sizing
 ======
-When using a :ref:`mixed spinning and solid drive setup
-<bluestore-mixed-device-config>` it is important to make a large enough
-``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
-*as large as possible* logical volumes.
+When using a :ref:`mixed spinning-and-solid-drive setup
+<bluestore-mixed-device-config>`, it is important to make a large enough
+``block.db`` logical volume for BlueStore. The logical volumes associated with
+``block.db`` should have logical volumes that are *as large as possible*.

-The general recommendation is to have ``block.db`` size in between 1% to 4%
-of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
-size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
-metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
-be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
+It is generally recommended that the size of ``block.db`` be somewhere between
+1% and 4% of the size of ``block``. For RGW workloads, it is recommended that
+the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
+use of ``block.db`` to store metadata (in particular, omap keys). For example,
+if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
+40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
+2% of the ``block`` size.

-In older releases, internal level sizes mean that the DB can fully utilize only
-specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
-etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
-so forth.  Most deployments will not substantially benefit from sizing to
-accommodate L3 and higher, though DB compaction can be facilitated by doubling
-these figures to 6GB, 60GB, and 600GB.
+In older releases, internal level sizes are such that the DB can fully utilize
+only those specific partition / logical volume sizes that correspond to sums of
+L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
+3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
+sizing that accommodates L3 and higher, though DB compaction can be facilitated
+by doubling these figures to 6GB, 60GB, and 600GB.

-Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
-enable better utilization of arbitrary DB device sizes, and the Pacific
-release brings experimental dynamic level support.  Users of older releases may
-thus wish to plan ahead by provisioning larger DB devices today so that their
-benefits may be realized with future upgrades.
-
-When *not* using a mix of fast and slow devices, it isn't required to create
-separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
-automatically colocate these within the space of ``block``.
+Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
+for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
+release brings experimental dynamic-level support. Because of these advances,
+users of older releases might want to plan ahead by provisioning larger DB
+devices today so that the benefits of scale can be realized when upgrades are
+made in the future.

+When *not* using a mix of fast and slow devices, there is no requirement to
+create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
+will automatically colocate these devices within the space of ``block``.

 Automatic Cache Sizing
 ======================

-BlueStore can be configured to automatically resize its caches when TCMalloc
-is configured as the memory allocator and the ``bluestore_cache_autotune``
-setting is enabled.  This option is currently enabled by default.  BlueStore
-will attempt to keep OSD heap memory usage under a designated target size via
-the ``osd_memory_target`` configuration option.  This is a best effort
-algorithm and caches will not shrink smaller than the amount specified by
-``osd_memory_cache_min``.  Cache ratios will be chosen based on a hierarchy
-of priorities.  If priority information is not available, the
-``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
-used as fallbacks.
+BlueStore can be configured to automatically resize its caches, provided that
+certain conditions are met: TCMalloc must be configured as the memory allocator
+and the ``bluestore_cache_autotune`` configuration option must be enabled (note
+that it is currently enabled by default). When automatic cache sizing is in
+effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
+size (as determined by ``osd_memory_target``). This approach makes use of a
+best-effort algorithm and caches do not shrink smaller than the size defined by
+the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
+with a hierarchy of priorities.  But if priority information is not available,
+the values specified in the ``bluestore_cache_meta_ratio`` and
+``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.

 .. confval:: bluestore_cache_autotune
 .. confval:: osd_memory_target
@ -195,34 +213,33 @@ used as fallbacks.
 Manual Cache Sizing
 ===================

-The amount of memory consumed by each OSD for BlueStore caches is
-determined by the ``bluestore_cache_size`` configuration option.  If
-that config option is not set (i.e., remains at 0), there is a
-different default value that is used depending on whether an HDD or
-SSD is used for the primary device (set by the
-``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
-options).
+The amount of memory consumed by each OSD to be used for its BlueStore cache is
+determined by the ``bluestore_cache_size`` configuration option. If that option
+has not been specified (that is, if it remains at 0), then Ceph uses a
+different configuration option to determine the default memory budget:
+``bluestore_cache_size_hdd`` if the primary device is an HDD, or
+``bluestore_cache_size_ssd`` if the primary device is an SSD.

-BlueStore and the rest of the Ceph OSD daemon do the best they can
-to work within this memory budget.  Note that on top of the configured
-cache size, there is also memory consumed by the OSD itself, and
-some additional utilization due to memory fragmentation and other
-allocator overhead.
+BlueStore and the rest of the Ceph OSD daemon make every effort to work within
+this memory budget. Note that in addition to the configured cache size, there
+is also memory consumed by the OSD itself. There is additional utilization due
+to memory fragmentation and other allocator overhead. 

-The configured cache memory budget can be used in a few different ways:
+The configured cache-memory budget can be used to store the following types of
+things:

-* Key/Value metadata (i.e., RocksDB's internal cache)
+* Key/Value metadata (that is, RocksDB's internal cache)
 * BlueStore metadata
-* BlueStore data (i.e., recently read or written object data)
+* BlueStore data (that is, recently read or recently written object data)

-Cache memory usage is governed by the following options:
-``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
-The fraction of the cache devoted to data
-is governed by the effective bluestore cache size (depending on
-``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
-device) as well as the meta and kv ratios.
-The data fraction can be calculated by
-``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
+Cache memory usage is governed by the configuration options
+``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.  The fraction
+of the cache that is reserved for data is governed by both the effective
+BlueStore cache size (which depends on the relevant
+``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
+device) and the "meta" and "kv" ratios.  This data fraction can be calculated
+with the following formula: ``<effective_cache_size> * (1 -
+bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.

 .. confval:: bluestore_cache_size
 .. confval:: bluestore_cache_size_hdd
@ -233,29 +250,28 @@ The data fraction can be calculated by
 Checksums
 =========

-BlueStore checksums all metadata and data written to disk.  Metadata
-checksumming is handled by RocksDB and uses `crc32c`. Data
-checksumming is done by BlueStore and can make use of `crc32c`,
-`xxhash32`, or `xxhash64`.  The default is `crc32c` and should be
-suitable for most purposes.
+BlueStore checksums all metadata and all data written to disk. Metadata
+checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
+contrast, data checksumming is handled by BlueStore and can use either
+`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
+checksum algorithm and it is suitable for most purposes.

-Full data checksumming does increase the amount of metadata that
-BlueStore must store and manage.  When possible, e.g., when clients
-hint that data is written and read sequentially, BlueStore will
-checksum larger blocks, but in many cases it must store a checksum
-value (usually 4 bytes) for every 4 kilobyte block of data.
+Full data checksumming increases the amount of metadata that BlueStore must
+store and manage. Whenever possible (for example, when clients hint that data
+is written and read sequentially), BlueStore will checksum larger blocks. In
+many cases, however, it must store a checksum value (usually 4 bytes) for every
+4 KB block of data.

-It is possible to use a smaller checksum value by truncating the
-checksum to two or one byte, reducing the metadata overhead.  The
-trade-off is that the probability that a random error will not be
-detected is higher with a smaller checksum, going from about one in
-four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
-16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
-The smaller checksum values can be used by selecting `crc32c_16` or
-`crc32c_8` as the checksum algorithm.
+It is possible to obtain a smaller checksum value by truncating the checksum to
+one or two bytes and reducing the metadata overhead.  A drawback of this
+approach is that it increases the probability of a random error going
+undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
+65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
+checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
+as the checksum algorithm.

-The *checksum algorithm* can be set either via a per-pool
-``csum_type`` property or the global config option.  For example:
+The *checksum algorithm* can be specified either via a per-pool ``csum_type``
+configuration option or via the global configuration option. For example:

 .. prompt:: bash $

@ -266,34 +282,35 @@ The *checksum algorithm* can be set either via a per-pool
 Inline Compression
 ==================

-BlueStore supports inline compression using `snappy`, `zlib`, or
-`lz4`. Please note that the `lz4` compression plugin is not
-distributed in the official release.
+BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`. 

-Whether data in BlueStore is compressed is determined by a combination
-of the *compression mode* and any hints associated with a write
-operation.  The modes are:
+Whether data in BlueStore is compressed is determined by two factors: (1) the
+*compression mode* and (2) any client hints associated with a write operation.
+The compression modes are as follows:

 * **none**: Never compress data.
 * **passive**: Do not compress data unless the write operation has a
  *compressible* hint set.
-* **aggressive**: Compress data unless the write operation has an
+* **aggressive**: Do compress data unless the write operation has an
  *incompressible* hint set.
 * **force**: Try to compress data no matter what.

-For more information about the *compressible* and *incompressible* IO
-hints, see :c:func:`rados_set_alloc_hint`.
+For more information about the *compressible* and *incompressible* I/O hints,
+see :c:func:`rados_set_alloc_hint`.

-Note that regardless of the mode, if the size of the data chunk is not
-reduced sufficiently it will not be used and the original
-(uncompressed) data will be stored.  For example, if the ``bluestore
-compression required ratio`` is set to ``.7`` then the compressed data
-must be 70% of the size of the original (or smaller).
+Note that data in Bluestore will be compressed only if the data chunk will be
+sufficiently reduced in size (as determined by the ``bluestore compression
+required ratio`` setting). No matter which compression modes have been used, if
+the data chunk is too big, then it will be discarded and the original
+(uncompressed) data will be stored instead. For example, if ``bluestore
+compression required ratio`` is set to ``.7``, then data compression will take
+place only if the size of the compressed data is no more than 70% of the size
+of the original data.

-The *compression mode*, *compression algorithm*, *compression required
-ratio*, *min blob size*, and *max blob size* can be set either via a
-per-pool property or a global config option.  Pool properties can be
-set with:
+The *compression mode*, *compression algorithm*, *compression required ratio*,
+*min blob size*, and *max blob size* settings can be specified either via a
+per-pool property or via a global config option. To specify pool properties,
+run the following commands:

 .. prompt:: bash $

@ -318,27 +335,30 @@ set with:
 RocksDB Sharding
 ================

-Internally BlueStore uses multiple types of key-value data,
-stored in RocksDB.  Each data type in BlueStore is assigned a
-unique prefix. Until Pacific all key-value data was stored in
-single RocksDB column family: 'default'.  Since Pacific,
-BlueStore can divide this data into multiple RocksDB column
-families. When keys have similar access frequency, modification
-frequency and lifetime, BlueStore benefits from better caching
-and more precise compaction. This improves performance, and also
-requires less disk space during compaction, since each column
-family is smaller and can compact independent of others.
+BlueStore maintains several types of internal key-value data, all of which are
+stored in RocksDB. Each data type in BlueStore is assigned a unique prefix.
+Prior to the Pacific release, all key-value data was stored in a single RocksDB
+column family: 'default'. In Pacific and later releases, however, BlueStore can
+divide key-value data into several RocksDB column families. BlueStore achieves
+better caching and more precise compaction when keys are similar: specifically,
+when keys have similar access frequency, similar modification frequency, and a
+similar lifetime.  Under such conditions, performance is improved and less disk
+space is required during compaction (because each column family is smaller and
+is able to compact independently of the others).

-OSDs deployed in Pacific or later use RocksDB sharding by default.
-If Ceph is upgraded to Pacific from a previous version, sharding is off.
+OSDs deployed in Pacific or later releases use RocksDB sharding by default.
+However, if Ceph has been upgraded to Pacific or a later version from a
+previous version, sharding is disabled on any OSDs that were created before
+Pacific.

-To enable sharding and apply the Pacific defaults, stop an OSD and run
+To enable sharding and apply the Pacific defaults to a specific OSD, stop the
+OSD and run the following command:

    .. prompt:: bash #

-      ceph-bluestore-tool \
+       ceph-bluestore-tool \
        --path <data path> \
-        --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
+        --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
        reshard

 .. confval:: bluestore_rocksdb_cf
@ -354,165 +374,179 @@ Throttling
 .. confval:: bluestore_throttle_cost_per_io_ssd

 SPDK Usage
-==================
+==========

-If you want to use the SPDK driver for NVMe devices, you must prepare your system.
-Refer to `SPDK document`__ for more details.
+To use the SPDK driver for NVMe devices, you must first prepare your system.
+See `SPDK document`__.

 .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples

-SPDK offers a script to configure the device automatically. Users can run the
-script as root:
+SPDK offers a script that will configure the device automatically. Run this
+script with root permissions:

 .. prompt:: bash $

   sudo src/spdk/scripts/setup.sh

-You will need to specify the subject NVMe device's device selector with
-the "spdk:" prefix for ``bluestore_block_path``.
+You will need to specify the subject NVMe device's device selector with the
+"spdk:" prefix for ``bluestore_block_path``.

-For example, you can find the device selector of an Intel PCIe SSD with:
+In the following example, you first find the device selector of an Intel NVMe
+SSD by running the following command:

 .. prompt:: bash $

-   lspci -mm -n -D -d 8086:0953
+   lspci -mm -n -d -d 8086:0953

-The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
+The form of the device selector is either ``DDDD:BB:DD.FF`` or
+``DDDD.BB.DD.FF``.

-and then set::
+Next, supposing that ``0000:01:00.0`` is the device selector found in the
+output of the ``lspci`` command, you can specify the device selector by running
+the following command::

  bluestore_block_path = spdk:0000:01:00.0

-Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
-command above.
+You may also specify a remote NVMeoF target over the TCP transport, as in the
+following example::

-To run multiple SPDK instances per node, you must specify the
-amount of dpdk memory in MB that each instance will use, to make sure each
-instance uses its own dpdk memory
+  bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"

-In most cases, a single device can be used for data, DB, and WAL.  We describe
+To run multiple SPDK instances per node, you must make sure each instance uses
+its own DPDK memory by specifying for each instance the amount of DPDK memory
+(in MB) that the instance will use.
+
+In most cases, a single device can be used for data, DB, and WAL. We describe
 this strategy as *colocating* these components. Be sure to enter the below
-settings to ensure that all IOs are issued through SPDK.::
+settings to ensure that all I/Os are issued through SPDK::

  bluestore_block_db_path = ""
  bluestore_block_db_size = 0
  bluestore_block_wal_path = ""
  bluestore_block_wal_size = 0

-Otherwise, the current implementation will populate the SPDK map files with
-kernel file system symbols and will use the kernel driver to issue DB/WAL IO.
+If these settings are not entered, then the current implementation will
+populate the SPDK map files with kernel file system symbols and will use the
+kernel driver to issue DB/WAL I/Os.

 Minimum Allocation Size
-========================
+=======================

-There is a configured minimum amount of storage that BlueStore will allocate on
-an OSD.  In practice, this is the least amount of capacity that a RADOS object
-can consume.  The value of :confval:`bluestore_min_alloc_size` is derived from the
-value of :confval:`bluestore_min_alloc_size_hdd` or :confval:`bluestore_min_alloc_size_ssd`
-depending on the OSD's ``rotational`` attribute.  This means that when an OSD
-is created on an HDD, BlueStore will be initialized with the current value
-of :confval:`bluestore_min_alloc_size_hdd`, and SSD OSDs (including NVMe devices)
-with the value of :confval:`bluestore_min_alloc_size_ssd`.
+There is a configured minimum amount of storage that BlueStore allocates on an
+underlying storage device. In practice, this is the least amount of capacity
+that even a tiny RADOS object can consume on each OSD's primary device. The
+configuration option in question--:confval:`bluestore_min_alloc_size`--derives
+its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or
+:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
+attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
+the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
+(including NVMe devices), Bluestore is initialized with the current value of
+:confval:`bluestore_min_alloc_size_ssd`.

-Through the Mimic release, the default values were 64KB and 16KB for rotational
-(HDD) and non-rotational (SSD) media respectively.  Octopus changed the default
-for SSD (non-rotational) media to 4KB, and Pacific changed the default for HDD
-(rotational) media to 4KB as well.
+In Mimic and earlier releases, the default values were 64KB for rotational
+media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
+changed the the default value for non-rotational media (SSD) to 4KB, and the
+Pacific release changed the default value for rotational media (HDD) to 4KB.

-These changes were driven by space amplification experienced by Ceph RADOS
-GateWay (RGW) deployments that host large numbers of small files
+These changes were driven by space amplification that was experienced by Ceph
+RADOS GateWay (RGW) deployments that hosted large numbers of small files
 (S3/Swift objects).

-For example, when an RGW client stores a 1KB S3 object, it is written to a
-single RADOS object.  With the default :confval:`min_alloc_size` value, 4KB of
-underlying drive space is allocated.  This means that roughly
-(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300%
-overhead or 25% efficiency. Similarly, a 5KB user object will be stored
-as one 4KB and one 1KB RADOS object, again stranding 4KB of device capacity,
-though in this case the overhead is a much smaller percentage.  Think of this
-in terms of the remainder from a modulus operation. The overhead *percentage*
-thus decreases rapidly as user object size increases.
+For example, when an RGW client stores a 1 KB S3 object, that object is written
+to a single RADOS object. In accordance with the default
+:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated.
+This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never
+used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB
+user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB
+RADOS object, with the result that 4KB of device capacity is stranded. In this
+case, however, the overhead percentage is much smaller. Think of this in terms
+of the remainder from a modulus operation. The overhead *percentage* thus
+decreases rapidly as object size increases.

-An easily missed additional subtlety is that this
-takes place for *each* replica.  So when using the default three copies of
-data (3R), a 1KB S3 object actually consumes roughly 9KB of storage device
-capacity.  If erasure coding (EC) is used instead of replication, the
-amplification may be even higher: for a ``k=4,m=2`` pool, our 1KB S3 object
-will allocate (6 * 4KB) = 24KB of device capacity.
+There is an additional subtlety that is easily missed: the amplification
+phenomenon just described takes place for *each* replica. For example, when
+using the default of three copies of data (3R), a 1 KB S3 object actually
+strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used
+instead of replication, the amplification might be even higher: for a ``k=4,
+m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6)
+of device capacity.

 When an RGW bucket pool contains many relatively large user objects, the effect
-of this phenomenon is often negligible, but should be considered for deployments
-that expect a significant fraction of relatively small objects.
+of this phenomenon is often negligible. However, with deployments that can
+expect a significant fraction of relatively small user objects, the effect
+should be taken into consideration.

-The 4KB default value aligns well with conventional HDD and SSD devices.  Some
-new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best
-when :confval:`bluestore_min_alloc_size_ssd`
-is set at OSD creation to match the device's IU:. 8KB, 16KB, or even 64KB.
-These novel storage drives allow one to achieve read performance competitive
-with conventional TLC SSDs and write performance faster than HDDs, with
-high density and lower cost than TLC SSDs.
+The 4KB default value aligns well with conventional HDD and SSD devices.
+However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear
+best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation
+to match the device's IU: this might be 8KB, 16KB, or even 64KB.  These novel
+storage drives can achieve read performance that is competitive with that of
+conventional TLC SSDs and write performance that is faster than that of HDDs,
+with higher density and lower cost than TLC SSDs.

-Note that when creating OSDs on these devices, one must carefully apply the
-non-default value only to appropriate devices, and not to conventional SSD and
-HDD devices.  This may be done through careful ordering of OSD creation, custom
-OSD device classes, and especially by the use of central configuration _masks_.
+Note that when creating OSDs on these novel devices, one must be careful to
+apply the non-default value only to appropriate devices, and not to
+conventional HDD and SSD devices. Error can be avoided through careful ordering
+of OSD creation, with custom OSD device classes, and especially by the use of
+central configuration *masks*.

-Quincy and later releases add
-the :confval:`bluestore_use_optimal_io_size_for_min_alloc_size`
-option that enables automatic discovery of the appropriate value as each OSD is
-created.  Note that the use of ``bcache``, ``OpenCAS``, ``dmcrypt``,
-``ATA over Ethernet``, `iSCSI`, or other device layering / abstraction
-technologies may confound the determination of appropriate values. OSDs
-deployed on top of VMware storage have been reported to also
-sometimes report a ``rotational`` attribute that does not match the underlying
-hardware.
+In Quincy and later releases, you can use the
+:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow
+automatic discovery of the correct value as each OSD is created. Note that the
+use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or
+other device-layering and abstraction technologies might confound the
+determination of correct values. Moreover, OSDs deployed on top of VMware
+storage have sometimes been found to report a ``rotational`` attribute that
+does not match the underlying hardware.

-We suggest inspecting such OSDs at startup via logs and admin sockets to ensure that
-behavior is appropriate.  Note that this also may not work as desired with
-older kernels.  You can check for this by examining the presence and value
-of ``/sys/block/<drive>/queue/optimal_io_size``.
+We suggest inspecting such OSDs at startup via logs and admin sockets in order
+to ensure that their behavior is correct. Be aware that this kind of inspection
+might not work as expected with older kernels.  To check for this issue,
+examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``.

-You may also inspect a given OSD:
+.. note:: When running Reef or a later Ceph release, the ``min_alloc_size``
+   baked into each OSD is conveniently reported by ``ceph osd metadata``.
+
+To inspect a specific OSD, run the following command:

 .. prompt:: bash #

-   ceph osd metadata osd.1701 | grep rotational
+   ceph osd metadata osd.1701 | egrep rotational\|alloc

-This space amplification may manifest as an unusually high ratio of raw to
-stored data reported by ``ceph df``.  ``ceph osd df`` may also report
-anomalously high ``%USE`` / ``VAR`` values when
-compared to other, ostensibly identical OSDs.  A pool using OSDs with
-mismatched ``min_alloc_size`` values may experience unexpected balancer
-behavior as well.
-
-Note that this BlueStore attribute takes effect *only* at OSD creation; if
-changed later, a given OSD's behavior will not change unless / until it is
-destroyed and redeployed with the appropriate option value(s).  Upgrading
-to a later Ceph release will *not* change the value used by OSDs deployed
-under older releases or with other settings.
+This space amplification might manifest as an unusually high ratio of raw to
+stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR``
+values reported by ``ceph osd df`` that are unusually high in comparison to
+other, ostensibly identical, OSDs. Finally, there might be unexpected balancer
+behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values.

+This BlueStore attribute takes effect *only* at OSD creation; if the attribute
+is changed later, a specific OSD's behavior will not change unless and until
+the OSD is destroyed and redeployed with the appropriate option value(s).
+Upgrading to a later Ceph release will *not* change the value used by OSDs that
+were deployed under older releases or with other settings.

 .. confval:: bluestore_min_alloc_size
 .. confval:: bluestore_min_alloc_size_hdd
 .. confval:: bluestore_min_alloc_size_ssd
 .. confval:: bluestore_use_optimal_io_size_for_min_alloc_size

-DSA (Data Streaming Accelerator Usage)
+DSA (Data Streaming Accelerator) Usage
 ======================================

-If you want to use the DML library to drive DSA device for offloading
-read/write operations on Persist memory in Bluestore. You need to install
-`DML`_ and `idxd-config`_ library in your machine with SPR (Sapphire Rapids) CPU.
+If you want to use the DML library to drive the DSA device for offloading
+read/write operations on persistent memory (PMEM) in BlueStore, you need to
+install `DML`_ and the `idxd-config`_ library. This will work only on machines
+that have a SPR (Sapphire Rapids) CPU.

-.. _DML: https://github.com/intel/DML
+.. _dml: https://github.com/intel/dml
 .. _idxd-config: https://github.com/intel/idxd-config

-After installing the DML software, you need to configure the shared
-work queues (WQs) with the following WQ configuration example via accel-config tool:
+After installing the DML software, configure the shared work queues (WQs) with
+reference to the following WQ configuration example:

 .. prompt:: bash $

-   accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="MyApp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
+   accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
   accel-config config-engine dsa0/engine0.1 --group-id=1
   accel-config enable-device dsa0
   accel-config enable-wq dsa0/wq0.1
--- a/ceph/doc/rados/configuration/ceph-conf.rst
+++ b/ceph/doc/rados/configuration/ceph-conf.rst
@ -4,116 +4,116 @@
 Configuring Ceph
 ==================

-When Ceph services start, the initialization process activates a series
-of daemons that run in the background. A :term:`Ceph Storage Cluster` runs 
-at a minimum three types of daemons:
+When Ceph services start, the initialization process activates a series of
+daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
+least three types of daemons:

 - :term:`Ceph Monitor` (``ceph-mon``)
 - :term:`Ceph Manager` (``ceph-mgr``)
 - :term:`Ceph OSD Daemon` (``ceph-osd``)

 Ceph Storage Clusters that support the :term:`Ceph File System` also run at
-least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that
-support :term:`Ceph Object Storage` run Ceph RADOS Gateway daemons
-(``radosgw``) as well.
+least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support
+:term:`Ceph Object Storage` run Ceph RADOS Gateway daemons (``radosgw``).

-Each daemon has a number of configuration options, each of which has a
-default value.  You may adjust the behavior of the system by changing these
-configuration options.  Be careful to understand the consequences before
+Each daemon has a number of configuration options, each of which has a default
+value. You may adjust the behavior of the system by changing these
+configuration options. Be careful to understand the consequences before
 overriding default values, as it is possible to significantly degrade the
-performance and stability of your cluster.  Also note that default values
-sometimes change between releases, so it is best to review the version of
-this documentation that aligns with your Ceph release.
+performance and stability of your cluster. Note too that default values
+sometimes change between releases. For this reason, it is best to review the
+version of this documentation that applies to your Ceph release.

 Option names
 ============

-All Ceph configuration options have a unique name consisting of words
-formed with lower-case characters and connected with underscore
-(``_``) characters.
+Each of the Ceph configuration options has a unique name that consists of words
+formed with lowercase characters and connected with underscore characters
+(``_``).

-When option names are specified on the command line, either underscore
-(``_``) or dash (``-``) characters can be used interchangeable (e.g.,
+When option names are specified on the command line, underscore (``_``) and
+dash (``-``) characters can be used interchangeably (for example,
 ``--mon-host`` is equivalent to ``--mon_host``).

-When option names appear in configuration files, spaces can also be
-used in place of underscore or dash.  We suggest, though, that for
-clarity and convenience you consistently use underscores, as we do
+When option names appear in configuration files, spaces can also be used in
+place of underscores or dashes. However, for the sake of clarity and
+convenience, we suggest that you consistently use underscores, as we do
 throughout this documentation.

 Config sources
 ==============

-Each Ceph daemon, process, and library will pull its configuration
-from several sources, listed below.  Sources later in the list will
-override those earlier in the list when both are present.
+Each Ceph daemon, process, and library pulls its configuration from one or more
+of the several sources listed below. Sources that occur later in the list
+override those that occur earlier in the list (when both are present).

 - the compiled-in default value
 - the monitor cluster's centralized configuration database
 - a configuration file stored on the local host
 - environment variables
- command line arguments
- runtime overrides set by an administrator
+- command-line arguments
+- runtime overrides that are set by an administrator

 One of the first things a Ceph process does on startup is parse the
-configuration options provided via the command line, environment, and
-local configuration file.  The process will then contact the monitor
-cluster to retrieve configuration stored centrally for the entire
-cluster.  Once a complete view of the configuration is available, the
-daemon or process startup will proceed.
+configuration options provided via the command line, via the environment, and
+via the local configuration file. Next, the process contacts the monitor
+cluster to retrieve centrally-stored configuration for the entire cluster.
+After a complete view of the configuration is available, the startup of the
+daemon or process will commence.

 .. _bootstrap-options:

 Bootstrap options
 -----------------

-Some configuration options affect the process's ability to contact the
-monitors, to authenticate, and to retrieve the cluster-stored configuration.
-For this reason, these options might need to be stored locally on the node, and
-set by means of a local configuration file. These options include the
-following:
+Bootstrap options are configuration options that affect the process's ability
+to contact the monitors, to authenticate, and to retrieve the cluster-stored
+configuration.  For this reason, these options might need to be stored locally
+on the node, and set by means of a local configuration file. These options
+include the following:

 .. confval:: mon_host
 .. confval:: mon_host_override

 - :confval:`mon_dns_srv_name`
- ``mon_data``, ``osd_data``, ``mds_data``, ``mgr_data``, and
-  similar options that define which local directory the daemon
-  stores its data in.
- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be used to
-  specify the authentication credential to use to authenticate with
-  the monitor.  Note that in most cases the default keyring location
-  is in the data directory specified above.
+- :confval:`mon_data`, :confval:`osd_data`, :confval:`mds_data`,
+  :confval:`mgr_data`, and similar options that define which local directory
+  the daemon stores its data in.
+- :confval:`keyring`, :confval:`keyfile`, and/or :confval:`key`, which can be
+  used to specify the authentication credential to use to authenticate with the
+  monitor. Note that in most cases the default keyring location is in the data
+  directory specified above.

-In most cases, the default values of these options are suitable. There is one
-exception to this: the :confval:`mon_host` option that identifies the addresses
-of the cluster's monitors.  When DNS is used to identify monitors, a local Ceph
+In most cases, there is no reason to modify the default values of these
+options. However, there is one exception to this: the :confval:`mon_host`
+option that identifies the addresses of the cluster's monitors. But when
+:ref:`DNS is used to identify monitors<mon-dns-lookup>`, a local Ceph
 configuration file can be avoided entirely.

+
 Skipping monitor config
 -----------------------

-Pass the option ``--no-mon-config`` to any process to skip the step that
-retrieves configuration information from the cluster monitors. This is useful
-in cases where configuration is managed entirely via configuration files, or
-when the monitor cluster is down and some maintenance activity needs to be
-done.
-
+The option ``--no-mon-config`` can be passed in any command in order to skip
+the step that retrieves configuration information from the cluster's monitors.
+Skipping this retrieval step can be useful in cases where configuration is
+managed entirely via configuration files, or when maintenance activity needs to
+be done but the monitor cluster is down.

 .. _ceph-conf-file:

-
 Configuration sections
 ======================

-Any given process or daemon has a single value for each configuration
-option.  However, values for an option may vary across different
-daemon types even daemons of the same type.  Ceph options that are
-stored in the monitor configuration database or in local configuration
-files are grouped into sections to indicate which daemons or clients
-they apply to.
+Each of the configuration options associated with a single process or daemon
+has a single value. However, the values for a configuration option can vary
+across daemon types, and can vary even across different daemons of the same
+type. Ceph options that are stored in the monitor configuration database or in
+local configuration files are grouped into sections |---| so-called "configuration
+sections" |---| to indicate which daemons or clients they apply to.

-These sections include:
+
+These sections include the following:

 .. confsec:: global

@ -156,43 +156,42 @@ These sections include:

 .. confsec:: client

-   Settings under ``client`` affect all Ceph Clients
-   (e.g., mounted Ceph File Systems, mounted Ceph Block Devices,
-   etc.) as well as Rados Gateway (RGW) daemons.
+   Settings under ``client`` affect all Ceph clients
+   (for example, mounted Ceph File Systems, mounted Ceph Block Devices)
+   as well as RADOS Gateway (RGW) daemons.

   :example: ``objecter_inflight_ops = 512``


-Sections may also specify an individual daemon or client name.  For example,
+Configuration sections can also specify an individual daemon or client name. For example,
 ``mon.foo``, ``osd.123``, and ``client.smith`` are all valid section names.


-Any given daemon will draw its settings from the global section, the
-daemon or client type section, and the section sharing its name.
-Settings in the most-specific section take precedence, so for example
-if the same option is specified in both :confsec:`global`, :confsec:`mon`, and
-``mon.foo`` on the same source (i.e., in the same configurationfile),
-the ``mon.foo`` value will be used.
+Any given daemon will draw its settings from the global section, the daemon- or
+client-type section, and the section sharing its name. Settings in the
+most-specific section take precedence so precedence: for example, if the same
+option is specified in both :confsec:`global`, :confsec:`mon`, and ``mon.foo``
+on the same source (i.e. that is, in the same configuration file), the
+``mon.foo`` setting will be used.

 If multiple values of the same configuration option are specified in the same
-section, the last value wins.
-
-Note that values from the local configuration file always take
-precedence over values from the monitor configuration database,
-regardless of which section they appear in.
+section, the last value specified takes precedence.

+Note that values from the local configuration file always take precedence over
+values from the monitor configuration database, regardless of the section in
+which they appear.

 .. _ceph-metavariables:

 Metavariables
 =============

-Metavariables simplify Ceph Storage Cluster configuration
-dramatically. When a metavariable is set in a configuration value,
-Ceph expands the metavariable into a concrete value at the time the
-configuration value is used. Ceph metavariables are similar to variable expansion in the Bash shell.
+Metavariables dramatically simplify Ceph storage cluster configuration. When a
+metavariable is set in a configuration value, Ceph expands the metavariable at
+the time the configuration value is used. In this way, Ceph metavariables
+behave similarly to the way that variable expansion works in the Bash shell.

-Ceph supports the following metavariables: 
+Ceph supports the following metavariables:

 .. describe:: $cluster

@ -204,7 +203,7 @@ Ceph supports the following metavariables:

 .. describe:: $type

-   Expands to a daemon or process type (e.g., ``mds``, ``osd``, or ``mon``)
+   Expands to a daemon or process type (for example, ``mds``, ``osd``, or ``mon``)

   :example: ``/var/lib/ceph/$type``

@ -233,33 +232,32 @@ Ceph supports the following metavariables:
   :example: ``/var/run/ceph/$cluster-$name-$pid.asok``


-
-The Configuration File
-======================
+Ceph configuration file
+=======================

 On startup, Ceph processes search for a configuration file in the
 following locations:

-#. ``$CEPH_CONF`` (*i.e.,* the path following the ``$CEPH_CONF``
+#. ``$CEPH_CONF`` (that is, the path following the ``$CEPH_CONF``
   environment variable)
-#. ``-c path/path``  (*i.e.,* the ``-c`` command line argument)
+#. ``-c path/path``  (that is, the ``-c`` command line argument)
 #. ``/etc/ceph/$cluster.conf``
 #. ``~/.ceph/$cluster.conf``
-#. ``./$cluster.conf`` (*i.e.,* in the current working directory)
+#. ``./$cluster.conf`` (that is, in the current working directory)
 #. On FreeBSD systems only, ``/usr/local/etc/ceph/$cluster.conf``

-where ``$cluster`` is the cluster's name (default ``ceph``).
+Here ``$cluster`` is the cluster's name (default: ``ceph``).

-The Ceph configuration file uses an *ini* style syntax. You can add comment
-text after a pound sign (#) or a semi-colon (;).  For example:
+The Ceph configuration file uses an ``ini`` style syntax. You can add "comment
+text" after a pound sign (#) or a semi-colon semicolon (;). For example:

 .. code-block:: ini

-	# <--A number (#) sign precedes a comment.
-	; A comment may be anything.
-	# Comments always follow a semi-colon (;) or a pound (#) on each line.
-	# The end of the line terminates a comment.
-	# We recommend that you provide comments in your configuration file(s).
+    # <--A number (#) sign number sign (#) precedes a comment.
+    ; A comment may be anything.
+    # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line.
+    # The end of the line terminates a comment.
+    # We recommend that you provide comments in your configuration file(s).


 .. _ceph-conf-settings:
@ -268,40 +266,41 @@ Config file section names
 -------------------------

 The configuration file is divided into sections. Each section must begin with a
-valid configuration section name (see `Configuration sections`_, above)
-surrounded by square brackets. For example,
+valid configuration section name (see `Configuration sections`_, above) that is
+surrounded by square brackets. For example:

 .. code-block:: ini

-	[global]
-	debug_ms = 0
-	
-	[osd]
-	debug_ms = 1
+    [global]
+    debug_ms = 0

-	[osd.1]
-	debug_ms = 10
+    [osd]
+    debug_ms = 1

-	[osd.2]
-	debug_ms = 10
+    [osd.1]
+    debug_ms = 10

+    [osd.2]
+    debug_ms = 10

 Config file option values
 -------------------------

-The value of a configuration option is a string. If it is too long to
-fit in a single line, you can put a backslash (``\``) at the end of line
-as the line continuation marker, so the value of the option will be
-the string after ``=`` in current line combined with the string in the next
-line::
+The value of a configuration option is a string. If the string is too long to
+fit on a single line, you can put a backslash (``\``) at the end of the line
+and the backslash will act as a line continuation marker. In such a case, the
+value of the option will be the string after ``=`` in the current line,
+combined with the string in the next line. Here is an example::

  [global]
  foo = long long ago\
  long ago

-In the example above, the value of "``foo``" would be "``long long ago long ago``".
+In this example, the value of the "``foo``" option is "``long long ago long
+ago``".

-Normally, the option value ends with a new line, or a comment, like
+An option value typically ends with either a newline or a comment. For
+example:

 .. code-block:: ini

@ -309,100 +308,108 @@ Normally, the option value ends with a new line, or a comment, like
    obscure_one = difficult to explain # I will try harder in next release
    simpler_one = nothing to explain

-In the example above, the value of "``obscure one``" would be "``difficult to explain``";
-and the value of "``simpler one`` would be "``nothing to explain``".
+In this example, the value of the "``obscure one``" option is "``difficult to
+explain``" and the value of the "``simpler one`` options is "``nothing to
+explain``".

-If an option value contains spaces, and we want to make it explicit, we
-could quote the value using single or double quotes, like
+When an option value contains spaces, it can be enclosed within single quotes
+or double quotes in order to make its scope clear and in order to make sure
+that the first space in the value is not interpreted as the end of the value.
+For example:

 .. code-block:: ini

    [global]
    line = "to be, or not to be"

-Certain characters are not allowed to be present in the option values directly.
-They are ``=``, ``#``, ``;`` and ``[``. If we have to, we need to escape them,
-like
+In option values, there are four characters that are treated as escape
+characters: ``=``, ``#``, ``;`` and ``[``. They are permitted to occur in an
+option value only if they are immediately preceded by the backslash character
+(``\``). For example:

 .. code-block:: ini

    [global]
    secret = "i love \# and \["

-Every configuration option is typed with one of the types below:
+Each configuration option falls under one of the following types:

 .. describe:: int

-   64-bit signed integer, Some SI prefixes are supported, like "K", "M", "G",
-   "T", "P", "E", meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
-   10\ :sup:`9`, etc.  And "B" is the only supported unit. So, "1K", "1M", "128B" and "-1" are all valid
-   option values. Some times, a negative value implies "unlimited" when it comes to
-   an option for threshold or limit.
+   64-bit signed integer. Some SI suffixes are supported, such as "K", "M",
+   "G", "T", "P", and "E" (meaning, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
+   10\ :sup:`9`, etc.). "B" is the only supported unit string. Thus "1K", "1M",
+   "128B" and "-1" are all valid option values. When a negative value is
+   assigned to a threshold option, this can indicate that the option is
+   "unlimited" -- that is, that there is no threshold or limit in effect.

   :example: ``42``, ``-1``

 .. describe:: uint

-   It is almost identical to ``integer``. But a negative value will be rejected.
+   This differs from ``integer`` only in that negative values are not
+   permitted.

   :example: ``256``, ``0``

 .. describe:: str

-   Free style strings encoded in UTF-8, but some characters are not allowed. Please
-   reference the above notes for the details.
+   A string encoded in UTF-8. Certain characters are not permitted. Reference
+   the above notes for the details.

   :example: ``"hello world"``, ``"i love \#"``, ``yet-another-name``

 .. describe:: boolean

-   one of the two values ``true`` or ``false``. But an integer is also accepted,
-   where "0" implies ``false``, and any non-zero values imply ``true``.
+   Typically either of the two values ``true`` or ``false``. However, any
+   integer is permitted: "0" implies ``false``, and any non-zero value implies
+   ``true``.

   :example: ``true``, ``false``, ``1``, ``0``

 .. describe:: addr

-   a single address optionally prefixed with ``v1``, ``v2`` or ``any`` for the messenger
-   protocol. If the prefix is not specified, ``v2`` protocol is used. Please see
-   :ref:`address_formats` for more details.
+   A single address, optionally prefixed with ``v1``, ``v2`` or ``any`` for the
+   messenger protocol. If no prefix is specified, the ``v2`` protocol is used.
+   For more details, see :ref:`address_formats`.

   :example: ``v1:1.2.3.4:567``, ``v2:1.2.3.4:567``, ``1.2.3.4:567``, ``2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567``, ``[::1]:6789``

 .. describe:: addrvec

-   a set of addresses separated by ",". The addresses can be optionally quoted with ``[`` and ``]``.
+   A set of addresses separated by ",". The addresses can be optionally quoted
+   with ``[`` and ``]``.

   :example: ``[v1:1.2.3.4:567,v2:1.2.3.4:568]``, ``v1:1.2.3.4:567,v1:1.2.3.14:567``  ``[2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::567], [2409:8a1e:8fb6:aa20:1260:4bff:fe92:18f5::568]``

 .. describe:: uuid

-   the string format of a uuid defined by `RFC4122 <https://www.ietf.org/rfc/rfc4122.txt>`_.
-   And some variants are also supported, for more details, see
-   `Boost document <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_.
+   The string format of a uuid defined by `RFC4122
+   <https://www.ietf.org/rfc/rfc4122.txt>`_. Certain variants are also
+   supported: for more details, see `Boost document
+   <https://www.boost.org/doc/libs/1_74_0/libs/uuid/doc/uuid.html#String%20Generator>`_.

   :example: ``f81d4fae-7dec-11d0-a765-00a0c91e6bf6``

 .. describe:: size

-   denotes a 64-bit unsigned integer. Both SI prefixes and IEC prefixes are
-   supported. And "B" is the only supported unit. A negative value will be
-   rejected.
+   64-bit unsigned integer. Both SI prefixes and IEC prefixes are supported.
+   "B" is the only supported unit string. Negative values are not permitted.

   :example: ``1Ki``, ``1K``, ``1KiB`` and ``1B``.

 .. describe:: secs

-   denotes a duration of time. By default the unit is second if not specified.
-   Following units of time are supported:
+   Denotes a duration of time. The default unit of time is the second.
+   The following units of time are supported:

-              * second: "s", "sec", "second", "seconds"
-              * minute: "m", "min", "minute", "minutes"
-              * hour: "hs", "hr", "hour", "hours"
-              * day: "d", "day", "days"
-              * week: "w", "wk", "week", "weeks"
-              * month: "mo", "month", "months"
-              * year: "y", "yr", "year", "years"
+              * second: ``s``, ``sec``, ``second``, ``seconds``
+              * minute: ``m``, ``min``, ``minute``, ``minutes``
+              * hour: ``hs``, ``hr``, ``hour``, ``hours``
+              * day: ``d``, ``day``, ``days``
+              * week: ``w``, ``wk``, ``week``, ``weeks``
+              * month: ``mo``, ``month``, ``months``
+              * year: ``y``, ``yr``, ``year``, ``years``

   :example: ``1 m``, ``1m`` and ``1 week``

@ -411,102 +418,103 @@ Every configuration option is typed with one of the types below:
 Monitor configuration database
 ==============================

-The monitor cluster manages a database of configuration options that
-can be consumed by the entire cluster, enabling streamlined central
-configuration management for the entire system.  The vast majority of
-configuration options can and should be stored here for ease of
-administration and transparency.
+The monitor cluster manages a database of configuration options that can be
+consumed by the entire cluster. This allows for streamlined central
+configuration management of the entire system. For ease of administration and
+transparency, the vast majority of configuration options can and should be
+stored in this database.

-A handful of settings may still need to be stored in local
-configuration files because they affect the ability to connect to the
-monitors, authenticate, and fetch configuration information.  In most
-cases this is limited to the ``mon_host`` option, although this can
-also be avoided through the use of DNS SRV records.
+Some settings might need to be stored in local configuration files because they
+affect the ability of the process to connect to the monitors, to authenticate,
+and to fetch configuration information. In most cases this applies only to the
+``mon_host`` option. This issue can be avoided by using :ref:`DNS SRV
+records<mon-dns-lookup>`.

 Sections and masks
 ------------------

-Configuration options stored by the monitor can live in a global
-section, daemon type section, or specific daemon section, just like
-options in a configuration file can.
+Configuration options stored by the monitor can be stored in a global section,
+in a daemon-type section, or in a specific daemon section. In this, they are
+no different from the options in a configuration file.

-In addition, options may also have a *mask* associated with them to
-further restrict which daemons or clients the option applies to.
-Masks take two forms:
+In addition, options may have a *mask* associated with them to further restrict
+which daemons or clients the option applies to. Masks take two forms:

-#. ``type:location`` where *type* is a CRUSH property like `rack` or
-   `host`, and *location* is a value for that property.  For example,
+#. ``type:location`` where ``type`` is a CRUSH property like ``rack`` or
+   ``host``, and ``location`` is a value for that property. For example,
   ``host:foo`` would limit the option only to daemons or clients
   running on a particular host.
-#. ``class:device-class`` where *device-class* is the name of a CRUSH
-   device class (e.g., ``hdd`` or ``ssd``).  For example,
+#. ``class:device-class`` where ``device-class`` is the name of a CRUSH
+   device class (for example, ``hdd`` or ``ssd``). For example,
   ``class:ssd`` would limit the option only to OSDs backed by SSDs.
-   (This mask has no effect for non-OSD daemons or clients.)
+   (This mask has no effect on non-OSD daemons or clients.)

-When setting a configuration option, the `who` may be a section name,
-a mask, or a combination of both separated by a slash (``/``)
-character.  For example, ``osd/rack:foo`` would mean all OSD daemons
-in the ``foo`` rack.
-
-When viewing configuration options, the section name and mask are
-generally separated out into separate fields or columns to ease readability.
+In commands that specify a configuration option, the argument of the option (in
+the following examples, this is the "who" string) may be a section name, a
+mask, or a combination of both separated by a slash character (``/``). For
+example, ``osd/rack:foo`` would refer to all OSD daemons in the ``foo`` rack.

+When configuration options are shown, the section name and mask are presented
+in separate fields or columns to make them more readable.

 Commands
 --------

 The following CLI commands are used to configure the cluster:

-* ``ceph config dump`` will dump the entire configuration database for
-  the cluster.
+* ``ceph config dump`` dumps the entire monitor configuration
+  database for the cluster.

-* ``ceph config get <who>`` will dump the configuration for a specific
-  daemon or client (e.g., ``mds.a``), as stored in the monitors'
-  configuration database.
+* ``ceph config get <who>`` dumps the configuration options stored in
+  the monitor configuration database for a specific daemon or client
+  (for example, ``mds.a``).

-* ``ceph config set <who> <option> <value>`` will set a configuration
-  option in the monitors' configuration database.
+* ``ceph config get <who> <option>`` shows either a configuration value
+  stored in the monitor configuration database for a specific daemon or client
+  (for example, ``mds.a``), or, if that value is not present in the monitor
+  configuration database, the compiled-in default value.

-* ``ceph config show <who>`` will show the reported running
-  configuration for a running daemon.  These settings may differ from
-  those stored by the monitors if there are also local configuration
-  files in use or options have been overridden on the command line or
-  at run time.  The source of the option values is reported as part
-  of the output.
+* ``ceph config set <who> <option> <value>`` specifies a configuration
+  option in the monitor configuration database.

-* ``ceph config assimilate-conf -i <input file> -o <output file>``
-  will ingest a configuration file from *input file* and move any
-  valid options into the monitors' configuration database.  Any
-  settings that are unrecognized, invalid, or cannot be controlled by
-  the monitor will be returned in an abbreviated config file stored in
-  *output file*.  This command is useful for transitioning from legacy
-  configuration files to centralized monitor-based configuration.
+* ``ceph config show <who>`` shows the configuration for a running daemon.
+  These settings might differ from those stored by the monitors if there are
+  also local configuration files in use or if options have been overridden on
+  the command line or at run time. The source of the values of the options is
+  displayed in the output.

+* ``ceph config assimilate-conf -i <input file> -o <output file>`` ingests a
+  configuration file from *input file* and moves any valid options into the
+  monitor configuration database. Any settings that are unrecognized, are
+  invalid, or cannot be controlled by the monitor will be returned in an
+  abbreviated configuration file stored in *output file*. This command is
+  useful for transitioning from legacy configuration files to centralized
+  monitor-based configuration.
+
+Note that ``ceph config set <who> <option> <value>`` and ``ceph config get
+<who> <option>`` will not necessarily return the same values. The latter
+command will show compiled-in default values. In order to determine whether a
+configuration option is present in the monitor configuration database, run
+``ceph config dump``.

 Help
 ====

-You can get help for a particular option with:
+To get help for a particular option, run the following command:

 .. prompt:: bash $

   ceph config help <option>

-Note that this will use the configuration schema that is compiled into the running monitors.  If you have a mixed-version cluster (e.g., during an upgrade), you might also want to query the option schema from a specific running daemon:
-
-.. prompt:: bash $
-
-   ceph daemon <name> config help [option]
-
 For example:

 .. prompt:: bash $

   ceph config help log_file

-:: 
+::

-  log_file - path to log file
+   log_file - path to log file
    (std::string, basic)
    Default (non-daemon):
    Default (daemon): /var/log/ceph/$cluster-$name.log
@ -543,20 +551,29 @@ or:
      "can_update_at_runtime": false
  }

-The ``level`` property can be any of `basic`, `advanced`, or `dev`.
-The `dev` options are intended for use by developers, generally for
-testing purposes, and are not recommended for use by operators.
+The ``level`` property can be ``basic``, ``advanced``, or ``dev``.  The `dev`
+options are intended for use by developers, generally for testing purposes, and
+are not recommended for use by operators.

+.. note:: This command uses the configuration schema that is compiled into the
+   running monitors. If you have a mixed-version cluster (as might exist, for
+   example, during an upgrade), you might want to query the option schema from
+   a specific running daemon by running a command of the following form:
+
+.. prompt:: bash $
+
+   ceph daemon <name> config help [option]

 Runtime Changes
 ===============

 In most cases, Ceph permits changes to the configuration of a daemon at
-runtime. This can be used for increasing or decreasing the amount of logging
+run time. This can be used for increasing or decreasing the amount of logging
 output, for enabling or disabling debug settings, and for runtime optimization.

-Configuration options can be updated via the ``ceph config set`` command.  For
-example, to enable the debug log level on a specific OSD, run a command of this form:
+Use the ``ceph config set`` command to update configuration options. For
+example, to enable the most verbose  debug log level on a specific OSD, run a
+command of the following form:

 .. prompt:: bash $

@ -565,129 +582,133 @@ example, to enable the debug log level on a specific OSD, run a command of this
 .. note:: If an option has been customized in a local configuration file, the
   `central config
   <https://ceph.io/en/news/blog/2018/new-mimic-centralized-configuration-management/>`_
-   setting will be ignored (it has a lower priority than the local
-   configuration file).
+   setting will be ignored because it has a lower priority than the local
+   configuration file.
+
+.. note:: Log levels range from 0 to 20.

 Override values
 ---------------

-Options can be set temporarily by using the `tell` or `daemon` interfaces on
-the Ceph CLI. These *override* values are ephemeral, which means that they
-affect only the current instance of the daemon and revert to persistently
-configured values when the daemon restarts.
+Options can be set temporarily by using the Ceph CLI ``tell`` or ``daemon``
+interfaces on the Ceph CLI. These *override* values are ephemeral, which means
+that they affect only the current instance of the daemon and revert to
+persistently configured values when the daemon restarts.

 Override values can be set in two ways:

 #. From any host, send a message to a daemon with a command of the following
   form:
-   
+
   .. prompt:: bash $

      ceph tell <name> config set <option> <value>

   For example:
-   
+
   .. prompt:: bash $

      ceph tell osd.123 config set debug_osd 20

   The ``tell`` command can also accept a wildcard as the daemon identifier.
   For example, to adjust the debug level on all OSD daemons, run a command of
-   this form:
-   
+   the following form:
+
   .. prompt:: bash $

      ceph tell osd.* config set debug_osd 20

 #. On the host where the daemon is running, connect to the daemon via a socket
-   in ``/var/run/ceph`` by running a command of this form:
+   in ``/var/run/ceph`` by running a command of the following form:

   .. prompt:: bash $

      ceph daemon <name> config set <option> <value>

   For example:
-   
+
   .. prompt:: bash $

      ceph daemon osd.4 config set debug_osd 20

 .. note:: In the output of the ``ceph config show`` command, these temporary
-   values are shown with a source of ``override``.
+   values are shown to have a source of ``override``.


 Viewing runtime settings
 ========================

-You can see the current options set for a running daemon with the ``ceph config show`` command.  For example:
+You can see the current settings specified for a running daemon with the ``ceph
+config show`` command. For example, to see the (non-default) settings for the
+daemon ``osd.0``, run the following command:

 .. prompt:: bash $

   ceph config show osd.0

-will show you the (non-default) options for that daemon.  You can also look at a specific option with:
+To see a specific setting, run the following command:

 .. prompt:: bash $

   ceph config show osd.0 debug_osd

-or view all options (even those with default values) with:
+To see all settings (including those with default values), run the following
+command:

 .. prompt:: bash $

   ceph config show-with-defaults osd.0

-You can also observe settings for a running daemon by connecting to it from the local host via the admin socket.  For example:
+You can see all settings for a daemon that is currently running by connecting
+to it on the local host via the admin socket. For example, to dump all
+current settings, run the following command:

 .. prompt:: bash $

   ceph daemon osd.0 config show

-will dump all current settings:
+To see non-default settings and to see where each value came from (for example,
+a config file, the monitor, or an override), run the following command:

 .. prompt:: bash $

   ceph daemon osd.0 config diff

-will show only non-default settings (as well as where the value came from: a config file, the monitor, an override, etc.), and:
+To see the value of a single setting, run the following command:

 .. prompt:: bash $

   ceph daemon osd.0 config get debug_osd

-will report the value of a single option.

-
-
-Changes since Nautilus
-======================
+Changes introduced in Octopus
+=============================

 With the Octopus release We changed the way the configuration file is parsed.
 These changes are as follows:

- Repeated configuration options are allowed, and no warnings will be printed.
-  The value of the last one is used, which means that the setting last in the file
-  is the one that takes effect. Before this change, we would print warning messages
-  when lines with duplicated options were encountered, like::
+- Repeated configuration options are allowed, and no warnings will be
+  displayed. This means that the setting that comes last in the file is the one
+  that takes effect. Prior to this change, Ceph displayed warning messages when
+  lines containing duplicate options were encountered, such as::

    warning line 42: 'foo' in section 'bar' redefined
-
- Invalid UTF-8 options were ignored with warning messages. But since Octopus,
-  they are treated as fatal errors.
-
- Backslash ``\`` is used as the line continuation marker to combine the next
-  line with current one. Before Octopus, it was required to follow a backslash with
-  a non-empty line. But in Octopus, an empty line following a backslash is now allowed.
-
+- Prior to Octopus, options containing invalid UTF-8 characters were ignored
+  with warning messages. But in Octopus, they are treated as fatal errors.
+- The backslash character ``\`` is used as the line-continuation marker that
+  combines the next line with the current one. Prior to Octopus, there was a
+  requirement that any end-of-line backslash be followed by a non-empty line.
+  But in Octopus, an empty line following a backslash is allowed.
 - In the configuration file, each line specifies an individual configuration
  option. The option's name and its value are separated with ``=``, and the
-  value may be quoted using single or double quotes. If an invalid
+  value may be enclosed within single or double quotes. If an invalid
  configuration is specified, we will treat it as an invalid configuration
-  file ::
+  file::

    bad option ==== bad value
+- Prior to Octopus, if no section name was specified in the configuration file,
+  all options would be set as though they were within the :confsec:`global`
+  section. This approach is discouraged. Since Octopus, any configuration
+  file that has no section name must contain only a single option.

- Before Octopus, if no section name was specified in the configuration file,
-  all options would be set as though they were within the :confsec:`global` section. This is
-  now discouraged. Since Octopus, only a single option is allowed for
-  configuration files without a section name.
+.. |---|   unicode:: U+2014 .. EM DASH :trim:
--- a/ceph/doc/rados/configuration/common.rst
+++ b/ceph/doc/rados/configuration/common.rst
@ -1,4 +1,3 @@
-
 .. _ceph-conf-common-settings:

 Common Settings
@ -7,30 +6,33 @@ Common Settings
 The `Hardware Recommendations`_ section provides some hardware guidelines for
 configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph
 Node` to run multiple daemons. For example, a single node with multiple drives
-may run one ``ceph-osd`` for each drive. Ideally, you will  have a node for a
-particular type of process. For example, some nodes may run ``ceph-osd``
-daemons, other nodes may run ``ceph-mds`` daemons, and still  other nodes may
-run ``ceph-mon`` daemons.
+ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be
+assigned to a particular type of process. For example, some nodes might run
+``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still
+other nodes might run ``ceph-mon`` daemons.
+
+Each node has a name. The name of a node can be found in its ``host`` setting.
+Monitors also specify a network address and port (that is, a domain name or IP
+address) that can be found in the ``addr`` setting. A basic configuration file
+typically specifies only minimal settings for each instance of monitor daemons.
+For example:
+
+

-Each node has a name identified by the ``host`` setting. Monitors also specify
-a network address and port (i.e., domain name or IP address) identified by the
-``addr`` setting.  A basic configuration file will typically specify only
-minimal settings for each instance of monitor daemons. For example:

 .. code-block:: ini

-	[global]
-	mon_initial_members = ceph1
-	mon_host = 10.0.0.1
+    [global]
+    mon_initial_members = ceph1
+    mon_host = 10.0.0.1

-
-.. important:: The ``host`` setting is the short name of the node (i.e., not
-   an fqdn). It is **NOT** an IP address either.  Enter ``hostname -s`` on
-   the command line to retrieve the name of the node. Do not use ``host``
-   settings for anything other than initial monitors unless you are deploying
-   Ceph manually. You **MUST NOT** specify ``host`` under individual daemons
-   when using deployment tools like ``chef`` or ``cephadm``, as those tools
-   will enter the appropriate values for you in the cluster map.
+.. important:: The ``host`` setting's value is the short name of the node. It
+   is not an FQDN. It is **NOT** an IP address. To retrieve the name of the
+   node, enter ``hostname -s`` on the command line. Unless you are deploying
+   Ceph manually, do not use ``host`` settings for anything other than initial
+   monitor setup.  **DO NOT** specify the ``host`` setting under individual
+   daemons when using deployment tools like ``chef`` or ``cephadm``. Such tools
+   are designed to enter the appropriate values for you in the cluster map.


 .. _ceph-network-config:
@ -38,34 +40,35 @@ minimal settings for each instance of monitor daemons. For example:
 Networks
 ========

-See the `Network Configuration Reference`_ for a detailed discussion about
-configuring a network for use with Ceph.
+For more about configuring a network for use with Ceph, see the `Network
+Configuration Reference`_ .


 Monitors
 ========

-Production Ceph clusters typically provision a minimum of three :term:`Ceph Monitor`
-daemons to ensure availability should a monitor instance crash. A minimum of
-three ensures that the Paxos algorithm can determine which version
-of the :term:`Ceph Cluster Map` is the most recent from a majority of Ceph
+Ceph production clusters typically provision at least three :term:`Ceph
+Monitor` daemons to ensure availability in the event of a monitor instance
+crash. A minimum of three :term:`Ceph Monitor` daemons ensures that the Paxos
+algorithm is able to determine which version of the :term:`Ceph Cluster Map` is
+the most recent. It makes this determination by consulting a majority of Ceph
 Monitors in the quorum.

 .. note:: You may deploy Ceph with a single monitor, but if the instance fails,
-	       the lack of other monitors may interrupt data service availability.
+   the lack of other monitors might interrupt data-service availability.

-Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and ``6789`` for the old v1 protocol.
+Ceph Monitors normally listen on port ``3300`` for the new v2 protocol, and on
+port ``6789`` for the old v1 protocol.

-By default, Ceph expects to store monitor data under the
-following path::
+By default, Ceph expects to store monitor data on the following path::

-	/var/lib/ceph/mon/$cluster-$id
+    /var/lib/ceph/mon/$cluster-$id

-You or a deployment tool (e.g., ``cephadm``) must create the corresponding
-directory. With metavariables fully  expressed and a cluster named "ceph", the
-foregoing directory would evaluate to::
+You or a deployment tool (for example, ``cephadm``) must create the
+corresponding directory. With metavariables fully expressed and a cluster named
+"ceph", the path specified in the above example evaluates to::

-	/var/lib/ceph/mon/ceph-a
+    /var/lib/ceph/mon/ceph-a

 For additional details, see the `Monitor Config Reference`_.

@ -74,22 +77,22 @@ For additional details, see the `Monitor Config Reference`_.

 .. _ceph-osd-config:

-
 Authentication
 ==============

 .. versionadded:: Bobtail 0.56

-For Bobtail (v 0.56) and beyond, you should expressly enable or disable
-authentication in the ``[global]`` section of your Ceph configuration file.
+Authentication is explicitly enabled or disabled in the ``[global]`` section of
+the Ceph configuration file, as shown here:

 .. code-block:: ini

-	auth_cluster_required = cephx
-	auth_service_required = cephx
-	auth_client_required = cephx
+    auth_cluster_required = cephx
+    auth_service_required = cephx
+    auth_client_required = cephx

-Additionally, you should enable message signing. See `Cephx Config Reference`_ for details.
+In addition, you should enable message signing. For details, see `Cephx Config
+Reference`_.

 .. _Cephx Config Reference: ../auth-config-ref

@ -100,65 +103,68 @@ Additionally, you should enable message signing. See `Cephx Config Reference`_ f
 OSDs
 ====

-Ceph production clusters typically deploy :term:`Ceph OSD Daemons` where one node
-has one OSD daemon running a Filestore on one storage device. The BlueStore back
-end is now default, but when using Filestore you specify a journal size. For example:
+When Ceph production clusters deploy :term:`Ceph OSD Daemons`, the typical
+arrangement is that one node has one OSD daemon running Filestore on one
+storage device. BlueStore is now the default back end, but when using Filestore
+you must specify a journal size. For example:

 .. code-block:: ini

-	[osd]
-	osd_journal_size = 10000
+    [osd]
+    osd_journal_size = 10000

-	[osd.0]
-	host = {hostname} #manual deployments only.
+    [osd.0]
+    host = {hostname} #manual deployments only.


-By default, Ceph expects to store a Ceph OSD Daemon's data at the
-following path::
+By default, Ceph expects to store a Ceph OSD Daemon's data on the following
+path::

-	/var/lib/ceph/osd/$cluster-$id
+    /var/lib/ceph/osd/$cluster-$id

-You or a deployment tool (e.g., ``cephadm``) must create the corresponding
-directory. With metavariables fully expressed and a cluster named "ceph", this
-example would evaluate to::
+You or a deployment tool (for example, ``cephadm``) must create the
+corresponding directory. With metavariables fully expressed and a cluster named
+"ceph", the path specified in the above example evaluates to::

-	/var/lib/ceph/osd/ceph-0
+    /var/lib/ceph/osd/ceph-0

-You may override this path using the ``osd_data`` setting. We recommend not
-changing the default location. Create the default directory on your OSD host.
+You can override this path using the ``osd_data`` setting. We recommend that
+you do not change the default location. To create the default directory on your
+OSD host, run the following commands:

 .. prompt:: bash $

-	ssh {osd-host}
-	sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}
+    ssh {osd-host}
+    sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}

-The ``osd_data`` path ideally leads to a mount point with a device that is
-separate from the device that contains the operating system and
-daemons. If an OSD is to use a device other than the OS device, prepare it for
-use with Ceph, and mount it to the directory you just created
+The ``osd_data`` path ought to lead to a mount point that has mounted on it a
+device that is distinct from the device that contains the operating system and
+the daemons. To use a device distinct from the device that contains the
+operating system and the daemons, prepare it for use with Ceph and mount it on
+the directory you just created by running the following commands:

 .. prompt:: bash $

-	ssh {new-osd-host}
-	sudo mkfs -t {fstype} /dev/{disk}
-	sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}
+    ssh {new-osd-host}
+    sudo mkfs -t {fstype} /dev/{disk}
+    sudo mount -o user_xattr /dev/{disk} /var/lib/ceph/osd/ceph-{osd-number}

-We recommend using the ``xfs`` file system when running
-:command:`mkfs`.  (``btrfs`` and ``ext4`` are not recommended and are no
-longer tested.)
+We recommend using the ``xfs`` file system when running :command:`mkfs`. (The
+``btrfs`` and ``ext4`` file systems are not recommended and are no longer
+tested.)

-See the `OSD Config Reference`_ for additional configuration details.
+For additional configuration details, see `OSD Config Reference`_.


 Heartbeats
 ==========

 During runtime operations, Ceph OSD Daemons check up on other Ceph OSD Daemons
-and report their  findings to the Ceph Monitor. You do not have to provide any
-settings. However, if you have network latency issues, you may wish to modify
-the settings.
+and report their findings to the Ceph Monitor. This process does not require
+you to provide any settings. However, if you have network latency issues, you
+might want to modify the default settings.

-See `Configuring Monitor/OSD Interaction`_ for additional details.
+For additional details, see `Configuring Monitor/OSD Interaction`_.


 .. _ceph-logging-and-debugging:
@ -166,9 +172,9 @@ See `Configuring Monitor/OSD Interaction`_ for additional details.
 Logs / Debugging
 ================

-Sometimes you may encounter issues with Ceph that require
-modifying logging output and using Ceph's debugging. See `Debugging and
-Logging`_ for details on log rotation.
+You might sometimes encounter issues with Ceph that require you to use Ceph's
+logging and debugging features. For details on log rotation, see `Debugging and
+Logging`_.

 .. _Debugging and Logging: ../../troubleshooting/log-and-debug

@ -186,32 +192,29 @@ Example ceph.conf
 Naming Clusters (deprecated)
 ============================

-Each Ceph cluster has an internal name that is used as part of configuration
-and log file names as well as directory and mountpoint names.  This name
-defaults to "ceph".  Previous releases of Ceph allowed one to specify a custom
-name instead, for example "ceph2".  This was intended to facilitate running
-multiple logical clusters on the same physical hardware, but in practice this
-was rarely exploited and should no longer be attempted.  Prior documentation
-could also be misinterpreted as requiring unique cluster names in order to
-use ``rbd-mirror``.
+Each Ceph cluster has an internal name. This internal name is used as part of
+configuration, and as part of "log file" names as well as part of directory
+names and as part of mountpoint names. This name defaults to "ceph". Previous
+releases of Ceph allowed one to specify a custom name instead, for example
+"ceph2". This option was intended to facilitate the running of multiple logical
+clusters on the same physical hardware, but in practice it was rarely
+exploited. Custom cluster names should no longer be attempted. Old
+documentation might lead readers to wrongly think that unique cluster names are
+required to use ``rbd-mirror``. They are not required.

 Custom cluster names are now considered deprecated and the ability to deploy
-them has already been removed from some tools, though existing custom name
-deployments continue to operate.  The ability to run and manage clusters with
-custom names may be progressively removed by future Ceph releases, so it is
-strongly recommended to deploy all new clusters with the default name "ceph".
+them has already been removed from some tools, although existing custom-name
+deployments continue to operate. The ability to run and manage clusters with
+custom names might be progressively removed by future Ceph releases, so **it is
+strongly recommended to deploy all new clusters with the default name "ceph"**.

-Some Ceph CLI commands accept an optional ``--cluster`` (cluster name) option. This
-option is present purely for backward compatibility and need not be accommodated
-by new tools and deployments.
+Some Ceph CLI commands accept a ``--cluster`` (cluster name) option. This
+option is present only for the sake of backward compatibility. New tools and
+deployments cannot be relied upon to accommodate this option.

-If you do need to allow multiple clusters to exist on the same host, please use
+If you need to allow multiple clusters to exist on the same host, use
 :ref:`cephadm`, which uses containers to fully isolate each cluster.

-
-
-
-
 .. _Hardware Recommendations: ../../../start/hardware-recommendations
 .. _Network Configuration Reference: ../network-config-ref
 .. _OSD Config Reference: ../osd-config-ref
--- a/ceph/doc/rados/configuration/filestore-config-ref.rst
+++ b/ceph/doc/rados/configuration/filestore-config-ref.rst
@ -2,8 +2,14 @@
 Filestore Config Reference
 ============================

-The Filestore back end is no longer the default when creating new OSDs,
-though Filestore OSDs are still supported.
+.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
+   default storage back end. Since the Luminous release of Ceph, BlueStore has
+   been Ceph's default storage back end.  However, Filestore OSDs are still
+   supported. See :ref:`OSD Back Ends
+   <rados_config_storage_devices_osd_backends>`. See :ref:`BlueStore Migration
+   <rados_operations_bluestore_migration>` for instructions explaining how to
+   replace an existing Filestore back end with a BlueStore back end.
+

 ``filestore debug omap check``

@ -18,26 +24,31 @@ though Filestore OSDs are still supported.
 Extended Attributes
 ===================

-Extended Attributes (XATTRs) are important for Filestore OSDs.
-Some file systems have limits on the number of bytes that can be stored in XATTRs. 
-Additionally, in some cases, the file system may not be as fast as an alternative
-method of storing XATTRs. The following settings may help improve performance
-by using a method of storing XATTRs that is extrinsic to the underlying file system.
+Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain
+disadvantages can occur when the underlying file system is used for the storage
+of XATTRs: some file systems have limits on the number of bytes that can be
+stored in XATTRs, and your file system might in some cases therefore run slower
+than would an alternative method of storing XATTRs. For this reason, a method
+of storing XATTRs extrinsic to the underlying file system might improve
+performance. To implement such an extrinsic method, refer to the following
+settings.

-Ceph XATTRs are stored as ``inline xattr``, using the XATTRs provided
-by the underlying file system, if it does not impose a size limit. If
-there is a size limit (4KB total on ext4, for instance), some Ceph
-XATTRs will be stored in a key/value database when either the
+If the underlying file system has no size limit, then Ceph XATTRs are stored as
+``inline xattr``, using the XATTRs provided by the file system. But if there is
+a size limit (for example, ext4 imposes a limit of 4 KB total), then some Ceph
+XATTRs will be stored in a key/value database when the limit is reached. More
+precisely, this begins to occur when either the
 ``filestore_max_inline_xattr_size`` or ``filestore_max_inline_xattrs``
 threshold is reached.


 ``filestore_max_inline_xattr_size``

-:Description: The maximum size of an XATTR stored in the file system (i.e., XFS,
-              Btrfs, EXT4, etc.) per object. Should not be larger than the
-              file system can handle. Default value of 0 means to use the value
-              specific to the underlying file system.
+:Description: Defines the maximum size per object of an XATTR that can be
+              stored in the file system (for example, XFS, Btrfs, ext4). The
+              specified size should not be larger than the file system can
+              handle. Using the default value of 0 instructs Filestore to use
+              the value specific to the file system.
 :Type: Unsigned 32-bit Integer
 :Required: No
 :Default: ``0``
@ -45,8 +56,9 @@ threshold is reached.

 ``filestore_max_inline_xattr_size_xfs``

-:Description: The maximum size of an XATTR stored in the XFS file system.
-              Only used if ``filestore_max_inline_xattr_size`` == 0.
+:Description: Defines the maximum size of an XATTR that can be stored in the
+              XFS file system.  This setting is used only if
+              ``filestore_max_inline_xattr_size`` == 0.
 :Type: Unsigned 32-bit Integer
 :Required: No
 :Default: ``65536``
@ -54,8 +66,9 @@ threshold is reached.

 ``filestore_max_inline_xattr_size_btrfs``

-:Description: The maximum size of an XATTR stored in the Btrfs file system.
-              Only used if ``filestore_max_inline_xattr_size`` == 0.
+:Description: Defines the maximum size of an XATTR that can be stored in the
+              Btrfs file system.  This setting is used only if
+              ``filestore_max_inline_xattr_size`` == 0.
 :Type: Unsigned 32-bit Integer
 :Required: No
 :Default: ``2048``
@ -63,8 +76,8 @@ threshold is reached.

 ``filestore_max_inline_xattr_size_other``

-:Description: The maximum size of an XATTR stored in other file systems.
-              Only used if ``filestore_max_inline_xattr_size`` == 0.
+:Description: Defines the maximum size of an XATTR that can be stored in other file systems.
+              This setting is used only if ``filestore_max_inline_xattr_size`` == 0.
 :Type: Unsigned 32-bit Integer
 :Required: No
 :Default: ``512``
@ -72,9 +85,8 @@ threshold is reached.

 ``filestore_max_inline_xattrs``

-:Description: The maximum number of XATTRs stored in the file system per object.
-              Default value of 0 means to use the value specific to the
-              underlying file system.
+:Description: Defines the maximum number of XATTRs per object that can be stored in the file system.
+              Using the default value of 0 instructs Filestore to use the value specific to the file system.
 :Type: 32-bit Integer
 :Required: No
 :Default: ``0``
@ -82,8 +94,8 @@ threshold is reached.

 ``filestore_max_inline_xattrs_xfs``

-:Description: The maximum number of XATTRs stored in the XFS file system per object.
-              Only used if ``filestore_max_inline_xattrs`` == 0.
+:Description: Defines the maximum number of XATTRs per object that can be stored in the XFS file system.
+              This setting is used only if ``filestore_max_inline_xattrs`` == 0.
 :Type: 32-bit Integer
 :Required: No
 :Default: ``10``
@ -91,8 +103,8 @@ threshold is reached.

 ``filestore_max_inline_xattrs_btrfs``

-:Description: The maximum number of XATTRs stored in the Btrfs file system per object.
-              Only used if ``filestore_max_inline_xattrs`` == 0.
+:Description: Defines the maximum number of XATTRs per object that can be stored in the Btrfs file system.
+              This setting is used only if ``filestore_max_inline_xattrs`` == 0.
 :Type: 32-bit Integer
 :Required: No
 :Default: ``10``
@ -100,8 +112,8 @@ threshold is reached.

 ``filestore_max_inline_xattrs_other``

-:Description: The maximum number of XATTRs stored in other file systems per object.
-              Only used if ``filestore_max_inline_xattrs`` == 0.
+:Description: Defines the maximum number of XATTRs per object that can be stored in other file systems.
+              This setting is used only if ``filestore_max_inline_xattrs`` == 0.
 :Type: 32-bit Integer
 :Required: No
 :Default: ``2``
@ -111,18 +123,19 @@ threshold is reached.
 Synchronization Intervals
 =========================

-Filestore needs to periodically quiesce writes and synchronize the
-file system, which creates a consistent commit point. It can then free journal
-entries up to the commit point. Synchronizing more frequently tends to reduce
-the time required to perform synchronization, and reduces the amount of data
-that needs to remain in the  journal. Less frequent synchronization allows the
-backing file system to coalesce small writes and metadata updates more
-optimally, potentially resulting in more efficient synchronization at the
-expense of potentially increasing tail latency.
+Filestore must periodically quiesce writes and synchronize the file system.
+Each synchronization creates a consistent commit point. When the commit point
+is created, Filestore is able to free all journal entries up to that point.
+More-frequent synchronization tends to reduce both synchronization time and
+the amount of data that needs to remain in the journal. Less-frequent
+synchronization allows the backing file system to coalesce small writes and
+metadata updates, potentially increasing synchronization
+efficiency but also potentially increasing tail latency.
+

 ``filestore_max_sync_interval``

-:Description: The maximum interval in seconds for synchronizing Filestore.
+:Description: Defines the maximum interval (in seconds) for synchronizing Filestore.
 :Type: Double
 :Required: No
 :Default: ``5``
@ -130,7 +143,7 @@ expense of potentially increasing tail latency.

 ``filestore_min_sync_interval``

-:Description: The minimum interval in seconds for synchronizing Filestore.
+:Description: Defines the minimum interval (in seconds) for synchronizing Filestore.
 :Type: Double
 :Required: No
 :Default: ``.01``
@ -142,14 +155,14 @@ Flusher
 =======

 The Filestore flusher forces data from large writes to be written out using
-``sync_file_range`` before the sync in order to (hopefully) reduce the cost of
-the eventual sync. In practice, disabling 'filestore_flusher' seems to improve
-performance in some cases.
+``sync_file_range`` prior to the synchronization.
+Ideally, this action reduces the cost of the eventual synchronization. In practice, however, disabling
+'filestore_flusher' seems in some cases to improve performance.


 ``filestore_flusher``

-:Description: Enables the filestore flusher.
+:Description: Enables the Filestore flusher.
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -158,7 +171,7 @@ performance in some cases.

 ``filestore_flusher_max_fds``

-:Description: Sets the maximum number of file descriptors for the flusher.
+:Description: Defines the maximum number of file descriptors for the flusher.
 :Type: Integer
 :Required: No
 :Default: ``512``
@ -176,7 +189,7 @@ performance in some cases.

 ``filestore_fsync_flushes_journal_data``

-:Description: Flush journal data during file system synchronization.
+:Description: Flushes journal data during file-system synchronization.
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -187,11 +200,11 @@ performance in some cases.
 Queue
 =====

-The following settings provide limits on the size of the Filestore queue.
+The following settings define limits on the size of the Filestore queue:

 ``filestore_queue_max_ops``

-:Description: Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. 
+:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations. 
 :Type: Integer
 :Required: No. Minimal impact on performance.
 :Default: ``50``
@ -199,23 +212,20 @@ The following settings provide limits on the size of the Filestore queue.

 ``filestore_queue_max_bytes``

-:Description: The maximum number of bytes for an operation. 
+:Description: Defines the maximum number of bytes permitted per operation.
 :Type: Integer
 :Required: No
 :Default: ``100 << 20``


-
-
 .. index:: filestore; timeouts

 Timeouts
 ========

-
 ``filestore_op_threads``

-:Description: The number of file system operation threads that execute in parallel. 
+:Description: Defines the number of file-system operation threads that execute in parallel. 
 :Type: Integer
 :Required: No
 :Default: ``2``
@ -223,7 +233,7 @@ Timeouts

 ``filestore_op_thread_timeout``

-:Description: The timeout for a file system operation thread (in seconds).
+:Description: Defines the timeout (in seconds) for a file-system operation thread.
 :Type: Integer
 :Required: No
 :Default: ``60``
@ -231,7 +241,7 @@ Timeouts

 ``filestore_op_thread_suicide_timeout``

-:Description: The timeout for a commit operation before cancelling the commit (in seconds). 
+:Description: Defines the timeout (in seconds) for a commit operation before the commit is cancelled.
 :Type: Integer
 :Required: No
 :Default: ``180``
@ -245,17 +255,17 @@ B-Tree Filesystem

 ``filestore_btrfs_snap``

-:Description: Enable snapshots for a ``btrfs`` filestore.
+:Description: Enables snapshots for a ``btrfs`` Filestore.
 :Type: Boolean
-:Required: No. Only used for ``btrfs``.
+:Required: No. Used only for ``btrfs``.
 :Default: ``true``


 ``filestore_btrfs_clone_range``

-:Description: Enable cloning ranges for a ``btrfs`` filestore.
+:Description: Enables cloning ranges for a ``btrfs`` Filestore.
 :Type: Boolean
-:Required: No. Only used for ``btrfs``.
+:Required: No. Used only for ``btrfs``.
 :Default: ``true``


@ -267,7 +277,7 @@ Journal

 ``filestore_journal_parallel``

-:Description: Enables parallel journaling, default for Btrfs.
+:Description: Enables parallel journaling, default for ``btrfs``.
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -275,7 +285,7 @@ Journal

 ``filestore_journal_writeahead``

-:Description: Enables writeahead journaling, default for XFS.
+:Description: Enables write-ahead journaling, default for XFS.
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -283,7 +293,7 @@ Journal

 ``filestore_journal_trailing``

-:Description: Deprecated, never use.
+:Description: Deprecated. **Never use.**
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -295,8 +305,8 @@ Misc

 ``filestore_merge_threshold``

-:Description: Min number of files in a subdir before merging into parent
-              NOTE: A negative value means to disable subdir merging
+:Description: Defines the minimum number of files permitted in a subdirectory before the subdirectory is merged into its parent directory.
+              NOTE: A negative value means that subdirectory merging is disabled.
 :Type: Integer
 :Required: No
 :Default: ``-10``
@ -305,8 +315,8 @@ Misc
 ``filestore_split_multiple``

 :Description:  ``(filestore_split_multiple * abs(filestore_merge_threshold) + (rand() % filestore_split_rand_factor)) * 16``
-               is the maximum number of files in a subdirectory before 
-               splitting into child directories.
+               is the maximum number of files permitted in a subdirectory
+               before the subdirectory is split into child directories.

 :Type: Integer
 :Required: No
@ -316,10 +326,10 @@ Misc
 ``filestore_split_rand_factor``

 :Description:  A random factor added to the split threshold to avoid
-               too many (expensive) Filestore splits occurring at once. See
-               ``filestore_split_multiple`` for details.
-               This can only be changed offline for an existing OSD,
-               via the ``ceph-objectstore-tool apply-layout-settings`` command.
+               too many (expensive) Filestore splits occurring at the same time.
+               For details, see ``filestore_split_multiple``.
+               To change this setting for an existing OSD, it is necessary to take the OSD
+               offline before running the ``ceph-objectstore-tool apply-layout-settings`` command.

 :Type: Unsigned 32-bit Integer
 :Required: No
@ -328,7 +338,7 @@ Misc

 ``filestore_update_to``

-:Description: Limits Filestore auto upgrade to specified version.
+:Description: Limits automatic upgrades to a specified version of Filestore. Useful in cases in which you want to avoid upgrading to a specific version.
 :Type: Integer
 :Required: No
 :Default: ``1000``
@ -336,7 +346,7 @@ Misc

 ``filestore_blackhole``

-:Description: Drop any new transactions on the floor.
+:Description: Drops any new transactions on the floor, similar to redirecting to NULL. 
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -344,7 +354,7 @@ Misc

 ``filestore_dump_file``

-:Description: File onto which store transaction dumps.
+:Description: Defines the file that transaction dumps are stored on.
 :Type: Boolean
 :Required: No
 :Default: ``false``
@ -352,7 +362,7 @@ Misc

 ``filestore_kill_at``

-:Description: inject a failure at the n'th opportunity
+:Description: Injects a failure at the *n*\th opportunity.
 :Type: String
 :Required: No
 :Default: ``false``
@ -360,8 +370,7 @@ Misc

 ``filestore_fail_eio``

-:Description: Fail/Crash on eio.
+:Description: Fail/Crash on EIO.
 :Type: Boolean
 :Required: No
 :Default: ``true``
-
--- a/ceph/doc/rados/configuration/mclock-config-ref.rst
+++ b/ceph/doc/rados/configuration/mclock-config-ref.rst
@ -21,6 +21,9 @@ the QoS related parameters:
 * total capacity (IOPS) of each OSD (determined automatically -
  See `OSD Capacity Determination (Automated)`_)

+* the max sequential bandwidth capacity (MiB/s) of each OSD -
+  See *osd_mclock_max_sequential_bandwidth_[hdd|ssd]* option
+
 * an mclock profile type to enable

 Using the settings in the specified profile, an OSD determines and applies the
@ -39,15 +42,15 @@ Each service can be considered as a type of client from mclock's perspective.
 Depending on the type of requests handled, mclock clients are classified into
 the buckets as shown in the table below,

-+------------------------+----------------------------------------------------+
-|  Client Type           | Request Types                                      |
-+========================+====================================================+
-| Client                 | I/O requests issued by external clients of Ceph    |
-+------------------------+----------------------------------------------------+
-| Background recovery    | Internal recovery/backfill requests                |
-+------------------------+----------------------------------------------------+
-| Background best-effort | Internal scrub, snap trim and PG deletion requests |
-+------------------------+----------------------------------------------------+
+------------------------+--------------------------------------------------------------+
+|  Client Type           | Request Types                                                |
+========================+==============================================================+
+| Client                 | I/O requests issued by external clients of Ceph              |
+------------------------+--------------------------------------------------------------+
+| Background recovery    | Internal recovery requests                                   |
+------------------------+--------------------------------------------------------------+
+| Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests |
+------------------------+--------------------------------------------------------------+

 The mclock profiles allocate parameters like reservation, weight and limit
 (see :ref:`dmclock-qos`) differently for each client type. The next sections
@ -85,32 +88,54 @@ Built-in Profiles
 -----------------
 Users can choose between the following built-in profile types:

-.. note:: The values mentioned in the tables below represent the percentage
+.. note:: The values mentioned in the tables below represent the proportion
          of the total IOPS capacity of the OSD allocated for the service type.

-By default, the *high_client_ops* profile is enabled to ensure that a larger
-chunk of the bandwidth allocation goes to client ops. Background recovery ops
-are given lower allocation (and therefore take a longer time to complete). But
-there might be instances that necessitate giving higher allocations to either
-client ops or recovery ops. In order to deal with such a situation, the
-alternate built-in profiles may be enabled by following the steps mentioned
-in next sections.
+* balanced (default)
+* high_client_ops
+* high_recovery_ops

-high_client_ops (*default*)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-This profile optimizes client performance over background activities by
-allocating more reservation and limit to client operations as compared to
-background operations in the OSD. This profile is enabled by default. The table
-shows the resource control parameters set by the profile:
+balanced (*default*)
+^^^^^^^^^^^^^^^^^^^^
+The *balanced* profile is the default mClock profile. This profile allocates
+equal reservation/priority to client operations and background recovery
+operations. Background best-effort ops are given lower reservation and therefore
+take a longer time to complete when are are competing operations. This profile
+helps meet the normal/steady-state requirements of the cluster. This is the
+case when external client performance requirement is not critical and there are
+other background operations that still need attention within the OSD.
+
+But there might be instances that necessitate giving higher allocations to either
+client ops or recovery ops. In order to deal with such a situation, the alternate
+built-in profiles may be enabled by following the steps mentioned in next sections.

 +------------------------+-------------+--------+-------+
 |  Service Type          | Reservation | Weight | Limit |
 +========================+=============+========+=======+
-| client                 | 50%         | 2      | MAX   |
+| client                 | 50%         | 1      | MAX   |
 +------------------------+-------------+--------+-------+
-| background recovery    | 25%         | 1      | 100%  |
+| background recovery    | 50%         | 1      | MAX   |
 +------------------------+-------------+--------+-------+
-| background best-effort | 25%         | 2      | MAX   |
+| background best-effort | MIN         | 1      | 90%   |
+------------------------+-------------+--------+-------+
+
+high_client_ops
+^^^^^^^^^^^^^^^
+This profile optimizes client performance over background activities by
+allocating more reservation and limit to client operations as compared to
+background operations in the OSD. This profile, for example, may be enabled
+to provide the needed performance for I/O intensive applications for a
+sustained period of time at the cost of slower recoveries. The table shows
+the resource control parameters set by the profile:
+
+------------------------+-------------+--------+-------+
+|  Service Type          | Reservation | Weight | Limit |
+========================+=============+========+=======+
+| client                 | 60%         | 2      | MAX   |
+------------------------+-------------+--------+-------+
+| background recovery    | 40%         | 1      | MAX   |
+------------------------+-------------+--------+-------+
+| background best-effort | MIN         | 1      | 70%   |
 +------------------------+-------------+--------+-------+

 high_recovery_ops
@ -124,34 +149,16 @@ parameters set by the profile:
 +------------------------+-------------+--------+-------+
 |  Service Type          | Reservation | Weight | Limit |
 +========================+=============+========+=======+
-| client                 | 30%         | 1      | 80%   |
+| client                 | 30%         | 1      | MAX   |
 +------------------------+-------------+--------+-------+
-| background recovery    | 60%         | 2      | 200%  |
+| background recovery    | 70%         | 2      | MAX   |
 +------------------------+-------------+--------+-------+
-| background best-effort | 1 (MIN)     | 2      | MAX   |
-+------------------------+-------------+--------+-------+
-
-balanced
-^^^^^^^^
-This profile allocates equal reservation to client I/O operations and background
-recovery operations. This means that equal I/O resources are allocated to both
-external and background recovery operations. This profile, for example, may be
-enabled by an administrator when external client performance requirement is not
-critical and there are other background operations that still need attention
-within the OSD.
-
-+------------------------+-------------+--------+-------+
-|  Service Type          | Reservation | Weight | Limit |
-+========================+=============+========+=======+
-| client                 | 40%         | 1      | 100%  |
-+------------------------+-------------+--------+-------+
-| background recovery    | 40%         | 1      | 150%  |
-+------------------------+-------------+--------+-------+
-| background best-effort | 20%         | 2      | MAX   |
+| background best-effort | MIN         | 1      | MAX   |
 +------------------------+-------------+--------+-------+

 .. note:: Across the built-in profiles, internal background best-effort clients
-          of mclock include "scrub", "snap trim", and "pg deletion" operations.
+          of mclock include "backfill", "scrub", "snap trim", and "pg deletion"
+          operations.


 Custom Profile
@ -170,6 +177,11 @@ in order to ensure mClock scheduler is able to provide predictable QoS.

 mClock Config Options
 ---------------------
+.. important:: These defaults cannot be changed using any of the config
+   subsytem commands like *config set* or via the *config daemon* or *config
+   tell* interfaces. Although the above command(s) report success, the mclock
+   QoS parameters are reverted to their respective built-in profile defaults.
+
 When a built-in profile is enabled, the mClock scheduler calculates the low
 level mclock parameters [*reservation*, *weight*, *limit*] based on the profile
 enabled for each client type. The mclock parameters are calculated based on
@ -188,30 +200,35 @@ config parameters cannot be modified when using any of the built-in profiles:

 Recovery/Backfill Options
 -------------------------
-The following recovery and backfill related Ceph options are set to new defaults
-for mClock:
+.. warning:: The recommendation is to not change these options as the built-in
+   profiles are optimized based on them. Changing these defaults can result in
+   unexpected performance outcomes.
+
+The following recovery and backfill related Ceph options are overridden to
+mClock defaults:

 - :confval:`osd_max_backfills`
 - :confval:`osd_recovery_max_active`
 - :confval:`osd_recovery_max_active_hdd`
 - :confval:`osd_recovery_max_active_ssd`

-The following table shows the new mClock defaults. This is done to maximize the
-impact of the built-in profile:
+The following table shows the mClock defaults which is the same as the current
+defaults. This is done to maximize the performance of the foreground (client)
+operations:

 +----------------------------------------+------------------+----------------+
 |  Config Option                         | Original Default | mClock Default |
 +========================================+==================+================+
-| :confval:`osd_max_backfills`           | 1                | 10             |
+| :confval:`osd_max_backfills`           | 1                | 1              |
 +----------------------------------------+------------------+----------------+
 | :confval:`osd_recovery_max_active`     | 0                | 0              |
 +----------------------------------------+------------------+----------------+
-| :confval:`osd_recovery_max_active_hdd` | 3                | 10             |
+| :confval:`osd_recovery_max_active_hdd` | 3                | 3              |
 +----------------------------------------+------------------+----------------+
-| :confval:`osd_recovery_max_active_ssd` | 10               | 20             |
+| :confval:`osd_recovery_max_active_ssd` | 10               | 10             |
 +----------------------------------------+------------------+----------------+

-The above mClock defaults, can be modified if necessary by enabling
+The above mClock defaults, can be modified only if necessary by enabling
 :confval:`osd_mclock_override_recovery_settings` (default: false). The
 steps for this is discussed in the
 `Steps to Modify mClock Max Backfills/Recovery Limits`_ section.
@ -246,8 +263,8 @@ all its clients.
 Steps to Enable mClock Profile
 ==============================

-As already mentioned, the default mclock profile is set to *high_client_ops*.
-The other values for the built-in profiles include *balanced* and
+As already mentioned, the default mclock profile is set to *balanced*.
+The other values for the built-in profiles include *high_client_ops* and
 *high_recovery_ops*.

 If there is a requirement to change the default profile, then the option
@ -297,15 +314,17 @@ command can be used:

 After switching to the *custom* profile, the desired mClock configuration
 option may be modified. For example, to change the client reservation IOPS
-allocation for a specific OSD (say osd.0), the following command can be used:
+ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following
+command can be used:

  .. prompt:: bash #

-    ceph config set osd.0 osd_mclock_scheduler_client_res 3000
+    ceph config set osd.0 osd_mclock_scheduler_client_res 0.5

-.. important:: Care must be taken to change the reservations of other services like
-   recovery and background best effort accordingly to ensure that the sum of the
-   reservations do not exceed the maximum IOPS capacity of the OSD.
+.. important:: Care must be taken to change the reservations of other services
+   like recovery and background best effort accordingly to ensure that the sum
+   of the reservations do not exceed the maximum proportion (1.0) of the IOPS
+   capacity of the OSD.

 .. tip::  The reservation and limit parameter allocations are per-shard based on
   the type of backing device (HDD/SSD) under the OSD. See
@ -673,12 +692,8 @@ mClock Config Options
 .. confval:: osd_mclock_profile
 .. confval:: osd_mclock_max_capacity_iops_hdd
 .. confval:: osd_mclock_max_capacity_iops_ssd
-.. confval:: osd_mclock_cost_per_io_usec
-.. confval:: osd_mclock_cost_per_io_usec_hdd
-.. confval:: osd_mclock_cost_per_io_usec_ssd
-.. confval:: osd_mclock_cost_per_byte_usec
-.. confval:: osd_mclock_cost_per_byte_usec_hdd
-.. confval:: osd_mclock_cost_per_byte_usec_ssd
+.. confval:: osd_mclock_max_sequential_bandwidth_hdd
+.. confval:: osd_mclock_max_sequential_bandwidth_ssd
 .. confval:: osd_mclock_force_run_benchmark_on_init
 .. confval:: osd_mclock_skip_benchmark
 .. confval:: osd_mclock_override_recovery_settings
--- a/ceph/doc/rados/configuration/mon-config-ref.rst
+++ b/ceph/doc/rados/configuration/mon-config-ref.rst
@ -16,24 +16,27 @@ consistent, but you can add, remove or replace a monitor in a cluster. See
 Background
 ==========

-Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`, which means a
-:term:`Ceph Client` can determine the location of all Ceph Monitors, Ceph OSD
-Daemons, and Ceph Metadata Servers just by connecting to one Ceph Monitor and
-retrieving a current cluster map. Before Ceph Clients can read from or write to
-Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor
-first. With a current copy of the cluster map and the CRUSH algorithm, a Ceph
-Client can compute the location for any object. The ability to compute object
-locations allows a Ceph Client to talk directly to Ceph OSD Daemons, which is a
-very important aspect of Ceph's high scalability and performance. See 
-`Scalability and High Availability`_ for additional details.
+Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`. 

-The primary role of the Ceph Monitor is to maintain a master copy of the cluster
-map. Ceph Monitors also provide authentication and logging services. Ceph
-Monitors write all changes in the monitor services to a single Paxos instance,
-and Paxos writes the changes to a key/value store for strong consistency. Ceph
-Monitors can query the most recent version of the cluster map during sync
-operations. Ceph Monitors leverage the key/value store's snapshots and iterators
-(using leveldb) to perform store-wide synchronization.
+The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to
+determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph
+Metadata Servers. Clients do this by connecting to one Ceph Monitor and
+retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor
+before they can read from or write to Ceph OSD Daemons or Ceph Metadata
+Servers. A Ceph client that has a current copy of the cluster map and the CRUSH
+algorithm can compute the location of any RADOS object within the cluster. This
+makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct
+communication between clients and Ceph OSD Daemons improves upon traditional
+storage architectures that required clients to communicate with a central
+component.  See `Scalability and High Availability`_ for more on this subject.
+
+The Ceph Monitor's primary function is to maintain a master copy of the cluster
+map. Monitors also provide authentication and logging services. All changes in
+the monitor services are written by the Ceph Monitor to a single Paxos
+instance, and Paxos writes the changes to a key/value store. This provides
+strong consistency. Ceph Monitors are able to query the most recent version of
+the cluster map during sync operations, and they use the key/value store's
+snapshots and iterators (using RocksDB) to perform store-wide synchronization.

 .. ditaa::
 /-------------\               /-------------\
@ -56,12 +59,6 @@ operations. Ceph Monitors leverage the key/value store's snapshots and iterators
 |    cCCC     |*---------------------+
 \-------------/

-
-.. deprecated:: version 0.58
-
-In Ceph versions 0.58 and earlier, Ceph Monitors use a Paxos instance for
-each service and store the map as a file. 
-
 .. index:: Ceph Monitor; cluster map

 Cluster Maps
@ -541,6 +538,8 @@ Trimming requires that the placement groups are ``active+clean``.

 .. index:: Ceph Monitor; clock

+.. _mon-config-ref-clock:
+
 Clock
 -----

--- a/ceph/doc/rados/configuration/mon-lookup-dns.rst
+++ b/ceph/doc/rados/configuration/mon-lookup-dns.rst
@ -1,16 +1,22 @@
+.. _mon-dns-lookup:
+
 ===============================
 Looking up Monitors through DNS
 ===============================

-Since version 11.0.0 RADOS supports looking up Monitors through DNS.
+Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors
+through DNS.

-This way daemons and clients do not require a *mon host* configuration directive in their ceph.conf configuration file.
+The addition of the ability to look up monitors through DNS means that daemons
+and clients do not require a *mon host* configuration directive in their
+``ceph.conf`` configuration file.

-Using DNS SRV TCP records clients are able to look up the monitors.
+With a DNS update, clients and daemons can be made aware of changes
+in the monitor topology. To be more precise and technical, clients look up the
+monitors by using ``DNS SRV TCP`` records. 

-This allows for less configuration on clients and monitors. Using a DNS update clients and daemons can be made aware of changes in the monitor topology.
-
-By default clients and daemons will look for the TCP service called *ceph-mon* which is configured by the *mon_dns_srv_name* configuration directive.
+By default, clients and daemons look for the TCP service called *ceph-mon*,
+which is configured by the *mon_dns_srv_name* configuration directive.


 .. confval:: mon_dns_srv_name
--- a/ceph/doc/rados/configuration/msgr2.rst
+++ b/ceph/doc/rados/configuration/msgr2.rst
@ -91,9 +91,8 @@ Similarly, two options control whether IPv4 and IPv6 addresses are used:
    to an IPv6 address

 .. note:: The ability to bind to multiple ports has paved the way for
-   dual-stack IPv4 and IPv6 support.  That said, dual-stack support is
-   not yet tested as of Nautilus v14.2.0 and likely needs some
-   additional code changes to work correctly.
+   dual-stack IPv4 and IPv6 support.  That said, dual-stack operation is
+   not yet supported as of Quincy v17.2.0.

 Connection modes
 ----------------
--- a/ceph/doc/rados/configuration/osd-config-ref.rst
+++ b/ceph/doc/rados/configuration/osd-config-ref.rst
@ -140,6 +140,8 @@ See `Pool & PG Config Reference`_ for details.

 .. index:: OSD; scrubbing

+.. _rados_config_scrubbing:
+
 Scrubbing
 =========

--- a/ceph/doc/rados/configuration/pool-pg-config-ref.rst
+++ b/ceph/doc/rados/configuration/pool-pg-config-ref.rst
@ -1,3 +1,5 @@
+.. _rados_config_pool_pg_crush_ref:
+
 ======================================
 Pool, PG and CRUSH Config Reference
 ======================================
--- a/ceph/doc/rados/configuration/storage-devices.rst
+++ b/ceph/doc/rados/configuration/storage-devices.rst
@ -25,6 +25,7 @@ There are several Ceph daemons in a storage cluster:
  additional monitoring and providing interfaces to external
  monitoring and management systems.

+.. _rados_config_storage_devices_osd_backends:

 OSD Back Ends
 =============
--- a/ceph/doc/rados/operations/add-or-rm-mons.rst
+++ b/ceph/doc/rados/operations/add-or-rm-mons.rst
@ -4,74 +4,70 @@
 Adding/Removing Monitors
 ==========================

-When you have a cluster up and running, you may add or remove monitors
-from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_
-or `Monitor Bootstrap`_.
+It is possible to add monitors to a running cluster as long as redundancy is
+maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor
+Bootstrap`_.

 .. _adding-monitors:

 Adding Monitors
 ===============

-Ceph monitors are lightweight processes that are the single source of truth
-for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 
-for a production cluster. Ceph monitors use a variation of the
-`Paxos`_ algorithm to establish consensus about maps and other critical
-information across the cluster. Due to the nature of Paxos, Ceph requires
-a majority of monitors to be active to establish a quorum (thus establishing
-consensus).
+Ceph monitors serve as the single source of truth for the cluster map. It is
+possible to run a cluster with only one monitor, but for a production cluster
+it is recommended to have at least three monitors provisioned and in quorum.
+Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus
+about maps and about other critical information across the cluster. Due to the
+nature of Paxos, Ceph is able to maintain quorum (and thus establish
+consensus) only if a majority of the monitors are ``active``.

-It is advisable to run an odd number of monitors. An
-odd number of monitors is more resilient than an
-even number. For instance, with a two monitor deployment, no
-failures can be tolerated and still maintain a quorum; with three monitors,
-one failure can be tolerated; in a four monitor deployment, one failure can
-be tolerated; with five monitors, two failures can be tolerated.  This avoids
-the dreaded *split brain* phenomenon, and is why an odd number is best.
-In short, Ceph needs a majority of
-monitors to be active (and able to communicate with each other), but that
-majority can be achieved using a single monitor, or 2 out of 2 monitors,
-2 out of 3, 3 out of 4, etc.
+It is best to run an odd number of monitors. This is because a cluster that is
+running an odd number of monitors is more resilient than a cluster running an
+even number. For example, in a two-monitor deployment, no failures can be
+tolerated if quorum is to be maintained; in a three-monitor deployment, one
+failure can be tolerated; in a four-monitor deployment, one failure can be
+tolerated; and in a five-monitor deployment, two failures can be tolerated. In
+general, a cluster running an odd number of monitors is best because it avoids
+what is called the *split brain* phenomenon. In short, Ceph is able to operate
+only if a majority of monitors are ``active`` and able to communicate with each
+other, (for example: there must be a single monitor, two out of two monitors,
+two out of three monitors, three out of five monitors, or the like).

 For small or non-critical deployments of multi-node Ceph clusters, it is
-advisable to deploy three monitors, and to increase the number of monitors
-to five for larger clusters or to survive a double failure.  There is rarely
-justification for seven or more.
+recommended to deploy three monitors. For larger clusters or for clusters that
+are intended to survive a double failure, it is recommended to deploy five
+monitors. Only in rare circumstances is there any justification for deploying
+seven or more monitors.

-Since monitors are lightweight, it is possible to run them on the same 
-host as OSDs; however, we recommend running them on separate hosts,
-because `fsync` issues with the kernel may impair performance.
-Dedicated monitor nodes also minimize disruption since monitor and OSD
-daemons are not inactive at the same time when a node crashes or is
-taken down for maintenance.
-
-Dedicated
-monitor nodes also make for cleaner maintenance by avoiding both OSDs and
-a mon going down if a node is rebooted, taken down, or crashes.
+It is possible to run a monitor on the same host that is running an OSD.
+However, this approach has disadvantages: for example: `fsync` issues with the
+kernel might weaken performance, monitor and OSD daemons might be inactive at
+the same time and cause disruption if the node crashes, is rebooted, or is
+taken down for maintenance. Because of these risks, it is instead
+recommended to run monitors and managers on dedicated hosts.

 .. note:: A *majority* of monitors in your cluster must be able to 
-   reach each other in order to establish a quorum.
+   reach each other in order for quorum to be established.

-Deploy your Hardware
--------------------
+Deploying your Hardware
+-----------------------

-If you are adding a new host when adding a new monitor,  see `Hardware
-Recommendations`_ for details on minimum recommendations for monitor hardware.
-To add a monitor host to your cluster, first make sure you have an up-to-date
-version of Linux installed (typically Ubuntu 16.04 or RHEL 7). 
+Some operators choose to add a new monitor host at the same time that they add
+a new monitor. For details on the minimum recommendations for monitor hardware,
+see `Hardware Recommendations`_. Before adding a monitor host to the cluster,
+make sure that there is an up-to-date version of Linux installed.

-Add your monitor host to a rack in your cluster, connect it to the network
-and ensure that it has network connectivity.
+Add the newly installed monitor host to a rack in your cluster, connect the
+host to the network, and make sure that the host has network connectivity.

 .. _Hardware Recommendations: ../../../start/hardware-recommendations

-Install the Required Software
-----------------------------
+Installing the Required Software
+--------------------------------

-For manually deployed clusters, you must install Ceph packages
-manually. See `Installing Packages`_ for details.
-You should configure SSH to a user with password-less authentication
-and root permissions.
+In manually deployed clusters, it is necessary to install Ceph packages
+manually. For details, see `Installing Packages`_. Configure SSH so that it can
+be used by a user that has passwordless authentication and root permissions.

 .. _Installing Packages: ../../../install/install-storage-cluster

@ -81,67 +77,65 @@ and root permissions.
 Adding a Monitor (Manual)
 -------------------------

-This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map
-and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster.  If
-this results in only two monitor daemons, you may add more monitors by
-repeating this procedure until you have a sufficient number of ``ceph-mon`` 
-daemons to achieve a quorum.
+The procedure in this section creates a ``ceph-mon`` data directory, retrieves
+both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to
+the cluster. The procedure might result in a Ceph cluster that contains only
+two monitor daemons. To add more monitors until there are enough ``ceph-mon``
+daemons to establish quorum, repeat the procedure.

-At this point you should define your monitor's id.  Traditionally, monitors 
-have been named with single letters (``a``, ``b``, ``c``, ...), but you are 
-free to define the id as you see fit.  For the purpose of this document, 
-please take into account that ``{mon-id}`` should be the id you chose, 
-without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` 
-on ``mon.a``).
+This is a good point at which to define the new monitor's ``id``. Monitors have
+often been named with single letters (``a``, ``b``, ``c``, etc.), but you are
+free to define the ``id`` however you see fit. In this document, ``{mon-id}``
+refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if
+``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is
+``a``.                                               ???

-#. Create the default directory on the machine that will host your 
-   new monitor:
+#. Create a data directory on the machine that will host the new monitor:

   .. prompt:: bash $

-	ssh {new-mon-host}
-	sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
+    ssh {new-mon-host}
+    sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}

-#. Create a temporary directory ``{tmp}`` to keep the files needed during 
-   this process. This directory should be different from the monitor's default 
-   directory created in the previous step, and can be removed after all the 
-   steps are executed:
+#. Create a temporary directory ``{tmp}`` that will contain the files needed
+   during this procedure. This directory should be different from the data
+   directory created in the previous step. Because this is a temporary
+   directory, it can be removed after the procedure is complete:

   .. prompt:: bash $

-	mkdir {tmp}
+    mkdir {tmp}

-#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to 
-   the retrieved keyring, and ``{key-filename}`` is the name of the file 
-   containing the retrieved monitor key:
+#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the
+   retrieved keyring and ``{key-filename}`` is the name of the file that
+   contains the retrieved monitor key):

   .. prompt:: bash $

      ceph auth get mon. -o {tmp}/{key-filename}

-#. Retrieve the monitor map, where ``{tmp}`` is the path to 
-   the retrieved monitor map, and ``{map-filename}`` is the name of the file 
-   containing the retrieved monitor map:
+#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map
+   and ``{map-filename}`` is the name of the file that contains the retrieved
+   monitor map):

   .. prompt:: bash $

      ceph mon getmap -o {tmp}/{map-filename}

-#. Prepare the monitor's data directory created in the first step. You must 
-   specify the path to the monitor map so that you can retrieve the 
-   information about a quorum of monitors and their ``fsid``. You must also 
-   specify a path to the monitor keyring:
-   
+#. Prepare the monitor's data directory, which was created in the first step.
+   The following command must specify the path to the monitor map (so that
+   information about a quorum of monitors and their ``fsid``\s can be
+   retrieved) and specify the path to the monitor keyring:
+
   .. prompt:: bash $

      sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
-	

-#. Start the new monitor and it will automatically join the cluster.
-   The daemon needs to know which address to bind to, via either the
-   ``--public-addr {ip}`` or ``--public-network {network}`` argument.
+#. Start the new monitor. It will automatically join the cluster. To provide
+   information to the daemon about which address to bind to, use either the
+   ``--public-addr {ip}`` option or the ``--public-network {network}`` option.
   For example:
-   
+
   .. prompt:: bash $

      ceph-mon -i {mon-id} --public-addr {ip:port}
@ -151,44 +145,47 @@ on ``mon.a``).
 Removing Monitors
 =================

-When you remove monitors from a cluster, consider that Ceph monitors use 
-Paxos to establish consensus about the master cluster map. You must have 
-a sufficient number of monitors to establish a quorum for consensus about 
-the cluster map.
+When monitors are removed from a cluster, it is important to remember
+that Ceph monitors use Paxos to maintain consensus about the cluster
+map. Such consensus is possible only if the number of monitors is sufficient
+to establish quorum.
+

 .. _Removing a Monitor (Manual):

 Removing a Monitor (Manual)
 ---------------------------

-This procedure removes a ``ceph-mon`` daemon from your cluster.   If this
-procedure results in only two monitor daemons, you may add or remove another
-monitor until you have a number of ``ceph-mon`` daemons that can achieve a 
-quorum.
+The procedure in this section removes a ``ceph-mon`` daemon from the cluster.
+The procedure might result in a Ceph cluster that contains a number of monitors
+insufficient to maintain quorum, so plan carefully. When replacing an old
+monitor with a new monitor, add the new monitor first, wait for quorum to be
+established, and then remove the old monitor. This ensures that quorum is not
+lost.
+

 #. Stop the monitor:

   .. prompt:: bash $

      service ceph -a stop mon.{mon-id}
-	
+
 #. Remove the monitor from the cluster:

   .. prompt:: bash $

      ceph mon remove {mon-id}
-	
-#. Remove the monitor entry from ``ceph.conf``. 
+
+#. Remove the monitor entry from the ``ceph.conf`` file:

 .. _rados-mon-remove-from-unhealthy: 

+
 Removing Monitors from an Unhealthy Cluster
 -------------------------------------------

-This procedure removes a ``ceph-mon`` daemon from an unhealthy
-cluster, for example a cluster where the monitors cannot form a
-quorum.
-
+The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy
+cluster (for example, a cluster whose monitors are unable to form a quorum).

 #. Stop all ``ceph-mon`` daemons on all monitor hosts:

@ -197,63 +194,68 @@ quorum.
      ssh {mon-host}
      systemctl stop ceph-mon.target

-   Repeat for all monitor hosts.
+   Repeat this step on every monitor host.

-#. Identify a surviving monitor and log in to that host:
+#. Identify a surviving monitor and log in to the monitor's host:

   .. prompt:: bash $

      ssh {mon-host}

-#. Extract a copy of the monmap file:
+#. Extract a copy of the ``monmap`` file by running a command of the following
+   form:

   .. prompt:: bash $

      ceph-mon -i {mon-id} --extract-monmap {map-path}

-   In most cases, this command will be:
+   Here is a more concrete example. In this example, ``hostname`` is the
+   ``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``:

   .. prompt:: bash $

      ceph-mon -i `hostname` --extract-monmap /tmp/monmap

-#. Remove the non-surviving or problematic monitors.  For example, if
-   you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
-   only ``mon.a`` will survive, follow the example below:
+#. Remove the non-surviving or otherwise problematic monitors:

   .. prompt:: bash $

      monmaptool {map-path} --rm {mon-id}

-   For example,
+   For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``,
+   and ``mon.c`` |---| and that only ``mon.a`` will survive:

   .. prompt:: bash $

      monmaptool /tmp/monmap --rm b
      monmaptool /tmp/monmap --rm c
-	
-#. Inject the surviving map with the removed monitors into the
-   surviving monitor(s).  For example, to inject a map into monitor
-   ``mon.a``, follow the example below:
+
+#. Inject the surviving map that includes the removed monitors into the
+   monmap of the surviving monitor(s):

   .. prompt:: bash $

      ceph-mon -i {mon-id} --inject-monmap {map-path}

-   For example:
+   Continuing with the above example, inject a map into monitor ``mon.a`` by
+   running the following command:

   .. prompt:: bash $

      ceph-mon -i a --inject-monmap /tmp/monmap

+
 #. Start only the surviving monitors.

-#. Verify the monitors form a quorum (``ceph -s``).
+#. Verify that the monitors form a quorum by running the command ``ceph -s``.

-#. You may wish to archive the removed monitors' data directory in
-   ``/var/lib/ceph/mon`` in a safe location, or delete it if you are
-   confident the remaining monitors are healthy and are sufficiently
-   redundant.
+#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``:
+   either archive this data directory in a safe location or delete this data
+   directory. However, do not delete it unless you are confident that the
+   remaining monitors are healthy and sufficiently redundant. Make sure that
+   there is enough room for the live DB to expand and compact, and make sure
+   that there is also room for an archived copy of the DB. The archived copy
+   can be compressed.

 .. _Changing a Monitor's IP address:

@ -262,185 +264,195 @@ Changing a Monitor's IP Address

 .. important:: Existing monitors are not supposed to change their IP addresses.

-Monitors are critical components of a Ceph cluster, and they need to maintain a
-quorum for the whole system to work properly. To establish a quorum, the
-monitors need to discover each other. Ceph has strict requirements for
-discovering monitors.
+Monitors are critical components of a Ceph cluster. The entire system can work
+properly only if the monitors maintain quorum, and quorum can be established
+only if the monitors have discovered each other by means of their IP addresses.
+Ceph has strict requirements on the discovery of monitors.

-Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors.
-However, monitors discover each other using the monitor map, not ``ceph.conf``.
-For example,  if you refer to `Adding a Monitor (Manual)`_ you will see that you
-need to obtain the current monmap for the cluster when creating a new monitor,
-as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The
-following sections explain the consistency requirements for Ceph monitors, and a
-few safe ways to change a monitor's IP address.
+Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons
+to discover monitors, the monitor map is used by monitors to discover each
+other. This is why it is necessary to obtain the current ``monmap`` at the time
+a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_,
+the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id}
+--mkfs`` command. The following sections explain the consistency requirements
+for Ceph monitors, and also explain a number of safe ways to change a monitor's
+IP address.


 Consistency Requirements
 ------------------------

-A monitor always refers to the local copy of the monmap  when discovering other
-monitors in the cluster.  Using the monmap instead of ``ceph.conf`` avoids
-errors that could  break the cluster (e.g., typos in ``ceph.conf`` when
-specifying a monitor address or port). Since monitors use monmaps for discovery
-and they share monmaps with clients and other Ceph daemons, the monmap provides
-monitors with a strict guarantee that their consensus is valid.
+When a monitor discovers other monitors in the cluster, it always refers to the
+local copy of the monitor map. Using the monitor map instead of using the
+``ceph.conf`` file avoids errors that could break the cluster (for example,
+typos or other slight errors in ``ceph.conf`` when a monitor address or port is
+specified). Because monitors use monitor maps for discovery and because they
+share monitor maps with Ceph clients and other Ceph daemons, the monitor map
+provides monitors with a strict guarantee that their consensus is valid.

 Strict consistency also applies to updates to the monmap. As with any other
 updates on the monitor, changes to the monmap always run through a distributed
 consensus algorithm called `Paxos`_. The monitors must agree on each update to
-the monmap, such as adding or removing a monitor, to ensure that each monitor in
-the quorum has the same version of the monmap. Updates to the monmap are
+the monmap, such as adding or removing a monitor, to ensure that each monitor
+in the quorum has the same version of the monmap. Updates to the monmap are
 incremental so that monitors have the latest agreed upon version, and a set of
-previous versions, allowing a monitor that has an older version of the monmap to
-catch up with the current state of the cluster.
+previous versions, allowing a monitor that has an older version of the monmap
+to catch up with the current state of the cluster.

-If monitors discovered each other through the Ceph configuration file instead of
-through the monmap, it would introduce additional risks because the Ceph
-configuration files are not updated and distributed automatically. Monitors
-might inadvertently use an older ``ceph.conf`` file, fail to recognize a
-monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able
-to determine the current state of the system accurately. Consequently,  making
-changes to an existing monitor's IP address must be done with  great care.
+There are additional advantages to using the monitor map rather than
+``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not
+automatically updated and distributed, its use would bring certain risks:
+monitors might use an outdated ``ceph.conf`` file, might fail to recognize a
+specific monitor, might fall out of quorum, and might develop a situation in
+which `Paxos`_ is unable to accurately ascertain the current state of the
+system. Because of these risks, any changes to an existing monitor's IP address
+must be made with great care.

+.. _operations_add_or_rm_mons_changing_mon_ip:

-Changing a Monitor's IP address (The Right Way)
-----------------------------------------------
+Changing a Monitor's IP address (Preferred Method)
+--------------------------------------------------

-Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to
-ensure that other monitors in the cluster will receive the update.  To change a
-monitor's IP address, you must add a new monitor with the IP  address you want
-to use (as described in `Adding a Monitor (Manual)`_),  ensure that the new
-monitor successfully joins the  quorum; then, remove the monitor that uses the
-old IP address. Then, update the ``ceph.conf`` file to ensure that clients and
-other daemons know the IP address of the new monitor.
+If a monitor's IP address is changed only in the ``ceph.conf`` file, there is
+no guarantee that the other monitors in the cluster will receive the update.
+For this reason, the preferred method to change a monitor's IP address is as
+follows: add a new monitor with the desired IP address (as described in `Adding
+a Monitor (Manual)`_), make sure that the new monitor successfully joins the
+quorum, remove the monitor that is using the old IP address, and update the
+``ceph.conf`` file to ensure that clients and other daemons are made aware of
+the new monitor's IP address.

-For example, lets assume there are three monitors in place, such as :: 
+For example, suppose that there are three monitors in place:: 

-	[mon.a]
-		host = host01
-		addr = 10.0.0.1:6789
-	[mon.b]
-		host = host02
-		addr = 10.0.0.2:6789
-	[mon.c]
-		host = host03
-		addr = 10.0.0.3:6789
+    [mon.a]
+        host = host01
+        addr = 10.0.0.1:6789
+    [mon.b]
+        host = host02
+        addr = 10.0.0.2:6789
+    [mon.c]
+        host = host03
+        addr = 10.0.0.3:6789

-To change ``mon.c`` to ``host04`` with the IP address  ``10.0.0.4``, follow the
-steps in `Adding a Monitor (Manual)`_ by adding a  new monitor ``mon.d``. Ensure
-that ``mon.d`` is  running before removing ``mon.c``, or it will break the
-quorum. Remove ``mon.c`` as described on  `Removing a Monitor (Manual)`_. Moving
-all three  monitors would thus require repeating this process as many times as
-needed.
+To change ``mon.c`` so that its name is ``host04`` and its IP address is
+``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new
+monitor ``mon.d``, (2) make sure that ``mon.d`` is  running before removing
+``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing
+a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP
+addresses, repeat this process.

+Changing a Monitor's IP address (Advanced Method)
+-------------------------------------------------

-Changing a Monitor's IP address (The Messy Way)
-----------------------------------------------
+There are cases in which the method outlined in :ref"`<Changing a Monitor's IP
+Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot
+be used. For example, it might be necessary to move the cluster's monitors to a
+different network, to a different part of the datacenter, or to a different
+datacenter altogether. It is still possible to change the monitors' IP
+addresses, but a different method must be used.

-There may come a time when the monitors must be moved to a different network,  a
-different part of the datacenter or a different datacenter altogether. While  it
-is possible to do it, the process becomes a bit more hazardous.
+For such cases, a new monitor map with updated IP addresses for every monitor
+in the cluster must be generated and injected on each monitor. Although this
+method is not particularly easy, such a major migration is unlikely to be a
+routine task. As stated at the beginning of this section, existing monitors are
+not supposed to change their IP addresses.

-In such a case, the solution is to generate a new monmap with updated IP
-addresses for all the monitors in the cluster, and inject the new map on each
-individual monitor.  This is not the most user-friendly approach, but we do not
-expect this to be something that needs to be done every other week.  As it is
-clearly stated on the top of this section, monitors are not supposed to change
-IP addresses.
+Continue with the monitor configuration in the example from :ref"`<Changing a
+Monitor's IP Address (Preferred Method)>
+operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors
+are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that
+these networks are unable to communicate. Carry out the following procedure:

-Using the previous monitor configuration as an example, assume you want to move
-all the  monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these
-networks  are unable to communicate.  Use the following procedure:
-
-#. Retrieve the monitor map, where ``{tmp}`` is the path to 
-   the retrieved monitor map, and ``{filename}`` is the name of the file 
-   containing the retrieved monitor map:
+#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor
+   map, and ``{filename}`` is the name of the file that contains the retrieved
+   monitor map):

   .. prompt:: bash $

      ceph mon getmap -o {tmp}/{filename}

-#. The following example demonstrates the contents of the monmap:
+#. Check the contents of the monitor map:

   .. prompt:: bash $

      monmaptool --print {tmp}/{filename}

-   ::	
+   ::    

-	monmaptool: monmap file {tmp}/{filename}
-	epoch 1
-	fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
-	last_changed 2012-12-17 02:46:41.591248
-	created 2012-12-17 02:46:41.591248
-	0: 10.0.0.1:6789/0 mon.a
-	1: 10.0.0.2:6789/0 mon.b
-	2: 10.0.0.3:6789/0 mon.c
+    monmaptool: monmap file {tmp}/{filename}
+    epoch 1
+    fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+    last_changed 2012-12-17 02:46:41.591248
+    created 2012-12-17 02:46:41.591248
+    0: 10.0.0.1:6789/0 mon.a
+    1: 10.0.0.2:6789/0 mon.b
+    2: 10.0.0.3:6789/0 mon.c

-#. Remove the existing monitors:
+#. Remove the existing monitors from the monitor map:

   .. prompt:: bash $

      monmaptool --rm a --rm b --rm c {tmp}/{filename}
-	

   ::

-	monmaptool: monmap file {tmp}/{filename}
-	monmaptool: removing a
-	monmaptool: removing b
-	monmaptool: removing c
-	monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
+    monmaptool: monmap file {tmp}/{filename}
+    monmaptool: removing a
+    monmaptool: removing b
+    monmaptool: removing c
+    monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)

-#. Add the new monitor locations:
+#. Add the new monitor locations to the monitor map:

   .. prompt:: bash $

      monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}

-
   ::
-	
+
      monmaptool: monmap file {tmp}/{filename}
      monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)

-#. Check new contents:
+#. Check the new contents of the monitor map:

   .. prompt:: bash $

       monmaptool --print {tmp}/{filename}
-	
+
   ::

-	monmaptool: monmap file {tmp}/{filename}
-	epoch 1
-	fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
-	last_changed 2012-12-17 02:46:41.591248
-	created 2012-12-17 02:46:41.591248
-	0: 10.1.0.1:6789/0 mon.a
-	1: 10.1.0.2:6789/0 mon.b
-	2: 10.1.0.3:6789/0 mon.c
+    monmaptool: monmap file {tmp}/{filename}
+    epoch 1
+    fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
+    last_changed 2012-12-17 02:46:41.591248
+    created 2012-12-17 02:46:41.591248
+    0: 10.1.0.1:6789/0 mon.a
+    1: 10.1.0.2:6789/0 mon.b
+    2: 10.1.0.3:6789/0 mon.c

-At this point, we assume the monitors (and stores) are installed at the new
-location. The next step is to propagate the modified monmap to the new 
-monitors, and inject the modified monmap into each new monitor.
+At this point, we assume that the monitors (and stores) have been installed at
+the new location. Next, propagate the modified monitor map to the new monitors,
+and inject the modified monitor map into each new monitor.

-#. First, make sure to stop all your monitors.  Injection must be done while 
-   the daemon is not running.
+#. Make sure all of your monitors have been stopped. Never inject into a
+   monitor while the monitor daemon is running.

-#. Inject the monmap: 
+#. Inject the monitor map:

   .. prompt:: bash $

      ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}

-#. Restart the monitors.
+#. Restart all of the monitors.
+
+Migration to the new location is now complete. The monitors should operate
+successfully.

-After this step, migration to the new location is complete and 
-the monitors should operate successfully.


 .. _Manual Deployment: ../../../install/manual-deployment
 .. _Monitor Bootstrap: ../../../dev/mon-bootstrap
 .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
+
+.. |---|   unicode:: U+2014 .. EM DASH
+   :trim:
--- a/ceph/doc/rados/operations/add-or-rm-osds.rst
+++ b/ceph/doc/rados/operations/add-or-rm-osds.rst
@ -2,49 +2,51 @@
 Adding/Removing OSDs
 ======================

-When you have a cluster up and running, you may add OSDs or remove OSDs
-from the cluster at runtime.
+When a cluster is up and running, it is possible to add or remove OSDs. 

 Adding OSDs
 ===========

-When you want to expand a cluster, you may add an OSD at runtime. With Ceph, an
-OSD is generally one Ceph ``ceph-osd`` daemon for one storage drive within a
-host machine. If your host has multiple storage drives, you may map one
-``ceph-osd`` daemon for each drive.
+OSDs can be added to a cluster in order to expand the cluster's capacity and
+resilience. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on one
+storage drive within a host machine. But if your host machine has multiple
+storage drives, you may map one ``ceph-osd`` daemon for each drive on the
+machine.

-Generally, it's a good idea to check the capacity of your cluster to see if you
-are reaching the upper end of its capacity. As your cluster reaches its ``near
-full`` ratio, you should add one or more OSDs to expand your cluster's capacity.
+It's a good idea to check the capacity of your cluster so that you know when it
+approaches its capacity limits. If your cluster has reached its ``near full``
+ratio, then you should add OSDs to expand your cluster's capacity.

-.. warning:: Do not let your cluster reach its ``full ratio`` before
-   adding an OSD. OSD failures that occur after the cluster reaches
-   its ``near full`` ratio may cause the cluster to exceed its
-   ``full ratio``.
+.. warning:: Do not add an OSD after your cluster has reached its ``full
+   ratio``. OSD failures that occur after the cluster reaches its ``near full
+   ratio`` might cause the cluster to exceed its ``full ratio``.

-Deploy your Hardware
--------------------

-If you are adding a new host when adding a new OSD,  see `Hardware
+Deploying your Hardware
+-----------------------
+
+If you are also adding a new host when adding a new OSD, see `Hardware
 Recommendations`_ for details on minimum recommendations for OSD hardware. To
-add an OSD host to your cluster, first make sure you have an up-to-date version
-of Linux installed, and you have made some initial preparations for your
-storage drives.  See `Filesystem Recommendations`_ for details.
+add an OSD host to your cluster, begin by making sure that an appropriate 
+version of Linux has been installed on the host machine and that all initial
+preparations for your storage drives have been carried out. For details, see
+`Filesystem Recommendations`_.
+
+Next, add your OSD host to a rack in your cluster, connect the host to the
+network, and ensure that the host has network connectivity. For details, see
+`Network Configuration Reference`_.

-Add your OSD host to a rack in your cluster, connect it to the network
-and ensure that it has network connectivity. See the `Network Configuration
-Reference`_ for details.

 .. _Hardware Recommendations: ../../../start/hardware-recommendations
 .. _Filesystem Recommendations: ../../configuration/filesystem-recommendations
 .. _Network Configuration Reference: ../../configuration/network-config-ref

-Install the Required Software
-----------------------------
+Installing the Required Software
+--------------------------------

-For manually deployed clusters, you must install Ceph packages
-manually. See `Installing Ceph (Manual)`_ for details.
-You should configure SSH to a user with password-less authentication
+If your cluster has been manually deployed, you will need to install Ceph
+software packages manually. For details, see `Installing Ceph (Manual)`_.
+Configure SSH for the appropriate user to have both passwordless authentication
 and root permissions.

 .. _Installing Ceph (Manual): ../../../install
@ -53,48 +55,56 @@ and root permissions.
 Adding an OSD (Manual)
 ----------------------

-This procedure sets up a ``ceph-osd`` daemon, configures it to use one drive,
-and configures the cluster to distribute data to the OSD. If your host has
-multiple drives, you may add an OSD for each drive by repeating this procedure.
+The following procedure sets up a ``ceph-osd`` daemon, configures this OSD to
+use one drive, and configures the cluster to distribute data to the OSD. If
+your host machine has multiple drives, you may add an OSD for each drive on the
+host by repeating this procedure.

-To add an OSD, create a data directory for it, mount a drive to that directory,
-add the OSD to the cluster, and then add it to the CRUSH map.
+As the following procedure will demonstrate, adding an OSD involves creating a
+metadata directory for it, configuring a data storage drive, adding the OSD to
+the cluster, and then adding it to the CRUSH map.

-When you add the OSD to the CRUSH map, consider the weight you give to the new
-OSD. Hard drive capacity grows 40% per year, so newer OSD hosts may have larger
-hard drives than older hosts in the cluster (i.e., they may have greater
-weight).
+When you add the OSD to the CRUSH map, you will need to consider the weight you
+assign to the new OSD. Since storage drive capacities increase over time, newer
+OSD hosts are likely to have larger hard drives than the older hosts in the
+cluster have and therefore might have greater weight as well.

-.. tip:: Ceph prefers uniform hardware across pools. If you are adding drives
-   of dissimilar size, you can adjust their weights. However, for best
-   performance, consider a CRUSH hierarchy with drives of the same type/size.
+.. tip:: Ceph works best with uniform hardware across pools. It is possible to
+   add drives of dissimilar size and then adjust their weights accordingly.
+   However, for best performance, consider a CRUSH hierarchy that has drives of
+   the same type and size. It is better to add larger drives uniformly to
+   existing hosts. This can be done incrementally, replacing smaller drives
+   each time the new drives are added.

-#. Create the OSD. If no UUID is given, it will be set automatically when the
-   OSD starts up. The following command will output the OSD number, which you
-   will need for subsequent steps:
+#. Create the new OSD by running a command of the following form. If you opt
+   not to specify a UUID in this command, the UUID will be set automatically
+   when the OSD starts up. The OSD number, which is needed for subsequent
+   steps, is found in the command's output:

   .. prompt:: bash $

      ceph osd create [{uuid} [{id}]]

-   If the optional parameter {id} is given it will be used as the OSD id.
-   Note, in this case the command may fail if the number is already in use.
+   If the optional parameter {id} is specified it will be used as the OSD ID.
+   However, if the ID number is already in use, the command will fail.

-   .. warning:: In general, explicitly specifying {id} is not recommended.
-      IDs are allocated as an array, and skipping entries consumes some extra
-      memory. This can become significant if there are large gaps and/or
-      clusters are large. If {id} is not specified, the smallest available is
-      used.
+   .. warning:: Explicitly specifying the ``{id}`` parameter is not
+      recommended. IDs are allocated as an array, and any skipping of entries
+      consumes extra memory. This memory consumption can become significant if
+      there are large gaps or if clusters are large. By leaving the ``{id}``
+      parameter unspecified, we ensure that Ceph uses the smallest ID number
+      available and that these problems are avoided.

-#. Create the default directory on your new OSD:
+#. Create the default directory for your new OSD by running commands of the
+   following form:

   .. prompt:: bash $

      ssh {new-osd-host}
      sudo mkdir /var/lib/ceph/osd/ceph-{osd-number}

-#. If the OSD is for a drive other than the OS drive, prepare it
-   for use with Ceph, and mount it to the directory you just created:
+#. If the OSD will be created on a drive other than the OS drive, prepare it
+   for use with Ceph. Run commands of the following form:

   .. prompt:: bash $

@ -102,41 +112,49 @@ weight).
      sudo mkfs -t {fstype} /dev/{drive}
      sudo mount -o user_xattr /dev/{hdd} /var/lib/ceph/osd/ceph-{osd-number}

-#. Initialize the OSD data directory:
+#. Initialize the OSD data directory by running commands of the following form:

   .. prompt:: bash $

      ssh {new-osd-host}
      ceph-osd -i {osd-num} --mkfs --mkkey

-   The directory must be empty before you can run ``ceph-osd``.
+   Make sure that the directory is empty before running ``ceph-osd``.

-#. Register the OSD authentication key. The value of ``ceph`` for
-   ``ceph-{osd-num}`` in the path is the ``$cluster-$id``.  If your
-   cluster name differs from ``ceph``, use your cluster name instead:
+#. Register the OSD authentication key by running a command of the following
+   form:

   .. prompt:: bash $

      ceph auth add osd.{osd-num} osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-{osd-num}/keyring

-#. Add the OSD to the CRUSH map so that the OSD can begin receiving data. The
-   ``ceph osd crush add`` command allows you to add OSDs to the CRUSH hierarchy
-   wherever you wish. If you specify at least one bucket, the command
-   will place the OSD into the most specific bucket you specify, *and* it will
-   move that bucket underneath any other buckets you specify. **Important:** If
-   you specify only the root bucket, the command will attach the OSD directly
-   to the root, but CRUSH rules expect OSDs to be inside of hosts.
+   This presentation of the command has ``ceph-{osd-num}`` in the listed path
+   because many clusters have the name ``ceph``. However, if your cluster name
+   is not ``ceph``, then the string ``ceph`` in ``ceph-{osd-num}`` needs to be
+   replaced with your cluster name. For example, if your cluster name is
+   ``cluster1``, then the path in the command should be
+   ``/var/lib/ceph/osd/cluster1-{osd-num}/keyring``.

-   Execute the following:
+#. Add the OSD to the CRUSH map by running the following command. This allows
+   the OSD to begin receiving data. The ``ceph osd crush add`` command can add
+   OSDs to the CRUSH hierarchy wherever you want. If you specify one or more
+   buckets, the command places the OSD in the most specific of those buckets,
+   and it moves that bucket underneath any other buckets that you have
+   specified. **Important:** If you specify only the root bucket, the command
+   will attach the OSD directly to the root, but CRUSH rules expect OSDs to be
+   inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not
+   receive any data.

   .. prompt:: bash $

      ceph osd crush add {id-or-name} {weight}  [{bucket-type}={bucket-name} ...]

-   You may also decompile the CRUSH map, add the OSD to the device list, add the
-   host as a bucket (if it's not already in the CRUSH map), add the device as an
-   item in the host, assign it a weight, recompile it and set it. See
-   `Add/Move an OSD`_ for details.
+   Note that there is another way to add a new OSD to the CRUSH map: decompile
+   the CRUSH map, add the OSD to the device list, add the host as a bucket (if
+   it is not already in the CRUSH map), add the device as an item in the host,
+   assign the device a weight, recompile the CRUSH map, and set the CRUSH map.
+   For details, see `Add/Move an OSD`_. This is rarely necessary with recent
+   releases (this sentence was written the month that Reef was released).


 .. _rados-replacing-an-osd:
@ -144,193 +162,206 @@ weight).
 Replacing an OSD
 ----------------

-.. note:: If the instructions in this section do not work for you, try the
-   instructions in the cephadm documentation: :ref:`cephadm-replacing-an-osd`.
+.. note:: If the procedure in this section does not work for you, try the
+   instructions in the ``cephadm`` documentation:
+   :ref:`cephadm-replacing-an-osd`.

-When disks fail, or if an administrator wants to reprovision OSDs with a new
-backend, for instance, for switching from FileStore to BlueStore, OSDs need to
-be replaced. Unlike `Removing the OSD`_, replaced OSD's id and CRUSH map entry
-need to be keep intact after the OSD is destroyed for replacement.
+Sometimes OSDs need to be replaced: for example, when a disk fails, or when an
+administrator wants to reprovision OSDs with a new back end (perhaps when
+switching from Filestore to BlueStore). Replacing an OSD differs from `Removing
+the OSD`_ in that the replaced OSD's ID and CRUSH map entry must be kept intact
+after the OSD is destroyed for replacement.

-#. Make sure it is safe to destroy the OSD:
+
+#. Make sure that it is safe to destroy the OSD:

   .. prompt:: bash $

      while ! ceph osd safe-to-destroy osd.{id} ; do sleep 10 ; done

-#. Destroy the OSD first:
+#. Destroy the OSD:

   .. prompt:: bash $

      ceph osd destroy {id} --yes-i-really-mean-it

-#. Zap a disk for the new OSD, if the disk was used before for other purposes.
-   It's not necessary for a new disk:
+#. *Optional*: If the disk that you plan to use is not a new disk and has been
+   used before for other purposes, zap the disk:

   .. prompt:: bash $

      ceph-volume lvm zap /dev/sdX

-#. Prepare the disk for replacement by using the previously destroyed OSD id:
+#. Prepare the disk for replacement by using the ID of the OSD that was
+   destroyed in previous steps:

   .. prompt:: bash $

      ceph-volume lvm prepare --osd-id {id} --data /dev/sdX

-#. And activate the OSD:
+#. Finally, activate the OSD:

   .. prompt:: bash $

      ceph-volume lvm activate {id} {fsid}

-Alternatively, instead of preparing and activating, the device can be recreated
-in one call, like:
+Alternatively, instead of carrying out the final two steps (preparing the disk
+and activating the OSD), you can re-create the OSD by running a single command
+of the following form:

   .. prompt:: bash $

      ceph-volume lvm create --osd-id {id} --data /dev/sdX

-
 Starting the OSD
 ----------------

-After you add an OSD to Ceph, the OSD is in your configuration. However,
-it is not yet running. The OSD is ``down`` and ``in``. You must start
-your new OSD before it can begin receiving data. You may use
-``service ceph`` from your admin host or start the OSD from its host
-machine:
+After an OSD is added to Ceph, the OSD is in the cluster. However, until it is
+started, the OSD is considered ``down`` and ``in``. The OSD is not running and
+will be unable to receive data. To start an OSD, either run ``service ceph``
+from your admin host or run a command of the following form to start the OSD
+from its host machine:

   .. prompt:: bash $

      sudo systemctl start ceph-osd@{osd-num}

+After the OSD is started, it is considered ``up`` and ``in``.

-Once you start your OSD, it is ``up`` and ``in``.
+Observing the Data Migration
+----------------------------

-
-Observe the Data Migration
--------------------------
-
-Once you have added your new OSD to the CRUSH map, Ceph  will begin rebalancing
-the server by migrating placement groups to your new OSD. You can observe this
-process with  the `ceph`_ tool. :
+After the new OSD has been added to the CRUSH map, Ceph begins rebalancing the
+cluster by migrating placement groups (PGs) to the new OSD. To observe this
+process by using the `ceph`_ tool, run the following command:

   .. prompt:: bash $

      ceph -w

-You should see the placement group states change from ``active+clean`` to
-``active, some degraded objects``, and finally ``active+clean`` when migration
-completes. (Control-c to exit.)
+Or:
+
+   .. prompt:: bash $
+
+      watch ceph status
+
+The PG states will first change from ``active+clean`` to ``active, some
+degraded objects`` and then return to ``active+clean`` when migration
+completes. When you are finished observing, press Ctrl-C to exit.

 .. _Add/Move an OSD: ../crush-map#addosd
 .. _ceph: ../monitoring


-
 Removing OSDs (Manual)
 ======================

-When you want to reduce the size of a cluster or replace hardware, you may
-remove an OSD at runtime. With Ceph, an OSD is generally one Ceph ``ceph-osd``
-daemon for one storage drive within a host machine. If your host has multiple
-storage drives, you may need to remove one ``ceph-osd`` daemon for each drive.
-Generally, it's a good idea to check the capacity of your cluster to see if you
-are reaching the upper end of its capacity. Ensure that when you remove an OSD
-that your cluster is not at its ``near full`` ratio.
+It is possible to remove an OSD manually while the cluster is running: you
+might want to do this in order to reduce the size of the cluster or when
+replacing hardware. Typically, an OSD is a Ceph ``ceph-osd`` daemon running on
+one storage drive within a host machine. Alternatively, if your host machine
+has multiple storage drives, you might need to remove multiple ``ceph-osd``
+daemons: one daemon for each drive on the machine. 

-.. warning:: Do not let your cluster reach its ``full ratio`` when
-   removing an OSD. Removing OSDs could cause the cluster to reach
-   or exceed its ``full ratio``.
+.. warning:: Before you begin the process of removing an OSD, make sure that
+   your cluster is not near its ``full ratio``. Otherwise the act of removing
+   OSDs might cause the cluster to reach or exceed its ``full ratio``.


-Take the OSD out of the Cluster
-----------------------------------
+Taking the OSD ``out`` of the Cluster
+-------------------------------------

-Before you remove an OSD, it is usually ``up`` and ``in``.  You need to take it
-out of the cluster so that Ceph can begin rebalancing and copying its data to
-other OSDs. :
+OSDs are typically ``up`` and ``in`` before they are removed from the cluster.
+Before the OSD can be removed from the cluster, the OSD must be taken ``out``
+of the cluster so that Ceph can begin rebalancing and copying its data to other
+OSDs. To take an OSD ``out`` of the cluster, run a command of the following
+form:

   .. prompt:: bash $

      ceph osd out {osd-num}


-Observe the Data Migration
--------------------------
+Observing the Data Migration
+----------------------------

-Once you have taken your OSD ``out`` of the cluster, Ceph  will begin
-rebalancing the cluster by migrating placement groups out of the OSD you
-removed. You can observe  this process with  the `ceph`_ tool. :
+After the OSD has been taken ``out`` of the cluster, Ceph begins rebalancing
+the cluster by migrating placement groups out of the OSD that was removed. To
+observe this process by using the `ceph`_ tool, run the following command:

   .. prompt:: bash $

      ceph -w

-You should see the placement group states change from ``active+clean`` to
-``active, some degraded objects``, and finally ``active+clean`` when migration
-completes. (Control-c to exit.)
+The PG states will change from ``active+clean`` to ``active, some degraded
+objects`` and will then return to ``active+clean`` when migration completes.
+When you are finished observing, press Ctrl-C to exit.

-.. note:: Sometimes, typically in a "small" cluster with few hosts (for
-   instance with a small testing cluster), the fact to take ``out`` the
-   OSD can spawn a CRUSH corner case where some PGs remain stuck in the
-   ``active+remapped`` state. If you are in this case, you should mark
-   the OSD ``in`` with:
+.. note:: Under certain conditions, the action of taking ``out`` an OSD
+   might lead CRUSH to encounter a corner case in which some PGs remain stuck
+   in the ``active+remapped`` state. This problem sometimes occurs in small
+   clusters with few hosts (for example, in a small testing cluster). To
+   address this problem, mark the OSD ``in`` by running a command of the
+   following form:

   .. prompt:: bash $

      ceph osd in {osd-num}

-   to come back to the initial state and then, instead of marking ``out``
-   the OSD, set its weight to 0 with:
+   After the OSD has come back to its initial state, do not mark the OSD
+   ``out`` again. Instead, set the OSD's weight to ``0`` by running a command
+   of the following form:

   .. prompt:: bash $

      ceph osd crush reweight osd.{osd-num} 0

-   After that, you can observe the data migration which should come to its
-   end. The difference between marking ``out`` the OSD and reweighting it
-   to 0 is that in the first case the weight of the bucket which contains
-   the OSD is not changed whereas in the second case the weight of the bucket
-   is updated (and decreased of the OSD weight). The reweight command could
-   be sometimes favoured in the case of a "small" cluster.
-
+   After the OSD has been reweighted, observe the data migration and confirm
+   that it has completed successfully. The difference between marking an OSD
+   ``out`` and reweighting the OSD to ``0`` has to do with the bucket that
+   contains the OSD. When an OSD is marked ``out``, the weight of the bucket is
+   not changed. But when an OSD is reweighted to ``0``, the weight of the
+   bucket is updated (namely, the weight of the OSD is subtracted from the
+   overall weight of the bucket). When operating small clusters, it can
+   sometimes be preferable to use the above reweight command.


 Stopping the OSD
 ----------------

-After you take an OSD out of the cluster, it may still be running.
-That is, the OSD may be ``up`` and ``out``. You must stop
-your OSD before you remove it from the configuration: 
+After you take an OSD ``out`` of the cluster, the OSD might still be running.
+In such a case, the OSD is ``up`` and ``out``. Before it is removed from the
+cluster, the OSD must be stopped by running commands of the following form:

   .. prompt:: bash $

      ssh {osd-host}
      sudo systemctl stop ceph-osd@{osd-num}

-Once you stop your OSD, it is ``down``.
+After the OSD has been stopped, it is ``down``.


 Removing the OSD
 ----------------

-This procedure removes an OSD from a cluster map, removes its authentication
-key, removes the OSD from the OSD map, and removes the OSD from the
-``ceph.conf`` file. If your host has multiple drives, you may need to remove an
-OSD for each drive by repeating this procedure.
+The following procedure removes an OSD from the cluster map, removes the OSD's
+authentication key, removes the OSD from the OSD map, and removes the OSD from
+the ``ceph.conf`` file. If your host has multiple drives, it might be necessary
+to remove an OSD from each drive by repeating this procedure.

-#. Let the cluster forget the OSD first. This step removes the OSD from the CRUSH
-   map, removes its authentication key. And it is removed from the OSD map as
-   well. Please note the :ref:`purge subcommand <ceph-admin-osd>` is introduced in Luminous, for older
-   versions, please see below:
+#. Begin by having the cluster forget the OSD. This step removes the OSD from
+   the CRUSH map, removes the OSD's authentication key, and removes the OSD
+   from the OSD map. (The :ref:`purge subcommand <ceph-admin-osd>` was
+   introduced in Luminous. For older releases, see :ref:`the procedure linked
+   here <ceph_osd_purge_procedure_pre_luminous>`.):

   .. prompt:: bash $

      ceph osd purge {id} --yes-i-really-mean-it

-#. Navigate to the host where you keep the master copy of the cluster's
-   ``ceph.conf`` file:
+
+#. Navigate to the host where the master copy of the cluster's
+   ``ceph.conf`` file is kept:

   .. prompt:: bash $

@ -338,46 +369,48 @@ OSD for each drive by repeating this procedure.
      cd /etc/ceph
      vim ceph.conf

-#. Remove the OSD entry from your ``ceph.conf`` file (if it exists)::
+#. Remove the OSD entry from your ``ceph.conf`` file (if such an entry
+   exists)::

-	[osd.1]
-		host = {hostname}
+    [osd.1]
+        host = {hostname}

-#. From the host where you keep the master copy of the cluster's ``ceph.conf``
-   file, copy the updated ``ceph.conf`` file to the ``/etc/ceph`` directory of
-   other hosts in your cluster.
+#. Copy the updated ``ceph.conf`` file from the location on the host where the
+   master copy of the cluster's ``ceph.conf`` is kept to the ``/etc/ceph``
+   directory of the other hosts in your cluster.

-If your Ceph cluster is older than Luminous, instead of using ``ceph osd
-purge``, you need to perform this step manually:
+.. _ceph_osd_purge_procedure_pre_luminous:

+If your Ceph cluster is older than Luminous, you will be unable to use the
+``ceph osd purge`` command. Instead, carry out the following procedure:

-#. Remove the OSD from the CRUSH map so that it no longer receives data. You may
-   also decompile the CRUSH map, remove the OSD from the device list, remove the
-   device as an item in the host bucket or remove the host  bucket (if it's in the
-   CRUSH map and you intend to remove the host), recompile the map and set it.
-   See `Remove an OSD`_ for details:
+#. Remove the OSD from the CRUSH map so that it no longer receives data (for
+   more details, see `Remove an OSD`_):

   .. prompt:: bash $

      ceph osd crush remove {name}

+   Instead of removing the OSD from the CRUSH map, you might opt for one of two
+   alternatives: (1) decompile the CRUSH map, remove the OSD from the device
+   list, and remove the device from the host bucket; (2) remove the host bucket
+   from the CRUSH map (provided that it is in the CRUSH map and that you intend
+   to remove the host), recompile the map, and set it:
+
+
 #. Remove the OSD authentication key:

   .. prompt:: bash $

      ceph auth del osd.{osd-num}

-   The value of ``ceph`` for ``ceph-{osd-num}`` in the path is the
-   ``$cluster-$id``.  If your cluster name differs from ``ceph``, use your
-   cluster name instead.
-
 #. Remove the OSD:

   .. prompt:: bash $

      ceph osd rm {osd-num}

-   for example:
+   For example:

   .. prompt:: bash $

--- a/ceph/doc/rados/operations/balancer.rst
+++ b/ceph/doc/rados/operations/balancer.rst
@ -3,14 +3,15 @@
 Balancer
 ========

-The *balancer* can optimize the placement of PGs across OSDs in
-order to achieve a balanced distribution, either automatically or in a
-supervised fashion.
+The *balancer* can optimize the allocation of placement groups (PGs) across
+OSDs in order to achieve a balanced distribution. The balancer can operate
+either automatically or in a supervised fashion.
+

 Status
 ------

-The current status of the balancer can be checked at any time with:
+To check the current status of the balancer, run the following command:

   .. prompt:: bash $

@ -20,70 +21,78 @@ The current status of the balancer can be checked at any time with:
 Automatic balancing
 -------------------

-The automatic balancing feature is enabled by default in ``upmap``
-mode. Please refer to :ref:`upmap` for more details. The balancer can be
-turned off with:
+When the balancer is in ``upmap`` mode, the automatic balancing feature is
+enabled by default. For more details, see :ref:`upmap`.  To disable the
+balancer, run the following command:

   .. prompt:: bash $

      ceph balancer off

-The balancer mode can be changed to ``crush-compat`` mode, which is
-backward compatible with older clients, and will make small changes to
-the data distribution over time to ensure that OSDs are equally utilized.
+The balancer mode can be changed from ``upmap`` mode to ``crush-compat`` mode.
+``crush-compat`` mode is backward compatible with older clients.  In
+``crush-compat`` mode, the balancer automatically makes small changes to the
+data distribution in order to ensure that OSDs are utilized equally.


 Throttling
 ----------

-No adjustments will be made to the PG distribution if the cluster is
-degraded (e.g., because an OSD has failed and the system has not yet
-healed itself).
+If the cluster is degraded (that is, if an OSD has failed and the system hasn't
+healed itself yet), then the balancer will not make any adjustments to the PG
+distribution.

-When the cluster is healthy, the balancer will throttle its changes
-such that the percentage of PGs that are misplaced (i.e., that need to
-be moved) is below a threshold of (by default) 5%.  The
-``target_max_misplaced_ratio`` threshold can be adjusted with:
+When the cluster is healthy, the balancer will incrementally move a small
+fraction of unbalanced PGs in order to improve distribution.  This fraction
+will not exceed a certain threshold that defaults to 5%. To adjust this
+``target_max_misplaced_ratio`` threshold setting, run the following command:

   .. prompt:: bash $

      ceph config set mgr target_max_misplaced_ratio .07   # 7%

-Set the number of seconds to sleep in between runs of the automatic balancer:
+The balancer sleeps between runs. To set the number of seconds for this
+interval of sleep, run the following command:

   .. prompt:: bash $

      ceph config set mgr mgr/balancer/sleep_interval 60

-Set the time of day to begin automatic balancing in HHMM format:
+To set the time of day (in HHMM format) at which automatic balancing begins,
+run the following command:

   .. prompt:: bash $

      ceph config set mgr mgr/balancer/begin_time 0000

-Set the time of day to finish automatic balancing in HHMM format:
+To set the time of day (in HHMM format) at which automatic balancing ends, run
+the following command:

   .. prompt:: bash $

      ceph config set mgr mgr/balancer/end_time 2359

-Restrict automatic balancing to this day of the week or later. 
-Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on:
+Automatic balancing can be restricted to certain days of the week.  To restrict
+it to a specific day of the week or later (as with crontab, ``0`` is Sunday,
+``1`` is Monday, and so on), run the following command:

   .. prompt:: bash $

      ceph config set mgr mgr/balancer/begin_weekday 0

-Restrict automatic balancing to this day of the week or earlier. 
-Uses the same conventions as crontab, 0 is Sunday, 1 is Monday, and so on:
+To restrict automatic balancing to a specific day of the week or earlier
+(again, ``0`` is Sunday, ``1`` is Monday, and so on), run the following
+command:

   .. prompt:: bash $

      ceph config set mgr mgr/balancer/end_weekday 6

-Pool IDs to which the automatic balancing will be limited. 
-The default for this is an empty string, meaning all pools will be balanced. 
-The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` command:
+Automatic balancing can be restricted to certain pools. By default, the value
+of this setting is an empty string, so that all pools are automatically
+balanced.  To restrict automatic balancing to specific pools, retrieve their
+numeric pool IDs (by running the :command:`ceph osd pool ls detail` command),
+and then run the following command:

   .. prompt:: bash $

@ -93,43 +102,41 @@ The numeric pool IDs can be gotten with the :command:`ceph osd pool ls detail` c
 Modes
 -----

-There are currently two supported balancer modes:
+There are two supported balancer modes:

-#. **crush-compat**.  The CRUSH compat mode uses the compat weight-set
-   feature (introduced in Luminous) to manage an alternative set of
-   weights for devices in the CRUSH hierarchy.  The normal weights
-   should remain set to the size of the device to reflect the target
-   amount of data that we want to store on the device.  The balancer
-   then optimizes the weight-set values, adjusting them up or down in
-   small increments, in order to achieve a distribution that matches
-   the target distribution as closely as possible.  (Because PG
-   placement is a pseudorandom process, there is a natural amount of
-   variation in the placement; by optimizing the weights we
-   counter-act that natural variation.)
+#. **crush-compat**. This mode uses the compat weight-set feature (introduced
+   in Luminous) to manage an alternative set of weights for devices in the
+   CRUSH hierarchy. When the balancer is operating in this mode, the normal
+   weights should remain set to the size of the device in order to reflect the
+   target amount of data intended to be stored on the device. The balancer will
+   then optimize the weight-set values, adjusting them up or down in small
+   increments, in order to achieve a distribution that matches the target
+   distribution as closely as possible. (Because PG placement is a pseudorandom
+   process, it is subject to a natural amount of variation; optimizing the
+   weights serves to counteract that natural variation.)

-   Notably, this mode is *fully backwards compatible* with older
-   clients: when an OSDMap and CRUSH map is shared with older clients,
-   we present the optimized weights as the "real" weights.
+   Note that this mode is *fully backward compatible* with older clients: when
+   an OSD Map and CRUSH map are shared with older clients, Ceph presents the
+   optimized weights as the "real" weights.

-   The primary restriction of this mode is that the balancer cannot
-   handle multiple CRUSH hierarchies with different placement rules if
-   the subtrees of the hierarchy share any OSDs.  (This is normally
-   not the case, and is generally not a recommended configuration
-   because it is hard to manage the space utilization on the shared
-   OSDs.)
+   The primary limitation of this mode is that the balancer cannot handle
+   multiple CRUSH hierarchies with different placement rules if the subtrees of
+   the hierarchy share any OSDs. (Such sharing of OSDs is not typical and,
+   because of the difficulty of managing the space utilization on the shared
+   OSDs, is generally not recommended.)

-#. **upmap**.  Starting with Luminous, the OSDMap can store explicit
-   mappings for individual OSDs as exceptions to the normal CRUSH
-   placement calculation.  These `upmap` entries provide fine-grained
-   control over the PG mapping.  This CRUSH mode will optimize the
-   placement of individual PGs in order to achieve a balanced
-   distribution.  In most cases, this distribution is "perfect," which
-   an equal number of PGs on each OSD (+/-1 PG, since they might not
-   divide evenly).
+#. **upmap**. In Luminous and later releases, the OSDMap can store explicit
+   mappings for individual OSDs as exceptions to the normal CRUSH placement
+   calculation. These ``upmap`` entries provide fine-grained control over the
+   PG mapping. This balancer mode optimizes the placement of individual PGs in
+   order to achieve a balanced distribution.  In most cases, the resulting
+   distribution is nearly perfect: that is, there is an equal number of PGs on
+   each OSD (±1 PG, since the total number might not divide evenly).

-   Note that using upmap requires that all clients be Luminous or newer.
+   To use ``upmap``, all clients must be Luminous or newer.

-The default mode is ``upmap``.  The mode can be adjusted with:
+The default mode is ``upmap``. The mode can be changed to ``crush-compat`` by
+running the following command:

   .. prompt:: bash $

@ -138,69 +145,77 @@ The default mode is ``upmap``.  The mode can be adjusted with:
 Supervised optimization
 -----------------------

-The balancer operation is broken into a few distinct phases:
+Supervised use of the balancer can be understood in terms of three distinct
+phases:

-#. building a *plan*
-#. evaluating the quality of the data distribution, either for the current PG distribution, or the PG distribution that would result after executing a *plan*
-#. executing the *plan*
+#. building a plan
+#. evaluating the quality of the data distribution, either for the current PG
+   distribution or for the PG distribution that would result after executing a
+   plan
+#. executing the plan

-To evaluate and score the current distribution:
+To evaluate the current distribution, run the following command:

   .. prompt:: bash $

      ceph balancer eval

-You can also evaluate the distribution for a single pool with:
+To evaluate the distribution for a single pool, run the following command:

   .. prompt:: bash $

      ceph balancer eval <pool-name>

-Greater detail for the evaluation can be seen with:
+To see the evaluation in greater detail, run the following command:

   .. prompt:: bash $

      ceph balancer eval-verbose ...
-  
-The balancer can generate a plan, using the currently configured mode, with:
+
+To instruct the balancer to generate a plan (using the currently configured
+mode), make up a name (any useful identifying string) for the plan, and run the
+following command:

   .. prompt:: bash $

      ceph balancer optimize <plan-name>

-The name is provided by the user and can be any useful identifying string.  The contents of a plan can be seen with:
+To see the contents of a plan, run the following command:

   .. prompt:: bash $

      ceph balancer show <plan-name>

-All plans can be shown with:
+To display all plans, run the following command:

   .. prompt:: bash $

      ceph balancer ls

-Old plans can be discarded with:
+To discard an old plan, run the following command:

   .. prompt:: bash $

      ceph balancer rm <plan-name>

-Currently recorded plans are shown as part of the status command:
+To see currently recorded plans, examine the output of the following status
+command:

   .. prompt:: bash $

      ceph balancer status

-The quality of the distribution that would result after executing a plan can be calculated with:
+To evaluate the distribution that would result from executing a specific plan,
+run the following command:

   .. prompt:: bash $

      ceph balancer eval <plan-name>

-Assuming the plan is expected to improve the distribution (i.e., it has a lower score than the current cluster state), the user can execute that plan with:
+If a plan is expected to improve the distribution (that is, the plan's score is
+lower than the current cluster state's score), you can execute that plan by
+running the following command:

   .. prompt:: bash $

      ceph balancer execute <plan-name>
-
--- a/ceph/doc/rados/operations/bluestore-migration.rst
+++ b/ceph/doc/rados/operations/bluestore-migration.rst
@ -1,69 +1,68 @@
+.. _rados_operations_bluestore_migration:
+
 =====================
 BlueStore Migration
 =====================

-Each OSD can run either BlueStore or Filestore, and a single Ceph
-cluster can contain a mix of both.  Users who have previously deployed
-Filestore OSDs should transition to BlueStore in order to
-take advantage of the improved performance and robustness.  Moreover,
-Ceph releases beginning with Reef do not support Filestore. There are
-several strategies for making such a transition.
+Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph
+cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs.
+Because BlueStore is superior to Filestore in performance and robustness, and
+because Filestore is not supported by Ceph releases beginning with Reef, users
+deploying Filestore OSDs should transition to BlueStore. There are several
+strategies for making the transition to BlueStore.

-An individual OSD cannot be converted in place;
-BlueStore and Filestore are simply too different for that to be
-feasible.  The conversion process uses either the cluster's normal
-replication and healing support or tools and strategies that copy OSD
-content from an old (Filestore) device to a new (BlueStore) one.
+BlueStore is so different from Filestore that an individual OSD cannot be
+converted in place. Instead, the conversion process must use either (1) the
+cluster's normal replication and healing support, or (2) tools and strategies
+that copy OSD content from an old (Filestore) device to a new (BlueStore) one.

+Deploying new OSDs with BlueStore
+=================================

-Deploy new OSDs with BlueStore
-==============================
+Use BlueStore when deploying new OSDs (for example, when the cluster is
+expanded). Because this is the default behavior, no specific change is
+needed.

-New OSDs (e.g., when the cluster is expanded) should be deployed
-using BlueStore.  This is the default behavior so no specific change
-is needed.
+Similarly, use BlueStore for any OSDs that have been reprovisioned after
+a failed drive was replaced.

-Similarly, any OSDs that are reprovisioned after replacing a failed drive
-should use BlueStore.
+Converting existing OSDs
+========================

-Convert existing OSDs
-=====================
+"Mark-``out``" replacement
+--------------------------

-Mark out and replace
--------------------
-
-The simplest approach is to ensure that the cluster is healthy,
-then mark ``out`` each device in turn, wait for
-data to replicate across the cluster, reprovision the OSD, and mark
-it back ``in`` again.  Proceed to the next OSD when recovery is complete.
-This is easy to automate but results in more data migration than
-is strictly necessary, which in turn presents additional wear to SSDs and takes
-longer to complete.
+The simplest approach is to verify that the cluster is healthy and
+then follow these steps for each Filestore OSD in succession: mark the OSD
+``out``, wait for the data to replicate across the cluster, reprovision the OSD, 
+mark the OSD back ``in``, and wait for recovery to complete before proceeding
+to the next OSD. This approach is easy to automate, but it entails unnecessary
+data migration that carries costs in time and SSD wear.

 #. Identify a Filestore OSD to replace::

     ID=<osd-id-number>
     DEVICE=<disk-device>

-   You can tell whether a given OSD is Filestore or BlueStore with:
+   #. Determine whether a given OSD is Filestore or BlueStore:

-   .. prompt:: bash $
+      .. prompt:: bash $

-      ceph osd metadata $ID | grep osd_objectstore
+         ceph osd metadata $ID | grep osd_objectstore

-   You can get a current count of Filestore and BlueStore OSDs with:
+   #. Get a current count of Filestore and BlueStore OSDs:

-   .. prompt:: bash $
+      .. prompt:: bash $

-      ceph osd count-metadata osd_objectstore
+         ceph osd count-metadata osd_objectstore

-#. Mark the Filestore OSD ``out``:
+#. Mark a Filestore OSD ``out``:

   .. prompt:: bash $

      ceph osd out $ID

-#. Wait for the data to migrate off the OSD in question:
+#. Wait for the data to migrate off this OSD:

   .. prompt:: bash $

@ -75,7 +74,9 @@ longer to complete.

      systemctl kill ceph-osd@$ID

-#. Note which device this OSD is using:
+   .. _osd_id_retrieval: 
+
+#. Note which device the OSD is using:

   .. prompt:: bash $

@ -87,25 +88,27 @@ longer to complete.

      umount /var/lib/ceph/osd/ceph-$ID

-#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy
-   the contents of the device; be certain the data on the device is
-   not needed (i.e., that the cluster is healthy) before proceeding:
+#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy
+   the contents of the device; you must be certain that the data on the device is
+   not needed (in other words, that the cluster is healthy) before proceeding:

   .. prompt:: bash $

      ceph-volume lvm zap $DEVICE

-#. Tell the cluster the OSD has been destroyed (and a new OSD can be
-   reprovisioned with the same ID):
+#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be
+   reprovisioned with the same OSD ID):

   .. prompt:: bash $

      ceph osd destroy $ID --yes-i-really-mean-it

-#. Provision a BlueStore OSD in its place with the same OSD ID.
-   This requires you do identify which device to wipe based on what you saw
-   mounted above. BE CAREFUL! Also note that hybrid OSDs may require
-   adjustments to these commands:
+#. Provision a BlueStore OSD in place by using the same OSD ID. This requires
+   you to identify which device to wipe, and to make certain that you target
+   the correct and intended device, using the information that was retrieved in
+   the :ref:`"Note which device the OSD is using" <osd_id_retrieval>` step.  BE
+   CAREFUL!  Note that you may need to modify these commands when dealing with
+   hybrid OSDs:

   .. prompt:: bash $

@ -113,15 +116,15 @@ longer to complete.

 #. Repeat.

-You can allow balancing of the replacement OSD to happen
-concurrently with the draining of the next OSD, or follow the same
-procedure for multiple OSDs in parallel, as long as you ensure the
-cluster is fully clean (all data has all replicas) before destroying
-any OSDs.  If you reprovision multiple OSDs in parallel, be **very** careful to
-only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or
-``rack``.  Failure to do so will reduce the redundancy and availability of
-your data and increase the risk of (or even cause) data loss.
-
+You may opt to (1) have the balancing of the replacement BlueStore OSD take
+place concurrently with the draining of the next Filestore OSD, or instead
+(2) follow the same procedure for multiple OSDs in parallel. In either case,
+however, you must ensure that the cluster is fully clean (in other words, that
+all data has all replicas) before destroying any OSDs. If you opt to reprovision
+multiple OSDs in parallel, be **very** careful to destroy OSDs only within a
+single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to
+satisfy this requirement will reduce the redundancy and availability of your
+data and increase the risk of data loss (or even guarantee data loss).

 Advantages:

@ -131,29 +134,29 @@ Advantages:

 Disadvantages:

-* Data is copied over the network twice: once to some other OSD in the
-  cluster (to maintain the desired number of replicas), and then again
-  back to the reprovisioned BlueStore OSD.
+* Data is copied over the network twice: once to another OSD in the cluster (to
+  maintain the specified number of replicas), and again back to the
+  reprovisioned BlueStore OSD.

+"Whole host" replacement
+------------------------

-Whole host replacement
----------------------
+If you have a spare host in the cluster, or sufficient free space to evacuate
+an entire host for use as a spare, then the conversion can be done on a
+host-by-host basis so that each stored copy of the data is migrated only once.

-If you have a spare host in the cluster, or have sufficient free space
-to evacuate an entire host in order to use it as a spare, then the
-conversion can be done on a host-by-host basis with each stored copy of
-the data migrating only once.
+To use this approach, you need an empty host that has no OSDs provisioned.
+There are two ways to do this: either by using a new, empty host that is not
+yet part of the cluster, or by offloading data from an existing host that is
+already part of the cluster.

-First, you need an empty host that has no OSDs provisioned.  There are two
-ways to do this: either by starting with a new, empty host that isn't yet
-part of the cluster, or by offloading data from an existing host in the cluster.
+Using a new, empty host
+^^^^^^^^^^^^^^^^^^^^^^^

-Use a new, empty host
-^^^^^^^^^^^^^^^^^^^^^
+Ideally the host will have roughly the same capacity as each of the other hosts
+you will be converting.  Add the host to the CRUSH hierarchy, but do not attach
+it to the root:

-Ideally the host should have roughly the
-same capacity as other hosts you will be converting.
-Add the host to the CRUSH hierarchy, but do not attach it to the root:

 .. prompt:: bash $

@ -162,23 +165,22 @@ Add the host to the CRUSH hierarchy, but do not attach it to the root:

 Make sure that Ceph packages are installed on the new host.

-Use an existing host
-^^^^^^^^^^^^^^^^^^^^
+Using an existing host
+^^^^^^^^^^^^^^^^^^^^^^

-If you would like to use an existing host
-that is already part of the cluster, and there is sufficient free
-space on that host so that all of its data can be migrated off to
-other cluster hosts, you can instead do::
+If you would like to use an existing host that is already part of the cluster,
+and if there is sufficient free space on that host so that all of its data can
+be migrated off to other cluster hosts, you can do the following (instead of
+using a new, empty host):

+.. prompt:: bash $

-.. prompt:: bash $ 
-   
   OLDHOST=<existing-cluster-host-to-offload>
   ceph osd crush unlink $OLDHOST default

 where "default" is the immediate ancestor in the CRUSH map. (For
 smaller clusters with unmodified configurations this will normally
-be "default", but it might also be a rack name.)  You should now
+be "default", but it might instead be a rack name.) You should now
 see the host at the top of the OSD tree output with no parent:

 .. prompt:: bash $
@ -199,15 +201,18 @@ see the host at the top of the OSD tree output with no parent:
   2   ssd 1.00000         osd.2     up  1.00000 1.00000
  ...

-If everything looks good, jump directly to the "Wait for data
-migration to complete" step below and proceed from there to clean up
-the old OSDs.
+If everything looks good, jump directly to the :ref:`"Wait for the data
+migration to complete" <bluestore_data_migration_step>` step below and proceed
+from there to clean up the old OSDs.

 Migration process
 ^^^^^^^^^^^^^^^^^

-If you're using a new host, start at step #1.  For an existing host,
-jump to step #5 below.
+If you're using a new host, start at :ref:`the first step
+<bluestore_migration_process_first_step>`. If you're using an existing host,
+jump to :ref:`this step <bluestore_data_migration_step>`.
+
+.. _bluestore_migration_process_first_step:

 #. Provision new BlueStore OSDs for all devices:

@ -215,14 +220,14 @@ jump to step #5 below.

      ceph-volume lvm create --bluestore --data /dev/$DEVICE

-#. Verify OSDs join the cluster with:
+#. Verify that the new OSDs have joined the cluster:

   .. prompt:: bash $

      ceph osd tree

   You should see the new host ``$NEWHOST`` with all of the OSDs beneath
-   it, but the host should *not* be nested beneath any other node in
+   it, but the host should *not* be nested beneath any other node in the
   hierarchy (like ``root default``).  For example, if ``newhost`` is
   the empty host, you might see something like::

@ -251,13 +256,16 @@ jump to step #5 below.

      ceph osd crush swap-bucket $NEWHOST $OLDHOST

-   At this point all data on ``$OLDHOST`` will start migrating to OSDs
-   on ``$NEWHOST``.  If there is a difference in the total capacity of
-   the old and new hosts you may also see some data migrate to or from
-   other nodes in the cluster, but as long as the hosts are similarly
-   sized this will be a relatively small amount of data.
+   At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on
+   ``$NEWHOST``.  If there is a difference between the total capacity of the
+   old hosts and the total capacity of the new hosts, you may also see some
+   data migrate to or from other nodes in the cluster. Provided that the hosts
+   are similarly sized, however, this will be a relatively small amount of
+   data.

-#. Wait for data migration to complete:
+   .. _bluestore_data_migration_step:
+
+#. Wait for the data migration to complete:

   .. prompt:: bash $

@ -279,14 +287,14 @@ jump to step #5 below.
         ceph osd purge $osd --yes-i-really-mean-it
      done

-#. Wipe the old OSD devices. This requires you do identify which
-   devices are to be wiped manually (BE CAREFUL!). For each device:
+#. Wipe the old OSDs. This requires you to identify which devices are to be
+   wiped manually. BE CAREFUL! For each device:

   .. prompt:: bash $

      ceph-volume lvm zap $DEVICE

-#. Use the now-empty host as the new host, and repeat::
+#. Use the now-empty host as the new host, and repeat:

   .. prompt:: bash $

@ -295,54 +303,53 @@ jump to step #5 below.
 Advantages:

 * Data is copied over the network only once.
-* Converts an entire host's OSDs at once.
-* Can parallelize to converting multiple hosts at a time.
-* No spare devices are required on each host.
+* An entire host's OSDs are converted at once.
+* Can be parallelized, to make possible the conversion of multiple hosts at the same time.
+* No host involved in this process needs to have a spare device.

 Disadvantages:

 * A spare host is required.
-* An entire host's worth of OSDs will be migrating data at a time.  This
+* An entire host's worth of OSDs will be migrating data at a time. This
  is likely to impact overall cluster performance.
 * All migrated data still makes one full hop over the network.

-
 Per-OSD device copy
 -------------------
-
 A single logical OSD can be converted by using the ``copy`` function
-of ``ceph-objectstore-tool``.  This requires that the host have a free
-device (or devices) to provision a new, empty BlueStore OSD.  For
-example, if each host in your cluster has twelve OSDs, then you'd need a
-thirteenth unused device so that each OSD can be converted in turn before the
-old device is reclaimed to convert the next OSD.
+included in ``ceph-objectstore-tool``. This requires that the host have one or more free
+devices to provision a new, empty BlueStore OSD. For
+example, if each host in your cluster has twelve OSDs, then you need a
+thirteenth unused OSD so that each OSD can be converted before the
+previous OSD is reclaimed to convert the next OSD.

 Caveats:

-* This strategy requires that an empty BlueStore OSD be prepared
-  without allocating a new OSD ID, something that the ``ceph-volume``
-  tool doesn't support.  More importantly, the setup of *dmcrypt* is
-  closely tied to the OSD identity, which means that this approach
-  does not work with encrypted OSDs.
+* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate
+  a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:**
+  because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not
+  work with encrypted OSDs.

 * The device must be manually partitioned.

-* An unsupported user-contributed script that shows this process may be found at
+* An unsupported user-contributed script that demonstrates this process may be found here:
  https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash

 Advantages:

-* Little or no data migrates over the network during the conversion, so long as
-  the `noout` or `norecover`/`norebalance` flags are set on the OSD or the cluster
-  while the process proceeds.
+* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the
+  cluster while the conversion process is underway, little or no data migrates over the
+  network during the conversion.

 Disadvantages:

 * Tooling is not fully implemented, supported, or documented.
+  
 * Each host must have an appropriate spare or empty device for staging.
+  
 * The OSD is offline during the conversion, which means new writes to PGs
  with the OSD in their acting set may not be ideally redundant until the
  subject OSD comes up and recovers. This increases the risk of data
-  loss due to an overlapping failure.  However, if another OSD fails before
-  conversion and start-up are complete, the original Filestore OSD can be
+  loss due to an overlapping failure. However, if another OSD fails before
+  conversion and startup have completed, the original Filestore OSD can be
  started to provide access to its original data.
--- a/ceph/doc/rados/operations/cache-tiering.rst
+++ b/ceph/doc/rados/operations/cache-tiering.rst
@ -1,6 +1,10 @@
 ===============
 Cache Tiering
 ===============
+.. warning:: Cache tiering has been deprecated in the Reef release as it
+             has lacked a maintainer for a very long time. This does not mean
+             it will be certainly removed, but we may choose to remove it
+             without much further notice.

 A cache tier provides Ceph Clients with better I/O performance for a subset of
 the data stored in a backing storage tier. Cache tiering involves creating a
--- a/ceph/doc/rados/operations/change-mon-elections.rst
+++ b/ceph/doc/rados/operations/change-mon-elections.rst
@ -1,88 +1,100 @@
 .. _changing_monitor_elections:

-=====================================
-Configure Monitor Election Strategies
-=====================================
+=======================================
+Configuring Monitor Election Strategies
+=======================================

-By default, the monitors will use the ``classic`` mode. We
-recommend that you stay in this mode unless you have a very specific reason.
+By default, the monitors are in ``classic`` mode. We recommend staying in this
+mode unless you have a very specific reason.

-If you want to switch modes BEFORE constructing the cluster, change
-the ``mon election default strategy`` option. This option is an integer value:
+If you want to switch modes BEFORE constructing the cluster, change the ``mon
+election default strategy`` option. This option takes an integer value:

-* 1 for "classic"
-* 2 for "disallow"
-* 3 for "connectivity"
+* ``1`` for ``classic``
+* ``2`` for ``disallow``
+* ``3`` for ``connectivity``

-Once your cluster is running, you can change strategies by running ::
+After your cluster has started running, you can change strategies by running a
+command of the following form:

  $ ceph mon set election_strategy {classic|disallow|connectivity}

 Choosing a mode
 ===============
-The modes other than classic provide different features. We recommend
-you stay in classic mode if you don't need the extra features as it is
-the simplest mode.

-The disallow Mode
-=================
-This mode lets you mark monitors as disallowd, in which case they will
-participate in the quorum and serve clients, but cannot be elected leader. You
-may wish to use this if you have some monitors which are known to be far away
-from clients.
-You can disallow a leader by running:
+The modes other than ``classic`` provide specific features. We recommend staying
+in ``classic`` mode if you don't need these extra features because it is the
+simplest mode.
+
+.. _rados_operations_disallow_mode:
+
+Disallow Mode
+=============
+
+The ``disallow`` mode allows you to mark monitors as disallowed. Disallowed
+monitors participate in the quorum and serve clients, but cannot be elected
+leader. You might want to use this mode for monitors that are far away from
+clients.
+
+To disallow a monitor from being elected leader, run a command of the following
+form:

 .. prompt:: bash $

   ceph mon add disallowed_leader {name}

-You can remove a monitor from the disallowed list, and allow it to become
-a leader again, by running:
+To remove a monitor from the disallowed list and allow it to be elected leader,
+run a command of the following form:

 .. prompt:: bash $

   ceph mon rm disallowed_leader {name}

-The list of disallowed_leaders is included when you run:
+To see the list of disallowed leaders, examine the output of the following
+command:

 .. prompt:: bash $

   ceph mon dump

-The connectivity Mode
-=====================
-This mode evaluates connection scores provided by each monitor for its
-peers and elects the monitor with the highest score. This mode is designed
-to handle network partitioning or *net-splits*, which may happen if your cluster
-is stretched across multiple data centers or otherwise has a non-uniform
-or unbalanced network topology.
+Connectivity Mode
+=================

-This mode also supports disallowing monitors from being the leader
-using the same commands as above in disallow.
+The ``connectivity`` mode evaluates connection scores that are provided by each
+monitor for its peers and elects the monitor with the highest score. This mode
+is designed to handle network partitioning (also called *net-splits*): network
+partitioning might occur if your cluster is stretched across multiple data
+centers or otherwise has a non-uniform or unbalanced network topology.
+
+The ``connectivity`` mode also supports disallowing monitors from being elected
+leader by using the same commands that were presented in :ref:`Disallow Mode <rados_operations_disallow_mode>`.

 Examining connectivity scores
 =============================
-The monitors maintain connection scores even if they aren't in
-the connectivity election mode. You can examine the scores a monitor
-has by running:
+
+The monitors maintain connection scores even if they aren't in ``connectivity``
+mode. To examine a specific monitor's connection scores, run a command of the
+following form:

 .. prompt:: bash $

   ceph daemon mon.{name} connection scores dump

-Scores for individual connections range from 0-1 inclusive, and also
-include whether the connection is considered alive or dead (determined by
-whether it returned its latest ping within the timeout).
+Scores for an individual connection range from ``0`` to ``1`` inclusive and
+include whether the connection is considered alive or dead (as determined by
+whether it returned its latest ping before timeout).

-While this would be an unexpected occurrence, if for some reason you experience
-problems and troubleshooting makes you think your scores have become invalid,
-you can forget history and reset them by running:
+Connectivity scores are expected to remain valid. However, if during
+troubleshooting you determine that these scores have for some reason become
+invalid, drop the history and reset the scores by running a command of the
+following form:

 .. prompt:: bash $

   ceph daemon mon.{name} connection scores reset

-While resetting scores has low risk (monitors will still quickly determine
-if a connection is alive or dead, and trend back to the previous scores if they
-were accurate!), it should also not be needed and is not recommended unless
-requested by your support team or a developer.
+Resetting connectivity scores carries little risk: monitors will still quickly
+determine whether a connection is alive or dead and trend back to the previous
+scores if those scores were accurate. Nevertheless, resetting scores ought to
+be unnecessary and it is not recommended unless advised by your support team
+or by a developer.
--- a/ceph/doc/rados/operations/control.rst
+++ b/ceph/doc/rados/operations/control.rst
@ -8,13 +8,13 @@
 Monitor Commands
 ================

-Monitor commands are issued using the ``ceph`` utility:
+To issue monitor commands, use the ``ceph`` utility:

 .. prompt:: bash $

   ceph [-m monhost] {command}

-The command is usually (though not always) of the form:
+In most cases, monitor commands have the following form:

 .. prompt:: bash $

@ -24,48 +24,49 @@ The command is usually (though not always) of the form:
 System Commands
 ===============

-Execute the following to display the current cluster status.  :
+To display the current cluster status, run the following commands:

 .. prompt:: bash $

   ceph -s
   ceph status

-Execute the following to display a running summary of cluster status
-and major events. :
+To display a running summary of cluster status and major events, run the
+following command:

 .. prompt:: bash $

   ceph -w

-Execute the following to show the monitor quorum, including which monitors are
-participating and which one is the leader. :
+To display the monitor quorum, including which monitors are participating and
+which one is the leader, run the following commands:

 .. prompt:: bash $

   ceph mon stat
   ceph quorum_status

-Execute the following to query the status of a single monitor, including whether
-or not it is in the quorum. :
+To query the status of a single monitor, including whether it is in the quorum,
+run the following command:

 .. prompt:: bash $

   ceph tell mon.[id] mon_status

-where the value of ``[id]`` can be determined, e.g., from ``ceph -s``.
+Here the value of ``[id]`` can be found by consulting the output of ``ceph
+-s``.


 Authentication Subsystem
 ========================

-To add a keyring for an OSD, execute the following:
+To add an OSD keyring for a specific OSD, run the following command:

 .. prompt:: bash $

   ceph auth add {osd} {--in-file|-i} {path-to-osd-keyring}

-To list the cluster's keys and their capabilities, execute the following:
+To list the cluster's keys and their capabilities, run the following command:

 .. prompt:: bash $

@ -75,42 +76,57 @@ To list the cluster's keys and their capabilities, execute the following:
 Placement Group Subsystem
 =========================

-To display the statistics for all placement groups (PGs), execute the following: 
+To display the statistics for all placement groups (PGs), run the following
+command:

 .. prompt:: bash $

   ceph pg dump [--format {format}]

-The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``.
-When implementing monitoring and other tools, it is best to use ``json`` format.
-JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much
-less variable from release to release.  The ``jq`` utility can be invaluable when extracting
-data from JSON output.
+Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``,
+``xml``, and ``xml-pretty``.  When implementing monitoring tools and other
+tools, it is best to use the ``json`` format.  JSON parsing is more
+deterministic than the ``plain`` format (which is more human readable), and the
+layout is much more consistent from release to release. The ``jq`` utility is
+very useful for extracting data from JSON output.

-To display the statistics for all placement groups stuck in a specified state, 
-execute the following: 
+To display the statistics for all PGs stuck in a specified state, run the
+following command:

 .. prompt:: bash $

   ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format {format}] [-t|--threshold {seconds}]

+Here ``--format`` may be ``plain`` (default), ``json``, ``json-pretty``,
+``xml``, or ``xml-pretty``.

-``--format`` may be ``plain`` (default), ``json``, ``json-pretty``, ``xml``, or ``xml-pretty``.
+The ``--threshold`` argument determines the time interval (in seconds) for a PG
+to be considered ``stuck`` (default: 300).

-``--threshold`` defines how many seconds "stuck" is (default: 300)
+PGs might be stuck in any of the following states:

-**Inactive** Placement groups cannot process reads or writes because they are waiting for an OSD
-with the most up-to-date data to come back.
+**Inactive** 

-**Unclean** Placement groups contain objects that are not replicated the desired number
-of times. They should be recovering.
+    PGs are unable to process reads or writes because they are waiting for an
+    OSD that has the most up-to-date data to return to an ``up`` state.

-**Stale** Placement groups are in an unknown state - the OSDs that host them have not
-reported to the monitor cluster in a while (configured by
-``mon_osd_report_timeout``).

-Delete "lost" objects or revert them to their prior state, either a previous version
-or delete them if they were just created. :
+**Unclean** 
+
+    PGs contain objects that have not been replicated the desired number of
+    times. These PGs have not yet completed the process of recovering.
+
+
+**Stale** 
+
+    PGs are in an unknown state, because the OSDs that host them have not
+    reported to the monitor cluster for a certain period of time (specified by
+    the ``mon_osd_report_timeout`` configuration setting).
+
+
+To delete a ``lost`` object or revert an object to its prior state, either by
+reverting it to its previous version or by deleting it because it was just
+created and has no previous version, run the following command:

 .. prompt:: bash $

@ -122,227 +138,262 @@ or delete them if they were just created. :
 OSD Subsystem
 =============

-Query OSD subsystem status. :
+To query OSD subsystem status, run the following command:

 .. prompt:: bash $

   ceph osd stat

-Write a copy of the most recent OSD map to a file. See
-:ref:`osdmaptool <osdmaptool>`. :
+To write a copy of the most recent OSD map to a file (see :ref:`osdmaptool
+<osdmaptool>`), run the following command:

 .. prompt:: bash $

   ceph osd getmap -o file

-Write a copy of the crush map from the most recent OSD map to
-file. :
+To write a copy of the CRUSH map from the most recent OSD map to a file, run
+the following command:

 .. prompt:: bash $

   ceph osd getcrushmap -o file

-The foregoing is functionally equivalent to :
+Note that this command is functionally equivalent to the following two
+commands:

 .. prompt:: bash $

   ceph osd getmap -o /tmp/osdmap
   osdmaptool /tmp/osdmap --export-crush file

-Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``,
-``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is 
-dumped as plain text.  As above, JSON format is best for tools, scripting, and other automation. :
+To dump the OSD map, run the following command:

 .. prompt:: bash $

   ceph osd dump [--format {format}]

-Dump the OSD map as a tree with one line per OSD containing weight
-and state. :
+The ``--format`` option accepts the following arguments: ``plain`` (default),
+``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is
+the recommended format for tools, scripting, and other forms of automation. 
+
+To dump the OSD map as a tree that lists one OSD per line and displays
+information about the weights and states of the OSDs, run the following
+command:

 .. prompt:: bash $

   ceph osd tree [--format {format}]

-Find out where a specific object is or would be stored in the system:
+To find out where a specific RADOS object is stored in the system, run a
+command of the following form:

 .. prompt:: bash $

   ceph osd map <pool-name> <object-name>

-Add or move a new item (OSD) with the given id/name/weight at the specified
-location. :
+To add or move a new OSD (specified by its ID, name, or weight) to a specific
+CRUSH location, run the following command:

 .. prompt:: bash $

   ceph osd crush set {id} {weight} [{loc1} [{loc2} ...]]

-Remove an existing item (OSD) from the CRUSH map. :
+To remove an existing OSD from the CRUSH map, run the following command:

 .. prompt:: bash $

   ceph osd crush remove {name}

-Remove an existing bucket from the CRUSH map. :
+To remove an existing bucket from the CRUSH map, run the following command:

 .. prompt:: bash $

   ceph osd crush remove {bucket-name}

-Move an existing bucket from one position in the hierarchy to another.  :
+To move an existing bucket from one position in the CRUSH hierarchy to another,
+run the following command:

 .. prompt:: bash $

   ceph osd crush move {id} {loc1} [{loc2} ...]

-Set the weight of the item given by ``{name}`` to ``{weight}``. :
+To set the CRUSH weight of a specific OSD (specified by ``{name}``) to
+``{weight}``, run the following command:

 .. prompt:: bash $

   ceph osd crush reweight {name} {weight}

-Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. :
+To mark an OSD as ``lost``, run the following command:

 .. prompt:: bash $

   ceph osd lost {id} [--yes-i-really-mean-it]

-Create a new OSD. If no UUID is given, it will be set automatically when the OSD
-starts up. :
+.. warning::
+   This could result in permanent data loss. Use with caution!
+
+To create a new OSD, run the following command:

 .. prompt:: bash $

   ceph osd create [{uuid}]

-Remove the given OSD(s). :
+If no UUID is given as part of this command, the UUID will be set automatically
+when the OSD starts up.
+
+To remove one or more specific OSDs, run the following command:

 .. prompt:: bash $

   ceph osd rm [{id}...]

-Query the current ``max_osd`` parameter in the OSD map. :
+To display the current ``max_osd`` parameter in the OSD map, run the following
+command:

 .. prompt:: bash $

   ceph osd getmaxosd

-Import the given crush map. :
+To import a specific CRUSH map, run the following command:

 .. prompt:: bash $

   ceph osd setcrushmap -i file

-Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so
-most admins will never need to adjust this. :
+To set the ``max_osd`` parameter in the OSD map, run the following command:

 .. prompt:: bash $

   ceph osd setmaxosd

-Mark OSD ``{osd-num}`` down. :
+The parameter has a default value of 10000. Most operators will never need to
+adjust it.
+
+To mark a specific OSD ``down``, run the following command:

 .. prompt:: bash $

   ceph osd down {osd-num}

-Mark OSD ``{osd-num}`` out of the distribution (i.e. allocated no data). :
+To mark a specific OSD ``out`` (so that no data will be allocated to it), run
+the following command:

 .. prompt:: bash $

   ceph osd out {osd-num}

-Mark ``{osd-num}`` in the distribution (i.e. allocated data). :
+To mark a specific OSD ``in`` (so that data will be allocated to it), run the
+following command:

 .. prompt:: bash $

   ceph osd in {osd-num}

-Set or clear the pause flags in the OSD map. If set, no IO requests
-will be sent to any OSD. Clearing the flags via unpause results in
-resending pending requests. :
+By using the "pause flags" in the OSD map, you can pause or unpause I/O
+requests.  If the flags are set, then no I/O requests will be sent to any OSD.
+When the flags are cleared, then pending I/O requests will be resent. To set or
+clear pause flags, run one of the following commands:

 .. prompt:: bash $

   ceph osd pause
   ceph osd unpause

-Set the override weight (reweight) of ``{osd-num}`` to ``{weight}``. Two OSDs with the
-same weight will receive roughly the same number of I/O requests and
-store approximately the same amount of data. ``ceph osd reweight``
-sets an override weight on the OSD. This value is in the range 0 to 1,
-and forces CRUSH to re-place (1-weight) of the data that would
-otherwise live on this drive. It does not change weights assigned
-to the buckets above the OSD in the crush map, and is a corrective
-measure in case the normal CRUSH distribution is not working out quite
-right. For instance, if one of your OSDs is at 90% and the others are
-at 50%, you could reduce this weight to compensate. :
+You can assign an override or ``reweight`` weight value to a specific OSD if
+the normal CRUSH distribution seems to be suboptimal. The weight of an OSD
+helps determine the extent of its I/O requests and data storage: two OSDs with
+the same weight will receive approximately the same number of I/O requests and
+store approximately the same amount of data. The ``ceph osd reweight`` command
+assigns an override weight to an OSD. The weight value is in the range 0 to 1,
+and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of
+the data that would otherwise be on this OSD. The command does not change the
+weights of the buckets above the OSD in the CRUSH map. Using the command is
+merely a corrective measure: for example, if one of your OSDs is at 90% and the
+others are at 50%, you could reduce the outlier weight to correct this
+imbalance. To assign an override weight to a specific OSD, run the following
+command:

 .. prompt:: bash $

   ceph osd reweight {osd-num} {weight}

-Balance OSD fullness by reducing the override weight of OSDs which are
-overly utilized.  Note that these override aka ``reweight`` values
-default to 1.00000 and are relative only to each other; they not absolute.
-It is crucial to distinguish them from CRUSH weights, which reflect the
-absolute capacity of a bucket in TiB.  By default this command adjusts
-override weight on OSDs which have + or - 20% of the average utilization,
-but if you include a ``threshold`` that percentage will be used instead. :
+.. note:: Any assigned override reweight value will conflict with the balancer.
+   This means that if the balancer is in use, all override reweight values
+   should be ``1.0000`` in order to avoid suboptimal cluster behavior.
+
+A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
+are being disproportionately utilized. Note that override or ``reweight``
+weights have values relative to one another that default to 1.00000; their
+values are not absolute, and these weights must be distinguished from CRUSH
+weights (which reflect the absolute capacity of a bucket, as measured in TiB).
+To reweight OSDs by utilization, run the following command:

 .. prompt:: bash $

   ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]

-To limit the step by which any OSD's reweight will be changed, specify
-``max_change`` which defaults to 0.05.  To limit the number of OSDs that will
-be adjusted, specify ``max_osds`` as well; the default is 4.  Increasing these
-parameters can speed leveling of OSD utilization, at the potential cost of
-greater impact on client operations due to more data moving at once.
+By default, this command adjusts the override weight of OSDs that have ±20% of
+the average utilization, but you can specify a different percentage in the
+``threshold`` argument. 

-To determine which and how many PGs and OSDs will be affected by a given invocation
-you can test before executing. :
+To limit the increment by which any OSD's reweight is to be changed, use the
+``max_change`` argument (default: 0.05). To limit the number of OSDs that are
+to be adjusted, use the ``max_osds`` argument (default: 4). Increasing these
+variables can accelerate the reweighting process, but perhaps at the cost of
+slower client operations (as a result of the increase in data movement).
+
+You can test the ``osd reweight-by-utilization`` command before running it. To
+find out which and how many PGs and OSDs will be affected by a specific use of
+the ``osd reweight-by-utilization`` command, run the following command:

 .. prompt:: bash $

   ceph osd test-reweight-by-utilization [threshold [max_change max_osds]] [--no-increasing]

-Adding ``--no-increasing`` to either command prevents increasing any
-override weights that are currently < 1.00000.  This can be useful when
-you are balancing in a hurry to remedy ``full`` or ``nearful`` OSDs or
-when some OSDs are being evacuated or slowly brought into service.
+The ``--no-increasing`` option can be added to the ``reweight-by-utilization``
+and ``test-reweight-by-utilization`` commands in order to prevent any override
+weights that are currently less than 1.00000 from being increased. This option
+can be useful in certain circumstances: for example, when you are hastily
+balancing in order to remedy ``full`` or ``nearfull`` OSDs, or when there are
+OSDs being evacuated or slowly brought into service.

-Deployments utilizing Nautilus (or later revisions of Luminous and Mimic)
-that have no pre-Luminous cients may instead wish to instead enable the
-`balancer`` module for ``ceph-mgr``.
+Operators of deployments that utilize Nautilus or newer (or later revisions of
+Luminous and Mimic) and that have no pre-Luminous clients might likely instead
+want to enable the `balancer`` module for ``ceph-mgr``.

-Add/remove an IP address or CIDR range to/from the blocklist.
-When adding to the blocklist,
-you can specify how long it should be blocklisted in seconds; otherwise,
-it will default to 1 hour. A blocklisted address is prevented from
-connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware
-that OSD will also be prevented from performing operations on its peers where it
-acts as a client. (This includes tiering and copy-from functionality.)
-
-If you want to blocklist a range (in CIDR format), you may do so by
-including the ``range`` keyword.
-
-These commands are mostly only useful for failure testing, as
-blocklists are normally maintained automatically and shouldn't need
-manual intervention. :
+The blocklist can be modified by adding or removing an IP address or a CIDR
+range. If an address is blocklisted, it will be unable to connect to any OSD.
+If an OSD is contained within an IP address or CIDR range that has been
+blocklisted, the OSD will be unable to perform operations on its peers when it
+acts as a client: such blocked operations include tiering and copy-from
+functionality. To add or remove an IP address or CIDR range to the blocklist,
+run one of the following commands:

 .. prompt:: bash $

   ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
   ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]

-Creates/deletes a snapshot of a pool. :
+If you add something to the blocklist with the above ``add`` command, you can
+use the ``TIME`` keyword to specify the length of time (in seconds) that it
+will remain on the blocklist (default: one hour). To add or remove a CIDR
+range, use the ``range`` keyword in the above commands.
+
+Note that these commands are useful primarily in failure testing. Under normal
+conditions, blocklists are maintained automatically and do not need any manual
+intervention.
+
+To create or delete a snapshot of a specific storage pool, run one of the
+following commands:

 .. prompt:: bash $

   ceph osd pool mksnap {pool-name} {snap-name}
   ceph osd pool rmsnap {pool-name} {snap-name}

-Creates/deletes/renames a storage pool. :
+To create, delete, or rename a specific storage pool, run one of the following
+commands:

 .. prompt:: bash $

@ -350,20 +401,20 @@ Creates/deletes/renames a storage pool. :
   ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
   ceph osd pool rename {old-name} {new-name}

-Changes a pool setting. : 
+To change a pool setting, run the following command: 

 .. prompt:: bash $

   ceph osd pool set {pool-name} {field} {value}

-Valid fields are:
+The following are valid fields:

-	* ``size``: Sets the number of copies of data in the pool.
-	* ``pg_num``: The placement group number.
-	* ``pgp_num``: Effective number when calculating pg placement.
-	* ``crush_rule``: rule number for mapping placement.
+    * ``size``: The number of copies of data in the pool.
+    * ``pg_num``: The PG number.
+    * ``pgp_num``: The effective number of PGs when calculating placement.
+    * ``crush_rule``: The rule number for mapping placement.

-Get the value of a pool setting. :
+To retrieve the value of a pool setting, run the following command:

 .. prompt:: bash $

@ -371,40 +422,43 @@ Get the value of a pool setting. :

 Valid fields are:

-	* ``pg_num``: The placement group number.
-	* ``pgp_num``: Effective number of placement groups when calculating placement.
+    * ``pg_num``: The PG number.
+    * ``pgp_num``: The effective number of PGs when calculating placement.

-
-Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :
+To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run
+the following command:

 .. prompt:: bash $

   ceph osd scrub {osd-num}

-Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :
+To send a repair command to a specific OSD, or to all OSDs (by using ``*``),
+run the following command:

 .. prompt:: bash $

   ceph osd repair N

-Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES``
-in write requests of ``BYTES_PER_WRITE`` each. By default, the test
-writes 1 GB in total in 4-MB increments.
-The benchmark is non-destructive and will not overwrite existing live
-OSD data, but might temporarily affect the performance of clients
-concurrently accessing the OSD. :
+You can run a simple throughput benchmark test against a specific OSD. This
+test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally,
+in multiple write requests that each have a size of ``BYTES_PER_WRITE``
+(default: 4 MB). The test is not destructive and it will not overwrite existing
+live OSD data, but it might temporarily affect the performance of clients that
+are concurrently accessing the OSD. To launch this benchmark test, run the
+following command:

 .. prompt:: bash $

   ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]

-To clear an OSD's caches between benchmark runs, use the 'cache drop' command :
+To clear the caches of a specific OSD during the interval between one benchmark
+run and another, run the following command:

 .. prompt:: bash $

   ceph tell osd.N cache drop

-To get the cache statistics of an OSD, use the 'cache status' command :
+To retrieve the cache statistics of a specific OSD, run the following command:

 .. prompt:: bash $

@ -413,7 +467,8 @@ To get the cache statistics of an OSD, use the 'cache status' command :
 MDS Subsystem
 =============

-Change configuration parameters on a running mds. :
+To change the configuration parameters of a running metadata server, run the
+following command:

 .. prompt:: bash $

@ -425,19 +480,20 @@ Example:

   ceph tell mds.0 config set debug_ms 1

-Enables debug messages. :
+To enable debug messages, run the following command:

 .. prompt:: bash $

   ceph mds stat

-Displays the status of all metadata servers. :
+To display the status of all metadata servers, run the following command:

 .. prompt:: bash $

   ceph mds fail 0

-Marks the active MDS as failed, triggering failover to a standby if present.
+To mark the active metadata server as failed (and to trigger failover to a
+standby if a standby is present), run the following command:

 .. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap

@ -445,157 +501,165 @@ Marks the active MDS as failed, triggering failover to a standby if present.
 Mon Subsystem
 =============

-Show monitor stats:
+To display monitor statistics, run the following command:

 .. prompt:: bash $

   ceph mon stat

+This command returns output similar to the following:
+
 ::

-	e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
+    e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c

+There is a ``quorum`` list at the end of the output. It lists those monitor
+nodes that are part of the current quorum.

-The ``quorum`` list at the end lists monitor nodes that are part of the current quorum.
-
-This is also available more directly:
+To retrieve this information in a more direct way, run the following command:

 .. prompt:: bash $

   ceph quorum_status -f json-pretty
-	
-.. code-block:: javascript	

-	{
-	    "election_epoch": 6,
-	    "quorum": [
-		0,
-		1,
-		2
-	    ],
-	    "quorum_names": [
-		"a",
-		"b",
-		"c"
-	    ],
-	    "quorum_leader_name": "a",
-	    "monmap": {
-		"epoch": 2,
-		"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
-		"modified": "2016-12-26 14:42:09.288066",
-		"created": "2016-12-26 14:42:03.573585",
-		"features": {
-		    "persistent": [
-			"kraken"
-		    ],
-		    "optional": []
-		},
-		"mons": [
-		    {
-			"rank": 0,
-			"name": "a",
-			"addr": "127.0.0.1:40000\/0",
-			"public_addr": "127.0.0.1:40000\/0"
-		    },
-		    {
-			"rank": 1,
-			"name": "b",
-			"addr": "127.0.0.1:40001\/0",
-			"public_addr": "127.0.0.1:40001\/0"
-		    },
-		    {
-			"rank": 2,
-			"name": "c",
-			"addr": "127.0.0.1:40002\/0",
-			"public_addr": "127.0.0.1:40002\/0"
-		    }
-		]
-	    }
-	}
-	  
+This command returns output similar to the following:
+
+.. code-block:: javascript    
+
+    {
+        "election_epoch": 6,
+        "quorum": [
+        0,
+        1,
+        2
+        ],
+        "quorum_names": [
+        "a",
+        "b",
+        "c"
+        ],
+        "quorum_leader_name": "a",
+        "monmap": {
+        "epoch": 2,
+        "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+        "modified": "2016-12-26 14:42:09.288066",
+        "created": "2016-12-26 14:42:03.573585",
+        "features": {
+            "persistent": [
+            "kraken"
+            ],
+            "optional": []
+        },
+        "mons": [
+            {
+            "rank": 0,
+            "name": "a",
+            "addr": "127.0.0.1:40000\/0",
+            "public_addr": "127.0.0.1:40000\/0"
+            },
+            {
+            "rank": 1,
+            "name": "b",
+            "addr": "127.0.0.1:40001\/0",
+            "public_addr": "127.0.0.1:40001\/0"
+            },
+            {
+            "rank": 2,
+            "name": "c",
+            "addr": "127.0.0.1:40002\/0",
+            "public_addr": "127.0.0.1:40002\/0"
+            }
+        ]
+        }
+    }
+      

 The above will block until a quorum is reached.

-For a status of just a single monitor:
+To see the status of a specific monitor, run the following command:

 .. prompt:: bash $

   ceph tell mon.[name] mon_status
-	
-where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample
-output::
-	
-	{
-	    "name": "b",
-	    "rank": 1,
-	    "state": "peon",
-	    "election_epoch": 6,
-	    "quorum": [
-		0,
-		1,
-		2
-	    ],
-	    "features": {
-		"required_con": "9025616074522624",
-		"required_mon": [
-		    "kraken"
-		],
-		"quorum_con": "1152921504336314367",
-		"quorum_mon": [
-		    "kraken"
-		]
-	    },
-	    "outside_quorum": [],
-	    "extra_probe_peers": [],
-	    "sync_provider": [],
-	    "monmap": {
-		"epoch": 2,
-		"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
-		"modified": "2016-12-26 14:42:09.288066",
-		"created": "2016-12-26 14:42:03.573585",
-		"features": {
-		    "persistent": [
-			"kraken"
-		    ],
-		    "optional": []
-		},
-		"mons": [
-		    {
-			"rank": 0,
-			"name": "a",
-			"addr": "127.0.0.1:40000\/0",
-			"public_addr": "127.0.0.1:40000\/0"
-		    },
-		    {
-			"rank": 1,
-			"name": "b",
-			"addr": "127.0.0.1:40001\/0",
-			"public_addr": "127.0.0.1:40001\/0"
-		    },
-		    {
-			"rank": 2,
-			"name": "c",
-			"addr": "127.0.0.1:40002\/0",
-			"public_addr": "127.0.0.1:40002\/0"
-		    }
-		]
-	    }
-	}

-A dump of the monitor state:
+Here the value of ``[name]`` can be found by consulting the output of the
+``ceph quorum_status`` command. This command returns output similar to the
+following:
+
+::
+
+    {
+        "name": "b",
+        "rank": 1,
+        "state": "peon",
+        "election_epoch": 6,
+        "quorum": [
+        0,
+        1,
+        2
+        ],
+        "features": {
+        "required_con": "9025616074522624",
+        "required_mon": [
+            "kraken"
+        ],
+        "quorum_con": "1152921504336314367",
+        "quorum_mon": [
+            "kraken"
+        ]
+        },
+        "outside_quorum": [],
+        "extra_probe_peers": [],
+        "sync_provider": [],
+        "monmap": {
+        "epoch": 2,
+        "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
+        "modified": "2016-12-26 14:42:09.288066",
+        "created": "2016-12-26 14:42:03.573585",
+        "features": {
+            "persistent": [
+            "kraken"
+            ],
+            "optional": []
+        },
+        "mons": [
+            {
+            "rank": 0,
+            "name": "a",
+            "addr": "127.0.0.1:40000\/0",
+            "public_addr": "127.0.0.1:40000\/0"
+            },
+            {
+            "rank": 1,
+            "name": "b",
+            "addr": "127.0.0.1:40001\/0",
+            "public_addr": "127.0.0.1:40001\/0"
+            },
+            {
+            "rank": 2,
+            "name": "c",
+            "addr": "127.0.0.1:40002\/0",
+            "public_addr": "127.0.0.1:40002\/0"
+            }
+        ]
+        }
+    }
+
+To see a dump of the monitor state, run the following command:

 .. prompt:: bash $

   ceph mon dump

+This command returns output similar to the following:
+
 ::

-	dumped monmap epoch 2
-	epoch 2
-	fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
-	last_changed 2016-12-26 14:42:09.288066
-	created 2016-12-26 14:42:03.573585
-	0: 127.0.0.1:40000/0 mon.a
-	1: 127.0.0.1:40001/0 mon.b
-	2: 127.0.0.1:40002/0 mon.c
-
+    dumped monmap epoch 2
+    epoch 2
+    fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
+    last_changed 2016-12-26 14:42:09.288066
+    created 2016-12-26 14:42:03.573585
+    0: 127.0.0.1:40000/0 mon.a
+    1: 127.0.0.1:40001/0 mon.b
+    2: 127.0.0.1:40002/0 mon.c
--- a/ceph/doc/rados/operations/crush-map-edits.rst
+++ b/ceph/doc/rados/operations/crush-map-edits.rst
--- a/ceph/doc/rados/operations/crush-map.rst
+++ b/ceph/doc/rados/operations/crush-map.rst
--- a/ceph/doc/rados/operations/data-placement.rst
+++ b/ceph/doc/rados/operations/data-placement.rst
@ -2,40 +2,45 @@
 Data Placement Overview
 =========================

-Ceph stores, replicates and rebalances data objects across a RADOS cluster
-dynamically.  With many different users storing objects in different pools for
-different purposes on countless OSDs, Ceph operations require some data
-placement planning.  The main data placement planning concepts in Ceph include:
+Ceph stores, replicates, and rebalances data objects across a RADOS cluster
+dynamically. Because different users store objects in different pools for
+different purposes on many OSDs, Ceph operations require a certain amount of
+data- placement planning. The main data-placement planning concepts in Ceph
+include:

- **Pools:** Ceph stores data within pools, which are logical groups for storing
-  objects. Pools manage the number of placement groups, the number of replicas,
-  and the CRUSH rule for the pool. To store data in a pool, you must have
-  an authenticated user with permissions for the pool. Ceph can snapshot pools.
-  See `Pools`_ for additional details.
+- **Pools:** Ceph stores data within pools, which are logical groups used for
+  storing objects. Pools manage the number of placement groups, the number of
+  replicas, and the CRUSH rule for the pool. To store data in a pool, it is
+  necessary to be an authenticated user with permissions for the pool. Ceph is
+  able to make snapshots of pools. For additional details, see `Pools`_.

- **Placement Groups:** Ceph maps objects to placement groups (PGs).
-  Placement groups (PGs) are shards or fragments of a logical object pool
-  that place objects as a group into OSDs. Placement groups reduce the amount
-  of per-object metadata when Ceph stores the data in OSDs. A larger number of
-  placement groups (e.g., 100 per OSD) leads to better balancing. See
-  :ref:`placement groups` for additional details.
+- **Placement Groups:** Ceph maps objects to placement groups. Placement
+  groups (PGs) are shards or fragments of a logical object pool that place
+  objects as a group into OSDs. Placement groups reduce the amount of
+  per-object metadata that is necessary for Ceph to store the data in OSDs. A
+  greater number of placement groups (for example, 100 PGs per OSD as compared
+  with 50 PGs per OSD) leads to better balancing. For additional details, see
+  :ref:`placement groups`.

- **CRUSH Maps:**  CRUSH is a big part of what allows Ceph to scale without
-  performance bottlenecks, without limitations to scalability, and without a
-  single point of failure. CRUSH maps provide the physical topology of the
-  cluster to the CRUSH algorithm to determine where the data for an object
-  and its replicas should be stored, and how to do so across failure domains
-  for added data safety among other things. See `CRUSH Maps`_ for additional
-  details.
+- **CRUSH Maps:**  CRUSH plays a major role in allowing Ceph to scale while
+  avoiding certain pitfalls, such as performance bottlenecks, limitations to
+  scalability, and single points of failure. CRUSH maps provide the physical
+  topology of the cluster to the CRUSH algorithm, so that it can determine both
+  (1) where the data for an object and its replicas should be stored and (2)
+  how to store that data across failure domains so as to improve data safety.
+  For additional details, see `CRUSH Maps`_.

- **Balancer:** The balancer is a feature that will automatically optimize the
-  distribution of PGs across devices to achieve a balanced data distribution,
-  maximizing the amount of data that can be stored in the cluster and evenly
-  distributing the workload across OSDs.
+- **Balancer:** The balancer is a feature that automatically optimizes the
+  distribution of placement groups across devices in order to achieve a
+  balanced data distribution, in order to maximize the amount of data that can
+  be stored in the cluster, and in order to evenly distribute the workload
+  across OSDs.

-When you initially set up a test cluster, you can use the default values. Once
-you begin planning for a large Ceph cluster, refer to pools, placement groups
-and CRUSH for data placement operations.
+It is possible to use the default values for each of the above components.
+Default values are recommended for a test cluster's initial setup. However,
+when planning a large Ceph cluster, values should be customized for
+data-placement operations with reference to the different roles played by
+pools, placement groups, and CRUSH.

 .. _Pools: ../pools
 .. _CRUSH Maps: ../crush-map
--- a/ceph/doc/rados/operations/devices.rst
+++ b/ceph/doc/rados/operations/devices.rst
@ -3,28 +3,32 @@
 Device Management
 =================

-Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed by
-which daemons, and collects health metrics about those devices in order to
-provide tools to predict and/or automatically respond to hardware failure.
+Device management allows Ceph to address hardware failure. Ceph tracks hardware
+storage devices (HDDs, SSDs) to see which devices are managed by which daemons.
+Ceph also collects health metrics about these devices. By doing so, Ceph can
+provide tools that predict hardware failure and can automatically respond to
+hardware failure.

 Device tracking
 ---------------

-You can query which storage devices are in use with:
+To see a list of the storage devices that are in use, run the following
+command:

 .. prompt:: bash $

   ceph device ls

-You can also list devices by daemon or by host:
+Alternatively, to list devices by daemon or by host, run a command of one of
+the following forms:

 .. prompt:: bash $

   ceph device ls-by-daemon <daemon>
   ceph device ls-by-host <host>

-For any individual device, you can query information about its
-location and how it is being consumed with:
+To see information about the location of an specific device and about how the
+device is being consumed, run a command of the following form:

 .. prompt:: bash $

@ -33,103 +37,107 @@ location and how it is being consumed with:
 Identifying physical devices
 ----------------------------

-You can blink the drive LEDs on hardware enclosures to make the replacement of
-failed disks easy and less error-prone.  Use the following command::
+To make the replacement of failed disks easier and less error-prone, you can
+(in some cases) "blink" the drive's LEDs on hardware enclosures by running a
+command of the following form::

  device light on|off <devid> [ident|fault] [--force]

-The ``<devid>`` parameter is the device identification. You can obtain this
-information using the following command:
+.. note:: Using this command to blink the lights might not work. Whether it
+   works will depend upon such factors as your kernel revision, your SES
+   firmware, or the setup of your HBA.
+
+The ``<devid>`` parameter is the device identification. To retrieve this
+information, run the following command:

 .. prompt:: bash $

   ceph device ls

-The ``[ident|fault]`` parameter is used to set the kind of light to blink.
-By default, the `identification` light is used.
+The ``[ident|fault]`` parameter determines which kind of light will blink.  By
+default, the `identification` light is used.

-.. note::
-   This command needs the Cephadm or the Rook `orchestrator <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_ module enabled.
-   The orchestrator module enabled is shown by executing the following command:
+.. note:: This command works only if the Cephadm or the Rook `orchestrator
+   <https://docs.ceph.com/docs/master/mgr/orchestrator/#orchestrator-cli-module>`_
+   module is enabled.  To see which orchestrator module is enabled, run the
+   following command:

   .. prompt:: bash $

      ceph orch status

-The command behind the scene to blink the drive LEDs is `lsmcli`. If you need
-to customize this command you can configure this via a Jinja2 template::
+The command that makes the drive's LEDs blink is `lsmcli`. To customize this
+command, configure it via a Jinja2 template by running commands of the
+following forms::

   ceph config-key set mgr/cephadm/blink_device_light_cmd "<template>"
   ceph config-key set mgr/cephadm/<host>/blink_device_light_cmd "lsmcli local-disk-{{ ident_fault }}-led-{{'on' if on else 'off'}} --path '{{ path or dev }}'"

-The Jinja2 template is rendered using the following arguments:
+The following arguments can be used to customize the Jinja2 template:

 * ``on``
    A boolean value.
 * ``ident_fault``
-    A string containing `ident` or `fault`.
+    A string that contains `ident` or `fault`.
 * ``dev``
-    A string containing the device ID, e.g. `SanDisk_X400_M.2_2280_512GB_162924424784`.
+    A string that contains the device ID: for example, `SanDisk_X400_M.2_2280_512GB_162924424784`.
 * ``path``
-    A string containing the device path, e.g. `/dev/sda`.
+    A string that contains the device path: for example, `/dev/sda`.

 .. _enabling-monitoring:

 Enabling monitoring
 -------------------

-Ceph can also monitor health metrics associated with your device.  For
-example, SATA hard disks implement a standard called SMART that
-provides a wide range of internal metrics about the device's usage and
-health, like the number of hours powered on, number of power cycles,
-or unrecoverable read errors.  Other device types like SAS and NVMe
-implement a similar set of metrics (via slightly different standards).
-All of these can be collected by Ceph via the ``smartctl`` tool.
+Ceph can also monitor the health metrics associated with your device. For
+example, SATA drives implement a standard called SMART that provides a wide
+range of internal metrics about the device's usage and health (for example: the
+number of hours powered on, the number of power cycles, the number of
+unrecoverable read errors). Other device types such as SAS and NVMe present a
+similar set of metrics (via slightly different standards).  All of these
+metrics can be collected by Ceph via the ``smartctl`` tool.

-You can enable or disable health monitoring with:
+You can enable or disable health monitoring by running one of the following
+commands:

 .. prompt:: bash $

   ceph device monitoring on
-
-or:
-
-.. prompt:: bash $
-
   ceph device monitoring off

-
 Scraping
 --------

-If monitoring is enabled, metrics will automatically be scraped at regular intervals.  That interval can be configured with:
+If monitoring is enabled, device metrics will be scraped automatically at
+regular intervals. To configure that interval, run a command of the following
+form:

 .. prompt:: bash $

   ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>

-The default is to scrape once every 24 hours.
+By default, device metrics are scraped once every 24 hours.

-You can manually trigger a scrape of all devices with:
+To manually scrape all devices, run the following command:
   
 .. prompt:: bash $

   ceph device scrape-health-metrics

-A single device can be scraped with:
+To scrape a single device, run a command of the following form:

 .. prompt:: bash $

   ceph device scrape-health-metrics <device-id>

-Or a single daemon's devices can be scraped with:
+To scrape a single daemon's devices, run a command of the following form:

 .. prompt:: bash $

   ceph device scrape-daemon-health-metrics <who>

-The stored health metrics for a device can be retrieved (optionally
-for a specific timestamp) with:
+To retrieve the stored health metrics for a device (optionally for a specific
+timestamp),  run a command of the following form:

 .. prompt:: bash $

@ -138,71 +146,82 @@ for a specific timestamp) with:
 Failure prediction
 ------------------

-Ceph can predict life expectancy and device failures based on the
-health metrics it collects.  There are three modes:
+Ceph can predict drive life expectancy and device failures by analyzing the
+health metrics that it collects. The prediction modes are as follows:

 * *none*: disable device failure prediction.
-* *local*: use a pre-trained prediction model from the ceph-mgr daemon
+* *local*: use a pre-trained prediction model from the ``ceph-mgr`` daemon.

-The prediction mode can be configured with:
+To configure the prediction mode, run a command of the following form:

 .. prompt:: bash $

   ceph config set global device_failure_prediction_mode <mode>

-Prediction normally runs in the background on a periodic basis, so it
-may take some time before life expectancy values are populated.  You
-can see the life expectancy of all devices in output from:
+Under normal conditions, failure prediction runs periodically in the
+background.  For this reason, life expectancy values might be populated only
+after a significant amount of time has passed.  The life expectancy of all
+devices is displayed in the output of the following command:

 .. prompt:: bash $

   ceph device ls

-You can also query the metadata for a specific device with:
+To see the metadata of a specific device, run a command of the following form:

 .. prompt:: bash $

   ceph device info <devid>

-You can explicitly force prediction of a device's life expectancy with:
+To explicitly force prediction of a specific device's life expectancy, run a
+command of the following form:

 .. prompt:: bash $

   ceph device predict-life-expectancy <devid>

-If you are not using Ceph's internal device failure prediction but
-have some external source of information about device failures, you
-can inform Ceph of a device's life expectancy with:
+In addition to Ceph's internal device failure prediction, you might have an
+external source of information about device failures. To inform Ceph of a
+specific device's life expectancy, run a command of the following form:

 .. prompt:: bash $

   ceph device set-life-expectancy <devid> <from> [<to>]

-Life expectancies are expressed as a time interval so that
-uncertainty can be expressed in the form of a wide interval. The
-interval end can also be left unspecified.
+Life expectancies are expressed as a time interval. This means that the
+uncertainty of the life expectancy can be expressed in the form of a range of
+time, and perhaps a wide range of time. The interval's end can be left
+unspecified.

 Health alerts
 -------------

-The ``mgr/devicehealth/warn_threshold`` controls how soon an expected
-device failure must be before we generate a health warning.
+The ``mgr/devicehealth/warn_threshold`` configuration option controls the
+health check for an expected device failure. If the device is expected to fail
+within the specified time interval, an alert is raised.

-The stored life expectancy of all devices can be checked, and any
-appropriate health alerts generated, with:
+To check the stored life expectancy of all devices and generate any appropriate
+health alert, run the following command:

 .. prompt:: bash $

   ceph device check-health

-Automatic Mitigation
--------------------
+Automatic Migration
+-------------------

-If the ``mgr/devicehealth/self_heal`` option is enabled (it is by
-default), then for devices that are expected to fail soon the module
-will automatically migrate data away from them by marking the devices
-"out".
+The ``mgr/devicehealth/self_heal`` option (enabled by default) automatically
+migrates data away from devices that are expected to fail soon. If this option
+is enabled, the module marks such devices ``out`` so that automatic migration
+will occur.

-The ``mgr/devicehealth/mark_out_threshold`` controls how soon an
-expected device failure must be before we automatically mark an osd
-"out".
+.. note:: The ``mon_osd_min_up_ratio`` configuration option can help prevent
+   this process from cascading to total failure. If the "self heal" module
+   marks ``out`` so many OSDs that the ratio value of ``mon_osd_min_up_ratio``
+   is exceeded, then the cluster raises the ``DEVICE_HEALTH_TOOMANY`` health
+   check. For instructions on what to do in this situation, see
+   :ref:`DEVICE_HEALTH_TOOMANY<rados_health_checks_device_health_toomany>`.
+
+The ``mgr/devicehealth/mark_out_threshold`` configuration option specifies the
+time interval for automatic migration. If a device is expected to fail within
+the specified time interval, it will be automatically marked ``out``.
--- a/ceph/doc/rados/operations/erasure-code-jerasure.rst
+++ b/ceph/doc/rados/operations/erasure-code-jerasure.rst
@ -6,9 +6,11 @@ The *jerasure* plugin is the most generic and flexible plugin, it is
 also the default for Ceph erasure coded pools. 

 The *jerasure* plugin encapsulates the `Jerasure
-<http://jerasure.org>`_ library. It is
-recommended to read the *jerasure* documentation to get a better
-understanding of the parameters.
+<https://github.com/ceph/jerasure>`_ library. It is
+recommended to read the ``jerasure`` documentation to
+understand the parameters. Note that the ``jerasure.org``
+web site as of 2023 may no longer be connected to the original
+project or legitimate.

 Create a jerasure profile
 =========================
--- a/ceph/doc/rados/operations/erasure-code-profile.rst
+++ b/ceph/doc/rados/operations/erasure-code-profile.rst
@ -110,6 +110,8 @@ To remove an erasure code profile::

 If the profile is referenced by a pool, the deletion will fail.

+.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior.
+
 osd erasure-code-profile get
 ============================

--- a/ceph/doc/rados/operations/erasure-code.rst
+++ b/ceph/doc/rados/operations/erasure-code.rst
@ -1,14 +1,14 @@
 .. _ecpool:

-=============
+==============
 Erasure code
-=============
+==============

 By default, Ceph `pools <../pools>`_ are created with the type "replicated". In
-replicated-type pools, every object is copied to multiple disks (this
-multiple copying is the "replication").
+replicated-type pools, every object is copied to multiple disks. This
+multiple copying is the method of data protection known as "replication".

-In contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
+By contrast, `erasure-coded <https://en.wikipedia.org/wiki/Erasure_code>`_
 pools use a method of data protection that is different from replication. In
 erasure coding, data is broken into fragments of two kinds: data blocks and
 parity blocks. If a drive fails or becomes corrupted, the parity blocks are
@ -16,17 +16,17 @@ used to rebuild the data. At scale, erasure coding saves space relative to
 replication.

 In this documentation, data blocks are referred to as "data chunks"
-and parity blocks are referred to as "encoding chunks".
+and parity blocks are referred to as "coding chunks".

 Erasure codes are also called "forward error correction codes". The
 first forward error correction code was developed in 1950 by Richard
 Hamming at Bell Laboratories.


-Creating a sample erasure coded pool
+Creating a sample erasure-coded pool
 ------------------------------------

-The simplest erasure coded pool is equivalent to `RAID5
+The simplest erasure-coded pool is similar to `RAID5
 <https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5>`_ and
 requires at least three hosts:

@ -47,12 +47,13 @@ requires at least three hosts:

   ABCDEFGHI

-Erasure code profiles
+Erasure-code profiles
 ---------------------

-The default erasure code profile can sustain the loss of two OSDs. This erasure
-code profile is equivalent to a replicated pool of size three, but requires
-2TB to store 1TB of data instead of 3TB to store 1TB of data. The default
+The default erasure-code profile can sustain the overlapping loss of two OSDs
+without losing data. This erasure-code profile is equivalent to a replicated
+pool of size three, but with different storage requirements: instead of
+requiring 3TB to store 1TB, it requires only 2TB to store 1TB. The default
 profile can be displayed with this command:

 .. prompt:: bash $
@ -68,26 +69,27 @@ profile can be displayed with this command:
   technique=reed_sol_van

 .. note::
-   The default erasure-coded pool, the profile of which is displayed here, is
-   not the same as the simplest erasure-coded pool. 
-   
-   The default erasure-coded pool has two data chunks (k) and two coding chunks
-   (m). The profile of the default erasure-coded pool is "k=2 m=2".
+  The profile just displayed is for the *default* erasure-coded pool, not the
+  *simplest* erasure-coded pool. These two pools are not the same:

-   The simplest erasure-coded pool has two data chunks (k) and one coding chunk
-   (m). The profile of the simplest erasure-coded pool is "k=2 m=1".
+   The default erasure-coded pool has two data chunks (K) and two coding chunks
+   (M). The profile of the default erasure-coded pool is "k=2 m=2".
+
+   The simplest erasure-coded pool has two data chunks (K) and one coding chunk
+   (M). The profile of the simplest erasure-coded pool is "k=2 m=1".

 Choosing the right profile is important because the profile cannot be modified
 after the pool is created. If you find that you need an erasure-coded pool with
 a profile different than the one you have created, you must create a new pool
-with a different (and presumably more carefully-considered) profile. When the
-new pool is created, all objects from the wrongly-configured pool must be moved
-to the newly-created pool. There is no way to alter the profile of a pool after its creation.
+with a different (and presumably more carefully considered) profile. When the
+new pool is created, all objects from the wrongly configured pool must be moved
+to the newly created pool. There is no way to alter the profile of a pool after
+the pool has been created.

-The most important parameters of the profile are *K*, *M* and
+The most important parameters of the profile are *K*, *M*, and
 *crush-failure-domain* because they define the storage overhead and
 the data durability. For example, if the desired architecture must
-sustain the loss of two racks with a storage overhead of 67% overhead,
+sustain the loss of two racks with a storage overhead of 67%,
 the following profile can be defined:

 .. prompt:: bash $
@ -106,7 +108,7 @@ the following profile can be defined:

 The *NYAN* object will be divided in three (*K=3*) and two additional
 *chunks* will be created (*M=2*). The value of *M* defines how many
-OSD can be lost simultaneously without losing any data. The
+OSDs can be lost simultaneously without losing any data. The
 *crush-failure-domain=rack* will create a CRUSH rule that ensures
 no two *chunks* are stored in the same rack.

@ -155,19 +157,19 @@ no two *chunks* are stored in the same rack.
                                 +------+

 
-More information can be found in the `erasure code profiles
+More information can be found in the `erasure-code profiles
 <../erasure-code-profile>`_ documentation.


 Erasure Coding with Overwrites
 ------------------------------

-By default, erasure coded pools only work with uses like RGW that
-perform full object writes and appends.
+By default, erasure-coded pools work only with operations that
+perform full object writes and appends (for example, RGW).

-Since Luminous, partial writes for an erasure coded pool may be
+Since Luminous, partial writes for an erasure-coded pool may be
 enabled with a per-pool setting. This lets RBD and CephFS store their
-data in an erasure coded pool:
+data in an erasure-coded pool:

 .. prompt:: bash $

@ -175,31 +177,33 @@ data in an erasure coded pool:

 This can be enabled only on a pool residing on BlueStore OSDs, since
 BlueStore's checksumming is used during deep scrubs to detect bitrot
-or other corruption. In addition to being unsafe, using Filestore with
-EC overwrites results in lower performance compared to BlueStore.
+or other corruption. Using Filestore with EC overwrites is not only
+unsafe, but it also results in lower performance compared to BlueStore.

-Erasure coded pools do not support omap, so to use them with RBD and
-CephFS you must instruct them to store their data in an EC pool, and
+Erasure-coded pools do not support omap, so to use them with RBD and
+CephFS you must instruct them to store their data in an EC pool and
 their metadata in a replicated pool. For RBD, this means using the
-erasure coded pool as the ``--data-pool`` during image creation:
+erasure-coded pool as the ``--data-pool`` during image creation:

 .. prompt:: bash $

    rbd create --size 1G --data-pool ec_pool replicated_pool/image_name

-For CephFS, an erasure coded pool can be set as the default data pool during
+For CephFS, an erasure-coded pool can be set as the default data pool during
 file system creation or via `file layouts <../../../cephfs/file-layouts>`_.


-Erasure coded pool and cache tiering
------------------------------------
+Erasure-coded pools and cache tiering
+-------------------------------------

-Erasure coded pools require more resources than replicated pools and
-lack some functionality such as omap. To overcome these
-limitations, one can set up a `cache tier <../cache-tiering>`_
-before the erasure coded pool.
+Erasure-coded pools require more resources than replicated pools and
+lack some of the functionality supported by replicated pools (for example, omap).
+To overcome these limitations, one can set up a `cache tier <../cache-tiering>`_
+before setting up the erasure-coded pool.

-For instance, if the pool *hot-storage* is made of fast storage:
+For example, if the pool *hot-storage* is made of fast storage, the following commands
+will place the *hot-storage* pool as a tier of *ecpool* in *writeback*
+mode:

 .. prompt:: bash $

@ -207,58 +211,60 @@ For instance, if the pool *hot-storage* is made of fast storage:
   ceph osd tier cache-mode hot-storage writeback
   ceph osd tier set-overlay ecpool hot-storage

-will place the *hot-storage* pool as tier of *ecpool* in *writeback*
-mode so that every write and read to the *ecpool* are actually using
-the *hot-storage* and benefit from its flexibility and speed.
+The result is that every write and read to the *ecpool* actually uses
+the *hot-storage* pool and benefits from its flexibility and speed.

 More information can be found in the `cache tiering
-<../cache-tiering>`_ documentation.  Note however that cache tiering
+<../cache-tiering>`_ documentation. Note, however, that cache tiering
 is deprecated and may be removed completely in a future release.

-Erasure coded pool recovery
+Erasure-coded pool recovery
 ---------------------------
-If an erasure coded pool loses some data shards, it must recover them from others.
-This involves reading from the remaining shards, reconstructing the data, and
+If an erasure-coded pool loses any data shards, it must recover them from others.
+This recovery involves reading from the remaining shards, reconstructing the data, and
 writing new shards.
+
 In Octopus and later releases, erasure-coded pools can recover as long as there are at least *K* shards
 available. (With fewer than *K* shards, you have actually lost data!)

-Prior to Octopus, erasure coded pools required at least ``min_size`` shards to be
-available, even if ``min_size`` is greater than ``K``. We recommend ``min_size``
-be ``K+2`` or more to prevent loss of writes and data.
-This conservative decision was made out of an abundance of caution when
-designing the new pool mode.  As a result pools with lost OSDs but without
-complete loss of any data were unable to recover and go active
-without manual intervention to temporarily change the ``min_size`` setting.
+Prior to Octopus, erasure-coded pools required that at least ``min_size`` shards be
+available, even if ``min_size`` was greater than ``K``. This was a conservative
+decision made out of an abundance of caution when designing the new pool
+mode. As a result, however, pools with lost OSDs but without complete data loss were
+unable to recover and go active without manual intervention to temporarily change
+the ``min_size`` setting.
+
+We recommend that ``min_size`` be ``K+1`` or greater to prevent loss of writes and
+loss of data.
+
+

 Glossary
 --------

 *chunk*
-   when the encoding function is called, it returns chunks of the same
-   size. Data chunks which can be concatenated to reconstruct the original
-   object and coding chunks which can be used to rebuild a lost chunk.
+   When the encoding function is called, it returns chunks of the same size as each other. There are two
+   kinds of chunks: (1) *data chunks*, which can be concatenated to reconstruct the original object, and
+   (2) *coding chunks*, which can be used to rebuild a lost chunk.

 *K*
-   the number of data *chunks*, i.e. the number of *chunks* in which the
-   original object is divided. For instance if *K* = 2 a 10KB object
-   will be divided into *K* objects of 5KB each.
+   The number of data chunks into which an object is divided. For example, if *K* = 2, then a 10KB object
+   is divided into two objects of 5KB each.

 *M*
-   the number of coding *chunks*, i.e. the number of additional *chunks*
-   computed by the encoding functions. If there are 2 coding *chunks*,
-   it means 2 OSDs can be out without losing data.
+   The number of coding chunks computed by the encoding function. *M* is equal to the number of OSDs that can
+   be missing from the cluster without the cluster suffering data loss. For example, if there are two coding
+   chunks, then two OSDs can be missing without data loss.

-
-Table of content
----------------
+Table of contents
+-----------------

 .. toctree::
-	:maxdepth: 1
+    :maxdepth: 1

-	erasure-code-profile
-	erasure-code-jerasure
-	erasure-code-isa
-	erasure-code-lrc
-	erasure-code-shec
-	erasure-code-clay
+    erasure-code-profile
+    erasure-code-jerasure
+    erasure-code-isa
+    erasure-code-lrc
+    erasure-code-shec
+    erasure-code-clay
--- a/ceph/doc/rados/operations/health-checks.rst
+++ b/ceph/doc/rados/operations/health-checks.rst
--- a/ceph/doc/rados/operations/monitoring-osd-pg.rst
+++ b/ceph/doc/rados/operations/monitoring-osd-pg.rst
@ -3,35 +3,38 @@
 =========================

 High availability and high reliability require a fault-tolerant approach to
-managing hardware and software issues. Ceph has no single point-of-failure, and
-can service requests for data in a "degraded" mode. Ceph's `data placement`_
-introduces a layer of indirection to ensure that data doesn't bind directly to
-particular OSD addresses. This means that tracking down system faults requires
-finding the `placement group`_ and the underlying OSDs at root of the problem.
+managing hardware and software issues. Ceph has no single point of failure and
+it can service requests for data even when in a "degraded" mode. Ceph's `data
+placement`_ introduces a layer of indirection to ensure that data doesn't bind
+directly to specific OSDs. For this reason, tracking system faults
+requires finding the `placement group`_ (PG) and the underlying OSDs at the
+root of the problem.

-.. tip:: A fault in one part of the cluster may prevent you from accessing a 
-   particular object, but that doesn't mean that you cannot access other objects.
-   When you run into a fault, don't panic. Just follow the steps for monitoring
-   your OSDs and placement groups. Then, begin troubleshooting.
+.. tip:: A fault in one part of the cluster might prevent you from accessing a
+   particular object, but that doesn't mean that you are prevented from
+   accessing other objects.  When you run into a fault, don't panic. Just
+   follow the steps for monitoring your OSDs and placement groups, and then
+   begin troubleshooting.

-Ceph is generally self-repairing. However, when problems persist, monitoring
-OSDs and placement groups will help you identify the problem.
+Ceph is self-repairing. However, when problems persist, monitoring OSDs and
+placement groups will help you identify the problem.


 Monitoring OSDs
 ===============

-An OSD's status is either in the cluster (``in``) or out of the cluster
-(``out``); and, it is either up and running (``up``), or it is down and not
-running (``down``). If an OSD is ``up``, it may be either ``in`` the cluster
-(you can read and write data) or it is ``out`` of the cluster.  If it was
-``in`` the cluster and recently moved ``out`` of the cluster, Ceph will migrate
-placement groups to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
-not assign placement groups to the OSD. If an OSD is ``down``, it should also be
-``out``.
+An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
+either running and reachable (``up``), or it is not running and not
+reachable (``down``). 

-.. note:: If an OSD is ``down`` and ``in``, there is a problem and the cluster 
-   will not be in a healthy state.
+If an OSD is ``up``, it may be either ``in`` service (clients can read and
+write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintin the configured redundancy.  
+
+If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
+If an OSD is ``down``, it will also be ``out``.
+
+.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
+   indicates that the cluster is not in a healthy state.

 .. ditaa::

@ -50,129 +53,128 @@ not assign placement groups to the OSD. If an OSD is ``down``, it should also be
           |                |        |                |
           +----------------+        +----------------+

-If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
-you may notice that the cluster does not always echo back ``HEALTH OK``. Don't
-panic. With respect to OSDs, you should expect that the cluster will **NOT**
-echo   ``HEALTH OK`` in a few expected circumstances:
+If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
+you might notice that the cluster does not always show ``HEALTH OK``. Don't
+panic. There are certain circumstances in which it is expected and normal that
+the cluster will **NOT** show ``HEALTH OK``:

-#. You haven't started the cluster yet (it won't respond).
-#. You have just started or restarted the cluster and it's not ready yet,
-   because the placement groups are getting created and the OSDs are in
-   the process of peering.
-#. You just added or removed an OSD.
-#. You just have modified your cluster map.
+#. You haven't started the cluster yet.
+#. You have just started or restarted the cluster and it's not ready to show
+   health statuses yet, because the PGs are in the process of being created and
+   the OSDs are in the process of peering.
+#. You have just added or removed an OSD.
+#. You have just have modified your cluster map.

-An important aspect of monitoring OSDs is to ensure that when the cluster
-is up and running that all OSDs that are ``in`` the cluster are ``up`` and
-running, too. To see if all OSDs are running, execute:
+Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them:
+whenever the cluster is up and running, every OSD that is ``in`` the cluster should also
+be ``up`` and running. To see if all of the cluster's OSDs are running, run the following
+command:

 .. prompt:: bash $

-	ceph osd stat
+    ceph osd stat

-The result should tell you the total number of OSDs (x),
-how many are ``up`` (y), how many are ``in`` (z) and the map epoch (eNNNN). ::
+The output provides the following information: the total number of OSDs (x),
+how many OSDs are ``up`` (y), how many OSDs are ``in`` (z), and the map epoch (eNNNN). ::

-	x osds: y up, z in; epoch: eNNNN
+    x osds: y up, z in; epoch: eNNNN

-If the number of OSDs that are ``in`` the cluster is more than the number of
-OSDs that are ``up``, execute the following command to identify the ``ceph-osd``
+If the number of OSDs that are ``in`` the cluster is greater than the number of
+OSDs that are ``up``, run the following command to identify the ``ceph-osd``
 daemons that are not running:

 .. prompt:: bash $

-	ceph osd tree
+    ceph osd tree

 :: 

-	#ID CLASS WEIGHT  TYPE NAME             STATUS REWEIGHT PRI-AFF
-	 -1       2.00000 pool openstack
-	 -3       2.00000 rack dell-2950-rack-A
-	 -2       2.00000 host dell-2950-A1
-	  0   ssd 1.00000      osd.0                up  1.00000 1.00000
-	  1   ssd 1.00000      osd.1              down  1.00000 1.00000
+    #ID CLASS WEIGHT  TYPE NAME             STATUS REWEIGHT PRI-AFF
+     -1       2.00000 pool openstack
+     -3       2.00000 rack dell-2950-rack-A
+     -2       2.00000 host dell-2950-A1
+      0   ssd 1.00000      osd.0                up  1.00000 1.00000
+      1   ssd 1.00000      osd.1              down  1.00000 1.00000

-.. tip:: The ability to search through a well-designed CRUSH hierarchy may help
-   you troubleshoot your cluster by identifying the physical locations faster.
+.. tip:: Searching through a well-designed CRUSH hierarchy to identify the physical
+   locations of particular OSDs might help you troubleshoot your cluster.

-If an OSD is ``down``, start it:
+If an OSD is ``down``, start it by running the following command:

 .. prompt:: bash $

-	sudo systemctl start ceph-osd@1
+    sudo systemctl start ceph-osd@1
+
+For problems associated with OSDs that have stopped or won't restart, see `OSD Not Running`_.

-See `OSD Not Running`_ for problems associated with OSDs that stopped, or won't
-restart.
-	

 PG Sets
 =======

-When CRUSH assigns placement groups to OSDs, it looks at the number of replicas
-for the pool and assigns the placement group to OSDs such that each replica of
-the placement group gets assigned to a different OSD. For example, if the pool
-requires three replicas of a placement group, CRUSH may assign them to
-``osd.1``, ``osd.2`` and ``osd.3`` respectively. CRUSH actually seeks a
-pseudo-random placement that will take into account failure domains you set in
-your `CRUSH map`_, so you will rarely see placement groups assigned to nearest
-neighbor OSDs in a large cluster.
+When CRUSH assigns a PG to OSDs, it takes note of how many replicas of the PG
+are required by the pool and then assigns each replica to a different OSD.
+For example, if the pool requires three replicas of a PG, CRUSH might assign
+them individually to ``osd.1``, ``osd.2`` and ``osd.3``. CRUSH seeks a
+pseudo-random placement that takes into account the failure domains that you
+have set in your `CRUSH map`_; for this reason, PGs are rarely assigned to
+immediately adjacent OSDs in a large cluster.

-Ceph processes a client request using the **Acting Set**, which is the set of
-OSDs that will actually handle the requests since they have a full and working
-version of a placement group shard. The set of OSDs that should contain a shard
-of a particular placement group as the **Up Set**, i.e. where data is
-moved/copied to (or planned to be).
+Ceph processes client requests with the **Acting Set** of OSDs: this is the set
+of OSDs that currently have a full and working version of a PG shard and that
+are therefore responsible for handling requests. By contrast, the **Up Set** is
+the set of OSDs that contain a shard of a specific PG. Data is moved or copied
+to the **Up Set**, or planned to be moved or copied, to the **Up Set**. See
+:ref:`Placement Group Concepts <rados_operations_pg_concepts>`.

-In some cases, an OSD in the Acting Set is ``down`` or otherwise not able to
-service requests for objects in the placement group. When these situations
-arise, don't panic. Common examples include:
+Sometimes an OSD in the Acting Set is ``down`` or otherwise unable to
+service requests for objects in the PG. When this kind of situation
+arises, don't panic. Common examples of such a situation include:

- You added or removed an OSD. Then, CRUSH reassigned the placement group to 
-  other OSDs--thereby changing the composition of the Acting Set and spawning
-  the migration of data with a "backfill" process.
+- You added or removed an OSD, CRUSH reassigned the PG to 
+  other OSDs, and this reassignment changed the composition of the Acting Set and triggered
+  the migration of data by means of a "backfill" process.
 - An OSD was ``down``, was restarted, and is now ``recovering``.
- An OSD in the Acting Set is ``down`` or unable to service requests, 
+- An OSD in the Acting Set is ``down`` or unable to service requests,
  and another OSD has temporarily assumed its duties.

-In most cases, the Up Set and the Acting Set are identical. When they are not,
-it may indicate that Ceph is migrating the PG (it's remapped), an OSD is
-recovering, or that there is a problem (i.e., Ceph usually echoes a "HEALTH
-WARN" state with a "stuck stale" message in such scenarios).
+Typically, the Up Set and the Acting Set are identical. When they are not, it
+might indicate that Ceph is migrating the PG (in other words, that the PG has
+been remapped), that an OSD is recovering, or that there is a problem with the
+cluster (in such scenarios, Ceph usually shows a "HEALTH WARN" state with a
+"stuck stale" message).

-To retrieve a list of placement groups, execute:
+To retrieve a list of PGs, run the following command:

 .. prompt:: bash $

-	ceph pg dump
-	
-To view which OSDs are within the Acting Set or the Up Set for a given placement
-group, execute:
+    ceph pg dump
+
+To see which OSDs are within the Acting Set and the Up Set for a specific PG, run the following command:

 .. prompt:: bash $

-	ceph pg map {pg-num}
+    ceph pg map {pg-num}

-The result should tell you the osdmap epoch (eNNN), the placement group number
-({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the acting set
+The output provides the following information: the osdmap epoch (eNNN), the PG number
+({pg-num}), the OSDs in the Up Set (up[]), and the OSDs in the Acting Set
 (acting[])::

-	osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]
+    osdmap eNNN pg {raw-pg-num} ({pg-num}) -> up [0,1,2] acting [0,1,2]

-.. note:: If the Up Set and Acting Set do not match, this may be an indicator
-   that the cluster rebalancing itself or of a potential problem with 
+.. note:: If the Up Set and the Acting Set do not match, this might indicate
+   that the cluster is rebalancing itself or that there is a problem with 
   the cluster.
- 
+

 Peering
 =======

-Before you can write data to a placement group, it must be in an ``active``
-state, and it  **should** be in a ``clean`` state. For Ceph to determine the
-current state of a placement group, the primary OSD of the placement group
-(i.e., the first OSD in the acting set), peers with the secondary and tertiary
-OSDs to establish agreement on the current state of the placement group
-(assuming a pool with 3 replicas of the PG).
-
+Before you can write data to a PG, it must be in an ``active`` state and it
+will preferably be in a ``clean`` state. For Ceph to determine the current
+state of a PG, peering must take place.  That is, the primary OSD of the PG
+(that is, the first OSD in the Acting Set) must peer with the secondary and
+OSDs so that consensus on the current state of the PG can be established. In
+the following diagram, we assume a pool with three replicas of the PG:

 .. ditaa::

@ -187,109 +189,110 @@ OSDs to establish agreement on the current state of the placement group
                |    Peering                   |
                |                              |
                |         Request To           |
-                |            Peer              | 
-                |----------------------------->|  
+                |            Peer              |
+                |----------------------------->|
                |<-----------------------------|
                |          Peering             |

-The OSDs also report their status to the monitor. See `Configuring Monitor/OSD
-Interaction`_ for details. To troubleshoot peering issues, see `Peering
+The OSDs also report their status to the monitor. For details, see `Configuring Monitor/OSD
+Interaction`_. To troubleshoot peering issues, see `Peering
 Failure`_.


-Monitoring Placement Group States
-=================================
+Monitoring PG States
+====================

-If you execute a command such as ``ceph health``, ``ceph -s`` or ``ceph -w``,
-you may notice that the cluster does not always echo back ``HEALTH OK``. After
-you check to see if the OSDs are running, you should also check placement group
-states. You should expect that the cluster will **NOT** echo ``HEALTH OK`` in a
-number of placement group peering-related circumstances:
+If you run the commands ``ceph health``, ``ceph -s``, or ``ceph -w``,
+you might notice that the cluster does not always show ``HEALTH OK``. After
+first checking to see if the OSDs are running, you should also check PG
+states. There are certain PG-peering-related circumstances in which it is expected
+and normal that the cluster will **NOT** show ``HEALTH OK``:

-#. You have just created a pool and placement groups haven't peered yet.
-#. The placement groups are recovering.
+#. You have just created a pool and the PGs haven't peered yet.
+#. The PGs are recovering.
 #. You have just added an OSD to or removed an OSD from the cluster.
-#. You have just modified your CRUSH map and your placement groups are migrating.
-#. There is inconsistent data in different replicas of a placement group.
-#. Ceph is scrubbing a placement group's replicas.
+#. You have just modified your CRUSH map and your PGs are migrating.
+#. There is inconsistent data in different replicas of a PG.
+#. Ceph is scrubbing a PG's replicas.
 #. Ceph doesn't have enough storage capacity to complete backfilling operations.

-If one of the foregoing circumstances causes Ceph to echo ``HEALTH WARN``, don't
-panic. In many cases, the cluster will recover on its own. In some cases, you
-may need to take action. An important aspect of monitoring placement groups is
-to ensure that when the cluster is up and running that all placement groups are
-``active``, and preferably in the ``clean`` state. To see the status of all
-placement groups, execute:
+If one of these circumstances causes Ceph to show ``HEALTH WARN``, don't
+panic. In many cases, the cluster will recover on its own. In some cases, however, you
+might need to take action. An important aspect of monitoring PGs is to check their
+status as ``active`` and ``clean``: that is, it is important to ensure that, when the
+cluster is up and running, all PGs are ``active`` and (preferably) ``clean``.
+To see the status of every PG, run the following command:

 .. prompt:: bash $

-	ceph pg stat
+    ceph pg stat

-The result should tell you the total number of placement groups (x), how many
-placement groups are in a particular state such as ``active+clean`` (y) and the
+The output provides the following information: the total number of PGs (x), how many
+PGs are in a particular state such as ``active+clean`` (y), and the
 amount of data stored (z). ::

-	x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail
+    x pgs: y active+clean; z bytes data, aa MB used, bb GB / cc GB avail

-.. note:: It is common for Ceph to report multiple states for placement groups.
+.. note:: It is common for Ceph to report multiple states for PGs (for example,
+   ``active+clean``, ``active+clean+remapped``, ``active+clean+scrubbing``.

-In addition to the placement group states, Ceph will also echo back the amount of
-storage capacity used (aa), the amount of storage capacity remaining (bb), and the total
-storage capacity for the placement group. These numbers can be important in a
-few cases: 
+Here Ceph shows not only the PG states, but also storage capacity used (aa),
+the amount of storage capacity remaining (bb), and the total storage capacity
+of the PG. These values can be important in a few cases:

- You are reaching your ``near full ratio`` or ``full ratio``. 
- Your data is not getting distributed across the cluster due to an 
-  error in your CRUSH configuration.
+- The cluster is reaching its ``near full ratio`` or ``full ratio``.
+- Data is not being distributed across the cluster due to an error in the
+  CRUSH configuration.


 .. topic:: Placement Group IDs

-   Placement group IDs consist of the pool number (not pool name) followed 
-   by a period (.) and the placement group ID--a hexadecimal number. You
-   can view pool numbers and their names from the output of ``ceph osd 
-   lspools``. For example, the first pool created corresponds to
-   pool number ``1``. A fully qualified placement group ID has the
+   PG IDs consist of the pool number (not the pool name) followed by a period
+   (.) and a hexadecimal number. You can view pool numbers and their names from
+   in the output of ``ceph osd lspools``. For example, the first pool that was
+   created corresponds to pool number ``1``. A fully qualified PG ID has the
   following form::
-   
-   	{pool-num}.{pg-id}
-   
-   And it typically looks like this:: 
-   
-	1.1f
-   

-To retrieve a list of placement groups, execute the following:
+       {pool-num}.{pg-id}
+
+   It typically resembles the following:: 
+
+    1.1701b
+
+
+To retrieve a list of PGs, run the following command:

 .. prompt:: bash $

-	ceph pg dump
-	
-You can also format the output in JSON format and save it to a file:
+    ceph pg dump
+
+To format the output in JSON format and save it to a file, run the following command:

 .. prompt:: bash $

-	ceph pg dump -o {filename} --format=json
+    ceph pg dump -o {filename} --format=json

-To query a particular placement group, execute the following:
+To query a specific PG, run the following command:

 .. prompt:: bash $

-	ceph pg {poolnum}.{pg-id} query
-	
+    ceph pg {poolnum}.{pg-id} query
+
 Ceph will output the query in JSON format.

-The following subsections describe the common pg states in detail.
+The following subsections describe the most common PG states in detail.
+

 Creating
 --------

-When you create a pool, it will create the number of placement groups you
-specified.  Ceph will echo ``creating`` when it is creating one or more
-placement groups. Once they are created, the OSDs that are part of a placement
-group's Acting Set will peer. Once peering is complete, the placement group
-status should be ``active+clean``, which means a Ceph client can begin writing
-to the placement group.
+PGs are created when you create a pool: the command that creates a pool
+specifies the total number of PGs for that pool, and when the pool is created
+all of those PGs are created as well. Ceph will echo ``creating`` while it is
+creating PGs. After the PG(s) are created, the OSDs that are part of a PG's
+Acting Set will peer. Once peering is complete, the PG status should be
+``active+clean``. This status means that Ceph clients begin writing to the
+PG.

 .. ditaa::

@ -300,43 +303,38 @@ to the placement group.
 Peering
 -------

-When Ceph is Peering a placement group, Ceph is bringing the OSDs that
-store the replicas of the placement group into **agreement about the state**
-of the objects and metadata in the placement group. When Ceph completes peering,
-this means that the OSDs that store the placement group agree about the current
-state of the placement group. However, completion of the peering process does
-**NOT** mean that each replica has the latest contents.
+When a PG peers, the OSDs that store the replicas of its data converge on an
+agreed state of the data and metadata within that PG. When peering is complete,
+those OSDs agree about the state of that PG. However, completion of the peering
+process does **NOT** mean that each replica has the latest contents.

 .. topic:: Authoritative History

-   Ceph will **NOT** acknowledge a write operation to a client, until 
-   all OSDs of the acting set persist the write operation. This practice 
-   ensures that at least one member of the acting set will have a record 
-   of every acknowledged write operation since the last successful 
-   peering operation.
+   Ceph will **NOT** acknowledge a write operation to a client until that write
+   operation is persisted by every OSD in the Acting Set. This practice ensures
+   that at least one member of the Acting Set will have a record of every
+   acknowledged write operation since the last successful peering operation.
   
-   With an accurate record of each acknowledged write operation, Ceph can 
-   construct and disseminate a new authoritative history of the placement 
-   group--a complete, and fully ordered set of operations that, if performed, 
-   would bring an OSD’s copy of a placement group up to date.
+   Given an accurate record of each acknowledged write operation, Ceph can
+   construct a new authoritative history of the PG--that is, a complete and
+   fully ordered set of operations that, if performed, would bring an OSD’s
+   copy of the PG up to date.


 Active
 ------

-Once Ceph completes the peering process, a placement group may become
-``active``. The ``active`` state means that the data in the placement group is
-generally  available in the primary placement group and the replicas for read
-and write operations. 
+After Ceph has completed the peering process, a PG should become ``active``.
+The ``active`` state means that the data in the PG is generally available for
+read and write operations in the primary and replica OSDs.


 Clean 
 -----

-When a placement group is in the ``clean`` state, the primary OSD and the
-replica OSDs have successfully peered and there are no stray replicas for the
-placement group. Ceph replicated all objects in the placement group the correct 
-number of times.
+When a PG is in the ``clean`` state, all OSDs holding its data and metadata
+have successfully peered and there are no stray replicas. Ceph has replicated
+all objects in the PG the correct number of times.


 Degraded
@ -344,143 +342,147 @@ Degraded

 When a client writes an object to the primary OSD, the primary OSD is
 responsible for writing the replicas to the replica OSDs. After the primary OSD
-writes the object to storage, the placement group will remain in a ``degraded``
+writes the object to storage, the PG will remain in a ``degraded``
 state until the primary OSD has received an acknowledgement from the replica
 OSDs that Ceph created the replica objects successfully. 

-The reason a placement group can be ``active+degraded`` is that an OSD may be
-``active`` even though it doesn't hold all of the objects yet. If an OSD goes
-``down``, Ceph marks each placement group assigned to the OSD as ``degraded``.
-The OSDs must peer again when the OSD comes back online. However, a client can
-still write a new object to a ``degraded`` placement group if it is ``active``.
+The reason that a PG can be ``active+degraded`` is that an OSD can be
+``active`` even if it doesn't yet hold all of the PG's objects. If an OSD goes
+``down``, Ceph marks each PG assigned to the OSD as ``degraded``. The PGs must
+peer again when the OSD comes back online. However, a client can still write a
+new object to a ``degraded`` PG if it is ``active``.

-If an OSD is ``down`` and the ``degraded`` condition persists, Ceph may mark the
+If an OSD is ``down`` and the ``degraded`` condition persists, Ceph might mark the
 ``down`` OSD as ``out`` of the cluster and remap the data from the ``down`` OSD
 to another OSD. The time between being marked ``down`` and being marked ``out``
-is controlled by ``mon osd down out interval``, which is set to ``600`` seconds
+is determined by ``mon_osd_down_out_interval``, which is set to ``600`` seconds
 by default.

-A placement group can also be ``degraded``, because Ceph cannot find one or more
-objects that Ceph thinks should be in the placement group. While you cannot
-read or write to unfound objects, you can still access all of the other objects
-in the ``degraded`` placement group.
+A PG can also be in the ``degraded`` state because there are one or more
+objects that Ceph expects to find in the PG but that Ceph cannot find. Although
+you cannot read or write to unfound objects, you can still access all of the other
+objects in the ``degraded`` PG.


 Recovering
 ----------

-Ceph was designed for fault-tolerance at a scale where hardware and software
-problems are ongoing. When an OSD goes ``down``, its contents may fall behind
-the current state of other replicas in the placement groups. When the OSD is
-back ``up``, the contents of the placement groups must be updated to reflect the
-current state. During that time period, the OSD may reflect a ``recovering``
-state.
+Ceph was designed for fault-tolerance, because hardware and other server
+problems are expected or even routine. When an OSD goes ``down``, its contents
+might fall behind the current state of other replicas in the PGs. When the OSD
+has returned to the ``up`` state, the contents of the PGs must be updated to
+reflect that current state. During that time period, the OSD might be in a
+``recovering`` state.

 Recovery is not always trivial, because a hardware failure might cause a
 cascading failure of multiple OSDs. For example, a network switch for a rack or
-cabinet may fail, which can cause the OSDs of a number of host machines to fall
-behind the current state  of the cluster. Each one of the OSDs must recover once
-the fault is resolved.
+cabinet might fail, which can cause the OSDs of a number of host machines to
+fall behind the current state of the cluster. In such a scenario, general
+recovery is possible only if each of the OSDs recovers after the fault has been
+resolved.]

-Ceph provides a number of settings to balance the resource contention between
-new service requests and the need to recover data objects and restore the
-placement groups to the current state. The ``osd recovery delay start`` setting
-allows an OSD to restart, re-peer and even process some replay requests before
-starting the recovery process.  The ``osd
-recovery thread timeout`` sets a thread timeout, because multiple OSDs may fail,
-restart and re-peer at staggered rates. The ``osd recovery max active`` setting
-limits the  number of recovery requests an OSD will entertain simultaneously to
-prevent the OSD from failing to serve . The ``osd recovery max chunk`` setting
-limits the size of the recovered data chunks to prevent network congestion.
+Ceph provides a number of settings that determine how the cluster balances the
+resource contention between the need to process new service requests and the
+need to recover data objects and restore the PGs to the current state. The
+``osd_recovery_delay_start`` setting allows an OSD to restart, re-peer, and
+even process some replay requests before starting the recovery process. The
+``osd_recovery_thread_timeout`` setting determines the duration of a thread
+timeout, because multiple OSDs might fail, restart, and re-peer at staggered
+rates.  The ``osd_recovery_max_active`` setting limits the number of recovery
+requests an OSD can entertain simultaneously, in order to prevent the OSD from
+failing to serve.  The ``osd_recovery_max_chunk`` setting limits the size of
+the recovered data chunks, in order to prevent network congestion.


 Back Filling
 ------------

-When a new OSD joins the cluster, CRUSH will reassign placement groups from OSDs
-in the cluster to the newly added OSD. Forcing the new OSD to accept the
-reassigned placement groups immediately can put excessive load on the new OSD.
-Back filling the OSD with the placement groups allows this process to begin in
-the background.  Once backfilling is complete, the new OSD will begin serving
-requests when it is ready.
+When a new OSD joins the cluster, CRUSH will reassign PGs from OSDs that are
+already in the cluster to the newly added OSD. It can put excessive load on the
+new OSD to force it to immediately accept the reassigned PGs. Back filling the
+OSD with the PGs allows this process to begin in the background. After the
+backfill operations have completed, the new OSD will begin serving requests as
+soon as it is ready.

-During the backfill operations, you may see one of several states:
+During the backfill operations, you might see one of several states:
 ``backfill_wait`` indicates that a backfill operation is pending, but is not
-underway yet; ``backfilling`` indicates that a backfill operation is underway;
-and, ``backfill_toofull`` indicates that a backfill operation was requested,
-but couldn't be completed due to insufficient storage capacity. When a 
-placement group cannot be backfilled, it may be considered ``incomplete``.
+yet underway; ``backfilling`` indicates that a backfill operation is currently
+underway; and ``backfill_toofull`` indicates that a backfill operation was
+requested but couldn't be completed due to insufficient storage capacity. When
+a PG cannot be backfilled, it might be considered ``incomplete``.

-The ``backfill_toofull`` state may be transient.  It is possible that as PGs
-are moved around, space may become available.  The ``backfill_toofull`` is
-similar to ``backfill_wait`` in that as soon as conditions change
-backfill can proceed.
+The ``backfill_toofull`` state might be transient. It might happen that, as PGs
+are moved around, space becomes available. The ``backfill_toofull`` state is
+similar to ``backfill_wait`` in that backfill operations can proceed as soon as
+conditions change.

-Ceph provides a number of settings to manage the load spike associated with
-reassigning placement groups to an OSD (especially a new OSD). By default,
-``osd_max_backfills`` sets the maximum number of concurrent backfills to and from
-an OSD to 1. The ``backfill full ratio`` enables an OSD to refuse a
-backfill request if the OSD is approaching its full ratio (90%, by default) and
-change with ``ceph osd set-backfillfull-ratio`` command.
-If an OSD refuses a backfill request, the ``osd backfill retry interval``
-enables an OSD to retry the request (after 30 seconds, by default). OSDs can
-also set ``osd backfill scan min`` and ``osd backfill scan max`` to manage scan
-intervals (64 and 512, by default).
+Ceph provides a number of settings to manage the load spike associated with the
+reassignment of PGs to an OSD (especially a new OSD). The ``osd_max_backfills``
+setting specifies the maximum number of concurrent backfills to and from an OSD
+(default: 1). The ``backfill_full_ratio`` setting allows an OSD to refuse a
+backfill request if the OSD is approaching its full ratio (default: 90%). This
+setting can be changed with the ``ceph osd set-backfillfull-ratio`` command. If
+an OSD refuses a backfill request, the ``osd_backfill_retry_interval`` setting
+allows an OSD to retry the request after a certain interval (default: 30
+seconds). OSDs can also set ``osd_backfill_scan_min`` and
+``osd_backfill_scan_max`` in order to manage scan intervals (default: 64 and
+512, respectively).


 Remapped
 --------

-When the Acting Set that services a placement group changes, the data migrates
-from the old acting set to the new acting set. It may take some time for a new
-primary OSD to service requests. So it may ask the old primary to continue to
-service requests until the placement group migration is complete. Once  data
-migration completes, the mapping uses the primary OSD of the new acting set.
+When the Acting Set that services a PG changes, the data migrates from the old
+Acting Set to the new Acting Set. Because it might take time for the new
+primary OSD to begin servicing requests, the old primary OSD might be required
+to continue servicing requests until the PG data migration is complete. After
+data migration has completed, the mapping uses the primary OSD of the new
+Acting Set.


 Stale
 -----

-While Ceph uses heartbeats to ensure that hosts and daemons are running, the
-``ceph-osd`` daemons may also get into a ``stuck`` state where they are not
-reporting statistics in a timely manner (e.g., a temporary network fault). By
-default, OSD daemons report their placement group, up through, boot and failure
-statistics every half second (i.e., ``0.5``),  which is more frequent than the
-heartbeat thresholds. If the **Primary OSD** of a placement group's acting set
-fails to report to the monitor or if other OSDs have reported the primary OSD
-``down``, the monitors will mark the placement group ``stale``.
+Although Ceph uses heartbeats in order to ensure that hosts and daemons are
+running, the ``ceph-osd`` daemons might enter a ``stuck`` state where they are
+not reporting statistics in a timely manner (for example, there might be a
+temporary network fault). By default, OSD daemons report their PG, up through,
+boot, and failure statistics every half second (that is, in accordance with a
+value of ``0.5``), which is more frequent than the reports defined by the
+heartbeat thresholds. If the primary OSD of a PG's Acting Set fails to report
+to the monitor or if other OSDs have reported the primary OSD ``down``, the
+monitors will mark the PG ``stale``.

-When you start your cluster, it is common to see the ``stale`` state until
-the peering process completes. After your cluster has been running for awhile, 
-seeing placement groups in the ``stale`` state indicates that the primary OSD
-for those placement groups is ``down`` or not reporting placement group statistics
-to the monitor.
+When you start your cluster, it is common to see the ``stale`` state until the
+peering process completes. After your cluster has been running for a while,
+however, seeing PGs in the ``stale`` state indicates that the primary OSD for
+those PGs is ``down`` or not reporting PG statistics to the monitor.


 Identifying Troubled PGs
 ========================

-As previously noted, a placement group is not necessarily problematic just 
-because its state is not ``active+clean``. Generally, Ceph's ability to self
-repair may not be working when placement groups get stuck. The stuck states
-include:
+As previously noted, a PG is not necessarily having problems just because its
+state is not ``active+clean``. When PGs are stuck, this might indicate that
+Ceph cannot perform self-repairs. The stuck states include:

- **Unclean**: Placement groups contain objects that are not replicated the 
-  desired number of times. They should be recovering.
- **Inactive**: Placement groups cannot process reads or writes because they 
-  are waiting for an OSD with the most up-to-date data to come back ``up``.
- **Stale**: Placement groups are in an unknown state, because the OSDs that 
-  host them have not reported to the monitor cluster in a while (configured 
-  by ``mon osd report timeout``).
+- **Unclean**: PGs contain objects that have not been replicated the desired
+  number of times. Under normal conditions, it can be assumed that these PGs
+  are recovering.
+- **Inactive**: PGs cannot process reads or writes because they are waiting for
+  an OSD that has the most up-to-date data to come back ``up``.
+- **Stale**: PG are in an unknown state, because the OSDs that host them have
+  not reported to the monitor cluster for a certain period of time (determined
+  by ``mon_osd_report_timeout``).

-To identify stuck placement groups, execute the following:
+To identify stuck PGs, run the following command:

 .. prompt:: bash $

-	ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]
+    ceph pg dump_stuck [unclean|inactive|stale|undersized|degraded]

-See `Placement Group Subsystem`_ for additional details. To troubleshoot
-stuck placement groups, see `Troubleshooting PG Errors`_.
+For more detail, see `Placement Group Subsystem`_. To troubleshoot stuck PGs,
+see `Troubleshooting PG Errors`_.


 Finding an Object Location
@ -491,55 +493,54 @@ To store object data in the Ceph Object Store, a Ceph client must:
 #. Set an object name
 #. Specify a `pool`_

-The Ceph client retrieves the latest cluster map and the CRUSH algorithm
-calculates how to map the object to a `placement group`_, and then calculates
-how to assign the placement group to an OSD dynamically. To find the object
-location, all you need is the object name and the pool name. For example:
+The Ceph client retrieves the latest cluster map, the CRUSH algorithm
+calculates how to map the object to a PG, and then the algorithm calculates how
+to dynamically assign the PG to an OSD. To find the object location given only
+the object name and the pool name, run a command of the following form:

 .. prompt:: bash $

-	ceph osd map {poolname} {object-name} [namespace]
+    ceph osd map {poolname} {object-name} [namespace]

 .. topic:: Exercise: Locate an Object

-        As an exercise, let's create an object. Specify an object name, a path
-        to a test file containing some object data and a pool name using the
+        As an exercise, let's create an object. We can specify an object name, a path
+        to a test file that contains some object data, and a pool name by using the
        ``rados put`` command on the command line. For example:

        .. prompt:: bash $
   
-		rados put {object-name} {file-path} --pool=data   	
-		rados put test-object-1 testfile.txt --pool=data
+           rados put {object-name} {file-path} --pool=data
+           rados put test-object-1 testfile.txt --pool=data
   
-        To verify that the Ceph Object Store stored the object, execute the
-        following:
+        To verify that the Ceph Object Store stored the object, run the
+        following command:
   
        .. prompt:: bash $

           rados -p data ls
   
-	Now, identify the object location:
+        To identify the object location, run the following commands:

        .. prompt:: bash $

           ceph osd map {pool-name} {object-name}
           ceph osd map data test-object-1
-   
-	Ceph should output the object's location. For example:: 
-   
-		osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
-   
-        To remove the test object, simply delete it using the ``rados rm``
-        command.  For example:
+
+        Ceph should output the object's location. For example:: 
+
+           osdmap e537 pool 'data' (1) object 'test-object-1' -> pg 1.d1743484 (1.4) -> up ([0,1], p0) acting ([0,1], p0)
+
+        To remove the test object, simply delete it by running the ``rados rm``
+        command. For example:

        .. prompt:: bash $
-   
+
           rados rm test-object-1 --pool=data
-   

 As the cluster evolves, the object location may change dynamically. One benefit
-of Ceph's dynamic rebalancing is that Ceph relieves you from having to perform
-the migration manually. See the  `Architecture`_ section for details.
+of Ceph's dynamic rebalancing is that Ceph spares you the burden of manually
+performing the migration. For details, see the `Architecture`_ section.

 .. _data placement: ../data-placement
 .. _pool: ../pools
--- a/ceph/doc/rados/operations/monitoring.rst
+++ b/ceph/doc/rados/operations/monitoring.rst
@ -2,9 +2,9 @@
 Monitoring a Cluster
 ======================

-Once you have a running cluster, you may use the ``ceph`` tool to monitor your
-cluster. Monitoring a cluster typically involves checking OSD status, monitor 
-status, placement group status and metadata server status.
+After you have a running cluster, you can use the ``ceph`` tool to monitor your
+cluster. Monitoring a cluster typically involves checking OSD status, monitor
+status, placement group status, and metadata server status.

 Using the command line
 ======================
@ -13,11 +13,11 @@ Interactive mode
 ----------------

 To run the ``ceph`` tool in interactive mode, type ``ceph`` at the command line
-with no arguments.  For example:
+with no arguments. For example:

 .. prompt:: bash $

-	ceph
+    ceph

 .. prompt:: ceph>
    :prompts: ceph>
@ -30,8 +30,9 @@ with no arguments.  For example:
 Non-default paths
 -----------------

-If you specified non-default locations for your configuration or keyring,
-you may specify their locations:
+If you specified non-default locations for your configuration or keyring when
+you install the cluster, you may specify their locations to the ``ceph`` tool
+by running the following command:

 .. prompt:: bash $

@ -40,30 +41,32 @@ you may specify their locations:
 Checking a Cluster's Status
 ===========================

-After you start your cluster, and before you start reading and/or
-writing data, check your cluster's status first.
+After you start your cluster, and before you start reading and/or writing data,
+you should check your cluster's status.

-To check a cluster's status, execute the following:
+To check a cluster's status, run the following command:

 .. prompt:: bash $

   ceph status
-	
-Or:
+
+Alternatively, you can run the following command:

 .. prompt:: bash $

   ceph -s

-In interactive mode, type ``status`` and press **Enter**:
+In interactive mode, this operation is performed by typing ``status`` and
+pressing **Enter**:

 .. prompt:: ceph>
    :prompts: ceph>
-   
+
    status

-Ceph will print the cluster status. For example, a tiny Ceph demonstration
-cluster with one of each service may print the following:
+Ceph will print the cluster status. For example, a tiny Ceph "demonstration
+cluster" that is running one instance of each service (monitor, manager, and
+OSD) might print the following:

 ::

@ -84,33 +87,35 @@ cluster with one of each service may print the following:
    pgs:     16 active+clean


-.. topic:: How Ceph Calculates Data Usage
+How Ceph Calculates Data Usage
+------------------------------

-   The ``usage`` value reflects the *actual* amount of raw storage used. The 
-   ``xxx GB / xxx GB`` value means the amount available (the lesser number)
-   of the overall storage capacity of the cluster. The notional number reflects 
-   the size of the stored data before it is replicated, cloned or snapshotted.
-   Therefore, the amount of data actually stored typically exceeds the notional
-   amount stored, because Ceph creates replicas of the data and may also use 
-   storage capacity for cloning and snapshotting.
+The ``usage`` value reflects the *actual* amount of raw storage used. The ``xxx
+GB / xxx GB`` value means the amount available (the lesser number) of the
+overall storage capacity of the cluster. The notional number reflects the size
+of the stored data before it is replicated, cloned or snapshotted.  Therefore,
+the amount of data actually stored typically exceeds the notional amount
+stored, because Ceph creates replicas of the data and may also use storage
+capacity for cloning and snapshotting.


 Watching a Cluster
 ==================

-In addition to local logging by each daemon, Ceph clusters maintain
-a *cluster log* that records high level events about the whole system.
-This is logged to disk on monitor servers (as ``/var/log/ceph/ceph.log`` by
-default), but can also be monitored via the command line.
+Each daemon in the Ceph cluster maintains a log of events, and the Ceph cluster
+itself maintains a *cluster log* that records high-level events about the
+entire Ceph cluster.  These events are logged to disk on monitor servers (in
+the default location ``/var/log/ceph/ceph.log``), and they can be monitored via
+the command line.

-To follow the cluster log, use the following command:
+To follow the cluster log, run the following command:

 .. prompt:: bash $

   ceph -w

-Ceph will print the status of the system, followed by each log message as it
-is emitted. For example:
+Ceph will print the status of the system, followed by each log message as it is
+added. For example:

 :: 

@ -135,21 +140,20 @@ is emitted. For example:
  2017-07-24 08:15:14.258143 mon.a mon.0 172.21.9.34:6789/0 39 : cluster [INF] Activating manager daemon x
  2017-07-24 08:15:15.446025 mon.a mon.0 172.21.9.34:6789/0 47 : cluster [INF] Manager daemon x is now available

-
-In addition to using ``ceph -w`` to print log lines as they are emitted,
-use ``ceph log last [n]`` to see the most recent ``n`` lines from the cluster
-log.
+Instead of printing log lines as they are added, you might want to print only
+the most recent lines. Run ``ceph log last [n]`` to see the most recent ``n``
+lines from the cluster log.

 Monitoring Health Checks
 ========================

-Ceph continuously runs various *health checks* against its own status.  When
-a health check fails, this is reflected in the output of ``ceph status`` (or
-``ceph health``).  In addition, messages are sent to the cluster log to
-indicate when a check fails, and when the cluster recovers.
+Ceph continuously runs various *health checks*. When
+a health check fails, this failure is reflected in the output of ``ceph status`` and
+``ceph health``. The cluster log receives messages that
+indicate when a check has failed and when the cluster has recovered.

 For example, when an OSD goes down, the ``health`` section of the status
-output may be updated as follows:
+output is updated as follows:

 ::

@ -157,7 +161,7 @@ output may be updated as follows:
            1 osds down
            Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded

-At this time, cluster log messages are also emitted to record the failure of the 
+At the same time, cluster log messages are emitted to record the failure of the 
 health checks:

 ::
@ -166,7 +170,7 @@ health checks:
    2017-07-25 10:09:01.302624 mon.a mon.0 172.21.9.34:6789/0 94 : cluster [WRN] Health check failed: Degraded data redundancy: 21/63 objects degraded (33.333%), 16 pgs unclean, 16 pgs degraded (PG_DEGRADED)

 When the OSD comes back online, the cluster log records the cluster's return
-to a health state:
+to a healthy state:

 ::

@ -177,21 +181,23 @@ to a health state:
 Network Performance Checks
 --------------------------

-Ceph OSDs send heartbeat ping messages amongst themselves to monitor daemon availability.  We
-also use the response times to monitor network performance.
-While it is possible that a busy OSD could delay a ping response, we can assume
-that if a network switch fails multiple delays will be detected between distinct pairs of OSDs.
+Ceph OSDs send heartbeat ping messages to each other in order to monitor daemon
+availability and network performance. If a single delayed response is detected,
+this might indicate nothing more than a busy OSD. But if multiple delays
+between distinct pairs of OSDs are detected, this might indicate a failed
+network switch, a NIC failure, or a layer 1 failure.

-By default we will warn about ping times which exceed 1 second (1000 milliseconds).
+By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
+health check (a ``HEALTH_WARN``. For example:

 ::

    HEALTH_WARN Slow OSD heartbeats on back (longest 1118.001ms)

-The health detail will add the combination of OSDs are seeing the delays and by how much.  There is a limit of 10
-detail line items.
-
-::
+In the output of the ``ceph health detail`` command, you can see which OSDs are
+experiencing delays and how long the delays are. The output of ``ceph health
+detail`` is limited to ten lines. Here is an example of the output you can
+expect from the ``ceph health detail`` command::

    [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 1118.001ms)
        Slow OSD heartbeats on back from osd.0 [dc1,rack1] to osd.1 [dc1,rack1] 1118.001 msec possibly improving
@ -199,11 +205,15 @@ detail line items.
        Slow OSD heartbeats on back from osd.2 [dc1,rack2] to osd.1 [dc1,rack1] 1015.321 msec
        Slow OSD heartbeats on back from osd.1 [dc1,rack1] to osd.0 [dc1,rack1] 1010.456 msec

-To see even more detail and a complete dump of network performance information the ``dump_osd_network`` command can be used.  Typically, this would be
-sent to a mgr, but it can be limited to a particular OSD's interactions by issuing it to any OSD.  The current threshold which defaults to 1 second
-(1000 milliseconds) can be overridden as an argument in milliseconds.
+To see more detail and to collect a complete dump of network performance
+information, use the ``dump_osd_network`` command. This command is usually sent
+to a Ceph Manager Daemon, but it can be used to collect information about a
+specific OSD's interactions by sending it to that OSD. The default threshold
+for a slow heartbeat is 1 second (1000 milliseconds), but this can be
+overridden by providing a number of milliseconds as an argument.

-The following command will show all gathered network performance data by specifying a threshold of 0 and sending to the mgr.
+To show all network performance data with a specified threshold of 0, send the
+following command to the mgr:

 .. prompt:: bash $

@ -287,26 +297,26 @@ The following command will show all gathered network performance data by specify



-Muting health checks
+Muting Health Checks
 --------------------

-Health checks can be muted so that they do not affect the overall
-reported status of the cluster.  Alerts are specified using the health
-check code (see :ref:`health-checks`):
+Health checks can be muted so that they have no effect on the overall
+reported status of the cluster. For example, if the cluster has raised a
+single health check and then you mute that health check, then the cluster will report a status of ``HEALTH_OK``.
+To mute a specific health check, use the health check code that corresponds to that health check (see :ref:`health-checks`), and 
+run the following command:

 .. prompt:: bash $

   ceph health mute <code>

-For example, if there is a health warning, muting it will make the
-cluster report an overall status of ``HEALTH_OK``.  For example, to
-mute an ``OSD_DOWN`` alert,:
+For example, to mute an ``OSD_DOWN`` health check, run the following command:

 .. prompt:: bash $

   ceph health mute OSD_DOWN

-Mutes are reported as part of the short and long form of the ``ceph health`` command.
+Mutes are reported as part of the short and long form of the ``ceph health`` command's output.
 For example, in the above scenario, the cluster would report:

 .. prompt:: bash $
@ -327,7 +337,7 @@ For example, in the above scenario, the cluster would report:
   (MUTED) OSD_DOWN 1 osds down
       osd.1 is down

-A mute can be explicitly removed with:
+A mute can be removed by running the following command:

 .. prompt:: bash $

@ -339,56 +349,44 @@ For example:

   ceph health unmute OSD_DOWN

-A health check mute may optionally have a TTL (time to live)
-associated with it, such that the mute will automatically expire
-after the specified period of time has elapsed.  The TTL is specified as an optional
-duration argument, e.g.:
+A "health mute" can have a TTL (**T**\ime **T**\o **L**\ive)
+associated with it: this means that the mute will automatically expire
+after a specified period of time. The TTL is specified as an optional
+duration argument, as seen in the following examples:

 .. prompt:: bash $

   ceph health mute OSD_DOWN 4h    # mute for 4 hours
-   ceph health mute MON_DOWN 15m   # mute for 15  minutes
+   ceph health mute MON_DOWN 15m   # mute for 15 minutes

-Normally, if a muted health alert is resolved (e.g., in the example
-above, the OSD comes back up), the mute goes away.  If the alert comes
+Normally, if a muted health check is resolved (for example, if the OSD that raised the ``OSD_DOWN`` health check 
+in the example above has come back up), the mute goes away. If the health check comes
 back later, it will be reported in the usual way.

-It is possible to make a mute "sticky" such that the mute will remain even if the
-alert clears.  For example:
+It is possible to make a health mute "sticky": this means that the mute will remain even if the
+health check clears. For example, to make a health mute "sticky", you might run the following command:

 .. prompt:: bash $

   ceph health mute OSD_DOWN 1h --sticky   # ignore any/all down OSDs for next hour

-Most health mutes also disappear if the extent of an alert gets worse.  For example,
-if there is one OSD down, and the alert is muted, the mute will disappear if one
-or more additional OSDs go down.  This is true for any health alert that involves
-a count indicating how much or how many of something is triggering the warning or
-error.
+Most health mutes disappear if the unhealthy condition that triggered the health check gets worse.
+For example, suppose that there is one OSD down and the health check is muted. In that case, if
+one or more additional OSDs go down, then the health mute disappears. This behavior occurs in any health check with a threshold value.


-Detecting configuration issues
-==============================
-
-In addition to the health checks that Ceph continuously runs on its
-own status, there are some configuration issues that may only be detected
-by an external tool.
-
-Use the `ceph-medic`_ tool to run these additional checks on your Ceph
-cluster's configuration.
-
 Checking a Cluster's Usage Stats
 ================================

-To check a cluster's data usage and data distribution among pools, you can
-use the ``df`` option. It is similar to Linux ``df``. Execute 
-the following:
+To check a cluster's data usage and data distribution among pools, use the
+``df`` command. This option is similar to Linux's ``df`` command. Run the
+following command:

 .. prompt:: bash $

   ceph df

-The output of ``ceph df`` looks like this::
+The output of ``ceph df`` resembles the following::

   CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
   ssd    202 GiB  200 GiB  2.0 GiB   2.0 GiB       1.00
@ -400,52 +398,49 @@ The output of ``ceph df`` looks like this::
   cephfs.a.meta           2   32  6.8 KiB  6.8 KiB      0 B        22   96 KiB  96 KiB      0 B       0    297 GiB            N/A          N/A     22         0 B          0 B
   cephfs.a.data           3   32      0 B      0 B      0 B         0      0 B     0 B      0 B       0     99 GiB            N/A          N/A      0         0 B          0 B
   test                    4   32   22 MiB   22 MiB   50 KiB       248   19 MiB  19 MiB   50 KiB       0    297 GiB            N/A          N/A    248         0 B          0 B
-
-
-
-
-
- **CLASS:** for example, "ssd" or "hdd"
+   
+- **CLASS:** For example, "ssd" or "hdd".
 - **SIZE:** The amount of storage capacity managed by the cluster.
 - **AVAIL:** The amount of free space available in the cluster.
 - **USED:** The amount of raw storage consumed by user data (excluding
-  BlueStore's database)
+  BlueStore's database).
 - **RAW USED:** The amount of raw storage consumed by user data, internal
-  overhead, or reserved capacity.
- **%RAW USED:** The percentage of raw storage used. Use this number in
-  conjunction with the ``full ratio`` and ``near full ratio`` to ensure that 
-  you are not reaching your cluster's capacity. See `Storage Capacity`_ for 
-  additional details.
+  overhead, and reserved capacity.
+- **%RAW USED:** The percentage of raw storage used. Watch this number in
+  conjunction with ``full ratio`` and ``near full ratio`` to be forewarned when
+  your cluster approaches the fullness thresholds. See `Storage Capacity`_.


-**POOLS:**  
+**POOLS:**

-The **POOLS** section of the output provides a list of pools and the notional 
-usage of each pool. The output from this section **DOES NOT** reflect replicas,
-clones or snapshots. For example, if you store an object with 1MB of data, the 
-notional usage will be 1MB, but the actual usage may be 2MB or more depending 
-on the number of replicas, clones and snapshots.  
+The POOLS section of the output provides a list of pools and the *notional*
+usage of each pool. This section of the output **DOES NOT** reflect replicas,
+clones, or snapshots. For example, if you store an object with 1MB of data,
+then the notional usage will be 1MB, but the actual usage might be 2MB or more
+depending on the number of replicas, clones, and snapshots.

- **ID:** The number of the node within the pool.
- **STORED:** actual amount of data user/Ceph has stored in a pool. This is
-  similar to the USED column in earlier versions of Ceph but the calculations
-  (for BlueStore!) are more precise (gaps are properly handled).
+- **ID:** The number of the specific node within the pool.
+- **STORED:** The actual amount of data that the user has stored in a pool.
+  This is similar to the USED column in earlier versions of Ceph, but the
+  calculations (for BlueStore!) are more precise (in that gaps are properly
+  handled).

-  - **(DATA):** usage for RBD (RADOS Block Device), CephFS file data, and RGW
+  - **(DATA):** Usage for RBD (RADOS Block Device), CephFS file data, and RGW
    (RADOS Gateway) object data.
-  - **(OMAP):** key-value pairs. Used primarily by CephFS and RGW (RADOS
+  - **(OMAP):** Key-value pairs. Used primarily by CephFS and RGW (RADOS
    Gateway) for metadata storage.

- **OBJECTS:** The notional number of objects stored per pool. "Notional" is
-  defined above in the paragraph immediately under "POOLS".
- **USED:** The space allocated for a pool over all OSDs. This includes
-  replication, allocation granularity, and erasure-coding overhead. Compression
-  savings and object content gaps are also taken into account. BlueStore's
-  database is not included in this amount.
+- **OBJECTS:** The notional number of objects stored per pool (that is, the
+  number of objects other than replicas, clones, or snapshots). 
+- **USED:** The space allocated for a pool over all OSDs. This includes space
+  for replication, space for allocation granularity, and space for the overhead
+  associated with erasure-coding. Compression savings and object-content gaps
+  are also taken into account. However, BlueStore's database is not included in
+  the amount reported under USED.

-  - **(DATA):** object usage for RBD (RADOS Block Device), CephFS file data, and RGW
-    (RADOS Gateway) object data.
-  - **(OMAP):** object key-value pairs. Used primarily by CephFS and RGW (RADOS
+  - **(DATA):** Object usage for RBD (RADOS Block Device), CephFS file data,
+    and RGW (RADOS Gateway) object data.
+  - **(OMAP):** Object key-value pairs. Used primarily by CephFS and RGW (RADOS
    Gateway) for metadata storage.

 - **%USED:** The notional percentage of storage used per pool.
@ -454,50 +449,51 @@ on the number of replicas, clones and snapshots.
 - **QUOTA OBJECTS:** The number of quota objects.
 - **QUOTA BYTES:** The number of bytes in the quota objects.
 - **DIRTY:** The number of objects in the cache pool that have been written to
-  the cache pool but have not been flushed yet to the base pool. This field is
-  only available when cache tiering is in use.
- **USED COMPR:** amount of space allocated for compressed data (i.e. this
-  includes compressed data plus all the allocation, replication and erasure
-  coding overhead).
- **UNDER COMPR:** amount of data passed through compression (summed over all
-  replicas) and beneficial enough to be stored in a compressed form.
+  the cache pool but have not yet been flushed to the base pool. This field is
+  available only when cache tiering is in use.
+- **USED COMPR:** The amount of space allocated for compressed data. This
+  includes compressed data in addition to all of the space required for
+  replication, allocation granularity, and erasure- coding overhead.
+- **UNDER COMPR:** The amount of data that has passed through compression
+  (summed over all replicas) and that is worth storing in a compressed form.


-.. note:: The numbers in the POOLS section are notional. They are not
-   inclusive of the number of replicas, snapshots or clones. As a result, the
-   sum of the USED and %USED amounts will not add up to the USED and %USED
-   amounts in the RAW section of the output.
+.. note:: The numbers in the POOLS section are notional. They do not include
+   the number of replicas, clones, or snapshots. As a result, the sum of the
+   USED and %USED amounts in the POOLS section of the output will not be equal
+   to the sum of the USED and %USED amounts in the RAW section of the output.

-.. note:: The MAX AVAIL value is a complicated function of the replication
-   or erasure code used, the CRUSH rule that maps storage to devices, the
-   utilization of those devices, and the configured ``mon_osd_full_ratio``.
+.. note:: The MAX AVAIL value is a complicated function of the replication or
+   the kind of erasure coding used, the CRUSH rule that maps storage to
+   devices, the utilization of those devices, and the configured
+   ``mon_osd_full_ratio`` setting.


 Checking OSD Status
 ===================

-You can check OSDs to ensure they are ``up`` and ``in`` by executing the
+To check if OSDs are ``up`` and ``in``, run the
 following command:

 .. prompt:: bash #

  ceph osd stat
-	
-Or: 
+
+Alternatively, you can run the following command:

 .. prompt:: bash #

  ceph osd dump
-	
-You can also check view OSDs according to their position in the CRUSH map by
-using the following command:
+
+To view OSDs according to their position in the CRUSH map, run the following
+command:

 .. prompt:: bash #

   ceph osd tree

-Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up
-and their weight:
+To print out a CRUSH tree that displays a host, its OSDs, whether the OSDs are
+``up``, and the weight of the OSDs, run the following command:

 .. code-block:: bash

@ -509,88 +505,90 @@ and their weight:
     1   ssd 1.00000         osd.1             up  1.00000 1.00000
     2   ssd 1.00000         osd.2             up  1.00000 1.00000

-For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
+See `Monitoring OSDs and Placement Groups`_.

 Checking Monitor Status
 =======================

-If your cluster has multiple monitors (likely), you should check the monitor
-quorum status after you start the cluster and before reading and/or writing data. A
-quorum must be present when multiple monitors are running. You should also check
-monitor status periodically to ensure that they are running.
+If your cluster has multiple monitors, then you need to perform certain
+"monitor status" checks.  After starting the cluster and before reading or
+writing data, you should check quorum status. A quorum must be present when
+multiple monitors are running to ensure proper functioning of your Ceph
+cluster. Check monitor status regularly in order to ensure that all of the
+monitors are running.

-To see display the monitor map, execute the following:
+To display the monitor map, run the following command:

 .. prompt:: bash $

   ceph mon stat
-	
-Or:
+
+Alternatively, you can run the following command:

 .. prompt:: bash $

   ceph mon dump
-	
-To check the quorum status for the monitor cluster, execute the following:
-	
+
+To check the quorum status for the monitor cluster, run the following command:
+
 .. prompt:: bash $

   ceph quorum_status

-Ceph will return the quorum status. For example, a Ceph  cluster consisting of
-three monitors may return the following:
+Ceph returns the quorum status. For example, a Ceph cluster that consists of
+three monitors might return the following:

 .. code-block:: javascript

-	{ "election_epoch": 10,
-	  "quorum": [
-	        0,
-	        1,
-	        2],
-	  "quorum_names": [
-		"a",
-		"b",
-		"c"],
-	  "quorum_leader_name": "a",
-	  "monmap": { "epoch": 1,
-	      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
-	      "modified": "2011-12-12 13:28:27.505520",
-	      "created": "2011-12-12 13:28:27.505520",
-	      "features": {"persistent": [
-				"kraken",
-				"luminous",
-				"mimic"],
-		"optional": []
-	      },
-	      "mons": [
-	            { "rank": 0,
-	              "name": "a",
-	              "addr": "127.0.0.1:6789/0",
-		      "public_addr": "127.0.0.1:6789/0"},
-	            { "rank": 1,
-	              "name": "b",
-	              "addr": "127.0.0.1:6790/0",
-		      "public_addr": "127.0.0.1:6790/0"},
-	            { "rank": 2,
-	              "name": "c",
-	              "addr": "127.0.0.1:6791/0",
-		      "public_addr": "127.0.0.1:6791/0"}
-	           ]
-	  }
-	}
+    { "election_epoch": 10,
+      "quorum": [
+            0,
+            1,
+            2],
+      "quorum_names": [
+        "a",
+        "b",
+        "c"],
+      "quorum_leader_name": "a",
+      "monmap": { "epoch": 1,
+          "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
+          "modified": "2011-12-12 13:28:27.505520",
+          "created": "2011-12-12 13:28:27.505520",
+          "features": {"persistent": [
+                "kraken",
+                "luminous",
+                "mimic"],
+        "optional": []
+          },
+          "mons": [
+                { "rank": 0,
+                  "name": "a",
+                  "addr": "127.0.0.1:6789/0",
+              "public_addr": "127.0.0.1:6789/0"},
+                { "rank": 1,
+                  "name": "b",
+                  "addr": "127.0.0.1:6790/0",
+              "public_addr": "127.0.0.1:6790/0"},
+                { "rank": 2,
+                  "name": "c",
+                  "addr": "127.0.0.1:6791/0",
+              "public_addr": "127.0.0.1:6791/0"}
+               ]
+      }
+    }

 Checking MDS Status
 ===================

-Metadata servers provide metadata services for  CephFS. Metadata servers have
-two sets of states: ``up | down`` and ``active | inactive``. To ensure your
-metadata servers are ``up`` and ``active``,  execute the following:
+Metadata servers provide metadata services for CephFS. Metadata servers have
+two sets of states: ``up | down`` and ``active | inactive``. To check if your
+metadata servers are ``up`` and ``active``, run the following command:

 .. prompt:: bash $

   ceph mds stat
-	
-To display details of the metadata cluster, execute the following:
+
+To display details of the metadata servers, run the following command:

 .. prompt:: bash $

@ -600,9 +598,9 @@ To display details of the metadata cluster, execute the following:
 Checking Placement Group States
 ===============================

-Placement groups map objects to OSDs. When you monitor your
-placement groups,  you will want them to be ``active`` and ``clean``. 
-For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
+Placement groups (PGs) map objects to OSDs. PGs are monitored in order to
+ensure that they are ``active`` and ``clean``.  See `Monitoring OSDs and
+Placement Groups`_.

 .. _Monitoring OSDs and Placement Groups: ../monitoring-osd-pg

@ -611,37 +609,36 @@ For a detailed discussion, refer to `Monitoring OSDs and Placement Groups`_.
 Using the Admin Socket
 ======================

-The Ceph admin socket allows you to query a daemon via a socket interface. 
-By default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon
-via the admin socket, login to the host running the daemon and use the 
-following command:
+The Ceph admin socket allows you to query a daemon via a socket interface.  By
+default, Ceph sockets reside under ``/var/run/ceph``. To access a daemon via
+the admin socket, log in to the host that is running the daemon and run one of
+the two following commands:

 .. prompt:: bash $

   ceph daemon {daemon-name}
   ceph daemon {path-to-socket-file}

-For example, the following are equivalent:
+For example, the following commands are equivalent to each other:

 .. prompt:: bash $

   ceph daemon osd.0 foo
   ceph daemon /var/run/ceph/ceph-osd.0.asok foo

-To view the available admin socket commands, execute the following command:
+To view the available admin-socket commands, run the following command:

 .. prompt:: bash $

   ceph daemon {daemon-name} help

-The admin socket command enables you to show and set your configuration at
-runtime. See `Viewing a Configuration at Runtime`_ for details.
-
-Additionally, you can set configuration values at runtime directly (i.e., the
-admin socket bypasses the monitor, unlike ``ceph tell {daemon-type}.{id}
-config set``, which relies on the monitor but doesn't require you to login
-directly to the host in question ).
+Admin-socket commands enable you to view and set your configuration at runtime.
+For more on viewing your configuration, see `Viewing a Configuration at
+Runtime`_. There are two methods of setting configuration value at runtime: (1)
+using the admin socket, which bypasses the monitor and requires a direct login
+to the host in question, and (2) using the ``ceph tell {daemon-type}.{id}
+config set`` command, which relies on the monitor and does not require a direct
+login.

 .. _Viewing a Configuration at Runtime: ../../configuration/ceph-conf#viewing-a-configuration-at-runtime
 .. _Storage Capacity: ../../configuration/mon-config-ref#storage-capacity
-.. _ceph-medic: http://docs.ceph.com/ceph-medic/master/
--- a/ceph/doc/rados/operations/operating.rst
+++ b/ceph/doc/rados/operations/operating.rst
@ -6,50 +6,52 @@


 Running Ceph with systemd
-==========================
+=========================

-For all distributions that support systemd (CentOS 7, Fedora, Debian
-Jessie 8 and later, SUSE), ceph daemons are now managed using native
-systemd files instead of the legacy sysvinit scripts.  For example:
+In all distributions that support systemd (CentOS 7, Fedora, Debian
+Jessie 8 and later, and SUSE), systemd files (and NOT legacy SysVinit scripts) 
+are used to manage Ceph daemons. Ceph daemons therefore behave like any other daemons 
+that can be controlled by the ``systemctl`` command, as in the following examples:

 .. prompt:: bash $

   sudo systemctl start ceph.target       # start all daemons
   sudo systemctl status ceph-osd@12      # check status of osd.12

-To list the Ceph systemd units on a node, execute:
+To list all of the Ceph systemd units on a node, run the following command:

 .. prompt:: bash $

   sudo systemctl status ceph\*.service ceph\*.target

-Starting all Daemons
+
+Starting all daemons
 --------------------

-To start all daemons on a Ceph Node (irrespective of type), execute the
-following:
+To start all of the daemons on a Ceph node (regardless of their type), run the
+following command:

 .. prompt:: bash $

   sudo systemctl start ceph.target


-Stopping all Daemons
+Stopping all daemons
 --------------------

-To stop all daemons on a Ceph Node (irrespective of type), execute the
-following:
+To stop all of the daemons on a Ceph node (regardless of their type), run the
+following command:

 .. prompt:: bash $

   sudo systemctl stop ceph\*.service ceph\*.target


-Starting all Daemons by Type
+Starting all daemons by type
 ----------------------------

-To start all daemons of a particular type on a Ceph Node, execute one of the
-following:
+To start all of the daemons of a particular type on a Ceph node, run one of the
+following commands:

 .. prompt:: bash $

@ -58,24 +60,24 @@ following:
   sudo systemctl start ceph-mds.target


-Stopping all Daemons by Type
+Stopping all daemons by type
 ----------------------------

-To stop all daemons of a particular type on a Ceph Node, execute one of the
-following:
+To stop all of the daemons of a particular type on a Ceph node, run one of the
+following commands:

 .. prompt:: bash $

-   sudo systemctl stop ceph-mon\*.service ceph-mon.target
   sudo systemctl stop ceph-osd\*.service ceph-osd.target
+   sudo systemctl stop ceph-mon\*.service ceph-mon.target
   sudo systemctl stop ceph-mds\*.service ceph-mds.target


-Starting a Daemon
+Starting a daemon
 -----------------

-To start a specific daemon instance on a Ceph Node, execute one of the
-following:
+To start a specific daemon instance on a Ceph node, run one of the
+following commands:

 .. prompt:: bash $

@ -92,11 +94,11 @@ For example:
   sudo systemctl start ceph-mds@ceph-server


-Stopping a Daemon
+Stopping a daemon
 -----------------

-To stop a specific daemon instance on a Ceph Node, execute one of the
-following:
+To stop a specific daemon instance on a Ceph node, run one of the
+following commands:

 .. prompt:: bash $

@ -115,15 +117,14 @@ For example:

 .. index:: sysvinit; operating a cluster

-Running Ceph with sysvinit
+Running Ceph with SysVinit
 ==========================

-Each time you to **start**, **restart**, and  **stop** Ceph daemons (or your
-entire cluster) you must specify at least one option and one command. You may
-also specify a daemon type or a daemon instance. ::
-
-	{commandline} [options] [commands] [daemons]
+Each time you start, restart, or stop Ceph daemons, you must specify at least one option and one command.
+Likewise, each time you start, restart, or stop your entire cluster, you must specify at least one option and one command.
+In both cases, you can also specify a daemon type or a daemon instance. ::

+    {commandline} [options] [commands] [daemons]

 The ``ceph`` options include:

@ -134,12 +135,12 @@ The ``ceph`` options include:
 +-----------------+----------+-------------------------------------------------+
 | ``--valgrind``  | ``N/A``  | (Dev and QA only) Use `Valgrind`_ debugging.    |
 +-----------------+----------+-------------------------------------------------+
-| ``--allhosts``  |  ``-a``  | Execute on all nodes in ``ceph.conf.``          |
+| ``--allhosts``  |  ``-a``  | Execute on all nodes listed in ``ceph.conf``.   |
 |                 |          | Otherwise, it only executes on ``localhost``.   |
 +-----------------+----------+-------------------------------------------------+
 | ``--restart``   | ``N/A``  | Automatically restart daemon if it core dumps.  |
 +-----------------+----------+-------------------------------------------------+
-| ``--norestart`` | ``N/A``  | Don't restart a daemon if it core dumps.        |
+| ``--norestart`` | ``N/A``  | Do not restart a daemon if it core dumps.       |
 +-----------------+----------+-------------------------------------------------+
 | ``--conf``      |  ``-c``  | Use an alternate configuration file.            |
 +-----------------+----------+-------------------------------------------------+
@ -153,24 +154,21 @@ The ``ceph`` commands include:
 +------------------+------------------------------------------------------------+
 |    ``stop``      | Stop the daemon(s).                                        |
 +------------------+------------------------------------------------------------+
-|  ``forcestop``   | Force the daemon(s) to stop. Same as ``kill -9``           |
+|  ``forcestop``   | Force the daemon(s) to stop. Same as ``kill -9``.          |
 +------------------+------------------------------------------------------------+
-|   ``killall``    | Kill all daemons of a particular type.                     | 
+|   ``killall``    | Kill all daemons of a particular type.                     |
 +------------------+------------------------------------------------------------+
 |  ``cleanlogs``   | Cleans out the log directory.                              |
 +------------------+------------------------------------------------------------+
 | ``cleanalllogs`` | Cleans out **everything** in the log directory.            |
 +------------------+------------------------------------------------------------+

-For subsystem operations, the ``ceph`` service can target specific daemon types
-by adding a particular daemon type for the ``[daemons]`` option. Daemon types
-include: 
+The ``[daemons]`` option allows the ``ceph`` service to target specific daemon types
+in order to perform subsystem operations. Daemon types include:

 - ``mon``
 - ``osd``
 - ``mds``

-
-
 .. _Valgrind: http://www.valgrind.org/
 .. _initctl: http://manpages.ubuntu.com/manpages/raring/en/man8/initctl.8.html
--- a/ceph/doc/rados/operations/pg-concepts.rst
+++ b/ceph/doc/rados/operations/pg-concepts.rst
@ -1,3 +1,5 @@
+.. _rados_operations_pg_concepts:
+
 ==========================
 Placement Group Concepts
 ==========================
--- a/ceph/doc/rados/operations/pg-repair.rst
+++ b/ceph/doc/rados/operations/pg-repair.rst
@ -1,59 +1,60 @@
 ============================
-Repairing PG inconsistencies
+Repairing PG Inconsistencies
 ============================
-Sometimes a placement group might become "inconsistent". To return the
-placement group to an active+clean state, you must first determine which
-of the placement groups has become inconsistent and then run the "pg
-repair" command on it. This page contains commands for diagnosing placement
-groups and the command for repairing placement groups that have become
+Sometimes a Placement Group (PG) might become ``inconsistent``. To return the PG
+to an ``active+clean`` state, you must first determine which of the PGs has become
+inconsistent and then run the ``pg repair`` command on it. This page contains
+commands for diagnosing PGs and the command for repairing PGs that have become
 inconsistent.

 .. highlight:: console

-Commands for Diagnosing Placement-group Problems
-================================================
-The commands in this section provide various ways of diagnosing broken placement groups.
+Commands for Diagnosing PG Problems
+===================================
+The commands in this section provide various ways of diagnosing broken PGs.

-The following command provides a high-level (low detail) overview of the health of the ceph cluster:
+To see a high-level (low-detail) overview of Ceph cluster health, run the
+following command:

 .. prompt:: bash #

   ceph health detail

-The following command provides more detail on the status of the placement groups:
+To see more detail on the status of the PGs, run the following command:

 .. prompt:: bash #

   ceph pg dump --format=json-pretty

-The following command lists inconsistent placement groups:
+To see a list of inconsistent PGs, run the following command:

 .. prompt:: bash #

   rados list-inconsistent-pg {pool}

-The following command lists inconsistent rados objects:
+To see a list of inconsistent RADOS objects, run the following command:

 .. prompt:: bash #

   rados list-inconsistent-obj {pgid}

-The following command lists inconsistent snapsets in the given placement group:
+To see a list of inconsistent snapsets in a specific PG, run the following
+command:

 .. prompt:: bash #

   rados list-inconsistent-snapset {pgid}


-Commands for Repairing Placement Groups
-=======================================
-The form of the command to repair a broken placement group is:
+Commands for Repairing PGs
+==========================
+The form of the command to repair a broken PG is as follows:

 .. prompt:: bash #

   ceph pg repair {pgid}

-Where ``{pgid}`` is the id of the affected placement group.
+Here ``{pgid}`` represents the id of the affected PG.

 For example:

@ -61,23 +62,57 @@ For example:

   ceph pg repair 1.4

-More Information on Placement Group Repair
-==========================================
-Ceph stores and updates the checksums of objects stored in the cluster. When a scrub is performed on a placement group, the OSD attempts to choose an authoritative copy from among its replicas. Among all of the possible cases, only one case is consistent. After a deep scrub, Ceph calculates the checksum of an object read from the disk and compares it to the checksum previously recorded. If the current checksum and the previously recorded checksums do not match, that is an inconsistency. In the case of replicated pools, any mismatch between the checksum of any replica of an object and the checksum of the authoritative copy means that there is an inconsistency.
+.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the
+   pool that contains the PG. The command ``ceph osd listpools`` and the
+   command ``ceph osd dump | grep pool`` return a list of pool numbers.

-The ``pg repair`` command attempts to fix inconsistencies of various kinds. If ``pg repair`` finds an inconsistent placement group, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy. If ``pg repair`` finds an inconsistent replicated pool, it marks the inconsistent copy as missing. Recovery, in the case of replicated pools, is beyond the scope of ``pg repair``.
+More Information on PG Repair
+=============================
+Ceph stores and updates the checksums of objects stored in the cluster. When a
+scrub is performed on a PG, the OSD attempts to choose an authoritative copy
+from among its replicas. Only one of the possible cases is consistent. After
+performing a deep scrub, Ceph calculates the checksum of an object that is read
+from disk and compares it to the checksum that was previously recorded. If the
+current checksum and the previously recorded checksum do not match, that
+mismatch is considered to be an inconsistency. In the case of replicated pools,
+any mismatch between the checksum of any replica of an object and the checksum
+of the authoritative copy means that there is an inconsistency. The discovery
+of these inconsistencies cause a PG's state to be set to ``inconsistent``.

-For erasure coded and BlueStore pools, Ceph will automatically repair
-if ``osd_scrub_auto_repair`` (default ``false`) is set to ``true`` and
-at most ``osd_scrub_auto_repair_num_errors`` (default ``5``) errors are found.
+The ``pg repair`` command attempts to fix inconsistencies of various kinds. If
+``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
+the inconsistent copy with the digest of the authoritative copy. If ``pg
+repair`` finds an inconsistent replicated pool, it marks the inconsistent copy
+as missing. In the case of replicated pools, recovery is beyond the scope of
+``pg repair``.

-``pg repair`` will not solve every problem. Ceph does not automatically repair placement groups when inconsistencies are found in them.
+In the case of erasure-coded and BlueStore pools, Ceph will automatically
+perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
+``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
+``5``) errors are found.

-The checksum of a RADOS object or an omap is not always available. Checksums are calculated incrementally. If a replicated object is updated non-sequentially, the write operation involved in the update changes the object and invalidates its checksum. The whole object is not read while recalculating the checksum. "ceph pg repair" is able to repair things even when checksums are not available to it, as in the case of Filestore. When replicated Filestore pools are in play, users might prefer manual repair over ``ceph pg repair``.
+The ``pg repair`` command will not solve every problem. Ceph does not
+automatically repair PGs when they are found to contain inconsistencies.

-The material in this paragraph is relevant for Filestore, and BlueStore has its own internal checksums. The matched-record checksum and the calculated checksum cannot prove that the authoritative copy is in fact authoritative. In the case that there is no checksum available, ``pg repair`` favors the data on the primary. this might or might not be the uncorrupted replica. This is why human intervention is necessary when an inconsistency is discovered. Human intervention sometimes means using the ``eph-objectstore-tool``.
+The checksum of a RADOS object or an omap is not always available. Checksums
+are calculated incrementally. If a replicated object is updated
+non-sequentially, the write operation involved in the update changes the object
+and invalidates its checksum. The whole object is not read while the checksum
+is recalculated. The ``pg repair`` command is able to make repairs even when
+checksums are not available to it, as in the case of Filestore. Users working
+with replicated Filestore pools might prefer manual repair to ``ceph pg
+repair``.
+
+This material is relevant for Filestore, but not for BlueStore, which has its
+own internal checksums. The matched-record checksum and the calculated checksum
+cannot prove that any specific copy is in fact authoritative. If there is no
+checksum available, ``pg repair`` favors the data on the primary, but this
+might not be the uncorrupted replica. Because of this uncertainty, human
+intervention is necessary when an inconsistency is discovered. This
+intervention sometimes involves use of ``ceph-objectstore-tool``.

 External Links
 ==============
-https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page contains a walkthrough of the repair of a placement group, and is recommended reading if you want to repair a placement
-group but have never done so.
+https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
+contains a walkthrough of the repair of a PG. It is recommended reading if you
+want to repair a PG but have never done so.
--- a/ceph/doc/rados/operations/placement-groups.rst
+++ b/ceph/doc/rados/operations/placement-groups.rst
--- a/ceph/doc/rados/operations/pools.rst
+++ b/ceph/doc/rados/operations/pools.rst
--- a/ceph/doc/rados/operations/stretch-mode.rst
+++ b/ceph/doc/rados/operations/stretch-mode.rst
@ -7,208 +7,256 @@ Stretch Clusters

 Stretch Clusters
 ================
-Ceph generally expects all parts of its network and overall cluster to be
-equally reliable, with failures randomly distributed across the CRUSH map.
-So you may lose a switch that knocks out a number of OSDs, but we expect
-the remaining OSDs and monitors to route around that.

-This is usually a good choice, but may not work well in some
-stretched cluster configurations where a significant part of your cluster
-is stuck behind a single network component. For instance, a single
-cluster which is located in multiple data centers, and you want to
-sustain the loss of a full DC.
+A stretch cluster is a cluster that has servers in geographically separated
+data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed
+and low-latency connections, but limited links. Stretch clusters have a higher
+likelihood of (possibly asymmetric) network splits, and a higher likelihood of
+temporary or complete loss of an entire data center (which can represent
+one-third to one-half of the total cluster).

-There are two standard configurations we've seen deployed, with either
-two or three data centers (or, in clouds, availability zones). With two
-zones, we expect each site to hold a copy of the data, and for a third
-site to have a tiebreaker monitor (this can be a VM or high-latency compared
-to the main sites) to pick a winner if the network connection fails and both
-DCs remain alive. For three sites, we expect a copy of the data and an equal
-number of monitors in each site.
+Ceph is designed with the expectation that all parts of its network and cluster
+will be reliable and that failures will be distributed randomly across the
+CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is
+designed so that the remaining OSDs and monitors will route around such a loss. 

-Note that the standard Ceph configuration will survive MANY failures of the
-network or data centers and it will never compromise data consistency.  If you
-bring back enough Ceph servers following a failure, it will recover. If you
-lose a data center, but can still form a quorum of monitors and have all the data
-available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules
-that will re-replicate to meet it), Ceph will maintain availability.
+Sometimes this cannot be relied upon. If you have a "stretched-cluster"
+deployment in which much of your cluster is behind a single network component,
+you might need to use **stretch mode** to ensure data integrity.

-What can't it handle?
+We will here consider two standard configurations: a configuration with two
+data centers (or, in clouds, two availability zones), and a configuration with
+three data centers (or, in clouds, three availability zones).
+
+In the two-site configuration, Ceph expects each of the sites to hold a copy of
+the data, and Ceph also expects there to be a third site that has a tiebreaker
+monitor. This tiebreaker monitor picks a winner if the network connection fails
+and both data centers remain alive.
+
+The tiebreaker monitor can be a VM. It can also have high latency relative to
+the two main sites.
+
+The standard Ceph configuration is able to survive MANY network failures or
+data-center failures without ever compromising data availability. If enough
+Ceph servers are brought back following a failure, the cluster *will* recover.
+If you lose a data center but are still able to form a quorum of monitors and
+still have all the data available, Ceph will maintain availability. (This
+assumes that the cluster has enough copies to satisfy the pools' ``min_size``
+configuration option, or (failing that) that the cluster has CRUSH rules in
+place that will cause the cluster to re-replicate the data until the
+``min_size`` configuration option has been met.)

 Stretch Cluster Issues
 ======================
-No matter what happens, Ceph will not compromise on data integrity
-and consistency. If there's a failure in your network or a loss of nodes and
-you can restore service, Ceph will return to normal functionality on its own.

-But there are scenarios where you lose data availability despite having
-enough servers available to satisfy Ceph's consistency and sizing constraints, or
-where you may be surprised to not satisfy Ceph's constraints.
-The first important category of these failures resolve around inconsistent
-networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
-them out of the acting PG sets despite the primary being unable to replicate data.
-If this happens, IO will not be permitted, because Ceph can't satisfy its durability
-guarantees.
+Ceph does not permit the compromise of data integrity and data consistency
+under any circumstances. When service is restored after a network failure or a
+loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
+without operator intervention.  
+
+Ceph does not permit the compromise of data integrity or data consistency, but
+there are situations in which *data availability* is compromised. These
+situations can occur even though there are enough clusters available to satisfy
+Ceph's consistency and sizing constraints. In some situations, you might
+discover that your cluster does not satisfy those constraints.
+
+The first category of these failures that we will discuss involves inconsistent
+networks -- if there is a netsplit (a disconnection between two servers that
+splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
+and remove them from the acting PG sets. This failure to mark ODSs ``down``
+will occur, despite the fact that the primary PG is unable to replicate data (a
+situation that, under normal non-netsplit circumstances, would result in the
+marking of affected OSDs as ``down`` and their removal from the PG). If this
+happens, Ceph will be unable to satisfy its durability guarantees and
+consequently IO will not be permitted.
+
+The second category of failures that we will discuss involves the situation in
+which the constraints are not sufficient to guarantee the replication of data
+across data centers, though it might seem that the data is correctly replicated
+across data centers. For example, in a scenario in which there are two data
+centers named Data Center A and Data Center B, and the CRUSH rule targets three
+replicas and places a replica in each data center with a ``min_size`` of ``2``,
+the PG might go active with two replicas in Data Center A and zero replicas in
+Data Center B. In a situation of this kind, the loss of Data Center A means
+that the data is lost and Ceph will not be able to operate on it. This
+situation is surprisingly difficult to avoid using only standard CRUSH rules.

-The second important category of failures is when you think you have data replicated
-across data centers, but the constraints aren't sufficient to guarantee this.
-For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
-and places a copy in each data center with a ``min_size`` of 2. The PG may go active with
-2 copies in site A and no copies in site B, which means that if you then lose site A you
-have lost data and Ceph can't operate on it. This situation is surprisingly difficult
-to avoid with standard CRUSH rules.

 Stretch Mode
 ============
-The new stretch mode is designed to handle the 2-site case. Three sites are
-just as susceptible to netsplit issues, but are much more tolerant of
-component availability outages than 2-site clusters are.
+Stretch mode is designed to handle deployments in which you cannot guarantee the
+replication of data across two data centers. This kind of situation can arise
+when the cluster's CRUSH rule specifies that three copies are to be made, but 
+then a copy is placed in each data center with a ``min_size`` of 2. Under such
+conditions, a placement group can become active with two copies in the first
+data center and no copies in the second data center. 

-To enter stretch mode, you must set the location of each monitor, matching
-your CRUSH map. For instance, to place ``mon.a`` in your first data center:

-.. prompt:: bash $
+Entering Stretch Mode
+---------------------

-   ceph mon set_location a datacenter=site1
+To enable stretch mode, you must set the location of each monitor, matching
+your CRUSH map. This procedure shows how to do this.

-Next, generate a CRUSH rule which will place 2 copies in each data center. This
-will require editing the CRUSH map directly:

-.. prompt:: bash $
+#. Place ``mon.a`` in your first data center:

-   ceph osd getcrushmap > crush.map.bin
-   crushtool -d crush.map.bin -o crush.map.txt
+   .. prompt:: bash $

-Now edit the ``crush.map.txt`` file to add a new rule. Here
-there is only one other rule, so this is ID 1, but you may need
-to use a different rule ID. We also have two datacenter buckets
-named ``site1`` and ``site2``::
+      ceph mon set_location a datacenter=site1

-  rule stretch_rule {
-          id 1
-          type replicated
-          min_size 1
-          max_size 10
-          step take site1
-          step chooseleaf firstn 2 type host
-          step emit
-          step take site2
-          step chooseleaf firstn 2 type host
-          step emit
-  }
+#. Generate a CRUSH rule that places two copies in each data center.
+   This requires editing the CRUSH map directly:

-Finally, inject the CRUSH map to make the rule available to the cluster:
+   .. prompt:: bash $

-.. prompt:: bash $
+      ceph osd getcrushmap > crush.map.bin
+      crushtool -d crush.map.bin -o crush.map.txt

-   crushtool -c crush.map.txt -o crush2.map.bin
-   ceph osd setcrushmap -i crush2.map.bin
+#. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
+   other rule (``id 1``), but you might need to use a different rule ID. We
+   have two data-center buckets named ``site1`` and ``site2``:

-If you aren't already running your monitors in connectivity mode, do so with
-the instructions in `Changing Monitor Elections`_.
+   ::
+
+      rule stretch_rule {
+             id 1
+             min_size 1
+             max_size 10
+             type replicated
+             step take site1
+             step chooseleaf firstn 2 type host
+             step emit
+             step take site2
+             step chooseleaf firstn 2 type host
+             step emit
+     }
+
+#. Inject the CRUSH map to make the rule available to the cluster:
+
+   .. prompt:: bash $
+
+      crushtool -c crush.map.txt -o crush2.map.bin
+      ceph osd setcrushmap -i crush2.map.bin
+
+#. Run the monitors in connectivity mode. See `Changing Monitor Elections`_.
+
+#. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the
+   tiebreaker monitor and we are splitting across data centers. The tiebreaker
+   monitor must be assigned a data center that is neither ``site1`` nor
+   ``site2``. For this purpose you can create another data-center bucket named
+   ``site3`` in your CRUSH and place ``mon.e`` there:
+
+   .. prompt:: bash $
+
+      ceph mon set_location e datacenter=site3
+      ceph mon enable_stretch_mode e stretch_rule datacenter
+
+When stretch mode is enabled, PGs will become active only when they peer
+across data centers (or across whichever CRUSH bucket type was specified),
+assuming both are alive. Pools will increase in size from the default ``3`` to
+``4``, and two copies will be expected in each site. OSDs will be allowed to
+connect to monitors only if they are in the same data center as the monitors.
+New monitors will not be allowed to join the cluster if they do not specify a
+location.
+
+If all OSDs and monitors in one of the data centers become inaccessible at once,
+the surviving data center enters a "degraded stretch mode". A warning will be
+issued, the ``min_size`` will be reduced to ``1``, and the cluster will be
+allowed to go active with the data in the single remaining site. The pool size
+does not change, so warnings will be generated that report that the pools are
+too small -- but a special stretch mode flag will prevent the OSDs from
+creating extra copies in the remaining data center. This means that the data
+center will keep only two copies, just as before.
+
+When the missing data center comes back, the cluster will enter a "recovery
+stretch mode". This changes the warning and allows peering, but requires OSDs
+only from the data center that was ``up`` throughout the duration of the
+downtime. When all PGs are in a known state, and are neither degraded nor
+incomplete, the cluster transitions back to regular stretch mode, ends the
+warning, restores ``min_size`` to its original value (``2``), requires both
+sites to peer, and no longer requires the site that was up throughout the
+duration of the downtime when peering (which makes failover to the other site
+possible, if needed).

 .. _Changing Monitor elections: ../change-mon-elections

-And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the
-tiebreaker and we are splitting across data centers. ``mon.e`` should be also
-set a datacenter, that will differ from ``site1`` and ``site2``. For this
-purpose you can create another datacenter bucket named ```site3`` in your
-CRUSH and place ``mon.e`` there:
+Limitations of Stretch Mode 
+===========================
+When using stretch mode, OSDs must be located at exactly two sites. 

-.. prompt:: bash $
+Two monitors should be run in each data center, plus a tiebreaker in a third
+(or in the cloud) for a total of five monitors. While in stretch mode, OSDs
+will connect only to monitors within the data center in which they are located.
+OSDs *DO NOT* connect to the tiebreaker monitor.

-   ceph mon set_location e datacenter=site3
-   ceph mon enable_stretch_mode e stretch_rule datacenter
+Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure
+coded pools with stretch mode will fail. Erasure coded pools cannot be created
+while in stretch mode. 

-When stretch mode is enabled, the OSDs will only take PGs active when
-they peer across data centers (or whatever other CRUSH bucket type
-you specified), assuming both are alive. Pools will increase in size
-from the default 3 to 4, expecting 2 copies in each site. OSDs will only
-be allowed to connect to monitors in the same data center. New monitors
-will not be allowed to join the cluster if they do not specify a location.
+To use stretch mode, you will need to create a CRUSH rule that provides two
+replicas in each data center. Ensure that there are four total replicas: two in
+each data center. If pools exist in the cluster that do not have the default
+``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such
+a CRUSH rule is given above.

-If all the OSDs and monitors from a data center become inaccessible
-at once, the surviving data center will enter a degraded stretch mode. This
-will issue a warning, reduce the min_size to 1, and allow
-the cluster to go active with data in the single remaining site. Note that
-we do not change the pool size, so you will also get warnings that the
-pools are too small -- but a special stretch mode flag will prevent the OSDs
-from creating extra copies in the remaining data center (so it will only keep
-2 copies, as before).
+Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly,
+``min_size 1``), we recommend enabling stretch mode only when using OSDs on
+SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended
+due to the long time it takes for them to recover after connectivity between
+data centers has been restored. This reduces the potential for data loss.

-When the missing data center comes back, the cluster will enter
-recovery stretch mode. This changes the warning and allows peering, but
-still only requires OSDs from the data center which was up the whole time.
-When all PGs are in a known state, and are neither degraded nor incomplete,
-the cluster transitions back to regular stretch mode, ends the warning,
-restores min_size to its starting value (2) and requires both sites to peer,
-and stops requiring the always-alive site when peering (so that you can fail
-over to the other site, if necessary).
-
-Stretch Mode Limitations
-========================
-As implied by the setup, stretch mode only handles 2 sites with OSDs.
-
-While it is not enforced, you should run 2 monitors in each site plus
-a tiebreaker, for a total of 5. This is because OSDs can only connect
-to monitors in their own site when in stretch mode.
-
-You cannot use erasure coded pools with stretch mode. If you try, it will
-refuse, and it will not allow you to create EC pools once in stretch mode.
-
-You must create your own CRUSH rule which provides 2 copies in each site, and
-you must use 4 total copies with 2 in each site. If you have existing pools
-with non-default size/min_size, Ceph will object when you attempt to
-enable stretch mode.
-
-Because it runs with ``min_size 1`` when degraded, you should only use stretch
-mode with all-flash OSDs.  This minimizes the time needed to recover once
-connectivity is restored, and thus minimizes the potential for data loss.
-
-Hopefully, future development will extend this feature to support EC pools and
-running with more than 2 full sites.
+In the future, stretch mode might support erasure-coded pools and might support
+deployments that have more than two data centers.

 Other commands
 ==============
-If your tiebreaker monitor fails for some reason, you can replace it. Turn on
-a new monitor and run:
+
+Replacing a failed tiebreaker monitor
+-------------------------------------
+
+Turn on a new monitor and run the following command:

 .. prompt:: bash $

   ceph mon set_new_tiebreaker mon.<new_mon_name>

-This command will protest if the new monitor is in the same location as existing
-non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker
-monitor; you should do so yourself.
+This command protests if the new monitor is in the same location as the
+existing non-tiebreaker monitors. **This command WILL NOT remove the previous
+tiebreaker monitor.** Remove the previous tiebreaker monitor yourself.

-If you are writing your own tooling for deploying Ceph, you can use a new
-``--set-crush-location`` option when booting monitors, instead of running
-``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg
-``ceph-mon --set-crush-location 'datacenter=a'``, which must match the
-bucket type you specified when running ``enable_stretch_mode``.
+Using "--set-crush-location" and not "ceph mon set_location"
+------------------------------------------------------------

+If you write your own tooling for deploying Ceph, use the
+``--set-crush-location`` option when booting monitors instead of running ``ceph
+mon set_location``. This option accepts only a single ``bucket=loc`` pair (for
+example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must
+match the bucket type that was specified when running ``enable_stretch_mode``.

-When in stretch degraded mode, the cluster will go into "recovery" mode automatically
-when the disconnected data center comes back. If that doesn't work, or you want to
-enable recovery mode early, you can invoke:
+Forcing recovery stretch mode
+-----------------------------
+
+When in stretch degraded mode, the cluster will go into "recovery" mode
+automatically when the disconnected data center comes back. If that does not
+happen or you want to enable recovery mode early, run the following command:

 .. prompt:: bash $

   ceph osd force_recovery_stretch_mode --yes-i-really-mean-it

-But this command should not be necessary; it is included to deal with
-unanticipated situations.
+Forcing normal stretch mode
+---------------------------

-When in recovery mode, the cluster should go back into normal stretch mode
-when the PGs are healthy. If this doesn't happen, or you want to force the
+When in recovery mode, the cluster should go back into normal stretch mode when
+the PGs are healthy. If this fails to happen or if you want to force the
 cross-data-center peering early and are willing to risk data downtime (or have
 verified separately that all the PGs can peer, even if they aren't fully
-recovered), you can invoke:
+recovered), run the following command:

 .. prompt:: bash $

   ceph osd force_healthy_stretch_mode --yes-i-really-mean-it

-This command should not be necessary; it is included to deal with
-unanticipated situations. But you might wish to invoke it to remove
-the ``HEALTH_WARN`` state which recovery mode generates.
+This command can be used to to remove the ``HEALTH_WARN`` state, which recovery
+mode generates.
--- a/Show More
+++ b/Show More