update ceph source to reef 18.2.1

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This commit is contained in:
Thomas Lamprecht 2023-12-19 09:13:36 +01:00
parent 27f45121cc
commit aee94f6923
1362 changed files with 78535 additions and 33339 deletions

View File

@ -22,7 +22,9 @@
## Contribution Guidelines ## Contribution Guidelines
- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst). - To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
- If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow. - If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
- When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an `x` between the brackets: `[x]`. Spaces and capitalization matter when checking off items this way.
## Checklist ## Checklist
- Tracker (select at least one) - Tracker (select at least one)
@ -62,4 +64,5 @@
- `jenkins test ceph-volume all` - `jenkins test ceph-volume all`
- `jenkins test ceph-volume tox` - `jenkins test ceph-volume tox`
- `jenkins test windows` - `jenkins test windows`
- `jenkins test rook e2e`
</details> </details>

View File

@ -1,7 +1,7 @@
cmake_minimum_required(VERSION 3.16) cmake_minimum_required(VERSION 3.16)
project(ceph project(ceph
VERSION 18.2.0 VERSION 18.2.1
LANGUAGES CXX C ASM) LANGUAGES CXX C ASM)
cmake_policy(SET CMP0028 NEW) cmake_policy(SET CMP0028 NEW)

View File

@ -1,3 +1,53 @@
>=19.0.0
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
multi-site. Previously, the replicas of such objects were corrupted on decryption.
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
identify these original multipart uploads. The ``LastModified`` timestamp of any
identified object is incremented by 1ns to cause peer zones to replicate it again.
For multi-site deployments that make any use of Server-Side Encryption, we
recommended running this command against every bucket in every zone after all
zones have upgraded.
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
a large buildup of session metadata resulting in the MDS going read-only due to
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
config controls the maximum size that a (encoded) session metadata can grow.
* RGW: New tools have been added to radosgw-admin for identifying and
correcting issues with versioned bucket indexes. Historical bugs with the
versioned bucket index transaction workflow made it possible for the index
to accumulate extraneous "book-keeping" olh entries and plain placeholder
entries. In some specific scenarios where clients made concurrent requests
referencing the same object key, it was likely that a lot of extra index
entries would accumulate. When a significant number of these entries are
present in a single bucket index shard, they can cause high bucket listing
latencies and lifecycle processing failures. To check whether a versioned
bucket has unnecessary olh entries, users can now run ``radosgw-admin
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
be safely removed. A distinct issue from the one described thus far, it is
also possible that some versioned buckets are maintaining extra unlinked
objects that are not listable from the S3/ Swift APIs. These extra objects
are typically a result of PUT requests that exited abnormally, in the middle
of a bucket index transaction - so the client would not have received a
successful response. Bugs in prior releases made these unlinked objects easy
to reproduce with any PUT request that was made on a bucket that was actively
resharding. Besides the extra space that these hidden, unlinked objects
consume, there can be another side effect in certain scenarios, caused by
the nature of the failure mode that produced them, where a client of a bucket
that was a victim of this bug may find the object associated with the key to
be in an inconsistent state. To check whether a versioned bucket has unlinked
entries, users can now run ``radosgw-admin bucket check unlinked``. If the
``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
a third issue made it possible for versioned bucket index stats to be
accounted inaccurately. The tooling for recalculating versioned bucket stats
also had a bug, and was not previously capable of fixing these inaccuracies.
This release resolves those issues and users can now expect that the existing
``radosgw-admin bucket check`` command will produce correct results. We
recommend that users with versioned buckets, especially those that existed
on prior releases, use these new tools to check whether their buckets are
affected and to clean them up accordingly.
* mgr/snap-schedule: For clusters with multiple CephFS file systems, all the
snap-schedule commands now expect the '--fs' argument.
>=18.0.0 >=18.0.0
* The RGW policy parser now rejects unknown principals by default. If you are * The RGW policy parser now rejects unknown principals by default. If you are
@ -171,6 +221,11 @@
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/ https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/
* CEPHFS: After recovering a Ceph File System post following the disaster recovery * CEPHFS: After recovering a Ceph File System post following the disaster recovery
procedure, the recovered files under `lost+found` directory can now be deleted. procedure, the recovered files under `lost+found` directory can now be deleted.
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
than the number mentioned against the config tunable `mds_max_snaps_per_dir`
so that a new snapshot can be created and retained during the next schedule
run.
>=17.2.1 >=17.2.1

View File

@ -1,23 +1,25 @@
# Ceph - a scalable distributed storage system # Ceph - a scalable distributed storage system
Please see https://ceph.com/ for current info. See https://ceph.com/ for current information about Ceph.
## Contributing Code ## Contributing Code
Most of Ceph is dual licensed under the LGPL version 2.1 or 3.0. Some Most of Ceph is dual-licensed under the LGPL version 2.1 or 3.0. Some
miscellaneous code is under a BSD-style license or is public domain. miscellaneous code is either public domain or licensed under a BSD-style
The documentation is licensed under Creative Commons license.
Attribution Share Alike 3.0 (CC-BY-SA-3.0). There are a handful of headers
included here that are licensed under the GPL. Please see the file
COPYING for a full inventory of licenses by file.
Code contributions must include a valid "Signed-off-by" acknowledging The Ceph documentation is licensed under Creative Commons Attribution Share
the license for the modified or contributed file. Please see the file Alike 3.0 (CC-BY-SA-3.0).
SubmittingPatches.rst for details on what that means and on how to
generate and submit patches.
We do not require assignment of copyright to contribute code; code is Some headers included in the `ceph/ceph` repository are licensed under the GPL.
See the file `COPYING` for a full inventory of licenses by file.
All code contributions must include a valid "Signed-off-by" line. See the file
`SubmittingPatches.rst` for details on this and instructions on how to generate
and submit patches.
Assignment of copyright is not required to contribute code. Code is
contributed under the terms of the applicable license. contributed under the terms of the applicable license.
@ -33,10 +35,11 @@ command on a system that has git installed:
git clone https://github.com/ceph/ceph.git git clone https://github.com/ceph/ceph.git
When the ceph/ceph repository has been cloned to your system, run the following When the `ceph/ceph` repository has been cloned to your system, run the
command to check out the git submodules associated with the ceph/ceph following commands to move into the cloned `ceph/ceph` repository and to check
repository: out the git submodules associated with it:
cd ceph
git submodule update --init --recursive git submodule update --init --recursive
@ -63,34 +66,42 @@ Install the ``python3-routes`` package:
These instructions are meant for developers who are compiling the code for These instructions are meant for developers who are compiling the code for
development and testing. To build binaries that are suitable for installation development and testing. To build binaries that are suitable for installation
we recommend that you build .deb or .rpm packages, or refer to ``ceph.spec.in`` we recommend that you build `.deb` or `.rpm` packages, or refer to
or ``debian/rules`` to see which configuration options are specified for ``ceph.spec.in`` or ``debian/rules`` to see which configuration options are
production builds. specified for production builds.
Build instructions: To build Ceph, make sure that you are in the top-level `ceph` directory that
contains `do_cmake.sh` and `CONTRIBUTING.rst` and run the following commands:
./do_cmake.sh ./do_cmake.sh
cd build cd build
ninja ninja
``do_cmake.sh`` defaults to creating a debug build of Ceph that can be up to 5x ``do_cmake.sh`` by default creates a "debug build" of Ceph, which can be up to
slower with some workloads. Pass ``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to five times slower than a non-debug build. Pass
``do_cmake.sh`` to create a non-debug release. ``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to ``do_cmake.sh`` to create a non-debug
build.
The number of jobs used by `ninja` is derived from the number of CPU cores of [Ninja](https://ninja-build.org/) is the buildsystem used by the Ceph project
the building host if unspecified. Use the `-j` option to limit the job number to build test builds. The number of jobs used by `ninja` is derived from the
if the build jobs are running out of memory. On average, each job takes around number of CPU cores of the building host if unspecified. Use the `-j` option to
2.5GiB memory. limit the job number if the build jobs are running out of memory. If you
attempt to run `ninja` and receive a message that reads `g++: fatal error:
Killed signal terminated program cc1plus`, then you have run out of memory.
Using the `-j` option with an argument appropriate to the hardware on which the
`ninja` command is run is expected to result in a successful build. For example,
to limit the job number to 3, run the command `ninja -j 3`. On average, each
`ninja` job run in parallel needs approximately 2.5 GiB of RAM.
This assumes that you make your build directory a subdirectory of the ceph.git This documentation assumes that your build directory is a subdirectory of the
checkout. If you put it elsewhere, just point `CEPH_GIT_DIR` to the correct `ceph.git` checkout. If the build directory is located elsewhere, point
path to the checkout. Additional CMake args can be specified by setting ARGS `CEPH_GIT_DIR` to the correct path of the checkout. Additional CMake args can
before invoking ``do_cmake.sh``. See [cmake options](#cmake-options) be specified by setting ARGS before invoking ``do_cmake.sh``. See [cmake
for more details. For example: options](#cmake-options) for more details. For example:
ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh
To build only certain targets use: To build only certain targets, run a command of the following form:
ninja [target name] ninja [target name]
@ -153,24 +164,25 @@ are committed to git.)
## Running a test cluster ## Running a test cluster
To run a functional test cluster, From the `ceph/` directory, run the following commands to launch a test Ceph
cluster:
cd build cd build
ninja vstart # builds just enough to run vstart ninja vstart # builds just enough to run vstart
../src/vstart.sh --debug --new -x --localhost --bluestore ../src/vstart.sh --debug --new -x --localhost --bluestore
./bin/ceph -s ./bin/ceph -s
Almost all of the usual commands are available in the bin/ directory. Most Ceph commands are available in the `bin/` directory. For example:
For example,
./bin/rados -p rbd bench 30 write
./bin/rbd create foo --size 1000 ./bin/rbd create foo --size 1000
./bin/rados -p foo bench 30 write
To shut down the test cluster, To shut down the test cluster, run the following command from the `build/`
directory:
../src/stop.sh ../src/stop.sh
To start or stop individual daemons, the sysvinit script can be used: Use the sysvinit script to start or stop individual daemons:
./bin/init-ceph restart osd.0 ./bin/init-ceph restart osd.0
./bin/init-ceph stop ./bin/init-ceph stop

View File

@ -170,7 +170,7 @@
# main package definition # main package definition
################################################################################# #################################################################################
Name: ceph Name: ceph
Version: 18.2.0 Version: 18.2.1
Release: 0%{?dist} Release: 0%{?dist}
%if 0%{?fedora} || 0%{?rhel} %if 0%{?fedora} || 0%{?rhel}
Epoch: 2 Epoch: 2
@ -186,7 +186,7 @@ License: LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-
Group: System/Filesystems Group: System/Filesystems
%endif %endif
URL: http://ceph.com/ URL: http://ceph.com/
Source0: %{?_remote_tarball_prefix}ceph-18.2.0.tar.bz2 Source0: %{?_remote_tarball_prefix}ceph-18.2.1.tar.bz2
%if 0%{?suse_version} %if 0%{?suse_version}
# _insert_obs_source_lines_here # _insert_obs_source_lines_here
ExclusiveArch: x86_64 aarch64 ppc64le s390x ExclusiveArch: x86_64 aarch64 ppc64le s390x
@ -1292,7 +1292,7 @@ This package provides a Ceph MIB for SNMP traps.
# common # common
################################################################################# #################################################################################
%prep %prep
%autosetup -p1 -n ceph-18.2.0 %autosetup -p1 -n ceph-18.2.1
%build %build
# Disable lto on systems that do not support symver attribute # Disable lto on systems that do not support symver attribute

View File

@ -1,7 +1,13 @@
ceph (18.2.0-1jammy) jammy; urgency=medium ceph (18.2.1-1jammy) jammy; urgency=medium
-- Jenkins Build Slave User <jenkins-build@braggi17.front.sepia.ceph.com> Thu, 03 Aug 2023 18:57:50 +0000 -- Jenkins Build Slave User <jenkins-build@braggi13.front.sepia.ceph.com> Mon, 11 Dec 2023 22:07:48 +0000
ceph (18.2.1-1) stable; urgency=medium
* New upstream release
-- Ceph Release Team <ceph-maintainers@ceph.io> Mon, 11 Dec 2023 21:55:36 +0000
ceph (18.2.0-1) stable; urgency=medium ceph (18.2.0-1) stable; urgency=medium

View File

@ -1 +0,0 @@
README

View File

@ -1,3 +1,4 @@
#!/bin/sh
# vim: set noet ts=8: # vim: set noet ts=8:
# postinst script for ceph-mon # postinst script for ceph-mon
# #

View File

@ -1,3 +1,4 @@
#!/bin/sh
# vim: set noet ts=8: # vim: set noet ts=8:
# postinst script for ceph-osd # postinst script for ceph-osd
# #

View File

@ -1 +1 @@
9 12

View File

@ -4,7 +4,7 @@ Priority: optional
Homepage: http://ceph.com/ Homepage: http://ceph.com/
Vcs-Git: git://github.com/ceph/ceph.git Vcs-Git: git://github.com/ceph/ceph.git
Vcs-Browser: https://github.com/ceph/ceph Vcs-Browser: https://github.com/ceph/ceph
Maintainer: Ceph Maintainers <ceph-maintainers@lists.ceph.com> Maintainer: Ceph Maintainers <ceph-maintainers@ceph.io>
Uploaders: Ken Dreyer <kdreyer@redhat.com>, Uploaders: Ken Dreyer <kdreyer@redhat.com>,
Alfredo Deza <adeza@redhat.com>, Alfredo Deza <adeza@redhat.com>,
Build-Depends: automake, Build-Depends: automake,
@ -20,8 +20,7 @@ Build-Depends: automake,
git, git,
golang, golang,
gperf, gperf,
g++ (>= 7), g++ (>= 11),
hostname <pkg.ceph.check>,
javahelper, javahelper,
jq <pkg.ceph.check>, jq <pkg.ceph.check>,
jsonnet <pkg.ceph.check>, jsonnet <pkg.ceph.check>,
@ -135,9 +134,6 @@ Package: ceph-base
Architecture: linux-any Architecture: linux-any
Depends: binutils, Depends: binutils,
ceph-common (= ${binary:Version}), ceph-common (= ${binary:Version}),
debianutils,
findutils,
grep,
logrotate, logrotate,
parted, parted,
psmisc, psmisc,
@ -187,8 +183,9 @@ Description: debugging symbols for ceph-base
Package: cephadm Package: cephadm
Architecture: linux-any Architecture: linux-any
Recommends: podman (>= 2.0.2) | docker.io Recommends: podman (>= 2.0.2) | docker.io | docker-ce
Depends: lvm2, Depends: lvm2,
python3,
${python3:Depends}, ${python3:Depends},
Description: cephadm utility to bootstrap ceph daemons with systemd and containers Description: cephadm utility to bootstrap ceph daemons with systemd and containers
Ceph is a massively scalable, open-source, distributed Ceph is a massively scalable, open-source, distributed
@ -431,7 +428,6 @@ Depends: ceph-osd (= ${binary:Version}),
e2fsprogs, e2fsprogs,
lvm2, lvm2,
parted, parted,
util-linux,
xfsprogs, xfsprogs,
${misc:Depends}, ${misc:Depends},
${python3:Depends} ${python3:Depends}
@ -759,7 +755,7 @@ Architecture: any
Section: debug Section: debug
Priority: extra Priority: extra
Depends: libsqlite3-mod-ceph (= ${binary:Version}), Depends: libsqlite3-mod-ceph (= ${binary:Version}),
libsqlite3-0-dbgsym libsqlite3-0-dbgsym,
${misc:Depends}, ${misc:Depends},
Description: debugging symbols for libsqlite3-mod-ceph Description: debugging symbols for libsqlite3-mod-ceph
A SQLite3 VFS for storing and manipulating databases stored on Ceph's RADOS A SQLite3 VFS for storing and manipulating databases stored on Ceph's RADOS
@ -1207,14 +1203,14 @@ Description: Java Native Interface library for CephFS Java bindings
Package: rados-objclass-dev Package: rados-objclass-dev
Architecture: linux-any Architecture: linux-any
Section: libdevel Section: libdevel
Depends: librados-dev (= ${binary:Version}) ${misc:Depends}, Depends: librados-dev (= ${binary:Version}), ${misc:Depends},
Description: RADOS object class development kit. Description: RADOS object class development kit.
. .
This package contains development files needed for building RADOS object class plugins. This package contains development files needed for building RADOS object class plugins.
Package: cephfs-shell Package: cephfs-shell
Architecture: all Architecture: all
Depends: ${misc:Depends} Depends: ${misc:Depends},
${python3:Depends} ${python3:Depends}
Description: interactive shell for the Ceph distributed file system Description: interactive shell for the Ceph distributed file system
Ceph is a massively scalable, open-source, distributed Ceph is a massively scalable, open-source, distributed
@ -1227,7 +1223,7 @@ Description: interactive shell for the Ceph distributed file system
Package: cephfs-top Package: cephfs-top
Architecture: all Architecture: all
Depends: ${misc:Depends} Depends: ${misc:Depends},
${python3:Depends} ${python3:Depends}
Description: This package provides a top(1) like utility to display various Description: This package provides a top(1) like utility to display various
filesystem metrics in realtime. filesystem metrics in realtime.

View File

@ -1,6 +1,6 @@
Format-Specification: http://anonscm.debian.org/viewvc/dep/web/deps/dep5/copyright-format.xml?revision=279&view=markup Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Name: ceph Upstream-Name: ceph
Maintainer: Sage Weil <sage@newdream.net> Upstream-Contact: Ceph Developers <dev@ceph.io>
Source: http://ceph.com/ Source: http://ceph.com/
Files: * Files: *
@ -180,3 +180,553 @@ Files: src/include/timegm.h
Copyright (C) Copyright Howard Hinnant Copyright (C) Copyright Howard Hinnant
Copyright (C) Copyright 2010-2011 Vicente J. Botet Escriba Copyright (C) Copyright 2010-2011 Vicente J. Botet Escriba
License: Boost Software License, Version 1.0 License: Boost Software License, Version 1.0
License: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
.
http://www.apache.org/licenses/LICENSE-2.0
.
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.
The complete text of the Apache License, Version 2.0
can be found in "/usr/share/common-licenses/Apache-2.0".
License: GPL-2
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
.
On Debian systems, the complete text of the GNU General Public License
version 2 can be found in `/usr/share/common-licenses/GPL-2' file.
License: GPL-2+
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 2 of the License, or
(at your option) any later version.
.
This package is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
.
On Debian systems, the complete text of the GNU General Public License
version 2 can be found in `/usr/share/common-licenses/GPL-2'.
License: GPL-3+
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
.
On Debian systems, the complete text of the GNU General Public
License version 3 can be found in `/usr/share/common-licenses/GPL-3'.
License: LGPL-2.1
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License version 2.1 as published by the Free Software Foundation.
.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
.
On Debian systems, the complete text of the GNU Lesser General
Public License can be found in `/usr/share/common-licenses/LGPL-2.1'.
License: LGPL-2.1+
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
.
On Debian systems, the complete text of the GNU Lesser General
Public License can be found in `/usr/share/common-licenses/LGPL-2.1'.
License: LGPL-2+
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License version 2 (or later) as published by the Free Software Foundation.
.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
.
On Debian systems, the complete text of the GNU Lesser General
Public License 2 can be found in `/usr/share/common-licenses/LGPL-2'.
License: MIT
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
.
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
License: CC-BY-SA-3.0
Creative Commons Attribution-ShareAlike 3.0 Unported
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION
ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE
INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
ITS USE.
License
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
CONDITIONS.
1. Definitions
a. "Adaptation" means a work based upon the Work, or upon the Work and
other pre-existing works, such as a translation, adaptation, derivative
work, arrangement of music or other alterations of a literary or
artistic work, or phonogram or performance and includes cinematographic
adaptations or any other form in which the Work may be recast,
transformed, or adapted including in any form recognizably derived from
the original, except that a work that constitutes a Collection will not
be considered an Adaptation for the purpose of this License. For the
avoidance of doubt, where the Work is a musical work, performance or
phonogram, the synchronization of the Work in timed-relation with a
moving image ("synching") will be considered an Adaptation for the
purpose of this License.
b. "Collection" means a collection of literary or artistic works, such
as encyclopedias and anthologies, or performances, phonograms or
broadcasts, or other works or subject matter other than works listed in
Section 1(f) below, which, by reason of the selection and arrangement of
their contents, constitute intellectual creations, in which the Work is
included in its entirety in unmodified form along with one or more other
contributions, each constituting separate and independent works in
themselves, which together are assembled into a collective whole. A work
that constitutes a Collection will not be considered an Adaptation (as
defined below) for the purposes of this License.
c. "Creative Commons Compatible License" means a license that is listed
at http://creativecommons.org/compatiblelicenses that has been approved
by Creative Commons as being essentially equivalent to this License,
including, at a minimum, because that license: (i) contains terms that
have the same purpose, meaning and effect as the License Elements of
this License; and, (ii) explicitly permits the relicensing of
adaptations of works made available under that license under this
License or a Creative Commons jurisdiction license with the same License
Elements as this License.
d. "Distribute" means to make available to the public the original and
copies of the Work or Adaptation, as appropriate, through sale or other
transfer of ownership.
e. "License Elements" means the following high-level license attributes
as selected by Licensor and indicated in the title of this License:
Attribution, ShareAlike.
f. "Licensor" means the individual, individuals, entity or entities that
offer(s) the Work under the terms of this License.
g. "Original Author" means, in the case of a literary or artistic work,
the individual, individuals, entity or entities who created the Work or
if no individual or entity can be identified, the publisher; and in
addition (i) in the case of a performance the actors, singers,
musicians, dancers, and other persons who act, sing, deliver, declaim,
play in, interpret or otherwise perform literary or artistic works or
expressions of folklore; (ii) in the case of a phonogram the producer
being the person or legal entity who first fixes the sounds of a
performance or other sounds; and, (iii) in the case of broadcasts, the
organization that transmits the broadcast.
h. "Work" means the literary and/or artistic work offered under the
terms of this License including without limitation any production in the
literary, scientific and artistic domain, whatever may be the mode or
form of its expression including digital form, such as a book, pamphlet
and other writing; a lecture, address, sermon or other work of the same
nature; a dramatic or dramatico-musical work; a choreographic work or
entertainment in dumb show; a musical composition with or without words;
a cinematographic work to which are assimilated works expressed by a
process analogous to cinematography; a work of drawing, painting,
architecture, sculpture, engraving or lithography; a photographic work
to which are assimilated works expressed by a process analogous to
photography; a work of applied art; an illustration, map, plan, sketch
or three-dimensional work relative to geography, topography,
architecture or science; a performance; a broadcast; a phonogram; a
compilation of data to the extent it is protected as a copyrightable
work; or a work performed by a variety or circus performer to the extent
it is not otherwise considered a literary or artistic work.
i. "You" means an individual or entity exercising rights under this
License who has not previously violated the terms of this License with
respect to the Work, or who has received express permission from the
Licensor to exercise rights under this License despite a previous
violation.
j. "Publicly Perform" means to perform public recitations of the Work
and to communicate to the public those public recitations, by any means
or process, including by wire or wireless means or public digital
performances; to make available to the public Works in such a way that
members of the public may access these Works from a place and at a place
individually chosen by them; to perform the Work to the public by any
means or process and the communication to the public of the performances
of the Work, including by public digital performance; to broadcast and
rebroadcast the Work by any means including signs, sounds or images.
k. "Reproduce" means to make copies of the Work by any means including
without limitation by sound or visual recordings and the right of
fixation and reproducing fixations of the Work, including storage of a
protected performance or phonogram in digital form or other electronic
medium.
2. Fair Dealing Rights. Nothing in this License is intended to reduce,
limit, or restrict any uses free from copyright or rights arising from
limitations or exceptions that are provided for in connection with the
copyright protection under copyright law or other applicable laws.
3. License Grant. Subject to the terms and conditions of this License,
Licensor hereby grants You a worldwide, royalty-free, non-exclusive,
perpetual (for the duration of the applicable copyright) license to
exercise the rights in the Work as stated below:
a. to Reproduce the Work, to incorporate the Work into one or more
Collections, and to Reproduce the Work as incorporated in the
Collections;
b. to create and Reproduce Adaptations provided that any such
Adaptation, including any translation in any medium, takes reasonable
steps to clearly label, demarcate or otherwise identify that changes
were made to the original Work. For example, a translation could be
marked "The original work was translated from English to Spanish," or a
modification could indicate "The original work has been modified.";
c. to Distribute and Publicly Perform the Work including as incorporated
in Collections; and,
d. to Distribute and Publicly Perform Adaptations.
e. For the avoidance of doubt:
i. Non-waivable Compulsory License Schemes. In those jurisdictions in
which the right to collect royalties through any statutory or compulsory
licensing scheme cannot be waived, the Licensor reserves the exclusive
right to collect such royalties for any exercise by You of the rights
granted under this License;
ii. Waivable Compulsory License Schemes. In those jurisdictions in which
the right to collect royalties through any statutory or compulsory
licensing scheme can be waived, the Licensor waives the exclusive right
to collect such royalties for any exercise by You of the rights granted
under this License; and,
iii. Voluntary License Schemes. The Licensor waives the right to collect
royalties, whether individually or, in the event that the Licensor is a
member of a collecting society that administers voluntary licensing
schemes, via that society, from any exercise by You of the rights
granted under this License.
The above rights may be exercised in all media and formats whether now
known or hereafter devised. The above rights include the right to make
such modifications as are technically necessary to exercise the rights
in other media and formats. Subject to Section 8(f), all rights not
expressly granted by Licensor are hereby reserved.
4. Restrictions. The license granted in Section 3 above is expressly
made subject to and limited by the following restrictions:
a. You may Distribute or Publicly Perform the Work only under the terms
of this License. You must include a copy of, or the Uniform Resource
Identifier (URI) for, this License with every copy of the Work You
Distribute or Publicly Perform. You may not offer or impose any terms on
the Work that restrict the terms of this License or the ability of the
recipient of the Work to exercise the rights granted to that recipient
under the terms of the License. You may not sublicense the Work. You
must keep intact all notices that refer to this License and to the
disclaimer of warranties with every copy of the Work You Distribute or
Publicly Perform. When You Distribute or Publicly Perform the Work, You
may not impose any effective technological measures on the Work that
restrict the ability of a recipient of the Work from You to exercise the
rights granted to that recipient under the terms of the License. This
Section 4(a) applies to the Work as incorporated in a Collection, but
this does not require the Collection apart from the Work itself to be
made subject to the terms of this License. If You create a Collection,
upon notice from any Licensor You must, to the extent practicable,
remove from the Collection any credit as required by Section 4(c), as
requested. If You create an Adaptation, upon notice from any Licensor
You must, to the extent practicable, remove from the Adaptation any
credit as required by Section 4(c), as requested.
b. You may Distribute or Publicly Perform an Adaptation only under the
terms of: (i) this License; (ii) a later version of this License with
the same License Elements as this License; (iii) a Creative Commons
jurisdiction license (either this or a later license version) that
contains the same License Elements as this License (e.g.,
Attribution-ShareAlike 3.0 US)); (iv) a Creative Commons Compatible
License. If you license the Adaptation under one of the licenses
mentioned in (iv), you must comply with the terms of that license. If
you license the Adaptation under the terms of any of the licenses
mentioned in (i), (ii) or (iii) (the "Applicable License"), you must
comply with the terms of the Applicable License generally and the
following provisions: (I) You must include a copy of, or the URI for,
the Applicable License with every copy of each Adaptation You Distribute
or Publicly Perform; (II) You may not offer or impose any terms on the
Adaptation that restrict the terms of the Applicable License or the
ability of the recipient of the Adaptation to exercise the rights
granted to that recipient under the terms of the Applicable License;
(III) You must keep intact all notices that refer to the Applicable
License and to the disclaimer of warranties with every copy of the Work
as included in the Adaptation You Distribute or Publicly Perform; (IV)
when You Distribute or Publicly Perform the Adaptation, You may not
impose any effective technological measures on the Adaptation that
restrict the ability of a recipient of the Adaptation from You to
exercise the rights granted to that recipient under the terms of the
Applicable License. This Section 4(b) applies to the Adaptation as
incorporated in a Collection, but this does not require the Collection
apart from the Adaptation itself to be made subject to the terms of the
Applicable License.
c. If You Distribute, or Publicly Perform the Work or any Adaptations or
Collections, You must, unless a request has been made pursuant to
Section 4(a), keep intact all copyright notices for the Work and
provide, reasonable to the medium or means You are utilizing: (i) the
name of the Original Author (or pseudonym, if applicable) if supplied,
and/or if the Original Author and/or Licensor designate another party or
parties (e.g., a sponsor institute, publishing entity, journal) for
attribution ("Attribution Parties") in Licensor's copyright notice,
terms of service or by other reasonable means, the name of such party or
parties; (ii) the title of the Work if supplied; (iii) to the extent
reasonably practicable, the URI, if any, that Licensor specifies to be
associated with the Work, unless such URI does not refer to the
copyright notice or licensing information for the Work; and (iv) ,
consistent with Ssection 3(b), in the case of an Adaptation, a credit
identifying the use of the Work in the Adaptation (e.g., "French
translation of the Work by Original Author," or "Screenplay based on
original Work by Original Author"). The credit required by this Section
4(c) may be implemented in any reasonable manner; provided, however,
that in the case of a Adaptation or Collection, at a minimum such credit
will appear, if a credit for all contributing authors of the Adaptation
or Collection appears, then as part of these credits and in a manner at
least as prominent as the credits for the other contributing authors.
For the avoidance of doubt, You may only use the credit required by this
Section for the purpose of attribution in the manner set out above and,
by exercising Your rights under this License, You may not implicitly or
explicitly assert or imply any connection with, sponsorship or
endorsement by the Original Author, Licensor and/or Attribution Parties,
as appropriate, of You or Your use of the Work, without the separate,
express prior written permission of the Original Author, Licensor and/or
Attribution Parties.
d. Except as otherwise agreed in writing by the Licensor or as may be
otherwise permitted by applicable law, if You Reproduce, Distribute or
Publicly Perform the Work either by itself or as part of any Adaptations
or Collections, You must not distort, mutilate, modify or take other
derogatory action in relation to the Work which would be prejudicial to
the Original Author's honor or reputation. Licensor agrees that in those
jurisdictions (e.g. Japan), in which any exercise of the right granted
in Section 3(b) of this License (the right to make Adaptations) would be
deemed to be a distortion, mutilation, modification or other derogatory
action prejudicial to the Original Author's honor and reputation, the
Licensor will waive or not assert, as appropriate, this Section, to the
fullest extent permitted by the applicable national law, to enable You
to reasonably exercise Your right under Section 3(b) of this License
(right to make Adaptations) but not otherwise.
5. Representations, Warranties and Disclaimer
UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE
EXCLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.
6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE
LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR
ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES
ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS
BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
7. Termination
a. This License and the rights granted hereunder will terminate
automatically upon any breach by You of the terms of this License.
Individuals or entities who have received Adaptations or Collections
from You under this License, however, will not have their licenses
terminated provided such individuals or entities remain in full
compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will
survive any termination of this License.
b. Subject to the above terms and conditions, the license granted here
is perpetual (for the duration of the applicable copyright in the Work).
Notwithstanding the above, Licensor reserves the right to release the
Work under different license terms or to stop distributing the Work at
any time; provided, however that any such election will not serve to
withdraw this License (or any other license that has been, or is
required to be, granted under the terms of this License), and this
License will continue in full force and effect unless terminated as
stated above.
8. Miscellaneous
a. Each time You Distribute or Publicly Perform the Work or a
Collection, the Licensor offers to the recipient a license to the Work
on the same terms and conditions as the license granted to You under
this License.
b. Each time You Distribute or Publicly Perform an Adaptation, Licensor
offers to the recipient a license to the original Work on the same terms
and conditions as the license granted to You under this License.
c. If any provision of this License is invalid or unenforceable under
applicable law, it shall not affect the validity or enforceability of
the remainder of the terms of this License, and without further action
by the parties to this agreement, such provision shall be reformed to
the minimum extent necessary to make such provision valid and
enforceable.
d. No term or provision of this License shall be deemed waived and no
breach consented to unless such waiver or consent shall be in writing
and signed by the party to be charged with such waiver or consent.
e. This License constitutes the entire agreement between the parties
with respect to the Work licensed here. There are no understandings,
agreements or representations with respect to the Work not specified
here. Licensor shall not be bound by any additional provisions that may
appear in any communication from You. This License may not be modified
without the mutual written agreement of the Licensor and You.
f. The rights granted under, and the subject matter referenced, in this
License were drafted utilizing the terminology of the Berne Convention
for the Protection of Literary and Artistic Works (as amended on
September 28, 1979), the Rome Convention of 1961, the WIPO Copyright
Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 and
the Universal Copyright Convention (as revised on July 24, 1971). These
rights and subject matter take effect in the relevant jurisdiction in
which the License terms are sought to be enforced according to the
corresponding provisions of the implementation of those treaty
provisions in the applicable national law. If the standard suite of
rights granted under applicable copyright law includes additional rights
not granted under this License, such additional rights are deemed to be
included in the License; this License is not intended to restrict the
license of any rights under applicable law.
Creative Commons Notice
Creative Commons is not a party to this License, and makes no warranty
whatsoever in connection with the Work. Creative Commons will not be
liable to You or any party on any legal theory for any damages
whatsoever, including without limitation any general, special,
incidental or consequential damages arising in connection to this
license. Notwithstanding the foregoing two (2) sentences, if Creative
Commons has expressly identified itself as the Licensor hereunder, it
shall have all rights and obligations of Licensor.
Except for the limited purpose of indicating to the public that the Work
is licensed under the CCPL, Creative Commons does not authorize the use
by either party of the trademark "Creative Commons" or any related
trademark or logo of Creative Commons without the prior written consent
of Creative Commons. Any permitted use will be in compliance with
Creative Commons' then-current trademark usage guidelines, as may be
published on its website or otherwise made available upon request from
time to time. For the avoidance of doubt, this trademark restriction
does not form part of the License.
Creative Commons may be contacted at http://creativecommons.org/.
License: BSD-3-clause
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
.
1. Redistributions of source code must retain the above
copyright notice, this list of conditions and the following
disclaimer.
.
2. Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials
provided with the distribution.
.
3. Neither the name of the copyright holder nor the names of
its contributors may be used to endorse or promote products
derived from this software without specific prior written
permission.
.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@ -5,7 +5,6 @@ export DESTDIR=$(CURDIR)/debian/tmp
include /usr/share/dpkg/default.mk include /usr/share/dpkg/default.mk
extraopts += -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11
ifneq (,$(findstring WITH_STATIC_LIBSTDCXX,$(CEPH_EXTRA_CMAKE_ARGS))) ifneq (,$(findstring WITH_STATIC_LIBSTDCXX,$(CEPH_EXTRA_CMAKE_ARGS)))
# dh_auto_build sets LDFLAGS with `dpkg-buildflags --get LDFLAGS` on ubuntu, # dh_auto_build sets LDFLAGS with `dpkg-buildflags --get LDFLAGS` on ubuntu,
# which makes the application aborts when the shared library throws # which makes the application aborts when the shared library throws
@ -59,19 +58,15 @@ py3_overrides_packages := $(basename $(notdir $(wildcard debian/*.requires)))
py3_packages := cephfs-shell cephfs-top cephadm py3_packages := cephfs-shell cephfs-top cephadm
%: %:
dh $@ --buildsystem=cmake --with javahelper,python3,systemd --parallel dh $@ --buildsystem=cmake --with javahelper,python3 --parallel
override_dh_auto_configure: override_dh_auto_configure:
env | sort env | sort
dh_auto_configure --buildsystem=cmake -- $(extraopts) $(CEPH_EXTRA_CMAKE_ARGS) dh_auto_configure --buildsystem=cmake -- $(extraopts) $(CEPH_EXTRA_CMAKE_ARGS)
override_dh_auto_build:
dh_auto_build --buildsystem=cmake
cp src/init-radosgw debian/radosgw.init
override_dh_auto_clean: override_dh_auto_clean:
dh_auto_clean --buildsystem=cmake dh_auto_clean --buildsystem=cmake
rm -f debian/radosgw.init debian/ceph.logrotate rm -f debian/radosgw.init debian/ceph.logrotate debian/ceph-base.docs
override_dh_auto_install: override_dh_auto_install:
dh_auto_install --buildsystem=cmake --destdir=$(DESTDIR) dh_auto_install --buildsystem=cmake --destdir=$(DESTDIR)
@ -87,13 +82,12 @@ override_dh_auto_install:
override_dh_installchangelogs: override_dh_installchangelogs:
dh_installchangelogs --exclude doc/changelog dh_installchangelogs --exclude doc/changelog
override_dh_installdocs:
override_dh_installlogrotate: override_dh_installlogrotate:
cp src/logrotate.conf debian/ceph-common.logrotate cp src/logrotate.conf debian/ceph-common.logrotate
dh_installlogrotate -pceph-common dh_installlogrotate -pceph-common
override_dh_installinit: override_dh_installinit:
cp src/init-radosgw debian/radosgw.init
# install the systemd stuff manually since we have funny service names # install the systemd stuff manually since we have funny service names
install -d -m0755 debian/ceph-common/etc/default install -d -m0755 debian/ceph-common/etc/default
install -m0644 etc/default/ceph debian/ceph-common/etc/default/ install -m0644 etc/default/ceph debian/ceph-common/etc/default/
@ -103,15 +97,9 @@ override_dh_installinit:
dh_installinit -p ceph-base --name ceph --no-start dh_installinit -p ceph-base --name ceph --no-start
dh_installinit -p radosgw --no-start dh_installinit -p radosgw --no-start
# NOTE: execute systemd helpers so they pickup dh_install'ed units and targets override_dh_installsystemd:
dh_systemd_enable # Only enable and start systemd targets
dh_systemd_start --no-restart-on-upgrade dh_installsystemd --no-stop-on-upgrade --no-restart-after-upgrade -Xceph-mon.service -Xceph-osd.service -X ceph-mds.service
override_dh_systemd_enable:
# systemd enable done as part of dh_installinit
override_dh_systemd_start:
# systemd start done as part of dh_installinit
override_dh_strip: override_dh_strip:
dh_strip -pceph-mds --dbg-package=ceph-mds-dbg dh_strip -pceph-mds --dbg-package=ceph-mds-dbg
@ -152,8 +140,12 @@ override_dh_python3:
@for pkg in $(py3_packages); do \ @for pkg in $(py3_packages); do \
dh_python3 -p $$pkg; \ dh_python3 -p $$pkg; \
done done
dh_python3 -p ceph-base --shebang=/usr/bin/python3
dh_python3 -p ceph-common --shebang=/usr/bin/python3
dh_python3 -p ceph-fuse --shebang=/usr/bin/python3
dh_python3 -p ceph-volume --shebang=/usr/bin/python3
# do not run tests # do not run tests
override_dh_auto_test: override_dh_auto_test:
.PHONY: override_dh_autoreconf override_dh_auto_configure override_dh_auto_build override_dh_auto_clean override_dh_auto_install override_dh_installdocs override_dh_installlogrotate override_dh_installinit override_dh_systemd_start override_dh_strip override_dh_auto_test .PHONY: override_dh_autoreconf override_dh_auto_configure override_dh_auto_clean override_dh_auto_install override_dh_installlogrotate override_dh_installinit override_dh_strip override_dh_auto_test

View File

@ -30,58 +30,54 @@ A Ceph Storage Cluster consists of multiple types of daemons:
- :term:`Ceph Manager` - :term:`Ceph Manager`
- :term:`Ceph Metadata Server` - :term:`Ceph Metadata Server`
.. ditaa:: .. _arch_monitor:
+---------------+ +---------------+ +---------------+ +---------------+ Ceph Monitors maintain the master copy of the cluster map, which they provide
| OSDs | | Monitors | | Managers | | MDS | to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
+---------------+ +---------------+ +---------------+ +---------------+ availability in the event that one of the monitor daemons or its host fails.
The Ceph monitor provides copies of the cluster map to storage cluster clients.
A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
monitors ensures high availability should a monitor daemon fail. Storage cluster
clients retrieve a copy of the cluster map from the Ceph Monitor.
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
back to monitors. back to monitors.
A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
modules. modules.
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
provide file services. provide file services.
Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
to efficiently compute information about data location, instead of having to to compute information about data location. This means that clients and OSDs
depend on a central lookup table. Ceph's high-level features include a are not bottlenecked by a central lookup table. Ceph's high-level features
native interface to the Ceph Storage Cluster via ``librados``, and a number of include a native interface to the Ceph Storage Cluster via ``librados``, and a
service interfaces built on top of ``librados``. number of service interfaces built on top of ``librados``.
Storing Data Storing Data
------------ ------------
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
:term:`Ceph File System` or a custom implementation you create using :term:`Ceph File System`, or a custom implementation that you create by using
``librados``-- which is stored as RADOS objects. Each object is stored on an ``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
:term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and objects. Each object is stored on an :term:`Object Storage Device` (this is
replication operations on storage drives. With the default BlueStore back end, also called an "OSD"). Ceph OSDs control read, write, and replication
objects are stored in a monolithic database-like fashion. operations on storage drives. The default BlueStore back end stores objects
in a monolithic, database-like fashion.
.. ditaa:: .. ditaa::
/-----\ +-----+ +-----+ /------\ +-----+ +-----+
| obj |------>| {d} |------>| {s} | | obj |------>| {d} |------>| {s} |
\-----/ +-----+ +-----+ \------/ +-----+ +-----+
Object OSD Drive Object OSD Drive
Ceph OSD Daemons store data as objects in a flat namespace (e.g., no Ceph OSD Daemons store data as objects in a flat namespace. This means that
hierarchy of directories). An object has an identifier, binary data, and objects are not stored in a hierarchy of directories. An object has an
metadata consisting of a set of name/value pairs. The semantics are completely identifier, binary data, and metadata consisting of name/value pairs.
up to :term:`Ceph Client`\s. For example, CephFS uses metadata to store file :term:`Ceph Client`\s determine the semantics of the object data. For example,
attributes such as the file owner, created date, last modified date, and so CephFS uses metadata to store file attributes such as the file owner, the
forth. created date, and the last modified date.
.. ditaa:: .. ditaa::
@ -100,20 +96,23 @@ forth.
.. index:: architecture; high availability, scalability .. index:: architecture; high availability, scalability
.. _arch_scalability_and_high_availability:
Scalability and High Availability Scalability and High Availability
--------------------------------- ---------------------------------
In traditional architectures, clients talk to a centralized component (e.g., a In traditional architectures, clients talk to a centralized component. This
gateway, broker, API, facade, etc.), which acts as a single point of entry to a centralized component might be a gateway, a broker, an API, or a facade. A
complex subsystem. This imposes a limit to both performance and scalability, centralized component of this kind acts as a single point of entry to a complex
while introducing a single point of failure (i.e., if the centralized component subsystem. Architectures that rely upon such a centralized component have a
goes down, the whole system goes down, too). single point of failure and incur limits to performance and scalability. If
the centralized component goes down, the whole system becomes unavailable.
Ceph eliminates the centralized gateway to enable clients to interact with Ceph eliminates this centralized component. This enables clients to interact
Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster to ensure data safety and high availability. Ceph also uses a cluster of
of monitors to ensure high availability. To eliminate centralization, Ceph monitors to ensure high availability. To eliminate centralization, Ceph uses an
uses an algorithm called CRUSH. algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
.. index:: CRUSH; architecture .. index:: CRUSH; architecture
@ -122,15 +121,15 @@ CRUSH Introduction
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
Replication Under Scalable Hashing)` algorithm to efficiently compute Replication Under Scalable Hashing)` algorithm to compute information about
information about object location, instead of having to depend on a object location instead of relying upon a central lookup table. CRUSH provides
central lookup table. CRUSH provides a better data management mechanism compared a better data management mechanism than do older approaches, and CRUSH enables
to older approaches, and enables massive scale by cleanly distributing the work massive scale by distributing the work to all the OSD daemons in the cluster
to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data and all the clients that communicate with them. CRUSH uses intelligent data
replication to ensure resiliency, which is better suited to hyper-scale storage. replication to ensure resiliency, which is better suited to hyper-scale
The following sections provide additional details on how CRUSH works. For a storage. The following sections provide additional details on how CRUSH works.
detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
Placement of Replicated Data`_. Decentralized Placement of Replicated Data`_.
.. index:: architecture; cluster map .. index:: architecture; cluster map
@ -139,61 +138,71 @@ Placement of Replicated Data`_.
Cluster Map Cluster Map
~~~~~~~~~~~ ~~~~~~~~~~~
Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
cluster topology, which is inclusive of 5 maps collectively referred to as the must have current information about the cluster's topology. Current information
"Cluster Map": is stored in the "Cluster Map", which is in fact a collection of five maps. The
five maps that constitute the cluster map are:
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name #. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
address and port of each monitor. It also indicates the current epoch, the address, and the TCP port of each monitor. The monitor map specifies the
when the map was created, and the last time it changed. To view a monitor current epoch, the time of the monitor map's creation, and the time of the
map, execute ``ceph mon dump``. monitor map's last modification. To view a monitor map, run ``ceph mon
dump``.
#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and #. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
last modified, a list of pools, replica sizes, PG numbers, a list of OSDs creation, the time of the OSD map's last modification, a list of pools, a
and their status (e.g., ``up``, ``in``). To view an OSD map, execute list of replica sizes, a list of PG numbers, and a list of OSDs and their
``ceph osd dump``. statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
osd dump``.
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD #. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
map epoch, the full ratios, and details on each placement group such as epoch, the full ratios, and the details of each placement group. This
the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g., includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
``active + clean``), and data usage statistics for each pool. example, ``active + clean``), and data usage statistics for each pool.
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain #. **The CRUSH Map:** Contains a list of storage devices, the failure domain
hierarchy (e.g., device, host, rack, row, room, etc.), and rules for hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
traversing the hierarchy when storing data. To view a CRUSH map, execute and rules for traversing the hierarchy when storing data. To view a CRUSH
``ceph osd getcrushmap -o {filename}``; then, decompile it by executing map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``. running ``crushtool -d {comp-crushmap-filename} -o
You can view the decompiled map in a text editor or with ``cat``. {decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
decompiled map.
#. **The MDS Map:** Contains the current MDS map epoch, when the map was #. **The MDS Map:** Contains the current MDS map epoch, when the map was
created, and the last time it changed. It also contains the pool for created, and the last time it changed. It also contains the pool for
storing metadata, a list of metadata servers, and which metadata servers storing metadata, a list of metadata servers, and which metadata servers
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``. are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
Each map maintains an iterative history of its operating state changes. Ceph Each map maintains a history of changes to its operating state. Ceph Monitors
Monitors maintain a master copy of the cluster map including the cluster maintain a master copy of the cluster map. This master copy includes the
members, state, changes, and the overall health of the Ceph Storage Cluster. cluster members, the state of the cluster, changes to the cluster, and
information recording the overall health of the Ceph Storage Cluster.
.. index:: high availability; monitor architecture .. index:: high availability; monitor architecture
High Availability Monitors High Availability Monitors
~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
Before Ceph Clients can read or write data, they must contact a Ceph Monitor A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
to obtain the most recent copy of the cluster map. A Ceph Storage Cluster cluster map in order to read data from or to write data to the Ceph cluster.
can operate with a single monitor; however, this introduces a single
point of failure (i.e., if the monitor goes down, Ceph Clients cannot
read or write data).
For added reliability and fault tolerance, Ceph supports a cluster of monitors. It is possible for a Ceph cluster to function properly with only a single
In a cluster of monitors, latency and other faults can cause one or more monitor, but a Ceph cluster that has only a single monitor has a single point
monitors to fall behind the current state of the cluster. For this reason, Ceph of failure: if the monitor goes down, Ceph clients will be unable to read data
must have agreement among various monitor instances regarding the state of the from or write data to the cluster.
cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
and the `Paxos`_ algorithm to establish a consensus among the monitors about the
current state of the cluster.
For details on configuring monitors, see the `Monitor Config Reference`_. Ceph leverages a cluster of monitors in order to increase reliability and fault
tolerance. When a cluster of monitors is used, however, one or more of the
monitors in the cluster can fall behind due to latency or other faults. Ceph
mitigates these negative effects by requiring multiple monitor instances to
agree about the state of the cluster. To establish consensus among the monitors
regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
majority of monitors (for example, one in a cluster that contains only one
monitor, two in a cluster that contains three monitors, three in a cluster that
contains five monitors, four in a cluster that contains six monitors, and so
on).
See the `Monitor Config Reference`_ for more detail on configuring monitors.
.. index:: architecture; high availability authentication .. index:: architecture; high availability authentication
@ -202,48 +211,57 @@ For details on configuring monitors, see the `Monitor Config Reference`_.
High Availability Authentication High Availability Authentication
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To identify users and protect against man-in-the-middle attacks, Ceph provides The ``cephx`` authentication system is used by Ceph to authenticate users and
its ``cephx`` authentication system to authenticate users and daemons. daemons and to protect against man-in-the-middle attacks.
.. note:: The ``cephx`` protocol does not address data encryption in transport .. note:: The ``cephx`` protocol does not address data encryption in transport
(e.g., SSL/TLS) or encryption at rest. (for example, SSL/TLS) or encryption at rest.
Cephx uses shared secret keys for authentication, meaning both the client and ``cephx`` uses shared secret keys for authentication. This means that both the
the monitor cluster have a copy of the client's secret key. The authentication client and the monitor cluster keep a copy of the client's secret key.
protocol is such that both parties are able to prove to each other they have a
copy of the key without actually revealing it. This provides mutual
authentication, which means the cluster is sure the user possesses the secret
key, and the user is sure that the cluster has a copy of the secret key.
A key scalability feature of Ceph is to avoid a centralized interface to the The ``cephx`` protocol makes it possible for each party to prove to the other
Ceph object store, which means that Ceph clients must be able to interact with that it has a copy of the key without revealing it. This provides mutual
OSDs directly. To protect data, Ceph provides its ``cephx`` authentication authentication and allows the cluster to confirm (1) that the user has the
system, which authenticates users operating Ceph clients. The ``cephx`` protocol secret key and (2) that the user can be confident that the cluster has a copy
operates in a manner with behavior similar to `Kerberos`_. of the secret key.
A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each As stated in :ref:`Scalability and High Availability
monitor can authenticate users and distribute keys, so there is no single point <arch_scalability_and_high_availability>`, Ceph does not have any centralized
of failure or bottleneck when using ``cephx``. The monitor returns an interface between clients and the Ceph object store. By avoiding such a
authentication data structure similar to a Kerberos ticket that contains a centralized interface, Ceph avoids the bottlenecks that attend such centralized
session key for use in obtaining Ceph services. This session key is itself interfaces. However, this means that clients must interact directly with OSDs.
encrypted with the user's permanent secret key, so that only the user can Direct interactions between Ceph clients and OSDs require authenticated
request services from the Ceph Monitor(s). The client then uses the session key connections. The ``cephx`` authentication system establishes and sustains these
to request its desired services from the monitor, and the monitor provides the authenticated connections.
client with a ticket that will authenticate the client to the OSDs that actually
handle data. Ceph Monitors and OSDs share a secret, so the client can use the
ticket provided by the monitor with any OSD or metadata server in the cluster.
Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
ticket or session key obtained surreptitiously. This form of authentication will
prevent attackers with access to the communications medium from either creating
bogus messages under another user's identity or altering another user's
legitimate messages, as long as the user's secret key is not divulged before it
expires.
To use ``cephx``, an administrator must set up users first. In the following The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
monitor can authenticate users and distribute keys, which means that there is
no single point of failure and no bottleneck when using ``cephx``. The monitor
returns an authentication data structure that is similar to a Kerberos ticket.
This authentication data structure contains a session key for use in obtaining
Ceph services. The session key is itself encrypted with the user's permanent
secret key, which means that only the user can request services from the Ceph
Monitors. The client then uses the session key to request services from the
monitors, and the monitors provide the client with a ticket that authenticates
the client against the OSDs that actually handle data. Ceph Monitors and OSDs
share a secret, which means that the clients can use the ticket provided by the
monitors to authenticate against any OSD or metadata server in the cluster.
Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
expired ticket or session key that has been obtained surreptitiously. This form
of authentication prevents attackers who have access to the communications
medium from creating bogus messages under another user's identity and prevents
attackers from altering another user's legitimate messages, as long as the
user's secret key is not divulged before it expires.
An administrator must set up users before using ``cephx``. In the following
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
the command line to generate a username and secret key. Ceph's ``auth`` the command line to generate a username and secret key. Ceph's ``auth``
subsystem generates the username and key, stores a copy with the monitor(s) and subsystem generates the username and key, stores a copy on the monitor(s), and
transmits the user's secret back to the ``client.admin`` user. This means that transmits the user's secret back to the ``client.admin`` user. This means that
the client and the monitor share a secret key. the client and the monitor share a secret key.
.. note:: The ``client.admin`` user must provide the user ID and .. note:: The ``client.admin`` user must provide the user ID and
@ -262,17 +280,16 @@ the client and the monitor share a secret key.
| transmit key | | transmit key |
| | | |
Here is how a client authenticates with a monitor. The client passes the user
To authenticate with the monitor, the client passes in the user name to the name to the monitor. The monitor generates a session key that is encrypted with
monitor, and the monitor generates a session key and encrypts it with the secret the secret key associated with the ``username``. The monitor transmits the
key associated to the user name. Then, the monitor transmits the encrypted encrypted ticket to the client. The client uses the shared secret key to
ticket back to the client. The client then decrypts the payload with the shared decrypt the payload. The session key identifies the user, and this act of
secret key to retrieve the session key. The session key identifies the user for identification will last for the duration of the session. The client requests
the current session. The client then requests a ticket on behalf of the user a ticket for the user, and the ticket is signed with the session key. The
signed by the session key. The monitor generates a ticket, encrypts it with the monitor generates a ticket and uses the user's secret key to encrypt it. The
user's secret key and transmits it back to the client. The client decrypts the encrypted ticket is transmitted to the client. The client decrypts the ticket
ticket and uses it to sign requests to OSDs and metadata servers throughout the and uses it to sign requests to OSDs and to metadata servers in the cluster.
cluster.
.. ditaa:: .. ditaa::
@ -302,10 +319,11 @@ cluster.
|<----+ | |<----+ |
The ``cephx`` protocol authenticates ongoing communications between the client The ``cephx`` protocol authenticates ongoing communications between the clients
machine and the Ceph servers. Each message sent between a client and server, and Ceph daemons. After initial authentication, each message sent between a
subsequent to the initial authentication, is signed using a ticket that the client and a daemon is signed using a ticket that can be verified by monitors,
monitors, OSDs and metadata servers can verify with their shared secret. OSDs, and metadata daemons. This ticket is verified by using the secret shared
between the client and the daemon.
.. ditaa:: .. ditaa::
@ -341,83 +359,93 @@ monitors, OSDs and metadata servers can verify with their shared secret.
|<-------------------------------------------| |<-------------------------------------------|
receive response receive response
The protection offered by this authentication is between the Ceph client and the This authentication protects only the connections between Ceph clients and Ceph
Ceph server hosts. The authentication is not extended beyond the Ceph client. If daemons. The authentication is not extended beyond the Ceph client. If a user
the user accesses the Ceph client from a remote host, Ceph authentication is not accesses the Ceph client from a remote host, cephx authentication will not be
applied to the connection between the user's host and the client host. applied to the connection between the user's host and the client host.
See `Cephx Config Guide`_ for more on configuration details.
For configuration details, see `Cephx Config Guide`_. For user management See `User Management`_ for more on user management.
details, see `User Management`_.
See :ref:`A Detailed Description of the Cephx Authentication Protocol
<cephx_2012_peter>` for more on the distinction between authorization and
authentication and for a step-by-step explanation of the setup of ``cephx``
tickets and session keys.
.. index:: architecture; smart daemons and scalability .. index:: architecture; smart daemons and scalability
Smart Daemons Enable Hyperscale Smart Daemons Enable Hyperscale
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A feature of many storage clusters is a centralized interface that keeps track
of the nodes that clients are permitted to access. Such centralized
architectures provide services to clients by means of a double dispatch. At the
petabyte-to-exabyte scale, such double dispatches are a significant
bottleneck.
In many clustered architectures, the primary purpose of cluster membership is Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
so that a centralized interface knows which nodes it can access. Then the cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
centralized interface provides services to the client through a double OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale. with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
cluster-aware makes it possible for Ceph clients to interact directly with Ceph
OSD Daemons.
Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients easily perform tasks that a cluster with a centralized interface would struggle
to interact directly with Ceph OSD Daemons. to perform. The ability of Ceph nodes to make use of the computing power of
the greater cluster provides several benefits:
The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with #. **OSDs Service Clients Directly:** Network devices can support only a
each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph limited number of concurrent connections. Because Ceph clients contact
nodes to easily perform tasks that would bog down a centralized server. The Ceph OSD daemons directly without first connecting to a central interface,
ability to leverage this computing power leads to several major benefits: Ceph enjoys improved perfomance and increased system capacity relative to
storage redundancy strategies that include a central interface. Ceph clients
maintain sessions only when needed, and maintain those sessions with only
particular Ceph OSD daemons, not with a centralized interface.
#. **OSDs Service Clients Directly:** Since any network device has a limit to #. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
the number of concurrent connections it can support, a centralized system report their status. At the lowest level, the Ceph OSD Daemon status is
has a low physical limit at high scales. By enabling Ceph Clients to contact ``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
Ceph OSD Daemons directly, Ceph increases both performance and total system able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
capacity simultaneously, while removing a single point of failure. Ceph ``in`` the Ceph Storage Cluster, this status may indicate the failure of the
Clients can maintain a session when they need to, and with a particular Ceph Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
OSD Daemon instead of a centralized server. the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
OSDs periodically send messages to the Ceph Monitor (in releases prior to
Luminous, this was done by means of ``MPGStats``, and beginning with the
Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
Monitors receive no such message after a configurable period of time,
then they mark the OSD ``down``. This mechanism is a failsafe, however.
Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
report it to the Ceph Monitors. This contributes to making Ceph Monitors
lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
additional details.
#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report #. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
on their status. At the lowest level, the Ceph OSD Daemon status is ``up`` RADOS objects. Ceph OSD Daemons compare the metadata of their own local
or ``down`` reflecting whether or not it is running and able to service objects against the metadata of the replicas of those objects, which are
Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
Storage Cluster, this status may indicate the failure of the Ceph OSD mismatches in object size and finds metadata mismatches, and is usually
Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous, bad sectors on drives that are not detectable with light scrubs. See `Data
and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that Scrubbing`_ for details on configuring scrubbing.
message after a configurable period of time then it marks the OSD down.
This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
This assures that Ceph Monitors are lightweight processes. See `Monitoring
OSDs`_ and `Heartbeats`_ for additional details.
#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness, #. **Replication:** Data replication involves a collaboration between Ceph
Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
their local objects metadata with its replicas stored on other OSDs. Scrubbing determine the storage location of object replicas. Ceph clients use the
happens on a per-Placement Group base. Scrubbing (usually performed daily) CRUSH algorithm to determine the storage location of an object, then the
catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper object is mapped to a pool and to a placement group, and then the client
scrubbing by comparing data in objects bit-for-bit with their checksums. consults the CRUSH map to identify the placement group's primary OSD.
Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
configuring scrubbing.
#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH After identifying the target placement group, the client writes the object
algorithm, but the Ceph OSD Daemon uses it to compute where replicas of to the identified placement group's primary OSD. The primary OSD then
objects should be stored (and for rebalancing). In a typical write scenario, consults its own copy of the CRUSH map to identify secondary and tertiary
a client uses the CRUSH algorithm to compute where to store an object, maps OSDS, replicates the object to the placement groups in those secondary and
the object to a pool and placement group, then looks at the CRUSH map to tertiary OSDs, confirms that the object was stored successfully in the
identify the primary OSD for the placement group. secondary and tertiary OSDs, and reports to the client that the object
was stored successfully.
The client writes the object to the identified placement group in the
primary OSD. Then, the primary OSD with its own copy of the CRUSH map
identifies the secondary and tertiary OSDs for replication purposes, and
replicates the object to the appropriate placement groups in the secondary
and tertiary OSDs (as many OSDs as additional replicas), and responds to the
client once it has confirmed the object was stored successfully.
.. ditaa:: .. ditaa::
@ -444,19 +472,18 @@ ability to leverage this computing power leads to several major benefits:
| | | | | | | |
+---------------+ +---------------+ +---------------+ +---------------+
With the ability to perform data replication, Ceph OSD Daemons relieve Ceph By performing this act of data replication, Ceph OSD Daemons relieve Ceph
clients from that duty, while ensuring high data availability and data safety. clients of the burden of replicating data.
Dynamic Cluster Management Dynamic Cluster Management
-------------------------- --------------------------
In the `Scalability and High Availability`_ section, we explained how Ceph uses In the `Scalability and High Availability`_ section, we explained how Ceph uses
CRUSH, cluster awareness and intelligent daemons to scale and maintain high CRUSH, cluster topology, and intelligent daemons to scale and maintain high
availability. Key to Ceph's design is the autonomous, self-healing, and availability. Key to Ceph's design is the autonomous, self-healing, and
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
enable modern cloud storage infrastructures to place data, rebalance the cluster enable modern cloud storage infrastructures to place data, rebalance the
and recover from faults dynamically. cluster, and adaptively place and balance data and recover from faults.
.. index:: architecture; pools .. index:: architecture; pools
@ -465,10 +492,11 @@ About Pools
The Ceph storage system supports the notion of 'Pools', which are logical The Ceph storage system supports the notion of 'Pools', which are logical
partitions for storing objects. partitions for storing objects.
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
pools. The pool's ``size`` or number of replicas, the CRUSH rule and the objects to pools. The way that Ceph places the data in the pools is determined
number of placement groups determine how Ceph will place the data. by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
placement groups in the pool.
.. ditaa:: .. ditaa::
@ -501,20 +529,23 @@ See `Set Pool Values`_ for details.
Mapping PGs to OSDs Mapping PGs to OSDs
~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically. Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
When a Ceph Client stores objects, CRUSH will map each object to a placement maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
group. object to a PG.
Mapping objects to placement groups creates a layer of indirection between the This mapping of RADOS objects to PGs implements an abstraction and indirection
Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph be able to grow (or shrink) and redistribute data adaptively when the internal
Client "knew" which Ceph OSD Daemon had which object, that would create a tight topology changes.
coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
algorithm maps each object to a placement group and then maps each placement If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
come online. The following diagram depicts how CRUSH maps objects to placement RADOS object to a placement group and then maps each placement group to one or
groups, and placement groups to OSDs. more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
dynamically when new Ceph OSD Daemons and their underlying OSD devices come
online. The following diagram shows how the CRUSH algorithm maps objects to
placement groups, and how it maps placement groups to OSDs.
.. ditaa:: .. ditaa::
@ -540,44 +571,45 @@ groups, and placement groups to OSDs.
| | | | | | | | | | | | | | | |
\----------/ \----------/ \----------/ \----------/ \----------/ \----------/ \----------/ \----------/
With a copy of the cluster map and the CRUSH algorithm, the client can compute The client uses its copy of the cluster map and the CRUSH algorithm to compute
exactly which OSD to use when reading or writing a particular object. precisely which OSD it will use when reading or writing a particular object.
.. index:: architecture; calculating PG IDs .. index:: architecture; calculating PG IDs
Calculating PG IDs Calculating PG IDs
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
`Cluster Map`_. With the cluster map, the client knows about all of the monitors, the `Cluster Map`_. When a client has been equipped with a copy of the cluster
OSDs, and metadata servers in the cluster. **However, it doesn't know anything map, it is aware of all the monitors, OSDs, and metadata servers in the
about object locations.** cluster. **However, even equipped with a copy of the latest version of the
cluster map, the client doesn't know anything about object locations.**
.. epigraph:: **Object locations must be computed.**
Object locations get computed. The client requies only the object ID and the name of the pool in order to
compute the object location.
Ceph stores data in named pools (for example, "liverpool"). When a client
stores a named object (for example, "john", "paul", "george", or "ringo") it
calculates a placement group by using the object name, a hash code, the number
of PGs in the pool, and the pool name. Ceph clients use the following steps to
compute PG IDs.
The only input required by the client is the object ID and the pool. #. The client inputs the pool name and the object ID. (for example: pool =
It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client "liverpool" and object-id = "john")
wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.) #. Ceph hashes the object ID.
it calculates a placement group using the object name, a hash code, the #. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
number of PGs in the pool and the pool name. Ceph clients use the following get a PG ID.
steps to compute PG IDs. #. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
``4``)
#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool" It is much faster to compute object locations than to perform object location
and object-id = "john") query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
#. Ceph takes the object ID and hashes it. Scalable Hashing)` algorithm allows a client to compute where objects are
#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get expected to be stored, and enables the client to contact the primary OSD to
a PG ID. store or retrieve the objects.
#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
Computing object locations is much faster than performing object location query
over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
Hashing)` algorithm allows a client to compute where objects *should* be stored,
and enables the client to contact the primary OSD to store or retrieve the
objects.
.. index:: architecture; PG Peering .. index:: architecture; PG Peering
@ -585,46 +617,51 @@ Peering and Sets
~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~
In previous sections, we noted that Ceph OSD Daemons check each other's In previous sections, we noted that Ceph OSD Daemons check each other's
heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
do is called 'peering', which is the process of bringing all of the OSDs that which is the process of bringing all of the OSDs that store a Placement Group
store a Placement Group (PG) into agreement about the state of all of the (PG) into agreement about the state of all of the RADOS objects (and their
objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve Monitors. Peering issues usually resolve themselves; however, if the problem
themselves; however, if the problem persists, you may need to refer to the persists, you may need to refer to the `Troubleshooting Peering Failure`_
`Troubleshooting Peering Failure`_ section. section.
.. Note:: Agreeing on the state does not mean that the PGs have the latest contents. .. Note:: PGs that agree on the state of the cluster do not necessarily have
the current data yet.
The Ceph Storage Cluster was designed to store at least two copies of an object The Ceph Storage Cluster was designed to store at least two copies of an object
(i.e., ``size = 2``), which is the minimum requirement for data safety. For high (that is, ``size = 2``), which is the minimum requirement for data safety. For
availability, a Ceph Storage Cluster should store more than two copies of an object high availability, a Ceph Storage Cluster should store more than two copies of
(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
``degraded`` state while maintaining data safety. to run in a ``degraded`` state while maintaining data safety.
Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not .. warning:: Although we say here that R2 (replication with two copies) is the
name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but minimum requirement for data safety, R3 (replication with three copies) is
rather refer to them as *Primary*, *Secondary*, and so forth. By convention, recommended. On a long enough timeline, data stored with an R2 strategy will
the *Primary* is the first OSD in the *Acting Set*, and is responsible for be lost.
coordinating the peering process for each placement group where it acts as
the *Primary*, and is the **ONLY** OSD that will accept client-initiated
writes to objects for a given placement group where it acts as the *Primary*.
When a series of OSDs are responsible for a placement group, that series of As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
OSD Daemons that are currently responsible for the placement group, or the Ceph etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
OSD Daemons that were responsible for a particular placement group as of some convention, the *Primary* is the first OSD in the *Acting Set*, and is
responsible for orchestrating the peering process for each placement group
where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
placement group that accepts client-initiated writes to objects.
The set of OSDs that is responsible for a placement group is called the
*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
that are currently responsible for the placement group, or to the Ceph OSD
Daemons that were responsible for a particular placement group as of some
epoch. epoch.
The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``. The Ceph OSD daemons that are part of an *Acting Set* might not always be
When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up ``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD The *Up Set* is an important distinction, because Ceph can remap PGs to other
Daemons when an OSD fails. Ceph OSD Daemons when an OSD fails.
.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
removed from the *Up Set*.
.. note:: Consider a hypothetical *Acting Set* for a PG that contains
``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
*Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
*Primary*, and ``osd.25`` is removed from the *Up Set*.
.. index:: architecture; Rebalancing .. index:: architecture; Rebalancing
@ -1469,11 +1506,11 @@ Ceph Clients
Ceph Clients include a number of service interfaces. These include: Ceph Clients include a number of service interfaces. These include:
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service - **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
provides resizable, thin-provisioned block devices with snapshotting and provides resizable, thin-provisioned block devices that can be snapshotted
cloning. Ceph stripes a block device across the cluster for high and cloned. Ceph stripes a block device across the cluster for high
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
that uses ``librbd`` directly--avoiding the kernel object overhead for that uses ``librbd`` directly--avoiding the kernel object overhead for
virtualized systems. virtualized systems.
- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service - **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service

View File

@ -3,18 +3,20 @@
``activate`` ``activate``
============ ============
Once :ref:`ceph-volume-lvm-prepare` is completed, and all the various steps After :ref:`ceph-volume-lvm-prepare` has completed its run, the volume can be
that entails are done, the volume is ready to get "activated". activated.
This activation process enables a systemd unit that persists the OSD ID and its Activating the volume involves enabling a ``systemd`` unit that persists the
UUID (also called ``fsid`` in Ceph CLI tools), so that at boot time it can ``OSD ID`` and its ``UUID`` (which is also called the ``fsid`` in the Ceph CLI
understand what OSD is enabled and needs to be mounted. tools). After this information has been persisted, the cluster can determine
which OSD is enabled and must be mounted.
.. note:: The execution of this call is fully idempotent, and there is no .. note:: The execution of this call is fully idempotent. This means that the
side-effects when running multiple times call can be executed multiple times without changing the result of its first
successful execution.
For OSDs deployed by cephadm, please refer to :ref:`cephadm-osd-activate` For information about OSDs deployed by cephadm, refer to
instead. :ref:`cephadm-osd-activate`.
New OSDs New OSDs
-------- --------

View File

@ -101,8 +101,19 @@ To drain all daemons from a host, run a command of the following form:
ceph orch host drain *<host>* ceph orch host drain *<host>*
The ``_no_schedule`` label will be applied to the host. See The ``_no_schedule`` and ``_no_conf_keyring`` labels will be applied to the
:ref:`cephadm-special-host-labels`. host. See :ref:`cephadm-special-host-labels`.
If you only want to drain daemons but leave managed ceph conf and keyring
files on the host, you may pass the ``--keep-conf-keyring`` flag to the
drain command.
.. prompt:: bash #
ceph orch host drain *<host>* --keep-conf-keyring
This will apply the ``_no_schedule`` label to the host but not the
``_no_conf_keyring`` label.
All OSDs on the host will be scheduled to be removed. You can check the progress of the OSD removal operation with the following command: All OSDs on the host will be scheduled to be removed. You can check the progress of the OSD removal operation with the following command:
@ -112,6 +123,14 @@ All OSDs on the host will be scheduled to be removed. You can check the progress
See :ref:`cephadm-osd-removal` for more details about OSD removal. See :ref:`cephadm-osd-removal` for more details about OSD removal.
The ``orch host drain`` command also supports a ``--zap-osd-devices``
flag. Setting this flag while draining a host will cause cephadm to zap
the devices of the OSDs it is removing as part of the drain process
.. prompt:: bash #
ceph orch host drain *<host>* --zap-osd-devices
Use the following command to determine whether any daemons are still on the Use the following command to determine whether any daemons are still on the
host: host:
@ -183,6 +202,12 @@ The following host labels have a special meaning to cephadm. All start with ``_
an existing host that already contains Ceph daemons, it will cause cephadm to move an existing host that already contains Ceph daemons, it will cause cephadm to move
those daemons elsewhere (except OSDs, which are not removed automatically). those daemons elsewhere (except OSDs, which are not removed automatically).
* ``_no_conf_keyring``: *Do not deploy config files or keyrings on this host*.
This label is effectively the same as ``_no_schedule`` but instead of working for
daemons it works for client keyrings and ceph conf files that are being managed
by cephadm
* ``_no_autotune_memory``: *Do not autotune memory on this host*. * ``_no_autotune_memory``: *Do not autotune memory on this host*.
This label will prevent daemon memory from being tuned even when the This label will prevent daemon memory from being tuned even when the
@ -290,7 +315,7 @@ create a new CRUSH host located in the specified hierarchy.
.. note:: .. note::
The ``location`` attribute will be only affect the initial CRUSH location. Subsequent The ``location`` attribute will be only affect the initial CRUSH location. Subsequent
changes of the ``location`` property will be ignored. Also, removing a host will no remove changes of the ``location`` property will be ignored. Also, removing a host will not remove
any CRUSH buckets. any CRUSH buckets.
See also :ref:`crush_map_default_types`. See also :ref:`crush_map_default_types`.
@ -505,7 +530,23 @@ There are two ways to customize this configuration for your environment:
manually distributed to the mgr data directory manually distributed to the mgr data directory
(``/var/lib/ceph/<cluster-fsid>/mgr.<id>`` on the host, visible at (``/var/lib/ceph/<cluster-fsid>/mgr.<id>`` on the host, visible at
``/var/lib/ceph/mgr/ceph-<id>`` from inside the container). ``/var/lib/ceph/mgr/ceph-<id>`` from inside the container).
Setting up CA signed keys for the cluster
-----------------------------------------
Cephadm also supports using CA signed keys for SSH authentication
across cluster nodes. In this setup, instead of needing a private
key and public key, we instead need a private key and certificate
created by signing that private key with a CA key. For more info
on setting up nodes for authentication using a CA signed key, see
:ref:`cephadm-bootstrap-ca-signed-keys`. Once you have your private
key and signed cert, they can be set up for cephadm to use by running:
.. prompt:: bash #
ceph config-key set mgr/cephadm/ssh_identity_key -i <private-key-file>
ceph config-key set mgr/cephadm/ssh_identity_cert -i <signed-cert-file>
.. _cephadm-fqdn: .. _cephadm-fqdn:
Fully qualified domain names vs bare host names Fully qualified domain names vs bare host names

View File

@ -50,8 +50,8 @@ There are two ways to install ``cephadm``:
distribution-specific installations distribution-specific installations
----------------------------------- -----------------------------------
Some Linux distributions may already include up-to-date Ceph packages. In that Some Linux distributions may already include up-to-date Ceph packages. In
case, you can install cephadm directly. For example: that case, you can install cephadm directly. For example:
In Ubuntu: In Ubuntu:
@ -87,7 +87,7 @@ curl-based installation
* First, determine what version of Ceph you will need. You can use the releases * First, determine what version of Ceph you will need. You can use the releases
page to find the `latest active releases <https://docs.ceph.com/en/latest/releases/#active-releases>`_. page to find the `latest active releases <https://docs.ceph.com/en/latest/releases/#active-releases>`_.
For example, we might look at that page and find that ``17.2.6`` is the latest For example, we might look at that page and find that ``18.2.0`` is the latest
active release. active release.
* Use ``curl`` to fetch a build of cephadm for that release. * Use ``curl`` to fetch a build of cephadm for that release.
@ -95,7 +95,7 @@ curl-based installation
.. prompt:: bash # .. prompt:: bash #
:substitutions: :substitutions:
CEPH_RELEASE=17.2.6 # replace this with the active release CEPH_RELEASE=18.2.0 # replace this with the active release
curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm
Ensure the ``cephadm`` file is executable: Ensure the ``cephadm`` file is executable:
@ -121,28 +121,41 @@ curl-based installation
python3.8 ./cephadm <arguments...> python3.8 ./cephadm <arguments...>
* Although the standalone cephadm is sufficient to get a cluster started, it is .. _cephadm_update:
convenient to have the ``cephadm`` command installed on the host. To install
the packages that provide the ``cephadm`` command, run the following
commands:
.. prompt:: bash # update cephadm
:substitutions: --------------
./cephadm add-repo --release |stable-release| The cephadm binary can be used to bootstrap a cluster and for a variety
./cephadm install of other management and debugging tasks. The Ceph team strongly recommends
using an actively supported version of cephadm. Additionally, although
the standalone cephadm is sufficient to get a cluster started, it is
convenient to have the ``cephadm`` command installed on the host. Older or LTS
distros may also have ``cephadm`` packages that are out-of-date and
running the commands below can help install a more recent version
from the Ceph project's repositories.
Confirm that ``cephadm`` is now in your PATH by running ``which``: To install the packages provided by the Ceph project that provide the
``cephadm`` command, run the following commands:
.. prompt:: bash # .. prompt:: bash #
:substitutions:
which cephadm ./cephadm add-repo --release |stable-release|
./cephadm install
A successful ``which cephadm`` command will return this: Confirm that ``cephadm`` is now in your PATH by running ``which`` or
``command -v``:
.. code-block:: bash .. prompt:: bash #
/usr/sbin/cephadm which cephadm
A successful ``which cephadm`` command will return this:
.. code-block:: bash
/usr/sbin/cephadm
Bootstrap a new cluster Bootstrap a new cluster
======================= =======================
@ -157,6 +170,9 @@ cluster's first "monitor daemon", and that monitor daemon needs an IP address.
You must pass the IP address of the Ceph cluster's first host to the ``ceph You must pass the IP address of the Ceph cluster's first host to the ``ceph
bootstrap`` command, so you'll need to know the IP address of that host. bootstrap`` command, so you'll need to know the IP address of that host.
.. important:: ``ssh`` must be installed and running in order for the
bootstrapping procedure to succeed.
.. note:: If there are multiple networks and interfaces, be sure to choose one .. note:: If there are multiple networks and interfaces, be sure to choose one
that will be accessible by any host accessing the Ceph cluster. that will be accessible by any host accessing the Ceph cluster.
@ -184,6 +200,8 @@ This command will:
with this label will (also) get a copy of ``/etc/ceph/ceph.conf`` and with this label will (also) get a copy of ``/etc/ceph/ceph.conf`` and
``/etc/ceph/ceph.client.admin.keyring``. ``/etc/ceph/ceph.client.admin.keyring``.
.. _cephadm-bootstrap-further-info:
Further information about cephadm bootstrap Further information about cephadm bootstrap
------------------------------------------- -------------------------------------------
@ -303,18 +321,21 @@ its status with:
Adding Hosts Adding Hosts
============ ============
Next, add all hosts to the cluster by following :ref:`cephadm-adding-hosts`. Add all hosts to the cluster by following the instructions in
:ref:`cephadm-adding-hosts`.
By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring are
are maintained in ``/etc/ceph`` on all hosts with the ``_admin`` label, which is initially maintained in ``/etc/ceph`` on all hosts that have the ``_admin`` label. This
applied only to the bootstrap host. We usually recommend that one or more other hosts be label is initially applied only to the bootstrap host. We usually recommend
given the ``_admin`` label so that the Ceph CLI (e.g., via ``cephadm shell``) is easily that one or more other hosts be given the ``_admin`` label so that the Ceph CLI
accessible on multiple hosts. To add the ``_admin`` label to additional host(s): (for example, via ``cephadm shell``) is easily accessible on multiple hosts. To add
the ``_admin`` label to additional host(s), run a command of the following form:
.. prompt:: bash # .. prompt:: bash #
ceph orch host label add *<host>* _admin ceph orch host label add *<host>* _admin
Adding additional MONs Adding additional MONs
====================== ======================
@ -454,3 +475,104 @@ have access to all hosts that you plan to add to the cluster.
cephadm --image *<hostname>*:5000/ceph/ceph bootstrap --mon-ip *<mon-ip>* cephadm --image *<hostname>*:5000/ceph/ceph bootstrap --mon-ip *<mon-ip>*
.. _cluster network: ../rados/configuration/network-config-ref#cluster-network .. _cluster network: ../rados/configuration/network-config-ref#cluster-network
.. _cephadm-bootstrap-custom-ssh-keys:
Deployment with custom SSH keys
-------------------------------
Bootstrap allows users to create their own private/public SSH key pair
rather than having cephadm generate them automatically.
To use custom SSH keys, pass the ``--ssh-private-key`` and ``--ssh-public-key``
fields to bootstrap. Both parameters require a path to the file where the
keys are stored:
.. prompt:: bash #
cephadm bootstrap --mon-ip <ip-addr> --ssh-private-key <private-key-filepath> --ssh-public-key <public-key-filepath>
This setup allows users to use a key that has already been distributed to hosts
the user wants in the cluster before bootstrap.
.. note:: In order for cephadm to connect to other hosts you'd like to add
to the cluster, make sure the public key of the key pair provided is set up
as an authorized key for the ssh user being used, typically root. If you'd
like more info on using a non-root user as the ssh user, see :ref:`cephadm-bootstrap-further-info`
.. _cephadm-bootstrap-ca-signed-keys:
Deployment with CA signed SSH keys
----------------------------------
As an alternative to standard public key authentication, cephadm also supports
deployment using CA signed keys. Before bootstrapping it's recommended to set up
the CA public key as a trusted CA key on hosts you'd like to eventually add to
the cluster. For example:
.. prompt:: bash
# we will act as our own CA, therefore we'll need to make a CA key
[root@host1 ~]# ssh-keygen -t rsa -f ca-key -N ""
# make the ca key trusted on the host we've generated it on
# this requires adding in a line in our /etc/sshd_config
# to mark this key as trusted
[root@host1 ~]# cp ca-key.pub /etc/ssh
[root@host1 ~]# vi /etc/ssh/sshd_config
[root@host1 ~]# cat /etc/ssh/sshd_config | grep ca-key
TrustedUserCAKeys /etc/ssh/ca-key.pub
# now restart sshd so it picks up the config change
[root@host1 ~]# systemctl restart sshd
# now, on all other hosts we want in the cluster, also install the CA key
[root@host1 ~]# scp /etc/ssh/ca-key.pub host2:/etc/ssh/
# on other hosts, make the same changes to the sshd_config
[root@host2 ~]# vi /etc/ssh/sshd_config
[root@host2 ~]# cat /etc/ssh/sshd_config | grep ca-key
TrustedUserCAKeys /etc/ssh/ca-key.pub
# and restart sshd so it picks up the config change
[root@host2 ~]# systemctl restart sshd
Once the CA key has been installed and marked as a trusted key, you are ready
to use a private key/CA signed cert combination for SSH. Continuing with our
current example, we will create a new key-pair for for host access and then
sign it with our CA key
.. prompt:: bash
# make a new key pair
[root@host1 ~]# ssh-keygen -t rsa -f cephadm-ssh-key -N ""
# sign the private key. This will create a new cephadm-ssh-key-cert.pub
# note here we're using user "root". If you'd like to use a non-root
# user the arguments to the -I and -n params would need to be adjusted
# Additionally, note the -V param indicates how long until the cert
# this creates will expire
[root@host1 ~]# ssh-keygen -s ca-key -I user_root -n root -V +52w cephadm-ssh-key
[root@host1 ~]# ls
ca-key ca-key.pub cephadm-ssh-key cephadm-ssh-key-cert.pub cephadm-ssh-key.pub
# verify our signed key is working. To do this, make sure the generated private
# key ("cephadm-ssh-key" in our example) and the newly signed cert are stored
# in the same directory. Then try to ssh using the private key
[root@host1 ~]# ssh -i cephadm-ssh-key host2
Once you have your private key and corresponding CA signed cert and have tested
SSH authentication using that key works, you can pass those keys to bootstrap
in order to have cephadm use them for SSHing between cluster hosts
.. prompt:: bash
[root@host1 ~]# cephadm bootstrap --mon-ip <ip-addr> --ssh-private-key cephadm-ssh-key --ssh-signed-cert cephadm-ssh-key-cert.pub
Note that this setup does not require installing the corresponding public key
from the private key passed to bootstrap on other nodes. In fact, cephadm will
reject the ``--ssh-public-key`` argument when passed along with ``--ssh-signed-cert``.
Not because having the public key breaks anything, but because it is not at all needed
for this setup and it helps bootstrap differentiate if the user wants the CA signed
keys setup or standard pubkey encryption. What this means is, SSH key rotation
would simply be a matter of getting another key signed by the same CA and providing
cephadm with the new private key and signed cert. No additional distribution of
keys to cluster nodes is needed after the initial setup of the CA key as a trusted key,
no matter how many new private key/signed cert pairs are rotated in.

View File

@ -541,13 +541,60 @@ a spec like
which would cause each mon daemon to be deployed with `--cpus=2`. which would cause each mon daemon to be deployed with `--cpus=2`.
There are two ways to express arguments in the ``extra_container_args`` list.
To start, an item in the list can be a string. When passing an argument
as a string and the string contains spaces, Cephadm will automatically split it
into multiple arguments. For example, ``--cpus 2`` would become ``["--cpus",
"2"]`` when processed. Example:
.. code-block:: yaml
service_type: mon
service_name: mon
placement:
hosts:
- host1
- host2
- host3
extra_container_args:
- "--cpus 2"
As an alternative, an item in the list can be an object (mapping) containing
the required key "argument" and an optional key "split". The value associated
with the ``argument`` key must be a single string. The value associated with
the ``split`` key is a boolean value. The ``split`` key explicitly controls if
spaces in the argument value cause the value to be split into multiple
arguments. If ``split`` is true then Cephadm will automatically split the value
into multiple arguments. If ``split`` is false then spaces in the value will
be retained in the argument. The default, when ``split`` is not provided, is
false. Examples:
.. code-block:: yaml
service_type: mon
service_name: mon
placement:
hosts:
- tiebreaker
extra_container_args:
# No spaces, always treated as a single argument
- argument: "--timout=3000"
# Splitting explicitly disabled, one single argument
- argument: "--annotation=com.example.name=my favorite mon"
split: false
# Splitting explicitly enabled, will become two arguments
- argument: "--cpuset-cpus 1-3,7-11"
split: true
# Splitting implicitly disabled, one single argument
- argument: "--annotation=com.example.note=a simple example"
Mounting Files with Extra Container Arguments Mounting Files with Extra Container Arguments
--------------------------------------------- ---------------------------------------------
A common use case for extra container arguments is to mount additional A common use case for extra container arguments is to mount additional
files within the container. However, some intuitive formats for doing files within the container. Older versions of Ceph did not support spaces
so can cause deployment to fail (see https://tracker.ceph.com/issues/57338). in arguments and therefore the examples below apply to the widest range
The recommended syntax for mounting a file with extra container arguments is: of Ceph versions.
.. code-block:: yaml .. code-block:: yaml
@ -587,6 +634,55 @@ the node-exporter service , one could apply a service spec like
extra_entrypoint_args: extra_entrypoint_args:
- "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2" - "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2"
There are two ways to express arguments in the ``extra_entrypoint_args`` list.
To start, an item in the list can be a string. When passing an argument as a
string and the string contains spaces, cephadm will automatically split it into
multiple arguments. For example, ``--debug_ms 10`` would become
``["--debug_ms", "10"]`` when processed. Example:
.. code-block:: yaml
service_type: mon
service_name: mon
placement:
hosts:
- host1
- host2
- host3
extra_entrypoint_args:
- "--debug_ms 2"
As an alternative, an item in the list can be an object (mapping) containing
the required key "argument" and an optional key "split". The value associated
with the ``argument`` key must be a single string. The value associated with
the ``split`` key is a boolean value. The ``split`` key explicitly controls if
spaces in the argument value cause the value to be split into multiple
arguments. If ``split`` is true then cephadm will automatically split the value
into multiple arguments. If ``split`` is false then spaces in the value will
be retained in the argument. The default, when ``split`` is not provided, is
false. Examples:
.. code-block:: yaml
# An theoretical data migration service
service_type: pretend
service_name: imagine1
placement:
hosts:
- host1
extra_entrypoint_args:
# No spaces, always treated as a single argument
- argument: "--timout=30m"
# Splitting explicitly disabled, one single argument
- argument: "--import=/mnt/usb/My Documents"
split: false
# Splitting explicitly enabled, will become two arguments
- argument: "--tag documents"
split: true
# Splitting implicitly disabled, one single argument
- argument: "--title=Imported Documents"
Custom Config Files Custom Config Files
=================== ===================

View File

@ -20,7 +20,18 @@ For example:
ceph fs volume create <fs_name> --placement="<placement spec>" ceph fs volume create <fs_name> --placement="<placement spec>"
where ``fs_name`` is the name of the CephFS and ``placement`` is a where ``fs_name`` is the name of the CephFS and ``placement`` is a
:ref:`orchestrator-cli-placement-spec`. :ref:`orchestrator-cli-placement-spec`. For example, to place
MDS daemons for the new ``foo`` volume on hosts labeled with ``mds``:
.. prompt:: bash #
ceph fs volume create foo --placement="label:mds"
You can also update the placement after-the-fact via:
.. prompt:: bash #
ceph orch apply mds foo 'mds-[012]'
For manually deploying MDS daemons, use this specification: For manually deploying MDS daemons, use this specification:
@ -30,6 +41,7 @@ For manually deploying MDS daemons, use this specification:
service_id: fs_name service_id: fs_name
placement: placement:
count: 3 count: 3
label: mds
The specification can then be applied using: The specification can then be applied using:

View File

@ -4,8 +4,8 @@
MGR Service MGR Service
=========== ===========
The cephadm MGR service is hosting different modules, like the :ref:`mgr-dashboard` The cephadm MGR service hosts multiple modules. These include the
and the cephadm manager module. :ref:`mgr-dashboard` and the cephadm manager module.
.. _cephadm-mgr-networks: .. _cephadm-mgr-networks:

View File

@ -161,6 +161,53 @@ that will tell it to bind to that specific IP.
Note that in these setups, one should make sure to include ``count: 1`` in the Note that in these setups, one should make sure to include ``count: 1`` in the
nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP. nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP.
NFS with HAProxy Protocol Support
----------------------------------
Cephadm supports deploying NFS in High-Availability mode with additional
HAProxy protocol support. This works just like High-availability NFS but also
supports client IP level configuration on NFS Exports. This feature requires
`NFS-Ganesha v5.0`_ or later.
.. _NFS-Ganesha v5.0: https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_5
To use this mode, you'll either want to set up the service using the nfs module
(see :ref:`nfs-module-cluster-create`) or manually create services with the
extra parameter ``enable_haproxy_protocol`` set to true. Both NFS Service and
Ingress service must have ``enable_haproxy_protocol`` set to the same value.
For example:
.. code-block:: yaml
service_type: ingress
service_id: nfs.foo
placement:
count: 1
hosts:
- host1
- host2
- host3
spec:
backend_service: nfs.foo
monitor_port: 9049
virtual_ip: 192.168.122.100/24
enable_haproxy_protocol: true
.. code-block:: yaml
service_type: nfs
service_id: foo
placement:
count: 1
hosts:
- host1
- host2
- host3
spec:
port: 2049
enable_haproxy_protocol: true
Further Reading Further Reading
=============== ===============

View File

@ -15,10 +15,9 @@ To print a list of devices discovered by ``cephadm``, run this command:
.. prompt:: bash # .. prompt:: bash #
ceph orch device ls [--hostname=...] [--wide] [--refresh] ceph orch device ls [--hostname=...] [--wide] [--refresh]
Example Example::
::
Hostname Path Type Serial Size Health Ident Fault Available Hostname Path Type Serial Size Health Ident Fault Available
srv-01 /dev/sdb hdd 15P0A0YFFRD6 300G Unknown N/A N/A No srv-01 /dev/sdb hdd 15P0A0YFFRD6 300G Unknown N/A N/A No
@ -44,7 +43,7 @@ enable cephadm's "enhanced device scan" option as follows;
.. prompt:: bash # .. prompt:: bash #
ceph config set mgr mgr/cephadm/device_enhanced_scan true ceph config set mgr mgr/cephadm/device_enhanced_scan true
.. warning:: .. warning::
Although the libstoragemgmt library performs standard SCSI inquiry calls, Although the libstoragemgmt library performs standard SCSI inquiry calls,
@ -173,16 +172,16 @@ will happen without actually creating the OSDs.
For example: For example:
.. prompt:: bash # .. prompt:: bash #
ceph orch apply osd --all-available-devices --dry-run ceph orch apply osd --all-available-devices --dry-run
:: ::
NAME HOST DATA DB WAL NAME HOST DATA DB WAL
all-available-devices node1 /dev/vdb - - all-available-devices node1 /dev/vdb - -
all-available-devices node2 /dev/vdc - - all-available-devices node2 /dev/vdc - -
all-available-devices node3 /dev/vdd - - all-available-devices node3 /dev/vdd - -
.. _cephadm-osd-declarative: .. _cephadm-osd-declarative:
@ -197,9 +196,9 @@ command completes will be automatically found and added to the cluster.
We will examine the effects of the following command: We will examine the effects of the following command:
.. prompt:: bash # .. prompt:: bash #
ceph orch apply osd --all-available-devices ceph orch apply osd --all-available-devices
After running the above command: After running the above command:
@ -212,17 +211,17 @@ If you want to avoid this behavior (disable automatic creation of OSD on availab
.. prompt:: bash # .. prompt:: bash #
ceph orch apply osd --all-available-devices --unmanaged=true ceph orch apply osd --all-available-devices --unmanaged=true
.. note:: .. note::
Keep these three facts in mind: Keep these three facts in mind:
- The default behavior of ``ceph orch apply`` causes cephadm constantly to reconcile. This means that cephadm creates OSDs as soon as new drives are detected. - The default behavior of ``ceph orch apply`` causes cephadm constantly to reconcile. This means that cephadm creates OSDs as soon as new drives are detected.
- Setting ``unmanaged: True`` disables the creation of OSDs. If ``unmanaged: True`` is set, nothing will happen even if you apply a new OSD service. - Setting ``unmanaged: True`` disables the creation of OSDs. If ``unmanaged: True`` is set, nothing will happen even if you apply a new OSD service.
- ``ceph orch daemon add`` creates OSDs, but does not add an OSD service. - ``ceph orch daemon add`` creates OSDs, but does not add an OSD service.
* For cephadm, see also :ref:`cephadm-spec-unmanaged`. * For cephadm, see also :ref:`cephadm-spec-unmanaged`.
@ -250,7 +249,7 @@ Example:
Expected output:: Expected output::
Scheduled OSD(s) for removal Scheduled OSD(s) for removal
OSDs that are not safe to destroy will be rejected. OSDs that are not safe to destroy will be rejected.
@ -273,14 +272,14 @@ You can query the state of OSD operation with the following command:
.. prompt:: bash # .. prompt:: bash #
ceph orch osd rm status ceph orch osd rm status
Expected output:: Expected output::
OSD_ID HOST STATE PG_COUNT REPLACE FORCE STARTED_AT OSD_ID HOST STATE PG_COUNT REPLACE FORCE STARTED_AT
2 cephadm-dev done, waiting for purge 0 True False 2020-07-17 13:01:43.147684 2 cephadm-dev done, waiting for purge 0 True False 2020-07-17 13:01:43.147684
3 cephadm-dev draining 17 False True 2020-07-17 13:01:45.162158 3 cephadm-dev draining 17 False True 2020-07-17 13:01:45.162158
4 cephadm-dev started 42 False True 2020-07-17 13:01:45.162158 4 cephadm-dev started 42 False True 2020-07-17 13:01:45.162158
When no PGs are left on the OSD, it will be decommissioned and removed from the cluster. When no PGs are left on the OSD, it will be decommissioned and removed from the cluster.
@ -302,11 +301,11 @@ Example:
.. prompt:: bash # .. prompt:: bash #
ceph orch osd rm stop 4 ceph orch osd rm stop 4
Expected output:: Expected output::
Stopped OSD(s) removal Stopped OSD(s) removal
This resets the initial state of the OSD and takes it off the removal queue. This resets the initial state of the OSD and takes it off the removal queue.
@ -327,7 +326,7 @@ Example:
Expected output:: Expected output::
Scheduled OSD(s) for replacement Scheduled OSD(s) for replacement
This follows the same procedure as the procedure in the "Remove OSD" section, with This follows the same procedure as the procedure in the "Remove OSD" section, with
one exception: the OSD is not permanently removed from the CRUSH hierarchy, but is one exception: the OSD is not permanently removed from the CRUSH hierarchy, but is
@ -434,10 +433,10 @@ the ``ceph orch ps`` output in the ``MEM LIMIT`` column::
To exclude an OSD from memory autotuning, disable the autotune option To exclude an OSD from memory autotuning, disable the autotune option
for that OSD and also set a specific memory target. For example, for that OSD and also set a specific memory target. For example,
.. prompt:: bash # .. prompt:: bash #
ceph config set osd.123 osd_memory_target_autotune false ceph config set osd.123 osd_memory_target_autotune false
ceph config set osd.123 osd_memory_target 16G ceph config set osd.123 osd_memory_target 16G
.. _drivegroups: .. _drivegroups:
@ -500,7 +499,7 @@ Example
.. prompt:: bash [monitor.1]# .. prompt:: bash [monitor.1]#
ceph orch apply -i /path/to/osd_spec.yml --dry-run ceph orch apply -i /path/to/osd_spec.yml --dry-run
@ -510,9 +509,9 @@ Filters
------- -------
.. note:: .. note::
Filters are applied using an `AND` gate by default. This means that a drive Filters are applied using an `AND` gate by default. This means that a drive
must fulfill all filter criteria in order to get selected. This behavior can must fulfill all filter criteria in order to get selected. This behavior can
be adjusted by setting ``filter_logic: OR`` in the OSD specification. be adjusted by setting ``filter_logic: OR`` in the OSD specification.
Filters are used to assign disks to groups, using their attributes to group Filters are used to assign disks to groups, using their attributes to group
them. them.
@ -522,7 +521,7 @@ information about the attributes with this command:
.. code-block:: bash .. code-block:: bash
ceph-volume inventory </path/to/disk> ceph-volume inventory </path/to/disk>
Vendor or Model Vendor or Model
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
@ -631,9 +630,9 @@ but want to use only the first two, you could use `limit`:
.. code-block:: yaml .. code-block:: yaml
data_devices: data_devices:
vendor: VendorA vendor: VendorA
limit: 2 limit: 2
.. note:: `limit` is a last resort and shouldn't be used if it can be avoided. .. note:: `limit` is a last resort and shouldn't be used if it can be avoided.
@ -856,8 +855,8 @@ See :ref:`orchestrator-cli-placement-spec`
.. note:: .. note::
Assuming each host has a unique disk layout, each OSD Assuming each host has a unique disk layout, each OSD
spec needs to have a different service id spec needs to have a different service id
Dedicated wal + db Dedicated wal + db
@ -987,7 +986,7 @@ activates all existing OSDs on a host.
.. prompt:: bash # .. prompt:: bash #
ceph cephadm osd activate <host>... ceph cephadm osd activate <host>...
This will scan all existing disks for OSDs and deploy corresponding daemons. This will scan all existing disks for OSDs and deploy corresponding daemons.

View File

@ -239,12 +239,14 @@ It is a yaml format file with the following properties:
- host2 - host2
- host3 - host3
spec: spec:
backend_service: rgw.something # adjust to match your existing RGW service backend_service: rgw.something # adjust to match your existing RGW service
virtual_ip: <string>/<string> # ex: 192.168.20.1/24 virtual_ip: <string>/<string> # ex: 192.168.20.1/24
frontend_port: <integer> # ex: 8080 frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks virtual_interface_networks: [ ... ] # optional: list of CIDR networks
ssl_cert: | # optional: SSL certificate and key use_keepalived_multicast: <bool> # optional: Default is False.
vrrp_interface_network: <string>/<string> # optional: ex: 192.168.20.0/24
ssl_cert: | # optional: SSL certificate and key
-----BEGIN CERTIFICATE----- -----BEGIN CERTIFICATE-----
... ...
-----END CERTIFICATE----- -----END CERTIFICATE-----
@ -270,6 +272,7 @@ It is a yaml format file with the following properties:
frontend_port: <integer> # ex: 8080 frontend_port: <integer> # ex: 8080
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
virtual_interface_networks: [ ... ] # optional: list of CIDR networks virtual_interface_networks: [ ... ] # optional: list of CIDR networks
first_virtual_router_id: <integer> # optional: default 50
ssl_cert: | # optional: SSL certificate and key ssl_cert: | # optional: SSL certificate and key
-----BEGIN CERTIFICATE----- -----BEGIN CERTIFICATE-----
... ...
@ -303,6 +306,21 @@ where the properties of this service specification are:
* ``ssl_cert``: * ``ssl_cert``:
SSL certificate, if SSL is to be enabled. This must contain the both the certificate and SSL certificate, if SSL is to be enabled. This must contain the both the certificate and
private key blocks in .pem format. private key blocks in .pem format.
* ``use_keepalived_multicast``
Default is False. By default, cephadm will deploy keepalived config to use unicast IPs,
using the IPs of the hosts. The IPs chosen will be the same IPs cephadm uses to connect
to the machines. But if multicast is prefered, we can set ``use_keepalived_multicast``
to ``True`` and Keepalived will use multicast IP (224.0.0.18) to communicate between instances,
using the same interfaces as where the VIPs are.
* ``vrrp_interface_network``
By default, cephadm will configure keepalived to use the same interface where the VIPs are
for VRRP communication. If another interface is needed, it can be set via ``vrrp_interface_network``
with a network to identify which ethernet interface to use.
* ``first_virtual_router_id``
Default is 50. When deploying more than 1 ingress, this parameter can be used to ensure each
keepalived will have different virtual_router_id. In the case of using ``virtual_ips_list``,
each IP will create its own virtual router. So the first one will have ``first_virtual_router_id``,
second one will have ``first_virtual_router_id`` + 1, etc. Valid values go from 1 to 255.
.. _ingress-virtual-ip: .. _ingress-virtual-ip:

View File

@ -1,60 +1,56 @@
Troubleshooting Troubleshooting
=============== ===============
You may wish to investigate why a cephadm command failed This section explains how to investigate why a cephadm command failed or why a
or why a certain service no longer runs properly. certain service no longer runs properly.
Cephadm deploys daemons within containers. This means that Cephadm deploys daemons within containers. Troubleshooting containerized
troubleshooting those containerized daemons will require daemons requires a different process than does troubleshooting traditional
a different process than traditional package-install daemons. daemons that were installed by means of packages.
Here are some tools and commands to help you troubleshoot Here are some tools and commands to help you troubleshoot your Ceph
your Ceph environment. environment.
.. _cephadm-pause: .. _cephadm-pause:
Pausing or Disabling cephadm Pausing or Disabling cephadm
---------------------------- ----------------------------
If something goes wrong and cephadm is behaving badly, you can If something goes wrong and cephadm is behaving badly, pause most of the Ceph
pause most of the Ceph cluster's background activity by running cluster's background activity by running the following command:
the following command:
.. prompt:: bash # .. prompt:: bash #
ceph orch pause ceph orch pause
This stops all changes in the Ceph cluster, but cephadm will This stops all changes in the Ceph cluster, but cephadm will still periodically
still periodically check hosts to refresh its inventory of check hosts to refresh its inventory of daemons and devices. Disable cephadm
daemons and devices. You can disable cephadm completely by completely by running the following commands:
running the following commands:
.. prompt:: bash # .. prompt:: bash #
ceph orch set backend '' ceph orch set backend ''
ceph mgr module disable cephadm ceph mgr module disable cephadm
These commands disable all of the ``ceph orch ...`` CLI commands. These commands disable all of the ``ceph orch ...`` CLI commands. All
All previously deployed daemon containers continue to exist and previously deployed daemon containers continue to run and will start just as
will start as they did before you ran these commands. they were before you ran these commands.
See :ref:`cephadm-spec-unmanaged` for information on disabling See :ref:`cephadm-spec-unmanaged` for more on disabling individual services.
individual services.
Per-service and Per-daemon Events Per-service and Per-daemon Events
--------------------------------- ---------------------------------
In order to facilitate debugging failed daemons, To make it easier to debug failed daemons, cephadm stores events per service
cephadm stores events per service and per daemon. and per daemon. These events often contain information relevant to
These events often contain information relevant to the troubleshooting of your Ceph cluster.
troubleshooting your Ceph cluster.
Listing Service Events Listing Service Events
~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
To see the events associated with a certain service, run a To see the events associated with a certain service, run a command of the
command of the and following form: following form:
.. prompt:: bash # .. prompt:: bash #
@ -81,8 +77,8 @@ This will return something in the following form:
Listing Daemon Events Listing Daemon Events
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
To see the events associated with a certain daemon, run a To see the events associated with a certain daemon, run a command of the
command of the and following form: following form:
.. prompt:: bash # .. prompt:: bash #
@ -105,32 +101,41 @@ This will return something in the following form:
Checking Cephadm Logs Checking Cephadm Logs
--------------------- ---------------------
To learn how to monitor cephadm logs as they are generated, read :ref:`watching_cephadm_logs`. To learn how to monitor cephadm logs as they are generated, read
:ref:`watching_cephadm_logs`.
If your Ceph cluster has been configured to log events to files, there will be a If your Ceph cluster has been configured to log events to files, there will be
``ceph.cephadm.log`` file on all monitor hosts (see a ``ceph.cephadm.log`` file on all monitor hosts. See :ref:`cephadm-logs` for a
:ref:`cephadm-logs` for a more complete explanation). more complete explanation.
Gathering Log Files Gathering Log Files
------------------- -------------------
Use journalctl to gather the log files of all daemons: Use ``journalctl`` to gather the log files of all daemons:
.. note:: By default cephadm now stores logs in journald. This means .. note:: By default cephadm now stores logs in journald. This means
that you will no longer find daemon logs in ``/var/log/ceph/``. that you will no longer find daemon logs in ``/var/log/ceph/``.
To read the log file of one specific daemon, run:: To read the log file of one specific daemon, run a command of the following
form:
cephadm logs --name <name-of-daemon> .. prompt:: bash
Note: this only works when run on the same host where the daemon is running. To cephadm logs --name <name-of-daemon>
get logs of a daemon running on a different host, give the ``--fsid`` option::
cephadm logs --fsid <fsid> --name <name-of-daemon> .. Note:: This works only when run on the same host that is running the daemon.
To get the logs of a daemon that is running on a different host, add the
``--fsid`` option to the command, as in the following example:
where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``. .. prompt:: bash
To fetch all log files of all daemons on a given host, run:: cephadm logs --fsid <fsid> --name <name-of-daemon>
In this example, ``<fsid>`` corresponds to the cluster ID returned by the
``ceph status`` command.
To fetch all log files of all daemons on a given host, run the following
for-loop::
for name in $(cephadm ls | jq -r '.[].name') ; do for name in $(cephadm ls | jq -r '.[].name') ; do
cephadm logs --fsid <fsid> --name "$name" > $name; cephadm logs --fsid <fsid> --name "$name" > $name;
@ -139,39 +144,41 @@ To fetch all log files of all daemons on a given host, run::
Collecting Systemd Status Collecting Systemd Status
------------------------- -------------------------
To print the state of a systemd unit, run:: To print the state of a systemd unit, run a command of the following form:
systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service"; .. prompt:: bash
systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
To fetch all state of all daemons of a given host, run:: To fetch the state of all daemons of a given host, run the following shell
script::
fsid="$(cephadm shell ceph fsid)" fsid="$(cephadm shell ceph fsid)"
for name in $(cephadm ls | jq -r '.[].name') ; do for name in $(cephadm ls | jq -r '.[].name') ; do
systemctl status "ceph-$fsid@$name.service" > $name; systemctl status "ceph-$fsid@$name.service" > $name;
done done
List all Downloaded Container Images List all Downloaded Container Images
------------------------------------ ------------------------------------
To list all container images that are downloaded on a host: To list all container images that are downloaded on a host, run the following
commands:
.. note:: ``Image`` might also be called `ImageID` .. prompt:: bash #
:: podman ps -a --format json | jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2"
podman ps -a --format json | jq '.[].Image' .. note:: ``Image`` might also be called ``ImageID``.
"docker.io/library/centos:8"
"registry.opensuse.org/opensuse/leap:15.2"
Manually Running Containers Manually Running Containers
--------------------------- ---------------------------
Cephadm uses small wrappers when running containers. Refer to Cephadm uses small wrappers when running containers. Refer to
``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the ``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the container
container execution command. execution command.
.. _cephadm-ssh-errors: .. _cephadm-ssh-errors:
@ -187,9 +194,10 @@ Error message::
Please make sure that the host is reachable and accepts connections using the cephadm SSH key Please make sure that the host is reachable and accepts connections using the cephadm SSH key
... ...
Things Ceph administrators can do: If you receive the above error message, try the following things to
troubleshoot the SSH connection between ``cephadm`` and the monitor:
1. Ensure cephadm has an SSH identity key:: 1. Ensure that ``cephadm`` has an SSH identity key::
[root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key [root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98 INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
@ -202,20 +210,21 @@ Things Ceph administrators can do:
or:: or::
[root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i - [root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssh-key -i -
2. Ensure that the SSH config is correct:: 2. Ensure that the SSH config is correct::
[root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config [root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
3. Verify that we can connect to the host:: 3. Verify that it is possible to connect to the host::
[root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1 [root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
Verifying that the Public Key is Listed in the authorized_keys file Verifying that the Public Key is Listed in the authorized_keys file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To verify that the public key is in the authorized_keys file, run the following commands:: To verify that the public key is in the ``authorized_keys`` file, run the
following commands::
[root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub [root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
[root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys [root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
@ -231,27 +240,33 @@ Or this error::
Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
This means that you must run a command of this form:: This means that you must run a command of this form:
ceph config set mon public_network <mon_network> .. prompt:: bash
For more detail on operations of this kind, see :ref:`deploy_additional_monitors` ceph config set mon public_network <mon_network>
For more detail on operations of this kind, see
:ref:`deploy_additional_monitors`.
Accessing the Admin Socket Accessing the Admin Socket
-------------------------- --------------------------
Each Ceph daemon provides an admin socket that bypasses the Each Ceph daemon provides an admin socket that bypasses the MONs (See
MONs (See :ref:`rados-monitoring-using-admin-socket`). :ref:`rados-monitoring-using-admin-socket`).
To access the admin socket, first enter the daemon container on the host:: #. To access the admin socket, enter the daemon container on the host::
[root@mon1 ~]# cephadm enter --name <daemon-name> [root@mon1 ~]# cephadm enter --name <daemon-name>
[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
#. Run a command of the following form to see the admin socket's configuration::
[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
Running Various Ceph Tools Running Various Ceph Tools
-------------------------------- --------------------------------
To run Ceph tools like ``ceph-objectstore-tool`` or To run Ceph tools such as ``ceph-objectstore-tool`` or
``ceph-monstore-tool``, invoke the cephadm CLI with ``ceph-monstore-tool``, invoke the cephadm CLI with
``cephadm shell --name <daemon-name>``. For example:: ``cephadm shell --name <daemon-name>``. For example::
@ -268,100 +283,232 @@ To run Ceph tools like ``ceph-objectstore-tool`` or
election_strategy: 1 election_strategy: 1
0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname 0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
The cephadm shell sets up the environment in a way that is suitable The cephadm shell sets up the environment in a way that is suitable for
for extended daemon maintenance and running daemons interactively. extended daemon maintenance and for the interactive running of daemons.
.. _cephadm-restore-quorum: .. _cephadm-restore-quorum:
Restoring the Monitor Quorum Restoring the Monitor Quorum
---------------------------- ----------------------------
If the Ceph monitor daemons (mons) cannot form a quorum, cephadm will not be If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not
able to manage the cluster until quorum is restored. be able to manage the cluster until quorum is restored.
In order to restore the quorum, remove unhealthy monitors In order to restore the quorum, remove unhealthy monitors
form the monmap by following these steps: form the monmap by following these steps:
1. Stop all mons. For each mon host:: 1. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then
while connected to the Monitor's host use ``cephadm`` to stop the Monitor
daemon:
ssh {mon-host} .. prompt:: bash
cephadm unit --name mon.`hostname` stop
ssh {mon-host}
cephadm unit --name {mon.hostname} stop
2. Identify a surviving monitor and log in to that host:: 2. Identify a surviving Monitor and log in to its host:
ssh {mon-host} .. prompt:: bash
cephadm enter --name mon.`hostname`
3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy` ssh {mon-host}
cephadm enter --name {mon.hostname}
3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`.
.. _cephadm-manually-deploy-mgr: .. _cephadm-manually-deploy-mgr:
Manually Deploying a Manager Daemon Manually Deploying a Manager Daemon
----------------------------------- -----------------------------------
At least one manager (mgr) daemon is required by cephadm in order to manage the At least one Manager (``mgr``) daemon is required by cephadm in order to manage
cluster. If the last mgr in a cluster has been removed, follow these steps in the cluster. If the last remaining Manager has been removed from the Ceph
order to deploy a manager called (for example) cluster, follow these steps in order to deploy a fresh Manager on an arbitrary
``mgr.hostname.smfvfd`` on a random host of your cluster manually. host in your cluster. In this example, the freshly-deployed Manager daemon is
called ``mgr.hostname.smfvfd``.
Disable the cephadm scheduler, in order to prevent cephadm from removing the new #. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing
manager. See :ref:`cephadm-enable-cli`:: the new Manager. See :ref:`cephadm-enable-cli`:
ceph config-key set mgr/cephadm/pause true .. prompt:: bash #
Then get or create the auth entry for the new manager:: ceph config-key set mgr/cephadm/pause true
ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *" #. Retrieve or create the "auth entry" for the new Manager:
Get the ceph.conf:: .. prompt:: bash #
ceph config generate-minimal-conf ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
Get the container image:: #. Retrieve the Monitor's configuration:
ceph config get "mgr.hostname.smfvfd" container_image .. prompt:: bash #
Create a file ``config-json.json`` which contains the information necessary to deploy ceph config generate-minimal-conf
the daemon:
.. code-block:: json #. Retrieve the container image:
{ .. prompt:: bash #
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
}
Deploy the daemon:: ceph config get "mgr.hostname.smfvfd" container_image
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json #. Create a file called ``config-json.json``, which contains the information
necessary to deploy the daemon:
Analyzing Core Dumps .. code-block:: json
{
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
}
#. Deploy the Manager daemon:
.. prompt:: bash #
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
Capturing Core Dumps
--------------------- ---------------------
When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps.
The initial capture and processing of the coredump is performed by
`systemd-coredump
<https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
To enable coredump handling, run the following command
.. prompt:: bash # .. prompt:: bash #
ulimit -c unlimited ulimit -c unlimited
Core dumps will now be written to ``/var/lib/systemd/coredump``.
.. note:: .. note::
Core dumps are not namespaced by the kernel, which means Core dumps are not namespaced by the kernel. This means that core dumps are
they will be written to ``/var/lib/systemd/coredump`` on written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit
the container host. -c unlimited`` setting will persist only until the system is rebooted.
Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``. Wait for the crash to happen again. To simulate the crash of a daemon, run for
example ``killall -3 ceph-mon``.
Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll::
# cephadm shell --mount /var/lib/systemd/coredump Running the Debugger with cephadm
[ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd ----------------------------------
[ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
[ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-... Running a single debugging session
(gdb) bt ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6 Initiate a debugging session by using the ``cephadm shell`` command.
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2 From within the shell container we need to install the debugger and debuginfo
#3 0x0000563085ca3d7e in main () packages. To debug a core file captured by systemd, run the following:
#. Start the shell session:
.. prompt:: bash #
cephadm shell --mount /var/lib/system/coredump
#. From within the shell session, run the following commands:
.. prompt:: bash #
dnf install ceph-debuginfo gdb zstd
.. prompt:: bash #
unzstd /var/lib/systemd/coredump/core.ceph-*.zst
.. prompt:: bash #
gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
#. Run debugger commands at gdb's prompt:
.. prompt:: bash (gdb)
bt
::
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
#3 0x0000563085ca3d7e in main ()
Running repeated debugging sessions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When using ``cephadm shell``, as in the example above, any changes made to the
container that is spawned by the shell command are ephemeral. After the shell
session exits, the files that were downloaded and installed cease to be
available. You can simply re-run the same commands every time ``cephadm
shell`` is invoked, but in order to save time and resources one can create a
new container image and use it for repeated debugging sessions.
In the following example, we create a simple file that will construct the
container image. The command below uses podman but it is expected to work
correctly even if ``podman`` is replaced with ``docker``::
cat >Containerfile <<EOF
ARG BASE_IMG=quay.io/ceph/ceph:v18
FROM \${BASE_IMG}
# install ceph debuginfo packages, gdb and other potentially useful packages
RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
EOF
podman build -t ceph:debugging -f Containerfile .
# pass --build-arg=BASE_IMG=<your image> to customize the base image
The above file creates a new local image named ``ceph:debugging``. This image
can be used on the same machine that built it. The image can also be pushed to
a container repository or saved and copied to a node runing other Ceph
containers. Consult the ``podman`` or ``docker`` documentation for more
information about the container workflow.
After the image has been built, it can be used to initiate repeat debugging
sessions. By using an image in this way, you avoid the trouble of having to
re-install the debug tools and debuginfo packages every time you need to run a
debug session. To debug a core file using this image, in the same way as
previously described, run:
.. prompt:: bash #
cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
Debugging live processes
~~~~~~~~~~~~~~~~~~~~~~~~
The gdb debugger can attach to running processes to debug them. This can be
achieved with a containerized process by using the debug image and attaching it
to the same PID namespace in which the process to be debugged resides.
This requires running a container command with some custom arguments. We can
generate a script that can debug a process in a running container.
.. prompt:: bash #
cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
This creates a script that includes the container command that ``cephadm``
would use to create a shell. Modify the script by removing the ``--init``
argument and replace it with the argument that joins to the namespace used for
a running running container. For example, assume we want to debug the Manager
and have determnined that the Manager is running in a container named
``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case,
the argument
``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
should be used.
We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell,
we can run commands such as ``ps`` to get the PID of the Manager process. In
the following example this is ``2``. While running gdb, we can attach to the
running process:
.. prompt:: bash (gdb)
attach 2
info threads
bt

View File

@ -15,7 +15,7 @@ creation of multiple file systems use ``ceph fs flag set enable_multiple true``.
:: ::
fs new <file system name> <metadata pool name> <data pool name> ceph fs new <file system name> <metadata pool name> <data pool name>
This command creates a new file system. The file system name and metadata pool This command creates a new file system. The file system name and metadata pool
name are self-explanatory. The specified data pool is the default data pool and name are self-explanatory. The specified data pool is the default data pool and
@ -25,19 +25,19 @@ to accommodate the new file system.
:: ::
fs ls ceph fs ls
List all file systems by name. List all file systems by name.
:: ::
fs lsflags <file system name> ceph fs lsflags <file system name>
List all the flags set on a file system. List all the flags set on a file system.
:: ::
fs dump [epoch] ceph fs dump [epoch]
This dumps the FSMap at the given epoch (default: current) which includes all This dumps the FSMap at the given epoch (default: current) which includes all
file system settings, MDS daemons and the ranks they hold, and the list of file system settings, MDS daemons and the ranks they hold, and the list of
@ -46,7 +46,7 @@ standby MDS daemons.
:: ::
fs rm <file system name> [--yes-i-really-mean-it] ceph fs rm <file system name> [--yes-i-really-mean-it]
Destroy a CephFS file system. This wipes information about the state of the Destroy a CephFS file system. This wipes information about the state of the
file system from the FSMap. The metadata pool and data pools are untouched and file system from the FSMap. The metadata pool and data pools are untouched and
@ -54,28 +54,28 @@ must be destroyed separately.
:: ::
fs get <file system name> ceph fs get <file system name>
Get information about the named file system, including settings and ranks. This Get information about the named file system, including settings and ranks. This
is a subset of the same information from the ``fs dump`` command. is a subset of the same information from the ``ceph fs dump`` command.
:: ::
fs set <file system name> <var> <val> ceph fs set <file system name> <var> <val>
Change a setting on a file system. These settings are specific to the named Change a setting on a file system. These settings are specific to the named
file system and do not affect other file systems. file system and do not affect other file systems.
:: ::
fs add_data_pool <file system name> <pool name/id> ceph fs add_data_pool <file system name> <pool name/id>
Add a data pool to the file system. This pool can be used for file layouts Add a data pool to the file system. This pool can be used for file layouts
as an alternate location to store file data. as an alternate location to store file data.
:: ::
fs rm_data_pool <file system name> <pool name/id> ceph fs rm_data_pool <file system name> <pool name/id>
This command removes the specified pool from the list of data pools for the This command removes the specified pool from the list of data pools for the
file system. If any files have layouts for the removed data pool, the file file system. If any files have layouts for the removed data pool, the file
@ -84,7 +84,7 @@ system) cannot be removed.
:: ::
fs rename <file system name> <new file system name> [--yes-i-really-mean-it] ceph fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
Rename a Ceph file system. This also changes the application tags on the data Rename a Ceph file system. This also changes the application tags on the data
pools and metadata pool of the file system to the new file system name. pools and metadata pool of the file system to the new file system name.
@ -98,7 +98,7 @@ Settings
:: ::
fs set <fs name> max_file_size <size in bytes> ceph fs set <fs name> max_file_size <size in bytes>
CephFS has a configurable maximum file size, and it's 1TB by default. CephFS has a configurable maximum file size, and it's 1TB by default.
You may wish to set this limit higher if you expect to store large files You may wish to set this limit higher if you expect to store large files
@ -132,13 +132,13 @@ Taking a CephFS cluster down is done by setting the down flag:
:: ::
fs set <fs_name> down true ceph fs set <fs_name> down true
To bring the cluster back online: To bring the cluster back online:
:: ::
fs set <fs_name> down false ceph fs set <fs_name> down false
This will also restore the previous value of max_mds. MDS daemons are brought This will also restore the previous value of max_mds. MDS daemons are brought
down in a way such that journals are flushed to the metadata pool and all down in a way such that journals are flushed to the metadata pool and all
@ -149,11 +149,11 @@ Taking the cluster down rapidly for deletion or disaster recovery
----------------------------------------------------------------- -----------------------------------------------------------------
To allow rapidly deleting a file system (for testing) or to quickly bring the To allow rapidly deleting a file system (for testing) or to quickly bring the
file system and MDS daemons down, use the ``fs fail`` command: file system and MDS daemons down, use the ``ceph fs fail`` command:
:: ::
fs fail <fs_name> ceph fs fail <fs_name>
This command sets a file system flag to prevent standbys from This command sets a file system flag to prevent standbys from
activating on the file system (the ``joinable`` flag). activating on the file system (the ``joinable`` flag).
@ -162,7 +162,7 @@ This process can also be done manually by doing the following:
:: ::
fs set <fs_name> joinable false ceph fs set <fs_name> joinable false
Then the operator can fail all of the ranks which causes the MDS daemons to Then the operator can fail all of the ranks which causes the MDS daemons to
respawn as standbys. The file system will be left in a degraded state. respawn as standbys. The file system will be left in a degraded state.
@ -170,7 +170,7 @@ respawn as standbys. The file system will be left in a degraded state.
:: ::
# For all ranks, 0-N: # For all ranks, 0-N:
mds fail <fs_name>:<n> ceph mds fail <fs_name>:<n>
Once all ranks are inactive, the file system may also be deleted or left in Once all ranks are inactive, the file system may also be deleted or left in
this state for other purposes (perhaps disaster recovery). this state for other purposes (perhaps disaster recovery).
@ -179,7 +179,7 @@ To bring the cluster back up, simply set the joinable flag:
:: ::
fs set <fs_name> joinable true ceph fs set <fs_name> joinable true
Daemons Daemons
@ -198,34 +198,35 @@ Commands to manipulate MDS daemons:
:: ::
mds fail <gid/name/role> ceph mds fail <gid/name/role>
Mark an MDS daemon as failed. This is equivalent to what the cluster Mark an MDS daemon as failed. This is equivalent to what the cluster
would do if an MDS daemon had failed to send a message to the mon would do if an MDS daemon had failed to send a message to the mon
for ``mds_beacon_grace`` second. If the daemon was active and a suitable for ``mds_beacon_grace`` second. If the daemon was active and a suitable
standby is available, using ``mds fail`` will force a failover to the standby. standby is available, using ``ceph mds fail`` will force a failover to the
standby.
If the MDS daemon was in reality still running, then using ``mds fail`` If the MDS daemon was in reality still running, then using ``ceph mds fail``
will cause the daemon to restart. If it was active and a standby was will cause the daemon to restart. If it was active and a standby was
available, then the "failed" daemon will return as a standby. available, then the "failed" daemon will return as a standby.
:: ::
tell mds.<daemon name> command ... ceph tell mds.<daemon name> command ...
Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
daemons. Use ``ceph tell mds.* help`` to learn available commands. daemons. Use ``ceph tell mds.* help`` to learn available commands.
:: ::
mds metadata <gid/name/role> ceph mds metadata <gid/name/role>
Get metadata about the given MDS known to the Monitors. Get metadata about the given MDS known to the Monitors.
:: ::
mds repaired <role> ceph mds repaired <role>
Mark the file system rank as repaired. Unlike the name suggests, this command Mark the file system rank as repaired. Unlike the name suggests, this command
does not change a MDS; it manipulates the file system rank which has been does not change a MDS; it manipulates the file system rank which has been
@ -244,14 +245,14 @@ Commands to manipulate required client features of a file system:
:: ::
fs required_client_features <fs name> add reply_encoding ceph fs required_client_features <fs name> add reply_encoding
fs required_client_features <fs name> rm reply_encoding ceph fs required_client_features <fs name> rm reply_encoding
To list all CephFS features To list all CephFS features
:: ::
fs feature ls ceph fs feature ls
Clients that are missing newly added features will be evicted automatically. Clients that are missing newly added features will be evicted automatically.
@ -346,7 +347,7 @@ Global settings
:: ::
fs flag set <flag name> <flag val> [<confirmation string>] ceph fs flag set <flag name> <flag val> [<confirmation string>]
Sets a global CephFS flag (i.e. not specific to a particular file system). Sets a global CephFS flag (i.e. not specific to a particular file system).
Currently, the only flag setting is 'enable_multiple' which allows having Currently, the only flag setting is 'enable_multiple' which allows having
@ -368,13 +369,13 @@ file system.
:: ::
mds rmfailed ceph mds rmfailed
This removes a rank from the failed set. This removes a rank from the failed set.
:: ::
fs reset <file system name> ceph fs reset <file system name>
This command resets the file system state to defaults, except for the name and This command resets the file system state to defaults, except for the name and
pools. Non-zero ranks are saved in the stopped set. pools. Non-zero ranks are saved in the stopped set.
@ -382,7 +383,7 @@ pools. Non-zero ranks are saved in the stopped set.
:: ::
fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force ceph fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
This command creates a file system with a specific **fscid** (file system cluster ID). This command creates a file system with a specific **fscid** (file system cluster ID).
You may want to do this when an application expects the file system's ID to be You may want to do this when an application expects the file system's ID to be

View File

@ -154,14 +154,8 @@ readdir. The behavior of the decay counter is the same as for cache trimming or
caps recall. Each readdir call increments the counter by the number of files in caps recall. Each readdir call increments the counter by the number of files in
the result. the result.
The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
maybe throttled by cap acquisition throttle:
.. confval:: mds_session_max_caps_throttle_ratio .. confval:: mds_session_max_caps_throttle_ratio
The timeout in seconds after which a client request is retried due to cap
acquisition throttling:
.. confval:: mds_cap_acquisition_throttle_retry_request_timeout .. confval:: mds_cap_acquisition_throttle_retry_request_timeout
If the number of caps acquired by the client per session is greater than the If the number of caps acquired by the client per session is greater than the

View File

@ -42,28 +42,21 @@ FS Volumes
Create a volume by running the following command: Create a volume by running the following command:
$ ceph fs volume create <vol_name> [<placement>] .. prompt:: bash #
ceph fs volume create <vol_name> [placement]
This creates a CephFS file system and its data and metadata pools. It can also This creates a CephFS file system and its data and metadata pools. It can also
deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for
example Rook). See :doc:`/mgr/orchestrator`. example Rook). See :doc:`/mgr/orchestrator`.
``<vol_name>`` is the volume name (an arbitrary string). ``<placement>`` is an ``<vol_name>`` is the volume name (an arbitrary string). ``[placement]`` is an
optional string that specifies the hosts that should have an MDS running on optional string that specifies the :ref:`orchestrator-cli-placement-spec` for
them and, optionally, the total number of MDS daemons that the cluster should the MDS. See also :ref:`orchestrator-cli-cephfs` for more examples on
have. For example, the following placement string means "deploy MDS on nodes placement.
``host1`` and ``host2`` (one MDS per host)::
"host1,host2" .. note:: Specifying placement via a YAML file is not supported through the
volume interface.
The following placement specification means "deploy two MDS daemons on each of
nodes ``host1`` and ``host2`` (for a total of four MDS daemons in the
cluster)"::
"4 host1,host2"
See :ref:`orchestrator-cli-service-spec` for more on placement specification.
Specifying placement via a YAML file is not supported.
To remove a volume, run the following command: To remove a volume, run the following command:
@ -72,6 +65,11 @@ To remove a volume, run the following command:
This removes a file system and its data and metadata pools. It also tries to This removes a file system and its data and metadata pools. It also tries to
remove MDS daemons using the enabled ceph-mgr orchestrator module. remove MDS daemons using the enabled ceph-mgr orchestrator module.
.. note:: After volume deletion, it is recommended to restart `ceph-mgr`
if a new file system is created on the same cluster and subvolume interface
is being used. Please see https://tracker.ceph.com/issues/49605#note-5
for more details.
List volumes by running the following command: List volumes by running the following command:
$ ceph fs volume ls $ ceph fs volume ls

View File

@ -28,7 +28,7 @@ To FUSE-mount the Ceph file system, use the ``ceph-fuse`` command::
mkdir /mnt/mycephfs mkdir /mnt/mycephfs
ceph-fuse --id foo /mnt/mycephfs ceph-fuse --id foo /mnt/mycephfs
Option ``-id`` passes the name of the CephX user whose keyring we intend to Option ``--id`` passes the name of the CephX user whose keyring we intend to
use for mounting CephFS. In the above command, it's ``foo``. You can also use use for mounting CephFS. In the above command, it's ``foo``. You can also use
``-n`` instead, although ``--id`` is evidently easier:: ``-n`` instead, although ``--id`` is evidently easier::

View File

@ -226,6 +226,20 @@ For the reverse situation:
The ``home/patrick`` directory and its children will be pinned to rank 2 The ``home/patrick`` directory and its children will be pinned to rank 2
because its export pin overrides the policy on ``home``. because its export pin overrides the policy on ``home``.
To remove a partitioning policy, remove the respective extended attribute
or set the value to 0.
.. code::bash
$ setfattr -n ceph.dir.pin.distributed -v 0 home
# or
$ setfattr -x ceph.dir.pin.distributed home
For export pins, remove the extended attribute or set the extended attribute
value to `-1`.
.. code::bash
$ setfattr -n ceph.dir.pin -v -1 home
Dynamic subtree partitioning with Balancer on specific ranks Dynamic subtree partitioning with Balancer on specific ranks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

View File

@ -143,3 +143,14 @@ The types of damage that can be reported and repaired by File System Scrub are:
* BACKTRACE : Inode's backtrace in the data pool is corrupted. * BACKTRACE : Inode's backtrace in the data pool is corrupted.
Evaluate strays using recursive scrub
=====================================
- In order to evaluate strays i.e. purge stray directories in ``~mdsdir`` use the following command::
ceph tell mds.<fsname>:0 scrub start ~mdsdir recursive
- ``~mdsdir`` is not enqueued by default when scrubbing at the CephFS root. In order to perform stray evaluation
at root, run scrub with flags ``scrub_mdsdir`` and ``recursive``::
ceph tell mds.<fsname>:0 scrub start / recursive,scrub_mdsdir

View File

@ -162,6 +162,13 @@ Examples::
snapshot creation is accounted for in the "created_count" field, which is a snapshot creation is accounted for in the "created_count" field, which is a
cumulative count of the total number of snapshots created so far. cumulative count of the total number of snapshots created so far.
.. note: The maximum number of snapshots to retain per directory is limited by the
config tunable `mds_max_snaps_per_dir`. This tunable defaults to 100.
To ensure a new snapshot can be created, one snapshot less than this will be
retained. So by default, a maximum of 99 snapshots will be retained.
.. note: The --fs argument is now required if there is more than one file system.
Active and inactive schedules Active and inactive schedules
----------------------------- -----------------------------
Snapshot schedules can be added for a path that doesn't exist yet in the Snapshot schedules can be added for a path that doesn't exist yet in the

View File

@ -98,7 +98,7 @@ things to do:
.. code:: bash .. code:: bash
ceph config set mds mds_heartbeat_reset_grace 3600 ceph config set mds mds_heartbeat_grace 3600
This has the effect of having the MDS continue to send beacons to the monitors This has the effect of having the MDS continue to send beacons to the monitors
even when its internal "heartbeat" mechanism has not been reset (beat) in one even when its internal "heartbeat" mechanism has not been reset (beat) in one

View File

@ -0,0 +1,58 @@
============================
Balancing in Ceph
============================
Introduction
============
In distributed storage systems like Ceph, it is important to balance write and read requests for optimal performance. Write balancing ensures fast storage
and replication of data in a cluster, while read balancing ensures quick access and retrieval of data in a cluster. Both types of balancing are important
in distributed systems for different reasons.
Upmap Balancing
==========================
Importance in a Cluster
-----------------------
Capacity balancing is a functional requirement. A system like Ceph is as full as its fullest device: When one device is full, the system can not serve write
requests anymore, and Ceph loses its function. To avoid filling up devices, we want to balance capacity across the devices in a fair way. Each device should
get a capacity proportional to its size so all devices have the same fullness level. From a performance perspective, capacity balancing creates fair share
workloads on the OSDs for write requests.
Capacity balancing is expensive. The operation (changing the mapping of pgs) requires data movement by definition, which takes time. During this time, the
performance of the system is reduced.
In Ceph, we can balance the write performance if all devices are homogeneous (same size and performance).
How to Balance Capacity in Ceph
-------------------------------
See :ref:`upmap` for more information.
Read Balancing
==============
Unlike capacity balancing, read balancing is not a strict requirement for Cephs functionality. Instead, it is a performance requirement, as it helps the system
“work” better. The overall goal is to ensure each device gets its fair share of primary OSDs so read requests are distributed evenly across OSDs in the cluster.
Unbalanced read requests lead to bad performance because of reduced overall cluster bandwidth.
Read balancing is cheap. Unlike capacity balancing, there is no data movement involved. It is just a metadata operation, where the osdmap is updated to change
which participating OSD in a pg is primary. This operation is fast and has no impact on the cluster performance (except improved performance when the operation
completes almost immediately).
In Ceph, we can balance the read performance if all devices are homogeneous (same size and performance). However, in future versions, the read balancer can be improved
to achieve overall cluster performance in heterogeneous systems.
How to Balance Reads in Ceph
----------------------------
See :ref:`read_balancer` for more information.
Also, see the Cephalocon 2023 talk `New Read Balancer in Ceph <https://www.youtube.com/watch?v=AT_cKYaQzcU/>`_ for a demonstration of the offline version
of the read balancer.
Plans for the Next Version
--------------------------
1. Improve behavior for heterogeneous OSDs in a pool
2. Offer read balancing as an online option to the balancer manager module

View File

@ -1,200 +0,0 @@
Cache pool
==========
Purpose
-------
Use a pool of fast storage devices (probably SSDs) and use it as a
cache for an existing slower and larger pool.
Use a replicated pool as a front-end to service most I/O, and destage
cold data to a separate erasure coded pool that does not currently (and
cannot efficiently) handle the workload.
We should be able to create and add a cache pool to an existing pool
of data, and later remove it, without disrupting service or migrating
data around.
Use cases
---------
Read-write pool, writeback
~~~~~~~~~~~~~~~~~~~~~~~~~~
We have an existing data pool and put a fast cache pool "in front" of
it. Writes will go to the cache pool and immediately ack. We flush
them back to the data pool based on the defined policy.
Read-only pool, weak consistency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We have an existing data pool and add one or more read-only cache
pools. We copy data to the cache pool(s) on read. Writes are
forwarded to the original data pool. Stale data is expired from the
cache pools based on the defined policy.
This is likely only useful for specific applications with specific
data access patterns. It may be a match for rgw, for example.
Interface
---------
Set up a read/write cache pool foo-hot for pool foo::
ceph osd tier add foo foo-hot
ceph osd tier cache-mode foo-hot writeback
Direct all traffic for foo to foo-hot::
ceph osd tier set-overlay foo foo-hot
Set the target size and enable the tiering agent for foo-hot::
ceph osd pool set foo-hot hit_set_type bloom
ceph osd pool set foo-hot hit_set_count 1
ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
ceph osd pool set foo-hot min_read_recency_for_promote 1
ceph osd pool set foo-hot min_write_recency_for_promote 1
Drain the cache in preparation for turning it off::
ceph osd tier cache-mode foo-hot forward
rados -p foo-hot cache-flush-evict-all
When cache pool is finally empty, disable it::
ceph osd tier remove-overlay foo
ceph osd tier remove foo foo-hot
Read-only pools with lazy consistency::
ceph osd tier add foo foo-east
ceph osd tier cache-mode foo-east readonly
ceph osd tier add foo foo-west
ceph osd tier cache-mode foo-west readonly
Tiering agent
-------------
The tiering policy is defined as properties on the cache pool itself.
HitSet metadata
~~~~~~~~~~~~~~~
First, the agent requires HitSet information to be tracked on the
cache pool in order to determine which objects in the pool are being
accessed. This is enabled with::
ceph osd pool set foo-hot hit_set_type bloom
ceph osd pool set foo-hot hit_set_count 1
ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
The supported HitSet types include 'bloom' (a bloom filter, the
default), 'explicit_hash', and 'explicit_object'. The latter two
explicitly enumerate accessed objects and are less memory efficient.
They are there primarily for debugging and to demonstrate pluggability
for the infrastructure. For the bloom filter type, you can additionally
define the false positive probability for the bloom filter (default is 0.05)::
ceph osd pool set foo-hot hit_set_fpp 0.15
The hit_set_count and hit_set_period define how much time each HitSet
should cover, and how many such HitSets to store. Binning accesses
over time allows Ceph to independently determine whether an object was
accessed at least once and whether it was accessed more than once over
some time period ("age" vs "temperature").
The ``min_read_recency_for_promote`` defines how many HitSets to check for the
existence of an object when handling a read operation. The checking result is
used to decide whether to promote the object asynchronously. Its value should be
between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
If it's set to 1, the current HitSet is checked. And if this object is in the
current HitSet, it's promoted. Otherwise not. For the other values, the exact
number of archive HitSets are checked. The object is promoted if the object is
found in any of the most recent ``min_read_recency_for_promote`` HitSets.
A similar parameter can be set for the write operation, which is
``min_write_recency_for_promote``. ::
ceph osd pool set {cachepool} min_read_recency_for_promote 1
ceph osd pool set {cachepool} min_write_recency_for_promote 1
Note that the longer the ``hit_set_period`` and the higher the
``min_read_recency_for_promote``/``min_write_recency_for_promote`` the more RAM
will be consumed by the ceph-osd process. In particular, when the agent is active
to flush or evict cache objects, all hit_set_count HitSets are loaded into RAM.
Cache mode
~~~~~~~~~~
The most important policy is the cache mode:
ceph osd pool set foo-hot cache-mode writeback
The supported modes are 'none', 'writeback', 'forward', and
'readonly'. Most installations want 'writeback', which will write
into the cache tier and only later flush updates back to the base
tier. Similarly, any object that is read will be promoted into the
cache tier.
The 'forward' mode is intended for when the cache is being disabled
and needs to be drained. No new objects will be promoted or written
to the cache pool unless they are already present. A background
operation can then do something like::
rados -p foo-hot cache-try-flush-evict-all
rados -p foo-hot cache-flush-evict-all
to force all data to be flushed back to the base tier.
The 'readonly' mode is intended for read-only workloads that do not
require consistency to be enforced by the storage system. Writes will
be forwarded to the base tier, but objects that are read will get
promoted to the cache. No attempt is made by Ceph to ensure that the
contents of the cache tier(s) are consistent in the presence of object
updates.
Cache sizing
~~~~~~~~~~~~
The agent performs two basic functions: flushing (writing 'dirty'
cache objects back to the base tier) and evicting (removing cold and
clean objects from the cache).
The thresholds at which Ceph will flush or evict objects is specified
relative to a 'target size' of the pool. For example::
ceph osd pool set foo-hot cache_target_dirty_ratio .4
ceph osd pool set foo-hot cache_target_dirty_high_ratio .6
ceph osd pool set foo-hot cache_target_full_ratio .8
will begin flushing dirty objects when 40% of the pool is dirty and begin
evicting clean objects when we reach 80% of the target size.
The target size can be specified either in terms of objects or bytes::
ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
ceph osd pool set foo-hot target_max_objects 1000000 # 1 million objects
Note that if both limits are specified, Ceph will begin flushing or
evicting when either threshold is triggered.
Other tunables
~~~~~~~~~~~~~~
You can specify a minimum object age before a recently updated object is
flushed to the base tier::
ceph osd pool set foo-hot cache_min_flush_age 600 # 10 minutes
You can specify the minimum age of an object before it will be evicted from
the cache tier::
ceph osd pool set foo-hot cache_min_evict_age 1800 # 30 minutes

View File

@ -377,7 +377,7 @@ information. To check which mirror daemon a directory has been mapped to use::
"state": "mapped" "state": "mapped"
} }
.. note:: `instance_id` is the RAODS instance-id associated with a mirror daemon. .. note:: `instance_id` is the RADOS instance-id associated with a mirror daemon.
Other information such as `state` and `last_shuffled` are interesting when running Other information such as `state` and `last_shuffled` are interesting when running
multiple mirror daemons. multiple mirror daemons.

View File

@ -243,7 +243,7 @@ object size in ``POOL`` is zero (evicted) and chunks objects are genereated---th
4. Read/write I/Os 4. Read/write I/Os
After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
completely compatible with existing RAODS operations. completely compatible with existing RADOS operations.
5. Run scrub to fix reference count 5. Run scrub to fix reference count

View File

@ -214,8 +214,8 @@ The build process is based on `Node.js <https://nodejs.org/>`_ and requires the
Prerequisites Prerequisites
~~~~~~~~~~~~~ ~~~~~~~~~~~~~
* Node 14.15.0 or higher * Node 18.17.0 or higher
* NPM 6.14.9 or higher * NPM 9.6.7 or higher
nodeenv: nodeenv:
During Ceph's build we create a virtualenv with ``node`` and ``npm`` During Ceph's build we create a virtualenv with ``node`` and ``npm``

View File

@ -55,7 +55,7 @@ using `vstart_runner.py`_. To do that, you'd need `teuthology`_ installed::
$ virtualenv --python=python3 venv $ virtualenv --python=python3 venv
$ source venv/bin/activate $ source venv/bin/activate
$ pip install 'setuptools >= 12' $ pip install 'setuptools >= 12'
$ pip install git+https://github.com/ceph/teuthology#egg=teuthology[test] $ pip install teuthology[test]@git+https://github.com/ceph/teuthology
$ deactivate $ deactivate
The above steps installs teuthology in a virtual environment. Before running The above steps installs teuthology in a virtual environment. Before running

View File

@ -1,3 +1,5 @@
.. _dev_mon_elections:
================= =================
Monitor Elections Monitor Elections
================= =================

View File

@ -289,40 +289,6 @@ This seems complicated, but it gets us two valuable properties:
All clone operations will need to consider adjacent ``chunk_maps`` All clone operations will need to consider adjacent ``chunk_maps``
when adding or removing references. when adding or removing references.
Cache/Tiering
-------------
There already exists a cache/tiering mechanism based on whiteouts.
One goal here should ultimately be for this manifest machinery to
provide a complete replacement.
See ``cache-pool.rst``
The manifest machinery already shares some code paths with the
existing cache/tiering code, mainly ``stat_flush``.
In no particular order, here's in incomplete list of things that need
to be wired up to provide feature parity:
* Online object access information: The osd already has pool configs
for maintaining bloom filters which provide estimates of access
recency for objects. We probably need to modify this to permit
hitset maintenance for a normal pool -- there are already
``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
* Tiering agent: The osd already has a background tiering agent which
would need to be modified to instead flush and evict using
manifests.
* Use exiting existing features regarding the cache flush policy such as
histset, age, ratio.
- hitset
- age, ratio, bytes
* Add tiering-mode to ``manifest-tiering``
- Writeback
- Read-only
Data Structures Data Structures
=============== ===============

View File

@ -114,29 +114,6 @@ baseline throughput for each device type was determined:
256 KiB. For HDDs, it was 40MiB. The above throughput was obtained 256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
by running 4 KiB random writes at a queue depth of 64 for 300 secs. by running 4 KiB random writes at a queue depth of 64 for 300 secs.
Factoring I/O Cost in mClock
============================
The services using mClock have a cost associated with them. The cost can be
different for each service type. The mClock scheduler factors in the cost
during calculations for parameters like *reservation*, *weight* and *limit*.
The calculations determine when the next op for the service type can be
dequeued from the operation queue. In general, the higher the cost, the longer
an op remains in the operation queue.
A cost modeling study was performed to determine the cost per I/O and the cost
per byte for SSD and HDD device types. The following cost specific options are
used under the hood by mClock,
- :confval:`osd_mclock_cost_per_io_usec`
- :confval:`osd_mclock_cost_per_io_usec_hdd`
- :confval:`osd_mclock_cost_per_io_usec_ssd`
- :confval:`osd_mclock_cost_per_byte_usec`
- :confval:`osd_mclock_cost_per_byte_usec_hdd`
- :confval:`osd_mclock_cost_per_byte_usec_ssd`
See :doc:`/rados/configuration/mclock-config-ref` for more details.
MClock Profile Allocations MClock Profile Allocations
========================== ==========================

View File

@ -1,53 +0,0 @@
This document describes the requirements and high-level design of the primary
balancer for Ceph.
Introduction
============
In a distributed storage system such as Ceph, there are some requirements to keep the system balanced in order to make it perform well:
#. Balance the capacity - This is a functional requirement, a system like Ceph is "as full as its fullest device". When one device is full the system can not serve write requests anymore. In order to do this we want to balance the capacity across the devices in a fair way - that each device gets capacity proportionally to its size and therefore all the devices have the same fullness level. This is a functional requirement. From performance perspective, capacity balancing creates fair share workloads on the OSDs for *write* requests.
#. Balance the workload - This is a performance requirement, we want to make sure that all the devices will receive a workload according to their performance. Assuming all the devices in a pool use the same technology and have the same bandwidth (a strong recommendation for a well configured system), and all devices in a pool have the same capacity, this means that for each pool, each device gets its fair share of primary OSDs so that the *read* requests are distributed evenly across the OSDs in the cluster. Managing workload balancing for devices with different capacities is discussed in the future enhancements section.
Requirements
============
- For each pool, each OSD should have its fair share of PGs in which it is primary. For replicated pools, this would be the number of PGs mapped to this OSD divided by the replica size.
- This may be improved in future releases. (see below)
- Improve the existing capacity balancer code to improve its maintainability
- Primary balancing is performed without data movement (data is moved only when balancing the capacity)
- Fix the global +/-1 balancing issue that happens since the current balancer works on a single pool at a time (this is a stretch goal for the first version)
- Problem description: In a perfectly balanced system, for each pool, each OSD has a number of PGs that ideally would have mapped to it to create a perfect capacity balancing. This number is usually not an integer, so some OSDs get a bit more PGs mapped and some a bit less. If you have many pools and you balance on a pool-by-pool basis, it is possible that some OSDs always get the "a bit more" side. When this happens, even to a single OSD, the result is non-balanced system where one OSD is more full than the others. This may happen with the current capacity balancer.
First release (Quincy) assumptions
----------------------------------
- Optional - In the first version the feature will be optional and by default will be disabled
- CLI only - In the first version we will probably give access to the primary balancer only by ``osdmaptool`` CLI and will not enable it in the online balancer; this way, the use of the feature is more controlled for early adopters
- No data movement
Future possible enhancements
----------------------------
- Improve the behavior for non identical OSDs in a pool
- Improve the capacity balancing behavior in extreme cases
- Add workload balancing to the online balancer
- A more futuristic feature can be to improve workload balancing based on real load statistics of the OSDs.
High Level Design
=================
- The capacity balancing code will remain in one function ``OSDMap::calc_pg_upmaps`` (the signature might be changed)
- The workload (a.k.a primary) balancer will be implemented in a different function
- The workload balancer will do its best based on the current status of the system
- When called on a balanced system (capacity-wise) with pools with identical devices, it will create a near optimal workload split among the OSDs
- Calling the workload balancer on an unbalanced system (capacity-wise) may yield non optimal results, and in some cases may give worse performance than before the call
Helper functionality
--------------------
- Set a seed for random generation in ``osdmaptool`` (For regression tests)

View File

@ -131,7 +131,7 @@ First release candidate
======================= =======================
- [x] src/ceph_release: change type to `rc` - [x] src/ceph_release: change type to `rc`
- [ ] opt-in to all telemetry channels, generate telemetry reports, and verify no sensitive details (like pools names) are collected - [x] opt-in to all telemetry channels, generate telemetry reports, and verify no sensitive details (like pools names) are collected
First stable release First stable release

View File

@ -15,10 +15,12 @@
introduced in the Ceph Kraken release. The Luminous release of introduced in the Ceph Kraken release. The Luminous release of
Ceph promoted BlueStore to the default OSD back end, Ceph promoted BlueStore to the default OSD back end,
supplanting FileStore. As of the Reef release, FileStore is no supplanting FileStore. As of the Reef release, FileStore is no
longer available as a storage backend. longer available as a storage back end.
BlueStore stores objects directly on Ceph block devices without BlueStore stores objects directly on raw block devices or
a mounted file system. partitions, and does not interact with mounted file systems.
BlueStore uses RocksDB's key/value database to map object names
to block locations on disk.
Bucket Bucket
In the context of :term:`RGW`, a bucket is a group of objects. In the context of :term:`RGW`, a bucket is a group of objects.
@ -269,7 +271,7 @@
The Ceph manager software, which collects all the state from The Ceph manager software, which collects all the state from
the whole cluster in one place. the whole cluster in one place.
MON :ref:`MON<arch_monitor>`
The Ceph monitor software. The Ceph monitor software.
Node Node
@ -328,6 +330,19 @@
Pools Pools
See :term:`pool`. See :term:`pool`.
:ref:`Primary Affinity <rados_ops_primary_affinity>`
The characteristic of an OSD that governs the likelihood that
a given OSD will be selected as the primary OSD (or "lead
OSD") in an acting set. Primary affinity was introduced in
Firefly (v. 0.80). See :ref:`Primary Affinity
<rados_ops_primary_affinity>`.
Quorum
Quorum is the state that exists when a majority of the
:ref:`Monitors<arch_monitor>` in the cluster are ``up``. A
minimum of three :ref:`Monitors<arch_monitor>` must exist in
the cluster in order for Quorum to be possible.
RADOS RADOS
**R**\eliable **A**\utonomic **D**\istributed **O**\bject **R**\eliable **A**\utonomic **D**\istributed **O**\bject
**S**\tore. RADOS is the object store that provides a scalable **S**\tore. RADOS is the object store that provides a scalable

View File

@ -4,11 +4,11 @@
Ceph delivers **object, block, and file storage in one unified system**. Ceph delivers **object, block, and file storage in one unified system**.
.. warning:: .. warning::
:ref:`If this is your first time using Ceph, read the "Basic Workflow" :ref:`If this is your first time using Ceph, read the "Basic Workflow"
page in the Ceph Developer Guide to learn how to contribute to the page in the Ceph Developer Guide to learn how to contribute to the
Ceph project. (Click anywhere in this paragraph to read the "Basic Ceph project. (Click anywhere in this paragraph to read the "Basic
Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`. Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`.
.. note:: .. note::
@ -110,6 +110,7 @@ about Ceph, see our `Architecture`_ section.
radosgw/index radosgw/index
mgr/index mgr/index
mgr/dashboard mgr/dashboard
monitoring/index
api/index api/index
architecture architecture
Developer Guide <dev/developer_guide/index> Developer Guide <dev/developer_guide/index>

View File

@ -25,17 +25,17 @@ There are three ways to get packages:
Install packages with cephadm Install packages with cephadm
============================= =============================
#. Download the cephadm script #. Download cephadm
.. prompt:: bash $ .. prompt:: bash $
:substitutions: :substitutions:
curl --silent --remote-name --location https://github.com/ceph/ceph/raw/|stable-release|/src/cephadm/cephadm curl --silent --remote-name --location https://download.ceph.com/rpm-|stable-release|/el9/noarch/cephadm
chmod +x cephadm chmod +x cephadm
#. Configure the Ceph repository based on the release name:: #. Configure the Ceph repository based on the release name::
./cephadm add-repo --release nautilus ./cephadm add-repo --release |stable-release|
For Octopus (15.2.0) and later releases, you can also specify a specific For Octopus (15.2.0) and later releases, you can also specify a specific
version:: version::
@ -47,8 +47,8 @@ Install packages with cephadm
./cephadm add-repo --dev my-branch ./cephadm add-repo --dev my-branch
#. Install the appropriate packages. You can install them using your #. Install the appropriate packages. You can install them using your
package management tool (e.g., APT, Yum) directly, or you can also package management tool (e.g., APT, Yum) directly, or you can
use the cephadm wrapper. For example:: use the cephadm wrapper command. For example::
./cephadm install ceph-common ./cephadm install ceph-common

View File

@ -0,0 +1,90 @@
:orphan:
======================================================
ceph-monstore-tool -- ceph monstore manipulation tool
======================================================
.. program:: ceph-monstore-tool
Synopsis
========
| **ceph-monstore-tool** <store path> <cmd> [args|options]
Description
===========
:program:`ceph-monstore-tool` is used to manipulate MonitorDBStore's data
(monmap, osdmap, etc.) offline. It is similar to `ceph-kvstore-tool`.
The default RocksDB debug level is `0`. This can be changed using `--debug`.
Note:
Ceph-specific options take the format `--option-name=VAL`
DO NOT FORGET THE EQUALS SIGN. ('=')
Command-specific options must be passed after a `--`
for example, `get monmap --debug -- --version 10 --out /tmp/foo`
Commands
========
:program:`ceph-monstore-tool` uses many commands for debugging purposes:
:command:`store-copy <path>`
Copy the store to PATH.
:command:`get monmap [-- options]`
Get monmap (version VER if specified) (default: last committed).
:command:`get osdmap [-- options]`
Get osdmap (version VER if specified) (default: last committed).
:command:`get msdmap [-- options]`
Get msdmap (version VER if specified) (default: last committed).
:command:`get mgr [-- options]`
Get mgrmap (version VER if specified) (default: last committed).
:command:`get crushmap [-- options]`
Get crushmap (version VER if specified) (default: last committed).
:command:`get osd_snap <key> [-- options]`
Get osd_snap key (`purged_snap` or `purged_epoch`).
:command:`dump-keys`
Dump store keys to FILE (default: stdout).
:command:`dump-paxos [-- options]`
Dump Paxos transactions (-- -- help for more info).
:command:`dump-trace FILE [-- options]`
Dump contents of trace file FILE (-- --help for more info).
:command:`replay-trace FILE [-- options]`
Replay trace from FILE (-- --help for more info).
:command:`random-gen [-- options]`
Add randomly genererated ops to the store (-- --help for more info).
:command:`rewrite-crush [-- options]`
Add a rewrite commit to the store
:command:`rebuild`
Rebuild store.
:command:`rm <prefix> <key>`
Remove specified key from the store.
Availability
============
**ceph-monstore-tool** is part of Ceph, a massively scalable, open-source,
distributed storage system. See the Ceph documentation at
https://docs.ceph.com for more information.
See also
========
:doc:`ceph <ceph>`\(8)

View File

@ -183,6 +183,18 @@ Options
write modified osdmap with upmap or crush-adjust changes write modified osdmap with upmap or crush-adjust changes
.. option:: --read <file>
calculate pg upmap entries to balance pg primaries
.. option:: --read-pool <poolname>
specify which pool the read balancer should adjust
.. option:: --vstart
prefix upmap and read output with './bin/'
Example Example
======= =======
@ -315,6 +327,31 @@ To simulate the active balancer in upmap mode::
osd.20 pgs 42 osd.20 pgs 42
Total time elapsed 0.0167765 secs, 5 rounds Total time elapsed 0.0167765 secs, 5 rounds
To simulate the active balancer in read mode, first make sure capacity is balanced
by running the balancer in upmap mode. Then, balance the reads on a replicated pool with::
osdmaptool osdmap --read read.out --read-pool <pool name>
./bin/osdmaptool: osdmap file 'om'
writing upmap command output to: read.out
---------- BEFORE ------------
osd.0 | primary affinity: 1 | number of prims: 3
osd.1 | primary affinity: 1 | number of prims: 10
osd.2 | primary affinity: 1 | number of prims: 3
read_balance_score of 'cephfs.a.meta': 1.88
---------- AFTER ------------
osd.0 | primary affinity: 1 | number of prims: 5
osd.1 | primary affinity: 1 | number of prims: 5
osd.2 | primary affinity: 1 | number of prims: 6
read_balance_score of 'cephfs.a.meta': 1.13
num changes: 5
Availability Availability
============ ============

View File

@ -15,15 +15,15 @@ Synopsis
Description Description
=========== ===========
:program:`radosgw-admin` is a RADOS gateway user administration utility. It :program:`radosgw-admin` is a Ceph Object Gateway user administration utility. It
allows creating and modifying users. is used to create and modify users.
Commands Commands
======== ========
:program:`radosgw-admin` utility uses many commands for administration purpose :program:`radosgw-admin` utility provides commands for administration purposes
which are as follows: as follows:
:command:`user create` :command:`user create`
Create a new user. Create a new user.
@ -32,8 +32,7 @@ which are as follows:
Modify a user. Modify a user.
:command:`user info` :command:`user info`
Display information of a user, and any potentially available Display information for a user including any subusers and keys.
subusers and keys.
:command:`user rename` :command:`user rename`
Renames a user. Renames a user.
@ -51,7 +50,7 @@ which are as follows:
Check user info. Check user info.
:command:`user stats` :command:`user stats`
Show user stats as accounted by quota subsystem. Show user stats as accounted by the quota subsystem.
:command:`user list` :command:`user list`
List all users. List all users.
@ -78,10 +77,10 @@ which are as follows:
Remove access key. Remove access key.
:command:`bucket list` :command:`bucket list`
List buckets, or, if bucket specified with --bucket=<bucket>, List buckets, or, if a bucket is specified with --bucket=<bucket>,
list its objects. If bucket specified adding --allow-unordered list its objects. Adding --allow-unordered
removes ordering requirement, possibly generating results more removes the ordering requirement, possibly generating results more
quickly in buckets with large number of objects. quickly for buckets with large number of objects.
:command:`bucket limit check` :command:`bucket limit check`
Show bucket sharding stats. Show bucket sharding stats.
@ -93,8 +92,8 @@ which are as follows:
Unlink bucket from specified user. Unlink bucket from specified user.
:command:`bucket chown` :command:`bucket chown`
Link bucket to specified user and update object ACLs. Change bucket ownership to the specified user and update object ACLs.
Use --marker to resume if command gets interrupted. Invoke with --marker to resume if the command is interrupted.
:command:`bucket stats` :command:`bucket stats`
Returns bucket statistics. Returns bucket statistics.
@ -109,12 +108,13 @@ which are as follows:
Rewrite all objects in the specified bucket. Rewrite all objects in the specified bucket.
:command:`bucket radoslist` :command:`bucket radoslist`
List the rados objects that contain the data for all objects is List the RADOS objects that contain the data for all objects in
the designated bucket, if --bucket=<bucket> is specified, or the designated bucket, if --bucket=<bucket> is specified.
otherwise all buckets. Otherwise, list the RADOS objects that contain data for all
buckets.
:command:`bucket reshard` :command:`bucket reshard`
Reshard a bucket. Reshard a bucket's index.
:command:`bucket sync disable` :command:`bucket sync disable`
Disable bucket sync. Disable bucket sync.
@ -306,16 +306,16 @@ which are as follows:
Run data sync for the specified source zone. Run data sync for the specified source zone.
:command:`sync error list` :command:`sync error list`
list sync error. List sync errors.
:command:`sync error trim` :command:`sync error trim`
trim sync error. Trim sync errors.
:command:`zone rename` :command:`zone rename`
Rename a zone. Rename a zone.
:command:`zone placement list` :command:`zone placement list`
List zone's placement targets. List a zone's placement targets.
:command:`zone placement add` :command:`zone placement add`
Add a zone placement target. Add a zone placement target.
@ -365,7 +365,7 @@ which are as follows:
List all bucket lifecycle progress. List all bucket lifecycle progress.
:command:`lc process` :command:`lc process`
Manually process lifecycle. If a bucket is specified (e.g., via Manually process lifecycle transitions. If a bucket is specified (e.g., via
--bucket_id or via --bucket and optional --tenant), only that bucket --bucket_id or via --bucket and optional --tenant), only that bucket
is processed. is processed.
@ -385,7 +385,7 @@ which are as follows:
List metadata log which is needed for multi-site deployments. List metadata log which is needed for multi-site deployments.
:command:`mdlog trim` :command:`mdlog trim`
Trim metadata log manually instead of relying on RGWs integrated log sync. Trim metadata log manually instead of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync. complete, otherwise it can reinitiate a sync.
@ -397,7 +397,7 @@ which are as follows:
:command:`bilog trim` :command:`bilog trim`
Trim bucket index log (use start-marker, end-marker) manually instead Trim bucket index log (use start-marker, end-marker) manually instead
of relying on RGWs integrated log sync. of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync. complete, otherwise it can reinitiate a sync.
@ -405,7 +405,7 @@ which are as follows:
List data log which is needed for multi-site deployments. List data log which is needed for multi-site deployments.
:command:`datalog trim` :command:`datalog trim`
Trim data log manually instead of relying on RGWs integrated log sync. Trim data log manually instead of relying on the gateway's integrated log sync.
Before trimming, compare the listings and make sure the last sync was Before trimming, compare the listings and make sure the last sync was
complete, otherwise it can reinitiate a sync. complete, otherwise it can reinitiate a sync.
@ -413,19 +413,19 @@ which are as follows:
Read data log status. Read data log status.
:command:`orphans find` :command:`orphans find`
Init and run search for leaked rados objects. Init and run search for leaked RADOS objects.
DEPRECATED. See the "rgw-orphan-list" tool. DEPRECATED. See the "rgw-orphan-list" tool.
:command:`orphans finish` :command:`orphans finish`
Clean up search for leaked rados objects. Clean up search for leaked RADOS objects.
DEPRECATED. See the "rgw-orphan-list" tool. DEPRECATED. See the "rgw-orphan-list" tool.
:command:`orphans list-jobs` :command:`orphans list-jobs`
List the current job-ids for the orphans search. List the current orphans search job IDs.
DEPRECATED. See the "rgw-orphan-list" tool. DEPRECATED. See the "rgw-orphan-list" tool.
:command:`role create` :command:`role create`
create a new AWS role for use with STS. Create a new role for use with STS (Security Token Service).
:command:`role rm` :command:`role rm`
Remove a role. Remove a role.
@ -485,7 +485,7 @@ which are as follows:
Show events in a pubsub subscription Show events in a pubsub subscription
:command:`subscription ack` :command:`subscription ack`
Ack (remove) an events in a pubsub subscription Acknowledge (remove) events in a pubsub subscription
Options Options
@ -499,7 +499,8 @@ Options
.. option:: -m monaddress[:port] .. option:: -m monaddress[:port]
Connect to specified monitor (instead of looking through ceph.conf). Connect to specified monitor (instead of selecting one
from ceph.conf).
.. option:: --tenant=<tenant> .. option:: --tenant=<tenant>
@ -507,19 +508,19 @@ Options
.. option:: --uid=uid .. option:: --uid=uid
The radosgw user ID. The user on which to operate.
.. option:: --new-uid=uid .. option:: --new-uid=uid
ID of the new user. Used with 'user rename' command. The new ID of the user. Used with 'user rename' command.
.. option:: --subuser=<name> .. option:: --subuser=<name>
Name of the subuser. Name of the subuser.
.. option:: --access-key=<key> .. option:: --access-key=<key>
S3 access key. S3 access key.
.. option:: --email=email .. option:: --email=email
@ -531,28 +532,29 @@ Options
.. option:: --gen-access-key .. option:: --gen-access-key
Generate random access key (for S3). Generate random access key (for S3).
.. option:: --gen-secret .. option:: --gen-secret
Generate random secret key. Generate random secret key.
.. option:: --key-type=<type> .. option:: --key-type=<type>
key type, options are: swift, s3. Key type, options are: swift, s3.
.. option:: --temp-url-key[-2]=<key> .. option:: --temp-url-key[-2]=<key>
Temporary url key. Temporary URL key.
.. option:: --max-buckets .. option:: --max-buckets
max number of buckets for a user (0 for no limit, negative value to disable bucket creation). Maximum number of buckets for a user (0 for no limit, negative value to disable bucket creation).
Default is 1000. Default is 1000.
.. option:: --access=<access> .. option:: --access=<access>
Set the access permissions for the sub-user. Set the access permissions for the subuser.
Available access permissions are read, write, readwrite and full. Available access permissions are read, write, readwrite and full.
.. option:: --display-name=<name> .. option:: --display-name=<name>
@ -600,24 +602,24 @@ Options
.. option:: --bucket-new-name=[tenant-id/]<bucket> .. option:: --bucket-new-name=[tenant-id/]<bucket>
Optional for `bucket link`; use to rename a bucket. Optional for `bucket link`; use to rename a bucket.
While tenant-id/ can be specified, this is never While the tenant-id can be specified, this is not
necessary for normal operation. necessary in normal operation.
.. option:: --shard-id=<shard-id> .. option:: --shard-id=<shard-id>
Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``. Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.
.. option:: --max-entries=<entries> .. option:: --max-entries=<entries>
Optional for listing operations to specify the max entries. Optional for listing operations to specify the max entries.
.. option:: --purge-data .. option:: --purge-data
When specified, user removal will also purge all the user data. When specified, user removal will also purge the user's data.
.. option:: --purge-keys .. option:: --purge-keys
When specified, subuser removal will also purge all the subuser keys. When specified, subuser removal will also purge the subuser' keys.
.. option:: --purge-objects .. option:: --purge-objects
@ -625,7 +627,7 @@ Options
.. option:: --metadata-key=<key> .. option:: --metadata-key=<key>
Key to retrieve metadata from with ``metadata get``. Key from which to retrieve metadata, used with ``metadata get``.
.. option:: --remote=<remote> .. option:: --remote=<remote>
@ -633,11 +635,11 @@ Options
.. option:: --period=<id> .. option:: --period=<id>
Period id. Period ID.
.. option:: --url=<url> .. option:: --url=<url>
url for pushing/pulling period or realm. URL for pushing/pulling period or realm.
.. option:: --epoch=<number> .. option:: --epoch=<number>
@ -657,7 +659,7 @@ Options
.. option:: --master-zone=<id> .. option:: --master-zone=<id>
Master zone id. Master zone ID.
.. option:: --rgw-realm=<name> .. option:: --rgw-realm=<name>
@ -665,11 +667,11 @@ Options
.. option:: --realm-id=<id> .. option:: --realm-id=<id>
The realm id. The realm ID.
.. option:: --realm-new-name=<name> .. option:: --realm-new-name=<name>
New name of realm. New name for the realm.
.. option:: --rgw-zonegroup=<name> .. option:: --rgw-zonegroup=<name>
@ -677,7 +679,7 @@ Options
.. option:: --zonegroup-id=<id> .. option:: --zonegroup-id=<id>
The zonegroup id. The zonegroup ID.
.. option:: --zonegroup-new-name=<name> .. option:: --zonegroup-new-name=<name>
@ -685,11 +687,11 @@ Options
.. option:: --rgw-zone=<zone> .. option:: --rgw-zone=<zone>
Zone in which radosgw is running. Zone in which the gateway is running.
.. option:: --zone-id=<id> .. option:: --zone-id=<id>
The zone id. The zone ID.
.. option:: --zone-new-name=<name> .. option:: --zone-new-name=<name>
@ -709,7 +711,7 @@ Options
.. option:: --placement-id .. option:: --placement-id
Placement id for the zonegroup placement commands. Placement ID for the zonegroup placement commands.
.. option:: --tags=<list> .. option:: --tags=<list>
@ -737,7 +739,7 @@ Options
.. option:: --data-extra-pool=<pool> .. option:: --data-extra-pool=<pool>
The placement target data extra (non-ec) pool. The placement target data extra (non-EC) pool.
.. option:: --placement-index-type=<type> .. option:: --placement-index-type=<type>
@ -765,11 +767,11 @@ Options
.. option:: --sync-from=[zone-name][,...] .. option:: --sync-from=[zone-name][,...]
Set the list of zones to sync from. Set the list of zones from which to sync.
.. option:: --sync-from-rm=[zone-name][,...] .. option:: --sync-from-rm=[zone-name][,...]
Remove the zones from list of zones to sync from. Remove zone(s) from list of zones from which to sync.
.. option:: --bucket-index-max-shards .. option:: --bucket-index-max-shards
@ -780,71 +782,71 @@ Options
.. option:: --fix .. option:: --fix
Besides checking bucket index, will also fix it. Fix the bucket index in addition to checking it.
.. option:: --check-objects .. option:: --check-objects
bucket check: Rebuilds bucket index according to actual objects state. Bucket check: Rebuilds the bucket index according to actual object state.
.. option:: --format=<format> .. option:: --format=<format>
Specify output format for certain operations. Supported formats: xml, json. Specify output format for certain operations. Supported formats: xml, json.
.. option:: --sync-stats .. option:: --sync-stats
Option for 'user stats' command. When specified, it will update user stats with Option for the 'user stats' command. When specified, it will update user stats with
the current stats reported by user's buckets indexes. the current stats reported by the user's buckets indexes.
.. option:: --show-config .. option:: --show-config
Show configuration. Show configuration.
.. option:: --show-log-entries=<flag> .. option:: --show-log-entries=<flag>
Enable/disable dump of log entries on log show. Enable/disable dumping of log entries on log show.
.. option:: --show-log-sum=<flag> .. option:: --show-log-sum=<flag>
Enable/disable dump of log summation on log show. Enable/disable dump of log summation on log show.
.. option:: --skip-zero-entries .. option:: --skip-zero-entries
Log show only dumps entries that don't have zero value in one of the numeric Log show only dumps entries that don't have zero value in one of the numeric
field. field.
.. option:: --infile .. option:: --infile
Specify a file to read in when setting data. Specify a file to read when setting data.
.. option:: --categories=<list> .. option:: --categories=<list>
Comma separated list of categories, used in usage show. Comma separated list of categories, used in usage show.
.. option:: --caps=<caps> .. option:: --caps=<caps>
List of caps (e.g., "usage=read, write; user=read"). List of capabilities (e.g., "usage=read, write; user=read").
.. option:: --compression=<compression-algorithm> .. option:: --compression=<compression-algorithm>
Placement target compression algorithm (lz4|snappy|zlib|zstd) Placement target compression algorithm (lz4|snappy|zlib|zstd).
.. option:: --yes-i-really-mean-it .. option:: --yes-i-really-mean-it
Required for certain operations. Required as a guardrail for certain destructive operations.
.. option:: --min-rewrite-size .. option:: --min-rewrite-size
Specify the min object size for bucket rewrite (default 4M). Specify the minimum object size for bucket rewrite (default 4M).
.. option:: --max-rewrite-size .. option:: --max-rewrite-size
Specify the max object size for bucket rewrite (default ULLONG_MAX). Specify the maximum object size for bucket rewrite (default ULLONG_MAX).
.. option:: --min-rewrite-stripe-size .. option:: --min-rewrite-stripe-size
Specify the min stripe size for object rewrite (default 0). If the value Specify the minimum stripe size for object rewrite (default 0). If the value
is set to 0, then the specified object will always be is set to 0, then the specified object will always be
rewritten for restriping. rewritten when restriping.
.. option:: --warnings-only .. option:: --warnings-only
@ -854,7 +856,7 @@ Options
.. option:: --bypass-gc .. option:: --bypass-gc
When specified with bucket deletion, When specified with bucket deletion,
triggers object deletions by not involving GC. triggers object deletion without involving GC.
.. option:: --inconsistent-index .. option:: --inconsistent-index
@ -863,25 +865,25 @@ Options
.. option:: --max-concurrent-ios .. option:: --max-concurrent-ios
Maximum concurrent ios for bucket operations. Affects operations that Maximum concurrent bucket operations. Affects operations that
scan the bucket index, e.g., listing, deletion, and all scan/search scan the bucket index, e.g., listing, deletion, and all scan/search
operations such as finding orphans or checking the bucket index. operations such as finding orphans or checking the bucket index.
Default is 32. The default is 32.
Quota Options Quota Options
============= =============
.. option:: --max-objects .. option:: --max-objects
Specify max objects (negative value to disable). Specify the maximum number of objects (negative value to disable).
.. option:: --max-size .. option:: --max-size
Specify max size (in B/K/M/G/T, negative value to disable). Specify the maximum object size (in B/K/M/G/T, negative value to disable).
.. option:: --quota-scope .. option:: --quota-scope
The scope of quota (bucket, user). The scope of quota (bucket, user).
Orphans Search Options Orphans Search Options
@ -889,16 +891,16 @@ Orphans Search Options
.. option:: --num-shards .. option:: --num-shards
Number of shards to use for keeping the temporary scan info Number of shards to use for temporary scan info
.. option:: --orphan-stale-secs .. option:: --orphan-stale-secs
Number of seconds to wait before declaring an object to be an orphan. Number of seconds to wait before declaring an object to be an orphan.
Default is 86400 (24 hours). The efault is 86400 (24 hours).
.. option:: --job-id .. option:: --job-id
Set the job id (for orphans find) Set the job id (for orphans find)
Orphans list-jobs options Orphans list-jobs options

View File

@ -53,10 +53,6 @@ Options
Run in foreground, log to usual location Run in foreground, log to usual location
.. option:: --rgw-socket-path=path
Specify a unix domain socket path.
.. option:: --rgw-region=region .. option:: --rgw-region=region
The region where radosgw runs The region where radosgw runs
@ -80,30 +76,24 @@ and ``mod_proxy_fcgi`` have to be present in the server. Unlike ``mod_fastcgi``,
or process management may be available in the FastCGI application framework or process management may be available in the FastCGI application framework
in use. in use.
``Apache`` can be configured in a way that enables ``mod_proxy_fcgi`` to be used ``Apache`` must be configured in a way that enables ``mod_proxy_fcgi`` to be
with localhost tcp or through unix domain socket. ``mod_proxy_fcgi`` that doesn't used with localhost tcp.
support unix domain socket such as the ones in Apache 2.2 and earlier versions of
Apache 2.4, needs to be configured for use with localhost tcp. Later versions of
Apache like Apache 2.4.9 or later support unix domain socket and as such they
allow for the configuration with unix domain socket instead of localhost tcp.
The following steps show the configuration in Ceph's configuration file i.e, The following steps show the configuration in Ceph's configuration file i.e,
``/etc/ceph/ceph.conf`` and the gateway configuration file i.e, ``/etc/ceph/ceph.conf`` and the gateway configuration file i.e,
``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or ``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or
``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost ``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost
tcp and through unix domain socket: tcp:
#. For distros with Apache 2.2 and early versions of Apache 2.4 that use #. For distros with Apache 2.2 and early versions of Apache 2.4 that use
localhost TCP and do not support Unix Domain Socket, append the following localhost TCP, append the following contents to ``/etc/ceph/ceph.conf``::
contents to ``/etc/ceph/ceph.conf``::
[client.radosgw.gateway] [client.radosgw.gateway]
host = {hostname} host = {hostname}
keyring = /etc/ceph/ceph.client.radosgw.keyring keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = "" log_file = /var/log/ceph/client.radosgw.gateway.log
log file = /var/log/ceph/client.radosgw.gateway.log rgw_frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0 rgw_print_continue = false
rgw print continue = false
#. Add the following content in the gateway configuration file: #. Add the following content in the gateway configuration file:
@ -149,16 +139,6 @@ tcp and through unix domain socket:
</VirtualHost> </VirtualHost>
#. For distros with Apache 2.4.9 or later that support Unix Domain Socket,
append the following configuration to ``/etc/ceph/ceph.conf``::
[client.radosgw.gateway]
host = {hostname}
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /var/log/ceph/client.radosgw.gateway.log
rgw print continue = false
#. Add the following content in the gateway configuration file: #. Add the following content in the gateway configuration file:
For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``:: For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``::
@ -182,10 +162,6 @@ tcp and through unix domain socket:
</VirtualHost> </VirtualHost>
Please note, ``Apache 2.4.7`` does not have Unix Domain Socket support in
it and as such it has to be configured with localhost tcp. The Unix Domain
Socket support is available in ``Apache 2.4.9`` and later versions.
#. Generate a key for radosgw to use for authentication with the cluster. :: #. Generate a key for radosgw to use for authentication with the cluster. ::
ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway

View File

@ -107,22 +107,29 @@ of the details of NFS redirecting traffic on the virtual IP to the
appropriate backend NFS servers, and redeploying NFS servers when they appropriate backend NFS servers, and redeploying NFS servers when they
fail. fail.
If a user additionally supplies ``--ingress-mode keepalive-only`` a An optional ``--ingress-mode`` parameter can be provided to choose
partial *ingress* service will be deployed that still provides a virtual how the *ingress* service is configured:
IP, but has nfs directly binding to that virtual IP and leaves out any
sort of load balancing or traffic redirection. This setup will restrict
users to deploying only 1 nfs daemon as multiple cannot bind to the same
port on the virtual IP.
Instead providing ``--ingress-mode default`` will result in the same setup - Setting ``--ingress-mode keepalive-only`` deploys a simplified *ingress*
as not providing the ``--ingress-mode`` flag. In this setup keepalived will be service that provides a virtual IP with the nfs server directly binding to
deployed to handle forming the virtual IP and haproxy will be deployed that virtual IP and leaves out any sort of load balancing or traffic
to handle load balancing and traffic redirection. redirection. This setup will restrict users to deploying only 1 nfs daemon
as multiple cannot bind to the same port on the virtual IP.
- Setting ``--ingress-mode haproxy-standard`` deploys a full *ingress* service
to provide load balancing and high-availability using HAProxy and keepalived.
Client IP addresses are not visible to the back-end NFS server and IP level
restrictions on NFS exports will not function.
- Setting ``--ingress-mode haproxy-protocol`` deploys a full *ingress* service
to provide load balancing and high-availability using HAProxy and keepalived.
Client IP addresses are visible to the back-end NFS server and IP level
restrictions on NFS exports are usable. This mode requires NFS Ganesha version
5.0 or later.
- Setting ``--ingress-mode default`` is equivalent to not providing any other
ingress mode by name. When no other ingress mode is specified by name
the default ingress mode used is ``haproxy-standard``.
Enabling ingress via the ``ceph nfs cluster create`` command deploys a Ingress can be added to an existing NFS service (e.g., one initially created
simple ingress configuration with the most common configuration without the ``--ingress`` flag), and the basic NFS service can
options. Ingress can also be added to an existing NFS service (e.g.,
one created without the ``--ingress`` flag), and the basic NFS service can
also be modified after the fact to include non-default options, by modifying also be modified after the fact to include non-default options, by modifying
the services directly. For more information, see :ref:`cephadm-ha-nfs`. the services directly. For more information, see :ref:`cephadm-ha-nfs`.

View File

@ -41,6 +41,7 @@ Configuration
.. confval:: rbd_stats_pools_refresh_interval .. confval:: rbd_stats_pools_refresh_interval
.. confval:: standby_behaviour .. confval:: standby_behaviour
.. confval:: standby_error_status_code .. confval:: standby_error_status_code
.. confval:: exclude_perf_counters
By default the module will accept HTTP requests on port ``9283`` on all IPv4 By default the module will accept HTTP requests on port ``9283`` on all IPv4
and IPv6 addresses on the host. The port and listen address are both and IPv6 addresses on the host. The port and listen address are both
@ -217,6 +218,15 @@ the module option ``exclude_perf_counters`` to ``false``:
ceph config set mgr mgr/prometheus/exclude_perf_counters false ceph config set mgr mgr/prometheus/exclude_perf_counters false
Ceph daemon performance counters metrics
-----------------------------------------
With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
the module option ``exclude_perf_counters`` to ``false``::
ceph config set mgr mgr/prometheus/exclude_perf_counters false
Statistic names and labels Statistic names and labels
========================== ==========================

View File

@ -0,0 +1,474 @@
.. _monitoring:
===================
Monitoring overview
===================
The aim of this part of the documentation is to explain the Ceph monitoring
stack and the meaning of the main Ceph metrics.
With a good understand of the Ceph monitoring stack and metrics users can
create customized monitoring tools, like Prometheus queries, Grafana
dashboards, or scripts.
Ceph Monitoring stack
=====================
Ceph provides a default monitoring stack wich is installed by cephadm and
explained in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
the cephadm documentation.
Ceph metrics
============
The main source for Ceph metrics are the performance counters exposed by each
Ceph daemon. The :doc:`../dev/perf_counters` are native Ceph monitoring data
Performance counters are transformed into standard Prometheus metrics by the
Ceph exporter daemon. This daemon runs on every Ceph cluster host and exposes a
metrics end point where all the performance counters exposed by all the Ceph
daemons running in the host are published in the form of Prometheus metrics.
In addition to the Ceph exporter, there is another agent to expose Ceph
metrics. It is the Prometheus manager module, wich exposes metrics related to
the whole cluster, basically metrics that are not produced by individual Ceph
daemons.
The main source for obtaining Ceph metrics is the metrics endpoint exposed by
the Cluster Prometheus server. Ceph can provide you with the Prometheus
endpoint where you can obtain the complete list of metrics (coming from Ceph
exporter daemons and Prometheus manager module) and exeute queries.
Use the following command to obtain the Prometheus server endpoint in your
cluster:
Example:
.. code-block:: bash
# ceph orch ps --service_name prometheus
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
prometheus.cephtest-node-00 cephtest-node-00.cephlab.com *:9095 running (103m) 50s ago 5w 142M - 2.33.4 514e6a882f6e efe3cbc2e521
With this information you can connect to
``http://cephtest-node-00.cephlab.com:9095`` to access the Prometheus server
interface.
And the complete list of metrics (with help) for your cluster will be available
in:
``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata``
It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
Performance metrics
===================
Main metrics used to measure Cluster Ceph performance:
All metrics have the following labels:
``ceph_daemon``: identifier of the OSD daemon generating the metric
``instance``: the IP address of the ceph exporter instance exposing the metric.
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_osd_op_r{ceph_daemon="osd.0", instance="192.168.122.7:9283", job="ceph"} = 73981
*Cluster I/O (throughput):*
Use ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain the cluster throughput generated by clients
Example:
.. code-block:: bash
Writes (B/s):
sum(irate(ceph_osd_op_w_in_bytes[1m]))
Reads (B/s):
sum(irate(ceph_osd_op_r_out_bytes[1m]))
*Cluster I/O (operations):*
Use ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the number of operations generated by clients
Example:
.. code-block:: bash
Writes (ops/s):
sum(irate(ceph_osd_op_w[1m]))
Reads (ops/s):
sum(irate(ceph_osd_op_r[1m]))
*Latency:*
Use ``ceph_osd_op_latency_sum`` wich represents the delay before a OSD transfer of data begins following a client instruction for its transfer
Example:
.. code-block:: bash
sum(irate(ceph_osd_op_latency_sum[1m]))
OSD performance
===============
The previous explained cluster performance metrics are based in OSD metrics, selecting the right label we can obtain for a single OSD the same performance information explained for the cluster:
Example:
.. code-block:: bash
OSD 0 read latency
irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"osd.0"}[1m]) / on (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
OSD 0 write IOPS
irate(ceph_osd_op_w{ceph_daemon=~"osd.0"}[1m])
OSD 0 write thughtput (bytes)
irate(ceph_osd_op_w_in_bytes{ceph_daemon=~"osd.0"}[1m])
OSD.0 total raw capacity available
ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481
Physical disk performance:
==========================
Combining Prometheus ``node_exporter`` metrics with Ceph metrics we can have
information about the performance provided by physical disks used by OSDs.
Example:
.. code-block:: bash
Read latency of device used by OSD 0:
label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Write latency of device used by OSD 0
label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
IOPS (device used by OSD.0)
reads:
label_replace(irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
writes:
label_replace(irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Throughput (device used by OSD.0)
reads:
label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
writes:
label_replace(irate(node_disk_written_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Physical Device Utilization (%) for OSD.0 in the last 5 minutes
label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
Pool metrics
============
These metrics have the following labels:
``instance``: the ip address of the Ceph exporter daemon producing the metric.
``pool_id``: identifier of the pool
``job``: prometheus scrape job
- ``ceph_pool_metadata``: Information about the pool It can be used together
with other metrics to provide more contextual information in queries and
graphs. Apart of the three common labels this metric provide the following
extra labels:
- ``compression_mode``: compression used in the pool (lz4, snappy, zlib,
zstd, none). Example: compression_mode="none"
- ``description``: brief description of the pool type (replica:number of
replicas or Erasure code: ec profile). Example: description="replica:3"
- ``name``: name of the pool. Example: name=".mgr"
- ``type``: type of pool (replicated/erasure code). Example: type="replicated"
- ``ceph_pool_bytes_used``: Total raw capacity consumed by user data and associated overheads by pool (metadata + redundancy):
- ``ceph_pool_stored``: Total of CLIENT data stored in the pool
- ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool
- ``ceph_pool_compress_bytes_used``: Data compressed in the pool
- ``ceph_pool_rd``: CLIENT read operations per pool (reads per second)
- ``ceph_pool_rd_bytes``: CLIENT read operations in bytes per pool
- ``ceph_pool_wr``: CLIENT write operations per pool (writes per second)
- ``ceph_pool_wr_bytes``: CLIENT write operation in bytes per pool
**Useful queries**:
.. code-block:: bash
Total raw capacity available in the cluster:
sum(ceph_osd_stat_bytes)
Total raw capacity consumed in the cluster (including metadata + redundancy):
sum(ceph_pool_bytes_used)
Total of CLIENT data stored in the cluster:
sum(ceph_pool_stored)
Compression savings:
sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used)
CLIENT IOPS for a pool (testrbdpool)
reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
CLIENT Throughput for a pool
reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
Object metrics
==============
These metrics have the following labels:
``instance``: the ip address of the ceph exporter daemon providing the metric
``instance_id``: identifier of the rgw daemon
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_rgw_req{instance="192.168.122.7:9283", instance_id="154247", job="ceph"} = 12345
Generic metrics
---------------
- ``ceph_rgw_metadata``: Provides generic information about the RGW daemon. It
can be used together with other metrics to provide more contextual
information in queries and graphs. Apart from the three common labels, this
metric provides the following extra labels:
- ``ceph_daemon``: Name of the Ceph daemon. Example:
ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq",
- ``ceph_version``: Version of Ceph daemon. Example: ceph_version="ceph
version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
- ``hostname``: Name of the host where the daemon runs. Example:
hostname:"cephtest-node-00.cephlab.com",
- ``ceph_rgw_req``: Number total of requests for the daemon (GET+PUT+DELETE)
Useful to detect bottlenecks and optimize load distribution.
- ``ceph_rgw_qlen``: RGW operations queue length for the daemon.
Useful to detect bottlenecks and optimize load distribution.
- ``ceph_rgw_failed_req``: Aborted requests.
Useful to detect daemon errors
GET operations: related metrics
-------------------------------
- ``ceph_rgw_get_initial_lat_count``: Number of get operations
- ``ceph_rgw_get_initial_lat_sum``: Total latency time for the GET operations
- ``ceph_rgw_get``: Number total of GET requests
- ``ceph_rgw_get_b``: Total bytes transferred in GET operations
Put operations: related metrics
-------------------------------
- ``ceph_rgw_put_initial_lat_count``: Number of get operations
- ``ceph_rgw_put_initial_lat_sum``: Total latency time for the PUT operations
- ``ceph_rgw_put``: Total number of PUT operations
- ``ceph_rgw_get_b``: Total bytes transferred in PUT operations
Useful queries
--------------
.. code-block:: bash
The average of get latencies:
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
The average of put latencies:
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Total requests per second:
rate(ceph_rgw_req[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Total number of "other" operations (LIST, DELETE)
rate(ceph_rgw_req[30s]) - (rate(ceph_rgw_get[30s]) + rate(ceph_rgw_put[30s]))
GET latencies
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
PUT latencies
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Bandwidth consumed by GET operations
sum(rate(ceph_rgw_get_b[30s]))
Bandwidth consumed by PUT operations
sum(rate(ceph_rgw_put_b[30s]))
Bandwidth consumed by RGW instance (PUTs + GETs)
sum by (instance_id) (rate(ceph_rgw_get_b[30s]) + rate(ceph_rgw_put_b[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
Http errors:
rate(ceph_rgw_failed_req[30s])
Filesystem Metrics
==================
These metrics have the following labels:
``ceph_daemon``: The name of the MDS daemon
``instance``: the ip address (and port) of of the Ceph exporter daemon exposing the metric
``job``: prometheus scrape job
Example:
.. code-block:: bash
ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452
Main metrics
------------
- ``ceph_mds_metadata``: Provides general information about the MDS daemon. It
can be used together with other metrics to provide more contextual
information in queries and graphs. It provides the following extra labels:
- ``ceph_version``: MDS daemon Ceph version
- ``fs_id``: filesystem cluster id
- ``hostname``: Host name where the MDS daemon runs
- ``public_addr``: Public address where the MDS daemon runs
- ``rank``: Rank of the MDS daemon
Example:
.. code-block:: bash
ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"}
- ``ceph_mds_request``: Total number of requests for the MDs daemon
- ``ceph_mds_reply_latency_sum``: Reply latency total
- ``ceph_mds_reply_latency_count``: Reply latency count
- ``ceph_mds_server_handle_client_request``: Number of client requests
- ``ceph_mds_sessions_session_count``: Session count
- ``ceph_mds_sessions_total_load``: Total load
- ``ceph_mds_sessions_sessions_open``: Sessions currently open
- ``ceph_mds_sessions_sessions_stale``: Sessions currently stale
- ``ceph_objecter_op_r``: Number of read operations
- ``ceph_objecter_op_w``: Number of write operations
- ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon
- ``ceph_mds_root_rfiles``: Total number of files managed by the daemon
Useful queries:
---------------
.. code-block:: bash
Total MDS daemons read workload:
sum(rate(ceph_objecter_op_r[1m]))
Total MDS daemons write workload:
sum(rate(ceph_objecter_op_w[1m]))
MDS daemon read workload: (daemon name is "mdstest")
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
MDS daemon write workload: (daemon name is "mdstest")
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
The average of reply latencies:
rate(ceph_mds_reply_latency_sum[30s]) / rate(ceph_mds_reply_latency_count[30s])
Total requests per second:
rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata
Block metrics
=============
By default RBD metrics for images are not available in order to provide the
best performance in the prometheus manager module.
To produce metrics for RBD images it is needed to configure properly the
manager option ``mgr/prometheus/rbd_stats_pools``. For more information please
see :ref:`prometheus-rbd-io-statistics`
These metrics have the following labels:
``image``: Name of the image which produces the metric value.
``instance``: Node where the rbd metric is produced. (It points to the Ceph exporter daemon)
``job``: Name of the Prometheus scrape job.
``pool``: Image pool name.
Example:
.. code-block:: bash
ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"}
Main metrics
------------
- ``ceph_rbd_read_bytes``: RBD image bytes read
- ``ceph_rbd_read_latency_count``: RBD image reads latency count
- ``ceph_rbd_read_latency_sum``: RBD image reads latency total
- ``ceph_rbd_read_ops``: RBD image reads count
- ``ceph_rbd_write_bytes``: RBD image bytes written
- ``ceph_rbd_write_latency_count``: RBD image writes latency count
- ``ceph_rbd_write_latency_sum``: RBD image writes latency total
- ``ceph_rbd_write_ops``: RBD image writes count
Useful queries
--------------
.. code-block:: bash
The average of read latencies:
rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata

View File

@ -4,7 +4,7 @@
Configuring Ceph Configuring Ceph
================== ==================
When Ceph services start, the initialization process activates a series of When Ceph services start, the initialization process activates a set of
daemons that run in the background. A :term:`Ceph Storage Cluster` runs at daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
least three types of daemons: least three types of daemons:
@ -12,15 +12,16 @@ least three types of daemons:
- :term:`Ceph Manager` (``ceph-mgr``) - :term:`Ceph Manager` (``ceph-mgr``)
- :term:`Ceph OSD Daemon` (``ceph-osd``) - :term:`Ceph OSD Daemon` (``ceph-osd``)
Ceph Storage Clusters that support the :term:`Ceph File System` also run at Any Ceph Storage Cluster that supports the :term:`Ceph File System` also runs
least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support at least one :term:`Ceph Metadata Server` (``ceph-mds``). Any Cluster that
:term:`Ceph Object Storage` run Ceph RADOS Gateway daemons (``radosgw``). supports :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons
(``radosgw``).
Each daemon has a number of configuration options, each of which has a default Each daemon has a number of configuration options, and each of those options
value. You may adjust the behavior of the system by changing these has a default value. Adjust the behavior of the system by changing these
configuration options. Be careful to understand the consequences before configuration options. Make sure to understand the consequences before
overriding default values, as it is possible to significantly degrade the overriding the default values, as it is possible to significantly degrade the
performance and stability of your cluster. Note too that default values performance and stability of your cluster. Remember that default values
sometimes change between releases. For this reason, it is best to review the sometimes change between releases. For this reason, it is best to review the
version of this documentation that applies to your Ceph release. version of this documentation that applies to your Ceph release.

View File

@ -4,11 +4,12 @@
.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's .. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
default storage back end. Since the Luminous release of Ceph, BlueStore has default storage back end. Since the Luminous release of Ceph, BlueStore has
been Ceph's default storage back end. However, Filestore OSDs are still been Ceph's default storage back end. However, Filestore OSDs are still
supported. See :ref:`OSD Back Ends supported up to Quincy. Filestore OSDs are not supported in Reef. See
<rados_config_storage_devices_osd_backends>`. See :ref:`BlueStore Migration :ref:`OSD Back Ends <rados_config_storage_devices_osd_backends>`. See
<rados_operations_bluestore_migration>` for instructions explaining how to :ref:`BlueStore Migration <rados_operations_bluestore_migration>` for
replace an existing Filestore back end with a BlueStore back end. instructions explaining how to replace an existing Filestore back end with a
BlueStore back end.
``filestore_debug_omap_check`` ``filestore_debug_omap_check``

View File

@ -18,27 +18,25 @@ Background
Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`. Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`.
The maintenance by Ceph Monitors of a :term:`Cluster Map` makes it possible for The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to
a :term:`Ceph Client` to determine the location of all Ceph Monitors, Ceph OSD determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph
Daemons, and Ceph Metadata Servers by connecting to one Ceph Monitor and Metadata Servers. Clients do this by connecting to one Ceph Monitor and
retrieving a current cluster map. Before Ceph Clients can read from or write to retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor
Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor. before they can read from or write to Ceph OSD Daemons or Ceph Metadata
When a Ceph client has a current copy of the cluster map and the CRUSH Servers. A Ceph client that has a current copy of the cluster map and the CRUSH
algorithm, it can compute the location for any RADOS object within in the algorithm can compute the location of any RADOS object within the cluster. This
cluster. This ability to compute the locations of objects makes it possible for makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct
Ceph Clients to talk directly to Ceph OSD Daemons. This direct communication communication between clients and Ceph OSD Daemons improves upon traditional
with Ceph OSD Daemons represents an improvment upon traditional storage storage architectures that required clients to communicate with a central
architectures in which clients were required to communicate with a central component. See `Scalability and High Availability`_ for more on this subject.
component, and that improvment contributes to Ceph's high scalability and
performance. See `Scalability and High Availability`_ for additional details.
The Ceph Monitor's primary function is to maintain a master copy of the cluster The Ceph Monitor's primary function is to maintain a master copy of the cluster
map. Monitors also provide authentication and logging services. All changes in map. Monitors also provide authentication and logging services. All changes in
the monitor services are written by the Ceph Monitor to a single Paxos the monitor services are written by the Ceph Monitor to a single Paxos
instance, and Paxos writes the changes to a key/value store for strong instance, and Paxos writes the changes to a key/value store. This provides
consistency. Ceph Monitors are able to query the most recent version of the strong consistency. Ceph Monitors are able to query the most recent version of
cluster map during sync operations, and they use the key/value store's the cluster map during sync operations, and they use the key/value store's
snapshots and iterators (using leveldb) to perform store-wide synchronization. snapshots and iterators (using RocksDB) to perform store-wide synchronization.
.. ditaa:: .. ditaa::
/-------------\ /-------------\ /-------------\ /-------------\
@ -289,7 +287,6 @@ by setting it in the ``[mon]`` section of the configuration file.
.. confval:: mon_data_size_warn .. confval:: mon_data_size_warn
.. confval:: mon_data_avail_warn .. confval:: mon_data_avail_warn
.. confval:: mon_data_avail_crit .. confval:: mon_data_avail_crit
.. confval:: mon_warn_on_cache_pools_without_hit_sets
.. confval:: mon_warn_on_crush_straw_calc_version_zero .. confval:: mon_warn_on_crush_straw_calc_version_zero
.. confval:: mon_warn_on_legacy_crush_tunables .. confval:: mon_warn_on_legacy_crush_tunables
.. confval:: mon_crush_min_required_version .. confval:: mon_crush_min_required_version
@ -540,6 +537,8 @@ Trimming requires that the placement groups are ``active+clean``.
.. index:: Ceph Monitor; clock .. index:: Ceph Monitor; clock
.. _mon-config-ref-clock:
Clock Clock
----- -----

View File

@ -91,7 +91,7 @@ Similarly, two options control whether IPv4 and IPv6 addresses are used:
to an IPv6 address to an IPv6 address
.. note:: The ability to bind to multiple ports has paved the way for .. note:: The ability to bind to multiple ports has paved the way for
dual-stack IPv4 and IPv6 support. That said, dual-stack support is dual-stack IPv4 and IPv6 support. That said, dual-stack operation is
not yet supported as of Quincy v17.2.0. not yet supported as of Quincy v17.2.0.
Connection modes Connection modes

View File

@ -145,17 +145,20 @@ See `Pool & PG Config Reference`_ for details.
Scrubbing Scrubbing
========= =========
In addition to making multiple copies of objects, Ceph ensures data integrity by One way that Ceph ensures data integrity is by "scrubbing" placement groups.
scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
object storage layer. For each placement group, Ceph generates a catalog of all generates a catalog of all objects in each placement group and compares each
objects and compares each primary object and its replicas to ensure that no primary object to its replicas, ensuring that no objects are missing or
objects are missing or mismatched. Light scrubbing (daily) checks the object mismatched. Light scrubbing checks the object size and attributes, and is
size and attributes. Deep scrubbing (weekly) reads the data and uses checksums usually done daily. Deep scrubbing reads the data and uses checksums to ensure
to ensure data integrity. data integrity, and is usually done weekly. The freqeuncies of both light
scrubbing and deep scrubbing are determined by the cluster's configuration,
which is fully under your control and subject to the settings explained below
in this section.
Scrubbing is important for maintaining data integrity, but it can reduce Although scrubbing is important for maintaining data integrity, it can reduce
performance. You can adjust the following settings to increase or decrease the performance of the Ceph cluster. You can adjust the following settings to
scrubbing operations. increase or decrease the frequency and depth of scrubbing operations.
.. confval:: osd_max_scrubs .. confval:: osd_max_scrubs

View File

@ -1,3 +1,5 @@
.. _rados_config_pool_pg_crush_ref:
====================================== ======================================
Pool, PG and CRUSH Config Reference Pool, PG and CRUSH Config Reference
====================================== ======================================

View File

@ -4,74 +4,70 @@
Adding/Removing Monitors Adding/Removing Monitors
========================== ==========================
When you have a cluster up and running, you may add or remove monitors It is possible to add monitors to a running cluster as long as redundancy is
from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_ maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor
or `Monitor Bootstrap`_. Bootstrap`_.
.. _adding-monitors: .. _adding-monitors:
Adding Monitors Adding Monitors
=============== ===============
Ceph monitors are lightweight processes that are the single source of truth Ceph monitors serve as the single source of truth for the cluster map. It is
for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 possible to run a cluster with only one monitor, but for a production cluster
for a production cluster. Ceph monitors use a variation of the it is recommended to have at least three monitors provisioned and in quorum.
`Paxos`_ algorithm to establish consensus about maps and other critical Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus
information across the cluster. Due to the nature of Paxos, Ceph requires about maps and about other critical information across the cluster. Due to the
a majority of monitors to be active to establish a quorum (thus establishing nature of Paxos, Ceph is able to maintain quorum (and thus establish
consensus). consensus) only if a majority of the monitors are ``active``.
It is advisable to run an odd number of monitors. An It is best to run an odd number of monitors. This is because a cluster that is
odd number of monitors is more resilient than an running an odd number of monitors is more resilient than a cluster running an
even number. For instance, with a two monitor deployment, no even number. For example, in a two-monitor deployment, no failures can be
failures can be tolerated and still maintain a quorum; with three monitors, tolerated if quorum is to be maintained; in a three-monitor deployment, one
one failure can be tolerated; in a four monitor deployment, one failure can failure can be tolerated; in a four-monitor deployment, one failure can be
be tolerated; with five monitors, two failures can be tolerated. This avoids tolerated; and in a five-monitor deployment, two failures can be tolerated. In
the dreaded *split brain* phenomenon, and is why an odd number is best. general, a cluster running an odd number of monitors is best because it avoids
In short, Ceph needs a majority of what is called the *split brain* phenomenon. In short, Ceph is able to operate
monitors to be active (and able to communicate with each other), but that only if a majority of monitors are ``active`` and able to communicate with each
majority can be achieved using a single monitor, or 2 out of 2 monitors, other, (for example: there must be a single monitor, two out of two monitors,
2 out of 3, 3 out of 4, etc. two out of three monitors, three out of five monitors, or the like).
For small or non-critical deployments of multi-node Ceph clusters, it is For small or non-critical deployments of multi-node Ceph clusters, it is
advisable to deploy three monitors, and to increase the number of monitors recommended to deploy three monitors. For larger clusters or for clusters that
to five for larger clusters or to survive a double failure. There is rarely are intended to survive a double failure, it is recommended to deploy five
justification for seven or more. monitors. Only in rare circumstances is there any justification for deploying
seven or more monitors.
Since monitors are lightweight, it is possible to run them on the same It is possible to run a monitor on the same host that is running an OSD.
host as OSDs; however, we recommend running them on separate hosts, However, this approach has disadvantages: for example: `fsync` issues with the
because `fsync` issues with the kernel may impair performance. kernel might weaken performance, monitor and OSD daemons might be inactive at
Dedicated monitor nodes also minimize disruption since monitor and OSD the same time and cause disruption if the node crashes, is rebooted, or is
daemons are not inactive at the same time when a node crashes or is taken down for maintenance. Because of these risks, it is instead
taken down for maintenance. recommended to run monitors and managers on dedicated hosts.
Dedicated
monitor nodes also make for cleaner maintenance by avoiding both OSDs and
a mon going down if a node is rebooted, taken down, or crashes.
.. note:: A *majority* of monitors in your cluster must be able to .. note:: A *majority* of monitors in your cluster must be able to
reach each other in order to establish a quorum. reach each other in order for quorum to be established.
Deploy your Hardware Deploying your Hardware
-------------------- -----------------------
If you are adding a new host when adding a new monitor, see `Hardware Some operators choose to add a new monitor host at the same time that they add
Recommendations`_ for details on minimum recommendations for monitor hardware. a new monitor. For details on the minimum recommendations for monitor hardware,
To add a monitor host to your cluster, first make sure you have an up-to-date see `Hardware Recommendations`_. Before adding a monitor host to the cluster,
version of Linux installed (typically Ubuntu 16.04 or RHEL 7). make sure that there is an up-to-date version of Linux installed.
Add your monitor host to a rack in your cluster, connect it to the network Add the newly installed monitor host to a rack in your cluster, connect the
and ensure that it has network connectivity. host to the network, and make sure that the host has network connectivity.
.. _Hardware Recommendations: ../../../start/hardware-recommendations .. _Hardware Recommendations: ../../../start/hardware-recommendations
Install the Required Software Installing the Required Software
----------------------------- --------------------------------
For manually deployed clusters, you must install Ceph packages In manually deployed clusters, it is necessary to install Ceph packages
manually. See `Installing Packages`_ for details. manually. For details, see `Installing Packages`_. Configure SSH so that it can
You should configure SSH to a user with password-less authentication be used by a user that has passwordless authentication and root permissions.
and root permissions.
.. _Installing Packages: ../../../install/install-storage-cluster .. _Installing Packages: ../../../install/install-storage-cluster
@ -81,67 +77,65 @@ and root permissions.
Adding a Monitor (Manual) Adding a Monitor (Manual)
------------------------- -------------------------
This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map The procedure in this section creates a ``ceph-mon`` data directory, retrieves
and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to
this results in only two monitor daemons, you may add more monitors by the cluster. The procedure might result in a Ceph cluster that contains only
repeating this procedure until you have a sufficient number of ``ceph-mon`` two monitor daemons. To add more monitors until there are enough ``ceph-mon``
daemons to achieve a quorum. daemons to establish quorum, repeat the procedure.
At this point you should define your monitor's id. Traditionally, monitors This is a good point at which to define the new monitor's ``id``. Monitors have
have been named with single letters (``a``, ``b``, ``c``, ...), but you are often been named with single letters (``a``, ``b``, ``c``, etc.), but you are
free to define the id as you see fit. For the purpose of this document, free to define the ``id`` however you see fit. In this document, ``{mon-id}``
please take into account that ``{mon-id}`` should be the id you chose, refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if
without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a`` ``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is
on ``mon.a``). ``a``. ???
#. Create the default directory on the machine that will host your #. Create a data directory on the machine that will host the new monitor:
new monitor:
.. prompt:: bash $ .. prompt:: bash $
ssh {new-mon-host} ssh {new-mon-host}
sudo mkdir /var/lib/ceph/mon/ceph-{mon-id} sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
#. Create a temporary directory ``{tmp}`` to keep the files needed during #. Create a temporary directory ``{tmp}`` that will contain the files needed
this process. This directory should be different from the monitor's default during this procedure. This directory should be different from the data
directory created in the previous step, and can be removed after all the directory created in the previous step. Because this is a temporary
steps are executed: directory, it can be removed after the procedure is complete:
.. prompt:: bash $ .. prompt:: bash $
mkdir {tmp} mkdir {tmp}
#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to #. Retrieve the keyring for your monitors (``{tmp}`` is the path to the
the retrieved keyring, and ``{key-filename}`` is the name of the file retrieved keyring and ``{key-filename}`` is the name of the file that
containing the retrieved monitor key: contains the retrieved monitor key):
.. prompt:: bash $ .. prompt:: bash $
ceph auth get mon. -o {tmp}/{key-filename} ceph auth get mon. -o {tmp}/{key-filename}
#. Retrieve the monitor map, where ``{tmp}`` is the path to #. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map
the retrieved monitor map, and ``{map-filename}`` is the name of the file and ``{map-filename}`` is the name of the file that contains the retrieved
containing the retrieved monitor map: monitor map):
.. prompt:: bash $ .. prompt:: bash $
ceph mon getmap -o {tmp}/{map-filename} ceph mon getmap -o {tmp}/{map-filename}
#. Prepare the monitor's data directory created in the first step. You must #. Prepare the monitor's data directory, which was created in the first step.
specify the path to the monitor map so that you can retrieve the The following command must specify the path to the monitor map (so that
information about a quorum of monitors and their ``fsid``. You must also information about a quorum of monitors and their ``fsid``\s can be
specify a path to the monitor keyring: retrieved) and specify the path to the monitor keyring:
.. prompt:: bash $ .. prompt:: bash $
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename} sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
#. Start the new monitor and it will automatically join the cluster. #. Start the new monitor. It will automatically join the cluster. To provide
The daemon needs to know which address to bind to, via either the information to the daemon about which address to bind to, use either the
``--public-addr {ip}`` or ``--public-network {network}`` argument. ``--public-addr {ip}`` option or the ``--public-network {network}`` option.
For example: For example:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --public-addr {ip:port} ceph-mon -i {mon-id} --public-addr {ip:port}
@ -151,44 +145,47 @@ on ``mon.a``).
Removing Monitors Removing Monitors
================= =================
When you remove monitors from a cluster, consider that Ceph monitors use When monitors are removed from a cluster, it is important to remember
Paxos to establish consensus about the master cluster map. You must have that Ceph monitors use Paxos to maintain consensus about the cluster
a sufficient number of monitors to establish a quorum for consensus about map. Such consensus is possible only if the number of monitors is sufficient
the cluster map. to establish quorum.
.. _Removing a Monitor (Manual): .. _Removing a Monitor (Manual):
Removing a Monitor (Manual) Removing a Monitor (Manual)
--------------------------- ---------------------------
This procedure removes a ``ceph-mon`` daemon from your cluster. If this The procedure in this section removes a ``ceph-mon`` daemon from the cluster.
procedure results in only two monitor daemons, you may add or remove another The procedure might result in a Ceph cluster that contains a number of monitors
monitor until you have a number of ``ceph-mon`` daemons that can achieve a insufficient to maintain quorum, so plan carefully. When replacing an old
quorum. monitor with a new monitor, add the new monitor first, wait for quorum to be
established, and then remove the old monitor. This ensures that quorum is not
lost.
#. Stop the monitor: #. Stop the monitor:
.. prompt:: bash $ .. prompt:: bash $
service ceph -a stop mon.{mon-id} service ceph -a stop mon.{mon-id}
#. Remove the monitor from the cluster: #. Remove the monitor from the cluster:
.. prompt:: bash $ .. prompt:: bash $
ceph mon remove {mon-id} ceph mon remove {mon-id}
#. Remove the monitor entry from ``ceph.conf``. #. Remove the monitor entry from the ``ceph.conf`` file:
.. _rados-mon-remove-from-unhealthy: .. _rados-mon-remove-from-unhealthy:
Removing Monitors from an Unhealthy Cluster Removing Monitors from an Unhealthy Cluster
------------------------------------------- -------------------------------------------
This procedure removes a ``ceph-mon`` daemon from an unhealthy The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy
cluster, for example a cluster where the monitors cannot form a cluster (for example, a cluster whose monitors are unable to form a quorum).
quorum.
#. Stop all ``ceph-mon`` daemons on all monitor hosts: #. Stop all ``ceph-mon`` daemons on all monitor hosts:
@ -197,63 +194,68 @@ quorum.
ssh {mon-host} ssh {mon-host}
systemctl stop ceph-mon.target systemctl stop ceph-mon.target
Repeat for all monitor hosts. Repeat this step on every monitor host.
#. Identify a surviving monitor and log in to that host: #. Identify a surviving monitor and log in to the monitor's host:
.. prompt:: bash $ .. prompt:: bash $
ssh {mon-host} ssh {mon-host}
#. Extract a copy of the monmap file: #. Extract a copy of the ``monmap`` file by running a command of the following
form:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --extract-monmap {map-path} ceph-mon -i {mon-id} --extract-monmap {map-path}
In most cases, this command will be: Here is a more concrete example. In this example, ``hostname`` is the
``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i `hostname` --extract-monmap /tmp/monmap ceph-mon -i `hostname` --extract-monmap /tmp/monmap
#. Remove the non-surviving or problematic monitors. For example, if #. Remove the non-surviving or otherwise problematic monitors:
you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
only ``mon.a`` will survive, follow the example below:
.. prompt:: bash $ .. prompt:: bash $
monmaptool {map-path} --rm {mon-id} monmaptool {map-path} --rm {mon-id}
For example, For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``,
and ``mon.c`` |---| and that only ``mon.a`` will survive:
.. prompt:: bash $ .. prompt:: bash $
monmaptool /tmp/monmap --rm b monmaptool /tmp/monmap --rm b
monmaptool /tmp/monmap --rm c monmaptool /tmp/monmap --rm c
#. Inject the surviving map with the removed monitors into the #. Inject the surviving map that includes the removed monitors into the
surviving monitor(s). For example, to inject a map into monitor monmap of the surviving monitor(s):
``mon.a``, follow the example below:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --inject-monmap {map-path} ceph-mon -i {mon-id} --inject-monmap {map-path}
For example: Continuing with the above example, inject a map into monitor ``mon.a`` by
running the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i a --inject-monmap /tmp/monmap ceph-mon -i a --inject-monmap /tmp/monmap
#. Start only the surviving monitors. #. Start only the surviving monitors.
#. Verify the monitors form a quorum (``ceph -s``). #. Verify that the monitors form a quorum by running the command ``ceph -s``.
#. You may wish to archive the removed monitors' data directory in #. The data directory of the removed monitors is in ``/var/lib/ceph/mon``:
``/var/lib/ceph/mon`` in a safe location, or delete it if you are either archive this data directory in a safe location or delete this data
confident the remaining monitors are healthy and are sufficiently directory. However, do not delete it unless you are confident that the
redundant. remaining monitors are healthy and sufficiently redundant. Make sure that
there is enough room for the live DB to expand and compact, and make sure
that there is also room for an archived copy of the DB. The archived copy
can be compressed.
.. _Changing a Monitor's IP address: .. _Changing a Monitor's IP address:
@ -262,185 +264,195 @@ Changing a Monitor's IP Address
.. important:: Existing monitors are not supposed to change their IP addresses. .. important:: Existing monitors are not supposed to change their IP addresses.
Monitors are critical components of a Ceph cluster, and they need to maintain a Monitors are critical components of a Ceph cluster. The entire system can work
quorum for the whole system to work properly. To establish a quorum, the properly only if the monitors maintain quorum, and quorum can be established
monitors need to discover each other. Ceph has strict requirements for only if the monitors have discovered each other by means of their IP addresses.
discovering monitors. Ceph has strict requirements on the discovery of monitors.
Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors. Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons
However, monitors discover each other using the monitor map, not ``ceph.conf``. to discover monitors, the monitor map is used by monitors to discover each
For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you other. This is why it is necessary to obtain the current ``monmap`` at the time
need to obtain the current monmap for the cluster when creating a new monitor, a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_,
as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id}
following sections explain the consistency requirements for Ceph monitors, and a --mkfs`` command. The following sections explain the consistency requirements
few safe ways to change a monitor's IP address. for Ceph monitors, and also explain a number of safe ways to change a monitor's
IP address.
Consistency Requirements Consistency Requirements
------------------------ ------------------------
A monitor always refers to the local copy of the monmap when discovering other When a monitor discovers other monitors in the cluster, it always refers to the
monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids local copy of the monitor map. Using the monitor map instead of using the
errors that could break the cluster (e.g., typos in ``ceph.conf`` when ``ceph.conf`` file avoids errors that could break the cluster (for example,
specifying a monitor address or port). Since monitors use monmaps for discovery typos or other slight errors in ``ceph.conf`` when a monitor address or port is
and they share monmaps with clients and other Ceph daemons, the monmap provides specified). Because monitors use monitor maps for discovery and because they
monitors with a strict guarantee that their consensus is valid. share monitor maps with Ceph clients and other Ceph daemons, the monitor map
provides monitors with a strict guarantee that their consensus is valid.
Strict consistency also applies to updates to the monmap. As with any other Strict consistency also applies to updates to the monmap. As with any other
updates on the monitor, changes to the monmap always run through a distributed updates on the monitor, changes to the monmap always run through a distributed
consensus algorithm called `Paxos`_. The monitors must agree on each update to consensus algorithm called `Paxos`_. The monitors must agree on each update to
the monmap, such as adding or removing a monitor, to ensure that each monitor in the monmap, such as adding or removing a monitor, to ensure that each monitor
the quorum has the same version of the monmap. Updates to the monmap are in the quorum has the same version of the monmap. Updates to the monmap are
incremental so that monitors have the latest agreed upon version, and a set of incremental so that monitors have the latest agreed upon version, and a set of
previous versions, allowing a monitor that has an older version of the monmap to previous versions, allowing a monitor that has an older version of the monmap
catch up with the current state of the cluster. to catch up with the current state of the cluster.
If monitors discovered each other through the Ceph configuration file instead of There are additional advantages to using the monitor map rather than
through the monmap, it would introduce additional risks because the Ceph ``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not
configuration files are not updated and distributed automatically. Monitors automatically updated and distributed, its use would bring certain risks:
might inadvertently use an older ``ceph.conf`` file, fail to recognize a monitors might use an outdated ``ceph.conf`` file, might fail to recognize a
monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able specific monitor, might fall out of quorum, and might develop a situation in
to determine the current state of the system accurately. Consequently, making which `Paxos`_ is unable to accurately ascertain the current state of the
changes to an existing monitor's IP address must be done with great care. system. Because of these risks, any changes to an existing monitor's IP address
must be made with great care.
.. _operations_add_or_rm_mons_changing_mon_ip:
Changing a Monitor's IP address (The Right Way) Changing a Monitor's IP address (Preferred Method)
----------------------------------------------- --------------------------------------------------
Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to If a monitor's IP address is changed only in the ``ceph.conf`` file, there is
ensure that other monitors in the cluster will receive the update. To change a no guarantee that the other monitors in the cluster will receive the update.
monitor's IP address, you must add a new monitor with the IP address you want For this reason, the preferred method to change a monitor's IP address is as
to use (as described in `Adding a Monitor (Manual)`_), ensure that the new follows: add a new monitor with the desired IP address (as described in `Adding
monitor successfully joins the quorum; then, remove the monitor that uses the a Monitor (Manual)`_), make sure that the new monitor successfully joins the
old IP address. Then, update the ``ceph.conf`` file to ensure that clients and quorum, remove the monitor that is using the old IP address, and update the
other daemons know the IP address of the new monitor. ``ceph.conf`` file to ensure that clients and other daemons are made aware of
the new monitor's IP address.
For example, lets assume there are three monitors in place, such as :: For example, suppose that there are three monitors in place::
[mon.a] [mon.a]
host = host01 host = host01
addr = 10.0.0.1:6789 addr = 10.0.0.1:6789
[mon.b] [mon.b]
host = host02 host = host02
addr = 10.0.0.2:6789 addr = 10.0.0.2:6789
[mon.c] [mon.c]
host = host03 host = host03
addr = 10.0.0.3:6789 addr = 10.0.0.3:6789
To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the To change ``mon.c`` so that its name is ``host04`` and its IP address is
steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure ``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new
that ``mon.d`` is running before removing ``mon.c``, or it will break the monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing
quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving ``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing
all three monitors would thus require repeating this process as many times as a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP
needed. addresses, repeat this process.
Changing a Monitor's IP address (Advanced Method)
-------------------------------------------------
Changing a Monitor's IP address (The Messy Way) There are cases in which the method outlined in :ref"`<Changing a Monitor's IP
----------------------------------------------- Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot
be used. For example, it might be necessary to move the cluster's monitors to a
different network, to a different part of the datacenter, or to a different
datacenter altogether. It is still possible to change the monitors' IP
addresses, but a different method must be used.
There may come a time when the monitors must be moved to a different network, a For such cases, a new monitor map with updated IP addresses for every monitor
different part of the datacenter or a different datacenter altogether. While it in the cluster must be generated and injected on each monitor. Although this
is possible to do it, the process becomes a bit more hazardous. method is not particularly easy, such a major migration is unlikely to be a
routine task. As stated at the beginning of this section, existing monitors are
not supposed to change their IP addresses.
In such a case, the solution is to generate a new monmap with updated IP Continue with the monitor configuration in the example from :ref"`<Changing a
addresses for all the monitors in the cluster, and inject the new map on each Monitor's IP Address (Preferred Method)>
individual monitor. This is not the most user-friendly approach, but we do not operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors
expect this to be something that needs to be done every other week. As it is are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that
clearly stated on the top of this section, monitors are not supposed to change these networks are unable to communicate. Carry out the following procedure:
IP addresses.
Using the previous monitor configuration as an example, assume you want to move #. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor
all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these map, and ``{filename}`` is the name of the file that contains the retrieved
networks are unable to communicate. Use the following procedure: monitor map):
#. Retrieve the monitor map, where ``{tmp}`` is the path to
the retrieved monitor map, and ``{filename}`` is the name of the file
containing the retrieved monitor map:
.. prompt:: bash $ .. prompt:: bash $
ceph mon getmap -o {tmp}/{filename} ceph mon getmap -o {tmp}/{filename}
#. The following example demonstrates the contents of the monmap: #. Check the contents of the monitor map:
.. prompt:: bash $ .. prompt:: bash $
monmaptool --print {tmp}/{filename} monmaptool --print {tmp}/{filename}
:: ::
monmaptool: monmap file {tmp}/{filename} monmaptool: monmap file {tmp}/{filename}
epoch 1 epoch 1
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
last_changed 2012-12-17 02:46:41.591248 last_changed 2012-12-17 02:46:41.591248
created 2012-12-17 02:46:41.591248 created 2012-12-17 02:46:41.591248
0: 10.0.0.1:6789/0 mon.a 0: 10.0.0.1:6789/0 mon.a
1: 10.0.0.2:6789/0 mon.b 1: 10.0.0.2:6789/0 mon.b
2: 10.0.0.3:6789/0 mon.c 2: 10.0.0.3:6789/0 mon.c
#. Remove the existing monitors: #. Remove the existing monitors from the monitor map:
.. prompt:: bash $ .. prompt:: bash $
monmaptool --rm a --rm b --rm c {tmp}/{filename} monmaptool --rm a --rm b --rm c {tmp}/{filename}
:: ::
monmaptool: monmap file {tmp}/{filename} monmaptool: monmap file {tmp}/{filename}
monmaptool: removing a monmaptool: removing a
monmaptool: removing b monmaptool: removing b
monmaptool: removing c monmaptool: removing c
monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors) monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
#. Add the new monitor locations: #. Add the new monitor locations to the monitor map:
.. prompt:: bash $ .. prompt:: bash $
monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename} monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
:: ::
monmaptool: monmap file {tmp}/{filename} monmaptool: monmap file {tmp}/{filename}
monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors) monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
#. Check new contents: #. Check the new contents of the monitor map:
.. prompt:: bash $ .. prompt:: bash $
monmaptool --print {tmp}/{filename} monmaptool --print {tmp}/{filename}
:: ::
monmaptool: monmap file {tmp}/{filename} monmaptool: monmap file {tmp}/{filename}
epoch 1 epoch 1
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61 fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
last_changed 2012-12-17 02:46:41.591248 last_changed 2012-12-17 02:46:41.591248
created 2012-12-17 02:46:41.591248 created 2012-12-17 02:46:41.591248
0: 10.1.0.1:6789/0 mon.a 0: 10.1.0.1:6789/0 mon.a
1: 10.1.0.2:6789/0 mon.b 1: 10.1.0.2:6789/0 mon.b
2: 10.1.0.3:6789/0 mon.c 2: 10.1.0.3:6789/0 mon.c
At this point, we assume the monitors (and stores) are installed at the new At this point, we assume that the monitors (and stores) have been installed at
location. The next step is to propagate the modified monmap to the new the new location. Next, propagate the modified monitor map to the new monitors,
monitors, and inject the modified monmap into each new monitor. and inject the modified monitor map into each new monitor.
#. First, make sure to stop all your monitors. Injection must be done while #. Make sure all of your monitors have been stopped. Never inject into a
the daemon is not running. monitor while the monitor daemon is running.
#. Inject the monmap: #. Inject the monitor map:
.. prompt:: bash $ .. prompt:: bash $
ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename} ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
#. Restart the monitors. #. Restart all of the monitors.
Migration to the new location is now complete. The monitors should operate
successfully.
After this step, migration to the new location is complete and
the monitors should operate successfully.
.. _Manual Deployment: ../../../install/manual-deployment .. _Manual Deployment: ../../../install/manual-deployment
.. _Monitor Bootstrap: ../../../dev/mon-bootstrap .. _Monitor Bootstrap: ../../../dev/mon-bootstrap
.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) .. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
.. |---| unicode:: U+2014 .. EM DASH
:trim:

View File

@ -1,7 +1,7 @@
.. _balancer: .. _balancer:
Balancer Balancer Module
======== =======================
The *balancer* can optimize the allocation of placement groups (PGs) across The *balancer* can optimize the allocation of placement groups (PGs) across
OSDs in order to achieve a balanced distribution. The balancer can operate OSDs in order to achieve a balanced distribution. The balancer can operate

View File

@ -106,22 +106,27 @@ to be considered ``stuck`` (default: 300).
PGs might be stuck in any of the following states: PGs might be stuck in any of the following states:
**Inactive** **Inactive**
PGs are unable to process reads or writes because they are waiting for an PGs are unable to process reads or writes because they are waiting for an
OSD that has the most up-to-date data to return to an ``up`` state. OSD that has the most up-to-date data to return to an ``up`` state.
**Unclean** **Unclean**
PGs contain objects that have not been replicated the desired number of PGs contain objects that have not been replicated the desired number of
times. These PGs have not yet completed the process of recovering. times. These PGs have not yet completed the process of recovering.
**Stale** **Stale**
PGs are in an unknown state, because the OSDs that host them have not PGs are in an unknown state, because the OSDs that host them have not
reported to the monitor cluster for a certain period of time (specified by reported to the monitor cluster for a certain period of time (specified by
the ``mon_osd_report_timeout`` configuration setting). the ``mon_osd_report_timeout`` configuration setting).
To delete a ``lost`` RADOS object or revert an object to its prior state To delete a ``lost`` object or revert an object to its prior state, either by
(either by reverting it to its previous version or by deleting it because it reverting it to its previous version or by deleting it because it was just
was just created and has no previous version), run the following command: created and has no previous version, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -168,10 +173,8 @@ To dump the OSD map, run the following command:
ceph osd dump [--format {format}] ceph osd dump [--format {format}]
The ``--format`` option accepts the following arguments: ``plain`` (default), The ``--format`` option accepts the following arguments: ``plain`` (default),
``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON ``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is
format is the recommended format for consumption by tools, scripting, and other the recommended format for tools, scripting, and other forms of automation.
forms of automation.
To dump the OSD map as a tree that lists one OSD per line and displays To dump the OSD map as a tree that lists one OSD per line and displays
information about the weights and states of the OSDs, run the following information about the weights and states of the OSDs, run the following
@ -230,7 +233,7 @@ To mark an OSD as ``lost``, run the following command:
.. warning:: .. warning::
This could result in permanent data loss. Use with caution! This could result in permanent data loss. Use with caution!
To create an OSD in the CRUSH map, run the following command: To create a new OSD, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -287,47 +290,51 @@ following command:
ceph osd in {osd-num} ceph osd in {osd-num}
By using the ``pause`` and ``unpause`` flags in the OSD map, you can pause or By using the "pause flags" in the OSD map, you can pause or unpause I/O
unpause I/O requests. If the flags are set, then no I/O requests will be sent requests. If the flags are set, then no I/O requests will be sent to any OSD.
to any OSD. If the flags are cleared, then pending I/O requests will be resent. When the flags are cleared, then pending I/O requests will be resent. To set or
To set or clear these flags, run one of the following commands: clear pause flags, run one of the following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph osd pause ceph osd pause
ceph osd unpause ceph osd unpause
You can assign an override or ``reweight`` weight value to a specific OSD You can assign an override or ``reweight`` weight value to a specific OSD if
if the normal CRUSH distribution seems to be suboptimal. The weight of an the normal CRUSH distribution seems to be suboptimal. The weight of an OSD
OSD helps determine the extent of its I/O requests and data storage: two helps determine the extent of its I/O requests and data storage: two OSDs with
OSDs with the same weight will receive approximately the same number of the same weight will receive approximately the same number of I/O requests and
I/O requests and store approximately the same amount of data. The ``ceph store approximately the same amount of data. The ``ceph osd reweight`` command
osd reweight`` command assigns an override weight to an OSD. The weight assigns an override weight to an OSD. The weight value is in the range 0 to 1,
value is in the range 0 to 1, and the command forces CRUSH to relocate a and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of
certain amount (1 - ``weight``) of the data that would otherwise be on the data that would otherwise be on this OSD. The command does not change the
this OSD. The command does not change the weights of the buckets above weights of the buckets above the OSD in the CRUSH map. Using the command is
the OSD in the CRUSH map. Using the command is merely a corrective merely a corrective measure: for example, if one of your OSDs is at 90% and the
measure: for example, if one of your OSDs is at 90% and the others are at others are at 50%, you could reduce the outlier weight to correct this
50%, you could reduce the outlier weight to correct this imbalance. To imbalance. To assign an override weight to a specific OSD, run the following
assign an override weight to a specific OSD, run the following command: command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd reweight {osd-num} {weight} ceph osd reweight {osd-num} {weight}
.. note:: Any assigned override reweight value will conflict with the balancer.
This means that if the balancer is in use, all override reweight values
should be ``1.0000`` in order to avoid suboptimal cluster behavior.
A cluster's OSDs can be reweighted in order to maintain balance if some OSDs A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
are being disproportionately utilized. Note that override or ``reweight`` are being disproportionately utilized. Note that override or ``reweight``
weights have relative values that default to 1.00000. Their values are not weights have values relative to one another that default to 1.00000; their
absolute, and these weights must be distinguished from CRUSH weights (which values are not absolute, and these weights must be distinguished from CRUSH
reflect the absolute capacity of a bucket, as measured in TiB). To reweight weights (which reflect the absolute capacity of a bucket, as measured in TiB).
OSDs by utilization, run the following command: To reweight OSDs by utilization, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing] ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]
By default, this command adjusts the override weight of OSDs that have ±20% By default, this command adjusts the override weight of OSDs that have ±20% of
of the average utilization, but you can specify a different percentage in the the average utilization, but you can specify a different percentage in the
``threshold`` argument. ``threshold`` argument.
To limit the increment by which any OSD's reweight is to be changed, use the To limit the increment by which any OSD's reweight is to be changed, use the
@ -355,42 +362,38 @@ Operators of deployments that utilize Nautilus or newer (or later revisions of
Luminous and Mimic) and that have no pre-Luminous clients might likely instead Luminous and Mimic) and that have no pre-Luminous clients might likely instead
want to enable the `balancer`` module for ``ceph-mgr``. want to enable the `balancer`` module for ``ceph-mgr``.
.. note:: The ``balancer`` module does the work for you and achieves a more The blocklist can be modified by adding or removing an IP address or a CIDR
uniform result, shuffling less data along the way. When enabling the range. If an address is blocklisted, it will be unable to connect to any OSD.
``balancer`` module, you will want to converge any changed override weights If an OSD is contained within an IP address or CIDR range that has been
back to 1.00000 so that the balancer can do an optimal job. If your cluster blocklisted, the OSD will be unable to perform operations on its peers when it
is very full, reverting these override weights before enabling the balancer acts as a client: such blocked operations include tiering and copy-from
may cause some OSDs to become full. This means that a phased approach may functionality. To add or remove an IP address or CIDR range to the blocklist,
needed. run one of the following commands:
Add/remove an IP address or CIDR range to/from the blocklist.
When adding to the blocklist,
you can specify how long it should be blocklisted in seconds; otherwise,
it will default to 1 hour. A blocklisted address is prevented from
connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware
that OSD will also be prevented from performing operations on its peers where it
acts as a client. (This includes tiering and copy-from functionality.)
If you want to blocklist a range (in CIDR format), you may do so by
including the ``range`` keyword.
These commands are mostly only useful for failure testing, as
blocklists are normally maintained automatically and shouldn't need
manual intervention. :
.. prompt:: bash $ .. prompt:: bash $
ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME] ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits] ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]
Creates/deletes a snapshot of a pool. : If you add something to the blocklist with the above ``add`` command, you can
use the ``TIME`` keyword to specify the length of time (in seconds) that it
will remain on the blocklist (default: one hour). To add or remove a CIDR
range, use the ``range`` keyword in the above commands.
Note that these commands are useful primarily in failure testing. Under normal
conditions, blocklists are maintained automatically and do not need any manual
intervention.
To create or delete a snapshot of a specific storage pool, run one of the
following commands:
.. prompt:: bash $ .. prompt:: bash $
ceph osd pool mksnap {pool-name} {snap-name} ceph osd pool mksnap {pool-name} {snap-name}
ceph osd pool rmsnap {pool-name} {snap-name} ceph osd pool rmsnap {pool-name} {snap-name}
Creates/deletes/renames a storage pool. : To create, delete, or rename a specific storage pool, run one of the following
commands:
.. prompt:: bash $ .. prompt:: bash $
@ -398,20 +401,20 @@ Creates/deletes/renames a storage pool. :
ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it] ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
ceph osd pool rename {old-name} {new-name} ceph osd pool rename {old-name} {new-name}
Changes a pool setting. : To change a pool setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd pool set {pool-name} {field} {value} ceph osd pool set {pool-name} {field} {value}
Valid fields are: The following are valid fields:
* ``size``: Sets the number of copies of data in the pool. * ``size``: The number of copies of data in the pool.
* ``pg_num``: The placement group number. * ``pg_num``: The PG number.
* ``pgp_num``: Effective number when calculating pg placement. * ``pgp_num``: The effective number of PGs when calculating placement.
* ``crush_rule``: rule number for mapping placement. * ``crush_rule``: The rule number for mapping placement.
Get the value of a pool setting. : To retrieve the value of a pool setting, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -419,40 +422,43 @@ Get the value of a pool setting. :
Valid fields are: Valid fields are:
* ``pg_num``: The placement group number. * ``pg_num``: The PG number.
* ``pgp_num``: Effective number of placement groups when calculating placement. * ``pgp_num``: The effective number of PGs when calculating placement.
To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run
Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. : the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd scrub {osd-num} ceph osd scrub {osd-num}
Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. : To send a repair command to a specific OSD, or to all OSDs (by using ``*``),
run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd repair N ceph osd repair N
Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES`` You can run a simple throughput benchmark test against a specific OSD. This
in write requests of ``BYTES_PER_WRITE`` each. By default, the test test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally,
writes 1 GB in total in 4-MB increments. in multiple write requests that each have a size of ``BYTES_PER_WRITE``
The benchmark is non-destructive and will not overwrite existing live (default: 4 MB). The test is not destructive and it will not overwrite existing
OSD data, but might temporarily affect the performance of clients live OSD data, but it might temporarily affect the performance of clients that
concurrently accessing the OSD. : are concurrently accessing the OSD. To launch this benchmark test, run the
following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE] ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
To clear an OSD's caches between benchmark runs, use the 'cache drop' command : To clear the caches of a specific OSD during the interval between one benchmark
run and another, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell osd.N cache drop ceph tell osd.N cache drop
To get the cache statistics of an OSD, use the 'cache status' command : To retrieve the cache statistics of a specific OSD, run the following command:
.. prompt:: bash $ .. prompt:: bash $
@ -461,7 +467,8 @@ To get the cache statistics of an OSD, use the 'cache status' command :
MDS Subsystem MDS Subsystem
============= =============
Change configuration parameters on a running mds. : To change the configuration parameters of a running metadata server, run the
following command:
.. prompt:: bash $ .. prompt:: bash $
@ -473,19 +480,20 @@ Example:
ceph tell mds.0 config set debug_ms 1 ceph tell mds.0 config set debug_ms 1
Enables debug messages. : To enable debug messages, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mds stat ceph mds stat
Displays the status of all metadata servers. : To display the status of all metadata servers, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mds fail 0 ceph mds fail 0
Marks the active MDS as failed, triggering failover to a standby if present. To mark the active metadata server as failed (and to trigger failover to a
standby if a standby is present), run the following command:
.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap .. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
@ -493,157 +501,165 @@ Marks the active MDS as failed, triggering failover to a standby if present.
Mon Subsystem Mon Subsystem
============= =============
Show monitor stats: To display monitor statistics, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon stat ceph mon stat
This command returns output similar to the following:
:: ::
e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
There is a ``quorum`` list at the end of the output. It lists those monitor
nodes that are part of the current quorum.
The ``quorum`` list at the end lists monitor nodes that are part of the current quorum. To retrieve this information in a more direct way, run the following command:
This is also available more directly:
.. prompt:: bash $ .. prompt:: bash $
ceph quorum_status -f json-pretty ceph quorum_status -f json-pretty
.. code-block:: javascript
{ This command returns output similar to the following:
"election_epoch": 6,
"quorum": [ .. code-block:: javascript
0,
1, {
2 "election_epoch": 6,
], "quorum": [
"quorum_names": [ 0,
"a", 1,
"b", 2
"c" ],
], "quorum_names": [
"quorum_leader_name": "a", "a",
"monmap": { "b",
"epoch": 2, "c"
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc", ],
"modified": "2016-12-26 14:42:09.288066", "quorum_leader_name": "a",
"created": "2016-12-26 14:42:03.573585", "monmap": {
"features": { "epoch": 2,
"persistent": [ "fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"kraken" "modified": "2016-12-26 14:42:09.288066",
], "created": "2016-12-26 14:42:03.573585",
"optional": [] "features": {
}, "persistent": [
"mons": [ "kraken"
{ ],
"rank": 0, "optional": []
"name": "a", },
"addr": "127.0.0.1:40000\/0", "mons": [
"public_addr": "127.0.0.1:40000\/0" {
}, "rank": 0,
{ "name": "a",
"rank": 1, "addr": "127.0.0.1:40000\/0",
"name": "b", "public_addr": "127.0.0.1:40000\/0"
"addr": "127.0.0.1:40001\/0", },
"public_addr": "127.0.0.1:40001\/0" {
}, "rank": 1,
{ "name": "b",
"rank": 2, "addr": "127.0.0.1:40001\/0",
"name": "c", "public_addr": "127.0.0.1:40001\/0"
"addr": "127.0.0.1:40002\/0", },
"public_addr": "127.0.0.1:40002\/0" {
} "rank": 2,
] "name": "c",
} "addr": "127.0.0.1:40002\/0",
} "public_addr": "127.0.0.1:40002\/0"
}
]
}
}
The above will block until a quorum is reached. The above will block until a quorum is reached.
For a status of just a single monitor: To see the status of a specific monitor, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph tell mon.[name] mon_status ceph tell mon.[name] mon_status
where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample
output::
{
"name": "b",
"rank": 1,
"state": "peon",
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"features": {
"required_con": "9025616074522624",
"required_mon": [
"kraken"
],
"quorum_con": "1152921504336314367",
"quorum_mon": [
"kraken"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 2,
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"modified": "2016-12-26 14:42:09.288066",
"created": "2016-12-26 14:42:03.573585",
"features": {
"persistent": [
"kraken"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"addr": "127.0.0.1:40000\/0",
"public_addr": "127.0.0.1:40000\/0"
},
{
"rank": 1,
"name": "b",
"addr": "127.0.0.1:40001\/0",
"public_addr": "127.0.0.1:40001\/0"
},
{
"rank": 2,
"name": "c",
"addr": "127.0.0.1:40002\/0",
"public_addr": "127.0.0.1:40002\/0"
}
]
}
}
A dump of the monitor state: Here the value of ``[name]`` can be found by consulting the output of the
``ceph quorum_status`` command. This command returns output similar to the
following:
::
{
"name": "b",
"rank": 1,
"state": "peon",
"election_epoch": 6,
"quorum": [
0,
1,
2
],
"features": {
"required_con": "9025616074522624",
"required_mon": [
"kraken"
],
"quorum_con": "1152921504336314367",
"quorum_mon": [
"kraken"
]
},
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 2,
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
"modified": "2016-12-26 14:42:09.288066",
"created": "2016-12-26 14:42:03.573585",
"features": {
"persistent": [
"kraken"
],
"optional": []
},
"mons": [
{
"rank": 0,
"name": "a",
"addr": "127.0.0.1:40000\/0",
"public_addr": "127.0.0.1:40000\/0"
},
{
"rank": 1,
"name": "b",
"addr": "127.0.0.1:40001\/0",
"public_addr": "127.0.0.1:40001\/0"
},
{
"rank": 2,
"name": "c",
"addr": "127.0.0.1:40002\/0",
"public_addr": "127.0.0.1:40002\/0"
}
]
}
}
To see a dump of the monitor state, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph mon dump ceph mon dump
This command returns output similar to the following:
:: ::
dumped monmap epoch 2 dumped monmap epoch 2
epoch 2 epoch 2
fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
last_changed 2016-12-26 14:42:09.288066 last_changed 2016-12-26 14:42:09.288066
created 2016-12-26 14:42:03.573585 created 2016-12-26 14:42:03.573585
0: 127.0.0.1:40000/0 mon.a 0: 127.0.0.1:40000/0 mon.a
1: 127.0.0.1:40001/0 mon.b 1: 127.0.0.1:40001/0 mon.b
2: 127.0.0.1:40002/0 mon.c 2: 127.0.0.1:40002/0 mon.c

View File

@ -1043,6 +1043,8 @@ operations are served from the primary OSD of each PG. For erasure-coded pools,
however, the speed of read operations can be increased by enabling **fast however, the speed of read operations can be increased by enabling **fast
read** (see :ref:`pool-settings`). read** (see :ref:`pool-settings`).
.. _rados_ops_primary_affinity:
Primary Affinity Primary Affinity
---------------- ----------------

View File

@ -110,6 +110,8 @@ To remove an erasure code profile::
If the profile is referenced by a pool, the deletion will fail. If the profile is referenced by a pool, the deletion will fail.
.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior.
osd erasure-code-profile get osd erasure-code-profile get
============================ ============================

View File

@ -1226,8 +1226,8 @@ The health check will be silenced for a specific pool only if
POOL_APP_NOT_ENABLED POOL_APP_NOT_ENABLED
____________________ ____________________
A pool exists that contains one or more objects, but the pool has not been A pool exists but the pool has not been tagged for use by a particular
tagged for use by a particular application. application.
To resolve this issue, tag the pool for use by an application. For To resolve this issue, tag the pool for use by an application. For
example, if the pool is used by RBD, run the following command: example, if the pool is used by RBD, run the following command:
@ -1406,6 +1406,31 @@ other performance issue with the OSDs.
The exact size of the snapshot trim queue is reported by the ``snaptrimq_len`` The exact size of the snapshot trim queue is reported by the ``snaptrimq_len``
field of ``ceph pg ls -f json-detail``. field of ``ceph pg ls -f json-detail``.
Stretch Mode
------------
INCORRECT_NUM_BUCKETS_STRETCH_MODE
__________________________________
Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests
that the number of dividing buckets is not equal to 2 after stretch mode is enabled.
You can expect unpredictable failures and MON assertions until the condition is fixed.
We encourage you to fix this by removing additional dividing buckets or bump the
number of dividing buckets to 2.
UNEVEN_WEIGHTS_STRETCH_MODE
___________________________
The 2 dividing buckets must have equal weights when stretch mode is enabled.
This warning suggests that the 2 dividing buckets have uneven weights after
stretch mode is enabled. This is not immediately fatal, however, you can expect
Ceph to be confused when trying to process transitions between dividing buckets.
We encourage you to fix this by making the weights even on both dividing buckets.
This can be done by making sure the combined weight of the OSDs on each dividing
bucket are the same.
Miscellaneous Miscellaneous
------------- -------------

View File

@ -39,8 +39,9 @@ CRUSH algorithm.
erasure-code erasure-code
cache-tiering cache-tiering
placement-groups placement-groups
balancer
upmap upmap
read-balancer
balancer
crush-map crush-map
crush-map-edits crush-map-edits
stretch-mode stretch-mode

View File

@ -10,10 +10,11 @@ directly to specific OSDs. For this reason, tracking system faults
requires finding the `placement group`_ (PG) and the underlying OSDs at the requires finding the `placement group`_ (PG) and the underlying OSDs at the
root of the problem. root of the problem.
.. tip:: A fault in one part of the cluster might prevent you from accessing a .. tip:: A fault in one part of the cluster might prevent you from accessing a
particular object, but that doesn't mean that you are prevented from accessing other objects. particular object, but that doesn't mean that you are prevented from
When you run into a fault, don't panic. Just follow the steps for monitoring accessing other objects. When you run into a fault, don't panic. Just
your OSDs and placement groups, and then begin troubleshooting. follow the steps for monitoring your OSDs and placement groups, and then
begin troubleshooting.
Ceph is self-repairing. However, when problems persist, monitoring OSDs and Ceph is self-repairing. However, when problems persist, monitoring OSDs and
placement groups will help you identify the problem. placement groups will help you identify the problem.
@ -22,17 +23,20 @@ placement groups will help you identify the problem.
Monitoring OSDs Monitoring OSDs
=============== ===============
An OSD's status is as follows: it is either in the cluster (``in``) or out of the cluster An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
(``out``); likewise, it is either up and running (``up``) or down and not either running and reachable (``up``), or it is not running and not reachable
running (``down``). If an OSD is ``up``, it can be either ``in`` the cluster (``down``).
(if so, you can read and write data) or ``out`` of the cluster. If the OSD was previously
``in`` the cluster but was recently moved ``out`` of the cluster, Ceph will migrate its
PGs to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
not assign any PGs to that OSD. If an OSD is ``down``, it should also be
``out``.
.. note:: If an OSD is ``down`` and ``in``, then there is a problem and the cluster If an OSD is ``up``, it may be either ``in`` service (clients can read and
is not in a healthy state. write data) or it is ``out`` of service. If the OSD was ``in`` but then due to
a failure or a manual action was set to the ``out`` state, Ceph will migrate
placement groups to the other OSDs to maintin the configured redundancy.
If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
If an OSD is ``down``, it will also be ``out``.
.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
indicates that the cluster is not in a healthy state.
.. ditaa:: .. ditaa::

View File

@ -210,6 +210,11 @@ process. We recommend constraining each pool so that it belongs to only one
root (that is, one OSD class) to silence the warning and ensure a successful root (that is, one OSD class) to silence the warning and ensure a successful
scaling process. scaling process.
.. _managing_bulk_flagged_pools:
Managing pools that are flagged with ``bulk``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
complement of PGs and then scales down the number of PGs only if the usage complement of PGs and then scales down the number of PGs only if the usage
ratio across the pool is uneven. However, if a pool is not flagged ``bulk``, ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
@ -659,6 +664,7 @@ In releases of Ceph that are Nautilus and later (inclusive), when the
``pg_num``. This process manifests as periods of remapping of PGs and of ``pg_num``. This process manifests as periods of remapping of PGs and of
backfill, and is expected behavior and normal. backfill, and is expected behavior and normal.
.. _rados_ops_pgs_get_pg_num:
Get the Number of PGs Get the Number of PGs
===================== =====================

View File

@ -46,12 +46,49 @@ operations. Do not create or manipulate pools with these names.
List Pools List Pools
========== ==========
To list your cluster's pools, run the following command: There are multiple ways to get the list of pools in your cluster.
To list just your cluster's pool names (good for scripting), execute:
.. prompt:: bash $
ceph osd pool ls
::
.rgw.root
default.rgw.log
default.rgw.control
default.rgw.meta
To list your cluster's pools with the pool number, run the following command:
.. prompt:: bash $ .. prompt:: bash $
ceph osd lspools ceph osd lspools
::
1 .rgw.root
2 default.rgw.log
3 default.rgw.control
4 default.rgw.meta
To list your cluster's pools with additional information, execute:
.. prompt:: bash $
ceph osd pool ls detail
::
pool 1 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
pool 2 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
pool 3 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 23 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
pool 4 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 4.00
To get even more information, you can execute this command with the ``--format`` (or ``-f``) option and the ``json``, ``json-pretty``, ``xml`` or ``xml-pretty`` value.
.. _createpool: .. _createpool:
Creating a Pool Creating a Pool
@ -462,82 +499,6 @@ You may set values for the following keys:
:Type: Integer :Type: Integer
:Valid Range: ``1`` sets flag, ``0`` unsets flag :Valid Range: ``1`` sets flag, ``0`` unsets flag
.. _hit_set_type:
.. describe:: hit_set_type
:Description: Enables HitSet tracking for cache pools.
For additional information, see `Bloom Filter`_.
:Type: String
:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
:Default: ``bloom``. Other values are for testing.
.. _hit_set_count:
.. describe:: hit_set_count
:Description: Determines the number of HitSets to store for cache pools. The
higher the value, the more RAM is consumed by the ``ceph-osd``
daemon.
:Type: Integer
:Valid Range: ``1``. Agent doesn't handle > ``1`` yet.
.. _hit_set_period:
.. describe:: hit_set_period
:Description: Determines the duration of a HitSet period (in seconds) for
cache pools. The higher the value, the more RAM is consumed
by the ``ceph-osd`` daemon.
:Type: Integer
:Example: ``3600`` (3600 seconds: one hour)
.. _hit_set_fpp:
.. describe:: hit_set_fpp
:Description: Determines the probability of false positives for the
``bloom`` HitSet type. For additional information, see `Bloom
Filter`_.
:Type: Double
:Valid Range: ``0.0`` - ``1.0``
:Default: ``0.05``
.. _cache_target_dirty_ratio:
.. describe:: cache_target_dirty_ratio
:Description: Sets a flush threshold for the percentage of the cache pool
containing modified (dirty) objects. When this threshold is
reached, the cache-tiering agent will flush these objects to
the backing storage pool.
:Type: Double
:Default: ``.4``
.. _cache_target_dirty_high_ratio:
.. describe:: cache_target_dirty_high_ratio
:Description: Sets a flush threshold for the percentage of the cache pool
containing modified (dirty) objects. When this threshold is
reached, the cache-tiering agent will flush these objects to
the backing storage pool with a higher speed (as compared with
``cache_target_dirty_ratio``).
:Type: Double
:Default: ``.6``
.. _cache_target_full_ratio:
.. describe:: cache_target_full_ratio
:Description: Sets an eviction threshold for the percentage of the cache
pool containing unmodified (clean) objects. When this
threshold is reached, the cache-tiering agent will evict
these objects from the cache pool.
:Type: Double
:Default: ``.8``
.. _target_max_bytes: .. _target_max_bytes:
.. describe:: target_max_bytes .. describe:: target_max_bytes
@ -556,41 +517,6 @@ You may set values for the following keys:
:Type: Integer :Type: Integer
:Example: ``1000000`` #1M objects :Example: ``1000000`` #1M objects
.. describe:: hit_set_grade_decay_rate
:Description: Sets the temperature decay rate between two successive
HitSets.
:Type: Integer
:Valid Range: 0 - 100
:Default: ``20``
.. describe:: hit_set_search_last_n
:Description: Count at most N appearances in HitSets. Used for temperature
calculation.
:Type: Integer
:Valid Range: 0 - hit_set_count
:Default: ``1``
.. _cache_min_flush_age:
.. describe:: cache_min_flush_age
:Description: Sets the time (in seconds) before the cache-tiering agent
flushes an object from the cache pool to the storage pool.
:Type: Integer
:Example: ``600`` (600 seconds: ten minutes)
.. _cache_min_evict_age:
.. describe:: cache_min_evict_age
:Description: Sets the time (in seconds) before the cache-tiering agent
evicts an object from the cache pool.
:Type: Integer
:Example: ``1800`` (1800 seconds: thirty minutes)
.. _fast_read: .. _fast_read:
.. describe:: fast_read .. describe:: fast_read
@ -702,56 +628,6 @@ You may get values from the following keys:
:Description: See crush_rule_. :Description: See crush_rule_.
``hit_set_type``
:Description: See hit_set_type_.
:Type: String
:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
``hit_set_count``
:Description: See hit_set_count_.
:Type: Integer
``hit_set_period``
:Description: See hit_set_period_.
:Type: Integer
``hit_set_fpp``
:Description: See hit_set_fpp_.
:Type: Double
``cache_target_dirty_ratio``
:Description: See cache_target_dirty_ratio_.
:Type: Double
``cache_target_dirty_high_ratio``
:Description: See cache_target_dirty_high_ratio_.
:Type: Double
``cache_target_full_ratio``
:Description: See cache_target_full_ratio_.
:Type: Double
``target_max_bytes`` ``target_max_bytes``
:Description: See target_max_bytes_. :Description: See target_max_bytes_.
@ -766,20 +642,6 @@ You may get values from the following keys:
:Type: Integer :Type: Integer
``cache_min_flush_age``
:Description: See cache_min_flush_age_.
:Type: Integer
``cache_min_evict_age``
:Description: See cache_min_evict_age_.
:Type: Integer
``fast_read`` ``fast_read``
:Description: See fast_read_. :Description: See fast_read_.
@ -876,6 +738,10 @@ Ceph will list pools and highlight the ``replicated size`` attribute. By
default, Ceph creates two replicas of an object (a total of three copies, for a default, Ceph creates two replicas of an object (a total of three copies, for a
size of ``3``). size of ``3``).
Managing pools that are flagged with ``--bulk``
===============================================
See :ref:`managing_bulk_flagged_pools`.
.. _pgcalc: https://old.ceph.com/pgcalc/ .. _pgcalc: https://old.ceph.com/pgcalc/
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref

View File

@ -0,0 +1,64 @@
.. _read_balancer:
=======================================
Operating the Read (Primary) Balancer
=======================================
You might be wondering: How can I improve performance in my Ceph cluster?
One important data point you can check is the ``read_balance_score`` on each
of your replicated pools.
This metric, available via ``ceph osd pool ls detail`` (see :ref:`rados_pools`
for more details) indicates read performance, or how balanced the primaries are
for each replicated pool. In most cases, if a ``read_balance_score`` is above 1
(for instance, 1.5), this means that your pool has unbalanced primaries and that
you may want to try improving your read performance with the read balancer.
Online Optimization
===================
At present, there is no online option for the read balancer. However, we plan to add
the read balancer as an option to the :ref:`balancer` in the next Ceph version
so it can be enabled to run automatically in the background like the upmap balancer.
Offline Optimization
====================
Primaries are updated with an offline optimizer that is built into the
:ref:`osdmaptool`.
#. Grab the latest copy of your osdmap:
.. prompt:: bash $
ceph osd getmap -o om
#. Run the optimizer:
.. prompt:: bash $
osdmaptool om --read out.txt --read-pool <pool name> [--vstart]
It is highly recommended that you run the capacity balancer before running the
balancer to ensure optimal results. See :ref:`upmap` for details on how to balance
capacity in a cluster.
#. Apply the changes:
.. prompt:: bash $
source out.txt
In the above example, the proposed changes are written to the output file
``out.txt``. The commands in this procedure are normal Ceph CLI commands
that can be run in order to apply the changes to the cluster.
If you are working in a vstart cluster, you may pass the ``--vstart`` parameter
as shown above so the CLI commands are formatted with the `./bin/` prefix.
Note that any time the number of pgs changes (for instance, if the pg autoscaler [:ref:`pg-autoscaler`]
kicks in), you should consider rechecking the scores and rerunning the balancer if needed.
To see some details about what the tool is doing, you can pass
``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass
``--debug-osd 20`` to ``osdmaptool``.

View File

@ -1,7 +1,8 @@
.. _upmap: .. _upmap:
=======================================
Using pg-upmap Using pg-upmap
============== =======================================
In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table
in the OSDMap that allows the cluster to explicitly map specific PGs to in the OSDMap that allows the cluster to explicitly map specific PGs to
@ -11,6 +12,9 @@ in most cases, uniformly distribute PGs across OSDs.
However, there is an important caveat when it comes to this new feature: it However, there is an important caveat when it comes to this new feature: it
requires all clients to understand the new *pg-upmap* structure in the OSDMap. requires all clients to understand the new *pg-upmap* structure in the OSDMap.
Online Optimization
===================
Enabling Enabling
-------- --------
@ -40,17 +44,17 @@ command:
ceph features ceph features
Balancer module Balancer Module
--------------- ---------------
The `balancer` module for ``ceph-mgr`` will automatically balance the number of The `balancer` module for ``ceph-mgr`` will automatically balance the number of
PGs per OSD. See :ref:`balancer` PGs per OSD. See :ref:`balancer`
Offline optimization Offline Optimization
-------------------- ====================
Upmap entries are updated with an offline optimizer that is built into Upmap entries are updated with an offline optimizer that is built into the
``osdmaptool``. :ref:`osdmaptool`.
#. Grab the latest copy of your osdmap: #. Grab the latest copy of your osdmap:

View File

@ -2,12 +2,18 @@
The Ceph Community The Ceph Community
==================== ====================
The Ceph community is an excellent source of information and help. For Ceph-users email list
operational issues with Ceph releases we recommend you `subscribe to the =====================
ceph-users email list`_. When you no longer want to receive emails, you can
`unsubscribe from the ceph-users email list`_.
You may also `subscribe to the ceph-devel email list`_. You should do so if The Ceph community is an excellent source of information and help. For
operational issues with Ceph we recommend that you `subscribe to the ceph-users
email list`_. When you no longer want to receive emails, you can `unsubscribe
from the ceph-users email list`_.
Ceph-devel email list
=====================
You can also `subscribe to the ceph-devel email list`_. You should do so if
your issue is: your issue is:
- Likely related to a bug - Likely related to a bug
@ -16,11 +22,14 @@ your issue is:
- Related to your own builds - Related to your own builds
If you no longer want to receive emails from the ``ceph-devel`` email list, you If you no longer want to receive emails from the ``ceph-devel`` email list, you
may `unsubscribe from the ceph-devel email list`_. can `unsubscribe from the ceph-devel email list`_.
.. tip:: The Ceph community is growing rapidly, and community members can help Ceph report
you if you provide them with detailed information about your problem. You ===========
can attach the output of the ``ceph report`` command to help people understand your issues.
.. tip:: Community members can help you if you provide them with detailed
information about your problem. Attach the output of the ``ceph report``
command to help people understand your issues.
.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io .. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io
.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io .. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io

View File

@ -9,59 +9,72 @@ you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details.
Initializing oprofile Initializing oprofile
===================== =====================
The first time you use ``oprofile`` you need to initialize it. Locate the ``oprofile`` must be initalized the first time it is used. Locate the
``vmlinux`` image corresponding to the kernel you are now running. :: ``vmlinux`` image that corresponds to the kernel you are running:
ls /boot .. prompt:: bash $
sudo opcontrol --init
sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6 ls /boot
sudo opcontrol --init
sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6
Starting oprofile Starting oprofile
================= =================
To start ``oprofile`` execute the following command:: Run the following command to start ``oprofile``:
opcontrol --start .. prompt:: bash $
Once you start ``oprofile``, you may run some tests with Ceph. opcontrol --start
Stopping oprofile Stopping oprofile
================= =================
To stop ``oprofile`` execute the following command:: Run the following command to stop ``oprofile``:
opcontrol --stop .. prompt:: bash $
opcontrol --stop
Retrieving oprofile Results Retrieving oprofile Results
=========================== ===========================
To retrieve the top ``cmon`` results, execute the following command:: Run the following command to retrieve the top ``cmon`` results:
opreport -gal ./cmon | less .. prompt:: bash $
To retrieve the top ``cmon`` results with call graphs attached, execute the opreport -gal ./cmon | less
following command::
opreport -cal ./cmon | less Run the following command to retrieve the top ``cmon`` results, with call
graphs attached:
.. important:: After reviewing results, you should reset ``oprofile`` before
running it again. Resetting ``oprofile`` removes data from the session .. prompt:: bash $
directory.
opreport -cal ./cmon | less
.. important:: After you have reviewed the results, reset ``oprofile`` before
running it again. The act of resetting ``oprofile`` removes data from the
session directory.
Resetting oprofile Resetting oprofile
================== ==================
To reset ``oprofile``, execute the following command:: Run the following command to reset ``oprofile``:
sudo opcontrol --reset .. prompt:: bash $
sudo opcontrol --reset
.. important:: You should reset ``oprofile`` after analyzing data so that .. important:: Reset ``oprofile`` after analyzing data. This ensures that
you do not commingle results from different tests. results from prior tests do not get mixed in with the results of the current
test.
.. _oprofile: http://oprofile.sourceforge.net/about/ .. _oprofile: http://oprofile.sourceforge.net/about/
.. _Installing Oprofile: ../../../dev/cpu-profiler .. _Installing Oprofile: ../../../dev/cpu-profiler

View File

@ -2,10 +2,10 @@
Troubleshooting Troubleshooting
================= =================
Ceph is still on the leading edge, so you may encounter situations that require You may encounter situations that require you to examine your configuration,
you to examine your configuration, modify your logging output, troubleshoot consult the documentation, modify your logging output, troubleshoot monitors
monitors and OSDs, profile memory and CPU usage, and reach out to the and OSDs, profile memory and CPU usage, and, in the last resort, reach out to
Ceph community for help. the Ceph community for help.
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1

View File

@ -2,16 +2,23 @@
Memory Profiling Memory Profiling
================== ==================
Ceph MON, OSD and MDS can generate heap profiles using Ceph Monitor, OSD, and MDS can report ``TCMalloc`` heap profiles. Install
``tcmalloc``. To generate heap profiles, ensure you have ``google-perftools`` if you want to generate these. Your OS distribution might
``google-perftools`` installed:: package this under a different name (for example, ``gperftools``), and your OS
distribution might use a different package manager. Run a command similar to
this one to install ``google-perftools``:
sudo apt-get install google-perftools .. prompt:: bash
The profiler dumps output to your ``log file`` directory (i.e., sudo apt-get install google-perftools
``/var/log/ceph``). See `Logging and Debugging`_ for details.
To view the profiler logs with Google's performance tools, execute the The profiler dumps output to your ``log file`` directory (``/var/log/ceph``).
following:: See `Logging and Debugging`_ for details.
To view the profiler logs with Google's performance tools, run the following
command:
.. prompt:: bash
google-pprof --text {path-to-daemon} {log-path/filename} google-pprof --text {path-to-daemon} {log-path/filename}
@ -48,9 +55,9 @@ For example::
0.0 0.4% 99.2% 0.0 0.6% decode_message 0.0 0.4% 99.2% 0.0 0.6% decode_message
... ...
Another heap dump on the same daemon will add another file. It is Performing another heap dump on the same daemon creates another file. It is
convenient to compare to a previous heap dump to show what has grown convenient to compare the new file to a file created by a previous heap dump to
in the interval. For instance:: show what has grown in the interval. For example::
$ google-pprof --text --base out/osd.0.profile.0001.heap \ $ google-pprof --text --base out/osd.0.profile.0001.heap \
ceph-osd out/osd.0.profile.0003.heap ceph-osd out/osd.0.profile.0003.heap
@ -60,112 +67,137 @@ in the interval. For instance::
0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op 0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op
0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate 0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate
Refer to `Google Heap Profiler`_ for additional details. See `Google Heap Profiler`_ for additional details.
Once you have the heap profiler installed, start your cluster and After you have installed the heap profiler, start your cluster and begin using
begin using the heap profiler. You may enable or disable the heap the heap profiler. You can enable or disable the heap profiler at runtime, or
profiler at runtime, or ensure that it runs continuously. For the ensure that it runs continuously. When running commands based on the examples
following commandline usage, replace ``{daemon-type}`` with ``mon``, that follow, do the following:
``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or
the MON or MDS id. #. replace ``{daemon-type}`` with ``mon``, ``osd`` or ``mds``
#. replace ``{daemon-id}`` with the OSD number or the MON ID or the MDS ID
Starting the Profiler Starting the Profiler
--------------------- ---------------------
To start the heap profiler, execute the following:: To start the heap profiler, run a command of the following form:
ceph tell {daemon-type}.{daemon-id} heap start_profiler .. prompt:: bash
For example:: ceph tell {daemon-type}.{daemon-id} heap start_profiler
ceph tell osd.1 heap start_profiler For example:
Alternatively the profile can be started when the daemon starts .. prompt:: bash
running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in
the environment. ceph tell osd.1 heap start_profiler
Alternatively, if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in the
environment, the profile will be started when the daemon starts running.
Printing Stats Printing Stats
-------------- --------------
To print out statistics, execute the following:: To print out statistics, run a command of the following form:
ceph tell {daemon-type}.{daemon-id} heap stats .. prompt:: bash
For example:: ceph tell {daemon-type}.{daemon-id} heap stats
ceph tell osd.0 heap stats For example:
.. note:: Printing stats does not require the profiler to be running and does .. prompt:: bash
not dump the heap allocation information to a file.
ceph tell osd.0 heap stats
.. note:: The reporting of stats with this command does not require the
profiler to be running and does not dump the heap allocation information to
a file.
Dumping Heap Information Dumping Heap Information
------------------------ ------------------------
To dump heap information, execute the following:: To dump heap information, run a command of the following form:
ceph tell {daemon-type}.{daemon-id} heap dump .. prompt:: bash
For example:: ceph tell {daemon-type}.{daemon-id} heap dump
ceph tell mds.a heap dump For example:
.. note:: Dumping heap information only works when the profiler is running. .. prompt:: bash
ceph tell mds.a heap dump
.. note:: Dumping heap information works only when the profiler is running.
Releasing Memory Releasing Memory
---------------- ----------------
To release memory that ``tcmalloc`` has allocated but which is not being used by To release memory that ``tcmalloc`` has allocated but which is not being used
the Ceph daemon itself, execute the following:: by the Ceph daemon itself, run a command of the following form:
ceph tell {daemon-type}{daemon-id} heap release .. prompt:: bash
For example:: ceph tell {daemon-type}{daemon-id} heap release
ceph tell osd.2 heap release For example:
.. prompt:: bash
ceph tell osd.2 heap release
Stopping the Profiler Stopping the Profiler
--------------------- ---------------------
To stop the heap profiler, execute the following:: To stop the heap profiler, run a command of the following form:
ceph tell {daemon-type}.{daemon-id} heap stop_profiler .. prompt:: bash
For example:: ceph tell {daemon-type}.{daemon-id} heap stop_profiler
ceph tell osd.0 heap stop_profiler For example:
.. prompt:: bash
ceph tell osd.0 heap stop_profiler
.. _Logging and Debugging: ../log-and-debug .. _Logging and Debugging: ../log-and-debug
.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html .. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html
Alternative ways for memory profiling Alternative Methods of Memory Profiling
------------------------------------- ----------------------------------------
Running Massif heap profiler with Valgrind Running Massif heap profiler with Valgrind
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Massif heap profiler tool can be used with Valgrind to The Massif heap profiler tool can be used with Valgrind to measure how much
measure how much heap memory is used and is good for heap memory is used. This method is well-suited to troubleshooting RadosGW.
troubleshooting for example Ceph RadosGW.
See the `Massif documentation <https://valgrind.org/docs/manual/ms-manual.html>`_ for See the `Massif documentation
more information. <https://valgrind.org/docs/manual/ms-manual.html>`_ for more information.
Install Valgrind from the package manager for your distribution Install Valgrind from the package manager for your distribution then start the
then start the Ceph daemon you want to troubleshoot:: Ceph daemon you want to troubleshoot:
sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph .. prompt:: bash
A file similar to ``massif.out.<pid>`` will be saved when it exits sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph
in your current working directory. The user running the process above
must have write permissions in the current directory.
You can then run the ``ms_print`` command to get a graph and statistics When this command has completed its run, a file with a name of the form
from the collected data in the ``massif.out.<pid>`` file:: ``massif.out.<pid>`` will be saved in your current working directory. To run
the command above, the user who runs it must have write permissions in the
current directory.
ms_print massif.out.12345 Run the ``ms_print`` command to get a graph and statistics from the collected
data in the ``massif.out.<pid>`` file:
This output is great for inclusion in a bug report. .. prompt:: bash
ms_print massif.out.12345
The output of this command is helpful when submitting a bug report.

View File

@ -6,70 +6,78 @@
.. index:: monitor, high availability .. index:: monitor, high availability
If a cluster encounters monitor-related problems, this does not necessarily Even if a cluster experiences monitor-related problems, the cluster is not
mean that the cluster is in danger of going down. Even if multiple monitors are necessarily in danger of going down. If a cluster has lost multiple monitors,
lost, the cluster can still be up and running, as long as there are enough it can still remain up and running as long as there are enough surviving
surviving monitors to form a quorum. monitors to form a quorum.
However serious your cluster's monitor-related problems might be, we recommend If your cluster is having monitor-related problems, we recommend that you
that you take the following troubleshooting steps. consult the following troubleshooting information.
Initial Troubleshooting Initial Troubleshooting
======================= =======================
**Are the monitors running?** The first steps in the process of troubleshooting Ceph Monitors involve making
sure that the Monitors are running and that they are able to communicate with
the network and on the network. Follow the steps in this section to rule out
the simplest causes of Monitor malfunction.
First, make sure that the monitor (*mon*) daemon processes (``ceph-mon``) are #. **Make sure that the Monitors are running.**
running. Sometimes Ceph admins either forget to start the mons or forget to
restart the mons after an upgrade. Checking for this simple oversight can
save hours of painstaking troubleshooting. It is also important to make sure
that the manager daemons (``ceph-mgr``) are running. Remember that typical
cluster configurations provide one ``ceph-mgr`` for each ``ceph-mon``.
.. note:: Rook will not run more than two managers. Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are
running. It might be the case that the mons have not be restarted after an
upgrade. Checking for this simple oversight can save hours of painstaking
troubleshooting.
It is also important to make sure that the manager daemons (``ceph-mgr``)
are running. Remember that typical cluster configurations provide one
Manager (``ceph-mgr``) for each Monitor (``ceph-mon``).
**Can you reach the monitor nodes?** .. note:: In releases prior to v1.12.5, Rook will not run more than two
managers.
In certain rare cases, there may be ``iptables`` rules that block access to #. **Make sure that you can reach the Monitor nodes.**
monitor nodes or TCP ports. These rules might be left over from earlier
stress testing or rule development. To check for the presence of such rules,
SSH into the server and then try to connect to the monitor's ports
(``tcp/3300`` and ``tcp/6789``) using ``telnet``, ``nc``, or a similar tool.
**Does the ``ceph status`` command run and receive a reply from the cluster?** In certain rare cases, ``iptables`` rules might be blocking access to
Monitor nodes or TCP ports. These rules might be left over from earlier
stress testing or rule development. To check for the presence of such
rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar
tool to attempt to connect to each of the other Monitor nodes on ports
``tcp/3300`` and ``tcp/6789``.
If the ``ceph status`` command does receive a reply from the cluster, then the #. **Make sure that the "ceph status" command runs and receives a reply from the cluster.**
cluster is up and running. The monitors will answer to a ``status`` request
only if there is a formed quorum. Confirm that one or more ``mgr`` daemons
are reported as running. Under ideal conditions, all ``mgr`` daemons will be
reported as running.
If the ``ceph status`` command receives a reply from the cluster, then the
cluster is up and running. Monitors answer to a ``status`` request only if
there is a formed quorum. Confirm that one or more ``mgr`` daemons are
reported as running. In a cluster with no deficiencies, ``ceph status``
will report that all ``mgr`` daemons are running.
If the ``ceph status`` command does not receive a reply from the cluster, then If the ``ceph status`` command does not receive a reply from the cluster,
there are probably not enough monitors ``up`` to form a quorum. The ``ceph then there are probably not enough Monitors ``up`` to form a quorum. If the
-s`` command with no further options specified connects to an arbitrarily ``ceph -s`` command is run with no further options specified, it connects
selected monitor. In certain cases, however, it might be helpful to connect to an arbitrarily selected Monitor. In certain cases, however, it might be
to a specific monitor (or to several specific monitors in sequence) by adding helpful to connect to a specific Monitor (or to several specific Monitors
the ``-m`` flag to the command: for example, ``ceph status -m mymon1``. in sequence) by adding the ``-m`` flag to the command: for example, ``ceph
status -m mymon1``.
#. **None of this worked. What now?**
**None of this worked. What now?** If the above solutions have not resolved your problems, you might find it
helpful to examine each individual Monitor in turn. Even if no quorum has
been formed, it is possible to contact each Monitor individually and
request its status by using the ``ceph tell mon.ID mon_status`` command
(here ``ID`` is the Monitor's identifier).
If the above solutions have not resolved your problems, you might find it Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the
helpful to examine each individual monitor in turn. Whether or not a quorum cluster. For more on this command's output, see :ref:`Understanding
has been formed, it is possible to contact each monitor individually and mon_status
request its status by using the ``ceph tell mon.ID mon_status`` command (here <rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`.
``ID`` is the monitor's identifier).
Run the ``ceph tell mon.ID mon_status`` command for each monitor in the There is also an alternative method for contacting each individual Monitor:
cluster. For more on this command's output, see :ref:`Understanding SSH into each Monitor node and query the daemon's admin socket. See
mon_status :ref:`Using the Monitor's Admin
<rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`. Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.
There is also an alternative method: SSH into each monitor node and query the
daemon's admin socket. See :ref:`Using the Monitor's Admin
Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.
.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket: .. _rados_troubleshoting_troubleshooting_mon_using_admin_socket:
@ -175,106 +183,136 @@ the quorum is formed by only two monitors, and *c* is in the quorum as a
``IP:PORT`` combination, the **lower** the rank. In this case, because ``IP:PORT`` combination, the **lower** the rank. In this case, because
``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations, ``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations,
``mon.a`` has the highest rank: namely, rank 0. ``mon.a`` has the highest rank: namely, rank 0.
Most Common Monitor Issues Most Common Monitor Issues
=========================== ===========================
Have Quorum but at least one Monitor is down The Cluster Has Quorum but at Least One Monitor is Down
--------------------------------------------- -------------------------------------------------------
When this happens, depending on the version of Ceph you are running, When the cluster has quorum but at least one monitor is down, ``ceph health
you should be seeing something similar to:: detail`` returns a message similar to the following::
$ ceph health detail $ ceph health detail
[snip] [snip]
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
How to troubleshoot this? **How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?**
First, make sure ``mon.a`` is running. #. Make sure that ``mon.a`` is running.
Second, make sure you are able to connect to ``mon.a``'s node from the #. Make sure that you can connect to ``mon.a``'s node from the
other mon nodes. Check the TCP ports as well. Check ``iptables`` and other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and
``nf_conntrack`` on all nodes and ensure that you are not ``nf_conntrack`` on all nodes and make sure that you are not
dropping/rejecting connections. dropping/rejecting connections.
If this initial troubleshooting doesn't solve your problems, then it's If this initial troubleshooting doesn't solve your problem, then further
time to go deeper. investigation is necessary.
First, check the problematic monitor's ``mon_status`` via the admin First, check the problematic monitor's ``mon_status`` via the admin
socket as explained in `Using the monitor's admin socket`_ and socket as explained in `Using the monitor's admin socket`_ and
`Understanding mon_status`_. `Understanding mon_status`_.
If the monitor is out of the quorum, its state should be one of ``probing``, If the Monitor is out of the quorum, then its state will be one of the
``electing`` or ``synchronizing``. If it happens to be either ``leader`` or following: ``probing``, ``electing`` or ``synchronizing``. If the state of
``peon``, then the monitor believes to be in quorum, while the remaining the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be
cluster is sure it is not; or maybe it got into the quorum while we were in quorum but the rest of the cluster believes that it is not in quorum. It
troubleshooting the monitor, so check you ``ceph status`` again just to make is possible that a Monitor that is in one of the ``probing``, ``electing``,
sure. Proceed if the monitor is not yet in the quorum. or ``synchronizing`` states has entered the quorum during the process of
troubleshooting. Check ``ceph status`` again to determine whether the Monitor
has entered quorum during your troubleshooting. If the Monitor remains out of
the quorum, then proceed with the investigations described in this section of
the documentation.
What if the state is ``probing``? **What does it mean when a Monitor's state is ``probing``?**
This means the monitor is still looking for the other monitors. Every time If ``ceph health detail`` shows that a Monitor's state is
you start a monitor, the monitor will stay in this state for some time while ``probing``, then the Monitor is still looking for the other Monitors. Every
trying to connect the rest of the monitors specified in the ``monmap``. The Monitor remains in this state for some time when it is started. When a
time a monitor will spend in this state can vary. For instance, when on a Monitor has connected to the other Monitors specified in the ``monmap``, it
single-monitor cluster (never do this in production), the monitor will pass ceases to be in the ``probing`` state. The amount of time that a Monitor is
through the probing state almost instantaneously. In a multi-monitor in the ``probing`` state depends upon the parameters of the cluster of which
cluster, the monitors will stay in this state until they find enough monitors it is a part. For example, when a Monitor is a part of a single-monitor
to form a quorum |---| this means that if you have 2 out of 3 monitors down, the cluster (never do this in production), the monitor passes through the probing
one remaining monitor will stay in this state indefinitely until you bring state almost instantaneously. In a multi-monitor cluster, the Monitors stay
one of the other monitors up. in the ``probing`` state until they find enough monitors to form a quorum
|---| this means that if two out of three Monitors in the cluster are
``down``, the one remaining Monitor stays in the ``probing`` state
indefinitely until you bring one of the other monitors up.
If you have a quorum the starting daemon should be able to find the If quorum has been established, then the Monitor daemon should be able to
other monitors quickly, as long as they can be reached. If your find the other Monitors quickly, as long as they can be reached. If a Monitor
monitor is stuck probing and you have gone through with all the communication is stuck in the ``probing`` state and you have exhausted the procedures above
troubleshooting, then there is a fair chance that the monitor is trying that describe the troubleshooting of communications between the Monitors,
to reach the other monitors on a wrong address. ``mon_status`` outputs the then it is possible that the problem Monitor is trying to reach the other
``monmap`` known to the monitor: check if the other monitor's locations Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is
match reality. If they don't, jump to known to the monitor: determine whether the other Monitors' locations as
`Recovering a Monitor's Broken monmap`_; if they do, then it may be related specified in the ``monmap`` match the locations of the Monitors in the
to severe clock skews amongst the monitor nodes and you should refer to network. If they do not, see `Recovering a Monitor's Broken monmap`_.
`Clock Skews`_ first, but if that doesn't solve your problem then it is If the locations of the Monitors as specified in the ``monmap`` match the
the time to prepare some logs and reach out to the community (please refer locations of the Monitors in the network, then the persistent
to `Preparing your logs`_ on how to best prepare your logs). ``probing`` state could be related to severe clock skews amongst the monitor
nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not
bring the Monitor out of the ``probing`` state, then prepare your system logs
and ask the Ceph community for help. See `Preparing your logs`_ for
information about the proper preparation of logs.
What if state is ``electing``? **What does it mean when a Monitor's state is ``electing``?**
This means the monitor is in the middle of an election. With recent Ceph If ``ceph health detail`` shows that a Monitor's state is ``electing``, the
releases these typically complete quickly, but at times the monitors can monitor is in the middle of an election. Elections typically complete
get stuck in what is known as an *election storm*. This can indicate quickly, but sometimes the monitors can get stuck in what is known as an
clock skew among the monitor nodes; jump to *election storm*. See :ref:`Monitor Elections <dev_mon_elections>` for more
`Clock Skews`_ for more information. If all your clocks are properly on monitor elections.
synchronized, you should search the mailing lists and tracker.
This is not a state that is likely to persist and aside from The presence of election storm might indicate clock skew among the monitor
(*really*) old bugs there is not an obvious reason besides clock skews on nodes. See `Clock Skews`_ for more information.
why this would happen. Worst case, if there are enough surviving mons,
down the problematic one while you investigate. If your clocks are properly synchronized, search the mailing lists and bug
tracker for issues similar to your issue. The ``electing`` state is not
likely to persist. In versions of Ceph after the release of Cuttlefish, there
is no obvious reason other than clock skew that explains why an ``electing``
state would persist.
It is possible to investigate the cause of a persistent ``electing`` state if
you put the problematic Monitor into a ``down`` state while you investigate.
This is possible only if there are enough surviving Monitors to form quorum.
What if state is ``synchronizing``? **What does it mean when a Monitor's state is ``synchronizing``?**
This means the monitor is catching up with the rest of the cluster in If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the
order to join the quorum. Time to synchronize is a function of the size monitor is catching up with the rest of the cluster so that it can join the
of your monitor store and thus of cluster size and state, so if you have a quorum. The amount of time that it takes for the Monitor to synchronize with
large or degraded cluster this may take a while. the rest of the quorum is a function of the size of the cluster's monitor
store, the cluster's size, and the state of the cluster. Larger and degraded
clusters generally keep Monitors in the ``synchronizing`` state longer than
do smaller, new clusters.
If you notice that the monitor jumps from ``synchronizing`` to A Monitor that changes its state from ``synchronizing`` to ``electing`` and
``electing`` and then back to ``synchronizing``, then you do have a then back to ``synchronizing`` indicates a problem: the cluster state may be
problem: the cluster state may be advancing (i.e., generating new maps) advancing (that is, generating new maps) too fast for the synchronization
too fast for the synchronization process to keep up. This was a more common process to keep up with the pace of the creation of the new maps. This issue
thing in early days (Cuttlefish), but since then the synchronization process presented more frequently prior to the Cuttlefish release than it does in
has been refactored and enhanced to avoid this dynamic. If you experience more recent releases, because the synchronization process has since been
this in later versions please let us know via a bug tracker. And bring some logs refactored and enhanced to avoid this dynamic. If you experience this in
(see `Preparing your logs`_). later versions, report the issue in the `Ceph bug tracker
<https://tracker.ceph.com>`_. Prepare and provide logs to substantiate any
bug you raise. See `Preparing your logs`_ for information about the proper
preparation of logs.
What if state is ``leader`` or ``peon``? **What does it mean when a Monitor's state is ``leader`` or ``peon``?**
This should not happen: famous last words. If it does, however, it likely If ``ceph health detail`` shows that the Monitor is in the ``leader`` state
has a lot to do with clock skew -- see `Clock Skews`_. If you are not or in the ``peon`` state, it is likely that clock skew is present. Follow the
suffering from clock skew, then please prepare your logs (see instructions in `Clock Skews`_. If you have followed those instructions and
`Preparing your logs`_) and reach out to the community. ``ceph health detail`` still shows that the Monitor is in the ``leader``
state or the ``peon`` state, report the issue in the `Ceph bug tracker
<https://tracker.ceph.com>`_. If you raise an issue, provide logs to
substantiate it. See `Preparing your logs`_ for information about the
proper preparation of logs.
Recovering a Monitor's Broken ``monmap`` Recovering a Monitor's Broken ``monmap``
@ -317,18 +355,21 @@ Scrap the monitor and redeploy
Inject a monmap into the monitor Inject a monmap into the monitor
Usually the safest path. You should grab the monmap from the remaining
monitors and inject it into the monitor with the corrupted/lost monmap.
These are the basic steps: These are the basic steps:
1. Is there a formed quorum? If so, grab the monmap from the quorum:: Retrieve the ``monmap`` from the surviving monitors and inject it into the
monitor whose ``monmap`` is corrupted or lost.
Implement this solution by carrying out the following procedure:
1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the
quorum::
$ ceph mon getmap -o /tmp/monmap $ ceph mon getmap -o /tmp/monmap
2. No quorum? Grab the monmap directly from another monitor (this 2. If there is no quorum, then retrieve the ``monmap`` directly from another
assumes the monitor you are grabbing the monmap from has id ID-FOO monitor that has been stopped (in this example, the other monitor has
and has been stopped):: the ID ``ID-FOO``)::
$ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap $ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
@ -340,97 +381,105 @@ Inject a monmap into the monitor
5. Start the monitor 5. Start the monitor
Please keep in mind that the ability to inject monmaps is a powerful .. warning:: Injecting ``monmaps`` can cause serious problems because doing
feature that can cause havoc with your monitors if misused as it will so will overwrite the latest existing ``monmap`` stored on the monitor. Be
overwrite the latest, existing monmap kept by the monitor. careful!
Clock Skews Clock Skews
------------ -----------
Monitor operation can be severely affected by clock skew among the quorum's The Paxos consensus algorithm requires close time synchroniziation, which means
mons, as the PAXOS consensus algorithm requires tight time alignment. that clock skew among the monitors in the quorum can have a serious effect on
Skew can result in weird behavior with no obvious monitor operation. The resulting behavior can be puzzling. To avoid this issue,
cause. To avoid such issues, you must run a clock synchronization tool run a clock synchronization tool on your monitor nodes: for example, use
on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to ``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that
configure the mon nodes with the `iburst` option and multiple peers: the `iburst` option is in effect and so that each monitor has multiple peers,
including the following:
* Each other * Each other
* Internal ``NTP`` servers * Internal ``NTP`` servers
* Multiple external, public pool servers * Multiple external, public pool servers
For good measure, *all* nodes in your cluster should also sync against .. note:: The ``iburst`` option sends a burst of eight packets instead of the
internal and external servers, and perhaps even your mons. ``NTP`` servers usual single packet, and is used during the process of getting two peers
should run on bare metal; VM virtualized clocks are not suitable for steady into initial synchronization.
timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your
organization may already have quality internal ``NTP`` servers you can use. Furthermore, it is advisable to synchronize *all* nodes in your cluster against
Sources for ``NTP`` server appliances include: internal and external servers, and perhaps even against your monitors. Run
``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for
steady timekeeping. See `https://www.ntp.org <https://www.ntp.org>`_ for more
information about the Network Time Protocol (NTP). Your organization might
already have quality internal ``NTP`` servers available. Sources for ``NTP``
server appliances include the following:
* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_ * Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_ * EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_ * Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
Clock Skew Questions and Answers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What's the maximum tolerated clock skew? **What's the maximum tolerated clock skew?**
By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms). By default, monitors allow clocks to drift up to a maximum of 0.05 seconds
(50 milliseconds).
**Can I increase the maximum tolerated clock skew?**
Can I increase the maximum tolerated clock skew? Yes, but we strongly recommend against doing so. The maximum tolerated clock
skew is configurable via the ``mon-clock-drift-allowed`` option, but it is
almost certainly a bad idea to make changes to this option. The clock skew
maximum is in place because clock-skewed monitors cannot be relied upon. The
current default value has proven its worth at alerting the user before the
monitors encounter serious problems. Changing this value might cause
unforeseen effects on the stability of the monitors and overall cluster
health.
The maximum tolerated clock skew is configurable via the **How do I know whether there is a clock skew?**
``mon-clock-drift-allowed`` option, and
although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
is in place because clock-skewed monitors are likely to misbehave. We, as
developers and QA aficionados, are comfortable with the current default
value, as it will alert the user before the monitors get out hand. Changing
this value may cause unforeseen effects on the
stability of the monitors and overall cluster health.
How do I know there's a clock skew? The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock
skew is present, the ``ceph health detail`` and ``ceph status`` commands
The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph return an output resembling the following::
health detail`` or ``ceph status`` should show something like::
mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s) mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
That means that ``mon.c`` has been flagged as suffering from a clock skew. In this example, the monitor ``mon.c`` has been flagged as suffering from
clock skew.
On releases beginning with Luminous you can issue the ``ceph In Luminous and later releases, it is possible to check for a clock skew by
time-sync-status`` command to check status. Note that the lead mon is running the ``ceph time-sync-status`` command. Note that the lead monitor
typically the one with the numerically lowest IP address. It will always typically has the numerically lowest IP address. It will always show ``0``:
show ``0``: the reported offsets of other mons are relative to the lead mon, the reported offsets of other monitors are relative to the lead monitor, not
not to any external reference source. to any external reference source.
**What should I do if there is a clock skew?**
What should I do if there's a clock skew? Synchronize your clocks. Using an NTP client might help. However, if you
are already using an NTP client and you still encounter clock skew problems,
Synchronize your clocks. Running an NTP client may help. If you are already determine whether the NTP server that you are using is remote to your network
using one and you hit this sort of issues, check if you are using some NTP or instead hosted on your network. Hosting your own NTP servers tends to
server remote to your network and consider hosting your own NTP server on mitigate clock skew problems.
your network. This last option tends to reduce the amount of issues with
monitor clock skews.
Client Can't Connect or Mount Client Can't Connect or Mount
------------------------------ -----------------------------
Check your IP tables. Some OS install utilities add a ``REJECT`` rule to Check your IP tables. Some operating-system install utilities add a ``REJECT``
``iptables``. The rule rejects all clients trying to connect to the host except rule to ``iptables``. ``iptables`` rules will reject all clients other than
for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in ``ssh`` that try to connect to the host. If your monitor host's IP tables have
place, clients connecting from a separate node will fail to mount with a timeout a ``REJECT`` rule in place, clients that are connecting from a separate node
error. You need to address ``iptables`` rules that reject clients trying to will fail and will raise a timeout error. Any ``iptables`` rules that reject
connect to Ceph daemons. For example, you would need to address rules that look clients trying to connect to Ceph daemons must be addressed. For example::
like this appropriately::
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
You may also need to add rules to IP tables on your Ceph hosts to ensure It might also be necessary to add rules to iptables on your Ceph hosts to
that clients can access the ports associated with your Ceph monitors (i.e., port ensure that clients are able to access the TCP ports associated with your Ceph
6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For
example:: example::
iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
Monitor Store Failures Monitor Store Failures
====================== ======================
@ -438,9 +487,9 @@ Monitor Store Failures
Symptoms of store corruption Symptoms of store corruption
---------------------------- ----------------------------
Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If Ceph monitors store the :term:`Cluster Map` in a key-value store. If key-value
a monitor fails due to the key/value store corruption, following error messages store corruption causes a monitor to fail, then the monitor log might contain
might be found in the monitor log:: one of the following error messages::
Corruption: error in middle of record Corruption: error in middle of record
@ -451,21 +500,26 @@ or::
Recovery using healthy monitor(s) Recovery using healthy monitor(s)
--------------------------------- ---------------------------------
If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a If there are surviving monitors, we can always :ref:`replace
new one. After booting up, the new joiner will sync up with a healthy <adding-and-removing-monitors>` the corrupted monitor with a new one. After the
peer, and once it is fully sync'ed, it will be able to serve the clients. new monitor boots, it will synchronize with a healthy peer. After the new
monitor is fully synchronized, it will be able to serve clients.
.. _mon-store-recovery-using-osds: .. _mon-store-recovery-using-osds:
Recovery using OSDs Recovery using OSDs
------------------- -------------------
But what if all monitors fail at the same time? Since users are encouraged to Even if all monitors fail at the same time, it is possible to recover the
deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous monitor store by using information stored in OSDs. You are encouraged to deploy
failure is rare. But unplanned power-downs in a data center with improperly at least three (and preferably five) monitors in a Ceph cluster. In such a
configured disk/fs settings could fail the underlying file system, and hence deployment, complete monitor failure is unlikely. However, unplanned power loss
kill all the monitors. In this case, we can recover the monitor store with the in a data center whose disk settings or filesystem settings are improperly
information stored in OSDs. configured could cause the underlying filesystem to fail and this could kill
all of the monitors. In such a case, data in the OSDs can be used to recover
the monitors. The following is such a script and can be used to recover the
monitors:
.. code-block:: bash .. code-block:: bash
@ -516,124 +570,142 @@ information stored in OSDs.
mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
The steps above This script performs the following steps:
#. Collects the map from each OSD host.
#. Rebuilds the store.
#. Fills the entities in the keyring file with appropriate capabilities.
#. Replaces the corrupted store on ``mon.foo`` with the recovered copy.
#. collect the map from all OSD hosts,
#. then rebuild the store,
#. fill the entities in keyring file with appropriate caps
#. replace the corrupted store on ``mon.foo`` with the recovered copy.
Known limitations Known limitations
~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~
Following information are not recoverable using the steps above: The above recovery tool is unable to recover the following information:
- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command - **Certain added keyrings**: All of the OSD keyrings added using the ``ceph
are recovered from the OSD's copy. And the ``client.admin`` keyring is imported auth add`` command are recovered from the OSD's copy, and the
using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing ``client.admin`` keyring is imported using ``ceph-monstore-tool``. However,
in the recovered monitor store. You might need to re-add them manually. the MDS keyrings and all other keyrings will be missing in the recovered
monitor store. You might need to manually re-add them.
- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty. - **Creating pools**: If any RADOS pools were in the process of being created,
that state is lost. The recovery tool operates on the assumption that all
- **MDS Maps**: the MDS maps are lost. pools have already been created. If there are PGs that are stuck in the
'unknown' state after the recovery for a partially created pool, you can
force creation of the *empty* PG by running the ``ceph osd force-create-pg``
command. Note that this will create an *empty* PG, so take this action only
if you know the pool is empty.
- **MDS Maps**: The MDS maps are lost.
Everything Failed! Now What? Everything Failed! Now What?
============================= ============================
Reaching out for help Reaching out for help
---------------------- ---------------------
You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net) You can find help on IRC in #ceph and #ceph-devel on OFTC (server
and on ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
sure you have grabbed your logs and have them ready if someone asks: the faster sure that you have prepared your logs and that you have them ready upon
the interaction and lower the latency in response, the better chances everyone's request.
time is optimized.
See https://ceph.io/en/community/connect/ for current (as of October 2023)
information on getting in contact with the upstream Ceph community.
Preparing your logs Preparing your logs
--------------------- -------------------
Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``.
may want them. However, your logs may not have the necessary information. If However, if they are not there, you can find their current location by running
you don't find your monitor logs at their default location, you can check the following command:
where they should be by running::
ceph-conf --name mon.FOO --show-config-value log_file .. prompt:: bash
The amount of information in the logs are subject to the debug levels being ceph-conf --name mon.FOO --show-config-value log_file
enforced by your configuration files. If you have not enforced a specific
debug level then Ceph is using the default levels and your logs may not
contain important information to track down you issue.
A first step in getting relevant information into your logs will be to raise
debug levels. In this case we will be interested in the information from the
monitor.
Similarly to what happens on other components, different parts of the monitor
will output their debug information on different subsystems.
You will have to raise the debug levels of those subsystems more closely The amount of information in the logs is determined by the debug levels in the
related to your issue. This may not be an easy task for someone unfamiliar cluster's configuration files. If Ceph is using the default debug levels, then
with troubleshooting Ceph. For most situations, setting the following options your logs might be missing important information that would help the upstream
on your monitors will be enough to pinpoint a potential source of the issue:: Ceph community address your issue.
To make sure your monitor logs contain relevant information, you can raise
debug levels. Here we are interested in information from the monitors. As with
other components, the monitors have different parts that output their debug
information on different subsystems.
If you are an experienced Ceph troubleshooter, we recommend raising the debug
levels of the most relevant subsystems. Of course, this approach might not be
easy for beginners. In most cases, however, enough information to address the
issue will be secured if the following debug levels are entered::
debug_mon = 10 debug_mon = 10
debug_ms = 1 debug_ms = 1
If we find that these debug levels are not enough, there's a chance we may Sometimes these debug levels do not yield enough information. In such cases,
ask you to raise them or even define other debug subsystems to obtain infos members of the upstream Ceph community might ask you to make additional changes
from -- but at least we started off with some useful information, instead to these or to other debug levels. In any case, it is better for us to receive
of a massively empty log without much to go on with. at least some useful information than to receive an empty log.
Do I need to restart a monitor to adjust debug levels? Do I need to restart a monitor to adjust debug levels?
------------------------------------------------------ ------------------------------------------------------
No. You may do it in one of two ways: No, restarting a monitor is not necessary. Debug levels may be adjusted by
using two different methods, depending on whether or not there is a quorum:
You have quorum There is a quorum
Either inject the debug option into the monitor you want to debug:: Either inject the debug option into the specific monitor that needs to
be debugged::
ceph tell mon.FOO config set debug_mon 10/10 ceph tell mon.FOO config set debug_mon 10/10
or into all monitors at once:: Or inject it into all monitors at once::
ceph tell mon.* config set debug_mon 10/10 ceph tell mon.* config set debug_mon 10/10
No quorum
Use the monitor's admin socket and directly adjust the configuration There is no quorum
options::
Use the admin socket of the specific monitor that needs to be debugged
and directly adjust the monitor's configuration options::
ceph daemon mon.FOO config set debug_mon 10/10 ceph daemon mon.FOO config set debug_mon 10/10
Going back to default values is as easy as rerunning the above commands To return the debug levels to their default values, run the above commands
using the debug level ``1/10`` instead. You can check your current using the debug level ``1/10`` rather than ``10/10``. To check a monitor's
values using the admin socket and the following commands:: current values, use the admin socket and run either of the following commands:
ceph daemon mon.FOO config show .. prompt:: bash
or:: ceph daemon mon.FOO config show
ceph daemon mon.FOO config get 'OPTION_NAME' or:
.. prompt:: bash
ceph daemon mon.FOO config get 'OPTION_NAME'
Reproduced the problem with appropriate debug levels. Now what?
----------------------------------------------------------------
Ideally you would send us only the relevant portions of your logs. I Reproduced the problem with appropriate debug levels. Now what?
We realise that figuring out the corresponding portion may not be the -----------------------------------------------------------------
easiest of tasks. Therefore, we won't hold it to you if you provide the
full log, but common sense should be employed. If your log has hundreds
of thousands of lines, it may get tricky to go through the whole thing,
specially if we are not aware at which point, whatever your issue is,
happened. For instance, when reproducing, keep in mind to write down
current time and date and to extract the relevant portions of your logs
based on that.
Finally, you should reach out to us on the mailing lists, on IRC or file We prefer that you send us only the portions of your logs that are relevant to
a new issue on the `tracker`_. your monitor problems. Of course, it might not be easy for you to determine
which portions are relevant so we are willing to accept complete and
unabridged logs. However, we request that you avoid sending logs containing
hundreds of thousands of lines with no additional clarifying information. One
common-sense way of making our task easier is to write down the current time
and date when you are reproducing the problem and then extract portions of your
logs based on that information.
Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a
new issue on the `tracker`_.
.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new .. _tracker: http://tracker.ceph.com/projects/ceph/issues/new

File diff suppressed because it is too large Load Diff

View File

@ -1,120 +1,128 @@
===================== ====================
Troubleshooting PGs Troubleshooting PGs
===================== ====================
Placement Groups Never Get Clean Placement Groups Never Get Clean
================================ ================================
When you create a cluster and your cluster remains in ``active``, If, after you have created your cluster, any Placement Groups (PGs) remain in
``active+remapped`` or ``active+degraded`` status and never achieves an the ``active`` status, the ``active+remapped`` status or the
``active+clean`` status, you likely have a problem with your configuration. ``active+degraded`` status and never achieves an ``active+clean`` status, you
likely have a problem with your configuration.
You may need to review settings in the `Pool, PG and CRUSH Config Reference`_ In such a situation, it may be necessary to review the settings in the `Pool,
and make appropriate adjustments. PG and CRUSH Config Reference`_ and make appropriate adjustments.
As a general rule, you should run your cluster with more than one OSD and a As a general rule, run your cluster with more than one OSD and a pool size
pool size greater than 1 object replica. greater than two object replicas.
.. _one-node-cluster: .. _one-node-cluster:
One Node Cluster One Node Cluster
---------------- ----------------
Ceph no longer provides documentation for operating on a single node, because Ceph no longer provides documentation for operating on a single node. Systems
you would never deploy a system designed for distributed computing on a single designed for distributed computing by definition do not run on a single node.
node. Additionally, mounting client kernel modules on a single node containing a The mounting of client kernel modules on a single node that contains a Ceph
Ceph daemon may cause a deadlock due to issues with the Linux kernel itself daemon may cause a deadlock due to issues with the Linux kernel itself (unless
(unless you use VMs for the clients). You can experiment with Ceph in a 1-node VMs are used as clients). You can experiment with Ceph in a one-node
configuration, in spite of the limitations as described herein. configuration, in spite of the limitations as described herein.
If you are trying to create a cluster on a single node, you must change the To create a cluster on a single node, you must change the
default of the ``osd_crush_chooseleaf_type`` setting from ``1`` (meaning ``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration ``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
file before you create your monitors and OSDs. This tells Ceph that an OSD file before you create your monitors and OSDs. This tells Ceph that an OSD is
can peer with another OSD on the same host. If you are trying to set up a permitted to place another OSD on the same host. If you are trying to set up a
1-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``, single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
Ceph will try to peer the PGs of one OSD with the PGs of another OSD on Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
another node, chassis, rack, row, or even datacenter depending on the setting. another node, chassis, rack, row, or datacenter depending on the setting.
.. tip:: DO NOT mount kernel clients directly on the same node as your .. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
Ceph Storage Cluster, because kernel conflicts can arise. However, you Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
can mount kernel clients within virtual machines (VMs) on a single node. clients within virtual machines (VMs) on a single node.
If you are creating OSDs using a single disk, you must create directories If you are creating OSDs using a single disk, you must manually create
for the data manually first. directories for the data first.
Fewer OSDs than Replicas Fewer OSDs than Replicas
------------------------ ------------------------
If you have brought up two OSDs to an ``up`` and ``in`` state, but you still If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
don't see ``active + clean`` placement groups, you may have an in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
``osd_pool_default_size`` set to greater than ``2``. to greater than ``2``.
There are a few ways to address this situation. If you want to operate your There are a few ways to address this situation. If you want to operate your
cluster in an ``active + degraded`` state with two replicas, you can set the cluster in an ``active + degraded`` state with two replicas, you can set the
``osd_pool_default_min_size`` to ``2`` so that you can write objects in ``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
an ``active + degraded`` state. You may also set the ``osd_pool_default_size`` ``active + degraded`` state. You may also set the ``osd_pool_default_size``
setting to ``2`` so that you only have two stored replicas (the original and setting to ``2`` so that you have only two stored replicas (the original and
one replica), in which case the cluster should achieve an ``active + clean`` one replica). In such a case, the cluster should achieve an ``active + clean``
state. state.
.. note:: You can make the changes at runtime. If you make the changes in .. note:: You can make the changes while the cluster is running. If you make
your Ceph configuration file, you may need to restart your cluster. the changes in your Ceph configuration file, you might need to restart your
cluster.
Pool Size = 1 Pool Size = 1
------------- -------------
If you have the ``osd_pool_default_size`` set to ``1``, you will only have If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
one copy of the object. OSDs rely on other OSDs to tell them which objects of the object. OSDs rely on other OSDs to tell them which objects they should
they should have. If a first OSD has a copy of an object and there is no have. If one OSD has a copy of an object and there is no second copy, then
second copy, then no second OSD can tell the first OSD that it should have there is no second OSD to tell the first OSD that it should have that copy. For
that copy. For each placement group mapped to the first OSD (see each placement group mapped to the first OSD (see ``ceph pg dump``), you can
``ceph pg dump``), you can force the first OSD to notice the placement groups force the first OSD to notice the placement groups it needs by running a
it needs by running:: command of the following form:
ceph osd force-create-pg <pgid> .. prompt:: bash
ceph osd force-create-pg <pgid>
CRUSH Map Errors CRUSH Map Errors
---------------- ----------------
Another candidate for placement groups remaining unclean involves errors If any placement groups in your cluster are unclean, then there might be errors
in your CRUSH map. in your CRUSH map.
Stuck Placement Groups Stuck Placement Groups
====================== ======================
It is normal for placement groups to enter states like "degraded" or "peering" It is normal for placement groups to enter "degraded" or "peering" states after
following a failure. Normally these states indicate the normal progression a component failure. Normally, these states reflect the expected progression
through the failure recovery process. However, if a placement group stays in one through the failure recovery process. However, a placement group that stays in
of these states for a long time this may be an indication of a larger problem. one of these states for a long time might be an indication of a larger problem.
For this reason, the monitor will warn when placement groups get "stuck" in a For this reason, the Ceph Monitors will warn when placement groups get "stuck"
non-optimal state. Specifically, we check for: in a non-optimal state. Specifically, we check for:
* ``inactive`` - The placement group has not been ``active`` for too long * ``inactive`` - The placement group has not been ``active`` for too long (that
(i.e., it hasn't been able to service read/write requests). is, it hasn't been able to service read/write requests).
* ``unclean`` - The placement group has not been ``clean`` for too long * ``unclean`` - The placement group has not been ``clean`` for too long (that
(i.e., it hasn't been able to completely recover from a previous failure). is, it hasn't been able to completely recover from a previous failure).
* ``stale`` - The placement group status has not been updated by a ``ceph-osd``, * ``stale`` - The placement group status has not been updated by a
indicating that all nodes storing this placement group may be ``down``. ``ceph-osd``. This indicates that all nodes storing this placement group may
be ``down``.
You can explicitly list stuck placement groups with one of:: List stuck placement groups by running one of the following commands:
ceph pg dump_stuck stale .. prompt:: bash
ceph pg dump_stuck inactive
ceph pg dump_stuck unclean
For stuck ``stale`` placement groups, it is normally a matter of getting the ceph pg dump_stuck stale
right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement ceph pg dump_stuck inactive
groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For ceph pg dump_stuck unclean
stuck ``unclean`` placement groups, there is usually something preventing
recovery from completing, like unfound objects (see - Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
:ref:`failures-osd-unfound`); daemons are not running.
- Stuck ``inactive`` placement groups usually indicate a peering problem (see
:ref:`failures-osd-peering`).
- Stuck ``unclean`` placement groups usually indicate that something is
preventing recovery from completing, possibly unfound objects (see
:ref:`failures-osd-unfound`);
@ -123,21 +131,28 @@ recovery from completing, like unfound objects (see
Placement Group Down - Peering Failure Placement Group Down - Peering Failure
====================================== ======================================
In certain cases, the ``ceph-osd`` `Peering` process can run into In certain cases, the ``ceph-osd`` `peering` process can run into problems,
problems, preventing a PG from becoming active and usable. For which can prevent a PG from becoming active and usable. In such a case, running
example, ``ceph health`` might report:: the command ``ceph health detail`` will report something similar to the following:
ceph health detail .. prompt:: bash
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
We can query the cluster to determine exactly why the PG is marked ``down`` with:: ceph health detail
ceph pg 0.5 query ::
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
.. prompt:: bash
ceph pg 0.5 query
.. code-block:: javascript .. code-block:: javascript
@ -164,21 +179,24 @@ We can query the cluster to determine exactly why the PG is marked ``down`` with
] ]
} }
The ``recovery_state`` section tells us that peering is blocked due to The ``recovery_state`` section tells us that peering is blocked due to down
down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd`` ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
and things will recover. particular ``ceph-osd`` and recovery will proceed.
Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
failure), we can tell the cluster that it is ``lost`` and to cope as there has been a disk failure), the cluster can be informed that the OSD is
best it can. ``lost`` and the cluster can be instructed that it must cope as best it can.
.. important:: This is dangerous in that the cluster cannot .. important:: Informing the cluster that an OSD has been lost is dangerous
guarantee that the other copies of the data are consistent because the cluster cannot guarantee that the other copies of the data are
and up to date. consistent and up to date.
To instruct Ceph to continue anyway:: To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
anyway, run a command of the following form:
ceph osd lost 1 .. prompt:: bash
ceph osd lost 1
Recovery will proceed. Recovery will proceed.
@ -188,32 +206,43 @@ Recovery will proceed.
Unfound Objects Unfound Objects
=============== ===============
Under certain combinations of failures Ceph may complain about Under certain combinations of failures, Ceph may complain about ``unfound``
``unfound`` objects:: objects, as in this example:
ceph health detail .. prompt:: bash
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
pg 2.4 is active+degraded, 78 unfound
This means that the storage cluster knows that some objects (or newer ceph health detail
copies of existing objects) exist, but it hasn't found copies of them.
One example of how this might come about for a PG whose data is on ceph-osds ::
1 and 2:
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
pg 2.4 is active+degraded, 78 unfound
This means that the storage cluster knows that some objects (or newer copies of
existing objects) exist, but it hasn't found copies of them. Here is an
example of how this might come about for a PG whose data is on two OSDS, which
we will call "1" and "2":
* 1 goes down * 1 goes down
* 2 handles some writes, alone * 2 handles some writes, alone
* 1 comes up * 1 comes up
* 1 and 2 repeer, and the objects missing on 1 are queued for recovery. * 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
* Before the new objects are copied, 2 goes down. * Before the new objects are copied, 2 goes down.
Now 1 knows that these object exist, but there is no live ``ceph-osd`` who At this point, 1 knows that these objects exist, but there is no live
has a copy. In this case, IO to those objects will block, and the ``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
cluster will hope that the failed node comes back soon; this is will block, and the cluster will hope that the failed node comes back soon.
assumed to be preferable to returning an IO error to the user. This is assumed to be preferable to returning an IO error to the user.
First, you can identify which objects are unfound with:: .. note:: The situation described immediately above is one reason that setting
``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
data loss.
ceph pg 2.4 list_unfound [starting offset, in json] Identify which objects are unfound by running a command of the following form:
.. prompt:: bash
ceph pg 2.4 list_unfound [starting offset, in json]
.. code-block:: javascript .. code-block:: javascript
@ -252,22 +281,24 @@ First, you can identify which objects are unfound with::
"more": false "more": false
} }
If there are too many objects to list in a single result, the ``more`` If there are too many objects to list in a single result, the ``more`` field
field will be true and you can query for more. (Eventually the will be true and you can query for more. (Eventually the command line tool
command line tool will hide this from you, but not yet.) will hide this from you, but not yet.)
Second, you can identify which OSDs have been probed or might contain Now you can identify which OSDs have been probed or might contain data.
data.
At the end of the listing (before ``more`` is false), ``might_have_unfound`` is provided At the end of the listing (before ``more: false``), ``might_have_unfound`` is
when ``available_might_have_unfound`` is true. This is equivalent to the output provided when ``available_might_have_unfound`` is true. This is equivalent to
of ``ceph pg #.# query``. This eliminates the need to use ``query`` directly. the output of ``ceph pg #.# query``. This eliminates the need to use ``query``
The ``might_have_unfound`` information given behaves the same way as described below for ``query``. directly. The ``might_have_unfound`` information given behaves the same way as
The only difference is that OSDs that have ``already probed`` status are ignored. that ``query`` does, which is described below. The only difference is that
OSDs that have the status of ``already probed`` are ignored.
Use of ``query``:: Use of ``query``:
ceph pg 2.4 query .. prompt:: bash
ceph pg 2.4 query
.. code-block:: javascript .. code-block:: javascript
@ -278,8 +309,8 @@ Use of ``query``::
{ "osd": 1, { "osd": 1,
"status": "osd is down"}]}, "status": "osd is down"}]},
In this case, for example, the cluster knows that ``osd.1`` might have In this case, the cluster knows that ``osd.1`` might have data, but it is
data, but it is ``down``. The full range of possible states include: ``down``. Here is the full range of possible states:
* already probed * already probed
* querying * querying
@ -289,106 +320,135 @@ data, but it is ``down``. The full range of possible states include:
Sometimes it simply takes some time for the cluster to query possible Sometimes it simply takes some time for the cluster to query possible
locations. locations.
It is possible that there are other locations where the object can It is possible that there are other locations where the object might exist that
exist that are not listed. For example, if a ceph-osd is stopped and are not listed. For example: if an OSD is stopped and taken out of the cluster
taken out of the cluster, the cluster fully recovers, and due to some and then the cluster fully recovers, and then through a subsequent set of
future set of failures ends up with an unfound object, it won't failures the cluster ends up with an unfound object, the cluster will ignore
consider the long-departed ceph-osd as a potential location to the removed OSD. (This scenario, however, is unlikely.)
consider. (This scenario, however, is unlikely.)
If all possible locations have been queried and objects are still If all possible locations have been queried and objects are still lost, you may
lost, you may have to give up on the lost objects. This, again, is have to give up on the lost objects. This, again, is possible only when unusual
possible given unusual combinations of failures that allow the cluster combinations of failures have occurred that allow the cluster to learn about
to learn about writes that were performed before the writes themselves writes that were performed before the writes themselves have been recovered. To
are recovered. To mark the "unfound" objects as "lost":: mark the "unfound" objects as "lost", run a command of the following form:
ceph pg 2.5 mark_unfound_lost revert|delete .. prompt:: bash
This the final argument specifies how the cluster should deal with ceph pg 2.5 mark_unfound_lost revert|delete
lost objects.
The "delete" option will forget about them entirely. Here the final argument (``revert|delete``) specifies how the cluster should
deal with lost objects.
The "revert" option (not available for erasure coded pools) will The ``delete`` option will cause the cluster to forget about them entirely.
either roll back to a previous version of the object or (if it was a
new object) forget about it entirely. Use this with caution, as it
may confuse applications that expected the object to exist.
The ``revert`` option (which is not available for erasure coded pools) will
either roll back to a previous version of the object or (if it was a new
object) forget about the object entirely. Use ``revert`` with caution, as it
may confuse applications that expect the object to exist.
Homeless Placement Groups Homeless Placement Groups
========================= =========================
It is possible for all OSDs that had copies of a given placement groups to fail. It is possible that every OSD that has copies of a given placement group fails.
If that's the case, that subset of the object store is unavailable, and the If this happens, then the subset of the object store that contains those
monitor will receive no status updates for those placement groups. To detect placement groups becomes unavailable and the monitor will receive no status
this situation, the monitor marks any placement group whose primary OSD has updates for those placement groups. The monitor marks as ``stale`` any
failed as ``stale``. For example:: placement group whose primary OSD has failed. For example:
ceph health .. prompt:: bash
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
You can identify which placement groups are ``stale``, and what the last OSDs to ceph health
store them were, with::
ceph health detail ::
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
...
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
...
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
If we want to get placement group 2.5 back online, for example, this tells us that HEALTH_WARN 24 pgs stale; 3/300 in osds are down
it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
daemons will allow the cluster to recover that placement group (and, presumably, Identify which placement groups are ``stale`` and which were the last OSDs to
many others). store the ``stale`` placement groups by running the following command:
.. prompt:: bash
ceph health detail
::
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
...
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
...
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
that placement group.
Only a Few OSDs Receive Data Only a Few OSDs Receive Data
============================ ============================
If you have many nodes in your cluster and only a few of them receive data, If only a few of the nodes in the cluster are receiving data, check the number
`check`_ the number of placement groups in your pool. Since placement groups get of placement groups in the pool as instructed in the :ref:`Placement Groups
mapped to OSDs, a small number of placement groups will not distribute across <rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
your cluster. Try creating a pool with a placement group count that is a OSDs in an operation involving dividing the number of placement groups in the
multiple of the number of OSDs. See `Placement Groups`_ for details. The default cluster by the number of OSDs in the cluster, a small number of placement
placement group count for pools is not useful, but you can change it `here`_. groups (the remainder, in this operation) are sometimes not distributed across
the cluster. In situations like this, create a pool with a placement group
count that is a multiple of the number of OSDs. See `Placement Groups`_ for
details. See the :ref:`Pool, PG, and CRUSH Config Reference
<rados_config_pool_pg_crush_ref>` for instructions on changing the default
values used to determine how many placement groups are assigned to each pool.
Can't Write Data Can't Write Data
================ ================
If your cluster is up, but some OSDs are down and you cannot write data, If the cluster is up, but some OSDs are down and you cannot write data, make
check to ensure that you have the minimum number of OSDs running for the sure that you have the minimum number of OSDs running in the pool. If you don't
placement group. If you don't have the minimum number of OSDs running, have the minimum number of OSDs running in the pool, Ceph will not allow you to
Ceph will not allow you to write data because there is no guarantee write data to it because there is no guarantee that Ceph can replicate your
that Ceph can replicate your data. See ``osd_pool_default_min_size`` data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
in the `Pool, PG and CRUSH Config Reference`_ for details. Config Reference <rados_config_pool_pg_crush_ref>` for details.
PGs Inconsistent PGs Inconsistent
================ ================
If you receive an ``active + clean + inconsistent`` state, this may happen If the command ``ceph health detail`` returns an ``active + clean +
due to an error during scrubbing. As always, we can identify the inconsistent inconsistent`` state, this might indicate an error during scrubbing. Identify
placement group(s) with:: the inconsistent placement group or placement groups by running the following
command:
.. prompt:: bash
$ ceph health detail $ ceph health detail
::
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2] pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors 2 scrub errors
Or if you prefer inspecting the output in a programmatic way:: Alternatively, run this command if you prefer to inspect the output in a
programmatic way:
.. prompt:: bash
$ rados list-inconsistent-pg rbd
::
$ rados list-inconsistent-pg rbd
["0.6"] ["0.6"]
There is only one consistent state, but in the worst case, we could have There is only one consistent state, but in the worst case, we could have
different inconsistencies in multiple perspectives found in more than one different inconsistencies in multiple perspectives found in more than one
objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have:: object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
``rados list-inconsistent-pg rbd`` will look something like this:
$ rados list-inconsistent-obj 0.6 --format=json-pretty .. prompt:: bash
rados list-inconsistent-obj 0.6 --format=json-pretty
.. code-block:: javascript .. code-block:: javascript
@ -442,82 +502,103 @@ objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have::
] ]
} }
In this case, we can learn from the output: In this case, the output indicates the following:
* The only inconsistent object is named ``foo``, and it is its head that has * The only inconsistent object is named ``foo``, and its head has
inconsistencies. inconsistencies.
* The inconsistencies fall into two categories: * The inconsistencies fall into two categories:
* ``errors``: these errors indicate inconsistencies between shards without a #. ``errors``: these errors indicate inconsistencies between shards, without
determination of which shard(s) are bad. Check for the ``errors`` in the an indication of which shard(s) are bad. Check for the ``errors`` in the
`shards` array, if available, to pinpoint the problem. ``shards`` array, if available, to pinpoint the problem.
* ``data_digest_mismatch``: the digest of the replica read from OSD.2 is * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
different from the ones of OSD.0 and OSD.1 is different from the digests of the replica reads of ``OSD.0`` and
* ``size_mismatch``: the size of the replica read from OSD.2 is 0, while ``OSD.1``
the size reported by OSD.0 and OSD.1 is 968. * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
* ``union_shard_errors``: the union of all shard specific ``errors`` in but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
``shards`` array. The ``errors`` are set for the given shard that has the
problem. They include errors like ``read_error``. The ``errors`` ending in
``oi`` indicate a comparison with ``selected_object_info``. Look at the
``shards`` array to determine which shard has which error(s).
* ``data_digest_mismatch_info``: the digest stored in the object-info is not #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
``0xffffffff``, which is calculated from the shard read from OSD.2 ``shards`` array. The ``errors`` are set for the shard with the problem.
* ``size_mismatch_info``: the size stored in the object-info is different These errors include ``read_error`` and other similar errors. The
from the one read from OSD.2. The latter is 0. ``errors`` ending in ``oi`` indicate a comparison with
``selected_object_info``. Examine the ``shards`` array to determine
which shard has which error or errors.
You can repair the inconsistent placement group by executing:: * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
is not ``0xffffffff``, which is calculated from the shard read from
``OSD.2``
* ``size_mismatch_info``: the size stored in the ``object-info`` is
different from the size read from ``OSD.2``. The latter is ``0``.
ceph pg repair {placement-group-ID} .. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
inconsistency is likely due to physical storage errors. In cases like this,
check the storage used by that OSD.
Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
repair.
Which overwrites the `bad` copies with the `authoritative` ones. In most cases, To repair the inconsistent placement group, run a command of the following
Ceph is able to choose authoritative copies from all available replicas using form:
some predefined criteria. But this does not always work. For example, the stored
data digest could be missing, and the calculated digest will be ignored when .. prompt:: bash
choosing the authoritative copies. So, please use the above command with caution.
ceph pg repair {placement-group-ID}
.. warning: This command overwrites the "bad" copies with "authoritative"
copies. In most cases, Ceph is able to choose authoritative copies from all
the available replicas by using some predefined criteria. This, however,
does not work in every case. For example, it might be the case that the
stored data digest is missing, which means that the calculated digest is
ignored when Ceph chooses the authoritative copies. Be aware of this, and
use the above command with caution.
If ``read_error`` is listed in the ``errors`` attribute of a shard, the
inconsistency is likely due to disk errors. You might want to check your disk
used by that OSD.
If you receive ``active + clean + inconsistent`` states periodically due to If you receive ``active + clean + inconsistent`` states periodically due to
clock skew, you may consider configuring your `NTP`_ daemons on your clock skew, consider configuring the `NTP
monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph <https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
`Clock Settings`_ for additional details. hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
Erasure Coded PGs are not active+clean Erasure Coded PGs are not active+clean
====================================== ======================================
When CRUSH fails to find enough OSDs to map to a PG, it will show as a If CRUSH fails to find enough OSDs to map to a PG, it will show as a
``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance:: ``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
[2,1,6,0,5,8,2147483647,7,4] [2,1,6,0,5,8,2147483647,7,4]
Not enough OSDs Not enough OSDs
--------------- ---------------
If the Ceph cluster only has 8 OSDs and the erasure coded pool needs If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
9, that is what it will show. You can either create another erasure OSDs, the cluster will show "Not enough OSDs". In this case, you either create
coded pool that requires less OSDs:: another erasure coded pool that requires fewer OSDs, by running commands of the
following form:
.. prompt:: bash
ceph osd erasure-code-profile set myprofile k=5 m=3 ceph osd erasure-code-profile set myprofile k=5 m=3
ceph osd pool create erasurepool erasure myprofile ceph osd pool create erasurepool erasure myprofile
or add a new OSDs and the PG will automatically use them. or add new OSDs, and the PG will automatically use them.
CRUSH constraints cannot be satisfied CRUSH constraints cannot be satisfied
------------------------------------- -------------------------------------
If the cluster has enough OSDs, it is possible that the CRUSH rule If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
imposes constraints that cannot be satisfied. If there are 10 OSDs on constraints that cannot be satisfied. If there are ten OSDs on two hosts and
two hosts and the CRUSH rule requires that no two OSDs from the the CRUSH rule requires that no two OSDs from the same host are used in the
same host are used in the same PG, the mapping may fail because only same PG, the mapping may fail because only two OSDs will be found. Check the
two OSDs will be found. You can check the constraint by displaying ("dumping") constraint by displaying ("dumping") the rule, as shown here:
the rule::
.. prompt:: bash
ceph osd crush rule ls
::
$ ceph osd crush rule ls
[ [
"replicated_rule", "replicated_rule",
"erasurepool"] "erasurepool"]
@ -535,36 +616,43 @@ the rule::
{ "op": "emit"}]} { "op": "emit"}]}
You can resolve the problem by creating a new pool in which PGs are allowed Resolve this problem by creating a new pool in which PGs are allowed to have
to have OSDs residing on the same host with:: OSDs residing on the same host by running the following commands:
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd .. prompt:: bash
ceph osd pool create erasurepool erasure myprofile
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
ceph osd pool create erasurepool erasure myprofile
CRUSH gives up too soon CRUSH gives up too soon
----------------------- -----------------------
If the Ceph cluster has just enough OSDs to map the PG (for instance a If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
cluster with a total of 9 OSDs and an erasure coded pool that requires with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
9 OSDs per PG), it is possible that CRUSH gives up before finding a PG), it is possible that CRUSH gives up before finding a mapping. This problem
mapping. It can be resolved by: can be resolved by:
* lowering the erasure coded pool requirements to use less OSDs per PG * lowering the erasure coded pool requirements to use fewer OSDs per PG (this
(that requires the creation of another pool as erasure code profiles requires the creation of another pool, because erasure code profiles cannot
cannot be dynamically modified). be modified dynamically).
* adding more OSDs to the cluster (that does not require the erasure * adding more OSDs to the cluster (this does not require the erasure coded pool
coded pool to be modified, it will become clean automatically) to be modified, because it will become clean automatically)
* use a handmade CRUSH rule that tries more times to find a good * using a handmade CRUSH rule that tries more times to find a good mapping.
mapping. This can be done by setting ``set_choose_tries`` to a value This can be modified for an existing CRUSH rule by setting
greater than the default. ``set_choose_tries`` to a value greater than the default.
You should first verify the problem with ``crushtool`` after First, verify the problem by using ``crushtool`` after extracting the crushmap
extracting the crushmap from the cluster so your experiments do not from the cluster. This ensures that your experiments do not modify the Ceph
modify the Ceph cluster and only work on a local files:: cluster and that they operate only on local files:
.. prompt:: bash
ceph osd crush rule dump erasurepool
::
$ ceph osd crush rule dump erasurepool
{ "rule_id": 1, { "rule_id": 1,
"rule_name": "erasurepool", "rule_name": "erasurepool",
"type": 3, "type": 3,
@ -586,44 +674,54 @@ modify the Ceph cluster and only work on a local files::
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8] bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647] bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
Where ``--num-rep`` is the number of OSDs the erasure code CRUSH Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
rule needs, ``--rule`` is the value of the ``rule_id`` field needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
displayed by ``ceph osd crush rule dump``. The test will try mapping ``ceph osd crush rule dump``. This test will attempt to map one million values
one million values (i.e. the range defined by ``[--min-x,--max-x]``) (in this example, the range defined by ``[--min-x,--max-x]``) and must display
and must display at least one bad mapping. If it outputs nothing it at least one bad mapping. If this test outputs nothing, all mappings have been
means all mappings are successful and you can stop right there: the successful and you can be assured that the problem with your cluster is not
problem is elsewhere. caused by bad mappings.
The CRUSH rule can be edited by decompiling the crush map:: Changing the value of set_choose_tries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$ crushtool --decompile crush.map > crush.txt #. Decompile the CRUSH map to edit the CRUSH rule by running the following
command:
and adding the following line to the rule:: .. prompt:: bash
step set_choose_tries 100 crushtool --decompile crush.map > crush.txt
The relevant part of the ``crush.txt`` file should look something #. Add the following line to the rule::
like::
rule erasurepool { step set_choose_tries 100
id 1
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
It can then be compiled and tested again:: The relevant part of the ``crush.txt`` file will resemble this::
$ crushtool --compile crush.txt -o better-crush.map rule erasurepool {
id 1
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}
When all mappings succeed, an histogram of the number of tries that #. Recompile and retest the CRUSH rule:
were necessary to find all of them can be displayed with the
``--show-choose-tries`` option of ``crushtool``::
$ crushtool -i better-crush.map --test --show-bad-mappings \ .. prompt:: bash
crushtool --compile crush.txt -o better-crush.map
#. When all mappings succeed, display a histogram of the number of tries that
were necessary to find all of the mapping by using the
``--show-choose-tries`` option of the ``crushtool`` command, as in the
following example:
.. prompt:: bash
crushtool -i better-crush.map --test --show-bad-mappings \
--show-choose-tries \ --show-choose-tries \
--rule 1 \ --rule 1 \
--num-rep 9 \ --num-rep 9 \
@ -673,14 +771,12 @@ were necessary to find all of them can be displayed with the
104: 0 104: 0
... ...
It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped). This output indicates that it took eleven tries to map forty-two PGs, twelve
tries to map forty-four PGs etc. The highest number of tries is the minimum
value of ``set_choose_tries`` that prevents bad mappings (for example,
``103`` in the above output, because it did not take more than 103 tries for
any PG to be mapped).
.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups .. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
.. _here: ../../configuration/pool-pg-config-ref
.. _Placement Groups: ../../operations/placement-groups .. _Placement Groups: ../../operations/placement-groups
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref .. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
.. _NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol
.. _The Network Time Protocol: http://www.ntp.org/
.. _Clock Settings: ../../configuration/mon-config-ref/#clock

View File

@ -476,23 +476,40 @@ commands. ::
Rate Limit Management Rate Limit Management
===================== =====================
The Ceph Object Gateway enables you to set rate limits on users and buckets. The Ceph Object Gateway makes it possible to set rate limits on users and
Rate limit includes the maximum number of read ops and write ops per minute buckets. "Rate limit" includes the maximum number of read operations (read
and how many bytes per minute could be written or read per user or per bucket. ops) and write operations (write ops) per minute and the number of bytes per
Requests that are using GET or HEAD method in the REST request are considered as "read requests", otherwise they are considered as "write requests". minute that can be written or read per user or per bucket.
Every Object Gateway tracks per user and bucket metrics separately, these metrics are not shared with other gateways.
That means that the desired limits configured should be divide by the number of active Object Gateways. Operations that use the ``GET`` method or the ``HEAD`` method in their REST
For example, if userA should be limited by 10 ops per minute and there are 2 Object Gateways in the cluster, requests are "read requests". All other requests are "write requests".
the limit over userA should be 5 (10 ops per minute / 2 RGWs).
If the requests are **not** balanced between RGWs, the rate limit may be underutilized. Each object gateway tracks per-user metrics separately from bucket metrics.
For example, if the ops limit is 5 and there are 2 RGWs, **but** the Load Balancer send load only to one of those RGWs, These metrics are not shared with other gateways. The configured limits should
The effective limit would be 5 ops, because this limit is enforced per RGW. be divided by the number of active object gateways. For example, if "user A" is
If there is a limit reached for bucket not for user or vice versa the request would be cancelled as well. to be be limited to 10 ops per minute and there are two object gateways in the
The bandwidth counting happens after the request is being accepted, as a result, even if in the middle of the request the bucket/user has reached its bandwidth limit this request will proceed. cluster, then the limit on "user A" should be ``5`` (10 ops per minute / 2
The RGW will keep a "debt" of used bytes more than the configured value and will prevent this user/bucket from sending more requests until there "debt" is being paid. RGWs). If the requests are **not** balanced between RGWs, the rate limit might
The "debt" maximum size is twice the max-read/write-bytes per minute. be underutilized. For example: if the ops limit is ``5`` and there are two
If userA has 1 byte read limit per minute and this user tries to GET 1 GB object, the user will be able to do it. RGWs, **but** the Load Balancer sends load to only one of those RGWs, the
After userA completes this 1GB operation, the RGW will block the user request for up to 2 minutes until userA will be able to send GET request again. effective limit is 5 ops, because this limit is enforced per RGW. If the rate
limit that has been set for the bucket has been reached but the rate limit that
has been set for the user has not been reached, then the request is cancelled.
The contrary holds as well: if the rate limit that has been set for the user
has been reached but the rate limit that has been set for the bucket has not
been reached, then the request is cancelled.
The accounting of bandwidth happens only after a request has been accepted.
This means that requests will proceed even if the bucket rate limit or user
rate limit is reached during the execution of the request. The RGW keeps track
of a "debt" consisting of bytes used in excess of the configured value; users
or buckets that incur this kind of debt are prevented from sending more
requests until the "debt" has been repaid. The maximum size of the "debt" is
twice the max-read/write-bytes per minute. If "user A" is subject to a 1-byte
read limit per minute and they attempt to GET an object that is 1 GB in size,
then the ``GET`` action will fail. After "user A" has completed this 1 GB
operation, RGW blocks the user's requests for up to two minutes. After this
time has elapsed, "user A" will be able to send ``GET`` requests again.
- **Bucket:** The ``--bucket`` option allows you to specify a rate limit for a - **Bucket:** The ``--bucket`` option allows you to specify a rate limit for a

View File

@ -79,7 +79,7 @@ workload with a smaller number of buckets but higher number of objects (hundreds
per bucket you would consider decreasing :confval:`rgw_lc_max_wp_worker` from the default value of 3. per bucket you would consider decreasing :confval:`rgw_lc_max_wp_worker` from the default value of 3.
.. note:: When looking to tune either of these specific values please validate the .. note:: When looking to tune either of these specific values please validate the
current Cluster performance and Ceph Object Gateway utilization before increasing. current Cluster performance and Ceph Object Gateway utilization before increasing.
Garbage Collection Settings Garbage Collection Settings
=========================== ===========================
@ -97,8 +97,9 @@ To view the queue of objects awaiting garbage collection, execute the following
radosgw-admin gc list radosgw-admin gc list
.. note:: specify ``--include-all`` to list all entries, including unexpired .. note:: Specify ``--include-all`` to list all entries, including unexpired
Garbage Collection objects.
Garbage collection is a background activity that may Garbage collection is a background activity that may
execute continuously or during times of low loads, depending upon how the execute continuously or during times of low loads, depending upon how the
administrator configures the Ceph Object Gateway. By default, the Ceph Object administrator configures the Ceph Object Gateway. By default, the Ceph Object
@ -121,7 +122,9 @@ configuration parameters.
:Tuning Garbage Collection for Delete Heavy Workloads: :Tuning Garbage Collection for Delete Heavy Workloads:
As an initial step towards tuning Ceph Garbage Collection to be more aggressive the following options are suggested to be increased from their default configuration values:: As an initial step towards tuning Ceph Garbage Collection to be more
aggressive the following options are suggested to be increased from their
default configuration values::
rgw_gc_max_concurrent_io = 20 rgw_gc_max_concurrent_io = 20
rgw_gc_max_trim_chunk = 64 rgw_gc_max_trim_chunk = 64
@ -270,7 +273,7 @@ to support future methods of scheduling requests.
Currently the scheduler defaults to a throttler which throttles the active Currently the scheduler defaults to a throttler which throttles the active
connections to a configured limit. QoS based on mClock is currently in an connections to a configured limit. QoS based on mClock is currently in an
*experimental* phase and not recommended for production yet. Current *experimental* phase and not recommended for production yet. Current
implementation of *dmclock_client* op queue divides RGW Ops on admin, auth implementation of *dmclock_client* op queue divides RGW ops on admin, auth
(swift auth, sts) metadata & data requests. (swift auth, sts) metadata & data requests.

View File

@ -6,38 +6,39 @@ RGW Dynamic Bucket Index Resharding
.. versionadded:: Luminous .. versionadded:: Luminous
A large bucket index can lead to performance problems. In order A large bucket index can lead to performance problems, which can
to address this problem we introduced bucket index sharding. be addressed by sharding bucket indexes.
Until Luminous, changing the number of bucket shards (resharding) Until Luminous, changing the number of bucket shards (resharding)
needed to be done offline. Starting with Luminous we support needed to be done offline, with RGW services disabled.
online bucket resharding. Since the Luminous release Ceph has supported online bucket resharding.
Each bucket index shard can handle its entries efficiently up until Each bucket index shard can handle its entries efficiently up until
reaching a certain threshold number of entries. If this threshold is reaching a certain threshold. If this threshold is
exceeded the system can suffer from performance issues. The dynamic exceeded the system can suffer from performance issues. The dynamic
resharding feature detects this situation and automatically increases resharding feature detects this situation and automatically increases
the number of shards used by the bucket index, resulting in a the number of shards used by a bucket's index, resulting in a
reduction of the number of entries in each bucket index shard. This reduction of the number of entries in each shard. This
process is transparent to the user. Write I/Os to the target bucket process is transparent to the user. Writes to the target bucket
are blocked and read I/Os are not during resharding process. are blocked (but reads are not) briefly during resharding process.
By default dynamic bucket index resharding can only increase the By default dynamic bucket index resharding can only increase the
number of bucket index shards to 1999, although this upper-bound is a number of bucket index shards to 1999, although this upper-bound is a
configuration parameter (see Configuration below). When configuration parameter (see Configuration below). When
possible, the process chooses a prime number of bucket index shards to possible, the process chooses a prime number of shards in order to
spread the number of bucket index entries across the bucket index spread the number of entries across the bucket index
shards more evenly. shards more evenly.
The detection process runs in a background process that periodically Detection of resharding opportunities runs as a background process
scans all the buckets. A bucket that requires resharding is added to that periodically
the resharding queue and will be scheduled to be resharded later. The scans all buckets. A bucket that requires resharding is added to
reshard thread runs in the background and execute the scheduled a queue. A thread runs in the background and processes the queueued
resharding tasks, one at a time. resharding tasks, one at a time and in order.
Multisite Multisite
========= =========
Prior to the Reef release, RGW does not support dynamic resharding in a With Ceph releases Prior to Reef, the Ceph Object Gateway (RGW) does not support
dynamic resharding in a
multisite environment. For information on dynamic resharding, see multisite environment. For information on dynamic resharding, see
:ref:`Resharding <feature_resharding>` in the RGW multisite documentation. :ref:`Resharding <feature_resharding>` in the RGW multisite documentation.
@ -50,11 +51,11 @@ Enable/Disable dynamic bucket index resharding:
Configuration options that control the resharding process: Configuration options that control the resharding process:
- ``rgw_max_objs_per_shard``: maximum number of objects per bucket index shard before resharding is triggered, default: 100000 objects - ``rgw_max_objs_per_shard``: maximum number of objects per bucket index shard before resharding is triggered, default: 100000
- ``rgw_max_dynamic_shards``: maximum number of shards that dynamic bucket index resharding can increase to, default: 1999 - ``rgw_max_dynamic_shards``: maximum number of bucket index shards that dynamic resharding can increase to, default: 1999
- ``rgw_reshard_bucket_lock_duration``: duration, in seconds, of lock on bucket obj during resharding, default: 360 seconds (i.e., 6 minutes) - ``rgw_reshard_bucket_lock_duration``: duration, in seconds, that writes to the bucket are locked during resharding, default: 360 (i.e., 6 minutes)
- ``rgw_reshard_thread_interval``: maximum time, in seconds, between rounds of resharding queue processing, default: 600 seconds (i.e., 10 minutes) - ``rgw_reshard_thread_interval``: maximum time, in seconds, between rounds of resharding queue processing, default: 600 seconds (i.e., 10 minutes)
@ -91,9 +92,9 @@ Bucket resharding status
# radosgw-admin reshard status --bucket <bucket_name> # radosgw-admin reshard status --bucket <bucket_name>
The output is a json array of 3 objects (reshard_status, new_bucket_instance_id, num_shards) per shard. The output is a JSON array of 3 objects (reshard_status, new_bucket_instance_id, num_shards) per shard.
For example, the output at different Dynamic Resharding stages is shown below: For example, the output at each dynamic resharding stage is shown below:
``1. Before resharding occurred:`` ``1. Before resharding occurred:``
:: ::
@ -122,7 +123,7 @@ For example, the output at different Dynamic Resharding stages is shown below:
} }
] ]
``3, After resharding completed:`` ``3. After resharding completed:``
:: ::
[ [
@ -142,7 +143,7 @@ For example, the output at different Dynamic Resharding stages is shown below:
Cancel pending bucket resharding Cancel pending bucket resharding
-------------------------------- --------------------------------
Note: Ongoing bucket resharding operations cannot be cancelled. :: Note: Bucket resharding operations cannot be cancelled while executing. ::
# radosgw-admin reshard cancel --bucket <bucket_name> # radosgw-admin reshard cancel --bucket <bucket_name>
@ -153,25 +154,24 @@ Manual immediate bucket resharding
# radosgw-admin bucket reshard --bucket <bucket_name> --num-shards <new number of shards> # radosgw-admin bucket reshard --bucket <bucket_name> --num-shards <new number of shards>
When choosing a number of shards, the administrator should keep a When choosing a number of shards, the administrator must anticipate each
number of items in mind. Ideally the administrator is aiming for no bucket's peak number of objects. Ideally one should aim for no
more than 100000 entries per shard, now and through some future point more than 100000 entries per shard at any given time.
in time.
Additionally, bucket index shards that are prime numbers tend to work Additionally, bucket index shards that are prime numbers are more effective
better in evenly distributing bucket index entries across the in evenly distributing bucket index entries.
shards. For example, 7001 bucket index shards is better than 7000 For example, 7001 bucket index shards is better than 7000
since the former is prime. A variety of web sites have lists of prime since the former is prime. A variety of web sites have lists of prime
numbers; search for "list of prime numbers" withy your favorite web numbers; search for "list of prime numbers" with your favorite
search engine to locate some web sites. search engine to locate some web sites.
Troubleshooting Troubleshooting
=============== ===============
Clusters prior to Luminous 12.2.11 and Mimic 13.2.5 left behind stale bucket Clusters prior to Luminous 12.2.11 and Mimic 13.2.5 left behind stale bucket
instance entries, which were not automatically cleaned up. The issue also affected instance entries, which were not automatically cleaned up. This issue also affected
LifeCycle policies, which were not applied to resharded buckets anymore. Both of LifeCycle policies, which were no longer applied to resharded buckets. Both of
these issues can be worked around using a couple of radosgw-admin commands. these issues could be worked around by running ``radosgw-admin`` commands.
Stale instance management Stale instance management
------------------------- -------------------------
@ -183,7 +183,7 @@ List the stale instances in a cluster that are ready to be cleaned up.
# radosgw-admin reshard stale-instances list # radosgw-admin reshard stale-instances list
Clean up the stale instances in a cluster. Note: cleanup of these Clean up the stale instances in a cluster. Note: cleanup of these
instances should only be done on a single site cluster. instances should only be done on a single-site cluster.
:: ::
@ -193,11 +193,12 @@ instances should only be done on a single site cluster.
Lifecycle fixes Lifecycle fixes
--------------- ---------------
For clusters that had resharded instances, it is highly likely that the old For clusters with resharded instances, it is highly likely that the old
lifecycle processes would have flagged and deleted lifecycle processing as the lifecycle processes would have flagged and deleted lifecycle processing as the
bucket instance changed during a reshard. While this is fixed for newer clusters bucket instance changed during a reshard. While this is fixed for buckets
(from Mimic 13.2.6 and Luminous 12.2.12), older buckets that had lifecycle policies and deployed on newer Ceph releases (from Mimic 13.2.6 and Luminous 12.2.12),
that have undergone resharding will have to be manually fixed. older buckets that had lifecycle policies and that have undergone
resharding must be fixed manually.
The command to do so is: The command to do so is:
@ -206,8 +207,8 @@ The command to do so is:
# radosgw-admin lc reshard fix --bucket {bucketname} # radosgw-admin lc reshard fix --bucket {bucketname}
As a convenience wrapper, if the ``--bucket`` argument is dropped then this If the ``--bucket`` argument is not provided, this
command will try and fix lifecycle policies for all the buckets in the cluster. command will try to fix lifecycle policies for all the buckets in the cluster.
Object Expirer fixes Object Expirer fixes
-------------------- --------------------
@ -217,7 +218,7 @@ been dropped from the log pool and never deleted after the bucket was
resharded. This would happen if their expiration time was before the resharded. This would happen if their expiration time was before the
cluster was upgraded, but if their expiration was after the upgrade cluster was upgraded, but if their expiration was after the upgrade
the objects would be correctly handled. To manage these expire-stale the objects would be correctly handled. To manage these expire-stale
objects, radosgw-admin provides two subcommands. objects, ``radosgw-admin`` provides two subcommands.
Listing: Listing:

View File

@ -770,7 +770,13 @@ to a multi-site system, follow these steps:
radosgw-admin zonegroup rename --rgw-zonegroup default --zonegroup-new-name=<name> radosgw-admin zonegroup rename --rgw-zonegroup default --zonegroup-new-name=<name>
radosgw-admin zone rename --rgw-zone default --zone-new-name us-east-1 --rgw-zonegroup=<name> radosgw-admin zone rename --rgw-zone default --zone-new-name us-east-1 --rgw-zonegroup=<name>
3. Configure the master zonegroup. Replace ``<name>`` with the realm name or 3. Rename the default zonegroup's ``api_name``. Replace ``<name>`` with the zonegroup name:
.. prompt:: bash #
radosgw-admin zonegroup modify --api-name=<name> --rgw-zonegroup=<name>
4. Configure the master zonegroup. Replace ``<name>`` with the realm name or
zonegroup name. Replace ``<fqdn>`` with the fully qualified domain name(s) zonegroup name. Replace ``<fqdn>`` with the fully qualified domain name(s)
in the zonegroup: in the zonegroup:
@ -778,7 +784,7 @@ to a multi-site system, follow these steps:
radosgw-admin zonegroup modify --rgw-realm=<name> --rgw-zonegroup=<name> --endpoints http://<fqdn>:80 --master --default radosgw-admin zonegroup modify --rgw-realm=<name> --rgw-zonegroup=<name> --endpoints http://<fqdn>:80 --master --default
4. Configure the master zone. Replace ``<name>`` with the realm name, zone 5. Configure the master zone. Replace ``<name>`` with the realm name, zone
name, or zonegroup name. Replace ``<fqdn>`` with the fully qualified domain name, or zonegroup name. Replace ``<fqdn>`` with the fully qualified domain
name(s) in the zonegroup: name(s) in the zonegroup:
@ -789,7 +795,7 @@ to a multi-site system, follow these steps:
--access-key=<access-key> --secret=<secret-key> \ --access-key=<access-key> --secret=<secret-key> \
--master --default --master --default
5. Create a system user. Replace ``<user-id>`` with the username. Replace 6. Create a system user. Replace ``<user-id>`` with the username. Replace
``<display-name>`` with a display name. The display name is allowed to ``<display-name>`` with a display name. The display name is allowed to
contain spaces: contain spaces:
@ -800,13 +806,13 @@ to a multi-site system, follow these steps:
--access-key=<access-key> \ --access-key=<access-key> \
--secret=<secret-key> --system --secret=<secret-key> --system
6. Commit the updated configuration: 7. Commit the updated configuration:
.. prompt:: bash # .. prompt:: bash #
radosgw-admin period update --commit radosgw-admin period update --commit
7. Restart the Ceph Object Gateway: 8. Restart the Ceph Object Gateway:
.. prompt:: bash # .. prompt:: bash #
@ -1588,7 +1594,7 @@ Zone Features
Some multisite features require support from all zones before they can be enabled. Each zone lists its ``supported_features``, and each zonegroup lists its ``enabled_features``. Before a feature can be enabled in the zonegroup, it must be supported by all of its zones. Some multisite features require support from all zones before they can be enabled. Each zone lists its ``supported_features``, and each zonegroup lists its ``enabled_features``. Before a feature can be enabled in the zonegroup, it must be supported by all of its zones.
On creation of new zones and zonegroups, all known features are supported/enabled. After upgrading an existing multisite configuration, however, new features must be enabled manually. On creation of new zones and zonegroups, all known features are supported and some features (see table below) are enabled by default. After upgrading an existing multisite configuration, however, new features must be enabled manually.
Supported Features Supported Features
------------------ ------------------

View File

@ -188,8 +188,7 @@ Request parameters:
specified CA will be used to authenticate the broker. The default CA will specified CA will be used to authenticate the broker. The default CA will
not be used. not be used.
- amqp-exchange: The exchanges must exist and must be able to route messages - amqp-exchange: The exchanges must exist and must be able to route messages
based on topics. This parameter is mandatory. Different topics that point based on topics. This parameter is mandatory.
to the same endpoint must use the same exchange.
- amqp-ack-level: No end2end acking is required. Messages may persist in the - amqp-ack-level: No end2end acking is required. Messages may persist in the
broker before being delivered to their final destinations. Three ack methods broker before being delivered to their final destinations. Three ack methods
exist: exist:

View File

@ -13,7 +13,7 @@ Supported Destination
--------------------- ---------------------
AWS supports: **SNS**, **SQS** and **Lambda** as possible destinations (AWS internal destinations). AWS supports: **SNS**, **SQS** and **Lambda** as possible destinations (AWS internal destinations).
Currently, we support: **HTTP/S**, **Kafka** and **AMQP**. And also support pulling and acking of events stored in Ceph (as an internal destination). Currently, we support: **HTTP/S**, **Kafka** and **AMQP**.
We are using the **SNS** ARNs to represent the **HTTP/S**, **Kafka** and **AMQP** destinations. We are using the **SNS** ARNs to represent the **HTTP/S**, **Kafka** and **AMQP** destinations.

View File

@ -91,14 +91,8 @@ The following common request header fields are not supported:
+----------------------------+------------+ +----------------------------+------------+
| Name | Type | | Name | Type |
+============================+============+ +============================+============+
| **Server** | Response |
+----------------------------+------------+
| **x-amz-delete-marker** | Response |
+----------------------------+------------+
| **x-amz-id-2** | Response | | **x-amz-id-2** | Response |
+----------------------------+------------+ +----------------------------+------------+
| **x-amz-version-id** | Response |
+----------------------------+------------+
.. _Amazon S3 API: http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html .. _Amazon S3 API: http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html
.. _S3 Notification Compatibility: ../s3-notification-compatibility .. _S3 Notification Compatibility: ../s3-notification-compatibility

View File

@ -21,6 +21,7 @@ security fixes.
:maxdepth: 1 :maxdepth: 1
:hidden: :hidden:
Reef (v18.2.*) <reef>
Quincy (v17.2.*) <quincy> Quincy (v17.2.*) <quincy>
Pacific (v16.2.*) <pacific> Pacific (v16.2.*) <pacific>
@ -58,8 +59,11 @@ receive bug fixes or backports).
Release timeline Release timeline
---------------- ----------------
.. ceph_timeline_gantt:: releases.yml quincy pacific .. ceph_timeline_gantt:: releases.yml reef quincy
.. ceph_timeline:: releases.yml quincy pacific .. ceph_timeline:: releases.yml reef quincy
.. _Reef: reef
.. _18.2.0: reef#v18-2-0-reef
.. _Quincy: quincy .. _Quincy: quincy
.. _17.2.0: quincy#v17-2-0-quincy .. _17.2.0: quincy#v17-2-0-quincy

551
ceph/doc/releases/reef.rst Normal file
View File

@ -0,0 +1,551 @@
====
Reef
====
Reef is the 18th stable release of Ceph. It is named after the reef squid
(Sepioteuthis).
v18.2.0 Reef
============
This is the first stable release of Ceph Reef.
.. important::
We are unable to build Ceph on Debian stable (bookworm) for the 18.2.0
release because of Debian bug
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129. We will build as
soon as this bug is resolved in Debian stable.
*last updated 2023 Aug 04*
Major Changes from Quincy
--------------------------
Highlights
~~~~~~~~~~
See the relevant sections below for more details on these changes.
* **RADOS** FileStore is not supported in Reef.
* **RADOS:** RocksDB has been upgraded to version 7.9.2.
* **RADOS:** There have been significant improvements to RocksDB iteration overhead and performance.
* **RADOS:** The ``perf dump`` and ``perf schema`` commands have been deprecated in
favor of the new ``counter dump`` and ``counter schema`` commands.
* **RADOS:** Cache tiering is now deprecated.
* **RADOS:** A new feature, the "read balancer", is now available, which allows users to balance primary PGs per pool on their clusters.
* **RGW:** Bucket resharding is now supported for multi-site configurations.
* **RGW:** There have been significant improvements to the stability and consistency of multi-site replication.
* **RGW:** Compression is now supported for objects uploaded with Server-Side Encryption.
* **Dashboard:** There is a new Dashboard page with improved layout. Active alerts and some important charts are now displayed inside cards.
* **RBD:** Support for layered client-side encryption has been added.
* **Telemetry**: Users can now opt in to participate in a leaderboard in the telemetry public dashboards.
CephFS
~~~~~~
* CephFS: The ``mds_max_retries_on_remount_failure`` option has been renamed to
``client_max_retries_on_remount_failure`` and moved from ``mds.yaml.in`` to
``mds-client.yaml.in``. This change was made because the option has always
been used only by the MDS client.
* CephFS: It is now possible to delete the recovered files in the
``lost+found`` directory after a CephFS post has been recovered in accordance
with disaster recovery procedures.
* The ``AT_NO_ATTR_SYNC`` macro has been deprecated in favor of the standard
``AT_STATX_DONT_SYNC`` macro. The ``AT_NO_ATTR_SYNC`` macro will be removed
in the future.
Dashboard
~~~~~~~~~
* There is a new Dashboard page with improved layout. Active alerts
and some important charts are now displayed inside cards.
* Cephx Auth Management: There is a new section dedicated to listing and
managing Ceph cluster users.
* RGW Server Side Encryption: The SSE-S3 and KMS encryption of rgw buckets can
now be configured at the time of bucket creation.
* RBD Snapshot mirroring: Snapshot mirroring can now be configured through UI.
Snapshots can now be scheduled.
* 1-Click OSD Creation Wizard: OSD creation has been broken into 3 options:
#. Cost/Capacity Optimized: Use all HDDs
#. Throughput Optimized: Combine HDDs and SSDs
#. IOPS Optimized: Use all NVMes
The current OSD-creation form has been moved to the Advanced section.
* Centralized Logging: There is now a view that collects all the logs from
the Ceph cluster.
* Accessibility WCAG-AA: Dashboard is WCAG 2.1 level A compliant and therefore
improved for blind and visually impaired Ceph users.
* Monitoring & Alerting
* Ceph-exporter: Now the performance metrics for Ceph daemons are
exported by ceph-exporter, which deploys on each daemon rather than
using prometheus exporter. This will reduce performance bottlenecks.
* Monitoring stacks updated:
* Prometheus 2.43.0
* Node-exporter 1.5.0
* Grafana 9.4.7
* Alertmanager 0.25.0
MGR
~~~
* mgr/snap_schedule: The snap-schedule manager module now retains one snapshot
less than the number mentioned against the config option
``mds_max_snaps_per_dir``. This means that a new snapshot can be created and
retained during the next schedule run.
* The ``ceph mgr dump`` command now outputs ``last_failure_osd_epoch`` and
``active_clients`` fields at the top level. Previously, these fields were
output under the ``always_on_modules`` field.
RADOS
~~~~~
* FileStore is not supported in Reef.
* RocksDB has been upgraded to version 7.9.2, which incorporates several
performance improvements and features. This is the first release that can
tune RocksDB settings per column family, which allows for more granular
tunings to be applied to different kinds of data stored in RocksDB. New
default settings have been used to optimize performance for most workloads, with a
slight penalty in some use cases. This slight penalty is outweighed by large
improvements in compactions and write amplification in use cases such as RGW
(up to a measured 13.59% improvement in 4K random write IOPs).
* Trimming of PGLog dups is now controlled by the size rather than the version.
This change fixes the PGLog inflation issue that was happening when the
online (in OSD) trimming got jammed after a PG split operation. Also, a new
offline mechanism has been added: ``ceph-objectstore-tool`` has a new
operation called ``trim-pg-log-dups`` that targets situations in which an OSD
is unable to boot because of the inflated dups. In such situations, the "You
can be hit by THE DUPS BUG" warning is visible in OSD logs. Relevant tracker:
https://tracker.ceph.com/issues/53729
* The RADOS Python bindings are now able to process (opt-in) omap keys as bytes
objects. This allows interacting with RADOS omap keys that are not
decodable as UTF-8 strings.
* mClock Scheduler: The mClock scheduler (the default scheduler in Quincy) has
undergone significant usability and design improvements to address the slow
backfill issue. The following is a list of some important changes:
* The ``balanced`` profile is set as the default mClock profile because it
represents a compromise between prioritizing client I/O and prioritizing
recovery I/O. Users can then choose either the ``high_client_ops`` profile
to prioritize client I/O or the ``high_recovery_ops`` profile to prioritize
recovery I/O.
* QoS parameters including ``reservation`` and ``limit`` are now specified in
terms of a fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
* The cost parameters (``osd_mclock_cost_per_io_usec_*`` and
``osd_mclock_cost_per_byte_usec_*``) have been removed. The cost of an
operation is now a function of the random IOPS and maximum sequential
bandwidth capability of the OSD's underlying device.
* Degraded object recovery is given higher priority than misplaced
object recovery because degraded objects present a data safety issue that
is not present with objects that are merely misplaced. As a result,
backfilling operations with the ``balanced`` and ``high_client_ops`` mClock
profiles might progress more slowly than in the past, when backfilling
operations used the 'WeightedPriorityQueue' (WPQ) scheduler.
* The QoS allocations in all the mClock profiles are optimized in
accordance with the above fixes and enhancements.
* For more details, see:
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/
* A new feature, the "read balancer", is now available, which allows
users to balance primary PGs per pool on their clusters. The read balancer is
currently available as an offline option via the ``osdmaptool``. By providing
a copy of their osdmap and a pool they want balanced to the ``osdmaptool``, users
can generate a preview of optimal primary PG mappings that they can then choose to
apply to their cluster. For more details, see
https://docs.ceph.com/en/latest/dev/balancer-design/#read-balancing
* The ``active_clients`` array displayed by the ``ceph mgr dump`` command now
has a ``name`` field that shows the name of the manager module that
registered a RADOS client. Previously, the ``active_clients`` array showed
the address of a module's RADOS client, but not the name of the module.
* The ``perf dump`` and ``perf schema`` commands have been deprecated in
favor of the new ``counter dump`` and ``counter schema`` commands. These new
commands add support for labeled perf counters and also emit existing
unlabeled perf counters. Some unlabeled perf counters became labeled in this
release, and more will be labeled in future releases; such converted perf
counters are no longer emitted by the ``perf dump`` and ``perf schema``
commands.
* Cache tiering is now deprecated.
* The SPDK backend for BlueStore can now connect to an NVMeoF target. This
is not an officially supported feature.
RBD
~~~
* The semantics of compare-and-write C++ API (`Image::compare_and_write` and
`Image::aio_compare_and_write` methods) now match those of C API. Both
compare and write steps operate only on len bytes even if the buffers
associated with them are larger. The previous behavior of comparing up to the
size of the compare buffer was prone to subtle breakage upon straddling a
stripe unit boundary.
* The ``compare-and-write`` operation is no longer limited to 512-byte
sectors. Assuming proper alignment, it now allows operating on stripe units
(4MB by default).
* There is a new ``rbd_aio_compare_and_writev`` API method that supports
scatter/gather on compare buffers as well as on write buffers. This
complements the existing ``rbd_aio_readv`` and ``rbd_aio_writev`` methods.
* The ``rbd device unmap`` command now has a ``--namespace`` option.
Support for namespaces was added to RBD in Nautilus 14.2.0, and since then it
has been possible to map and unmap images in namespaces using the
``image-spec`` syntax. However, the corresponding option available in most
other commands was missing.
* All rbd-mirror daemon perf counters have become labeled and are now
emitted only by the new ``counter dump`` and ``counter schema`` commands. As
part of the conversion, many were also renamed in order to better
disambiguate journal-based and snapshot-based mirroring.
* The list-watchers C++ API (`Image::list_watchers`) now clears the passed
`std::list` before appending to it. This aligns with the semantics of the C
API (``rbd_watchers_list``).
* Trailing newline in passphrase files (for example: the
``<passphrase-file>`` argument of the ``rbd encryption format`` command and
the ``--encryption-passphrase-file`` option of other commands) is no longer
stripped.
* Support for layered client-side encryption has been added. It is now
possible to encrypt cloned images with a distinct encryption format and
passphrase, differing from that of the parent image and from that of every
other cloned image. The efficient copy-on-write semantics intrinsic to
unformatted (regular) cloned images have been retained.
RGW
~~~
* Bucket resharding is now supported for multi-site configurations. This
feature is enabled by default for new deployments. Existing deployments must
enable the ``resharding`` feature manually after all zones have upgraded.
See https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features for
details.
* The RGW policy parser now rejects unknown principals by default. If you are
mirroring policies between RGW and AWS, you might want to set
``rgw_policy_reject_invalid_principals`` to ``false``. This change affects
only newly set policies, not policies that are already in place.
* RGW's default backend for ``rgw_enable_ops_log`` has changed from ``RADOS``
to ``file``. The default value of ``rgw_ops_log_rados`` is now ``false``, and
``rgw_ops_log_file_path`` now defaults to
``/var/log/ceph/ops-log-$cluster-$name.log``.
* RGW's pubsub interface now returns boolean fields using ``bool``. Before this
change, ``/topics/<topic-name>`` returned ``stored_secret`` and
``persistent`` using a string of ``"true"`` or ``"false"`` that contains
enclosing quotation marks. After this change, these fields are returned
without enclosing quotation marks so that the fields can be decoded as
boolean values in JSON. The same is true of the ``is_truncated`` field
returned by ``/subscriptions/<sub-name>``.
* RGW's response of ``Action=GetTopicAttributes&TopicArn=<topic-arn>`` REST
API now returns ``HasStoredSecret`` and ``Persistent`` as boolean in the JSON
string that is encoded in ``Attributes/EndPoint``.
* All boolean fields that were previously rendered as strings by the
``rgw-admin`` command when the JSON format was used are now rendered as
boolean. If your scripts and tools rely on this behavior, update them
accordingly. The following is a list of the field names impacted by this
change:
* ``absolute``
* ``add``
* ``admin``
* ``appendable``
* ``bucket_key_enabled``
* ``delete_marker``
* ``exists``
* ``has_bucket_info``
* ``high_precision_time``
* ``index``
* ``is_master``
* ``is_prefix``
* ``is_truncated``
* ``linked``
* ``log_meta``
* ``log_op``
* ``pending_removal``
* ``read_only``
* ``retain_head_object``
* ``rule_exist``
* ``start_with_full_sync``
* ``sync_from_all``
* ``syncstopped``
* ``system``
* ``truncated``
* ``user_stats_sync``
* The Beast front end's HTTP access log line now uses a new
``debug_rgw_access`` configurable. It has the same defaults as
``debug_rgw``, but it can be controlled independently.
* The pubsub functionality for storing bucket notifications inside Ceph
has been removed. As a result, the pubsub zone should not be used anymore.
The following have also been removed: the REST operations, ``radosgw-admin``
commands for manipulating subscriptions, fetching the notifications, and
acking the notifications.
If the endpoint to which the notifications are sent is down or disconnected,
we recommend that you use persistent notifications to guarantee their
delivery. If the system that consumes the notifications has to pull them
(instead of the notifications being pushed to the system), use an external
message bus (for example, RabbitMQ or Kafka) for that purpose.
* The serialized format of notification and topics has changed. This means
that new and updated topics will be unreadable by old RGWs. We recommend
completing the RGW upgrades before creating or modifying any notification
topics.
* Compression is now supported for objects uploaded with Server-Side
Encryption. When both compression and encryption are enabled, compression is
applied before encryption. Earlier releases of multisite do not replicate
such objects correctly, so all zones must upgrade to Reef before enabling the
`compress-encrypted` zonegroup feature: see
https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features and note the
security considerations.
Telemetry
~~~~~~~~~
* Users who have opted in to telemetry can also opt in to
participate in a leaderboard in the telemetry public dashboards
(https://telemetry-public.ceph.com/). In addition, users are now able to
provide a description of their cluster that will appear publicly in the
leaderboard. For more details, see:
https://docs.ceph.com/en/reef/mgr/telemetry/#leaderboard. To see a sample
report, run ``ceph telemetry preview``. To opt in to telemetry, run ``ceph
telemetry on``. To opt in to the leaderboard, run ``ceph config set mgr
mgr/telemetry/leaderboard true``. To add a leaderboard description, run
``ceph config set mgr mgr/telemetry/leaderboard_description Cluster
description`` (entering your own cluster description).
Upgrading from Pacific or Quincy
--------------------------------
Before starting, make sure your cluster is stable and healthy (no down or recovering OSDs). (This is optional, but recommended.) You can disable the autoscaler for all pools during the upgrade using the noautoscale flag.
.. note::
You can monitor the progress of your upgrade at each stage with the ``ceph versions`` command, which will tell you what ceph version(s) are running for each type of daemon.
Upgrading cephadm clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~
If your cluster is deployed with cephadm (first introduced in Octopus), then the upgrade process is entirely automated. To initiate the upgrade,
.. prompt:: bash #
ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.0
The same process is used to upgrade to future minor releases.
Upgrade progress can be monitored with
.. prompt:: bash #
ceph orch upgrade status
Upgrade progress can also be monitored with `ceph -s` (which provides a simple progress bar) or more verbosely with
.. prompt:: bash #
ceph -W cephadm
The upgrade can be paused or resumed with
.. prompt:: bash #
ceph orch upgrade pause # to pause
ceph orch upgrade resume # to resume
or canceled with
.. prompt:: bash #
ceph orch upgrade stop
Note that canceling the upgrade simply stops the process; there is no ability to downgrade back to Pacific or Quincy.
Upgrading non-cephadm clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
1. If your cluster is running Pacific (16.2.x) or later, you might choose to first convert it to use cephadm so that the upgrade to Reef is automated (see above).
For more information, see https://docs.ceph.com/en/reef/cephadm/adoption/.
2. If your cluster is running Pacific (16.2.x) or later, systemd unit file names have changed to include the cluster fsid. To find the correct systemd unit file name for your cluster, run following command:
```
systemctl -l | grep <daemon type>
```
Example:
```
$ systemctl -l | grep mon | grep active
ceph-6ce0347c-314a-11ee-9b52-000af7995d6c@mon.f28-h21-000-r630.service loaded active running Ceph mon.f28-h21-000-r630 for 6ce0347c-314a-11ee-9b52-000af7995d6c
```
#. Set the `noout` flag for the duration of the upgrade. (Optional, but recommended.)
.. prompt:: bash #
ceph osd set noout
#. Upgrade monitors by installing the new packages and restarting the monitor daemons. For example, on each monitor host
.. prompt:: bash #
systemctl restart ceph-mon.target
Once all monitors are up, verify that the monitor upgrade is complete by looking for the `reef` string in the mon map. The command
.. prompt:: bash #
ceph mon dump | grep min_mon_release
should report:
.. prompt:: bash #
min_mon_release 18 (reef)
If it does not, that implies that one or more monitors hasn't been upgraded and restarted and/or the quorum does not include all monitors.
#. Upgrade `ceph-mgr` daemons by installing the new packages and restarting all manager daemons. For example, on each manager host,
.. prompt:: bash #
systemctl restart ceph-mgr.target
Verify the `ceph-mgr` daemons are running by checking `ceph -s`:
.. prompt:: bash #
ceph -s
::
...
services:
mon: 3 daemons, quorum foo,bar,baz
mgr: foo(active), standbys: bar, baz
...
#. Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD hosts
.. prompt:: bash #
systemctl restart ceph-osd.target
#. Upgrade all CephFS MDS daemons. For each CephFS file system,
#. Disable standby_replay:
.. prompt:: bash #
ceph fs set <fs_name> allow_standby_replay false
#. If upgrading from Pacific <=16.2.5:
.. prompt:: bash #
ceph config set mon mon_mds_skip_sanity true
#. Reduce the number of ranks to 1. (Make note of the original number of MDS daemons first if you plan to restore it later.)
.. prompt:: bash #
ceph status # ceph fs set <fs_name> max_mds 1
#. Wait for the cluster to deactivate any non-zero ranks by periodically checking the status
.. prompt:: bash #
ceph status
#. Take all standby MDS daemons offline on the appropriate hosts with
.. prompt:: bash #
systemctl stop ceph-mds@<daemon_name>
#. Confirm that only one MDS is online and is rank 0 for your FS
.. prompt:: bash #
ceph status
#. Upgrade the last remaining MDS daemon by installing the new packages and restarting the daemon
.. prompt:: bash #
systemctl restart ceph-mds.target
#. Restart all standby MDS daemons that were taken offline
.. prompt:: bash #
systemctl start ceph-mds.target
#. Restore the original value of `max_mds` for the volume
.. prompt:: bash #
ceph fs set <fs_name> max_mds <original_max_mds>
#. If upgrading from Pacific <=16.2.5 (followup to step 5.2):
.. prompt:: bash #
ceph config set mon mon_mds_skip_sanity false
#. Upgrade all radosgw daemons by upgrading packages and restarting daemons on all hosts
.. prompt:: bash #
systemctl restart ceph-radosgw.target
#. Complete the upgrade by disallowing pre-Reef OSDs and enabling all new Reef-only functionality
.. prompt:: bash #
ceph osd require-osd-release reef
#. If you set `noout` at the beginning, be sure to clear it with
.. prompt:: bash #
ceph osd unset noout
#. Consider transitioning your cluster to use the cephadm deployment and orchestration framework to simplify cluster management and future upgrades. For more information on converting an existing cluster to cephadm, see https://docs.ceph.com/en/reef/cephadm/adoption/.
Post-upgrade
~~~~~~~~~~~~
#. Verify the cluster is healthy with `ceph health`. If your cluster is running Filestore, and you are upgrading directly from Pacific to Reef, a deprecation warning is expected. This warning can be temporarily muted using the following command
.. prompt:: bash #
ceph health mute OSD_FILESTORE
#. Consider enabling the `telemetry module <https://docs.ceph.com/en/reef/mgr/telemetry/>`_ to send anonymized usage statistics and crash information to the Ceph upstream developers. To see what would be reported (without actually sending any information to anyone),
.. prompt:: bash #
ceph telemetry preview-all
If you are comfortable with the data that is reported, you can opt-in to automatically report the high-level cluster metadata with
.. prompt:: bash #
ceph telemetry on
The public dashboard that aggregates Ceph telemetry can be found at https://telemetry-public.ceph.com/.
Upgrading from pre-Pacific releases (like Octopus)
__________________________________________________
You **must** first upgrade to Pacific (16.2.z) or Quincy (17.2.z) before upgrading to Reef.

View File

@ -12,6 +12,11 @@
# If a version might represent an actual number (e.g. 0.80) quote it. # If a version might represent an actual number (e.g. 0.80) quote it.
# #
releases: releases:
reef:
target_eol: 2025-08-01
releases:
- version: 18.2.0
released: 2023-08-07
quincy: quincy:
target_eol: 2024-06-01 target_eol: 2024-06-01
releases: releases:
@ -29,7 +34,7 @@ releases:
released: 2022-04-19 released: 2022-04-19
pacific: pacific:
target_eol: 2023-06-01 target_eol: 2023-10-01
releases: releases:
- version: 16.2.11 - version: 16.2.11
released: 2023-01-26 released: 2023-01-26

View File

@ -973,6 +973,15 @@ convention was preferred because it made the documents more readable in a
command line interface. As of 2023, though, we have no preference for one over command line interface. As of 2023, though, we have no preference for one over
the other. Use whichever convention makes the text easier to read. the other. Use whichever convention makes the text easier to read.
Using a part of a sentence as a hyperlink, `like this <docs.ceph.com>`_, is
discouraged. The convention of writing "See X" is preferred. Here are some
preferred formulations:
#. For more information, see `docs.ceph.com <docs.ceph.com>`_.
#. See `docs.ceph.com <docs.ceph.com>`_.
Quirks of ReStructured Text Quirks of ReStructured Text
--------------------------- ---------------------------
@ -981,7 +990,8 @@ External Links
.. _external_link_with_inline_text: .. _external_link_with_inline_text:
This is the formula for links to addresses external to the Ceph documentation: Use the formula immediately below to render links that direct the reader to
addresses external to the Ceph documentation:
:: ::
@ -994,10 +1004,13 @@ This is the formula for links to addresses external to the Ceph documentation:
To link to addresses that are external to the Ceph documentation, include a To link to addresses that are external to the Ceph documentation, include a
space between the inline text and the angle bracket that precedes the space between the inline text and the angle bracket that precedes the
external address. This is precisely the opposite of :ref:`the convention for external address. This is precisely the opposite of the convention for
inline text that links to a location inside the Ceph inline text that links to a location inside the Ceph documentation. See
documentation<internal_link_with_inline_text>`. If this seems inconsistent :ref:`here <internal_link_with_inline_text>` for an exemplar of this
and confusing to you, then you're right. It is inconsistent and confusing. convention.
If this seems inconsistent and confusing to you, then you're right. It is
inconsistent and confusing.
See also ":ref:`External Hyperlink Example<start_external_hyperlink_example>`". See also ":ref:`External Hyperlink Example<start_external_hyperlink_example>`".

View File

@ -1,66 +1,83 @@
.. _hardware-recommendations: .. _hardware-recommendations:
========================== ==========================
Hardware Recommendations hardware recommendations
========================== ==========================
Ceph was designed to run on commodity hardware, which makes building and Ceph is designed to run on commodity hardware, which makes building and
maintaining petabyte-scale data clusters economically feasible. maintaining petabyte-scale data clusters flexible and economically feasible.
When planning out your cluster hardware, you will need to balance a number When planning your cluster's hardware, you will need to balance a number
of considerations, including failure domains and potential performance of considerations, including failure domains, cost, and performance.
issues. Hardware planning should include distributing Ceph daemons and Hardware planning should include distributing Ceph daemons and
other processes that use Ceph across many hosts. Generally, we recommend other processes that use Ceph across many hosts. Generally, we recommend
running Ceph daemons of a specific type on a host configured for that type running Ceph daemons of a specific type on a host configured for that type
of daemon. We recommend using other hosts for processes that utilize your of daemon. We recommend using separate hosts for processes that utilize your
data cluster (e.g., OpenStack, CloudStack, etc). data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
The requirements of one Ceph cluster are not the same as the requirements of
another, but below are some general guidelines.
.. tip:: Check out the `Ceph blog`_ too. .. tip:: check out the `ceph blog`_ too.
CPU CPU
=== ===
CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS) CephFS Metadata Servers (MDS) are CPU-intensive. They are
should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
nodes need enough processing power to run the RADOS service, to calculate data servers do not need a large number of CPU cores unless they are also hosting other
services, such as SSD OSDs for the CephFS metadata pool.
OSD nodes need enough processing power to run the RADOS service, to calculate data
placement with CRUSH, to replicate data, and to maintain their own copies of the placement with CRUSH, to replicate data, and to maintain their own copies of the
cluster map. cluster map.
The requirements of one Ceph cluster are not the same as the requirements of With earlier releases of Ceph, we would make hardware recommendations based on
another, but here are some general guidelines. the number of cores per OSD, but this cores-per-osd metric is no longer as
useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
In earlier versions of Ceph, we would make hardware recommendations based on For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
the number of cores per OSD, but this cores-per-OSD metric is no longer as
useful a metric as the number of cycles per IOP and the number of IOPs per OSD.
For example, for NVMe drives, Ceph can easily utilize five or six cores on real
clusters and up to about fourteen cores on single OSDs in isolation. So cores clusters and up to about fourteen cores on single OSDs in isolation. So cores
per OSD are no longer as pressing a concern as they were. When selecting per OSD are no longer as pressing a concern as they were. When selecting
hardware, select for IOPs per core. hardware, select for IOPS per core.
Monitor nodes and manager nodes have no heavy CPU demands and require only .. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
modest processors. If your host machines will run CPU-intensive processes in is enabled. Hyperthreading is usually beneficial for Ceph servers.
Monitor nodes and Manager nodes do not have heavy CPU demands and require only
modest processors. if your hosts will run CPU-intensive processes in
addition to Ceph daemons, make sure that you have enough processing power to addition to Ceph daemons, make sure that you have enough processing power to
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
one such example of a CPU-intensive process.) We recommend that you run one example of a CPU-intensive process.) We recommend that you run
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
not your monitor and manager nodes) in order to avoid resource contention. not your Monitor and Manager nodes) in order to avoid resource contention.
If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
with your Mon and Manager services if the nodes have sufficient resources.
RAM RAM
=== ===
Generally, more RAM is better. Monitor / manager nodes for a modest cluster Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
is a reasonable target. There is a memory target for BlueStore OSDs that is advised.
.. tip:: when we speak of RAM and storage requirements, we often describe
the needs of a single daemon of a given type. A given server as
a whole will thus need at least the sum of the needs of the
daemons that it hosts as well as resources for logs and other operating
system components. Keep in mind that a server's need for RAM
and storage will be greater at startup and when components
fail or are added and the cluster rebalances. In other words,
allow headroom past what you might see used during a calm period
on a small initial cluster footprint.
There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
defaults to 4GB. Factor in a prudent margin for the operating system and defaults to 4GB. Factor in a prudent margin for the operating system and
administrative tasks (like monitoring and metrics) as well as increased administrative tasks (like monitoring and metrics) as well as increased
consumption during recovery: provisioning ~8GB per BlueStore OSD consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
is advised. advised.
Monitors and managers (ceph-mon and ceph-mgr) Monitors and managers (ceph-mon and ceph-mgr)
--------------------------------------------- ---------------------------------------------
Monitor and manager daemon memory usage generally scales with the size of the Monitor and manager daemon memory usage scales with the size of the
cluster. Note that at boot-time and during topology changes and recovery these cluster. Note that at boot-time and during topology changes and recovery these
daemons will need more RAM than they do during steady-state operation, so plan daemons will need more RAM than they do during steady-state operation, so plan
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to, for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
@ -75,8 +92,8 @@ tuning the following settings:
Metadata servers (ceph-mds) Metadata servers (ceph-mds)
--------------------------- ---------------------------
The metadata daemon memory utilization depends on how much memory its cache is CephFS metadata daemon memory utilization depends on the configured size of
configured to consume. We recommend 1 GB as a minimum for most systems. See its cache. We recommend 1 GB as a minimum for most systems. See
:confval:`mds_cache_memory_limit`. :confval:`mds_cache_memory_limit`.
@ -88,23 +105,24 @@ operating system's page cache. In Bluestore you can adjust the amount of memory
that the OSD attempts to consume by changing the :confval:`osd_memory_target` that the OSD attempts to consume by changing the :confval:`osd_memory_target`
configuration option. configuration option.
- Setting the :confval:`osd_memory_target` below 2GB is typically not - Setting the :confval:`osd_memory_target` below 2GB is not
recommended (Ceph may fail to keep the memory consumption under 2GB and recommended. Ceph may fail to keep the memory consumption under 2GB and
this may cause extremely slow performance). extremely slow performance is likely.
- Setting the memory target between 2GB and 4GB typically works but may result - Setting the memory target between 2GB and 4GB typically works but may result
in degraded performance: metadata may be read from disk during IO unless the in degraded performance: metadata may need to be read from disk during IO
active data set is relatively small. unless the active data set is relatively small.
- 4GB is the current default :confval:`osd_memory_target` size. This default - 4GB is the current default value for :confval:`osd_memory_target` This default
was chosen for typical use cases, and is intended to balance memory was chosen for typical use cases, and is intended to balance RAM cost and
requirements and OSD performance. OSD performance.
- Setting the :confval:`osd_memory_target` higher than 4GB can improve - Setting the :confval:`osd_memory_target` higher than 4GB can improve
performance when there many (small) objects or when large (256GB/OSD performance when there many (small) objects or when large (256GB/OSD
or more) data sets are processed. or more) data sets are processed. This is especially true with fast
NVMe OSDs.
.. important:: OSD memory autotuning is "best effort". Although the OSD may .. important:: OSD memory management is "best effort". Although the OSD may
unmap memory to allow the kernel to reclaim it, there is no guarantee that unmap memory to allow the kernel to reclaim it, there is no guarantee that
the kernel will actually reclaim freed memory within a specific time the kernel will actually reclaim freed memory within a specific time
frame. This applies especially in older versions of Ceph, where transparent frame. This applies especially in older versions of Ceph, where transparent
@ -113,14 +131,19 @@ configuration option.
pages at the application level to avoid this, but that does not pages at the application level to avoid this, but that does not
guarantee that the kernel will immediately reclaim unmapped memory. The OSD guarantee that the kernel will immediately reclaim unmapped memory. The OSD
may still at times exceed its memory target. We recommend budgeting may still at times exceed its memory target. We recommend budgeting
approximately 20% extra memory on your system to prevent OSDs from going OOM at least 20% extra memory on your system to prevent OSDs from going OOM
(**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in (**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
the kernel reclaiming freed pages. That 20% value might be more or less than the kernel reclaiming freed pages. That 20% value might be more or less than
needed, depending on the exact configuration of the system. needed, depending on the exact configuration of the system.
When using the legacy FileStore back end, the page cache is used for caching .. tip:: Configuring the operating system with swap to provide additional
data, so no tuning is normally needed. When using the legacy FileStore backend, virtual memory for daemons is not advised for modern systems. Doing
the OSD memory consumption is related to the number of PGs per daemon in the may result in lower performance, and your Ceph cluster may well be
happier with a daemon that crashes vs one that slows to a crawl.
When using the legacy FileStore back end, the OS page cache was used for caching
data, so tuning was not normally needed. When using the legacy FileStore backend,
the OSD memory consumption was related to the number of PGs per daemon in the
system. system.
@ -130,13 +153,34 @@ Data Storage
Plan your data storage configuration carefully. There are significant cost and Plan your data storage configuration carefully. There are significant cost and
performance tradeoffs to consider when planning for data storage. Simultaneous performance tradeoffs to consider when planning for data storage. Simultaneous
OS operations and simultaneous requests from multiple daemons for read and OS operations and simultaneous requests from multiple daemons for read and
write operations against a single drive can slow performance. write operations against a single drive can impact performance.
OSDs require substantial storage drive space for RADOS data. We recommend a
minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
use a significant fraction of their capacity for metadata, and drives smaller
than 100 gigabytes will not be effective at all.
It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
be provisioned for bulk OSD data.
To get the best performance out of Ceph, provision the following on separate
drives:
* The operating systems
* OSD data
* BlueStore WAL+DB
For more
information on how to effectively use a mix of fast drives and slow drives in
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
Configuration Reference.
Hard Disk Drives Hard Disk Drives
---------------- ----------------
OSDs should have plenty of storage drive space for object data. We recommend a Consider carefully the cost-per-gigabyte advantage
minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage
of larger disks. We recommend dividing the price of the disk drive by the of larger disks. We recommend dividing the price of the disk drive by the
number of gigabytes to arrive at a cost per gigabyte, because larger drives may number of gigabytes to arrive at a cost per gigabyte, because larger drives may
have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
@ -146,11 +190,10 @@ per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
1 terabyte disks would generally increase the cost per gigabyte by 1 terabyte disks would generally increase the cost per gigabyte by
40%--rendering your cluster substantially less cost efficient. 40%--rendering your cluster substantially less cost efficient.
.. tip:: Running multiple OSDs on a single SAS / SATA drive .. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
is **NOT** a good idea. NVMe drives, however, can achieve is **NOT** a good idea.
improved performance by being split into two or more OSDs.
.. tip:: Running an OSD and a monitor or a metadata server on a single .. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
drive is also **NOT** a good idea. drive is also **NOT** a good idea.
.. tip:: With spinning disks, the SATA and SAS interface increasingly .. tip:: With spinning disks, the SATA and SAS interface increasingly
@ -162,35 +205,36 @@ Storage drives are subject to limitations on seek time, access time, read and
write times, as well as total throughput. These physical limitations affect write times, as well as total throughput. These physical limitations affect
overall system performance--especially during recovery. We recommend using a overall system performance--especially during recovery. We recommend using a
dedicated (ideally mirrored) drive for the operating system and software, and dedicated (ideally mirrored) drive for the operating system and software, and
one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above). one drive for each Ceph OSD Daemon you run on the host.
Many "slow OSD" issues (when they are not attributable to hardware failure) Many "slow OSD" issues (when they are not attributable to hardware failure)
arise from running an operating system and multiple OSDs on the same drive. arise from running an operating system and multiple OSDs on the same drive.
Also be aware that today's 22TB HDD uses the same SATA interface as a
3TB HDD from ten years ago: more than seven times the data to squeeze
through the same same interface. For this reason, when using HDDs for
OSDs, drives larger than 8TB may be best suited for storage of large
files / objects that are not at all performance-sensitive.
It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA
drive, but this will lead to resource contention and diminish overall
throughput.
To get the best performance out of Ceph, run the following on separate drives:
(1) operating systems, (2) OSD data, and (3) BlueStore db. For more
information on how to effectively use a mix of fast drives and slow drives in
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
Configuration Reference.
Solid State Drives Solid State Drives
------------------ ------------------
Ceph performance can be improved by using solid-state drives (SSDs). This Ceph performance is much improved when using solid-state drives (SSDs). This
reduces random access time and reduces latency while accelerating throughput. reduces random access time and reduces latency while increasing throughput.
SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer SSDs cost more per gigabyte than do HDDs but SSDs often offer
access times that are, at a minimum, 100 times faster than hard disk drives. access times that are, at a minimum, 100 times faster than HDDs.
SSDs avoid hotspot issues and bottleneck issues within busy clusters, and SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
they may offer better economics when TCO is evaluated holistically. they may offer better economics when TCO is evaluated holistically. Notably,
the amortized drive cost for a given number of IOPS is much lower with SSDs
than with HDDs. SSDs do not suffer rotational or seek latency and in addition
to improved client performance, they substantially improve the speed and
client impact of cluster changes including rebalancing when OSDs or Monitors
are added, removed, or fail.
SSDs do not have moving mechanical parts, so they are not necessarily subject SSDs do not have moving mechanical parts, so they are not subject
to the same types of limitations as hard disk drives. SSDs do have significant to many of the limitations of HDDs. SSDs do have significant
limitations though. When evaluating SSDs, it is important to consider the limitations though. When evaluating SSDs, it is important to consider the
performance of sequential reads and writes. performance of sequential and random reads and writes.
.. important:: We recommend exploring the use of SSDs to improve performance. .. important:: We recommend exploring the use of SSDs to improve performance.
However, before making a significant investment in SSDs, we **strongly However, before making a significant investment in SSDs, we **strongly
@ -198,16 +242,36 @@ performance of sequential reads and writes.
SSD in a test configuration in order to gauge performance. SSD in a test configuration in order to gauge performance.
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
Acceptable IOPS are not the only factor to consider when selecting an SSD for Acceptable IOPS are not the only factor to consider when selecting SSDs for
use with Ceph. use with Ceph. Bargain SSDs are often a false economy: they may experience
"cliffing", which means that after an initial burst, sustained performance
once a limited cache is filled declines considerably. Consider also durability:
a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
OSDs dedicated to certain types of sequentially-written read-mostly data, but
are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
for Ceph: they almost always feature power less protection (PLP) and do
not suffer the dramatic cliffing that client (desktop) models may experience.
SSDs have historically been cost prohibitive for object storage, but emerging When using a single (or mirrored pair) SSD for both operating system boot
QLC drives are closing the gap, offering greater density with lower power and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
consumption and less power spent on cooling. HDD OSDs may see a significant and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
performance improvement by offloading WAL+DB onto an SSD. equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
workload, a larger drive than technically required will provide more endurance
because it effectively has greater overprovsioning. We stress that
enterprise-class drives are best for production use, as they feature power
loss protection and increased durability compared to client (desktop) SKUs
that are intended for much lighter and intermittent duty cycles.
To get a better sense of the factors that determine the cost of storage, you SSDs were historically been cost prohibitive for object storage, but
might use the `Storage Networking Industry Association's Total Cost of QLC SSDs are closing the gap, offering greater density with lower power
consumption and less power spent on cooling. Also, HDD OSDs may see a
significant write latency improvement by offloading WAL+DB onto an SSD.
Many Ceph OSD deployments do not require an SSD with greater endurance than
1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
often overkill for this purpose and cost signficantly more.
To get a better sense of the factors that determine the total cost of storage,
you might use the `Storage Networking Industry Association's Total Cost of
Ownership calculator`_ Ownership calculator`_
Partition Alignment Partition Alignment
@ -222,11 +286,11 @@ alignment and example commands that show how to align partitions properly, see
CephFS Metadata Segregation CephFS Metadata Segregation
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
One way that Ceph accelerates CephFS file system performance is by segregating One way that Ceph accelerates CephFS file system performance is by separating
the storage of CephFS metadata from the storage of the CephFS file contents. the storage of CephFS metadata from the storage of the CephFS file contents.
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
have to create a pool for CephFS metadata, but you can create a CRUSH map have to manually create a pool for CephFS metadata, but you can create a CRUSH map
hierarchy for your CephFS metadata pool that points only to SSD storage media. hierarchy for your CephFS metadata pool that includes only SSD storage media.
See :ref:`CRUSH Device Class<crush-map-device-class>` for details. See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
@ -237,8 +301,20 @@ Disk controllers (HBAs) can have a significant impact on write throughput.
Carefully consider your selection of HBAs to ensure that they do not create a Carefully consider your selection of HBAs to ensure that they do not create a
performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
backup can substantially increase hardware and maintenance costs. Some RAID backup can substantially increase hardware and maintenance costs. Many RAID
HBAs can be configured with an IT-mode "personality". HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
streamlined operation.
You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
serve well for boot volume durability. When using SAS or SATA data drives,
forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
additionally reduces the HDD vs SSD cost gap when the system as a whole is
considered. The initial cost of a fancy RAID HBA plus onboard cache plus
battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
dollars even after discounts - a sum that goes a log way toward SSD cost parity.
An HBA-free system may also cost hundreds of US dollars less every year if one
purchases an annual maintenance contract or extended warranty.
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
@ -248,10 +324,10 @@ HBAs can be configured with an IT-mode "personality".
Benchmarking Benchmarking
------------ ------------
BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
that data is safely persisted to media. You can evaluate a drive's low-level frequently to ensure that data is safely persisted to media. You can evaluate a
write performance using ``fio``. For example, 4kB random write performance is drive's low-level write performance using ``fio``. For example, 4kB random write
measured as follows: performance is measured as follows:
.. code-block:: console .. code-block:: console
@ -261,6 +337,7 @@ Write Caches
------------ ------------
Enterprise SSDs and HDDs normally include power loss protection features which Enterprise SSDs and HDDs normally include power loss protection features which
ensure data durability when power is lost while operating, and
use multi-level caches to speed up direct or synchronous writes. These devices use multi-level caches to speed up direct or synchronous writes. These devices
can be toggled between two caching modes -- a volatile cache flushed to can be toggled between two caching modes -- a volatile cache flushed to
persistent media with fsync, or a non-volatile cache written synchronously. persistent media with fsync, or a non-volatile cache written synchronously.
@ -269,9 +346,9 @@ These two modes are selected by either "enabling" or "disabling" the write
(volatile) cache. When the volatile cache is enabled, Linux uses a device in (volatile) cache. When the volatile cache is enabled, Linux uses a device in
"write back" mode, and when disabled, it uses "write through". "write back" mode, and when disabled, it uses "write through".
The default configuration (normally caching enabled) may not be optimal, and The default configuration (usually: caching is enabled) may not be optimal, and
OSD performance may be dramatically increased in terms of increased IOPS and OSD performance may be dramatically increased in terms of increased IOPS and
decreased commit_latency by disabling the write cache. decreased commit latency by disabling this write cache.
Users are therefore encouraged to benchmark their devices with ``fio`` as Users are therefore encouraged to benchmark their devices with ``fio`` as
described earlier and persist the optimal cache configuration for their described earlier and persist the optimal cache configuration for their
@ -319,11 +396,11 @@ The write cache can be disabled with those same tools:
=== START OF ENABLE/DISABLE COMMANDS SECTION === === START OF ENABLE/DISABLE COMMANDS SECTION ===
Write cache disabled Write cache disabled
Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl`` In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
results in the cache_type changing automatically to "write through". If this is results in the cache_type changing automatically to "write through". If this is
not the case, you can try setting it directly as follows. (Users should note not the case, you can try setting it directly as follows. (Users should ensure
that setting cache_type also correctly persists the caching mode of the device that setting cache_type also correctly persists the caching mode of the device
until the next reboot): until the next reboot as some drives require this to be repeated at every boot):
.. code-block:: console .. code-block:: console
@ -367,13 +444,13 @@ until the next reboot):
Additional Considerations Additional Considerations
------------------------- -------------------------
You typically will run multiple OSDs per host, but you should ensure that the Ceph operators typically provision multiple OSDs per host, but you should
aggregate throughput of your OSD drives doesn't exceed the network bandwidth ensure that the aggregate throughput of your OSD drives doesn't exceed the
required to service a client's need to read or write data. You should also network bandwidth required to service a client's read and write operations.
consider what percentage of the overall data the cluster stores on each host. If You should also each host's percentage of the cluster's overall capacity. If
the percentage on a particular host is large and the host fails, it can lead to the percentage located on a particular host is large and the host fails, it
problems such as exceeding the ``full ratio``, which causes Ceph to halt can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
operations as a safety precaution that prevents data loss. which in turn causes Ceph to halt operations to prevent data loss.
When you run multiple OSDs per host, you also need to ensure that the kernel When you run multiple OSDs per host, you also need to ensure that the kernel
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
@ -384,7 +461,11 @@ multiple OSDs per host.
Networks Networks
======== ========
Provision at least 10 Gb/s networking in your racks. Provision at least 10 Gb/s networking in your datacenter, both among Ceph
hosts and between clients and your Ceph cluster. Network link active/active
bonding across separate network switches is strongly recommended both for
increased throughput and for tolerance of network failures and maintenance.
Take care that your bonding hash policy distributes traffic across links.
Speed Speed
----- -----
@ -392,13 +473,20 @@ Speed
It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
only one hour to replicate 10 TB across a 10 Gb/s network. only one hour to replicate 10 TB across a 10 Gb/s network.
Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
in parallel. Thus, and perhaps somewhat counterintuitively, an individual
packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
network.
Cost Cost
---- ----
The larger the Ceph cluster, the more common OSD failures will be. The larger the Ceph cluster, the more common OSD failures will be.
The faster that a placement group (PG) can recover from a ``degraded`` state to The faster that a placement group (PG) can recover from a degraded state to
an ``active + clean`` state, the better. Notably, fast recovery minimizes an ``active + clean`` state, the better. Notably, fast recovery minimizes
the likelihood of multiple, overlapping failures that can cause data to become the likelihood of multiple, overlapping failures that can cause data to become
temporarily unavailable or even lost. Of course, when provisioning your temporarily unavailable or even lost. Of course, when provisioning your
@ -410,10 +498,10 @@ switches. The added expense of this hardware may be offset by the operational
cost savings on network setup and maintenance. When using VLANs to handle VM cost savings on network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack, traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
25/50/100 Gb/s networking as of 2022 is common for production clusters. increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
Top-of-rack (TOR) switches also need fast and redundant uplinks to spind Top-of-rack (TOR) switches also need fast and redundant uplinks to
spine switches / routers, often at least 40 Gb/s. core / spine network switches or routers, often at least 40 Gb/s.
Baseboard Management Controller (BMC) Baseboard Management Controller (BMC)
@ -425,78 +513,103 @@ Administration and deployment tools may also use BMCs extensively, especially
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
network for security and administration. Hypervisor SSH access, VM image uploads, network for security and administration. Hypervisor SSH access, VM image uploads,
OS image installs, management sockets, etc. can impose significant loads on a network. OS image installs, management sockets, etc. can impose significant loads on a network.
Running three networks may seem like overkill, but each traffic path represents Running multiple networks may seem like overkill, but each traffic path represents
a potential capacity, throughput and/or performance bottleneck that you should a potential capacity, throughput and/or performance bottleneck that you should
carefully consider before deploying a large scale data cluster. carefully consider before deploying a large scale data cluster.
Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
may reduce costs by wasting fewer expenive ports on faster host switches.
Failure Domains Failure Domains
=============== ===============
A failure domain is any failure that prevents access to one or more OSDs. That A failure domain can be thought of as any component loss that prevents access to
could be a stopped daemon on a host; a disk failure, an OS crash, a one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
malfunctioning NIC, a failed power supply, a network outage, a power outage, a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
and so forth. When planning out your hardware needs, you must balance the a network outage, a power outage, and so forth. When planning your hardware
temptation to reduce costs by placing too many responsibilities into too few deployment, you must balance the risk of reducing costs by placing too many
failure domains, and the added costs of isolating every potential failure responsibilities into too few failure domains against the added costs of
domain. isolating every potential failure domain.
Minimum Hardware Recommendations Minimum Hardware Recommendations
================================ ================================
Ceph can run on inexpensive commodity hardware. Small production clusters Ceph can run on inexpensive commodity hardware. Small production clusters
and development clusters can run successfully with modest hardware. and development clusters can run successfully with modest hardware. As
we noted above: when we speak of CPU _cores_, we mean _threads_ when
hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
provides two logical CPU threads; other CPU architectures may vary.
Take care that there are many factors that influence resource choices. The
minimum resources that suffice for one purpose will not necessarily suffice for
another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
a trio of Raspberry PIs will get by with fewer resources than a production
deployment with a thousand OSDs serving five thousand of RBD clients. The
classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
One would not expect the former to do the job of the latter. We especially
cannot stress enough the criticality of using enterprise-quality storage
media for production workloads.
Additional insights into resource planning for production clusters are
found above and elsewhere within this documentation.
+--------------+----------------+-----------------------------------------+ +--------------+----------------+-----------------------------------------+
| Process | Criteria | Minimum Recommended | | Process | Criteria | Bare Minimum and Recommended |
+==============+================+=========================================+ +==============+================+=========================================+
| ``ceph-osd`` | Processor | - 1 core minimum | | ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
| | | - 1 core per 200-500 MB/s | | | | - 1 core per 200-500 MB/s throughput |
| | | - 1 core per 1000-3000 IOPS | | | | - 1 core per 1000-3000 IOPS |
| | | | | | | |
| | | * Results are before replication. | | | | * Results are before replication. |
| | | * Results may vary with different | | | | * Results may vary across CPU and drive |
| | | CPU models and Ceph features. | | | | models and Ceph configuration: |
| | | (erasure coding, compression, etc) | | | | (erasure coding, compression, etc) |
| | | * ARM processors specifically may | | | | * ARM processors specifically may |
| | | require additional cores. | | | | require more cores for performance. |
| | | * SSD OSDs, especially NVMe, will |
| | | benefit from additional cores per OSD.|
| | | * Actual performance depends on many | | | | * Actual performance depends on many |
| | | factors including drives, net, and | | | | factors including drives, net, and |
| | | client throughput and latency. | | | | client throughput and latency. |
| | | Benchmarking is highly recommended. | | | | Benchmarking is highly recommended. |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | RAM | - 4GB+ per daemon (more is better) | | | RAM | - 4GB+ per daemon (more is better) |
| | | - 2-4GB often functions (may be slow) | | | | - 2-4GB may function but may be slow |
| | | - Less than 2GB not recommended | | | | - Less than 2GB is not recommended |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | Volume Storage | 1x storage drive per daemon | | | Storage Drives | 1x storage drive per OSD |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | DB/WAL | 1x SSD partition per daemon (optional) | | | DB/WAL | 1x SSD partion per HDD OSD |
| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
| | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) | | | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+ +--------------+----------------+-----------------------------------------+
| ``ceph-mon`` | Processor | - 2 cores minimum | | ``ceph-mon`` | Processor | - 2 cores minimum |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | RAM | 2-4GB+ per daemon | | | RAM | 5GB+ per daemon (large / production |
| | | clusters need more) |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | Disk Space | 60 GB per daemon | | | Storage | 100 GB per daemon, SSD is recommended |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs | | | Network | 1x 1Gb/s (10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+ +--------------+----------------+-----------------------------------------+
| ``ceph-mds`` | Processor | - 2 cores minimum | | ``ceph-mds`` | Processor | - 2 cores minimum |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | RAM | 2GB+ per daemon | | | RAM | 2GB+ per daemon (more for production) |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | Disk Space | 1 MB per daemon | | | Disk Space | 1 GB per daemon |
| +----------------+-----------------------------------------+ | +----------------+-----------------------------------------+
| | Network | 1x 1GbE+ NICs | | | Network | 1x 1Gb/s (10+ Gb/s recommended) |
+--------------+----------------+-----------------------------------------+ +--------------+----------------+-----------------------------------------+
.. tip:: If you are running an OSD with a single disk, create a .. tip:: If you are running an OSD node with a single storage drive, create a
partition for your volume storage that is separate from the partition partition for your OSD that is separate from the partition
containing the OS. Generally, we recommend separate disks for the containing the OS. We recommend separate drives for the
OS and the volume storage. OS and for OSD storage.

View File

@ -35,20 +35,38 @@ Linux Kernel
Platforms Platforms
========= =========
The charts below show how Ceph's requirements map onto various Linux The chart below shows which Linux platforms Ceph provides packages for, and
platforms. Generally speaking, there is very little dependence on which platforms Ceph has been tested on.
specific distributions outside of the kernel and system initialization
package (i.e., sysvinit, systemd).
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+ Ceph does not require a specific Linux distribution. Ceph can run on any
| Release Name | Tag | CentOS | Ubuntu | OpenSUSE :sup:`C` | Debian :sup:`C` | distribution that includes a supported kernel and supported system startup
+==============+========+========================+================================+===================+=================+ framework, for example ``sysvinit`` or ``systemd``. Ceph is sometimes ported to
| Quincy | 17.2.z | 8 :sup:`A` | 20.04 :sup:`A` | 15.3 | 11 | non-Linux systems but these are not supported by the core Ceph effort.
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+
| Pacific | 16.2.z | 8 :sup:`A` | 18.04 :sup:`C`, 20.04 :sup:`A` | 15.2 | 10, 11 |
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+ +---------------+---------------+-----------------+------------------+------------------+
| Octopus | 15.2.z | 7 :sup:`B` 8 :sup:`A` | 18.04 :sup:`C`, 20.04 :sup:`A` | 15.2 | 10 | | | Reef (18.2.z) | Quincy (17.2.z) | Pacific (16.2.z) | Octopus (15.2.z) |
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+ +===============+===============+=================+==================+==================+
| Centos 7 | | | A | B |
+---------------+---------------+-----------------+------------------+------------------+
| Centos 8 | A | A | A | A |
+---------------+---------------+-----------------+------------------+------------------+
| Centos 9 | A | | | |
+---------------+---------------+-----------------+------------------+------------------+
| Debian 10 | C | | C | C |
+---------------+---------------+-----------------+------------------+------------------+
| Debian 11 | C | C | C | |
+---------------+---------------+-----------------+------------------+------------------+
| OpenSUSE 15.2 | C | | C | C |
+---------------+---------------+-----------------+------------------+------------------+
| OpenSUSE 15.3 | C | C | | |
+---------------+---------------+-----------------+------------------+------------------+
| Ubuntu 18.04 | | | C | C |
+---------------+---------------+-----------------+------------------+------------------+
| Ubuntu 20.04 | A | A | A | A |
+---------------+---------------+-----------------+------------------+------------------+
| Ubuntu 22.04 | A | | | |
+---------------+---------------+-----------------+------------------+------------------+
- **A**: Ceph provides packages and has done comprehensive tests on the software in them. - **A**: Ceph provides packages and has done comprehensive tests on the software in them.
- **B**: Ceph provides packages and has done basic tests on the software in them. - **B**: Ceph provides packages and has done basic tests on the software in them.

View File

@ -141,19 +141,51 @@ function install_pkg_on_ubuntu {
fi fi
} }
boost_ver=1.79
function clean_boost_on_ubuntu {
in_jenkins && echo "CI_DEBUG: Running clean_boost_on_ubuntu() in install-deps.sh"
# Find currently installed version. If there are multiple
# versions, they end up newline separated
local installed_ver=$(apt -qq list --installed ceph-libboost*-dev 2>/dev/null |
cut -d' ' -f2 |
cut -d'.' -f1,2 |
sort -u)
# If installed_ver contains whitespace, we can't really count on it,
# but otherwise, bail out if the version installed is the version
# we want.
if test -n "$installed_ver" &&
echo -n "$installed_ver" | tr '[:space:]' ' ' | grep -v -q ' '; then
if echo "$installed_ver" | grep -q "^$boost_ver"; then
return
fi
fi
# Historical packages
$SUDO rm -f /etc/apt/sources.list.d/ceph-libboost*.list
# Currently used
$SUDO rm -f /etc/apt/sources.list.d/libboost.list
# Refresh package list so things aren't in the available list.
$SUDO env DEBIAN_FRONTEND=noninteractive apt-get update -y || true
# Remove all ceph-libboost packages. We have an early return if
# the desired version is already (and the only) version installed,
# so no need to spare it.
if test -n "$installed_ver"; then
$SUDO env DEBIAN_FRONTEND=noninteractive apt-get -y --fix-missing remove "ceph-libboost*"
fi
}
function install_boost_on_ubuntu { function install_boost_on_ubuntu {
in_jenkins && echo "CI_DEBUG: Running install_boost_on_ubuntu() in install-deps.sh" in_jenkins && echo "CI_DEBUG: Running install_boost_on_ubuntu() in install-deps.sh"
local ver=1.79 # Once we get to this point, clean_boost_on_ubuntu() should ensure
# that there is no more than one installed version.
local installed_ver=$(apt -qq list --installed ceph-libboost*-dev 2>/dev/null | local installed_ver=$(apt -qq list --installed ceph-libboost*-dev 2>/dev/null |
grep -e 'libboost[0-9].[0-9]\+-dev' | grep -e 'libboost[0-9].[0-9]\+-dev' |
cut -d' ' -f2 | cut -d' ' -f2 |
cut -d'.' -f1,2) cut -d'.' -f1,2)
if test -n "$installed_ver"; then if test -n "$installed_ver"; then
if echo "$installed_ver" | grep -q "^$ver"; then if echo "$installed_ver" | grep -q "^$boost_ver"; then
return return
else
$SUDO env DEBIAN_FRONTEND=noninteractive apt-get -y remove "ceph-libboost.*${installed_ver}.*"
$SUDO rm -f /etc/apt/sources.list.d/ceph-libboost${installed_ver}.list
fi fi
fi fi
local codename=$1 local codename=$1
@ -164,22 +196,22 @@ function install_boost_on_ubuntu {
$sha1 \ $sha1 \
$codename \ $codename \
check \ check \
ceph-libboost-atomic$ver-dev \ ceph-libboost-atomic${boost_ver}-dev \
ceph-libboost-chrono$ver-dev \ ceph-libboost-chrono${boost_ver}-dev \
ceph-libboost-container$ver-dev \ ceph-libboost-container${boost_ver}-dev \
ceph-libboost-context$ver-dev \ ceph-libboost-context${boost_ver}-dev \
ceph-libboost-coroutine$ver-dev \ ceph-libboost-coroutine${boost_ver}-dev \
ceph-libboost-date-time$ver-dev \ ceph-libboost-date-time${boost_ver}-dev \
ceph-libboost-filesystem$ver-dev \ ceph-libboost-filesystem${boost_ver}-dev \
ceph-libboost-iostreams$ver-dev \ ceph-libboost-iostreams${boost_ver}-dev \
ceph-libboost-program-options$ver-dev \ ceph-libboost-program-options${boost_ver}-dev \
ceph-libboost-python$ver-dev \ ceph-libboost-python${boost_ver}-dev \
ceph-libboost-random$ver-dev \ ceph-libboost-random${boost_ver}-dev \
ceph-libboost-regex$ver-dev \ ceph-libboost-regex${boost_ver}-dev \
ceph-libboost-system$ver-dev \ ceph-libboost-system${boost_ver}-dev \
ceph-libboost-test$ver-dev \ ceph-libboost-test${boost_ver}-dev \
ceph-libboost-thread$ver-dev \ ceph-libboost-thread${boost_ver}-dev \
ceph-libboost-timer$ver-dev ceph-libboost-timer${boost_ver}-dev
} }
function install_libzbd_on_ubuntu { function install_libzbd_on_ubuntu {
@ -330,6 +362,9 @@ else
case "$ID" in case "$ID" in
debian|ubuntu|devuan|elementary|softiron) debian|ubuntu|devuan|elementary|softiron)
echo "Using apt-get to install dependencies" echo "Using apt-get to install dependencies"
# Put this before any other invocation of apt so it can clean
# up in a broken case.
clean_boost_on_ubuntu
$SUDO apt-get install -y devscripts equivs $SUDO apt-get install -y devscripts equivs
$SUDO apt-get install -y dpkg-dev $SUDO apt-get install -y dpkg-dev
ensure_python3_sphinx_on_ubuntu ensure_python3_sphinx_on_ubuntu

View File

@ -132,7 +132,7 @@ build_dashboard_frontend() {
$CURR_DIR/src/tools/setup-virtualenv.sh $TEMP_DIR $CURR_DIR/src/tools/setup-virtualenv.sh $TEMP_DIR
$TEMP_DIR/bin/pip install nodeenv $TEMP_DIR/bin/pip install nodeenv
$TEMP_DIR/bin/nodeenv --verbose -p --node=14.15.1 $TEMP_DIR/bin/nodeenv --verbose -p --node=18.17.0
cd src/pybind/mgr/dashboard/frontend cd src/pybind/mgr/dashboard/frontend
. $TEMP_DIR/bin/activate . $TEMP_DIR/bin/activate

View File

@ -3,3 +3,4 @@ log-rotate:
ceph-osd: 10G ceph-osd: 10G
tasks: tasks:
- ceph: - ceph:
create_rbd_pool: false

View File

@ -10,3 +10,4 @@ overrides:
- \(MDS_ALL_DOWN\) - \(MDS_ALL_DOWN\)
- \(MDS_UP_LESS_THAN_MAX\) - \(MDS_UP_LESS_THAN_MAX\)
- \(FS_INLINE_DATA_DEPRECATED\) - \(FS_INLINE_DATA_DEPRECATED\)
- \(POOL_APP_NOT_ENABLED\)

View File

@ -0,0 +1 @@
../all/centos_8.yaml

View File

@ -0,0 +1 @@
../all/centos_latest.yaml

View File

@ -0,0 +1 @@
../all/centos_latest.yaml

View File

@ -0,0 +1 @@
../all/ubuntu_20.04.yaml

View File

@ -1 +1 @@
../all/ubuntu_20.04.yaml ../all/ubuntu_latest.yaml

Some files were not shown because too many files have changed in this diff Show More