mirror of
https://git.proxmox.com/git/ceph.git
synced 2025-04-28 12:39:22 +00:00
update ceph source to reef 18.2.1
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
This commit is contained in:
parent
27f45121cc
commit
aee94f6923
5
ceph/.github/pull_request_template.md
vendored
5
ceph/.github/pull_request_template.md
vendored
@ -22,7 +22,9 @@
|
|||||||
## Contribution Guidelines
|
## Contribution Guidelines
|
||||||
- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
|
- To sign and title your commits, please refer to [Submitting Patches to Ceph](https://github.com/ceph/ceph/blob/main/SubmittingPatches.rst).
|
||||||
|
|
||||||
- If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
|
- If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to [Submitting Patches to Ceph - Backports](https://github.com/ceph/ceph/blob/master/SubmittingPatches-backports.rst) for the proper workflow.
|
||||||
|
|
||||||
|
- When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an `x` between the brackets: `[x]`. Spaces and capitalization matter when checking off items this way.
|
||||||
|
|
||||||
## Checklist
|
## Checklist
|
||||||
- Tracker (select at least one)
|
- Tracker (select at least one)
|
||||||
@ -62,4 +64,5 @@
|
|||||||
- `jenkins test ceph-volume all`
|
- `jenkins test ceph-volume all`
|
||||||
- `jenkins test ceph-volume tox`
|
- `jenkins test ceph-volume tox`
|
||||||
- `jenkins test windows`
|
- `jenkins test windows`
|
||||||
|
- `jenkins test rook e2e`
|
||||||
</details>
|
</details>
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
cmake_minimum_required(VERSION 3.16)
|
cmake_minimum_required(VERSION 3.16)
|
||||||
|
|
||||||
project(ceph
|
project(ceph
|
||||||
VERSION 18.2.0
|
VERSION 18.2.1
|
||||||
LANGUAGES CXX C ASM)
|
LANGUAGES CXX C ASM)
|
||||||
|
|
||||||
cmake_policy(SET CMP0028 NEW)
|
cmake_policy(SET CMP0028 NEW)
|
||||||
|
@ -1,3 +1,53 @@
|
|||||||
|
>=19.0.0
|
||||||
|
|
||||||
|
* RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in
|
||||||
|
multi-site. Previously, the replicas of such objects were corrupted on decryption.
|
||||||
|
A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to
|
||||||
|
identify these original multipart uploads. The ``LastModified`` timestamp of any
|
||||||
|
identified object is incremented by 1ns to cause peer zones to replicate it again.
|
||||||
|
For multi-site deployments that make any use of Server-Side Encryption, we
|
||||||
|
recommended running this command against every bucket in every zone after all
|
||||||
|
zones have upgraded.
|
||||||
|
* CEPHFS: MDS evicts clients which are not advancing their request tids which causes
|
||||||
|
a large buildup of session metadata resulting in the MDS going read-only due to
|
||||||
|
the RADOS operation exceeding the size threshold. `mds_session_metadata_threshold`
|
||||||
|
config controls the maximum size that a (encoded) session metadata can grow.
|
||||||
|
* RGW: New tools have been added to radosgw-admin for identifying and
|
||||||
|
correcting issues with versioned bucket indexes. Historical bugs with the
|
||||||
|
versioned bucket index transaction workflow made it possible for the index
|
||||||
|
to accumulate extraneous "book-keeping" olh entries and plain placeholder
|
||||||
|
entries. In some specific scenarios where clients made concurrent requests
|
||||||
|
referencing the same object key, it was likely that a lot of extra index
|
||||||
|
entries would accumulate. When a significant number of these entries are
|
||||||
|
present in a single bucket index shard, they can cause high bucket listing
|
||||||
|
latencies and lifecycle processing failures. To check whether a versioned
|
||||||
|
bucket has unnecessary olh entries, users can now run ``radosgw-admin
|
||||||
|
bucket check olh``. If the ``--fix`` flag is used, the extra entries will
|
||||||
|
be safely removed. A distinct issue from the one described thus far, it is
|
||||||
|
also possible that some versioned buckets are maintaining extra unlinked
|
||||||
|
objects that are not listable from the S3/ Swift APIs. These extra objects
|
||||||
|
are typically a result of PUT requests that exited abnormally, in the middle
|
||||||
|
of a bucket index transaction - so the client would not have received a
|
||||||
|
successful response. Bugs in prior releases made these unlinked objects easy
|
||||||
|
to reproduce with any PUT request that was made on a bucket that was actively
|
||||||
|
resharding. Besides the extra space that these hidden, unlinked objects
|
||||||
|
consume, there can be another side effect in certain scenarios, caused by
|
||||||
|
the nature of the failure mode that produced them, where a client of a bucket
|
||||||
|
that was a victim of this bug may find the object associated with the key to
|
||||||
|
be in an inconsistent state. To check whether a versioned bucket has unlinked
|
||||||
|
entries, users can now run ``radosgw-admin bucket check unlinked``. If the
|
||||||
|
``--fix`` flag is used, the unlinked objects will be safely removed. Finally,
|
||||||
|
a third issue made it possible for versioned bucket index stats to be
|
||||||
|
accounted inaccurately. The tooling for recalculating versioned bucket stats
|
||||||
|
also had a bug, and was not previously capable of fixing these inaccuracies.
|
||||||
|
This release resolves those issues and users can now expect that the existing
|
||||||
|
``radosgw-admin bucket check`` command will produce correct results. We
|
||||||
|
recommend that users with versioned buckets, especially those that existed
|
||||||
|
on prior releases, use these new tools to check whether their buckets are
|
||||||
|
affected and to clean them up accordingly.
|
||||||
|
* mgr/snap-schedule: For clusters with multiple CephFS file systems, all the
|
||||||
|
snap-schedule commands now expect the '--fs' argument.
|
||||||
|
|
||||||
>=18.0.0
|
>=18.0.0
|
||||||
|
|
||||||
* The RGW policy parser now rejects unknown principals by default. If you are
|
* The RGW policy parser now rejects unknown principals by default. If you are
|
||||||
@ -171,6 +221,11 @@
|
|||||||
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/
|
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/
|
||||||
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
|
* CEPHFS: After recovering a Ceph File System post following the disaster recovery
|
||||||
procedure, the recovered files under `lost+found` directory can now be deleted.
|
procedure, the recovered files under `lost+found` directory can now be deleted.
|
||||||
|
https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/
|
||||||
|
* mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot
|
||||||
|
than the number mentioned against the config tunable `mds_max_snaps_per_dir`
|
||||||
|
so that a new snapshot can be created and retained during the next schedule
|
||||||
|
run.
|
||||||
|
|
||||||
>=17.2.1
|
>=17.2.1
|
||||||
|
|
||||||
|
@ -1,23 +1,25 @@
|
|||||||
# Ceph - a scalable distributed storage system
|
# Ceph - a scalable distributed storage system
|
||||||
|
|
||||||
Please see https://ceph.com/ for current info.
|
See https://ceph.com/ for current information about Ceph.
|
||||||
|
|
||||||
|
|
||||||
## Contributing Code
|
## Contributing Code
|
||||||
|
|
||||||
Most of Ceph is dual licensed under the LGPL version 2.1 or 3.0. Some
|
Most of Ceph is dual-licensed under the LGPL version 2.1 or 3.0. Some
|
||||||
miscellaneous code is under a BSD-style license or is public domain.
|
miscellaneous code is either public domain or licensed under a BSD-style
|
||||||
The documentation is licensed under Creative Commons
|
license.
|
||||||
Attribution Share Alike 3.0 (CC-BY-SA-3.0). There are a handful of headers
|
|
||||||
included here that are licensed under the GPL. Please see the file
|
|
||||||
COPYING for a full inventory of licenses by file.
|
|
||||||
|
|
||||||
Code contributions must include a valid "Signed-off-by" acknowledging
|
The Ceph documentation is licensed under Creative Commons Attribution Share
|
||||||
the license for the modified or contributed file. Please see the file
|
Alike 3.0 (CC-BY-SA-3.0).
|
||||||
SubmittingPatches.rst for details on what that means and on how to
|
|
||||||
generate and submit patches.
|
|
||||||
|
|
||||||
We do not require assignment of copyright to contribute code; code is
|
Some headers included in the `ceph/ceph` repository are licensed under the GPL.
|
||||||
|
See the file `COPYING` for a full inventory of licenses by file.
|
||||||
|
|
||||||
|
All code contributions must include a valid "Signed-off-by" line. See the file
|
||||||
|
`SubmittingPatches.rst` for details on this and instructions on how to generate
|
||||||
|
and submit patches.
|
||||||
|
|
||||||
|
Assignment of copyright is not required to contribute code. Code is
|
||||||
contributed under the terms of the applicable license.
|
contributed under the terms of the applicable license.
|
||||||
|
|
||||||
|
|
||||||
@ -33,10 +35,11 @@ command on a system that has git installed:
|
|||||||
|
|
||||||
git clone https://github.com/ceph/ceph.git
|
git clone https://github.com/ceph/ceph.git
|
||||||
|
|
||||||
When the ceph/ceph repository has been cloned to your system, run the following
|
When the `ceph/ceph` repository has been cloned to your system, run the
|
||||||
command to check out the git submodules associated with the ceph/ceph
|
following commands to move into the cloned `ceph/ceph` repository and to check
|
||||||
repository:
|
out the git submodules associated with it:
|
||||||
|
|
||||||
|
cd ceph
|
||||||
git submodule update --init --recursive
|
git submodule update --init --recursive
|
||||||
|
|
||||||
|
|
||||||
@ -63,34 +66,42 @@ Install the ``python3-routes`` package:
|
|||||||
|
|
||||||
These instructions are meant for developers who are compiling the code for
|
These instructions are meant for developers who are compiling the code for
|
||||||
development and testing. To build binaries that are suitable for installation
|
development and testing. To build binaries that are suitable for installation
|
||||||
we recommend that you build .deb or .rpm packages, or refer to ``ceph.spec.in``
|
we recommend that you build `.deb` or `.rpm` packages, or refer to
|
||||||
or ``debian/rules`` to see which configuration options are specified for
|
``ceph.spec.in`` or ``debian/rules`` to see which configuration options are
|
||||||
production builds.
|
specified for production builds.
|
||||||
|
|
||||||
Build instructions:
|
To build Ceph, make sure that you are in the top-level `ceph` directory that
|
||||||
|
contains `do_cmake.sh` and `CONTRIBUTING.rst` and run the following commands:
|
||||||
|
|
||||||
./do_cmake.sh
|
./do_cmake.sh
|
||||||
cd build
|
cd build
|
||||||
ninja
|
ninja
|
||||||
|
|
||||||
``do_cmake.sh`` defaults to creating a debug build of Ceph that can be up to 5x
|
``do_cmake.sh`` by default creates a "debug build" of Ceph, which can be up to
|
||||||
slower with some workloads. Pass ``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to
|
five times slower than a non-debug build. Pass
|
||||||
``do_cmake.sh`` to create a non-debug release.
|
``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` to ``do_cmake.sh`` to create a non-debug
|
||||||
|
build.
|
||||||
|
|
||||||
The number of jobs used by `ninja` is derived from the number of CPU cores of
|
[Ninja](https://ninja-build.org/) is the buildsystem used by the Ceph project
|
||||||
the building host if unspecified. Use the `-j` option to limit the job number
|
to build test builds. The number of jobs used by `ninja` is derived from the
|
||||||
if the build jobs are running out of memory. On average, each job takes around
|
number of CPU cores of the building host if unspecified. Use the `-j` option to
|
||||||
2.5GiB memory.
|
limit the job number if the build jobs are running out of memory. If you
|
||||||
|
attempt to run `ninja` and receive a message that reads `g++: fatal error:
|
||||||
|
Killed signal terminated program cc1plus`, then you have run out of memory.
|
||||||
|
Using the `-j` option with an argument appropriate to the hardware on which the
|
||||||
|
`ninja` command is run is expected to result in a successful build. For example,
|
||||||
|
to limit the job number to 3, run the command `ninja -j 3`. On average, each
|
||||||
|
`ninja` job run in parallel needs approximately 2.5 GiB of RAM.
|
||||||
|
|
||||||
This assumes that you make your build directory a subdirectory of the ceph.git
|
This documentation assumes that your build directory is a subdirectory of the
|
||||||
checkout. If you put it elsewhere, just point `CEPH_GIT_DIR` to the correct
|
`ceph.git` checkout. If the build directory is located elsewhere, point
|
||||||
path to the checkout. Additional CMake args can be specified by setting ARGS
|
`CEPH_GIT_DIR` to the correct path of the checkout. Additional CMake args can
|
||||||
before invoking ``do_cmake.sh``. See [cmake options](#cmake-options)
|
be specified by setting ARGS before invoking ``do_cmake.sh``. See [cmake
|
||||||
for more details. For example:
|
options](#cmake-options) for more details. For example:
|
||||||
|
|
||||||
ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh
|
ARGS="-DCMAKE_C_COMPILER=gcc-7" ./do_cmake.sh
|
||||||
|
|
||||||
To build only certain targets use:
|
To build only certain targets, run a command of the following form:
|
||||||
|
|
||||||
ninja [target name]
|
ninja [target name]
|
||||||
|
|
||||||
@ -153,24 +164,25 @@ are committed to git.)
|
|||||||
|
|
||||||
## Running a test cluster
|
## Running a test cluster
|
||||||
|
|
||||||
To run a functional test cluster,
|
From the `ceph/` directory, run the following commands to launch a test Ceph
|
||||||
|
cluster:
|
||||||
|
|
||||||
cd build
|
cd build
|
||||||
ninja vstart # builds just enough to run vstart
|
ninja vstart # builds just enough to run vstart
|
||||||
../src/vstart.sh --debug --new -x --localhost --bluestore
|
../src/vstart.sh --debug --new -x --localhost --bluestore
|
||||||
./bin/ceph -s
|
./bin/ceph -s
|
||||||
|
|
||||||
Almost all of the usual commands are available in the bin/ directory.
|
Most Ceph commands are available in the `bin/` directory. For example:
|
||||||
For example,
|
|
||||||
|
|
||||||
./bin/rados -p rbd bench 30 write
|
|
||||||
./bin/rbd create foo --size 1000
|
./bin/rbd create foo --size 1000
|
||||||
|
./bin/rados -p foo bench 30 write
|
||||||
|
|
||||||
To shut down the test cluster,
|
To shut down the test cluster, run the following command from the `build/`
|
||||||
|
directory:
|
||||||
|
|
||||||
../src/stop.sh
|
../src/stop.sh
|
||||||
|
|
||||||
To start or stop individual daemons, the sysvinit script can be used:
|
Use the sysvinit script to start or stop individual daemons:
|
||||||
|
|
||||||
./bin/init-ceph restart osd.0
|
./bin/init-ceph restart osd.0
|
||||||
./bin/init-ceph stop
|
./bin/init-ceph stop
|
||||||
|
@ -170,7 +170,7 @@
|
|||||||
# main package definition
|
# main package definition
|
||||||
#################################################################################
|
#################################################################################
|
||||||
Name: ceph
|
Name: ceph
|
||||||
Version: 18.2.0
|
Version: 18.2.1
|
||||||
Release: 0%{?dist}
|
Release: 0%{?dist}
|
||||||
%if 0%{?fedora} || 0%{?rhel}
|
%if 0%{?fedora} || 0%{?rhel}
|
||||||
Epoch: 2
|
Epoch: 2
|
||||||
@ -186,7 +186,7 @@ License: LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-
|
|||||||
Group: System/Filesystems
|
Group: System/Filesystems
|
||||||
%endif
|
%endif
|
||||||
URL: http://ceph.com/
|
URL: http://ceph.com/
|
||||||
Source0: %{?_remote_tarball_prefix}ceph-18.2.0.tar.bz2
|
Source0: %{?_remote_tarball_prefix}ceph-18.2.1.tar.bz2
|
||||||
%if 0%{?suse_version}
|
%if 0%{?suse_version}
|
||||||
# _insert_obs_source_lines_here
|
# _insert_obs_source_lines_here
|
||||||
ExclusiveArch: x86_64 aarch64 ppc64le s390x
|
ExclusiveArch: x86_64 aarch64 ppc64le s390x
|
||||||
@ -1292,7 +1292,7 @@ This package provides a Ceph MIB for SNMP traps.
|
|||||||
# common
|
# common
|
||||||
#################################################################################
|
#################################################################################
|
||||||
%prep
|
%prep
|
||||||
%autosetup -p1 -n ceph-18.2.0
|
%autosetup -p1 -n ceph-18.2.1
|
||||||
|
|
||||||
%build
|
%build
|
||||||
# Disable lto on systems that do not support symver attribute
|
# Disable lto on systems that do not support symver attribute
|
||||||
|
@ -1,7 +1,13 @@
|
|||||||
ceph (18.2.0-1jammy) jammy; urgency=medium
|
ceph (18.2.1-1jammy) jammy; urgency=medium
|
||||||
|
|
||||||
|
|
||||||
-- Jenkins Build Slave User <jenkins-build@braggi17.front.sepia.ceph.com> Thu, 03 Aug 2023 18:57:50 +0000
|
-- Jenkins Build Slave User <jenkins-build@braggi13.front.sepia.ceph.com> Mon, 11 Dec 2023 22:07:48 +0000
|
||||||
|
|
||||||
|
ceph (18.2.1-1) stable; urgency=medium
|
||||||
|
|
||||||
|
* New upstream release
|
||||||
|
|
||||||
|
-- Ceph Release Team <ceph-maintainers@ceph.io> Mon, 11 Dec 2023 21:55:36 +0000
|
||||||
|
|
||||||
ceph (18.2.0-1) stable; urgency=medium
|
ceph (18.2.0-1) stable; urgency=medium
|
||||||
|
|
||||||
|
@ -1 +0,0 @@
|
|||||||
README
|
|
@ -1,3 +1,4 @@
|
|||||||
|
#!/bin/sh
|
||||||
# vim: set noet ts=8:
|
# vim: set noet ts=8:
|
||||||
# postinst script for ceph-mon
|
# postinst script for ceph-mon
|
||||||
#
|
#
|
||||||
|
@ -1,3 +1,4 @@
|
|||||||
|
#!/bin/sh
|
||||||
# vim: set noet ts=8:
|
# vim: set noet ts=8:
|
||||||
# postinst script for ceph-osd
|
# postinst script for ceph-osd
|
||||||
#
|
#
|
||||||
|
@ -1 +1 @@
|
|||||||
9
|
12
|
||||||
|
@ -4,7 +4,7 @@ Priority: optional
|
|||||||
Homepage: http://ceph.com/
|
Homepage: http://ceph.com/
|
||||||
Vcs-Git: git://github.com/ceph/ceph.git
|
Vcs-Git: git://github.com/ceph/ceph.git
|
||||||
Vcs-Browser: https://github.com/ceph/ceph
|
Vcs-Browser: https://github.com/ceph/ceph
|
||||||
Maintainer: Ceph Maintainers <ceph-maintainers@lists.ceph.com>
|
Maintainer: Ceph Maintainers <ceph-maintainers@ceph.io>
|
||||||
Uploaders: Ken Dreyer <kdreyer@redhat.com>,
|
Uploaders: Ken Dreyer <kdreyer@redhat.com>,
|
||||||
Alfredo Deza <adeza@redhat.com>,
|
Alfredo Deza <adeza@redhat.com>,
|
||||||
Build-Depends: automake,
|
Build-Depends: automake,
|
||||||
@ -20,8 +20,7 @@ Build-Depends: automake,
|
|||||||
git,
|
git,
|
||||||
golang,
|
golang,
|
||||||
gperf,
|
gperf,
|
||||||
g++ (>= 7),
|
g++ (>= 11),
|
||||||
hostname <pkg.ceph.check>,
|
|
||||||
javahelper,
|
javahelper,
|
||||||
jq <pkg.ceph.check>,
|
jq <pkg.ceph.check>,
|
||||||
jsonnet <pkg.ceph.check>,
|
jsonnet <pkg.ceph.check>,
|
||||||
@ -135,9 +134,6 @@ Package: ceph-base
|
|||||||
Architecture: linux-any
|
Architecture: linux-any
|
||||||
Depends: binutils,
|
Depends: binutils,
|
||||||
ceph-common (= ${binary:Version}),
|
ceph-common (= ${binary:Version}),
|
||||||
debianutils,
|
|
||||||
findutils,
|
|
||||||
grep,
|
|
||||||
logrotate,
|
logrotate,
|
||||||
parted,
|
parted,
|
||||||
psmisc,
|
psmisc,
|
||||||
@ -187,8 +183,9 @@ Description: debugging symbols for ceph-base
|
|||||||
|
|
||||||
Package: cephadm
|
Package: cephadm
|
||||||
Architecture: linux-any
|
Architecture: linux-any
|
||||||
Recommends: podman (>= 2.0.2) | docker.io
|
Recommends: podman (>= 2.0.2) | docker.io | docker-ce
|
||||||
Depends: lvm2,
|
Depends: lvm2,
|
||||||
|
python3,
|
||||||
${python3:Depends},
|
${python3:Depends},
|
||||||
Description: cephadm utility to bootstrap ceph daemons with systemd and containers
|
Description: cephadm utility to bootstrap ceph daemons with systemd and containers
|
||||||
Ceph is a massively scalable, open-source, distributed
|
Ceph is a massively scalable, open-source, distributed
|
||||||
@ -431,7 +428,6 @@ Depends: ceph-osd (= ${binary:Version}),
|
|||||||
e2fsprogs,
|
e2fsprogs,
|
||||||
lvm2,
|
lvm2,
|
||||||
parted,
|
parted,
|
||||||
util-linux,
|
|
||||||
xfsprogs,
|
xfsprogs,
|
||||||
${misc:Depends},
|
${misc:Depends},
|
||||||
${python3:Depends}
|
${python3:Depends}
|
||||||
@ -759,7 +755,7 @@ Architecture: any
|
|||||||
Section: debug
|
Section: debug
|
||||||
Priority: extra
|
Priority: extra
|
||||||
Depends: libsqlite3-mod-ceph (= ${binary:Version}),
|
Depends: libsqlite3-mod-ceph (= ${binary:Version}),
|
||||||
libsqlite3-0-dbgsym
|
libsqlite3-0-dbgsym,
|
||||||
${misc:Depends},
|
${misc:Depends},
|
||||||
Description: debugging symbols for libsqlite3-mod-ceph
|
Description: debugging symbols for libsqlite3-mod-ceph
|
||||||
A SQLite3 VFS for storing and manipulating databases stored on Ceph's RADOS
|
A SQLite3 VFS for storing and manipulating databases stored on Ceph's RADOS
|
||||||
@ -1207,14 +1203,14 @@ Description: Java Native Interface library for CephFS Java bindings
|
|||||||
Package: rados-objclass-dev
|
Package: rados-objclass-dev
|
||||||
Architecture: linux-any
|
Architecture: linux-any
|
||||||
Section: libdevel
|
Section: libdevel
|
||||||
Depends: librados-dev (= ${binary:Version}) ${misc:Depends},
|
Depends: librados-dev (= ${binary:Version}), ${misc:Depends},
|
||||||
Description: RADOS object class development kit.
|
Description: RADOS object class development kit.
|
||||||
.
|
.
|
||||||
This package contains development files needed for building RADOS object class plugins.
|
This package contains development files needed for building RADOS object class plugins.
|
||||||
|
|
||||||
Package: cephfs-shell
|
Package: cephfs-shell
|
||||||
Architecture: all
|
Architecture: all
|
||||||
Depends: ${misc:Depends}
|
Depends: ${misc:Depends},
|
||||||
${python3:Depends}
|
${python3:Depends}
|
||||||
Description: interactive shell for the Ceph distributed file system
|
Description: interactive shell for the Ceph distributed file system
|
||||||
Ceph is a massively scalable, open-source, distributed
|
Ceph is a massively scalable, open-source, distributed
|
||||||
@ -1227,7 +1223,7 @@ Description: interactive shell for the Ceph distributed file system
|
|||||||
|
|
||||||
Package: cephfs-top
|
Package: cephfs-top
|
||||||
Architecture: all
|
Architecture: all
|
||||||
Depends: ${misc:Depends}
|
Depends: ${misc:Depends},
|
||||||
${python3:Depends}
|
${python3:Depends}
|
||||||
Description: This package provides a top(1) like utility to display various
|
Description: This package provides a top(1) like utility to display various
|
||||||
filesystem metrics in realtime.
|
filesystem metrics in realtime.
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
Format-Specification: http://anonscm.debian.org/viewvc/dep/web/deps/dep5/copyright-format.xml?revision=279&view=markup
|
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
|
||||||
Name: ceph
|
Upstream-Name: ceph
|
||||||
Maintainer: Sage Weil <sage@newdream.net>
|
Upstream-Contact: Ceph Developers <dev@ceph.io>
|
||||||
Source: http://ceph.com/
|
Source: http://ceph.com/
|
||||||
|
|
||||||
Files: *
|
Files: *
|
||||||
@ -180,3 +180,553 @@ Files: src/include/timegm.h
|
|||||||
Copyright (C) Copyright Howard Hinnant
|
Copyright (C) Copyright Howard Hinnant
|
||||||
Copyright (C) Copyright 2010-2011 Vicente J. Botet Escriba
|
Copyright (C) Copyright 2010-2011 Vicente J. Botet Escriba
|
||||||
License: Boost Software License, Version 1.0
|
License: Boost Software License, Version 1.0
|
||||||
|
|
||||||
|
License: Apache-2.0
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
.
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
.
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
||||||
|
.
|
||||||
|
The complete text of the Apache License, Version 2.0
|
||||||
|
can be found in "/usr/share/common-licenses/Apache-2.0".
|
||||||
|
|
||||||
|
|
||||||
|
License: GPL-2
|
||||||
|
This program is free software; you can redistribute it and/or modify
|
||||||
|
it under the terms of the GNU General Public License as published by
|
||||||
|
the Free Software Foundation; either version 2 of the License, or
|
||||||
|
(at your option) any later version.
|
||||||
|
.
|
||||||
|
This program is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||||
|
GNU General Public License for more details.
|
||||||
|
.
|
||||||
|
You should have received a copy of the GNU General Public License along
|
||||||
|
with this program; if not, write to the Free Software Foundation, Inc.,
|
||||||
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
|
||||||
|
.
|
||||||
|
On Debian systems, the complete text of the GNU General Public License
|
||||||
|
version 2 can be found in `/usr/share/common-licenses/GPL-2' file.
|
||||||
|
|
||||||
|
License: GPL-2+
|
||||||
|
This program is free software: you can redistribute it and/or modify
|
||||||
|
it under the terms of the GNU General Public License as published by
|
||||||
|
the Free Software Foundation, either version 2 of the License, or
|
||||||
|
(at your option) any later version.
|
||||||
|
.
|
||||||
|
This package is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||||
|
GNU General Public License for more details.
|
||||||
|
.
|
||||||
|
You should have received a copy of the GNU General Public License
|
||||||
|
along with this program. If not, see <http://www.gnu.org/licenses/>.
|
||||||
|
.
|
||||||
|
On Debian systems, the complete text of the GNU General Public License
|
||||||
|
version 2 can be found in `/usr/share/common-licenses/GPL-2'.
|
||||||
|
|
||||||
|
License: GPL-3+
|
||||||
|
This program is free software: you can redistribute it and/or modify
|
||||||
|
it under the terms of the GNU General Public License as published by
|
||||||
|
the Free Software Foundation; either version 3 of the License, or
|
||||||
|
(at your option) any later version.
|
||||||
|
.
|
||||||
|
This program is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||||
|
GNU General Public License for more details.
|
||||||
|
.
|
||||||
|
You should have received a copy of the GNU General Public License
|
||||||
|
along with this program. If not, see <http://www.gnu.org/licenses/>.
|
||||||
|
.
|
||||||
|
On Debian systems, the complete text of the GNU General Public
|
||||||
|
License version 3 can be found in `/usr/share/common-licenses/GPL-3'.
|
||||||
|
|
||||||
|
License: LGPL-2.1
|
||||||
|
This library is free software; you can redistribute it and/or
|
||||||
|
modify it under the terms of the GNU Lesser General Public
|
||||||
|
License version 2.1 as published by the Free Software Foundation.
|
||||||
|
.
|
||||||
|
This library is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||||
|
Lesser General Public License for more details.
|
||||||
|
.
|
||||||
|
You should have received a copy of the GNU Lesser General Public
|
||||||
|
License along with this library; if not, write to the Free Software
|
||||||
|
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||||
|
.
|
||||||
|
On Debian systems, the complete text of the GNU Lesser General
|
||||||
|
Public License can be found in `/usr/share/common-licenses/LGPL-2.1'.
|
||||||
|
|
||||||
|
License: LGPL-2.1+
|
||||||
|
This library is free software; you can redistribute it and/or
|
||||||
|
modify it under the terms of the GNU Lesser General Public
|
||||||
|
License as published by the Free Software Foundation; either
|
||||||
|
version 2.1 of the License, or (at your option) any later version.
|
||||||
|
.
|
||||||
|
This library is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||||
|
Lesser General Public License for more details.
|
||||||
|
.
|
||||||
|
You should have received a copy of the GNU Lesser General Public
|
||||||
|
License along with this library; if not, write to the Free Software
|
||||||
|
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||||
|
.
|
||||||
|
On Debian systems, the complete text of the GNU Lesser General
|
||||||
|
Public License can be found in `/usr/share/common-licenses/LGPL-2.1'.
|
||||||
|
|
||||||
|
License: LGPL-2+
|
||||||
|
This library is free software; you can redistribute it and/or
|
||||||
|
modify it under the terms of the GNU Lesser General Public
|
||||||
|
License version 2 (or later) as published by the Free Software Foundation.
|
||||||
|
.
|
||||||
|
This library is distributed in the hope that it will be useful,
|
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
||||||
|
Lesser General Public License for more details.
|
||||||
|
.
|
||||||
|
You should have received a copy of the GNU Lesser General Public
|
||||||
|
License along with this library; if not, write to the Free Software
|
||||||
|
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
|
||||||
|
.
|
||||||
|
On Debian systems, the complete text of the GNU Lesser General
|
||||||
|
Public License 2 can be found in `/usr/share/common-licenses/LGPL-2'.
|
||||||
|
|
||||||
|
License: MIT
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
.
|
||||||
|
The above copyright notice and this permission notice shall be included in
|
||||||
|
all copies or substantial portions of the Software.
|
||||||
|
.
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||||
|
THE SOFTWARE.
|
||||||
|
|
||||||
|
License: CC-BY-SA-3.0
|
||||||
|
Creative Commons Attribution-ShareAlike 3.0 Unported
|
||||||
|
․
|
||||||
|
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
|
||||||
|
LEGAL SERVICES. DISTRIBUTION OF THIS LICENSE DOES NOT CREATE AN
|
||||||
|
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION
|
||||||
|
ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE
|
||||||
|
INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
|
||||||
|
ITS USE.
|
||||||
|
․
|
||||||
|
License
|
||||||
|
․
|
||||||
|
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE
|
||||||
|
COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY
|
||||||
|
COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS
|
||||||
|
AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.
|
||||||
|
․
|
||||||
|
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE
|
||||||
|
TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY
|
||||||
|
BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS
|
||||||
|
CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND
|
||||||
|
CONDITIONS.
|
||||||
|
․
|
||||||
|
1. Definitions
|
||||||
|
․
|
||||||
|
a. "Adaptation" means a work based upon the Work, or upon the Work and
|
||||||
|
other pre-existing works, such as a translation, adaptation, derivative
|
||||||
|
work, arrangement of music or other alterations of a literary or
|
||||||
|
artistic work, or phonogram or performance and includes cinematographic
|
||||||
|
adaptations or any other form in which the Work may be recast,
|
||||||
|
transformed, or adapted including in any form recognizably derived from
|
||||||
|
the original, except that a work that constitutes a Collection will not
|
||||||
|
be considered an Adaptation for the purpose of this License. For the
|
||||||
|
avoidance of doubt, where the Work is a musical work, performance or
|
||||||
|
phonogram, the synchronization of the Work in timed-relation with a
|
||||||
|
moving image ("synching") will be considered an Adaptation for the
|
||||||
|
purpose of this License.
|
||||||
|
․
|
||||||
|
b. "Collection" means a collection of literary or artistic works, such
|
||||||
|
as encyclopedias and anthologies, or performances, phonograms or
|
||||||
|
broadcasts, or other works or subject matter other than works listed in
|
||||||
|
Section 1(f) below, which, by reason of the selection and arrangement of
|
||||||
|
their contents, constitute intellectual creations, in which the Work is
|
||||||
|
included in its entirety in unmodified form along with one or more other
|
||||||
|
contributions, each constituting separate and independent works in
|
||||||
|
themselves, which together are assembled into a collective whole. A work
|
||||||
|
that constitutes a Collection will not be considered an Adaptation (as
|
||||||
|
defined below) for the purposes of this License.
|
||||||
|
․
|
||||||
|
c. "Creative Commons Compatible License" means a license that is listed
|
||||||
|
at http://creativecommons.org/compatiblelicenses that has been approved
|
||||||
|
by Creative Commons as being essentially equivalent to this License,
|
||||||
|
including, at a minimum, because that license: (i) contains terms that
|
||||||
|
have the same purpose, meaning and effect as the License Elements of
|
||||||
|
this License; and, (ii) explicitly permits the relicensing of
|
||||||
|
adaptations of works made available under that license under this
|
||||||
|
License or a Creative Commons jurisdiction license with the same License
|
||||||
|
Elements as this License.
|
||||||
|
․
|
||||||
|
d. "Distribute" means to make available to the public the original and
|
||||||
|
copies of the Work or Adaptation, as appropriate, through sale or other
|
||||||
|
transfer of ownership.
|
||||||
|
․
|
||||||
|
e. "License Elements" means the following high-level license attributes
|
||||||
|
as selected by Licensor and indicated in the title of this License:
|
||||||
|
Attribution, ShareAlike.
|
||||||
|
․
|
||||||
|
f. "Licensor" means the individual, individuals, entity or entities that
|
||||||
|
offer(s) the Work under the terms of this License.
|
||||||
|
․
|
||||||
|
g. "Original Author" means, in the case of a literary or artistic work,
|
||||||
|
the individual, individuals, entity or entities who created the Work or
|
||||||
|
if no individual or entity can be identified, the publisher; and in
|
||||||
|
addition (i) in the case of a performance the actors, singers,
|
||||||
|
musicians, dancers, and other persons who act, sing, deliver, declaim,
|
||||||
|
play in, interpret or otherwise perform literary or artistic works or
|
||||||
|
expressions of folklore; (ii) in the case of a phonogram the producer
|
||||||
|
being the person or legal entity who first fixes the sounds of a
|
||||||
|
performance or other sounds; and, (iii) in the case of broadcasts, the
|
||||||
|
organization that transmits the broadcast.
|
||||||
|
․
|
||||||
|
h. "Work" means the literary and/or artistic work offered under the
|
||||||
|
terms of this License including without limitation any production in the
|
||||||
|
literary, scientific and artistic domain, whatever may be the mode or
|
||||||
|
form of its expression including digital form, such as a book, pamphlet
|
||||||
|
and other writing; a lecture, address, sermon or other work of the same
|
||||||
|
nature; a dramatic or dramatico-musical work; a choreographic work or
|
||||||
|
entertainment in dumb show; a musical composition with or without words;
|
||||||
|
a cinematographic work to which are assimilated works expressed by a
|
||||||
|
process analogous to cinematography; a work of drawing, painting,
|
||||||
|
architecture, sculpture, engraving or lithography; a photographic work
|
||||||
|
to which are assimilated works expressed by a process analogous to
|
||||||
|
photography; a work of applied art; an illustration, map, plan, sketch
|
||||||
|
or three-dimensional work relative to geography, topography,
|
||||||
|
architecture or science; a performance; a broadcast; a phonogram; a
|
||||||
|
compilation of data to the extent it is protected as a copyrightable
|
||||||
|
work; or a work performed by a variety or circus performer to the extent
|
||||||
|
it is not otherwise considered a literary or artistic work.
|
||||||
|
․
|
||||||
|
i. "You" means an individual or entity exercising rights under this
|
||||||
|
License who has not previously violated the terms of this License with
|
||||||
|
respect to the Work, or who has received express permission from the
|
||||||
|
Licensor to exercise rights under this License despite a previous
|
||||||
|
violation.
|
||||||
|
․
|
||||||
|
j. "Publicly Perform" means to perform public recitations of the Work
|
||||||
|
and to communicate to the public those public recitations, by any means
|
||||||
|
or process, including by wire or wireless means or public digital
|
||||||
|
performances; to make available to the public Works in such a way that
|
||||||
|
members of the public may access these Works from a place and at a place
|
||||||
|
individually chosen by them; to perform the Work to the public by any
|
||||||
|
means or process and the communication to the public of the performances
|
||||||
|
of the Work, including by public digital performance; to broadcast and
|
||||||
|
rebroadcast the Work by any means including signs, sounds or images.
|
||||||
|
․
|
||||||
|
k. "Reproduce" means to make copies of the Work by any means including
|
||||||
|
without limitation by sound or visual recordings and the right of
|
||||||
|
fixation and reproducing fixations of the Work, including storage of a
|
||||||
|
protected performance or phonogram in digital form or other electronic
|
||||||
|
medium.
|
||||||
|
․
|
||||||
|
2. Fair Dealing Rights. Nothing in this License is intended to reduce,
|
||||||
|
limit, or restrict any uses free from copyright or rights arising from
|
||||||
|
limitations or exceptions that are provided for in connection with the
|
||||||
|
copyright protection under copyright law or other applicable laws.
|
||||||
|
․
|
||||||
|
3. License Grant. Subject to the terms and conditions of this License,
|
||||||
|
Licensor hereby grants You a worldwide, royalty-free, non-exclusive,
|
||||||
|
perpetual (for the duration of the applicable copyright) license to
|
||||||
|
exercise the rights in the Work as stated below:
|
||||||
|
․
|
||||||
|
a. to Reproduce the Work, to incorporate the Work into one or more
|
||||||
|
Collections, and to Reproduce the Work as incorporated in the
|
||||||
|
Collections;
|
||||||
|
․
|
||||||
|
b. to create and Reproduce Adaptations provided that any such
|
||||||
|
Adaptation, including any translation in any medium, takes reasonable
|
||||||
|
steps to clearly label, demarcate or otherwise identify that changes
|
||||||
|
were made to the original Work. For example, a translation could be
|
||||||
|
marked "The original work was translated from English to Spanish," or a
|
||||||
|
modification could indicate "The original work has been modified.";
|
||||||
|
․
|
||||||
|
c. to Distribute and Publicly Perform the Work including as incorporated
|
||||||
|
in Collections; and,
|
||||||
|
․
|
||||||
|
d. to Distribute and Publicly Perform Adaptations.
|
||||||
|
․
|
||||||
|
e. For the avoidance of doubt:
|
||||||
|
․
|
||||||
|
i. Non-waivable Compulsory License Schemes. In those jurisdictions in
|
||||||
|
which the right to collect royalties through any statutory or compulsory
|
||||||
|
licensing scheme cannot be waived, the Licensor reserves the exclusive
|
||||||
|
right to collect such royalties for any exercise by You of the rights
|
||||||
|
granted under this License;
|
||||||
|
․
|
||||||
|
ii. Waivable Compulsory License Schemes. In those jurisdictions in which
|
||||||
|
the right to collect royalties through any statutory or compulsory
|
||||||
|
licensing scheme can be waived, the Licensor waives the exclusive right
|
||||||
|
to collect such royalties for any exercise by You of the rights granted
|
||||||
|
under this License; and,
|
||||||
|
․
|
||||||
|
iii. Voluntary License Schemes. The Licensor waives the right to collect
|
||||||
|
royalties, whether individually or, in the event that the Licensor is a
|
||||||
|
member of a collecting society that administers voluntary licensing
|
||||||
|
schemes, via that society, from any exercise by You of the rights
|
||||||
|
granted under this License.
|
||||||
|
․
|
||||||
|
The above rights may be exercised in all media and formats whether now
|
||||||
|
known or hereafter devised. The above rights include the right to make
|
||||||
|
such modifications as are technically necessary to exercise the rights
|
||||||
|
in other media and formats. Subject to Section 8(f), all rights not
|
||||||
|
expressly granted by Licensor are hereby reserved.
|
||||||
|
․
|
||||||
|
4. Restrictions. The license granted in Section 3 above is expressly
|
||||||
|
made subject to and limited by the following restrictions:
|
||||||
|
․
|
||||||
|
a. You may Distribute or Publicly Perform the Work only under the terms
|
||||||
|
of this License. You must include a copy of, or the Uniform Resource
|
||||||
|
Identifier (URI) for, this License with every copy of the Work You
|
||||||
|
Distribute or Publicly Perform. You may not offer or impose any terms on
|
||||||
|
the Work that restrict the terms of this License or the ability of the
|
||||||
|
recipient of the Work to exercise the rights granted to that recipient
|
||||||
|
under the terms of the License. You may not sublicense the Work. You
|
||||||
|
must keep intact all notices that refer to this License and to the
|
||||||
|
disclaimer of warranties with every copy of the Work You Distribute or
|
||||||
|
Publicly Perform. When You Distribute or Publicly Perform the Work, You
|
||||||
|
may not impose any effective technological measures on the Work that
|
||||||
|
restrict the ability of a recipient of the Work from You to exercise the
|
||||||
|
rights granted to that recipient under the terms of the License. This
|
||||||
|
Section 4(a) applies to the Work as incorporated in a Collection, but
|
||||||
|
this does not require the Collection apart from the Work itself to be
|
||||||
|
made subject to the terms of this License. If You create a Collection,
|
||||||
|
upon notice from any Licensor You must, to the extent practicable,
|
||||||
|
remove from the Collection any credit as required by Section 4(c), as
|
||||||
|
requested. If You create an Adaptation, upon notice from any Licensor
|
||||||
|
You must, to the extent practicable, remove from the Adaptation any
|
||||||
|
credit as required by Section 4(c), as requested.
|
||||||
|
․
|
||||||
|
b. You may Distribute or Publicly Perform an Adaptation only under the
|
||||||
|
terms of: (i) this License; (ii) a later version of this License with
|
||||||
|
the same License Elements as this License; (iii) a Creative Commons
|
||||||
|
jurisdiction license (either this or a later license version) that
|
||||||
|
contains the same License Elements as this License (e.g.,
|
||||||
|
Attribution-ShareAlike 3.0 US)); (iv) a Creative Commons Compatible
|
||||||
|
License. If you license the Adaptation under one of the licenses
|
||||||
|
mentioned in (iv), you must comply with the terms of that license. If
|
||||||
|
you license the Adaptation under the terms of any of the licenses
|
||||||
|
mentioned in (i), (ii) or (iii) (the "Applicable License"), you must
|
||||||
|
comply with the terms of the Applicable License generally and the
|
||||||
|
following provisions: (I) You must include a copy of, or the URI for,
|
||||||
|
the Applicable License with every copy of each Adaptation You Distribute
|
||||||
|
or Publicly Perform; (II) You may not offer or impose any terms on the
|
||||||
|
Adaptation that restrict the terms of the Applicable License or the
|
||||||
|
ability of the recipient of the Adaptation to exercise the rights
|
||||||
|
granted to that recipient under the terms of the Applicable License;
|
||||||
|
(III) You must keep intact all notices that refer to the Applicable
|
||||||
|
License and to the disclaimer of warranties with every copy of the Work
|
||||||
|
as included in the Adaptation You Distribute or Publicly Perform; (IV)
|
||||||
|
when You Distribute or Publicly Perform the Adaptation, You may not
|
||||||
|
impose any effective technological measures on the Adaptation that
|
||||||
|
restrict the ability of a recipient of the Adaptation from You to
|
||||||
|
exercise the rights granted to that recipient under the terms of the
|
||||||
|
Applicable License. This Section 4(b) applies to the Adaptation as
|
||||||
|
incorporated in a Collection, but this does not require the Collection
|
||||||
|
apart from the Adaptation itself to be made subject to the terms of the
|
||||||
|
Applicable License.
|
||||||
|
․
|
||||||
|
c. If You Distribute, or Publicly Perform the Work or any Adaptations or
|
||||||
|
Collections, You must, unless a request has been made pursuant to
|
||||||
|
Section 4(a), keep intact all copyright notices for the Work and
|
||||||
|
provide, reasonable to the medium or means You are utilizing: (i) the
|
||||||
|
name of the Original Author (or pseudonym, if applicable) if supplied,
|
||||||
|
and/or if the Original Author and/or Licensor designate another party or
|
||||||
|
parties (e.g., a sponsor institute, publishing entity, journal) for
|
||||||
|
attribution ("Attribution Parties") in Licensor's copyright notice,
|
||||||
|
terms of service or by other reasonable means, the name of such party or
|
||||||
|
parties; (ii) the title of the Work if supplied; (iii) to the extent
|
||||||
|
reasonably practicable, the URI, if any, that Licensor specifies to be
|
||||||
|
associated with the Work, unless such URI does not refer to the
|
||||||
|
copyright notice or licensing information for the Work; and (iv) ,
|
||||||
|
consistent with Ssection 3(b), in the case of an Adaptation, a credit
|
||||||
|
identifying the use of the Work in the Adaptation (e.g., "French
|
||||||
|
translation of the Work by Original Author," or "Screenplay based on
|
||||||
|
original Work by Original Author"). The credit required by this Section
|
||||||
|
4(c) may be implemented in any reasonable manner; provided, however,
|
||||||
|
that in the case of a Adaptation or Collection, at a minimum such credit
|
||||||
|
will appear, if a credit for all contributing authors of the Adaptation
|
||||||
|
or Collection appears, then as part of these credits and in a manner at
|
||||||
|
least as prominent as the credits for the other contributing authors.
|
||||||
|
For the avoidance of doubt, You may only use the credit required by this
|
||||||
|
Section for the purpose of attribution in the manner set out above and,
|
||||||
|
by exercising Your rights under this License, You may not implicitly or
|
||||||
|
explicitly assert or imply any connection with, sponsorship or
|
||||||
|
endorsement by the Original Author, Licensor and/or Attribution Parties,
|
||||||
|
as appropriate, of You or Your use of the Work, without the separate,
|
||||||
|
express prior written permission of the Original Author, Licensor and/or
|
||||||
|
Attribution Parties.
|
||||||
|
․
|
||||||
|
d. Except as otherwise agreed in writing by the Licensor or as may be
|
||||||
|
otherwise permitted by applicable law, if You Reproduce, Distribute or
|
||||||
|
Publicly Perform the Work either by itself or as part of any Adaptations
|
||||||
|
or Collections, You must not distort, mutilate, modify or take other
|
||||||
|
derogatory action in relation to the Work which would be prejudicial to
|
||||||
|
the Original Author's honor or reputation. Licensor agrees that in those
|
||||||
|
jurisdictions (e.g. Japan), in which any exercise of the right granted
|
||||||
|
in Section 3(b) of this License (the right to make Adaptations) would be
|
||||||
|
deemed to be a distortion, mutilation, modification or other derogatory
|
||||||
|
action prejudicial to the Original Author's honor and reputation, the
|
||||||
|
Licensor will waive or not assert, as appropriate, this Section, to the
|
||||||
|
fullest extent permitted by the applicable national law, to enable You
|
||||||
|
to reasonably exercise Your right under Section 3(b) of this License
|
||||||
|
(right to make Adaptations) but not otherwise.
|
||||||
|
․
|
||||||
|
5. Representations, Warranties and Disclaimer
|
||||||
|
․
|
||||||
|
UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR
|
||||||
|
OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY
|
||||||
|
KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE,
|
||||||
|
INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF
|
||||||
|
LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS,
|
||||||
|
WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE
|
||||||
|
EXCLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.
|
||||||
|
․
|
||||||
|
6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE
|
||||||
|
LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR
|
||||||
|
ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES
|
||||||
|
ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS
|
||||||
|
BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
|
||||||
|
․
|
||||||
|
7. Termination
|
||||||
|
․
|
||||||
|
a. This License and the rights granted hereunder will terminate
|
||||||
|
automatically upon any breach by You of the terms of this License.
|
||||||
|
Individuals or entities who have received Adaptations or Collections
|
||||||
|
from You under this License, however, will not have their licenses
|
||||||
|
terminated provided such individuals or entities remain in full
|
||||||
|
compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will
|
||||||
|
survive any termination of this License.
|
||||||
|
․
|
||||||
|
b. Subject to the above terms and conditions, the license granted here
|
||||||
|
is perpetual (for the duration of the applicable copyright in the Work).
|
||||||
|
Notwithstanding the above, Licensor reserves the right to release the
|
||||||
|
Work under different license terms or to stop distributing the Work at
|
||||||
|
any time; provided, however that any such election will not serve to
|
||||||
|
withdraw this License (or any other license that has been, or is
|
||||||
|
required to be, granted under the terms of this License), and this
|
||||||
|
License will continue in full force and effect unless terminated as
|
||||||
|
stated above.
|
||||||
|
․
|
||||||
|
8. Miscellaneous
|
||||||
|
․
|
||||||
|
a. Each time You Distribute or Publicly Perform the Work or a
|
||||||
|
Collection, the Licensor offers to the recipient a license to the Work
|
||||||
|
on the same terms and conditions as the license granted to You under
|
||||||
|
this License.
|
||||||
|
․
|
||||||
|
b. Each time You Distribute or Publicly Perform an Adaptation, Licensor
|
||||||
|
offers to the recipient a license to the original Work on the same terms
|
||||||
|
and conditions as the license granted to You under this License.
|
||||||
|
․
|
||||||
|
c. If any provision of this License is invalid or unenforceable under
|
||||||
|
applicable law, it shall not affect the validity or enforceability of
|
||||||
|
the remainder of the terms of this License, and without further action
|
||||||
|
by the parties to this agreement, such provision shall be reformed to
|
||||||
|
the minimum extent necessary to make such provision valid and
|
||||||
|
enforceable.
|
||||||
|
․
|
||||||
|
d. No term or provision of this License shall be deemed waived and no
|
||||||
|
breach consented to unless such waiver or consent shall be in writing
|
||||||
|
and signed by the party to be charged with such waiver or consent.
|
||||||
|
․
|
||||||
|
e. This License constitutes the entire agreement between the parties
|
||||||
|
with respect to the Work licensed here. There are no understandings,
|
||||||
|
agreements or representations with respect to the Work not specified
|
||||||
|
here. Licensor shall not be bound by any additional provisions that may
|
||||||
|
appear in any communication from You. This License may not be modified
|
||||||
|
without the mutual written agreement of the Licensor and You.
|
||||||
|
․
|
||||||
|
f. The rights granted under, and the subject matter referenced, in this
|
||||||
|
License were drafted utilizing the terminology of the Berne Convention
|
||||||
|
for the Protection of Literary and Artistic Works (as amended on
|
||||||
|
September 28, 1979), the Rome Convention of 1961, the WIPO Copyright
|
||||||
|
Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 and
|
||||||
|
the Universal Copyright Convention (as revised on July 24, 1971). These
|
||||||
|
rights and subject matter take effect in the relevant jurisdiction in
|
||||||
|
which the License terms are sought to be enforced according to the
|
||||||
|
corresponding provisions of the implementation of those treaty
|
||||||
|
provisions in the applicable national law. If the standard suite of
|
||||||
|
rights granted under applicable copyright law includes additional rights
|
||||||
|
not granted under this License, such additional rights are deemed to be
|
||||||
|
included in the License; this License is not intended to restrict the
|
||||||
|
license of any rights under applicable law.
|
||||||
|
․
|
||||||
|
․
|
||||||
|
Creative Commons Notice
|
||||||
|
․
|
||||||
|
Creative Commons is not a party to this License, and makes no warranty
|
||||||
|
whatsoever in connection with the Work. Creative Commons will not be
|
||||||
|
liable to You or any party on any legal theory for any damages
|
||||||
|
whatsoever, including without limitation any general, special,
|
||||||
|
incidental or consequential damages arising in connection to this
|
||||||
|
license. Notwithstanding the foregoing two (2) sentences, if Creative
|
||||||
|
Commons has expressly identified itself as the Licensor hereunder, it
|
||||||
|
shall have all rights and obligations of Licensor.
|
||||||
|
․
|
||||||
|
Except for the limited purpose of indicating to the public that the Work
|
||||||
|
is licensed under the CCPL, Creative Commons does not authorize the use
|
||||||
|
by either party of the trademark "Creative Commons" or any related
|
||||||
|
trademark or logo of Creative Commons without the prior written consent
|
||||||
|
of Creative Commons. Any permitted use will be in compliance with
|
||||||
|
Creative Commons' then-current trademark usage guidelines, as may be
|
||||||
|
published on its website or otherwise made available upon request from
|
||||||
|
time to time. For the avoidance of doubt, this trademark restriction
|
||||||
|
does not form part of the License.
|
||||||
|
․
|
||||||
|
Creative Commons may be contacted at http://creativecommons.org/.
|
||||||
|
|
||||||
|
License: BSD-3-clause
|
||||||
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
modification, are permitted provided that the following conditions
|
||||||
|
are met:
|
||||||
|
.
|
||||||
|
1. Redistributions of source code must retain the above
|
||||||
|
copyright notice, this list of conditions and the following
|
||||||
|
disclaimer.
|
||||||
|
.
|
||||||
|
2. Redistributions in binary form must reproduce the above
|
||||||
|
copyright notice, this list of conditions and the following
|
||||||
|
disclaimer in the documentation and/or other materials
|
||||||
|
provided with the distribution.
|
||||||
|
.
|
||||||
|
3. Neither the name of the copyright holder nor the names of
|
||||||
|
its contributors may be used to endorse or promote products
|
||||||
|
derived from this software without specific prior written
|
||||||
|
permission.
|
||||||
|
.
|
||||||
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
||||||
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
||||||
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
|
||||||
|
FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
|
||||||
|
COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
|
||||||
|
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||||
|
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||||
|
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
||||||
|
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
|
||||||
|
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||||
|
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
|
||||||
|
OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||||
|
@ -5,7 +5,6 @@ export DESTDIR=$(CURDIR)/debian/tmp
|
|||||||
|
|
||||||
include /usr/share/dpkg/default.mk
|
include /usr/share/dpkg/default.mk
|
||||||
|
|
||||||
extraopts += -DCMAKE_C_COMPILER=gcc-11 -DCMAKE_CXX_COMPILER=g++-11
|
|
||||||
ifneq (,$(findstring WITH_STATIC_LIBSTDCXX,$(CEPH_EXTRA_CMAKE_ARGS)))
|
ifneq (,$(findstring WITH_STATIC_LIBSTDCXX,$(CEPH_EXTRA_CMAKE_ARGS)))
|
||||||
# dh_auto_build sets LDFLAGS with `dpkg-buildflags --get LDFLAGS` on ubuntu,
|
# dh_auto_build sets LDFLAGS with `dpkg-buildflags --get LDFLAGS` on ubuntu,
|
||||||
# which makes the application aborts when the shared library throws
|
# which makes the application aborts when the shared library throws
|
||||||
@ -59,19 +58,15 @@ py3_overrides_packages := $(basename $(notdir $(wildcard debian/*.requires)))
|
|||||||
py3_packages := cephfs-shell cephfs-top cephadm
|
py3_packages := cephfs-shell cephfs-top cephadm
|
||||||
|
|
||||||
%:
|
%:
|
||||||
dh $@ --buildsystem=cmake --with javahelper,python3,systemd --parallel
|
dh $@ --buildsystem=cmake --with javahelper,python3 --parallel
|
||||||
|
|
||||||
override_dh_auto_configure:
|
override_dh_auto_configure:
|
||||||
env | sort
|
env | sort
|
||||||
dh_auto_configure --buildsystem=cmake -- $(extraopts) $(CEPH_EXTRA_CMAKE_ARGS)
|
dh_auto_configure --buildsystem=cmake -- $(extraopts) $(CEPH_EXTRA_CMAKE_ARGS)
|
||||||
|
|
||||||
override_dh_auto_build:
|
|
||||||
dh_auto_build --buildsystem=cmake
|
|
||||||
cp src/init-radosgw debian/radosgw.init
|
|
||||||
|
|
||||||
override_dh_auto_clean:
|
override_dh_auto_clean:
|
||||||
dh_auto_clean --buildsystem=cmake
|
dh_auto_clean --buildsystem=cmake
|
||||||
rm -f debian/radosgw.init debian/ceph.logrotate
|
rm -f debian/radosgw.init debian/ceph.logrotate debian/ceph-base.docs
|
||||||
|
|
||||||
override_dh_auto_install:
|
override_dh_auto_install:
|
||||||
dh_auto_install --buildsystem=cmake --destdir=$(DESTDIR)
|
dh_auto_install --buildsystem=cmake --destdir=$(DESTDIR)
|
||||||
@ -87,13 +82,12 @@ override_dh_auto_install:
|
|||||||
override_dh_installchangelogs:
|
override_dh_installchangelogs:
|
||||||
dh_installchangelogs --exclude doc/changelog
|
dh_installchangelogs --exclude doc/changelog
|
||||||
|
|
||||||
override_dh_installdocs:
|
|
||||||
|
|
||||||
override_dh_installlogrotate:
|
override_dh_installlogrotate:
|
||||||
cp src/logrotate.conf debian/ceph-common.logrotate
|
cp src/logrotate.conf debian/ceph-common.logrotate
|
||||||
dh_installlogrotate -pceph-common
|
dh_installlogrotate -pceph-common
|
||||||
|
|
||||||
override_dh_installinit:
|
override_dh_installinit:
|
||||||
|
cp src/init-radosgw debian/radosgw.init
|
||||||
# install the systemd stuff manually since we have funny service names
|
# install the systemd stuff manually since we have funny service names
|
||||||
install -d -m0755 debian/ceph-common/etc/default
|
install -d -m0755 debian/ceph-common/etc/default
|
||||||
install -m0644 etc/default/ceph debian/ceph-common/etc/default/
|
install -m0644 etc/default/ceph debian/ceph-common/etc/default/
|
||||||
@ -103,15 +97,9 @@ override_dh_installinit:
|
|||||||
dh_installinit -p ceph-base --name ceph --no-start
|
dh_installinit -p ceph-base --name ceph --no-start
|
||||||
dh_installinit -p radosgw --no-start
|
dh_installinit -p radosgw --no-start
|
||||||
|
|
||||||
# NOTE: execute systemd helpers so they pickup dh_install'ed units and targets
|
override_dh_installsystemd:
|
||||||
dh_systemd_enable
|
# Only enable and start systemd targets
|
||||||
dh_systemd_start --no-restart-on-upgrade
|
dh_installsystemd --no-stop-on-upgrade --no-restart-after-upgrade -Xceph-mon.service -Xceph-osd.service -X ceph-mds.service
|
||||||
|
|
||||||
override_dh_systemd_enable:
|
|
||||||
# systemd enable done as part of dh_installinit
|
|
||||||
|
|
||||||
override_dh_systemd_start:
|
|
||||||
# systemd start done as part of dh_installinit
|
|
||||||
|
|
||||||
override_dh_strip:
|
override_dh_strip:
|
||||||
dh_strip -pceph-mds --dbg-package=ceph-mds-dbg
|
dh_strip -pceph-mds --dbg-package=ceph-mds-dbg
|
||||||
@ -152,8 +140,12 @@ override_dh_python3:
|
|||||||
@for pkg in $(py3_packages); do \
|
@for pkg in $(py3_packages); do \
|
||||||
dh_python3 -p $$pkg; \
|
dh_python3 -p $$pkg; \
|
||||||
done
|
done
|
||||||
|
dh_python3 -p ceph-base --shebang=/usr/bin/python3
|
||||||
|
dh_python3 -p ceph-common --shebang=/usr/bin/python3
|
||||||
|
dh_python3 -p ceph-fuse --shebang=/usr/bin/python3
|
||||||
|
dh_python3 -p ceph-volume --shebang=/usr/bin/python3
|
||||||
|
|
||||||
# do not run tests
|
# do not run tests
|
||||||
override_dh_auto_test:
|
override_dh_auto_test:
|
||||||
|
|
||||||
.PHONY: override_dh_autoreconf override_dh_auto_configure override_dh_auto_build override_dh_auto_clean override_dh_auto_install override_dh_installdocs override_dh_installlogrotate override_dh_installinit override_dh_systemd_start override_dh_strip override_dh_auto_test
|
.PHONY: override_dh_autoreconf override_dh_auto_configure override_dh_auto_clean override_dh_auto_install override_dh_installlogrotate override_dh_installinit override_dh_strip override_dh_auto_test
|
||||||
|
@ -30,58 +30,54 @@ A Ceph Storage Cluster consists of multiple types of daemons:
|
|||||||
- :term:`Ceph Manager`
|
- :term:`Ceph Manager`
|
||||||
- :term:`Ceph Metadata Server`
|
- :term:`Ceph Metadata Server`
|
||||||
|
|
||||||
.. ditaa::
|
.. _arch_monitor:
|
||||||
|
|
||||||
+---------------+ +---------------+ +---------------+ +---------------+
|
Ceph Monitors maintain the master copy of the cluster map, which they provide
|
||||||
| OSDs | | Monitors | | Managers | | MDS |
|
to Ceph clients. Provisioning multiple monitors within the Ceph cluster ensures
|
||||||
+---------------+ +---------------+ +---------------+ +---------------+
|
availability in the event that one of the monitor daemons or its host fails.
|
||||||
|
The Ceph monitor provides copies of the cluster map to storage cluster clients.
|
||||||
A Ceph Monitor maintains a master copy of the cluster map. A cluster of Ceph
|
|
||||||
monitors ensures high availability should a monitor daemon fail. Storage cluster
|
|
||||||
clients retrieve a copy of the cluster map from the Ceph Monitor.
|
|
||||||
|
|
||||||
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
|
A Ceph OSD Daemon checks its own state and the state of other OSDs and reports
|
||||||
back to monitors.
|
back to monitors.
|
||||||
|
|
||||||
A Ceph Manager acts as an endpoint for monitoring, orchestration, and plug-in
|
A Ceph Manager serves as an endpoint for monitoring, orchestration, and plug-in
|
||||||
modules.
|
modules.
|
||||||
|
|
||||||
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
|
A Ceph Metadata Server (MDS) manages file metadata when CephFS is used to
|
||||||
provide file services.
|
provide file services.
|
||||||
|
|
||||||
Storage cluster clients and each :term:`Ceph OSD Daemon` use the CRUSH algorithm
|
Storage cluster clients and :term:`Ceph OSD Daemon`\s use the CRUSH algorithm
|
||||||
to efficiently compute information about data location, instead of having to
|
to compute information about data location. This means that clients and OSDs
|
||||||
depend on a central lookup table. Ceph's high-level features include a
|
are not bottlenecked by a central lookup table. Ceph's high-level features
|
||||||
native interface to the Ceph Storage Cluster via ``librados``, and a number of
|
include a native interface to the Ceph Storage Cluster via ``librados``, and a
|
||||||
service interfaces built on top of ``librados``.
|
number of service interfaces built on top of ``librados``.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Storing Data
|
Storing Data
|
||||||
------------
|
------------
|
||||||
|
|
||||||
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
|
The Ceph Storage Cluster receives data from :term:`Ceph Client`\s--whether it
|
||||||
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
|
comes through a :term:`Ceph Block Device`, :term:`Ceph Object Storage`, the
|
||||||
:term:`Ceph File System` or a custom implementation you create using
|
:term:`Ceph File System`, or a custom implementation that you create by using
|
||||||
``librados``-- which is stored as RADOS objects. Each object is stored on an
|
``librados``. The data received by the Ceph Storage Cluster is stored as RADOS
|
||||||
:term:`Object Storage Device`. Ceph OSD Daemons handle read, write, and
|
objects. Each object is stored on an :term:`Object Storage Device` (this is
|
||||||
replication operations on storage drives. With the default BlueStore back end,
|
also called an "OSD"). Ceph OSDs control read, write, and replication
|
||||||
objects are stored in a monolithic database-like fashion.
|
operations on storage drives. The default BlueStore back end stores objects
|
||||||
|
in a monolithic, database-like fashion.
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
/-----\ +-----+ +-----+
|
/------\ +-----+ +-----+
|
||||||
| obj |------>| {d} |------>| {s} |
|
| obj |------>| {d} |------>| {s} |
|
||||||
\-----/ +-----+ +-----+
|
\------/ +-----+ +-----+
|
||||||
|
|
||||||
Object OSD Drive
|
Object OSD Drive
|
||||||
|
|
||||||
Ceph OSD Daemons store data as objects in a flat namespace (e.g., no
|
Ceph OSD Daemons store data as objects in a flat namespace. This means that
|
||||||
hierarchy of directories). An object has an identifier, binary data, and
|
objects are not stored in a hierarchy of directories. An object has an
|
||||||
metadata consisting of a set of name/value pairs. The semantics are completely
|
identifier, binary data, and metadata consisting of name/value pairs.
|
||||||
up to :term:`Ceph Client`\s. For example, CephFS uses metadata to store file
|
:term:`Ceph Client`\s determine the semantics of the object data. For example,
|
||||||
attributes such as the file owner, created date, last modified date, and so
|
CephFS uses metadata to store file attributes such as the file owner, the
|
||||||
forth.
|
created date, and the last modified date.
|
||||||
|
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
@ -100,20 +96,23 @@ forth.
|
|||||||
|
|
||||||
.. index:: architecture; high availability, scalability
|
.. index:: architecture; high availability, scalability
|
||||||
|
|
||||||
|
.. _arch_scalability_and_high_availability:
|
||||||
|
|
||||||
Scalability and High Availability
|
Scalability and High Availability
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
In traditional architectures, clients talk to a centralized component (e.g., a
|
In traditional architectures, clients talk to a centralized component. This
|
||||||
gateway, broker, API, facade, etc.), which acts as a single point of entry to a
|
centralized component might be a gateway, a broker, an API, or a facade. A
|
||||||
complex subsystem. This imposes a limit to both performance and scalability,
|
centralized component of this kind acts as a single point of entry to a complex
|
||||||
while introducing a single point of failure (i.e., if the centralized component
|
subsystem. Architectures that rely upon such a centralized component have a
|
||||||
goes down, the whole system goes down, too).
|
single point of failure and incur limits to performance and scalability. If
|
||||||
|
the centralized component goes down, the whole system becomes unavailable.
|
||||||
|
|
||||||
Ceph eliminates the centralized gateway to enable clients to interact with
|
Ceph eliminates this centralized component. This enables clients to interact
|
||||||
Ceph OSD Daemons directly. Ceph OSD Daemons create object replicas on other
|
with Ceph OSDs directly. Ceph OSDs create object replicas on other Ceph Nodes
|
||||||
Ceph Nodes to ensure data safety and high availability. Ceph also uses a cluster
|
to ensure data safety and high availability. Ceph also uses a cluster of
|
||||||
of monitors to ensure high availability. To eliminate centralization, Ceph
|
monitors to ensure high availability. To eliminate centralization, Ceph uses an
|
||||||
uses an algorithm called CRUSH.
|
algorithm called :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)`.
|
||||||
|
|
||||||
|
|
||||||
.. index:: CRUSH; architecture
|
.. index:: CRUSH; architecture
|
||||||
@ -122,15 +121,15 @@ CRUSH Introduction
|
|||||||
~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
|
Ceph Clients and Ceph OSD Daemons both use the :abbr:`CRUSH (Controlled
|
||||||
Replication Under Scalable Hashing)` algorithm to efficiently compute
|
Replication Under Scalable Hashing)` algorithm to compute information about
|
||||||
information about object location, instead of having to depend on a
|
object location instead of relying upon a central lookup table. CRUSH provides
|
||||||
central lookup table. CRUSH provides a better data management mechanism compared
|
a better data management mechanism than do older approaches, and CRUSH enables
|
||||||
to older approaches, and enables massive scale by cleanly distributing the work
|
massive scale by distributing the work to all the OSD daemons in the cluster
|
||||||
to all the clients and OSD daemons in the cluster. CRUSH uses intelligent data
|
and all the clients that communicate with them. CRUSH uses intelligent data
|
||||||
replication to ensure resiliency, which is better suited to hyper-scale storage.
|
replication to ensure resiliency, which is better suited to hyper-scale
|
||||||
The following sections provide additional details on how CRUSH works. For a
|
storage. The following sections provide additional details on how CRUSH works.
|
||||||
detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized
|
For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable,
|
||||||
Placement of Replicated Data`_.
|
Decentralized Placement of Replicated Data`_.
|
||||||
|
|
||||||
.. index:: architecture; cluster map
|
.. index:: architecture; cluster map
|
||||||
|
|
||||||
@ -139,61 +138,71 @@ Placement of Replicated Data`_.
|
|||||||
Cluster Map
|
Cluster Map
|
||||||
~~~~~~~~~~~
|
~~~~~~~~~~~
|
||||||
|
|
||||||
Ceph depends upon Ceph Clients and Ceph OSD Daemons having knowledge of the
|
In order for a Ceph cluster to function properly, Ceph Clients and Ceph OSDs
|
||||||
cluster topology, which is inclusive of 5 maps collectively referred to as the
|
must have current information about the cluster's topology. Current information
|
||||||
"Cluster Map":
|
is stored in the "Cluster Map", which is in fact a collection of five maps. The
|
||||||
|
five maps that constitute the cluster map are:
|
||||||
|
|
||||||
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, name
|
#. **The Monitor Map:** Contains the cluster ``fsid``, the position, the name,
|
||||||
address and port of each monitor. It also indicates the current epoch,
|
the address, and the TCP port of each monitor. The monitor map specifies the
|
||||||
when the map was created, and the last time it changed. To view a monitor
|
current epoch, the time of the monitor map's creation, and the time of the
|
||||||
map, execute ``ceph mon dump``.
|
monitor map's last modification. To view a monitor map, run ``ceph mon
|
||||||
|
dump``.
|
||||||
|
|
||||||
#. **The OSD Map:** Contains the cluster ``fsid``, when the map was created and
|
#. **The OSD Map:** Contains the cluster ``fsid``, the time of the OSD map's
|
||||||
last modified, a list of pools, replica sizes, PG numbers, a list of OSDs
|
creation, the time of the OSD map's last modification, a list of pools, a
|
||||||
and their status (e.g., ``up``, ``in``). To view an OSD map, execute
|
list of replica sizes, a list of PG numbers, and a list of OSDs and their
|
||||||
``ceph osd dump``.
|
statuses (for example, ``up``, ``in``). To view an OSD map, run ``ceph
|
||||||
|
osd dump``.
|
||||||
|
|
||||||
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD
|
#. **The PG Map:** Contains the PG version, its time stamp, the last OSD map
|
||||||
map epoch, the full ratios, and details on each placement group such as
|
epoch, the full ratios, and the details of each placement group. This
|
||||||
the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (e.g.,
|
includes the PG ID, the `Up Set`, the `Acting Set`, the state of the PG (for
|
||||||
``active + clean``), and data usage statistics for each pool.
|
example, ``active + clean``), and data usage statistics for each pool.
|
||||||
|
|
||||||
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
|
#. **The CRUSH Map:** Contains a list of storage devices, the failure domain
|
||||||
hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
|
hierarchy (for example, ``device``, ``host``, ``rack``, ``row``, ``room``),
|
||||||
traversing the hierarchy when storing data. To view a CRUSH map, execute
|
and rules for traversing the hierarchy when storing data. To view a CRUSH
|
||||||
``ceph osd getcrushmap -o {filename}``; then, decompile it by executing
|
map, run ``ceph osd getcrushmap -o {filename}`` and then decompile it by
|
||||||
``crushtool -d {comp-crushmap-filename} -o {decomp-crushmap-filename}``.
|
running ``crushtool -d {comp-crushmap-filename} -o
|
||||||
You can view the decompiled map in a text editor or with ``cat``.
|
{decomp-crushmap-filename}``. Use a text editor or ``cat`` to view the
|
||||||
|
decompiled map.
|
||||||
|
|
||||||
#. **The MDS Map:** Contains the current MDS map epoch, when the map was
|
#. **The MDS Map:** Contains the current MDS map epoch, when the map was
|
||||||
created, and the last time it changed. It also contains the pool for
|
created, and the last time it changed. It also contains the pool for
|
||||||
storing metadata, a list of metadata servers, and which metadata servers
|
storing metadata, a list of metadata servers, and which metadata servers
|
||||||
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
|
are ``up`` and ``in``. To view an MDS map, execute ``ceph fs dump``.
|
||||||
|
|
||||||
Each map maintains an iterative history of its operating state changes. Ceph
|
Each map maintains a history of changes to its operating state. Ceph Monitors
|
||||||
Monitors maintain a master copy of the cluster map including the cluster
|
maintain a master copy of the cluster map. This master copy includes the
|
||||||
members, state, changes, and the overall health of the Ceph Storage Cluster.
|
cluster members, the state of the cluster, changes to the cluster, and
|
||||||
|
information recording the overall health of the Ceph Storage Cluster.
|
||||||
|
|
||||||
.. index:: high availability; monitor architecture
|
.. index:: high availability; monitor architecture
|
||||||
|
|
||||||
High Availability Monitors
|
High Availability Monitors
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Before Ceph Clients can read or write data, they must contact a Ceph Monitor
|
A Ceph Client must contact a Ceph Monitor and obtain a current copy of the
|
||||||
to obtain the most recent copy of the cluster map. A Ceph Storage Cluster
|
cluster map in order to read data from or to write data to the Ceph cluster.
|
||||||
can operate with a single monitor; however, this introduces a single
|
|
||||||
point of failure (i.e., if the monitor goes down, Ceph Clients cannot
|
|
||||||
read or write data).
|
|
||||||
|
|
||||||
For added reliability and fault tolerance, Ceph supports a cluster of monitors.
|
It is possible for a Ceph cluster to function properly with only a single
|
||||||
In a cluster of monitors, latency and other faults can cause one or more
|
monitor, but a Ceph cluster that has only a single monitor has a single point
|
||||||
monitors to fall behind the current state of the cluster. For this reason, Ceph
|
of failure: if the monitor goes down, Ceph clients will be unable to read data
|
||||||
must have agreement among various monitor instances regarding the state of the
|
from or write data to the cluster.
|
||||||
cluster. Ceph always uses a majority of monitors (e.g., 1, 2:3, 3:5, 4:6, etc.)
|
|
||||||
and the `Paxos`_ algorithm to establish a consensus among the monitors about the
|
|
||||||
current state of the cluster.
|
|
||||||
|
|
||||||
For details on configuring monitors, see the `Monitor Config Reference`_.
|
Ceph leverages a cluster of monitors in order to increase reliability and fault
|
||||||
|
tolerance. When a cluster of monitors is used, however, one or more of the
|
||||||
|
monitors in the cluster can fall behind due to latency or other faults. Ceph
|
||||||
|
mitigates these negative effects by requiring multiple monitor instances to
|
||||||
|
agree about the state of the cluster. To establish consensus among the monitors
|
||||||
|
regarding the state of the cluster, Ceph uses the `Paxos`_ algorithm and a
|
||||||
|
majority of monitors (for example, one in a cluster that contains only one
|
||||||
|
monitor, two in a cluster that contains three monitors, three in a cluster that
|
||||||
|
contains five monitors, four in a cluster that contains six monitors, and so
|
||||||
|
on).
|
||||||
|
|
||||||
|
See the `Monitor Config Reference`_ for more detail on configuring monitors.
|
||||||
|
|
||||||
.. index:: architecture; high availability authentication
|
.. index:: architecture; high availability authentication
|
||||||
|
|
||||||
@ -202,48 +211,57 @@ For details on configuring monitors, see the `Monitor Config Reference`_.
|
|||||||
High Availability Authentication
|
High Availability Authentication
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
To identify users and protect against man-in-the-middle attacks, Ceph provides
|
The ``cephx`` authentication system is used by Ceph to authenticate users and
|
||||||
its ``cephx`` authentication system to authenticate users and daemons.
|
daemons and to protect against man-in-the-middle attacks.
|
||||||
|
|
||||||
.. note:: The ``cephx`` protocol does not address data encryption in transport
|
.. note:: The ``cephx`` protocol does not address data encryption in transport
|
||||||
(e.g., SSL/TLS) or encryption at rest.
|
(for example, SSL/TLS) or encryption at rest.
|
||||||
|
|
||||||
Cephx uses shared secret keys for authentication, meaning both the client and
|
``cephx`` uses shared secret keys for authentication. This means that both the
|
||||||
the monitor cluster have a copy of the client's secret key. The authentication
|
client and the monitor cluster keep a copy of the client's secret key.
|
||||||
protocol is such that both parties are able to prove to each other they have a
|
|
||||||
copy of the key without actually revealing it. This provides mutual
|
|
||||||
authentication, which means the cluster is sure the user possesses the secret
|
|
||||||
key, and the user is sure that the cluster has a copy of the secret key.
|
|
||||||
|
|
||||||
A key scalability feature of Ceph is to avoid a centralized interface to the
|
The ``cephx`` protocol makes it possible for each party to prove to the other
|
||||||
Ceph object store, which means that Ceph clients must be able to interact with
|
that it has a copy of the key without revealing it. This provides mutual
|
||||||
OSDs directly. To protect data, Ceph provides its ``cephx`` authentication
|
authentication and allows the cluster to confirm (1) that the user has the
|
||||||
system, which authenticates users operating Ceph clients. The ``cephx`` protocol
|
secret key and (2) that the user can be confident that the cluster has a copy
|
||||||
operates in a manner with behavior similar to `Kerberos`_.
|
of the secret key.
|
||||||
|
|
||||||
A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each
|
As stated in :ref:`Scalability and High Availability
|
||||||
monitor can authenticate users and distribute keys, so there is no single point
|
<arch_scalability_and_high_availability>`, Ceph does not have any centralized
|
||||||
of failure or bottleneck when using ``cephx``. The monitor returns an
|
interface between clients and the Ceph object store. By avoiding such a
|
||||||
authentication data structure similar to a Kerberos ticket that contains a
|
centralized interface, Ceph avoids the bottlenecks that attend such centralized
|
||||||
session key for use in obtaining Ceph services. This session key is itself
|
interfaces. However, this means that clients must interact directly with OSDs.
|
||||||
encrypted with the user's permanent secret key, so that only the user can
|
Direct interactions between Ceph clients and OSDs require authenticated
|
||||||
request services from the Ceph Monitor(s). The client then uses the session key
|
connections. The ``cephx`` authentication system establishes and sustains these
|
||||||
to request its desired services from the monitor, and the monitor provides the
|
authenticated connections.
|
||||||
client with a ticket that will authenticate the client to the OSDs that actually
|
|
||||||
handle data. Ceph Monitors and OSDs share a secret, so the client can use the
|
|
||||||
ticket provided by the monitor with any OSD or metadata server in the cluster.
|
|
||||||
Like Kerberos, ``cephx`` tickets expire, so an attacker cannot use an expired
|
|
||||||
ticket or session key obtained surreptitiously. This form of authentication will
|
|
||||||
prevent attackers with access to the communications medium from either creating
|
|
||||||
bogus messages under another user's identity or altering another user's
|
|
||||||
legitimate messages, as long as the user's secret key is not divulged before it
|
|
||||||
expires.
|
|
||||||
|
|
||||||
To use ``cephx``, an administrator must set up users first. In the following
|
The ``cephx`` protocol operates in a manner similar to `Kerberos`_.
|
||||||
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
|
|
||||||
|
A user invokes a Ceph client to contact a monitor. Unlike Kerberos, each
|
||||||
|
monitor can authenticate users and distribute keys, which means that there is
|
||||||
|
no single point of failure and no bottleneck when using ``cephx``. The monitor
|
||||||
|
returns an authentication data structure that is similar to a Kerberos ticket.
|
||||||
|
This authentication data structure contains a session key for use in obtaining
|
||||||
|
Ceph services. The session key is itself encrypted with the user's permanent
|
||||||
|
secret key, which means that only the user can request services from the Ceph
|
||||||
|
Monitors. The client then uses the session key to request services from the
|
||||||
|
monitors, and the monitors provide the client with a ticket that authenticates
|
||||||
|
the client against the OSDs that actually handle data. Ceph Monitors and OSDs
|
||||||
|
share a secret, which means that the clients can use the ticket provided by the
|
||||||
|
monitors to authenticate against any OSD or metadata server in the cluster.
|
||||||
|
|
||||||
|
Like Kerberos tickets, ``cephx`` tickets expire. An attacker cannot use an
|
||||||
|
expired ticket or session key that has been obtained surreptitiously. This form
|
||||||
|
of authentication prevents attackers who have access to the communications
|
||||||
|
medium from creating bogus messages under another user's identity and prevents
|
||||||
|
attackers from altering another user's legitimate messages, as long as the
|
||||||
|
user's secret key is not divulged before it expires.
|
||||||
|
|
||||||
|
An administrator must set up users before using ``cephx``. In the following
|
||||||
|
diagram, the ``client.admin`` user invokes ``ceph auth get-or-create-key`` from
|
||||||
the command line to generate a username and secret key. Ceph's ``auth``
|
the command line to generate a username and secret key. Ceph's ``auth``
|
||||||
subsystem generates the username and key, stores a copy with the monitor(s) and
|
subsystem generates the username and key, stores a copy on the monitor(s), and
|
||||||
transmits the user's secret back to the ``client.admin`` user. This means that
|
transmits the user's secret back to the ``client.admin`` user. This means that
|
||||||
the client and the monitor share a secret key.
|
the client and the monitor share a secret key.
|
||||||
|
|
||||||
.. note:: The ``client.admin`` user must provide the user ID and
|
.. note:: The ``client.admin`` user must provide the user ID and
|
||||||
@ -262,17 +280,16 @@ the client and the monitor share a secret key.
|
|||||||
| transmit key |
|
| transmit key |
|
||||||
| |
|
| |
|
||||||
|
|
||||||
|
Here is how a client authenticates with a monitor. The client passes the user
|
||||||
To authenticate with the monitor, the client passes in the user name to the
|
name to the monitor. The monitor generates a session key that is encrypted with
|
||||||
monitor, and the monitor generates a session key and encrypts it with the secret
|
the secret key associated with the ``username``. The monitor transmits the
|
||||||
key associated to the user name. Then, the monitor transmits the encrypted
|
encrypted ticket to the client. The client uses the shared secret key to
|
||||||
ticket back to the client. The client then decrypts the payload with the shared
|
decrypt the payload. The session key identifies the user, and this act of
|
||||||
secret key to retrieve the session key. The session key identifies the user for
|
identification will last for the duration of the session. The client requests
|
||||||
the current session. The client then requests a ticket on behalf of the user
|
a ticket for the user, and the ticket is signed with the session key. The
|
||||||
signed by the session key. The monitor generates a ticket, encrypts it with the
|
monitor generates a ticket and uses the user's secret key to encrypt it. The
|
||||||
user's secret key and transmits it back to the client. The client decrypts the
|
encrypted ticket is transmitted to the client. The client decrypts the ticket
|
||||||
ticket and uses it to sign requests to OSDs and metadata servers throughout the
|
and uses it to sign requests to OSDs and to metadata servers in the cluster.
|
||||||
cluster.
|
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
@ -302,10 +319,11 @@ cluster.
|
|||||||
|<----+ |
|
|<----+ |
|
||||||
|
|
||||||
|
|
||||||
The ``cephx`` protocol authenticates ongoing communications between the client
|
The ``cephx`` protocol authenticates ongoing communications between the clients
|
||||||
machine and the Ceph servers. Each message sent between a client and server,
|
and Ceph daemons. After initial authentication, each message sent between a
|
||||||
subsequent to the initial authentication, is signed using a ticket that the
|
client and a daemon is signed using a ticket that can be verified by monitors,
|
||||||
monitors, OSDs and metadata servers can verify with their shared secret.
|
OSDs, and metadata daemons. This ticket is verified by using the secret shared
|
||||||
|
between the client and the daemon.
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
@ -341,83 +359,93 @@ monitors, OSDs and metadata servers can verify with their shared secret.
|
|||||||
|<-------------------------------------------|
|
|<-------------------------------------------|
|
||||||
receive response
|
receive response
|
||||||
|
|
||||||
The protection offered by this authentication is between the Ceph client and the
|
This authentication protects only the connections between Ceph clients and Ceph
|
||||||
Ceph server hosts. The authentication is not extended beyond the Ceph client. If
|
daemons. The authentication is not extended beyond the Ceph client. If a user
|
||||||
the user accesses the Ceph client from a remote host, Ceph authentication is not
|
accesses the Ceph client from a remote host, cephx authentication will not be
|
||||||
applied to the connection between the user's host and the client host.
|
applied to the connection between the user's host and the client host.
|
||||||
|
|
||||||
|
See `Cephx Config Guide`_ for more on configuration details.
|
||||||
|
|
||||||
For configuration details, see `Cephx Config Guide`_. For user management
|
See `User Management`_ for more on user management.
|
||||||
details, see `User Management`_.
|
|
||||||
|
|
||||||
|
See :ref:`A Detailed Description of the Cephx Authentication Protocol
|
||||||
|
<cephx_2012_peter>` for more on the distinction between authorization and
|
||||||
|
authentication and for a step-by-step explanation of the setup of ``cephx``
|
||||||
|
tickets and session keys.
|
||||||
|
|
||||||
.. index:: architecture; smart daemons and scalability
|
.. index:: architecture; smart daemons and scalability
|
||||||
|
|
||||||
Smart Daemons Enable Hyperscale
|
Smart Daemons Enable Hyperscale
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
A feature of many storage clusters is a centralized interface that keeps track
|
||||||
|
of the nodes that clients are permitted to access. Such centralized
|
||||||
|
architectures provide services to clients by means of a double dispatch. At the
|
||||||
|
petabyte-to-exabyte scale, such double dispatches are a significant
|
||||||
|
bottleneck.
|
||||||
|
|
||||||
In many clustered architectures, the primary purpose of cluster membership is
|
Ceph obviates this bottleneck: Ceph's OSD Daemons AND Ceph clients are
|
||||||
so that a centralized interface knows which nodes it can access. Then the
|
cluster-aware. Like Ceph clients, each Ceph OSD Daemon is aware of other Ceph
|
||||||
centralized interface provides services to the client through a double
|
OSD Daemons in the cluster. This enables Ceph OSD Daemons to interact directly
|
||||||
dispatch--which is a **huge** bottleneck at the petabyte-to-exabyte scale.
|
with other Ceph OSD Daemons and to interact directly with Ceph Monitors. Being
|
||||||
|
cluster-aware makes it possible for Ceph clients to interact directly with Ceph
|
||||||
|
OSD Daemons.
|
||||||
|
|
||||||
Ceph eliminates the bottleneck: Ceph's OSD Daemons AND Ceph Clients are cluster
|
Because Ceph clients, Ceph monitors, and Ceph OSD daemons interact with one
|
||||||
aware. Like Ceph clients, each Ceph OSD Daemon knows about other Ceph OSD
|
another directly, Ceph OSD daemons can make use of the aggregate CPU and RAM
|
||||||
Daemons in the cluster. This enables Ceph OSD Daemons to interact directly with
|
resources of the nodes in the Ceph cluster. This means that a Ceph cluster can
|
||||||
other Ceph OSD Daemons and Ceph Monitors. Additionally, it enables Ceph Clients
|
easily perform tasks that a cluster with a centralized interface would struggle
|
||||||
to interact directly with Ceph OSD Daemons.
|
to perform. The ability of Ceph nodes to make use of the computing power of
|
||||||
|
the greater cluster provides several benefits:
|
||||||
|
|
||||||
The ability of Ceph Clients, Ceph Monitors and Ceph OSD Daemons to interact with
|
#. **OSDs Service Clients Directly:** Network devices can support only a
|
||||||
each other means that Ceph OSD Daemons can utilize the CPU and RAM of the Ceph
|
limited number of concurrent connections. Because Ceph clients contact
|
||||||
nodes to easily perform tasks that would bog down a centralized server. The
|
Ceph OSD daemons directly without first connecting to a central interface,
|
||||||
ability to leverage this computing power leads to several major benefits:
|
Ceph enjoys improved perfomance and increased system capacity relative to
|
||||||
|
storage redundancy strategies that include a central interface. Ceph clients
|
||||||
|
maintain sessions only when needed, and maintain those sessions with only
|
||||||
|
particular Ceph OSD daemons, not with a centralized interface.
|
||||||
|
|
||||||
#. **OSDs Service Clients Directly:** Since any network device has a limit to
|
#. **OSD Membership and Status**: When Ceph OSD Daemons join a cluster, they
|
||||||
the number of concurrent connections it can support, a centralized system
|
report their status. At the lowest level, the Ceph OSD Daemon status is
|
||||||
has a low physical limit at high scales. By enabling Ceph Clients to contact
|
``up`` or ``down``: this reflects whether the Ceph OSD daemon is running and
|
||||||
Ceph OSD Daemons directly, Ceph increases both performance and total system
|
able to service Ceph Client requests. If a Ceph OSD Daemon is ``down`` and
|
||||||
capacity simultaneously, while removing a single point of failure. Ceph
|
``in`` the Ceph Storage Cluster, this status may indicate the failure of the
|
||||||
Clients can maintain a session when they need to, and with a particular Ceph
|
Ceph OSD Daemon. If a Ceph OSD Daemon is not running because it has crashed,
|
||||||
OSD Daemon instead of a centralized server.
|
the Ceph OSD Daemon cannot notify the Ceph Monitor that it is ``down``. The
|
||||||
|
OSDs periodically send messages to the Ceph Monitor (in releases prior to
|
||||||
|
Luminous, this was done by means of ``MPGStats``, and beginning with the
|
||||||
|
Luminous release, this has been done with ``MOSDBeacon``). If the Ceph
|
||||||
|
Monitors receive no such message after a configurable period of time,
|
||||||
|
then they mark the OSD ``down``. This mechanism is a failsafe, however.
|
||||||
|
Normally, Ceph OSD Daemons determine if a neighboring OSD is ``down`` and
|
||||||
|
report it to the Ceph Monitors. This contributes to making Ceph Monitors
|
||||||
|
lightweight processes. See `Monitoring OSDs`_ and `Heartbeats`_ for
|
||||||
|
additional details.
|
||||||
|
|
||||||
#. **OSD Membership and Status**: Ceph OSD Daemons join a cluster and report
|
#. **Data Scrubbing:** To maintain data consistency, Ceph OSD Daemons scrub
|
||||||
on their status. At the lowest level, the Ceph OSD Daemon status is ``up``
|
RADOS objects. Ceph OSD Daemons compare the metadata of their own local
|
||||||
or ``down`` reflecting whether or not it is running and able to service
|
objects against the metadata of the replicas of those objects, which are
|
||||||
Ceph Client requests. If a Ceph OSD Daemon is ``down`` and ``in`` the Ceph
|
stored on other OSDs. Scrubbing occurs on a per-Placement-Group basis, finds
|
||||||
Storage Cluster, this status may indicate the failure of the Ceph OSD
|
mismatches in object size and finds metadata mismatches, and is usually
|
||||||
Daemon. If a Ceph OSD Daemon is not running (e.g., it crashes), the Ceph OSD
|
performed daily. Ceph OSD Daemons perform deeper scrubbing by comparing the
|
||||||
Daemon cannot notify the Ceph Monitor that it is ``down``. The OSDs
|
data in objects, bit-for-bit, against their checksums. Deep scrubbing finds
|
||||||
periodically send messages to the Ceph Monitor (``MPGStats`` pre-luminous,
|
bad sectors on drives that are not detectable with light scrubs. See `Data
|
||||||
and a new ``MOSDBeacon`` in luminous). If the Ceph Monitor doesn't see that
|
Scrubbing`_ for details on configuring scrubbing.
|
||||||
message after a configurable period of time then it marks the OSD down.
|
|
||||||
This mechanism is a failsafe, however. Normally, Ceph OSD Daemons will
|
|
||||||
determine if a neighboring OSD is down and report it to the Ceph Monitor(s).
|
|
||||||
This assures that Ceph Monitors are lightweight processes. See `Monitoring
|
|
||||||
OSDs`_ and `Heartbeats`_ for additional details.
|
|
||||||
|
|
||||||
#. **Data Scrubbing:** As part of maintaining data consistency and cleanliness,
|
#. **Replication:** Data replication involves a collaboration between Ceph
|
||||||
Ceph OSD Daemons can scrub objects. That is, Ceph OSD Daemons can compare
|
Clients and Ceph OSD Daemons. Ceph OSD Daemons use the CRUSH algorithm to
|
||||||
their local objects metadata with its replicas stored on other OSDs. Scrubbing
|
determine the storage location of object replicas. Ceph clients use the
|
||||||
happens on a per-Placement Group base. Scrubbing (usually performed daily)
|
CRUSH algorithm to determine the storage location of an object, then the
|
||||||
catches mismatches in size and other metadata. Ceph OSD Daemons also perform deeper
|
object is mapped to a pool and to a placement group, and then the client
|
||||||
scrubbing by comparing data in objects bit-for-bit with their checksums.
|
consults the CRUSH map to identify the placement group's primary OSD.
|
||||||
Deep scrubbing (usually performed weekly) finds bad sectors on a drive that
|
|
||||||
weren't apparent in a light scrub. See `Data Scrubbing`_ for details on
|
|
||||||
configuring scrubbing.
|
|
||||||
|
|
||||||
#. **Replication:** Like Ceph Clients, Ceph OSD Daemons use the CRUSH
|
After identifying the target placement group, the client writes the object
|
||||||
algorithm, but the Ceph OSD Daemon uses it to compute where replicas of
|
to the identified placement group's primary OSD. The primary OSD then
|
||||||
objects should be stored (and for rebalancing). In a typical write scenario,
|
consults its own copy of the CRUSH map to identify secondary and tertiary
|
||||||
a client uses the CRUSH algorithm to compute where to store an object, maps
|
OSDS, replicates the object to the placement groups in those secondary and
|
||||||
the object to a pool and placement group, then looks at the CRUSH map to
|
tertiary OSDs, confirms that the object was stored successfully in the
|
||||||
identify the primary OSD for the placement group.
|
secondary and tertiary OSDs, and reports to the client that the object
|
||||||
|
was stored successfully.
|
||||||
The client writes the object to the identified placement group in the
|
|
||||||
primary OSD. Then, the primary OSD with its own copy of the CRUSH map
|
|
||||||
identifies the secondary and tertiary OSDs for replication purposes, and
|
|
||||||
replicates the object to the appropriate placement groups in the secondary
|
|
||||||
and tertiary OSDs (as many OSDs as additional replicas), and responds to the
|
|
||||||
client once it has confirmed the object was stored successfully.
|
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
@ -444,19 +472,18 @@ ability to leverage this computing power leads to several major benefits:
|
|||||||
| | | |
|
| | | |
|
||||||
+---------------+ +---------------+
|
+---------------+ +---------------+
|
||||||
|
|
||||||
With the ability to perform data replication, Ceph OSD Daemons relieve Ceph
|
By performing this act of data replication, Ceph OSD Daemons relieve Ceph
|
||||||
clients from that duty, while ensuring high data availability and data safety.
|
clients of the burden of replicating data.
|
||||||
|
|
||||||
|
|
||||||
Dynamic Cluster Management
|
Dynamic Cluster Management
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
In the `Scalability and High Availability`_ section, we explained how Ceph uses
|
In the `Scalability and High Availability`_ section, we explained how Ceph uses
|
||||||
CRUSH, cluster awareness and intelligent daemons to scale and maintain high
|
CRUSH, cluster topology, and intelligent daemons to scale and maintain high
|
||||||
availability. Key to Ceph's design is the autonomous, self-healing, and
|
availability. Key to Ceph's design is the autonomous, self-healing, and
|
||||||
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
|
intelligent Ceph OSD Daemon. Let's take a deeper look at how CRUSH works to
|
||||||
enable modern cloud storage infrastructures to place data, rebalance the cluster
|
enable modern cloud storage infrastructures to place data, rebalance the
|
||||||
and recover from faults dynamically.
|
cluster, and adaptively place and balance data and recover from faults.
|
||||||
|
|
||||||
.. index:: architecture; pools
|
.. index:: architecture; pools
|
||||||
|
|
||||||
@ -465,10 +492,11 @@ About Pools
|
|||||||
|
|
||||||
The Ceph storage system supports the notion of 'Pools', which are logical
|
The Ceph storage system supports the notion of 'Pools', which are logical
|
||||||
partitions for storing objects.
|
partitions for storing objects.
|
||||||
|
|
||||||
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write objects to
|
Ceph Clients retrieve a `Cluster Map`_ from a Ceph Monitor, and write RADOS
|
||||||
pools. The pool's ``size`` or number of replicas, the CRUSH rule and the
|
objects to pools. The way that Ceph places the data in the pools is determined
|
||||||
number of placement groups determine how Ceph will place the data.
|
by the pool's ``size`` or number of replicas, the CRUSH rule, and the number of
|
||||||
|
placement groups in the pool.
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
@ -501,20 +529,23 @@ See `Set Pool Values`_ for details.
|
|||||||
Mapping PGs to OSDs
|
Mapping PGs to OSDs
|
||||||
~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Each pool has a number of placement groups. CRUSH maps PGs to OSDs dynamically.
|
Each pool has a number of placement groups (PGs) within it. CRUSH dynamically
|
||||||
When a Ceph Client stores objects, CRUSH will map each object to a placement
|
maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS
|
||||||
group.
|
object to a PG.
|
||||||
|
|
||||||
Mapping objects to placement groups creates a layer of indirection between the
|
This mapping of RADOS objects to PGs implements an abstraction and indirection
|
||||||
Ceph OSD Daemon and the Ceph Client. The Ceph Storage Cluster must be able to
|
layer between Ceph OSD Daemons and Ceph Clients. The Ceph Storage Cluster must
|
||||||
grow (or shrink) and rebalance where it stores objects dynamically. If the Ceph
|
be able to grow (or shrink) and redistribute data adaptively when the internal
|
||||||
Client "knew" which Ceph OSD Daemon had which object, that would create a tight
|
topology changes.
|
||||||
coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH
|
|
||||||
algorithm maps each object to a placement group and then maps each placement
|
If the Ceph Client "knew" which Ceph OSD Daemons were storing which objects, a
|
||||||
group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to
|
tight coupling would exist between the Ceph Client and the Ceph OSD Daemon.
|
||||||
rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices
|
But Ceph avoids any such tight coupling. Instead, the CRUSH algorithm maps each
|
||||||
come online. The following diagram depicts how CRUSH maps objects to placement
|
RADOS object to a placement group and then maps each placement group to one or
|
||||||
groups, and placement groups to OSDs.
|
more Ceph OSD Daemons. This "layer of indirection" allows Ceph to rebalance
|
||||||
|
dynamically when new Ceph OSD Daemons and their underlying OSD devices come
|
||||||
|
online. The following diagram shows how the CRUSH algorithm maps objects to
|
||||||
|
placement groups, and how it maps placement groups to OSDs.
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
@ -540,44 +571,45 @@ groups, and placement groups to OSDs.
|
|||||||
| | | | | | | |
|
| | | | | | | |
|
||||||
\----------/ \----------/ \----------/ \----------/
|
\----------/ \----------/ \----------/ \----------/
|
||||||
|
|
||||||
With a copy of the cluster map and the CRUSH algorithm, the client can compute
|
The client uses its copy of the cluster map and the CRUSH algorithm to compute
|
||||||
exactly which OSD to use when reading or writing a particular object.
|
precisely which OSD it will use when reading or writing a particular object.
|
||||||
|
|
||||||
.. index:: architecture; calculating PG IDs
|
.. index:: architecture; calculating PG IDs
|
||||||
|
|
||||||
Calculating PG IDs
|
Calculating PG IDs
|
||||||
~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest copy of the
|
When a Ceph Client binds to a Ceph Monitor, it retrieves the latest version of
|
||||||
`Cluster Map`_. With the cluster map, the client knows about all of the monitors,
|
the `Cluster Map`_. When a client has been equipped with a copy of the cluster
|
||||||
OSDs, and metadata servers in the cluster. **However, it doesn't know anything
|
map, it is aware of all the monitors, OSDs, and metadata servers in the
|
||||||
about object locations.**
|
cluster. **However, even equipped with a copy of the latest version of the
|
||||||
|
cluster map, the client doesn't know anything about object locations.**
|
||||||
|
|
||||||
.. epigraph::
|
**Object locations must be computed.**
|
||||||
|
|
||||||
Object locations get computed.
|
The client requies only the object ID and the name of the pool in order to
|
||||||
|
compute the object location.
|
||||||
|
|
||||||
|
Ceph stores data in named pools (for example, "liverpool"). When a client
|
||||||
|
stores a named object (for example, "john", "paul", "george", or "ringo") it
|
||||||
|
calculates a placement group by using the object name, a hash code, the number
|
||||||
|
of PGs in the pool, and the pool name. Ceph clients use the following steps to
|
||||||
|
compute PG IDs.
|
||||||
|
|
||||||
The only input required by the client is the object ID and the pool.
|
#. The client inputs the pool name and the object ID. (for example: pool =
|
||||||
It's simple: Ceph stores data in named pools (e.g., "liverpool"). When a client
|
"liverpool" and object-id = "john")
|
||||||
wants to store a named object (e.g., "john," "paul," "george," "ringo", etc.)
|
#. Ceph hashes the object ID.
|
||||||
it calculates a placement group using the object name, a hash code, the
|
#. Ceph calculates the hash, modulo the number of PGs (for example: ``58``), to
|
||||||
number of PGs in the pool and the pool name. Ceph clients use the following
|
get a PG ID.
|
||||||
steps to compute PG IDs.
|
#. Ceph uses the pool name to retrieve the pool ID: (for example: "liverpool" =
|
||||||
|
``4``)
|
||||||
|
#. Ceph prepends the pool ID to the PG ID (for example: ``4.58``).
|
||||||
|
|
||||||
#. The client inputs the pool name and the object ID. (e.g., pool = "liverpool"
|
It is much faster to compute object locations than to perform object location
|
||||||
and object-id = "john")
|
query over a chatty session. The :abbr:`CRUSH (Controlled Replication Under
|
||||||
#. Ceph takes the object ID and hashes it.
|
Scalable Hashing)` algorithm allows a client to compute where objects are
|
||||||
#. Ceph calculates the hash modulo the number of PGs. (e.g., ``58``) to get
|
expected to be stored, and enables the client to contact the primary OSD to
|
||||||
a PG ID.
|
store or retrieve the objects.
|
||||||
#. Ceph gets the pool ID given the pool name (e.g., "liverpool" = ``4``)
|
|
||||||
#. Ceph prepends the pool ID to the PG ID (e.g., ``4.58``).
|
|
||||||
|
|
||||||
Computing object locations is much faster than performing object location query
|
|
||||||
over a chatty session. The :abbr:`CRUSH (Controlled Replication Under Scalable
|
|
||||||
Hashing)` algorithm allows a client to compute where objects *should* be stored,
|
|
||||||
and enables the client to contact the primary OSD to store or retrieve the
|
|
||||||
objects.
|
|
||||||
|
|
||||||
.. index:: architecture; PG Peering
|
.. index:: architecture; PG Peering
|
||||||
|
|
||||||
@ -585,46 +617,51 @@ Peering and Sets
|
|||||||
~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
In previous sections, we noted that Ceph OSD Daemons check each other's
|
In previous sections, we noted that Ceph OSD Daemons check each other's
|
||||||
heartbeats and report back to the Ceph Monitor. Another thing Ceph OSD daemons
|
heartbeats and report back to Ceph Monitors. Ceph OSD daemons also 'peer',
|
||||||
do is called 'peering', which is the process of bringing all of the OSDs that
|
which is the process of bringing all of the OSDs that store a Placement Group
|
||||||
store a Placement Group (PG) into agreement about the state of all of the
|
(PG) into agreement about the state of all of the RADOS objects (and their
|
||||||
objects (and their metadata) in that PG. In fact, Ceph OSD Daemons `Report
|
metadata) in that PG. Ceph OSD Daemons `Report Peering Failure`_ to the Ceph
|
||||||
Peering Failure`_ to the Ceph Monitors. Peering issues usually resolve
|
Monitors. Peering issues usually resolve themselves; however, if the problem
|
||||||
themselves; however, if the problem persists, you may need to refer to the
|
persists, you may need to refer to the `Troubleshooting Peering Failure`_
|
||||||
`Troubleshooting Peering Failure`_ section.
|
section.
|
||||||
|
|
||||||
.. Note:: Agreeing on the state does not mean that the PGs have the latest contents.
|
.. Note:: PGs that agree on the state of the cluster do not necessarily have
|
||||||
|
the current data yet.
|
||||||
|
|
||||||
The Ceph Storage Cluster was designed to store at least two copies of an object
|
The Ceph Storage Cluster was designed to store at least two copies of an object
|
||||||
(i.e., ``size = 2``), which is the minimum requirement for data safety. For high
|
(that is, ``size = 2``), which is the minimum requirement for data safety. For
|
||||||
availability, a Ceph Storage Cluster should store more than two copies of an object
|
high availability, a Ceph Storage Cluster should store more than two copies of
|
||||||
(e.g., ``size = 3`` and ``min size = 2``) so that it can continue to run in a
|
an object (that is, ``size = 3`` and ``min size = 2``) so that it can continue
|
||||||
``degraded`` state while maintaining data safety.
|
to run in a ``degraded`` state while maintaining data safety.
|
||||||
|
|
||||||
Referring back to the diagram in `Smart Daemons Enable Hyperscale`_, we do not
|
.. warning:: Although we say here that R2 (replication with two copies) is the
|
||||||
name the Ceph OSD Daemons specifically (e.g., ``osd.0``, ``osd.1``, etc.), but
|
minimum requirement for data safety, R3 (replication with three copies) is
|
||||||
rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
|
recommended. On a long enough timeline, data stored with an R2 strategy will
|
||||||
the *Primary* is the first OSD in the *Acting Set*, and is responsible for
|
be lost.
|
||||||
coordinating the peering process for each placement group where it acts as
|
|
||||||
the *Primary*, and is the **ONLY** OSD that will accept client-initiated
|
|
||||||
writes to objects for a given placement group where it acts as the *Primary*.
|
|
||||||
|
|
||||||
When a series of OSDs are responsible for a placement group, that series of
|
As explained in the diagram in `Smart Daemons Enable Hyperscale`_, we do not
|
||||||
OSDs, we refer to them as an *Acting Set*. An *Acting Set* may refer to the Ceph
|
name the Ceph OSD Daemons specifically (for example, ``osd.0``, ``osd.1``,
|
||||||
OSD Daemons that are currently responsible for the placement group, or the Ceph
|
etc.), but rather refer to them as *Primary*, *Secondary*, and so forth. By
|
||||||
OSD Daemons that were responsible for a particular placement group as of some
|
convention, the *Primary* is the first OSD in the *Acting Set*, and is
|
||||||
|
responsible for orchestrating the peering process for each placement group
|
||||||
|
where it acts as the *Primary*. The *Primary* is the **ONLY** OSD in a given
|
||||||
|
placement group that accepts client-initiated writes to objects.
|
||||||
|
|
||||||
|
The set of OSDs that is responsible for a placement group is called the
|
||||||
|
*Acting Set*. The term "*Acting Set*" can refer either to the Ceph OSD Daemons
|
||||||
|
that are currently responsible for the placement group, or to the Ceph OSD
|
||||||
|
Daemons that were responsible for a particular placement group as of some
|
||||||
epoch.
|
epoch.
|
||||||
|
|
||||||
The Ceph OSD daemons that are part of an *Acting Set* may not always be ``up``.
|
The Ceph OSD daemons that are part of an *Acting Set* might not always be
|
||||||
When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*. The *Up
|
``up``. When an OSD in the *Acting Set* is ``up``, it is part of the *Up Set*.
|
||||||
Set* is an important distinction, because Ceph can remap PGs to other Ceph OSD
|
The *Up Set* is an important distinction, because Ceph can remap PGs to other
|
||||||
Daemons when an OSD fails.
|
Ceph OSD Daemons when an OSD fails.
|
||||||
|
|
||||||
.. note:: In an *Acting Set* for a PG containing ``osd.25``, ``osd.32`` and
|
|
||||||
``osd.61``, the first OSD, ``osd.25``, is the *Primary*. If that OSD fails,
|
|
||||||
the Secondary, ``osd.32``, becomes the *Primary*, and ``osd.25`` will be
|
|
||||||
removed from the *Up Set*.
|
|
||||||
|
|
||||||
|
.. note:: Consider a hypothetical *Acting Set* for a PG that contains
|
||||||
|
``osd.25``, ``osd.32`` and ``osd.61``. The first OSD (``osd.25``), is the
|
||||||
|
*Primary*. If that OSD fails, the Secondary (``osd.32``), becomes the
|
||||||
|
*Primary*, and ``osd.25`` is removed from the *Up Set*.
|
||||||
|
|
||||||
.. index:: architecture; Rebalancing
|
.. index:: architecture; Rebalancing
|
||||||
|
|
||||||
@ -1469,11 +1506,11 @@ Ceph Clients
|
|||||||
|
|
||||||
Ceph Clients include a number of service interfaces. These include:
|
Ceph Clients include a number of service interfaces. These include:
|
||||||
|
|
||||||
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
|
- **Block Devices:** The :term:`Ceph Block Device` (a.k.a., RBD) service
|
||||||
provides resizable, thin-provisioned block devices with snapshotting and
|
provides resizable, thin-provisioned block devices that can be snapshotted
|
||||||
cloning. Ceph stripes a block device across the cluster for high
|
and cloned. Ceph stripes a block device across the cluster for high
|
||||||
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
|
performance. Ceph supports both kernel objects (KO) and a QEMU hypervisor
|
||||||
that uses ``librbd`` directly--avoiding the kernel object overhead for
|
that uses ``librbd`` directly--avoiding the kernel object overhead for
|
||||||
virtualized systems.
|
virtualized systems.
|
||||||
|
|
||||||
- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
|
- **Object Storage:** The :term:`Ceph Object Storage` (a.k.a., RGW) service
|
||||||
|
@ -3,18 +3,20 @@
|
|||||||
``activate``
|
``activate``
|
||||||
============
|
============
|
||||||
|
|
||||||
Once :ref:`ceph-volume-lvm-prepare` is completed, and all the various steps
|
After :ref:`ceph-volume-lvm-prepare` has completed its run, the volume can be
|
||||||
that entails are done, the volume is ready to get "activated".
|
activated.
|
||||||
|
|
||||||
This activation process enables a systemd unit that persists the OSD ID and its
|
Activating the volume involves enabling a ``systemd`` unit that persists the
|
||||||
UUID (also called ``fsid`` in Ceph CLI tools), so that at boot time it can
|
``OSD ID`` and its ``UUID`` (which is also called the ``fsid`` in the Ceph CLI
|
||||||
understand what OSD is enabled and needs to be mounted.
|
tools). After this information has been persisted, the cluster can determine
|
||||||
|
which OSD is enabled and must be mounted.
|
||||||
|
|
||||||
.. note:: The execution of this call is fully idempotent, and there is no
|
.. note:: The execution of this call is fully idempotent. This means that the
|
||||||
side-effects when running multiple times
|
call can be executed multiple times without changing the result of its first
|
||||||
|
successful execution.
|
||||||
|
|
||||||
For OSDs deployed by cephadm, please refer to :ref:`cephadm-osd-activate`
|
For information about OSDs deployed by cephadm, refer to
|
||||||
instead.
|
:ref:`cephadm-osd-activate`.
|
||||||
|
|
||||||
New OSDs
|
New OSDs
|
||||||
--------
|
--------
|
||||||
|
@ -101,8 +101,19 @@ To drain all daemons from a host, run a command of the following form:
|
|||||||
|
|
||||||
ceph orch host drain *<host>*
|
ceph orch host drain *<host>*
|
||||||
|
|
||||||
The ``_no_schedule`` label will be applied to the host. See
|
The ``_no_schedule`` and ``_no_conf_keyring`` labels will be applied to the
|
||||||
:ref:`cephadm-special-host-labels`.
|
host. See :ref:`cephadm-special-host-labels`.
|
||||||
|
|
||||||
|
If you only want to drain daemons but leave managed ceph conf and keyring
|
||||||
|
files on the host, you may pass the ``--keep-conf-keyring`` flag to the
|
||||||
|
drain command.
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch host drain *<host>* --keep-conf-keyring
|
||||||
|
|
||||||
|
This will apply the ``_no_schedule`` label to the host but not the
|
||||||
|
``_no_conf_keyring`` label.
|
||||||
|
|
||||||
All OSDs on the host will be scheduled to be removed. You can check the progress of the OSD removal operation with the following command:
|
All OSDs on the host will be scheduled to be removed. You can check the progress of the OSD removal operation with the following command:
|
||||||
|
|
||||||
@ -112,6 +123,14 @@ All OSDs on the host will be scheduled to be removed. You can check the progress
|
|||||||
|
|
||||||
See :ref:`cephadm-osd-removal` for more details about OSD removal.
|
See :ref:`cephadm-osd-removal` for more details about OSD removal.
|
||||||
|
|
||||||
|
The ``orch host drain`` command also supports a ``--zap-osd-devices``
|
||||||
|
flag. Setting this flag while draining a host will cause cephadm to zap
|
||||||
|
the devices of the OSDs it is removing as part of the drain process
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch host drain *<host>* --zap-osd-devices
|
||||||
|
|
||||||
Use the following command to determine whether any daemons are still on the
|
Use the following command to determine whether any daemons are still on the
|
||||||
host:
|
host:
|
||||||
|
|
||||||
@ -183,6 +202,12 @@ The following host labels have a special meaning to cephadm. All start with ``_
|
|||||||
an existing host that already contains Ceph daemons, it will cause cephadm to move
|
an existing host that already contains Ceph daemons, it will cause cephadm to move
|
||||||
those daemons elsewhere (except OSDs, which are not removed automatically).
|
those daemons elsewhere (except OSDs, which are not removed automatically).
|
||||||
|
|
||||||
|
* ``_no_conf_keyring``: *Do not deploy config files or keyrings on this host*.
|
||||||
|
|
||||||
|
This label is effectively the same as ``_no_schedule`` but instead of working for
|
||||||
|
daemons it works for client keyrings and ceph conf files that are being managed
|
||||||
|
by cephadm
|
||||||
|
|
||||||
* ``_no_autotune_memory``: *Do not autotune memory on this host*.
|
* ``_no_autotune_memory``: *Do not autotune memory on this host*.
|
||||||
|
|
||||||
This label will prevent daemon memory from being tuned even when the
|
This label will prevent daemon memory from being tuned even when the
|
||||||
@ -290,7 +315,7 @@ create a new CRUSH host located in the specified hierarchy.
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The ``location`` attribute will be only affect the initial CRUSH location. Subsequent
|
The ``location`` attribute will be only affect the initial CRUSH location. Subsequent
|
||||||
changes of the ``location`` property will be ignored. Also, removing a host will no remove
|
changes of the ``location`` property will be ignored. Also, removing a host will not remove
|
||||||
any CRUSH buckets.
|
any CRUSH buckets.
|
||||||
|
|
||||||
See also :ref:`crush_map_default_types`.
|
See also :ref:`crush_map_default_types`.
|
||||||
@ -505,7 +530,23 @@ There are two ways to customize this configuration for your environment:
|
|||||||
manually distributed to the mgr data directory
|
manually distributed to the mgr data directory
|
||||||
(``/var/lib/ceph/<cluster-fsid>/mgr.<id>`` on the host, visible at
|
(``/var/lib/ceph/<cluster-fsid>/mgr.<id>`` on the host, visible at
|
||||||
``/var/lib/ceph/mgr/ceph-<id>`` from inside the container).
|
``/var/lib/ceph/mgr/ceph-<id>`` from inside the container).
|
||||||
|
|
||||||
|
Setting up CA signed keys for the cluster
|
||||||
|
-----------------------------------------
|
||||||
|
|
||||||
|
Cephadm also supports using CA signed keys for SSH authentication
|
||||||
|
across cluster nodes. In this setup, instead of needing a private
|
||||||
|
key and public key, we instead need a private key and certificate
|
||||||
|
created by signing that private key with a CA key. For more info
|
||||||
|
on setting up nodes for authentication using a CA signed key, see
|
||||||
|
:ref:`cephadm-bootstrap-ca-signed-keys`. Once you have your private
|
||||||
|
key and signed cert, they can be set up for cephadm to use by running:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph config-key set mgr/cephadm/ssh_identity_key -i <private-key-file>
|
||||||
|
ceph config-key set mgr/cephadm/ssh_identity_cert -i <signed-cert-file>
|
||||||
|
|
||||||
.. _cephadm-fqdn:
|
.. _cephadm-fqdn:
|
||||||
|
|
||||||
Fully qualified domain names vs bare host names
|
Fully qualified domain names vs bare host names
|
||||||
|
@ -50,8 +50,8 @@ There are two ways to install ``cephadm``:
|
|||||||
distribution-specific installations
|
distribution-specific installations
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
|
|
||||||
Some Linux distributions may already include up-to-date Ceph packages. In that
|
Some Linux distributions may already include up-to-date Ceph packages. In
|
||||||
case, you can install cephadm directly. For example:
|
that case, you can install cephadm directly. For example:
|
||||||
|
|
||||||
In Ubuntu:
|
In Ubuntu:
|
||||||
|
|
||||||
@ -87,7 +87,7 @@ curl-based installation
|
|||||||
|
|
||||||
* First, determine what version of Ceph you will need. You can use the releases
|
* First, determine what version of Ceph you will need. You can use the releases
|
||||||
page to find the `latest active releases <https://docs.ceph.com/en/latest/releases/#active-releases>`_.
|
page to find the `latest active releases <https://docs.ceph.com/en/latest/releases/#active-releases>`_.
|
||||||
For example, we might look at that page and find that ``17.2.6`` is the latest
|
For example, we might look at that page and find that ``18.2.0`` is the latest
|
||||||
active release.
|
active release.
|
||||||
|
|
||||||
* Use ``curl`` to fetch a build of cephadm for that release.
|
* Use ``curl`` to fetch a build of cephadm for that release.
|
||||||
@ -95,7 +95,7 @@ curl-based installation
|
|||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
:substitutions:
|
:substitutions:
|
||||||
|
|
||||||
CEPH_RELEASE=17.2.6 # replace this with the active release
|
CEPH_RELEASE=18.2.0 # replace this with the active release
|
||||||
curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm
|
curl --silent --remote-name --location https://download.ceph.com/rpm-${CEPH_RELEASE}/el9/noarch/cephadm
|
||||||
|
|
||||||
Ensure the ``cephadm`` file is executable:
|
Ensure the ``cephadm`` file is executable:
|
||||||
@ -121,28 +121,41 @@ curl-based installation
|
|||||||
|
|
||||||
python3.8 ./cephadm <arguments...>
|
python3.8 ./cephadm <arguments...>
|
||||||
|
|
||||||
* Although the standalone cephadm is sufficient to get a cluster started, it is
|
.. _cephadm_update:
|
||||||
convenient to have the ``cephadm`` command installed on the host. To install
|
|
||||||
the packages that provide the ``cephadm`` command, run the following
|
|
||||||
commands:
|
|
||||||
|
|
||||||
.. prompt:: bash #
|
update cephadm
|
||||||
:substitutions:
|
--------------
|
||||||
|
|
||||||
./cephadm add-repo --release |stable-release|
|
The cephadm binary can be used to bootstrap a cluster and for a variety
|
||||||
./cephadm install
|
of other management and debugging tasks. The Ceph team strongly recommends
|
||||||
|
using an actively supported version of cephadm. Additionally, although
|
||||||
|
the standalone cephadm is sufficient to get a cluster started, it is
|
||||||
|
convenient to have the ``cephadm`` command installed on the host. Older or LTS
|
||||||
|
distros may also have ``cephadm`` packages that are out-of-date and
|
||||||
|
running the commands below can help install a more recent version
|
||||||
|
from the Ceph project's repositories.
|
||||||
|
|
||||||
Confirm that ``cephadm`` is now in your PATH by running ``which``:
|
To install the packages provided by the Ceph project that provide the
|
||||||
|
``cephadm`` command, run the following commands:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
:substitutions:
|
||||||
|
|
||||||
which cephadm
|
./cephadm add-repo --release |stable-release|
|
||||||
|
./cephadm install
|
||||||
|
|
||||||
A successful ``which cephadm`` command will return this:
|
Confirm that ``cephadm`` is now in your PATH by running ``which`` or
|
||||||
|
``command -v``:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. prompt:: bash #
|
||||||
|
|
||||||
/usr/sbin/cephadm
|
which cephadm
|
||||||
|
|
||||||
|
A successful ``which cephadm`` command will return this:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
/usr/sbin/cephadm
|
||||||
|
|
||||||
Bootstrap a new cluster
|
Bootstrap a new cluster
|
||||||
=======================
|
=======================
|
||||||
@ -157,6 +170,9 @@ cluster's first "monitor daemon", and that monitor daemon needs an IP address.
|
|||||||
You must pass the IP address of the Ceph cluster's first host to the ``ceph
|
You must pass the IP address of the Ceph cluster's first host to the ``ceph
|
||||||
bootstrap`` command, so you'll need to know the IP address of that host.
|
bootstrap`` command, so you'll need to know the IP address of that host.
|
||||||
|
|
||||||
|
.. important:: ``ssh`` must be installed and running in order for the
|
||||||
|
bootstrapping procedure to succeed.
|
||||||
|
|
||||||
.. note:: If there are multiple networks and interfaces, be sure to choose one
|
.. note:: If there are multiple networks and interfaces, be sure to choose one
|
||||||
that will be accessible by any host accessing the Ceph cluster.
|
that will be accessible by any host accessing the Ceph cluster.
|
||||||
|
|
||||||
@ -184,6 +200,8 @@ This command will:
|
|||||||
with this label will (also) get a copy of ``/etc/ceph/ceph.conf`` and
|
with this label will (also) get a copy of ``/etc/ceph/ceph.conf`` and
|
||||||
``/etc/ceph/ceph.client.admin.keyring``.
|
``/etc/ceph/ceph.client.admin.keyring``.
|
||||||
|
|
||||||
|
.. _cephadm-bootstrap-further-info:
|
||||||
|
|
||||||
Further information about cephadm bootstrap
|
Further information about cephadm bootstrap
|
||||||
-------------------------------------------
|
-------------------------------------------
|
||||||
|
|
||||||
@ -303,18 +321,21 @@ its status with:
|
|||||||
Adding Hosts
|
Adding Hosts
|
||||||
============
|
============
|
||||||
|
|
||||||
Next, add all hosts to the cluster by following :ref:`cephadm-adding-hosts`.
|
Add all hosts to the cluster by following the instructions in
|
||||||
|
:ref:`cephadm-adding-hosts`.
|
||||||
|
|
||||||
By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring
|
By default, a ``ceph.conf`` file and a copy of the ``client.admin`` keyring are
|
||||||
are maintained in ``/etc/ceph`` on all hosts with the ``_admin`` label, which is initially
|
maintained in ``/etc/ceph`` on all hosts that have the ``_admin`` label. This
|
||||||
applied only to the bootstrap host. We usually recommend that one or more other hosts be
|
label is initially applied only to the bootstrap host. We usually recommend
|
||||||
given the ``_admin`` label so that the Ceph CLI (e.g., via ``cephadm shell``) is easily
|
that one or more other hosts be given the ``_admin`` label so that the Ceph CLI
|
||||||
accessible on multiple hosts. To add the ``_admin`` label to additional host(s):
|
(for example, via ``cephadm shell``) is easily accessible on multiple hosts. To add
|
||||||
|
the ``_admin`` label to additional host(s), run a command of the following form:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch host label add *<host>* _admin
|
ceph orch host label add *<host>* _admin
|
||||||
|
|
||||||
|
|
||||||
Adding additional MONs
|
Adding additional MONs
|
||||||
======================
|
======================
|
||||||
|
|
||||||
@ -454,3 +475,104 @@ have access to all hosts that you plan to add to the cluster.
|
|||||||
cephadm --image *<hostname>*:5000/ceph/ceph bootstrap --mon-ip *<mon-ip>*
|
cephadm --image *<hostname>*:5000/ceph/ceph bootstrap --mon-ip *<mon-ip>*
|
||||||
|
|
||||||
.. _cluster network: ../rados/configuration/network-config-ref#cluster-network
|
.. _cluster network: ../rados/configuration/network-config-ref#cluster-network
|
||||||
|
|
||||||
|
.. _cephadm-bootstrap-custom-ssh-keys:
|
||||||
|
|
||||||
|
Deployment with custom SSH keys
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
Bootstrap allows users to create their own private/public SSH key pair
|
||||||
|
rather than having cephadm generate them automatically.
|
||||||
|
|
||||||
|
To use custom SSH keys, pass the ``--ssh-private-key`` and ``--ssh-public-key``
|
||||||
|
fields to bootstrap. Both parameters require a path to the file where the
|
||||||
|
keys are stored:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
cephadm bootstrap --mon-ip <ip-addr> --ssh-private-key <private-key-filepath> --ssh-public-key <public-key-filepath>
|
||||||
|
|
||||||
|
This setup allows users to use a key that has already been distributed to hosts
|
||||||
|
the user wants in the cluster before bootstrap.
|
||||||
|
|
||||||
|
.. note:: In order for cephadm to connect to other hosts you'd like to add
|
||||||
|
to the cluster, make sure the public key of the key pair provided is set up
|
||||||
|
as an authorized key for the ssh user being used, typically root. If you'd
|
||||||
|
like more info on using a non-root user as the ssh user, see :ref:`cephadm-bootstrap-further-info`
|
||||||
|
|
||||||
|
.. _cephadm-bootstrap-ca-signed-keys:
|
||||||
|
|
||||||
|
Deployment with CA signed SSH keys
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
As an alternative to standard public key authentication, cephadm also supports
|
||||||
|
deployment using CA signed keys. Before bootstrapping it's recommended to set up
|
||||||
|
the CA public key as a trusted CA key on hosts you'd like to eventually add to
|
||||||
|
the cluster. For example:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
# we will act as our own CA, therefore we'll need to make a CA key
|
||||||
|
[root@host1 ~]# ssh-keygen -t rsa -f ca-key -N ""
|
||||||
|
|
||||||
|
# make the ca key trusted on the host we've generated it on
|
||||||
|
# this requires adding in a line in our /etc/sshd_config
|
||||||
|
# to mark this key as trusted
|
||||||
|
[root@host1 ~]# cp ca-key.pub /etc/ssh
|
||||||
|
[root@host1 ~]# vi /etc/ssh/sshd_config
|
||||||
|
[root@host1 ~]# cat /etc/ssh/sshd_config | grep ca-key
|
||||||
|
TrustedUserCAKeys /etc/ssh/ca-key.pub
|
||||||
|
# now restart sshd so it picks up the config change
|
||||||
|
[root@host1 ~]# systemctl restart sshd
|
||||||
|
|
||||||
|
# now, on all other hosts we want in the cluster, also install the CA key
|
||||||
|
[root@host1 ~]# scp /etc/ssh/ca-key.pub host2:/etc/ssh/
|
||||||
|
|
||||||
|
# on other hosts, make the same changes to the sshd_config
|
||||||
|
[root@host2 ~]# vi /etc/ssh/sshd_config
|
||||||
|
[root@host2 ~]# cat /etc/ssh/sshd_config | grep ca-key
|
||||||
|
TrustedUserCAKeys /etc/ssh/ca-key.pub
|
||||||
|
# and restart sshd so it picks up the config change
|
||||||
|
[root@host2 ~]# systemctl restart sshd
|
||||||
|
|
||||||
|
Once the CA key has been installed and marked as a trusted key, you are ready
|
||||||
|
to use a private key/CA signed cert combination for SSH. Continuing with our
|
||||||
|
current example, we will create a new key-pair for for host access and then
|
||||||
|
sign it with our CA key
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
# make a new key pair
|
||||||
|
[root@host1 ~]# ssh-keygen -t rsa -f cephadm-ssh-key -N ""
|
||||||
|
# sign the private key. This will create a new cephadm-ssh-key-cert.pub
|
||||||
|
# note here we're using user "root". If you'd like to use a non-root
|
||||||
|
# user the arguments to the -I and -n params would need to be adjusted
|
||||||
|
# Additionally, note the -V param indicates how long until the cert
|
||||||
|
# this creates will expire
|
||||||
|
[root@host1 ~]# ssh-keygen -s ca-key -I user_root -n root -V +52w cephadm-ssh-key
|
||||||
|
[root@host1 ~]# ls
|
||||||
|
ca-key ca-key.pub cephadm-ssh-key cephadm-ssh-key-cert.pub cephadm-ssh-key.pub
|
||||||
|
|
||||||
|
# verify our signed key is working. To do this, make sure the generated private
|
||||||
|
# key ("cephadm-ssh-key" in our example) and the newly signed cert are stored
|
||||||
|
# in the same directory. Then try to ssh using the private key
|
||||||
|
[root@host1 ~]# ssh -i cephadm-ssh-key host2
|
||||||
|
|
||||||
|
Once you have your private key and corresponding CA signed cert and have tested
|
||||||
|
SSH authentication using that key works, you can pass those keys to bootstrap
|
||||||
|
in order to have cephadm use them for SSHing between cluster hosts
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
[root@host1 ~]# cephadm bootstrap --mon-ip <ip-addr> --ssh-private-key cephadm-ssh-key --ssh-signed-cert cephadm-ssh-key-cert.pub
|
||||||
|
|
||||||
|
Note that this setup does not require installing the corresponding public key
|
||||||
|
from the private key passed to bootstrap on other nodes. In fact, cephadm will
|
||||||
|
reject the ``--ssh-public-key`` argument when passed along with ``--ssh-signed-cert``.
|
||||||
|
Not because having the public key breaks anything, but because it is not at all needed
|
||||||
|
for this setup and it helps bootstrap differentiate if the user wants the CA signed
|
||||||
|
keys setup or standard pubkey encryption. What this means is, SSH key rotation
|
||||||
|
would simply be a matter of getting another key signed by the same CA and providing
|
||||||
|
cephadm with the new private key and signed cert. No additional distribution of
|
||||||
|
keys to cluster nodes is needed after the initial setup of the CA key as a trusted key,
|
||||||
|
no matter how many new private key/signed cert pairs are rotated in.
|
||||||
|
@ -541,13 +541,60 @@ a spec like
|
|||||||
|
|
||||||
which would cause each mon daemon to be deployed with `--cpus=2`.
|
which would cause each mon daemon to be deployed with `--cpus=2`.
|
||||||
|
|
||||||
|
There are two ways to express arguments in the ``extra_container_args`` list.
|
||||||
|
To start, an item in the list can be a string. When passing an argument
|
||||||
|
as a string and the string contains spaces, Cephadm will automatically split it
|
||||||
|
into multiple arguments. For example, ``--cpus 2`` would become ``["--cpus",
|
||||||
|
"2"]`` when processed. Example:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
service_type: mon
|
||||||
|
service_name: mon
|
||||||
|
placement:
|
||||||
|
hosts:
|
||||||
|
- host1
|
||||||
|
- host2
|
||||||
|
- host3
|
||||||
|
extra_container_args:
|
||||||
|
- "--cpus 2"
|
||||||
|
|
||||||
|
As an alternative, an item in the list can be an object (mapping) containing
|
||||||
|
the required key "argument" and an optional key "split". The value associated
|
||||||
|
with the ``argument`` key must be a single string. The value associated with
|
||||||
|
the ``split`` key is a boolean value. The ``split`` key explicitly controls if
|
||||||
|
spaces in the argument value cause the value to be split into multiple
|
||||||
|
arguments. If ``split`` is true then Cephadm will automatically split the value
|
||||||
|
into multiple arguments. If ``split`` is false then spaces in the value will
|
||||||
|
be retained in the argument. The default, when ``split`` is not provided, is
|
||||||
|
false. Examples:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
service_type: mon
|
||||||
|
service_name: mon
|
||||||
|
placement:
|
||||||
|
hosts:
|
||||||
|
- tiebreaker
|
||||||
|
extra_container_args:
|
||||||
|
# No spaces, always treated as a single argument
|
||||||
|
- argument: "--timout=3000"
|
||||||
|
# Splitting explicitly disabled, one single argument
|
||||||
|
- argument: "--annotation=com.example.name=my favorite mon"
|
||||||
|
split: false
|
||||||
|
# Splitting explicitly enabled, will become two arguments
|
||||||
|
- argument: "--cpuset-cpus 1-3,7-11"
|
||||||
|
split: true
|
||||||
|
# Splitting implicitly disabled, one single argument
|
||||||
|
- argument: "--annotation=com.example.note=a simple example"
|
||||||
|
|
||||||
Mounting Files with Extra Container Arguments
|
Mounting Files with Extra Container Arguments
|
||||||
---------------------------------------------
|
---------------------------------------------
|
||||||
|
|
||||||
A common use case for extra container arguments is to mount additional
|
A common use case for extra container arguments is to mount additional
|
||||||
files within the container. However, some intuitive formats for doing
|
files within the container. Older versions of Ceph did not support spaces
|
||||||
so can cause deployment to fail (see https://tracker.ceph.com/issues/57338).
|
in arguments and therefore the examples below apply to the widest range
|
||||||
The recommended syntax for mounting a file with extra container arguments is:
|
of Ceph versions.
|
||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
@ -587,6 +634,55 @@ the node-exporter service , one could apply a service spec like
|
|||||||
extra_entrypoint_args:
|
extra_entrypoint_args:
|
||||||
- "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2"
|
- "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2"
|
||||||
|
|
||||||
|
There are two ways to express arguments in the ``extra_entrypoint_args`` list.
|
||||||
|
To start, an item in the list can be a string. When passing an argument as a
|
||||||
|
string and the string contains spaces, cephadm will automatically split it into
|
||||||
|
multiple arguments. For example, ``--debug_ms 10`` would become
|
||||||
|
``["--debug_ms", "10"]`` when processed. Example:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
service_type: mon
|
||||||
|
service_name: mon
|
||||||
|
placement:
|
||||||
|
hosts:
|
||||||
|
- host1
|
||||||
|
- host2
|
||||||
|
- host3
|
||||||
|
extra_entrypoint_args:
|
||||||
|
- "--debug_ms 2"
|
||||||
|
|
||||||
|
As an alternative, an item in the list can be an object (mapping) containing
|
||||||
|
the required key "argument" and an optional key "split". The value associated
|
||||||
|
with the ``argument`` key must be a single string. The value associated with
|
||||||
|
the ``split`` key is a boolean value. The ``split`` key explicitly controls if
|
||||||
|
spaces in the argument value cause the value to be split into multiple
|
||||||
|
arguments. If ``split`` is true then cephadm will automatically split the value
|
||||||
|
into multiple arguments. If ``split`` is false then spaces in the value will
|
||||||
|
be retained in the argument. The default, when ``split`` is not provided, is
|
||||||
|
false. Examples:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
# An theoretical data migration service
|
||||||
|
service_type: pretend
|
||||||
|
service_name: imagine1
|
||||||
|
placement:
|
||||||
|
hosts:
|
||||||
|
- host1
|
||||||
|
extra_entrypoint_args:
|
||||||
|
# No spaces, always treated as a single argument
|
||||||
|
- argument: "--timout=30m"
|
||||||
|
# Splitting explicitly disabled, one single argument
|
||||||
|
- argument: "--import=/mnt/usb/My Documents"
|
||||||
|
split: false
|
||||||
|
# Splitting explicitly enabled, will become two arguments
|
||||||
|
- argument: "--tag documents"
|
||||||
|
split: true
|
||||||
|
# Splitting implicitly disabled, one single argument
|
||||||
|
- argument: "--title=Imported Documents"
|
||||||
|
|
||||||
|
|
||||||
Custom Config Files
|
Custom Config Files
|
||||||
===================
|
===================
|
||||||
|
|
||||||
|
@ -20,7 +20,18 @@ For example:
|
|||||||
ceph fs volume create <fs_name> --placement="<placement spec>"
|
ceph fs volume create <fs_name> --placement="<placement spec>"
|
||||||
|
|
||||||
where ``fs_name`` is the name of the CephFS and ``placement`` is a
|
where ``fs_name`` is the name of the CephFS and ``placement`` is a
|
||||||
:ref:`orchestrator-cli-placement-spec`.
|
:ref:`orchestrator-cli-placement-spec`. For example, to place
|
||||||
|
MDS daemons for the new ``foo`` volume on hosts labeled with ``mds``:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph fs volume create foo --placement="label:mds"
|
||||||
|
|
||||||
|
You can also update the placement after-the-fact via:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch apply mds foo 'mds-[012]'
|
||||||
|
|
||||||
For manually deploying MDS daemons, use this specification:
|
For manually deploying MDS daemons, use this specification:
|
||||||
|
|
||||||
@ -30,6 +41,7 @@ For manually deploying MDS daemons, use this specification:
|
|||||||
service_id: fs_name
|
service_id: fs_name
|
||||||
placement:
|
placement:
|
||||||
count: 3
|
count: 3
|
||||||
|
label: mds
|
||||||
|
|
||||||
|
|
||||||
The specification can then be applied using:
|
The specification can then be applied using:
|
||||||
|
@ -4,8 +4,8 @@
|
|||||||
MGR Service
|
MGR Service
|
||||||
===========
|
===========
|
||||||
|
|
||||||
The cephadm MGR service is hosting different modules, like the :ref:`mgr-dashboard`
|
The cephadm MGR service hosts multiple modules. These include the
|
||||||
and the cephadm manager module.
|
:ref:`mgr-dashboard` and the cephadm manager module.
|
||||||
|
|
||||||
.. _cephadm-mgr-networks:
|
.. _cephadm-mgr-networks:
|
||||||
|
|
||||||
|
@ -161,6 +161,53 @@ that will tell it to bind to that specific IP.
|
|||||||
Note that in these setups, one should make sure to include ``count: 1`` in the
|
Note that in these setups, one should make sure to include ``count: 1`` in the
|
||||||
nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP.
|
nfs placement, as it's only possible for one nfs daemon to bind to the virtual IP.
|
||||||
|
|
||||||
|
NFS with HAProxy Protocol Support
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Cephadm supports deploying NFS in High-Availability mode with additional
|
||||||
|
HAProxy protocol support. This works just like High-availability NFS but also
|
||||||
|
supports client IP level configuration on NFS Exports. This feature requires
|
||||||
|
`NFS-Ganesha v5.0`_ or later.
|
||||||
|
|
||||||
|
.. _NFS-Ganesha v5.0: https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_5
|
||||||
|
|
||||||
|
To use this mode, you'll either want to set up the service using the nfs module
|
||||||
|
(see :ref:`nfs-module-cluster-create`) or manually create services with the
|
||||||
|
extra parameter ``enable_haproxy_protocol`` set to true. Both NFS Service and
|
||||||
|
Ingress service must have ``enable_haproxy_protocol`` set to the same value.
|
||||||
|
For example:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
service_type: ingress
|
||||||
|
service_id: nfs.foo
|
||||||
|
placement:
|
||||||
|
count: 1
|
||||||
|
hosts:
|
||||||
|
- host1
|
||||||
|
- host2
|
||||||
|
- host3
|
||||||
|
spec:
|
||||||
|
backend_service: nfs.foo
|
||||||
|
monitor_port: 9049
|
||||||
|
virtual_ip: 192.168.122.100/24
|
||||||
|
enable_haproxy_protocol: true
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
service_type: nfs
|
||||||
|
service_id: foo
|
||||||
|
placement:
|
||||||
|
count: 1
|
||||||
|
hosts:
|
||||||
|
- host1
|
||||||
|
- host2
|
||||||
|
- host3
|
||||||
|
spec:
|
||||||
|
port: 2049
|
||||||
|
enable_haproxy_protocol: true
|
||||||
|
|
||||||
|
|
||||||
Further Reading
|
Further Reading
|
||||||
===============
|
===============
|
||||||
|
|
||||||
|
@ -15,10 +15,9 @@ To print a list of devices discovered by ``cephadm``, run this command:
|
|||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch device ls [--hostname=...] [--wide] [--refresh]
|
ceph orch device ls [--hostname=...] [--wide] [--refresh]
|
||||||
|
|
||||||
Example
|
Example::
|
||||||
::
|
|
||||||
|
|
||||||
Hostname Path Type Serial Size Health Ident Fault Available
|
Hostname Path Type Serial Size Health Ident Fault Available
|
||||||
srv-01 /dev/sdb hdd 15P0A0YFFRD6 300G Unknown N/A N/A No
|
srv-01 /dev/sdb hdd 15P0A0YFFRD6 300G Unknown N/A N/A No
|
||||||
@ -44,7 +43,7 @@ enable cephadm's "enhanced device scan" option as follows;
|
|||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph config set mgr mgr/cephadm/device_enhanced_scan true
|
ceph config set mgr mgr/cephadm/device_enhanced_scan true
|
||||||
|
|
||||||
.. warning::
|
.. warning::
|
||||||
Although the libstoragemgmt library performs standard SCSI inquiry calls,
|
Although the libstoragemgmt library performs standard SCSI inquiry calls,
|
||||||
@ -173,16 +172,16 @@ will happen without actually creating the OSDs.
|
|||||||
|
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch apply osd --all-available-devices --dry-run
|
ceph orch apply osd --all-available-devices --dry-run
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
NAME HOST DATA DB WAL
|
NAME HOST DATA DB WAL
|
||||||
all-available-devices node1 /dev/vdb - -
|
all-available-devices node1 /dev/vdb - -
|
||||||
all-available-devices node2 /dev/vdc - -
|
all-available-devices node2 /dev/vdc - -
|
||||||
all-available-devices node3 /dev/vdd - -
|
all-available-devices node3 /dev/vdd - -
|
||||||
|
|
||||||
.. _cephadm-osd-declarative:
|
.. _cephadm-osd-declarative:
|
||||||
|
|
||||||
@ -197,9 +196,9 @@ command completes will be automatically found and added to the cluster.
|
|||||||
|
|
||||||
We will examine the effects of the following command:
|
We will examine the effects of the following command:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch apply osd --all-available-devices
|
ceph orch apply osd --all-available-devices
|
||||||
|
|
||||||
After running the above command:
|
After running the above command:
|
||||||
|
|
||||||
@ -212,17 +211,17 @@ If you want to avoid this behavior (disable automatic creation of OSD on availab
|
|||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch apply osd --all-available-devices --unmanaged=true
|
ceph orch apply osd --all-available-devices --unmanaged=true
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
Keep these three facts in mind:
|
Keep these three facts in mind:
|
||||||
|
|
||||||
- The default behavior of ``ceph orch apply`` causes cephadm constantly to reconcile. This means that cephadm creates OSDs as soon as new drives are detected.
|
- The default behavior of ``ceph orch apply`` causes cephadm constantly to reconcile. This means that cephadm creates OSDs as soon as new drives are detected.
|
||||||
|
|
||||||
- Setting ``unmanaged: True`` disables the creation of OSDs. If ``unmanaged: True`` is set, nothing will happen even if you apply a new OSD service.
|
- Setting ``unmanaged: True`` disables the creation of OSDs. If ``unmanaged: True`` is set, nothing will happen even if you apply a new OSD service.
|
||||||
|
|
||||||
- ``ceph orch daemon add`` creates OSDs, but does not add an OSD service.
|
- ``ceph orch daemon add`` creates OSDs, but does not add an OSD service.
|
||||||
|
|
||||||
* For cephadm, see also :ref:`cephadm-spec-unmanaged`.
|
* For cephadm, see also :ref:`cephadm-spec-unmanaged`.
|
||||||
|
|
||||||
@ -250,7 +249,7 @@ Example:
|
|||||||
|
|
||||||
Expected output::
|
Expected output::
|
||||||
|
|
||||||
Scheduled OSD(s) for removal
|
Scheduled OSD(s) for removal
|
||||||
|
|
||||||
OSDs that are not safe to destroy will be rejected.
|
OSDs that are not safe to destroy will be rejected.
|
||||||
|
|
||||||
@ -273,14 +272,14 @@ You can query the state of OSD operation with the following command:
|
|||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch osd rm status
|
ceph orch osd rm status
|
||||||
|
|
||||||
Expected output::
|
Expected output::
|
||||||
|
|
||||||
OSD_ID HOST STATE PG_COUNT REPLACE FORCE STARTED_AT
|
OSD_ID HOST STATE PG_COUNT REPLACE FORCE STARTED_AT
|
||||||
2 cephadm-dev done, waiting for purge 0 True False 2020-07-17 13:01:43.147684
|
2 cephadm-dev done, waiting for purge 0 True False 2020-07-17 13:01:43.147684
|
||||||
3 cephadm-dev draining 17 False True 2020-07-17 13:01:45.162158
|
3 cephadm-dev draining 17 False True 2020-07-17 13:01:45.162158
|
||||||
4 cephadm-dev started 42 False True 2020-07-17 13:01:45.162158
|
4 cephadm-dev started 42 False True 2020-07-17 13:01:45.162158
|
||||||
|
|
||||||
|
|
||||||
When no PGs are left on the OSD, it will be decommissioned and removed from the cluster.
|
When no PGs are left on the OSD, it will be decommissioned and removed from the cluster.
|
||||||
@ -302,11 +301,11 @@ Example:
|
|||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch osd rm stop 4
|
ceph orch osd rm stop 4
|
||||||
|
|
||||||
Expected output::
|
Expected output::
|
||||||
|
|
||||||
Stopped OSD(s) removal
|
Stopped OSD(s) removal
|
||||||
|
|
||||||
This resets the initial state of the OSD and takes it off the removal queue.
|
This resets the initial state of the OSD and takes it off the removal queue.
|
||||||
|
|
||||||
@ -327,7 +326,7 @@ Example:
|
|||||||
|
|
||||||
Expected output::
|
Expected output::
|
||||||
|
|
||||||
Scheduled OSD(s) for replacement
|
Scheduled OSD(s) for replacement
|
||||||
|
|
||||||
This follows the same procedure as the procedure in the "Remove OSD" section, with
|
This follows the same procedure as the procedure in the "Remove OSD" section, with
|
||||||
one exception: the OSD is not permanently removed from the CRUSH hierarchy, but is
|
one exception: the OSD is not permanently removed from the CRUSH hierarchy, but is
|
||||||
@ -434,10 +433,10 @@ the ``ceph orch ps`` output in the ``MEM LIMIT`` column::
|
|||||||
To exclude an OSD from memory autotuning, disable the autotune option
|
To exclude an OSD from memory autotuning, disable the autotune option
|
||||||
for that OSD and also set a specific memory target. For example,
|
for that OSD and also set a specific memory target. For example,
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph config set osd.123 osd_memory_target_autotune false
|
ceph config set osd.123 osd_memory_target_autotune false
|
||||||
ceph config set osd.123 osd_memory_target 16G
|
ceph config set osd.123 osd_memory_target 16G
|
||||||
|
|
||||||
|
|
||||||
.. _drivegroups:
|
.. _drivegroups:
|
||||||
@ -500,7 +499,7 @@ Example
|
|||||||
|
|
||||||
.. prompt:: bash [monitor.1]#
|
.. prompt:: bash [monitor.1]#
|
||||||
|
|
||||||
ceph orch apply -i /path/to/osd_spec.yml --dry-run
|
ceph orch apply -i /path/to/osd_spec.yml --dry-run
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -510,9 +509,9 @@ Filters
|
|||||||
-------
|
-------
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
Filters are applied using an `AND` gate by default. This means that a drive
|
Filters are applied using an `AND` gate by default. This means that a drive
|
||||||
must fulfill all filter criteria in order to get selected. This behavior can
|
must fulfill all filter criteria in order to get selected. This behavior can
|
||||||
be adjusted by setting ``filter_logic: OR`` in the OSD specification.
|
be adjusted by setting ``filter_logic: OR`` in the OSD specification.
|
||||||
|
|
||||||
Filters are used to assign disks to groups, using their attributes to group
|
Filters are used to assign disks to groups, using their attributes to group
|
||||||
them.
|
them.
|
||||||
@ -522,7 +521,7 @@ information about the attributes with this command:
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
ceph-volume inventory </path/to/disk>
|
ceph-volume inventory </path/to/disk>
|
||||||
|
|
||||||
Vendor or Model
|
Vendor or Model
|
||||||
^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^
|
||||||
@ -631,9 +630,9 @@ but want to use only the first two, you could use `limit`:
|
|||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
data_devices:
|
data_devices:
|
||||||
vendor: VendorA
|
vendor: VendorA
|
||||||
limit: 2
|
limit: 2
|
||||||
|
|
||||||
.. note:: `limit` is a last resort and shouldn't be used if it can be avoided.
|
.. note:: `limit` is a last resort and shouldn't be used if it can be avoided.
|
||||||
|
|
||||||
@ -856,8 +855,8 @@ See :ref:`orchestrator-cli-placement-spec`
|
|||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
Assuming each host has a unique disk layout, each OSD
|
Assuming each host has a unique disk layout, each OSD
|
||||||
spec needs to have a different service id
|
spec needs to have a different service id
|
||||||
|
|
||||||
|
|
||||||
Dedicated wal + db
|
Dedicated wal + db
|
||||||
@ -987,7 +986,7 @@ activates all existing OSDs on a host.
|
|||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph cephadm osd activate <host>...
|
ceph cephadm osd activate <host>...
|
||||||
|
|
||||||
This will scan all existing disks for OSDs and deploy corresponding daemons.
|
This will scan all existing disks for OSDs and deploy corresponding daemons.
|
||||||
|
|
||||||
|
@ -239,12 +239,14 @@ It is a yaml format file with the following properties:
|
|||||||
- host2
|
- host2
|
||||||
- host3
|
- host3
|
||||||
spec:
|
spec:
|
||||||
backend_service: rgw.something # adjust to match your existing RGW service
|
backend_service: rgw.something # adjust to match your existing RGW service
|
||||||
virtual_ip: <string>/<string> # ex: 192.168.20.1/24
|
virtual_ip: <string>/<string> # ex: 192.168.20.1/24
|
||||||
frontend_port: <integer> # ex: 8080
|
frontend_port: <integer> # ex: 8080
|
||||||
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
|
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
|
||||||
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
|
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
|
||||||
ssl_cert: | # optional: SSL certificate and key
|
use_keepalived_multicast: <bool> # optional: Default is False.
|
||||||
|
vrrp_interface_network: <string>/<string> # optional: ex: 192.168.20.0/24
|
||||||
|
ssl_cert: | # optional: SSL certificate and key
|
||||||
-----BEGIN CERTIFICATE-----
|
-----BEGIN CERTIFICATE-----
|
||||||
...
|
...
|
||||||
-----END CERTIFICATE-----
|
-----END CERTIFICATE-----
|
||||||
@ -270,6 +272,7 @@ It is a yaml format file with the following properties:
|
|||||||
frontend_port: <integer> # ex: 8080
|
frontend_port: <integer> # ex: 8080
|
||||||
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
|
monitor_port: <integer> # ex: 1967, used by haproxy for load balancer status
|
||||||
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
|
virtual_interface_networks: [ ... ] # optional: list of CIDR networks
|
||||||
|
first_virtual_router_id: <integer> # optional: default 50
|
||||||
ssl_cert: | # optional: SSL certificate and key
|
ssl_cert: | # optional: SSL certificate and key
|
||||||
-----BEGIN CERTIFICATE-----
|
-----BEGIN CERTIFICATE-----
|
||||||
...
|
...
|
||||||
@ -303,6 +306,21 @@ where the properties of this service specification are:
|
|||||||
* ``ssl_cert``:
|
* ``ssl_cert``:
|
||||||
SSL certificate, if SSL is to be enabled. This must contain the both the certificate and
|
SSL certificate, if SSL is to be enabled. This must contain the both the certificate and
|
||||||
private key blocks in .pem format.
|
private key blocks in .pem format.
|
||||||
|
* ``use_keepalived_multicast``
|
||||||
|
Default is False. By default, cephadm will deploy keepalived config to use unicast IPs,
|
||||||
|
using the IPs of the hosts. The IPs chosen will be the same IPs cephadm uses to connect
|
||||||
|
to the machines. But if multicast is prefered, we can set ``use_keepalived_multicast``
|
||||||
|
to ``True`` and Keepalived will use multicast IP (224.0.0.18) to communicate between instances,
|
||||||
|
using the same interfaces as where the VIPs are.
|
||||||
|
* ``vrrp_interface_network``
|
||||||
|
By default, cephadm will configure keepalived to use the same interface where the VIPs are
|
||||||
|
for VRRP communication. If another interface is needed, it can be set via ``vrrp_interface_network``
|
||||||
|
with a network to identify which ethernet interface to use.
|
||||||
|
* ``first_virtual_router_id``
|
||||||
|
Default is 50. When deploying more than 1 ingress, this parameter can be used to ensure each
|
||||||
|
keepalived will have different virtual_router_id. In the case of using ``virtual_ips_list``,
|
||||||
|
each IP will create its own virtual router. So the first one will have ``first_virtual_router_id``,
|
||||||
|
second one will have ``first_virtual_router_id`` + 1, etc. Valid values go from 1 to 255.
|
||||||
|
|
||||||
.. _ingress-virtual-ip:
|
.. _ingress-virtual-ip:
|
||||||
|
|
||||||
|
@ -1,60 +1,56 @@
|
|||||||
Troubleshooting
|
Troubleshooting
|
||||||
===============
|
===============
|
||||||
|
|
||||||
You may wish to investigate why a cephadm command failed
|
This section explains how to investigate why a cephadm command failed or why a
|
||||||
or why a certain service no longer runs properly.
|
certain service no longer runs properly.
|
||||||
|
|
||||||
Cephadm deploys daemons within containers. This means that
|
Cephadm deploys daemons within containers. Troubleshooting containerized
|
||||||
troubleshooting those containerized daemons will require
|
daemons requires a different process than does troubleshooting traditional
|
||||||
a different process than traditional package-install daemons.
|
daemons that were installed by means of packages.
|
||||||
|
|
||||||
Here are some tools and commands to help you troubleshoot
|
Here are some tools and commands to help you troubleshoot your Ceph
|
||||||
your Ceph environment.
|
environment.
|
||||||
|
|
||||||
.. _cephadm-pause:
|
.. _cephadm-pause:
|
||||||
|
|
||||||
Pausing or Disabling cephadm
|
Pausing or Disabling cephadm
|
||||||
----------------------------
|
----------------------------
|
||||||
|
|
||||||
If something goes wrong and cephadm is behaving badly, you can
|
If something goes wrong and cephadm is behaving badly, pause most of the Ceph
|
||||||
pause most of the Ceph cluster's background activity by running
|
cluster's background activity by running the following command:
|
||||||
the following command:
|
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch pause
|
ceph orch pause
|
||||||
|
|
||||||
This stops all changes in the Ceph cluster, but cephadm will
|
This stops all changes in the Ceph cluster, but cephadm will still periodically
|
||||||
still periodically check hosts to refresh its inventory of
|
check hosts to refresh its inventory of daemons and devices. Disable cephadm
|
||||||
daemons and devices. You can disable cephadm completely by
|
completely by running the following commands:
|
||||||
running the following commands:
|
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph orch set backend ''
|
ceph orch set backend ''
|
||||||
ceph mgr module disable cephadm
|
ceph mgr module disable cephadm
|
||||||
|
|
||||||
These commands disable all of the ``ceph orch ...`` CLI commands.
|
These commands disable all of the ``ceph orch ...`` CLI commands. All
|
||||||
All previously deployed daemon containers continue to exist and
|
previously deployed daemon containers continue to run and will start just as
|
||||||
will start as they did before you ran these commands.
|
they were before you ran these commands.
|
||||||
|
|
||||||
See :ref:`cephadm-spec-unmanaged` for information on disabling
|
See :ref:`cephadm-spec-unmanaged` for more on disabling individual services.
|
||||||
individual services.
|
|
||||||
|
|
||||||
|
|
||||||
Per-service and Per-daemon Events
|
Per-service and Per-daemon Events
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
In order to facilitate debugging failed daemons,
|
To make it easier to debug failed daemons, cephadm stores events per service
|
||||||
cephadm stores events per service and per daemon.
|
and per daemon. These events often contain information relevant to
|
||||||
These events often contain information relevant to
|
the troubleshooting of your Ceph cluster.
|
||||||
troubleshooting your Ceph cluster.
|
|
||||||
|
|
||||||
Listing Service Events
|
Listing Service Events
|
||||||
~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
To see the events associated with a certain service, run a
|
To see the events associated with a certain service, run a command of the
|
||||||
command of the and following form:
|
following form:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
@ -81,8 +77,8 @@ This will return something in the following form:
|
|||||||
Listing Daemon Events
|
Listing Daemon Events
|
||||||
~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
To see the events associated with a certain daemon, run a
|
To see the events associated with a certain daemon, run a command of the
|
||||||
command of the and following form:
|
following form:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
@ -105,32 +101,41 @@ This will return something in the following form:
|
|||||||
Checking Cephadm Logs
|
Checking Cephadm Logs
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
To learn how to monitor cephadm logs as they are generated, read :ref:`watching_cephadm_logs`.
|
To learn how to monitor cephadm logs as they are generated, read
|
||||||
|
:ref:`watching_cephadm_logs`.
|
||||||
|
|
||||||
If your Ceph cluster has been configured to log events to files, there will be a
|
If your Ceph cluster has been configured to log events to files, there will be
|
||||||
``ceph.cephadm.log`` file on all monitor hosts (see
|
a ``ceph.cephadm.log`` file on all monitor hosts. See :ref:`cephadm-logs` for a
|
||||||
:ref:`cephadm-logs` for a more complete explanation).
|
more complete explanation.
|
||||||
|
|
||||||
Gathering Log Files
|
Gathering Log Files
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
Use journalctl to gather the log files of all daemons:
|
Use ``journalctl`` to gather the log files of all daemons:
|
||||||
|
|
||||||
.. note:: By default cephadm now stores logs in journald. This means
|
.. note:: By default cephadm now stores logs in journald. This means
|
||||||
that you will no longer find daemon logs in ``/var/log/ceph/``.
|
that you will no longer find daemon logs in ``/var/log/ceph/``.
|
||||||
|
|
||||||
To read the log file of one specific daemon, run::
|
To read the log file of one specific daemon, run a command of the following
|
||||||
|
form:
|
||||||
|
|
||||||
cephadm logs --name <name-of-daemon>
|
.. prompt:: bash
|
||||||
|
|
||||||
Note: this only works when run on the same host where the daemon is running. To
|
cephadm logs --name <name-of-daemon>
|
||||||
get logs of a daemon running on a different host, give the ``--fsid`` option::
|
|
||||||
|
|
||||||
cephadm logs --fsid <fsid> --name <name-of-daemon>
|
.. Note:: This works only when run on the same host that is running the daemon.
|
||||||
|
To get the logs of a daemon that is running on a different host, add the
|
||||||
|
``--fsid`` option to the command, as in the following example:
|
||||||
|
|
||||||
where the ``<fsid>`` corresponds to the cluster ID printed by ``ceph status``.
|
.. prompt:: bash
|
||||||
|
|
||||||
To fetch all log files of all daemons on a given host, run::
|
cephadm logs --fsid <fsid> --name <name-of-daemon>
|
||||||
|
|
||||||
|
In this example, ``<fsid>`` corresponds to the cluster ID returned by the
|
||||||
|
``ceph status`` command.
|
||||||
|
|
||||||
|
To fetch all log files of all daemons on a given host, run the following
|
||||||
|
for-loop::
|
||||||
|
|
||||||
for name in $(cephadm ls | jq -r '.[].name') ; do
|
for name in $(cephadm ls | jq -r '.[].name') ; do
|
||||||
cephadm logs --fsid <fsid> --name "$name" > $name;
|
cephadm logs --fsid <fsid> --name "$name" > $name;
|
||||||
@ -139,39 +144,41 @@ To fetch all log files of all daemons on a given host, run::
|
|||||||
Collecting Systemd Status
|
Collecting Systemd Status
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
To print the state of a systemd unit, run::
|
To print the state of a systemd unit, run a command of the following form:
|
||||||
|
|
||||||
systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
|
.. prompt:: bash
|
||||||
|
|
||||||
|
systemctl status "ceph-$(cephadm shell ceph fsid)@<service name>.service";
|
||||||
|
|
||||||
|
|
||||||
To fetch all state of all daemons of a given host, run::
|
To fetch the state of all daemons of a given host, run the following shell
|
||||||
|
script::
|
||||||
|
|
||||||
fsid="$(cephadm shell ceph fsid)"
|
fsid="$(cephadm shell ceph fsid)"
|
||||||
for name in $(cephadm ls | jq -r '.[].name') ; do
|
for name in $(cephadm ls | jq -r '.[].name') ; do
|
||||||
systemctl status "ceph-$fsid@$name.service" > $name;
|
systemctl status "ceph-$fsid@$name.service" > $name;
|
||||||
done
|
done
|
||||||
|
|
||||||
|
|
||||||
List all Downloaded Container Images
|
List all Downloaded Container Images
|
||||||
------------------------------------
|
------------------------------------
|
||||||
|
|
||||||
To list all container images that are downloaded on a host:
|
To list all container images that are downloaded on a host, run the following
|
||||||
|
commands:
|
||||||
|
|
||||||
.. note:: ``Image`` might also be called `ImageID`
|
.. prompt:: bash #
|
||||||
|
|
||||||
::
|
podman ps -a --format json | jq '.[].Image' "docker.io/library/centos:8" "registry.opensuse.org/opensuse/leap:15.2"
|
||||||
|
|
||||||
podman ps -a --format json | jq '.[].Image'
|
.. note:: ``Image`` might also be called ``ImageID``.
|
||||||
"docker.io/library/centos:8"
|
|
||||||
"registry.opensuse.org/opensuse/leap:15.2"
|
|
||||||
|
|
||||||
|
|
||||||
Manually Running Containers
|
Manually Running Containers
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
Cephadm uses small wrappers when running containers. Refer to
|
Cephadm uses small wrappers when running containers. Refer to
|
||||||
``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the
|
``/var/lib/ceph/<cluster-fsid>/<service-name>/unit.run`` for the container
|
||||||
container execution command.
|
execution command.
|
||||||
|
|
||||||
.. _cephadm-ssh-errors:
|
.. _cephadm-ssh-errors:
|
||||||
|
|
||||||
@ -187,9 +194,10 @@ Error message::
|
|||||||
Please make sure that the host is reachable and accepts connections using the cephadm SSH key
|
Please make sure that the host is reachable and accepts connections using the cephadm SSH key
|
||||||
...
|
...
|
||||||
|
|
||||||
Things Ceph administrators can do:
|
If you receive the above error message, try the following things to
|
||||||
|
troubleshoot the SSH connection between ``cephadm`` and the monitor:
|
||||||
|
|
||||||
1. Ensure cephadm has an SSH identity key::
|
1. Ensure that ``cephadm`` has an SSH identity key::
|
||||||
|
|
||||||
[root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
|
[root@mon1~]# cephadm shell -- ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
|
||||||
INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
|
INFO:cephadm:Inferring fsid f8edc08a-7f17-11ea-8707-000c2915dd98
|
||||||
@ -202,20 +210,21 @@ Things Ceph administrators can do:
|
|||||||
|
|
||||||
or::
|
or::
|
||||||
|
|
||||||
[root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssk-key -i -
|
[root@mon1 ~]# cat ~/cephadm_private_key | cephadm shell -- ceph cephadm set-ssh-key -i -
|
||||||
|
|
||||||
2. Ensure that the SSH config is correct::
|
2. Ensure that the SSH config is correct::
|
||||||
|
|
||||||
[root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
|
[root@mon1 ~]# cephadm shell -- ceph cephadm get-ssh-config > config
|
||||||
|
|
||||||
3. Verify that we can connect to the host::
|
3. Verify that it is possible to connect to the host::
|
||||||
|
|
||||||
[root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
|
[root@mon1 ~]# ssh -F config -i ~/cephadm_private_key root@mon1
|
||||||
|
|
||||||
Verifying that the Public Key is Listed in the authorized_keys file
|
Verifying that the Public Key is Listed in the authorized_keys file
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
To verify that the public key is in the authorized_keys file, run the following commands::
|
To verify that the public key is in the ``authorized_keys`` file, run the
|
||||||
|
following commands::
|
||||||
|
|
||||||
[root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
|
[root@mon1 ~]# cephadm shell -- ceph cephadm get-pub-key > ~/ceph.pub
|
||||||
[root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
|
[root@mon1 ~]# grep "`cat ~/ceph.pub`" /root/.ssh/authorized_keys
|
||||||
@ -231,27 +240,33 @@ Or this error::
|
|||||||
|
|
||||||
Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
|
Must set public_network config option or specify a CIDR network, ceph addrvec, or plain IP
|
||||||
|
|
||||||
This means that you must run a command of this form::
|
This means that you must run a command of this form:
|
||||||
|
|
||||||
ceph config set mon public_network <mon_network>
|
.. prompt:: bash
|
||||||
|
|
||||||
For more detail on operations of this kind, see :ref:`deploy_additional_monitors`
|
ceph config set mon public_network <mon_network>
|
||||||
|
|
||||||
|
For more detail on operations of this kind, see
|
||||||
|
:ref:`deploy_additional_monitors`.
|
||||||
|
|
||||||
Accessing the Admin Socket
|
Accessing the Admin Socket
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
Each Ceph daemon provides an admin socket that bypasses the
|
Each Ceph daemon provides an admin socket that bypasses the MONs (See
|
||||||
MONs (See :ref:`rados-monitoring-using-admin-socket`).
|
:ref:`rados-monitoring-using-admin-socket`).
|
||||||
|
|
||||||
To access the admin socket, first enter the daemon container on the host::
|
#. To access the admin socket, enter the daemon container on the host::
|
||||||
|
|
||||||
[root@mon1 ~]# cephadm enter --name <daemon-name>
|
[root@mon1 ~]# cephadm enter --name <daemon-name>
|
||||||
[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
|
|
||||||
|
#. Run a command of the following form to see the admin socket's configuration::
|
||||||
|
|
||||||
|
[ceph: root@mon1 /]# ceph --admin-daemon /var/run/ceph/ceph-<daemon-name>.asok config show
|
||||||
|
|
||||||
Running Various Ceph Tools
|
Running Various Ceph Tools
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
To run Ceph tools like ``ceph-objectstore-tool`` or
|
To run Ceph tools such as ``ceph-objectstore-tool`` or
|
||||||
``ceph-monstore-tool``, invoke the cephadm CLI with
|
``ceph-monstore-tool``, invoke the cephadm CLI with
|
||||||
``cephadm shell --name <daemon-name>``. For example::
|
``cephadm shell --name <daemon-name>``. For example::
|
||||||
|
|
||||||
@ -268,100 +283,232 @@ To run Ceph tools like ``ceph-objectstore-tool`` or
|
|||||||
election_strategy: 1
|
election_strategy: 1
|
||||||
0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
|
0: [v2:127.0.0.1:3300/0,v1:127.0.0.1:6789/0] mon.myhostname
|
||||||
|
|
||||||
The cephadm shell sets up the environment in a way that is suitable
|
The cephadm shell sets up the environment in a way that is suitable for
|
||||||
for extended daemon maintenance and running daemons interactively.
|
extended daemon maintenance and for the interactive running of daemons.
|
||||||
|
|
||||||
.. _cephadm-restore-quorum:
|
.. _cephadm-restore-quorum:
|
||||||
|
|
||||||
Restoring the Monitor Quorum
|
Restoring the Monitor Quorum
|
||||||
----------------------------
|
----------------------------
|
||||||
|
|
||||||
If the Ceph monitor daemons (mons) cannot form a quorum, cephadm will not be
|
If the Ceph Monitor daemons (mons) cannot form a quorum, ``cephadm`` will not
|
||||||
able to manage the cluster until quorum is restored.
|
be able to manage the cluster until quorum is restored.
|
||||||
|
|
||||||
In order to restore the quorum, remove unhealthy monitors
|
In order to restore the quorum, remove unhealthy monitors
|
||||||
form the monmap by following these steps:
|
form the monmap by following these steps:
|
||||||
|
|
||||||
1. Stop all mons. For each mon host::
|
1. Stop all Monitors. Use ``ssh`` to connect to each Monitor's host, and then
|
||||||
|
while connected to the Monitor's host use ``cephadm`` to stop the Monitor
|
||||||
|
daemon:
|
||||||
|
|
||||||
ssh {mon-host}
|
.. prompt:: bash
|
||||||
cephadm unit --name mon.`hostname` stop
|
|
||||||
|
ssh {mon-host}
|
||||||
|
cephadm unit --name {mon.hostname} stop
|
||||||
|
|
||||||
|
|
||||||
2. Identify a surviving monitor and log in to that host::
|
2. Identify a surviving Monitor and log in to its host:
|
||||||
|
|
||||||
ssh {mon-host}
|
.. prompt:: bash
|
||||||
cephadm enter --name mon.`hostname`
|
|
||||||
|
|
||||||
3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`
|
ssh {mon-host}
|
||||||
|
cephadm enter --name {mon.hostname}
|
||||||
|
|
||||||
|
3. Follow the steps in :ref:`rados-mon-remove-from-unhealthy`.
|
||||||
|
|
||||||
.. _cephadm-manually-deploy-mgr:
|
.. _cephadm-manually-deploy-mgr:
|
||||||
|
|
||||||
Manually Deploying a Manager Daemon
|
Manually Deploying a Manager Daemon
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
At least one manager (mgr) daemon is required by cephadm in order to manage the
|
At least one Manager (``mgr``) daemon is required by cephadm in order to manage
|
||||||
cluster. If the last mgr in a cluster has been removed, follow these steps in
|
the cluster. If the last remaining Manager has been removed from the Ceph
|
||||||
order to deploy a manager called (for example)
|
cluster, follow these steps in order to deploy a fresh Manager on an arbitrary
|
||||||
``mgr.hostname.smfvfd`` on a random host of your cluster manually.
|
host in your cluster. In this example, the freshly-deployed Manager daemon is
|
||||||
|
called ``mgr.hostname.smfvfd``.
|
||||||
|
|
||||||
Disable the cephadm scheduler, in order to prevent cephadm from removing the new
|
#. Disable the cephadm scheduler, in order to prevent ``cephadm`` from removing
|
||||||
manager. See :ref:`cephadm-enable-cli`::
|
the new Manager. See :ref:`cephadm-enable-cli`:
|
||||||
|
|
||||||
ceph config-key set mgr/cephadm/pause true
|
.. prompt:: bash #
|
||||||
|
|
||||||
Then get or create the auth entry for the new manager::
|
ceph config-key set mgr/cephadm/pause true
|
||||||
|
|
||||||
ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
|
#. Retrieve or create the "auth entry" for the new Manager:
|
||||||
|
|
||||||
Get the ceph.conf::
|
.. prompt:: bash #
|
||||||
|
|
||||||
ceph config generate-minimal-conf
|
ceph auth get-or-create mgr.hostname.smfvfd mon "profile mgr" osd "allow *" mds "allow *"
|
||||||
|
|
||||||
Get the container image::
|
#. Retrieve the Monitor's configuration:
|
||||||
|
|
||||||
ceph config get "mgr.hostname.smfvfd" container_image
|
.. prompt:: bash #
|
||||||
|
|
||||||
Create a file ``config-json.json`` which contains the information necessary to deploy
|
ceph config generate-minimal-conf
|
||||||
the daemon:
|
|
||||||
|
|
||||||
.. code-block:: json
|
#. Retrieve the container image:
|
||||||
|
|
||||||
{
|
.. prompt:: bash #
|
||||||
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
|
|
||||||
"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
|
|
||||||
}
|
|
||||||
|
|
||||||
Deploy the daemon::
|
ceph config get "mgr.hostname.smfvfd" container_image
|
||||||
|
|
||||||
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
|
#. Create a file called ``config-json.json``, which contains the information
|
||||||
|
necessary to deploy the daemon:
|
||||||
|
|
||||||
Analyzing Core Dumps
|
.. code-block:: json
|
||||||
|
|
||||||
|
{
|
||||||
|
"config": "# minimal ceph.conf for 8255263a-a97e-4934-822c-00bfe029b28f\n[global]\n\tfsid = 8255263a-a97e-4934-822c-00bfe029b28f\n\tmon_host = [v2:192.168.0.1:40483/0,v1:192.168.0.1:40484/0]\n",
|
||||||
|
"keyring": "[mgr.hostname.smfvfd]\n\tkey = V2VyIGRhcyBsaWVzdCBpc3QgZG9vZi4=\n"
|
||||||
|
}
|
||||||
|
|
||||||
|
#. Deploy the Manager daemon:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
cephadm --image <container-image> deploy --fsid <fsid> --name mgr.hostname.smfvfd --config-json config-json.json
|
||||||
|
|
||||||
|
Capturing Core Dumps
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
When a Ceph daemon crashes, cephadm supports analyzing core dumps. To enable core dumps, run
|
A Ceph cluster that uses ``cephadm`` can be configured to capture core dumps.
|
||||||
|
The initial capture and processing of the coredump is performed by
|
||||||
|
`systemd-coredump
|
||||||
|
<https://www.man7.org/linux/man-pages/man8/systemd-coredump.8.html>`_.
|
||||||
|
|
||||||
|
|
||||||
|
To enable coredump handling, run the following command
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
ulimit -c unlimited
|
ulimit -c unlimited
|
||||||
|
|
||||||
Core dumps will now be written to ``/var/lib/systemd/coredump``.
|
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
Core dumps are not namespaced by the kernel, which means
|
Core dumps are not namespaced by the kernel. This means that core dumps are
|
||||||
they will be written to ``/var/lib/systemd/coredump`` on
|
written to ``/var/lib/systemd/coredump`` on the container host. The ``ulimit
|
||||||
the container host.
|
-c unlimited`` setting will persist only until the system is rebooted.
|
||||||
|
|
||||||
Now, wait for the crash to happen again. To simulate the crash of a daemon, run e.g. ``killall -3 ceph-mon``.
|
Wait for the crash to happen again. To simulate the crash of a daemon, run for
|
||||||
|
example ``killall -3 ceph-mon``.
|
||||||
|
|
||||||
Install debug packages including ``ceph-debuginfo`` by entering the cephadm shelll::
|
|
||||||
|
|
||||||
# cephadm shell --mount /var/lib/systemd/coredump
|
Running the Debugger with cephadm
|
||||||
[ceph: root@host1 /]# dnf install ceph-debuginfo gdb zstd
|
----------------------------------
|
||||||
[ceph: root@host1 /]# unzstd /mnt/coredump/core.ceph-*.zst
|
|
||||||
[ceph: root@host1 /]# gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-...
|
Running a single debugging session
|
||||||
(gdb) bt
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
|
|
||||||
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
|
Initiate a debugging session by using the ``cephadm shell`` command.
|
||||||
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
|
From within the shell container we need to install the debugger and debuginfo
|
||||||
#3 0x0000563085ca3d7e in main ()
|
packages. To debug a core file captured by systemd, run the following:
|
||||||
|
|
||||||
|
|
||||||
|
#. Start the shell session:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
cephadm shell --mount /var/lib/system/coredump
|
||||||
|
|
||||||
|
#. From within the shell session, run the following commands:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
dnf install ceph-debuginfo gdb zstd
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
unzstd /var/lib/systemd/coredump/core.ceph-*.zst
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
gdb /usr/bin/ceph-mon /mnt/coredump/core.ceph-*.zst
|
||||||
|
|
||||||
|
#. Run debugger commands at gdb's prompt:
|
||||||
|
|
||||||
|
.. prompt:: bash (gdb)
|
||||||
|
|
||||||
|
bt
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
#0 0x00007fa9117383fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
|
||||||
|
#1 0x00007fa910d7f8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
|
||||||
|
#2 0x00007fa913d3f48f in AsyncMessenger::wait() () from /usr/lib64/ceph/libceph-common.so.2
|
||||||
|
#3 0x0000563085ca3d7e in main ()
|
||||||
|
|
||||||
|
|
||||||
|
Running repeated debugging sessions
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
When using ``cephadm shell``, as in the example above, any changes made to the
|
||||||
|
container that is spawned by the shell command are ephemeral. After the shell
|
||||||
|
session exits, the files that were downloaded and installed cease to be
|
||||||
|
available. You can simply re-run the same commands every time ``cephadm
|
||||||
|
shell`` is invoked, but in order to save time and resources one can create a
|
||||||
|
new container image and use it for repeated debugging sessions.
|
||||||
|
|
||||||
|
In the following example, we create a simple file that will construct the
|
||||||
|
container image. The command below uses podman but it is expected to work
|
||||||
|
correctly even if ``podman`` is replaced with ``docker``::
|
||||||
|
|
||||||
|
cat >Containerfile <<EOF
|
||||||
|
ARG BASE_IMG=quay.io/ceph/ceph:v18
|
||||||
|
FROM \${BASE_IMG}
|
||||||
|
# install ceph debuginfo packages, gdb and other potentially useful packages
|
||||||
|
RUN dnf install --enablerepo='*debug*' -y ceph-debuginfo gdb zstd strace python3-debuginfo
|
||||||
|
EOF
|
||||||
|
podman build -t ceph:debugging -f Containerfile .
|
||||||
|
# pass --build-arg=BASE_IMG=<your image> to customize the base image
|
||||||
|
|
||||||
|
The above file creates a new local image named ``ceph:debugging``. This image
|
||||||
|
can be used on the same machine that built it. The image can also be pushed to
|
||||||
|
a container repository or saved and copied to a node runing other Ceph
|
||||||
|
containers. Consult the ``podman`` or ``docker`` documentation for more
|
||||||
|
information about the container workflow.
|
||||||
|
|
||||||
|
After the image has been built, it can be used to initiate repeat debugging
|
||||||
|
sessions. By using an image in this way, you avoid the trouble of having to
|
||||||
|
re-install the debug tools and debuginfo packages every time you need to run a
|
||||||
|
debug session. To debug a core file using this image, in the same way as
|
||||||
|
previously described, run:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
cephadm --image ceph:debugging shell --mount /var/lib/system/coredump
|
||||||
|
|
||||||
|
|
||||||
|
Debugging live processes
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The gdb debugger can attach to running processes to debug them. This can be
|
||||||
|
achieved with a containerized process by using the debug image and attaching it
|
||||||
|
to the same PID namespace in which the process to be debugged resides.
|
||||||
|
|
||||||
|
This requires running a container command with some custom arguments. We can
|
||||||
|
generate a script that can debug a process in a running container.
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
cephadm --image ceph:debugging shell --dry-run > /tmp/debug.sh
|
||||||
|
|
||||||
|
This creates a script that includes the container command that ``cephadm``
|
||||||
|
would use to create a shell. Modify the script by removing the ``--init``
|
||||||
|
argument and replace it with the argument that joins to the namespace used for
|
||||||
|
a running running container. For example, assume we want to debug the Manager
|
||||||
|
and have determnined that the Manager is running in a container named
|
||||||
|
``ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``. In this case,
|
||||||
|
the argument
|
||||||
|
``--pid=container:ceph-bc615290-685b-11ee-84a6-525400220000-mgr-ceph0-sluwsk``
|
||||||
|
should be used.
|
||||||
|
|
||||||
|
We can run our debugging container with ``sh /tmp/debug.sh``. Within the shell,
|
||||||
|
we can run commands such as ``ps`` to get the PID of the Manager process. In
|
||||||
|
the following example this is ``2``. While running gdb, we can attach to the
|
||||||
|
running process:
|
||||||
|
|
||||||
|
.. prompt:: bash (gdb)
|
||||||
|
|
||||||
|
attach 2
|
||||||
|
info threads
|
||||||
|
bt
|
||||||
|
@ -15,7 +15,7 @@ creation of multiple file systems use ``ceph fs flag set enable_multiple true``.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs new <file system name> <metadata pool name> <data pool name>
|
ceph fs new <file system name> <metadata pool name> <data pool name>
|
||||||
|
|
||||||
This command creates a new file system. The file system name and metadata pool
|
This command creates a new file system. The file system name and metadata pool
|
||||||
name are self-explanatory. The specified data pool is the default data pool and
|
name are self-explanatory. The specified data pool is the default data pool and
|
||||||
@ -25,19 +25,19 @@ to accommodate the new file system.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs ls
|
ceph fs ls
|
||||||
|
|
||||||
List all file systems by name.
|
List all file systems by name.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs lsflags <file system name>
|
ceph fs lsflags <file system name>
|
||||||
|
|
||||||
List all the flags set on a file system.
|
List all the flags set on a file system.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs dump [epoch]
|
ceph fs dump [epoch]
|
||||||
|
|
||||||
This dumps the FSMap at the given epoch (default: current) which includes all
|
This dumps the FSMap at the given epoch (default: current) which includes all
|
||||||
file system settings, MDS daemons and the ranks they hold, and the list of
|
file system settings, MDS daemons and the ranks they hold, and the list of
|
||||||
@ -46,7 +46,7 @@ standby MDS daemons.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs rm <file system name> [--yes-i-really-mean-it]
|
ceph fs rm <file system name> [--yes-i-really-mean-it]
|
||||||
|
|
||||||
Destroy a CephFS file system. This wipes information about the state of the
|
Destroy a CephFS file system. This wipes information about the state of the
|
||||||
file system from the FSMap. The metadata pool and data pools are untouched and
|
file system from the FSMap. The metadata pool and data pools are untouched and
|
||||||
@ -54,28 +54,28 @@ must be destroyed separately.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs get <file system name>
|
ceph fs get <file system name>
|
||||||
|
|
||||||
Get information about the named file system, including settings and ranks. This
|
Get information about the named file system, including settings and ranks. This
|
||||||
is a subset of the same information from the ``fs dump`` command.
|
is a subset of the same information from the ``ceph fs dump`` command.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs set <file system name> <var> <val>
|
ceph fs set <file system name> <var> <val>
|
||||||
|
|
||||||
Change a setting on a file system. These settings are specific to the named
|
Change a setting on a file system. These settings are specific to the named
|
||||||
file system and do not affect other file systems.
|
file system and do not affect other file systems.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs add_data_pool <file system name> <pool name/id>
|
ceph fs add_data_pool <file system name> <pool name/id>
|
||||||
|
|
||||||
Add a data pool to the file system. This pool can be used for file layouts
|
Add a data pool to the file system. This pool can be used for file layouts
|
||||||
as an alternate location to store file data.
|
as an alternate location to store file data.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs rm_data_pool <file system name> <pool name/id>
|
ceph fs rm_data_pool <file system name> <pool name/id>
|
||||||
|
|
||||||
This command removes the specified pool from the list of data pools for the
|
This command removes the specified pool from the list of data pools for the
|
||||||
file system. If any files have layouts for the removed data pool, the file
|
file system. If any files have layouts for the removed data pool, the file
|
||||||
@ -84,7 +84,7 @@ system) cannot be removed.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
|
ceph fs rename <file system name> <new file system name> [--yes-i-really-mean-it]
|
||||||
|
|
||||||
Rename a Ceph file system. This also changes the application tags on the data
|
Rename a Ceph file system. This also changes the application tags on the data
|
||||||
pools and metadata pool of the file system to the new file system name.
|
pools and metadata pool of the file system to the new file system name.
|
||||||
@ -98,7 +98,7 @@ Settings
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs set <fs name> max_file_size <size in bytes>
|
ceph fs set <fs name> max_file_size <size in bytes>
|
||||||
|
|
||||||
CephFS has a configurable maximum file size, and it's 1TB by default.
|
CephFS has a configurable maximum file size, and it's 1TB by default.
|
||||||
You may wish to set this limit higher if you expect to store large files
|
You may wish to set this limit higher if you expect to store large files
|
||||||
@ -132,13 +132,13 @@ Taking a CephFS cluster down is done by setting the down flag:
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs set <fs_name> down true
|
ceph fs set <fs_name> down true
|
||||||
|
|
||||||
To bring the cluster back online:
|
To bring the cluster back online:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs set <fs_name> down false
|
ceph fs set <fs_name> down false
|
||||||
|
|
||||||
This will also restore the previous value of max_mds. MDS daemons are brought
|
This will also restore the previous value of max_mds. MDS daemons are brought
|
||||||
down in a way such that journals are flushed to the metadata pool and all
|
down in a way such that journals are flushed to the metadata pool and all
|
||||||
@ -149,11 +149,11 @@ Taking the cluster down rapidly for deletion or disaster recovery
|
|||||||
-----------------------------------------------------------------
|
-----------------------------------------------------------------
|
||||||
|
|
||||||
To allow rapidly deleting a file system (for testing) or to quickly bring the
|
To allow rapidly deleting a file system (for testing) or to quickly bring the
|
||||||
file system and MDS daemons down, use the ``fs fail`` command:
|
file system and MDS daemons down, use the ``ceph fs fail`` command:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs fail <fs_name>
|
ceph fs fail <fs_name>
|
||||||
|
|
||||||
This command sets a file system flag to prevent standbys from
|
This command sets a file system flag to prevent standbys from
|
||||||
activating on the file system (the ``joinable`` flag).
|
activating on the file system (the ``joinable`` flag).
|
||||||
@ -162,7 +162,7 @@ This process can also be done manually by doing the following:
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs set <fs_name> joinable false
|
ceph fs set <fs_name> joinable false
|
||||||
|
|
||||||
Then the operator can fail all of the ranks which causes the MDS daemons to
|
Then the operator can fail all of the ranks which causes the MDS daemons to
|
||||||
respawn as standbys. The file system will be left in a degraded state.
|
respawn as standbys. The file system will be left in a degraded state.
|
||||||
@ -170,7 +170,7 @@ respawn as standbys. The file system will be left in a degraded state.
|
|||||||
::
|
::
|
||||||
|
|
||||||
# For all ranks, 0-N:
|
# For all ranks, 0-N:
|
||||||
mds fail <fs_name>:<n>
|
ceph mds fail <fs_name>:<n>
|
||||||
|
|
||||||
Once all ranks are inactive, the file system may also be deleted or left in
|
Once all ranks are inactive, the file system may also be deleted or left in
|
||||||
this state for other purposes (perhaps disaster recovery).
|
this state for other purposes (perhaps disaster recovery).
|
||||||
@ -179,7 +179,7 @@ To bring the cluster back up, simply set the joinable flag:
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs set <fs_name> joinable true
|
ceph fs set <fs_name> joinable true
|
||||||
|
|
||||||
|
|
||||||
Daemons
|
Daemons
|
||||||
@ -198,34 +198,35 @@ Commands to manipulate MDS daemons:
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
mds fail <gid/name/role>
|
ceph mds fail <gid/name/role>
|
||||||
|
|
||||||
Mark an MDS daemon as failed. This is equivalent to what the cluster
|
Mark an MDS daemon as failed. This is equivalent to what the cluster
|
||||||
would do if an MDS daemon had failed to send a message to the mon
|
would do if an MDS daemon had failed to send a message to the mon
|
||||||
for ``mds_beacon_grace`` second. If the daemon was active and a suitable
|
for ``mds_beacon_grace`` second. If the daemon was active and a suitable
|
||||||
standby is available, using ``mds fail`` will force a failover to the standby.
|
standby is available, using ``ceph mds fail`` will force a failover to the
|
||||||
|
standby.
|
||||||
|
|
||||||
If the MDS daemon was in reality still running, then using ``mds fail``
|
If the MDS daemon was in reality still running, then using ``ceph mds fail``
|
||||||
will cause the daemon to restart. If it was active and a standby was
|
will cause the daemon to restart. If it was active and a standby was
|
||||||
available, then the "failed" daemon will return as a standby.
|
available, then the "failed" daemon will return as a standby.
|
||||||
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
tell mds.<daemon name> command ...
|
ceph tell mds.<daemon name> command ...
|
||||||
|
|
||||||
Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
|
Send a command to the MDS daemon(s). Use ``mds.*`` to send a command to all
|
||||||
daemons. Use ``ceph tell mds.* help`` to learn available commands.
|
daemons. Use ``ceph tell mds.* help`` to learn available commands.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
mds metadata <gid/name/role>
|
ceph mds metadata <gid/name/role>
|
||||||
|
|
||||||
Get metadata about the given MDS known to the Monitors.
|
Get metadata about the given MDS known to the Monitors.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
mds repaired <role>
|
ceph mds repaired <role>
|
||||||
|
|
||||||
Mark the file system rank as repaired. Unlike the name suggests, this command
|
Mark the file system rank as repaired. Unlike the name suggests, this command
|
||||||
does not change a MDS; it manipulates the file system rank which has been
|
does not change a MDS; it manipulates the file system rank which has been
|
||||||
@ -244,14 +245,14 @@ Commands to manipulate required client features of a file system:
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs required_client_features <fs name> add reply_encoding
|
ceph fs required_client_features <fs name> add reply_encoding
|
||||||
fs required_client_features <fs name> rm reply_encoding
|
ceph fs required_client_features <fs name> rm reply_encoding
|
||||||
|
|
||||||
To list all CephFS features
|
To list all CephFS features
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs feature ls
|
ceph fs feature ls
|
||||||
|
|
||||||
Clients that are missing newly added features will be evicted automatically.
|
Clients that are missing newly added features will be evicted automatically.
|
||||||
|
|
||||||
@ -346,7 +347,7 @@ Global settings
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs flag set <flag name> <flag val> [<confirmation string>]
|
ceph fs flag set <flag name> <flag val> [<confirmation string>]
|
||||||
|
|
||||||
Sets a global CephFS flag (i.e. not specific to a particular file system).
|
Sets a global CephFS flag (i.e. not specific to a particular file system).
|
||||||
Currently, the only flag setting is 'enable_multiple' which allows having
|
Currently, the only flag setting is 'enable_multiple' which allows having
|
||||||
@ -368,13 +369,13 @@ file system.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
mds rmfailed
|
ceph mds rmfailed
|
||||||
|
|
||||||
This removes a rank from the failed set.
|
This removes a rank from the failed set.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs reset <file system name>
|
ceph fs reset <file system name>
|
||||||
|
|
||||||
This command resets the file system state to defaults, except for the name and
|
This command resets the file system state to defaults, except for the name and
|
||||||
pools. Non-zero ranks are saved in the stopped set.
|
pools. Non-zero ranks are saved in the stopped set.
|
||||||
@ -382,7 +383,7 @@ pools. Non-zero ranks are saved in the stopped set.
|
|||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
|
ceph fs new <file system name> <metadata pool name> <data pool name> --fscid <fscid> --force
|
||||||
|
|
||||||
This command creates a file system with a specific **fscid** (file system cluster ID).
|
This command creates a file system with a specific **fscid** (file system cluster ID).
|
||||||
You may want to do this when an application expects the file system's ID to be
|
You may want to do this when an application expects the file system's ID to be
|
||||||
|
@ -154,14 +154,8 @@ readdir. The behavior of the decay counter is the same as for cache trimming or
|
|||||||
caps recall. Each readdir call increments the counter by the number of files in
|
caps recall. Each readdir call increments the counter by the number of files in
|
||||||
the result.
|
the result.
|
||||||
|
|
||||||
The ratio of ``mds_max_caps_per_client`` that client must exceed before readdir
|
|
||||||
maybe throttled by cap acquisition throttle:
|
|
||||||
|
|
||||||
.. confval:: mds_session_max_caps_throttle_ratio
|
.. confval:: mds_session_max_caps_throttle_ratio
|
||||||
|
|
||||||
The timeout in seconds after which a client request is retried due to cap
|
|
||||||
acquisition throttling:
|
|
||||||
|
|
||||||
.. confval:: mds_cap_acquisition_throttle_retry_request_timeout
|
.. confval:: mds_cap_acquisition_throttle_retry_request_timeout
|
||||||
|
|
||||||
If the number of caps acquired by the client per session is greater than the
|
If the number of caps acquired by the client per session is greater than the
|
||||||
|
@ -42,28 +42,21 @@ FS Volumes
|
|||||||
|
|
||||||
Create a volume by running the following command:
|
Create a volume by running the following command:
|
||||||
|
|
||||||
$ ceph fs volume create <vol_name> [<placement>]
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph fs volume create <vol_name> [placement]
|
||||||
|
|
||||||
This creates a CephFS file system and its data and metadata pools. It can also
|
This creates a CephFS file system and its data and metadata pools. It can also
|
||||||
deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for
|
deploy MDS daemons for the filesystem using a ceph-mgr orchestrator module (for
|
||||||
example Rook). See :doc:`/mgr/orchestrator`.
|
example Rook). See :doc:`/mgr/orchestrator`.
|
||||||
|
|
||||||
``<vol_name>`` is the volume name (an arbitrary string). ``<placement>`` is an
|
``<vol_name>`` is the volume name (an arbitrary string). ``[placement]`` is an
|
||||||
optional string that specifies the hosts that should have an MDS running on
|
optional string that specifies the :ref:`orchestrator-cli-placement-spec` for
|
||||||
them and, optionally, the total number of MDS daemons that the cluster should
|
the MDS. See also :ref:`orchestrator-cli-cephfs` for more examples on
|
||||||
have. For example, the following placement string means "deploy MDS on nodes
|
placement.
|
||||||
``host1`` and ``host2`` (one MDS per host)::
|
|
||||||
|
|
||||||
"host1,host2"
|
.. note:: Specifying placement via a YAML file is not supported through the
|
||||||
|
volume interface.
|
||||||
The following placement specification means "deploy two MDS daemons on each of
|
|
||||||
nodes ``host1`` and ``host2`` (for a total of four MDS daemons in the
|
|
||||||
cluster)"::
|
|
||||||
|
|
||||||
"4 host1,host2"
|
|
||||||
|
|
||||||
See :ref:`orchestrator-cli-service-spec` for more on placement specification.
|
|
||||||
Specifying placement via a YAML file is not supported.
|
|
||||||
|
|
||||||
To remove a volume, run the following command:
|
To remove a volume, run the following command:
|
||||||
|
|
||||||
@ -72,6 +65,11 @@ To remove a volume, run the following command:
|
|||||||
This removes a file system and its data and metadata pools. It also tries to
|
This removes a file system and its data and metadata pools. It also tries to
|
||||||
remove MDS daemons using the enabled ceph-mgr orchestrator module.
|
remove MDS daemons using the enabled ceph-mgr orchestrator module.
|
||||||
|
|
||||||
|
.. note:: After volume deletion, it is recommended to restart `ceph-mgr`
|
||||||
|
if a new file system is created on the same cluster and subvolume interface
|
||||||
|
is being used. Please see https://tracker.ceph.com/issues/49605#note-5
|
||||||
|
for more details.
|
||||||
|
|
||||||
List volumes by running the following command:
|
List volumes by running the following command:
|
||||||
|
|
||||||
$ ceph fs volume ls
|
$ ceph fs volume ls
|
||||||
|
@ -28,7 +28,7 @@ To FUSE-mount the Ceph file system, use the ``ceph-fuse`` command::
|
|||||||
mkdir /mnt/mycephfs
|
mkdir /mnt/mycephfs
|
||||||
ceph-fuse --id foo /mnt/mycephfs
|
ceph-fuse --id foo /mnt/mycephfs
|
||||||
|
|
||||||
Option ``-id`` passes the name of the CephX user whose keyring we intend to
|
Option ``--id`` passes the name of the CephX user whose keyring we intend to
|
||||||
use for mounting CephFS. In the above command, it's ``foo``. You can also use
|
use for mounting CephFS. In the above command, it's ``foo``. You can also use
|
||||||
``-n`` instead, although ``--id`` is evidently easier::
|
``-n`` instead, although ``--id`` is evidently easier::
|
||||||
|
|
||||||
|
@ -226,6 +226,20 @@ For the reverse situation:
|
|||||||
The ``home/patrick`` directory and its children will be pinned to rank 2
|
The ``home/patrick`` directory and its children will be pinned to rank 2
|
||||||
because its export pin overrides the policy on ``home``.
|
because its export pin overrides the policy on ``home``.
|
||||||
|
|
||||||
|
To remove a partitioning policy, remove the respective extended attribute
|
||||||
|
or set the value to 0.
|
||||||
|
|
||||||
|
.. code::bash
|
||||||
|
$ setfattr -n ceph.dir.pin.distributed -v 0 home
|
||||||
|
# or
|
||||||
|
$ setfattr -x ceph.dir.pin.distributed home
|
||||||
|
|
||||||
|
For export pins, remove the extended attribute or set the extended attribute
|
||||||
|
value to `-1`.
|
||||||
|
|
||||||
|
.. code::bash
|
||||||
|
$ setfattr -n ceph.dir.pin -v -1 home
|
||||||
|
|
||||||
|
|
||||||
Dynamic subtree partitioning with Balancer on specific ranks
|
Dynamic subtree partitioning with Balancer on specific ranks
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
@ -143,3 +143,14 @@ The types of damage that can be reported and repaired by File System Scrub are:
|
|||||||
|
|
||||||
* BACKTRACE : Inode's backtrace in the data pool is corrupted.
|
* BACKTRACE : Inode's backtrace in the data pool is corrupted.
|
||||||
|
|
||||||
|
Evaluate strays using recursive scrub
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
- In order to evaluate strays i.e. purge stray directories in ``~mdsdir`` use the following command::
|
||||||
|
|
||||||
|
ceph tell mds.<fsname>:0 scrub start ~mdsdir recursive
|
||||||
|
|
||||||
|
- ``~mdsdir`` is not enqueued by default when scrubbing at the CephFS root. In order to perform stray evaluation
|
||||||
|
at root, run scrub with flags ``scrub_mdsdir`` and ``recursive``::
|
||||||
|
|
||||||
|
ceph tell mds.<fsname>:0 scrub start / recursive,scrub_mdsdir
|
||||||
|
@ -162,6 +162,13 @@ Examples::
|
|||||||
snapshot creation is accounted for in the "created_count" field, which is a
|
snapshot creation is accounted for in the "created_count" field, which is a
|
||||||
cumulative count of the total number of snapshots created so far.
|
cumulative count of the total number of snapshots created so far.
|
||||||
|
|
||||||
|
.. note: The maximum number of snapshots to retain per directory is limited by the
|
||||||
|
config tunable `mds_max_snaps_per_dir`. This tunable defaults to 100.
|
||||||
|
To ensure a new snapshot can be created, one snapshot less than this will be
|
||||||
|
retained. So by default, a maximum of 99 snapshots will be retained.
|
||||||
|
|
||||||
|
.. note: The --fs argument is now required if there is more than one file system.
|
||||||
|
|
||||||
Active and inactive schedules
|
Active and inactive schedules
|
||||||
-----------------------------
|
-----------------------------
|
||||||
Snapshot schedules can be added for a path that doesn't exist yet in the
|
Snapshot schedules can be added for a path that doesn't exist yet in the
|
||||||
|
@ -98,7 +98,7 @@ things to do:
|
|||||||
|
|
||||||
.. code:: bash
|
.. code:: bash
|
||||||
|
|
||||||
ceph config set mds mds_heartbeat_reset_grace 3600
|
ceph config set mds mds_heartbeat_grace 3600
|
||||||
|
|
||||||
This has the effect of having the MDS continue to send beacons to the monitors
|
This has the effect of having the MDS continue to send beacons to the monitors
|
||||||
even when its internal "heartbeat" mechanism has not been reset (beat) in one
|
even when its internal "heartbeat" mechanism has not been reset (beat) in one
|
||||||
|
58
ceph/doc/dev/balancer-design.rst
Normal file
58
ceph/doc/dev/balancer-design.rst
Normal file
@ -0,0 +1,58 @@
|
|||||||
|
============================
|
||||||
|
Balancing in Ceph
|
||||||
|
============================
|
||||||
|
|
||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
In distributed storage systems like Ceph, it is important to balance write and read requests for optimal performance. Write balancing ensures fast storage
|
||||||
|
and replication of data in a cluster, while read balancing ensures quick access and retrieval of data in a cluster. Both types of balancing are important
|
||||||
|
in distributed systems for different reasons.
|
||||||
|
|
||||||
|
Upmap Balancing
|
||||||
|
==========================
|
||||||
|
|
||||||
|
Importance in a Cluster
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Capacity balancing is a functional requirement. A system like Ceph is as full as its fullest device: When one device is full, the system can not serve write
|
||||||
|
requests anymore, and Ceph loses its function. To avoid filling up devices, we want to balance capacity across the devices in a fair way. Each device should
|
||||||
|
get a capacity proportional to its size so all devices have the same fullness level. From a performance perspective, capacity balancing creates fair share
|
||||||
|
workloads on the OSDs for write requests.
|
||||||
|
|
||||||
|
Capacity balancing is expensive. The operation (changing the mapping of pgs) requires data movement by definition, which takes time. During this time, the
|
||||||
|
performance of the system is reduced.
|
||||||
|
|
||||||
|
In Ceph, we can balance the write performance if all devices are homogeneous (same size and performance).
|
||||||
|
|
||||||
|
How to Balance Capacity in Ceph
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
See :ref:`upmap` for more information.
|
||||||
|
|
||||||
|
Read Balancing
|
||||||
|
==============
|
||||||
|
|
||||||
|
Unlike capacity balancing, read balancing is not a strict requirement for Ceph’s functionality. Instead, it is a performance requirement, as it helps the system
|
||||||
|
“work” better. The overall goal is to ensure each device gets its fair share of primary OSDs so read requests are distributed evenly across OSDs in the cluster.
|
||||||
|
Unbalanced read requests lead to bad performance because of reduced overall cluster bandwidth.
|
||||||
|
|
||||||
|
Read balancing is cheap. Unlike capacity balancing, there is no data movement involved. It is just a metadata operation, where the osdmap is updated to change
|
||||||
|
which participating OSD in a pg is primary. This operation is fast and has no impact on the cluster performance (except improved performance when the operation
|
||||||
|
completes – almost immediately).
|
||||||
|
|
||||||
|
In Ceph, we can balance the read performance if all devices are homogeneous (same size and performance). However, in future versions, the read balancer can be improved
|
||||||
|
to achieve overall cluster performance in heterogeneous systems.
|
||||||
|
|
||||||
|
How to Balance Reads in Ceph
|
||||||
|
----------------------------
|
||||||
|
See :ref:`read_balancer` for more information.
|
||||||
|
|
||||||
|
Also, see the Cephalocon 2023 talk `New Read Balancer in Ceph <https://www.youtube.com/watch?v=AT_cKYaQzcU/>`_ for a demonstration of the offline version
|
||||||
|
of the read balancer.
|
||||||
|
|
||||||
|
Plans for the Next Version
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
1. Improve behavior for heterogeneous OSDs in a pool
|
||||||
|
2. Offer read balancing as an online option to the balancer manager module
|
@ -1,200 +0,0 @@
|
|||||||
Cache pool
|
|
||||||
==========
|
|
||||||
|
|
||||||
Purpose
|
|
||||||
-------
|
|
||||||
|
|
||||||
Use a pool of fast storage devices (probably SSDs) and use it as a
|
|
||||||
cache for an existing slower and larger pool.
|
|
||||||
|
|
||||||
Use a replicated pool as a front-end to service most I/O, and destage
|
|
||||||
cold data to a separate erasure coded pool that does not currently (and
|
|
||||||
cannot efficiently) handle the workload.
|
|
||||||
|
|
||||||
We should be able to create and add a cache pool to an existing pool
|
|
||||||
of data, and later remove it, without disrupting service or migrating
|
|
||||||
data around.
|
|
||||||
|
|
||||||
Use cases
|
|
||||||
---------
|
|
||||||
|
|
||||||
Read-write pool, writeback
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
We have an existing data pool and put a fast cache pool "in front" of
|
|
||||||
it. Writes will go to the cache pool and immediately ack. We flush
|
|
||||||
them back to the data pool based on the defined policy.
|
|
||||||
|
|
||||||
Read-only pool, weak consistency
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
We have an existing data pool and add one or more read-only cache
|
|
||||||
pools. We copy data to the cache pool(s) on read. Writes are
|
|
||||||
forwarded to the original data pool. Stale data is expired from the
|
|
||||||
cache pools based on the defined policy.
|
|
||||||
|
|
||||||
This is likely only useful for specific applications with specific
|
|
||||||
data access patterns. It may be a match for rgw, for example.
|
|
||||||
|
|
||||||
|
|
||||||
Interface
|
|
||||||
---------
|
|
||||||
|
|
||||||
Set up a read/write cache pool foo-hot for pool foo::
|
|
||||||
|
|
||||||
ceph osd tier add foo foo-hot
|
|
||||||
ceph osd tier cache-mode foo-hot writeback
|
|
||||||
|
|
||||||
Direct all traffic for foo to foo-hot::
|
|
||||||
|
|
||||||
ceph osd tier set-overlay foo foo-hot
|
|
||||||
|
|
||||||
Set the target size and enable the tiering agent for foo-hot::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot hit_set_type bloom
|
|
||||||
ceph osd pool set foo-hot hit_set_count 1
|
|
||||||
ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
|
|
||||||
ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
|
|
||||||
ceph osd pool set foo-hot min_read_recency_for_promote 1
|
|
||||||
ceph osd pool set foo-hot min_write_recency_for_promote 1
|
|
||||||
|
|
||||||
Drain the cache in preparation for turning it off::
|
|
||||||
|
|
||||||
ceph osd tier cache-mode foo-hot forward
|
|
||||||
rados -p foo-hot cache-flush-evict-all
|
|
||||||
|
|
||||||
When cache pool is finally empty, disable it::
|
|
||||||
|
|
||||||
ceph osd tier remove-overlay foo
|
|
||||||
ceph osd tier remove foo foo-hot
|
|
||||||
|
|
||||||
Read-only pools with lazy consistency::
|
|
||||||
|
|
||||||
ceph osd tier add foo foo-east
|
|
||||||
ceph osd tier cache-mode foo-east readonly
|
|
||||||
ceph osd tier add foo foo-west
|
|
||||||
ceph osd tier cache-mode foo-west readonly
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Tiering agent
|
|
||||||
-------------
|
|
||||||
|
|
||||||
The tiering policy is defined as properties on the cache pool itself.
|
|
||||||
|
|
||||||
HitSet metadata
|
|
||||||
~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
First, the agent requires HitSet information to be tracked on the
|
|
||||||
cache pool in order to determine which objects in the pool are being
|
|
||||||
accessed. This is enabled with::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot hit_set_type bloom
|
|
||||||
ceph osd pool set foo-hot hit_set_count 1
|
|
||||||
ceph osd pool set foo-hot hit_set_period 3600 # 1 hour
|
|
||||||
|
|
||||||
The supported HitSet types include 'bloom' (a bloom filter, the
|
|
||||||
default), 'explicit_hash', and 'explicit_object'. The latter two
|
|
||||||
explicitly enumerate accessed objects and are less memory efficient.
|
|
||||||
They are there primarily for debugging and to demonstrate pluggability
|
|
||||||
for the infrastructure. For the bloom filter type, you can additionally
|
|
||||||
define the false positive probability for the bloom filter (default is 0.05)::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot hit_set_fpp 0.15
|
|
||||||
|
|
||||||
The hit_set_count and hit_set_period define how much time each HitSet
|
|
||||||
should cover, and how many such HitSets to store. Binning accesses
|
|
||||||
over time allows Ceph to independently determine whether an object was
|
|
||||||
accessed at least once and whether it was accessed more than once over
|
|
||||||
some time period ("age" vs "temperature").
|
|
||||||
|
|
||||||
The ``min_read_recency_for_promote`` defines how many HitSets to check for the
|
|
||||||
existence of an object when handling a read operation. The checking result is
|
|
||||||
used to decide whether to promote the object asynchronously. Its value should be
|
|
||||||
between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
|
|
||||||
If it's set to 1, the current HitSet is checked. And if this object is in the
|
|
||||||
current HitSet, it's promoted. Otherwise not. For the other values, the exact
|
|
||||||
number of archive HitSets are checked. The object is promoted if the object is
|
|
||||||
found in any of the most recent ``min_read_recency_for_promote`` HitSets.
|
|
||||||
|
|
||||||
A similar parameter can be set for the write operation, which is
|
|
||||||
``min_write_recency_for_promote``. ::
|
|
||||||
|
|
||||||
ceph osd pool set {cachepool} min_read_recency_for_promote 1
|
|
||||||
ceph osd pool set {cachepool} min_write_recency_for_promote 1
|
|
||||||
|
|
||||||
Note that the longer the ``hit_set_period`` and the higher the
|
|
||||||
``min_read_recency_for_promote``/``min_write_recency_for_promote`` the more RAM
|
|
||||||
will be consumed by the ceph-osd process. In particular, when the agent is active
|
|
||||||
to flush or evict cache objects, all hit_set_count HitSets are loaded into RAM.
|
|
||||||
|
|
||||||
Cache mode
|
|
||||||
~~~~~~~~~~
|
|
||||||
|
|
||||||
The most important policy is the cache mode:
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot cache-mode writeback
|
|
||||||
|
|
||||||
The supported modes are 'none', 'writeback', 'forward', and
|
|
||||||
'readonly'. Most installations want 'writeback', which will write
|
|
||||||
into the cache tier and only later flush updates back to the base
|
|
||||||
tier. Similarly, any object that is read will be promoted into the
|
|
||||||
cache tier.
|
|
||||||
|
|
||||||
The 'forward' mode is intended for when the cache is being disabled
|
|
||||||
and needs to be drained. No new objects will be promoted or written
|
|
||||||
to the cache pool unless they are already present. A background
|
|
||||||
operation can then do something like::
|
|
||||||
|
|
||||||
rados -p foo-hot cache-try-flush-evict-all
|
|
||||||
rados -p foo-hot cache-flush-evict-all
|
|
||||||
|
|
||||||
to force all data to be flushed back to the base tier.
|
|
||||||
|
|
||||||
The 'readonly' mode is intended for read-only workloads that do not
|
|
||||||
require consistency to be enforced by the storage system. Writes will
|
|
||||||
be forwarded to the base tier, but objects that are read will get
|
|
||||||
promoted to the cache. No attempt is made by Ceph to ensure that the
|
|
||||||
contents of the cache tier(s) are consistent in the presence of object
|
|
||||||
updates.
|
|
||||||
|
|
||||||
Cache sizing
|
|
||||||
~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The agent performs two basic functions: flushing (writing 'dirty'
|
|
||||||
cache objects back to the base tier) and evicting (removing cold and
|
|
||||||
clean objects from the cache).
|
|
||||||
|
|
||||||
The thresholds at which Ceph will flush or evict objects is specified
|
|
||||||
relative to a 'target size' of the pool. For example::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot cache_target_dirty_ratio .4
|
|
||||||
ceph osd pool set foo-hot cache_target_dirty_high_ratio .6
|
|
||||||
ceph osd pool set foo-hot cache_target_full_ratio .8
|
|
||||||
|
|
||||||
will begin flushing dirty objects when 40% of the pool is dirty and begin
|
|
||||||
evicting clean objects when we reach 80% of the target size.
|
|
||||||
|
|
||||||
The target size can be specified either in terms of objects or bytes::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot target_max_bytes 1000000000000 # 1 TB
|
|
||||||
ceph osd pool set foo-hot target_max_objects 1000000 # 1 million objects
|
|
||||||
|
|
||||||
Note that if both limits are specified, Ceph will begin flushing or
|
|
||||||
evicting when either threshold is triggered.
|
|
||||||
|
|
||||||
Other tunables
|
|
||||||
~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
You can specify a minimum object age before a recently updated object is
|
|
||||||
flushed to the base tier::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot cache_min_flush_age 600 # 10 minutes
|
|
||||||
|
|
||||||
You can specify the minimum age of an object before it will be evicted from
|
|
||||||
the cache tier::
|
|
||||||
|
|
||||||
ceph osd pool set foo-hot cache_min_evict_age 1800 # 30 minutes
|
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -377,7 +377,7 @@ information. To check which mirror daemon a directory has been mapped to use::
|
|||||||
"state": "mapped"
|
"state": "mapped"
|
||||||
}
|
}
|
||||||
|
|
||||||
.. note:: `instance_id` is the RAODS instance-id associated with a mirror daemon.
|
.. note:: `instance_id` is the RADOS instance-id associated with a mirror daemon.
|
||||||
|
|
||||||
Other information such as `state` and `last_shuffled` are interesting when running
|
Other information such as `state` and `last_shuffled` are interesting when running
|
||||||
multiple mirror daemons.
|
multiple mirror daemons.
|
||||||
|
@ -243,7 +243,7 @@ object size in ``POOL`` is zero (evicted) and chunks objects are genereated---th
|
|||||||
4. Read/write I/Os
|
4. Read/write I/Os
|
||||||
|
|
||||||
After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
|
After step 3, the users don't need to consider anything about I/Os. Deduplicated objects are
|
||||||
completely compatible with existing RAODS operations.
|
completely compatible with existing RADOS operations.
|
||||||
|
|
||||||
|
|
||||||
5. Run scrub to fix reference count
|
5. Run scrub to fix reference count
|
||||||
|
@ -214,8 +214,8 @@ The build process is based on `Node.js <https://nodejs.org/>`_ and requires the
|
|||||||
Prerequisites
|
Prerequisites
|
||||||
~~~~~~~~~~~~~
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
* Node 14.15.0 or higher
|
* Node 18.17.0 or higher
|
||||||
* NPM 6.14.9 or higher
|
* NPM 9.6.7 or higher
|
||||||
|
|
||||||
nodeenv:
|
nodeenv:
|
||||||
During Ceph's build we create a virtualenv with ``node`` and ``npm``
|
During Ceph's build we create a virtualenv with ``node`` and ``npm``
|
||||||
|
@ -55,7 +55,7 @@ using `vstart_runner.py`_. To do that, you'd need `teuthology`_ installed::
|
|||||||
$ virtualenv --python=python3 venv
|
$ virtualenv --python=python3 venv
|
||||||
$ source venv/bin/activate
|
$ source venv/bin/activate
|
||||||
$ pip install 'setuptools >= 12'
|
$ pip install 'setuptools >= 12'
|
||||||
$ pip install git+https://github.com/ceph/teuthology#egg=teuthology[test]
|
$ pip install teuthology[test]@git+https://github.com/ceph/teuthology
|
||||||
$ deactivate
|
$ deactivate
|
||||||
|
|
||||||
The above steps installs teuthology in a virtual environment. Before running
|
The above steps installs teuthology in a virtual environment. Before running
|
||||||
|
@ -1,3 +1,5 @@
|
|||||||
|
.. _dev_mon_elections:
|
||||||
|
|
||||||
=================
|
=================
|
||||||
Monitor Elections
|
Monitor Elections
|
||||||
=================
|
=================
|
||||||
|
@ -289,40 +289,6 @@ This seems complicated, but it gets us two valuable properties:
|
|||||||
All clone operations will need to consider adjacent ``chunk_maps``
|
All clone operations will need to consider adjacent ``chunk_maps``
|
||||||
when adding or removing references.
|
when adding or removing references.
|
||||||
|
|
||||||
Cache/Tiering
|
|
||||||
-------------
|
|
||||||
|
|
||||||
There already exists a cache/tiering mechanism based on whiteouts.
|
|
||||||
One goal here should ultimately be for this manifest machinery to
|
|
||||||
provide a complete replacement.
|
|
||||||
|
|
||||||
See ``cache-pool.rst``
|
|
||||||
|
|
||||||
The manifest machinery already shares some code paths with the
|
|
||||||
existing cache/tiering code, mainly ``stat_flush``.
|
|
||||||
|
|
||||||
In no particular order, here's in incomplete list of things that need
|
|
||||||
to be wired up to provide feature parity:
|
|
||||||
|
|
||||||
* Online object access information: The osd already has pool configs
|
|
||||||
for maintaining bloom filters which provide estimates of access
|
|
||||||
recency for objects. We probably need to modify this to permit
|
|
||||||
hitset maintenance for a normal pool -- there are already
|
|
||||||
``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
|
|
||||||
* Tiering agent: The osd already has a background tiering agent which
|
|
||||||
would need to be modified to instead flush and evict using
|
|
||||||
manifests.
|
|
||||||
|
|
||||||
* Use exiting existing features regarding the cache flush policy such as
|
|
||||||
histset, age, ratio.
|
|
||||||
- hitset
|
|
||||||
- age, ratio, bytes
|
|
||||||
|
|
||||||
* Add tiering-mode to ``manifest-tiering``
|
|
||||||
- Writeback
|
|
||||||
- Read-only
|
|
||||||
|
|
||||||
|
|
||||||
Data Structures
|
Data Structures
|
||||||
===============
|
===============
|
||||||
|
|
||||||
|
@ -114,29 +114,6 @@ baseline throughput for each device type was determined:
|
|||||||
256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
|
256 KiB. For HDDs, it was 40MiB. The above throughput was obtained
|
||||||
by running 4 KiB random writes at a queue depth of 64 for 300 secs.
|
by running 4 KiB random writes at a queue depth of 64 for 300 secs.
|
||||||
|
|
||||||
Factoring I/O Cost in mClock
|
|
||||||
============================
|
|
||||||
|
|
||||||
The services using mClock have a cost associated with them. The cost can be
|
|
||||||
different for each service type. The mClock scheduler factors in the cost
|
|
||||||
during calculations for parameters like *reservation*, *weight* and *limit*.
|
|
||||||
The calculations determine when the next op for the service type can be
|
|
||||||
dequeued from the operation queue. In general, the higher the cost, the longer
|
|
||||||
an op remains in the operation queue.
|
|
||||||
|
|
||||||
A cost modeling study was performed to determine the cost per I/O and the cost
|
|
||||||
per byte for SSD and HDD device types. The following cost specific options are
|
|
||||||
used under the hood by mClock,
|
|
||||||
|
|
||||||
- :confval:`osd_mclock_cost_per_io_usec`
|
|
||||||
- :confval:`osd_mclock_cost_per_io_usec_hdd`
|
|
||||||
- :confval:`osd_mclock_cost_per_io_usec_ssd`
|
|
||||||
- :confval:`osd_mclock_cost_per_byte_usec`
|
|
||||||
- :confval:`osd_mclock_cost_per_byte_usec_hdd`
|
|
||||||
- :confval:`osd_mclock_cost_per_byte_usec_ssd`
|
|
||||||
|
|
||||||
See :doc:`/rados/configuration/mclock-config-ref` for more details.
|
|
||||||
|
|
||||||
MClock Profile Allocations
|
MClock Profile Allocations
|
||||||
==========================
|
==========================
|
||||||
|
|
||||||
|
@ -1,53 +0,0 @@
|
|||||||
|
|
||||||
This document describes the requirements and high-level design of the primary
|
|
||||||
balancer for Ceph.
|
|
||||||
|
|
||||||
Introduction
|
|
||||||
============
|
|
||||||
|
|
||||||
In a distributed storage system such as Ceph, there are some requirements to keep the system balanced in order to make it perform well:
|
|
||||||
|
|
||||||
#. Balance the capacity - This is a functional requirement, a system like Ceph is "as full as its fullest device". When one device is full the system can not serve write requests anymore. In order to do this we want to balance the capacity across the devices in a fair way - that each device gets capacity proportionally to its size and therefore all the devices have the same fullness level. This is a functional requirement. From performance perspective, capacity balancing creates fair share workloads on the OSDs for *write* requests.
|
|
||||||
|
|
||||||
#. Balance the workload - This is a performance requirement, we want to make sure that all the devices will receive a workload according to their performance. Assuming all the devices in a pool use the same technology and have the same bandwidth (a strong recommendation for a well configured system), and all devices in a pool have the same capacity, this means that for each pool, each device gets its fair share of primary OSDs so that the *read* requests are distributed evenly across the OSDs in the cluster. Managing workload balancing for devices with different capacities is discussed in the future enhancements section.
|
|
||||||
|
|
||||||
Requirements
|
|
||||||
============
|
|
||||||
|
|
||||||
- For each pool, each OSD should have its fair share of PGs in which it is primary. For replicated pools, this would be the number of PGs mapped to this OSD divided by the replica size.
|
|
||||||
- This may be improved in future releases. (see below)
|
|
||||||
- Improve the existing capacity balancer code to improve its maintainability
|
|
||||||
- Primary balancing is performed without data movement (data is moved only when balancing the capacity)
|
|
||||||
- Fix the global +/-1 balancing issue that happens since the current balancer works on a single pool at a time (this is a stretch goal for the first version)
|
|
||||||
|
|
||||||
- Problem description: In a perfectly balanced system, for each pool, each OSD has a number of PGs that ideally would have mapped to it to create a perfect capacity balancing. This number is usually not an integer, so some OSDs get a bit more PGs mapped and some a bit less. If you have many pools and you balance on a pool-by-pool basis, it is possible that some OSDs always get the "a bit more" side. When this happens, even to a single OSD, the result is non-balanced system where one OSD is more full than the others. This may happen with the current capacity balancer.
|
|
||||||
|
|
||||||
First release (Quincy) assumptions
|
|
||||||
----------------------------------
|
|
||||||
|
|
||||||
- Optional - In the first version the feature will be optional and by default will be disabled
|
|
||||||
- CLI only - In the first version we will probably give access to the primary balancer only by ``osdmaptool`` CLI and will not enable it in the online balancer; this way, the use of the feature is more controlled for early adopters
|
|
||||||
- No data movement
|
|
||||||
|
|
||||||
Future possible enhancements
|
|
||||||
----------------------------
|
|
||||||
|
|
||||||
- Improve the behavior for non identical OSDs in a pool
|
|
||||||
- Improve the capacity balancing behavior in extreme cases
|
|
||||||
- Add workload balancing to the online balancer
|
|
||||||
- A more futuristic feature can be to improve workload balancing based on real load statistics of the OSDs.
|
|
||||||
|
|
||||||
High Level Design
|
|
||||||
=================
|
|
||||||
|
|
||||||
- The capacity balancing code will remain in one function ``OSDMap::calc_pg_upmaps`` (the signature might be changed)
|
|
||||||
- The workload (a.k.a primary) balancer will be implemented in a different function
|
|
||||||
- The workload balancer will do its best based on the current status of the system
|
|
||||||
|
|
||||||
- When called on a balanced system (capacity-wise) with pools with identical devices, it will create a near optimal workload split among the OSDs
|
|
||||||
- Calling the workload balancer on an unbalanced system (capacity-wise) may yield non optimal results, and in some cases may give worse performance than before the call
|
|
||||||
|
|
||||||
Helper functionality
|
|
||||||
--------------------
|
|
||||||
|
|
||||||
- Set a seed for random generation in ``osdmaptool`` (For regression tests)
|
|
@ -131,7 +131,7 @@ First release candidate
|
|||||||
=======================
|
=======================
|
||||||
|
|
||||||
- [x] src/ceph_release: change type to `rc`
|
- [x] src/ceph_release: change type to `rc`
|
||||||
- [ ] opt-in to all telemetry channels, generate telemetry reports, and verify no sensitive details (like pools names) are collected
|
- [x] opt-in to all telemetry channels, generate telemetry reports, and verify no sensitive details (like pools names) are collected
|
||||||
|
|
||||||
|
|
||||||
First stable release
|
First stable release
|
||||||
|
@ -15,10 +15,12 @@
|
|||||||
introduced in the Ceph Kraken release. The Luminous release of
|
introduced in the Ceph Kraken release. The Luminous release of
|
||||||
Ceph promoted BlueStore to the default OSD back end,
|
Ceph promoted BlueStore to the default OSD back end,
|
||||||
supplanting FileStore. As of the Reef release, FileStore is no
|
supplanting FileStore. As of the Reef release, FileStore is no
|
||||||
longer available as a storage backend.
|
longer available as a storage back end.
|
||||||
|
|
||||||
BlueStore stores objects directly on Ceph block devices without
|
BlueStore stores objects directly on raw block devices or
|
||||||
a mounted file system.
|
partitions, and does not interact with mounted file systems.
|
||||||
|
BlueStore uses RocksDB's key/value database to map object names
|
||||||
|
to block locations on disk.
|
||||||
|
|
||||||
Bucket
|
Bucket
|
||||||
In the context of :term:`RGW`, a bucket is a group of objects.
|
In the context of :term:`RGW`, a bucket is a group of objects.
|
||||||
@ -269,7 +271,7 @@
|
|||||||
The Ceph manager software, which collects all the state from
|
The Ceph manager software, which collects all the state from
|
||||||
the whole cluster in one place.
|
the whole cluster in one place.
|
||||||
|
|
||||||
MON
|
:ref:`MON<arch_monitor>`
|
||||||
The Ceph monitor software.
|
The Ceph monitor software.
|
||||||
|
|
||||||
Node
|
Node
|
||||||
@ -328,6 +330,19 @@
|
|||||||
Pools
|
Pools
|
||||||
See :term:`pool`.
|
See :term:`pool`.
|
||||||
|
|
||||||
|
:ref:`Primary Affinity <rados_ops_primary_affinity>`
|
||||||
|
The characteristic of an OSD that governs the likelihood that
|
||||||
|
a given OSD will be selected as the primary OSD (or "lead
|
||||||
|
OSD") in an acting set. Primary affinity was introduced in
|
||||||
|
Firefly (v. 0.80). See :ref:`Primary Affinity
|
||||||
|
<rados_ops_primary_affinity>`.
|
||||||
|
|
||||||
|
Quorum
|
||||||
|
Quorum is the state that exists when a majority of the
|
||||||
|
:ref:`Monitors<arch_monitor>` in the cluster are ``up``. A
|
||||||
|
minimum of three :ref:`Monitors<arch_monitor>` must exist in
|
||||||
|
the cluster in order for Quorum to be possible.
|
||||||
|
|
||||||
RADOS
|
RADOS
|
||||||
**R**\eliable **A**\utonomic **D**\istributed **O**\bject
|
**R**\eliable **A**\utonomic **D**\istributed **O**\bject
|
||||||
**S**\tore. RADOS is the object store that provides a scalable
|
**S**\tore. RADOS is the object store that provides a scalable
|
||||||
|
@ -4,11 +4,11 @@
|
|||||||
|
|
||||||
Ceph delivers **object, block, and file storage in one unified system**.
|
Ceph delivers **object, block, and file storage in one unified system**.
|
||||||
|
|
||||||
.. warning::
|
.. warning::
|
||||||
|
|
||||||
:ref:`If this is your first time using Ceph, read the "Basic Workflow"
|
:ref:`If this is your first time using Ceph, read the "Basic Workflow"
|
||||||
page in the Ceph Developer Guide to learn how to contribute to the
|
page in the Ceph Developer Guide to learn how to contribute to the
|
||||||
Ceph project. (Click anywhere in this paragraph to read the "Basic
|
Ceph project. (Click anywhere in this paragraph to read the "Basic
|
||||||
Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`.
|
Workflow" page of the Ceph Developer Guide.) <basic workflow dev guide>`.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
@ -110,6 +110,7 @@ about Ceph, see our `Architecture`_ section.
|
|||||||
radosgw/index
|
radosgw/index
|
||||||
mgr/index
|
mgr/index
|
||||||
mgr/dashboard
|
mgr/dashboard
|
||||||
|
monitoring/index
|
||||||
api/index
|
api/index
|
||||||
architecture
|
architecture
|
||||||
Developer Guide <dev/developer_guide/index>
|
Developer Guide <dev/developer_guide/index>
|
||||||
|
@ -25,17 +25,17 @@ There are three ways to get packages:
|
|||||||
Install packages with cephadm
|
Install packages with cephadm
|
||||||
=============================
|
=============================
|
||||||
|
|
||||||
#. Download the cephadm script
|
#. Download cephadm
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
:substitutions:
|
:substitutions:
|
||||||
|
|
||||||
curl --silent --remote-name --location https://github.com/ceph/ceph/raw/|stable-release|/src/cephadm/cephadm
|
curl --silent --remote-name --location https://download.ceph.com/rpm-|stable-release|/el9/noarch/cephadm
|
||||||
chmod +x cephadm
|
chmod +x cephadm
|
||||||
|
|
||||||
#. Configure the Ceph repository based on the release name::
|
#. Configure the Ceph repository based on the release name::
|
||||||
|
|
||||||
./cephadm add-repo --release nautilus
|
./cephadm add-repo --release |stable-release|
|
||||||
|
|
||||||
For Octopus (15.2.0) and later releases, you can also specify a specific
|
For Octopus (15.2.0) and later releases, you can also specify a specific
|
||||||
version::
|
version::
|
||||||
@ -47,8 +47,8 @@ Install packages with cephadm
|
|||||||
./cephadm add-repo --dev my-branch
|
./cephadm add-repo --dev my-branch
|
||||||
|
|
||||||
#. Install the appropriate packages. You can install them using your
|
#. Install the appropriate packages. You can install them using your
|
||||||
package management tool (e.g., APT, Yum) directly, or you can also
|
package management tool (e.g., APT, Yum) directly, or you can
|
||||||
use the cephadm wrapper. For example::
|
use the cephadm wrapper command. For example::
|
||||||
|
|
||||||
./cephadm install ceph-common
|
./cephadm install ceph-common
|
||||||
|
|
||||||
|
90
ceph/doc/man/8/ceph-monstore-tool.rst
Normal file
90
ceph/doc/man/8/ceph-monstore-tool.rst
Normal file
@ -0,0 +1,90 @@
|
|||||||
|
:orphan:
|
||||||
|
|
||||||
|
======================================================
|
||||||
|
ceph-monstore-tool -- ceph monstore manipulation tool
|
||||||
|
======================================================
|
||||||
|
|
||||||
|
.. program:: ceph-monstore-tool
|
||||||
|
|
||||||
|
Synopsis
|
||||||
|
========
|
||||||
|
|
||||||
|
| **ceph-monstore-tool** <store path> <cmd> [args|options]
|
||||||
|
|
||||||
|
|
||||||
|
Description
|
||||||
|
===========
|
||||||
|
|
||||||
|
:program:`ceph-monstore-tool` is used to manipulate MonitorDBStore's data
|
||||||
|
(monmap, osdmap, etc.) offline. It is similar to `ceph-kvstore-tool`.
|
||||||
|
|
||||||
|
The default RocksDB debug level is `0`. This can be changed using `--debug`.
|
||||||
|
|
||||||
|
Note:
|
||||||
|
Ceph-specific options take the format `--option-name=VAL`
|
||||||
|
DO NOT FORGET THE EQUALS SIGN. ('=')
|
||||||
|
Command-specific options must be passed after a `--`
|
||||||
|
for example, `get monmap --debug -- --version 10 --out /tmp/foo`
|
||||||
|
|
||||||
|
Commands
|
||||||
|
========
|
||||||
|
|
||||||
|
:program:`ceph-monstore-tool` uses many commands for debugging purposes:
|
||||||
|
|
||||||
|
:command:`store-copy <path>`
|
||||||
|
Copy the store to PATH.
|
||||||
|
|
||||||
|
:command:`get monmap [-- options]`
|
||||||
|
Get monmap (version VER if specified) (default: last committed).
|
||||||
|
|
||||||
|
:command:`get osdmap [-- options]`
|
||||||
|
Get osdmap (version VER if specified) (default: last committed).
|
||||||
|
|
||||||
|
:command:`get msdmap [-- options]`
|
||||||
|
Get msdmap (version VER if specified) (default: last committed).
|
||||||
|
|
||||||
|
:command:`get mgr [-- options]`
|
||||||
|
Get mgrmap (version VER if specified) (default: last committed).
|
||||||
|
|
||||||
|
:command:`get crushmap [-- options]`
|
||||||
|
Get crushmap (version VER if specified) (default: last committed).
|
||||||
|
|
||||||
|
:command:`get osd_snap <key> [-- options]`
|
||||||
|
Get osd_snap key (`purged_snap` or `purged_epoch`).
|
||||||
|
|
||||||
|
:command:`dump-keys`
|
||||||
|
Dump store keys to FILE (default: stdout).
|
||||||
|
|
||||||
|
:command:`dump-paxos [-- options]`
|
||||||
|
Dump Paxos transactions (-- -- help for more info).
|
||||||
|
|
||||||
|
:command:`dump-trace FILE [-- options]`
|
||||||
|
Dump contents of trace file FILE (-- --help for more info).
|
||||||
|
|
||||||
|
:command:`replay-trace FILE [-- options]`
|
||||||
|
Replay trace from FILE (-- --help for more info).
|
||||||
|
|
||||||
|
:command:`random-gen [-- options]`
|
||||||
|
Add randomly genererated ops to the store (-- --help for more info).
|
||||||
|
|
||||||
|
:command:`rewrite-crush [-- options]`
|
||||||
|
Add a rewrite commit to the store
|
||||||
|
|
||||||
|
:command:`rebuild`
|
||||||
|
Rebuild store.
|
||||||
|
|
||||||
|
:command:`rm <prefix> <key>`
|
||||||
|
Remove specified key from the store.
|
||||||
|
|
||||||
|
Availability
|
||||||
|
============
|
||||||
|
|
||||||
|
**ceph-monstore-tool** is part of Ceph, a massively scalable, open-source,
|
||||||
|
distributed storage system. See the Ceph documentation at
|
||||||
|
https://docs.ceph.com for more information.
|
||||||
|
|
||||||
|
|
||||||
|
See also
|
||||||
|
========
|
||||||
|
|
||||||
|
:doc:`ceph <ceph>`\(8)
|
@ -183,6 +183,18 @@ Options
|
|||||||
|
|
||||||
write modified osdmap with upmap or crush-adjust changes
|
write modified osdmap with upmap or crush-adjust changes
|
||||||
|
|
||||||
|
.. option:: --read <file>
|
||||||
|
|
||||||
|
calculate pg upmap entries to balance pg primaries
|
||||||
|
|
||||||
|
.. option:: --read-pool <poolname>
|
||||||
|
|
||||||
|
specify which pool the read balancer should adjust
|
||||||
|
|
||||||
|
.. option:: --vstart
|
||||||
|
|
||||||
|
prefix upmap and read output with './bin/'
|
||||||
|
|
||||||
Example
|
Example
|
||||||
=======
|
=======
|
||||||
|
|
||||||
@ -315,6 +327,31 @@ To simulate the active balancer in upmap mode::
|
|||||||
osd.20 pgs 42
|
osd.20 pgs 42
|
||||||
Total time elapsed 0.0167765 secs, 5 rounds
|
Total time elapsed 0.0167765 secs, 5 rounds
|
||||||
|
|
||||||
|
To simulate the active balancer in read mode, first make sure capacity is balanced
|
||||||
|
by running the balancer in upmap mode. Then, balance the reads on a replicated pool with::
|
||||||
|
|
||||||
|
osdmaptool osdmap --read read.out --read-pool <pool name>
|
||||||
|
|
||||||
|
./bin/osdmaptool: osdmap file 'om'
|
||||||
|
writing upmap command output to: read.out
|
||||||
|
|
||||||
|
---------- BEFORE ------------
|
||||||
|
osd.0 | primary affinity: 1 | number of prims: 3
|
||||||
|
osd.1 | primary affinity: 1 | number of prims: 10
|
||||||
|
osd.2 | primary affinity: 1 | number of prims: 3
|
||||||
|
|
||||||
|
read_balance_score of 'cephfs.a.meta': 1.88
|
||||||
|
|
||||||
|
|
||||||
|
---------- AFTER ------------
|
||||||
|
osd.0 | primary affinity: 1 | number of prims: 5
|
||||||
|
osd.1 | primary affinity: 1 | number of prims: 5
|
||||||
|
osd.2 | primary affinity: 1 | number of prims: 6
|
||||||
|
|
||||||
|
read_balance_score of 'cephfs.a.meta': 1.13
|
||||||
|
|
||||||
|
|
||||||
|
num changes: 5
|
||||||
|
|
||||||
Availability
|
Availability
|
||||||
============
|
============
|
||||||
|
@ -15,15 +15,15 @@ Synopsis
|
|||||||
Description
|
Description
|
||||||
===========
|
===========
|
||||||
|
|
||||||
:program:`radosgw-admin` is a RADOS gateway user administration utility. It
|
:program:`radosgw-admin` is a Ceph Object Gateway user administration utility. It
|
||||||
allows creating and modifying users.
|
is used to create and modify users.
|
||||||
|
|
||||||
|
|
||||||
Commands
|
Commands
|
||||||
========
|
========
|
||||||
|
|
||||||
:program:`radosgw-admin` utility uses many commands for administration purpose
|
:program:`radosgw-admin` utility provides commands for administration purposes
|
||||||
which are as follows:
|
as follows:
|
||||||
|
|
||||||
:command:`user create`
|
:command:`user create`
|
||||||
Create a new user.
|
Create a new user.
|
||||||
@ -32,8 +32,7 @@ which are as follows:
|
|||||||
Modify a user.
|
Modify a user.
|
||||||
|
|
||||||
:command:`user info`
|
:command:`user info`
|
||||||
Display information of a user, and any potentially available
|
Display information for a user including any subusers and keys.
|
||||||
subusers and keys.
|
|
||||||
|
|
||||||
:command:`user rename`
|
:command:`user rename`
|
||||||
Renames a user.
|
Renames a user.
|
||||||
@ -51,7 +50,7 @@ which are as follows:
|
|||||||
Check user info.
|
Check user info.
|
||||||
|
|
||||||
:command:`user stats`
|
:command:`user stats`
|
||||||
Show user stats as accounted by quota subsystem.
|
Show user stats as accounted by the quota subsystem.
|
||||||
|
|
||||||
:command:`user list`
|
:command:`user list`
|
||||||
List all users.
|
List all users.
|
||||||
@ -78,10 +77,10 @@ which are as follows:
|
|||||||
Remove access key.
|
Remove access key.
|
||||||
|
|
||||||
:command:`bucket list`
|
:command:`bucket list`
|
||||||
List buckets, or, if bucket specified with --bucket=<bucket>,
|
List buckets, or, if a bucket is specified with --bucket=<bucket>,
|
||||||
list its objects. If bucket specified adding --allow-unordered
|
list its objects. Adding --allow-unordered
|
||||||
removes ordering requirement, possibly generating results more
|
removes the ordering requirement, possibly generating results more
|
||||||
quickly in buckets with large number of objects.
|
quickly for buckets with large number of objects.
|
||||||
|
|
||||||
:command:`bucket limit check`
|
:command:`bucket limit check`
|
||||||
Show bucket sharding stats.
|
Show bucket sharding stats.
|
||||||
@ -93,8 +92,8 @@ which are as follows:
|
|||||||
Unlink bucket from specified user.
|
Unlink bucket from specified user.
|
||||||
|
|
||||||
:command:`bucket chown`
|
:command:`bucket chown`
|
||||||
Link bucket to specified user and update object ACLs.
|
Change bucket ownership to the specified user and update object ACLs.
|
||||||
Use --marker to resume if command gets interrupted.
|
Invoke with --marker to resume if the command is interrupted.
|
||||||
|
|
||||||
:command:`bucket stats`
|
:command:`bucket stats`
|
||||||
Returns bucket statistics.
|
Returns bucket statistics.
|
||||||
@ -109,12 +108,13 @@ which are as follows:
|
|||||||
Rewrite all objects in the specified bucket.
|
Rewrite all objects in the specified bucket.
|
||||||
|
|
||||||
:command:`bucket radoslist`
|
:command:`bucket radoslist`
|
||||||
List the rados objects that contain the data for all objects is
|
List the RADOS objects that contain the data for all objects in
|
||||||
the designated bucket, if --bucket=<bucket> is specified, or
|
the designated bucket, if --bucket=<bucket> is specified.
|
||||||
otherwise all buckets.
|
Otherwise, list the RADOS objects that contain data for all
|
||||||
|
buckets.
|
||||||
|
|
||||||
:command:`bucket reshard`
|
:command:`bucket reshard`
|
||||||
Reshard a bucket.
|
Reshard a bucket's index.
|
||||||
|
|
||||||
:command:`bucket sync disable`
|
:command:`bucket sync disable`
|
||||||
Disable bucket sync.
|
Disable bucket sync.
|
||||||
@ -306,16 +306,16 @@ which are as follows:
|
|||||||
Run data sync for the specified source zone.
|
Run data sync for the specified source zone.
|
||||||
|
|
||||||
:command:`sync error list`
|
:command:`sync error list`
|
||||||
list sync error.
|
List sync errors.
|
||||||
|
|
||||||
:command:`sync error trim`
|
:command:`sync error trim`
|
||||||
trim sync error.
|
Trim sync errors.
|
||||||
|
|
||||||
:command:`zone rename`
|
:command:`zone rename`
|
||||||
Rename a zone.
|
Rename a zone.
|
||||||
|
|
||||||
:command:`zone placement list`
|
:command:`zone placement list`
|
||||||
List zone's placement targets.
|
List a zone's placement targets.
|
||||||
|
|
||||||
:command:`zone placement add`
|
:command:`zone placement add`
|
||||||
Add a zone placement target.
|
Add a zone placement target.
|
||||||
@ -365,7 +365,7 @@ which are as follows:
|
|||||||
List all bucket lifecycle progress.
|
List all bucket lifecycle progress.
|
||||||
|
|
||||||
:command:`lc process`
|
:command:`lc process`
|
||||||
Manually process lifecycle. If a bucket is specified (e.g., via
|
Manually process lifecycle transitions. If a bucket is specified (e.g., via
|
||||||
--bucket_id or via --bucket and optional --tenant), only that bucket
|
--bucket_id or via --bucket and optional --tenant), only that bucket
|
||||||
is processed.
|
is processed.
|
||||||
|
|
||||||
@ -385,7 +385,7 @@ which are as follows:
|
|||||||
List metadata log which is needed for multi-site deployments.
|
List metadata log which is needed for multi-site deployments.
|
||||||
|
|
||||||
:command:`mdlog trim`
|
:command:`mdlog trim`
|
||||||
Trim metadata log manually instead of relying on RGWs integrated log sync.
|
Trim metadata log manually instead of relying on the gateway's integrated log sync.
|
||||||
Before trimming, compare the listings and make sure the last sync was
|
Before trimming, compare the listings and make sure the last sync was
|
||||||
complete, otherwise it can reinitiate a sync.
|
complete, otherwise it can reinitiate a sync.
|
||||||
|
|
||||||
@ -397,7 +397,7 @@ which are as follows:
|
|||||||
|
|
||||||
:command:`bilog trim`
|
:command:`bilog trim`
|
||||||
Trim bucket index log (use start-marker, end-marker) manually instead
|
Trim bucket index log (use start-marker, end-marker) manually instead
|
||||||
of relying on RGWs integrated log sync.
|
of relying on the gateway's integrated log sync.
|
||||||
Before trimming, compare the listings and make sure the last sync was
|
Before trimming, compare the listings and make sure the last sync was
|
||||||
complete, otherwise it can reinitiate a sync.
|
complete, otherwise it can reinitiate a sync.
|
||||||
|
|
||||||
@ -405,7 +405,7 @@ which are as follows:
|
|||||||
List data log which is needed for multi-site deployments.
|
List data log which is needed for multi-site deployments.
|
||||||
|
|
||||||
:command:`datalog trim`
|
:command:`datalog trim`
|
||||||
Trim data log manually instead of relying on RGWs integrated log sync.
|
Trim data log manually instead of relying on the gateway's integrated log sync.
|
||||||
Before trimming, compare the listings and make sure the last sync was
|
Before trimming, compare the listings and make sure the last sync was
|
||||||
complete, otherwise it can reinitiate a sync.
|
complete, otherwise it can reinitiate a sync.
|
||||||
|
|
||||||
@ -413,19 +413,19 @@ which are as follows:
|
|||||||
Read data log status.
|
Read data log status.
|
||||||
|
|
||||||
:command:`orphans find`
|
:command:`orphans find`
|
||||||
Init and run search for leaked rados objects.
|
Init and run search for leaked RADOS objects.
|
||||||
DEPRECATED. See the "rgw-orphan-list" tool.
|
DEPRECATED. See the "rgw-orphan-list" tool.
|
||||||
|
|
||||||
:command:`orphans finish`
|
:command:`orphans finish`
|
||||||
Clean up search for leaked rados objects.
|
Clean up search for leaked RADOS objects.
|
||||||
DEPRECATED. See the "rgw-orphan-list" tool.
|
DEPRECATED. See the "rgw-orphan-list" tool.
|
||||||
|
|
||||||
:command:`orphans list-jobs`
|
:command:`orphans list-jobs`
|
||||||
List the current job-ids for the orphans search.
|
List the current orphans search job IDs.
|
||||||
DEPRECATED. See the "rgw-orphan-list" tool.
|
DEPRECATED. See the "rgw-orphan-list" tool.
|
||||||
|
|
||||||
:command:`role create`
|
:command:`role create`
|
||||||
create a new AWS role for use with STS.
|
Create a new role for use with STS (Security Token Service).
|
||||||
|
|
||||||
:command:`role rm`
|
:command:`role rm`
|
||||||
Remove a role.
|
Remove a role.
|
||||||
@ -485,7 +485,7 @@ which are as follows:
|
|||||||
Show events in a pubsub subscription
|
Show events in a pubsub subscription
|
||||||
|
|
||||||
:command:`subscription ack`
|
:command:`subscription ack`
|
||||||
Ack (remove) an events in a pubsub subscription
|
Acknowledge (remove) events in a pubsub subscription
|
||||||
|
|
||||||
|
|
||||||
Options
|
Options
|
||||||
@ -499,7 +499,8 @@ Options
|
|||||||
|
|
||||||
.. option:: -m monaddress[:port]
|
.. option:: -m monaddress[:port]
|
||||||
|
|
||||||
Connect to specified monitor (instead of looking through ceph.conf).
|
Connect to specified monitor (instead of selecting one
|
||||||
|
from ceph.conf).
|
||||||
|
|
||||||
.. option:: --tenant=<tenant>
|
.. option:: --tenant=<tenant>
|
||||||
|
|
||||||
@ -507,19 +508,19 @@ Options
|
|||||||
|
|
||||||
.. option:: --uid=uid
|
.. option:: --uid=uid
|
||||||
|
|
||||||
The radosgw user ID.
|
The user on which to operate.
|
||||||
|
|
||||||
.. option:: --new-uid=uid
|
.. option:: --new-uid=uid
|
||||||
|
|
||||||
ID of the new user. Used with 'user rename' command.
|
The new ID of the user. Used with 'user rename' command.
|
||||||
|
|
||||||
.. option:: --subuser=<name>
|
.. option:: --subuser=<name>
|
||||||
|
|
||||||
Name of the subuser.
|
Name of the subuser.
|
||||||
|
|
||||||
.. option:: --access-key=<key>
|
.. option:: --access-key=<key>
|
||||||
|
|
||||||
S3 access key.
|
S3 access key.
|
||||||
|
|
||||||
.. option:: --email=email
|
.. option:: --email=email
|
||||||
|
|
||||||
@ -531,28 +532,29 @@ Options
|
|||||||
|
|
||||||
.. option:: --gen-access-key
|
.. option:: --gen-access-key
|
||||||
|
|
||||||
Generate random access key (for S3).
|
Generate random access key (for S3).
|
||||||
|
|
||||||
|
|
||||||
.. option:: --gen-secret
|
.. option:: --gen-secret
|
||||||
|
|
||||||
Generate random secret key.
|
Generate random secret key.
|
||||||
|
|
||||||
.. option:: --key-type=<type>
|
.. option:: --key-type=<type>
|
||||||
|
|
||||||
key type, options are: swift, s3.
|
Key type, options are: swift, s3.
|
||||||
|
|
||||||
.. option:: --temp-url-key[-2]=<key>
|
.. option:: --temp-url-key[-2]=<key>
|
||||||
|
|
||||||
Temporary url key.
|
Temporary URL key.
|
||||||
|
|
||||||
.. option:: --max-buckets
|
.. option:: --max-buckets
|
||||||
|
|
||||||
max number of buckets for a user (0 for no limit, negative value to disable bucket creation).
|
Maximum number of buckets for a user (0 for no limit, negative value to disable bucket creation).
|
||||||
Default is 1000.
|
Default is 1000.
|
||||||
|
|
||||||
.. option:: --access=<access>
|
.. option:: --access=<access>
|
||||||
|
|
||||||
Set the access permissions for the sub-user.
|
Set the access permissions for the subuser.
|
||||||
Available access permissions are read, write, readwrite and full.
|
Available access permissions are read, write, readwrite and full.
|
||||||
|
|
||||||
.. option:: --display-name=<name>
|
.. option:: --display-name=<name>
|
||||||
@ -600,24 +602,24 @@ Options
|
|||||||
.. option:: --bucket-new-name=[tenant-id/]<bucket>
|
.. option:: --bucket-new-name=[tenant-id/]<bucket>
|
||||||
|
|
||||||
Optional for `bucket link`; use to rename a bucket.
|
Optional for `bucket link`; use to rename a bucket.
|
||||||
While tenant-id/ can be specified, this is never
|
While the tenant-id can be specified, this is not
|
||||||
necessary for normal operation.
|
necessary in normal operation.
|
||||||
|
|
||||||
.. option:: --shard-id=<shard-id>
|
.. option:: --shard-id=<shard-id>
|
||||||
|
|
||||||
Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.
|
Optional for mdlog list, bi list, data sync status. Required for ``mdlog trim``.
|
||||||
|
|
||||||
.. option:: --max-entries=<entries>
|
.. option:: --max-entries=<entries>
|
||||||
|
|
||||||
Optional for listing operations to specify the max entries.
|
Optional for listing operations to specify the max entries.
|
||||||
|
|
||||||
.. option:: --purge-data
|
.. option:: --purge-data
|
||||||
|
|
||||||
When specified, user removal will also purge all the user data.
|
When specified, user removal will also purge the user's data.
|
||||||
|
|
||||||
.. option:: --purge-keys
|
.. option:: --purge-keys
|
||||||
|
|
||||||
When specified, subuser removal will also purge all the subuser keys.
|
When specified, subuser removal will also purge the subuser' keys.
|
||||||
|
|
||||||
.. option:: --purge-objects
|
.. option:: --purge-objects
|
||||||
|
|
||||||
@ -625,7 +627,7 @@ Options
|
|||||||
|
|
||||||
.. option:: --metadata-key=<key>
|
.. option:: --metadata-key=<key>
|
||||||
|
|
||||||
Key to retrieve metadata from with ``metadata get``.
|
Key from which to retrieve metadata, used with ``metadata get``.
|
||||||
|
|
||||||
.. option:: --remote=<remote>
|
.. option:: --remote=<remote>
|
||||||
|
|
||||||
@ -633,11 +635,11 @@ Options
|
|||||||
|
|
||||||
.. option:: --period=<id>
|
.. option:: --period=<id>
|
||||||
|
|
||||||
Period id.
|
Period ID.
|
||||||
|
|
||||||
.. option:: --url=<url>
|
.. option:: --url=<url>
|
||||||
|
|
||||||
url for pushing/pulling period or realm.
|
URL for pushing/pulling period or realm.
|
||||||
|
|
||||||
.. option:: --epoch=<number>
|
.. option:: --epoch=<number>
|
||||||
|
|
||||||
@ -657,7 +659,7 @@ Options
|
|||||||
|
|
||||||
.. option:: --master-zone=<id>
|
.. option:: --master-zone=<id>
|
||||||
|
|
||||||
Master zone id.
|
Master zone ID.
|
||||||
|
|
||||||
.. option:: --rgw-realm=<name>
|
.. option:: --rgw-realm=<name>
|
||||||
|
|
||||||
@ -665,11 +667,11 @@ Options
|
|||||||
|
|
||||||
.. option:: --realm-id=<id>
|
.. option:: --realm-id=<id>
|
||||||
|
|
||||||
The realm id.
|
The realm ID.
|
||||||
|
|
||||||
.. option:: --realm-new-name=<name>
|
.. option:: --realm-new-name=<name>
|
||||||
|
|
||||||
New name of realm.
|
New name for the realm.
|
||||||
|
|
||||||
.. option:: --rgw-zonegroup=<name>
|
.. option:: --rgw-zonegroup=<name>
|
||||||
|
|
||||||
@ -677,7 +679,7 @@ Options
|
|||||||
|
|
||||||
.. option:: --zonegroup-id=<id>
|
.. option:: --zonegroup-id=<id>
|
||||||
|
|
||||||
The zonegroup id.
|
The zonegroup ID.
|
||||||
|
|
||||||
.. option:: --zonegroup-new-name=<name>
|
.. option:: --zonegroup-new-name=<name>
|
||||||
|
|
||||||
@ -685,11 +687,11 @@ Options
|
|||||||
|
|
||||||
.. option:: --rgw-zone=<zone>
|
.. option:: --rgw-zone=<zone>
|
||||||
|
|
||||||
Zone in which radosgw is running.
|
Zone in which the gateway is running.
|
||||||
|
|
||||||
.. option:: --zone-id=<id>
|
.. option:: --zone-id=<id>
|
||||||
|
|
||||||
The zone id.
|
The zone ID.
|
||||||
|
|
||||||
.. option:: --zone-new-name=<name>
|
.. option:: --zone-new-name=<name>
|
||||||
|
|
||||||
@ -709,7 +711,7 @@ Options
|
|||||||
|
|
||||||
.. option:: --placement-id
|
.. option:: --placement-id
|
||||||
|
|
||||||
Placement id for the zonegroup placement commands.
|
Placement ID for the zonegroup placement commands.
|
||||||
|
|
||||||
.. option:: --tags=<list>
|
.. option:: --tags=<list>
|
||||||
|
|
||||||
@ -737,7 +739,7 @@ Options
|
|||||||
|
|
||||||
.. option:: --data-extra-pool=<pool>
|
.. option:: --data-extra-pool=<pool>
|
||||||
|
|
||||||
The placement target data extra (non-ec) pool.
|
The placement target data extra (non-EC) pool.
|
||||||
|
|
||||||
.. option:: --placement-index-type=<type>
|
.. option:: --placement-index-type=<type>
|
||||||
|
|
||||||
@ -765,11 +767,11 @@ Options
|
|||||||
|
|
||||||
.. option:: --sync-from=[zone-name][,...]
|
.. option:: --sync-from=[zone-name][,...]
|
||||||
|
|
||||||
Set the list of zones to sync from.
|
Set the list of zones from which to sync.
|
||||||
|
|
||||||
.. option:: --sync-from-rm=[zone-name][,...]
|
.. option:: --sync-from-rm=[zone-name][,...]
|
||||||
|
|
||||||
Remove the zones from list of zones to sync from.
|
Remove zone(s) from list of zones from which to sync.
|
||||||
|
|
||||||
.. option:: --bucket-index-max-shards
|
.. option:: --bucket-index-max-shards
|
||||||
|
|
||||||
@ -780,71 +782,71 @@ Options
|
|||||||
|
|
||||||
.. option:: --fix
|
.. option:: --fix
|
||||||
|
|
||||||
Besides checking bucket index, will also fix it.
|
Fix the bucket index in addition to checking it.
|
||||||
|
|
||||||
.. option:: --check-objects
|
.. option:: --check-objects
|
||||||
|
|
||||||
bucket check: Rebuilds bucket index according to actual objects state.
|
Bucket check: Rebuilds the bucket index according to actual object state.
|
||||||
|
|
||||||
.. option:: --format=<format>
|
.. option:: --format=<format>
|
||||||
|
|
||||||
Specify output format for certain operations. Supported formats: xml, json.
|
Specify output format for certain operations. Supported formats: xml, json.
|
||||||
|
|
||||||
.. option:: --sync-stats
|
.. option:: --sync-stats
|
||||||
|
|
||||||
Option for 'user stats' command. When specified, it will update user stats with
|
Option for the 'user stats' command. When specified, it will update user stats with
|
||||||
the current stats reported by user's buckets indexes.
|
the current stats reported by the user's buckets indexes.
|
||||||
|
|
||||||
.. option:: --show-config
|
.. option:: --show-config
|
||||||
|
|
||||||
Show configuration.
|
Show configuration.
|
||||||
|
|
||||||
.. option:: --show-log-entries=<flag>
|
.. option:: --show-log-entries=<flag>
|
||||||
|
|
||||||
Enable/disable dump of log entries on log show.
|
Enable/disable dumping of log entries on log show.
|
||||||
|
|
||||||
.. option:: --show-log-sum=<flag>
|
.. option:: --show-log-sum=<flag>
|
||||||
|
|
||||||
Enable/disable dump of log summation on log show.
|
Enable/disable dump of log summation on log show.
|
||||||
|
|
||||||
.. option:: --skip-zero-entries
|
.. option:: --skip-zero-entries
|
||||||
|
|
||||||
Log show only dumps entries that don't have zero value in one of the numeric
|
Log show only dumps entries that don't have zero value in one of the numeric
|
||||||
field.
|
field.
|
||||||
|
|
||||||
.. option:: --infile
|
.. option:: --infile
|
||||||
|
|
||||||
Specify a file to read in when setting data.
|
Specify a file to read when setting data.
|
||||||
|
|
||||||
.. option:: --categories=<list>
|
.. option:: --categories=<list>
|
||||||
|
|
||||||
Comma separated list of categories, used in usage show.
|
Comma separated list of categories, used in usage show.
|
||||||
|
|
||||||
.. option:: --caps=<caps>
|
.. option:: --caps=<caps>
|
||||||
|
|
||||||
List of caps (e.g., "usage=read, write; user=read").
|
List of capabilities (e.g., "usage=read, write; user=read").
|
||||||
|
|
||||||
.. option:: --compression=<compression-algorithm>
|
.. option:: --compression=<compression-algorithm>
|
||||||
|
|
||||||
Placement target compression algorithm (lz4|snappy|zlib|zstd)
|
Placement target compression algorithm (lz4|snappy|zlib|zstd).
|
||||||
|
|
||||||
.. option:: --yes-i-really-mean-it
|
.. option:: --yes-i-really-mean-it
|
||||||
|
|
||||||
Required for certain operations.
|
Required as a guardrail for certain destructive operations.
|
||||||
|
|
||||||
.. option:: --min-rewrite-size
|
.. option:: --min-rewrite-size
|
||||||
|
|
||||||
Specify the min object size for bucket rewrite (default 4M).
|
Specify the minimum object size for bucket rewrite (default 4M).
|
||||||
|
|
||||||
.. option:: --max-rewrite-size
|
.. option:: --max-rewrite-size
|
||||||
|
|
||||||
Specify the max object size for bucket rewrite (default ULLONG_MAX).
|
Specify the maximum object size for bucket rewrite (default ULLONG_MAX).
|
||||||
|
|
||||||
.. option:: --min-rewrite-stripe-size
|
.. option:: --min-rewrite-stripe-size
|
||||||
|
|
||||||
Specify the min stripe size for object rewrite (default 0). If the value
|
Specify the minimum stripe size for object rewrite (default 0). If the value
|
||||||
is set to 0, then the specified object will always be
|
is set to 0, then the specified object will always be
|
||||||
rewritten for restriping.
|
rewritten when restriping.
|
||||||
|
|
||||||
.. option:: --warnings-only
|
.. option:: --warnings-only
|
||||||
|
|
||||||
@ -854,7 +856,7 @@ Options
|
|||||||
.. option:: --bypass-gc
|
.. option:: --bypass-gc
|
||||||
|
|
||||||
When specified with bucket deletion,
|
When specified with bucket deletion,
|
||||||
triggers object deletions by not involving GC.
|
triggers object deletion without involving GC.
|
||||||
|
|
||||||
.. option:: --inconsistent-index
|
.. option:: --inconsistent-index
|
||||||
|
|
||||||
@ -863,25 +865,25 @@ Options
|
|||||||
|
|
||||||
.. option:: --max-concurrent-ios
|
.. option:: --max-concurrent-ios
|
||||||
|
|
||||||
Maximum concurrent ios for bucket operations. Affects operations that
|
Maximum concurrent bucket operations. Affects operations that
|
||||||
scan the bucket index, e.g., listing, deletion, and all scan/search
|
scan the bucket index, e.g., listing, deletion, and all scan/search
|
||||||
operations such as finding orphans or checking the bucket index.
|
operations such as finding orphans or checking the bucket index.
|
||||||
Default is 32.
|
The default is 32.
|
||||||
|
|
||||||
Quota Options
|
Quota Options
|
||||||
=============
|
=============
|
||||||
|
|
||||||
.. option:: --max-objects
|
.. option:: --max-objects
|
||||||
|
|
||||||
Specify max objects (negative value to disable).
|
Specify the maximum number of objects (negative value to disable).
|
||||||
|
|
||||||
.. option:: --max-size
|
.. option:: --max-size
|
||||||
|
|
||||||
Specify max size (in B/K/M/G/T, negative value to disable).
|
Specify the maximum object size (in B/K/M/G/T, negative value to disable).
|
||||||
|
|
||||||
.. option:: --quota-scope
|
.. option:: --quota-scope
|
||||||
|
|
||||||
The scope of quota (bucket, user).
|
The scope of quota (bucket, user).
|
||||||
|
|
||||||
|
|
||||||
Orphans Search Options
|
Orphans Search Options
|
||||||
@ -889,16 +891,16 @@ Orphans Search Options
|
|||||||
|
|
||||||
.. option:: --num-shards
|
.. option:: --num-shards
|
||||||
|
|
||||||
Number of shards to use for keeping the temporary scan info
|
Number of shards to use for temporary scan info
|
||||||
|
|
||||||
.. option:: --orphan-stale-secs
|
.. option:: --orphan-stale-secs
|
||||||
|
|
||||||
Number of seconds to wait before declaring an object to be an orphan.
|
Number of seconds to wait before declaring an object to be an orphan.
|
||||||
Default is 86400 (24 hours).
|
The efault is 86400 (24 hours).
|
||||||
|
|
||||||
.. option:: --job-id
|
.. option:: --job-id
|
||||||
|
|
||||||
Set the job id (for orphans find)
|
Set the job id (for orphans find)
|
||||||
|
|
||||||
|
|
||||||
Orphans list-jobs options
|
Orphans list-jobs options
|
||||||
|
@ -53,10 +53,6 @@ Options
|
|||||||
|
|
||||||
Run in foreground, log to usual location
|
Run in foreground, log to usual location
|
||||||
|
|
||||||
.. option:: --rgw-socket-path=path
|
|
||||||
|
|
||||||
Specify a unix domain socket path.
|
|
||||||
|
|
||||||
.. option:: --rgw-region=region
|
.. option:: --rgw-region=region
|
||||||
|
|
||||||
The region where radosgw runs
|
The region where radosgw runs
|
||||||
@ -80,30 +76,24 @@ and ``mod_proxy_fcgi`` have to be present in the server. Unlike ``mod_fastcgi``,
|
|||||||
or process management may be available in the FastCGI application framework
|
or process management may be available in the FastCGI application framework
|
||||||
in use.
|
in use.
|
||||||
|
|
||||||
``Apache`` can be configured in a way that enables ``mod_proxy_fcgi`` to be used
|
``Apache`` must be configured in a way that enables ``mod_proxy_fcgi`` to be
|
||||||
with localhost tcp or through unix domain socket. ``mod_proxy_fcgi`` that doesn't
|
used with localhost tcp.
|
||||||
support unix domain socket such as the ones in Apache 2.2 and earlier versions of
|
|
||||||
Apache 2.4, needs to be configured for use with localhost tcp. Later versions of
|
|
||||||
Apache like Apache 2.4.9 or later support unix domain socket and as such they
|
|
||||||
allow for the configuration with unix domain socket instead of localhost tcp.
|
|
||||||
|
|
||||||
The following steps show the configuration in Ceph's configuration file i.e,
|
The following steps show the configuration in Ceph's configuration file i.e,
|
||||||
``/etc/ceph/ceph.conf`` and the gateway configuration file i.e,
|
``/etc/ceph/ceph.conf`` and the gateway configuration file i.e,
|
||||||
``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or
|
``/etc/httpd/conf.d/rgw.conf`` (RPM-based distros) or
|
||||||
``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost
|
``/etc/apache2/conf-available/rgw.conf`` (Debian-based distros) with localhost
|
||||||
tcp and through unix domain socket:
|
tcp:
|
||||||
|
|
||||||
#. For distros with Apache 2.2 and early versions of Apache 2.4 that use
|
#. For distros with Apache 2.2 and early versions of Apache 2.4 that use
|
||||||
localhost TCP and do not support Unix Domain Socket, append the following
|
localhost TCP, append the following contents to ``/etc/ceph/ceph.conf``::
|
||||||
contents to ``/etc/ceph/ceph.conf``::
|
|
||||||
|
|
||||||
[client.radosgw.gateway]
|
[client.radosgw.gateway]
|
||||||
host = {hostname}
|
host = {hostname}
|
||||||
keyring = /etc/ceph/ceph.client.radosgw.keyring
|
keyring = /etc/ceph/ceph.client.radosgw.keyring
|
||||||
rgw socket path = ""
|
log_file = /var/log/ceph/client.radosgw.gateway.log
|
||||||
log file = /var/log/ceph/client.radosgw.gateway.log
|
rgw_frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
|
||||||
rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
|
rgw_print_continue = false
|
||||||
rgw print continue = false
|
|
||||||
|
|
||||||
#. Add the following content in the gateway configuration file:
|
#. Add the following content in the gateway configuration file:
|
||||||
|
|
||||||
@ -149,16 +139,6 @@ tcp and through unix domain socket:
|
|||||||
|
|
||||||
</VirtualHost>
|
</VirtualHost>
|
||||||
|
|
||||||
#. For distros with Apache 2.4.9 or later that support Unix Domain Socket,
|
|
||||||
append the following configuration to ``/etc/ceph/ceph.conf``::
|
|
||||||
|
|
||||||
[client.radosgw.gateway]
|
|
||||||
host = {hostname}
|
|
||||||
keyring = /etc/ceph/ceph.client.radosgw.keyring
|
|
||||||
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
|
|
||||||
log file = /var/log/ceph/client.radosgw.gateway.log
|
|
||||||
rgw print continue = false
|
|
||||||
|
|
||||||
#. Add the following content in the gateway configuration file:
|
#. Add the following content in the gateway configuration file:
|
||||||
|
|
||||||
For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``::
|
For CentOS/RHEL add in ``/etc/httpd/conf.d/rgw.conf``::
|
||||||
@ -182,10 +162,6 @@ tcp and through unix domain socket:
|
|||||||
|
|
||||||
</VirtualHost>
|
</VirtualHost>
|
||||||
|
|
||||||
Please note, ``Apache 2.4.7`` does not have Unix Domain Socket support in
|
|
||||||
it and as such it has to be configured with localhost tcp. The Unix Domain
|
|
||||||
Socket support is available in ``Apache 2.4.9`` and later versions.
|
|
||||||
|
|
||||||
#. Generate a key for radosgw to use for authentication with the cluster. ::
|
#. Generate a key for radosgw to use for authentication with the cluster. ::
|
||||||
|
|
||||||
ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway
|
ceph-authtool -C -n client.radosgw.gateway --gen-key /etc/ceph/keyring.radosgw.gateway
|
||||||
|
@ -107,22 +107,29 @@ of the details of NFS redirecting traffic on the virtual IP to the
|
|||||||
appropriate backend NFS servers, and redeploying NFS servers when they
|
appropriate backend NFS servers, and redeploying NFS servers when they
|
||||||
fail.
|
fail.
|
||||||
|
|
||||||
If a user additionally supplies ``--ingress-mode keepalive-only`` a
|
An optional ``--ingress-mode`` parameter can be provided to choose
|
||||||
partial *ingress* service will be deployed that still provides a virtual
|
how the *ingress* service is configured:
|
||||||
IP, but has nfs directly binding to that virtual IP and leaves out any
|
|
||||||
sort of load balancing or traffic redirection. This setup will restrict
|
|
||||||
users to deploying only 1 nfs daemon as multiple cannot bind to the same
|
|
||||||
port on the virtual IP.
|
|
||||||
|
|
||||||
Instead providing ``--ingress-mode default`` will result in the same setup
|
- Setting ``--ingress-mode keepalive-only`` deploys a simplified *ingress*
|
||||||
as not providing the ``--ingress-mode`` flag. In this setup keepalived will be
|
service that provides a virtual IP with the nfs server directly binding to
|
||||||
deployed to handle forming the virtual IP and haproxy will be deployed
|
that virtual IP and leaves out any sort of load balancing or traffic
|
||||||
to handle load balancing and traffic redirection.
|
redirection. This setup will restrict users to deploying only 1 nfs daemon
|
||||||
|
as multiple cannot bind to the same port on the virtual IP.
|
||||||
|
- Setting ``--ingress-mode haproxy-standard`` deploys a full *ingress* service
|
||||||
|
to provide load balancing and high-availability using HAProxy and keepalived.
|
||||||
|
Client IP addresses are not visible to the back-end NFS server and IP level
|
||||||
|
restrictions on NFS exports will not function.
|
||||||
|
- Setting ``--ingress-mode haproxy-protocol`` deploys a full *ingress* service
|
||||||
|
to provide load balancing and high-availability using HAProxy and keepalived.
|
||||||
|
Client IP addresses are visible to the back-end NFS server and IP level
|
||||||
|
restrictions on NFS exports are usable. This mode requires NFS Ganesha version
|
||||||
|
5.0 or later.
|
||||||
|
- Setting ``--ingress-mode default`` is equivalent to not providing any other
|
||||||
|
ingress mode by name. When no other ingress mode is specified by name
|
||||||
|
the default ingress mode used is ``haproxy-standard``.
|
||||||
|
|
||||||
Enabling ingress via the ``ceph nfs cluster create`` command deploys a
|
Ingress can be added to an existing NFS service (e.g., one initially created
|
||||||
simple ingress configuration with the most common configuration
|
without the ``--ingress`` flag), and the basic NFS service can
|
||||||
options. Ingress can also be added to an existing NFS service (e.g.,
|
|
||||||
one created without the ``--ingress`` flag), and the basic NFS service can
|
|
||||||
also be modified after the fact to include non-default options, by modifying
|
also be modified after the fact to include non-default options, by modifying
|
||||||
the services directly. For more information, see :ref:`cephadm-ha-nfs`.
|
the services directly. For more information, see :ref:`cephadm-ha-nfs`.
|
||||||
|
|
||||||
|
@ -41,6 +41,7 @@ Configuration
|
|||||||
.. confval:: rbd_stats_pools_refresh_interval
|
.. confval:: rbd_stats_pools_refresh_interval
|
||||||
.. confval:: standby_behaviour
|
.. confval:: standby_behaviour
|
||||||
.. confval:: standby_error_status_code
|
.. confval:: standby_error_status_code
|
||||||
|
.. confval:: exclude_perf_counters
|
||||||
|
|
||||||
By default the module will accept HTTP requests on port ``9283`` on all IPv4
|
By default the module will accept HTTP requests on port ``9283`` on all IPv4
|
||||||
and IPv6 addresses on the host. The port and listen address are both
|
and IPv6 addresses on the host. The port and listen address are both
|
||||||
@ -217,6 +218,15 @@ the module option ``exclude_perf_counters`` to ``false``:
|
|||||||
|
|
||||||
ceph config set mgr mgr/prometheus/exclude_perf_counters false
|
ceph config set mgr mgr/prometheus/exclude_perf_counters false
|
||||||
|
|
||||||
|
Ceph daemon performance counters metrics
|
||||||
|
-----------------------------------------
|
||||||
|
|
||||||
|
With the introduction of ``ceph-exporter`` daemon, the prometheus module will no longer export Ceph daemon
|
||||||
|
perf counters as prometheus metrics by default. However, one may re-enable exporting these metrics by setting
|
||||||
|
the module option ``exclude_perf_counters`` to ``false``::
|
||||||
|
|
||||||
|
ceph config set mgr mgr/prometheus/exclude_perf_counters false
|
||||||
|
|
||||||
Statistic names and labels
|
Statistic names and labels
|
||||||
==========================
|
==========================
|
||||||
|
|
||||||
|
474
ceph/doc/monitoring/index.rst
Normal file
474
ceph/doc/monitoring/index.rst
Normal file
@ -0,0 +1,474 @@
|
|||||||
|
.. _monitoring:
|
||||||
|
|
||||||
|
===================
|
||||||
|
Monitoring overview
|
||||||
|
===================
|
||||||
|
|
||||||
|
The aim of this part of the documentation is to explain the Ceph monitoring
|
||||||
|
stack and the meaning of the main Ceph metrics.
|
||||||
|
|
||||||
|
With a good understand of the Ceph monitoring stack and metrics users can
|
||||||
|
create customized monitoring tools, like Prometheus queries, Grafana
|
||||||
|
dashboards, or scripts.
|
||||||
|
|
||||||
|
|
||||||
|
Ceph Monitoring stack
|
||||||
|
=====================
|
||||||
|
|
||||||
|
Ceph provides a default monitoring stack wich is installed by cephadm and
|
||||||
|
explained in the :ref:`Monitoring Services <mgr-cephadm-monitoring>` section of
|
||||||
|
the cephadm documentation.
|
||||||
|
|
||||||
|
|
||||||
|
Ceph metrics
|
||||||
|
============
|
||||||
|
|
||||||
|
The main source for Ceph metrics are the performance counters exposed by each
|
||||||
|
Ceph daemon. The :doc:`../dev/perf_counters` are native Ceph monitoring data
|
||||||
|
|
||||||
|
Performance counters are transformed into standard Prometheus metrics by the
|
||||||
|
Ceph exporter daemon. This daemon runs on every Ceph cluster host and exposes a
|
||||||
|
metrics end point where all the performance counters exposed by all the Ceph
|
||||||
|
daemons running in the host are published in the form of Prometheus metrics.
|
||||||
|
|
||||||
|
In addition to the Ceph exporter, there is another agent to expose Ceph
|
||||||
|
metrics. It is the Prometheus manager module, wich exposes metrics related to
|
||||||
|
the whole cluster, basically metrics that are not produced by individual Ceph
|
||||||
|
daemons.
|
||||||
|
|
||||||
|
The main source for obtaining Ceph metrics is the metrics endpoint exposed by
|
||||||
|
the Cluster Prometheus server. Ceph can provide you with the Prometheus
|
||||||
|
endpoint where you can obtain the complete list of metrics (coming from Ceph
|
||||||
|
exporter daemons and Prometheus manager module) and exeute queries.
|
||||||
|
|
||||||
|
Use the following command to obtain the Prometheus server endpoint in your
|
||||||
|
cluster:
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
# ceph orch ps --service_name prometheus
|
||||||
|
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
|
||||||
|
prometheus.cephtest-node-00 cephtest-node-00.cephlab.com *:9095 running (103m) 50s ago 5w 142M - 2.33.4 514e6a882f6e efe3cbc2e521
|
||||||
|
|
||||||
|
With this information you can connect to
|
||||||
|
``http://cephtest-node-00.cephlab.com:9095`` to access the Prometheus server
|
||||||
|
interface.
|
||||||
|
|
||||||
|
And the complete list of metrics (with help) for your cluster will be available
|
||||||
|
in:
|
||||||
|
|
||||||
|
``http://cephtest-node-00.cephlab.com:9095/api/v1/targets/metadata``
|
||||||
|
|
||||||
|
|
||||||
|
It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
|
||||||
|
|
||||||
|
|
||||||
|
Performance metrics
|
||||||
|
===================
|
||||||
|
|
||||||
|
Main metrics used to measure Cluster Ceph performance:
|
||||||
|
|
||||||
|
All metrics have the following labels:
|
||||||
|
``ceph_daemon``: identifier of the OSD daemon generating the metric
|
||||||
|
``instance``: the IP address of the ceph exporter instance exposing the metric.
|
||||||
|
``job``: prometheus scrape job
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ceph_osd_op_r{ceph_daemon="osd.0", instance="192.168.122.7:9283", job="ceph"} = 73981
|
||||||
|
|
||||||
|
*Cluster I/O (throughput):*
|
||||||
|
Use ``ceph_osd_op_r_out_bytes`` and ``ceph_osd_op_w_in_bytes`` to obtain the cluster throughput generated by clients
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
Writes (B/s):
|
||||||
|
sum(irate(ceph_osd_op_w_in_bytes[1m]))
|
||||||
|
|
||||||
|
Reads (B/s):
|
||||||
|
sum(irate(ceph_osd_op_r_out_bytes[1m]))
|
||||||
|
|
||||||
|
|
||||||
|
*Cluster I/O (operations):*
|
||||||
|
Use ``ceph_osd_op_r``, ``ceph_osd_op_w`` to obtain the number of operations generated by clients
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
Writes (ops/s):
|
||||||
|
sum(irate(ceph_osd_op_w[1m]))
|
||||||
|
|
||||||
|
Reads (ops/s):
|
||||||
|
sum(irate(ceph_osd_op_r[1m]))
|
||||||
|
|
||||||
|
*Latency:*
|
||||||
|
Use ``ceph_osd_op_latency_sum`` wich represents the delay before a OSD transfer of data begins following a client instruction for its transfer
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
sum(irate(ceph_osd_op_latency_sum[1m]))
|
||||||
|
|
||||||
|
|
||||||
|
OSD performance
|
||||||
|
===============
|
||||||
|
|
||||||
|
The previous explained cluster performance metrics are based in OSD metrics, selecting the right label we can obtain for a single OSD the same performance information explained for the cluster:
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
OSD 0 read latency
|
||||||
|
irate(ceph_osd_op_r_latency_sum{ceph_daemon=~"osd.0"}[1m]) / on (ceph_daemon) irate(ceph_osd_op_r_latency_count[1m])
|
||||||
|
|
||||||
|
OSD 0 write IOPS
|
||||||
|
irate(ceph_osd_op_w{ceph_daemon=~"osd.0"}[1m])
|
||||||
|
|
||||||
|
OSD 0 write thughtput (bytes)
|
||||||
|
irate(ceph_osd_op_w_in_bytes{ceph_daemon=~"osd.0"}[1m])
|
||||||
|
|
||||||
|
OSD.0 total raw capacity available
|
||||||
|
ceph_osd_stat_bytes{ceph_daemon="osd.0", instance="cephtest-node-00.cephlab.com:9283", job="ceph"} = 536451481
|
||||||
|
|
||||||
|
|
||||||
|
Physical disk performance:
|
||||||
|
==========================
|
||||||
|
|
||||||
|
Combining Prometheus ``node_exporter`` metrics with Ceph metrics we can have
|
||||||
|
information about the performance provided by physical disks used by OSDs.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
Read latency of device used by OSD 0:
|
||||||
|
label_replace(irate(node_disk_read_time_seconds_total[1m]) / irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
Write latency of device used by OSD 0
|
||||||
|
label_replace(irate(node_disk_write_time_seconds_total[1m]) / irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
IOPS (device used by OSD.0)
|
||||||
|
reads:
|
||||||
|
label_replace(irate(node_disk_reads_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
writes:
|
||||||
|
label_replace(irate(node_disk_writes_completed_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
Throughput (device used by OSD.0)
|
||||||
|
reads:
|
||||||
|
label_replace(irate(node_disk_read_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
writes:
|
||||||
|
label_replace(irate(node_disk_written_bytes_total[1m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
Physical Device Utilization (%) for OSD.0 in the last 5 minutes
|
||||||
|
label_replace(irate(node_disk_io_time_seconds_total[5m]), "instance", "$1", "instance", "([^:.]*).*") and on (instance, device) label_replace(label_replace(ceph_disk_occupation_human{ceph_daemon=~"osd.0"}, "device", "$1", "device", "/dev/(.*)"), "instance", "$1", "instance", "([^:.]*).*")
|
||||||
|
|
||||||
|
Pool metrics
|
||||||
|
============
|
||||||
|
|
||||||
|
These metrics have the following labels:
|
||||||
|
``instance``: the ip address of the Ceph exporter daemon producing the metric.
|
||||||
|
``pool_id``: identifier of the pool
|
||||||
|
``job``: prometheus scrape job
|
||||||
|
|
||||||
|
|
||||||
|
- ``ceph_pool_metadata``: Information about the pool It can be used together
|
||||||
|
with other metrics to provide more contextual information in queries and
|
||||||
|
graphs. Apart of the three common labels this metric provide the following
|
||||||
|
extra labels:
|
||||||
|
|
||||||
|
- ``compression_mode``: compression used in the pool (lz4, snappy, zlib,
|
||||||
|
zstd, none). Example: compression_mode="none"
|
||||||
|
|
||||||
|
- ``description``: brief description of the pool type (replica:number of
|
||||||
|
replicas or Erasure code: ec profile). Example: description="replica:3"
|
||||||
|
- ``name``: name of the pool. Example: name=".mgr"
|
||||||
|
- ``type``: type of pool (replicated/erasure code). Example: type="replicated"
|
||||||
|
|
||||||
|
- ``ceph_pool_bytes_used``: Total raw capacity consumed by user data and associated overheads by pool (metadata + redundancy):
|
||||||
|
|
||||||
|
- ``ceph_pool_stored``: Total of CLIENT data stored in the pool
|
||||||
|
|
||||||
|
- ``ceph_pool_compress_under_bytes``: Data eligible to be compressed in the pool
|
||||||
|
|
||||||
|
- ``ceph_pool_compress_bytes_used``: Data compressed in the pool
|
||||||
|
|
||||||
|
- ``ceph_pool_rd``: CLIENT read operations per pool (reads per second)
|
||||||
|
|
||||||
|
- ``ceph_pool_rd_bytes``: CLIENT read operations in bytes per pool
|
||||||
|
|
||||||
|
- ``ceph_pool_wr``: CLIENT write operations per pool (writes per second)
|
||||||
|
|
||||||
|
- ``ceph_pool_wr_bytes``: CLIENT write operation in bytes per pool
|
||||||
|
|
||||||
|
|
||||||
|
**Useful queries**:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
Total raw capacity available in the cluster:
|
||||||
|
sum(ceph_osd_stat_bytes)
|
||||||
|
|
||||||
|
Total raw capacity consumed in the cluster (including metadata + redundancy):
|
||||||
|
sum(ceph_pool_bytes_used)
|
||||||
|
|
||||||
|
Total of CLIENT data stored in the cluster:
|
||||||
|
sum(ceph_pool_stored)
|
||||||
|
|
||||||
|
Compression savings:
|
||||||
|
sum(ceph_pool_compress_under_bytes - ceph_pool_compress_bytes_used)
|
||||||
|
|
||||||
|
CLIENT IOPS for a pool (testrbdpool)
|
||||||
|
reads: irate(ceph_pool_rd[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
|
||||||
|
writes: irate(ceph_pool_wr[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
|
||||||
|
|
||||||
|
CLIENT Throughput for a pool
|
||||||
|
reads: irate(ceph_pool_rd_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
|
||||||
|
writes: irate(ceph_pool_wr_bytes[1m]) * on(pool_id) group_left(instance,name) ceph_pool_metadata{name=~"testrbdpool"}
|
||||||
|
|
||||||
|
Object metrics
|
||||||
|
==============
|
||||||
|
|
||||||
|
These metrics have the following labels:
|
||||||
|
``instance``: the ip address of the ceph exporter daemon providing the metric
|
||||||
|
``instance_id``: identifier of the rgw daemon
|
||||||
|
``job``: prometheus scrape job
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ceph_rgw_req{instance="192.168.122.7:9283", instance_id="154247", job="ceph"} = 12345
|
||||||
|
|
||||||
|
|
||||||
|
Generic metrics
|
||||||
|
---------------
|
||||||
|
- ``ceph_rgw_metadata``: Provides generic information about the RGW daemon. It
|
||||||
|
can be used together with other metrics to provide more contextual
|
||||||
|
information in queries and graphs. Apart from the three common labels, this
|
||||||
|
metric provides the following extra labels:
|
||||||
|
|
||||||
|
- ``ceph_daemon``: Name of the Ceph daemon. Example:
|
||||||
|
ceph_daemon="rgw.rgwtest.cephtest-node-00.sxizyq",
|
||||||
|
- ``ceph_version``: Version of Ceph daemon. Example: ceph_version="ceph
|
||||||
|
version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)",
|
||||||
|
- ``hostname``: Name of the host where the daemon runs. Example:
|
||||||
|
hostname:"cephtest-node-00.cephlab.com",
|
||||||
|
|
||||||
|
- ``ceph_rgw_req``: Number total of requests for the daemon (GET+PUT+DELETE)
|
||||||
|
Useful to detect bottlenecks and optimize load distribution.
|
||||||
|
|
||||||
|
- ``ceph_rgw_qlen``: RGW operations queue length for the daemon.
|
||||||
|
Useful to detect bottlenecks and optimize load distribution.
|
||||||
|
|
||||||
|
- ``ceph_rgw_failed_req``: Aborted requests.
|
||||||
|
Useful to detect daemon errors
|
||||||
|
|
||||||
|
|
||||||
|
GET operations: related metrics
|
||||||
|
-------------------------------
|
||||||
|
- ``ceph_rgw_get_initial_lat_count``: Number of get operations
|
||||||
|
|
||||||
|
- ``ceph_rgw_get_initial_lat_sum``: Total latency time for the GET operations
|
||||||
|
|
||||||
|
- ``ceph_rgw_get``: Number total of GET requests
|
||||||
|
|
||||||
|
- ``ceph_rgw_get_b``: Total bytes transferred in GET operations
|
||||||
|
|
||||||
|
|
||||||
|
Put operations: related metrics
|
||||||
|
-------------------------------
|
||||||
|
- ``ceph_rgw_put_initial_lat_count``: Number of get operations
|
||||||
|
|
||||||
|
- ``ceph_rgw_put_initial_lat_sum``: Total latency time for the PUT operations
|
||||||
|
|
||||||
|
- ``ceph_rgw_put``: Total number of PUT operations
|
||||||
|
|
||||||
|
- ``ceph_rgw_get_b``: Total bytes transferred in PUT operations
|
||||||
|
|
||||||
|
|
||||||
|
Useful queries
|
||||||
|
--------------
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
The average of get latencies:
|
||||||
|
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
The average of put latencies:
|
||||||
|
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
Total requests per second:
|
||||||
|
rate(ceph_rgw_req[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
Total number of "other" operations (LIST, DELETE)
|
||||||
|
rate(ceph_rgw_req[30s]) - (rate(ceph_rgw_get[30s]) + rate(ceph_rgw_put[30s]))
|
||||||
|
|
||||||
|
GET latencies
|
||||||
|
rate(ceph_rgw_get_initial_lat_sum[30s]) / rate(ceph_rgw_get_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
PUT latencies
|
||||||
|
rate(ceph_rgw_put_initial_lat_sum[30s]) / rate(ceph_rgw_put_initial_lat_count[30s]) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
Bandwidth consumed by GET operations
|
||||||
|
sum(rate(ceph_rgw_get_b[30s]))
|
||||||
|
|
||||||
|
Bandwidth consumed by PUT operations
|
||||||
|
sum(rate(ceph_rgw_put_b[30s]))
|
||||||
|
|
||||||
|
Bandwidth consumed by RGW instance (PUTs + GETs)
|
||||||
|
sum by (instance_id) (rate(ceph_rgw_get_b[30s]) + rate(ceph_rgw_put_b[30s])) * on (instance_id) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
Http errors:
|
||||||
|
rate(ceph_rgw_failed_req[30s])
|
||||||
|
|
||||||
|
|
||||||
|
Filesystem Metrics
|
||||||
|
==================
|
||||||
|
|
||||||
|
These metrics have the following labels:
|
||||||
|
``ceph_daemon``: The name of the MDS daemon
|
||||||
|
``instance``: the ip address (and port) of of the Ceph exporter daemon exposing the metric
|
||||||
|
``job``: prometheus scrape job
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ceph_mds_request{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", instance="192.168.122.7:9283", job="ceph"} = 1452
|
||||||
|
|
||||||
|
|
||||||
|
Main metrics
|
||||||
|
------------
|
||||||
|
|
||||||
|
- ``ceph_mds_metadata``: Provides general information about the MDS daemon. It
|
||||||
|
can be used together with other metrics to provide more contextual
|
||||||
|
information in queries and graphs. It provides the following extra labels:
|
||||||
|
|
||||||
|
- ``ceph_version``: MDS daemon Ceph version
|
||||||
|
- ``fs_id``: filesystem cluster id
|
||||||
|
- ``hostname``: Host name where the MDS daemon runs
|
||||||
|
- ``public_addr``: Public address where the MDS daemon runs
|
||||||
|
- ``rank``: Rank of the MDS daemon
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ceph_mds_metadata{ceph_daemon="mds.test.cephtest-node-00.hmhsoh", ceph_version="ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)", fs_id="-1", hostname="cephtest-node-00.cephlab.com", instance="cephtest-node-00.cephlab.com:9283", job="ceph", public_addr="192.168.122.145:6801/118896446", rank="-1"}
|
||||||
|
|
||||||
|
|
||||||
|
- ``ceph_mds_request``: Total number of requests for the MDs daemon
|
||||||
|
|
||||||
|
- ``ceph_mds_reply_latency_sum``: Reply latency total
|
||||||
|
|
||||||
|
- ``ceph_mds_reply_latency_count``: Reply latency count
|
||||||
|
|
||||||
|
- ``ceph_mds_server_handle_client_request``: Number of client requests
|
||||||
|
|
||||||
|
- ``ceph_mds_sessions_session_count``: Session count
|
||||||
|
|
||||||
|
- ``ceph_mds_sessions_total_load``: Total load
|
||||||
|
|
||||||
|
- ``ceph_mds_sessions_sessions_open``: Sessions currently open
|
||||||
|
|
||||||
|
- ``ceph_mds_sessions_sessions_stale``: Sessions currently stale
|
||||||
|
|
||||||
|
- ``ceph_objecter_op_r``: Number of read operations
|
||||||
|
|
||||||
|
- ``ceph_objecter_op_w``: Number of write operations
|
||||||
|
|
||||||
|
- ``ceph_mds_root_rbytes``: Total number of bytes managed by the daemon
|
||||||
|
|
||||||
|
- ``ceph_mds_root_rfiles``: Total number of files managed by the daemon
|
||||||
|
|
||||||
|
|
||||||
|
Useful queries:
|
||||||
|
---------------
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
Total MDS daemons read workload:
|
||||||
|
sum(rate(ceph_objecter_op_r[1m]))
|
||||||
|
|
||||||
|
Total MDS daemons write workload:
|
||||||
|
sum(rate(ceph_objecter_op_w[1m]))
|
||||||
|
|
||||||
|
MDS daemon read workload: (daemon name is "mdstest")
|
||||||
|
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
|
||||||
|
|
||||||
|
MDS daemon write workload: (daemon name is "mdstest")
|
||||||
|
sum(rate(ceph_objecter_op_r{ceph_daemon=~"mdstest"}[1m]))
|
||||||
|
|
||||||
|
The average of reply latencies:
|
||||||
|
rate(ceph_mds_reply_latency_sum[30s]) / rate(ceph_mds_reply_latency_count[30s])
|
||||||
|
|
||||||
|
Total requests per second:
|
||||||
|
rate(ceph_mds_request[30s]) * on (instance) group_right (ceph_daemon) ceph_mds_metadata
|
||||||
|
|
||||||
|
|
||||||
|
Block metrics
|
||||||
|
=============
|
||||||
|
|
||||||
|
By default RBD metrics for images are not available in order to provide the
|
||||||
|
best performance in the prometheus manager module.
|
||||||
|
|
||||||
|
To produce metrics for RBD images it is needed to configure properly the
|
||||||
|
manager option ``mgr/prometheus/rbd_stats_pools``. For more information please
|
||||||
|
see :ref:`prometheus-rbd-io-statistics`
|
||||||
|
|
||||||
|
|
||||||
|
These metrics have the following labels:
|
||||||
|
``image``: Name of the image which produces the metric value.
|
||||||
|
``instance``: Node where the rbd metric is produced. (It points to the Ceph exporter daemon)
|
||||||
|
``job``: Name of the Prometheus scrape job.
|
||||||
|
``pool``: Image pool name.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ceph_rbd_read_bytes{image="test2", instance="cephtest-node-00.cephlab.com:9283", job="ceph", pool="testrbdpool"}
|
||||||
|
|
||||||
|
|
||||||
|
Main metrics
|
||||||
|
------------
|
||||||
|
|
||||||
|
- ``ceph_rbd_read_bytes``: RBD image bytes read
|
||||||
|
|
||||||
|
- ``ceph_rbd_read_latency_count``: RBD image reads latency count
|
||||||
|
|
||||||
|
- ``ceph_rbd_read_latency_sum``: RBD image reads latency total
|
||||||
|
|
||||||
|
- ``ceph_rbd_read_ops``: RBD image reads count
|
||||||
|
|
||||||
|
- ``ceph_rbd_write_bytes``: RBD image bytes written
|
||||||
|
|
||||||
|
- ``ceph_rbd_write_latency_count``: RBD image writes latency count
|
||||||
|
|
||||||
|
- ``ceph_rbd_write_latency_sum``: RBD image writes latency total
|
||||||
|
|
||||||
|
- ``ceph_rbd_write_ops``: RBD image writes count
|
||||||
|
|
||||||
|
|
||||||
|
Useful queries
|
||||||
|
--------------
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
The average of read latencies:
|
||||||
|
rate(ceph_rbd_read_latency_sum[30s]) / rate(ceph_rbd_read_latency_count[30s]) * on (instance) group_left (ceph_daemon) ceph_rgw_metadata
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -4,7 +4,7 @@
|
|||||||
Configuring Ceph
|
Configuring Ceph
|
||||||
==================
|
==================
|
||||||
|
|
||||||
When Ceph services start, the initialization process activates a series of
|
When Ceph services start, the initialization process activates a set of
|
||||||
daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
|
daemons that run in the background. A :term:`Ceph Storage Cluster` runs at
|
||||||
least three types of daemons:
|
least three types of daemons:
|
||||||
|
|
||||||
@ -12,15 +12,16 @@ least three types of daemons:
|
|||||||
- :term:`Ceph Manager` (``ceph-mgr``)
|
- :term:`Ceph Manager` (``ceph-mgr``)
|
||||||
- :term:`Ceph OSD Daemon` (``ceph-osd``)
|
- :term:`Ceph OSD Daemon` (``ceph-osd``)
|
||||||
|
|
||||||
Ceph Storage Clusters that support the :term:`Ceph File System` also run at
|
Any Ceph Storage Cluster that supports the :term:`Ceph File System` also runs
|
||||||
least one :term:`Ceph Metadata Server` (``ceph-mds``). Clusters that support
|
at least one :term:`Ceph Metadata Server` (``ceph-mds``). Any Cluster that
|
||||||
:term:`Ceph Object Storage` run Ceph RADOS Gateway daemons (``radosgw``).
|
supports :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons
|
||||||
|
(``radosgw``).
|
||||||
|
|
||||||
Each daemon has a number of configuration options, each of which has a default
|
Each daemon has a number of configuration options, and each of those options
|
||||||
value. You may adjust the behavior of the system by changing these
|
has a default value. Adjust the behavior of the system by changing these
|
||||||
configuration options. Be careful to understand the consequences before
|
configuration options. Make sure to understand the consequences before
|
||||||
overriding default values, as it is possible to significantly degrade the
|
overriding the default values, as it is possible to significantly degrade the
|
||||||
performance and stability of your cluster. Note too that default values
|
performance and stability of your cluster. Remember that default values
|
||||||
sometimes change between releases. For this reason, it is best to review the
|
sometimes change between releases. For this reason, it is best to review the
|
||||||
version of this documentation that applies to your Ceph release.
|
version of this documentation that applies to your Ceph release.
|
||||||
|
|
||||||
|
@ -4,11 +4,12 @@
|
|||||||
|
|
||||||
.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
|
.. note:: Since the Luminous release of Ceph, Filestore has not been Ceph's
|
||||||
default storage back end. Since the Luminous release of Ceph, BlueStore has
|
default storage back end. Since the Luminous release of Ceph, BlueStore has
|
||||||
been Ceph's default storage back end. However, Filestore OSDs are still
|
been Ceph's default storage back end. However, Filestore OSDs are still
|
||||||
supported. See :ref:`OSD Back Ends
|
supported up to Quincy. Filestore OSDs are not supported in Reef. See
|
||||||
<rados_config_storage_devices_osd_backends>`. See :ref:`BlueStore Migration
|
:ref:`OSD Back Ends <rados_config_storage_devices_osd_backends>`. See
|
||||||
<rados_operations_bluestore_migration>` for instructions explaining how to
|
:ref:`BlueStore Migration <rados_operations_bluestore_migration>` for
|
||||||
replace an existing Filestore back end with a BlueStore back end.
|
instructions explaining how to replace an existing Filestore back end with a
|
||||||
|
BlueStore back end.
|
||||||
|
|
||||||
|
|
||||||
``filestore_debug_omap_check``
|
``filestore_debug_omap_check``
|
||||||
|
@ -18,27 +18,25 @@ Background
|
|||||||
|
|
||||||
Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`.
|
Ceph Monitors maintain a "master copy" of the :term:`Cluster Map`.
|
||||||
|
|
||||||
The maintenance by Ceph Monitors of a :term:`Cluster Map` makes it possible for
|
The :term:`Cluster Map` makes it possible for :term:`Ceph client`\s to
|
||||||
a :term:`Ceph Client` to determine the location of all Ceph Monitors, Ceph OSD
|
determine the location of all Ceph Monitors, Ceph OSD Daemons, and Ceph
|
||||||
Daemons, and Ceph Metadata Servers by connecting to one Ceph Monitor and
|
Metadata Servers. Clients do this by connecting to one Ceph Monitor and
|
||||||
retrieving a current cluster map. Before Ceph Clients can read from or write to
|
retrieving a current cluster map. Ceph clients must connect to a Ceph Monitor
|
||||||
Ceph OSD Daemons or Ceph Metadata Servers, they must connect to a Ceph Monitor.
|
before they can read from or write to Ceph OSD Daemons or Ceph Metadata
|
||||||
When a Ceph client has a current copy of the cluster map and the CRUSH
|
Servers. A Ceph client that has a current copy of the cluster map and the CRUSH
|
||||||
algorithm, it can compute the location for any RADOS object within in the
|
algorithm can compute the location of any RADOS object within the cluster. This
|
||||||
cluster. This ability to compute the locations of objects makes it possible for
|
makes it possible for Ceph clients to talk directly to Ceph OSD Daemons. Direct
|
||||||
Ceph Clients to talk directly to Ceph OSD Daemons. This direct communication
|
communication between clients and Ceph OSD Daemons improves upon traditional
|
||||||
with Ceph OSD Daemons represents an improvment upon traditional storage
|
storage architectures that required clients to communicate with a central
|
||||||
architectures in which clients were required to communicate with a central
|
component. See `Scalability and High Availability`_ for more on this subject.
|
||||||
component, and that improvment contributes to Ceph's high scalability and
|
|
||||||
performance. See `Scalability and High Availability`_ for additional details.
|
|
||||||
|
|
||||||
The Ceph Monitor's primary function is to maintain a master copy of the cluster
|
The Ceph Monitor's primary function is to maintain a master copy of the cluster
|
||||||
map. Monitors also provide authentication and logging services. All changes in
|
map. Monitors also provide authentication and logging services. All changes in
|
||||||
the monitor services are written by the Ceph Monitor to a single Paxos
|
the monitor services are written by the Ceph Monitor to a single Paxos
|
||||||
instance, and Paxos writes the changes to a key/value store for strong
|
instance, and Paxos writes the changes to a key/value store. This provides
|
||||||
consistency. Ceph Monitors are able to query the most recent version of the
|
strong consistency. Ceph Monitors are able to query the most recent version of
|
||||||
cluster map during sync operations, and they use the key/value store's
|
the cluster map during sync operations, and they use the key/value store's
|
||||||
snapshots and iterators (using leveldb) to perform store-wide synchronization.
|
snapshots and iterators (using RocksDB) to perform store-wide synchronization.
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
/-------------\ /-------------\
|
/-------------\ /-------------\
|
||||||
@ -289,7 +287,6 @@ by setting it in the ``[mon]`` section of the configuration file.
|
|||||||
.. confval:: mon_data_size_warn
|
.. confval:: mon_data_size_warn
|
||||||
.. confval:: mon_data_avail_warn
|
.. confval:: mon_data_avail_warn
|
||||||
.. confval:: mon_data_avail_crit
|
.. confval:: mon_data_avail_crit
|
||||||
.. confval:: mon_warn_on_cache_pools_without_hit_sets
|
|
||||||
.. confval:: mon_warn_on_crush_straw_calc_version_zero
|
.. confval:: mon_warn_on_crush_straw_calc_version_zero
|
||||||
.. confval:: mon_warn_on_legacy_crush_tunables
|
.. confval:: mon_warn_on_legacy_crush_tunables
|
||||||
.. confval:: mon_crush_min_required_version
|
.. confval:: mon_crush_min_required_version
|
||||||
@ -540,6 +537,8 @@ Trimming requires that the placement groups are ``active+clean``.
|
|||||||
|
|
||||||
.. index:: Ceph Monitor; clock
|
.. index:: Ceph Monitor; clock
|
||||||
|
|
||||||
|
.. _mon-config-ref-clock:
|
||||||
|
|
||||||
Clock
|
Clock
|
||||||
-----
|
-----
|
||||||
|
|
||||||
|
@ -91,7 +91,7 @@ Similarly, two options control whether IPv4 and IPv6 addresses are used:
|
|||||||
to an IPv6 address
|
to an IPv6 address
|
||||||
|
|
||||||
.. note:: The ability to bind to multiple ports has paved the way for
|
.. note:: The ability to bind to multiple ports has paved the way for
|
||||||
dual-stack IPv4 and IPv6 support. That said, dual-stack support is
|
dual-stack IPv4 and IPv6 support. That said, dual-stack operation is
|
||||||
not yet supported as of Quincy v17.2.0.
|
not yet supported as of Quincy v17.2.0.
|
||||||
|
|
||||||
Connection modes
|
Connection modes
|
||||||
|
@ -145,17 +145,20 @@ See `Pool & PG Config Reference`_ for details.
|
|||||||
Scrubbing
|
Scrubbing
|
||||||
=========
|
=========
|
||||||
|
|
||||||
In addition to making multiple copies of objects, Ceph ensures data integrity by
|
One way that Ceph ensures data integrity is by "scrubbing" placement groups.
|
||||||
scrubbing placement groups. Ceph scrubbing is analogous to ``fsck`` on the
|
Ceph scrubbing is analogous to ``fsck`` on the object storage layer. Ceph
|
||||||
object storage layer. For each placement group, Ceph generates a catalog of all
|
generates a catalog of all objects in each placement group and compares each
|
||||||
objects and compares each primary object and its replicas to ensure that no
|
primary object to its replicas, ensuring that no objects are missing or
|
||||||
objects are missing or mismatched. Light scrubbing (daily) checks the object
|
mismatched. Light scrubbing checks the object size and attributes, and is
|
||||||
size and attributes. Deep scrubbing (weekly) reads the data and uses checksums
|
usually done daily. Deep scrubbing reads the data and uses checksums to ensure
|
||||||
to ensure data integrity.
|
data integrity, and is usually done weekly. The freqeuncies of both light
|
||||||
|
scrubbing and deep scrubbing are determined by the cluster's configuration,
|
||||||
|
which is fully under your control and subject to the settings explained below
|
||||||
|
in this section.
|
||||||
|
|
||||||
Scrubbing is important for maintaining data integrity, but it can reduce
|
Although scrubbing is important for maintaining data integrity, it can reduce
|
||||||
performance. You can adjust the following settings to increase or decrease
|
the performance of the Ceph cluster. You can adjust the following settings to
|
||||||
scrubbing operations.
|
increase or decrease the frequency and depth of scrubbing operations.
|
||||||
|
|
||||||
|
|
||||||
.. confval:: osd_max_scrubs
|
.. confval:: osd_max_scrubs
|
||||||
|
@ -1,3 +1,5 @@
|
|||||||
|
.. _rados_config_pool_pg_crush_ref:
|
||||||
|
|
||||||
======================================
|
======================================
|
||||||
Pool, PG and CRUSH Config Reference
|
Pool, PG and CRUSH Config Reference
|
||||||
======================================
|
======================================
|
||||||
|
@ -4,74 +4,70 @@
|
|||||||
Adding/Removing Monitors
|
Adding/Removing Monitors
|
||||||
==========================
|
==========================
|
||||||
|
|
||||||
When you have a cluster up and running, you may add or remove monitors
|
It is possible to add monitors to a running cluster as long as redundancy is
|
||||||
from the cluster at runtime. To bootstrap a monitor, see `Manual Deployment`_
|
maintained. To bootstrap a monitor, see `Manual Deployment`_ or `Monitor
|
||||||
or `Monitor Bootstrap`_.
|
Bootstrap`_.
|
||||||
|
|
||||||
.. _adding-monitors:
|
.. _adding-monitors:
|
||||||
|
|
||||||
Adding Monitors
|
Adding Monitors
|
||||||
===============
|
===============
|
||||||
|
|
||||||
Ceph monitors are lightweight processes that are the single source of truth
|
Ceph monitors serve as the single source of truth for the cluster map. It is
|
||||||
for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3
|
possible to run a cluster with only one monitor, but for a production cluster
|
||||||
for a production cluster. Ceph monitors use a variation of the
|
it is recommended to have at least three monitors provisioned and in quorum.
|
||||||
`Paxos`_ algorithm to establish consensus about maps and other critical
|
Ceph monitors use a variation of the `Paxos`_ algorithm to maintain consensus
|
||||||
information across the cluster. Due to the nature of Paxos, Ceph requires
|
about maps and about other critical information across the cluster. Due to the
|
||||||
a majority of monitors to be active to establish a quorum (thus establishing
|
nature of Paxos, Ceph is able to maintain quorum (and thus establish
|
||||||
consensus).
|
consensus) only if a majority of the monitors are ``active``.
|
||||||
|
|
||||||
It is advisable to run an odd number of monitors. An
|
It is best to run an odd number of monitors. This is because a cluster that is
|
||||||
odd number of monitors is more resilient than an
|
running an odd number of monitors is more resilient than a cluster running an
|
||||||
even number. For instance, with a two monitor deployment, no
|
even number. For example, in a two-monitor deployment, no failures can be
|
||||||
failures can be tolerated and still maintain a quorum; with three monitors,
|
tolerated if quorum is to be maintained; in a three-monitor deployment, one
|
||||||
one failure can be tolerated; in a four monitor deployment, one failure can
|
failure can be tolerated; in a four-monitor deployment, one failure can be
|
||||||
be tolerated; with five monitors, two failures can be tolerated. This avoids
|
tolerated; and in a five-monitor deployment, two failures can be tolerated. In
|
||||||
the dreaded *split brain* phenomenon, and is why an odd number is best.
|
general, a cluster running an odd number of monitors is best because it avoids
|
||||||
In short, Ceph needs a majority of
|
what is called the *split brain* phenomenon. In short, Ceph is able to operate
|
||||||
monitors to be active (and able to communicate with each other), but that
|
only if a majority of monitors are ``active`` and able to communicate with each
|
||||||
majority can be achieved using a single monitor, or 2 out of 2 monitors,
|
other, (for example: there must be a single monitor, two out of two monitors,
|
||||||
2 out of 3, 3 out of 4, etc.
|
two out of three monitors, three out of five monitors, or the like).
|
||||||
|
|
||||||
For small or non-critical deployments of multi-node Ceph clusters, it is
|
For small or non-critical deployments of multi-node Ceph clusters, it is
|
||||||
advisable to deploy three monitors, and to increase the number of monitors
|
recommended to deploy three monitors. For larger clusters or for clusters that
|
||||||
to five for larger clusters or to survive a double failure. There is rarely
|
are intended to survive a double failure, it is recommended to deploy five
|
||||||
justification for seven or more.
|
monitors. Only in rare circumstances is there any justification for deploying
|
||||||
|
seven or more monitors.
|
||||||
|
|
||||||
Since monitors are lightweight, it is possible to run them on the same
|
It is possible to run a monitor on the same host that is running an OSD.
|
||||||
host as OSDs; however, we recommend running them on separate hosts,
|
However, this approach has disadvantages: for example: `fsync` issues with the
|
||||||
because `fsync` issues with the kernel may impair performance.
|
kernel might weaken performance, monitor and OSD daemons might be inactive at
|
||||||
Dedicated monitor nodes also minimize disruption since monitor and OSD
|
the same time and cause disruption if the node crashes, is rebooted, or is
|
||||||
daemons are not inactive at the same time when a node crashes or is
|
taken down for maintenance. Because of these risks, it is instead
|
||||||
taken down for maintenance.
|
recommended to run monitors and managers on dedicated hosts.
|
||||||
|
|
||||||
Dedicated
|
|
||||||
monitor nodes also make for cleaner maintenance by avoiding both OSDs and
|
|
||||||
a mon going down if a node is rebooted, taken down, or crashes.
|
|
||||||
|
|
||||||
.. note:: A *majority* of monitors in your cluster must be able to
|
.. note:: A *majority* of monitors in your cluster must be able to
|
||||||
reach each other in order to establish a quorum.
|
reach each other in order for quorum to be established.
|
||||||
|
|
||||||
Deploy your Hardware
|
Deploying your Hardware
|
||||||
--------------------
|
-----------------------
|
||||||
|
|
||||||
If you are adding a new host when adding a new monitor, see `Hardware
|
Some operators choose to add a new monitor host at the same time that they add
|
||||||
Recommendations`_ for details on minimum recommendations for monitor hardware.
|
a new monitor. For details on the minimum recommendations for monitor hardware,
|
||||||
To add a monitor host to your cluster, first make sure you have an up-to-date
|
see `Hardware Recommendations`_. Before adding a monitor host to the cluster,
|
||||||
version of Linux installed (typically Ubuntu 16.04 or RHEL 7).
|
make sure that there is an up-to-date version of Linux installed.
|
||||||
|
|
||||||
Add your monitor host to a rack in your cluster, connect it to the network
|
Add the newly installed monitor host to a rack in your cluster, connect the
|
||||||
and ensure that it has network connectivity.
|
host to the network, and make sure that the host has network connectivity.
|
||||||
|
|
||||||
.. _Hardware Recommendations: ../../../start/hardware-recommendations
|
.. _Hardware Recommendations: ../../../start/hardware-recommendations
|
||||||
|
|
||||||
Install the Required Software
|
Installing the Required Software
|
||||||
-----------------------------
|
--------------------------------
|
||||||
|
|
||||||
For manually deployed clusters, you must install Ceph packages
|
In manually deployed clusters, it is necessary to install Ceph packages
|
||||||
manually. See `Installing Packages`_ for details.
|
manually. For details, see `Installing Packages`_. Configure SSH so that it can
|
||||||
You should configure SSH to a user with password-less authentication
|
be used by a user that has passwordless authentication and root permissions.
|
||||||
and root permissions.
|
|
||||||
|
|
||||||
.. _Installing Packages: ../../../install/install-storage-cluster
|
.. _Installing Packages: ../../../install/install-storage-cluster
|
||||||
|
|
||||||
@ -81,67 +77,65 @@ and root permissions.
|
|||||||
Adding a Monitor (Manual)
|
Adding a Monitor (Manual)
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
This procedure creates a ``ceph-mon`` data directory, retrieves the monitor map
|
The procedure in this section creates a ``ceph-mon`` data directory, retrieves
|
||||||
and monitor keyring, and adds a ``ceph-mon`` daemon to your cluster. If
|
both the monitor map and the monitor keyring, and adds a ``ceph-mon`` daemon to
|
||||||
this results in only two monitor daemons, you may add more monitors by
|
the cluster. The procedure might result in a Ceph cluster that contains only
|
||||||
repeating this procedure until you have a sufficient number of ``ceph-mon``
|
two monitor daemons. To add more monitors until there are enough ``ceph-mon``
|
||||||
daemons to achieve a quorum.
|
daemons to establish quorum, repeat the procedure.
|
||||||
|
|
||||||
At this point you should define your monitor's id. Traditionally, monitors
|
This is a good point at which to define the new monitor's ``id``. Monitors have
|
||||||
have been named with single letters (``a``, ``b``, ``c``, ...), but you are
|
often been named with single letters (``a``, ``b``, ``c``, etc.), but you are
|
||||||
free to define the id as you see fit. For the purpose of this document,
|
free to define the ``id`` however you see fit. In this document, ``{mon-id}``
|
||||||
please take into account that ``{mon-id}`` should be the id you chose,
|
refers to the ``id`` exclusive of the ``mon.`` prefix: for example, if
|
||||||
without the ``mon.`` prefix (i.e., ``{mon-id}`` should be the ``a``
|
``mon.a`` has been chosen as the ``id`` of a monitor, then ``{mon-id}`` is
|
||||||
on ``mon.a``).
|
``a``. ???
|
||||||
|
|
||||||
#. Create the default directory on the machine that will host your
|
#. Create a data directory on the machine that will host the new monitor:
|
||||||
new monitor:
|
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ssh {new-mon-host}
|
ssh {new-mon-host}
|
||||||
sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
|
sudo mkdir /var/lib/ceph/mon/ceph-{mon-id}
|
||||||
|
|
||||||
#. Create a temporary directory ``{tmp}`` to keep the files needed during
|
#. Create a temporary directory ``{tmp}`` that will contain the files needed
|
||||||
this process. This directory should be different from the monitor's default
|
during this procedure. This directory should be different from the data
|
||||||
directory created in the previous step, and can be removed after all the
|
directory created in the previous step. Because this is a temporary
|
||||||
steps are executed:
|
directory, it can be removed after the procedure is complete:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
mkdir {tmp}
|
mkdir {tmp}
|
||||||
|
|
||||||
#. Retrieve the keyring for your monitors, where ``{tmp}`` is the path to
|
#. Retrieve the keyring for your monitors (``{tmp}`` is the path to the
|
||||||
the retrieved keyring, and ``{key-filename}`` is the name of the file
|
retrieved keyring and ``{key-filename}`` is the name of the file that
|
||||||
containing the retrieved monitor key:
|
contains the retrieved monitor key):
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph auth get mon. -o {tmp}/{key-filename}
|
ceph auth get mon. -o {tmp}/{key-filename}
|
||||||
|
|
||||||
#. Retrieve the monitor map, where ``{tmp}`` is the path to
|
#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor map
|
||||||
the retrieved monitor map, and ``{map-filename}`` is the name of the file
|
and ``{map-filename}`` is the name of the file that contains the retrieved
|
||||||
containing the retrieved monitor map:
|
monitor map):
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mon getmap -o {tmp}/{map-filename}
|
ceph mon getmap -o {tmp}/{map-filename}
|
||||||
|
|
||||||
#. Prepare the monitor's data directory created in the first step. You must
|
#. Prepare the monitor's data directory, which was created in the first step.
|
||||||
specify the path to the monitor map so that you can retrieve the
|
The following command must specify the path to the monitor map (so that
|
||||||
information about a quorum of monitors and their ``fsid``. You must also
|
information about a quorum of monitors and their ``fsid``\s can be
|
||||||
specify a path to the monitor keyring:
|
retrieved) and specify the path to the monitor keyring:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
|
sudo ceph-mon -i {mon-id} --mkfs --monmap {tmp}/{map-filename} --keyring {tmp}/{key-filename}
|
||||||
|
|
||||||
|
|
||||||
#. Start the new monitor and it will automatically join the cluster.
|
#. Start the new monitor. It will automatically join the cluster. To provide
|
||||||
The daemon needs to know which address to bind to, via either the
|
information to the daemon about which address to bind to, use either the
|
||||||
``--public-addr {ip}`` or ``--public-network {network}`` argument.
|
``--public-addr {ip}`` option or the ``--public-network {network}`` option.
|
||||||
For example:
|
For example:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph-mon -i {mon-id} --public-addr {ip:port}
|
ceph-mon -i {mon-id} --public-addr {ip:port}
|
||||||
@ -151,44 +145,47 @@ on ``mon.a``).
|
|||||||
Removing Monitors
|
Removing Monitors
|
||||||
=================
|
=================
|
||||||
|
|
||||||
When you remove monitors from a cluster, consider that Ceph monitors use
|
When monitors are removed from a cluster, it is important to remember
|
||||||
Paxos to establish consensus about the master cluster map. You must have
|
that Ceph monitors use Paxos to maintain consensus about the cluster
|
||||||
a sufficient number of monitors to establish a quorum for consensus about
|
map. Such consensus is possible only if the number of monitors is sufficient
|
||||||
the cluster map.
|
to establish quorum.
|
||||||
|
|
||||||
|
|
||||||
.. _Removing a Monitor (Manual):
|
.. _Removing a Monitor (Manual):
|
||||||
|
|
||||||
Removing a Monitor (Manual)
|
Removing a Monitor (Manual)
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
This procedure removes a ``ceph-mon`` daemon from your cluster. If this
|
The procedure in this section removes a ``ceph-mon`` daemon from the cluster.
|
||||||
procedure results in only two monitor daemons, you may add or remove another
|
The procedure might result in a Ceph cluster that contains a number of monitors
|
||||||
monitor until you have a number of ``ceph-mon`` daemons that can achieve a
|
insufficient to maintain quorum, so plan carefully. When replacing an old
|
||||||
quorum.
|
monitor with a new monitor, add the new monitor first, wait for quorum to be
|
||||||
|
established, and then remove the old monitor. This ensures that quorum is not
|
||||||
|
lost.
|
||||||
|
|
||||||
|
|
||||||
#. Stop the monitor:
|
#. Stop the monitor:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
service ceph -a stop mon.{mon-id}
|
service ceph -a stop mon.{mon-id}
|
||||||
|
|
||||||
#. Remove the monitor from the cluster:
|
#. Remove the monitor from the cluster:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mon remove {mon-id}
|
ceph mon remove {mon-id}
|
||||||
|
|
||||||
#. Remove the monitor entry from ``ceph.conf``.
|
#. Remove the monitor entry from the ``ceph.conf`` file:
|
||||||
|
|
||||||
.. _rados-mon-remove-from-unhealthy:
|
.. _rados-mon-remove-from-unhealthy:
|
||||||
|
|
||||||
|
|
||||||
Removing Monitors from an Unhealthy Cluster
|
Removing Monitors from an Unhealthy Cluster
|
||||||
-------------------------------------------
|
-------------------------------------------
|
||||||
|
|
||||||
This procedure removes a ``ceph-mon`` daemon from an unhealthy
|
The procedure in this section removes a ``ceph-mon`` daemon from an unhealthy
|
||||||
cluster, for example a cluster where the monitors cannot form a
|
cluster (for example, a cluster whose monitors are unable to form a quorum).
|
||||||
quorum.
|
|
||||||
|
|
||||||
|
|
||||||
#. Stop all ``ceph-mon`` daemons on all monitor hosts:
|
#. Stop all ``ceph-mon`` daemons on all monitor hosts:
|
||||||
|
|
||||||
@ -197,63 +194,68 @@ quorum.
|
|||||||
ssh {mon-host}
|
ssh {mon-host}
|
||||||
systemctl stop ceph-mon.target
|
systemctl stop ceph-mon.target
|
||||||
|
|
||||||
Repeat for all monitor hosts.
|
Repeat this step on every monitor host.
|
||||||
|
|
||||||
#. Identify a surviving monitor and log in to that host:
|
#. Identify a surviving monitor and log in to the monitor's host:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ssh {mon-host}
|
ssh {mon-host}
|
||||||
|
|
||||||
#. Extract a copy of the monmap file:
|
#. Extract a copy of the ``monmap`` file by running a command of the following
|
||||||
|
form:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph-mon -i {mon-id} --extract-monmap {map-path}
|
ceph-mon -i {mon-id} --extract-monmap {map-path}
|
||||||
|
|
||||||
In most cases, this command will be:
|
Here is a more concrete example. In this example, ``hostname`` is the
|
||||||
|
``{mon-id}`` and ``/tmp/monpap`` is the ``{map-path}``:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph-mon -i `hostname` --extract-monmap /tmp/monmap
|
ceph-mon -i `hostname` --extract-monmap /tmp/monmap
|
||||||
|
|
||||||
#. Remove the non-surviving or problematic monitors. For example, if
|
#. Remove the non-surviving or otherwise problematic monitors:
|
||||||
you have three monitors, ``mon.a``, ``mon.b``, and ``mon.c``, where
|
|
||||||
only ``mon.a`` will survive, follow the example below:
|
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
monmaptool {map-path} --rm {mon-id}
|
monmaptool {map-path} --rm {mon-id}
|
||||||
|
|
||||||
For example,
|
For example, suppose that there are three monitors |---| ``mon.a``, ``mon.b``,
|
||||||
|
and ``mon.c`` |---| and that only ``mon.a`` will survive:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
monmaptool /tmp/monmap --rm b
|
monmaptool /tmp/monmap --rm b
|
||||||
monmaptool /tmp/monmap --rm c
|
monmaptool /tmp/monmap --rm c
|
||||||
|
|
||||||
#. Inject the surviving map with the removed monitors into the
|
#. Inject the surviving map that includes the removed monitors into the
|
||||||
surviving monitor(s). For example, to inject a map into monitor
|
monmap of the surviving monitor(s):
|
||||||
``mon.a``, follow the example below:
|
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph-mon -i {mon-id} --inject-monmap {map-path}
|
ceph-mon -i {mon-id} --inject-monmap {map-path}
|
||||||
|
|
||||||
For example:
|
Continuing with the above example, inject a map into monitor ``mon.a`` by
|
||||||
|
running the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph-mon -i a --inject-monmap /tmp/monmap
|
ceph-mon -i a --inject-monmap /tmp/monmap
|
||||||
|
|
||||||
|
|
||||||
#. Start only the surviving monitors.
|
#. Start only the surviving monitors.
|
||||||
|
|
||||||
#. Verify the monitors form a quorum (``ceph -s``).
|
#. Verify that the monitors form a quorum by running the command ``ceph -s``.
|
||||||
|
|
||||||
#. You may wish to archive the removed monitors' data directory in
|
#. The data directory of the removed monitors is in ``/var/lib/ceph/mon``:
|
||||||
``/var/lib/ceph/mon`` in a safe location, or delete it if you are
|
either archive this data directory in a safe location or delete this data
|
||||||
confident the remaining monitors are healthy and are sufficiently
|
directory. However, do not delete it unless you are confident that the
|
||||||
redundant.
|
remaining monitors are healthy and sufficiently redundant. Make sure that
|
||||||
|
there is enough room for the live DB to expand and compact, and make sure
|
||||||
|
that there is also room for an archived copy of the DB. The archived copy
|
||||||
|
can be compressed.
|
||||||
|
|
||||||
.. _Changing a Monitor's IP address:
|
.. _Changing a Monitor's IP address:
|
||||||
|
|
||||||
@ -262,185 +264,195 @@ Changing a Monitor's IP Address
|
|||||||
|
|
||||||
.. important:: Existing monitors are not supposed to change their IP addresses.
|
.. important:: Existing monitors are not supposed to change their IP addresses.
|
||||||
|
|
||||||
Monitors are critical components of a Ceph cluster, and they need to maintain a
|
Monitors are critical components of a Ceph cluster. The entire system can work
|
||||||
quorum for the whole system to work properly. To establish a quorum, the
|
properly only if the monitors maintain quorum, and quorum can be established
|
||||||
monitors need to discover each other. Ceph has strict requirements for
|
only if the monitors have discovered each other by means of their IP addresses.
|
||||||
discovering monitors.
|
Ceph has strict requirements on the discovery of monitors.
|
||||||
|
|
||||||
Ceph clients and other Ceph daemons use ``ceph.conf`` to discover monitors.
|
Although the ``ceph.conf`` file is used by Ceph clients and other Ceph daemons
|
||||||
However, monitors discover each other using the monitor map, not ``ceph.conf``.
|
to discover monitors, the monitor map is used by monitors to discover each
|
||||||
For example, if you refer to `Adding a Monitor (Manual)`_ you will see that you
|
other. This is why it is necessary to obtain the current ``monmap`` at the time
|
||||||
need to obtain the current monmap for the cluster when creating a new monitor,
|
a new monitor is created: as can be seen above in `Adding a Monitor (Manual)`_,
|
||||||
as it is one of the required arguments of ``ceph-mon -i {mon-id} --mkfs``. The
|
the ``monmap`` is one of the arguments required by the ``ceph-mon -i {mon-id}
|
||||||
following sections explain the consistency requirements for Ceph monitors, and a
|
--mkfs`` command. The following sections explain the consistency requirements
|
||||||
few safe ways to change a monitor's IP address.
|
for Ceph monitors, and also explain a number of safe ways to change a monitor's
|
||||||
|
IP address.
|
||||||
|
|
||||||
|
|
||||||
Consistency Requirements
|
Consistency Requirements
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
A monitor always refers to the local copy of the monmap when discovering other
|
When a monitor discovers other monitors in the cluster, it always refers to the
|
||||||
monitors in the cluster. Using the monmap instead of ``ceph.conf`` avoids
|
local copy of the monitor map. Using the monitor map instead of using the
|
||||||
errors that could break the cluster (e.g., typos in ``ceph.conf`` when
|
``ceph.conf`` file avoids errors that could break the cluster (for example,
|
||||||
specifying a monitor address or port). Since monitors use monmaps for discovery
|
typos or other slight errors in ``ceph.conf`` when a monitor address or port is
|
||||||
and they share monmaps with clients and other Ceph daemons, the monmap provides
|
specified). Because monitors use monitor maps for discovery and because they
|
||||||
monitors with a strict guarantee that their consensus is valid.
|
share monitor maps with Ceph clients and other Ceph daemons, the monitor map
|
||||||
|
provides monitors with a strict guarantee that their consensus is valid.
|
||||||
|
|
||||||
Strict consistency also applies to updates to the monmap. As with any other
|
Strict consistency also applies to updates to the monmap. As with any other
|
||||||
updates on the monitor, changes to the monmap always run through a distributed
|
updates on the monitor, changes to the monmap always run through a distributed
|
||||||
consensus algorithm called `Paxos`_. The monitors must agree on each update to
|
consensus algorithm called `Paxos`_. The monitors must agree on each update to
|
||||||
the monmap, such as adding or removing a monitor, to ensure that each monitor in
|
the monmap, such as adding or removing a monitor, to ensure that each monitor
|
||||||
the quorum has the same version of the monmap. Updates to the monmap are
|
in the quorum has the same version of the monmap. Updates to the monmap are
|
||||||
incremental so that monitors have the latest agreed upon version, and a set of
|
incremental so that monitors have the latest agreed upon version, and a set of
|
||||||
previous versions, allowing a monitor that has an older version of the monmap to
|
previous versions, allowing a monitor that has an older version of the monmap
|
||||||
catch up with the current state of the cluster.
|
to catch up with the current state of the cluster.
|
||||||
|
|
||||||
If monitors discovered each other through the Ceph configuration file instead of
|
There are additional advantages to using the monitor map rather than
|
||||||
through the monmap, it would introduce additional risks because the Ceph
|
``ceph.conf`` when monitors discover each other. Because ``ceph.conf`` is not
|
||||||
configuration files are not updated and distributed automatically. Monitors
|
automatically updated and distributed, its use would bring certain risks:
|
||||||
might inadvertently use an older ``ceph.conf`` file, fail to recognize a
|
monitors might use an outdated ``ceph.conf`` file, might fail to recognize a
|
||||||
monitor, fall out of a quorum, or develop a situation where `Paxos`_ is not able
|
specific monitor, might fall out of quorum, and might develop a situation in
|
||||||
to determine the current state of the system accurately. Consequently, making
|
which `Paxos`_ is unable to accurately ascertain the current state of the
|
||||||
changes to an existing monitor's IP address must be done with great care.
|
system. Because of these risks, any changes to an existing monitor's IP address
|
||||||
|
must be made with great care.
|
||||||
|
|
||||||
|
.. _operations_add_or_rm_mons_changing_mon_ip:
|
||||||
|
|
||||||
Changing a Monitor's IP address (The Right Way)
|
Changing a Monitor's IP address (Preferred Method)
|
||||||
-----------------------------------------------
|
--------------------------------------------------
|
||||||
|
|
||||||
Changing a monitor's IP address in ``ceph.conf`` only is not sufficient to
|
If a monitor's IP address is changed only in the ``ceph.conf`` file, there is
|
||||||
ensure that other monitors in the cluster will receive the update. To change a
|
no guarantee that the other monitors in the cluster will receive the update.
|
||||||
monitor's IP address, you must add a new monitor with the IP address you want
|
For this reason, the preferred method to change a monitor's IP address is as
|
||||||
to use (as described in `Adding a Monitor (Manual)`_), ensure that the new
|
follows: add a new monitor with the desired IP address (as described in `Adding
|
||||||
monitor successfully joins the quorum; then, remove the monitor that uses the
|
a Monitor (Manual)`_), make sure that the new monitor successfully joins the
|
||||||
old IP address. Then, update the ``ceph.conf`` file to ensure that clients and
|
quorum, remove the monitor that is using the old IP address, and update the
|
||||||
other daemons know the IP address of the new monitor.
|
``ceph.conf`` file to ensure that clients and other daemons are made aware of
|
||||||
|
the new monitor's IP address.
|
||||||
|
|
||||||
For example, lets assume there are three monitors in place, such as ::
|
For example, suppose that there are three monitors in place::
|
||||||
|
|
||||||
[mon.a]
|
[mon.a]
|
||||||
host = host01
|
host = host01
|
||||||
addr = 10.0.0.1:6789
|
addr = 10.0.0.1:6789
|
||||||
[mon.b]
|
[mon.b]
|
||||||
host = host02
|
host = host02
|
||||||
addr = 10.0.0.2:6789
|
addr = 10.0.0.2:6789
|
||||||
[mon.c]
|
[mon.c]
|
||||||
host = host03
|
host = host03
|
||||||
addr = 10.0.0.3:6789
|
addr = 10.0.0.3:6789
|
||||||
|
|
||||||
To change ``mon.c`` to ``host04`` with the IP address ``10.0.0.4``, follow the
|
To change ``mon.c`` so that its name is ``host04`` and its IP address is
|
||||||
steps in `Adding a Monitor (Manual)`_ by adding a new monitor ``mon.d``. Ensure
|
``10.0.0.4``: (1) follow the steps in `Adding a Monitor (Manual)`_ to add a new
|
||||||
that ``mon.d`` is running before removing ``mon.c``, or it will break the
|
monitor ``mon.d``, (2) make sure that ``mon.d`` is running before removing
|
||||||
quorum. Remove ``mon.c`` as described on `Removing a Monitor (Manual)`_. Moving
|
``mon.c`` or else quorum will be broken, and (3) follow the steps in `Removing
|
||||||
all three monitors would thus require repeating this process as many times as
|
a Monitor (Manual)`_ to remove ``mon.c``. To move all three monitors to new IP
|
||||||
needed.
|
addresses, repeat this process.
|
||||||
|
|
||||||
|
Changing a Monitor's IP address (Advanced Method)
|
||||||
|
-------------------------------------------------
|
||||||
|
|
||||||
Changing a Monitor's IP address (The Messy Way)
|
There are cases in which the method outlined in :ref"`<Changing a Monitor's IP
|
||||||
-----------------------------------------------
|
Address (Preferred Method)> operations_add_or_rm_mons_changing_mon_ip` cannot
|
||||||
|
be used. For example, it might be necessary to move the cluster's monitors to a
|
||||||
|
different network, to a different part of the datacenter, or to a different
|
||||||
|
datacenter altogether. It is still possible to change the monitors' IP
|
||||||
|
addresses, but a different method must be used.
|
||||||
|
|
||||||
There may come a time when the monitors must be moved to a different network, a
|
For such cases, a new monitor map with updated IP addresses for every monitor
|
||||||
different part of the datacenter or a different datacenter altogether. While it
|
in the cluster must be generated and injected on each monitor. Although this
|
||||||
is possible to do it, the process becomes a bit more hazardous.
|
method is not particularly easy, such a major migration is unlikely to be a
|
||||||
|
routine task. As stated at the beginning of this section, existing monitors are
|
||||||
|
not supposed to change their IP addresses.
|
||||||
|
|
||||||
In such a case, the solution is to generate a new monmap with updated IP
|
Continue with the monitor configuration in the example from :ref"`<Changing a
|
||||||
addresses for all the monitors in the cluster, and inject the new map on each
|
Monitor's IP Address (Preferred Method)>
|
||||||
individual monitor. This is not the most user-friendly approach, but we do not
|
operations_add_or_rm_mons_changing_mon_ip` . Suppose that all of the monitors
|
||||||
expect this to be something that needs to be done every other week. As it is
|
are to be moved from the ``10.0.0.x`` range to the ``10.1.0.x`` range, and that
|
||||||
clearly stated on the top of this section, monitors are not supposed to change
|
these networks are unable to communicate. Carry out the following procedure:
|
||||||
IP addresses.
|
|
||||||
|
|
||||||
Using the previous monitor configuration as an example, assume you want to move
|
#. Retrieve the monitor map (``{tmp}`` is the path to the retrieved monitor
|
||||||
all the monitors from the ``10.0.0.x`` range to ``10.1.0.x``, and these
|
map, and ``{filename}`` is the name of the file that contains the retrieved
|
||||||
networks are unable to communicate. Use the following procedure:
|
monitor map):
|
||||||
|
|
||||||
#. Retrieve the monitor map, where ``{tmp}`` is the path to
|
|
||||||
the retrieved monitor map, and ``{filename}`` is the name of the file
|
|
||||||
containing the retrieved monitor map:
|
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mon getmap -o {tmp}/{filename}
|
ceph mon getmap -o {tmp}/{filename}
|
||||||
|
|
||||||
#. The following example demonstrates the contents of the monmap:
|
#. Check the contents of the monitor map:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
monmaptool --print {tmp}/{filename}
|
monmaptool --print {tmp}/{filename}
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
monmaptool: monmap file {tmp}/{filename}
|
monmaptool: monmap file {tmp}/{filename}
|
||||||
epoch 1
|
epoch 1
|
||||||
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
|
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
|
||||||
last_changed 2012-12-17 02:46:41.591248
|
last_changed 2012-12-17 02:46:41.591248
|
||||||
created 2012-12-17 02:46:41.591248
|
created 2012-12-17 02:46:41.591248
|
||||||
0: 10.0.0.1:6789/0 mon.a
|
0: 10.0.0.1:6789/0 mon.a
|
||||||
1: 10.0.0.2:6789/0 mon.b
|
1: 10.0.0.2:6789/0 mon.b
|
||||||
2: 10.0.0.3:6789/0 mon.c
|
2: 10.0.0.3:6789/0 mon.c
|
||||||
|
|
||||||
#. Remove the existing monitors:
|
#. Remove the existing monitors from the monitor map:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
monmaptool --rm a --rm b --rm c {tmp}/{filename}
|
monmaptool --rm a --rm b --rm c {tmp}/{filename}
|
||||||
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
monmaptool: monmap file {tmp}/{filename}
|
monmaptool: monmap file {tmp}/{filename}
|
||||||
monmaptool: removing a
|
monmaptool: removing a
|
||||||
monmaptool: removing b
|
monmaptool: removing b
|
||||||
monmaptool: removing c
|
monmaptool: removing c
|
||||||
monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
|
monmaptool: writing epoch 1 to {tmp}/{filename} (0 monitors)
|
||||||
|
|
||||||
#. Add the new monitor locations:
|
#. Add the new monitor locations to the monitor map:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
|
monmaptool --add a 10.1.0.1:6789 --add b 10.1.0.2:6789 --add c 10.1.0.3:6789 {tmp}/{filename}
|
||||||
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
monmaptool: monmap file {tmp}/{filename}
|
monmaptool: monmap file {tmp}/{filename}
|
||||||
monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
|
monmaptool: writing epoch 1 to {tmp}/{filename} (3 monitors)
|
||||||
|
|
||||||
#. Check new contents:
|
#. Check the new contents of the monitor map:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
monmaptool --print {tmp}/{filename}
|
monmaptool --print {tmp}/{filename}
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
monmaptool: monmap file {tmp}/{filename}
|
monmaptool: monmap file {tmp}/{filename}
|
||||||
epoch 1
|
epoch 1
|
||||||
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
|
fsid 224e376d-c5fe-4504-96bb-ea6332a19e61
|
||||||
last_changed 2012-12-17 02:46:41.591248
|
last_changed 2012-12-17 02:46:41.591248
|
||||||
created 2012-12-17 02:46:41.591248
|
created 2012-12-17 02:46:41.591248
|
||||||
0: 10.1.0.1:6789/0 mon.a
|
0: 10.1.0.1:6789/0 mon.a
|
||||||
1: 10.1.0.2:6789/0 mon.b
|
1: 10.1.0.2:6789/0 mon.b
|
||||||
2: 10.1.0.3:6789/0 mon.c
|
2: 10.1.0.3:6789/0 mon.c
|
||||||
|
|
||||||
At this point, we assume the monitors (and stores) are installed at the new
|
At this point, we assume that the monitors (and stores) have been installed at
|
||||||
location. The next step is to propagate the modified monmap to the new
|
the new location. Next, propagate the modified monitor map to the new monitors,
|
||||||
monitors, and inject the modified monmap into each new monitor.
|
and inject the modified monitor map into each new monitor.
|
||||||
|
|
||||||
#. First, make sure to stop all your monitors. Injection must be done while
|
#. Make sure all of your monitors have been stopped. Never inject into a
|
||||||
the daemon is not running.
|
monitor while the monitor daemon is running.
|
||||||
|
|
||||||
#. Inject the monmap:
|
#. Inject the monitor map:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
|
ceph-mon -i {mon-id} --inject-monmap {tmp}/{filename}
|
||||||
|
|
||||||
#. Restart the monitors.
|
#. Restart all of the monitors.
|
||||||
|
|
||||||
|
Migration to the new location is now complete. The monitors should operate
|
||||||
|
successfully.
|
||||||
|
|
||||||
After this step, migration to the new location is complete and
|
|
||||||
the monitors should operate successfully.
|
|
||||||
|
|
||||||
|
|
||||||
.. _Manual Deployment: ../../../install/manual-deployment
|
.. _Manual Deployment: ../../../install/manual-deployment
|
||||||
.. _Monitor Bootstrap: ../../../dev/mon-bootstrap
|
.. _Monitor Bootstrap: ../../../dev/mon-bootstrap
|
||||||
.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
|
.. _Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)
|
||||||
|
|
||||||
|
.. |---| unicode:: U+2014 .. EM DASH
|
||||||
|
:trim:
|
||||||
|
@ -1,7 +1,7 @@
|
|||||||
.. _balancer:
|
.. _balancer:
|
||||||
|
|
||||||
Balancer
|
Balancer Module
|
||||||
========
|
=======================
|
||||||
|
|
||||||
The *balancer* can optimize the allocation of placement groups (PGs) across
|
The *balancer* can optimize the allocation of placement groups (PGs) across
|
||||||
OSDs in order to achieve a balanced distribution. The balancer can operate
|
OSDs in order to achieve a balanced distribution. The balancer can operate
|
||||||
|
@ -106,22 +106,27 @@ to be considered ``stuck`` (default: 300).
|
|||||||
PGs might be stuck in any of the following states:
|
PGs might be stuck in any of the following states:
|
||||||
|
|
||||||
**Inactive**
|
**Inactive**
|
||||||
|
|
||||||
PGs are unable to process reads or writes because they are waiting for an
|
PGs are unable to process reads or writes because they are waiting for an
|
||||||
OSD that has the most up-to-date data to return to an ``up`` state.
|
OSD that has the most up-to-date data to return to an ``up`` state.
|
||||||
|
|
||||||
|
|
||||||
**Unclean**
|
**Unclean**
|
||||||
|
|
||||||
PGs contain objects that have not been replicated the desired number of
|
PGs contain objects that have not been replicated the desired number of
|
||||||
times. These PGs have not yet completed the process of recovering.
|
times. These PGs have not yet completed the process of recovering.
|
||||||
|
|
||||||
|
|
||||||
**Stale**
|
**Stale**
|
||||||
|
|
||||||
PGs are in an unknown state, because the OSDs that host them have not
|
PGs are in an unknown state, because the OSDs that host them have not
|
||||||
reported to the monitor cluster for a certain period of time (specified by
|
reported to the monitor cluster for a certain period of time (specified by
|
||||||
the ``mon_osd_report_timeout`` configuration setting).
|
the ``mon_osd_report_timeout`` configuration setting).
|
||||||
|
|
||||||
|
|
||||||
To delete a ``lost`` RADOS object or revert an object to its prior state
|
To delete a ``lost`` object or revert an object to its prior state, either by
|
||||||
(either by reverting it to its previous version or by deleting it because it
|
reverting it to its previous version or by deleting it because it was just
|
||||||
was just created and has no previous version), run the following command:
|
created and has no previous version, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
@ -168,10 +173,8 @@ To dump the OSD map, run the following command:
|
|||||||
ceph osd dump [--format {format}]
|
ceph osd dump [--format {format}]
|
||||||
|
|
||||||
The ``--format`` option accepts the following arguments: ``plain`` (default),
|
The ``--format`` option accepts the following arguments: ``plain`` (default),
|
||||||
``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON
|
``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. As noted above, JSON is
|
||||||
format is the recommended format for consumption by tools, scripting, and other
|
the recommended format for tools, scripting, and other forms of automation.
|
||||||
forms of automation.
|
|
||||||
|
|
||||||
|
|
||||||
To dump the OSD map as a tree that lists one OSD per line and displays
|
To dump the OSD map as a tree that lists one OSD per line and displays
|
||||||
information about the weights and states of the OSDs, run the following
|
information about the weights and states of the OSDs, run the following
|
||||||
@ -230,7 +233,7 @@ To mark an OSD as ``lost``, run the following command:
|
|||||||
.. warning::
|
.. warning::
|
||||||
This could result in permanent data loss. Use with caution!
|
This could result in permanent data loss. Use with caution!
|
||||||
|
|
||||||
To create an OSD in the CRUSH map, run the following command:
|
To create a new OSD, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
@ -287,47 +290,51 @@ following command:
|
|||||||
|
|
||||||
ceph osd in {osd-num}
|
ceph osd in {osd-num}
|
||||||
|
|
||||||
By using the ``pause`` and ``unpause`` flags in the OSD map, you can pause or
|
By using the "pause flags" in the OSD map, you can pause or unpause I/O
|
||||||
unpause I/O requests. If the flags are set, then no I/O requests will be sent
|
requests. If the flags are set, then no I/O requests will be sent to any OSD.
|
||||||
to any OSD. If the flags are cleared, then pending I/O requests will be resent.
|
When the flags are cleared, then pending I/O requests will be resent. To set or
|
||||||
To set or clear these flags, run one of the following commands:
|
clear pause flags, run one of the following commands:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd pause
|
ceph osd pause
|
||||||
ceph osd unpause
|
ceph osd unpause
|
||||||
|
|
||||||
You can assign an override or ``reweight`` weight value to a specific OSD
|
You can assign an override or ``reweight`` weight value to a specific OSD if
|
||||||
if the normal CRUSH distribution seems to be suboptimal. The weight of an
|
the normal CRUSH distribution seems to be suboptimal. The weight of an OSD
|
||||||
OSD helps determine the extent of its I/O requests and data storage: two
|
helps determine the extent of its I/O requests and data storage: two OSDs with
|
||||||
OSDs with the same weight will receive approximately the same number of
|
the same weight will receive approximately the same number of I/O requests and
|
||||||
I/O requests and store approximately the same amount of data. The ``ceph
|
store approximately the same amount of data. The ``ceph osd reweight`` command
|
||||||
osd reweight`` command assigns an override weight to an OSD. The weight
|
assigns an override weight to an OSD. The weight value is in the range 0 to 1,
|
||||||
value is in the range 0 to 1, and the command forces CRUSH to relocate a
|
and the command forces CRUSH to relocate a certain amount (1 - ``weight``) of
|
||||||
certain amount (1 - ``weight``) of the data that would otherwise be on
|
the data that would otherwise be on this OSD. The command does not change the
|
||||||
this OSD. The command does not change the weights of the buckets above
|
weights of the buckets above the OSD in the CRUSH map. Using the command is
|
||||||
the OSD in the CRUSH map. Using the command is merely a corrective
|
merely a corrective measure: for example, if one of your OSDs is at 90% and the
|
||||||
measure: for example, if one of your OSDs is at 90% and the others are at
|
others are at 50%, you could reduce the outlier weight to correct this
|
||||||
50%, you could reduce the outlier weight to correct this imbalance. To
|
imbalance. To assign an override weight to a specific OSD, run the following
|
||||||
assign an override weight to a specific OSD, run the following command:
|
command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd reweight {osd-num} {weight}
|
ceph osd reweight {osd-num} {weight}
|
||||||
|
|
||||||
|
.. note:: Any assigned override reweight value will conflict with the balancer.
|
||||||
|
This means that if the balancer is in use, all override reweight values
|
||||||
|
should be ``1.0000`` in order to avoid suboptimal cluster behavior.
|
||||||
|
|
||||||
A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
|
A cluster's OSDs can be reweighted in order to maintain balance if some OSDs
|
||||||
are being disproportionately utilized. Note that override or ``reweight``
|
are being disproportionately utilized. Note that override or ``reweight``
|
||||||
weights have relative values that default to 1.00000. Their values are not
|
weights have values relative to one another that default to 1.00000; their
|
||||||
absolute, and these weights must be distinguished from CRUSH weights (which
|
values are not absolute, and these weights must be distinguished from CRUSH
|
||||||
reflect the absolute capacity of a bucket, as measured in TiB). To reweight
|
weights (which reflect the absolute capacity of a bucket, as measured in TiB).
|
||||||
OSDs by utilization, run the following command:
|
To reweight OSDs by utilization, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]
|
ceph osd reweight-by-utilization [threshold [max_change [max_osds]]] [--no-increasing]
|
||||||
|
|
||||||
By default, this command adjusts the override weight of OSDs that have ±20%
|
By default, this command adjusts the override weight of OSDs that have ±20% of
|
||||||
of the average utilization, but you can specify a different percentage in the
|
the average utilization, but you can specify a different percentage in the
|
||||||
``threshold`` argument.
|
``threshold`` argument.
|
||||||
|
|
||||||
To limit the increment by which any OSD's reweight is to be changed, use the
|
To limit the increment by which any OSD's reweight is to be changed, use the
|
||||||
@ -355,42 +362,38 @@ Operators of deployments that utilize Nautilus or newer (or later revisions of
|
|||||||
Luminous and Mimic) and that have no pre-Luminous clients might likely instead
|
Luminous and Mimic) and that have no pre-Luminous clients might likely instead
|
||||||
want to enable the `balancer`` module for ``ceph-mgr``.
|
want to enable the `balancer`` module for ``ceph-mgr``.
|
||||||
|
|
||||||
.. note:: The ``balancer`` module does the work for you and achieves a more
|
The blocklist can be modified by adding or removing an IP address or a CIDR
|
||||||
uniform result, shuffling less data along the way. When enabling the
|
range. If an address is blocklisted, it will be unable to connect to any OSD.
|
||||||
``balancer`` module, you will want to converge any changed override weights
|
If an OSD is contained within an IP address or CIDR range that has been
|
||||||
back to 1.00000 so that the balancer can do an optimal job. If your cluster
|
blocklisted, the OSD will be unable to perform operations on its peers when it
|
||||||
is very full, reverting these override weights before enabling the balancer
|
acts as a client: such blocked operations include tiering and copy-from
|
||||||
may cause some OSDs to become full. This means that a phased approach may
|
functionality. To add or remove an IP address or CIDR range to the blocklist,
|
||||||
needed.
|
run one of the following commands:
|
||||||
|
|
||||||
Add/remove an IP address or CIDR range to/from the blocklist.
|
|
||||||
When adding to the blocklist,
|
|
||||||
you can specify how long it should be blocklisted in seconds; otherwise,
|
|
||||||
it will default to 1 hour. A blocklisted address is prevented from
|
|
||||||
connecting to any OSD. If you blocklist an IP or range containing an OSD, be aware
|
|
||||||
that OSD will also be prevented from performing operations on its peers where it
|
|
||||||
acts as a client. (This includes tiering and copy-from functionality.)
|
|
||||||
|
|
||||||
If you want to blocklist a range (in CIDR format), you may do so by
|
|
||||||
including the ``range`` keyword.
|
|
||||||
|
|
||||||
These commands are mostly only useful for failure testing, as
|
|
||||||
blocklists are normally maintained automatically and shouldn't need
|
|
||||||
manual intervention. :
|
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
|
ceph osd blocklist ["range"] add ADDRESS[:source_port][/netmask_bits] [TIME]
|
||||||
ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]
|
ceph osd blocklist ["range"] rm ADDRESS[:source_port][/netmask_bits]
|
||||||
|
|
||||||
Creates/deletes a snapshot of a pool. :
|
If you add something to the blocklist with the above ``add`` command, you can
|
||||||
|
use the ``TIME`` keyword to specify the length of time (in seconds) that it
|
||||||
|
will remain on the blocklist (default: one hour). To add or remove a CIDR
|
||||||
|
range, use the ``range`` keyword in the above commands.
|
||||||
|
|
||||||
|
Note that these commands are useful primarily in failure testing. Under normal
|
||||||
|
conditions, blocklists are maintained automatically and do not need any manual
|
||||||
|
intervention.
|
||||||
|
|
||||||
|
To create or delete a snapshot of a specific storage pool, run one of the
|
||||||
|
following commands:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd pool mksnap {pool-name} {snap-name}
|
ceph osd pool mksnap {pool-name} {snap-name}
|
||||||
ceph osd pool rmsnap {pool-name} {snap-name}
|
ceph osd pool rmsnap {pool-name} {snap-name}
|
||||||
|
|
||||||
Creates/deletes/renames a storage pool. :
|
To create, delete, or rename a specific storage pool, run one of the following
|
||||||
|
commands:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
@ -398,20 +401,20 @@ Creates/deletes/renames a storage pool. :
|
|||||||
ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
|
ceph osd pool delete {pool-name} [{pool-name} --yes-i-really-really-mean-it]
|
||||||
ceph osd pool rename {old-name} {new-name}
|
ceph osd pool rename {old-name} {new-name}
|
||||||
|
|
||||||
Changes a pool setting. :
|
To change a pool setting, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd pool set {pool-name} {field} {value}
|
ceph osd pool set {pool-name} {field} {value}
|
||||||
|
|
||||||
Valid fields are:
|
The following are valid fields:
|
||||||
|
|
||||||
* ``size``: Sets the number of copies of data in the pool.
|
* ``size``: The number of copies of data in the pool.
|
||||||
* ``pg_num``: The placement group number.
|
* ``pg_num``: The PG number.
|
||||||
* ``pgp_num``: Effective number when calculating pg placement.
|
* ``pgp_num``: The effective number of PGs when calculating placement.
|
||||||
* ``crush_rule``: rule number for mapping placement.
|
* ``crush_rule``: The rule number for mapping placement.
|
||||||
|
|
||||||
Get the value of a pool setting. :
|
To retrieve the value of a pool setting, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
@ -419,40 +422,43 @@ Get the value of a pool setting. :
|
|||||||
|
|
||||||
Valid fields are:
|
Valid fields are:
|
||||||
|
|
||||||
* ``pg_num``: The placement group number.
|
* ``pg_num``: The PG number.
|
||||||
* ``pgp_num``: Effective number of placement groups when calculating placement.
|
* ``pgp_num``: The effective number of PGs when calculating placement.
|
||||||
|
|
||||||
|
To send a scrub command to a specific OSD, or to all OSDs (by using ``*``), run
|
||||||
Sends a scrub command to OSD ``{osd-num}``. To send the command to all OSDs, use ``*``. :
|
the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd scrub {osd-num}
|
ceph osd scrub {osd-num}
|
||||||
|
|
||||||
Sends a repair command to OSD.N. To send the command to all OSDs, use ``*``. :
|
To send a repair command to a specific OSD, or to all OSDs (by using ``*``),
|
||||||
|
run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd repair N
|
ceph osd repair N
|
||||||
|
|
||||||
Runs a simple throughput benchmark against OSD.N, writing ``TOTAL_DATA_BYTES``
|
You can run a simple throughput benchmark test against a specific OSD. This
|
||||||
in write requests of ``BYTES_PER_WRITE`` each. By default, the test
|
test writes a total size of ``TOTAL_DATA_BYTES`` (default: 1 GB) incrementally,
|
||||||
writes 1 GB in total in 4-MB increments.
|
in multiple write requests that each have a size of ``BYTES_PER_WRITE``
|
||||||
The benchmark is non-destructive and will not overwrite existing live
|
(default: 4 MB). The test is not destructive and it will not overwrite existing
|
||||||
OSD data, but might temporarily affect the performance of clients
|
live OSD data, but it might temporarily affect the performance of clients that
|
||||||
concurrently accessing the OSD. :
|
are concurrently accessing the OSD. To launch this benchmark test, run the
|
||||||
|
following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
|
ceph tell osd.N bench [TOTAL_DATA_BYTES] [BYTES_PER_WRITE]
|
||||||
|
|
||||||
To clear an OSD's caches between benchmark runs, use the 'cache drop' command :
|
To clear the caches of a specific OSD during the interval between one benchmark
|
||||||
|
run and another, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph tell osd.N cache drop
|
ceph tell osd.N cache drop
|
||||||
|
|
||||||
To get the cache statistics of an OSD, use the 'cache status' command :
|
To retrieve the cache statistics of a specific OSD, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
@ -461,7 +467,8 @@ To get the cache statistics of an OSD, use the 'cache status' command :
|
|||||||
MDS Subsystem
|
MDS Subsystem
|
||||||
=============
|
=============
|
||||||
|
|
||||||
Change configuration parameters on a running mds. :
|
To change the configuration parameters of a running metadata server, run the
|
||||||
|
following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
@ -473,19 +480,20 @@ Example:
|
|||||||
|
|
||||||
ceph tell mds.0 config set debug_ms 1
|
ceph tell mds.0 config set debug_ms 1
|
||||||
|
|
||||||
Enables debug messages. :
|
To enable debug messages, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mds stat
|
ceph mds stat
|
||||||
|
|
||||||
Displays the status of all metadata servers. :
|
To display the status of all metadata servers, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mds fail 0
|
ceph mds fail 0
|
||||||
|
|
||||||
Marks the active MDS as failed, triggering failover to a standby if present.
|
To mark the active metadata server as failed (and to trigger failover to a
|
||||||
|
standby if a standby is present), run the following command:
|
||||||
|
|
||||||
.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
|
.. todo:: ``ceph mds`` subcommands missing docs: set, dump, getmap, stop, setmap
|
||||||
|
|
||||||
@ -493,157 +501,165 @@ Marks the active MDS as failed, triggering failover to a standby if present.
|
|||||||
Mon Subsystem
|
Mon Subsystem
|
||||||
=============
|
=============
|
||||||
|
|
||||||
Show monitor stats:
|
To display monitor statistics, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mon stat
|
ceph mon stat
|
||||||
|
|
||||||
|
This command returns output similar to the following:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
|
e2: 3 mons at {a=127.0.0.1:40000/0,b=127.0.0.1:40001/0,c=127.0.0.1:40002/0}, election epoch 6, quorum 0,1,2 a,b,c
|
||||||
|
|
||||||
|
There is a ``quorum`` list at the end of the output. It lists those monitor
|
||||||
|
nodes that are part of the current quorum.
|
||||||
|
|
||||||
The ``quorum`` list at the end lists monitor nodes that are part of the current quorum.
|
To retrieve this information in a more direct way, run the following command:
|
||||||
|
|
||||||
This is also available more directly:
|
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph quorum_status -f json-pretty
|
ceph quorum_status -f json-pretty
|
||||||
|
|
||||||
.. code-block:: javascript
|
|
||||||
|
|
||||||
{
|
This command returns output similar to the following:
|
||||||
"election_epoch": 6,
|
|
||||||
"quorum": [
|
.. code-block:: javascript
|
||||||
0,
|
|
||||||
1,
|
{
|
||||||
2
|
"election_epoch": 6,
|
||||||
],
|
"quorum": [
|
||||||
"quorum_names": [
|
0,
|
||||||
"a",
|
1,
|
||||||
"b",
|
2
|
||||||
"c"
|
],
|
||||||
],
|
"quorum_names": [
|
||||||
"quorum_leader_name": "a",
|
"a",
|
||||||
"monmap": {
|
"b",
|
||||||
"epoch": 2,
|
"c"
|
||||||
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
|
],
|
||||||
"modified": "2016-12-26 14:42:09.288066",
|
"quorum_leader_name": "a",
|
||||||
"created": "2016-12-26 14:42:03.573585",
|
"monmap": {
|
||||||
"features": {
|
"epoch": 2,
|
||||||
"persistent": [
|
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
|
||||||
"kraken"
|
"modified": "2016-12-26 14:42:09.288066",
|
||||||
],
|
"created": "2016-12-26 14:42:03.573585",
|
||||||
"optional": []
|
"features": {
|
||||||
},
|
"persistent": [
|
||||||
"mons": [
|
"kraken"
|
||||||
{
|
],
|
||||||
"rank": 0,
|
"optional": []
|
||||||
"name": "a",
|
},
|
||||||
"addr": "127.0.0.1:40000\/0",
|
"mons": [
|
||||||
"public_addr": "127.0.0.1:40000\/0"
|
{
|
||||||
},
|
"rank": 0,
|
||||||
{
|
"name": "a",
|
||||||
"rank": 1,
|
"addr": "127.0.0.1:40000\/0",
|
||||||
"name": "b",
|
"public_addr": "127.0.0.1:40000\/0"
|
||||||
"addr": "127.0.0.1:40001\/0",
|
},
|
||||||
"public_addr": "127.0.0.1:40001\/0"
|
{
|
||||||
},
|
"rank": 1,
|
||||||
{
|
"name": "b",
|
||||||
"rank": 2,
|
"addr": "127.0.0.1:40001\/0",
|
||||||
"name": "c",
|
"public_addr": "127.0.0.1:40001\/0"
|
||||||
"addr": "127.0.0.1:40002\/0",
|
},
|
||||||
"public_addr": "127.0.0.1:40002\/0"
|
{
|
||||||
}
|
"rank": 2,
|
||||||
]
|
"name": "c",
|
||||||
}
|
"addr": "127.0.0.1:40002\/0",
|
||||||
}
|
"public_addr": "127.0.0.1:40002\/0"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
The above will block until a quorum is reached.
|
The above will block until a quorum is reached.
|
||||||
|
|
||||||
For a status of just a single monitor:
|
To see the status of a specific monitor, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph tell mon.[name] mon_status
|
ceph tell mon.[name] mon_status
|
||||||
|
|
||||||
where the value of ``[name]`` can be taken from ``ceph quorum_status``. Sample
|
|
||||||
output::
|
|
||||||
|
|
||||||
{
|
|
||||||
"name": "b",
|
|
||||||
"rank": 1,
|
|
||||||
"state": "peon",
|
|
||||||
"election_epoch": 6,
|
|
||||||
"quorum": [
|
|
||||||
0,
|
|
||||||
1,
|
|
||||||
2
|
|
||||||
],
|
|
||||||
"features": {
|
|
||||||
"required_con": "9025616074522624",
|
|
||||||
"required_mon": [
|
|
||||||
"kraken"
|
|
||||||
],
|
|
||||||
"quorum_con": "1152921504336314367",
|
|
||||||
"quorum_mon": [
|
|
||||||
"kraken"
|
|
||||||
]
|
|
||||||
},
|
|
||||||
"outside_quorum": [],
|
|
||||||
"extra_probe_peers": [],
|
|
||||||
"sync_provider": [],
|
|
||||||
"monmap": {
|
|
||||||
"epoch": 2,
|
|
||||||
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
|
|
||||||
"modified": "2016-12-26 14:42:09.288066",
|
|
||||||
"created": "2016-12-26 14:42:03.573585",
|
|
||||||
"features": {
|
|
||||||
"persistent": [
|
|
||||||
"kraken"
|
|
||||||
],
|
|
||||||
"optional": []
|
|
||||||
},
|
|
||||||
"mons": [
|
|
||||||
{
|
|
||||||
"rank": 0,
|
|
||||||
"name": "a",
|
|
||||||
"addr": "127.0.0.1:40000\/0",
|
|
||||||
"public_addr": "127.0.0.1:40000\/0"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"rank": 1,
|
|
||||||
"name": "b",
|
|
||||||
"addr": "127.0.0.1:40001\/0",
|
|
||||||
"public_addr": "127.0.0.1:40001\/0"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"rank": 2,
|
|
||||||
"name": "c",
|
|
||||||
"addr": "127.0.0.1:40002\/0",
|
|
||||||
"public_addr": "127.0.0.1:40002\/0"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
A dump of the monitor state:
|
Here the value of ``[name]`` can be found by consulting the output of the
|
||||||
|
``ceph quorum_status`` command. This command returns output similar to the
|
||||||
|
following:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
{
|
||||||
|
"name": "b",
|
||||||
|
"rank": 1,
|
||||||
|
"state": "peon",
|
||||||
|
"election_epoch": 6,
|
||||||
|
"quorum": [
|
||||||
|
0,
|
||||||
|
1,
|
||||||
|
2
|
||||||
|
],
|
||||||
|
"features": {
|
||||||
|
"required_con": "9025616074522624",
|
||||||
|
"required_mon": [
|
||||||
|
"kraken"
|
||||||
|
],
|
||||||
|
"quorum_con": "1152921504336314367",
|
||||||
|
"quorum_mon": [
|
||||||
|
"kraken"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"outside_quorum": [],
|
||||||
|
"extra_probe_peers": [],
|
||||||
|
"sync_provider": [],
|
||||||
|
"monmap": {
|
||||||
|
"epoch": 2,
|
||||||
|
"fsid": "ba807e74-b64f-4b72-b43f-597dfe60ddbc",
|
||||||
|
"modified": "2016-12-26 14:42:09.288066",
|
||||||
|
"created": "2016-12-26 14:42:03.573585",
|
||||||
|
"features": {
|
||||||
|
"persistent": [
|
||||||
|
"kraken"
|
||||||
|
],
|
||||||
|
"optional": []
|
||||||
|
},
|
||||||
|
"mons": [
|
||||||
|
{
|
||||||
|
"rank": 0,
|
||||||
|
"name": "a",
|
||||||
|
"addr": "127.0.0.1:40000\/0",
|
||||||
|
"public_addr": "127.0.0.1:40000\/0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"rank": 1,
|
||||||
|
"name": "b",
|
||||||
|
"addr": "127.0.0.1:40001\/0",
|
||||||
|
"public_addr": "127.0.0.1:40001\/0"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"rank": 2,
|
||||||
|
"name": "c",
|
||||||
|
"addr": "127.0.0.1:40002\/0",
|
||||||
|
"public_addr": "127.0.0.1:40002\/0"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
To see a dump of the monitor state, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph mon dump
|
ceph mon dump
|
||||||
|
|
||||||
|
This command returns output similar to the following:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
dumped monmap epoch 2
|
dumped monmap epoch 2
|
||||||
epoch 2
|
epoch 2
|
||||||
fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
|
fsid ba807e74-b64f-4b72-b43f-597dfe60ddbc
|
||||||
last_changed 2016-12-26 14:42:09.288066
|
last_changed 2016-12-26 14:42:09.288066
|
||||||
created 2016-12-26 14:42:03.573585
|
created 2016-12-26 14:42:03.573585
|
||||||
0: 127.0.0.1:40000/0 mon.a
|
0: 127.0.0.1:40000/0 mon.a
|
||||||
1: 127.0.0.1:40001/0 mon.b
|
1: 127.0.0.1:40001/0 mon.b
|
||||||
2: 127.0.0.1:40002/0 mon.c
|
2: 127.0.0.1:40002/0 mon.c
|
||||||
|
|
||||||
|
@ -1043,6 +1043,8 @@ operations are served from the primary OSD of each PG. For erasure-coded pools,
|
|||||||
however, the speed of read operations can be increased by enabling **fast
|
however, the speed of read operations can be increased by enabling **fast
|
||||||
read** (see :ref:`pool-settings`).
|
read** (see :ref:`pool-settings`).
|
||||||
|
|
||||||
|
.. _rados_ops_primary_affinity:
|
||||||
|
|
||||||
Primary Affinity
|
Primary Affinity
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
|
@ -110,6 +110,8 @@ To remove an erasure code profile::
|
|||||||
|
|
||||||
If the profile is referenced by a pool, the deletion will fail.
|
If the profile is referenced by a pool, the deletion will fail.
|
||||||
|
|
||||||
|
.. warning:: Removing an erasure code profile using ``osd erasure-code-profile rm`` does not automatically delete the associated CRUSH rule associated with the erasure code profile. It is recommended to manually remove the associated CRUSH rule using ``ceph osd crush rule remove {rule-name}`` to avoid unexpected behavior.
|
||||||
|
|
||||||
osd erasure-code-profile get
|
osd erasure-code-profile get
|
||||||
============================
|
============================
|
||||||
|
|
||||||
|
@ -1226,8 +1226,8 @@ The health check will be silenced for a specific pool only if
|
|||||||
POOL_APP_NOT_ENABLED
|
POOL_APP_NOT_ENABLED
|
||||||
____________________
|
____________________
|
||||||
|
|
||||||
A pool exists that contains one or more objects, but the pool has not been
|
A pool exists but the pool has not been tagged for use by a particular
|
||||||
tagged for use by a particular application.
|
application.
|
||||||
|
|
||||||
To resolve this issue, tag the pool for use by an application. For
|
To resolve this issue, tag the pool for use by an application. For
|
||||||
example, if the pool is used by RBD, run the following command:
|
example, if the pool is used by RBD, run the following command:
|
||||||
@ -1406,6 +1406,31 @@ other performance issue with the OSDs.
|
|||||||
The exact size of the snapshot trim queue is reported by the ``snaptrimq_len``
|
The exact size of the snapshot trim queue is reported by the ``snaptrimq_len``
|
||||||
field of ``ceph pg ls -f json-detail``.
|
field of ``ceph pg ls -f json-detail``.
|
||||||
|
|
||||||
|
Stretch Mode
|
||||||
|
------------
|
||||||
|
|
||||||
|
INCORRECT_NUM_BUCKETS_STRETCH_MODE
|
||||||
|
__________________________________
|
||||||
|
|
||||||
|
Stretch mode currently only support 2 dividing buckets with OSDs, this warning suggests
|
||||||
|
that the number of dividing buckets is not equal to 2 after stretch mode is enabled.
|
||||||
|
You can expect unpredictable failures and MON assertions until the condition is fixed.
|
||||||
|
|
||||||
|
We encourage you to fix this by removing additional dividing buckets or bump the
|
||||||
|
number of dividing buckets to 2.
|
||||||
|
|
||||||
|
UNEVEN_WEIGHTS_STRETCH_MODE
|
||||||
|
___________________________
|
||||||
|
|
||||||
|
The 2 dividing buckets must have equal weights when stretch mode is enabled.
|
||||||
|
This warning suggests that the 2 dividing buckets have uneven weights after
|
||||||
|
stretch mode is enabled. This is not immediately fatal, however, you can expect
|
||||||
|
Ceph to be confused when trying to process transitions between dividing buckets.
|
||||||
|
|
||||||
|
We encourage you to fix this by making the weights even on both dividing buckets.
|
||||||
|
This can be done by making sure the combined weight of the OSDs on each dividing
|
||||||
|
bucket are the same.
|
||||||
|
|
||||||
Miscellaneous
|
Miscellaneous
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
|
@ -39,8 +39,9 @@ CRUSH algorithm.
|
|||||||
erasure-code
|
erasure-code
|
||||||
cache-tiering
|
cache-tiering
|
||||||
placement-groups
|
placement-groups
|
||||||
balancer
|
|
||||||
upmap
|
upmap
|
||||||
|
read-balancer
|
||||||
|
balancer
|
||||||
crush-map
|
crush-map
|
||||||
crush-map-edits
|
crush-map-edits
|
||||||
stretch-mode
|
stretch-mode
|
||||||
|
@ -10,10 +10,11 @@ directly to specific OSDs. For this reason, tracking system faults
|
|||||||
requires finding the `placement group`_ (PG) and the underlying OSDs at the
|
requires finding the `placement group`_ (PG) and the underlying OSDs at the
|
||||||
root of the problem.
|
root of the problem.
|
||||||
|
|
||||||
.. tip:: A fault in one part of the cluster might prevent you from accessing a
|
.. tip:: A fault in one part of the cluster might prevent you from accessing a
|
||||||
particular object, but that doesn't mean that you are prevented from accessing other objects.
|
particular object, but that doesn't mean that you are prevented from
|
||||||
When you run into a fault, don't panic. Just follow the steps for monitoring
|
accessing other objects. When you run into a fault, don't panic. Just
|
||||||
your OSDs and placement groups, and then begin troubleshooting.
|
follow the steps for monitoring your OSDs and placement groups, and then
|
||||||
|
begin troubleshooting.
|
||||||
|
|
||||||
Ceph is self-repairing. However, when problems persist, monitoring OSDs and
|
Ceph is self-repairing. However, when problems persist, monitoring OSDs and
|
||||||
placement groups will help you identify the problem.
|
placement groups will help you identify the problem.
|
||||||
@ -22,17 +23,20 @@ placement groups will help you identify the problem.
|
|||||||
Monitoring OSDs
|
Monitoring OSDs
|
||||||
===============
|
===============
|
||||||
|
|
||||||
An OSD's status is as follows: it is either in the cluster (``in``) or out of the cluster
|
An OSD is either *in* service (``in``) or *out* of service (``out``). An OSD is
|
||||||
(``out``); likewise, it is either up and running (``up``) or down and not
|
either running and reachable (``up``), or it is not running and not reachable
|
||||||
running (``down``). If an OSD is ``up``, it can be either ``in`` the cluster
|
(``down``).
|
||||||
(if so, you can read and write data) or ``out`` of the cluster. If the OSD was previously
|
|
||||||
``in`` the cluster but was recently moved ``out`` of the cluster, Ceph will migrate its
|
|
||||||
PGs to other OSDs. If an OSD is ``out`` of the cluster, CRUSH will
|
|
||||||
not assign any PGs to that OSD. If an OSD is ``down``, it should also be
|
|
||||||
``out``.
|
|
||||||
|
|
||||||
.. note:: If an OSD is ``down`` and ``in``, then there is a problem and the cluster
|
If an OSD is ``up``, it may be either ``in`` service (clients can read and
|
||||||
is not in a healthy state.
|
write data) or it is ``out`` of service. If the OSD was ``in`` but then due to
|
||||||
|
a failure or a manual action was set to the ``out`` state, Ceph will migrate
|
||||||
|
placement groups to the other OSDs to maintin the configured redundancy.
|
||||||
|
|
||||||
|
If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
|
||||||
|
If an OSD is ``down``, it will also be ``out``.
|
||||||
|
|
||||||
|
.. note:: If an OSD is ``down`` and ``in``, there is a problem and this
|
||||||
|
indicates that the cluster is not in a healthy state.
|
||||||
|
|
||||||
.. ditaa::
|
.. ditaa::
|
||||||
|
|
||||||
|
@ -210,6 +210,11 @@ process. We recommend constraining each pool so that it belongs to only one
|
|||||||
root (that is, one OSD class) to silence the warning and ensure a successful
|
root (that is, one OSD class) to silence the warning and ensure a successful
|
||||||
scaling process.
|
scaling process.
|
||||||
|
|
||||||
|
.. _managing_bulk_flagged_pools:
|
||||||
|
|
||||||
|
Managing pools that are flagged with ``bulk``
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
|
If a pool is flagged ``bulk``, then the autoscaler starts the pool with a full
|
||||||
complement of PGs and then scales down the number of PGs only if the usage
|
complement of PGs and then scales down the number of PGs only if the usage
|
||||||
ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
|
ratio across the pool is uneven. However, if a pool is not flagged ``bulk``,
|
||||||
@ -659,6 +664,7 @@ In releases of Ceph that are Nautilus and later (inclusive), when the
|
|||||||
``pg_num``. This process manifests as periods of remapping of PGs and of
|
``pg_num``. This process manifests as periods of remapping of PGs and of
|
||||||
backfill, and is expected behavior and normal.
|
backfill, and is expected behavior and normal.
|
||||||
|
|
||||||
|
.. _rados_ops_pgs_get_pg_num:
|
||||||
|
|
||||||
Get the Number of PGs
|
Get the Number of PGs
|
||||||
=====================
|
=====================
|
||||||
|
@ -46,12 +46,49 @@ operations. Do not create or manipulate pools with these names.
|
|||||||
List Pools
|
List Pools
|
||||||
==========
|
==========
|
||||||
|
|
||||||
To list your cluster's pools, run the following command:
|
There are multiple ways to get the list of pools in your cluster.
|
||||||
|
|
||||||
|
To list just your cluster's pool names (good for scripting), execute:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
ceph osd pool ls
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
.rgw.root
|
||||||
|
default.rgw.log
|
||||||
|
default.rgw.control
|
||||||
|
default.rgw.meta
|
||||||
|
|
||||||
|
To list your cluster's pools with the pool number, run the following command:
|
||||||
|
|
||||||
.. prompt:: bash $
|
.. prompt:: bash $
|
||||||
|
|
||||||
ceph osd lspools
|
ceph osd lspools
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
1 .rgw.root
|
||||||
|
2 default.rgw.log
|
||||||
|
3 default.rgw.control
|
||||||
|
4 default.rgw.meta
|
||||||
|
|
||||||
|
To list your cluster's pools with additional information, execute:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
ceph osd pool ls detail
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
pool 1 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
|
||||||
|
pool 2 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
|
||||||
|
pool 3 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 23 flags hashpspool stripe_width 0 application rgw read_balance_score 4.00
|
||||||
|
pool 4 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 25 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw read_balance_score 4.00
|
||||||
|
|
||||||
|
To get even more information, you can execute this command with the ``--format`` (or ``-f``) option and the ``json``, ``json-pretty``, ``xml`` or ``xml-pretty`` value.
|
||||||
|
|
||||||
.. _createpool:
|
.. _createpool:
|
||||||
|
|
||||||
Creating a Pool
|
Creating a Pool
|
||||||
@ -462,82 +499,6 @@ You may set values for the following keys:
|
|||||||
:Type: Integer
|
:Type: Integer
|
||||||
:Valid Range: ``1`` sets flag, ``0`` unsets flag
|
:Valid Range: ``1`` sets flag, ``0`` unsets flag
|
||||||
|
|
||||||
.. _hit_set_type:
|
|
||||||
|
|
||||||
.. describe:: hit_set_type
|
|
||||||
|
|
||||||
:Description: Enables HitSet tracking for cache pools.
|
|
||||||
For additional information, see `Bloom Filter`_.
|
|
||||||
:Type: String
|
|
||||||
:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
|
|
||||||
:Default: ``bloom``. Other values are for testing.
|
|
||||||
|
|
||||||
.. _hit_set_count:
|
|
||||||
|
|
||||||
.. describe:: hit_set_count
|
|
||||||
|
|
||||||
:Description: Determines the number of HitSets to store for cache pools. The
|
|
||||||
higher the value, the more RAM is consumed by the ``ceph-osd``
|
|
||||||
daemon.
|
|
||||||
:Type: Integer
|
|
||||||
:Valid Range: ``1``. Agent doesn't handle > ``1`` yet.
|
|
||||||
|
|
||||||
.. _hit_set_period:
|
|
||||||
|
|
||||||
.. describe:: hit_set_period
|
|
||||||
|
|
||||||
:Description: Determines the duration of a HitSet period (in seconds) for
|
|
||||||
cache pools. The higher the value, the more RAM is consumed
|
|
||||||
by the ``ceph-osd`` daemon.
|
|
||||||
:Type: Integer
|
|
||||||
:Example: ``3600`` (3600 seconds: one hour)
|
|
||||||
|
|
||||||
.. _hit_set_fpp:
|
|
||||||
|
|
||||||
.. describe:: hit_set_fpp
|
|
||||||
|
|
||||||
:Description: Determines the probability of false positives for the
|
|
||||||
``bloom`` HitSet type. For additional information, see `Bloom
|
|
||||||
Filter`_.
|
|
||||||
:Type: Double
|
|
||||||
:Valid Range: ``0.0`` - ``1.0``
|
|
||||||
:Default: ``0.05``
|
|
||||||
|
|
||||||
.. _cache_target_dirty_ratio:
|
|
||||||
|
|
||||||
.. describe:: cache_target_dirty_ratio
|
|
||||||
|
|
||||||
:Description: Sets a flush threshold for the percentage of the cache pool
|
|
||||||
containing modified (dirty) objects. When this threshold is
|
|
||||||
reached, the cache-tiering agent will flush these objects to
|
|
||||||
the backing storage pool.
|
|
||||||
:Type: Double
|
|
||||||
:Default: ``.4``
|
|
||||||
|
|
||||||
.. _cache_target_dirty_high_ratio:
|
|
||||||
|
|
||||||
.. describe:: cache_target_dirty_high_ratio
|
|
||||||
|
|
||||||
:Description: Sets a flush threshold for the percentage of the cache pool
|
|
||||||
containing modified (dirty) objects. When this threshold is
|
|
||||||
reached, the cache-tiering agent will flush these objects to
|
|
||||||
the backing storage pool with a higher speed (as compared with
|
|
||||||
``cache_target_dirty_ratio``).
|
|
||||||
:Type: Double
|
|
||||||
:Default: ``.6``
|
|
||||||
|
|
||||||
.. _cache_target_full_ratio:
|
|
||||||
|
|
||||||
.. describe:: cache_target_full_ratio
|
|
||||||
|
|
||||||
:Description: Sets an eviction threshold for the percentage of the cache
|
|
||||||
pool containing unmodified (clean) objects. When this
|
|
||||||
threshold is reached, the cache-tiering agent will evict
|
|
||||||
these objects from the cache pool.
|
|
||||||
|
|
||||||
:Type: Double
|
|
||||||
:Default: ``.8``
|
|
||||||
|
|
||||||
.. _target_max_bytes:
|
.. _target_max_bytes:
|
||||||
|
|
||||||
.. describe:: target_max_bytes
|
.. describe:: target_max_bytes
|
||||||
@ -556,41 +517,6 @@ You may set values for the following keys:
|
|||||||
:Type: Integer
|
:Type: Integer
|
||||||
:Example: ``1000000`` #1M objects
|
:Example: ``1000000`` #1M objects
|
||||||
|
|
||||||
|
|
||||||
.. describe:: hit_set_grade_decay_rate
|
|
||||||
|
|
||||||
:Description: Sets the temperature decay rate between two successive
|
|
||||||
HitSets.
|
|
||||||
:Type: Integer
|
|
||||||
:Valid Range: 0 - 100
|
|
||||||
:Default: ``20``
|
|
||||||
|
|
||||||
.. describe:: hit_set_search_last_n
|
|
||||||
|
|
||||||
:Description: Count at most N appearances in HitSets. Used for temperature
|
|
||||||
calculation.
|
|
||||||
:Type: Integer
|
|
||||||
:Valid Range: 0 - hit_set_count
|
|
||||||
:Default: ``1``
|
|
||||||
|
|
||||||
.. _cache_min_flush_age:
|
|
||||||
|
|
||||||
.. describe:: cache_min_flush_age
|
|
||||||
|
|
||||||
:Description: Sets the time (in seconds) before the cache-tiering agent
|
|
||||||
flushes an object from the cache pool to the storage pool.
|
|
||||||
:Type: Integer
|
|
||||||
:Example: ``600`` (600 seconds: ten minutes)
|
|
||||||
|
|
||||||
.. _cache_min_evict_age:
|
|
||||||
|
|
||||||
.. describe:: cache_min_evict_age
|
|
||||||
|
|
||||||
:Description: Sets the time (in seconds) before the cache-tiering agent
|
|
||||||
evicts an object from the cache pool.
|
|
||||||
:Type: Integer
|
|
||||||
:Example: ``1800`` (1800 seconds: thirty minutes)
|
|
||||||
|
|
||||||
.. _fast_read:
|
.. _fast_read:
|
||||||
|
|
||||||
.. describe:: fast_read
|
.. describe:: fast_read
|
||||||
@ -702,56 +628,6 @@ You may get values from the following keys:
|
|||||||
:Description: See crush_rule_.
|
:Description: See crush_rule_.
|
||||||
|
|
||||||
|
|
||||||
``hit_set_type``
|
|
||||||
|
|
||||||
:Description: See hit_set_type_.
|
|
||||||
|
|
||||||
:Type: String
|
|
||||||
:Valid Settings: ``bloom``, ``explicit_hash``, ``explicit_object``
|
|
||||||
|
|
||||||
|
|
||||||
``hit_set_count``
|
|
||||||
|
|
||||||
:Description: See hit_set_count_.
|
|
||||||
|
|
||||||
:Type: Integer
|
|
||||||
|
|
||||||
|
|
||||||
``hit_set_period``
|
|
||||||
|
|
||||||
:Description: See hit_set_period_.
|
|
||||||
|
|
||||||
:Type: Integer
|
|
||||||
|
|
||||||
|
|
||||||
``hit_set_fpp``
|
|
||||||
|
|
||||||
:Description: See hit_set_fpp_.
|
|
||||||
|
|
||||||
:Type: Double
|
|
||||||
|
|
||||||
|
|
||||||
``cache_target_dirty_ratio``
|
|
||||||
|
|
||||||
:Description: See cache_target_dirty_ratio_.
|
|
||||||
|
|
||||||
:Type: Double
|
|
||||||
|
|
||||||
|
|
||||||
``cache_target_dirty_high_ratio``
|
|
||||||
|
|
||||||
:Description: See cache_target_dirty_high_ratio_.
|
|
||||||
|
|
||||||
:Type: Double
|
|
||||||
|
|
||||||
|
|
||||||
``cache_target_full_ratio``
|
|
||||||
|
|
||||||
:Description: See cache_target_full_ratio_.
|
|
||||||
|
|
||||||
:Type: Double
|
|
||||||
|
|
||||||
|
|
||||||
``target_max_bytes``
|
``target_max_bytes``
|
||||||
|
|
||||||
:Description: See target_max_bytes_.
|
:Description: See target_max_bytes_.
|
||||||
@ -766,20 +642,6 @@ You may get values from the following keys:
|
|||||||
:Type: Integer
|
:Type: Integer
|
||||||
|
|
||||||
|
|
||||||
``cache_min_flush_age``
|
|
||||||
|
|
||||||
:Description: See cache_min_flush_age_.
|
|
||||||
|
|
||||||
:Type: Integer
|
|
||||||
|
|
||||||
|
|
||||||
``cache_min_evict_age``
|
|
||||||
|
|
||||||
:Description: See cache_min_evict_age_.
|
|
||||||
|
|
||||||
:Type: Integer
|
|
||||||
|
|
||||||
|
|
||||||
``fast_read``
|
``fast_read``
|
||||||
|
|
||||||
:Description: See fast_read_.
|
:Description: See fast_read_.
|
||||||
@ -876,6 +738,10 @@ Ceph will list pools and highlight the ``replicated size`` attribute. By
|
|||||||
default, Ceph creates two replicas of an object (a total of three copies, for a
|
default, Ceph creates two replicas of an object (a total of three copies, for a
|
||||||
size of ``3``).
|
size of ``3``).
|
||||||
|
|
||||||
|
Managing pools that are flagged with ``--bulk``
|
||||||
|
===============================================
|
||||||
|
See :ref:`managing_bulk_flagged_pools`.
|
||||||
|
|
||||||
|
|
||||||
.. _pgcalc: https://old.ceph.com/pgcalc/
|
.. _pgcalc: https://old.ceph.com/pgcalc/
|
||||||
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
|
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
|
||||||
|
64
ceph/doc/rados/operations/read-balancer.rst
Normal file
64
ceph/doc/rados/operations/read-balancer.rst
Normal file
@ -0,0 +1,64 @@
|
|||||||
|
.. _read_balancer:
|
||||||
|
|
||||||
|
=======================================
|
||||||
|
Operating the Read (Primary) Balancer
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
You might be wondering: How can I improve performance in my Ceph cluster?
|
||||||
|
One important data point you can check is the ``read_balance_score`` on each
|
||||||
|
of your replicated pools.
|
||||||
|
|
||||||
|
This metric, available via ``ceph osd pool ls detail`` (see :ref:`rados_pools`
|
||||||
|
for more details) indicates read performance, or how balanced the primaries are
|
||||||
|
for each replicated pool. In most cases, if a ``read_balance_score`` is above 1
|
||||||
|
(for instance, 1.5), this means that your pool has unbalanced primaries and that
|
||||||
|
you may want to try improving your read performance with the read balancer.
|
||||||
|
|
||||||
|
Online Optimization
|
||||||
|
===================
|
||||||
|
|
||||||
|
At present, there is no online option for the read balancer. However, we plan to add
|
||||||
|
the read balancer as an option to the :ref:`balancer` in the next Ceph version
|
||||||
|
so it can be enabled to run automatically in the background like the upmap balancer.
|
||||||
|
|
||||||
|
Offline Optimization
|
||||||
|
====================
|
||||||
|
|
||||||
|
Primaries are updated with an offline optimizer that is built into the
|
||||||
|
:ref:`osdmaptool`.
|
||||||
|
|
||||||
|
#. Grab the latest copy of your osdmap:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
ceph osd getmap -o om
|
||||||
|
|
||||||
|
#. Run the optimizer:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
osdmaptool om --read out.txt --read-pool <pool name> [--vstart]
|
||||||
|
|
||||||
|
It is highly recommended that you run the capacity balancer before running the
|
||||||
|
balancer to ensure optimal results. See :ref:`upmap` for details on how to balance
|
||||||
|
capacity in a cluster.
|
||||||
|
|
||||||
|
#. Apply the changes:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
source out.txt
|
||||||
|
|
||||||
|
In the above example, the proposed changes are written to the output file
|
||||||
|
``out.txt``. The commands in this procedure are normal Ceph CLI commands
|
||||||
|
that can be run in order to apply the changes to the cluster.
|
||||||
|
|
||||||
|
If you are working in a vstart cluster, you may pass the ``--vstart`` parameter
|
||||||
|
as shown above so the CLI commands are formatted with the `./bin/` prefix.
|
||||||
|
|
||||||
|
Note that any time the number of pgs changes (for instance, if the pg autoscaler [:ref:`pg-autoscaler`]
|
||||||
|
kicks in), you should consider rechecking the scores and rerunning the balancer if needed.
|
||||||
|
|
||||||
|
To see some details about what the tool is doing, you can pass
|
||||||
|
``--debug-osd 10`` to ``osdmaptool``. To see even more details, pass
|
||||||
|
``--debug-osd 20`` to ``osdmaptool``.
|
@ -1,7 +1,8 @@
|
|||||||
.. _upmap:
|
.. _upmap:
|
||||||
|
|
||||||
|
=======================================
|
||||||
Using pg-upmap
|
Using pg-upmap
|
||||||
==============
|
=======================================
|
||||||
|
|
||||||
In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table
|
In Luminous v12.2.z and later releases, there is a *pg-upmap* exception table
|
||||||
in the OSDMap that allows the cluster to explicitly map specific PGs to
|
in the OSDMap that allows the cluster to explicitly map specific PGs to
|
||||||
@ -11,6 +12,9 @@ in most cases, uniformly distribute PGs across OSDs.
|
|||||||
However, there is an important caveat when it comes to this new feature: it
|
However, there is an important caveat when it comes to this new feature: it
|
||||||
requires all clients to understand the new *pg-upmap* structure in the OSDMap.
|
requires all clients to understand the new *pg-upmap* structure in the OSDMap.
|
||||||
|
|
||||||
|
Online Optimization
|
||||||
|
===================
|
||||||
|
|
||||||
Enabling
|
Enabling
|
||||||
--------
|
--------
|
||||||
|
|
||||||
@ -40,17 +44,17 @@ command:
|
|||||||
|
|
||||||
ceph features
|
ceph features
|
||||||
|
|
||||||
Balancer module
|
Balancer Module
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
The `balancer` module for ``ceph-mgr`` will automatically balance the number of
|
The `balancer` module for ``ceph-mgr`` will automatically balance the number of
|
||||||
PGs per OSD. See :ref:`balancer`
|
PGs per OSD. See :ref:`balancer`
|
||||||
|
|
||||||
Offline optimization
|
Offline Optimization
|
||||||
--------------------
|
====================
|
||||||
|
|
||||||
Upmap entries are updated with an offline optimizer that is built into
|
Upmap entries are updated with an offline optimizer that is built into the
|
||||||
``osdmaptool``.
|
:ref:`osdmaptool`.
|
||||||
|
|
||||||
#. Grab the latest copy of your osdmap:
|
#. Grab the latest copy of your osdmap:
|
||||||
|
|
||||||
|
@ -2,12 +2,18 @@
|
|||||||
The Ceph Community
|
The Ceph Community
|
||||||
====================
|
====================
|
||||||
|
|
||||||
The Ceph community is an excellent source of information and help. For
|
Ceph-users email list
|
||||||
operational issues with Ceph releases we recommend you `subscribe to the
|
=====================
|
||||||
ceph-users email list`_. When you no longer want to receive emails, you can
|
|
||||||
`unsubscribe from the ceph-users email list`_.
|
|
||||||
|
|
||||||
You may also `subscribe to the ceph-devel email list`_. You should do so if
|
The Ceph community is an excellent source of information and help. For
|
||||||
|
operational issues with Ceph we recommend that you `subscribe to the ceph-users
|
||||||
|
email list`_. When you no longer want to receive emails, you can `unsubscribe
|
||||||
|
from the ceph-users email list`_.
|
||||||
|
|
||||||
|
Ceph-devel email list
|
||||||
|
=====================
|
||||||
|
|
||||||
|
You can also `subscribe to the ceph-devel email list`_. You should do so if
|
||||||
your issue is:
|
your issue is:
|
||||||
|
|
||||||
- Likely related to a bug
|
- Likely related to a bug
|
||||||
@ -16,11 +22,14 @@ your issue is:
|
|||||||
- Related to your own builds
|
- Related to your own builds
|
||||||
|
|
||||||
If you no longer want to receive emails from the ``ceph-devel`` email list, you
|
If you no longer want to receive emails from the ``ceph-devel`` email list, you
|
||||||
may `unsubscribe from the ceph-devel email list`_.
|
can `unsubscribe from the ceph-devel email list`_.
|
||||||
|
|
||||||
.. tip:: The Ceph community is growing rapidly, and community members can help
|
Ceph report
|
||||||
you if you provide them with detailed information about your problem. You
|
===========
|
||||||
can attach the output of the ``ceph report`` command to help people understand your issues.
|
|
||||||
|
.. tip:: Community members can help you if you provide them with detailed
|
||||||
|
information about your problem. Attach the output of the ``ceph report``
|
||||||
|
command to help people understand your issues.
|
||||||
|
|
||||||
.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io
|
.. _subscribe to the ceph-devel email list: mailto:dev-join@ceph.io
|
||||||
.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io
|
.. _unsubscribe from the ceph-devel email list: mailto:dev-leave@ceph.io
|
||||||
|
@ -9,59 +9,72 @@ you can profile Ceph's CPU usage. See `Installing Oprofile`_ for details.
|
|||||||
Initializing oprofile
|
Initializing oprofile
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
The first time you use ``oprofile`` you need to initialize it. Locate the
|
``oprofile`` must be initalized the first time it is used. Locate the
|
||||||
``vmlinux`` image corresponding to the kernel you are now running. ::
|
``vmlinux`` image that corresponds to the kernel you are running:
|
||||||
|
|
||||||
ls /boot
|
.. prompt:: bash $
|
||||||
sudo opcontrol --init
|
|
||||||
sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6
|
ls /boot
|
||||||
|
sudo opcontrol --init
|
||||||
|
sudo opcontrol --setup --vmlinux={path-to-image} --separate=library --callgraph=6
|
||||||
|
|
||||||
|
|
||||||
Starting oprofile
|
Starting oprofile
|
||||||
=================
|
=================
|
||||||
|
|
||||||
To start ``oprofile`` execute the following command::
|
Run the following command to start ``oprofile``:
|
||||||
|
|
||||||
opcontrol --start
|
.. prompt:: bash $
|
||||||
|
|
||||||
Once you start ``oprofile``, you may run some tests with Ceph.
|
opcontrol --start
|
||||||
|
|
||||||
|
|
||||||
Stopping oprofile
|
Stopping oprofile
|
||||||
=================
|
=================
|
||||||
|
|
||||||
To stop ``oprofile`` execute the following command::
|
Run the following command to stop ``oprofile``:
|
||||||
|
|
||||||
opcontrol --stop
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
opcontrol --stop
|
||||||
|
|
||||||
|
|
||||||
Retrieving oprofile Results
|
Retrieving oprofile Results
|
||||||
===========================
|
===========================
|
||||||
|
|
||||||
To retrieve the top ``cmon`` results, execute the following command::
|
Run the following command to retrieve the top ``cmon`` results:
|
||||||
|
|
||||||
opreport -gal ./cmon | less
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
|
||||||
To retrieve the top ``cmon`` results with call graphs attached, execute the
|
opreport -gal ./cmon | less
|
||||||
following command::
|
|
||||||
|
|
||||||
opreport -cal ./cmon | less
|
Run the following command to retrieve the top ``cmon`` results, with call
|
||||||
|
graphs attached:
|
||||||
.. important:: After reviewing results, you should reset ``oprofile`` before
|
|
||||||
running it again. Resetting ``oprofile`` removes data from the session
|
.. prompt:: bash $
|
||||||
directory.
|
|
||||||
|
opreport -cal ./cmon | less
|
||||||
|
|
||||||
|
.. important:: After you have reviewed the results, reset ``oprofile`` before
|
||||||
|
running it again. The act of resetting ``oprofile`` removes data from the
|
||||||
|
session directory.
|
||||||
|
|
||||||
|
|
||||||
Resetting oprofile
|
Resetting oprofile
|
||||||
==================
|
==================
|
||||||
|
|
||||||
To reset ``oprofile``, execute the following command::
|
Run the following command to reset ``oprofile``:
|
||||||
|
|
||||||
sudo opcontrol --reset
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
sudo opcontrol --reset
|
||||||
|
|
||||||
.. important:: You should reset ``oprofile`` after analyzing data so that
|
.. important:: Reset ``oprofile`` after analyzing data. This ensures that
|
||||||
you do not commingle results from different tests.
|
results from prior tests do not get mixed in with the results of the current
|
||||||
|
test.
|
||||||
|
|
||||||
.. _oprofile: http://oprofile.sourceforge.net/about/
|
.. _oprofile: http://oprofile.sourceforge.net/about/
|
||||||
.. _Installing Oprofile: ../../../dev/cpu-profiler
|
.. _Installing Oprofile: ../../../dev/cpu-profiler
|
||||||
|
|
||||||
|
|
||||||
|
@ -2,10 +2,10 @@
|
|||||||
Troubleshooting
|
Troubleshooting
|
||||||
=================
|
=================
|
||||||
|
|
||||||
Ceph is still on the leading edge, so you may encounter situations that require
|
You may encounter situations that require you to examine your configuration,
|
||||||
you to examine your configuration, modify your logging output, troubleshoot
|
consult the documentation, modify your logging output, troubleshoot monitors
|
||||||
monitors and OSDs, profile memory and CPU usage, and reach out to the
|
and OSDs, profile memory and CPU usage, and, in the last resort, reach out to
|
||||||
Ceph community for help.
|
the Ceph community for help.
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
|
@ -2,16 +2,23 @@
|
|||||||
Memory Profiling
|
Memory Profiling
|
||||||
==================
|
==================
|
||||||
|
|
||||||
Ceph MON, OSD and MDS can generate heap profiles using
|
Ceph Monitor, OSD, and MDS can report ``TCMalloc`` heap profiles. Install
|
||||||
``tcmalloc``. To generate heap profiles, ensure you have
|
``google-perftools`` if you want to generate these. Your OS distribution might
|
||||||
``google-perftools`` installed::
|
package this under a different name (for example, ``gperftools``), and your OS
|
||||||
|
distribution might use a different package manager. Run a command similar to
|
||||||
|
this one to install ``google-perftools``:
|
||||||
|
|
||||||
sudo apt-get install google-perftools
|
.. prompt:: bash
|
||||||
|
|
||||||
The profiler dumps output to your ``log file`` directory (i.e.,
|
sudo apt-get install google-perftools
|
||||||
``/var/log/ceph``). See `Logging and Debugging`_ for details.
|
|
||||||
To view the profiler logs with Google's performance tools, execute the
|
The profiler dumps output to your ``log file`` directory (``/var/log/ceph``).
|
||||||
following::
|
See `Logging and Debugging`_ for details.
|
||||||
|
|
||||||
|
To view the profiler logs with Google's performance tools, run the following
|
||||||
|
command:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
google-pprof --text {path-to-daemon} {log-path/filename}
|
google-pprof --text {path-to-daemon} {log-path/filename}
|
||||||
|
|
||||||
@ -48,9 +55,9 @@ For example::
|
|||||||
0.0 0.4% 99.2% 0.0 0.6% decode_message
|
0.0 0.4% 99.2% 0.0 0.6% decode_message
|
||||||
...
|
...
|
||||||
|
|
||||||
Another heap dump on the same daemon will add another file. It is
|
Performing another heap dump on the same daemon creates another file. It is
|
||||||
convenient to compare to a previous heap dump to show what has grown
|
convenient to compare the new file to a file created by a previous heap dump to
|
||||||
in the interval. For instance::
|
show what has grown in the interval. For example::
|
||||||
|
|
||||||
$ google-pprof --text --base out/osd.0.profile.0001.heap \
|
$ google-pprof --text --base out/osd.0.profile.0001.heap \
|
||||||
ceph-osd out/osd.0.profile.0003.heap
|
ceph-osd out/osd.0.profile.0003.heap
|
||||||
@ -60,112 +67,137 @@ in the interval. For instance::
|
|||||||
0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op
|
0.0 0.9% 97.7% 0.0 26.1% ReplicatedPG::do_op
|
||||||
0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate
|
0.0 0.8% 98.5% 0.0 0.8% __gnu_cxx::new_allocator::allocate
|
||||||
|
|
||||||
Refer to `Google Heap Profiler`_ for additional details.
|
See `Google Heap Profiler`_ for additional details.
|
||||||
|
|
||||||
Once you have the heap profiler installed, start your cluster and
|
After you have installed the heap profiler, start your cluster and begin using
|
||||||
begin using the heap profiler. You may enable or disable the heap
|
the heap profiler. You can enable or disable the heap profiler at runtime, or
|
||||||
profiler at runtime, or ensure that it runs continuously. For the
|
ensure that it runs continuously. When running commands based on the examples
|
||||||
following commandline usage, replace ``{daemon-type}`` with ``mon``,
|
that follow, do the following:
|
||||||
``osd`` or ``mds``, and replace ``{daemon-id}`` with the OSD number or
|
|
||||||
the MON or MDS id.
|
#. replace ``{daemon-type}`` with ``mon``, ``osd`` or ``mds``
|
||||||
|
#. replace ``{daemon-id}`` with the OSD number or the MON ID or the MDS ID
|
||||||
|
|
||||||
|
|
||||||
Starting the Profiler
|
Starting the Profiler
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
To start the heap profiler, execute the following::
|
To start the heap profiler, run a command of the following form:
|
||||||
|
|
||||||
ceph tell {daemon-type}.{daemon-id} heap start_profiler
|
.. prompt:: bash
|
||||||
|
|
||||||
For example::
|
ceph tell {daemon-type}.{daemon-id} heap start_profiler
|
||||||
|
|
||||||
ceph tell osd.1 heap start_profiler
|
For example:
|
||||||
|
|
||||||
Alternatively the profile can be started when the daemon starts
|
.. prompt:: bash
|
||||||
running if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in
|
|
||||||
the environment.
|
ceph tell osd.1 heap start_profiler
|
||||||
|
|
||||||
|
Alternatively, if the ``CEPH_HEAP_PROFILER_INIT=true`` variable is found in the
|
||||||
|
environment, the profile will be started when the daemon starts running.
|
||||||
|
|
||||||
Printing Stats
|
Printing Stats
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
To print out statistics, execute the following::
|
To print out statistics, run a command of the following form:
|
||||||
|
|
||||||
ceph tell {daemon-type}.{daemon-id} heap stats
|
.. prompt:: bash
|
||||||
|
|
||||||
For example::
|
ceph tell {daemon-type}.{daemon-id} heap stats
|
||||||
|
|
||||||
ceph tell osd.0 heap stats
|
For example:
|
||||||
|
|
||||||
.. note:: Printing stats does not require the profiler to be running and does
|
.. prompt:: bash
|
||||||
not dump the heap allocation information to a file.
|
|
||||||
|
ceph tell osd.0 heap stats
|
||||||
|
|
||||||
|
.. note:: The reporting of stats with this command does not require the
|
||||||
|
profiler to be running and does not dump the heap allocation information to
|
||||||
|
a file.
|
||||||
|
|
||||||
|
|
||||||
Dumping Heap Information
|
Dumping Heap Information
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
To dump heap information, execute the following::
|
To dump heap information, run a command of the following form:
|
||||||
|
|
||||||
ceph tell {daemon-type}.{daemon-id} heap dump
|
.. prompt:: bash
|
||||||
|
|
||||||
For example::
|
ceph tell {daemon-type}.{daemon-id} heap dump
|
||||||
|
|
||||||
ceph tell mds.a heap dump
|
For example:
|
||||||
|
|
||||||
.. note:: Dumping heap information only works when the profiler is running.
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph tell mds.a heap dump
|
||||||
|
|
||||||
|
.. note:: Dumping heap information works only when the profiler is running.
|
||||||
|
|
||||||
|
|
||||||
Releasing Memory
|
Releasing Memory
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
To release memory that ``tcmalloc`` has allocated but which is not being used by
|
To release memory that ``tcmalloc`` has allocated but which is not being used
|
||||||
the Ceph daemon itself, execute the following::
|
by the Ceph daemon itself, run a command of the following form:
|
||||||
|
|
||||||
ceph tell {daemon-type}{daemon-id} heap release
|
.. prompt:: bash
|
||||||
|
|
||||||
For example::
|
ceph tell {daemon-type}{daemon-id} heap release
|
||||||
|
|
||||||
ceph tell osd.2 heap release
|
For example:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph tell osd.2 heap release
|
||||||
|
|
||||||
|
|
||||||
Stopping the Profiler
|
Stopping the Profiler
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
To stop the heap profiler, execute the following::
|
To stop the heap profiler, run a command of the following form:
|
||||||
|
|
||||||
ceph tell {daemon-type}.{daemon-id} heap stop_profiler
|
.. prompt:: bash
|
||||||
|
|
||||||
For example::
|
ceph tell {daemon-type}.{daemon-id} heap stop_profiler
|
||||||
|
|
||||||
ceph tell osd.0 heap stop_profiler
|
For example:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph tell osd.0 heap stop_profiler
|
||||||
|
|
||||||
.. _Logging and Debugging: ../log-and-debug
|
.. _Logging and Debugging: ../log-and-debug
|
||||||
.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html
|
.. _Google Heap Profiler: http://goog-perftools.sourceforge.net/doc/heap_profiler.html
|
||||||
|
|
||||||
Alternative ways for memory profiling
|
Alternative Methods of Memory Profiling
|
||||||
-------------------------------------
|
----------------------------------------
|
||||||
|
|
||||||
Running Massif heap profiler with Valgrind
|
Running Massif heap profiler with Valgrind
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The Massif heap profiler tool can be used with Valgrind to
|
The Massif heap profiler tool can be used with Valgrind to measure how much
|
||||||
measure how much heap memory is used and is good for
|
heap memory is used. This method is well-suited to troubleshooting RadosGW.
|
||||||
troubleshooting for example Ceph RadosGW.
|
|
||||||
|
|
||||||
See the `Massif documentation <https://valgrind.org/docs/manual/ms-manual.html>`_ for
|
See the `Massif documentation
|
||||||
more information.
|
<https://valgrind.org/docs/manual/ms-manual.html>`_ for more information.
|
||||||
|
|
||||||
Install Valgrind from the package manager for your distribution
|
Install Valgrind from the package manager for your distribution then start the
|
||||||
then start the Ceph daemon you want to troubleshoot::
|
Ceph daemon you want to troubleshoot:
|
||||||
|
|
||||||
sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph
|
.. prompt:: bash
|
||||||
|
|
||||||
A file similar to ``massif.out.<pid>`` will be saved when it exits
|
sudo -u ceph valgrind --max-threads=1024 --tool=massif /usr/bin/radosgw -f --cluster ceph --name NAME --setuser ceph --setgroup ceph
|
||||||
in your current working directory. The user running the process above
|
|
||||||
must have write permissions in the current directory.
|
|
||||||
|
|
||||||
You can then run the ``ms_print`` command to get a graph and statistics
|
When this command has completed its run, a file with a name of the form
|
||||||
from the collected data in the ``massif.out.<pid>`` file::
|
``massif.out.<pid>`` will be saved in your current working directory. To run
|
||||||
|
the command above, the user who runs it must have write permissions in the
|
||||||
|
current directory.
|
||||||
|
|
||||||
ms_print massif.out.12345
|
Run the ``ms_print`` command to get a graph and statistics from the collected
|
||||||
|
data in the ``massif.out.<pid>`` file:
|
||||||
|
|
||||||
This output is great for inclusion in a bug report.
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ms_print massif.out.12345
|
||||||
|
|
||||||
|
The output of this command is helpful when submitting a bug report.
|
||||||
|
@ -6,70 +6,78 @@
|
|||||||
|
|
||||||
.. index:: monitor, high availability
|
.. index:: monitor, high availability
|
||||||
|
|
||||||
If a cluster encounters monitor-related problems, this does not necessarily
|
Even if a cluster experiences monitor-related problems, the cluster is not
|
||||||
mean that the cluster is in danger of going down. Even if multiple monitors are
|
necessarily in danger of going down. If a cluster has lost multiple monitors,
|
||||||
lost, the cluster can still be up and running, as long as there are enough
|
it can still remain up and running as long as there are enough surviving
|
||||||
surviving monitors to form a quorum.
|
monitors to form a quorum.
|
||||||
|
|
||||||
However serious your cluster's monitor-related problems might be, we recommend
|
If your cluster is having monitor-related problems, we recommend that you
|
||||||
that you take the following troubleshooting steps.
|
consult the following troubleshooting information.
|
||||||
|
|
||||||
|
|
||||||
Initial Troubleshooting
|
Initial Troubleshooting
|
||||||
=======================
|
=======================
|
||||||
|
|
||||||
**Are the monitors running?**
|
The first steps in the process of troubleshooting Ceph Monitors involve making
|
||||||
|
sure that the Monitors are running and that they are able to communicate with
|
||||||
|
the network and on the network. Follow the steps in this section to rule out
|
||||||
|
the simplest causes of Monitor malfunction.
|
||||||
|
|
||||||
First, make sure that the monitor (*mon*) daemon processes (``ceph-mon``) are
|
#. **Make sure that the Monitors are running.**
|
||||||
running. Sometimes Ceph admins either forget to start the mons or forget to
|
|
||||||
restart the mons after an upgrade. Checking for this simple oversight can
|
|
||||||
save hours of painstaking troubleshooting. It is also important to make sure
|
|
||||||
that the manager daemons (``ceph-mgr``) are running. Remember that typical
|
|
||||||
cluster configurations provide one ``ceph-mgr`` for each ``ceph-mon``.
|
|
||||||
|
|
||||||
.. note:: Rook will not run more than two managers.
|
Make sure that the Monitor (*mon*) daemon processes (``ceph-mon``) are
|
||||||
|
running. It might be the case that the mons have not be restarted after an
|
||||||
|
upgrade. Checking for this simple oversight can save hours of painstaking
|
||||||
|
troubleshooting.
|
||||||
|
|
||||||
|
It is also important to make sure that the manager daemons (``ceph-mgr``)
|
||||||
|
are running. Remember that typical cluster configurations provide one
|
||||||
|
Manager (``ceph-mgr``) for each Monitor (``ceph-mon``).
|
||||||
|
|
||||||
**Can you reach the monitor nodes?**
|
.. note:: In releases prior to v1.12.5, Rook will not run more than two
|
||||||
|
managers.
|
||||||
|
|
||||||
In certain rare cases, there may be ``iptables`` rules that block access to
|
#. **Make sure that you can reach the Monitor nodes.**
|
||||||
monitor nodes or TCP ports. These rules might be left over from earlier
|
|
||||||
stress testing or rule development. To check for the presence of such rules,
|
|
||||||
SSH into the server and then try to connect to the monitor's ports
|
|
||||||
(``tcp/3300`` and ``tcp/6789``) using ``telnet``, ``nc``, or a similar tool.
|
|
||||||
|
|
||||||
**Does the ``ceph status`` command run and receive a reply from the cluster?**
|
In certain rare cases, ``iptables`` rules might be blocking access to
|
||||||
|
Monitor nodes or TCP ports. These rules might be left over from earlier
|
||||||
|
stress testing or rule development. To check for the presence of such
|
||||||
|
rules, SSH into each Monitor node and use ``telnet`` or ``nc`` or a similar
|
||||||
|
tool to attempt to connect to each of the other Monitor nodes on ports
|
||||||
|
``tcp/3300`` and ``tcp/6789``.
|
||||||
|
|
||||||
If the ``ceph status`` command does receive a reply from the cluster, then the
|
#. **Make sure that the "ceph status" command runs and receives a reply from the cluster.**
|
||||||
cluster is up and running. The monitors will answer to a ``status`` request
|
|
||||||
only if there is a formed quorum. Confirm that one or more ``mgr`` daemons
|
|
||||||
are reported as running. Under ideal conditions, all ``mgr`` daemons will be
|
|
||||||
reported as running.
|
|
||||||
|
|
||||||
|
If the ``ceph status`` command receives a reply from the cluster, then the
|
||||||
|
cluster is up and running. Monitors answer to a ``status`` request only if
|
||||||
|
there is a formed quorum. Confirm that one or more ``mgr`` daemons are
|
||||||
|
reported as running. In a cluster with no deficiencies, ``ceph status``
|
||||||
|
will report that all ``mgr`` daemons are running.
|
||||||
|
|
||||||
If the ``ceph status`` command does not receive a reply from the cluster, then
|
If the ``ceph status`` command does not receive a reply from the cluster,
|
||||||
there are probably not enough monitors ``up`` to form a quorum. The ``ceph
|
then there are probably not enough Monitors ``up`` to form a quorum. If the
|
||||||
-s`` command with no further options specified connects to an arbitrarily
|
``ceph -s`` command is run with no further options specified, it connects
|
||||||
selected monitor. In certain cases, however, it might be helpful to connect
|
to an arbitrarily selected Monitor. In certain cases, however, it might be
|
||||||
to a specific monitor (or to several specific monitors in sequence) by adding
|
helpful to connect to a specific Monitor (or to several specific Monitors
|
||||||
the ``-m`` flag to the command: for example, ``ceph status -m mymon1``.
|
in sequence) by adding the ``-m`` flag to the command: for example, ``ceph
|
||||||
|
status -m mymon1``.
|
||||||
|
|
||||||
|
#. **None of this worked. What now?**
|
||||||
|
|
||||||
**None of this worked. What now?**
|
If the above solutions have not resolved your problems, you might find it
|
||||||
|
helpful to examine each individual Monitor in turn. Even if no quorum has
|
||||||
|
been formed, it is possible to contact each Monitor individually and
|
||||||
|
request its status by using the ``ceph tell mon.ID mon_status`` command
|
||||||
|
(here ``ID`` is the Monitor's identifier).
|
||||||
|
|
||||||
If the above solutions have not resolved your problems, you might find it
|
Run the ``ceph tell mon.ID mon_status`` command for each Monitor in the
|
||||||
helpful to examine each individual monitor in turn. Whether or not a quorum
|
cluster. For more on this command's output, see :ref:`Understanding
|
||||||
has been formed, it is possible to contact each monitor individually and
|
mon_status
|
||||||
request its status by using the ``ceph tell mon.ID mon_status`` command (here
|
<rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`.
|
||||||
``ID`` is the monitor's identifier).
|
|
||||||
|
|
||||||
Run the ``ceph tell mon.ID mon_status`` command for each monitor in the
|
There is also an alternative method for contacting each individual Monitor:
|
||||||
cluster. For more on this command's output, see :ref:`Understanding
|
SSH into each Monitor node and query the daemon's admin socket. See
|
||||||
mon_status
|
:ref:`Using the Monitor's Admin
|
||||||
<rados_troubleshoting_troubleshooting_mon_understanding_mon_status>`.
|
Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.
|
||||||
|
|
||||||
There is also an alternative method: SSH into each monitor node and query the
|
|
||||||
daemon's admin socket. See :ref:`Using the Monitor's Admin
|
|
||||||
Socket<rados_troubleshoting_troubleshooting_mon_using_admin_socket>`.
|
|
||||||
|
|
||||||
.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket:
|
.. _rados_troubleshoting_troubleshooting_mon_using_admin_socket:
|
||||||
|
|
||||||
@ -175,106 +183,136 @@ the quorum is formed by only two monitors, and *c* is in the quorum as a
|
|||||||
``IP:PORT`` combination, the **lower** the rank. In this case, because
|
``IP:PORT`` combination, the **lower** the rank. In this case, because
|
||||||
``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations,
|
``127.0.0.1:6789`` is lower than the other two ``IP:PORT`` combinations,
|
||||||
``mon.a`` has the highest rank: namely, rank 0.
|
``mon.a`` has the highest rank: namely, rank 0.
|
||||||
|
|
||||||
|
|
||||||
Most Common Monitor Issues
|
Most Common Monitor Issues
|
||||||
===========================
|
===========================
|
||||||
|
|
||||||
Have Quorum but at least one Monitor is down
|
The Cluster Has Quorum but at Least One Monitor is Down
|
||||||
---------------------------------------------
|
-------------------------------------------------------
|
||||||
|
|
||||||
When this happens, depending on the version of Ceph you are running,
|
When the cluster has quorum but at least one monitor is down, ``ceph health
|
||||||
you should be seeing something similar to::
|
detail`` returns a message similar to the following::
|
||||||
|
|
||||||
$ ceph health detail
|
$ ceph health detail
|
||||||
[snip]
|
[snip]
|
||||||
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
|
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
|
||||||
|
|
||||||
How to troubleshoot this?
|
**How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?**
|
||||||
|
|
||||||
First, make sure ``mon.a`` is running.
|
#. Make sure that ``mon.a`` is running.
|
||||||
|
|
||||||
Second, make sure you are able to connect to ``mon.a``'s node from the
|
#. Make sure that you can connect to ``mon.a``'s node from the
|
||||||
other mon nodes. Check the TCP ports as well. Check ``iptables`` and
|
other Monitor nodes. Check the TCP ports as well. Check ``iptables`` and
|
||||||
``nf_conntrack`` on all nodes and ensure that you are not
|
``nf_conntrack`` on all nodes and make sure that you are not
|
||||||
dropping/rejecting connections.
|
dropping/rejecting connections.
|
||||||
|
|
||||||
If this initial troubleshooting doesn't solve your problems, then it's
|
If this initial troubleshooting doesn't solve your problem, then further
|
||||||
time to go deeper.
|
investigation is necessary.
|
||||||
|
|
||||||
First, check the problematic monitor's ``mon_status`` via the admin
|
First, check the problematic monitor's ``mon_status`` via the admin
|
||||||
socket as explained in `Using the monitor's admin socket`_ and
|
socket as explained in `Using the monitor's admin socket`_ and
|
||||||
`Understanding mon_status`_.
|
`Understanding mon_status`_.
|
||||||
|
|
||||||
If the monitor is out of the quorum, its state should be one of ``probing``,
|
If the Monitor is out of the quorum, then its state will be one of the
|
||||||
``electing`` or ``synchronizing``. If it happens to be either ``leader`` or
|
following: ``probing``, ``electing`` or ``synchronizing``. If the state of
|
||||||
``peon``, then the monitor believes to be in quorum, while the remaining
|
the Monitor is ``leader`` or ``peon``, then the Monitor believes itself to be
|
||||||
cluster is sure it is not; or maybe it got into the quorum while we were
|
in quorum but the rest of the cluster believes that it is not in quorum. It
|
||||||
troubleshooting the monitor, so check you ``ceph status`` again just to make
|
is possible that a Monitor that is in one of the ``probing``, ``electing``,
|
||||||
sure. Proceed if the monitor is not yet in the quorum.
|
or ``synchronizing`` states has entered the quorum during the process of
|
||||||
|
troubleshooting. Check ``ceph status`` again to determine whether the Monitor
|
||||||
|
has entered quorum during your troubleshooting. If the Monitor remains out of
|
||||||
|
the quorum, then proceed with the investigations described in this section of
|
||||||
|
the documentation.
|
||||||
|
|
||||||
|
|
||||||
What if the state is ``probing``?
|
**What does it mean when a Monitor's state is ``probing``?**
|
||||||
|
|
||||||
This means the monitor is still looking for the other monitors. Every time
|
If ``ceph health detail`` shows that a Monitor's state is
|
||||||
you start a monitor, the monitor will stay in this state for some time while
|
``probing``, then the Monitor is still looking for the other Monitors. Every
|
||||||
trying to connect the rest of the monitors specified in the ``monmap``. The
|
Monitor remains in this state for some time when it is started. When a
|
||||||
time a monitor will spend in this state can vary. For instance, when on a
|
Monitor has connected to the other Monitors specified in the ``monmap``, it
|
||||||
single-monitor cluster (never do this in production), the monitor will pass
|
ceases to be in the ``probing`` state. The amount of time that a Monitor is
|
||||||
through the probing state almost instantaneously. In a multi-monitor
|
in the ``probing`` state depends upon the parameters of the cluster of which
|
||||||
cluster, the monitors will stay in this state until they find enough monitors
|
it is a part. For example, when a Monitor is a part of a single-monitor
|
||||||
to form a quorum |---| this means that if you have 2 out of 3 monitors down, the
|
cluster (never do this in production), the monitor passes through the probing
|
||||||
one remaining monitor will stay in this state indefinitely until you bring
|
state almost instantaneously. In a multi-monitor cluster, the Monitors stay
|
||||||
one of the other monitors up.
|
in the ``probing`` state until they find enough monitors to form a quorum
|
||||||
|
|---| this means that if two out of three Monitors in the cluster are
|
||||||
|
``down``, the one remaining Monitor stays in the ``probing`` state
|
||||||
|
indefinitely until you bring one of the other monitors up.
|
||||||
|
|
||||||
If you have a quorum the starting daemon should be able to find the
|
If quorum has been established, then the Monitor daemon should be able to
|
||||||
other monitors quickly, as long as they can be reached. If your
|
find the other Monitors quickly, as long as they can be reached. If a Monitor
|
||||||
monitor is stuck probing and you have gone through with all the communication
|
is stuck in the ``probing`` state and you have exhausted the procedures above
|
||||||
troubleshooting, then there is a fair chance that the monitor is trying
|
that describe the troubleshooting of communications between the Monitors,
|
||||||
to reach the other monitors on a wrong address. ``mon_status`` outputs the
|
then it is possible that the problem Monitor is trying to reach the other
|
||||||
``monmap`` known to the monitor: check if the other monitor's locations
|
Monitors at a wrong address. ``mon_status`` outputs the ``monmap`` that is
|
||||||
match reality. If they don't, jump to
|
known to the monitor: determine whether the other Monitors' locations as
|
||||||
`Recovering a Monitor's Broken monmap`_; if they do, then it may be related
|
specified in the ``monmap`` match the locations of the Monitors in the
|
||||||
to severe clock skews amongst the monitor nodes and you should refer to
|
network. If they do not, see `Recovering a Monitor's Broken monmap`_.
|
||||||
`Clock Skews`_ first, but if that doesn't solve your problem then it is
|
If the locations of the Monitors as specified in the ``monmap`` match the
|
||||||
the time to prepare some logs and reach out to the community (please refer
|
locations of the Monitors in the network, then the persistent
|
||||||
to `Preparing your logs`_ on how to best prepare your logs).
|
``probing`` state could be related to severe clock skews amongst the monitor
|
||||||
|
nodes. See `Clock Skews`_. If the information in `Clock Skews`_ does not
|
||||||
|
bring the Monitor out of the ``probing`` state, then prepare your system logs
|
||||||
|
and ask the Ceph community for help. See `Preparing your logs`_ for
|
||||||
|
information about the proper preparation of logs.
|
||||||
|
|
||||||
|
|
||||||
What if state is ``electing``?
|
**What does it mean when a Monitor's state is ``electing``?**
|
||||||
|
|
||||||
This means the monitor is in the middle of an election. With recent Ceph
|
If ``ceph health detail`` shows that a Monitor's state is ``electing``, the
|
||||||
releases these typically complete quickly, but at times the monitors can
|
monitor is in the middle of an election. Elections typically complete
|
||||||
get stuck in what is known as an *election storm*. This can indicate
|
quickly, but sometimes the monitors can get stuck in what is known as an
|
||||||
clock skew among the monitor nodes; jump to
|
*election storm*. See :ref:`Monitor Elections <dev_mon_elections>` for more
|
||||||
`Clock Skews`_ for more information. If all your clocks are properly
|
on monitor elections.
|
||||||
synchronized, you should search the mailing lists and tracker.
|
|
||||||
This is not a state that is likely to persist and aside from
|
The presence of election storm might indicate clock skew among the monitor
|
||||||
(*really*) old bugs there is not an obvious reason besides clock skews on
|
nodes. See `Clock Skews`_ for more information.
|
||||||
why this would happen. Worst case, if there are enough surviving mons,
|
|
||||||
down the problematic one while you investigate.
|
If your clocks are properly synchronized, search the mailing lists and bug
|
||||||
|
tracker for issues similar to your issue. The ``electing`` state is not
|
||||||
|
likely to persist. In versions of Ceph after the release of Cuttlefish, there
|
||||||
|
is no obvious reason other than clock skew that explains why an ``electing``
|
||||||
|
state would persist.
|
||||||
|
|
||||||
|
It is possible to investigate the cause of a persistent ``electing`` state if
|
||||||
|
you put the problematic Monitor into a ``down`` state while you investigate.
|
||||||
|
This is possible only if there are enough surviving Monitors to form quorum.
|
||||||
|
|
||||||
What if state is ``synchronizing``?
|
**What does it mean when a Monitor's state is ``synchronizing``?**
|
||||||
|
|
||||||
This means the monitor is catching up with the rest of the cluster in
|
If ``ceph health detail`` shows that the Monitor is ``synchronizing``, the
|
||||||
order to join the quorum. Time to synchronize is a function of the size
|
monitor is catching up with the rest of the cluster so that it can join the
|
||||||
of your monitor store and thus of cluster size and state, so if you have a
|
quorum. The amount of time that it takes for the Monitor to synchronize with
|
||||||
large or degraded cluster this may take a while.
|
the rest of the quorum is a function of the size of the cluster's monitor
|
||||||
|
store, the cluster's size, and the state of the cluster. Larger and degraded
|
||||||
|
clusters generally keep Monitors in the ``synchronizing`` state longer than
|
||||||
|
do smaller, new clusters.
|
||||||
|
|
||||||
If you notice that the monitor jumps from ``synchronizing`` to
|
A Monitor that changes its state from ``synchronizing`` to ``electing`` and
|
||||||
``electing`` and then back to ``synchronizing``, then you do have a
|
then back to ``synchronizing`` indicates a problem: the cluster state may be
|
||||||
problem: the cluster state may be advancing (i.e., generating new maps)
|
advancing (that is, generating new maps) too fast for the synchronization
|
||||||
too fast for the synchronization process to keep up. This was a more common
|
process to keep up with the pace of the creation of the new maps. This issue
|
||||||
thing in early days (Cuttlefish), but since then the synchronization process
|
presented more frequently prior to the Cuttlefish release than it does in
|
||||||
has been refactored and enhanced to avoid this dynamic. If you experience
|
more recent releases, because the synchronization process has since been
|
||||||
this in later versions please let us know via a bug tracker. And bring some logs
|
refactored and enhanced to avoid this dynamic. If you experience this in
|
||||||
(see `Preparing your logs`_).
|
later versions, report the issue in the `Ceph bug tracker
|
||||||
|
<https://tracker.ceph.com>`_. Prepare and provide logs to substantiate any
|
||||||
|
bug you raise. See `Preparing your logs`_ for information about the proper
|
||||||
|
preparation of logs.
|
||||||
|
|
||||||
What if state is ``leader`` or ``peon``?
|
**What does it mean when a Monitor's state is ``leader`` or ``peon``?**
|
||||||
|
|
||||||
This should not happen: famous last words. If it does, however, it likely
|
If ``ceph health detail`` shows that the Monitor is in the ``leader`` state
|
||||||
has a lot to do with clock skew -- see `Clock Skews`_. If you are not
|
or in the ``peon`` state, it is likely that clock skew is present. Follow the
|
||||||
suffering from clock skew, then please prepare your logs (see
|
instructions in `Clock Skews`_. If you have followed those instructions and
|
||||||
`Preparing your logs`_) and reach out to the community.
|
``ceph health detail`` still shows that the Monitor is in the ``leader``
|
||||||
|
state or the ``peon`` state, report the issue in the `Ceph bug tracker
|
||||||
|
<https://tracker.ceph.com>`_. If you raise an issue, provide logs to
|
||||||
|
substantiate it. See `Preparing your logs`_ for information about the
|
||||||
|
proper preparation of logs.
|
||||||
|
|
||||||
|
|
||||||
Recovering a Monitor's Broken ``monmap``
|
Recovering a Monitor's Broken ``monmap``
|
||||||
@ -317,18 +355,21 @@ Scrap the monitor and redeploy
|
|||||||
|
|
||||||
Inject a monmap into the monitor
|
Inject a monmap into the monitor
|
||||||
|
|
||||||
Usually the safest path. You should grab the monmap from the remaining
|
|
||||||
monitors and inject it into the monitor with the corrupted/lost monmap.
|
|
||||||
|
|
||||||
These are the basic steps:
|
These are the basic steps:
|
||||||
|
|
||||||
1. Is there a formed quorum? If so, grab the monmap from the quorum::
|
Retrieve the ``monmap`` from the surviving monitors and inject it into the
|
||||||
|
monitor whose ``monmap`` is corrupted or lost.
|
||||||
|
|
||||||
|
Implement this solution by carrying out the following procedure:
|
||||||
|
|
||||||
|
1. Is there a quorum of monitors? If so, retrieve the ``monmap`` from the
|
||||||
|
quorum::
|
||||||
|
|
||||||
$ ceph mon getmap -o /tmp/monmap
|
$ ceph mon getmap -o /tmp/monmap
|
||||||
|
|
||||||
2. No quorum? Grab the monmap directly from another monitor (this
|
2. If there is no quorum, then retrieve the ``monmap`` directly from another
|
||||||
assumes the monitor you are grabbing the monmap from has id ID-FOO
|
monitor that has been stopped (in this example, the other monitor has
|
||||||
and has been stopped)::
|
the ID ``ID-FOO``)::
|
||||||
|
|
||||||
$ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
|
$ ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
|
||||||
|
|
||||||
@ -340,97 +381,105 @@ Inject a monmap into the monitor
|
|||||||
|
|
||||||
5. Start the monitor
|
5. Start the monitor
|
||||||
|
|
||||||
Please keep in mind that the ability to inject monmaps is a powerful
|
.. warning:: Injecting ``monmaps`` can cause serious problems because doing
|
||||||
feature that can cause havoc with your monitors if misused as it will
|
so will overwrite the latest existing ``monmap`` stored on the monitor. Be
|
||||||
overwrite the latest, existing monmap kept by the monitor.
|
careful!
|
||||||
|
|
||||||
|
|
||||||
Clock Skews
|
Clock Skews
|
||||||
------------
|
-----------
|
||||||
|
|
||||||
Monitor operation can be severely affected by clock skew among the quorum's
|
The Paxos consensus algorithm requires close time synchroniziation, which means
|
||||||
mons, as the PAXOS consensus algorithm requires tight time alignment.
|
that clock skew among the monitors in the quorum can have a serious effect on
|
||||||
Skew can result in weird behavior with no obvious
|
monitor operation. The resulting behavior can be puzzling. To avoid this issue,
|
||||||
cause. To avoid such issues, you must run a clock synchronization tool
|
run a clock synchronization tool on your monitor nodes: for example, use
|
||||||
on your monitor nodes: ``Chrony`` or the legacy ``ntpd``. Be sure to
|
``Chrony`` or the legacy ``ntpd`` utility. Configure each monitor nodes so that
|
||||||
configure the mon nodes with the `iburst` option and multiple peers:
|
the `iburst` option is in effect and so that each monitor has multiple peers,
|
||||||
|
including the following:
|
||||||
|
|
||||||
* Each other
|
* Each other
|
||||||
* Internal ``NTP`` servers
|
* Internal ``NTP`` servers
|
||||||
* Multiple external, public pool servers
|
* Multiple external, public pool servers
|
||||||
|
|
||||||
For good measure, *all* nodes in your cluster should also sync against
|
.. note:: The ``iburst`` option sends a burst of eight packets instead of the
|
||||||
internal and external servers, and perhaps even your mons. ``NTP`` servers
|
usual single packet, and is used during the process of getting two peers
|
||||||
should run on bare metal; VM virtualized clocks are not suitable for steady
|
into initial synchronization.
|
||||||
timekeeping. Visit `https://www.ntp.org <https://www.ntp.org>`_ for more info. Your
|
|
||||||
organization may already have quality internal ``NTP`` servers you can use.
|
Furthermore, it is advisable to synchronize *all* nodes in your cluster against
|
||||||
Sources for ``NTP`` server appliances include:
|
internal and external servers, and perhaps even against your monitors. Run
|
||||||
|
``NTP`` servers on bare metal: VM-virtualized clocks are not suitable for
|
||||||
|
steady timekeeping. See `https://www.ntp.org <https://www.ntp.org>`_ for more
|
||||||
|
information about the Network Time Protocol (NTP). Your organization might
|
||||||
|
already have quality internal ``NTP`` servers available. Sources for ``NTP``
|
||||||
|
server appliances include the following:
|
||||||
|
|
||||||
* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
|
* Microsemi (formerly Symmetricom) `https://microsemi.com <https://www.microsemi.com/product-directory/3425-timing-synchronization>`_
|
||||||
* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
|
* EndRun `https://endruntechnologies.com <https://endruntechnologies.com/products/ntp-time-servers>`_
|
||||||
* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
|
* Netburner `https://www.netburner.com <https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server>`_
|
||||||
|
|
||||||
|
Clock Skew Questions and Answers
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
What's the maximum tolerated clock skew?
|
**What's the maximum tolerated clock skew?**
|
||||||
|
|
||||||
By default the monitors will allow clocks to drift up to 0.05 seconds (50 ms).
|
By default, monitors allow clocks to drift up to a maximum of 0.05 seconds
|
||||||
|
(50 milliseconds).
|
||||||
|
|
||||||
|
**Can I increase the maximum tolerated clock skew?**
|
||||||
|
|
||||||
Can I increase the maximum tolerated clock skew?
|
Yes, but we strongly recommend against doing so. The maximum tolerated clock
|
||||||
|
skew is configurable via the ``mon-clock-drift-allowed`` option, but it is
|
||||||
|
almost certainly a bad idea to make changes to this option. The clock skew
|
||||||
|
maximum is in place because clock-skewed monitors cannot be relied upon. The
|
||||||
|
current default value has proven its worth at alerting the user before the
|
||||||
|
monitors encounter serious problems. Changing this value might cause
|
||||||
|
unforeseen effects on the stability of the monitors and overall cluster
|
||||||
|
health.
|
||||||
|
|
||||||
The maximum tolerated clock skew is configurable via the
|
**How do I know whether there is a clock skew?**
|
||||||
``mon-clock-drift-allowed`` option, and
|
|
||||||
although you *CAN* you almost certainly *SHOULDN'T*. The clock skew mechanism
|
|
||||||
is in place because clock-skewed monitors are likely to misbehave. We, as
|
|
||||||
developers and QA aficionados, are comfortable with the current default
|
|
||||||
value, as it will alert the user before the monitors get out hand. Changing
|
|
||||||
this value may cause unforeseen effects on the
|
|
||||||
stability of the monitors and overall cluster health.
|
|
||||||
|
|
||||||
How do I know there's a clock skew?
|
The monitors will warn you via the cluster status ``HEALTH_WARN``. When clock
|
||||||
|
skew is present, the ``ceph health detail`` and ``ceph status`` commands
|
||||||
The monitors will warn you via the cluster status ``HEALTH_WARN``. ``ceph
|
return an output resembling the following::
|
||||||
health detail`` or ``ceph status`` should show something like::
|
|
||||||
|
|
||||||
mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
|
mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
|
||||||
|
|
||||||
That means that ``mon.c`` has been flagged as suffering from a clock skew.
|
In this example, the monitor ``mon.c`` has been flagged as suffering from
|
||||||
|
clock skew.
|
||||||
|
|
||||||
On releases beginning with Luminous you can issue the ``ceph
|
In Luminous and later releases, it is possible to check for a clock skew by
|
||||||
time-sync-status`` command to check status. Note that the lead mon is
|
running the ``ceph time-sync-status`` command. Note that the lead monitor
|
||||||
typically the one with the numerically lowest IP address. It will always
|
typically has the numerically lowest IP address. It will always show ``0``:
|
||||||
show ``0``: the reported offsets of other mons are relative to the lead mon,
|
the reported offsets of other monitors are relative to the lead monitor, not
|
||||||
not to any external reference source.
|
to any external reference source.
|
||||||
|
|
||||||
|
**What should I do if there is a clock skew?**
|
||||||
|
|
||||||
What should I do if there's a clock skew?
|
Synchronize your clocks. Using an NTP client might help. However, if you
|
||||||
|
are already using an NTP client and you still encounter clock skew problems,
|
||||||
Synchronize your clocks. Running an NTP client may help. If you are already
|
determine whether the NTP server that you are using is remote to your network
|
||||||
using one and you hit this sort of issues, check if you are using some NTP
|
or instead hosted on your network. Hosting your own NTP servers tends to
|
||||||
server remote to your network and consider hosting your own NTP server on
|
mitigate clock skew problems.
|
||||||
your network. This last option tends to reduce the amount of issues with
|
|
||||||
monitor clock skews.
|
|
||||||
|
|
||||||
|
|
||||||
Client Can't Connect or Mount
|
Client Can't Connect or Mount
|
||||||
------------------------------
|
-----------------------------
|
||||||
|
|
||||||
Check your IP tables. Some OS install utilities add a ``REJECT`` rule to
|
Check your IP tables. Some operating-system install utilities add a ``REJECT``
|
||||||
``iptables``. The rule rejects all clients trying to connect to the host except
|
rule to ``iptables``. ``iptables`` rules will reject all clients other than
|
||||||
for ``ssh``. If your monitor host's IP tables have such a ``REJECT`` rule in
|
``ssh`` that try to connect to the host. If your monitor host's IP tables have
|
||||||
place, clients connecting from a separate node will fail to mount with a timeout
|
a ``REJECT`` rule in place, clients that are connecting from a separate node
|
||||||
error. You need to address ``iptables`` rules that reject clients trying to
|
will fail and will raise a timeout error. Any ``iptables`` rules that reject
|
||||||
connect to Ceph daemons. For example, you would need to address rules that look
|
clients trying to connect to Ceph daemons must be addressed. For example::
|
||||||
like this appropriately::
|
|
||||||
|
|
||||||
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
|
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
|
||||||
|
|
||||||
You may also need to add rules to IP tables on your Ceph hosts to ensure
|
It might also be necessary to add rules to iptables on your Ceph hosts to
|
||||||
that clients can access the ports associated with your Ceph monitors (i.e., port
|
ensure that clients are able to access the TCP ports associated with your Ceph
|
||||||
6789 by default) and Ceph OSDs (i.e., 6800 through 7300 by default). For
|
monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7300). For
|
||||||
example::
|
example::
|
||||||
|
|
||||||
iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
|
iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7300 -j ACCEPT
|
||||||
|
|
||||||
|
|
||||||
Monitor Store Failures
|
Monitor Store Failures
|
||||||
======================
|
======================
|
||||||
@ -438,9 +487,9 @@ Monitor Store Failures
|
|||||||
Symptoms of store corruption
|
Symptoms of store corruption
|
||||||
----------------------------
|
----------------------------
|
||||||
|
|
||||||
Ceph monitor stores the :term:`Cluster Map` in a key/value store such as LevelDB. If
|
Ceph monitors store the :term:`Cluster Map` in a key-value store. If key-value
|
||||||
a monitor fails due to the key/value store corruption, following error messages
|
store corruption causes a monitor to fail, then the monitor log might contain
|
||||||
might be found in the monitor log::
|
one of the following error messages::
|
||||||
|
|
||||||
Corruption: error in middle of record
|
Corruption: error in middle of record
|
||||||
|
|
||||||
@ -451,21 +500,26 @@ or::
|
|||||||
Recovery using healthy monitor(s)
|
Recovery using healthy monitor(s)
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
If there are any survivors, we can always :ref:`replace <adding-and-removing-monitors>` the corrupted one with a
|
If there are surviving monitors, we can always :ref:`replace
|
||||||
new one. After booting up, the new joiner will sync up with a healthy
|
<adding-and-removing-monitors>` the corrupted monitor with a new one. After the
|
||||||
peer, and once it is fully sync'ed, it will be able to serve the clients.
|
new monitor boots, it will synchronize with a healthy peer. After the new
|
||||||
|
monitor is fully synchronized, it will be able to serve clients.
|
||||||
|
|
||||||
.. _mon-store-recovery-using-osds:
|
.. _mon-store-recovery-using-osds:
|
||||||
|
|
||||||
Recovery using OSDs
|
Recovery using OSDs
|
||||||
-------------------
|
-------------------
|
||||||
|
|
||||||
But what if all monitors fail at the same time? Since users are encouraged to
|
Even if all monitors fail at the same time, it is possible to recover the
|
||||||
deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous
|
monitor store by using information stored in OSDs. You are encouraged to deploy
|
||||||
failure is rare. But unplanned power-downs in a data center with improperly
|
at least three (and preferably five) monitors in a Ceph cluster. In such a
|
||||||
configured disk/fs settings could fail the underlying file system, and hence
|
deployment, complete monitor failure is unlikely. However, unplanned power loss
|
||||||
kill all the monitors. In this case, we can recover the monitor store with the
|
in a data center whose disk settings or filesystem settings are improperly
|
||||||
information stored in OSDs.
|
configured could cause the underlying filesystem to fail and this could kill
|
||||||
|
all of the monitors. In such a case, data in the OSDs can be used to recover
|
||||||
|
the monitors. The following is such a script and can be used to recover the
|
||||||
|
monitors:
|
||||||
|
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -516,124 +570,142 @@ information stored in OSDs.
|
|||||||
mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
|
mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
|
||||||
chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
|
chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db
|
||||||
|
|
||||||
The steps above
|
This script performs the following steps:
|
||||||
|
|
||||||
|
#. Collects the map from each OSD host.
|
||||||
|
#. Rebuilds the store.
|
||||||
|
#. Fills the entities in the keyring file with appropriate capabilities.
|
||||||
|
#. Replaces the corrupted store on ``mon.foo`` with the recovered copy.
|
||||||
|
|
||||||
#. collect the map from all OSD hosts,
|
|
||||||
#. then rebuild the store,
|
|
||||||
#. fill the entities in keyring file with appropriate caps
|
|
||||||
#. replace the corrupted store on ``mon.foo`` with the recovered copy.
|
|
||||||
|
|
||||||
Known limitations
|
Known limitations
|
||||||
~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Following information are not recoverable using the steps above:
|
The above recovery tool is unable to recover the following information:
|
||||||
|
|
||||||
- **some added keyrings**: all the OSD keyrings added using ``ceph auth add`` command
|
- **Certain added keyrings**: All of the OSD keyrings added using the ``ceph
|
||||||
are recovered from the OSD's copy. And the ``client.admin`` keyring is imported
|
auth add`` command are recovered from the OSD's copy, and the
|
||||||
using ``ceph-monstore-tool``. But the MDS keyrings and other keyrings are missing
|
``client.admin`` keyring is imported using ``ceph-monstore-tool``. However,
|
||||||
in the recovered monitor store. You might need to re-add them manually.
|
the MDS keyrings and all other keyrings will be missing in the recovered
|
||||||
|
monitor store. You might need to manually re-add them.
|
||||||
|
|
||||||
- **creating pools**: If any RADOS pools were in the process of being creating, that state is lost. The recovery tool assumes that all pools have been created. If there are PGs that are stuck in the 'unknown' after the recovery for a partially created pool, you can force creation of the *empty* PG with the ``ceph osd force-create-pg`` command. Note that this will create an *empty* PG, so only do this if you know the pool is empty.
|
- **Creating pools**: If any RADOS pools were in the process of being created,
|
||||||
|
that state is lost. The recovery tool operates on the assumption that all
|
||||||
- **MDS Maps**: the MDS maps are lost.
|
pools have already been created. If there are PGs that are stuck in the
|
||||||
|
'unknown' state after the recovery for a partially created pool, you can
|
||||||
|
force creation of the *empty* PG by running the ``ceph osd force-create-pg``
|
||||||
|
command. Note that this will create an *empty* PG, so take this action only
|
||||||
|
if you know the pool is empty.
|
||||||
|
|
||||||
|
- **MDS Maps**: The MDS maps are lost.
|
||||||
|
|
||||||
|
|
||||||
Everything Failed! Now What?
|
Everything Failed! Now What?
|
||||||
=============================
|
============================
|
||||||
|
|
||||||
Reaching out for help
|
Reaching out for help
|
||||||
----------------------
|
---------------------
|
||||||
|
|
||||||
You can find us on IRC at #ceph and #ceph-devel at OFTC (server irc.oftc.net)
|
You can find help on IRC in #ceph and #ceph-devel on OFTC (server
|
||||||
and on ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
|
irc.oftc.net), or at ``dev@ceph.io`` and ``ceph-users@lists.ceph.com``. Make
|
||||||
sure you have grabbed your logs and have them ready if someone asks: the faster
|
sure that you have prepared your logs and that you have them ready upon
|
||||||
the interaction and lower the latency in response, the better chances everyone's
|
request.
|
||||||
time is optimized.
|
|
||||||
|
See https://ceph.io/en/community/connect/ for current (as of October 2023)
|
||||||
|
information on getting in contact with the upstream Ceph community.
|
||||||
|
|
||||||
|
|
||||||
Preparing your logs
|
Preparing your logs
|
||||||
---------------------
|
-------------------
|
||||||
|
|
||||||
Monitor logs are, by default, kept in ``/var/log/ceph/ceph-mon.FOO.log*``. We
|
The default location for monitor logs is ``/var/log/ceph/ceph-mon.FOO.log*``.
|
||||||
may want them. However, your logs may not have the necessary information. If
|
However, if they are not there, you can find their current location by running
|
||||||
you don't find your monitor logs at their default location, you can check
|
the following command:
|
||||||
where they should be by running::
|
|
||||||
|
|
||||||
ceph-conf --name mon.FOO --show-config-value log_file
|
.. prompt:: bash
|
||||||
|
|
||||||
The amount of information in the logs are subject to the debug levels being
|
ceph-conf --name mon.FOO --show-config-value log_file
|
||||||
enforced by your configuration files. If you have not enforced a specific
|
|
||||||
debug level then Ceph is using the default levels and your logs may not
|
|
||||||
contain important information to track down you issue.
|
|
||||||
A first step in getting relevant information into your logs will be to raise
|
|
||||||
debug levels. In this case we will be interested in the information from the
|
|
||||||
monitor.
|
|
||||||
Similarly to what happens on other components, different parts of the monitor
|
|
||||||
will output their debug information on different subsystems.
|
|
||||||
|
|
||||||
You will have to raise the debug levels of those subsystems more closely
|
The amount of information in the logs is determined by the debug levels in the
|
||||||
related to your issue. This may not be an easy task for someone unfamiliar
|
cluster's configuration files. If Ceph is using the default debug levels, then
|
||||||
with troubleshooting Ceph. For most situations, setting the following options
|
your logs might be missing important information that would help the upstream
|
||||||
on your monitors will be enough to pinpoint a potential source of the issue::
|
Ceph community address your issue.
|
||||||
|
|
||||||
|
To make sure your monitor logs contain relevant information, you can raise
|
||||||
|
debug levels. Here we are interested in information from the monitors. As with
|
||||||
|
other components, the monitors have different parts that output their debug
|
||||||
|
information on different subsystems.
|
||||||
|
|
||||||
|
If you are an experienced Ceph troubleshooter, we recommend raising the debug
|
||||||
|
levels of the most relevant subsystems. Of course, this approach might not be
|
||||||
|
easy for beginners. In most cases, however, enough information to address the
|
||||||
|
issue will be secured if the following debug levels are entered::
|
||||||
|
|
||||||
debug_mon = 10
|
debug_mon = 10
|
||||||
debug_ms = 1
|
debug_ms = 1
|
||||||
|
|
||||||
If we find that these debug levels are not enough, there's a chance we may
|
Sometimes these debug levels do not yield enough information. In such cases,
|
||||||
ask you to raise them or even define other debug subsystems to obtain infos
|
members of the upstream Ceph community might ask you to make additional changes
|
||||||
from -- but at least we started off with some useful information, instead
|
to these or to other debug levels. In any case, it is better for us to receive
|
||||||
of a massively empty log without much to go on with.
|
at least some useful information than to receive an empty log.
|
||||||
|
|
||||||
|
|
||||||
Do I need to restart a monitor to adjust debug levels?
|
Do I need to restart a monitor to adjust debug levels?
|
||||||
------------------------------------------------------
|
------------------------------------------------------
|
||||||
|
|
||||||
No. You may do it in one of two ways:
|
No, restarting a monitor is not necessary. Debug levels may be adjusted by
|
||||||
|
using two different methods, depending on whether or not there is a quorum:
|
||||||
|
|
||||||
You have quorum
|
There is a quorum
|
||||||
|
|
||||||
Either inject the debug option into the monitor you want to debug::
|
Either inject the debug option into the specific monitor that needs to
|
||||||
|
be debugged::
|
||||||
|
|
||||||
ceph tell mon.FOO config set debug_mon 10/10
|
ceph tell mon.FOO config set debug_mon 10/10
|
||||||
|
|
||||||
or into all monitors at once::
|
Or inject it into all monitors at once::
|
||||||
|
|
||||||
ceph tell mon.* config set debug_mon 10/10
|
ceph tell mon.* config set debug_mon 10/10
|
||||||
|
|
||||||
No quorum
|
|
||||||
|
|
||||||
Use the monitor's admin socket and directly adjust the configuration
|
There is no quorum
|
||||||
options::
|
|
||||||
|
Use the admin socket of the specific monitor that needs to be debugged
|
||||||
|
and directly adjust the monitor's configuration options::
|
||||||
|
|
||||||
ceph daemon mon.FOO config set debug_mon 10/10
|
ceph daemon mon.FOO config set debug_mon 10/10
|
||||||
|
|
||||||
|
|
||||||
Going back to default values is as easy as rerunning the above commands
|
To return the debug levels to their default values, run the above commands
|
||||||
using the debug level ``1/10`` instead. You can check your current
|
using the debug level ``1/10`` rather than ``10/10``. To check a monitor's
|
||||||
values using the admin socket and the following commands::
|
current values, use the admin socket and run either of the following commands:
|
||||||
|
|
||||||
ceph daemon mon.FOO config show
|
.. prompt:: bash
|
||||||
|
|
||||||
or::
|
ceph daemon mon.FOO config show
|
||||||
|
|
||||||
ceph daemon mon.FOO config get 'OPTION_NAME'
|
or:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph daemon mon.FOO config get 'OPTION_NAME'
|
||||||
|
|
||||||
|
|
||||||
Reproduced the problem with appropriate debug levels. Now what?
|
|
||||||
----------------------------------------------------------------
|
|
||||||
|
|
||||||
Ideally you would send us only the relevant portions of your logs.
|
I Reproduced the problem with appropriate debug levels. Now what?
|
||||||
We realise that figuring out the corresponding portion may not be the
|
-----------------------------------------------------------------
|
||||||
easiest of tasks. Therefore, we won't hold it to you if you provide the
|
|
||||||
full log, but common sense should be employed. If your log has hundreds
|
|
||||||
of thousands of lines, it may get tricky to go through the whole thing,
|
|
||||||
specially if we are not aware at which point, whatever your issue is,
|
|
||||||
happened. For instance, when reproducing, keep in mind to write down
|
|
||||||
current time and date and to extract the relevant portions of your logs
|
|
||||||
based on that.
|
|
||||||
|
|
||||||
Finally, you should reach out to us on the mailing lists, on IRC or file
|
We prefer that you send us only the portions of your logs that are relevant to
|
||||||
a new issue on the `tracker`_.
|
your monitor problems. Of course, it might not be easy for you to determine
|
||||||
|
which portions are relevant so we are willing to accept complete and
|
||||||
|
unabridged logs. However, we request that you avoid sending logs containing
|
||||||
|
hundreds of thousands of lines with no additional clarifying information. One
|
||||||
|
common-sense way of making our task easier is to write down the current time
|
||||||
|
and date when you are reproducing the problem and then extract portions of your
|
||||||
|
logs based on that information.
|
||||||
|
|
||||||
|
Finally, reach out to us on the mailing lists or IRC or Slack, or by filing a
|
||||||
|
new issue on the `tracker`_.
|
||||||
|
|
||||||
.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new
|
.. _tracker: http://tracker.ceph.com/projects/ceph/issues/new
|
||||||
|
|
||||||
|
File diff suppressed because it is too large
Load Diff
@ -1,120 +1,128 @@
|
|||||||
=====================
|
====================
|
||||||
Troubleshooting PGs
|
Troubleshooting PGs
|
||||||
=====================
|
====================
|
||||||
|
|
||||||
Placement Groups Never Get Clean
|
Placement Groups Never Get Clean
|
||||||
================================
|
================================
|
||||||
|
|
||||||
When you create a cluster and your cluster remains in ``active``,
|
If, after you have created your cluster, any Placement Groups (PGs) remain in
|
||||||
``active+remapped`` or ``active+degraded`` status and never achieves an
|
the ``active`` status, the ``active+remapped`` status or the
|
||||||
``active+clean`` status, you likely have a problem with your configuration.
|
``active+degraded`` status and never achieves an ``active+clean`` status, you
|
||||||
|
likely have a problem with your configuration.
|
||||||
|
|
||||||
You may need to review settings in the `Pool, PG and CRUSH Config Reference`_
|
In such a situation, it may be necessary to review the settings in the `Pool,
|
||||||
and make appropriate adjustments.
|
PG and CRUSH Config Reference`_ and make appropriate adjustments.
|
||||||
|
|
||||||
As a general rule, you should run your cluster with more than one OSD and a
|
As a general rule, run your cluster with more than one OSD and a pool size
|
||||||
pool size greater than 1 object replica.
|
greater than two object replicas.
|
||||||
|
|
||||||
.. _one-node-cluster:
|
.. _one-node-cluster:
|
||||||
|
|
||||||
One Node Cluster
|
One Node Cluster
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
Ceph no longer provides documentation for operating on a single node, because
|
Ceph no longer provides documentation for operating on a single node. Systems
|
||||||
you would never deploy a system designed for distributed computing on a single
|
designed for distributed computing by definition do not run on a single node.
|
||||||
node. Additionally, mounting client kernel modules on a single node containing a
|
The mounting of client kernel modules on a single node that contains a Ceph
|
||||||
Ceph daemon may cause a deadlock due to issues with the Linux kernel itself
|
daemon may cause a deadlock due to issues with the Linux kernel itself (unless
|
||||||
(unless you use VMs for the clients). You can experiment with Ceph in a 1-node
|
VMs are used as clients). You can experiment with Ceph in a one-node
|
||||||
configuration, in spite of the limitations as described herein.
|
configuration, in spite of the limitations as described herein.
|
||||||
|
|
||||||
If you are trying to create a cluster on a single node, you must change the
|
To create a cluster on a single node, you must change the
|
||||||
default of the ``osd_crush_chooseleaf_type`` setting from ``1`` (meaning
|
``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
|
||||||
``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
|
``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
|
||||||
file before you create your monitors and OSDs. This tells Ceph that an OSD
|
file before you create your monitors and OSDs. This tells Ceph that an OSD is
|
||||||
can peer with another OSD on the same host. If you are trying to set up a
|
permitted to place another OSD on the same host. If you are trying to set up a
|
||||||
1-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
|
single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
|
||||||
Ceph will try to peer the PGs of one OSD with the PGs of another OSD on
|
Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
|
||||||
another node, chassis, rack, row, or even datacenter depending on the setting.
|
another node, chassis, rack, row, or datacenter depending on the setting.
|
||||||
|
|
||||||
.. tip:: DO NOT mount kernel clients directly on the same node as your
|
.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
|
||||||
Ceph Storage Cluster, because kernel conflicts can arise. However, you
|
Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
|
||||||
can mount kernel clients within virtual machines (VMs) on a single node.
|
clients within virtual machines (VMs) on a single node.
|
||||||
|
|
||||||
If you are creating OSDs using a single disk, you must create directories
|
If you are creating OSDs using a single disk, you must manually create
|
||||||
for the data manually first.
|
directories for the data first.
|
||||||
|
|
||||||
|
|
||||||
Fewer OSDs than Replicas
|
Fewer OSDs than Replicas
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
If you have brought up two OSDs to an ``up`` and ``in`` state, but you still
|
If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
|
||||||
don't see ``active + clean`` placement groups, you may have an
|
in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
|
||||||
``osd_pool_default_size`` set to greater than ``2``.
|
to greater than ``2``.
|
||||||
|
|
||||||
There are a few ways to address this situation. If you want to operate your
|
There are a few ways to address this situation. If you want to operate your
|
||||||
cluster in an ``active + degraded`` state with two replicas, you can set the
|
cluster in an ``active + degraded`` state with two replicas, you can set the
|
||||||
``osd_pool_default_min_size`` to ``2`` so that you can write objects in
|
``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
|
||||||
an ``active + degraded`` state. You may also set the ``osd_pool_default_size``
|
``active + degraded`` state. You may also set the ``osd_pool_default_size``
|
||||||
setting to ``2`` so that you only have two stored replicas (the original and
|
setting to ``2`` so that you have only two stored replicas (the original and
|
||||||
one replica), in which case the cluster should achieve an ``active + clean``
|
one replica). In such a case, the cluster should achieve an ``active + clean``
|
||||||
state.
|
state.
|
||||||
|
|
||||||
.. note:: You can make the changes at runtime. If you make the changes in
|
.. note:: You can make the changes while the cluster is running. If you make
|
||||||
your Ceph configuration file, you may need to restart your cluster.
|
the changes in your Ceph configuration file, you might need to restart your
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
|
||||||
Pool Size = 1
|
Pool Size = 1
|
||||||
-------------
|
-------------
|
||||||
|
|
||||||
If you have the ``osd_pool_default_size`` set to ``1``, you will only have
|
If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
|
||||||
one copy of the object. OSDs rely on other OSDs to tell them which objects
|
of the object. OSDs rely on other OSDs to tell them which objects they should
|
||||||
they should have. If a first OSD has a copy of an object and there is no
|
have. If one OSD has a copy of an object and there is no second copy, then
|
||||||
second copy, then no second OSD can tell the first OSD that it should have
|
there is no second OSD to tell the first OSD that it should have that copy. For
|
||||||
that copy. For each placement group mapped to the first OSD (see
|
each placement group mapped to the first OSD (see ``ceph pg dump``), you can
|
||||||
``ceph pg dump``), you can force the first OSD to notice the placement groups
|
force the first OSD to notice the placement groups it needs by running a
|
||||||
it needs by running::
|
command of the following form:
|
||||||
|
|
||||||
ceph osd force-create-pg <pgid>
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph osd force-create-pg <pgid>
|
||||||
|
|
||||||
|
|
||||||
CRUSH Map Errors
|
CRUSH Map Errors
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
Another candidate for placement groups remaining unclean involves errors
|
If any placement groups in your cluster are unclean, then there might be errors
|
||||||
in your CRUSH map.
|
in your CRUSH map.
|
||||||
|
|
||||||
|
|
||||||
Stuck Placement Groups
|
Stuck Placement Groups
|
||||||
======================
|
======================
|
||||||
|
|
||||||
It is normal for placement groups to enter states like "degraded" or "peering"
|
It is normal for placement groups to enter "degraded" or "peering" states after
|
||||||
following a failure. Normally these states indicate the normal progression
|
a component failure. Normally, these states reflect the expected progression
|
||||||
through the failure recovery process. However, if a placement group stays in one
|
through the failure recovery process. However, a placement group that stays in
|
||||||
of these states for a long time this may be an indication of a larger problem.
|
one of these states for a long time might be an indication of a larger problem.
|
||||||
For this reason, the monitor will warn when placement groups get "stuck" in a
|
For this reason, the Ceph Monitors will warn when placement groups get "stuck"
|
||||||
non-optimal state. Specifically, we check for:
|
in a non-optimal state. Specifically, we check for:
|
||||||
|
|
||||||
* ``inactive`` - The placement group has not been ``active`` for too long
|
* ``inactive`` - The placement group has not been ``active`` for too long (that
|
||||||
(i.e., it hasn't been able to service read/write requests).
|
is, it hasn't been able to service read/write requests).
|
||||||
|
|
||||||
* ``unclean`` - The placement group has not been ``clean`` for too long
|
* ``unclean`` - The placement group has not been ``clean`` for too long (that
|
||||||
(i.e., it hasn't been able to completely recover from a previous failure).
|
is, it hasn't been able to completely recover from a previous failure).
|
||||||
|
|
||||||
* ``stale`` - The placement group status has not been updated by a ``ceph-osd``,
|
* ``stale`` - The placement group status has not been updated by a
|
||||||
indicating that all nodes storing this placement group may be ``down``.
|
``ceph-osd``. This indicates that all nodes storing this placement group may
|
||||||
|
be ``down``.
|
||||||
|
|
||||||
You can explicitly list stuck placement groups with one of::
|
List stuck placement groups by running one of the following commands:
|
||||||
|
|
||||||
ceph pg dump_stuck stale
|
.. prompt:: bash
|
||||||
ceph pg dump_stuck inactive
|
|
||||||
ceph pg dump_stuck unclean
|
|
||||||
|
|
||||||
For stuck ``stale`` placement groups, it is normally a matter of getting the
|
ceph pg dump_stuck stale
|
||||||
right ``ceph-osd`` daemons running again. For stuck ``inactive`` placement
|
ceph pg dump_stuck inactive
|
||||||
groups, it is usually a peering problem (see :ref:`failures-osd-peering`). For
|
ceph pg dump_stuck unclean
|
||||||
stuck ``unclean`` placement groups, there is usually something preventing
|
|
||||||
recovery from completing, like unfound objects (see
|
- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
|
||||||
:ref:`failures-osd-unfound`);
|
daemons are not running.
|
||||||
|
- Stuck ``inactive`` placement groups usually indicate a peering problem (see
|
||||||
|
:ref:`failures-osd-peering`).
|
||||||
|
- Stuck ``unclean`` placement groups usually indicate that something is
|
||||||
|
preventing recovery from completing, possibly unfound objects (see
|
||||||
|
:ref:`failures-osd-unfound`);
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -123,21 +131,28 @@ recovery from completing, like unfound objects (see
|
|||||||
Placement Group Down - Peering Failure
|
Placement Group Down - Peering Failure
|
||||||
======================================
|
======================================
|
||||||
|
|
||||||
In certain cases, the ``ceph-osd`` `Peering` process can run into
|
In certain cases, the ``ceph-osd`` `peering` process can run into problems,
|
||||||
problems, preventing a PG from becoming active and usable. For
|
which can prevent a PG from becoming active and usable. In such a case, running
|
||||||
example, ``ceph health`` might report::
|
the command ``ceph health detail`` will report something similar to the following:
|
||||||
|
|
||||||
ceph health detail
|
.. prompt:: bash
|
||||||
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
|
|
||||||
...
|
|
||||||
pg 0.5 is down+peering
|
|
||||||
pg 1.4 is down+peering
|
|
||||||
...
|
|
||||||
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
|
|
||||||
|
|
||||||
We can query the cluster to determine exactly why the PG is marked ``down`` with::
|
ceph health detail
|
||||||
|
|
||||||
ceph pg 0.5 query
|
::
|
||||||
|
|
||||||
|
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
|
||||||
|
...
|
||||||
|
pg 0.5 is down+peering
|
||||||
|
pg 1.4 is down+peering
|
||||||
|
...
|
||||||
|
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
|
||||||
|
|
||||||
|
Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph pg 0.5 query
|
||||||
|
|
||||||
.. code-block:: javascript
|
.. code-block:: javascript
|
||||||
|
|
||||||
@ -164,21 +179,24 @@ We can query the cluster to determine exactly why the PG is marked ``down`` with
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
The ``recovery_state`` section tells us that peering is blocked due to
|
The ``recovery_state`` section tells us that peering is blocked due to down
|
||||||
down ``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that ``ceph-osd``
|
``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
|
||||||
and things will recover.
|
particular ``ceph-osd`` and recovery will proceed.
|
||||||
|
|
||||||
Alternatively, if there is a catastrophic failure of ``osd.1`` (e.g., disk
|
Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
|
||||||
failure), we can tell the cluster that it is ``lost`` and to cope as
|
there has been a disk failure), the cluster can be informed that the OSD is
|
||||||
best it can.
|
``lost`` and the cluster can be instructed that it must cope as best it can.
|
||||||
|
|
||||||
.. important:: This is dangerous in that the cluster cannot
|
.. important:: Informing the cluster that an OSD has been lost is dangerous
|
||||||
guarantee that the other copies of the data are consistent
|
because the cluster cannot guarantee that the other copies of the data are
|
||||||
and up to date.
|
consistent and up to date.
|
||||||
|
|
||||||
To instruct Ceph to continue anyway::
|
To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
|
||||||
|
anyway, run a command of the following form:
|
||||||
|
|
||||||
ceph osd lost 1
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph osd lost 1
|
||||||
|
|
||||||
Recovery will proceed.
|
Recovery will proceed.
|
||||||
|
|
||||||
@ -188,32 +206,43 @@ Recovery will proceed.
|
|||||||
Unfound Objects
|
Unfound Objects
|
||||||
===============
|
===============
|
||||||
|
|
||||||
Under certain combinations of failures Ceph may complain about
|
Under certain combinations of failures, Ceph may complain about ``unfound``
|
||||||
``unfound`` objects::
|
objects, as in this example:
|
||||||
|
|
||||||
ceph health detail
|
.. prompt:: bash
|
||||||
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
|
|
||||||
pg 2.4 is active+degraded, 78 unfound
|
|
||||||
|
|
||||||
This means that the storage cluster knows that some objects (or newer
|
ceph health detail
|
||||||
copies of existing objects) exist, but it hasn't found copies of them.
|
|
||||||
One example of how this might come about for a PG whose data is on ceph-osds
|
::
|
||||||
1 and 2:
|
|
||||||
|
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
|
||||||
|
pg 2.4 is active+degraded, 78 unfound
|
||||||
|
|
||||||
|
This means that the storage cluster knows that some objects (or newer copies of
|
||||||
|
existing objects) exist, but it hasn't found copies of them. Here is an
|
||||||
|
example of how this might come about for a PG whose data is on two OSDS, which
|
||||||
|
we will call "1" and "2":
|
||||||
|
|
||||||
* 1 goes down
|
* 1 goes down
|
||||||
* 2 handles some writes, alone
|
* 2 handles some writes, alone
|
||||||
* 1 comes up
|
* 1 comes up
|
||||||
* 1 and 2 repeer, and the objects missing on 1 are queued for recovery.
|
* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
|
||||||
* Before the new objects are copied, 2 goes down.
|
* Before the new objects are copied, 2 goes down.
|
||||||
|
|
||||||
Now 1 knows that these object exist, but there is no live ``ceph-osd`` who
|
At this point, 1 knows that these objects exist, but there is no live
|
||||||
has a copy. In this case, IO to those objects will block, and the
|
``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
|
||||||
cluster will hope that the failed node comes back soon; this is
|
will block, and the cluster will hope that the failed node comes back soon.
|
||||||
assumed to be preferable to returning an IO error to the user.
|
This is assumed to be preferable to returning an IO error to the user.
|
||||||
|
|
||||||
First, you can identify which objects are unfound with::
|
.. note:: The situation described immediately above is one reason that setting
|
||||||
|
``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
|
||||||
|
data loss.
|
||||||
|
|
||||||
ceph pg 2.4 list_unfound [starting offset, in json]
|
Identify which objects are unfound by running a command of the following form:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph pg 2.4 list_unfound [starting offset, in json]
|
||||||
|
|
||||||
.. code-block:: javascript
|
.. code-block:: javascript
|
||||||
|
|
||||||
@ -252,22 +281,24 @@ First, you can identify which objects are unfound with::
|
|||||||
"more": false
|
"more": false
|
||||||
}
|
}
|
||||||
|
|
||||||
If there are too many objects to list in a single result, the ``more``
|
If there are too many objects to list in a single result, the ``more`` field
|
||||||
field will be true and you can query for more. (Eventually the
|
will be true and you can query for more. (Eventually the command line tool
|
||||||
command line tool will hide this from you, but not yet.)
|
will hide this from you, but not yet.)
|
||||||
|
|
||||||
Second, you can identify which OSDs have been probed or might contain
|
Now you can identify which OSDs have been probed or might contain data.
|
||||||
data.
|
|
||||||
|
|
||||||
At the end of the listing (before ``more`` is false), ``might_have_unfound`` is provided
|
At the end of the listing (before ``more: false``), ``might_have_unfound`` is
|
||||||
when ``available_might_have_unfound`` is true. This is equivalent to the output
|
provided when ``available_might_have_unfound`` is true. This is equivalent to
|
||||||
of ``ceph pg #.# query``. This eliminates the need to use ``query`` directly.
|
the output of ``ceph pg #.# query``. This eliminates the need to use ``query``
|
||||||
The ``might_have_unfound`` information given behaves the same way as described below for ``query``.
|
directly. The ``might_have_unfound`` information given behaves the same way as
|
||||||
The only difference is that OSDs that have ``already probed`` status are ignored.
|
that ``query`` does, which is described below. The only difference is that
|
||||||
|
OSDs that have the status of ``already probed`` are ignored.
|
||||||
|
|
||||||
Use of ``query``::
|
Use of ``query``:
|
||||||
|
|
||||||
ceph pg 2.4 query
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph pg 2.4 query
|
||||||
|
|
||||||
.. code-block:: javascript
|
.. code-block:: javascript
|
||||||
|
|
||||||
@ -278,8 +309,8 @@ Use of ``query``::
|
|||||||
{ "osd": 1,
|
{ "osd": 1,
|
||||||
"status": "osd is down"}]},
|
"status": "osd is down"}]},
|
||||||
|
|
||||||
In this case, for example, the cluster knows that ``osd.1`` might have
|
In this case, the cluster knows that ``osd.1`` might have data, but it is
|
||||||
data, but it is ``down``. The full range of possible states include:
|
``down``. Here is the full range of possible states:
|
||||||
|
|
||||||
* already probed
|
* already probed
|
||||||
* querying
|
* querying
|
||||||
@ -289,106 +320,135 @@ data, but it is ``down``. The full range of possible states include:
|
|||||||
Sometimes it simply takes some time for the cluster to query possible
|
Sometimes it simply takes some time for the cluster to query possible
|
||||||
locations.
|
locations.
|
||||||
|
|
||||||
It is possible that there are other locations where the object can
|
It is possible that there are other locations where the object might exist that
|
||||||
exist that are not listed. For example, if a ceph-osd is stopped and
|
are not listed. For example: if an OSD is stopped and taken out of the cluster
|
||||||
taken out of the cluster, the cluster fully recovers, and due to some
|
and then the cluster fully recovers, and then through a subsequent set of
|
||||||
future set of failures ends up with an unfound object, it won't
|
failures the cluster ends up with an unfound object, the cluster will ignore
|
||||||
consider the long-departed ceph-osd as a potential location to
|
the removed OSD. (This scenario, however, is unlikely.)
|
||||||
consider. (This scenario, however, is unlikely.)
|
|
||||||
|
|
||||||
If all possible locations have been queried and objects are still
|
If all possible locations have been queried and objects are still lost, you may
|
||||||
lost, you may have to give up on the lost objects. This, again, is
|
have to give up on the lost objects. This, again, is possible only when unusual
|
||||||
possible given unusual combinations of failures that allow the cluster
|
combinations of failures have occurred that allow the cluster to learn about
|
||||||
to learn about writes that were performed before the writes themselves
|
writes that were performed before the writes themselves have been recovered. To
|
||||||
are recovered. To mark the "unfound" objects as "lost"::
|
mark the "unfound" objects as "lost", run a command of the following form:
|
||||||
|
|
||||||
ceph pg 2.5 mark_unfound_lost revert|delete
|
.. prompt:: bash
|
||||||
|
|
||||||
This the final argument specifies how the cluster should deal with
|
ceph pg 2.5 mark_unfound_lost revert|delete
|
||||||
lost objects.
|
|
||||||
|
|
||||||
The "delete" option will forget about them entirely.
|
Here the final argument (``revert|delete``) specifies how the cluster should
|
||||||
|
deal with lost objects.
|
||||||
|
|
||||||
The "revert" option (not available for erasure coded pools) will
|
The ``delete`` option will cause the cluster to forget about them entirely.
|
||||||
either roll back to a previous version of the object or (if it was a
|
|
||||||
new object) forget about it entirely. Use this with caution, as it
|
|
||||||
may confuse applications that expected the object to exist.
|
|
||||||
|
|
||||||
|
The ``revert`` option (which is not available for erasure coded pools) will
|
||||||
|
either roll back to a previous version of the object or (if it was a new
|
||||||
|
object) forget about the object entirely. Use ``revert`` with caution, as it
|
||||||
|
may confuse applications that expect the object to exist.
|
||||||
|
|
||||||
Homeless Placement Groups
|
Homeless Placement Groups
|
||||||
=========================
|
=========================
|
||||||
|
|
||||||
It is possible for all OSDs that had copies of a given placement groups to fail.
|
It is possible that every OSD that has copies of a given placement group fails.
|
||||||
If that's the case, that subset of the object store is unavailable, and the
|
If this happens, then the subset of the object store that contains those
|
||||||
monitor will receive no status updates for those placement groups. To detect
|
placement groups becomes unavailable and the monitor will receive no status
|
||||||
this situation, the monitor marks any placement group whose primary OSD has
|
updates for those placement groups. The monitor marks as ``stale`` any
|
||||||
failed as ``stale``. For example::
|
placement group whose primary OSD has failed. For example:
|
||||||
|
|
||||||
ceph health
|
.. prompt:: bash
|
||||||
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
|
||||||
|
|
||||||
You can identify which placement groups are ``stale``, and what the last OSDs to
|
ceph health
|
||||||
store them were, with::
|
|
||||||
|
|
||||||
ceph health detail
|
::
|
||||||
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
|
||||||
...
|
|
||||||
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
|
|
||||||
...
|
|
||||||
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
|
|
||||||
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
|
|
||||||
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
|
|
||||||
|
|
||||||
If we want to get placement group 2.5 back online, for example, this tells us that
|
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
||||||
it was last managed by ``osd.0`` and ``osd.2``. Restarting those ``ceph-osd``
|
|
||||||
daemons will allow the cluster to recover that placement group (and, presumably,
|
Identify which placement groups are ``stale`` and which were the last OSDs to
|
||||||
many others).
|
store the ``stale`` placement groups by running the following command:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph health detail
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
|
||||||
|
...
|
||||||
|
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
|
||||||
|
...
|
||||||
|
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
|
||||||
|
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
|
||||||
|
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
|
||||||
|
|
||||||
|
This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
|
||||||
|
``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
|
||||||
|
that placement group.
|
||||||
|
|
||||||
|
|
||||||
Only a Few OSDs Receive Data
|
Only a Few OSDs Receive Data
|
||||||
============================
|
============================
|
||||||
|
|
||||||
If you have many nodes in your cluster and only a few of them receive data,
|
If only a few of the nodes in the cluster are receiving data, check the number
|
||||||
`check`_ the number of placement groups in your pool. Since placement groups get
|
of placement groups in the pool as instructed in the :ref:`Placement Groups
|
||||||
mapped to OSDs, a small number of placement groups will not distribute across
|
<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
|
||||||
your cluster. Try creating a pool with a placement group count that is a
|
OSDs in an operation involving dividing the number of placement groups in the
|
||||||
multiple of the number of OSDs. See `Placement Groups`_ for details. The default
|
cluster by the number of OSDs in the cluster, a small number of placement
|
||||||
placement group count for pools is not useful, but you can change it `here`_.
|
groups (the remainder, in this operation) are sometimes not distributed across
|
||||||
|
the cluster. In situations like this, create a pool with a placement group
|
||||||
|
count that is a multiple of the number of OSDs. See `Placement Groups`_ for
|
||||||
|
details. See the :ref:`Pool, PG, and CRUSH Config Reference
|
||||||
|
<rados_config_pool_pg_crush_ref>` for instructions on changing the default
|
||||||
|
values used to determine how many placement groups are assigned to each pool.
|
||||||
|
|
||||||
|
|
||||||
Can't Write Data
|
Can't Write Data
|
||||||
================
|
================
|
||||||
|
|
||||||
If your cluster is up, but some OSDs are down and you cannot write data,
|
If the cluster is up, but some OSDs are down and you cannot write data, make
|
||||||
check to ensure that you have the minimum number of OSDs running for the
|
sure that you have the minimum number of OSDs running in the pool. If you don't
|
||||||
placement group. If you don't have the minimum number of OSDs running,
|
have the minimum number of OSDs running in the pool, Ceph will not allow you to
|
||||||
Ceph will not allow you to write data because there is no guarantee
|
write data to it because there is no guarantee that Ceph can replicate your
|
||||||
that Ceph can replicate your data. See ``osd_pool_default_min_size``
|
data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
|
||||||
in the `Pool, PG and CRUSH Config Reference`_ for details.
|
Config Reference <rados_config_pool_pg_crush_ref>` for details.
|
||||||
|
|
||||||
|
|
||||||
PGs Inconsistent
|
PGs Inconsistent
|
||||||
================
|
================
|
||||||
|
|
||||||
If you receive an ``active + clean + inconsistent`` state, this may happen
|
If the command ``ceph health detail`` returns an ``active + clean +
|
||||||
due to an error during scrubbing. As always, we can identify the inconsistent
|
inconsistent`` state, this might indicate an error during scrubbing. Identify
|
||||||
placement group(s) with::
|
the inconsistent placement group or placement groups by running the following
|
||||||
|
command:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
$ ceph health detail
|
$ ceph health detail
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
|
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
|
||||||
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
|
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
|
||||||
2 scrub errors
|
2 scrub errors
|
||||||
|
|
||||||
Or if you prefer inspecting the output in a programmatic way::
|
Alternatively, run this command if you prefer to inspect the output in a
|
||||||
|
programmatic way:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
$ rados list-inconsistent-pg rbd
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
$ rados list-inconsistent-pg rbd
|
|
||||||
["0.6"]
|
["0.6"]
|
||||||
|
|
||||||
There is only one consistent state, but in the worst case, we could have
|
There is only one consistent state, but in the worst case, we could have
|
||||||
different inconsistencies in multiple perspectives found in more than one
|
different inconsistencies in multiple perspectives found in more than one
|
||||||
objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have::
|
object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
|
||||||
|
``rados list-inconsistent-pg rbd`` will look something like this:
|
||||||
|
|
||||||
$ rados list-inconsistent-obj 0.6 --format=json-pretty
|
.. prompt:: bash
|
||||||
|
|
||||||
|
rados list-inconsistent-obj 0.6 --format=json-pretty
|
||||||
|
|
||||||
.. code-block:: javascript
|
.. code-block:: javascript
|
||||||
|
|
||||||
@ -442,82 +502,103 @@ objects. If an object named ``foo`` in PG ``0.6`` is truncated, we will have::
|
|||||||
]
|
]
|
||||||
}
|
}
|
||||||
|
|
||||||
In this case, we can learn from the output:
|
In this case, the output indicates the following:
|
||||||
|
|
||||||
* The only inconsistent object is named ``foo``, and it is its head that has
|
* The only inconsistent object is named ``foo``, and its head has
|
||||||
inconsistencies.
|
inconsistencies.
|
||||||
* The inconsistencies fall into two categories:
|
* The inconsistencies fall into two categories:
|
||||||
|
|
||||||
* ``errors``: these errors indicate inconsistencies between shards without a
|
#. ``errors``: these errors indicate inconsistencies between shards, without
|
||||||
determination of which shard(s) are bad. Check for the ``errors`` in the
|
an indication of which shard(s) are bad. Check for the ``errors`` in the
|
||||||
`shards` array, if available, to pinpoint the problem.
|
``shards`` array, if available, to pinpoint the problem.
|
||||||
|
|
||||||
* ``data_digest_mismatch``: the digest of the replica read from OSD.2 is
|
* ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
|
||||||
different from the ones of OSD.0 and OSD.1
|
is different from the digests of the replica reads of ``OSD.0`` and
|
||||||
* ``size_mismatch``: the size of the replica read from OSD.2 is 0, while
|
``OSD.1``
|
||||||
the size reported by OSD.0 and OSD.1 is 968.
|
* ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
|
||||||
* ``union_shard_errors``: the union of all shard specific ``errors`` in
|
but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.
|
||||||
``shards`` array. The ``errors`` are set for the given shard that has the
|
|
||||||
problem. They include errors like ``read_error``. The ``errors`` ending in
|
|
||||||
``oi`` indicate a comparison with ``selected_object_info``. Look at the
|
|
||||||
``shards`` array to determine which shard has which error(s).
|
|
||||||
|
|
||||||
* ``data_digest_mismatch_info``: the digest stored in the object-info is not
|
#. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
|
||||||
``0xffffffff``, which is calculated from the shard read from OSD.2
|
``shards`` array. The ``errors`` are set for the shard with the problem.
|
||||||
* ``size_mismatch_info``: the size stored in the object-info is different
|
These errors include ``read_error`` and other similar errors. The
|
||||||
from the one read from OSD.2. The latter is 0.
|
``errors`` ending in ``oi`` indicate a comparison with
|
||||||
|
``selected_object_info``. Examine the ``shards`` array to determine
|
||||||
|
which shard has which error or errors.
|
||||||
|
|
||||||
You can repair the inconsistent placement group by executing::
|
* ``data_digest_mismatch_info``: the digest stored in the ``object-info``
|
||||||
|
is not ``0xffffffff``, which is calculated from the shard read from
|
||||||
|
``OSD.2``
|
||||||
|
* ``size_mismatch_info``: the size stored in the ``object-info`` is
|
||||||
|
different from the size read from ``OSD.2``. The latter is ``0``.
|
||||||
|
|
||||||
ceph pg repair {placement-group-ID}
|
.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
|
||||||
|
inconsistency is likely due to physical storage errors. In cases like this,
|
||||||
|
check the storage used by that OSD.
|
||||||
|
|
||||||
|
Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
|
||||||
|
repair.
|
||||||
|
|
||||||
Which overwrites the `bad` copies with the `authoritative` ones. In most cases,
|
To repair the inconsistent placement group, run a command of the following
|
||||||
Ceph is able to choose authoritative copies from all available replicas using
|
form:
|
||||||
some predefined criteria. But this does not always work. For example, the stored
|
|
||||||
data digest could be missing, and the calculated digest will be ignored when
|
.. prompt:: bash
|
||||||
choosing the authoritative copies. So, please use the above command with caution.
|
|
||||||
|
ceph pg repair {placement-group-ID}
|
||||||
|
|
||||||
|
.. warning: This command overwrites the "bad" copies with "authoritative"
|
||||||
|
copies. In most cases, Ceph is able to choose authoritative copies from all
|
||||||
|
the available replicas by using some predefined criteria. This, however,
|
||||||
|
does not work in every case. For example, it might be the case that the
|
||||||
|
stored data digest is missing, which means that the calculated digest is
|
||||||
|
ignored when Ceph chooses the authoritative copies. Be aware of this, and
|
||||||
|
use the above command with caution.
|
||||||
|
|
||||||
If ``read_error`` is listed in the ``errors`` attribute of a shard, the
|
|
||||||
inconsistency is likely due to disk errors. You might want to check your disk
|
|
||||||
used by that OSD.
|
|
||||||
|
|
||||||
If you receive ``active + clean + inconsistent`` states periodically due to
|
If you receive ``active + clean + inconsistent`` states periodically due to
|
||||||
clock skew, you may consider configuring your `NTP`_ daemons on your
|
clock skew, consider configuring the `NTP
|
||||||
monitor hosts to act as peers. See `The Network Time Protocol`_ and Ceph
|
<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
|
||||||
`Clock Settings`_ for additional details.
|
hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
|
||||||
|
and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.
|
||||||
|
|
||||||
|
|
||||||
Erasure Coded PGs are not active+clean
|
Erasure Coded PGs are not active+clean
|
||||||
======================================
|
======================================
|
||||||
|
|
||||||
When CRUSH fails to find enough OSDs to map to a PG, it will show as a
|
If CRUSH fails to find enough OSDs to map to a PG, it will show as a
|
||||||
``2147483647`` which is ITEM_NONE or ``no OSD found``. For instance::
|
``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::
|
||||||
|
|
||||||
[2,1,6,0,5,8,2147483647,7,4]
|
[2,1,6,0,5,8,2147483647,7,4]
|
||||||
|
|
||||||
Not enough OSDs
|
Not enough OSDs
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
If the Ceph cluster only has 8 OSDs and the erasure coded pool needs
|
If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
|
||||||
9, that is what it will show. You can either create another erasure
|
OSDs, the cluster will show "Not enough OSDs". In this case, you either create
|
||||||
coded pool that requires less OSDs::
|
another erasure coded pool that requires fewer OSDs, by running commands of the
|
||||||
|
following form:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
ceph osd erasure-code-profile set myprofile k=5 m=3
|
ceph osd erasure-code-profile set myprofile k=5 m=3
|
||||||
ceph osd pool create erasurepool erasure myprofile
|
ceph osd pool create erasurepool erasure myprofile
|
||||||
|
|
||||||
or add a new OSDs and the PG will automatically use them.
|
or add new OSDs, and the PG will automatically use them.
|
||||||
|
|
||||||
CRUSH constraints cannot be satisfied
|
CRUSH constraints cannot be satisfied
|
||||||
-------------------------------------
|
-------------------------------------
|
||||||
|
|
||||||
If the cluster has enough OSDs, it is possible that the CRUSH rule
|
If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
|
||||||
imposes constraints that cannot be satisfied. If there are 10 OSDs on
|
constraints that cannot be satisfied. If there are ten OSDs on two hosts and
|
||||||
two hosts and the CRUSH rule requires that no two OSDs from the
|
the CRUSH rule requires that no two OSDs from the same host are used in the
|
||||||
same host are used in the same PG, the mapping may fail because only
|
same PG, the mapping may fail because only two OSDs will be found. Check the
|
||||||
two OSDs will be found. You can check the constraint by displaying ("dumping")
|
constraint by displaying ("dumping") the rule, as shown here:
|
||||||
the rule::
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph osd crush rule ls
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
$ ceph osd crush rule ls
|
|
||||||
[
|
[
|
||||||
"replicated_rule",
|
"replicated_rule",
|
||||||
"erasurepool"]
|
"erasurepool"]
|
||||||
@ -535,36 +616,43 @@ the rule::
|
|||||||
{ "op": "emit"}]}
|
{ "op": "emit"}]}
|
||||||
|
|
||||||
|
|
||||||
You can resolve the problem by creating a new pool in which PGs are allowed
|
Resolve this problem by creating a new pool in which PGs are allowed to have
|
||||||
to have OSDs residing on the same host with::
|
OSDs residing on the same host by running the following commands:
|
||||||
|
|
||||||
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
|
.. prompt:: bash
|
||||||
ceph osd pool create erasurepool erasure myprofile
|
|
||||||
|
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
|
||||||
|
ceph osd pool create erasurepool erasure myprofile
|
||||||
|
|
||||||
CRUSH gives up too soon
|
CRUSH gives up too soon
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
If the Ceph cluster has just enough OSDs to map the PG (for instance a
|
If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
|
||||||
cluster with a total of 9 OSDs and an erasure coded pool that requires
|
with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
|
||||||
9 OSDs per PG), it is possible that CRUSH gives up before finding a
|
PG), it is possible that CRUSH gives up before finding a mapping. This problem
|
||||||
mapping. It can be resolved by:
|
can be resolved by:
|
||||||
|
|
||||||
* lowering the erasure coded pool requirements to use less OSDs per PG
|
* lowering the erasure coded pool requirements to use fewer OSDs per PG (this
|
||||||
(that requires the creation of another pool as erasure code profiles
|
requires the creation of another pool, because erasure code profiles cannot
|
||||||
cannot be dynamically modified).
|
be modified dynamically).
|
||||||
|
|
||||||
* adding more OSDs to the cluster (that does not require the erasure
|
* adding more OSDs to the cluster (this does not require the erasure coded pool
|
||||||
coded pool to be modified, it will become clean automatically)
|
to be modified, because it will become clean automatically)
|
||||||
|
|
||||||
* use a handmade CRUSH rule that tries more times to find a good
|
* using a handmade CRUSH rule that tries more times to find a good mapping.
|
||||||
mapping. This can be done by setting ``set_choose_tries`` to a value
|
This can be modified for an existing CRUSH rule by setting
|
||||||
greater than the default.
|
``set_choose_tries`` to a value greater than the default.
|
||||||
|
|
||||||
You should first verify the problem with ``crushtool`` after
|
First, verify the problem by using ``crushtool`` after extracting the crushmap
|
||||||
extracting the crushmap from the cluster so your experiments do not
|
from the cluster. This ensures that your experiments do not modify the Ceph
|
||||||
modify the Ceph cluster and only work on a local files::
|
cluster and that they operate only on local files:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
ceph osd crush rule dump erasurepool
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
$ ceph osd crush rule dump erasurepool
|
|
||||||
{ "rule_id": 1,
|
{ "rule_id": 1,
|
||||||
"rule_name": "erasurepool",
|
"rule_name": "erasurepool",
|
||||||
"type": 3,
|
"type": 3,
|
||||||
@ -586,44 +674,54 @@ modify the Ceph cluster and only work on a local files::
|
|||||||
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
|
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
|
||||||
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
|
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
|
||||||
|
|
||||||
Where ``--num-rep`` is the number of OSDs the erasure code CRUSH
|
Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
|
||||||
rule needs, ``--rule`` is the value of the ``rule_id`` field
|
needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
|
||||||
displayed by ``ceph osd crush rule dump``. The test will try mapping
|
``ceph osd crush rule dump``. This test will attempt to map one million values
|
||||||
one million values (i.e. the range defined by ``[--min-x,--max-x]``)
|
(in this example, the range defined by ``[--min-x,--max-x]``) and must display
|
||||||
and must display at least one bad mapping. If it outputs nothing it
|
at least one bad mapping. If this test outputs nothing, all mappings have been
|
||||||
means all mappings are successful and you can stop right there: the
|
successful and you can be assured that the problem with your cluster is not
|
||||||
problem is elsewhere.
|
caused by bad mappings.
|
||||||
|
|
||||||
The CRUSH rule can be edited by decompiling the crush map::
|
Changing the value of set_choose_tries
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
$ crushtool --decompile crush.map > crush.txt
|
#. Decompile the CRUSH map to edit the CRUSH rule by running the following
|
||||||
|
command:
|
||||||
|
|
||||||
and adding the following line to the rule::
|
.. prompt:: bash
|
||||||
|
|
||||||
step set_choose_tries 100
|
crushtool --decompile crush.map > crush.txt
|
||||||
|
|
||||||
The relevant part of the ``crush.txt`` file should look something
|
#. Add the following line to the rule::
|
||||||
like::
|
|
||||||
|
|
||||||
rule erasurepool {
|
step set_choose_tries 100
|
||||||
id 1
|
|
||||||
type erasure
|
|
||||||
step set_chooseleaf_tries 5
|
|
||||||
step set_choose_tries 100
|
|
||||||
step take default
|
|
||||||
step chooseleaf indep 0 type host
|
|
||||||
step emit
|
|
||||||
}
|
|
||||||
|
|
||||||
It can then be compiled and tested again::
|
The relevant part of the ``crush.txt`` file will resemble this::
|
||||||
|
|
||||||
$ crushtool --compile crush.txt -o better-crush.map
|
rule erasurepool {
|
||||||
|
id 1
|
||||||
|
type erasure
|
||||||
|
step set_chooseleaf_tries 5
|
||||||
|
step set_choose_tries 100
|
||||||
|
step take default
|
||||||
|
step chooseleaf indep 0 type host
|
||||||
|
step emit
|
||||||
|
}
|
||||||
|
|
||||||
When all mappings succeed, an histogram of the number of tries that
|
#. Recompile and retest the CRUSH rule:
|
||||||
were necessary to find all of them can be displayed with the
|
|
||||||
``--show-choose-tries`` option of ``crushtool``::
|
|
||||||
|
|
||||||
$ crushtool -i better-crush.map --test --show-bad-mappings \
|
.. prompt:: bash
|
||||||
|
|
||||||
|
crushtool --compile crush.txt -o better-crush.map
|
||||||
|
|
||||||
|
#. When all mappings succeed, display a histogram of the number of tries that
|
||||||
|
were necessary to find all of the mapping by using the
|
||||||
|
``--show-choose-tries`` option of the ``crushtool`` command, as in the
|
||||||
|
following example:
|
||||||
|
|
||||||
|
.. prompt:: bash
|
||||||
|
|
||||||
|
crushtool -i better-crush.map --test --show-bad-mappings \
|
||||||
--show-choose-tries \
|
--show-choose-tries \
|
||||||
--rule 1 \
|
--rule 1 \
|
||||||
--num-rep 9 \
|
--num-rep 9 \
|
||||||
@ -673,14 +771,12 @@ were necessary to find all of them can be displayed with the
|
|||||||
104: 0
|
104: 0
|
||||||
...
|
...
|
||||||
|
|
||||||
It took 11 tries to map 42 PGs, 12 tries to map 44 PGs etc. The highest number of tries is the minimum value of ``set_choose_tries`` that prevents bad mappings (i.e. 103 in the above output because it did not take more than 103 tries for any PG to be mapped).
|
This output indicates that it took eleven tries to map forty-two PGs, twelve
|
||||||
|
tries to map forty-four PGs etc. The highest number of tries is the minimum
|
||||||
|
value of ``set_choose_tries`` that prevents bad mappings (for example,
|
||||||
|
``103`` in the above output, because it did not take more than 103 tries for
|
||||||
|
any PG to be mapped).
|
||||||
|
|
||||||
.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
|
.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
|
||||||
.. _here: ../../configuration/pool-pg-config-ref
|
|
||||||
.. _Placement Groups: ../../operations/placement-groups
|
.. _Placement Groups: ../../operations/placement-groups
|
||||||
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
|
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref
|
||||||
.. _NTP: https://en.wikipedia.org/wiki/Network_Time_Protocol
|
|
||||||
.. _The Network Time Protocol: http://www.ntp.org/
|
|
||||||
.. _Clock Settings: ../../configuration/mon-config-ref/#clock
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -476,23 +476,40 @@ commands. ::
|
|||||||
Rate Limit Management
|
Rate Limit Management
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
The Ceph Object Gateway enables you to set rate limits on users and buckets.
|
The Ceph Object Gateway makes it possible to set rate limits on users and
|
||||||
Rate limit includes the maximum number of read ops and write ops per minute
|
buckets. "Rate limit" includes the maximum number of read operations (read
|
||||||
and how many bytes per minute could be written or read per user or per bucket.
|
ops) and write operations (write ops) per minute and the number of bytes per
|
||||||
Requests that are using GET or HEAD method in the REST request are considered as "read requests", otherwise they are considered as "write requests".
|
minute that can be written or read per user or per bucket.
|
||||||
Every Object Gateway tracks per user and bucket metrics separately, these metrics are not shared with other gateways.
|
|
||||||
That means that the desired limits configured should be divide by the number of active Object Gateways.
|
Operations that use the ``GET`` method or the ``HEAD`` method in their REST
|
||||||
For example, if userA should be limited by 10 ops per minute and there are 2 Object Gateways in the cluster,
|
requests are "read requests". All other requests are "write requests".
|
||||||
the limit over userA should be 5 (10 ops per minute / 2 RGWs).
|
|
||||||
If the requests are **not** balanced between RGWs, the rate limit may be underutilized.
|
Each object gateway tracks per-user metrics separately from bucket metrics.
|
||||||
For example, if the ops limit is 5 and there are 2 RGWs, **but** the Load Balancer send load only to one of those RGWs,
|
These metrics are not shared with other gateways. The configured limits should
|
||||||
The effective limit would be 5 ops, because this limit is enforced per RGW.
|
be divided by the number of active object gateways. For example, if "user A" is
|
||||||
If there is a limit reached for bucket not for user or vice versa the request would be cancelled as well.
|
to be be limited to 10 ops per minute and there are two object gateways in the
|
||||||
The bandwidth counting happens after the request is being accepted, as a result, even if in the middle of the request the bucket/user has reached its bandwidth limit this request will proceed.
|
cluster, then the limit on "user A" should be ``5`` (10 ops per minute / 2
|
||||||
The RGW will keep a "debt" of used bytes more than the configured value and will prevent this user/bucket from sending more requests until there "debt" is being paid.
|
RGWs). If the requests are **not** balanced between RGWs, the rate limit might
|
||||||
The "debt" maximum size is twice the max-read/write-bytes per minute.
|
be underutilized. For example: if the ops limit is ``5`` and there are two
|
||||||
If userA has 1 byte read limit per minute and this user tries to GET 1 GB object, the user will be able to do it.
|
RGWs, **but** the Load Balancer sends load to only one of those RGWs, the
|
||||||
After userA completes this 1GB operation, the RGW will block the user request for up to 2 minutes until userA will be able to send GET request again.
|
effective limit is 5 ops, because this limit is enforced per RGW. If the rate
|
||||||
|
limit that has been set for the bucket has been reached but the rate limit that
|
||||||
|
has been set for the user has not been reached, then the request is cancelled.
|
||||||
|
The contrary holds as well: if the rate limit that has been set for the user
|
||||||
|
has been reached but the rate limit that has been set for the bucket has not
|
||||||
|
been reached, then the request is cancelled.
|
||||||
|
|
||||||
|
The accounting of bandwidth happens only after a request has been accepted.
|
||||||
|
This means that requests will proceed even if the bucket rate limit or user
|
||||||
|
rate limit is reached during the execution of the request. The RGW keeps track
|
||||||
|
of a "debt" consisting of bytes used in excess of the configured value; users
|
||||||
|
or buckets that incur this kind of debt are prevented from sending more
|
||||||
|
requests until the "debt" has been repaid. The maximum size of the "debt" is
|
||||||
|
twice the max-read/write-bytes per minute. If "user A" is subject to a 1-byte
|
||||||
|
read limit per minute and they attempt to GET an object that is 1 GB in size,
|
||||||
|
then the ``GET`` action will fail. After "user A" has completed this 1 GB
|
||||||
|
operation, RGW blocks the user's requests for up to two minutes. After this
|
||||||
|
time has elapsed, "user A" will be able to send ``GET`` requests again.
|
||||||
|
|
||||||
|
|
||||||
- **Bucket:** The ``--bucket`` option allows you to specify a rate limit for a
|
- **Bucket:** The ``--bucket`` option allows you to specify a rate limit for a
|
||||||
|
@ -79,7 +79,7 @@ workload with a smaller number of buckets but higher number of objects (hundreds
|
|||||||
per bucket you would consider decreasing :confval:`rgw_lc_max_wp_worker` from the default value of 3.
|
per bucket you would consider decreasing :confval:`rgw_lc_max_wp_worker` from the default value of 3.
|
||||||
|
|
||||||
.. note:: When looking to tune either of these specific values please validate the
|
.. note:: When looking to tune either of these specific values please validate the
|
||||||
current Cluster performance and Ceph Object Gateway utilization before increasing.
|
current Cluster performance and Ceph Object Gateway utilization before increasing.
|
||||||
|
|
||||||
Garbage Collection Settings
|
Garbage Collection Settings
|
||||||
===========================
|
===========================
|
||||||
@ -97,8 +97,9 @@ To view the queue of objects awaiting garbage collection, execute the following
|
|||||||
|
|
||||||
radosgw-admin gc list
|
radosgw-admin gc list
|
||||||
|
|
||||||
.. note:: specify ``--include-all`` to list all entries, including unexpired
|
.. note:: Specify ``--include-all`` to list all entries, including unexpired
|
||||||
|
Garbage Collection objects.
|
||||||
|
|
||||||
Garbage collection is a background activity that may
|
Garbage collection is a background activity that may
|
||||||
execute continuously or during times of low loads, depending upon how the
|
execute continuously or during times of low loads, depending upon how the
|
||||||
administrator configures the Ceph Object Gateway. By default, the Ceph Object
|
administrator configures the Ceph Object Gateway. By default, the Ceph Object
|
||||||
@ -121,7 +122,9 @@ configuration parameters.
|
|||||||
|
|
||||||
:Tuning Garbage Collection for Delete Heavy Workloads:
|
:Tuning Garbage Collection for Delete Heavy Workloads:
|
||||||
|
|
||||||
As an initial step towards tuning Ceph Garbage Collection to be more aggressive the following options are suggested to be increased from their default configuration values::
|
As an initial step towards tuning Ceph Garbage Collection to be more
|
||||||
|
aggressive the following options are suggested to be increased from their
|
||||||
|
default configuration values::
|
||||||
|
|
||||||
rgw_gc_max_concurrent_io = 20
|
rgw_gc_max_concurrent_io = 20
|
||||||
rgw_gc_max_trim_chunk = 64
|
rgw_gc_max_trim_chunk = 64
|
||||||
@ -270,7 +273,7 @@ to support future methods of scheduling requests.
|
|||||||
Currently the scheduler defaults to a throttler which throttles the active
|
Currently the scheduler defaults to a throttler which throttles the active
|
||||||
connections to a configured limit. QoS based on mClock is currently in an
|
connections to a configured limit. QoS based on mClock is currently in an
|
||||||
*experimental* phase and not recommended for production yet. Current
|
*experimental* phase and not recommended for production yet. Current
|
||||||
implementation of *dmclock_client* op queue divides RGW Ops on admin, auth
|
implementation of *dmclock_client* op queue divides RGW ops on admin, auth
|
||||||
(swift auth, sts) metadata & data requests.
|
(swift auth, sts) metadata & data requests.
|
||||||
|
|
||||||
|
|
||||||
|
@ -6,38 +6,39 @@ RGW Dynamic Bucket Index Resharding
|
|||||||
|
|
||||||
.. versionadded:: Luminous
|
.. versionadded:: Luminous
|
||||||
|
|
||||||
A large bucket index can lead to performance problems. In order
|
A large bucket index can lead to performance problems, which can
|
||||||
to address this problem we introduced bucket index sharding.
|
be addressed by sharding bucket indexes.
|
||||||
Until Luminous, changing the number of bucket shards (resharding)
|
Until Luminous, changing the number of bucket shards (resharding)
|
||||||
needed to be done offline. Starting with Luminous we support
|
needed to be done offline, with RGW services disabled.
|
||||||
online bucket resharding.
|
Since the Luminous release Ceph has supported online bucket resharding.
|
||||||
|
|
||||||
Each bucket index shard can handle its entries efficiently up until
|
Each bucket index shard can handle its entries efficiently up until
|
||||||
reaching a certain threshold number of entries. If this threshold is
|
reaching a certain threshold. If this threshold is
|
||||||
exceeded the system can suffer from performance issues. The dynamic
|
exceeded the system can suffer from performance issues. The dynamic
|
||||||
resharding feature detects this situation and automatically increases
|
resharding feature detects this situation and automatically increases
|
||||||
the number of shards used by the bucket index, resulting in a
|
the number of shards used by a bucket's index, resulting in a
|
||||||
reduction of the number of entries in each bucket index shard. This
|
reduction of the number of entries in each shard. This
|
||||||
process is transparent to the user. Write I/Os to the target bucket
|
process is transparent to the user. Writes to the target bucket
|
||||||
are blocked and read I/Os are not during resharding process.
|
are blocked (but reads are not) briefly during resharding process.
|
||||||
|
|
||||||
By default dynamic bucket index resharding can only increase the
|
By default dynamic bucket index resharding can only increase the
|
||||||
number of bucket index shards to 1999, although this upper-bound is a
|
number of bucket index shards to 1999, although this upper-bound is a
|
||||||
configuration parameter (see Configuration below). When
|
configuration parameter (see Configuration below). When
|
||||||
possible, the process chooses a prime number of bucket index shards to
|
possible, the process chooses a prime number of shards in order to
|
||||||
spread the number of bucket index entries across the bucket index
|
spread the number of entries across the bucket index
|
||||||
shards more evenly.
|
shards more evenly.
|
||||||
|
|
||||||
The detection process runs in a background process that periodically
|
Detection of resharding opportunities runs as a background process
|
||||||
scans all the buckets. A bucket that requires resharding is added to
|
that periodically
|
||||||
the resharding queue and will be scheduled to be resharded later. The
|
scans all buckets. A bucket that requires resharding is added to
|
||||||
reshard thread runs in the background and execute the scheduled
|
a queue. A thread runs in the background and processes the queueued
|
||||||
resharding tasks, one at a time.
|
resharding tasks, one at a time and in order.
|
||||||
|
|
||||||
Multisite
|
Multisite
|
||||||
=========
|
=========
|
||||||
|
|
||||||
Prior to the Reef release, RGW does not support dynamic resharding in a
|
With Ceph releases Prior to Reef, the Ceph Object Gateway (RGW) does not support
|
||||||
|
dynamic resharding in a
|
||||||
multisite environment. For information on dynamic resharding, see
|
multisite environment. For information on dynamic resharding, see
|
||||||
:ref:`Resharding <feature_resharding>` in the RGW multisite documentation.
|
:ref:`Resharding <feature_resharding>` in the RGW multisite documentation.
|
||||||
|
|
||||||
@ -50,11 +51,11 @@ Enable/Disable dynamic bucket index resharding:
|
|||||||
|
|
||||||
Configuration options that control the resharding process:
|
Configuration options that control the resharding process:
|
||||||
|
|
||||||
- ``rgw_max_objs_per_shard``: maximum number of objects per bucket index shard before resharding is triggered, default: 100000 objects
|
- ``rgw_max_objs_per_shard``: maximum number of objects per bucket index shard before resharding is triggered, default: 100000
|
||||||
|
|
||||||
- ``rgw_max_dynamic_shards``: maximum number of shards that dynamic bucket index resharding can increase to, default: 1999
|
- ``rgw_max_dynamic_shards``: maximum number of bucket index shards that dynamic resharding can increase to, default: 1999
|
||||||
|
|
||||||
- ``rgw_reshard_bucket_lock_duration``: duration, in seconds, of lock on bucket obj during resharding, default: 360 seconds (i.e., 6 minutes)
|
- ``rgw_reshard_bucket_lock_duration``: duration, in seconds, that writes to the bucket are locked during resharding, default: 360 (i.e., 6 minutes)
|
||||||
|
|
||||||
- ``rgw_reshard_thread_interval``: maximum time, in seconds, between rounds of resharding queue processing, default: 600 seconds (i.e., 10 minutes)
|
- ``rgw_reshard_thread_interval``: maximum time, in seconds, between rounds of resharding queue processing, default: 600 seconds (i.e., 10 minutes)
|
||||||
|
|
||||||
@ -91,9 +92,9 @@ Bucket resharding status
|
|||||||
|
|
||||||
# radosgw-admin reshard status --bucket <bucket_name>
|
# radosgw-admin reshard status --bucket <bucket_name>
|
||||||
|
|
||||||
The output is a json array of 3 objects (reshard_status, new_bucket_instance_id, num_shards) per shard.
|
The output is a JSON array of 3 objects (reshard_status, new_bucket_instance_id, num_shards) per shard.
|
||||||
|
|
||||||
For example, the output at different Dynamic Resharding stages is shown below:
|
For example, the output at each dynamic resharding stage is shown below:
|
||||||
|
|
||||||
``1. Before resharding occurred:``
|
``1. Before resharding occurred:``
|
||||||
::
|
::
|
||||||
@ -122,7 +123,7 @@ For example, the output at different Dynamic Resharding stages is shown below:
|
|||||||
}
|
}
|
||||||
]
|
]
|
||||||
|
|
||||||
``3, After resharding completed:``
|
``3. After resharding completed:``
|
||||||
::
|
::
|
||||||
|
|
||||||
[
|
[
|
||||||
@ -142,7 +143,7 @@ For example, the output at different Dynamic Resharding stages is shown below:
|
|||||||
Cancel pending bucket resharding
|
Cancel pending bucket resharding
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
Note: Ongoing bucket resharding operations cannot be cancelled. ::
|
Note: Bucket resharding operations cannot be cancelled while executing. ::
|
||||||
|
|
||||||
# radosgw-admin reshard cancel --bucket <bucket_name>
|
# radosgw-admin reshard cancel --bucket <bucket_name>
|
||||||
|
|
||||||
@ -153,25 +154,24 @@ Manual immediate bucket resharding
|
|||||||
|
|
||||||
# radosgw-admin bucket reshard --bucket <bucket_name> --num-shards <new number of shards>
|
# radosgw-admin bucket reshard --bucket <bucket_name> --num-shards <new number of shards>
|
||||||
|
|
||||||
When choosing a number of shards, the administrator should keep a
|
When choosing a number of shards, the administrator must anticipate each
|
||||||
number of items in mind. Ideally the administrator is aiming for no
|
bucket's peak number of objects. Ideally one should aim for no
|
||||||
more than 100000 entries per shard, now and through some future point
|
more than 100000 entries per shard at any given time.
|
||||||
in time.
|
|
||||||
|
|
||||||
Additionally, bucket index shards that are prime numbers tend to work
|
Additionally, bucket index shards that are prime numbers are more effective
|
||||||
better in evenly distributing bucket index entries across the
|
in evenly distributing bucket index entries.
|
||||||
shards. For example, 7001 bucket index shards is better than 7000
|
For example, 7001 bucket index shards is better than 7000
|
||||||
since the former is prime. A variety of web sites have lists of prime
|
since the former is prime. A variety of web sites have lists of prime
|
||||||
numbers; search for "list of prime numbers" withy your favorite web
|
numbers; search for "list of prime numbers" with your favorite
|
||||||
search engine to locate some web sites.
|
search engine to locate some web sites.
|
||||||
|
|
||||||
Troubleshooting
|
Troubleshooting
|
||||||
===============
|
===============
|
||||||
|
|
||||||
Clusters prior to Luminous 12.2.11 and Mimic 13.2.5 left behind stale bucket
|
Clusters prior to Luminous 12.2.11 and Mimic 13.2.5 left behind stale bucket
|
||||||
instance entries, which were not automatically cleaned up. The issue also affected
|
instance entries, which were not automatically cleaned up. This issue also affected
|
||||||
LifeCycle policies, which were not applied to resharded buckets anymore. Both of
|
LifeCycle policies, which were no longer applied to resharded buckets. Both of
|
||||||
these issues can be worked around using a couple of radosgw-admin commands.
|
these issues could be worked around by running ``radosgw-admin`` commands.
|
||||||
|
|
||||||
Stale instance management
|
Stale instance management
|
||||||
-------------------------
|
-------------------------
|
||||||
@ -183,7 +183,7 @@ List the stale instances in a cluster that are ready to be cleaned up.
|
|||||||
# radosgw-admin reshard stale-instances list
|
# radosgw-admin reshard stale-instances list
|
||||||
|
|
||||||
Clean up the stale instances in a cluster. Note: cleanup of these
|
Clean up the stale instances in a cluster. Note: cleanup of these
|
||||||
instances should only be done on a single site cluster.
|
instances should only be done on a single-site cluster.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@ -193,11 +193,12 @@ instances should only be done on a single site cluster.
|
|||||||
Lifecycle fixes
|
Lifecycle fixes
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
For clusters that had resharded instances, it is highly likely that the old
|
For clusters with resharded instances, it is highly likely that the old
|
||||||
lifecycle processes would have flagged and deleted lifecycle processing as the
|
lifecycle processes would have flagged and deleted lifecycle processing as the
|
||||||
bucket instance changed during a reshard. While this is fixed for newer clusters
|
bucket instance changed during a reshard. While this is fixed for buckets
|
||||||
(from Mimic 13.2.6 and Luminous 12.2.12), older buckets that had lifecycle policies and
|
deployed on newer Ceph releases (from Mimic 13.2.6 and Luminous 12.2.12),
|
||||||
that have undergone resharding will have to be manually fixed.
|
older buckets that had lifecycle policies and that have undergone
|
||||||
|
resharding must be fixed manually.
|
||||||
|
|
||||||
The command to do so is:
|
The command to do so is:
|
||||||
|
|
||||||
@ -206,8 +207,8 @@ The command to do so is:
|
|||||||
# radosgw-admin lc reshard fix --bucket {bucketname}
|
# radosgw-admin lc reshard fix --bucket {bucketname}
|
||||||
|
|
||||||
|
|
||||||
As a convenience wrapper, if the ``--bucket`` argument is dropped then this
|
If the ``--bucket`` argument is not provided, this
|
||||||
command will try and fix lifecycle policies for all the buckets in the cluster.
|
command will try to fix lifecycle policies for all the buckets in the cluster.
|
||||||
|
|
||||||
Object Expirer fixes
|
Object Expirer fixes
|
||||||
--------------------
|
--------------------
|
||||||
@ -217,7 +218,7 @@ been dropped from the log pool and never deleted after the bucket was
|
|||||||
resharded. This would happen if their expiration time was before the
|
resharded. This would happen if their expiration time was before the
|
||||||
cluster was upgraded, but if their expiration was after the upgrade
|
cluster was upgraded, but if their expiration was after the upgrade
|
||||||
the objects would be correctly handled. To manage these expire-stale
|
the objects would be correctly handled. To manage these expire-stale
|
||||||
objects, radosgw-admin provides two subcommands.
|
objects, ``radosgw-admin`` provides two subcommands.
|
||||||
|
|
||||||
Listing:
|
Listing:
|
||||||
|
|
||||||
|
@ -770,7 +770,13 @@ to a multi-site system, follow these steps:
|
|||||||
radosgw-admin zonegroup rename --rgw-zonegroup default --zonegroup-new-name=<name>
|
radosgw-admin zonegroup rename --rgw-zonegroup default --zonegroup-new-name=<name>
|
||||||
radosgw-admin zone rename --rgw-zone default --zone-new-name us-east-1 --rgw-zonegroup=<name>
|
radosgw-admin zone rename --rgw-zone default --zone-new-name us-east-1 --rgw-zonegroup=<name>
|
||||||
|
|
||||||
3. Configure the master zonegroup. Replace ``<name>`` with the realm name or
|
3. Rename the default zonegroup's ``api_name``. Replace ``<name>`` with the zonegroup name:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
radosgw-admin zonegroup modify --api-name=<name> --rgw-zonegroup=<name>
|
||||||
|
|
||||||
|
4. Configure the master zonegroup. Replace ``<name>`` with the realm name or
|
||||||
zonegroup name. Replace ``<fqdn>`` with the fully qualified domain name(s)
|
zonegroup name. Replace ``<fqdn>`` with the fully qualified domain name(s)
|
||||||
in the zonegroup:
|
in the zonegroup:
|
||||||
|
|
||||||
@ -778,7 +784,7 @@ to a multi-site system, follow these steps:
|
|||||||
|
|
||||||
radosgw-admin zonegroup modify --rgw-realm=<name> --rgw-zonegroup=<name> --endpoints http://<fqdn>:80 --master --default
|
radosgw-admin zonegroup modify --rgw-realm=<name> --rgw-zonegroup=<name> --endpoints http://<fqdn>:80 --master --default
|
||||||
|
|
||||||
4. Configure the master zone. Replace ``<name>`` with the realm name, zone
|
5. Configure the master zone. Replace ``<name>`` with the realm name, zone
|
||||||
name, or zonegroup name. Replace ``<fqdn>`` with the fully qualified domain
|
name, or zonegroup name. Replace ``<fqdn>`` with the fully qualified domain
|
||||||
name(s) in the zonegroup:
|
name(s) in the zonegroup:
|
||||||
|
|
||||||
@ -789,7 +795,7 @@ to a multi-site system, follow these steps:
|
|||||||
--access-key=<access-key> --secret=<secret-key> \
|
--access-key=<access-key> --secret=<secret-key> \
|
||||||
--master --default
|
--master --default
|
||||||
|
|
||||||
5. Create a system user. Replace ``<user-id>`` with the username. Replace
|
6. Create a system user. Replace ``<user-id>`` with the username. Replace
|
||||||
``<display-name>`` with a display name. The display name is allowed to
|
``<display-name>`` with a display name. The display name is allowed to
|
||||||
contain spaces:
|
contain spaces:
|
||||||
|
|
||||||
@ -800,13 +806,13 @@ to a multi-site system, follow these steps:
|
|||||||
--access-key=<access-key> \
|
--access-key=<access-key> \
|
||||||
--secret=<secret-key> --system
|
--secret=<secret-key> --system
|
||||||
|
|
||||||
6. Commit the updated configuration:
|
7. Commit the updated configuration:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
radosgw-admin period update --commit
|
radosgw-admin period update --commit
|
||||||
|
|
||||||
7. Restart the Ceph Object Gateway:
|
8. Restart the Ceph Object Gateway:
|
||||||
|
|
||||||
.. prompt:: bash #
|
.. prompt:: bash #
|
||||||
|
|
||||||
@ -1588,7 +1594,7 @@ Zone Features
|
|||||||
|
|
||||||
Some multisite features require support from all zones before they can be enabled. Each zone lists its ``supported_features``, and each zonegroup lists its ``enabled_features``. Before a feature can be enabled in the zonegroup, it must be supported by all of its zones.
|
Some multisite features require support from all zones before they can be enabled. Each zone lists its ``supported_features``, and each zonegroup lists its ``enabled_features``. Before a feature can be enabled in the zonegroup, it must be supported by all of its zones.
|
||||||
|
|
||||||
On creation of new zones and zonegroups, all known features are supported/enabled. After upgrading an existing multisite configuration, however, new features must be enabled manually.
|
On creation of new zones and zonegroups, all known features are supported and some features (see table below) are enabled by default. After upgrading an existing multisite configuration, however, new features must be enabled manually.
|
||||||
|
|
||||||
Supported Features
|
Supported Features
|
||||||
------------------
|
------------------
|
||||||
|
@ -188,8 +188,7 @@ Request parameters:
|
|||||||
specified CA will be used to authenticate the broker. The default CA will
|
specified CA will be used to authenticate the broker. The default CA will
|
||||||
not be used.
|
not be used.
|
||||||
- amqp-exchange: The exchanges must exist and must be able to route messages
|
- amqp-exchange: The exchanges must exist and must be able to route messages
|
||||||
based on topics. This parameter is mandatory. Different topics that point
|
based on topics. This parameter is mandatory.
|
||||||
to the same endpoint must use the same exchange.
|
|
||||||
- amqp-ack-level: No end2end acking is required. Messages may persist in the
|
- amqp-ack-level: No end2end acking is required. Messages may persist in the
|
||||||
broker before being delivered to their final destinations. Three ack methods
|
broker before being delivered to their final destinations. Three ack methods
|
||||||
exist:
|
exist:
|
||||||
|
@ -13,7 +13,7 @@ Supported Destination
|
|||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
AWS supports: **SNS**, **SQS** and **Lambda** as possible destinations (AWS internal destinations).
|
AWS supports: **SNS**, **SQS** and **Lambda** as possible destinations (AWS internal destinations).
|
||||||
Currently, we support: **HTTP/S**, **Kafka** and **AMQP**. And also support pulling and acking of events stored in Ceph (as an internal destination).
|
Currently, we support: **HTTP/S**, **Kafka** and **AMQP**.
|
||||||
|
|
||||||
We are using the **SNS** ARNs to represent the **HTTP/S**, **Kafka** and **AMQP** destinations.
|
We are using the **SNS** ARNs to represent the **HTTP/S**, **Kafka** and **AMQP** destinations.
|
||||||
|
|
||||||
|
@ -91,14 +91,8 @@ The following common request header fields are not supported:
|
|||||||
+----------------------------+------------+
|
+----------------------------+------------+
|
||||||
| Name | Type |
|
| Name | Type |
|
||||||
+============================+============+
|
+============================+============+
|
||||||
| **Server** | Response |
|
|
||||||
+----------------------------+------------+
|
|
||||||
| **x-amz-delete-marker** | Response |
|
|
||||||
+----------------------------+------------+
|
|
||||||
| **x-amz-id-2** | Response |
|
| **x-amz-id-2** | Response |
|
||||||
+----------------------------+------------+
|
+----------------------------+------------+
|
||||||
| **x-amz-version-id** | Response |
|
|
||||||
+----------------------------+------------+
|
|
||||||
|
|
||||||
.. _Amazon S3 API: http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html
|
.. _Amazon S3 API: http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html
|
||||||
.. _S3 Notification Compatibility: ../s3-notification-compatibility
|
.. _S3 Notification Compatibility: ../s3-notification-compatibility
|
||||||
|
@ -21,6 +21,7 @@ security fixes.
|
|||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
:hidden:
|
:hidden:
|
||||||
|
|
||||||
|
Reef (v18.2.*) <reef>
|
||||||
Quincy (v17.2.*) <quincy>
|
Quincy (v17.2.*) <quincy>
|
||||||
Pacific (v16.2.*) <pacific>
|
Pacific (v16.2.*) <pacific>
|
||||||
|
|
||||||
@ -58,8 +59,11 @@ receive bug fixes or backports).
|
|||||||
Release timeline
|
Release timeline
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
.. ceph_timeline_gantt:: releases.yml quincy pacific
|
.. ceph_timeline_gantt:: releases.yml reef quincy
|
||||||
.. ceph_timeline:: releases.yml quincy pacific
|
.. ceph_timeline:: releases.yml reef quincy
|
||||||
|
|
||||||
|
.. _Reef: reef
|
||||||
|
.. _18.2.0: reef#v18-2-0-reef
|
||||||
|
|
||||||
.. _Quincy: quincy
|
.. _Quincy: quincy
|
||||||
.. _17.2.0: quincy#v17-2-0-quincy
|
.. _17.2.0: quincy#v17-2-0-quincy
|
||||||
|
551
ceph/doc/releases/reef.rst
Normal file
551
ceph/doc/releases/reef.rst
Normal file
@ -0,0 +1,551 @@
|
|||||||
|
====
|
||||||
|
Reef
|
||||||
|
====
|
||||||
|
|
||||||
|
Reef is the 18th stable release of Ceph. It is named after the reef squid
|
||||||
|
(Sepioteuthis).
|
||||||
|
|
||||||
|
v18.2.0 Reef
|
||||||
|
============
|
||||||
|
|
||||||
|
This is the first stable release of Ceph Reef.
|
||||||
|
|
||||||
|
.. important::
|
||||||
|
|
||||||
|
We are unable to build Ceph on Debian stable (bookworm) for the 18.2.0
|
||||||
|
release because of Debian bug
|
||||||
|
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129. We will build as
|
||||||
|
soon as this bug is resolved in Debian stable.
|
||||||
|
|
||||||
|
*last updated 2023 Aug 04*
|
||||||
|
|
||||||
|
Major Changes from Quincy
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
Highlights
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
See the relevant sections below for more details on these changes.
|
||||||
|
|
||||||
|
* **RADOS** FileStore is not supported in Reef.
|
||||||
|
* **RADOS:** RocksDB has been upgraded to version 7.9.2.
|
||||||
|
* **RADOS:** There have been significant improvements to RocksDB iteration overhead and performance.
|
||||||
|
* **RADOS:** The ``perf dump`` and ``perf schema`` commands have been deprecated in
|
||||||
|
favor of the new ``counter dump`` and ``counter schema`` commands.
|
||||||
|
* **RADOS:** Cache tiering is now deprecated.
|
||||||
|
* **RADOS:** A new feature, the "read balancer", is now available, which allows users to balance primary PGs per pool on their clusters.
|
||||||
|
* **RGW:** Bucket resharding is now supported for multi-site configurations.
|
||||||
|
* **RGW:** There have been significant improvements to the stability and consistency of multi-site replication.
|
||||||
|
* **RGW:** Compression is now supported for objects uploaded with Server-Side Encryption.
|
||||||
|
* **Dashboard:** There is a new Dashboard page with improved layout. Active alerts and some important charts are now displayed inside cards.
|
||||||
|
* **RBD:** Support for layered client-side encryption has been added.
|
||||||
|
* **Telemetry**: Users can now opt in to participate in a leaderboard in the telemetry public dashboards.
|
||||||
|
|
||||||
|
CephFS
|
||||||
|
~~~~~~
|
||||||
|
|
||||||
|
* CephFS: The ``mds_max_retries_on_remount_failure`` option has been renamed to
|
||||||
|
``client_max_retries_on_remount_failure`` and moved from ``mds.yaml.in`` to
|
||||||
|
``mds-client.yaml.in``. This change was made because the option has always
|
||||||
|
been used only by the MDS client.
|
||||||
|
* CephFS: It is now possible to delete the recovered files in the
|
||||||
|
``lost+found`` directory after a CephFS post has been recovered in accordance
|
||||||
|
with disaster recovery procedures.
|
||||||
|
* The ``AT_NO_ATTR_SYNC`` macro has been deprecated in favor of the standard
|
||||||
|
``AT_STATX_DONT_SYNC`` macro. The ``AT_NO_ATTR_SYNC`` macro will be removed
|
||||||
|
in the future.
|
||||||
|
|
||||||
|
Dashboard
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
* There is a new Dashboard page with improved layout. Active alerts
|
||||||
|
and some important charts are now displayed inside cards.
|
||||||
|
|
||||||
|
* Cephx Auth Management: There is a new section dedicated to listing and
|
||||||
|
managing Ceph cluster users.
|
||||||
|
|
||||||
|
* RGW Server Side Encryption: The SSE-S3 and KMS encryption of rgw buckets can
|
||||||
|
now be configured at the time of bucket creation.
|
||||||
|
|
||||||
|
* RBD Snapshot mirroring: Snapshot mirroring can now be configured through UI.
|
||||||
|
Snapshots can now be scheduled.
|
||||||
|
|
||||||
|
* 1-Click OSD Creation Wizard: OSD creation has been broken into 3 options:
|
||||||
|
|
||||||
|
#. Cost/Capacity Optimized: Use all HDDs
|
||||||
|
|
||||||
|
#. Throughput Optimized: Combine HDDs and SSDs
|
||||||
|
|
||||||
|
#. IOPS Optimized: Use all NVMes
|
||||||
|
|
||||||
|
The current OSD-creation form has been moved to the Advanced section.
|
||||||
|
|
||||||
|
* Centralized Logging: There is now a view that collects all the logs from
|
||||||
|
the Ceph cluster.
|
||||||
|
|
||||||
|
* Accessibility WCAG-AA: Dashboard is WCAG 2.1 level A compliant and therefore
|
||||||
|
improved for blind and visually impaired Ceph users.
|
||||||
|
|
||||||
|
* Monitoring & Alerting
|
||||||
|
|
||||||
|
* Ceph-exporter: Now the performance metrics for Ceph daemons are
|
||||||
|
exported by ceph-exporter, which deploys on each daemon rather than
|
||||||
|
using prometheus exporter. This will reduce performance bottlenecks.
|
||||||
|
|
||||||
|
* Monitoring stacks updated:
|
||||||
|
|
||||||
|
* Prometheus 2.43.0
|
||||||
|
|
||||||
|
* Node-exporter 1.5.0
|
||||||
|
|
||||||
|
* Grafana 9.4.7
|
||||||
|
|
||||||
|
* Alertmanager 0.25.0
|
||||||
|
|
||||||
|
MGR
|
||||||
|
~~~
|
||||||
|
|
||||||
|
* mgr/snap_schedule: The snap-schedule manager module now retains one snapshot
|
||||||
|
less than the number mentioned against the config option
|
||||||
|
``mds_max_snaps_per_dir``. This means that a new snapshot can be created and
|
||||||
|
retained during the next schedule run.
|
||||||
|
* The ``ceph mgr dump`` command now outputs ``last_failure_osd_epoch`` and
|
||||||
|
``active_clients`` fields at the top level. Previously, these fields were
|
||||||
|
output under the ``always_on_modules`` field.
|
||||||
|
|
||||||
|
RADOS
|
||||||
|
~~~~~
|
||||||
|
|
||||||
|
* FileStore is not supported in Reef.
|
||||||
|
* RocksDB has been upgraded to version 7.9.2, which incorporates several
|
||||||
|
performance improvements and features. This is the first release that can
|
||||||
|
tune RocksDB settings per column family, which allows for more granular
|
||||||
|
tunings to be applied to different kinds of data stored in RocksDB. New
|
||||||
|
default settings have been used to optimize performance for most workloads, with a
|
||||||
|
slight penalty in some use cases. This slight penalty is outweighed by large
|
||||||
|
improvements in compactions and write amplification in use cases such as RGW
|
||||||
|
(up to a measured 13.59% improvement in 4K random write IOPs).
|
||||||
|
* Trimming of PGLog dups is now controlled by the size rather than the version.
|
||||||
|
This change fixes the PGLog inflation issue that was happening when the
|
||||||
|
online (in OSD) trimming got jammed after a PG split operation. Also, a new
|
||||||
|
offline mechanism has been added: ``ceph-objectstore-tool`` has a new
|
||||||
|
operation called ``trim-pg-log-dups`` that targets situations in which an OSD
|
||||||
|
is unable to boot because of the inflated dups. In such situations, the "You
|
||||||
|
can be hit by THE DUPS BUG" warning is visible in OSD logs. Relevant tracker:
|
||||||
|
https://tracker.ceph.com/issues/53729
|
||||||
|
* The RADOS Python bindings are now able to process (opt-in) omap keys as bytes
|
||||||
|
objects. This allows interacting with RADOS omap keys that are not
|
||||||
|
decodable as UTF-8 strings.
|
||||||
|
* mClock Scheduler: The mClock scheduler (the default scheduler in Quincy) has
|
||||||
|
undergone significant usability and design improvements to address the slow
|
||||||
|
backfill issue. The following is a list of some important changes:
|
||||||
|
|
||||||
|
* The ``balanced`` profile is set as the default mClock profile because it
|
||||||
|
represents a compromise between prioritizing client I/O and prioritizing
|
||||||
|
recovery I/O. Users can then choose either the ``high_client_ops`` profile
|
||||||
|
to prioritize client I/O or the ``high_recovery_ops`` profile to prioritize
|
||||||
|
recovery I/O.
|
||||||
|
* QoS parameters including ``reservation`` and ``limit`` are now specified in
|
||||||
|
terms of a fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity.
|
||||||
|
* The cost parameters (``osd_mclock_cost_per_io_usec_*`` and
|
||||||
|
``osd_mclock_cost_per_byte_usec_*``) have been removed. The cost of an
|
||||||
|
operation is now a function of the random IOPS and maximum sequential
|
||||||
|
bandwidth capability of the OSD's underlying device.
|
||||||
|
* Degraded object recovery is given higher priority than misplaced
|
||||||
|
object recovery because degraded objects present a data safety issue that
|
||||||
|
is not present with objects that are merely misplaced. As a result,
|
||||||
|
backfilling operations with the ``balanced`` and ``high_client_ops`` mClock
|
||||||
|
profiles might progress more slowly than in the past, when backfilling
|
||||||
|
operations used the 'WeightedPriorityQueue' (WPQ) scheduler.
|
||||||
|
* The QoS allocations in all the mClock profiles are optimized in
|
||||||
|
accordance with the above fixes and enhancements.
|
||||||
|
* For more details, see:
|
||||||
|
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/
|
||||||
|
* A new feature, the "read balancer", is now available, which allows
|
||||||
|
users to balance primary PGs per pool on their clusters. The read balancer is
|
||||||
|
currently available as an offline option via the ``osdmaptool``. By providing
|
||||||
|
a copy of their osdmap and a pool they want balanced to the ``osdmaptool``, users
|
||||||
|
can generate a preview of optimal primary PG mappings that they can then choose to
|
||||||
|
apply to their cluster. For more details, see
|
||||||
|
https://docs.ceph.com/en/latest/dev/balancer-design/#read-balancing
|
||||||
|
* The ``active_clients`` array displayed by the ``ceph mgr dump`` command now
|
||||||
|
has a ``name`` field that shows the name of the manager module that
|
||||||
|
registered a RADOS client. Previously, the ``active_clients`` array showed
|
||||||
|
the address of a module's RADOS client, but not the name of the module.
|
||||||
|
* The ``perf dump`` and ``perf schema`` commands have been deprecated in
|
||||||
|
favor of the new ``counter dump`` and ``counter schema`` commands. These new
|
||||||
|
commands add support for labeled perf counters and also emit existing
|
||||||
|
unlabeled perf counters. Some unlabeled perf counters became labeled in this
|
||||||
|
release, and more will be labeled in future releases; such converted perf
|
||||||
|
counters are no longer emitted by the ``perf dump`` and ``perf schema``
|
||||||
|
commands.
|
||||||
|
* Cache tiering is now deprecated.
|
||||||
|
* The SPDK backend for BlueStore can now connect to an NVMeoF target. This
|
||||||
|
is not an officially supported feature.
|
||||||
|
|
||||||
|
RBD
|
||||||
|
~~~
|
||||||
|
|
||||||
|
* The semantics of compare-and-write C++ API (`Image::compare_and_write` and
|
||||||
|
`Image::aio_compare_and_write` methods) now match those of C API. Both
|
||||||
|
compare and write steps operate only on len bytes even if the buffers
|
||||||
|
associated with them are larger. The previous behavior of comparing up to the
|
||||||
|
size of the compare buffer was prone to subtle breakage upon straddling a
|
||||||
|
stripe unit boundary.
|
||||||
|
* The ``compare-and-write`` operation is no longer limited to 512-byte
|
||||||
|
sectors. Assuming proper alignment, it now allows operating on stripe units
|
||||||
|
(4MB by default).
|
||||||
|
* There is a new ``rbd_aio_compare_and_writev`` API method that supports
|
||||||
|
scatter/gather on compare buffers as well as on write buffers. This
|
||||||
|
complements the existing ``rbd_aio_readv`` and ``rbd_aio_writev`` methods.
|
||||||
|
* The ``rbd device unmap`` command now has a ``--namespace`` option.
|
||||||
|
Support for namespaces was added to RBD in Nautilus 14.2.0, and since then it
|
||||||
|
has been possible to map and unmap images in namespaces using the
|
||||||
|
``image-spec`` syntax. However, the corresponding option available in most
|
||||||
|
other commands was missing.
|
||||||
|
* All rbd-mirror daemon perf counters have become labeled and are now
|
||||||
|
emitted only by the new ``counter dump`` and ``counter schema`` commands. As
|
||||||
|
part of the conversion, many were also renamed in order to better
|
||||||
|
disambiguate journal-based and snapshot-based mirroring.
|
||||||
|
* The list-watchers C++ API (`Image::list_watchers`) now clears the passed
|
||||||
|
`std::list` before appending to it. This aligns with the semantics of the C
|
||||||
|
API (``rbd_watchers_list``).
|
||||||
|
* Trailing newline in passphrase files (for example: the
|
||||||
|
``<passphrase-file>`` argument of the ``rbd encryption format`` command and
|
||||||
|
the ``--encryption-passphrase-file`` option of other commands) is no longer
|
||||||
|
stripped.
|
||||||
|
* Support for layered client-side encryption has been added. It is now
|
||||||
|
possible to encrypt cloned images with a distinct encryption format and
|
||||||
|
passphrase, differing from that of the parent image and from that of every
|
||||||
|
other cloned image. The efficient copy-on-write semantics intrinsic to
|
||||||
|
unformatted (regular) cloned images have been retained.
|
||||||
|
|
||||||
|
RGW
|
||||||
|
~~~
|
||||||
|
|
||||||
|
* Bucket resharding is now supported for multi-site configurations. This
|
||||||
|
feature is enabled by default for new deployments. Existing deployments must
|
||||||
|
enable the ``resharding`` feature manually after all zones have upgraded.
|
||||||
|
See https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features for
|
||||||
|
details.
|
||||||
|
* The RGW policy parser now rejects unknown principals by default. If you are
|
||||||
|
mirroring policies between RGW and AWS, you might want to set
|
||||||
|
``rgw_policy_reject_invalid_principals`` to ``false``. This change affects
|
||||||
|
only newly set policies, not policies that are already in place.
|
||||||
|
* RGW's default backend for ``rgw_enable_ops_log`` has changed from ``RADOS``
|
||||||
|
to ``file``. The default value of ``rgw_ops_log_rados`` is now ``false``, and
|
||||||
|
``rgw_ops_log_file_path`` now defaults to
|
||||||
|
``/var/log/ceph/ops-log-$cluster-$name.log``.
|
||||||
|
* RGW's pubsub interface now returns boolean fields using ``bool``. Before this
|
||||||
|
change, ``/topics/<topic-name>`` returned ``stored_secret`` and
|
||||||
|
``persistent`` using a string of ``"true"`` or ``"false"`` that contains
|
||||||
|
enclosing quotation marks. After this change, these fields are returned
|
||||||
|
without enclosing quotation marks so that the fields can be decoded as
|
||||||
|
boolean values in JSON. The same is true of the ``is_truncated`` field
|
||||||
|
returned by ``/subscriptions/<sub-name>``.
|
||||||
|
* RGW's response of ``Action=GetTopicAttributes&TopicArn=<topic-arn>`` REST
|
||||||
|
API now returns ``HasStoredSecret`` and ``Persistent`` as boolean in the JSON
|
||||||
|
string that is encoded in ``Attributes/EndPoint``.
|
||||||
|
* All boolean fields that were previously rendered as strings by the
|
||||||
|
``rgw-admin`` command when the JSON format was used are now rendered as
|
||||||
|
boolean. If your scripts and tools rely on this behavior, update them
|
||||||
|
accordingly. The following is a list of the field names impacted by this
|
||||||
|
change:
|
||||||
|
|
||||||
|
* ``absolute``
|
||||||
|
* ``add``
|
||||||
|
* ``admin``
|
||||||
|
* ``appendable``
|
||||||
|
* ``bucket_key_enabled``
|
||||||
|
* ``delete_marker``
|
||||||
|
* ``exists``
|
||||||
|
* ``has_bucket_info``
|
||||||
|
* ``high_precision_time``
|
||||||
|
* ``index``
|
||||||
|
* ``is_master``
|
||||||
|
* ``is_prefix``
|
||||||
|
* ``is_truncated``
|
||||||
|
* ``linked``
|
||||||
|
* ``log_meta``
|
||||||
|
* ``log_op``
|
||||||
|
* ``pending_removal``
|
||||||
|
* ``read_only``
|
||||||
|
* ``retain_head_object``
|
||||||
|
* ``rule_exist``
|
||||||
|
* ``start_with_full_sync``
|
||||||
|
* ``sync_from_all``
|
||||||
|
* ``syncstopped``
|
||||||
|
* ``system``
|
||||||
|
* ``truncated``
|
||||||
|
* ``user_stats_sync``
|
||||||
|
* The Beast front end's HTTP access log line now uses a new
|
||||||
|
``debug_rgw_access`` configurable. It has the same defaults as
|
||||||
|
``debug_rgw``, but it can be controlled independently.
|
||||||
|
* The pubsub functionality for storing bucket notifications inside Ceph
|
||||||
|
has been removed. As a result, the pubsub zone should not be used anymore.
|
||||||
|
The following have also been removed: the REST operations, ``radosgw-admin``
|
||||||
|
commands for manipulating subscriptions, fetching the notifications, and
|
||||||
|
acking the notifications.
|
||||||
|
|
||||||
|
If the endpoint to which the notifications are sent is down or disconnected,
|
||||||
|
we recommend that you use persistent notifications to guarantee their
|
||||||
|
delivery. If the system that consumes the notifications has to pull them
|
||||||
|
(instead of the notifications being pushed to the system), use an external
|
||||||
|
message bus (for example, RabbitMQ or Kafka) for that purpose.
|
||||||
|
* The serialized format of notification and topics has changed. This means
|
||||||
|
that new and updated topics will be unreadable by old RGWs. We recommend
|
||||||
|
completing the RGW upgrades before creating or modifying any notification
|
||||||
|
topics.
|
||||||
|
* Compression is now supported for objects uploaded with Server-Side
|
||||||
|
Encryption. When both compression and encryption are enabled, compression is
|
||||||
|
applied before encryption. Earlier releases of multisite do not replicate
|
||||||
|
such objects correctly, so all zones must upgrade to Reef before enabling the
|
||||||
|
`compress-encrypted` zonegroup feature: see
|
||||||
|
https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features and note the
|
||||||
|
security considerations.
|
||||||
|
|
||||||
|
Telemetry
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
* Users who have opted in to telemetry can also opt in to
|
||||||
|
participate in a leaderboard in the telemetry public dashboards
|
||||||
|
(https://telemetry-public.ceph.com/). In addition, users are now able to
|
||||||
|
provide a description of their cluster that will appear publicly in the
|
||||||
|
leaderboard. For more details, see:
|
||||||
|
https://docs.ceph.com/en/reef/mgr/telemetry/#leaderboard. To see a sample
|
||||||
|
report, run ``ceph telemetry preview``. To opt in to telemetry, run ``ceph
|
||||||
|
telemetry on``. To opt in to the leaderboard, run ``ceph config set mgr
|
||||||
|
mgr/telemetry/leaderboard true``. To add a leaderboard description, run
|
||||||
|
``ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster
|
||||||
|
description’`` (entering your own cluster description).
|
||||||
|
|
||||||
|
Upgrading from Pacific or Quincy
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Before starting, make sure your cluster is stable and healthy (no down or recovering OSDs). (This is optional, but recommended.) You can disable the autoscaler for all pools during the upgrade using the noautoscale flag.
|
||||||
|
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
You can monitor the progress of your upgrade at each stage with the ``ceph versions`` command, which will tell you what ceph version(s) are running for each type of daemon.
|
||||||
|
|
||||||
|
Upgrading cephadm clusters
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If your cluster is deployed with cephadm (first introduced in Octopus), then the upgrade process is entirely automated. To initiate the upgrade,
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch upgrade start --image quay.io/ceph/ceph:v18.2.0
|
||||||
|
|
||||||
|
The same process is used to upgrade to future minor releases.
|
||||||
|
|
||||||
|
Upgrade progress can be monitored with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch upgrade status
|
||||||
|
|
||||||
|
Upgrade progress can also be monitored with `ceph -s` (which provides a simple progress bar) or more verbosely with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph -W cephadm
|
||||||
|
|
||||||
|
The upgrade can be paused or resumed with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch upgrade pause # to pause
|
||||||
|
ceph orch upgrade resume # to resume
|
||||||
|
|
||||||
|
or canceled with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph orch upgrade stop
|
||||||
|
|
||||||
|
Note that canceling the upgrade simply stops the process; there is no ability to downgrade back to Pacific or Quincy.
|
||||||
|
|
||||||
|
Upgrading non-cephadm clusters
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
1. If your cluster is running Pacific (16.2.x) or later, you might choose to first convert it to use cephadm so that the upgrade to Reef is automated (see above).
|
||||||
|
For more information, see https://docs.ceph.com/en/reef/cephadm/adoption/.
|
||||||
|
|
||||||
|
2. If your cluster is running Pacific (16.2.x) or later, systemd unit file names have changed to include the cluster fsid. To find the correct systemd unit file name for your cluster, run following command:
|
||||||
|
|
||||||
|
```
|
||||||
|
systemctl -l | grep <daemon type>
|
||||||
|
```
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ systemctl -l | grep mon | grep active
|
||||||
|
ceph-6ce0347c-314a-11ee-9b52-000af7995d6c@mon.f28-h21-000-r630.service loaded active running Ceph mon.f28-h21-000-r630 for 6ce0347c-314a-11ee-9b52-000af7995d6c
|
||||||
|
```
|
||||||
|
|
||||||
|
#. Set the `noout` flag for the duration of the upgrade. (Optional, but recommended.)
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph osd set noout
|
||||||
|
|
||||||
|
#. Upgrade monitors by installing the new packages and restarting the monitor daemons. For example, on each monitor host
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl restart ceph-mon.target
|
||||||
|
|
||||||
|
Once all monitors are up, verify that the monitor upgrade is complete by looking for the `reef` string in the mon map. The command
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph mon dump | grep min_mon_release
|
||||||
|
|
||||||
|
should report:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
min_mon_release 18 (reef)
|
||||||
|
|
||||||
|
If it does not, that implies that one or more monitors hasn't been upgraded and restarted and/or the quorum does not include all monitors.
|
||||||
|
|
||||||
|
#. Upgrade `ceph-mgr` daemons by installing the new packages and restarting all manager daemons. For example, on each manager host,
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl restart ceph-mgr.target
|
||||||
|
|
||||||
|
Verify the `ceph-mgr` daemons are running by checking `ceph -s`:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph -s
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
...
|
||||||
|
services:
|
||||||
|
mon: 3 daemons, quorum foo,bar,baz
|
||||||
|
mgr: foo(active), standbys: bar, baz
|
||||||
|
...
|
||||||
|
|
||||||
|
#. Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD hosts
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl restart ceph-osd.target
|
||||||
|
|
||||||
|
#. Upgrade all CephFS MDS daemons. For each CephFS file system,
|
||||||
|
|
||||||
|
#. Disable standby_replay:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph fs set <fs_name> allow_standby_replay false
|
||||||
|
|
||||||
|
#. If upgrading from Pacific <=16.2.5:
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph config set mon mon_mds_skip_sanity true
|
||||||
|
|
||||||
|
#. Reduce the number of ranks to 1. (Make note of the original number of MDS daemons first if you plan to restore it later.)
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph status # ceph fs set <fs_name> max_mds 1
|
||||||
|
|
||||||
|
#. Wait for the cluster to deactivate any non-zero ranks by periodically checking the status
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph status
|
||||||
|
|
||||||
|
#. Take all standby MDS daemons offline on the appropriate hosts with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl stop ceph-mds@<daemon_name>
|
||||||
|
|
||||||
|
#. Confirm that only one MDS is online and is rank 0 for your FS
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph status
|
||||||
|
|
||||||
|
#. Upgrade the last remaining MDS daemon by installing the new packages and restarting the daemon
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl restart ceph-mds.target
|
||||||
|
|
||||||
|
#. Restart all standby MDS daemons that were taken offline
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl start ceph-mds.target
|
||||||
|
|
||||||
|
#. Restore the original value of `max_mds` for the volume
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph fs set <fs_name> max_mds <original_max_mds>
|
||||||
|
|
||||||
|
#. If upgrading from Pacific <=16.2.5 (followup to step 5.2):
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph config set mon mon_mds_skip_sanity false
|
||||||
|
|
||||||
|
#. Upgrade all radosgw daemons by upgrading packages and restarting daemons on all hosts
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
systemctl restart ceph-radosgw.target
|
||||||
|
|
||||||
|
#. Complete the upgrade by disallowing pre-Reef OSDs and enabling all new Reef-only functionality
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph osd require-osd-release reef
|
||||||
|
|
||||||
|
#. If you set `noout` at the beginning, be sure to clear it with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph osd unset noout
|
||||||
|
|
||||||
|
#. Consider transitioning your cluster to use the cephadm deployment and orchestration framework to simplify cluster management and future upgrades. For more information on converting an existing cluster to cephadm, see https://docs.ceph.com/en/reef/cephadm/adoption/.
|
||||||
|
|
||||||
|
Post-upgrade
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
#. Verify the cluster is healthy with `ceph health`. If your cluster is running Filestore, and you are upgrading directly from Pacific to Reef, a deprecation warning is expected. This warning can be temporarily muted using the following command
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph health mute OSD_FILESTORE
|
||||||
|
|
||||||
|
#. Consider enabling the `telemetry module <https://docs.ceph.com/en/reef/mgr/telemetry/>`_ to send anonymized usage statistics and crash information to the Ceph upstream developers. To see what would be reported (without actually sending any information to anyone),
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph telemetry preview-all
|
||||||
|
|
||||||
|
If you are comfortable with the data that is reported, you can opt-in to automatically report the high-level cluster metadata with
|
||||||
|
|
||||||
|
.. prompt:: bash #
|
||||||
|
|
||||||
|
ceph telemetry on
|
||||||
|
|
||||||
|
The public dashboard that aggregates Ceph telemetry can be found at https://telemetry-public.ceph.com/.
|
||||||
|
|
||||||
|
Upgrading from pre-Pacific releases (like Octopus)
|
||||||
|
__________________________________________________
|
||||||
|
|
||||||
|
You **must** first upgrade to Pacific (16.2.z) or Quincy (17.2.z) before upgrading to Reef.
|
@ -12,6 +12,11 @@
|
|||||||
# If a version might represent an actual number (e.g. 0.80) quote it.
|
# If a version might represent an actual number (e.g. 0.80) quote it.
|
||||||
#
|
#
|
||||||
releases:
|
releases:
|
||||||
|
reef:
|
||||||
|
target_eol: 2025-08-01
|
||||||
|
releases:
|
||||||
|
- version: 18.2.0
|
||||||
|
released: 2023-08-07
|
||||||
quincy:
|
quincy:
|
||||||
target_eol: 2024-06-01
|
target_eol: 2024-06-01
|
||||||
releases:
|
releases:
|
||||||
@ -29,7 +34,7 @@ releases:
|
|||||||
released: 2022-04-19
|
released: 2022-04-19
|
||||||
|
|
||||||
pacific:
|
pacific:
|
||||||
target_eol: 2023-06-01
|
target_eol: 2023-10-01
|
||||||
releases:
|
releases:
|
||||||
- version: 16.2.11
|
- version: 16.2.11
|
||||||
released: 2023-01-26
|
released: 2023-01-26
|
||||||
|
@ -973,6 +973,15 @@ convention was preferred because it made the documents more readable in a
|
|||||||
command line interface. As of 2023, though, we have no preference for one over
|
command line interface. As of 2023, though, we have no preference for one over
|
||||||
the other. Use whichever convention makes the text easier to read.
|
the other. Use whichever convention makes the text easier to read.
|
||||||
|
|
||||||
|
Using a part of a sentence as a hyperlink, `like this <docs.ceph.com>`_, is
|
||||||
|
discouraged. The convention of writing "See X" is preferred. Here are some
|
||||||
|
preferred formulations:
|
||||||
|
|
||||||
|
#. For more information, see `docs.ceph.com <docs.ceph.com>`_.
|
||||||
|
|
||||||
|
#. See `docs.ceph.com <docs.ceph.com>`_.
|
||||||
|
|
||||||
|
|
||||||
Quirks of ReStructured Text
|
Quirks of ReStructured Text
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
@ -981,7 +990,8 @@ External Links
|
|||||||
|
|
||||||
.. _external_link_with_inline_text:
|
.. _external_link_with_inline_text:
|
||||||
|
|
||||||
This is the formula for links to addresses external to the Ceph documentation:
|
Use the formula immediately below to render links that direct the reader to
|
||||||
|
addresses external to the Ceph documentation:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@ -994,10 +1004,13 @@ This is the formula for links to addresses external to the Ceph documentation:
|
|||||||
|
|
||||||
To link to addresses that are external to the Ceph documentation, include a
|
To link to addresses that are external to the Ceph documentation, include a
|
||||||
space between the inline text and the angle bracket that precedes the
|
space between the inline text and the angle bracket that precedes the
|
||||||
external address. This is precisely the opposite of :ref:`the convention for
|
external address. This is precisely the opposite of the convention for
|
||||||
inline text that links to a location inside the Ceph
|
inline text that links to a location inside the Ceph documentation. See
|
||||||
documentation<internal_link_with_inline_text>`. If this seems inconsistent
|
:ref:`here <internal_link_with_inline_text>` for an exemplar of this
|
||||||
and confusing to you, then you're right. It is inconsistent and confusing.
|
convention.
|
||||||
|
|
||||||
|
If this seems inconsistent and confusing to you, then you're right. It is
|
||||||
|
inconsistent and confusing.
|
||||||
|
|
||||||
See also ":ref:`External Hyperlink Example<start_external_hyperlink_example>`".
|
See also ":ref:`External Hyperlink Example<start_external_hyperlink_example>`".
|
||||||
|
|
||||||
|
@ -1,66 +1,83 @@
|
|||||||
.. _hardware-recommendations:
|
.. _hardware-recommendations:
|
||||||
|
|
||||||
==========================
|
==========================
|
||||||
Hardware Recommendations
|
hardware recommendations
|
||||||
==========================
|
==========================
|
||||||
|
|
||||||
Ceph was designed to run on commodity hardware, which makes building and
|
Ceph is designed to run on commodity hardware, which makes building and
|
||||||
maintaining petabyte-scale data clusters economically feasible.
|
maintaining petabyte-scale data clusters flexible and economically feasible.
|
||||||
When planning out your cluster hardware, you will need to balance a number
|
When planning your cluster's hardware, you will need to balance a number
|
||||||
of considerations, including failure domains and potential performance
|
of considerations, including failure domains, cost, and performance.
|
||||||
issues. Hardware planning should include distributing Ceph daemons and
|
Hardware planning should include distributing Ceph daemons and
|
||||||
other processes that use Ceph across many hosts. Generally, we recommend
|
other processes that use Ceph across many hosts. Generally, we recommend
|
||||||
running Ceph daemons of a specific type on a host configured for that type
|
running Ceph daemons of a specific type on a host configured for that type
|
||||||
of daemon. We recommend using other hosts for processes that utilize your
|
of daemon. We recommend using separate hosts for processes that utilize your
|
||||||
data cluster (e.g., OpenStack, CloudStack, etc).
|
data cluster (e.g., OpenStack, CloudStack, Kubernetes, etc).
|
||||||
|
|
||||||
|
The requirements of one Ceph cluster are not the same as the requirements of
|
||||||
|
another, but below are some general guidelines.
|
||||||
|
|
||||||
.. tip:: Check out the `Ceph blog`_ too.
|
.. tip:: check out the `ceph blog`_ too.
|
||||||
|
|
||||||
|
|
||||||
CPU
|
CPU
|
||||||
===
|
===
|
||||||
|
|
||||||
CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS)
|
CephFS Metadata Servers (MDS) are CPU-intensive. They are
|
||||||
should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD
|
are single-threaded and perform best with CPUs with a high clock rate (GHz). MDS
|
||||||
nodes need enough processing power to run the RADOS service, to calculate data
|
servers do not need a large number of CPU cores unless they are also hosting other
|
||||||
|
services, such as SSD OSDs for the CephFS metadata pool.
|
||||||
|
OSD nodes need enough processing power to run the RADOS service, to calculate data
|
||||||
placement with CRUSH, to replicate data, and to maintain their own copies of the
|
placement with CRUSH, to replicate data, and to maintain their own copies of the
|
||||||
cluster map.
|
cluster map.
|
||||||
|
|
||||||
The requirements of one Ceph cluster are not the same as the requirements of
|
With earlier releases of Ceph, we would make hardware recommendations based on
|
||||||
another, but here are some general guidelines.
|
the number of cores per OSD, but this cores-per-osd metric is no longer as
|
||||||
|
useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
|
||||||
In earlier versions of Ceph, we would make hardware recommendations based on
|
For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
|
||||||
the number of cores per OSD, but this cores-per-OSD metric is no longer as
|
|
||||||
useful a metric as the number of cycles per IOP and the number of IOPs per OSD.
|
|
||||||
For example, for NVMe drives, Ceph can easily utilize five or six cores on real
|
|
||||||
clusters and up to about fourteen cores on single OSDs in isolation. So cores
|
clusters and up to about fourteen cores on single OSDs in isolation. So cores
|
||||||
per OSD are no longer as pressing a concern as they were. When selecting
|
per OSD are no longer as pressing a concern as they were. When selecting
|
||||||
hardware, select for IOPs per core.
|
hardware, select for IOPS per core.
|
||||||
|
|
||||||
Monitor nodes and manager nodes have no heavy CPU demands and require only
|
.. tip:: When we speak of CPU _cores_, we mean _threads_ when hyperthreading
|
||||||
modest processors. If your host machines will run CPU-intensive processes in
|
is enabled. Hyperthreading is usually beneficial for Ceph servers.
|
||||||
|
|
||||||
|
Monitor nodes and Manager nodes do not have heavy CPU demands and require only
|
||||||
|
modest processors. if your hosts will run CPU-intensive processes in
|
||||||
addition to Ceph daemons, make sure that you have enough processing power to
|
addition to Ceph daemons, make sure that you have enough processing power to
|
||||||
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
|
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
|
||||||
one such example of a CPU-intensive process.) We recommend that you run
|
one example of a CPU-intensive process.) We recommend that you run
|
||||||
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
|
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
|
||||||
not your monitor and manager nodes) in order to avoid resource contention.
|
not your Monitor and Manager nodes) in order to avoid resource contention.
|
||||||
|
If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
|
||||||
|
with your Mon and Manager services if the nodes have sufficient resources.
|
||||||
|
|
||||||
RAM
|
RAM
|
||||||
===
|
===
|
||||||
|
|
||||||
Generally, more RAM is better. Monitor / manager nodes for a modest cluster
|
Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
|
||||||
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
|
might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
|
||||||
is a reasonable target. There is a memory target for BlueStore OSDs that
|
is advised.
|
||||||
|
|
||||||
|
.. tip:: when we speak of RAM and storage requirements, we often describe
|
||||||
|
the needs of a single daemon of a given type. A given server as
|
||||||
|
a whole will thus need at least the sum of the needs of the
|
||||||
|
daemons that it hosts as well as resources for logs and other operating
|
||||||
|
system components. Keep in mind that a server's need for RAM
|
||||||
|
and storage will be greater at startup and when components
|
||||||
|
fail or are added and the cluster rebalances. In other words,
|
||||||
|
allow headroom past what you might see used during a calm period
|
||||||
|
on a small initial cluster footprint.
|
||||||
|
|
||||||
|
There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
|
||||||
defaults to 4GB. Factor in a prudent margin for the operating system and
|
defaults to 4GB. Factor in a prudent margin for the operating system and
|
||||||
administrative tasks (like monitoring and metrics) as well as increased
|
administrative tasks (like monitoring and metrics) as well as increased
|
||||||
consumption during recovery: provisioning ~8GB per BlueStore OSD
|
consumption during recovery: provisioning ~8GB *per BlueStore OSD* is thus
|
||||||
is advised.
|
advised.
|
||||||
|
|
||||||
Monitors and managers (ceph-mon and ceph-mgr)
|
Monitors and managers (ceph-mon and ceph-mgr)
|
||||||
---------------------------------------------
|
---------------------------------------------
|
||||||
|
|
||||||
Monitor and manager daemon memory usage generally scales with the size of the
|
Monitor and manager daemon memory usage scales with the size of the
|
||||||
cluster. Note that at boot-time and during topology changes and recovery these
|
cluster. Note that at boot-time and during topology changes and recovery these
|
||||||
daemons will need more RAM than they do during steady-state operation, so plan
|
daemons will need more RAM than they do during steady-state operation, so plan
|
||||||
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
|
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
|
||||||
@ -75,8 +92,8 @@ tuning the following settings:
|
|||||||
Metadata servers (ceph-mds)
|
Metadata servers (ceph-mds)
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
The metadata daemon memory utilization depends on how much memory its cache is
|
CephFS metadata daemon memory utilization depends on the configured size of
|
||||||
configured to consume. We recommend 1 GB as a minimum for most systems. See
|
its cache. We recommend 1 GB as a minimum for most systems. See
|
||||||
:confval:`mds_cache_memory_limit`.
|
:confval:`mds_cache_memory_limit`.
|
||||||
|
|
||||||
|
|
||||||
@ -88,23 +105,24 @@ operating system's page cache. In Bluestore you can adjust the amount of memory
|
|||||||
that the OSD attempts to consume by changing the :confval:`osd_memory_target`
|
that the OSD attempts to consume by changing the :confval:`osd_memory_target`
|
||||||
configuration option.
|
configuration option.
|
||||||
|
|
||||||
- Setting the :confval:`osd_memory_target` below 2GB is typically not
|
- Setting the :confval:`osd_memory_target` below 2GB is not
|
||||||
recommended (Ceph may fail to keep the memory consumption under 2GB and
|
recommended. Ceph may fail to keep the memory consumption under 2GB and
|
||||||
this may cause extremely slow performance).
|
extremely slow performance is likely.
|
||||||
|
|
||||||
- Setting the memory target between 2GB and 4GB typically works but may result
|
- Setting the memory target between 2GB and 4GB typically works but may result
|
||||||
in degraded performance: metadata may be read from disk during IO unless the
|
in degraded performance: metadata may need to be read from disk during IO
|
||||||
active data set is relatively small.
|
unless the active data set is relatively small.
|
||||||
|
|
||||||
- 4GB is the current default :confval:`osd_memory_target` size. This default
|
- 4GB is the current default value for :confval:`osd_memory_target` This default
|
||||||
was chosen for typical use cases, and is intended to balance memory
|
was chosen for typical use cases, and is intended to balance RAM cost and
|
||||||
requirements and OSD performance.
|
OSD performance.
|
||||||
|
|
||||||
- Setting the :confval:`osd_memory_target` higher than 4GB can improve
|
- Setting the :confval:`osd_memory_target` higher than 4GB can improve
|
||||||
performance when there many (small) objects or when large (256GB/OSD
|
performance when there many (small) objects or when large (256GB/OSD
|
||||||
or more) data sets are processed.
|
or more) data sets are processed. This is especially true with fast
|
||||||
|
NVMe OSDs.
|
||||||
|
|
||||||
.. important:: OSD memory autotuning is "best effort". Although the OSD may
|
.. important:: OSD memory management is "best effort". Although the OSD may
|
||||||
unmap memory to allow the kernel to reclaim it, there is no guarantee that
|
unmap memory to allow the kernel to reclaim it, there is no guarantee that
|
||||||
the kernel will actually reclaim freed memory within a specific time
|
the kernel will actually reclaim freed memory within a specific time
|
||||||
frame. This applies especially in older versions of Ceph, where transparent
|
frame. This applies especially in older versions of Ceph, where transparent
|
||||||
@ -113,14 +131,19 @@ configuration option.
|
|||||||
pages at the application level to avoid this, but that does not
|
pages at the application level to avoid this, but that does not
|
||||||
guarantee that the kernel will immediately reclaim unmapped memory. The OSD
|
guarantee that the kernel will immediately reclaim unmapped memory. The OSD
|
||||||
may still at times exceed its memory target. We recommend budgeting
|
may still at times exceed its memory target. We recommend budgeting
|
||||||
approximately 20% extra memory on your system to prevent OSDs from going OOM
|
at least 20% extra memory on your system to prevent OSDs from going OOM
|
||||||
(**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
|
(**O**\ut **O**\f **M**\emory) during temporary spikes or due to delay in
|
||||||
the kernel reclaiming freed pages. That 20% value might be more or less than
|
the kernel reclaiming freed pages. That 20% value might be more or less than
|
||||||
needed, depending on the exact configuration of the system.
|
needed, depending on the exact configuration of the system.
|
||||||
|
|
||||||
When using the legacy FileStore back end, the page cache is used for caching
|
.. tip:: Configuring the operating system with swap to provide additional
|
||||||
data, so no tuning is normally needed. When using the legacy FileStore backend,
|
virtual memory for daemons is not advised for modern systems. Doing
|
||||||
the OSD memory consumption is related to the number of PGs per daemon in the
|
may result in lower performance, and your Ceph cluster may well be
|
||||||
|
happier with a daemon that crashes vs one that slows to a crawl.
|
||||||
|
|
||||||
|
When using the legacy FileStore back end, the OS page cache was used for caching
|
||||||
|
data, so tuning was not normally needed. When using the legacy FileStore backend,
|
||||||
|
the OSD memory consumption was related to the number of PGs per daemon in the
|
||||||
system.
|
system.
|
||||||
|
|
||||||
|
|
||||||
@ -130,13 +153,34 @@ Data Storage
|
|||||||
Plan your data storage configuration carefully. There are significant cost and
|
Plan your data storage configuration carefully. There are significant cost and
|
||||||
performance tradeoffs to consider when planning for data storage. Simultaneous
|
performance tradeoffs to consider when planning for data storage. Simultaneous
|
||||||
OS operations and simultaneous requests from multiple daemons for read and
|
OS operations and simultaneous requests from multiple daemons for read and
|
||||||
write operations against a single drive can slow performance.
|
write operations against a single drive can impact performance.
|
||||||
|
|
||||||
|
OSDs require substantial storage drive space for RADOS data. We recommend a
|
||||||
|
minimum drive size of 1 terabyte. OSD drives much smaller than one terabyte
|
||||||
|
use a significant fraction of their capacity for metadata, and drives smaller
|
||||||
|
than 100 gigabytes will not be effective at all.
|
||||||
|
|
||||||
|
It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
|
||||||
|
minimum, Ceph Monitor and Ceph Manager hosts, as well as CephFS Metadata Server
|
||||||
|
metadata pools and Ceph Object Gateway (RGW) index pools, even if HDDs are to
|
||||||
|
be provisioned for bulk OSD data.
|
||||||
|
|
||||||
|
To get the best performance out of Ceph, provision the following on separate
|
||||||
|
drives:
|
||||||
|
|
||||||
|
* The operating systems
|
||||||
|
* OSD data
|
||||||
|
* BlueStore WAL+DB
|
||||||
|
|
||||||
|
For more
|
||||||
|
information on how to effectively use a mix of fast drives and slow drives in
|
||||||
|
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
|
||||||
|
Configuration Reference.
|
||||||
|
|
||||||
Hard Disk Drives
|
Hard Disk Drives
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
OSDs should have plenty of storage drive space for object data. We recommend a
|
Consider carefully the cost-per-gigabyte advantage
|
||||||
minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage
|
|
||||||
of larger disks. We recommend dividing the price of the disk drive by the
|
of larger disks. We recommend dividing the price of the disk drive by the
|
||||||
number of gigabytes to arrive at a cost per gigabyte, because larger drives may
|
number of gigabytes to arrive at a cost per gigabyte, because larger drives may
|
||||||
have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
|
have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte
|
||||||
@ -146,11 +190,10 @@ per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the
|
|||||||
1 terabyte disks would generally increase the cost per gigabyte by
|
1 terabyte disks would generally increase the cost per gigabyte by
|
||||||
40%--rendering your cluster substantially less cost efficient.
|
40%--rendering your cluster substantially less cost efficient.
|
||||||
|
|
||||||
.. tip:: Running multiple OSDs on a single SAS / SATA drive
|
.. tip:: Hosting multiple OSDs on a single SAS / SATA HDD
|
||||||
is **NOT** a good idea. NVMe drives, however, can achieve
|
is **NOT** a good idea.
|
||||||
improved performance by being split into two or more OSDs.
|
|
||||||
|
|
||||||
.. tip:: Running an OSD and a monitor or a metadata server on a single
|
.. tip:: Hosting an OSD with monitor, manager, or MDS data on a single
|
||||||
drive is also **NOT** a good idea.
|
drive is also **NOT** a good idea.
|
||||||
|
|
||||||
.. tip:: With spinning disks, the SATA and SAS interface increasingly
|
.. tip:: With spinning disks, the SATA and SAS interface increasingly
|
||||||
@ -162,35 +205,36 @@ Storage drives are subject to limitations on seek time, access time, read and
|
|||||||
write times, as well as total throughput. These physical limitations affect
|
write times, as well as total throughput. These physical limitations affect
|
||||||
overall system performance--especially during recovery. We recommend using a
|
overall system performance--especially during recovery. We recommend using a
|
||||||
dedicated (ideally mirrored) drive for the operating system and software, and
|
dedicated (ideally mirrored) drive for the operating system and software, and
|
||||||
one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above).
|
one drive for each Ceph OSD Daemon you run on the host.
|
||||||
Many "slow OSD" issues (when they are not attributable to hardware failure)
|
Many "slow OSD" issues (when they are not attributable to hardware failure)
|
||||||
arise from running an operating system and multiple OSDs on the same drive.
|
arise from running an operating system and multiple OSDs on the same drive.
|
||||||
|
Also be aware that today's 22TB HDD uses the same SATA interface as a
|
||||||
|
3TB HDD from ten years ago: more than seven times the data to squeeze
|
||||||
|
through the same same interface. For this reason, when using HDDs for
|
||||||
|
OSDs, drives larger than 8TB may be best suited for storage of large
|
||||||
|
files / objects that are not at all performance-sensitive.
|
||||||
|
|
||||||
It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA
|
|
||||||
drive, but this will lead to resource contention and diminish overall
|
|
||||||
throughput.
|
|
||||||
|
|
||||||
To get the best performance out of Ceph, run the following on separate drives:
|
|
||||||
(1) operating systems, (2) OSD data, and (3) BlueStore db. For more
|
|
||||||
information on how to effectively use a mix of fast drives and slow drives in
|
|
||||||
your Ceph cluster, see the `block and block.db`_ section of the Bluestore
|
|
||||||
Configuration Reference.
|
|
||||||
|
|
||||||
Solid State Drives
|
Solid State Drives
|
||||||
------------------
|
------------------
|
||||||
|
|
||||||
Ceph performance can be improved by using solid-state drives (SSDs). This
|
Ceph performance is much improved when using solid-state drives (SSDs). This
|
||||||
reduces random access time and reduces latency while accelerating throughput.
|
reduces random access time and reduces latency while increasing throughput.
|
||||||
|
|
||||||
SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer
|
SSDs cost more per gigabyte than do HDDs but SSDs often offer
|
||||||
access times that are, at a minimum, 100 times faster than hard disk drives.
|
access times that are, at a minimum, 100 times faster than HDDs.
|
||||||
SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
|
SSDs avoid hotspot issues and bottleneck issues within busy clusters, and
|
||||||
they may offer better economics when TCO is evaluated holistically.
|
they may offer better economics when TCO is evaluated holistically. Notably,
|
||||||
|
the amortized drive cost for a given number of IOPS is much lower with SSDs
|
||||||
|
than with HDDs. SSDs do not suffer rotational or seek latency and in addition
|
||||||
|
to improved client performance, they substantially improve the speed and
|
||||||
|
client impact of cluster changes including rebalancing when OSDs or Monitors
|
||||||
|
are added, removed, or fail.
|
||||||
|
|
||||||
SSDs do not have moving mechanical parts, so they are not necessarily subject
|
SSDs do not have moving mechanical parts, so they are not subject
|
||||||
to the same types of limitations as hard disk drives. SSDs do have significant
|
to many of the limitations of HDDs. SSDs do have significant
|
||||||
limitations though. When evaluating SSDs, it is important to consider the
|
limitations though. When evaluating SSDs, it is important to consider the
|
||||||
performance of sequential reads and writes.
|
performance of sequential and random reads and writes.
|
||||||
|
|
||||||
.. important:: We recommend exploring the use of SSDs to improve performance.
|
.. important:: We recommend exploring the use of SSDs to improve performance.
|
||||||
However, before making a significant investment in SSDs, we **strongly
|
However, before making a significant investment in SSDs, we **strongly
|
||||||
@ -198,16 +242,36 @@ performance of sequential reads and writes.
|
|||||||
SSD in a test configuration in order to gauge performance.
|
SSD in a test configuration in order to gauge performance.
|
||||||
|
|
||||||
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
|
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
|
||||||
Acceptable IOPS are not the only factor to consider when selecting an SSD for
|
Acceptable IOPS are not the only factor to consider when selecting SSDs for
|
||||||
use with Ceph.
|
use with Ceph. Bargain SSDs are often a false economy: they may experience
|
||||||
|
"cliffing", which means that after an initial burst, sustained performance
|
||||||
|
once a limited cache is filled declines considerably. Consider also durability:
|
||||||
|
a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
|
||||||
|
OSDs dedicated to certain types of sequentially-written read-mostly data, but
|
||||||
|
are not a good choice for Ceph Monitor duty. Enterprise-class SSDs are best
|
||||||
|
for Ceph: they almost always feature power less protection (PLP) and do
|
||||||
|
not suffer the dramatic cliffing that client (desktop) models may experience.
|
||||||
|
|
||||||
SSDs have historically been cost prohibitive for object storage, but emerging
|
When using a single (or mirrored pair) SSD for both operating system boot
|
||||||
QLC drives are closing the gap, offering greater density with lower power
|
and Ceph Monitor / Manager purposes, a minimum capacity of 256GB is advised
|
||||||
consumption and less power spent on cooling. HDD OSDs may see a significant
|
and at least 480GB is recommended. A drive model rated at 1+ DWPD (or the
|
||||||
performance improvement by offloading WAL+DB onto an SSD.
|
equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
|
||||||
|
workload, a larger drive than technically required will provide more endurance
|
||||||
|
because it effectively has greater overprovsioning. We stress that
|
||||||
|
enterprise-class drives are best for production use, as they feature power
|
||||||
|
loss protection and increased durability compared to client (desktop) SKUs
|
||||||
|
that are intended for much lighter and intermittent duty cycles.
|
||||||
|
|
||||||
To get a better sense of the factors that determine the cost of storage, you
|
SSDs were historically been cost prohibitive for object storage, but
|
||||||
might use the `Storage Networking Industry Association's Total Cost of
|
QLC SSDs are closing the gap, offering greater density with lower power
|
||||||
|
consumption and less power spent on cooling. Also, HDD OSDs may see a
|
||||||
|
significant write latency improvement by offloading WAL+DB onto an SSD.
|
||||||
|
Many Ceph OSD deployments do not require an SSD with greater endurance than
|
||||||
|
1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
|
||||||
|
often overkill for this purpose and cost signficantly more.
|
||||||
|
|
||||||
|
To get a better sense of the factors that determine the total cost of storage,
|
||||||
|
you might use the `Storage Networking Industry Association's Total Cost of
|
||||||
Ownership calculator`_
|
Ownership calculator`_
|
||||||
|
|
||||||
Partition Alignment
|
Partition Alignment
|
||||||
@ -222,11 +286,11 @@ alignment and example commands that show how to align partitions properly, see
|
|||||||
CephFS Metadata Segregation
|
CephFS Metadata Segregation
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
One way that Ceph accelerates CephFS file system performance is by segregating
|
One way that Ceph accelerates CephFS file system performance is by separating
|
||||||
the storage of CephFS metadata from the storage of the CephFS file contents.
|
the storage of CephFS metadata from the storage of the CephFS file contents.
|
||||||
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
|
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
|
||||||
have to create a pool for CephFS metadata, but you can create a CRUSH map
|
have to manually create a pool for CephFS metadata, but you can create a CRUSH map
|
||||||
hierarchy for your CephFS metadata pool that points only to SSD storage media.
|
hierarchy for your CephFS metadata pool that includes only SSD storage media.
|
||||||
See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
|
See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
|
||||||
|
|
||||||
|
|
||||||
@ -237,8 +301,20 @@ Disk controllers (HBAs) can have a significant impact on write throughput.
|
|||||||
Carefully consider your selection of HBAs to ensure that they do not create a
|
Carefully consider your selection of HBAs to ensure that they do not create a
|
||||||
performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
|
performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency
|
||||||
than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
|
than simpler "JBOD" (IT) mode HBAs. The RAID SoC, write cache, and battery
|
||||||
backup can substantially increase hardware and maintenance costs. Some RAID
|
backup can substantially increase hardware and maintenance costs. Many RAID
|
||||||
HBAs can be configured with an IT-mode "personality".
|
HBAs can be configured with an IT-mode "personality" or "JBOD mode" for
|
||||||
|
streamlined operation.
|
||||||
|
|
||||||
|
You do not need an RoC (RAID-capable) HBA. ZFS or Linux MD software mirroring
|
||||||
|
serve well for boot volume durability. When using SAS or SATA data drives,
|
||||||
|
forgoing HBA RAID capabilities can reduce the gap between HDD and SSD
|
||||||
|
media cost. Moreover, when using NVMe SSDs, you do not need *any* HBA. This
|
||||||
|
additionally reduces the HDD vs SSD cost gap when the system as a whole is
|
||||||
|
considered. The initial cost of a fancy RAID HBA plus onboard cache plus
|
||||||
|
battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
|
||||||
|
dollars even after discounts - a sum that goes a log way toward SSD cost parity.
|
||||||
|
An HBA-free system may also cost hundreds of US dollars less every year if one
|
||||||
|
purchases an annual maintenance contract or extended warranty.
|
||||||
|
|
||||||
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
|
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
|
||||||
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
|
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
|
||||||
@ -248,10 +324,10 @@ HBAs can be configured with an IT-mode "personality".
|
|||||||
Benchmarking
|
Benchmarking
|
||||||
------------
|
------------
|
||||||
|
|
||||||
BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure
|
BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
|
||||||
that data is safely persisted to media. You can evaluate a drive's low-level
|
frequently to ensure that data is safely persisted to media. You can evaluate a
|
||||||
write performance using ``fio``. For example, 4kB random write performance is
|
drive's low-level write performance using ``fio``. For example, 4kB random write
|
||||||
measured as follows:
|
performance is measured as follows:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
@ -261,6 +337,7 @@ Write Caches
|
|||||||
------------
|
------------
|
||||||
|
|
||||||
Enterprise SSDs and HDDs normally include power loss protection features which
|
Enterprise SSDs and HDDs normally include power loss protection features which
|
||||||
|
ensure data durability when power is lost while operating, and
|
||||||
use multi-level caches to speed up direct or synchronous writes. These devices
|
use multi-level caches to speed up direct or synchronous writes. These devices
|
||||||
can be toggled between two caching modes -- a volatile cache flushed to
|
can be toggled between two caching modes -- a volatile cache flushed to
|
||||||
persistent media with fsync, or a non-volatile cache written synchronously.
|
persistent media with fsync, or a non-volatile cache written synchronously.
|
||||||
@ -269,9 +346,9 @@ These two modes are selected by either "enabling" or "disabling" the write
|
|||||||
(volatile) cache. When the volatile cache is enabled, Linux uses a device in
|
(volatile) cache. When the volatile cache is enabled, Linux uses a device in
|
||||||
"write back" mode, and when disabled, it uses "write through".
|
"write back" mode, and when disabled, it uses "write through".
|
||||||
|
|
||||||
The default configuration (normally caching enabled) may not be optimal, and
|
The default configuration (usually: caching is enabled) may not be optimal, and
|
||||||
OSD performance may be dramatically increased in terms of increased IOPS and
|
OSD performance may be dramatically increased in terms of increased IOPS and
|
||||||
decreased commit_latency by disabling the write cache.
|
decreased commit latency by disabling this write cache.
|
||||||
|
|
||||||
Users are therefore encouraged to benchmark their devices with ``fio`` as
|
Users are therefore encouraged to benchmark their devices with ``fio`` as
|
||||||
described earlier and persist the optimal cache configuration for their
|
described earlier and persist the optimal cache configuration for their
|
||||||
@ -319,11 +396,11 @@ The write cache can be disabled with those same tools:
|
|||||||
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
|
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
|
||||||
Write cache disabled
|
Write cache disabled
|
||||||
|
|
||||||
Normally, disabling the cache using ``hdparm``, ``sdparm``, or ``smartctl``
|
In most cases, disabling this cache using ``hdparm``, ``sdparm``, or ``smartctl``
|
||||||
results in the cache_type changing automatically to "write through". If this is
|
results in the cache_type changing automatically to "write through". If this is
|
||||||
not the case, you can try setting it directly as follows. (Users should note
|
not the case, you can try setting it directly as follows. (Users should ensure
|
||||||
that setting cache_type also correctly persists the caching mode of the device
|
that setting cache_type also correctly persists the caching mode of the device
|
||||||
until the next reboot):
|
until the next reboot as some drives require this to be repeated at every boot):
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
@ -367,13 +444,13 @@ until the next reboot):
|
|||||||
Additional Considerations
|
Additional Considerations
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
You typically will run multiple OSDs per host, but you should ensure that the
|
Ceph operators typically provision multiple OSDs per host, but you should
|
||||||
aggregate throughput of your OSD drives doesn't exceed the network bandwidth
|
ensure that the aggregate throughput of your OSD drives doesn't exceed the
|
||||||
required to service a client's need to read or write data. You should also
|
network bandwidth required to service a client's read and write operations.
|
||||||
consider what percentage of the overall data the cluster stores on each host. If
|
You should also each host's percentage of the cluster's overall capacity. If
|
||||||
the percentage on a particular host is large and the host fails, it can lead to
|
the percentage located on a particular host is large and the host fails, it
|
||||||
problems such as exceeding the ``full ratio``, which causes Ceph to halt
|
can lead to problems such as recovery causing OSDs to exceed the ``full ratio``,
|
||||||
operations as a safety precaution that prevents data loss.
|
which in turn causes Ceph to halt operations to prevent data loss.
|
||||||
|
|
||||||
When you run multiple OSDs per host, you also need to ensure that the kernel
|
When you run multiple OSDs per host, you also need to ensure that the kernel
|
||||||
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
|
is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
|
||||||
@ -384,7 +461,11 @@ multiple OSDs per host.
|
|||||||
Networks
|
Networks
|
||||||
========
|
========
|
||||||
|
|
||||||
Provision at least 10 Gb/s networking in your racks.
|
Provision at least 10 Gb/s networking in your datacenter, both among Ceph
|
||||||
|
hosts and between clients and your Ceph cluster. Network link active/active
|
||||||
|
bonding across separate network switches is strongly recommended both for
|
||||||
|
increased throughput and for tolerance of network failures and maintenance.
|
||||||
|
Take care that your bonding hash policy distributes traffic across links.
|
||||||
|
|
||||||
Speed
|
Speed
|
||||||
-----
|
-----
|
||||||
@ -392,13 +473,20 @@ Speed
|
|||||||
It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
|
It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it
|
||||||
takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
|
takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only
|
||||||
twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
|
twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes
|
||||||
only one hour to replicate 10 TB across a 10 Gb/s network.
|
only one hour to replicate 10 TB across a 10 Gb/s network.
|
||||||
|
|
||||||
|
Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
|
||||||
|
parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
|
||||||
|
in parallel. Thus, and perhaps somewhat counterintuitively, an individual
|
||||||
|
packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
|
||||||
|
network.
|
||||||
|
|
||||||
|
|
||||||
Cost
|
Cost
|
||||||
----
|
----
|
||||||
|
|
||||||
The larger the Ceph cluster, the more common OSD failures will be.
|
The larger the Ceph cluster, the more common OSD failures will be.
|
||||||
The faster that a placement group (PG) can recover from a ``degraded`` state to
|
The faster that a placement group (PG) can recover from a degraded state to
|
||||||
an ``active + clean`` state, the better. Notably, fast recovery minimizes
|
an ``active + clean`` state, the better. Notably, fast recovery minimizes
|
||||||
the likelihood of multiple, overlapping failures that can cause data to become
|
the likelihood of multiple, overlapping failures that can cause data to become
|
||||||
temporarily unavailable or even lost. Of course, when provisioning your
|
temporarily unavailable or even lost. Of course, when provisioning your
|
||||||
@ -410,10 +498,10 @@ switches. The added expense of this hardware may be offset by the operational
|
|||||||
cost savings on network setup and maintenance. When using VLANs to handle VM
|
cost savings on network setup and maintenance. When using VLANs to handle VM
|
||||||
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
|
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
|
||||||
etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
|
etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
|
||||||
25/50/100 Gb/s networking as of 2022 is common for production clusters.
|
increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
|
||||||
|
|
||||||
Top-of-rack (TOR) switches also need fast and redundant uplinks to spind
|
Top-of-rack (TOR) switches also need fast and redundant uplinks to
|
||||||
spine switches / routers, often at least 40 Gb/s.
|
core / spine network switches or routers, often at least 40 Gb/s.
|
||||||
|
|
||||||
|
|
||||||
Baseboard Management Controller (BMC)
|
Baseboard Management Controller (BMC)
|
||||||
@ -425,78 +513,103 @@ Administration and deployment tools may also use BMCs extensively, especially
|
|||||||
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
|
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
|
||||||
network for security and administration. Hypervisor SSH access, VM image uploads,
|
network for security and administration. Hypervisor SSH access, VM image uploads,
|
||||||
OS image installs, management sockets, etc. can impose significant loads on a network.
|
OS image installs, management sockets, etc. can impose significant loads on a network.
|
||||||
Running three networks may seem like overkill, but each traffic path represents
|
Running multiple networks may seem like overkill, but each traffic path represents
|
||||||
a potential capacity, throughput and/or performance bottleneck that you should
|
a potential capacity, throughput and/or performance bottleneck that you should
|
||||||
carefully consider before deploying a large scale data cluster.
|
carefully consider before deploying a large scale data cluster.
|
||||||
|
|
||||||
|
Additionally BMCs as of 2023 rarely sport network connections faster than 1 Gb/s,
|
||||||
|
so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
|
||||||
|
may reduce costs by wasting fewer expenive ports on faster host switches.
|
||||||
|
|
||||||
|
|
||||||
Failure Domains
|
Failure Domains
|
||||||
===============
|
===============
|
||||||
|
|
||||||
A failure domain is any failure that prevents access to one or more OSDs. That
|
A failure domain can be thought of as any component loss that prevents access to
|
||||||
could be a stopped daemon on a host; a disk failure, an OS crash, a
|
one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
|
||||||
malfunctioning NIC, a failed power supply, a network outage, a power outage,
|
a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
|
||||||
and so forth. When planning out your hardware needs, you must balance the
|
a network outage, a power outage, and so forth. When planning your hardware
|
||||||
temptation to reduce costs by placing too many responsibilities into too few
|
deployment, you must balance the risk of reducing costs by placing too many
|
||||||
failure domains, and the added costs of isolating every potential failure
|
responsibilities into too few failure domains against the added costs of
|
||||||
domain.
|
isolating every potential failure domain.
|
||||||
|
|
||||||
|
|
||||||
Minimum Hardware Recommendations
|
Minimum Hardware Recommendations
|
||||||
================================
|
================================
|
||||||
|
|
||||||
Ceph can run on inexpensive commodity hardware. Small production clusters
|
Ceph can run on inexpensive commodity hardware. Small production clusters
|
||||||
and development clusters can run successfully with modest hardware.
|
and development clusters can run successfully with modest hardware. As
|
||||||
|
we noted above: when we speak of CPU _cores_, we mean _threads_ when
|
||||||
|
hyperthreading (HT) is enabled. Each modern physical x64 CPU core typically
|
||||||
|
provides two logical CPU threads; other CPU architectures may vary.
|
||||||
|
|
||||||
|
Take care that there are many factors that influence resource choices. The
|
||||||
|
minimum resources that suffice for one purpose will not necessarily suffice for
|
||||||
|
another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
|
||||||
|
a trio of Raspberry PIs will get by with fewer resources than a production
|
||||||
|
deployment with a thousand OSDs serving five thousand of RBD clients. The
|
||||||
|
classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
|
||||||
|
One would not expect the former to do the job of the latter. We especially
|
||||||
|
cannot stress enough the criticality of using enterprise-quality storage
|
||||||
|
media for production workloads.
|
||||||
|
|
||||||
|
Additional insights into resource planning for production clusters are
|
||||||
|
found above and elsewhere within this documentation.
|
||||||
|
|
||||||
+--------------+----------------+-----------------------------------------+
|
+--------------+----------------+-----------------------------------------+
|
||||||
| Process | Criteria | Minimum Recommended |
|
| Process | Criteria | Bare Minimum and Recommended |
|
||||||
+==============+================+=========================================+
|
+==============+================+=========================================+
|
||||||
| ``ceph-osd`` | Processor | - 1 core minimum |
|
| ``ceph-osd`` | Processor | - 1 core minimum, 2 recommended |
|
||||||
| | | - 1 core per 200-500 MB/s |
|
| | | - 1 core per 200-500 MB/s throughput |
|
||||||
| | | - 1 core per 1000-3000 IOPS |
|
| | | - 1 core per 1000-3000 IOPS |
|
||||||
| | | |
|
| | | |
|
||||||
| | | * Results are before replication. |
|
| | | * Results are before replication. |
|
||||||
| | | * Results may vary with different |
|
| | | * Results may vary across CPU and drive |
|
||||||
| | | CPU models and Ceph features. |
|
| | | models and Ceph configuration: |
|
||||||
| | | (erasure coding, compression, etc) |
|
| | | (erasure coding, compression, etc) |
|
||||||
| | | * ARM processors specifically may |
|
| | | * ARM processors specifically may |
|
||||||
| | | require additional cores. |
|
| | | require more cores for performance. |
|
||||||
|
| | | * SSD OSDs, especially NVMe, will |
|
||||||
|
| | | benefit from additional cores per OSD.|
|
||||||
| | | * Actual performance depends on many |
|
| | | * Actual performance depends on many |
|
||||||
| | | factors including drives, net, and |
|
| | | factors including drives, net, and |
|
||||||
| | | client throughput and latency. |
|
| | | client throughput and latency. |
|
||||||
| | | Benchmarking is highly recommended. |
|
| | | Benchmarking is highly recommended. |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | RAM | - 4GB+ per daemon (more is better) |
|
| | RAM | - 4GB+ per daemon (more is better) |
|
||||||
| | | - 2-4GB often functions (may be slow) |
|
| | | - 2-4GB may function but may be slow |
|
||||||
| | | - Less than 2GB not recommended |
|
| | | - Less than 2GB is not recommended |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | Volume Storage | 1x storage drive per daemon |
|
| | Storage Drives | 1x storage drive per OSD |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | DB/WAL | 1x SSD partition per daemon (optional) |
|
| | DB/WAL | 1x SSD partion per HDD OSD |
|
||||||
|
| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
|
||||||
|
| | | <= 10 HDD OSDss per DB/WAL NVMe SSD |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | Network | 1x 1GbE+ NICs (10GbE+ recommended) |
|
| | Network | 1x 1Gb/s (bonded 10+ Gb/s recommended) |
|
||||||
+--------------+----------------+-----------------------------------------+
|
+--------------+----------------+-----------------------------------------+
|
||||||
| ``ceph-mon`` | Processor | - 2 cores minimum |
|
| ``ceph-mon`` | Processor | - 2 cores minimum |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | RAM | 2-4GB+ per daemon |
|
| | RAM | 5GB+ per daemon (large / production |
|
||||||
|
| | | clusters need more) |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | Disk Space | 60 GB per daemon |
|
| | Storage | 100 GB per daemon, SSD is recommended |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | Network | 1x 1GbE+ NICs |
|
| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
|
||||||
+--------------+----------------+-----------------------------------------+
|
+--------------+----------------+-----------------------------------------+
|
||||||
| ``ceph-mds`` | Processor | - 2 cores minimum |
|
| ``ceph-mds`` | Processor | - 2 cores minimum |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | RAM | 2GB+ per daemon |
|
| | RAM | 2GB+ per daemon (more for production) |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | Disk Space | 1 MB per daemon |
|
| | Disk Space | 1 GB per daemon |
|
||||||
| +----------------+-----------------------------------------+
|
| +----------------+-----------------------------------------+
|
||||||
| | Network | 1x 1GbE+ NICs |
|
| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
|
||||||
+--------------+----------------+-----------------------------------------+
|
+--------------+----------------+-----------------------------------------+
|
||||||
|
|
||||||
.. tip:: If you are running an OSD with a single disk, create a
|
.. tip:: If you are running an OSD node with a single storage drive, create a
|
||||||
partition for your volume storage that is separate from the partition
|
partition for your OSD that is separate from the partition
|
||||||
containing the OS. Generally, we recommend separate disks for the
|
containing the OS. We recommend separate drives for the
|
||||||
OS and the volume storage.
|
OS and for OSD storage.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -35,20 +35,38 @@ Linux Kernel
|
|||||||
Platforms
|
Platforms
|
||||||
=========
|
=========
|
||||||
|
|
||||||
The charts below show how Ceph's requirements map onto various Linux
|
The chart below shows which Linux platforms Ceph provides packages for, and
|
||||||
platforms. Generally speaking, there is very little dependence on
|
which platforms Ceph has been tested on.
|
||||||
specific distributions outside of the kernel and system initialization
|
|
||||||
package (i.e., sysvinit, systemd).
|
|
||||||
|
|
||||||
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+
|
Ceph does not require a specific Linux distribution. Ceph can run on any
|
||||||
| Release Name | Tag | CentOS | Ubuntu | OpenSUSE :sup:`C` | Debian :sup:`C` |
|
distribution that includes a supported kernel and supported system startup
|
||||||
+==============+========+========================+================================+===================+=================+
|
framework, for example ``sysvinit`` or ``systemd``. Ceph is sometimes ported to
|
||||||
| Quincy | 17.2.z | 8 :sup:`A` | 20.04 :sup:`A` | 15.3 | 11 |
|
non-Linux systems but these are not supported by the core Ceph effort.
|
||||||
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+
|
|
||||||
| Pacific | 16.2.z | 8 :sup:`A` | 18.04 :sup:`C`, 20.04 :sup:`A` | 15.2 | 10, 11 |
|
|
||||||
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
| Octopus | 15.2.z | 7 :sup:`B` 8 :sup:`A` | 18.04 :sup:`C`, 20.04 :sup:`A` | 15.2 | 10 |
|
| | Reef (18.2.z) | Quincy (17.2.z) | Pacific (16.2.z) | Octopus (15.2.z) |
|
||||||
+--------------+--------+------------------------+--------------------------------+-------------------+-----------------+
|
+===============+===============+=================+==================+==================+
|
||||||
|
| Centos 7 | | | A | B |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Centos 8 | A | A | A | A |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Centos 9 | A | | | |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Debian 10 | C | | C | C |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Debian 11 | C | C | C | |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| OpenSUSE 15.2 | C | | C | C |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| OpenSUSE 15.3 | C | C | | |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Ubuntu 18.04 | | | C | C |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Ubuntu 20.04 | A | A | A | A |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
| Ubuntu 22.04 | A | | | |
|
||||||
|
+---------------+---------------+-----------------+------------------+------------------+
|
||||||
|
|
||||||
- **A**: Ceph provides packages and has done comprehensive tests on the software in them.
|
- **A**: Ceph provides packages and has done comprehensive tests on the software in them.
|
||||||
- **B**: Ceph provides packages and has done basic tests on the software in them.
|
- **B**: Ceph provides packages and has done basic tests on the software in them.
|
||||||
|
@ -141,19 +141,51 @@ function install_pkg_on_ubuntu {
|
|||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
|
boost_ver=1.79
|
||||||
|
|
||||||
|
function clean_boost_on_ubuntu {
|
||||||
|
in_jenkins && echo "CI_DEBUG: Running clean_boost_on_ubuntu() in install-deps.sh"
|
||||||
|
# Find currently installed version. If there are multiple
|
||||||
|
# versions, they end up newline separated
|
||||||
|
local installed_ver=$(apt -qq list --installed ceph-libboost*-dev 2>/dev/null |
|
||||||
|
cut -d' ' -f2 |
|
||||||
|
cut -d'.' -f1,2 |
|
||||||
|
sort -u)
|
||||||
|
# If installed_ver contains whitespace, we can't really count on it,
|
||||||
|
# but otherwise, bail out if the version installed is the version
|
||||||
|
# we want.
|
||||||
|
if test -n "$installed_ver" &&
|
||||||
|
echo -n "$installed_ver" | tr '[:space:]' ' ' | grep -v -q ' '; then
|
||||||
|
if echo "$installed_ver" | grep -q "^$boost_ver"; then
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Historical packages
|
||||||
|
$SUDO rm -f /etc/apt/sources.list.d/ceph-libboost*.list
|
||||||
|
# Currently used
|
||||||
|
$SUDO rm -f /etc/apt/sources.list.d/libboost.list
|
||||||
|
# Refresh package list so things aren't in the available list.
|
||||||
|
$SUDO env DEBIAN_FRONTEND=noninteractive apt-get update -y || true
|
||||||
|
# Remove all ceph-libboost packages. We have an early return if
|
||||||
|
# the desired version is already (and the only) version installed,
|
||||||
|
# so no need to spare it.
|
||||||
|
if test -n "$installed_ver"; then
|
||||||
|
$SUDO env DEBIAN_FRONTEND=noninteractive apt-get -y --fix-missing remove "ceph-libboost*"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
function install_boost_on_ubuntu {
|
function install_boost_on_ubuntu {
|
||||||
in_jenkins && echo "CI_DEBUG: Running install_boost_on_ubuntu() in install-deps.sh"
|
in_jenkins && echo "CI_DEBUG: Running install_boost_on_ubuntu() in install-deps.sh"
|
||||||
local ver=1.79
|
# Once we get to this point, clean_boost_on_ubuntu() should ensure
|
||||||
|
# that there is no more than one installed version.
|
||||||
local installed_ver=$(apt -qq list --installed ceph-libboost*-dev 2>/dev/null |
|
local installed_ver=$(apt -qq list --installed ceph-libboost*-dev 2>/dev/null |
|
||||||
grep -e 'libboost[0-9].[0-9]\+-dev' |
|
grep -e 'libboost[0-9].[0-9]\+-dev' |
|
||||||
cut -d' ' -f2 |
|
cut -d' ' -f2 |
|
||||||
cut -d'.' -f1,2)
|
cut -d'.' -f1,2)
|
||||||
if test -n "$installed_ver"; then
|
if test -n "$installed_ver"; then
|
||||||
if echo "$installed_ver" | grep -q "^$ver"; then
|
if echo "$installed_ver" | grep -q "^$boost_ver"; then
|
||||||
return
|
return
|
||||||
else
|
|
||||||
$SUDO env DEBIAN_FRONTEND=noninteractive apt-get -y remove "ceph-libboost.*${installed_ver}.*"
|
|
||||||
$SUDO rm -f /etc/apt/sources.list.d/ceph-libboost${installed_ver}.list
|
|
||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
local codename=$1
|
local codename=$1
|
||||||
@ -164,22 +196,22 @@ function install_boost_on_ubuntu {
|
|||||||
$sha1 \
|
$sha1 \
|
||||||
$codename \
|
$codename \
|
||||||
check \
|
check \
|
||||||
ceph-libboost-atomic$ver-dev \
|
ceph-libboost-atomic${boost_ver}-dev \
|
||||||
ceph-libboost-chrono$ver-dev \
|
ceph-libboost-chrono${boost_ver}-dev \
|
||||||
ceph-libboost-container$ver-dev \
|
ceph-libboost-container${boost_ver}-dev \
|
||||||
ceph-libboost-context$ver-dev \
|
ceph-libboost-context${boost_ver}-dev \
|
||||||
ceph-libboost-coroutine$ver-dev \
|
ceph-libboost-coroutine${boost_ver}-dev \
|
||||||
ceph-libboost-date-time$ver-dev \
|
ceph-libboost-date-time${boost_ver}-dev \
|
||||||
ceph-libboost-filesystem$ver-dev \
|
ceph-libboost-filesystem${boost_ver}-dev \
|
||||||
ceph-libboost-iostreams$ver-dev \
|
ceph-libboost-iostreams${boost_ver}-dev \
|
||||||
ceph-libboost-program-options$ver-dev \
|
ceph-libboost-program-options${boost_ver}-dev \
|
||||||
ceph-libboost-python$ver-dev \
|
ceph-libboost-python${boost_ver}-dev \
|
||||||
ceph-libboost-random$ver-dev \
|
ceph-libboost-random${boost_ver}-dev \
|
||||||
ceph-libboost-regex$ver-dev \
|
ceph-libboost-regex${boost_ver}-dev \
|
||||||
ceph-libboost-system$ver-dev \
|
ceph-libboost-system${boost_ver}-dev \
|
||||||
ceph-libboost-test$ver-dev \
|
ceph-libboost-test${boost_ver}-dev \
|
||||||
ceph-libboost-thread$ver-dev \
|
ceph-libboost-thread${boost_ver}-dev \
|
||||||
ceph-libboost-timer$ver-dev
|
ceph-libboost-timer${boost_ver}-dev
|
||||||
}
|
}
|
||||||
|
|
||||||
function install_libzbd_on_ubuntu {
|
function install_libzbd_on_ubuntu {
|
||||||
@ -330,6 +362,9 @@ else
|
|||||||
case "$ID" in
|
case "$ID" in
|
||||||
debian|ubuntu|devuan|elementary|softiron)
|
debian|ubuntu|devuan|elementary|softiron)
|
||||||
echo "Using apt-get to install dependencies"
|
echo "Using apt-get to install dependencies"
|
||||||
|
# Put this before any other invocation of apt so it can clean
|
||||||
|
# up in a broken case.
|
||||||
|
clean_boost_on_ubuntu
|
||||||
$SUDO apt-get install -y devscripts equivs
|
$SUDO apt-get install -y devscripts equivs
|
||||||
$SUDO apt-get install -y dpkg-dev
|
$SUDO apt-get install -y dpkg-dev
|
||||||
ensure_python3_sphinx_on_ubuntu
|
ensure_python3_sphinx_on_ubuntu
|
||||||
|
@ -132,7 +132,7 @@ build_dashboard_frontend() {
|
|||||||
|
|
||||||
$CURR_DIR/src/tools/setup-virtualenv.sh $TEMP_DIR
|
$CURR_DIR/src/tools/setup-virtualenv.sh $TEMP_DIR
|
||||||
$TEMP_DIR/bin/pip install nodeenv
|
$TEMP_DIR/bin/pip install nodeenv
|
||||||
$TEMP_DIR/bin/nodeenv --verbose -p --node=14.15.1
|
$TEMP_DIR/bin/nodeenv --verbose -p --node=18.17.0
|
||||||
cd src/pybind/mgr/dashboard/frontend
|
cd src/pybind/mgr/dashboard/frontend
|
||||||
|
|
||||||
. $TEMP_DIR/bin/activate
|
. $TEMP_DIR/bin/activate
|
||||||
|
@ -3,3 +3,4 @@ log-rotate:
|
|||||||
ceph-osd: 10G
|
ceph-osd: 10G
|
||||||
tasks:
|
tasks:
|
||||||
- ceph:
|
- ceph:
|
||||||
|
create_rbd_pool: false
|
||||||
|
@ -10,3 +10,4 @@ overrides:
|
|||||||
- \(MDS_ALL_DOWN\)
|
- \(MDS_ALL_DOWN\)
|
||||||
- \(MDS_UP_LESS_THAN_MAX\)
|
- \(MDS_UP_LESS_THAN_MAX\)
|
||||||
- \(FS_INLINE_DATA_DEPRECATED\)
|
- \(FS_INLINE_DATA_DEPRECATED\)
|
||||||
|
- \(POOL_APP_NOT_ENABLED\)
|
||||||
|
1
ceph/qa/distros/crimson-supported-all-distro/centos_8.yaml
Symbolic link
1
ceph/qa/distros/crimson-supported-all-distro/centos_8.yaml
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../all/centos_8.yaml
|
1
ceph/qa/distros/crimson-supported-all-distro/centos_latest.yaml
Symbolic link
1
ceph/qa/distros/crimson-supported-all-distro/centos_latest.yaml
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../all/centos_latest.yaml
|
1
ceph/qa/distros/supported-all-distro/centos_latest.yaml
Symbolic link
1
ceph/qa/distros/supported-all-distro/centos_latest.yaml
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../all/centos_latest.yaml
|
1
ceph/qa/distros/supported-all-distro/ubuntu_20.04.yaml
Symbolic link
1
ceph/qa/distros/supported-all-distro/ubuntu_20.04.yaml
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../all/ubuntu_20.04.yaml
|
@ -1 +1 @@
|
|||||||
../all/ubuntu_20.04.yaml
|
../all/ubuntu_latest.yaml
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user