Commit Graph

154 Commits

Author SHA1 Message Date
Aaron Lauterer
38fa08d074 api: osd: destroy: remove mclock max iops settings
Ceph does a quick benchmark when creating a new OSD and stores the
osd_mclock_max_capacity_iops_{ssd,hdd} settings in the config DB.

When destroying the OSD, Ceph does not automatically remove these
settings. Keeping them can be problematic if a new OSD with potentially
more performance is added and ends up getting the same OSD ID.

Therefore, we remove these settings ourselves when destroying an OSD.
Removing both variants, hdd and ssd should be fine, as the MON does not
complain if the setting does not exist.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
Tested-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
2023-11-17 08:09:15 +01:00
Thomas Lamprecht
2f6467d8eb api: ceph osd: fix description line-wrapping style
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-11-06 18:26:53 +01:00
Aaron Lauterer
ad1677d221 fix #4631: ceph: osd: create: add osds-per-device
Allows to automatically create multiple OSDs per physical device. The
main use case are fast NVME drives that would be bottlenecked by a
single OSD service.

By using the 'ceph-volume lvm batch' command instead of the 'ceph-volume
lvm create' for multiple OSDs / device, we don't have to deal with the
split of the drive ourselves.

But this means that the parameters to specify a DB or WAL device won't
work as the 'batch' command doesn't use them. Dedicated DB and WAL
devices don't make much sense anyway if we place the OSDs on fast NVME
drives.

Some other changes to how the command is built were needed as well, as
the 'batch' command needs the path to the disk as a positional argument,
not as '--data /dev/sdX'.
We drop the '--cluster-fsid' parameter because the 'batch' command
doesn't accept it. The 'create' will fall back to reading it from the
ceph.conf file.

Removal of OSDs works as expected without any code changes. As long as
there are other OSDs on a disk, the VG & PV won't be removed, even if
'cleanup' is enabled.

The '--no-auto' parameter is used to avoid the following deprecation
warning:
```
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.3333333333333333
```

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2023-11-06 18:23:28 +01:00
Maximiliano Sandoval
ab70343982 fix #4808: ceph: mds create: use snake_case when setting options
As suggested in [1], it is recommended to use `_` in all cases when
dealing with config files. Note that this is for creation only, and we
enforce that there cannot be an existing MDS with the same ID, so we
do not have to bother how ceph would handle the case where both exist.

[1] https://docs.ceph.com/en/reef/rados/configuration/ceph-conf/#option-names

Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-09-06 16:57:10 +02:00
Thomas Lamprecht
b4b39b55f8 api: ceph osd: factor out getting PSS stat & improve error handling
Do not crowd the higher level API endpoint handler code directly with
some rather low level procfs parsing code, rather factor that out in a
helper. Make said helper private for now so that anybody wanting to
use cannot do so, and thus increase the chance that said dev will
actually think about if this makes sense as is as a general interface.

Avoid fatal die's for the odd case that the smaps_rollup file cannot
be opened, or the even less likely case where PSS stats cannot be
found in the content.

The former could happen due to the general TOCTOU race here, i.e., the
PID we get from systemctl service status parsing isn't guaranteed to
exist anymore when we read from procfs, and if, it's actually not
guaranteed to still be the OSD - but we cannot easily use pidfd's
here and OSD stops are not something that happens frequently, but in
anyway avoid that such a thing fails the whole API call only because a
single metric is affected.

In the long rung it might be better to add a "errors" array to the
response, so that the user can be informed about such an (odd) thing
happening.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-09-04 14:44:33 +02:00
Thomas Lamprecht
f7b7e942a7 api: ceph osd: drop unused variable and useless intermediate code
$raw isn't used anywhere here and probably just a left over from copy
pasting, and the "int cast ternary" can be avoided by just directly
casting to int when assigning the variable in the first place.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-09-04 14:22:31 +02:00
Stefan Hanreich
808eb12f8c api: ceph: improve reporting of ceph OSD memory usage
Currently we are using the MemoryCurrent property of the OSD service
to determine the used memory of a Ceph OSD. This includes, among other
things, the memory used by buffers [1]. Since BlueFS uses buffered
I/O, this can lead to extremely high values shown in the UI.

Instead we are now reading the PSS value from the proc filesystem,
which should more accurately reflect the amount of memory currently
used by the Ceph OSD.

Aaron and I decided on PSS over RSS, since this should give a better
idea of used memory - particularly when using a large amount of OSDs
on one host, since the OSDs share some of the pages.

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Signed-off-by: Stefan Hanreich <s.hanreich@proxmox.com>
Tested-by: Aaron Lauterer <a.lauterer@proxmox.com>
2023-09-04 13:53:35 +02:00
Aaron Lauterer
27f6d19848 api: ceph: remove deprecrated Pools path
The replacement is Pool (singular).

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2023-06-21 09:21:04 +02:00
Thomas Lamprecht
147d67c495 makefile: convert to use simple parenthesis
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-05-29 18:24:00 +02:00
Fiona Ebner
16f3482b34 api: ceph: mon create: remove superfluous verification call
The pve_verify_cidr{,v4,v6} functions were originally intended for
the /etc/network/interfaces API endpoints and thus are a bit
restrictive. For example, as reported in the community forum[0],
pve_verify_cidr() does not consider '0::/0' and '0::/1' to be valid.

The error message in this scenario being
> value does not look like a valid CIDR network
is also confusing, as the first thought of users will be that it comes
from the passed-in monitor address.

The public networks are not written here and read from the Ceph config
and via a RADOS mon command, so no need to try and verify them. If
something really would go wrong during parsing, the
get_local_ip_from_cidr() call would complain afterwards.

[0]: https://forum.proxmox.com/threads/125226/

Suggested-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
2023-04-12 13:26:49 +02:00
Thomas Lamprecht
9c7c61d06c api: ceph cfg: drop double definition of permission property for index
Reported-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-03-20 15:34:46 +01:00
Aaron Lauterer
26cf20bc0f api: ceph: add ceph/cfg path, deprecate ceph/config and ceph/configdb
Consolidating the different config paths lets us add more as needed
without polluting our API with too many 'configxxx' endpoints.

The config and configdb paths are renamed under the ceph/cfg path:
* config -> raw (returns the ceph.conf file as is)
* configdb -> db (returns the ceph config db contents)

The old paths are still available and need to be dropped at some point.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by:  Dominik Csapak <d.csapak@proxmox.com>
2023-03-20 15:31:04 +01:00
Aaron Lauterer
b2e005d76a api: ceph: deprecate pools in favor of pool
/nodes/{node}/ceph/pools/{pool} returns the pool details right away on a
GET.  This makes it bad practice to add additional sub API endpoints.

By deprecating it and replacing it with /nodes/{node}/ceph/pool/{pool}
(singular instead of plural) we can turn that into an index GET
response, making it possible to expand it more in the future.

The GET call returning the pool details is moved into
/nodes/{node}/ceph/pool/{pool}/status

The code in the new Pool.pm is basically a copy of Pools.pm to avoid
a close coupling with the old code as it is possible that it will divert
until we can entirely remove the old code.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by:  Dominik Csapak <d.csapak@proxmox.com>
2023-03-20 15:31:04 +01:00
Aaron Lauterer
e907f822d7 api ceph osd: add OSD index, metadata and lv-info
To get more details for a single OSD, we add two new endpoints:
* nodes/{node}/ceph/osd/{osdid}/metadata
* nodes/{node}/ceph/osd/{osdid}/lv-info

The {osdid} endpoint itself gets a new GET handler to return the index.

The metadata one provides various metadata regarding the OSD.

Such as
* process id
* memory usage
* info about devices used (bdev/block, db, wal)
    * size
    * disks used (sdX)
    ...
* network addresses and ports used
...

Memory usage and PID are retrieved from systemd while the rest can be
retrieved from the metadata provided by Ceph.

The second one (lv-info) returns the following infos for a logical
volume:
* creation time
* lv name
* lv path
* lv size
* lv uuid
* vg name

Possible volumes are:
* block (default value if not provided)
* db
* wal

'ceph-volume' is used to gather the infos, except for the creation time
of the LV which is retrieved via 'lvs'.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by:  Dominik Csapak <d.csapak@proxmox.com>
2023-03-15 18:24:27 +01:00
Aaron Lauterer
c4368cf6d6 ceph osd: return PGs per OSD and show in UI
By switching from 'ceph osd tree' to the 'ceph osd df tree' mon API
equivalent , we get the same data structure with more information per
OSD. One of them is the number of PGs stored on that OSD.

The number of PGs per OSD is an important number, for example when
trying to figure out why the performance is not as good as expected.
Therefore, adding it to the OSD overview visible by default should
reduce the number of times, one needs to access the CLI.

Comparing runtime cost on a 3 node ceph cluster with 4 OSDs each doing 50k
iterations gives:

               Rate osd-df-tree    osd-tree
osd-df-tree  9141/s          --        -25%
osd-tree    12136/s         33%          --

So, while definitively a bit slower, but it's still in the µs range,
and as such below HTTP in TLS in TCP connection setup for most users,
so worth the extra useful information.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
 [ TL: slight rewording of subject and add benchmark data ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2023-02-15 10:16:50 +01:00
Aaron Lauterer
b62ba85ad7 api: ceph: update return schemas
to include a more complete description of the returned data.
Sort properties in alphabetical order if the list is longer.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2023-01-16 14:32:00 +01:00
Stefan Sterz
7818083008 api: ceph: add applications of each pool to the lspools endpoint
since ceph luminous (ceph 12) pools need to be associated with at
least one applicaton. expose this information here too so that clients
of this endpoint can use it.

Signed-off-by: Stefan Sterz <s.sterz@proxmox.com>
Tested-By:  Aaron Lauterer <a.lauterer@proxmox.com>
2022-11-16 20:24:12 +01:00
Aaron Lauterer
655080eeeb api: ceph: pools: get_storages: set pool name if missing
This avoids errors about the use of uninitialized values if the 'pool'
parameter is not present in the storage configuration.

The 'pool' property for an RBD storage config is not mandatory.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-10-14 15:43:53 +02:00
Fabian Ebner
e6f55a13b0 api: ceph: mon: make checking for duplicate addresses more robust
Because $mon->{addr} might come with a port attached (affects monitors
created with PVE 5.4 as reported in the community forum [0]), or even
be a hostname (according to the code in Ceph/Services.pm). Although
the latter shouldn't happen for configurations created by PVE.

[0]: https://forum.proxmox.com/threads/105904/

Fixes: 9e989449 ("api: ceph: mon: fix handling of IPv6 addresses in assert_mon_prerequisites")
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2022-06-08 08:49:34 +02:00
Thomas Lamprecht
ea9eea012a api: ceph pool: reword ec desc full textwidth and reword slightly
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-05-04 07:14:56 +02:00
Aaron Lauterer
1401cb4e43 ceph pools create: enhance erasure-code description
Mention which optional parameters will be used for the replicated
metadata pool but won't have an effect on the erasure coded data pool.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-05-04 07:12:00 +02:00
Aaron Lauterer
9569939a54 ceph pools create: remove crush_rule for ec pool data
The crush rule is an optional paramter which can be used for the
metadata pool, but the erasure coded data pool will always get its own
crush rule. Therefore this parameter can not be adapted.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-05-04 07:11:06 +02:00
Aaron Lauterer
fca1900c76 api: ceph pools: add type to returned properties
The osd dump already contains the pool type in numerical format.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2022-05-02 15:43:11 +02:00
Thomas Lamprecht
ec63b237de api: ceph: fix description indentation style
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-04-29 14:28:12 +02:00
Thomas Lamprecht
07316e6d04 api: followup: code locality
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-04-29 14:26:32 +02:00
Aaron Lauterer
4605d6fd22 api: ceph ec pools: make add_storages overridable default
The behavior of always adding the storage config was lost in commit
23c407e. But it is more sensible to make it a default that can be
changed if needed.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-04-29 14:24:04 +02:00
Aaron Lauterer
136f761ba7 api: ceph ec pools: schema fixes and enhancements
Ceph has a min value for 'k' of 2. Adding default and description where
missing.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-04-29 14:24:04 +02:00
Thomas Lamprecht
f35e7fcd8e api: ceph ec pools: move to format-str, create ec in worker, reuse $rados
moved to a format string 'erasurce-coded', that allows also to drop
most of the param existence checking as we can set the correct
optional'ness in there.  Also avoids bloating the API to much for
just this.

Reuse the $rados connection more often to avoid to much
overhead/lingering sockets (the rados connection stays around in the
background to allow efficient reuse)

really should be three separate commits, but too intertwined and too
late for me to care tbh.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-04-28 20:26:38 +02:00
Aaron Lauterer
3bd128d7a0 ceph pools: allow to create erasure code pools
To use erasure coded (EC) pools for RBD storages, we need two pools. One
regular replicated pool that will hold the RBD omap and other metadata
and the EC pool which will hold the image data.

The coupling happens when an RBD image is created by adding the
--data-pool parameter. This is why we have the 'data-pool' parameter in
the storage configuration.

To follow already established semantics, we will create a 'X-metadata'
and 'X-data' pool. The storage configuration is always added as it is
the only thing that links the two together (besides naming schemes).

Different pg_num defaults are chosen for the replicated metadata pool as
it will not hold a lot of data.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-04-28 20:26:38 +02:00
Aaron Lauterer
29fe1eea7a api: ceph: $get_storages check if data-pool too
When removing a pool, we check against any storage that might have that
pool configured.
We need to check if that pool is used as data-pool too.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2022-04-28 20:26:38 +02:00
Fabian Ebner
cffeb11592 api: ceph: create osd: set correct partition type
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-11-11 21:50:51 +01:00
Fabian Ebner
5161a0c2f0 partially fix #2285: api: ceph: create osd: allow using partitions
Note that this does not only allow partitions to be used, but for DB
and WAL disks, one more type of disk, that wasn't allowed before.
Namely, GPT-partitioned disks with any partitions detected as used.
The reason is get_disks' behavior:
  * Without $include_partitions=1, the disk will have the same usage
    as it's first used partition, and thus wasn't allowed. (Except in
    the case that usage was LVM, where the check was bypassed, but
    luckily OSD creation just failed later because no Ceph volume
    group would be detected).
  * With $include_partitions=1, the disk will have usage 'partitions'
    and thus be allowed.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-11-11 21:50:38 +01:00
Fabian Ebner
46b1ccc357 api: ceph: create osd: set correct parttype for DB/WAL
The get_ceph_journals function in pve-storage uses this information.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-11-11 21:50:33 +01:00
Dominik Csapak
6ec3dc437b api: cephfs: add 'fs-name' for cephfs storage
so that we can uniquely identify the cephfs (in case of multiple)

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-11-11 17:52:08 +01:00
Dominik Csapak
352a0e5c93 api: cephfs: add fs_name to 'is mds active' check
so that we check the mds for the correct cephfs we just added

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-11-11 17:52:08 +01:00
Dominik Csapak
028af34e3d api: cephfs: more checks on fs create
namely if the fs is already existing, and if there is currently a
standby mds that can be used for the new fs
previosuly, only one cephfs was possible, so these checks were not
necessary. now with pacific, it is possible to have multiple cephfs'
and we should check for those.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-11-11 17:52:08 +01:00
Dominik Csapak
0ab69d6e88 api: cephfs: refactor {ls, create}_fs
no function change intended

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-11-11 17:52:08 +01:00
Dominik Csapak
d3eed3b4a8 api: ceph: fix getting ceph versions
Since commit: 8a3a300b ("ceph services: drop broadcasting legacy
version pmxcfs KV")

The 'ceph-version' kv is not broadcasted anymore, so we should not
query it, instead use get_ceph_versions

Also drop the other legacy keys for the versions

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-11-10 14:36:22 +01:00
Fabian Ebner
45d602f212 api: ceph: create osd: work around udev bug
There is a udev bug [0] which can ultimately lead to the udev database
for certain devices not being actively updated. The Diskmanage package
relies upon lsblk for certain info, and lsblk queries the udev
database. Ensure the information is updated by manually calling
'udevadm trigger' for the changed devices.

Without the fix, and a bit of bad luck, a cleaned up disk could still
show up as an 'LVM2_member' for example.

[0]: https://github.com/systemd/systemd/issues/18525

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-09-30 18:12:58 +02:00
Fabian Ebner
683a3563e7 api: check: create osd: use wipe_blockdev from the Diskmanage package
which is mostly a copy of the wipe_disks helper with the difference
that it also uses wipefs on the device and its partitions.

Remove the wipe_disks helper as no users remain.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-09-30 18:12:58 +02:00
Fabian Ebner
e256595683 api: ceph: create osd: re-check disk requirements after fork/lock
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-09-30 18:12:58 +02:00
Alwin Antreich
0b6a283801 fix #2422: allow multiple Ceph public networks
Multiple public networks can be defined in the ceph.conf. The networks need to
be routed to each other.

Support handling multiple IPs for a single monitor. By default, one address from
each public network is selected for monitor creation, but, as before, it can be
overwritten with the mon-address parameter, now taking a list of addresses.

On removal, make sure the all addresses are removed from the mon_host entry in
the ceph configuration.

Originally-by: Alwin Antreich <a.antreich@proxmox.com>
[handling of multiple addresses]
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:05 +02:00
Fabian Ebner
815325da0d api: ceph: mon: fix handling of IPv6 addresses in destroymon
by also comparing the canonical form to decide when to remove an address. When
getting the IP from the rados information, also drop eventual brackets, so our
existing function can handle it. Add the brackets back within the
remove_addr_from_mon_host function.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:05 +02:00
Fabian Ebner
3e10f0fcdb api: ceph: mon: factor out mon_host regex address removal
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00
Fabian Ebner
9e989449ae api: ceph: mon: fix handling of IPv6 addresses in assert_mon_prerequisites
by comparing their canonical forms.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00
Fabian Ebner
4be756f59c api: ceph: mon: add ips_from_mon_host helper
Partially based on pve-storage's CephConfig.pm get_monaddr_list, but the
interface is not the best for the use case here.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00
Fabian Ebner
396acb1577 api: ceph: mon: fix handling of IPv6 addresses in find_mon_ip
by comparing their canonical forms.

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00
Fabian Ebner
8ecaa0bfbe api: ceph: create mon: explicitly add subsequent monitors to the monmap
in preparation for supporting multiple addresses. The config section does not
allow more than one public_addr.

Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00
Fabian Ebner
57951fc78b api: ceph: create mon: factor out monmaptool command
so it's easier to re-use for a future variant.

Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00
Fabian Ebner
d3b899c144 api: ceph: create mon: handle ms_bind_ipv* options more generally
mostly relevant to prepare support for IPv4/IPv6 dual stack mode as a special
case of the planned support for mutliple public networks.

As before, only set the false value when we are dealing with the first address,
but also be explicit about the IPv4 case as the defaults might change in the
future.

Then, when an address of a different type comes along later, set the relevant
bind option to true.

Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-06-18 17:13:04 +02:00