proxmox-backup/pbs-datastore
Christian Ebner cb9814e331 garbage collection: fix rare race in chunk marking phase
During phase 1 of garbage collection referenced chunks are marked as
in use by iterating over all index files and updating the atime on
the chunks referenced by these.

In an edge case for long running garbage collection jobs, where a
newly added snapshot (created after the start of GC) reused known
chunks from a previous snapshot, but the previous snapshot index
referencing them disappeared before the marking phase could reach
that index (e.g. pruned because only 1 snapshot to be kept by
retention setting), known chunks from that previous index file might
not be marked (given that by none of the other index files it was
marked).

Since commit 74361da8 ("garbage collection: generate index file list
via datastore iterators") this is even less likely as now the
iteration reads also index files added during phase 1, and
therefore either the new or the previous index file will account for
these chunks (the previous backup snapshot can only be pruned after
the new one finished, since locked). There remains however a small
race window between the reading of the snapshots in the backup group
and the reading of the actual index files for marking.

Fix this race by:
1. Checking if the last snapshot of a group disappeared and if so
2. generate the list again, looking for new index files previously
   not accounted for
3. To avoid possible endless looping, lock the group if the snapshot
   list changed even after the 10th time (which will lead to
   concurrent operations to this group failing).

Signed-off-by: Christian Ebner <c.ebner@proxmox.com>
Acked-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Link: https://lore.proxmox.com/20250416105000.270166-3-c.ebner@proxmox.com
2025-04-16 14:17:24 +02:00
..
examples pbs-datastore: clippy fixes 2023-05-17 17:43:17 +02:00
src garbage collection: fix rare race in chunk marking phase 2025-04-16 14:17:24 +02:00
Cargo.toml fix #3935: datastore/api/backup: move datastore locking to '/run' 2025-03-26 16:21:43 +01:00