Commit Graph

6 Commits

Author SHA1 Message Date
Hannes Laimer
de2ddf940a replication: delete job even if it is disabled
Currently we skip all disabled jobs, also the ones up for deletion,
which does not make sense. This came up in support.

Signed-off-by: Hannes Laimer <h.laimer@proxmox.com>
Link: https://lore.proxmox.com/20250407085138.4653-1-h.laimer@proxmox.com
2025-04-07 14:04:13 +02:00
Dominik Csapak
f1fc7d6c61 ReplicationState: deterministically order replication jobs
if we have multiple jobs for the same vmid with the same schedule,
the last_sync, next_sync and vmid will always be the same, so the order
depends on the order of the $jobs hash (which is random; thanks perl)

to have a fixed order, take the jobid also into consideration

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Fabian Ebner <f.ebner@proxmox.com>
2022-06-08 08:48:04 +02:00
Dominik Csapak
1aa4d844a1 ReplicationState: purge state from non local vms
when running replication, we don't want to keep replication states for
non-local vms. Normally this would not be a problem, since on migration,
we transfer the states anyway, but when the ha-manager steals a vm, it
cannot do that. In that case, having an old state lying around is
harmful, since the code does not expect the state to be out-of-sync
with the actual snapshots on disk.

One such problem is the following:

Replicate vm 100 from node A to node B and C, and activate HA. When node
A dies, it will be relocated to e.g. node B and start replicate from
there. If node B now had an old state lying around for it's sync to node
C, it might delete the common base snapshots of B and C and cannot sync
again.

Deleting the state for all non local guests fixes that issue, since it
always starts fresh, and the potentially existing old state cannot be
valid anyway since we just relocated the vm here (from a dead node).

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Fabian Ebner <f.ebner@proxmox.com>
2022-06-08 08:48:04 +02:00
Thomas Lamprecht
73a3e4cb23 replication config: retry first three failed times quicker before going to 30m
So the repeat frequency for a stuck job is now:
t0 -> fails
t1 = t0 +  5m -> repat
t2 = t1 + 10m = t0 + 15m -> repat
t3 = t2 + 15m = t0 + 30m -> repat
t4 = t3 + 30m = t0 + 60-> repat
then
tx = tx-1 + 30m -> repat

So, we converge more naturally/stable to the 30m intervals than
before, when t3 would have been t0 + 45m.

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-04-27 09:59:26 +02:00
Fabian Ebner
ff574bf8d2 replication: update last_sync before removing old replication snapshots
If pvesr was terminated after finishing with the new sync and after
removing old replication snapshots, but before it could write the new
state, the next replication would fail. It would wrongly interpret the
actual last replication snapshot as stale, remove it, and (if no other
snapshots are present) attempt a full sync, which would fail.

Reported in the community forum [0], this was brought to light by the
new pvescheduler before it learned graceful reload.

It's not possible to simply preserve a last remaining snapshot in
prepare(), because prepare() is also used for valid removals. Instead,
update last_sync early enough. Stale snapshots will still be removed
on the next run if there are any.

[0]: https://forum.proxmox.com/threads/100154

Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
2021-11-29 10:50:36 +01:00
Thomas Lamprecht
960c85be38 buildsys: split packaging and source build-systems
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-05-09 20:10:14 +02:00