if we have multiple jobs for the same vmid with the same schedule,
the last_sync, next_sync and vmid will always be the same, so the order
depends on the order of the $jobs hash (which is random; thanks perl)
to have a fixed order, take the jobid also into consideration
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Fabian Ebner <f.ebner@proxmox.com>
when running replication, we don't want to keep replication states for
non-local vms. Normally this would not be a problem, since on migration,
we transfer the states anyway, but when the ha-manager steals a vm, it
cannot do that. In that case, having an old state lying around is
harmful, since the code does not expect the state to be out-of-sync
with the actual snapshots on disk.
One such problem is the following:
Replicate vm 100 from node A to node B and C, and activate HA. When node
A dies, it will be relocated to e.g. node B and start replicate from
there. If node B now had an old state lying around for it's sync to node
C, it might delete the common base snapshots of B and C and cannot sync
again.
Deleting the state for all non local guests fixes that issue, since it
always starts fresh, and the potentially existing old state cannot be
valid anyway since we just relocated the vm here (from a dead node).
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Fabian Ebner <f.ebner@proxmox.com>
So the repeat frequency for a stuck job is now:
t0 -> fails
t1 = t0 + 5m -> repat
t2 = t1 + 10m = t0 + 15m -> repat
t3 = t2 + 15m = t0 + 30m -> repat
t4 = t3 + 30m = t0 + 60-> repat
then
tx = tx-1 + 30m -> repat
So, we converge more naturally/stable to the 30m intervals than
before, when t3 would have been t0 + 45m.
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
If pvesr was terminated after finishing with the new sync and after
removing old replication snapshots, but before it could write the new
state, the next replication would fail. It would wrongly interpret the
actual last replication snapshot as stale, remove it, and (if no other
snapshots are present) attempt a full sync, which would fail.
Reported in the community forum [0], this was brought to light by the
new pvescheduler before it learned graceful reload.
It's not possible to simply preserve a last remaining snapshot in
prepare(), because prepare() is also used for valid removals. Instead,
update last_sync early enough. Stale snapshots will still be removed
on the next run if there are any.
[0]: https://forum.proxmox.com/threads/100154
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>