Fiona Ebner <f.ebner@proxmox.com> says:
Record the created fleecing images in the VM configuration to be able
to remove left-overs after hard failures.
Adds a new special configuration section 'fleecing', making special
section handling more generic as preparation, as well as fixing some
corner cases in configuration parsing and adding tests.
Fiona Ebner (16):
migration: remove unused variable
test: avoid duplicate mock module in restore config test
test: add parse config tests
parse config: be precise about section type checks
test: add test case exposing issue with unknown sections
parse config: skip unknown sections and warn about their presence
vzdump: anchor matches for pending and special sections
vzdump: skip all special sections
config: make special section handling generic
test: parse config: test config with duplicate sections
parse config: warn about duplicate sections
check type: require schema as an argument
config: add fleecing section
fix#5440: vzdump: better cleanup fleecing images after hard errors
migration: attempt to clean up potential left-over fleecing images
destroy vm: clean up potential left-over fleecing images
Link: https://lore.proxmox.com/20250127112923.31703-1-f.ebner@proxmox.com
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
By recording the allocated fleecing images in the VM config, they
are not immediately orphaned, should a hard error occur during
backup that prevents cleanup.
They are attempted to be cleaned up during the next backup run.
In the cleanup helper, check if fleecing images are still attached in
QEMU and detach them. This allows recovering from more failure
scenarios. However, to avoid a deadlock, a left-over backup job needs
to be canceled first. While canceling a left-over backup already
happens when cleanup is done for a subsquent backup, it is required
for other cases that like cleanup before migration (to be added in a
following commit).
Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Link: https://lore.proxmox.com/20250127112923.31703-15-f.ebner@proxmox.com
Collect special sections below a common 'special-sections' key in
preparation to introduce a new special section.
The special 'cloudinit' section was added in the top-level of the
configuration structure, but it's cleaner to group special sections
more similar to snapshots.
The 'cloudinit' key was already initialized, so having the new
'special-sections' key be always initialized should not cause issues
after checking and adapting all usages of 'cloudinit' which this patch
attempts to do.
Add compat handling for remote migration which might receive the
configuration hash from a node that does not yet have the changes.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Link: https://lore.proxmox.com/20250127112923.31703-10-f.ebner@proxmox.com
To be used in the context of template backup, where a minimized
temporary configuration is created to start the VM in 'prelaunch'
mode. Issue a warning, so that code paths where this happens will be
noted and can be evaluated and adapted.
Since the code currently doesn't use blessed config objects, special
dispatching is needed to potentially defer to the new child class in
the write_config() method.
Suggested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Suggested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
The function checks for resources that cannot be migrated, snapshoted,
or suspended.
To run this function while the snapshot lock is active, the
pve-guest-common patch 'AbstractConfig: add abstract method to check for
resources preventing a snapshot.' is required.
Signed-off-by: Markus Frank <m.frank@proxmox.com>
Convert the machine parameter to a property-string and use the machine
type as the default key for backward compatibility.
Signed-off-by: Markus Frank <m.frank@proxmox.com>
since we always determine the deviceid, passing in a possibly wrong value makes
no sense and could actually re-introduce bugs.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Fiona Ebner <f.ebner@proxmox.com>
In preparation to add more properties to the memory configuration like
maximum hotpluggable memory and whether virtio-mem devices should be
used.
This also allows to get rid of the cyclic include of PVE::QemuServer
in the memory module.
Signed-off-by: Alexandre Derumier <aderumier@odiso.com>
[FE: also convert new usage in get_derived_property
remove cyclic include of PVE::QemuServer
add commit message]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
All calling sites except for QemuConfig.pm::get_replicatable_volumes()
already enabled it. Making it the non-configurable default results in a
change in the VM replication. Now a disk image only referenced in the
pending section will also be replicated.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
Make it possible to optionally iterate over disks in the pending section
of VMs, similar as to how snapshots are handled already.
This is for example useful in the migration if we don't want to rely on
the scanning of all storages.
All calling sites are adapted and enable it, except for
QemuConfig::get_replicatable_volumes as that would cause a change for
the replication if pending disks would be included.
The following lists the calling sites and if they should be fine with
the change (source [0]):
1. QemuMigrate: scan_local_volumes(): needed to include pending disk
images
2. API2/Qemu.pm: check_vm_disks_local() for migration precondition:
related to migration, so more consistent with pending
3. QemuConfig.pm: get_replicatable_volumes(): would change the behavior
of the replication, will not use it for now.
4. QemuServer.pm: get_vm_volumes(): is used multiple times by:
4a. vm_stop_cleanup() to deactivate/unmap: should also be fine with
including pending
4b. QemuMigrate.pm: in prepare(): part of migration, so more consistent
with pending
4c. QemuMigrate.pm: in phase3_cleanup() for deactivation: part of
migration, so more consistent with pending
[0] https://lists.proxmox.com/pipermail/pve-devel/2023-May/056868.html
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
See the corresponding commit in guest-common for more information.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
If the config doesn't contain the cloud-init disk anymore after the
rollback, we have to clean it up since otherwise no further disk can be
attached unless the one still existing on the storage is deleted.
Signed-off-by: Mira Limbeck <m.limbeck@proxmox.com>
Reviewed-by: Stefan Hanreich <s.hanreich@proxmox.com>
Tested-by: Stefan Hanreich <s.hanreich@proxmox.com>
This will break possibly existing workflows like
1. add second cloud-init
2. remove first cloud-init
to change the cloud-init storage.
On the other hand, it avoids unintended misconfiguration of having
mutliple cloud-init drives with potentially different settings.
Also in preparation for adding cloud-init-related API calls, where
not being able to assume that there's only one cloud-init drive/state
would complicate things quite a bit.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
So that there is a better chance to debug issues like in [0]. For
suspending, which uses the same QMP calls, this is already done.
[0]: https://forum.proxmox.com/threads/114203/
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
A "savevm" call (both our async variant and the upstream sync one) use
migration code internally. As such, they both expect migration
capabilities to be set.
This is usually not a problem, as the default set of capabilities is ok,
however, it leads to differing snapshot settings if one does a snapshot
after a machine has been live-migrated (as the capabilities will persist
from that), which could potentially lead to discrepencies in snapshots
(currently it seems to be fine, but it still makes sense to set them to
safeguard against future changes).
Note that we do set the "dirty-bitmaps" capability now (if
query-proxmox-support reports true), which has three effects:
1) PBS dirty-bitmaps are preserved in snapshots, enabling
fast-incremental backups to work after rollback (as long as no newer
backups exist), including for hibernate/resume
2) snapshots taken from now on, with a QEMU version supporting bitmap
migration, *might* lead to incompatibility of these snapshots with
QEMU versions that don't know about bitmaps at all (i.e. < 5.0 IIRC?)
- forward compatibility is still given, and all other capabilities we
set go back to very old versions
3) since we now explicitly disable bitmap saving if the version doesn't
report support, we avoid crashes even with not-updated QEMU versions
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Move the logic which volumes are included in the backup job to its own
method and adapt the VZDump code accordingly. This makes it possible to
develop other features around backup jobs.
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
and refactor the test_volid closure. Like this get_replicatable_volumes doesn't
need a separate loop for unused volumes anymore. For get_vm_volumes, which is used
for activation/deactivation of volumes at migration and deactivation in vm_stop_cleanup,
includes those volumes now. For migration it's an improvement, because those volumes
might need to be migrated and for vm_stop_cleanup it shouldn't hurt. The last user
of foreach_volid is check_vm_disks_local used by migrate_vm_precondition,
where information about the additional volumes doesn't hurt either.
Note that replicate is (still) set by default, so the behavior for
get_replicatable_volumes for unused volumes should not change.
Hibernation vmstate files are now also included and recognized as 'is_vmstate'.
The 'size' attribute will not be overwritten by subsequent iterations for the
same volid anymore (a volid may appear both in the config and in snapshots),
so the size from the current config is now preferred.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
It was necessary to move foreach_volid back to QemuServer.pm
In VZDump/QemuServer.pm and QemuMigrate.pm the dependency on
QemuConfig.pm was already there, just the explicit "use" was missing.
Signed-off-by: Fabian Ebner <f.ebner@proxmox.com>
Just like with live-migration, custom CPU models might change after a
snapshot has been taken (or a VM suspended), which would lead to a
different QEMU invocation on rollback/resume.
Save the "-cpu" argument as a new "runningcpu" option into the VM conf
akin to "runningmachine" and use as override during rollback/resume.
No functional change with non-custom CPU types intended.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
as preparation for refactoring it further. remote migration will add
another 1-2 parameters, and it is already unwieldly enough as it is.
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
...PVE::QemuServer::Machine.
qemu_machine_feature_enabled is exported since it has a *lot* of users
in PVE::QemuServer and a long enough name as it is.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
QMP and monitor helpers are moved from QemuServer.pm.
By using only vm_running_locally instead of check_running, a cyclic
dependency to QemuConfig is avoided. This also means that the $nocheck
parameter serves no more purpose, and has thus been removed along with
vm_mon_cmd_nocheck.
Care has been taken to avoid errors resulting from this, and
occasionally a manual check for a VM's existance inserted on the
callsite.
Methods have been renamed to avoid redundant naming:
* vm_qmp_command -> qmp_cmd
* vm_mon_cmd -> mon_cmd
* vm_human_monitor_command -> hmp_cmd
mon_cmd is exported since it has many users. This patch also changes all
non-package users of vm_qmp_command to use the mon_cmd helper. Includes
mocking for tests.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
vm_exists_on_node in PVE::QemuConfig checks if a config file for a vmid
exists
vm_running_locally in PVE::QemuServer::Helpers checks if a VM is running
on the local machine by probing its pidfile and checking /proc/.../cmdline
check_running is left in QemuServer for compatibility, but changed to
simply call the two new helper functions.
Both methods are also correctly mocked for testing snapshots.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
Also remove unused $confdir variable in QemuConfig, but leave it and
$lock_dir there, since those paths should only be used with
cfs_config_path anyway.
nodename() is still called in multiple places, but since it's cached by
INotify it doesn't really matter.
Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
So, while we could just make this a special case before the
config_to_command call and set the $conf->{vmstate} to the statefile
for the case were it's a valid volumeid, the special case handling
get's much easier when we do this outside of that method.
So it's basically a trade-off, and after looking far to long at all
nice revisions Alwin made for me and Fabians request, and even trying
out different approaches, it was never perfect.
But having slight code duplication over the movement mess I proposed
(as I did not had the full picture then, sorry Alwin) felt like the
slightly nicer trade off, as all worked I just use this one now, it
has very clear semantics, easy to understand and that now three lines
are duplicated is IMO irrelevant.
Co-developed-by: Alwin Antreich <a.antreich@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
A generated VM with default values does not set the memory key in the
configuration. Hence the size of the state file for a suspend had only
the default size of the state itself and not in addition twice the
configured memory.
The patch uses the static defaults from the JSON schema if the memory
key is not set.
Signed-off-by: Alwin Antreich <a.antreich@proxmox.com>
this makes it possible to give a storage for state saving, if one
wants to use a different storage than for snapshots or does not
want to save this info into the config
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
the idea is to have the same logic as with snapshots, but without
the snapshotting of the disks, and after saving the vm state (incl memory),
we hard shut off the guest.
this way the disks will not be touched anymore by the guest
to prevent any alteration of the vm (incl migration, hw changes, etc) we
add a config lock 'suspend'
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
instead of overwriting the 'machine' config in the snapshot,
use its own 'runningmachine' config only for the snapshot
this way, we do not lose the machine type if it was
explicitely set during the snapshot, but deleted afterwards
we also have to adapt the tests for this
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>