Go to file
Dominik Csapak a672c578e0 mediated device pass-through: fix race condition on VM reboot
When rebooting a VM from PVE (via CLI/API), the reboot code is called
under a guest lock, which creates a reboot request, shuts down the VM
and then calls the regular cleanup code, which includes the mdev
cleanup.

In parallel, the qmeventd observes that the VM process has gone, and
starts 'qm cleanup' which is (among other tasks) also starts the VM
again if a reboot from the PVE side is pending.
The qmeventd synchronizes this through a lock on the guest, with a
default timeout of 10 seconds.

Since we currently also always wait 10 seconds for the NVIDIA driver
to clean up the mdev, this creates a race condition for the cleanup
lock. IOW., when the call to `qm cleanup` starts before we started to
sleep for 10 seconds, it will not be able to acquire its lock and not
start the vm again.

To avoid the race condition in practice, do two things:
* increase the timeout in `qm cleanup` to 60 seconds.
  Technically this still might run into a timeout, as we can configure
  up to 16 mediated devices with each delaying 10 seconds in the worst
  case, but realistically most users won't configure more than two or
  three of them, if even that.

* change the hard-coded `sleep 10` to a loop sleeping for 1 second
  each before checking the state again. This shortens the timeout when
  the NVIDIA driver did not require the full 10s to finish the
  clean-up.

Further, add a bit of logging, so one can properly see in the task log
what is happening at which point in time.

Fixes: 49c51a60 (pci: workaround nvidia driver issue on mdev cleanup)
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Reviewed-by: Mira Limbeck <m.limbeck@proxmox.com>
 [ TL: change warn to print, reword commit message ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-03-08 14:15:38 +01:00
debian bump version to 8.0.10 2023-11-22 14:12:52 +01:00
PVE mediated device pass-through: fix race condition on VM reboot 2024-03-08 14:15:38 +01:00
qemu-configs move qemu-configs to own directory 2019-09-24 18:59:35 +02:00
qmeventd qmeventd: VMID from PID: avoid goto 2023-07-17 11:30:49 +02:00
test tests: cfg2cmd: rename vnc-clipboard to lower-case and add description 2023-11-20 16:36:51 +01:00
vm-network-scripts sdn: pass vmid and hostname to add_dhcp_mapping 2023-11-21 20:51:56 +01:00
.gitignore gitignore: sort content 2023-11-17 15:54:24 +01:00
bootsplash.jpg add seabios bootsplash and use it 2016-09-08 12:22:01 +02:00
bootsplash.xcf add seabios bootsplash and use it 2016-09-08 12:22:01 +02:00
Makefile buildsys: rework clean target, avoid doc-gen one 2023-05-19 15:06:46 +02:00
modules-load.conf remove unnecessary init.d, postint, postrm and qmupdate scripts 2015-02-27 16:09:41 +01:00
qm convert qmrestore into a PVE::CLI class 2015-10-05 13:10:24 +02:00
qmextract remove legacy sparsecp 2017-08-23 10:03:37 +02:00
qmrestore convert qmrestore into a PVE::CLI class 2015-10-05 13:10:24 +02:00