Commit Graph

43 Commits

Author SHA1 Message Date
Dominik Csapak
458b487bed pci: don't hard require resetting devices for passthrough
Since pve-common commit:

 eff5957 (sysfstools: file_write: properly catch errors)

this check here fails now when the reset does not work. It turns out
that resetting the device is not always necessary, and we previously
ignored most errors when trying to do so.

To restore that functionality, downgrade this `die` to a warning.

If the device really needs a reset to work, it will either fail later
during startup, or not work correctly in the guest, but that behavior
existed before and is AFAIK not really detectable from our side.

Also improve the warning message a bit to not scare users and explain
that we're continuing.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
 [ TL: fine-tune error message a bit and avoid parenthesis ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-11-11 20:33:25 +01:00
Thomas Lamprecht
a28e6fe6f9 pci: make variable name slightly easier to read
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-10-24 18:44:53 +02:00
Dominik Csapak
48ada6982f pci: mdev: adapt to NVIDIA's modern interface with kernel >= 6.8
Since kernel 6.8, NVIDIAs vGPU driver does not use the generic mdev
interface anymore, since they relied on a feature there which is not
available anymore. IIUC the kernel [0] recommends drivers to implement
their own device specific features since putting all in the generic one
does not make sense.

They now have an 'nvidia' folder in the device sysfs path, which
contains the files `creatable_vgpu_types`/`current_vgpu_type` to
control the virtual functions model, and then the whole virtual function
has to be passed through (although without resetting and changing to the
vfio-pci driver).

This patch implements changes so that from a config perspective, it
still is an mediated device, and we map the functionality iff the device
has no mediated devices but the new NVIDIAs sysfsapi and the model name
is 'nvidia-<..>'

It behaves a bit different than mdevs and normal pci passthrough, as we
have to choose the correct device immediately since it's bound to the
pciid, but we must not bind the device to vfio-pci as the NVIDIA driver
implements this functionality itself.

When cleaning up, we iterate over all reserved devices (since for a
mapping we can't know at this point which was chosen besides looking at
the reservations) and reset the vgpu model to '0', so it frees up the
reservation from NVIDIAs side. (We also do that in a loop, since it's
not always immediately ready after QEMU closes)

A general problem (but that was previously also the case) is that a
showcmd (for a not running guest) reserves the pciids, which might block
an execution of a different real vm. This is now a bit more problematic
as we (temporarily) set the vgpu type then.

0: https://docs.kernel.org/driver-api/vfio-pci-device-specific-driver-acceptance.html

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Christoph Heiss <c.heiss@proxmox.com>
Reviewed-by: Christoph Heiss <c.heiss@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-10-24 18:43:52 +02:00
Dominik Csapak
d7fe48e9aa pci: device reservation: allow one to only free a subset of IDs
Add an optional parameter to the helper that removes PCI reservations
so that we can partially release IDs again. This will be necessary for
NVIDIAs new sysfs api

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Christoph Heiss <c.heiss@proxmox.com>
Reviewed-by: Christoph Heiss <c.heiss@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-10-24 18:40:55 +02:00
Dominik Csapak
fc23c72a42 pci: device selection: don't reserve PCI IDs when VM is already running
Since the only way this could happen is when we're being called
from 'qm showcmd' and there we don't want to reserve or create anything.

In case the VM was not running, we actually reserve the devices, so we
want to call 'cleanup_pci_devices' after to remove those again. This
minimizes the timespan where those devices are not available for real vm
starts.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Christoph Heiss <c.heiss@proxmox.com>
Reviewed-by: Christoph Heiss <c.heiss@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2024-10-24 18:39:37 +02:00
Maximiliano Sandoval
be8c868f0c fix typos in user-visible strings
This includes docs, and strings printed to stderr or stdout.

These were caught with:

    typos --exclude test --exclude changelog

Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>
2024-10-24 13:15:06 +02:00
Dominik Csapak
9b71c34d61 enable cluster mapped PCI devices for guests
this patch allows configuring pci devices that are mapped via cluster
resource mapping when the user has 'Resource.Use' on the ACL path
'/mapping/pci/{ID}' (in  addition to the usual required vm config
privileges)

When given multiple mappings in the config, we use them as alternatives
for the passthrough, and will select the first free one on startup.
It is using our regular pci reservation mechanism for regular devices and
we introduce a selection mechanism for mediated devices.

A few changes to the inner workings were required to make this work well:
* parse_hostpci now returns a different structure where we have a list
  of lists (first level is for the different alternatives and second
  level is for the different devices that should be passed through
  together)
* factor out the 'parse_hostpci_devices' which parses each device from
  the config and does some precondition checks
* reserve_pci_usage now behaves slightly different when trying to
  reserve an device with the same VMID that's already reserved for,
  since for checking which alternative we can use, we already must
  reserve one (this means that qm showcmd can actually reserve devices,
  albeit only for up to 10 seconds)
* configuring a mediated device on a multifunction device is not
  supported anymore, and results in failure to start (previously, it
  just chose the first device to do it). This is a breaking change
* configuring a single pci device twice on different hostpci slots now
  fails during commandline generation instead on qemu start, so we had
  to adapt one test where this occurred (it could never have worked
  anyway)
* we now check permissions during clone/restore, meaning raw/real
  devices can only be cloned/restored by root@pam from now on.
  this is a breaking change.

Fixes #3574: Improve SR-IOV usability
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-By:  Markus Frank <m.frank@proxmox.com>
2023-06-16 16:24:02 +02:00
Dominik Csapak
6fa358a334 pci: make mediated device sysfs path independent of PCI id
mdevs have a host-unique UUID they are indexed with in the PCI-id
independent `/sys/bus/mdev/devices/<uuid>` path, so there is no need
to go through the PCI id for them.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2022-11-09 09:06:19 +01:00
Thomas Lamprecht
2fa64dbddd pci: add/improve HW reservation comments
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-11-09 08:55:55 +01:00
Dominik Csapak
1b189121fc vm start/stop: cleanup passed-through pci devices in more situations
if the preparing of PCI devices or the start of the VM fails, we need
to cleanup the PCI devices (reservations *and* mdevs), or else it
might happen that there are leftovers which must be manually removed.

to include also mdevs now, refactor the cleanup code from
'vm_stop_cleanup' into it's own function, and call that instead of
only 'remove_pci_reservation'

also simplifies the code, such that it now removes all PCI ids
reserved for that VMID, since we cannot have multiple VMs with the
same VMID anyway

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2022-11-09 08:49:45 +01:00
Dominik Csapak
bbf96e0f1e automatically add 'uuid' parameter when passing through NVIDIA vGPU
When passing through an NVIDIA vGPU via mediated devices, their
software needs the qemu process to have the 'uuid' parameter set to the
one of the vGPU. Since it's currently not possible to pass through multiple
vGPUs to one VM (seems to be an NVIDIA driver limitation at the moment),
we don't have to take care about that.

Sadly, the place we do this, it does not show up in 'qm showcmd' as we
don't (want to) query the pci devices in that case, and then we don't
have a way of knowing if it's an NVIDIA card or not. But since this
is informational with QEMU anyway, i'd say we can ignore that.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2022-08-12 13:42:33 +02:00
Dominik Csapak
d8a7e9e881 PCI: allow longer pci domains
some systems[0] have pci domains longer than the default ('0000') of 4
characters, so change the regex to allow at least 4.

0: https://forum.proxmox.com/threads/problem-with-gpu-passthrough-in-a-virtual-machine.105720/

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2022-03-16 18:03:35 +01:00
Nicholas Sherlock
d806b017ac pci: allow override of PCI vendor/device ids
This allows mobile- and vGPUs to be presented to the guest as if they
were the original desktop variants of the card. It also allows
device-ID variants that guests don't know about to be renamed to
match compatible sibling devices the guest does have drivers for
(e.g. to remove manufacturer-specific vendor ID variants that prevent
the use of a device which would otherwise have a supported chipset)

e.g. hostpci0: 03:00,vendor-id=0x8086,device-id=0x10f6

Signed-off-by: Nicholas Sherlock <n.sherlock@gmail.com>
Reviewed-by: Dominik Csapak <d.csapak@proxmox.com>
Tested-by: Dominik Csapak <d.csapak@proxmox.com>
2022-01-25 10:59:23 +01:00
Thomas Lamprecht
d01de38cb6 pci: prepare: improve no-IOMMU error message
give some context

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-15 19:58:16 +02:00
Thomas Lamprecht
a01593676c pci reservation: rework helpers style and readability wise
both style and readability are naturally subjective to a certain
degree...

Also, this patch mixes a bit much into one thing, but splitting that
up would mean lots of work I just wanted to avoid, sorry about that.

Among other things:

- avoid a level of indentation in the reserve loop
- rename pciids to reservation_list where it was a better fit
- make reserve set either pid or time to avoid suggesting that we
  save both
- rename parameters to requested/dropped IDs for easier understanding
  what's going on in the code
- avoid old_pid/pid, use running_pid and reserver_pid instead to
  clarify what they actually mean
- drop useless returns to avoid suggesting the return value has any
  use and save some lnes
- use a hash slice to delete all dropped IDs at once, shorter and
  faster
- use 5 second timeout for reservation, this does nothing intensive
  nor does it wait for anything, so the critical section should be
  really short, 5s is really long enough for a wait..

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-15 19:58:16 +02:00
Thomas Lamprecht
bda0ebff2d pci reservation: move lock/reservation file into /run/qemu-server
lck needs to die, the days of any 8.3 file naming schemes are long
gone (in the server space that is ;)

/var/run is /run so use the shorter, and while /var/lock is a OK
place for the locks we try to keep lock and lock-object together
nowadays. The qemu-server sub-directory avoids overly cluttering the
already crowded top-level /run dir

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-15 18:17:34 +02:00
Thomas Lamprecht
cda95d5223 pci reservation: encode locklessness of parsers in name
to avoid that they're misused

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-15 14:44:50 +02:00
Dominik Csapak
3bfee796f4 pci: add helpers to (un)reserve pciids for a vm
saves a list of pciid <-> vmid mappings in /var/run
that we can check when we start a vm

if we're not given a pid but a timeout, we save the time when the
reservation will run out (current time + timeout + 5s) since each
vm start (until we can save the pid) varies from config to config

reserve_pci_usage and remove_pci_reservation always expect a list of ids
so that we can update the reservation for a vm all at once

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-10-11 09:07:52 +02:00
Thomas Lamprecht
71cb8e0f87 pci related code cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-11 08:39:28 +02:00
Thomas Lamprecht
e2b42bee6d pci: use local helper to generated generate_mdev_uuid
avoid (API) leaking qemu-server specific stuff into pve-common

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-11 08:38:28 +02:00
Thomas Lamprecht
82712fcd3c pci: prepare_pci_device: fixup parameter name
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2021-10-11 08:37:35 +02:00
Dominik Csapak
acd4b77745 pci: refactor pci device preparation
makes the vm start a bit less crowded

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-10-08 06:27:19 +02:00
Dominik Csapak
a4d5b84c9c pci: to not capture first group in PCIRE
we do not need this group, but want to use the regex where we have
multiple groups, so make it a non-capture group

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2021-10-05 16:14:42 +02:00
Thomas Lamprecht
41af2dfc25 PCI: use warnings/strict and fix setting $vga from config2command
fixes commit 74c17b7a23 which moved
this code here, but forgot to pass $vga ref, as the module was not
using warning nor strict mode this was not caught..

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2020-10-16 18:03:32 +02:00
Thomas Lamprecht
f7d1505b0c tree wide cleanups
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2020-10-16 18:03:32 +02:00
Thomas Lamprecht
d1c1af4b02 tree wide cleanup of s/return undef/return/
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2020-10-16 16:20:05 +02:00
Stefan Reiter
2141a802b8 fix #3010: add 'bootorder' parameter for better control of boot devices
(also fixes #3011)

Deprecates the old-style 'boot' and 'bootdisk' options by adding a new
'order=' subproperty to 'boot'.

This allows a user to specify more than one disk in the boot order,
helping with newer versions of SeaBIOS/OVMF where disks without a
bootindex won't be initialized at all (breaks soft-raid and some LVM
setups).

This also allows specifying a bootindex for USB and hostpci devices,
which was not possible before. Floppy boot support is not supported in
the new model, but I doubt that will be a problem (AFAICT we can't even
attach floppy disks to a VM?).

Default behaviour is intended to stay the same, i.e. while new VMs will
receive the new 'order' property, it will be set so the VM starts the
same as before (using get_default_bootorder).

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
2020-10-14 12:30:50 +02:00
Dominik Csapak
7de7f675c2 fix mdev cmdline generation
during refactoring, the vmid got lost, but is necessary to get
the correct mdev id

Fixes commit 74c17b7a23
Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
[ reference fixed commit ]
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2020-07-13 10:29:25 +02:00
Thomas Lamprecht
1fac3a0b31 pci: whitespace, indentation and formating fixes
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2020-06-25 13:33:26 +02:00
Stefan Reiter
13d689792e fix #2794: allow legacy IGD passthrough
Legacy IGD passthrough requires address 00:1f.0 to not be assigned to
anything on QEMU startup (currently it's assigned to bridge pci.2).
Changing this in general would break live-migration, so introduce a new
hostpci parameter "legacy-igd", which if set to 1 will move that bridge
to be nested under bridge 1.

This is safe because:
* Bridge 1 is unconditionally created on i440fx, so nesting is ok
* Defaults are not changed, i.e. PCI layout only changes when the new
parameter is specified manually
* hostpci forbids migration anyway

Additionally, the PT device has to be assigned address 00:02.0 in the
guest as well, which is usually used for VGA assignment. Luckily, IGD PT
requires vga=none, so that is not an issue either.

See https://git.qemu.org/?p=qemu.git;a=blob;f=docs/igd-assign.txt

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
2020-06-25 13:25:35 +02:00
Stefan Reiter
74c17b7a23 cfg2cmd: hostpci: move code to PCI.pm
To avoid further cluttering config_to_command with subsequent changes.

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
2020-06-25 13:25:35 +02:00
Stefan Reiter
2cf61f33d9 fix #2264: add virtio-rng device
Allow a user to add a virtio-rng-pci (an emulated hardware random
number generator) to a VM with the rng0 setting. The setting is
version_guard()-ed.

Limit the selection of entropy source to one of three:
/dev/urandom (preferred): Non-blocking kernel entropy source
/dev/random: Blocking kernel source
/dev/hwrng: Hardware RNG on the host for passthrough

QEMU itself defaults to /dev/urandom (or the equivalent getrandom()
call) if no source file is given, but I don't fully trust that
behaviour to stay constant, considering the documentation [0] already
disagrees with the code [1], so let's always specify the file ourselves.

/dev/urandom is preferred, since it prevents host entropy starvation.
The quality of randomness is still good enough to emulate a hwrng, since
a) it's still seeded from the kernel's true entropy pool periodically
and b) it's mixed with true entropy in the guest as well.

Additionally, all sources about entropy predicition attacks I could find
mention that to predict /dev/urandom results, /dev/random has to be
accessed or manipulated in one way or the other - this is not possible
from a VM however, as the entropy we're talking about comes from the
*hosts* blocking pool.

More about the entropy and security implications of the non-blocking
interface in [2] and [3].

Note further that only one /dev/hwrng exists at any given time, if
multiple RNGs are available, only the one selected in
'/sys/devices/virtual/misc/hw_random/rng_current' will feed the file.
Selecting this is left as an exercise to the user, if at all required.

We limit the available entropy to 1 KiB/s by default, but allow the user
to override this. Interesting to note is that the limiter does not work
linearly, i.e. max_bytes=1024/period=1000 means that up to 1 KiB of data
becomes available on a 1000 millisecond timer, not that 1 KiB is
streamed to the guest over the course of one second - hence the
configurable period.

The default used here is the same as given in the QEMU documentation [0]
and has been verified to affect entropy availability in a guest by
measuring /dev/random throughput. 1 KiB/s is enough to avoid any
early-boot entropy shortages, and already has a significant impact on
/dev/random availability in the guest.

[0] https://wiki.qemu.org/Features/VirtIORNG
[1] https://git.qemu.org/?p=qemu.git;a=blob;f=crypto/random-platform.c;h=f92f96987d7d262047c7604b169a7fdf11236107;hb=HEAD
[2] https://lwn.net/Articles/261804/
[3] https://lwn.net/Articles/808575/

Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
2020-03-06 18:09:04 +01:00
Dominik Csapak
2513b862e6 fix #2566: increase scsi limit to 31
to achieve this we have to add 3 new scsihw addresses since lsi
controllers can only hold 7 scsi drives

we go up to 31, since this is the limit for virtio-scsi-single devices
we have reserved (we can increase this in the future)

to make it more future proof, we add a new pci bridge under pci
bridge 1, so we have to adapt the bridge adding code (we did not
need this for q35 previously)

impact on live migration:
since on older versions of qemu-server we do not have those config
settings, there is no problem from old -> new

new->old is not supported anyway and this breaks so that
the vm crashes and loses the configs for scsi15-30
(same behaviour as e.g. with audio0 and migration from new->old)

tested with 31 scsi disk on
i440fx + virtio-scsi
i440fx + lsi
q35 + virtio-scsi
q35 + lsi
with ovmf + seabios

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2020-01-31 20:26:26 +01:00
Thomas Lamprecht
e2b0d85dda PCIe passthrough: fixup: avoid addr conflict and cleanup a bit
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-09-06 19:27:30 +02:00
Thomas Lamprecht
d7d698f60c pci: add conflict tests
best viewed with: git show -w

Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
2019-09-06 19:27:30 +02:00
Aaron Lauterer
c4e1638148 Add support for up to 16 PCI(e) devices
For non pci express passthrough additional addresses are reserved.
For pcie passthrough pcie root ports are needed (unless guest is like
windows 7).

The first 4 pcie root ports are defined by default in the pve-q35*.cfg
files. If more than 4 pcie devices are passed through the needed root
ports are created on demand. This helps to keep live migration possible
without adding a new pve-q35*.cfg file.

For the windows 7 like guests additional addresses are reserved as well.

Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2019-09-06 19:27:30 +02:00
Aaron Lauterer
d438e06028 Add PCI address for audio device
Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>
2019-07-18 08:24:39 +02:00
Dominik Csapak
6dbcb07367 add ivshmem device to config
with such a shared memory device, a vm can share data with other
vms or with the host via memory

one of the use cases is looking-glass[1] with pci-passthrough, which copies
the guest fb to the host and you get a high-speed, low-latency
display client for the vm

on vm stop we delete the file again

1: https://looking-glass.hostfission.com/

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2019-02-26 08:01:12 +01:00
Dominik Csapak
739ba34024 add win7 pcie quirk
Win7 is very picky about pcie assignments and fails with
'error 12' the way we add hospci devices.

To combat that, we simply give the hostpci device a normal port
instead.

Start with address 0x10, so that we have space before those devices,
and between them and the ones configured in pve-q35.cfg should we
need it in the future.

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2018-12-17 14:00:23 +01:00
Dominik Csapak
b71351a7ed QemuServer: remove PCI sysfs helpers
and use them from PVE::SysFSTools, where they got moved to

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2018-11-19 14:06:11 +01:00
Wolfgang Bumiller
d559309fcf arm: pci addressing, keyboard and ehci controller
On arm we start off with a pcie bridge pcie.0. We need a
keyboard in addition to the tablet device, and we need to
connect both to an 'ehci' controller.

To do all this, we also pass the $arch variable through a
whole lot of function calls to ultimately also adapt the
hotplug code to take care of the new keyboard device.

Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
2018-11-13 14:44:28 +01:00
Dominik Csapak
55655ebc32 fix #1952: make vga memory configurable
we change 'vga' to a property string and add a 'memory' property
with this, the user can better control the memory given to the virtual
gpu, this is especially useful for spice/qxl since high resolutions need
more memory

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2018-11-09 13:45:07 +01:00
Dominik Csapak
de9768f002 refactor PCI into own file
to reduce QemuServer.pm size
also move the $device hash out of any function

Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>
2016-06-22 09:13:16 +02:00