qemu-server

mirror of https://git.proxmox.com/git/qemu-server synced 2025-12-07 23:14:38 +00:00

Author	SHA1	Message	Date
Dominik Csapak	458b487bed	pci: don't hard require resetting devices for passthrough Since pve-common commit: eff5957 (sysfstools: file_write: properly catch errors) this check here fails now when the reset does not work. It turns out that resetting the device is not always necessary, and we previously ignored most errors when trying to do so. To restore that functionality, downgrade this `die` to a warning. If the device really needs a reset to work, it will either fail later during startup, or not work correctly in the guest, but that behavior existed before and is AFAIK not really detectable from our side. Also improve the warning message a bit to not scare users and explain that we're continuing. Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> [ TL: fine-tune error message a bit and avoid parenthesis ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-11-11 20:33:25 +01:00
Thomas Lamprecht	a28e6fe6f9	pci: make variable name slightly easier to read Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-10-24 18:44:53 +02:00
Dominik Csapak	48ada6982f	pci: mdev: adapt to NVIDIA's modern interface with kernel >= 6.8 Since kernel 6.8, NVIDIAs vGPU driver does not use the generic mdev interface anymore, since they relied on a feature there which is not available anymore. IIUC the kernel [0] recommends drivers to implement their own device specific features since putting all in the generic one does not make sense. They now have an 'nvidia' folder in the device sysfs path, which contains the files `creatable_vgpu_types`/`current_vgpu_type` to control the virtual functions model, and then the whole virtual function has to be passed through (although without resetting and changing to the vfio-pci driver). This patch implements changes so that from a config perspective, it still is an mediated device, and we map the functionality iff the device has no mediated devices but the new NVIDIAs sysfsapi and the model name is 'nvidia-<..>' It behaves a bit different than mdevs and normal pci passthrough, as we have to choose the correct device immediately since it's bound to the pciid, but we must not bind the device to vfio-pci as the NVIDIA driver implements this functionality itself. When cleaning up, we iterate over all reserved devices (since for a mapping we can't know at this point which was chosen besides looking at the reservations) and reset the vgpu model to '0', so it frees up the reservation from NVIDIAs side. (We also do that in a loop, since it's not always immediately ready after QEMU closes) A general problem (but that was previously also the case) is that a showcmd (for a not running guest) reserves the pciids, which might block an execution of a different real vm. This is now a bit more problematic as we (temporarily) set the vgpu type then. 0: https://docs.kernel.org/driver-api/vfio-pci-device-specific-driver-acceptance.html Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Christoph Heiss <c.heiss@proxmox.com> Reviewed-by: Christoph Heiss <c.heiss@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-10-24 18:43:52 +02:00
Dominik Csapak	d7fe48e9aa	pci: device reservation: allow one to only free a subset of IDs Add an optional parameter to the helper that removes PCI reservations so that we can partially release IDs again. This will be necessary for NVIDIAs new sysfs api Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Christoph Heiss <c.heiss@proxmox.com> Reviewed-by: Christoph Heiss <c.heiss@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-10-24 18:40:55 +02:00
Dominik Csapak	fc23c72a42	pci: device selection: don't reserve PCI IDs when VM is already running Since the only way this could happen is when we're being called from 'qm showcmd' and there we don't want to reserve or create anything. In case the VM was not running, we actually reserve the devices, so we want to call 'cleanup_pci_devices' after to remove those again. This minimizes the timespan where those devices are not available for real vm starts. Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Christoph Heiss <c.heiss@proxmox.com> Reviewed-by: Christoph Heiss <c.heiss@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2024-10-24 18:39:37 +02:00
Maximiliano Sandoval	be8c868f0c	fix typos in user-visible strings This includes docs, and strings printed to stderr or stdout. These were caught with: typos --exclude test --exclude changelog Signed-off-by: Maximiliano Sandoval <m.sandoval@proxmox.com>	2024-10-24 13:15:06 +02:00
Dominik Csapak	9b71c34d61	enable cluster mapped PCI devices for guests this patch allows configuring pci devices that are mapped via cluster resource mapping when the user has 'Resource.Use' on the ACL path '/mapping/pci/{ID}' (in addition to the usual required vm config privileges) When given multiple mappings in the config, we use them as alternatives for the passthrough, and will select the first free one on startup. It is using our regular pci reservation mechanism for regular devices and we introduce a selection mechanism for mediated devices. A few changes to the inner workings were required to make this work well: * parse_hostpci now returns a different structure where we have a list of lists (first level is for the different alternatives and second level is for the different devices that should be passed through together) * factor out the 'parse_hostpci_devices' which parses each device from the config and does some precondition checks * reserve_pci_usage now behaves slightly different when trying to reserve an device with the same VMID that's already reserved for, since for checking which alternative we can use, we already must reserve one (this means that qm showcmd can actually reserve devices, albeit only for up to 10 seconds) * configuring a mediated device on a multifunction device is not supported anymore, and results in failure to start (previously, it just chose the first device to do it). This is a breaking change * configuring a single pci device twice on different hostpci slots now fails during commandline generation instead on qemu start, so we had to adapt one test where this occurred (it could never have worked anyway) * we now check permissions during clone/restore, meaning raw/real devices can only be cloned/restored by root@pam from now on. this is a breaking change. Fixes #3574: Improve SR-IOV usability Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> Tested-By: Markus Frank <m.frank@proxmox.com>	2023-06-16 16:24:02 +02:00
Dominik Csapak	6fa358a334	pci: make mediated device sysfs path independent of PCI id mdevs have a host-unique UUID they are indexed with in the PCI-id independent `/sys/bus/mdev/devices/<uuid>` path, so there is no need to go through the PCI id for them. Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2022-11-09 09:06:19 +01:00
Thomas Lamprecht	2fa64dbddd	pci: add/improve HW reservation comments Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2022-11-09 08:55:55 +01:00
Dominik Csapak	1b189121fc	vm start/stop: cleanup passed-through pci devices in more situations if the preparing of PCI devices or the start of the VM fails, we need to cleanup the PCI devices (reservations and mdevs), or else it might happen that there are leftovers which must be manually removed. to include also mdevs now, refactor the cleanup code from 'vm_stop_cleanup' into it's own function, and call that instead of only 'remove_pci_reservation' also simplifies the code, such that it now removes all PCI ids reserved for that VMID, since we cannot have multiple VMs with the same VMID anyway Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2022-11-09 08:49:45 +01:00
Dominik Csapak	bbf96e0f1e	automatically add 'uuid' parameter when passing through NVIDIA vGPU When passing through an NVIDIA vGPU via mediated devices, their software needs the qemu process to have the 'uuid' parameter set to the one of the vGPU. Since it's currently not possible to pass through multiple vGPUs to one VM (seems to be an NVIDIA driver limitation at the moment), we don't have to take care about that. Sadly, the place we do this, it does not show up in 'qm showcmd' as we don't (want to) query the pci devices in that case, and then we don't have a way of knowing if it's an NVIDIA card or not. But since this is informational with QEMU anyway, i'd say we can ignore that. Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2022-08-12 13:42:33 +02:00
Dominik Csapak	d8a7e9e881	PCI: allow longer pci domains some systems[0] have pci domains longer than the default ('0000') of 4 characters, so change the regex to allow at least 4. 0: https://forum.proxmox.com/threads/problem-with-gpu-passthrough-in-a-virtual-machine.105720/ Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2022-03-16 18:03:35 +01:00
Nicholas Sherlock	d806b017ac	pci: allow override of PCI vendor/device ids This allows mobile- and vGPUs to be presented to the guest as if they were the original desktop variants of the card. It also allows device-ID variants that guests don't know about to be renamed to match compatible sibling devices the guest does have drivers for (e.g. to remove manufacturer-specific vendor ID variants that prevent the use of a device which would otherwise have a supported chipset) e.g. hostpci0: 03:00,vendor-id=0x8086,device-id=0x10f6 Signed-off-by: Nicholas Sherlock <n.sherlock@gmail.com> Reviewed-by: Dominik Csapak <d.csapak@proxmox.com> Tested-by: Dominik Csapak <d.csapak@proxmox.com>	2022-01-25 10:59:23 +01:00
Thomas Lamprecht	d01de38cb6	pci: prepare: improve no-IOMMU error message give some context Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-15 19:58:16 +02:00
Thomas Lamprecht	a01593676c	pci reservation: rework helpers style and readability wise both style and readability are naturally subjective to a certain degree... Also, this patch mixes a bit much into one thing, but splitting that up would mean lots of work I just wanted to avoid, sorry about that. Among other things: - avoid a level of indentation in the reserve loop - rename pciids to reservation_list where it was a better fit - make reserve set either pid or time to avoid suggesting that we save both - rename parameters to requested/dropped IDs for easier understanding what's going on in the code - avoid old_pid/pid, use running_pid and reserver_pid instead to clarify what they actually mean - drop useless returns to avoid suggesting the return value has any use and save some lnes - use a hash slice to delete all dropped IDs at once, shorter and faster - use 5 second timeout for reservation, this does nothing intensive nor does it wait for anything, so the critical section should be really short, 5s is really long enough for a wait.. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-15 19:58:16 +02:00
Thomas Lamprecht	bda0ebff2d	pci reservation: move lock/reservation file into /run/qemu-server lck needs to die, the days of any 8.3 file naming schemes are long gone (in the server space that is ;) /var/run is /run so use the shorter, and while /var/lock is a OK place for the locks we try to keep lock and lock-object together nowadays. The qemu-server sub-directory avoids overly cluttering the already crowded top-level /run dir Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-15 18:17:34 +02:00
Thomas Lamprecht	cda95d5223	pci reservation: encode locklessness of parsers in name to avoid that they're misused Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-15 14:44:50 +02:00
Dominik Csapak	3bfee796f4	pci: add helpers to (un)reserve pciids for a vm saves a list of pciid <-> vmid mappings in /var/run that we can check when we start a vm if we're not given a pid but a timeout, we save the time when the reservation will run out (current time + timeout + 5s) since each vm start (until we can save the pid) varies from config to config reserve_pci_usage and remove_pci_reservation always expect a list of ids so that we can update the reservation for a vm all at once Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2021-10-11 09:07:52 +02:00
Thomas Lamprecht	71cb8e0f87	pci related code cleanups Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-11 08:39:28 +02:00
Thomas Lamprecht	e2b42bee6d	pci: use local helper to generated generate_mdev_uuid avoid (API) leaking qemu-server specific stuff into pve-common Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-11 08:38:28 +02:00
Thomas Lamprecht	82712fcd3c	pci: prepare_pci_device: fixup parameter name Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2021-10-11 08:37:35 +02:00
Dominik Csapak	acd4b77745	pci: refactor pci device preparation makes the vm start a bit less crowded Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2021-10-08 06:27:19 +02:00
Dominik Csapak	a4d5b84c9c	pci: to not capture first group in PCIRE we do not need this group, but want to use the regex where we have multiple groups, so make it a non-capture group Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2021-10-05 16:14:42 +02:00
Thomas Lamprecht	41af2dfc25	PCI: use warnings/strict and fix setting $vga from config2command fixes commit `74c17b7a23` which moved this code here, but forgot to pass $vga ref, as the module was not using warning nor strict mode this was not caught.. Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2020-10-16 18:03:32 +02:00
Thomas Lamprecht	f7d1505b0c	tree wide cleanups Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2020-10-16 18:03:32 +02:00
Thomas Lamprecht	d1c1af4b02	tree wide cleanup of s/return undef/return/ Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2020-10-16 16:20:05 +02:00
Stefan Reiter	2141a802b8	fix #3010 : add 'bootorder' parameter for better control of boot devices (also fixes #3011) Deprecates the old-style 'boot' and 'bootdisk' options by adding a new 'order=' subproperty to 'boot'. This allows a user to specify more than one disk in the boot order, helping with newer versions of SeaBIOS/OVMF where disks without a bootindex won't be initialized at all (breaks soft-raid and some LVM setups). This also allows specifying a bootindex for USB and hostpci devices, which was not possible before. Floppy boot support is not supported in the new model, but I doubt that will be a problem (AFAICT we can't even attach floppy disks to a VM?). Default behaviour is intended to stay the same, i.e. while new VMs will receive the new 'order' property, it will be set so the VM starts the same as before (using get_default_bootorder). Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>	2020-10-14 12:30:50 +02:00
Dominik Csapak	7de7f675c2	fix mdev cmdline generation during refactoring, the vmid got lost, but is necessary to get the correct mdev id Fixes commit `74c17b7a23` Signed-off-by: Dominik Csapak <d.csapak@proxmox.com> [ reference fixed commit ] Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2020-07-13 10:29:25 +02:00
Thomas Lamprecht	1fac3a0b31	pci: whitespace, indentation and formating fixes Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2020-06-25 13:33:26 +02:00
Stefan Reiter	13d689792e	fix #2794 : allow legacy IGD passthrough Legacy IGD passthrough requires address 00:1f.0 to not be assigned to anything on QEMU startup (currently it's assigned to bridge pci.2). Changing this in general would break live-migration, so introduce a new hostpci parameter "legacy-igd", which if set to 1 will move that bridge to be nested under bridge 1. This is safe because: * Bridge 1 is unconditionally created on i440fx, so nesting is ok * Defaults are not changed, i.e. PCI layout only changes when the new parameter is specified manually * hostpci forbids migration anyway Additionally, the PT device has to be assigned address 00:02.0 in the guest as well, which is usually used for VGA assignment. Luckily, IGD PT requires vga=none, so that is not an issue either. See https://git.qemu.org/?p=qemu.git;a=blob;f=docs/igd-assign.txt Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>	2020-06-25 13:25:35 +02:00
Stefan Reiter	74c17b7a23	cfg2cmd: hostpci: move code to PCI.pm To avoid further cluttering config_to_command with subsequent changes. Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>	2020-06-25 13:25:35 +02:00
Stefan Reiter	2cf61f33d9	fix #2264 : add virtio-rng device Allow a user to add a virtio-rng-pci (an emulated hardware random number generator) to a VM with the rng0 setting. The setting is version_guard()-ed. Limit the selection of entropy source to one of three: /dev/urandom (preferred): Non-blocking kernel entropy source /dev/random: Blocking kernel source /dev/hwrng: Hardware RNG on the host for passthrough QEMU itself defaults to /dev/urandom (or the equivalent getrandom() call) if no source file is given, but I don't fully trust that behaviour to stay constant, considering the documentation [0] already disagrees with the code [1], so let's always specify the file ourselves. /dev/urandom is preferred, since it prevents host entropy starvation. The quality of randomness is still good enough to emulate a hwrng, since a) it's still seeded from the kernel's true entropy pool periodically and b) it's mixed with true entropy in the guest as well. Additionally, all sources about entropy predicition attacks I could find mention that to predict /dev/urandom results, /dev/random has to be accessed or manipulated in one way or the other - this is not possible from a VM however, as the entropy we're talking about comes from the hosts blocking pool. More about the entropy and security implications of the non-blocking interface in [2] and [3]. Note further that only one /dev/hwrng exists at any given time, if multiple RNGs are available, only the one selected in '/sys/devices/virtual/misc/hw_random/rng_current' will feed the file. Selecting this is left as an exercise to the user, if at all required. We limit the available entropy to 1 KiB/s by default, but allow the user to override this. Interesting to note is that the limiter does not work linearly, i.e. max_bytes=1024/period=1000 means that up to 1 KiB of data becomes available on a 1000 millisecond timer, not that 1 KiB is streamed to the guest over the course of one second - hence the configurable period. The default used here is the same as given in the QEMU documentation [0] and has been verified to affect entropy availability in a guest by measuring /dev/random throughput. 1 KiB/s is enough to avoid any early-boot entropy shortages, and already has a significant impact on /dev/random availability in the guest. [0] https://wiki.qemu.org/Features/VirtIORNG [1] https://git.qemu.org/?p=qemu.git;a=blob;f=crypto/random-platform.c;h=f92f96987d7d262047c7604b169a7fdf11236107;hb=HEAD [2] https://lwn.net/Articles/261804/ [3] https://lwn.net/Articles/808575/ Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>	2020-03-06 18:09:04 +01:00
Dominik Csapak	2513b862e6	fix #2566 : increase scsi limit to 31 to achieve this we have to add 3 new scsihw addresses since lsi controllers can only hold 7 scsi drives we go up to 31, since this is the limit for virtio-scsi-single devices we have reserved (we can increase this in the future) to make it more future proof, we add a new pci bridge under pci bridge 1, so we have to adapt the bridge adding code (we did not need this for q35 previously) impact on live migration: since on older versions of qemu-server we do not have those config settings, there is no problem from old -> new new->old is not supported anyway and this breaks so that the vm crashes and loses the configs for scsi15-30 (same behaviour as e.g. with audio0 and migration from new->old) tested with 31 scsi disk on i440fx + virtio-scsi i440fx + lsi q35 + virtio-scsi q35 + lsi with ovmf + seabios Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2020-01-31 20:26:26 +01:00
Thomas Lamprecht	e2b0d85dda	PCIe passthrough: fixup: avoid addr conflict and cleanup a bit Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-09-06 19:27:30 +02:00
Thomas Lamprecht	d7d698f60c	pci: add conflict tests best viewed with: git show -w Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>	2019-09-06 19:27:30 +02:00
Aaron Lauterer	c4e1638148	Add support for up to 16 PCI(e) devices For non pci express passthrough additional addresses are reserved. For pcie passthrough pcie root ports are needed (unless guest is like windows 7). The first 4 pcie root ports are defined by default in the pve-q35.cfg files. If more than 4 pcie devices are passed through the needed root ports are created on demand. This helps to keep live migration possible without adding a new pve-q35.cfg file. For the windows 7 like guests additional addresses are reserved as well. Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>	2019-09-06 19:27:30 +02:00
Aaron Lauterer	d438e06028	Add PCI address for audio device Signed-off-by: Aaron Lauterer <a.lauterer@proxmox.com>	2019-07-18 08:24:39 +02:00
Dominik Csapak	6dbcb07367	add ivshmem device to config with such a shared memory device, a vm can share data with other vms or with the host via memory one of the use cases is looking-glass[1] with pci-passthrough, which copies the guest fb to the host and you get a high-speed, low-latency display client for the vm on vm stop we delete the file again 1: https://looking-glass.hostfission.com/ Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2019-02-26 08:01:12 +01:00
Dominik Csapak	739ba34024	add win7 pcie quirk Win7 is very picky about pcie assignments and fails with 'error 12' the way we add hospci devices. To combat that, we simply give the hostpci device a normal port instead. Start with address 0x10, so that we have space before those devices, and between them and the ones configured in pve-q35.cfg should we need it in the future. Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2018-12-17 14:00:23 +01:00
Dominik Csapak	b71351a7ed	QemuServer: remove PCI sysfs helpers and use them from PVE::SysFSTools, where they got moved to Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2018-11-19 14:06:11 +01:00
Wolfgang Bumiller	d559309fcf	arm: pci addressing, keyboard and ehci controller On arm we start off with a pcie bridge pcie.0. We need a keyboard in addition to the tablet device, and we need to connect both to an 'ehci' controller. To do all this, we also pass the $arch variable through a whole lot of function calls to ultimately also adapt the hotplug code to take care of the new keyboard device. Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>	2018-11-13 14:44:28 +01:00
Dominik Csapak	55655ebc32	fix #1952 : make vga memory configurable we change 'vga' to a property string and add a 'memory' property with this, the user can better control the memory given to the virtual gpu, this is especially useful for spice/qxl since high resolutions need more memory Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2018-11-09 13:45:07 +01:00
Dominik Csapak	de9768f002	refactor PCI into own file to reduce QemuServer.pm size also move the $device hash out of any function Signed-off-by: Dominik Csapak <d.csapak@proxmox.com>	2016-06-22 09:13:16 +02:00

43 Commits