linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-27 06:50:37 +00:00

Author	SHA1	Message	Date
Jason Wang	6a20f9fca3	vhost: initialize vq->nheads properly Commit 7918bb2d19c9 ("vhost: basic in order support") introduces vq->nheads to store the number of batched used buffers per used elem but it forgets to initialize the vq->nheads to NULL in vhost_dev_init() this will cause kfree() that would try to free it without be allocated if SET_OWNER is not called. Reported-by: JAEHOON KIM <jhkim@linux.ibm.com> Reported-by: Breno Leitao <leitao@debian.org> Fixes: `45347e79b5` ("vhost: basic in order support") Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250729073916.80647-1-jasowang@redhat.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Jaehoon Kim <jhkim@linux.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-05 05:57:40 -04:00
Linus Torvalds	821c9e515d	virtio, vhost: features, fixes vhost can now support legacy threading if enabled in Kconfig vsock memory allocation strategies for large buffers have been improved, reducing pressure on kmalloc vhost now supports the in-order feature guest bits missed the merge window fixes, cleanups all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmiMvQEPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpgr8IAKUrIjqqTYXLkbCWn6tK8T+LxZ6LkMkyHA1v AJ+y5fKDeLsT5QpusD1XRjXJVqXBwQEsTN0pNVuhWHlcCpUeOFEHuJaf/QMncbc3 deFlUfMa3ihniUxBuyhojlWURsf94uTC906lCFXlIsfSKH2CW6/SjKvqR0SH5PhN 5WaqRYiSFFwDlyG2Ul4e5temP/er2KuZfYyvcYCU8VdSEp6bjvqCHd9ztFIVuByp fFWsrHce6IqR8ixOOzavEjzfd8WAN3LGzXntj5KEaX3fZ6HxCZCMv+rNVqvJmLps cSrTgIUo60nCiZb8klUCS1YTEEvmdmJg3UmmddIpIhcsCYJSbOU= =2dxm -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - vhost can now support legacy threading if enabled in Kconfig - vsock memory allocation strategies for large buffers have been improved, reducing pressure on kmalloc - vhost now supports the in-order feature. guest bits missed the merge window. - fixes, cleanups all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits) vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers vsock/virtio: Rename virtio_vsock_skb_rx_put() vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers vsock/virtio: Move SKB allocation lower-bound check to callers vsock/virtio: Rename virtio_vsock_alloc_skb() vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() vsock/virtio: Validate length in packet header before skb_put() vhost/vsock: Avoid allocating arbitrarily-sized SKBs vhost_net: basic in_order support vhost: basic in order support vhost: fail early when __vhost_add_used() fails vhost: Reintroduce kthread API and add mode selection vdpa: Fix IDR memory leak in VDUSE module exit vdpa/mlx5: Fix release of uninitialized resources on error path vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit virtio: virtio_dma_buf: fix missing parameter documentation vhost: Fix typos vhost: vringh: Remove unused functions vhost: vringh: Remove unused iotlb functions ...	2025-08-01 14:17:48 -07:00
Will Deacon	8ca76151d2	vsock/virtio: Rename virtio_vsock_skb_rx_put() In preparation for using virtio_vsock_skb_rx_put() when populating SKBs on the vsock TX path, rename virtio_vsock_skb_rx_put() to virtio_vsock_skb_put(). No functional change. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-9-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	ab9aa2f3af	vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers When receiving a packet from a guest, vhost_vsock_handle_tx_kick() calls vhost_vsock_alloc_linear_skb() to allocate and fill an SKB with the receive data. Unfortunately, these are always linear allocations and can therefore result in significant pressure on kmalloc() considering that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB allocation for each packet. Rework the vsock SKB allocation so that, for sizes with page order greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated instead with the packet header in the SKB and the receive data in the fragments. Finally, add a debug warning if virtio_vsock_skb_rx_put() is ever called on an SKB with a non-zero length, as this would be destructive for the nonlinear case. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-8-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	fac6b82e0f	vsock/virtio: Move SKB allocation lower-bound check to callers virtio_vsock_alloc_linear_skb() checks that the requested size is at least big enough for the packet header (VIRTIO_VSOCK_SKB_HEADROOM). Of the three callers of virtio_vsock_alloc_linear_skb(), only vhost_vsock_alloc_skb() can potentially pass a packet smaller than the header size and, as it already has a check against the maximum packet size, extend its bounds checking to consider the minimum packet size and remove the check from virtio_vsock_alloc_linear_skb(). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-7-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	2304c64a28	vsock/virtio: Rename virtio_vsock_alloc_skb() In preparation for nonlinear allocations for large SKBs, rename virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate that it returns linear SKBs unconditionally and switch all callers over to this new interface for now. No functional change. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-6-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	87dbae5e36	vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() virtio_vsock_skb_rx_put() only calls skb_put() if the length in the packet header is not zero even though skb_put() handles this case gracefully. Remove the functionally redundant check from virtio_vsock_skb_rx_put() and, on the assumption that this is a worthwhile optimisation for handling credit messages, augment the existing length checks in virtio_transport_rx_work() to elide the call for zero-length payloads. Since the callers all have the length, extend virtio_vsock_skb_rx_put() to take it as an additional parameter rather than fish it back out of the packet header. Note that the vhost code already has similar logic in vhost_vsock_alloc_skb(). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-4-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	10a886aaed	vhost/vsock: Avoid allocating arbitrarily-sized SKBs vhost_vsock_alloc_skb() returns NULL for packets advertising a length larger than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE in the packet header. However, this is only checked once the SKB has been allocated and, if the length in the packet header is zero, the SKB may not be freed immediately. Hoist the size check before the SKB allocation so that an iovec larger than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + the header size is rejected outright. The subsequent check on the length field in the header can then simply check that the allocated SKB is indeed large enough to hold the packet. Cc: <stable@vger.kernel.org> Fixes: `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-2-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Jason Wang	45347e79b5	vhost_net: basic in_order support This patch introduces basic in-order support for vhost-net. By recording the number of batched buffers in an array when calling `vhost_add_used_and_signal_n()`, we can reduce the number of userspace accesses. Note that the vhost-net batching logic is kept as we still count the number of buffers there. Testing Results: With testpmd: - TX: txonly mode + vhost_net with XDP_DROP on TAP shows a 17.5% improvement, from 4.75 Mpps to 5.35 Mpps. - RX: No obvious improvements were observed. With virtio-ring in-order experimental code in the guest: - TX: pktgen in the guest + XDP_DROP on TAP shows a 19% improvement, from 5.2 Mpps to 6.2 Mpps. - RX: pktgen on TAP with vhost_net + XDP_DROP in the guest achieves a 6.1% improvement, from 3.47 Mpps to 3.61 Mpps. Acked-by: Jonah Palmer <jonah.palmer@oracle.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250714084755.11921-4-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Jason Wang	67a873df0c	vhost: basic in order support This patch adds basic in order support for vhost. Two optimizations are implemented in this patch: 1) Since driver uses descriptor in order, vhost can deduce the next avail ring head by counting the number of descriptors that has been used in next_avail_head. This eliminate the need to access the available ring in vhost. 2) vhost_add_used_and_singal_n() is extended to accept the number of batched buffers per used elem. While this increases the times of userspace memory access but it helps to reduce the chance of used ring access of both the driver and vhost. Vhost-net will be the first user for this. Acked-by: Jonah Palmer <jonah.palmer@oracle.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250714084755.11921-3-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Jason Wang	b4ba1207d4	vhost: fail early when __vhost_add_used() fails This patch fails vhost_add_used_n() early when __vhost_add_used() fails to make sure used idx is not updated with stale used ring information. Reported-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250714084755.11921-2-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Cindy Lu	7d9896e9f6	vhost: Reintroduce kthread API and add mode selection Since commit `6e890c5d50` ("vhost: use vhost_tasks for worker threads"), the vhost uses vhost_task and operates as a child of the owner thread. This is required for correct CPU usage accounting, especially when using containers. However, this change has caused confusion for some legacy userspace applications, and we didn't notice until it's too late. Unfortunately, it's too late to revert - we now have userspace depending both on old and new behaviour :( To address the issue, reintroduce kthread mode for vhost workers and provide a configuration to select between kthread and task worker. - Add 'fork_owner' parameter to vhost_dev to let users select kthread or task mode. Default mode is task mode(VHOST_FORK_OWNER_TASK). - Reintroduce kthread mode support: * Bring back the original vhost_worker() implementation, and renamed to vhost_run_work_kthread_list(). * Add cgroup support for the kthread * Introduce struct vhost_worker_ops: - Encapsulates create / stop / wake‑up callbacks. - vhost_worker_create() selects the proper ops according to inherit_owner. - Userspace configuration interface: * New IOCTLs: - VHOST_SET_FORK_FROM_OWNER lets userspace select task mode (VHOST_FORK_OWNER_TASK) or kthread mode (VHOST_FORK_OWNER_KTHREAD) - VHOST_GET_FORK_FROM_OWNER reads the current worker mode * Expose module parameter 'fork_from_owner_default' to allow system administrators to configure the default mode for vhost workers * Kconfig option CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL controls whether these IOCTLs and the parameter are available - The VHOST_NEW_WORKER functionality requires fork_owner to be set to true, with validation added to ensure proper configuration This partially reverts or improves upon: commit `6e890c5d50` ("vhost: use vhost_tasks for worker threads") commit `1cdaafa1b8` ("vhost: replace single worker pointer with xarray") Fixes: `6e890c5d50` ("vhost: use vhost_tasks for worker threads"), Signed-off-by: Cindy Lu <lulu@redhat.com> Message-Id: <20250714071333.59794-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Alok Tiwari	400cad513c	vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit The condition comparing ret to VHOST_SCSI_PREALLOC_SGLS was incorrect, as ret holds the result of kstrtouint() (typically 0 on success), not the parsed value. Update the check to use cnt, which contains the actual user-provided value. prevents silently accepting values exceeding the maximum inline_sg_cnt. Fixes: `bca939d5bc` ("vhost-scsi: Dynamically allocate scatterlists") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20250628183405.3979538-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>	2025-08-01 09:11:08 -04:00
Alok Tiwari	8a0d18a934	vhost: Fix typos Fix multiple typos and improve comment clarity across vhost.c. Spelling errors: "thead" -> "thread", "RUNNUNG" -> "RUNNING" and "available". Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Message-Id: <20250615173933.1610324-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org>	2025-08-01 09:11:08 -04:00
Dr. David Alan Gilbert	6e9ef6937c	vhost: vringh: Remove unused functions The functions: vringh_abandon_kern() vringh_abandon_user() vringh_iov_pull_kern() and vringh_iov_push_kern() were all added in 2013 by commit `f87d0fbb57` ("vringh: host-side implementation of virtio rings.") but have remained unused. Remove them and the two helper functions they used. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-Id: <20250617001838.114457-3-linux@treblig.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org>	2025-08-01 09:11:08 -04:00
Dr. David Alan Gilbert	569c392e19	vhost: vringh: Remove unused iotlb functions The functions: vringh_abandon_iotlb() vringh_notify_disable_iotlb() and vringh_notify_enable_iotlb() were added in 2020 by commit `9ad9c49cfe` ("vringh: IOTLB support") but have remained unused. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Message-Id: <20250617001838.114457-2-linux@treblig.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:08 -04:00
Mike Christie	69cd720a8a	vhost-scsi: Fix log flooding with target does not exist errors As part of the normal initiator side scanning the guest's scsi layer will loop over all possible targets and send an inquiry. Since the max number of targets for virtio-scsi is 256, this can result in 255 error messages about targets not existing if you only have a single target. When there's more than 1 vhost-scsi device each with a single target, then you get N * 255 log messages. It looks like the log message was added by accident in: commit `3f8ca2e115` ("vhost/scsi: Extract common handling code from control queue handler") when we added common helpers. Then in: commit `09d7583294` ("vhost/scsi: Use common handling code in request queue handler") we converted the scsi command processing path to use the new helpers so we started to see the extra log messages during scanning. The patches were just making some code common but added the vq_err call and I'm guessing the patch author forgot to enable the vq_err call (vq_err is implemented by pr_debug which defaults to off). So this patch removes the call since it's expected to hit this path during device discovery. Fixes: `09d7583294` ("vhost/scsi: Use common handling code in request queue handler") Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20250611210113.10912-1-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:08 -04:00
Alok Tiwari	652abad085	vhost-scsi: Fix typos and formatting in comments and logs This patch corrects several minor typos and formatting issues. Changes include: Fixing misspellings like in comments - "explict" -> "explicit" - "infight" -> "inflight", - "with generate" -> "will generate" formatting in logs - Correcting log formatting specifier from "%dd" to "%d" - Adding a missing space in the sysfs emit string to prevent misinterpreted output like "X86_64on ". changing to "X86_64 on " - Cleaning up stray semicolons in struct definition endings These changes improve code readability and consistency. no functionality changes. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Message-Id: <20250611143932.2443796-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com>	2025-08-01 09:11:08 -04:00
Linus Torvalds	63eb28bb14	ARM: - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts. - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface. - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally. - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range. - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor. - Convert more system register sanitization to the config-driven implementation. - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls. - Various cleanups and minor fixes. LoongArch: - Add stat information for in-kernel irqchip - Add tracepoints for CPUCFG and CSR emulation exits - Enhance in-kernel irqchip emulation - Various cleanups. RISC-V: - Enable ring-based dirty memory tracking - Improve perf kvm stat to report interrupt events - Delegate illegal instruction trap to VS-mode - MMU improvements related to upcoming nested virtualization s390x - Fixes x86: - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Share device posted IRQ code between SVM and VMX and harden it against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - For MMIO stale data mitigation, track whether or not a vCPU has access to (host) MMIO based on whether the page tables have MMIO pfns mapped; using VFIO is prone to false negatives - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical. - Recalculate all MSR intercepts from scratch on MSR filter changes, instead of maintaining shadow bitmaps. - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently. - Fix a user-triggerable WARN that syzkaller found by setting the vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU into VMX Root Mode (post-VMXON). Trying to detect every possible path leading to architecturally forbidden states is hard and even risks breaking userspace (if it goes from valid to valid state but passes through invalid states), so just wait until KVM_RUN to detect that the vCPU state isn't allowed. - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can access APERF/MPERF. This has many caveats (APERF/MPERF cannot be zeroed on vCPU creation or saved/restored on suspend and resume, or preserved over thread migration let alone VM migration) but can be useful whenever you're interested in letting Linux guests see the effective physical CPU frequency in /proc/cpuinfo. - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been created, as there's no known use case for changing the default frequency for other VM types and it goes counter to the very reason why the ioctl was added to the vm file descriptor. And also, there would be no way to make it work for confidential VMs with a "secure" TSC, so kill two birds with one stone. - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list). - Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC. - Various cleanups and fixes. x86 (Intel): - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest. Failure to honor FREEZE_IN_SMM can leak host state into guests. - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF. x86 (AMD): - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty much a static condition and therefore should never happen, but still). - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Request GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model. - Accept any SNP policy that is accepted by the firmware with respect to SMT and single-socket restrictions. An incompatible policy doesn't put the kernel at risk in any way, so there's no reason for KVM to care. - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and use WBNOINVD instead of WBINVD when possible for SEV cache maintenance. - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data. Generic: - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx " instead of a "void ", to avoid making a simple concept unnecessarily difficult to understand. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique. - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions. - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL. - Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely. - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking, now that KVM no longer uses assigned_device_count as a heuristic for either irqbypass usage or MDS mitigation. Selftests: - Fix a comment typo. - Verify KVM is loaded when getting any KVM module param so that attempting to run a selftest without kvm.ko loaded results in a SKIP message about KVM not being loaded/enabled (versus some random parameter not existing). - Skip tests that hit EACCES when attempting to access a file, and rpint a "Root required?" help message. In most cases, the test just needs to be run with elevated permissions. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmiKXMgUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMhMQf/QDhC/CP1aGXph2whuyeD2NMqPKiU 9KdnDNST+ftPwjg9QxZ9mTaa8zeVz/wly6XlxD9OQHy+opM1wcys3k0GZAFFEEQm YrThgURdzEZ3nwJZgb+m0t4wjJQtpiFIBwAf7qq6z1VrqQBEmHXJ/8QxGuqO+BNC j5q/X+q6KZwehKI6lgFBrrOKWFaxqhnRAYfW6rGBxRXxzTJuna37fvDpodQnNceN zOiq+avfriUMArTXTqOteJNKU0229HjiPSnjILLnFQ+B3akBlwNG0jk7TMaAKR6q IZWG1EIS9q1BAkGXaw6DE1y6d/YwtXCR5qgAIkiGwaPt5yj9Oj6kRN2Ytw== =j2At -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor - Convert more system register sanitization to the config-driven implementation - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls - Various cleanups and minor fixes LoongArch: - Add stat information for in-kernel irqchip - Add tracepoints for CPUCFG and CSR emulation exits - Enhance in-kernel irqchip emulation - Various cleanups RISC-V: - Enable ring-based dirty memory tracking - Improve perf kvm stat to report interrupt events - Delegate illegal instruction trap to VS-mode - MMU improvements related to upcoming nested virtualization s390x - Fixes x86: - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time - Share device posted IRQ code between SVM and VMX and harden it against bugs and runtime errors - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n) - For MMIO stale data mitigation, track whether or not a vCPU has access to (host) MMIO based on whether the page tables have MMIO pfns mapped; using VFIO is prone to false negatives - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical - Recalculate all MSR intercepts from scratch on MSR filter changes, instead of maintaining shadow bitmaps - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently - Fix a user-triggerable WARN that syzkaller found by setting the vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU into VMX Root Mode (post-VMXON). Trying to detect every possible path leading to architecturally forbidden states is hard and even risks breaking userspace (if it goes from valid to valid state but passes through invalid states), so just wait until KVM_RUN to detect that the vCPU state isn't allowed - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can access APERF/MPERF. This has many caveats (APERF/MPERF cannot be zeroed on vCPU creation or saved/restored on suspend and resume, or preserved over thread migration let alone VM migration) but can be useful whenever you're interested in letting Linux guests see the effective physical CPU frequency in /proc/cpuinfo - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been created, as there's no known use case for changing the default frequency for other VM types and it goes counter to the very reason why the ioctl was added to the vm file descriptor. And also, there would be no way to make it work for confidential VMs with a "secure" TSC, so kill two birds with one stone - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list) - Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC - Various cleanups and fixes x86 (Intel): - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest. Failure to honor FREEZE_IN_SMM can leak host state into guests - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF x86 (AMD): - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty much a static condition and therefore should never happen, but still) - Fix a variety of flaws and bugs in the AVIC device posted IRQ code - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs - Request GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model - Accept any SNP policy that is accepted by the firmware with respect to SMT and single-socket restrictions. An incompatible policy doesn't put the kernel at risk in any way, so there's no reason for KVM to care - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and use WBNOINVD instead of WBINVD when possible for SEV cache maintenance - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data Generic: - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI - Track irqbypass's "token" as "struct eventfd_ctx " instead of a "void ", to avoid making a simple concept unnecessarily difficult to understand - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs - Clean up and document/comment the irqfd assignment code - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL - Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking, now that KVM no longer uses assigned_device_count as a heuristic for either irqbypass usage or MDS mitigation Selftests: - Fix a comment typo - Verify KVM is loaded when getting any KVM module param so that attempting to run a selftest without kvm.ko loaded results in a SKIP message about KVM not being loaded/enabled (versus some random parameter not existing) - Skip tests that hit EACCES when attempting to access a file, and print a "Root required?" help message. In most cases, the test just needs to be run with elevated permissions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits) Documentation: KVM: Use unordered list for pre-init VGIC registers RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map() RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs RISC-V: perf/kvm: Add reporting of interrupt events RISC-V: KVM: Enable ring-based dirty memory tracking RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap RISC-V: KVM: Delegate illegal instruction fault to VS mode RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs RISC-V: KVM: Factor-out g-stage page table management RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence RISC-V: KVM: Introduce struct kvm_gstage_mapping RISC-V: KVM: Factor-out MMU related declarations into separate headers RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect() RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range() RISC-V: KVM: Don't flush TLB when PTE is unchanged RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize() RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init() RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list ...	2025-07-30 17:14:01 -07:00
Jakub Kicinski	b430f6c38d	Merge branch 'virtio_udp_tunnel_08_07_2025' of https://github.com/pabeni/linux-devel Paolo Abeni says: ==================== virtio: introduce GSO over UDP tunnel Some virtualized deployments use UDP tunnel pervasively and are impacted negatively by the lack of GSO support for such kind of traffic in the virtual NIC driver. The virtio_net specification recently introduced support for GSO over UDP tunnel, this series updates the virtio implementation to support such a feature. Currently the kernel virtio support limits the feature space to 64, while the virtio specification allows for a larger number of features. Specifically the GSO-over-UDP-tunnel-related virtio features use bits 65-69. The first four patches in this series rework the virtio and vhost feature support to cope with up to 128 bits. The limit is set by a define and could be easily raised in future, as needed. This implementation choice is aimed at keeping the code churn as limited as possible. For the same reason, only the virtio_net driver is reworked to leverage the extended feature space; all other virtio/vhost drivers are unaffected, but could be upgraded to support the extended features space in a later time. The last four patches bring in the actual GSO over UDP tunnel support. As per specification, some additional fields are introduced into the virtio net header to support the new offload. The presence of such fields depends on the negotiated features. New helpers are introduced to convert the UDP-tunneled skb metadata to an extended virtio net header and vice versa. Such helpers are used by the tun and virtio_net driver to cope with the newly supported offloads. Tested with basic stream transfer with all the possible permutations of host kernel/qemu/guest kernel with/without GSO over UDP tunnel support. ==================== Link: https://patch.msgid.link/cover.1751874094.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-10 13:32:35 -07:00
Paolo Abeni	bbca931fce	vhost/net: enable gso over UDP tunnel support. Vhost net need to know the exact virtio net hdr size to be able to copy such header correctly. Teach it about the newly defined UDP tunnel-related option and update the hdr size computation accordingly. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-08 18:07:28 +02:00
Paolo Abeni	333c515d18	vhost-net: allow configuring extended features Use the extended feature type for 'acked_features' and implement two new ioctls operation allowing the user-space to set/query an unbounded amount of features. The actual number of processed features is limited by VIRTIO_FEATURES_MAX and attempts to set features above such limit fail with EOPNOTSUPP. Note that: the legacy ioctls implicitly truncate the negotiated features to the lower 64 bits range and the 'acked_backend_features' field don't need conversion, as the only negotiated feature there is in the low 64 bit range. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-08 18:05:23 +02:00
Jason Wang	97b2409f28	vhost-net: reduce one userspace copy when building XDP buff We used to do twice copy_from_iter() to copy virtio-net and packet separately. This introduce overheads for userspace access hardening as well as SMAP (for x86 it's stac/clac). So this patch tries to use one copy_from_iter() to copy them once and move the virtio-net header afterwards to reduce overheads. Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:29:46 -07:00
Jason Wang	4d313f2bd2	tun: remove unnecessary tun_xdp_hdr structure With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"), buffer length could be stored as frame size so there's no need to have a dedicated tun_xdp_hdr structure. We can simply store virtio net header instead. Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:29:46 -07:00
Sean Christopherson	23b54381ce	irqbypass: Require producers to pass in Linux IRQ number during registration Pass in the Linux IRQ associated with an IRQ bypass producer instead of relying on the caller to set the field prior to registration, as there's no benefit to relying on callers to do the right thing. Take care to set producer->irq before __connect(), as KVM expects the IRQ to be valid as soon as a connection is possible. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:52:41 -07:00
Sean Christopherson	2b521d86ee	irqbypass: Take ownership of producer/consumer token tracking Move ownership of IRQ bypass token tracking into irqbypass.ko, and explicitly require callers to pass an eventfd_ctx structure instead of a completely opaque token. Relying on producers and consumers to set the token appropriately is error prone, and hiding the fact that the token must be an eventfd_ctx pointer (for all intents and purposes) unnecessarily obfuscates the code and makes it more brittle. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:52:38 -07:00
Linus Torvalds	8ca154e491	virtio, vhost: features, fixes A new virtio RTC driver. vhost scsi now logs write descriptors so migration works. Some hardening work in virtio core. An old spec compliance issue fixed in vhost net. A couple of cleanups, fixes in vringh, virtio-pci, vdpa. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmg2ukkPHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpXCYIAKvqsUujiF0W1Kv2uqrztrWVYPL5wEOgiwfD ElxeYvkzGsaBHW1zfAxUz4az5K9qigNdm9JE55Kc79Sd/5yuLASXZtcma5d5Y4G6 JS45EhsCFxbJjtIFv2hXlTBTuo0LkxDRNzCX/CtNL/k4+H0vCn5pPfv19hUTT1T4 BH+/JA64FAquSCnMK10CG3byT8hfchlxM5TV4iQriolaOtN4gCroLp2lSzT0D2Ux cT8e+2IwEjdx3iRkrAaNJSbpmz34xUu3TsaR7KxTZ+a0/+Zcs6aBVOhkqsyPLnPf XK1EDwtizT+Kz6XXei+kH8sWhARQ4dU4iNn4aMKmf3937QIOSYs= =VJy5 -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - A new virtio RTC driver - vhost scsi now logs write descriptors so migration works - Some hardening work in virtio core - An old spec compliance issue fixed in vhost net - A couple of cleanups, fixes in vringh, virtio-pci, vdpa * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio: reject shm region if length is zero virtio_rtc: Add RTC class driver virtio_rtc: Add Arm Generic Timer cross-timestamping virtio_rtc: Add PTP clocks virtio_rtc: Add module and driver core vringh: use bvec_kmap_local vhost: vringh: Use matching allocation type in resize_iovec() virtio-pci: Fix result size returned for the admin command completion vdpa/octeon_ep: Control PCI dev enabling manually vhost-scsi: log event queue write descriptors vhost-scsi: log control queue write descriptors vhost-scsi: log I/O queue write descriptors vhost-scsi: adjust vhost_scsi_get_desc() to log vring descriptors vhost: modify vhost_log_write() for broader users	2025-05-29 08:15:35 -07:00
Christoph Hellwig	169294a14b	vringh: use bvec_kmap_local Use the bvec_kmap_local helper rather than digging into the bvec internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20250501142244.2888227-1-hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>	2025-05-27 10:27:53 -04:00
Kees Cook	8b3f9967b1	vhost: vringh: Use matching allocation type in resize_iovec() In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void ", which can be implicitly cast to any pointer type.) The assigned type is "struct kvec ", but the returned type will be "struct iovec ". These have the same allocation size, so there is no bug: struct kvec { void iov_base; /* and that should never hold a userland pointer / size_t iov_len; }; struct iovec { void __user iov_base; /* BSD uses caddr_t (1003.1g requires void ) / __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; Adjust the allocation type to match the assignment. Signed-off-by: Kees Cook <kees@kernel.org> Message-Id: <20250426062214.work.334-kees@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-27 10:27:53 -04:00
Dongli Zhang	ac9dcca236	vhost-scsi: log event queue write descriptors Log write descriptors for the event queue, leveraging vhost_get_vq_desc() to retrieve the array of write descriptors to obtain the log buffer. There is only one path for event queue. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-9-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	a94c96a352	vhost-scsi: log control queue write descriptors Log write descriptors for the control queue, leveraging vhost_scsi_get_desc() and vhost_get_vq_desc() to retrieve the array of write descriptors to obtain the log buffer. For Task Management Requests, similar to the I/O queue, store the log buffer during the submission path and log it in the completion or error handling path. For Asynchronous Notifications, only the submission path is involved. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250403063028.16045-8-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	c2c5c259aa	vhost-scsi: log I/O queue write descriptors Log write descriptors for the I/O queue, leveraging vhost_scsi_get_desc() and vhost_get_vq_desc() to retrieve the array of write descriptors to obtain the log buffer. In addition, introduce a vhost-scsi specific function to log vring descriptors. In this function, the 'partial' argument is set to false, and the 'len' argument is set to 0, because vhost-scsi always logs all pages shared by a vring descriptor. Add WARN_ON_ONCE() since vhost-scsi doesn't support VIRTIO_F_ACCESS_PLATFORM. The per-cmd log buffer is allocated on demand in the submission path after VHOST_F_LOG_ALL is set. Return -ENOMEM on allocation failure, in order to send SAM_STAT_TASK_SET_FULL to the guest. It isn't reclaimed in the completion path. Instead, it is reclaimed when VHOST_F_LOG_ALL is removed, or during VHOST_SCSI_SET_ENDPOINT when all commands are destroyed. Store the log buffer during the submission path and log it in the completion path. Logging is also required in the error handling path of the submission process. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250403063028.16045-7-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	e5e6b15b0d	vhost-scsi: adjust vhost_scsi_get_desc() to log vring descriptors Adjust vhost_scsi_get_desc() to facilitate logging of vring descriptors. Add new arguments to allow passing the log buffer and length to vhost_get_vq_desc(). In addition, reset 'log_num' since vhost_get_vq_desc() may reset it only after certain condition checks. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-6-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	41cff026cf	vhost: modify vhost_log_write() for broader users Currently, the only user of vhost_log_write() is vhost-net. The 'len' argument prevents logging of pages that are not tainted by the RX path. Adjustments are needed since more drivers (i.e. vhost-scsi) begin using vhost_log_write(). So far vhost-net RX path may only partially use pages shared via the last vring descriptor. Unlike vhost-net, vhost-scsi always logs all pages shared via vring descriptors. To accommodate this, use (len == U64_MAX) to indicate whether the driver would log all pages of vring descriptors, or only pages that are tainted by the driver. In addition, removes BUG(). Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250403063028.16045-5-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-18 17:25:24 -04:00
Jon Kohler	8c2e6b26ff	vhost/net: Defer TX queue re-enable until after sendmsg In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and batches up to 64 messages before calling sock->sendmsg. Currently, when there are no more messages on the ring to dequeue, handle_tx_copy re-enables kicks on the ring before firing off the batch sendmsg. However, sock->sendmsg incurs a non-zero delay, especially if it needs to wake up a thread (e.g., another vhost worker). If the guest submits additional messages immediately after the last ring check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to kick the vhost worker. This may happen while the worker is still processing the sendmsg, leading to wasteful exit(s). This is particularly problematic for single-threaded guest submission threads, as they must exit, wait for the exit to be processed (potentially involving a TTWU), and then resume. In scenarios like a constant stream of UDP messages, this results in a sawtooth pattern where the submitter frequently vmexits, and the vhost-net worker alternates between sleeping and waking. A common solution is to configure vhost-net busy polling via userspace (e.g., qemu poll-us). However, treating the sendmsg as the "busy" period by keeping kicks disabled during the final sendmsg and performing one additional ring check afterward provides a significant performance improvement without any excess busy poll cycles. If messages are found in the ring after the final sendmsg, requeue the TX handler. This ensures fairness for the RX handler and allows vhost_run_work_list to cond_resched() as needed. Test Case TX VM: taskset -c 2 iperf3 -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5 RX VM: taskset -c 2 iperf3 -s -p 5200 -D 6.12.0, each worker backed by tun interface with IFF_NAPI setup. Note: TCP side is largely unchanged as that was copy bound 6.12.0 unpatched EPT_MISCONFIG/second: 5411 Datagrams/second: ~382k Interval Transfer Bitrate Lost/Total Datagrams 0.00-30.00 sec 15.5 GBytes 4.43 Gbits/sec 0/11481630 (0%) sender 6.12.0 patched EPT_MISCONFIG/second: 58 (~93x reduction) Datagrams/second: ~650k (~1.7x increase) Interval Transfer Bitrate Lost/Total Datagrams 0.00-30.00 sec 26.4 GBytes 7.55 Gbits/sec 0/19554720 (0%) sender Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Jon Kohler <jon@nutanix.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20250501020428.1889162-1-jon@nutanix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-05 18:18:41 -07:00
Dongli Zhang	58465d8607	vhost-scsi: Fix vhost_scsi_send_status() Although the support of VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 was signaled by the commit `664ed90e62` ("vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits"), vhost_scsi_send_bad_target() still assumes the response in a single descriptor. Similar issue in vhost_scsi_send_bad_target() has been fixed in previous commit. In addition, similar issue for vhost_scsi_complete_cmd_work() has been fixed by the commit `6dd88fd59d` ("vhost-scsi: unbreak any layout for response"). Fixes: `3ca51662f8` ("vhost-scsi: Add better resource allocation failure handling") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-4-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-04-18 10:08:11 -04:00
Dongli Zhang	b182687135	vhost-scsi: Fix vhost_scsi_send_bad_target() Although the support of VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 was signaled by the commit `664ed90e62` ("vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits"), vhost_scsi_send_bad_target() still assumes the response in a single descriptor. In addition, although vhost_scsi_send_bad_target() is used by both I/O queue and control queue, the response header is always virtio_scsi_cmd_resp. It is required to use virtio_scsi_ctrl_tmf_resp or virtio_scsi_ctrl_an_resp for control queue. Fixes: `664ed90e62` ("vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-3-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-04-18 10:08:11 -04:00
Dongli Zhang	f591cf9fce	vhost-scsi: protect vq->log_used with vq->mutex The vhost-scsi completion path may access vq->log_base when vq->log_used is already set to false. vhost-thread QEMU-thread vhost_scsi_complete_cmd_work() -> vhost_add_used() -> vhost_add_used_n() if (unlikely(vq->log_used)) QEMU disables vq->log_used via VHOST_SET_VRING_ADDR. mutex_lock(&vq->mutex); vq->log_used = false now! mutex_unlock(&vq->mutex); QEMU gfree(vq->log_base) log_used() -> log_write(vq->log_base) Assuming the VMM is QEMU. The vq->log_base is from QEMU userpace and can be reclaimed via gfree(). As a result, this causes invalid memory writes to QEMU userspace. The control queue path has the same issue. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-2-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-04-18 10:08:11 -04:00
Linus Torvalds	4b98d5dcd1	virtio: features, fixes, cleanups A small number of improvements all over the place: shutdown has been reworked to reset devices. virtio fs is now allowed in vduse. vhost-scsi memory use has been reduced. cleanups, fixes all over the place. A couple more fixes are being tested and will be merged after rc1. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmfqw70PHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpnHwIAL+3kc7Na6iFu995K7iVaqPf5l+xqflvLXDU bdm+oxz7SPlM39qgX3LwaRW8aILH4+bQ5zP8f0O16Ou+1Hraf+BZT0cah7zjPMdn WRE/zsHaNrXNKuPERI2lG5v26NhD9UIqpufutBjKu1OVQ55kt/20Rt/HlyWjCzU8 /gROCs5nlUr47+PRLQraRMA6hKpLjVaIEBLLlWVnB1nr9TQZmtVwn742RFu5urr+ wosaCSq0JCFdxtoUhHHGUhO1rq+3LO3kPpuKE5AeBJUqVo9IMwZVfcK7s+5zTECb FylXu3vNUNWG5Cr2lnzONhkjg2i0BXUTHDz/iCmcMQks9cdojUE= =DbWQ -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: "A small number of improvements all over the place: - shutdown has been reworked to reset devices - virtio fs is now allowed in vduse - vhost-scsi memory use has been reduced - cleanups, fixes all over the place A couple more fixes are being tested and will be merged after rc1" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost-scsi: Reduce response iov mem use vhost-scsi: Allocate iov_iter used for unaligned copies when needed vhost-scsi: Stop duplicating se_cmd fields vhost-scsi: Dynamically allocate scatterlists vhost-scsi: Return queue full for page alloc failures during copy vhost-scsi: Add better resource allocation failure handling vhost-scsi: Allocate T10 PI structs only when enabled vhost-scsi: Reduce mem use by moving upages to per queue vduse: add virtio_fs to allowed dev id sound/virtio: Fix cancel_sync warnings on uninitialized work_structs vdpa/mlx5: Fix oversized null mkey longer than 32bit vdpa/mlx5: Fix mlx5_vdpa_get_config() endianness on big-endian machines vhost-scsi: Fix handling of multiple calls to vhost_scsi_set_endpoint tools: virtio/linux/module.h add MODULE_DESCRIPTION() define. tools: virtio/linux/compiler.h: Add data_race() define. tools/virtio: Add DMA_MAPPING_ERROR and sg_dma_len api define for virtio test virtio: break and reset virtio devices on device_shutdown()	2025-04-01 18:52:54 -07:00
Keith Busch	cb380909ae	vhost: return task creation error instead of NULL Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-01 02:52:52 -05:00
Mike Christie	9d8960672d	vhost-scsi: Reduce response iov mem use We have to save N iov entries to copy the virtio_scsi_cmd_resp struct back to the guest's buffer. The difficulty is that we can't assume the virtio_scsi_cmd_resp will be in 1 iov because older virtio specs allowed you to break it up. The worst case is that the guest was doing something like breaking up the virtio_scsi_cmd_resp struct into 108 (the struct is 108 bytes) byte sized vecs like: iov[0].iov_base = ((unsigned char )virtio_scsi_cmd_resp)[0] iov[0].iov_len = 1 iov[1].iov_base = ((unsigned char )virtio_scsi_cmd_resp)[1] iov[1].iov_len = 1 .... iov[107].iov_base = ((unsigned char )virtio_scsi_cmd_resp)[107] iov[1].iov_len = 1 Right now we allocate UIO_MAXIOV vecs which is 1024 and so for a small device with just 1 queue and 128 commands per queue, we are wasting 1.8 MB = (1024 current entries - 108) 16 bytes per entry * 128 cmds The most common case is going to be where the initiator puts the entire virtio_scsi_cmd_resp in the first iov and does not split it. We've always done it this way for Linux and the windows driver looks like it's always done the same. It's highly unlikely anyone has ever split the response and if they did it might just be where they have the sense in a second iov but that doesn't seem likely as well. So to optimize for the common implementation, this has us only pre-allocate the single iovec. If we do hit the split up response case this has us allocate the needed iovec when needed. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-9-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	fd47976581	vhost-scsi: Allocate iov_iter used for unaligned copies when needed It's extremely rare that we get unaligned requests that need to drop down to the data copy code path. However, the iov_iter is almost 5% of the mem used for the vhost_scsi_cmd. This patch has us allocate the iov_iter only when needed since it's not a perf path that uses the struct. This along with the patches that removed the duplicated fields on the vhost_scsd_cmd allow us to reduce mem use by 1 MB in mid size setups where we have 16 virtqueues and are doing 1024 cmds per queue. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-8-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	ddc5b5f68e	vhost-scsi: Stop duplicating se_cmd fields When setting up the command we will initially set values like lun and data direction on the vhost scsi command. We then pass them to LIO which stores them again on the LIO se_cmd. The se_cmd is actually stored in the vhost scsi command so we are storing these values twice on the same struct. So this patch has stop duplicating the storing of SCSI values like lun, data dir, data len, cdb, etc on the vhost scsi command and just pass them to LIO which will store them on the se_cmd. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-7-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	bca939d5bc	vhost-scsi: Dynamically allocate scatterlists We currently preallocate scatterlists which have 2048 entries for each command. For a small device with just 1 queue this results in: 8 MB = 32 bytes per sg * 2048 entries * 128 cmd When mq is turned on and we increase the virtqueue_size so we can handle commands from multiple queues in parallel, then this can sky rocket. This patch allows us to dynamically allocate the scatterlist like is done with drivers like NVMe and SCSI. For small IO (4-16K) IOPs testing, we didn't see any regressions, but for throughput testing we sometimes saw a 2-5% regression when the backend device was very fast (8 NVMe drives in a MD RAID0 config or a memory backed device). As a result this patch makes the dynamic allocation feature a modparam so userspace can decide how it wants to balance mem use and perf. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-6-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	891b99eab0	vhost-scsi: Return queue full for page alloc failures during copy This has us return queue full if we can't allocate a page during the copy operation so the initiator can retry. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-5-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	3ca51662f8	vhost-scsi: Add better resource allocation failure handling If we can't allocate mem to map in data for a request or can't find a tag for a command, we currently drop the command. This leads to the error handler running to clean it up. Instead of dropping the command this has us return an error telling the initiator that it queued more commands than we can handle. The initiator will then reduce how many commands it will send us and retry later. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-4-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	bf2d650391	vhost-scsi: Allocate T10 PI structs only when enabled T10 PI is not a widely used feature. This has us only allocate the structs for it if the feature has been enabled. For a common small setup where you have 1 virtqueue and 128 commands per queue, this saves: 8MB = 32 bytes per sg * 2048 entries * 128 commands per vhost-scsi device. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	4c1f3a7d74	vhost-scsi: Reduce mem use by moving upages to per queue Each worker thread can process 1 command at a time so there's no need to allocate a upages array per cmd. This patch moves it to per queue. Even a small device with 128 cmds and 1 queue this brings mem use for the array from 2 MB = 8 bytes per page pointer * 2048 pointers * 128 cmds to 16K = 8 bytes per pointer * 2048 * 1 queue Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20241203191705.19431-2-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-02-25 07:10:46 -05:00
Mike Christie	5dd639a164	vhost-scsi: Fix handling of multiple calls to vhost_scsi_set_endpoint If vhost_scsi_set_endpoint is called multiple times without a vhost_scsi_clear_endpoint between them, we can hit multiple bugs found by Haoran Zhang: 1. Use-after-free when no tpgs are found: This fixes a use after free that occurs when vhost_scsi_set_endpoint is called more than once and calls after the first call do not find any tpgs to add to the vs_tpg. When vhost_scsi_set_endpoint first finds tpgs to add to the vs_tpg array match=true, so we will do: vhost_vq_set_backend(vq, vs_tpg); ... kfree(vs->vs_tpg); vs->vs_tpg = vs_tpg; If vhost_scsi_set_endpoint is called again and no tpgs are found match=false so we skip the vhost_vq_set_backend call leaving the pointer to the vs_tpg we then free via: kfree(vs->vs_tpg); vs->vs_tpg = vs_tpg; If a scsi request is then sent we do: vhost_scsi_handle_vq -> vhost_scsi_get_req -> vhost_vq_get_backend which sees the vs_tpg we just did a kfree on. 2. Tpg dir removal hang: This patch fixes an issue where we cannot remove a LIO/target layer tpg (and structs above it like the target) dir due to the refcount dropping to -1. The problem is that if vhost_scsi_set_endpoint detects a tpg is already in the vs->vs_tpg array or if the tpg has been removed so target_depend_item fails, the undepend goto handler will do target_undepend_item on all tpgs in the vs_tpg array dropping their refcount to 0. At this time vs_tpg contains both the tpgs we have added in the current vhost_scsi_set_endpoint call as well as tpgs we added in previous calls which are also in vs->vs_tpg. Later, when vhost_scsi_clear_endpoint runs it will do target_undepend_item on all the tpgs in the vs->vs_tpg which will drop their refcount to -1. Userspace will then not be able to remove the tpg and will hang when it tries to do rmdir on the tpg dir. 3. Tpg leak: This fixes a bug where we can leak tpgs and cause them to be un-removable because the target name is overwritten when vhost_scsi_set_endpoint is called multiple times but with different target names. The bug occurs if a user has called VHOST_SCSI_SET_ENDPOINT and setup a vhost-scsi device to target/tpg mapping, then calls VHOST_SCSI_SET_ENDPOINT again with a new target name that has tpgs we haven't seen before (target1 has tpg1 but target2 has tpg2). When this happens we don't teardown the old target tpg mapping and just overwrite the target name and the vs->vs_tpg array. Later when we do vhost_scsi_clear_endpoint, we are passed in either target1 or target2's name and we will only match that target's tpgs when we loop over the vs->vs_tpg. We will then return from the function without doing target_undepend_item on the tpgs. Because of all these bugs, it looks like being able to call vhost_scsi_set_endpoint multiple times was never supported. The major user, QEMU, already has checks to prevent this use case. So to fix the issues, this patch prevents vhost_scsi_set_endpoint from being called if it's already successfully added tpgs. To add, remove or change the tpg config or target name, you must do a vhost_scsi_clear_endpoint first. Fixes: `25b98b64e2` ("vhost scsi: alloc cmds per vq instead of session") Fixes: `4f7f46d32c` ("tcm_vhost: Use vq->private_data to indicate if the endpoint is setup") Reported-by: Haoran Zhang <wh1sper@zju.edu.cn> Closes: https://lore.kernel.org/virtualization/e418a5ee-45ca-4d18-9b5d-6f8b6b1add8e@oracle.com/T/#me6c0041ce376677419b9b2563494172a01487ecb Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20250129210922.121533-1-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com>	2025-02-25 07:10:45 -05:00
Akihiko Odaki	a3b9c053d8	vhost/net: Set num_buffers for virtio 1.0 The specification says the device MUST set num_buffers to 1 if VIRTIO_NET_F_MRG_RXBUF has not been negotiated. Fixes: `41e3e42108` ("vhost/net: enable virtio 1.0") Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-Id: <20240915-v1-v1-1-f10d2cb5e759@daynix.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-01-27 09:39:25 -05:00

1 2 3 4 5 ...

1053 Commits