- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts.
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface.
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on hardware
that previously advertised it unconditionally.
- Map supporting endpoints with cacheable memory attributes on systems
with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
maintenance on the address range.
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
hypervisor to inject external aborts into an L2 VM and take traps of
masked external aborts to the hypervisor.
- Convert more system register sanitization to the config-driven
implementation.
- Fixes to the visibility of EL2 registers, namely making VGICv3 system
registers accessible through the VGIC device instead of the ONE_REG
vCPU ioctls.
- Various cleanups and minor fixes.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCaIezbRccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFr/eAQDY5NIG5cR6ZcAWnPQLmGWpz2ou
pq4Jhn9E/mGR3n5L1AEAsJpfLLpOsmnLBdwfbjmW59gGsa8k3i5tjWEOJ6yzAwk=
=r+sp
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.17' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.17, round #1
- Host driver for GICv5, the next generation interrupt controller for
arm64, including support for interrupt routing, MSIs, interrupt
translation and wired interrupts.
- Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on
GICv5 hardware, leveraging the legacy VGIC interface.
- Userspace control of the 'nASSGIcap' GICv3 feature, allowing
userspace to disable support for SGIs w/o an active state on hardware
that previously advertised it unconditionally.
- Map supporting endpoints with cacheable memory attributes on systems
with FEAT_S2FWB and DIC where KVM no longer needs to perform cache
maintenance on the address range.
- Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest
hypervisor to inject external aborts into an L2 VM and take traps of
masked external aborts to the hypervisor.
- Convert more system register sanitization to the config-driven
implementation.
- Fixes to the visibility of EL2 registers, namely making VGICv3 system
registers accessible through the VGIC device instead of the ONE_REG
vCPU ioctls.
- Various cleanups and minor fixes.
A preliminary version of a hack to invoke unmap_all_vpes() from an ioctl
didn't work very well. We eventually determined this was because we were
invoking it on the wrong file descriptor, but not getting an error.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/bbbddd56135399baf699bc46ffb6e7f08d9f8c9f.camel@infradead.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if
failure occurs when freeing an ITE. If undoing vCPU affinity fails, then
odds are very good that vLPI state tracking has has gotten out of whack,
i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI. At best,
inconsistent state means there is a lurking bug/flaw somewhere. At worst,
the inconsistency could eventually be fatal to the host, e.g. if an ITS
command fails because KVM's view of things doesn't match reality/hardware.
Note, only the call from kvm_arch_irq_bypass_del_producer() by way of
kvm_vgic_v4_unset_forwarding() doesn't already WARN. Common KVM's
kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails.
For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(),
the only possible causes are that the GIC doesn't have a v4 ITS (from
its_irq_set_vcpu_affinity()):
/* Need a v4 ITS */
if (!is_v4(its_dev->its))
return -EINVAL;
guard(raw_spinlock)(&its_dev->event_map.vlpi_lock);
/* Unmap request? */
if (!info)
return its_vlpi_unmap(d);
or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()):
if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d))
return -EINVAL;
All of the above failure scenarios are warnable offences, as they should
never occur absent a kernel/KVM bug.
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/all/aFWY2LTVIxz5rfhh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change.
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o private
IRQs allocated.
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum.
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping.
- Avoid dereferencing a NULL pointer in the VGIC debug code, which can
happen if the device doesn't have any mapping yet.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmg5fbgACgkQI9DQutE9
ekNMyhAAq99rswsyqUrDwTd4qwJBtRYj4FMkBD8rH1oPQcXkqPGVllOMOgAnJhtd
mU1/XpU8gk49nsx5kwdg1RORQG4lKCZATH1F2MhInDWMMSSWUTWBu0+qG62GFQWF
Db+aXcvyebHdfzunDmT1/NebhV9VGCnkUtbQrFBvBbqOt1xNCRhKNvv5RWNEugTW
R2DIg4gCk2x1/iZceP8dDRDpnvFk8txtzZsWE6g1DZRw1mmlrz01rRgD7EJtkyaQ
c74mqpc/A2GuJv2mwpz3GCXTLiNgAuNq1X9+yeEqYM+VVvuyGy906BTFXxgvYl6l
o4OvE1F6QOtcJIokPb5kBlpxz76J2bNxKuUrY7rxM73Of0ZgGMsSltQxdcoGhJZ7
cyfUWFVQ+H5Gx+BVs/+5u2XdtX7OC0U50eXP0XrzviunTkEUhvRaH1WfTh9Zike/
U150d0oU17MeVZ9HlOWyhXV/nId6CPX5xo7BUjIq2e1Oy4mYkIYYbowdrV5t6NiJ
FHwzXNa/T0prZDSPW4t1GL/8o8hcnd4KRGLTJDWBYk58J6S1hHG0QElAz5RvqUj+
nf8HSycZMBrpdXtIK6nu/GRk9Ng+6gT4KHomwjgIDj6QVn7DTFFId911j+HRmoCP
DiV4QD4yVRSMBDX5cdQj2XBk2raMTOxdN9A1y6T2H3b+iLavVyo=
=JjTn
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.16-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.16, take #1
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change.
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o private
IRQs allocated.
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum.
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping.
- Avoid dereferencing a NULL pointer in the VGIC debug code, which can
happen if the device doesn't have any mapping yet.
Though undocumented, KVM generally protects the translation of a vLPI
with the its_lock. While this makes perfectly good sense, as the ITS
itself contains the guest translation, an upcoming change will require
twiddling the vLPI mapping in an atomic context.
Switch to using the vIRQ's irq_lock to protect the translation. Use of
the its_lock in vgic_v4_unset_forwarding() is preserved for now as it
still needs to walk the ITS.
Tested-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250523194722.4066715-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.
This fixes the following false lockdep warning:
[ 328.171264] BUG: MAX_LOCK_DEPTH too low!
[ 328.175227] turning off the locking correctness validator.
[ 328.180726] Please attach the output of /proc/lock_stat to the bug report
[ 328.187531] depth: 48 max: 48!
[ 328.190678] 48 locks held by qemu-kvm/11664:
[ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit introduces a debugfs interface to display the contents of the
VGIC Interrupt Translation Service (ITS) tables.
The ITS tables map Device/Event IDs to Interrupt IDs and target processors.
Exposing this information through debugfs allows for easier inspection and
debugging of the interrupt routing configuration.
The debugfs interface presents the ITS table data in a tabular format:
Device ID: 0x0, Event ID Range: [0 - 31]
EVENT_ID INTID HWINTID TARGET COL_ID HW
-----------------------------------------------
0 8192 0 0 0 0
1 8193 0 0 0 0
2 8194 0 2 2 0
Device ID: 0x18, Event ID Range: [0 - 3]
EVENT_ID INTID HWINTID TARGET COL_ID HW
-----------------------------------------------
0 8225 0 0 0 0
1 8226 0 1 1 0
2 8227 0 3 3 0
Device ID: 0x10, Event ID Range: [0 - 7]
EVENT_ID INTID HWINTID TARGET COL_ID HW
-----------------------------------------------
0 8229 0 3 3 1
1 8230 0 0 0 1
2 8231 0 1 1 1
3 8232 0 2 2 1
4 8233 0 3 3 1
The output is generated using the seq_file interface, allowing for efficient
handling of potentially large ITS tables.
This interface is read-only and does not allow modification of the ITS
tables. It is intended for debugging and informational purposes only.
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20250220224247.2017205-1-jingzhangos@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The return value of xa_store() needs to be checked. This fix adds an
error handling path that resolves the kref inconsistency on failure. As
suggested by Oliver Upton, this function does not return the error code
intentionally because the translation cache is best effort.
Fixes: 8201d1028c ("KVM: arm64: vgic-its: Maintain a translation cache per ITS")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241130144952.23729-1-keisuke.nishimura@inria.fr
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The ITS ABI infrastructure allows for some pretty lax code, where
the size of the data doesn't have to match the size of the entry,
potentially leading to a collection of interesting bugs.
Commit 7fe28d7e68 ("KVM: arm64: vgic-its: Add a data length check
in vgic_its_save_*") added some checks, but starts by implicitly
casting all writes to a 64bit value, hiding some of the issues.
Instead, introduce macros that will check the data type actually used
for dealing with the table entries. The macros are taking a symbolic
entry type that is used to fetch the size of the entry type for the
current ABI. This immediately catches a couple of low-impact gotchas
(zero values that are implicitly 32bit), easy enough to fix.
Given that we currently only have a single ABI, hardcode a couple of
BUILD_BUG_ON()s that will fire if we use anything but a 64bit quantity,
and some (currently unreachable) fallback code that may become useful
one day.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_get_irq() has an awkward signature, as it takes both a kvm
*and* a vcpu, where the vcpu is allowed to be NULL if the INTID
being looked up is a global interrupt (SPI or LPI).
This leads to potentially problematic situations where the INTID
passed is a private interrupt, but that there is no vcpu.
In order to make things less ambiguous, let have *two* helpers
instead:
- vgic_get_irq(struct kvm *kvm, u32 intid), which is only concerned
with *global* interrupts, as indicated by the lack of vcpu.
- vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid), which can
return *any* interrupt class, but must have of course a non-NULL
vcpu.
Most of the code nicely falls under one or the other situations,
except for a couple of cases (close to the UABI or in the debug code)
where we have to distinguish between the two cases.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
When DISCARD frees an ITE, it does not invalidate the
corresponding ITE. In the scenario of continuous saves and
restores, there may be a situation where an ITE is not saved
but is restored. This is unreasonable and may cause restore
to fail. This patch clears the corresponding ITE when DISCARD
frees an ITE.
Cc: stable@vger.kernel.org
Fixes: eff484e029 ("KVM: arm64: vgic-its: ITT save and restore")
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-6-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_its_save_device_tables will traverse its->device_list to
save DTE for each device. vgic_its_restore_device_tables will
traverse each entry of device table and check if it is valid.
Restore if valid.
But when MAPD unmaps a device, it does not invalidate the
corresponding DTE. In the scenario of continuous saves
and restores, there may be a situation where a device's DTE
is not saved but is restored. This is unreasonable and may
cause restore to fail. This patch clears the corresponding
DTE when MAPD unmaps a device.
Cc: stable@vger.kernel.org
Fixes: 57a9a11715 ("KVM: arm64: vgic-its: Device table save/restore")
Co-developed-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-5-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
In all the vgic_its_save_*() functinos, they do not check whether
the data length is 8 bytes before calling vgic_write_guest_lock.
This patch adds the check. To prevent the kernel from being blown up
when the fault occurs, KVM_BUG_ON() is used. And the other BUG_ON()s
are replaced together.
Cc: stable@vger.kernel.org
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with the new entry read/write helpers]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-4-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Fix kdoc warnings by adding missing function parameter
descriptions or by conversion to a normal comment.
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240723101204.7356-3-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The last genuine use case for the lpi_list_lock was the global LPI
translation cache, which has been removed in favor of a per-ITS xarray.
Remove a layer from the locking puzzle by getting rid of it.
vgic_add_lpi() still has a critical section that needs to protect
against the insertion of other LPIs; change it to take the LPI xarray's
xa_lock to retain this property.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Everything is in place to switch to per-ITS translation caches. Start
using the per-ITS cache to avoid the lock serialization related to the
global translation cache. Explicitly check for out-of-range device and
event IDs as the cache index is packed based on the range the ITS
actually supports.
Take the RCU read lock to protect against the returned descriptor being
freed while trying to take a reference on it, as it is no longer
necessary to acquire the lpi_list_lock.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-11-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The fast path will soon need to find an ITS by doorbell address, as the
translation caches will become local to an ITS. Spin off a helper to do
just that.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-10-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Within the context of a single ITS, it is possible to use an xarray to
cache the device ID & event ID translation to a particular irq
descriptor. Take advantage of this to build a translation cache capable
of fitting all valid translations for a given ITS.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
As the current LPI translation cache is global, the corresponding
invalidation helpers are also globally-scoped. In anticipation of
constructing a translation cache per ITS, add a helper for scoped cache
invalidations.
We still need to support global invalidations when LPIs are toggled on
a redistributor, as a property of the translation cache is that all
stored LPIs are known to be delieverable.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-8-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The last user has been transitioned to walking the LPI xarray directly.
Cut the wart off, and get rid of the now unneeded lpi_count while doing
so.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The vgic debug iterator is the final user of vgic_copy_lpi_list(), but
is a bit more complicated to transition to something else. Use a mark
in the LPI xarray to record the indices 'known' to the debug iterator.
Protect against the LPIs from being freed by associating an additional
reference with the xarray mark.
Rework iter_next() to let the xarray walk 'drive' the iteration after
visiting all of the SGIs, PPIs, and SPIs.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in vgic_its_cmd_handle_movall().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in vgic_its_invall().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
The new LPI xarray makes it possible to walk the VM's LPIs without
holding a lock, meaning that vgic_copy_lpi_list() is no longer
necessary. Prepare for the deletion by walking the LPI xarray directly
in its_sync_lpi_pending_table().
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240422200158.2606761-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
- Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZepBjgAKCRCivnWIJHzd
FnngAP93VxjCkJ+5qSmYpFNG6r0ECVIbLHFQ59nKn0+GgvbPEgEAwt8svdLdW06h
njFTpdzvl4Po+aD/V9xHgqVz3kVvZwE=
=1FbW
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.9
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
- Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
* kvm-arm64/lpi-xarray:
: xarray-based representation of vgic LPIs
:
: KVM's linked-list of LPI state has proven to be a bottleneck in LPI
: injection paths, due to lock serialization when acquiring / releasing a
: reference on an IRQ.
:
: Start the tedious process of reworking KVM's LPI injection by replacing
: the LPI linked-list with an xarray, leveraging this to allow RCU readers
: to walk it outside of the spinlock.
KVM: arm64: vgic: Don't acquire the lpi_list_lock in vgic_put_irq()
KVM: arm64: vgic: Ensure the irq refcount is nonzero when taking a ref
KVM: arm64: vgic: Rely on RCU protection in vgic_get_lpi()
KVM: arm64: vgic: Free LPI vgic_irq structs in an RCU-safe manner
KVM: arm64: vgic: Use atomics to count LPIs
KVM: arm64: vgic: Get rid of the LPI linked-list
KVM: arm64: vgic-its: Walk the LPI xarray in vgic_copy_lpi_list()
KVM: arm64: vgic-v3: Iterate the xarray to find pending LPIs
KVM: arm64: vgic: Use xarray to find LPI in vgic_get_lpi()
KVM: arm64: vgic: Store LPIs in an xarray
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The LPI xarray's xa_lock is sufficient for synchronizing writers when
freeing a given LPI. Furthermore, readers can only take a new reference
on an IRQ if it was already nonzero.
Stop taking the lpi_list_lock unnecessarily and get rid of
__vgic_put_lpi_locked().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-11-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
It will soon be possible for get() and put() calls to happen in
parallel, which means in most cases we must ensure the refcount is
nonzero when taking a new reference. Switch to using
vgic_try_get_irq_kref() where necessary, and document the few conditions
where an IRQ's refcount is guaranteed to be nonzero.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-10-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
All readers of LPI configuration have been transitioned to use the LPI
xarray. Get rid of the linked-list altogether.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Using a linked-list for LPIs is less than ideal as it of course requires
iterative searches to find a particular entry. An xarray is a better
data structure for this use case, as it provides faster searches and can
still handle a potentially sparse range of INTID allocations.
Start by storing LPIs in an xarray, punting usage of the xarray to a
subsequent change. The observant among you will notice that we added yet
another lock to the chain of locking order rules; document the ordering
of the xa_lock. Don't worry, we'll get rid of the lpi_list_lock one
day...
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240221054253.3848076-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
It is possible that an LPI mapped in a different ITS gets unmapped while
handling the MOVALL command. If that is the case, there is no state that
can be migrated to the destination. Silently ignore it and continue
migrating other LPIs.
Cc: stable@vger.kernel.org
Fixes: ff9c114394 ("KVM: arm/arm64: GICv4: Handle MOVALL applied to a vPE")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240221092732.4126848-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
vgic_get_irq() may not return a valid descriptor if there is no ITS that
holds a valid translation for the specified INTID. If that is the case,
it is safe to silently ignore it and continue processing the LPI pending
table.
Cc: stable@vger.kernel.org
Fixes: 33d3bc9556 ("KVM: arm64: vgic-its: Read initial LPI pending table")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240221092732.4126848-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Correct the function parameter name "@save tables" -> "@save_tables".
Use the "typedef" keyword in the kernel-doc comment for a typedef.
These changes prevent kernel-doc warnings:
vgic/vgic-its.c:174: warning: Function parameter or struct member 'save_tables' not described in 'vgic_its_abi'
arch/arm64/kvm/vgic/vgic-its.c:2152: warning: expecting prototype for entry_fn_t(). Prototype was for int() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.linux.dev
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20240117230714.31025-10-rdunlap@infradead.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
There is a potential UAF scenario in the case of an LPI translation
cache hit racing with an operation that invalidates the cache, such
as a DISCARD ITS command. The root of the problem is that
vgic_its_check_cache() does not elevate the refcount on the vgic_irq
before dropping the lock that serializes refcount changes.
Have vgic_its_check_cache() raise the refcount on the returned vgic_irq
and add the corresponding decrement after queueing the interrupt.
Cc: stable@vger.kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240104183233.3560639-1-oliver.upton@linux.dev
Since our emulated ITS advertises GITS_TYPER.PTA=0, the target
address associated to a collection is a PE number and not
an address. So far, so good. However, the PE number is what userspace
has provided given us (aka the vcpu_id), and not the internal vcpu
index.
Make sure we consistently retrieve the vcpu by ID rather than
by index, adding a helper that deals with most of the cases.
We also get rid of the pointless (and bogus) comparisons to
online_vcpus, which don't really make sense.
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230927090911.3355209-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
vgic_its_create() changes the vgic state without holding the
config_lock, which triggers a lockdep warning in vgic_v4_init():
[ 358.667941] WARNING: CPU: 3 PID: 178 at arch/arm64/kvm/vgic/vgic-v4.c:245 vgic_v4_init+0x15c/0x7a8
...
[ 358.707410] vgic_v4_init+0x15c/0x7a8
[ 358.708550] vgic_its_create+0x37c/0x4a4
[ 358.709640] kvm_vm_ioctl+0x1518/0x2d80
[ 358.710688] __arm64_sys_ioctl+0x7ac/0x1ba8
[ 358.711960] invoke_syscall.constprop.0+0x70/0x1e0
[ 358.713245] do_el0_svc+0xe4/0x2d4
[ 358.714289] el0_svc+0x44/0x8c
[ 358.715329] el0t_64_sync_handler+0xf4/0x120
[ 358.716615] el0t_64_sync+0x190/0x194
Wrap the whole of vgic_its_create() with config_lock since, in addition
to calling vgic_v4_init(), it also modifies the global kvm->arch.vgic
state.
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230518100914.2837292-3-jean-philippe@linaro.org
commit f003277311 ("KVM: arm64: Use config_lock to protect vgic
state") was meant to rectify a longstanding lock ordering issue in KVM
where the kvm->lock is taken while holding vcpu->mutex. As it so
happens, the aforementioned commit introduced yet another locking issue
by acquiring the its_lock before acquiring the config lock.
This is obviously wrong, especially considering that the lock ordering
is well documented in vgic.c. Reshuffle the locks once more to take the
config_lock before the its_lock. While at it, sprinkle in the lockdep
hinting that has become popular as of late to keep lockdep apprised of
our ordering.
Cc: stable@vger.kernel.org
Fixes: f003277311 ("KVM: arm64: Use config_lock to protect vgic state")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230412062733.988229-1-oliver.upton@linux.dev
Almost all of the vgic state is VM-scoped but accessed from the context
of a vCPU. These accesses were serialized on the kvm->lock which cannot
be nested within a vcpu->mutex critical section.
Move over the vgic state to using the config_lock. Tweak the lock
ordering where necessary to ensure that the config_lock is acquired
after the vcpu->mutex. Acquire the config_lock in kvm_vgic_create() to
avoid a race between the converted flows and GIC creation. Where
necessary, continue to acquire kvm->lock to avoid a race with vCPU
creation (i.e. flows that use lock_all_vcpus()).
Finally, promote the locking expectations in comments to lockdep
assertions and update the locking documentation for the config_lock as
well as vcpu->mutex.
Cc: stable@vger.kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230327164747.2466958-5-oliver.upton@linux.dev
Currently, the unknown no-running-vcpu sites are reported when a
dirty page is tracked by mark_page_dirty_in_slot(). Until now, the
only known no-running-vcpu site is saving vgic/its tables through
KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} command on KVM device
"kvm-arm-vgic-its". Unfortunately, there are more unknown sites to
be handled and no-running-vcpu context will be allowed in these
sites: (1) KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} command
on KVM device "kvm-arm-vgic-its" to restore vgic/its tables. The
vgic3 LPI pending status could be restored. (2) Save vgic3 pending
table through KVM_DEV_ARM_{VGIC_GRP_CTRL, VGIC_SAVE_PENDING_TABLES}
command on KVM device "kvm-arm-vgic-v3".
In order to handle those unknown cases, we need a unified helper
vgic_write_guest_lock(). struct vgic_dist::save_its_tables_in_progress
is also renamed to struct vgic_dist::save_tables_in_progress.
No functional change intended.
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230126235451.469087-3-gshan@redhat.com
Enable ring-based dirty memory tracking on ARM64:
- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL.
- Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP.
- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
offset.
- Add ARM64 specific kvm_arch_allow_write_without_running_vcpu() to
keep the site of saving vgic/its tables out of the no-running-vcpu
radar.
Signed-off-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-5-gshan@redhat.com
With some PCIe topologies, restoring a guest fails while
parsing the ITS device tables.
Reproducer hints:
1. Create ARM virt VM with pxb-pcie bus which adds
extra host bridges, with qemu command like:
```
-device pxb-pcie,bus_nr=8,id=pci.x,numa_node=0,bus=pcie.0 \
-device pcie-root-port,..,bus=pci.x \
...
-device pxb-pcie,bus_nr=37,id=pci.y,numa_node=1,bus=pcie.0 \
-device pcie-root-port,..,bus=pci.y \
...
```
2. Ensure the guest uses 2-level device table
3. Perform VM migration which calls save/restore device tables
In that setup, we get a big "offset" between 2 device_ids,
which makes unsigned "len" round up a big positive number,
causing the scan loop to continue with a bad GPA. For example:
1. L1 table has 2 entries;
2. and we are now scanning at L2 table entry index 2075 (pointed
to by L1 first entry)
3. if next device id is 9472, we will get a big offset: 7397;
4. with unsigned 'len', 'len -= offset * esz', len will underflow to a
positive number, mistakenly into next iteration with a bad GPA;
(It should break out of the current L2 table scanning, and jump
into the next L1 table entry)
5. that bad GPA fails the guest read.
Fix it by stopping the L2 table scan when the next device id is
outside of the current table, allowing the scan to continue from
the next L1 table entry.
Thanks to Eric Auger for the fix suggestion.
Fixes: 920a7a8fa9 ("KVM: arm64: vgic-its: Add infrastructure for tableookup")
Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Ren <renzhengeek@gmail.com>
[maz: commit message tidy-up]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d9c3a564af9e2c5bf63f48a7dcbf08cd593c5c0b.1665802985.git.renzhengeek@gmail.com
The 'coll' parameter to update_affinity_collection() is never NULL,
so comparing it with 'ite->collection' is enough to cover both
the NULL case and the "another collection" case.
Remove the duplicate check in update_affinity_collection().
Signed-off-by: Gavin Shan <gshan@redhat.com>
[maz: repainted commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220923065447.323445-1-gshan@redhat.com
* kvm-arm64/its-save-restore-fixes-5.19:
: .
: Tighten the ITS save/restore infrastructure to fail early rather
: than late. Patches courtesy of Rocardo Koller.
: .
KVM: arm64: vgic: Undo work in failed ITS restores
KVM: arm64: vgic: Do not ignore vgic_its_restore_cte failures
KVM: arm64: vgic: Add more checks when restoring ITS tables
KVM: arm64: vgic: Check that new ITEs could be saved in guest memory
Signed-off-by: Marc Zyngier <maz@kernel.org>
Failed ITS restores should clean up all state restored until the
failure. There is some cleanup already present when failing to restore
some tables, but it's not complete. Add the missing cleanup.
Note that this changes the behavior in case of a failed restore of the
device tables.
restore ioctl:
1. restore collection tables
2. restore device tables
With this commit, failures in 2. clean up everything created so far,
including state created by 1.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-5-ricarkol@google.com
Restoring a corrupted collection entry (like an out of range ID) is
being ignored and treated as success. More specifically, a
vgic_its_restore_cte failure is treated as success by
vgic_its_restore_collection_table. vgic_its_restore_cte uses positive
and negative numbers to return error, and +1 to return success. The
caller then uses "ret > 0" to check for success.
Fix this by having vgic_its_restore_cte only return negative numbers on
error. Do this by changing alloc_collection return codes to only return
negative numbers on error.
Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220510001633.552496-4-ricarkol@google.com