Commit Graph

3144 Commits

Author SHA1 Message Date
Oliver Upton
07212d16ad KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
syzkaller has found another ugly race in the VGIC, this time dealing
with VGIC creation. Since kvm_vgic_create() doesn't sufficiently protect
against in-flight vCPU creations, it is possible to get a vCPU into the
kernel w/ an in-kernel VGIC but no allocation of private IRQs:

  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000d20
  Mem abort info:
    ESR = 0x0000000096000046
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    FSC = 0x06: level 2 translation fault
  Data abort info:
    ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000
    CM = 0, WnR = 1, TnD = 0, TagAccess = 0
    GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
  user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103e4f000
  [0000000000000d20] pgd=0800000102e1c403, p4d=0800000102e1c403, pud=0800000101146403, pmd=0000000000000000
  Internal error: Oops: 0000000096000046 [#1] PREEMPT SMP
  CPU: 9 UID: 0 PID: 246 Comm: test Not tainted 6.14.0-rc6-00097-g0c90821f5db8 #16
  Hardware name: linux,dummy-virt (DT)
  pstate: 814020c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
  pc : _raw_spin_lock_irqsave+0x34/0x8c
  lr : kvm_vgic_set_owner+0x54/0xa4
  sp : ffff80008086ba20
  x29: ffff80008086ba20 x28: ffff0000c19b5640 x27: 0000000000000000
  x26: 0000000000000000 x25: ffff0000c4879bd0 x24: 000000000000001e
  x23: 0000000000000000 x22: 0000000000000000 x21: ffff0000c487af80
  x20: ffff0000c487af18 x19: 0000000000000000 x18: 0000001afadd5a8b
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001
  x14: ffff0000c19b56c0 x13: 0030c9adf9d9889e x12: ffffc263710e1908
  x11: 0000001afb0d74f2 x10: e0966b840b373664 x9 : ec806bf7d6a57cd5
  x8 : ffff80008086b980 x7 : 0000000000000001 x6 : 0000000000000001
  x5 : 0000000080800054 x4 : 4ec4ec4ec4ec4ec5 x3 : 0000000000000000
  x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000d20
  Call trace:
   _raw_spin_lock_irqsave+0x34/0x8c (P)
   kvm_vgic_set_owner+0x54/0xa4
   kvm_timer_enable+0xf4/0x274
   kvm_arch_vcpu_run_pid_change+0xe0/0x380
   kvm_vcpu_ioctl+0x93c/0x9e0
   __arm64_sys_ioctl+0xb4/0xec
   invoke_syscall+0x48/0x110
   el0_svc_common.constprop.0+0x40/0xe0
   do_el0_svc+0x1c/0x28
   el0_svc+0x30/0xd0
   el0t_64_sync_handler+0x10c/0x138
   el0t_64_sync+0x198/0x19c
   Code: b9000841 d503201f 52800001 52800022 (88e17c02)
   ---[ end trace 0000000000000000 ]---

Plug the race by explicitly checking for an in-progress vCPU creation
and failing kvm_vgic_create() when that's the case. Add some comments to
document all the things kvm_vgic_create() is trying to guard against
too.

Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20250523194722.4066715-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-30 09:11:29 +01:00
Oliver Upton
4bf3693d36 KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
KVM's interrupt infrastructure is dodgy at best, allowing for some ugly
'off label' usage of the various UAPIs. In one example, userspace can
change the routing entry of a particular "GSI" after configuring
irqbypass with KVM_IRQFD. KVM/arm64 is oblivious to this, and winds up
preserving the stale translation in cases where vLPIs are configured.

Honor userspace's intentions and tear down the vLPI mapping if affected
by a "GSI" routing change. Make no attempt to reconstruct vLPIs if the
new target is an MSI and just fall back to software injection.

Tested-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250523194722.4066715-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-30 09:11:29 +01:00
Oliver Upton
05b9405f2f KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
The virtual mapping and "GSI" routing of a particular vLPI is subject to
change in response to the guest / userspace. This can be pretty annoying
to deal with when KVM needs to track the physical state that's managed
for vLPI direct injection.

Make vgic_v4_unset_forwarding() resilient by using the host IRQ to
resolve the vgic IRQ. Since this uses the LPI xarray directly, finding
the ITS by doorbell address + grabbing it's its_lock is no longer
necessary. Note that matching the right ITS / ITE is already handled in
vgic_v4_set_forwarding(), and unless there's a bug in KVM's VGIC ITS
emulation the virtual mapping that should remain stable for the lifetime
of the vLPI mapping.

Tested-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250523194722.4066715-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-30 09:11:29 +01:00
Oliver Upton
fc4dafe87b KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
Though undocumented, KVM generally protects the translation of a vLPI
with the its_lock. While this makes perfectly good sense, as the ITS
itself contains the guest translation, an upcoming change will require
twiddling the vLPI mapping in an atomic context.

Switch to using the vIRQ's irq_lock to protect the translation. Use of
the its_lock in vgic_v4_unset_forwarding() is preserved for now as it
still needs to walk the ITS.

Tested-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250523194722.4066715-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-30 09:11:29 +01:00
Oliver Upton
761aabe76e KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
The locking dance is about to get more interesting, switch the its_lock
over to a lock guard to make it a bit easier to handle.

Tested-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250523194722.4066715-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-30 09:11:28 +01:00
Marc Zyngier
6673047405 KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
When handling a TLBI VA* instruction that potentially targets a
VNCR page mapping, we fail to mask out the top bits that contain
the ASID and TTL fields, hence potentially failing the VA check
in the TLB code.

An additional wrinkle is that we fail to sign extend the VA,
again leading to failed VA checks.

Fix both in one go by sign-extending the VA from bit 48, making
it comparable to the way we interpret VNCR_EL2.BADDR.

Fixes: 4ffa72ad8f ("KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2")
Link: https://lore.kernel.org/r/20250525175759.780891-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-30 09:09:16 +01:00
Maxim Levitsky
b586c5d219 KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
Use kvm_trylock_all_vcpus instead of a custom implementation when locking
all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in
which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs.

This fixes the following false lockdep warning:

[  328.171264] BUG: MAX_LOCK_DEPTH too low!
[  328.175227] turning off the locking correctness validator.
[  328.180726] Please attach the output of /proc/lock_stat to the bug report
[  328.187531] depth: 48  max: 48!
[  328.190678] 48 locks held by qemu-kvm/11664:
[  328.194957]  #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[  328.204048]  #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.212521]  #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.220991]  #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.229463]  #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.237934]  #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[  328.246405]  #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-27 12:16:41 -04:00
Paolo Bonzini
4d526b02df KVM/arm64 updates for 6.16
* New features:
 
   - Add large stage-2 mapping support for non-protected pKVM guests,
     clawing back some performance.
 
   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
     protected modes.
 
   - Enable nested virtualisation support on systems that support it
     (yes, it has been a long time coming), though it is disabled by
     default.
 
 * Improvements, fixes and cleanups:
 
   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. This ensures correctness of
     emulation (the data is automatically extracted from the published
     JSON files), and helps dealing with the evolution of the
     architecture.
 
   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.
 
   - New selftest checking the pKVM ownership transition rules
 
   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.
 
   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.
 
   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.
 
   - Add a new selftest for the SVE host state being corrupted by a
     guest.
 
   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
 
   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.
 
   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.
 
   - and the usual random cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmgwU7UACgkQI9DQutE9
 ekN93g//fNnejxf01dBFIbuylzYEyHZSEH0iTGLeM+ES9zvntCzciTYVzb27oqNG
 RDLShlQYp3w4rAe6ORzyePyHptOmKXCxfj/VXUFp3A7H9QYOxt1nacD3WxI9fCOo
 LzaSLquvgwFBaeTdDE0KdeTUKQHluId+w1Azh0lnHGeUP+lOHNZ8FqoP1/la0q04
 GvVL+l3wz/IhPP8r1YA0Q1bzJ5SLfSpjIw/0F5H/xgI4lyYdHzgFL8sKuSyFeCyM
 2STQi+ZnTCsAs4bkXkw2Pp9CFYrfQgZi+sf7Om+noAKhbJo3vb7/RHpgjv+QCjJy
 Kx4g9CbxHfaM03cH6uSLBoFzsACR1iAuUz8BCSRvvVNH4RVT6H+34nzjLZXLncrP
 gm1uYs9aMTLr91caeAx0aYIMWGYa1uqV0rum3WxyIHezN9Q/NuQoZyfprUufr8oX
 wCYE+ot4VT3DwG0UFZKKwj0BiCbYcbph9nBLVyZJsg8OKxpvspkCtPriFp1kb6BP
 dTTGSXd9JJqwSgP9qJLxijcv6Nfgp2gT42TWwh/dJRZXhnTCvr9IyclFIhoIIq3G
 Q2BkFCXOoEoNQhBA1tiWzJ9nDHf52P72Z2K1gPyyMZwF49HGa2BZBCJGkqX06wSs
 Riolf1/cjFhDno1ThiHKsHT0sG1D4oc9k/1NLq5dyNAEGcgATIA=
 =Jju3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 6.16

* New features:

  - Add large stage-2 mapping support for non-protected pKVM guests,
    clawing back some performance.

  - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
    protected modes.

  - Enable nested virtualisation support on systems that support it
    (yes, it has been a long time coming), though it is disabled by
    default.

* Improvements, fixes and cleanups:

  - Large rework of the way KVM tracks architecture features and links
    them with the effects of control bits. This ensures correctness of
    emulation (the data is automatically extracted from the published
    JSON files), and helps dealing with the evolution of the
    architecture.

  - Significant changes to the way pKVM tracks ownership of pages,
    avoiding page table walks by storing the state in the hypervisor's
    vmemmap. This in turn enables the THP support described above.

  - New selftest checking the pKVM ownership transition rules

  - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
    even if the host didn't have it.

  - Fixes for the address translation emulation, which happened to be
    rather buggy in some specific contexts.

  - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
    from the number of counters exposed to a guest and addressing a
    number of issues in the process.

  - Add a new selftest for the SVE host state being corrupted by a
    guest.

  - Keep HCR_EL2.xMO set at all times for systems running with the
    kernel at EL2, ensuring that the window for interrupts is slightly
    bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

  - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
    from a pretty bad case of TLB corruption unless accesses to HCR_EL2
    are heavily synchronised.

  - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
    tables in a human-friendly fashion.

  - and the usual random cleanups.
2025-05-26 16:19:46 -04:00
Roman Kisel
13423063c7 arm64: kvm, smccc: Introduce and use API for getting hypervisor UUID
The KVM/arm64 uses SMCCC to detect hypervisor presence. That code is
private, and it follows the SMCCC specification. Other existing and
emerging hypervisor guest implementations can and should use that
standard approach as well.

Factor out a common infrastructure that the guests can use, update KVM
to employ the new API. The central notion of the SMCCC method is the
UUID of the hypervisor, and the new API follows that.

No functional changes. Validated with a KVM/arm64 guest.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250428210742.435282-2-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250428210742.435282-2-romank@linux.microsoft.com>
2025-05-23 16:30:55 +00:00
Marc Zyngier
1b85d923ba Merge branch kvm-arm64/misc-6.16 into kvmarm-master/next
* kvm-arm64/misc-6.16:
  : .
  : Misc changes and improvements for 6.16:
  :
  : - Add a new selftest for the SVE host state being corrupted by a guest
  :
  : - Keep HCR_EL2.xMO set at all times for systems running with the kernel at EL2,
  :   ensuring that the window for interrupts is slightly bigger, and avoiding
  :   a pretty bad erratum on the AmpereOne HW
  :
  : - Replace a couple of open-coded on/off strings with str_on_off()
  :
  : - Get rid of the pKVM memblock sorting, which now appears to be superflous
  :
  : - Drop superflous clearing of ICH_LR_EOI in the LR when nesting
  :
  : - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers from
  :   a pretty bad case of TLB corruption unless accesses to HCR_EL2 are
  :   heavily synchronised
  :
  : - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS tables
  :   in a human-friendly fashion
  : .
  KVM: arm64: Fix documentation for vgic_its_iter_next()
  KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables
  arm64: errata: Work around AmpereOne's erratum AC04_CPU_23
  KVM: arm64: nv: Remove clearing of ICH_LR<n>.EOI if ICH_LR<n>.HW == 1
  KVM: arm64: Drop sort_memblock_regions()
  KVM: arm64: selftests: Add test for SVE host corruption
  KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
  KVM: arm64: Replace ternary flags with str_on_off() helper

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:59:43 +01:00
Marc Zyngier
7f3225fe8b Merge branch kvm-arm64/nv-nv into kvmarm-master/next
* kvm-arm64/nv-nv:
  : .
  : Flick the switch on the NV support by adding the missing piece
  : in the form of the VNCR page management. From the cover letter:
  :
  : "This is probably the most interesting bit of the whole NV adventure.
  : So far, everything else has been a walk in the park, but this one is
  : where the real fun takes place.
  :
  : With FEAT_NV2, most of the NV support revolves around tricking a guest
  : into accessing memory while it tries to access system registers. The
  : hypervisor's job is to handle the context switch of the actual
  : registers with the state in memory as needed."
  : .
  KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
  KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
  KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
  KVM: arm64: Document NV caps and vcpu flags
  KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2*
  KVM: arm64: nv: Remove dead code from ERET handling
  KVM: arm64: nv: Plumb TLBI S1E2 into system instruction dispatch
  KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2
  KVM: arm64: nv: Program host's VNCR_EL2 to the fixmap address
  KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers
  KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2
  KVM: arm64: nv: Handle VNCR_EL2-triggered faults
  KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2
  KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2
  KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nesting
  KVM: arm64: nv: Move TLBI range decoding to a helper
  KVM: arm64: nv: Snapshot S1 ASID tagging information during walk
  KVM: arm64: nv: Extract translation helper from the AT code
  KVM: arm64: nv: Allocate VNCR page when required
  arm64: sysreg: Add layout for VNCR_EL2

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:58:57 +01:00
Marc Zyngier
35e4d7fe69 Merge branch kvm-arm64/at-fixes-6.16 into kvmarm-master/next
* kvm-arm64/at-fixes-6.16:
  : .
  : Set of fixes for Address Translation (AT) instruction emulation,
  : which affect the (not yet upstream) NV support.
  :
  : From the cover letter:
  :
  : "Here's a small series of fixes for KVM's implementation of address
  : translation (aka the AT S1* instructions), addressing a number of
  : issues in increasing levels of severity:
  :
  : - We misreport PAR_EL1.PTW in a number of occasions, including state
  :   that is not possible as per the architecture definition
  :
  : - We don't handle access faults at all, and that doesn't play very
  :   well with the rest of the VNCR stuff
  :
  : - AT S1E{0,1} from EL2 with HCR_EL2.{E2H,TGE}={1,1} will absolutely
  :   take the host down, no questions asked"
  : .
  KVM: arm64: Don't feed uninitialised data to HCR_EL2
  KVM: arm64: Teach address translation about access faults
  KVM: arm64: Fix PAR_EL1.{PTW,S} reporting on AT S1E*

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:58:34 +01:00
Marc Zyngier
fef3acf5ae Merge branch kvm-arm64/fgt-masks into kvmarm-master/next
* kvm-arm64/fgt-masks: (43 commits)
  : .
  : Large rework of the way KVM deals with trap bits in conjunction with
  : the CPU feature registers. It now draws a direct link between which
  : the feature set, the system registers that need to UNDEF to match
  : the configuration and bits that need to behave as RES0 or RES1 in
  : the trap registers that are visible to the guest.
  :
  : Best of all, these definitions are mostly automatically generated
  : from the JSON description published by ARM under a permissive
  : license.
  : .
  KVM: arm64: Handle TSB CSYNC traps
  KVM: arm64: Add FGT descriptors for FEAT_FGT2
  KVM: arm64: Allow sysreg ranges for FGT descriptors
  KVM: arm64: Add context-switch for FEAT_FGT2 registers
  KVM: arm64: Add trap routing for FEAT_FGT2 registers
  KVM: arm64: Add sanitisation for FEAT_FGT2 registers
  KVM: arm64: Add FEAT_FGT2 registers to the VNCR page
  KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits
  KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits
  KVM: arm64: Allow kvm_has_feat() to take variable arguments
  KVM: arm64: Use FGT feature maps to drive RES0 bits
  KVM: arm64: Validate FGT register descriptions against RES0 masks
  KVM: arm64: Switch to table-driven FGU configuration
  KVM: arm64: Handle PSB CSYNC traps
  KVM: arm64: Use KVM-specific HCRX_EL2 RES0 mask
  KVM: arm64: Remove hand-crafted masks for FGT registers
  KVM: arm64: Use computed FGT masks to setup FGT registers
  KVM: arm64: Propagate FGT masks to the nVHE hypervisor
  KVM: arm64: Unconditionally configure fine-grain traps
  KVM: arm64: Use computed masks as sanitisers for FGT registers
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:58:15 +01:00
Marc Zyngier
6eb0ed9629 Merge branch kvm-arm64/mte-frac into kvmarm-master/next
* kvm-arm64/mte-frac:
  : .
  : Prevent FEAT_MTE_ASYNC from being accidently exposed to a guest,
  : courtesy of Ben Horgan. From the cover letter:
  :
  : "The ID_AA64PFR1_EL1.MTE_frac field is currently hidden from KVM.
  : However, when ID_AA64PFR1_EL1.MTE==2, ID_AA64PFR1_EL1.MTE_frac==0
  : indicates that MTE_ASYNC is supported. On a host with
  : ID_AA64PFR1_EL1.MTE==2 but without MTE_ASYNC support a guest with the
  : MTE capability enabled will incorrectly see MTE_ASYNC advertised as
  : supported. This series fixes that."
  : .
  KVM: selftests: Confirm exposing MTE_frac does not break migration
  KVM: arm64: Make MTE_frac masking conditional on MTE capability
  arm64/sysreg: Expose MTE_frac so that it is visible to KVM

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:57:44 +01:00
Marc Zyngier
cb86616c39 Merge branch kvm-arm64/ubsan-el2 into kvmarm-master/next
* kvm-arm64/ubsan-el2:
  : .
  : Add UBSAN support to the EL2 portion of KVM, reusing most of the
  : existing logic provided by CONFIG_IBSAN_TRAP.
  :
  : Patches courtesy of Mostafa Saleh.
  : .
  KVM: arm64: Handle UBSAN faults
  KVM: arm64: Introduce CONFIG_UBSAN_KVM_EL2
  ubsan: Remove regs from report_ubsan_failure()
  arm64: Introduce esr_is_ubsan_brk()

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:57:32 +01:00
Marc Zyngier
a90e001754 Merge branch kvm-arm64/pkvm-np-thp-6.16 into kvmarm-master/next
* kvm-arm64/pkvm-np-thp-6.16: (21 commits)
  : .
  : Large mapping support for non-protected pKVM guests, courtesy of
  : Vincent Donnefort. From the cover letter:
  :
  : "This series adds support for stage-2 huge mappings (PMD_SIZE) to pKVM
  : np-guests, that is installing PMD-level mappings in the stage-2,
  : whenever the stage-1 is backed by either Hugetlbfs or THPs."
  : .
  KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
  KVM: arm64: Stage-2 huge mappings for np-guests
  KVM: arm64: Add a range to pkvm_mappings
  KVM: arm64: Convert pkvm_mappings to interval tree
  KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
  KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
  KVM: arm64: Add a range to __pkvm_host_unshare_guest()
  KVM: arm64: Add a range to __pkvm_host_share_guest()
  KVM: arm64: Introduce for_each_hyp_page
  KVM: arm64: Handle huge mappings for np-guest CMOs
  KVM: arm64: Extend pKVM selftest for np-guests
  KVM: arm64: Selftest for pKVM transitions
  KVM: arm64: Don't WARN from __pkvm_host_share_guest()
  KVM: arm64: Add .hyp.data section
  KVM: arm64: Unconditionally cross check hyp state
  KVM: arm64: Defer EL2 stage-1 mapping on share
  KVM: arm64: Move hyp state to hyp_vmemmap
  KVM: arm64: Introduce {get,set}_host_state() helpers
  KVM: arm64: Use 0b11 for encoding PKVM_NOPAGE
  KVM: arm64: Fix pKVM page-tracking comments
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-23 10:56:25 +01:00
Marc Zyngier
bf809a0aab KVM: arm64: Fix documentation for vgic_its_iter_next()
As reported by the build robot, the documentation for vgic_its_iter_next()
contains a typo. Fix it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505221421.KAuWlmSr-lkp@intel.com/
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-22 09:15:02 +01:00
Vincent Donnefort
c353fde17d KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
With the introduction of stage-2 huge mappings in the pKVM hypervisor,
guest pages CMO is needed for PMD_SIZE size. Fixmap only supports
PAGE_SIZE and iterating over the huge-page is time consuming (mostly due
to TLBI on hyp_fixmap_unmap) which is a problem for EL2 latency.

Introduce a shared PMD_SIZE fixmap (hyp_fixblock_map/hyp_fixblock_unmap)
to improve guest page CMOs when stage-2 huge mappings are installed.

On a Pixel6, the iterative solution resulted in a latency of ~700us,
while the PMD_SIZE fixmap reduces it to ~100us.

Because of the horrendous private range allocation that would be
necessary, this is disabled for 64KiB pages systems.

Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-11-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
db14091d8f KVM: arm64: Stage-2 huge mappings for np-guests
Now np-guests hypercalls with range are supported, we can let the
hypervisor to install block mappings whenever the Stage-1 allows it,
that is when backed by either Hugetlbfs or THPs. The size of those block
mappings is limited to PMD_SIZE.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-10-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Quentin Perret
3669ddd8fa KVM: arm64: Add a range to pkvm_mappings
In preparation for supporting stage-2 huge mappings for np-guest, add a
nr_pages member for pkvm_mappings to allow EL1 to track the size of the
stage-2 mapping.

Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-9-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Quentin Perret
b38c9775f7 KVM: arm64: Convert pkvm_mappings to interval tree
In preparation for supporting stage-2 huge mappings for np-guest, let's
convert pgt.pkvm_mappings to an interval tree.

No functional change intended.

Suggested-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-8-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
c4d99a833d KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_test_clear_young_guest hypercall.
This range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is
512 on a 4K-pages system).

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-7-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
0eb802b3b4 KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_wrprotect_guest hypercall. This
range supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512
on a 4K-pages system).

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-6-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
f28f1d02f4 KVM: arm64: Add a range to __pkvm_host_unshare_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_unshare_guest hypercall. This range
supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a
4K-pages system).

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-5-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
4274385ebf KVM: arm64: Add a range to __pkvm_host_share_guest()
In preparation for supporting stage-2 huge mappings for np-guest. Add a
nr_pages argument to the __pkvm_host_share_guest hypercall. This range
supports only two values: 1 or PMD_SIZE / PAGE_SIZE (that is 512 on a
4K-pages system).

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-4-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
3db771fa23 KVM: arm64: Introduce for_each_hyp_page
Add a helper to iterate over the hypervisor vmemmap. This will be
particularly handy with the introduction of huge mapping support
for the np-guest stage-2.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Vincent Donnefort
944a1ed8cc KVM: arm64: Handle huge mappings for np-guest CMOs
clean_dcache_guest_page() and invalidate_icache_guest_page() accept a
size as an argument. But they also rely on fixmap, which can only map a
single PAGE_SIZE page.

With the upcoming stage-2 huge mappings for pKVM np-guests, those
callbacks will get size > PAGE_SIZE. Loop the CMOs on a PAGE_SIZE basis
until the whole range is done.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250521124834.1070650-2-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:51 +01:00
Marc Zyngier
d5702dd224 Merge branch kvm-arm64/pkvm-selftest-6.16 into kvm-arm64/pkvm-np-thp-6.16
* kvm-arm64/pkvm-selftest-6.16:
  : .
  : pKVM selftests covering the memory ownership transitions by
  : Quentin Perret. From the initial cover letter:
  :
  : "We have recently found a bug [1] in the pKVM memory ownership
  : transitions by code inspection, but it could have been caught with a
  : test.
  :
  : Introduce a boot-time selftest exercising all the known pKVM memory
  : transitions and importantly checks the rejection of illegal transitions.
  :
  : The new test is hidden behind a new Kconfig option separate from
  : CONFIG_EL2_NVHE_DEBUG on purpose as that has side effects on the
  : transition checks ([1] doesn't reproduce with EL2 debug enabled).
  :
  : [1] https://lore.kernel.org/kvmarm/20241128154406.602875-1-qperret@google.com/"
  : .
  KVM: arm64: Extend pKVM selftest for np-guests
  KVM: arm64: Selftest for pKVM transitions
  KVM: arm64: Don't WARN from __pkvm_host_share_guest()
  KVM: arm64: Add .hyp.data section

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 14:33:43 +01:00
Marc Zyngier
538fbac740 KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
The conversion to kvm_release_faultin_page() missed the requirement
for this to be called within a critical section with mmu_lock held
for write. Move this call up to satisfy this requirement.

Fixes: 069a05e535 ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults")
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 11:40:12 +01:00
Marc Zyngier
beab7d0583 KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
Calling invalidate_vncr_va() without the mmu_lock held for write
is a bad idea, and lockdep tells you about that.

Fixes: 4ffa72ad8f ("KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2")
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 11:40:12 +01:00
Marc Zyngier
d43548f422 KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
When translating a VNCR translation fault, we start by marking the
current SW-managed TLB as invalid, so that we can populate it
in place. This is, however, done without the mmu_lock held.

A consequence of this is that another CPU dealing with TLBI
emulation can observe a translation still flagged as valid, but
with invalid walk results (such as pgshift being 0). Bad things
can result from this, such as a BUG() in pgshift_level_to_ttl().

Fix it by taking the mmu_lock for write to perform this local
invalidation, and use invalidate_vncr() instead of open-coding
the write to the 'valid' flag.

Fixes: 069a05e535 ("KVM: arm64: nv: Handle VNCR_EL2-triggered faults")
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250520144116.3667978-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-21 09:53:08 +01:00
Jing Zhang
30deb51a67 KVM: arm64: vgic-its: Add debugfs interface to expose ITS tables
This commit introduces a debugfs interface to display the contents of the
VGIC Interrupt Translation Service (ITS) tables.

The ITS tables map Device/Event IDs to Interrupt IDs and target processors.
Exposing this information through debugfs allows for easier inspection and
debugging of the interrupt routing configuration.

The debugfs interface presents the ITS table data in a tabular format:

    Device ID: 0x0, Event ID Range: [0 - 31]
    EVENT_ID    INTID  HWINTID   TARGET   COL_ID HW
    -----------------------------------------------
           0     8192        0        0        0  0
           1     8193        0        0        0  0
           2     8194        0        2        2  0

    Device ID: 0x18, Event ID Range: [0 - 3]
    EVENT_ID    INTID  HWINTID   TARGET   COL_ID HW
    -----------------------------------------------
           0     8225        0        0        0  0
           1     8226        0        1        1  0
           2     8227        0        3        3  0

    Device ID: 0x10, Event ID Range: [0 - 7]
    EVENT_ID    INTID  HWINTID   TARGET   COL_ID HW
    -----------------------------------------------
           0     8229        0        3        3  1
           1     8230        0        0        0  1
           2     8231        0        1        1  1
           3     8232        0        2        2  1
           4     8233        0        3        3  1

The output is generated using the seq_file interface, allowing for efficient
handling of potentially large ITS tables.

This interface is read-only and does not allow modification of the ITS
tables. It is intended for debugging and informational purposes only.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20250220224247.2017205-1-jingzhangos@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 16:10:02 +01:00
D Scott Phillips
fed55f49fa arm64: errata: Work around AmpereOne's erratum AC04_CPU_23
On AmpereOne AC04, updates to HCR_EL2 can rarely corrupt simultaneous
translations for data addresses initiated by load/store instructions.
Only instruction initiated translations are vulnerable, not translations
from prefetches for example. A DSB before the store to HCR_EL2 is
sufficient to prevent older instructions from hitting the window for
corruption, and an ISB after is sufficient to prevent younger
instructions from hitting the window for corruption.

Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250513184514.2678288-1-scott@os.amperecomputing.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 12:46:26 +01:00
Marc Zyngier
98dbe56a01 KVM: arm64: Handle TSB CSYNC traps
The architecture introduces a trap for TSB CSYNC that fits in
the same EC as LS64 and PSB CSYNC. Let's deal with it in a similar
way.

It's not that we expect this to be useful any time soon anyway.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:36:21 +01:00
Marc Zyngier
af2d78dcad KVM: arm64: Add FGT descriptors for FEAT_FGT2
Bulk addition of all the FGT2 traps reported with EC == 0x18,
as described in the 2025-03 JSON drop.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:36:21 +01:00
Marc Zyngier
f654e9e47e KVM: arm64: Allow sysreg ranges for FGT descriptors
Just like we allow sysreg ranges for Coarse Grained Trap descriptors,
allow them for Fine Grain Traps as well.

This comes with a warning that not all ranges are suitable for this
particular definition of ranges.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:36:21 +01:00
Marc Zyngier
1ba41c8160 KVM: arm64: Add context-switch for FEAT_FGT2 registers
Just like the rest of the FGT registers, perform a switch of the
FGT2 equivalent. This avoids the host configuration leaking into
the guest...

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:36:21 +01:00
Marc Zyngier
fc631df00c KVM: arm64: Add trap routing for FEAT_FGT2 registers
Similarly to the FEAT_FGT registers, pick the correct FEAT_FGT2
register when a sysreg trap indicates they could be responsible
for the exception.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:36:21 +01:00
Marc Zyngier
4bc0fe0898 KVM: arm64: Add sanitisation for FEAT_FGT2 registers
Just like the FEAT_FGT registers, treat the FGT2 variant the same
way. THis is a large  update, but a fairly mechanical one.

The config dependencies are extracted from the 2025-03 JSON drop.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:36:10 +01:00
Marc Zyngier
b2a324ff01 KVM: arm64: Use HCR_EL2 feature map to drive fixed-value bits
Similarly to other registers, describe which HCR_EL2 bit depends
on which feature, and use this to compute the RES0 status of these
bits.

An additional complexity stems from the status of some bits such
as E2H and RW, which do not had a RESx status, but still take
a fixed value due to implementation choices in KVM.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:35:30 +01:00
Marc Zyngier
beed444841 KVM: arm64: Use HCRX_EL2 feature map to drive fixed-value bits
Similarly to other registers, describe which HCR_EL2 bit depends
on which feature, and use this to compute the RES0 status of these
bits.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:35:30 +01:00
Marc Zyngier
c6cbe6a4c1 KVM: arm64: Use FGT feature maps to drive RES0 bits
Another benefit of mapping bits to features is that it becomes trivial
to define which bits should be handled as RES0.

Let's apply this principle to the guest's view of the FGT registers.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 11:35:00 +01:00
Marc Zyngier
a7484c80e5 KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2*
Since we're (almost) feature complete, let's allow userspace to
request KVM_ARM_VCPU_EL2* by bumping KVM_VCPU_MAX_FEATURES up.

We also now advertise the features to userspace with new capabilities.

It's going to be great...

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250514103501.2225951-17-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
6ec4c371d4 KVM: arm64: nv: Remove dead code from ERET handling
Cleanly, this code cannot trigger, since we filter this from the
caller. Drop it.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-16-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
aa98df31f6 KVM: arm64: nv: Plumb TLBI S1E2 into system instruction dispatch
Now that we have to handle TLBI S1E2 in the core code, plumb the
sysinsn dispatch code into it, so that these instructions don't
just UNDEF anymore.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-15-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
4ffa72ad8f KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2
A TLBI by VA for S1 must take effect on our pseudo-TLB for VNCR
and potentially knock the fixmap mapping. Even worse, that TLBI
must be able to work cross-vcpu.

For that, we track on a per-VM basis if any VNCR is mapped, using
an atomic counter. Whenever a TLBI S1E2 occurs and that this counter
is non-zero, we take the long road all the way back to the core code.

There, we iterate over all vcpus and check whether this particular
invalidation has any damaging effect. If it does, we nuke the pseudo
TLB and the corresponding fixmap.

Yes, this is costly.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-14-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
73e1b621b2 KVM: arm64: nv: Program host's VNCR_EL2 to the fixmap address
Since we now have a way to map the guest's VNCR_EL2 on the host,
we can point the host's VNCR_EL2 to it and go full circle!

Note that we unconditionally assign the fixmap to VNCR_EL2,
irrespective of the guest's version being mapped or not. We want
to take a fault on first access, so the fixmap either contains
something guranteed to be either invalid or a guest mapping.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-13-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
7270cc9157 KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers
During an invalidation triggered by an MMU notifier, we need to
make sure we can drop the *host* mapping that would have been
translated by the stage-2 mapping being invalidated.

For the moment, the invalidation is pretty brutal, as we nuke
the full IPA range, and therefore any VNCR_EL2 mapping.

At some point, we'll be more light-weight, and the code is able
to deal with something more targetted.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
2a359e0725 KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2
Now that we can handle faults triggered through VNCR_EL2, we need
to map the corresponding page at EL2. But where, you'll ask?

Since each CPU in the system can run a vcpu, we need a per-CPU
mapping. For that, we carve a NR_CPUS range in the fixmap, giving
us a per-CPU va at which to map the guest's VNCR's page.

The mapping occurs both on vcpu load and on the back of a fault,
both generating a request that will take care of the mapping.
That mapping will also get dropped on vcpu put.

Yes, this is a bit heavy handed, but it is simple. Eventually,
we may want to have a per-VM, per-CPU mapping, which would avoid
all the TLBI overhead.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
069a05e535 KVM: arm64: nv: Handle VNCR_EL2-triggered faults
As VNCR_EL2.BADDR contains a VA, it is bound to trigger faults.

These faults can have multiple source:

- We haven't mapped anything on the host: we need to compute the
  resulting translation, populate a TLB, and eventually map
  the corresponding page

- The permissions are out of whack: we need to tell the guest about
  this state of affairs

Note that the kernel doesn't support S1POE for itself yet, so
the particular case of a VNCR page mapped with no permissions
or with write-only permissions is not correctly handled yet.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 08:01:19 +01:00
Marc Zyngier
6fb75733f1 KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2
Plug VNCR_EL2 in the vcpu_sysreg enum, define its RES0/RES1 bits,
and make it accessible to userspace when the VM is configured to
support FEAT_NV2.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Marc Zyngier
ea8d3cf46d KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2
FEAT_NV2 introduces an interesting problem for NV, as VNCR_EL2.BADDR
is a virtual address in the EL2&0 (or EL2, but we thankfully ignore
this) translation regime.

As we need to replicate such mapping in the real EL2, it means that
we need to remember that there is such a translation, and that any
TLBI affecting EL2 can possibly affect this translation.

It also means that any invalidation driven by an MMU notifier must
be able to shoot down any such mapping.

All in all, we need a data structure that represents this mapping,
and that is extremely close to a TLB. Given that we can only use
one of those per vcpu at any given time, we only allocate one.

No effort is made to keep that structure small. If we need to
start caching multiple of them, we may want to revisit that design
point. But for now, it is kept simple so that we can reason about it.

Oh, and add a braindump of how things are supposed to work, because
I will definitely page this out at some point. Yes, pun intended.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Marc Zyngier
bd914a9814 KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nesting
We currently check for HCR_EL2.NV being set to decide whether we
need to repaint PSTATE.M to say EL2 instead of EL1 on exit.

However, this isn't correct when L2 is itself a hypervisor, and
that L1 as set its own HCR_EL2.NV. That's because we "flatten"
the state and inherit parts of the guest's own setup. In that case,
we shouldn't adjust PSTATE.M, as this is really EL1 for both us
and the guest.

Instead of trying to try and work out how we ended-up with HCR_EL2.NV
being set by introspecting both the host and guest states, use
a per-CPU flag to remember the context (HYP or not), and use that
information to decide whether PSTATE needs tweaking.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Marc Zyngier
85bba00425 KVM: arm64: nv: Move TLBI range decoding to a helper
As we are about to expand out TLB invalidation capabilities to support
recursive virtualisation, move the decoding of a TLBI by range into
a helper that returns the base, the range and the ASID.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Marc Zyngier
a0ec2b822c KVM: arm64: nv: Snapshot S1 ASID tagging information during walk
We currently completely ignore any sort of ASID tagging during a S1
walk, as AT doesn't care about it.

However, such information is required if we are going to create
anything that looks like a TLB from this walk.

Let's capture it both the nG and ASID information while walking
the page tables.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Marc Zyngier
34fa9dece5 KVM: arm64: nv: Extract translation helper from the AT code
The address translation infrastructure is currently pretty tied to
the AT emulation.

However, we also need to features that require the use of VAs, such
as VNCR_EL2 (and maybe one of these days SPE), meaning that we need
a slightly more generic infrastructure.

Start this by introducing a new helper (__kvm_translate_va()) that
performs a S1 walk for a given translation regime, EL and PAN
settings.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Marc Zyngier
469c4713d4 KVM: arm64: nv: Allocate VNCR page when required
If running a NV guest on an ARMv8.4-NV capable system, let's
allocate an additional page that will be used by the hypervisor
to fulfill system register accesses.

Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250514103501.2225951-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-19 07:59:46 +01:00
Wei-Lin Chang
92c749e4aa KVM: arm64: nv: Remove clearing of ICH_LR<n>.EOI if ICH_LR<n>.HW == 1
In the case of ICH_LR<n>.HW == 1, bit 41 of LR is just a part of pINTID
without EOI meaning, and bit 41 will be zeroed by the subsequent clearing
of ICH_LR_PHYS_ID_MASK anyway.
No functional changes intended.

Signed-off-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw>
Link: https://lore.kernel.org/r/20250512133223.866999-1-r09922117@csie.ntu.edu.tw
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-16 13:05:23 +01:00
Ben Horgan
fe21ff5d4b KVM: arm64: Make MTE_frac masking conditional on MTE capability
If MTE_frac is masked out unconditionally then the guest will always
see ID_AA64PFR1_EL1_MTE_frac as 0. However, a value of 0 when
ID_AA64PFR1_EL1_MTE is 2 indicates that MTE_ASYNC is supported. Hence, for
a host with ID_AA64PFR1_EL1_MTE==2 and ID_AA64PFR1_EL1_MTE_frac==0xf
(MTE_ASYNC unsupported) the guest would see MTE_ASYNC advertised as
supported whilst the host does not support it. Hence, expose the sanitised
value of MTE_frac to the guest and user-space.

As MTE_frac was previously hidden, always 0, and KVM must accept values
from KVM provided by user-space, when ID_AA64PFR1_EL1.MTE is 2 allow
user-space to set ID_AA64PFR1_EL1.MTE_frac to 0. However, ignore it to
avoid incorrectly claiming hardware support for MTE_ASYNC in the guest.

Note that linux does not check the value of ID_AA64PFR1_EL1_MTE_frac and
wrongly assumes that MTE async faults can be generated even on hardware
that does nto support them. This issue is not addressed here.

Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Link: https://lore.kernel.org/r/20250512114112.359087-3-ben.horgan@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-16 13:01:18 +01:00
Marc Zyngier
3e4d597220 KVM: arm64: Don't feed uninitialised data to HCR_EL2
When the guest executes an AT S1E{0,1} from EL2, and that its
HCR_EL2.{E2H,TGE}=={1,1}, then this is a pure S1 translation
that doesn't involve a guest-supplied S2, and the full S1
context is already in place. This allows us to take a shortcut
and avoid save/restoring a bunch of registers.

However, we set HCR_EL2 to a value suitable for the use of AT
in guest context. And we do so by using the value that we saved.
Or not. In the case described above, we restore whatever junk
was on the stack, and carry on with it until the next entry.

Needless to say, this is completely broken.

But this also triggers the realisation that saving HCR_EL2 is
a bit pointless. We are always in host context at the point where
reach this code, and what we program to enter the guest is a known
value (vcpu->arch.hcr_el2).

Drop the pointless save/restore, and wrap the AT operations with
writes that switch between guest and host values for HCR_EL2.

Reported-by: D Scott Phillips <scott@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250422122612.2675672-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-14 10:43:50 +01:00
Marc Zyngier
ed648ab804 KVM: arm64: Teach address translation about access faults
It appears that our S1 PTW is completely oblivious of access faults.
Teach the S1 translation code about it.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250422122612.2675672-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-14 10:43:50 +01:00
Marc Zyngier
493b01de72 KVM: arm64: Fix PAR_EL1.{PTW,S} reporting on AT S1E*
When an AT S1E* operation fails, we need to report whether the
translation failed at S2, and whether this was during a S1 PTW.

But these two bits are not independent. PAR_EL1.PTW can only be
set of PAR_EL1.S is also set, and PAR_EL1.S can only be set on
its own when the full S1 PTW has succeeded, but that the access
itself is reporting a fault at S2.

As a result, it makes no sense to carry both ptw and s2 as parameters
to fail_s1_walk(), and they should be unified.

This fixes a number of cases where we were reporting PTW=1 *and*
S=0, which makes no sense.

Link: https://lore.kernel.org/r/20250422122612.2675672-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-14 10:43:50 +01:00
Marc Zyngier
938a79d0aa KVM: arm64: Validate FGT register descriptions against RES0 masks
In order to point out to the unsuspecting KVM hacker that they
are missing something somewhere, validate that the known FGT bits
do not intersect with the corresponding RES0 mask, as computed at
boot time.

THis check is also performed at boot time, ensuring that there is
no runtime overhead.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10 11:04:35 +01:00
Marc Zyngier
63d423a763 KVM: arm64: Switch to table-driven FGU configuration
Defining the FGU behaviour is extremely tedious. It relies on matching
each set of bits from FGT registers with am architectural feature, and
adding them to the FGU list if the corresponding feature isn't advertised
to the guest.

It is however relatively easy to dump most of that information from
the architecture JSON description, and use that to control the FGU bits.

Let's introduce a new set of tables descripbing the mapping between
FGT bits and features. Most of the time, this is only a lookup in
an idreg field, with a few more complex exceptions.

While this is obviously many more lines in a new file, this is
mostly generated, and is pretty easy to maintain.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10 11:04:35 +01:00
Marc Zyngier
397411c743 KVM: arm64: Handle PSB CSYNC traps
The architecture introduces a trap for PSB CSYNC that fits in
 the same EC as LS64. Let's deal with it in a similar way as
LS64.

It's not that we expect this to be useful any time soon anyway.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10 11:04:35 +01:00
Marc Zyngier
ef6d7d2682 KVM: arm64: Use KVM-specific HCRX_EL2 RES0 mask
We do not have a computed table for HCRX_EL2, so statically define
the bits we know about. A warning will fire if the architecture
grows bits that are not handled yet.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10 11:04:35 +01:00
Marc Zyngier
3ce9bbba93 KVM: arm64: Remove hand-crafted masks for FGT registers
These masks are now useless, and can be removed.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10 11:04:35 +01:00
Marc Zyngier
aed34b6d21 KVM: arm64: Use computed FGT masks to setup FGT registers
Flip the hyervisor FGT configuration over to the computed FGT
masks.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-10 11:04:09 +01:00
Gavin Shan
00b0300cf1 KVM: arm64: Drop sort_memblock_regions()
Drop sort_memblock_regions() and avoid sorting the copied memory
regions to be ascending order on their base addresses, because the
source memory regions should have been sorted correctly when they
are added by memblock_add() or its variants.

This is generally reverting commit a14307f531 ("KVM: arm64: Sort
the hypervisor memblocks"). No functional changes intended.

Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250311043718.91004-1-gshan@redhat.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-08 11:29:45 +01:00
Mostafa Saleh
446692759b KVM: arm64: Handle UBSAN faults
As now UBSAN can be enabled, handle brk64 exits from UBSAN.
Re-use the decoding code from the kernel, and panic with
UBSAN message.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250430162713.1997569-5-smostafa@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-07 11:21:35 +01:00
Mostafa Saleh
61b38f7591 KVM: arm64: Introduce CONFIG_UBSAN_KVM_EL2
Add a new Kconfig CONFIG_UBSAN_KVM_EL2 for KVM which enables
UBSAN for EL2 code (in protected/nvhe/hvhe) modes.
This will re-use the same checks enabled for the kernel for
the hypervisor. The only difference is that for EL2 it always
emits a "brk" instead of implementing hooks as the hypervisor
can't print reports.

The KVM code will re-use the same code for the kernel
"report_ubsan_failure()" so #ifdefs are changed to also have this
code for CONFIG_UBSAN_KVM_EL2

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250430162713.1997569-4-smostafa@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-07 11:21:35 +01:00
Mostafa Saleh
3949e28786 KVM: arm64: Fix memory check in host_stage2_set_owner_locked()
I found this simple bug while preparing some patches for pKVM.
AFAICT, it should be harmless (besides crashing the kernel if it
was misbehaving)

Fixes: e94a7dea29 ("KVM: arm64: Move host page ownership tracking to the hyp vmemmap")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Link: https://lore.kernel.org/r/20250501162450.2784043-1-smostafa@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-05-07 00:17:05 -07:00
Marc Zyngier
ffea7c73d1 KVM: arm64: Properly save/restore HCRX_EL2
Rather than restoring HCRX_EL2 to a fixed value on vcpu exit,
perform a full save/restore of the register, ensuring that
we don't lose bits that would have been set at some point in
the host kernel lifetime, such as the GCSEn bit.

Fixes: ff5181d8a2 ("arm64/gcs: Provide basic EL2 setup to allow GCS usage at EL0 and EL1")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250430105916.3815157-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-05-07 00:16:44 -07:00
Marc Zyngier
311ba55a5f KVM: arm64: Propagate FGT masks to the nVHE hypervisor
The nVHE hypervisor needs to have access to its own view of the FGT
masks, which unfortunately results in a bit of data duplication.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:25 +01:00
Mark Rutland
ea266c7249 KVM: arm64: Unconditionally configure fine-grain traps
... otherwise we can inherit the host configuration if this differs from
the KVM configuration.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[maz: simplified a couple of things]
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:25 +01:00
Marc Zyngier
7ed43d84c1 KVM: arm64: Use computed masks as sanitisers for FGT registers
Now that we have computed RES0 bits, use them to sanitise the
guest view of FGT registers.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:25 +01:00
Marc Zyngier
3164899c21 KVM: arm64: Add description of FGT bits leading to EC!=0x18
The current FTP tables are only concerned with the bits generating
ESR_ELx.EC==0x18. However, we want an exhaustive view of what KVM
really knows about.

So let's add another small table that provides that extra information.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:25 +01:00
Marc Zyngier
1b8570be89 KVM: arm64: Compute FGT masks from KVM's own FGT tables
In the process of decoupling KVM's view of the FGT bits from the
wider architectural state, use KVM's own FGT tables to build
a synthetic view of what is actually known.

This allows for some checking along the way.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:19 +01:00
Marc Zyngier
5329358c22 KVM: arm64: Plug FEAT_GCS handling
We don't seem to be handling the GCS-specific exception class.
Handle it by delivering an UNDEF to the guest, and populate the
relevant trap bits.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:19 +01:00
Marc Zyngier
09be03c6b5 KVM: arm64: Don't treat HCRX_EL2 as a FGT register
Treating HCRX_EL2 as yet another FGT register seems excessive, and
gets in a way of further improvements. It is actually simpler to
just be explicit about the masking, so just to that.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:19 +01:00
Marc Zyngier
9308d0b1d7 KVM: arm64: Restrict ACCDATA_EL1 undef to FEAT_LS64_ACCDATA being disabled
We currently unconditionally make ACCDATA_EL1 accesses UNDEF.

As we are about to support it, restrict the UNDEF behaviour to cases
where FEAT_LS64_ACCDATA is not exposed to the guest.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:19 +01:00
Marc Zyngier
2e04378f1a KVM: arm64: Handle trapping of FEAT_LS64* instructions
We generally don't expect FEAT_LS64* instructions to trap, unless
they are trapped by a guest hypervisor.

Otherwise, this is just the guest playing tricks on us by using
an instruction that isn't advertised, which we handle with a well
deserved UNDEF.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:14 +01:00
Marc Zyngier
4b4af68dd9 KVM: arm64: Simplify handling of negative FGT bits
check_fgt_bit() and triage_sysreg_trap() implement the same thing
twice for no good reason. We have to lookup the FGT register twice,
as we don't communicate it. Similarly, we extract the register value
at the wrong spot.

Reorganise the code in a more logical way so that things are done
at the correct location, removing a lot of duplication.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:09 +01:00
Marc Zyngier
04af8a3968 KVM: arm64: Tighten handling of unknown FGT groups
triage_sysreg_trap() assumes that it knows all the possible values
for FGT groups, which won't be the case as we start adding more
FGT registers (unless we add everything in one go, which is obviously
undesirable).

At the same time, it doesn't offer much in terms of debugging info
when things go wrong.

Turn the "__NR_FGT_GROUP_IDS__" case into a default, covering any
unhandled value, and give the kernel hacker a bit of a clue about
what's wrong (system register and full trap descriptor).

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:09 +01:00
Marc Zyngier
0f013a524b arm64: sysreg: Replace HFGxTR_EL2 with HFG{R,W}TR_EL2
Treating HFGRTR_EL2 and HFGWTR_EL2 identically was a mistake.
It makes things hard to reason about, has the potential to
introduce bugs by giving a meaning to bits that are really reserved,
and is in general a bad description of the architecture.

Given that #defines are cheap, let's describe both registers as
intended by the architecture, and repaint all the existing uses.

Yes, this is painful.

The registers themselves are generated from the JSON file in
an automated way.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 17:35:03 +01:00
Quentin Perret
48d5645072 KVM: arm64: Extend pKVM selftest for np-guests
The pKVM selftest intends to test as many memory 'transitions' as
possible, so extend it to cover sharing pages with non-protected guests,
including in the case of multi-sharing.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-5-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:56:18 +01:00
Quentin Perret
6c2d4c319c KVM: arm64: Selftest for pKVM transitions
We have recently found a bug [1] in the pKVM memory ownership
transitions by code inspection, but it could have been caught with a
test.

Introduce a boot-time selftest exercising all the known pKVM memory
transitions and importantly checks the rejection of illegal transitions.

The new test is hidden behind a new Kconfig option separate from
CONFIG_EL2_NVHE_DEBUG on purpose as that has side effects on the
transition checks ([1] doesn't reproduce with EL2 debug enabled).

[1] https://lore.kernel.org/kvmarm/20241128154406.602875-1-qperret@google.com/

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-4-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:56:18 +01:00
Quentin Perret
845f126732 KVM: arm64: Don't WARN from __pkvm_host_share_guest()
We currently WARN() if the host attempts to share a page that is not in
an acceptable state with a guest. This isn't strictly necessary and
makes testing much harder, so drop the WARN and make sure to propage the
error code instead.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:56:18 +01:00
David Brazdil
74b13d5816 KVM: arm64: Add .hyp.data section
The hypervisor has not needed its own .data section because all globals
were either .rodata or .bss. To avoid having to initialize future
data-structures at run-time, let's introduce add a .data section to the
hypervisor.

Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:56:18 +01:00
Marc Zyngier
bae247ccad KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
We keep setting and clearing these bits depending on the role of
the host kernel, mimicking what we do for nVHE. But that's actually
pretty pointless, as we always want physical interrupts to make it
to the host, at EL2.

This has also two problems:

- it prevents IRQs from being taken when these bits are cleared
  if the implementation has chosen to implement these bits as
  masks when HCR_EL2.{TGE,xMO}=={0,0}

- it triggers a bad erratum on the AmpereOne HW, which catches
  fire on clearing these bits while an interrupt is being taken
  (AC03_CPU_36).

Let's kill these two birds with a single stone, and permanently
set the xMO bits when running VHE. This involves a bit of surgery
on code paths that rely on flipping these bits on and off for
other purposes.

Note that the earliest setting of hcr_el2 (in the init_hcr_el2
macro) is left untouched as is runs extremely early, with interrupts
disabled, and soon enough overwritten with the final value containing
the xMO bits.

Reported-by: D Scott Phillips <scott@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250429114326.3618875-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:41:32 +01:00
Seongsu Park
d2f14174f9 KVM: arm64: Replace ternary flags with str_on_off() helper
Replace repetitive ternary expressions with the str_on_off() helper
function. This change improves code readability and ensures consistency
in tracepoint string formatting

Signed-off-by: Seongsu Park <sgsu.park@samsung.com>
Reviewed-by: Zenghui Yu <yuzenghui@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/1891546521.01744691102904.JavaMail.epsvc@epcpadp1new
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06 09:38:37 +01:00
Marc Zyngier
7af7cfbe78 KVM: arm64: Prevent userspace from disabling AArch64 support at any virtualisable EL
A sorry excuse for a selftest is trying to disable AArch64 support.
And yes, this goes as well as you can imagine.

Let's forbid this sort of things. Normal userspace shouldn't get
caught doing that.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250429114117.3618800-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-05-05 12:19:45 -07:00
Marc Zyngier
859c60276e KVM: arm64: Force HCR_EL2.xMO to 1 at all times in VHE mode
We keep setting and clearing these bits depending on the role of
the host kernel, mimicking what we do for nVHE. But that's actually
pretty pointless, as we always want physical interrupts to make it
to the host, at EL2.

This has also two problems:

- it prevents IRQs from being taken when these bits are cleared
  if the implementation has chosen to implement these bits as
  masks when HCR_EL2.{TGE,xMO}=={0,0}

- it triggers a bad erratum on the AmpereOne HW, which catches
  fire on clearing these bits while an interrupt is being taken
  (AC03_CPU_36).

Let's kill these two birds with a single stone, and permanently
set the xMO bits when running VHE. This involves a bit of surgery
on code paths that rely on flipping these bits on and off for
other purposes.

Note that the earliest setting of hcr_el2 (in the init_hcr_el2
macro) is left untouched as is runs extremely early, with interrupts
disabled, and soon enough overwritten with the final value containing
the xMO bits.

Reported-by: D Scott Phillips <scott@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250429114326.3618875-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-05-05 12:19:24 -07:00
Sebastian Ott
157dbc4a32 KVM: arm64: Fix uninitialized memcache pointer in user_mem_abort()
Commit fce886a602 ("KVM: arm64: Plumb the pKVM MMU in KVM") made the
initialization of the local memcache variable in user_mem_abort()
conditional, leaving a codepath where it is used uninitialized via
kvm_pgtable_stage2_map().

This can fail on any path that requires a stage-2 allocation
without transition via a permission fault or dirty logging.

Fix this by making sure that memcache is always valid.

Fixes: fce886a602 ("KVM: arm64: Plumb the pKVM MMU in KVM")
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/kvmarm/3f5db4c7-ccce-fb95-595c-692fa7aad227@redhat.com/
Link: https://lore.kernel.org/r/20250505173148.33900-1-sebott@redhat.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-05-05 12:12:27 -07:00
Arnd Bergmann
2555d4c687 arm64: drop binutils version checks
Now that gcc-8 and binutils-2.30 are the minimum versions, a lot of
the individual feature checks can go away for simplification.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-30 21:55:06 +02:00
Marc Zyngier
67bd641517 Merge branch kvm-arm64/nv-pmu-fixes into kvmarm-master/next
* kvm-arm64/nv-pmu-fixes:
  : .
  : Fixes for NV PMU emulation. From the cover letter:
  :
  : "Joey reports that some of his PMU tests do not behave quite as
  : expected:
  :
  : - MDCR_EL2.HPMN is set to 0 out of reset
  :
  : - PMCR_EL0.P should reset all the counters when written from EL2
  :
  : Oliver points out that setting PMCR_EL0.N from userspace by writing to
  : the register is silly with NV, and that we need a new PMU attribute
  : instead.
  :
  : On top of that, I figured out that we had a number of little gotchas:
  :
  : - It is possible for a guest to write an HPMN value that is out of
  :   bound, and it seems valuable to limit it
  :
  : - PMCR_EL0.N should be the maximum number of counters when read from
  :   EL2, and MDCR_EL2.HPMN when read from EL0/EL1
  :
  : - Prevent userspace from updating PMCR_EL0.N when EL2 is available"
  : .
  KVM: arm64: Let kvm_vcpu_read_pmcr() return an EL-dependent value for PMCR_EL0.N
  KVM: arm64: Handle out-of-bound write to MDCR_EL2.HPMN
  KVM: arm64: Don't let userspace write to PMCR_EL0.N when the vcpu has EL2
  KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs
  KVM: arm64: Contextualise the handling of PMCR_EL0.P writes
  KVM: arm64: Fix MDCR_EL2.HPMN reset value
  KVM: arm64: Repaint pmcr_n into nr_pmu_counters

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-29 13:38:21 +01:00
Quentin Perret
43c475504a KVM: arm64: Unconditionally cross check hyp state
Now that the hypervisor's state is stored in the hyp_vmemmap, we no
longer need an expensive page-table walk to read it. This means we can
now afford to cross check the hyp-state during all memory ownership
transitions where the hyp is involved unconditionally, hence avoiding
problems such as [1].

[1] https://lore.kernel.org/kvmarm/20241128154406.602875-1-qperret@google.com/

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-8-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Quentin Perret
48d8488823 KVM: arm64: Defer EL2 stage-1 mapping on share
We currently blindly map into EL2 stage-1 *any* page passed to the
__pkvm_host_share_hyp() HVC. This is less than ideal from a security
perspective as it makes exploitation of potential hypervisor gadgets
easier than it should be. But interestingly, pKVM should never need to
access SHARED_BORROWED pages that it hasn't previously pinned, so there
is no need to map the page before that.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-7-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Quentin Perret
3390b3cbb6 KVM: arm64: Move hyp state to hyp_vmemmap
Tracking the hypervisor's ownership state into struct hyp_page has
several benefits, including allowing far more efficient lookups (no
page-table walk needed) and de-corelating the state from the presence
of a mapping. This will later allow to map pages into EL2 stage-1 less
proactively which is generally a good thing for security. And in the
future this will help with tracking the state of pages mapped into the
hypervisor's private range without requiring an alias into the 'linear
map' range.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-6-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Quentin Perret
ba5b2e5b9d KVM: arm64: Introduce {get,set}_host_state() helpers
Instead of directly accessing the host_state member in struct hyp_page,
introduce static inline accessors to do it. The future hyp_state member
will follow the same pattern as it will need some logic in the accessors.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-5-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Quentin Perret
cd4b039165 KVM: arm64: Use 0b11 for encoding PKVM_NOPAGE
The page ownership state encoded as 0b11 is currently considered
reserved for future use, and PKVM_NOPAGE uses bit 2. In order to
simplify the relocation of the hyp ownership state into the
vmemmap in later patches, let's use the 'reserved' encoding for
the PKVM_NOPAGE state. The struct hyp_page layout isn't guaranteed
stable at all, so there is no real reason to have 'reserved' encodings.

No functional changes intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-4-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Quentin Perret
ba637018ca KVM: arm64: Fix pKVM page-tracking comments
Most of the comments relating to pKVM page-tracking in nvhe/memory.h are
now either slightly outdated or outright wrong. Fix the comments.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Fuad Tabba
5db1bef933 KVM: arm64: Track SVE state in the hypervisor vcpu structure
When dealing with a guest with SVE enabled, make sure the host SVE
state is pinned at EL2 S1, and that the hypervisor vCPU state is
correctly initialised (and then unpinned on teardown).

Co-authored-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Paolo Bonzini
5f9e169814 KVM: arm64, x86: make kvm_arch_has_irq_bypass() inline
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:46:58 -04:00
Marc Zyngier
600f6fa5c9 KVM: arm64: Let kvm_vcpu_read_pmcr() return an EL-dependent value for PMCR_EL0.N
When EL2 is present, PMCR_EL0.N is the effective value of MDCR_EL2.HPMN
when accessed from EL1 or EL0.

Make sure we honor this requirement.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:24 +01:00
Marc Zyngier
efff9dd2fe KVM: arm64: Handle out-of-bound write to MDCR_EL2.HPMN
We don't really pay attention to what gets written to MDCR_EL2.HPMN,
and funky guests could play ugly games on us.

Restrict what gets written there, and limit the number of counters
to what the PMU is allowed to have.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:24 +01:00
Marc Zyngier
cd84a42c67 KVM: arm64: Don't let userspace write to PMCR_EL0.N when the vcpu has EL2
Now that userspace can provide its limit for hte maximum number of
counters, prevent it from writing to PMCR_EL0.N, as the value should
be derived from MDCR_EL2.HPMN in that case.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:23 +01:00
Marc Zyngier
b7628c7973 KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs
As long as we had purely EL1 VMs, we could easily update the number
of guest-visible counters by letting userspace write to PMCR_EL0.N.

With VMs started at EL2, PMCR_EL1.N only reflects MDCR_EL2.HPMN,
and we don't have a good way to limit it.

For this purpose, introduce a new PMUv3 attribute that allows
limiting the maximum number of counters. This requires the explicit
selection of a PMU.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:23 +01:00
Marc Zyngier
0224353343 KVM: arm64: Contextualise the handling of PMCR_EL0.P writes
Contrary to what the comment says in kvm_pmu_handle_pmcr(),
writing PMCR_EL0.P==1 has the following effects:

<quote>
The event counters affected by this field are:
  * All event counters in the first range.
  * If any of the following are true, all event counters in the second
    range:
    - EL2 is disabled or not implemented in the current Security state.
    - The PE is executing at EL2 or EL3.
</quote>

where the "first range" represent the counters in the [0..HPMN-1]
range, and the "second range" the counters in the [HPMN..MAX] range.

It so appears that writing P from EL2 should nuke all counters,
and not just the "guest" view. Just do that, and nuke the misleading
comment.

Reported-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 12:59:10 +01:00
Marc Zyngier
c8823e51b5 KVM: arm64: Fix MDCR_EL2.HPMN reset value
The MDCR_EL2 documentation indicates that the HPMN field has
the following behaviour:

"On a Warm reset, this field resets to the expression NUM_PMU_COUNTERS."

However, it appears we reset it to zero, which is not very useful.

Add a reset helper for MDCR_EL2, and handle the case where userspace
changes the target PMU, which may force us to change HPMN again.

Reported-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 12:58:51 +01:00
Marc Zyngier
f12b54d7c2 KVM: arm64: Repaint pmcr_n into nr_pmu_counters
The pmcr_n field obviously refers to PMCR_EL0.N, but is generally used
as the number of counters seen by the guest. Rename it accordingly.

Suggested-by: Oliver upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 11:54:22 +01:00
Linus Torvalds
0e8863244e ARM:
* Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
   stage-1 page tables) to align with the architecture. This avoids
   possibly taking an SEA at EL2 on the page table walk or using an
   architecturally UNKNOWN fault IPA.
 
 * Use acquire/release semantics in the KVM FF-A proxy to avoid reading
   a stale value for the FF-A version.
 
 * Fix KVM guest driver to match PV CPUID hypercall ABI.
 
 * Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
   selftests, which is the only memory type for which atomic
   instructions are architecturally guaranteed to work.
 
 s390:
 
 * Don't use %pK for debug printing and tracepoints.
 
 x86:
 
 * Use a separate subclass when acquiring KVM's per-CPU posted interrupts
   wakeup lock in the scheduled out path, i.e. when adding a vCPU on
   the list of vCPUs to wake, to workaround a false positive deadlock.
   The schedule out code runs with a scheduler lock that the wakeup
   handler takes in the opposite order; but it does so with IRQs disabled
   and cannot run concurrently with a wakeup.
 
 * Explicitly zero-initialize on-stack CPUID unions
 
 * Allow building irqbypass.ko as as module when kvm.ko is a module
 
 * Wrap relatively expensive sanity check with KVM_PROVE_MMU
 
 * Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
 
 selftests:
 
 * Add more scenarios to the MONITOR/MWAIT test.
 
 * Add option to rseq test to override /dev/cpu_dma_latency
 
 * Bring list of exit reasons up to date
 
 * Cleanup Makefile to list once tests that are valid on all architectures
 
 Other:
 
 * Documentation fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmf083IUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1dgf/QwfpZcHoMNQSnrc1jMy2LHrArln2
 XfmsOGZTU7kyoLQsLWGAPNocOveGdiemTDsj5ZXoNMnqV8hCBr+tZuv2gWI1rr/o
 kiGerdIgSZ9piTjBlJkVAaOzbWhg2DUnr7qVVzEzFY9+rPNyQ81vgAfU7h56KhYB
 optecozmBrHHAxvQZwmPeL9UyPWFjOF1BY/8LTMx7X+aVuCX6qx1JqO3a3ylAw4J
 tGXv6qFJfuCnu1d1b4X0ILce0iMUTOjQzvTcIm+BKjYycecl+3j1aczC/BOorIgc
 mf0+XeauhcTduK73pirnvx2b05eOxntgkOpwJytO2RP6pE0uK+2Th/C3Qg==
 =ba/Y
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
     stage-1 page tables) to align with the architecture. This avoids
     possibly taking an SEA at EL2 on the page table walk or using an
     architecturally UNKNOWN fault IPA

   - Use acquire/release semantics in the KVM FF-A proxy to avoid
     reading a stale value for the FF-A version

   - Fix KVM guest driver to match PV CPUID hypercall ABI

   - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
     selftests, which is the only memory type for which atomic
     instructions are architecturally guaranteed to work

  s390:

   - Don't use %pK for debug printing and tracepoints

  x86:

   - Use a separate subclass when acquiring KVM's per-CPU posted
     interrupts wakeup lock in the scheduled out path, i.e. when adding
     a vCPU on the list of vCPUs to wake, to workaround a false positive
     deadlock. The schedule out code runs with a scheduler lock that the
     wakeup handler takes in the opposite order; but it does so with
     IRQs disabled and cannot run concurrently with a wakeup

   - Explicitly zero-initialize on-stack CPUID unions

   - Allow building irqbypass.ko as as module when kvm.ko is a module

   - Wrap relatively expensive sanity check with KVM_PROVE_MMU

   - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses

  selftests:

   - Add more scenarios to the MONITOR/MWAIT test

   - Add option to rseq test to override /dev/cpu_dma_latency

   - Bring list of exit reasons up to date

   - Cleanup Makefile to list once tests that are valid on all
     architectures

  Other:

   - Documentation fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: arm64: Use acquire/release to communicate FF-A version negotiation
  KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable
  KVM: arm64: selftests: Introduce and use hardware-definition macros
  KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
  KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
  KVM: x86: Explicitly zero-initialize on-stack CPUID unions
  KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
  KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
  KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency
  KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
  Documentation: kvm: remove KVM_CAP_MIPS_TE
  Documentation: kvm: organize capabilities in the right section
  Documentation: kvm: fix some definition lists
  Documentation: kvm: drop "Capability" heading from capabilities
  Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
  Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
  selftests: kvm: list once tests that are valid on all architectures
  selftests: kvm: bring list of exit reasons up to date
  selftests: kvm: revamp MONITOR/MWAIT tests
  KVM: arm64: Don't translate FAR if invalid/unsafe
  ...
2025-04-08 13:47:55 -07:00
Will Deacon
a344e258ac KVM: arm64: Use acquire/release to communicate FF-A version negotiation
The pKVM FF-A proxy rejects FF-A requests other than FFA_VERSION until
version negotiation is complete, which is signalled by setting the
global 'has_version_negotiated' variable.

To avoid excessive locking, this variable is checked directly from
kvm_host_ffa_handler() in response to an FF-A call, but this can race
against another CPU performing the negotiation and potentially lead to
reading a torn value (incredibly unlikely for a 'bool') or problematic
re-ordering of the accesses to 'has_version_negotiated' and
'hyp_ffa_version' whereby a stale version number could be read by
__do_ffa_mem_xfer().

Use acquire/release primitives when writing 'has_version_negotiated'
with the version lock held and when reading without the lock held.

Cc: Sebastian Ene <sebastianene@google.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Fixes: c9c012625e ("KVM: arm64: Trap FFA_VERSION host call in pKVM")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250407152755.1041-1-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-07 15:03:34 -07:00
Oliver Upton
26fbdf3692 KVM: arm64: Don't translate FAR if invalid/unsafe
Don't re-walk the page tables if an SEA occurred during the faulting
page table walk to avoid taking a fatal exception in the hyp.
Additionally, check that FAR_EL2 is valid for SEAs not taken on PTW
as the architecture doesn't guarantee it contains the fault VA.

Finally, fix up the rest of the abort path by checking for SEAs early
and bugging the VM if we get further along with an UNKNOWN fault IPA.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-03 00:28:51 -07:00
Oliver Upton
1cf3e126f1 arm64: Convert HPFAR_EL2 to sysreg table
Switch over to the typical sysreg table for HPFAR_EL2 as we're about to
start using more fields in the register.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-03 00:28:51 -07:00
Oliver Upton
fb8a3eba9c KVM: arm64: Only read HPFAR_EL2 when value is architecturally valid
KVM's logic for deciding when HPFAR_EL2 is UNKNOWN doesn't align with
the architecture. Most notably, KVM assumes HPFAR_EL2 contains the
faulting IPA even in the case of an SEA.

Align the logic with the architecture rather than attempting to
paraphrase it. Additionally, take the opportunity to improve the
language around ARM erratum #834220 such that it actually describes the
bug.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-03 00:28:51 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Linus Torvalds
edb0e8f6e2 ARM:
* Nested virtualization support for VGICv3, giving the nested
 hypervisor control of the VGIC hardware when running an L2 VM
 
 * Removal of 'late' nested virtualization feature register masking,
   making the supported feature set directly visible to userspace
 
 * Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
   of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
 
 * Paravirtual interface for discovering the set of CPU implementations
   where a VM may run, addressing a longstanding issue of guest CPU
   errata awareness in big-little systems and cross-implementation VM
   migration
 
 * Userspace control of the registers responsible for identifying a
   particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
   allowing VMs to be migrated cross-implementation
 
 * pKVM updates, including support for tracking stage-2 page table
   allocations in the protected hypervisor in the 'SecPageTable' stat
 
 * Fixes to vPMU, ensuring that userspace updates to the vPMU after
   KVM_RUN are reflected into the backing perf events
 
 LoongArch:
 
 * Remove unnecessary header include path
 
 * Assume constant PGD during VM context switch
 
 * Add perf events support for guest VM
 
 RISC-V:
 
 * Disable the kernel perf counter during configure
 
 * KVM selftests improvements for PMU
 
 * Fix warning at the time of KVM module removal
 
 x86:
 
 * Add support for aging of SPTEs without holding mmu_lock.  Not taking mmu_lock
   allows multiple aging actions to run in parallel, and more importantly avoids
   stalling vCPUs.  This includes an implementation of per-rmap-entry locking;
   aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas
   locking an rmap for write requires taking both the per-rmap spinlock and
   the mmu_lock.
 
   Note that this decreases slightly the accuracy of accessed-page information,
   because changes to the SPTE outside aging might not use atomic operations
   even if they could race against a clear of the Accessed bit.  This is
   deliberate because KVM and mm/ tolerate false positives/negatives for
   accessed information, and testing has shown that reducing the latency of
   aging is far more beneficial to overall system performance than providing
   "perfect" young/old information.
 
 * Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.
 
 * Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).
 
 * Drop "support" for async page faults for protected guests that do not set
   SEND_ALWAYS (i.e. that only want async page faults at CPL3)
 
 * Bring a bit of sanity to x86's VM teardown code, which has accumulated
   a lot of cruft over the years.  Particularly, destroy vCPUs before
   the MMU, despite the latter being a VM-wide operation.
 
 * Add common secure TSC infrastructure for use within SNP and in the
   future TDX
 
 * Block KVM_CAP_SYNC_REGS if guest state is protected.  It does not make
   sense to use the capability if the relevant registers are not
   available for reading or writing.
 
 * Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
   fix a largely theoretical deadlock.
 
 * Use the vCPU's actual Xen PV clock information when starting the Xen timer,
   as the cached state in arch.hv_clock can be stale/bogus.
 
 * Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
   PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend
   notifier only accounts for kvmclock, and there's no evidence that the
   flag is actually supported by Xen guests.
 
 * Clean up the per-vCPU "cache" of its reference pvclock, and instead only
   track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
   expensive to compute, and rarely changes for modern setups).
 
 * Don't write to the Xen hypercall page on MSR writes that are initiated by
   the host (userspace or KVM) to fix a class of bugs where KVM can write to
   guest memory at unexpected times, e.g. during vCPU creation if userspace has
   set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
 
 * Restrict the Xen hypercall MSR index to the unofficial synthetic range to
   reduce the set of possible collisions with MSRs that are emulated by KVM
   (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
   in the synthetic range).
 
 * Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
 
 * Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
   entries when updating PV clocks; there is no guarantee PV clocks will be
   updated between TSC frequency changes and CPUID emulation, and guest reads
   of the TSC leaves should be rare, i.e. are not a hot path.
 
 x86 (Intel):
 
 * Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
   modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
 
 * Pass XFD_ERR as the payload when injecting #NM, as a preparatory step
   for upcoming FRED virtualization support.
 
 * Decouple the EPT entry RWX protection bit macros from the EPT Violation
   bits, both as a general cleanup and in anticipation of adding support for
   emulating Mode-Based Execution Control (MBEC).
 
 * Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
   state while KVM is in the middle of emulating nested VM-Enter.
 
 * Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
   in anticipation of adding sanity checks for secondary exit controls (the
   primary field is out of bits).
 
 x86 (AMD):
 
 * Ensure the PSP driver is initialized when both the PSP and KVM modules are
   built-in (the initcall framework doesn't handle dependencies).
 
 * Use long-term pins when registering encrypted memory regions, so that the
   pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
   excessive fragmentation.
 
 * Add macros and helpers for setting GHCB return/error codes.
 
 * Add support for Idle HLT interception, which elides interception if the vCPU
   has a pending, unmasked virtual IRQ when HLT is executed.
 
 * Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
   address.
 
 * Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
   because the vCPU was "destroyed" via SNP's AP Creation hypercall.
 
 * Reject SNP AP Creation if the requested SEV features for the vCPU don't
   match the VM's configured set of features.
 
 Selftests:
 
 * Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data
   instead of executing code.  The theory is that modern Intel CPUs have
   learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that an event is
   counting correctly without actually knowing what the event counts on the
   underlying hardware.
 
 * Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and
   improve its coverage by collecting all dirty entries on each iteration.
 
 * Fix a few minor bugs related to handling of stats FDs.
 
 * Add infrastructure to make vCPU and VM stats FDs available to tests by
   default (open the FDs during VM/vCPU creation).
 
 * Relax an assertion on the number of HLT exits in the xAPIC IPI test when
   running on a CPU that supports AMD's Idle HLT (which elides interception of
   HLT if a virtual IRQ is pending and unmasked).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00
 yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom
 uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH
 diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw
 DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU
 wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A==
 =nnCN
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Nested virtualization support for VGICv3, giving the nested
     hypervisor control of the VGIC hardware when running an L2 VM

   - Removal of 'late' nested virtualization feature register masking,
     making the supported feature set directly visible to userspace

   - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
     of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers

   - Paravirtual interface for discovering the set of CPU
     implementations where a VM may run, addressing a longstanding issue
     of guest CPU errata awareness in big-little systems and
     cross-implementation VM migration

   - Userspace control of the registers responsible for identifying a
     particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
     allowing VMs to be migrated cross-implementation

   - pKVM updates, including support for tracking stage-2 page table
     allocations in the protected hypervisor in the 'SecPageTable' stat

   - Fixes to vPMU, ensuring that userspace updates to the vPMU after
     KVM_RUN are reflected into the backing perf events

  LoongArch:

   - Remove unnecessary header include path

   - Assume constant PGD during VM context switch

   - Add perf events support for guest VM

  RISC-V:

   - Disable the kernel perf counter during configure

   - KVM selftests improvements for PMU

   - Fix warning at the time of KVM module removal

  x86:

   - Add support for aging of SPTEs without holding mmu_lock.

     Not taking mmu_lock allows multiple aging actions to run in
     parallel, and more importantly avoids stalling vCPUs. This includes
     an implementation of per-rmap-entry locking; aging the gfn is done
     with only a per-rmap single-bin spinlock taken, whereas locking an
     rmap for write requires taking both the per-rmap spinlock and the
     mmu_lock.

     Note that this decreases slightly the accuracy of accessed-page
     information, because changes to the SPTE outside aging might not
     use atomic operations even if they could race against a clear of
     the Accessed bit.

     This is deliberate because KVM and mm/ tolerate false
     positives/negatives for accessed information, and testing has shown
     that reducing the latency of aging is far more beneficial to
     overall system performance than providing "perfect" young/old
     information.

   - Defer runtime CPUID updates until KVM emulates a CPUID instruction,
     to coalesce updates when multiple pieces of vCPU state are
     changing, e.g. as part of a nested transition

   - Fix a variety of nested emulation bugs, and add VMX support for
     synthesizing nested VM-Exit on interception (instead of injecting
     #UD into L2)

   - Drop "support" for async page faults for protected guests that do
     not set SEND_ALWAYS (i.e. that only want async page faults at CPL3)

   - Bring a bit of sanity to x86's VM teardown code, which has
     accumulated a lot of cruft over the years. Particularly, destroy
     vCPUs before the MMU, despite the latter being a VM-wide operation

   - Add common secure TSC infrastructure for use within SNP and in the
     future TDX

   - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not
     make sense to use the capability if the relevant registers are not
     available for reading or writing

   - Don't take kvm->lock when iterating over vCPUs in the suspend
     notifier to fix a largely theoretical deadlock

   - Use the vCPU's actual Xen PV clock information when starting the
     Xen timer, as the cached state in arch.hv_clock can be stale/bogus

   - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across
     different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as
     KVM's suspend notifier only accounts for kvmclock, and there's no
     evidence that the flag is actually supported by Xen guests

   - Clean up the per-vCPU "cache" of its reference pvclock, and instead
     only track the vCPU's TSC scaling (multipler+shift) metadata (which
     is moderately expensive to compute, and rarely changes for modern
     setups)

   - Don't write to the Xen hypercall page on MSR writes that are
     initiated by the host (userspace or KVM) to fix a class of bugs
     where KVM can write to guest memory at unexpected times, e.g.
     during vCPU creation if userspace has set the Xen hypercall MSR
     index to collide with an MSR that KVM emulates

   - Restrict the Xen hypercall MSR index to the unofficial synthetic
     range to reduce the set of possible collisions with MSRs that are
     emulated by KVM (collisions can still happen as KVM emulates
     Hyper-V MSRs, which also reside in the synthetic range)

   - Clean up and optimize KVM's handling of Xen MSR writes and
     xen_hvm_config

   - Update Xen TSC leaves during CPUID emulation instead of modifying
     the CPUID entries when updating PV clocks; there is no guarantee PV
     clocks will be updated between TSC frequency changes and CPUID
     emulation, and guest reads of the TSC leaves should be rare, i.e.
     are not a hot path

  x86 (Intel):

   - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and
     thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1

   - Pass XFD_ERR as the payload when injecting #NM, as a preparatory
     step for upcoming FRED virtualization support

   - Decouple the EPT entry RWX protection bit macros from the EPT
     Violation bits, both as a general cleanup and in anticipation of
     adding support for emulating Mode-Based Execution Control (MBEC)

   - Reject KVM_RUN if userspace manages to gain control and stuff
     invalid guest state while KVM is in the middle of emulating nested
     VM-Enter

   - Add a macro to handle KVM's sanity checks on entry/exit VMCS
     control pairs in anticipation of adding sanity checks for secondary
     exit controls (the primary field is out of bits)

  x86 (AMD):

   - Ensure the PSP driver is initialized when both the PSP and KVM
     modules are built-in (the initcall framework doesn't handle
     dependencies)

   - Use long-term pins when registering encrypted memory regions, so
     that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and
     don't lead to excessive fragmentation

   - Add macros and helpers for setting GHCB return/error codes

   - Add support for Idle HLT interception, which elides interception if
     the vCPU has a pending, unmasked virtual IRQ when HLT is executed

   - Fix a bug in INVPCID emulation where KVM fails to check for a
     non-canonical address

   - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is
     invalid, e.g. because the vCPU was "destroyed" via SNP's AP
     Creation hypercall

   - Reject SNP AP Creation if the requested SEV features for the vCPU
     don't match the VM's configured set of features

  Selftests:

   - Fix again the Intel PMU counters test; add a data load and do
     CLFLUSH{OPT} on the data instead of executing code. The theory is
     that modern Intel CPUs have learned new code prefetching tricks
     that bypass the PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that an
     event is counting correctly without actually knowing what the event
     counts on the underlying hardware

   - Fix a variety of flaws, bugs, and false failures/passes
     dirty_log_test, and improve its coverage by collecting all dirty
     entries on each iteration

   - Fix a few minor bugs related to handling of stats FDs

   - Add infrastructure to make vCPU and VM stats FDs available to tests
     by default (open the FDs during VM/vCPU creation)

   - Relax an assertion on the number of HLT exits in the xAPIC IPI test
     when running on a CPU that supports AMD's Idle HLT (which elides
     interception of HLT if a virtual IRQ is pending and unmasked)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits)
  RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed
  RISC-V: KVM: Teardown riscv specific bits after kvm_exit
  LoongArch: KVM: Register perf callbacks for guest
  LoongArch: KVM: Implement arch-specific functions for guest perf
  LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()
  LoongArch: KVM: Remove PGD saving during VM context switch
  LoongArch: KVM: Remove unnecessary header include path
  KVM: arm64: Tear down vGIC on failed vCPU creation
  KVM: arm64: PMU: Reload when resetting
  KVM: arm64: PMU: Reload when user modifies registers
  KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
  KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
  KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
  KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
  KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
  KVM: arm64: Initialize HCRX_EL2 traps in pKVM
  KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
  KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
  KVM: x86: Add infrastructure for secure TSC
  KVM: x86: Push down setting vcpu.arch.user_set_tsc
  ...
2025-03-25 14:22:07 -07:00
Linus Torvalds
2d09a9449e arm64 updates for 6.15:
Perf and PMUs:
 
  - Support for the "Rainier" CPU PMU from Arm
 
  - Preparatory driver changes and cleanups that pave the way for BRBE
    support
 
  - Support for partial virtualisation of the Apple-M1 PMU
 
  - Support for the second event filter in Arm CSPMU designs
 
  - Minor fixes and cleanups (CMN and DWC PMUs)
 
  - Enable EL2 requirements for FEAT_PMUv3p9
 
 Power, CPU topology:
 
  - Support for AMUv1-based average CPU frequency
 
  - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds
    a generic topology_is_primary_thread() function overridden by x86 and
    powerpc
 
 New(ish) features:
 
  - MOPS (memcpy/memset) support for the uaccess routines
 
 Security/confidential compute:
 
  - Fix the DMA address for devices used in Realms with Arm CCA. The
    CCA architecture uses the address bit to differentiate between shared
    and private addresses
 
  - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
    default
 
 Memory management clean-ups:
 
  - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs
 
  - Some minor page table accessor clean-ups
 
  - PIE/POE (permission indirection/overlay) helpers clean-up
 
 Kselftests:
 
  - MTE: skip hugetlb tests if MTE is not supported on such mappings and
    user correct naming for sync/async tag checking modes
 
 Miscellaneous:
 
  - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
    request)
 
  - Sysreg updates for new register fields
 
  - CPU type info for some Qualcomm Kryo cores
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmfjB2QACgkQa9axLQDI
 XvGrfg//W3Bx9+jw1G/XHHEQqGEVFmvltvxZUkvgV0Qki0rPSMnappJhZRL9n0Nm
 V6PvGd2KoKHZuL3g5ViZb3cs2R9BiD2JB6PncwBKuxumHGh3vz3kk1JMkDVfWdHv
 qAceOckFJD9rXjPZn+PDsfYiEi2i3RRWIP5VglZ14ue8j3prHQ6DJXLUQF2GYvzE
 /bgLSq44wp5N59ddy23+qH9rxrHzz3bgpbVv/F56W/LErvE873mRmyFwiuGJm+M0
 Pn8ra572rI6a4sgSwrMTeNPBU+F9o5AbqwauVhkz428RdMvgfEuW6qHUBnGWJDmt
 HotXmu+4Eb2KJks/iQkDo4OTJ38yUqvvZZJtP171ms3E4yqESSJngWP6O2A6LF+y
 xhe0sESF/Ew6jLhM6/hvOmBcE2AyB14JE3ymqLkXbWub4NXddBn2AF1WXFjF4CBw
 F8KSUhNLekrCYKv1k9M3nhvkcpoS9FkTF/TI+zEg546alI/GLPih6uDRkgMAODh1
 RDJYixHsf2NDDRQbfwvt9Xua/KKpDF6qNkHLA4OiqqVUwh1hkas24Lrnp8vmce4o
 wIpWCLqYWey8Rl3XWuWgWz2Xu58fHH4Dl2k72Z8I0pwp3abCDa9xEj79G0Svk7Si
 Q+FCYrNlpKee1RXBC+1MUD/Gl5r/28dEUFkAzPD80F7AgafXPd0=
 =Kc9c
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "Nothing major this time around.

  Apart from the usual perf/PMU updates, some page table cleanups, the
  notable features are average CPU frequency based on the AMUv1
  counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in
  the uaccess routines.

  Perf and PMUs:

   - Support for the 'Rainier' CPU PMU from Arm

   - Preparatory driver changes and cleanups that pave the way for BRBE
     support

   - Support for partial virtualisation of the Apple-M1 PMU

   - Support for the second event filter in Arm CSPMU designs

   - Minor fixes and cleanups (CMN and DWC PMUs)

   - Enable EL2 requirements for FEAT_PMUv3p9

  Power, CPU topology:

   - Support for AMUv1-based average CPU frequency

   - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It
     adds a generic topology_is_primary_thread() function overridden by
     x86 and powerpc

  New(ish) features:

   - MOPS (memcpy/memset) support for the uaccess routines

  Security/confidential compute:

   - Fix the DMA address for devices used in Realms with Arm CCA. The
     CCA architecture uses the address bit to differentiate between
     shared and private addresses

   - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
     default

  Memory management clean-ups:

   - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs

   - Some minor page table accessor clean-ups

   - PIE/POE (permission indirection/overlay) helpers clean-up

  Kselftests:

   - MTE: skip hugetlb tests if MTE is not supported on such mappings
     and user correct naming for sync/async tag checking modes

  Miscellaneous:

   - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
     request)

   - Sysreg updates for new register fields

   - CPU type info for some Qualcomm Kryo cores"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
  arm64: mm: Don't use %pK through printk
  perf/arm_cspmu: Fix missing io.h include
  arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
  arm64: cputype: Add MIDR_CORTEX_A76AE
  arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
  arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
  arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
  arm64/sysreg: Enforce whole word match for open/close tokens
  arm64/sysreg: Fix unbalanced closing block
  arm64: Kconfig: Enable HOTPLUG_SMT
  arm64: topology: Support SMT control on ACPI based system
  arch_topology: Support SMT control for OF based system
  cpu/SMT: Provide a default topology_is_primary_thread()
  arm64/mm: Define PTDESC_ORDER
  perf/arm_cspmu: Add PMEVFILT2R support
  perf/arm_cspmu: Generalise event filtering
  perf/arm_cspmu: Move register definitons to header
  arm64/kernel: Always use level 2 or higher for early mappings
  arm64/mm: Drop PXD_TABLE_BIT
  arm64/mm: Check pmd_table() in pmd_trans_huge()
  ...
2025-03-25 13:16:16 -07:00
Catalin Marinas
8cc14fdcc1 Merge branches 'for-next/amuv1-avg-freq', 'for-next/pkey_unrestricted', 'for-next/sysreg', 'for-next/misc', 'for-next/pgtable-cleanups', 'for-next/kselftest', 'for-next/uaccess-mops', 'for-next/pie-poe-cleanup', 'for-next/cputype-kryo', 'for-next/cca-dma-address', 'for-next/drop-pxd_table_bit' and 'for-next/spectre-bhb-assume-vulnerable', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf/arm_cspmu: Fix missing io.h include
  perf/arm_cspmu: Add PMEVFILT2R support
  perf/arm_cspmu: Generalise event filtering
  perf/arm_cspmu: Move register definitons to header
  drivers/perf: apple_m1: Support host/guest event filtering
  drivers/perf: apple_m1: Refactor event select/filter configuration
  perf/dwc_pcie: fix duplicate pci_dev devices
  perf/dwc_pcie: fix some unreleased resources
  perf/arm-cmn: Minor event type housekeeping
  perf: arm_pmu: Move PMUv3-specific data
  perf: apple_m1: Don't disable counter in m1_pmu_enable_event()
  perf: arm_v7_pmu: Don't disable counter in (armv7|krait_|scorpion_)pmu_enable_event()
  perf: arm_v7_pmu: Drop obvious comments for enabling/disabling counters and interrupts
  perf: arm_pmuv3: Don't disable counter in armv8pmu_enable_event()
  perf: arm_pmu: Don't disable counter in armpmu_add()
  perf: arm_pmuv3: Call kvm_vcpu_pmu_resync_el0() before enabling counters
  perf: arm_pmuv3: Add support for ARM Rainier PMU

* for-next/amuv1-avg-freq:
  : Add support for AArch64 AMUv1-based average freq
  arm64: Utilize for_each_cpu_wrap for reference lookup
  arm64: Update AMU-based freq scale factor on entering idle
  arm64: Provide an AMU-based version of arch_freq_get_on_cpu
  cpufreq: Introduce an optional cpuinfo_avg_freq sysfs entry
  cpufreq: Allow arch_freq_get_on_cpu to return an error
  arch_topology: init capacity_freq_ref to 0

* for-next/pkey_unrestricted:
  : mm/pkey: Add PKEY_UNRESTRICTED macro
  selftest/powerpc/mm/pkey: fix build-break introduced by commit 00894c3fc9
  selftests/powerpc: Use PKEY_UNRESTRICTED macro
  selftests/mm: Use PKEY_UNRESTRICTED macro
  mm/pkey: Add PKEY_UNRESTRICTED macro

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Enforce whole word match for open/close tokens
  arm64/sysreg: Fix unbalanced closing block
  arm64/sysreg: Add register fields for HFGWTR2_EL2
  arm64/sysreg: Add register fields for HFGRTR2_EL2
  arm64/sysreg: Add register fields for HFGITR2_EL2
  arm64/sysreg: Add register fields for HDFGWTR2_EL2
  arm64/sysreg: Add register fields for HDFGRTR2_EL2
  arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1

* for-next/misc:
  : Miscellaneous arm64 patches
  arm64: mm: Don't use %pK through printk
  arm64/fpsimd: Remove unused declaration fpsimd_kvm_prepare()

* for-next/pgtable-cleanups:
  : arm64 pgtable accessors cleanup
  arm64/mm: Define PTDESC_ORDER
  arm64/kernel: Always use level 2 or higher for early mappings
  arm64/hugetlb: Consistently use pud_sect_supported()
  arm64/mm: Convert __pte_to_phys() and __phys_to_pte_val() as functions

* for-next/kselftest:
  : arm64 kselftest updates
  kselftest/arm64: mte: Skip the hugetlb tests if MTE not supported on such mappings
  kselftest/arm64: mte: Use the correct naming for tag check modes in check_hugetlb_options.c

* for-next/uaccess-mops:
  : Implement the uaccess memory copy/set using MOPS instructions
  arm64: lib: Use MOPS for usercopy routines
  arm64: mm: Handle PAN faults on uaccess CPY* instructions
  arm64: extable: Add fixup handling for uaccess CPY* instructions

* for-next/pie-poe-cleanup:
  : PIE/POE helpers cleanup
  arm64/sysreg: Move POR_EL0_INIT to asm/por.h
  arm64/sysreg: Rename POE_RXW to POE_RWX
  arm64/sysreg: Improve PIR/POR helpers

* for-next/cputype-kryo:
  : Add cputype info for some Qualcomm Kryo cores
  arm64: cputype: Add comments about Qualcomm Kryo 5XX and 6XX cores
  arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD

* for-next/cca-dma-address:
  : Fix DMA address for devices used in realms with Arm CCA
  arm64: realm: Use aliased addresses for device DMA to shared buffers
  dma: Introduce generic dma_addr_*crypted helpers
  dma: Fix encryption bit clearing for dma_to_phys

* for-next/drop-pxd_table_bit:
  : Drop the arm64 PXD_TABLE_BIT (clean-up in preparation for 128-bit PTEs)
  arm64/mm: Drop PXD_TABLE_BIT
  arm64/mm: Check pmd_table() in pmd_trans_huge()
  arm64/mm: Check PUD_TYPE_TABLE in pud_bad()
  arm64/mm: Check PXD_TYPE_TABLE in [p4d|pgd]_bad()
  arm64/mm: Clear PXX_TYPE_MASK and set PXD_TYPE_SECT in [pmd|pud]_mkhuge()
  arm64/mm: Clear PXX_TYPE_MASK in mk_[pmd|pud]_sect_prot()
  arm64/ptdump: Test PMD_TYPE_MASK for block mapping
  KVM: arm64: ptdump: Test PMD_TYPE_MASK for block mapping

* for-next/spectre-bhb-assume-vulnerable:
  : Rework Spectre BHB mitigations to not assume "safe"
  arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
  arm64: cputype: Add MIDR_CORTEX_A76AE
  arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
  arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
  arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
2025-03-25 19:32:03 +00:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Oliver Upton
369c012268 Merge branch 'kvm-arm64/pmu-fixes' into kvmarm/next
* kvm-arm64/pmu-fixes:
  : vPMU fixes for 6.15 courtesy of Akihiko Odaki
  :
  : Various fixes to KVM's vPMU implementation, notably ensuring
  : userspace-directed changes to the PMCs are reflected in the backing perf
  : events.
  KVM: arm64: PMU: Reload when resetting
  KVM: arm64: PMU: Reload when user modifies registers
  KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
  KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
  KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:52 -07:00
Oliver Upton
ca19dd4323 Merge branch 'kvm-arm64/pkvm-6.15' into kvmarm/next
* kvm-arm64/pkvm-6.15:
  : pKVM updates for 6.15
  :
  :  - SecPageTable stats for stage-2 table pages allocated by the protected
  :    hypervisor (Vincent Donnefort)
  :
  :  - HCRX_EL2 trap + vCPU initialization fixes for pKVM (Fuad Tabba)
  KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
  KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
  KVM: arm64: Initialize HCRX_EL2 traps in pKVM
  KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
  KVM: arm64: Count pKVM stage-2 usage in secondary pagetable stats
  KVM: arm64: Distinct pKVM teardown memcache for stage-2
  KVM: arm64: Add flags to kvm_hyp_memcache

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:40 -07:00
Oliver Upton
4f2774c57a Merge branch 'kvm-arm64/writable-midr' into kvmarm/next
* kvm-arm64/writable-midr:
  : Writable implementation ID registers, courtesy of Sebastian Ott
  :
  : Introduce a new capability that allows userspace to set the
  : ID registers that identify a CPU implementation: MIDR_EL1, REVIDR_EL1,
  : and AIDR_EL1. Also plug a hole in KVM's trap configuration where
  : SMIDR_EL1 was readable at EL1, despite the fact that KVM does not
  : support SME.
  KVM: arm64: Fix documentation for KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
  KVM: arm64: Copy MIDR_EL1 into hyp VM when it is writable
  KVM: arm64: Copy guest CTR_EL0 into hyp VM
  KVM: selftests: arm64: Test writes to MIDR,REVIDR,AIDR
  KVM: arm64: Allow userspace to change the implementation ID registers
  KVM: arm64: Load VPIDR_EL2 with the VM's MIDR_EL1 value
  KVM: arm64: Maintain per-VM copy of implementation ID regs
  KVM: arm64: Set HCR_EL2.TID1 unconditionally

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:32 -07:00
Oliver Upton
1b1d1b17b8 Merge branch 'kvm-arm64/pmuv3-asahi' into kvmarm/next
* kvm-arm64/pmuv3-asahi:
  : Support PMUv3 for KVM guests on Apple silicon
  :
  : Take advantage of some IMPLEMENTATION DEFINED traps available on Apple
  : parts to trap-and-emulate the PMUv3 registers on behalf of a KVM guest.
  : Constrain the vPMU to a cycle counter and single event counter, as the
  : Apple PMU has events that cannot be counted on every counter.
  :
  : There is a small new interface between the ARM PMU driver and KVM, where
  : the PMU driver owns the PMUv3 -> hardware event mappings.
  arm64: Enable IMP DEF PMUv3 traps on Apple M*
  KVM: arm64: Provide 1 event counter on IMPDEF hardware
  drivers/perf: apple_m1: Provide helper for mapping PMUv3 events
  KVM: arm64: Remap PMUv3 events onto hardware
  KVM: arm64: Advertise PMUv3 if IMPDEF traps are present
  KVM: arm64: Compute synthetic sysreg ESR for Apple PMUv3 traps
  KVM: arm64: Move PMUVer filtering into KVM code
  KVM: arm64: Use guard() to cleanup usage of arm_pmus_lock
  KVM: arm64: Drop kvm_arm_pmu_available static key
  KVM: arm64: Use a cpucap to determine if system supports FEAT_PMUv3
  KVM: arm64: Always support SW_INCR PMU event
  KVM: arm64: Compute PMCEID from arm_pmu's event bitmaps
  drivers/perf: apple_m1: Support host/guest event filtering
  drivers/perf: apple_m1: Refactor event select/filter configuration

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:23 -07:00
Oliver Upton
d300b0168e Merge branch 'kvm-arm64/pv-cpuid' into kvmarm/next
* kvm-arm64/pv-cpuid:
  : Paravirtualized implementation ID, courtesy of Shameer Kolothum
  :
  : Big-little has historically been a pain in the ass to virtualize. The
  : implementation ID (MIDR, REVIDR, AIDR) of a vCPU can change at the whim
  : of vCPU scheduling. This can be particularly annoying when the guest
  : needs to know the underlying implementation to mitigate errata.
  :
  : "Hyperscalers" face a similar scheduling problem, where VMs may freely
  : migrate between hosts in a pool of heterogenous hardware. And yes, our
  : server-class friends are equally riddled with errata too.
  :
  : In absence of an architected solution to this wart on the ecosystem,
  : introduce support for paravirtualizing the implementation exposed
  : to a VM, allowing the VMM to describe the pool of implementations that a
  : VM may be exposed to due to scheduling/migration.
  :
  : Userspace is expected to intercept and handle these hypercalls using the
  : SMCCC filter UAPI, should it choose to do so.
  smccc: kvm_guest: Fix kernel builds for 32 bit arm
  KVM: selftests: Add test for KVM_REG_ARM_VENDOR_HYP_BMAP_2
  smccc/kvm_guest: Enable errata based on implementation CPUs
  arm64: Make  _midr_in_range_list() an exported function
  KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2
  KVM: arm64: Specify hypercall ABI for retrieving target implementations
  arm64: Modify _midr_range() functions to read MIDR/REVIDR internally

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:53:16 -07:00
Oliver Upton
13f64f6d21 Merge branch 'kvm-arm64/nv-idregs' into kvmarm/next
* kvm-arm64/nv-idregs:
  : Changes to exposure of NV features, courtesy of Marc Zyngier
  :
  : Apply NV-specific feature restrictions at reset rather than at the point
  : of KVM_RUN. This makes the true feature set visible to userspace, a
  : necessary step towards save/restore support or NV VMs.
  :
  : Add an additional vCPU feature flag for selecting the E2H0 flavor of NV,
  : such that the VHE-ness of the VM can be applied to the feature set.
  KVM: arm64: selftests: Test that TGRAN*_2 fields are writable
  KVM: arm64: Allow userspace to write ID_AA64MMFR0_EL1.TGRAN*_2
  KVM: arm64: Advertise FEAT_ECV when possible
  KVM: arm64: Make ID_AA64MMFR4_EL1.NV_frac writable
  KVM: arm64: Allow userspace to limit NV support to nVHE
  KVM: arm64: Move NV-specific capping to idreg sanitisation
  KVM: arm64: Enforce NV limits on a per-idregs basis
  KVM: arm64: Make ID_REG_LIMIT_FIELD_ENUM() more widely available
  KVM: arm64: Consolidate idreg callbacks
  KVM: arm64: Advertise NV2 in the boot messages
  KVM: arm64: Mark HCR.EL2.{NV*,AT} RES0 when ID_AA64MMFR4_EL1.NV_frac is 0
  KVM: arm64: Mark HCR.EL2.E2H RES0 when ID_AA64MMFR1_EL1.VH is zero
  KVM: arm64: Hide ID_AA64MMFR2_EL1.NV from guest and userspace
  arm64: cpufeature: Handle NV_frac as a synonym of NV2

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:52:26 -07:00
Oliver Upton
56e3e5c8f7 Merge branch 'kvm-arm64/nv-vgic' into kvmarm/next
* kvm-arm64/nv-vgic:
  : NV VGICv3 support, courtesy of Marc Zyngier
  :
  : Support for emulating the GIC hypervisor controls and managing shadow
  : VGICv3 state for the L1 hypervisor. As part of it, bring in support for
  : taking IRQs to the L1 and UAPI to manage the VGIC maintenance interrupt.
  KVM: arm64: nv: Fail KVM init if asking for NV without GICv3
  KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ
  KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
  KVM: arm64: nv: Propagate used_lrs between L1 and L0 contexts
  KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2
  KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting
  KVM: arm64: nv: Add Maintenance Interrupt emulation
  KVM: arm64: nv: Handle L2->L1 transition on interrupt injection
  KVM: arm64: nv: Nested GICv3 emulation
  KVM: arm64: nv: Sanitise ICH_HCR_EL2 accesses
  KVM: arm64: nv: Plumb handling of GICv3 EL2 accesses
  KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysreg
  KVM: arm64: nv: Load timer before the GIC
  arm64: sysreg: Add layout for ICH_MISR_EL2
  arm64: sysreg: Add layout for ICH_VTR_EL2
  arm64: sysreg: Add layout for ICH_HCR_EL2

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:51:43 -07:00
Oliver Upton
3ed0dc03f6 Merge branch 'kvm-arm64/misc' into kvmarm/next
* kvm-arm64/misc:
  : Miscellaneous fixes/cleanups for KVM/arm64
  :
  :  - Avoid GICv4 vLPI configuration when confronted with user error
  :
  :  - Only attempt vLPI configuration when the target routing is an MSI
  :
  :  - Document ordering requirements to avoid aforementioned user error
  KVM: arm64: Tear down vGIC on failed vCPU creation
  KVM: arm64: Document ordering requirements for irqbypass
  KVM: arm64: vgic-v4: Fall back to software irqbypass if LPI not found
  KVM: arm64: vgic-v4: Only WARN for HW IRQ mismatch when unmapping vLPI
  KVM: arm64: vgic-v4: Only attempt vLPI mapping for actual MSIs

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:51:15 -07:00
Will Deacon
250f25367b KVM: arm64: Tear down vGIC on failed vCPU creation
If kvm_arch_vcpu_create() fails to share the vCPU page with the
hypervisor, we propagate the error back to the ioctl but leave the
vGIC vCPU data initialised. Note only does this leak the corresponding
memory when the vCPU is destroyed but it can also lead to use-after-free
if the redistributor device handling tries to walk into the vCPU.

Add the missing cleanup to kvm_arch_vcpu_create(), ensuring that the
vGIC vCPU structures are destroyed on error.

Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250314133409.9123-1-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 23:30:51 -07:00
Akihiko Odaki
fe53538069 KVM: arm64: PMU: Reload when resetting
Replace kvm_pmu_vcpu_reset() with the generic PMU reloading mechanism to
ensure the consistency with system registers and to reduce code size.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-5-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:45:25 -07:00
Akihiko Odaki
1db4aaa055 KVM: arm64: PMU: Reload when user modifies registers
Commit d0c94c4979 ("KVM: arm64: Restore PMU configuration on first
run") added the code to reload the PMU configuration on first run.

It is also important to keep the correct state even if system registers
are modified after first run, specifically when debugging Windows on
QEMU with GDB; QEMU tries to write back all visible registers when
resuming the VM execution with GDB, corrupting the PMU state. Windows
always uses the PMU so this can cause adverse effects on that particular
OS.

The usual register writes and reset are already handled independently,
but register writes from userspace are not covered.
Trigger the code to reload the PMU configuration for them instead so
that PMU configuration changes made by users will be applied also after
the first run.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-4-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:45:25 -07:00
Akihiko Odaki
64074ca8ca KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
Reload the perf event when setting the vPMU counter (vPMC) registers
(PMCCNTR_EL0 and PMEVCNTR<n>_EL0). This is a change corresponding to
commit 9228b26194 ("KVM: arm64: PMU: Fix GET_ONE_REG
for vPMC regs to return the current value") but for SET_ONE_REG.

Values of vPMC registers are saved in sysreg files on certain occasions.
These saved values don't represent the current values of the vPMC
registers if the perf events for the vPMCs count events after the save.
The current values of those registers are the sum of the sysreg file
value and the current perf event counter value.  But, when userspace
writes those registers (using KVM_SET_ONE_REG), KVM only updates the
sysreg file value and leaves the current perf event counter value as is.

It is also important to keep the correct state even if userspace writes
them after first run, specifically when debugging Windows on QEMU with
GDB; QEMU tries to write back all visible registers when resuming the VM
execution with GDB, corrupting the PMU state. Windows always uses the
PMU so this can cause adverse effects on that particular OS.

Fix this by releasing the current perf event and trigger recreating one
with KVM_REQ_RELOAD_PMU.

Fixes: 051ff581ce ("arm64: KVM: Add access handler for event counter register")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-3-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:45:21 -07:00
Akihiko Odaki
be5ccac3f1 KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
Many functions in pmu-emul.c checks kvm_vcpu_has_pmu(vcpu). A favorable
interpretation is defensive programming, but it also has downsides:

- It is confusing as it implies these functions are called without PMU
  although most of them are called only when a PMU is present.

- It makes semantics of functions fuzzy. For example, calling
  kvm_pmu_disable_counter_mask() without PMU may result in no-op as
  there are no enabled counters, but it's unclear what
  kvm_pmu_get_counter_value() returns when there is no PMU.

- It allows callers without checking kvm_vcpu_has_pmu(vcpu), but it is
  often wrong to call these functions without PMU.

- It is error-prone to duplicate kvm_vcpu_has_pmu(vcpu) checks into
  multiple functions. Many functions are called for system registers,
  and the system register infrastructure already employs less
  error-prone, comprehensive checks.

Check kvm_vcpu_has_pmu(vcpu) in callers of these functions instead,
and remove the obsolete checks from pmu-emul.c. The only exceptions are
the functions that implement ioctls as they have definitive semantics
even when the PMU is not present.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-2-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:42:22 -07:00
Akihiko Odaki
f2aeb7bbd5 KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
Commit a45f41d754 ("KVM: arm64: Add {get,set}_user for
PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}") changed KVM_SET_ONE_REG to update
the mentioned registers in a way matching with the behavior of guest
register writes. This is a breaking change of a UAPI though the new
semantics looks cleaner and VMMs are not prepared for this.

Firecracker, QEMU, and crosvm perform migration by listing registers
with KVM_GET_REG_LIST, getting their values with KVM_GET_ONE_REG and
setting them with KVM_SET_ONE_REG. This algorithm assumes
KVM_SET_ONE_REG restores the values retrieved with KVM_GET_ONE_REG
without any alteration. However, bit operations added by the earlier
commit do not preserve the values retried with KVM_GET_ONE_REG and
potentially break migration.

Remove the bit operations that alter the values retrieved with
KVM_GET_ONE_REG.

Cc: stable@vger.kernel.org
Fixes: a45f41d754 ("KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-1-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:42:22 -07:00
Anshuman Khandual
f9aad62200 mm: rename GENERIC_PTDUMP and PTDUMP_CORE
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP.
But generic ptdump gets enabled via PTDUMP_CORE.  These configs
combination is confusing as they sound very similar and does not
differentiate between platform's feature subscription and feature
enablement for ptdump.  Rename the configs as ARCH_HAS_PTDUMP and PTDUMP
making it more clear and improve readability.

Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 00:05:32 -07:00
Fuad Tabba
1eab115486 KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
Instead of creating and initializing _all_ hyp vcpus in pKVM when
the first host vcpu runs for the first time, initialize _each_
hyp vcpu in conjunction with its corresponding host vcpu.

Some of the host vcpu state (e.g., system registers and traps
values) is not initialized until the first time the host vcpu is
run. Therefore, initializing a hyp vcpu before its corresponding
host vcpu has run for the first time might not view the complete
host state of these vcpus.

Additionally, this behavior is inline with non-protected modes.

Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-5-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:06:03 -07:00
Fuad Tabba
8b21fb47c7 KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
Move the code that creates and initializes the hyp view of a vcpu
in pKVM to its own function. This is meant to make the transition
to initializing every vcpu individually clearer.

Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-4-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:04:23 -07:00
Fuad Tabba
066daa8d3b KVM: arm64: Initialize HCRX_EL2 traps in pKVM
Initialize and set the traps controlled by the HCRX_EL2 in pKVM.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-3-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:00:49 -07:00
Fuad Tabba
44f979bf43 KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
Factor out the code for setting a vcpu's HCRX_EL2 traps in to a
separate inline function. This allows us to share the logic with
pKVM when setting the traps in protected mode.

No functional change intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-2-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:00:49 -07:00
Vincent Donnefort
79ea662315 KVM: arm64: Count pKVM stage-2 usage in secondary pagetable stats
Count the pages used by pKVM for the guest stage-2 in memory stats under
secondary pagetable, similarly to what the VHE mode does.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250313114038.1502357-4-vdonnefort@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 00:56:30 -07:00
Vincent Donnefort
8c0d7d14c5 KVM: arm64: Distinct pKVM teardown memcache for stage-2
In order to account for memory dedicated to the stage-2 page-tables, use
a separated memcache when tearing down the VM. Meanwhile rename
reclaim_guest_pages to reflect the fact it only reclaim page-table
pages.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250313114038.1502357-3-vdonnefort@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 00:56:29 -07:00
Vincent Donnefort
cf2d228da9 KVM: arm64: Add flags to kvm_hyp_memcache
Add flags to kvm_hyp_memcache and propagate the latter to the allocation
and free callbacks. This will later allow to account for memory, based
on the memcache configuration.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250313114038.1502357-2-vdonnefort@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 00:56:29 -07:00
Sebastian Ott
3f1e072753 KVM: arm64: Allow userspace to write ID_AA64MMFR0_EL1.TGRAN*_2
Allow userspace to write the safe (NI) value for ID_AA64MMFR0_EL1.TGRAN*_2.
Disallow to change these fields for NV since kvm provides a sanitized view
for them based on the PAGE_SIZE.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/kvmarm/20250306184013.30008-1-sebott@redhat.com/
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-12 13:39:50 -07:00
Anshuman Khandual
0b626b245c KVM: arm64: ptdump: Test PMD_TYPE_MASK for block mapping
Test given page table entries against PMD_TYPE_SECT on PMD_TYPE_MASK mask
bits for identifying block mappings in stage 2 page tables.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.linux.dev
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250221044227.1145393-2-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-12 12:20:59 +00:00
Oliver Upton
1b92e65f50 KVM: arm64: Provide 1 event counter on IMPDEF hardware
PMUv3 requires that all programmable event counters are capable of
counting any event. The Apple M* PMU is quite a bit different, and
events have affinities for particular PMCs.

Expose 1 event counter on IMPDEF hardware, allowing the guest to do
something useful with its PMU while also upholding the requirements of
the architecture.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305203021.428366-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:30 -07:00
Oliver Upton
1e7dcbfa4b KVM: arm64: Remap PMUv3 events onto hardware
Map PMUv3 event IDs onto hardware, if the driver exposes such a helper.
This is expected to be quite rare, and only useful for non-PMUv3 hardware.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-12-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
bed9b8ec8c KVM: arm64: Advertise PMUv3 if IMPDEF traps are present
Advertise a baseline PMUv3 implementation when running on hardware with
IMPDEF traps of the PMUv3 sysregs.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-11-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
2c433f70dc KVM: arm64: Compute synthetic sysreg ESR for Apple PMUv3 traps
Apple M* CPUs provide an IMPDEF trap for PMUv3 sysregs, where ESR_EL2.EC
is a reserved value (0x3F) and a sysreg-like ISS is reported in
AFSR1_EL2.

Compute a synthetic ESR for these PMUv3 traps, giving the illusion of
something architectural to the rest of KVM.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-10-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
56290316a4 KVM: arm64: Move PMUVer filtering into KVM code
The supported guest PMU version on a particular platform is ultimately a
KVM decision. Move PMUVer filtering into KVM code.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00