Commit Graph

3144 Commits

Author SHA1 Message Date
Quentin Perret
cd4b039165 KVM: arm64: Use 0b11 for encoding PKVM_NOPAGE
The page ownership state encoded as 0b11 is currently considered
reserved for future use, and PKVM_NOPAGE uses bit 2. In order to
simplify the relocation of the hyp ownership state into the
vmemmap in later patches, let's use the 'reserved' encoding for
the PKVM_NOPAGE state. The struct hyp_page layout isn't guaranteed
stable at all, so there is no real reason to have 'reserved' encodings.

No functional changes intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-4-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Quentin Perret
ba637018ca KVM: arm64: Fix pKVM page-tracking comments
Most of the comments relating to pKVM page-tracking in nvhe/memory.h are
now either slightly outdated or outright wrong. Fix the comments.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Fuad Tabba
5db1bef933 KVM: arm64: Track SVE state in the hypervisor vcpu structure
When dealing with a guest with SVE enabled, make sure the host SVE
state is pinned at EL2 S1, and that the hypervisor vCPU state is
correctly initialised (and then unpinned on teardown).

Co-authored-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-28 09:23:46 +01:00
Paolo Bonzini
5f9e169814 KVM: arm64, x86: make kvm_arch_has_irq_bypass() inline
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24 09:46:58 -04:00
Marc Zyngier
600f6fa5c9 KVM: arm64: Let kvm_vcpu_read_pmcr() return an EL-dependent value for PMCR_EL0.N
When EL2 is present, PMCR_EL0.N is the effective value of MDCR_EL2.HPMN
when accessed from EL1 or EL0.

Make sure we honor this requirement.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:24 +01:00
Marc Zyngier
efff9dd2fe KVM: arm64: Handle out-of-bound write to MDCR_EL2.HPMN
We don't really pay attention to what gets written to MDCR_EL2.HPMN,
and funky guests could play ugly games on us.

Restrict what gets written there, and limit the number of counters
to what the PMU is allowed to have.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:24 +01:00
Marc Zyngier
cd84a42c67 KVM: arm64: Don't let userspace write to PMCR_EL0.N when the vcpu has EL2
Now that userspace can provide its limit for hte maximum number of
counters, prevent it from writing to PMCR_EL0.N, as the value should
be derived from MDCR_EL2.HPMN in that case.

Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:23 +01:00
Marc Zyngier
b7628c7973 KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs
As long as we had purely EL1 VMs, we could easily update the number
of guest-visible counters by letting userspace write to PMCR_EL0.N.

With VMs started at EL2, PMCR_EL1.N only reflects MDCR_EL2.HPMN,
and we don't have a good way to limit it.

For this purpose, introduce a new PMUv3 attribute that allows
limiting the maximum number of counters. This requires the explicit
selection of a PMU.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 13:08:23 +01:00
Marc Zyngier
0224353343 KVM: arm64: Contextualise the handling of PMCR_EL0.P writes
Contrary to what the comment says in kvm_pmu_handle_pmcr(),
writing PMCR_EL0.P==1 has the following effects:

<quote>
The event counters affected by this field are:
  * All event counters in the first range.
  * If any of the following are true, all event counters in the second
    range:
    - EL2 is disabled or not implemented in the current Security state.
    - The PE is executing at EL2 or EL3.
</quote>

where the "first range" represent the counters in the [0..HPMN-1]
range, and the "second range" the counters in the [HPMN..MAX] range.

It so appears that writing P from EL2 should nuke all counters,
and not just the "guest" view. Just do that, and nuke the misleading
comment.

Reported-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 12:59:10 +01:00
Marc Zyngier
c8823e51b5 KVM: arm64: Fix MDCR_EL2.HPMN reset value
The MDCR_EL2 documentation indicates that the HPMN field has
the following behaviour:

"On a Warm reset, this field resets to the expression NUM_PMU_COUNTERS."

However, it appears we reset it to zero, which is not very useful.

Add a reset helper for MDCR_EL2, and handle the case where userspace
changes the target PMU, which may force us to change HPMN again.

Reported-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 12:58:51 +01:00
Marc Zyngier
f12b54d7c2 KVM: arm64: Repaint pmcr_n into nr_pmu_counters
The pmcr_n field obviously refers to PMCR_EL0.N, but is generally used
as the number of counters seen by the guest. Rename it accordingly.

Suggested-by: Oliver upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-04-11 11:54:22 +01:00
Linus Torvalds
0e8863244e ARM:
* Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
   stage-1 page tables) to align with the architecture. This avoids
   possibly taking an SEA at EL2 on the page table walk or using an
   architecturally UNKNOWN fault IPA.
 
 * Use acquire/release semantics in the KVM FF-A proxy to avoid reading
   a stale value for the FF-A version.
 
 * Fix KVM guest driver to match PV CPUID hypercall ABI.
 
 * Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
   selftests, which is the only memory type for which atomic
   instructions are architecturally guaranteed to work.
 
 s390:
 
 * Don't use %pK for debug printing and tracepoints.
 
 x86:
 
 * Use a separate subclass when acquiring KVM's per-CPU posted interrupts
   wakeup lock in the scheduled out path, i.e. when adding a vCPU on
   the list of vCPUs to wake, to workaround a false positive deadlock.
   The schedule out code runs with a scheduler lock that the wakeup
   handler takes in the opposite order; but it does so with IRQs disabled
   and cannot run concurrently with a wakeup.
 
 * Explicitly zero-initialize on-stack CPUID unions
 
 * Allow building irqbypass.ko as as module when kvm.ko is a module
 
 * Wrap relatively expensive sanity check with KVM_PROVE_MMU
 
 * Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
 
 selftests:
 
 * Add more scenarios to the MONITOR/MWAIT test.
 
 * Add option to rseq test to override /dev/cpu_dma_latency
 
 * Bring list of exit reasons up to date
 
 * Cleanup Makefile to list once tests that are valid on all architectures
 
 Other:
 
 * Documentation fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmf083IUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1dgf/QwfpZcHoMNQSnrc1jMy2LHrArln2
 XfmsOGZTU7kyoLQsLWGAPNocOveGdiemTDsj5ZXoNMnqV8hCBr+tZuv2gWI1rr/o
 kiGerdIgSZ9piTjBlJkVAaOzbWhg2DUnr7qVVzEzFY9+rPNyQ81vgAfU7h56KhYB
 optecozmBrHHAxvQZwmPeL9UyPWFjOF1BY/8LTMx7X+aVuCX6qx1JqO3a3ylAw4J
 tGXv6qFJfuCnu1d1b4X0ILce0iMUTOjQzvTcIm+BKjYycecl+3j1aczC/BOorIgc
 mf0+XeauhcTduK73pirnvx2b05eOxntgkOpwJytO2RP6pE0uK+2Th/C3Qg==
 =ba/Y
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
     stage-1 page tables) to align with the architecture. This avoids
     possibly taking an SEA at EL2 on the page table walk or using an
     architecturally UNKNOWN fault IPA

   - Use acquire/release semantics in the KVM FF-A proxy to avoid
     reading a stale value for the FF-A version

   - Fix KVM guest driver to match PV CPUID hypercall ABI

   - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
     selftests, which is the only memory type for which atomic
     instructions are architecturally guaranteed to work

  s390:

   - Don't use %pK for debug printing and tracepoints

  x86:

   - Use a separate subclass when acquiring KVM's per-CPU posted
     interrupts wakeup lock in the scheduled out path, i.e. when adding
     a vCPU on the list of vCPUs to wake, to workaround a false positive
     deadlock. The schedule out code runs with a scheduler lock that the
     wakeup handler takes in the opposite order; but it does so with
     IRQs disabled and cannot run concurrently with a wakeup

   - Explicitly zero-initialize on-stack CPUID unions

   - Allow building irqbypass.ko as as module when kvm.ko is a module

   - Wrap relatively expensive sanity check with KVM_PROVE_MMU

   - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses

  selftests:

   - Add more scenarios to the MONITOR/MWAIT test

   - Add option to rseq test to override /dev/cpu_dma_latency

   - Bring list of exit reasons up to date

   - Cleanup Makefile to list once tests that are valid on all
     architectures

  Other:

   - Documentation fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: arm64: Use acquire/release to communicate FF-A version negotiation
  KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable
  KVM: arm64: selftests: Introduce and use hardware-definition macros
  KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
  KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
  KVM: x86: Explicitly zero-initialize on-stack CPUID unions
  KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
  KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
  KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency
  KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
  Documentation: kvm: remove KVM_CAP_MIPS_TE
  Documentation: kvm: organize capabilities in the right section
  Documentation: kvm: fix some definition lists
  Documentation: kvm: drop "Capability" heading from capabilities
  Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
  Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
  selftests: kvm: list once tests that are valid on all architectures
  selftests: kvm: bring list of exit reasons up to date
  selftests: kvm: revamp MONITOR/MWAIT tests
  KVM: arm64: Don't translate FAR if invalid/unsafe
  ...
2025-04-08 13:47:55 -07:00
Will Deacon
a344e258ac KVM: arm64: Use acquire/release to communicate FF-A version negotiation
The pKVM FF-A proxy rejects FF-A requests other than FFA_VERSION until
version negotiation is complete, which is signalled by setting the
global 'has_version_negotiated' variable.

To avoid excessive locking, this variable is checked directly from
kvm_host_ffa_handler() in response to an FF-A call, but this can race
against another CPU performing the negotiation and potentially lead to
reading a torn value (incredibly unlikely for a 'bool') or problematic
re-ordering of the accesses to 'has_version_negotiated' and
'hyp_ffa_version' whereby a stale version number could be read by
__do_ffa_mem_xfer().

Use acquire/release primitives when writing 'has_version_negotiated'
with the version lock held and when reading without the lock held.

Cc: Sebastian Ene <sebastianene@google.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Fixes: c9c012625e ("KVM: arm64: Trap FFA_VERSION host call in pKVM")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250407152755.1041-1-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-07 15:03:34 -07:00
Oliver Upton
26fbdf3692 KVM: arm64: Don't translate FAR if invalid/unsafe
Don't re-walk the page tables if an SEA occurred during the faulting
page table walk to avoid taking a fatal exception in the hyp.
Additionally, check that FAR_EL2 is valid for SEAs not taken on PTW
as the architecture doesn't guarantee it contains the fault VA.

Finally, fix up the rest of the abort path by checking for SEAs early
and bugging the VM if we get further along with an UNKNOWN fault IPA.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-03 00:28:51 -07:00
Oliver Upton
1cf3e126f1 arm64: Convert HPFAR_EL2 to sysreg table
Switch over to the typical sysreg table for HPFAR_EL2 as we're about to
start using more fields in the register.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-03 00:28:51 -07:00
Oliver Upton
fb8a3eba9c KVM: arm64: Only read HPFAR_EL2 when value is architecturally valid
KVM's logic for deciding when HPFAR_EL2 is UNKNOWN doesn't align with
the architecture. Most notably, KVM assumes HPFAR_EL2 contains the
faulting IPA even in the case of an SEA.

Align the logic with the architecture rather than attempting to
paraphrase it. Additionally, take the opportunity to improve the
language around ARM erratum #834220 such that it actually describes the
bug.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-04-03 00:28:51 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Linus Torvalds
edb0e8f6e2 ARM:
* Nested virtualization support for VGICv3, giving the nested
 hypervisor control of the VGIC hardware when running an L2 VM
 
 * Removal of 'late' nested virtualization feature register masking,
   making the supported feature set directly visible to userspace
 
 * Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
   of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers
 
 * Paravirtual interface for discovering the set of CPU implementations
   where a VM may run, addressing a longstanding issue of guest CPU
   errata awareness in big-little systems and cross-implementation VM
   migration
 
 * Userspace control of the registers responsible for identifying a
   particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
   allowing VMs to be migrated cross-implementation
 
 * pKVM updates, including support for tracking stage-2 page table
   allocations in the protected hypervisor in the 'SecPageTable' stat
 
 * Fixes to vPMU, ensuring that userspace updates to the vPMU after
   KVM_RUN are reflected into the backing perf events
 
 LoongArch:
 
 * Remove unnecessary header include path
 
 * Assume constant PGD during VM context switch
 
 * Add perf events support for guest VM
 
 RISC-V:
 
 * Disable the kernel perf counter during configure
 
 * KVM selftests improvements for PMU
 
 * Fix warning at the time of KVM module removal
 
 x86:
 
 * Add support for aging of SPTEs without holding mmu_lock.  Not taking mmu_lock
   allows multiple aging actions to run in parallel, and more importantly avoids
   stalling vCPUs.  This includes an implementation of per-rmap-entry locking;
   aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas
   locking an rmap for write requires taking both the per-rmap spinlock and
   the mmu_lock.
 
   Note that this decreases slightly the accuracy of accessed-page information,
   because changes to the SPTE outside aging might not use atomic operations
   even if they could race against a clear of the Accessed bit.  This is
   deliberate because KVM and mm/ tolerate false positives/negatives for
   accessed information, and testing has shown that reducing the latency of
   aging is far more beneficial to overall system performance than providing
   "perfect" young/old information.
 
 * Defer runtime CPUID updates until KVM emulates a CPUID instruction, to
   coalesce updates when multiple pieces of vCPU state are changing, e.g. as
   part of a nested transition.
 
 * Fix a variety of nested emulation bugs, and add VMX support for synthesizing
   nested VM-Exit on interception (instead of injecting #UD into L2).
 
 * Drop "support" for async page faults for protected guests that do not set
   SEND_ALWAYS (i.e. that only want async page faults at CPL3)
 
 * Bring a bit of sanity to x86's VM teardown code, which has accumulated
   a lot of cruft over the years.  Particularly, destroy vCPUs before
   the MMU, despite the latter being a VM-wide operation.
 
 * Add common secure TSC infrastructure for use within SNP and in the
   future TDX
 
 * Block KVM_CAP_SYNC_REGS if guest state is protected.  It does not make
   sense to use the capability if the relevant registers are not
   available for reading or writing.
 
 * Don't take kvm->lock when iterating over vCPUs in the suspend notifier to
   fix a largely theoretical deadlock.
 
 * Use the vCPU's actual Xen PV clock information when starting the Xen timer,
   as the cached state in arch.hv_clock can be stale/bogus.
 
 * Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different
   PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend
   notifier only accounts for kvmclock, and there's no evidence that the
   flag is actually supported by Xen guests.
 
 * Clean up the per-vCPU "cache" of its reference pvclock, and instead only
   track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately
   expensive to compute, and rarely changes for modern setups).
 
 * Don't write to the Xen hypercall page on MSR writes that are initiated by
   the host (userspace or KVM) to fix a class of bugs where KVM can write to
   guest memory at unexpected times, e.g. during vCPU creation if userspace has
   set the Xen hypercall MSR index to collide with an MSR that KVM emulates.
 
 * Restrict the Xen hypercall MSR index to the unofficial synthetic range to
   reduce the set of possible collisions with MSRs that are emulated by KVM
   (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside
   in the synthetic range).
 
 * Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config.
 
 * Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID
   entries when updating PV clocks; there is no guarantee PV clocks will be
   updated between TSC frequency changes and CPUID emulation, and guest reads
   of the TSC leaves should be rare, i.e. are not a hot path.
 
 x86 (Intel):
 
 * Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus
   modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1.
 
 * Pass XFD_ERR as the payload when injecting #NM, as a preparatory step
   for upcoming FRED virtualization support.
 
 * Decouple the EPT entry RWX protection bit macros from the EPT Violation
   bits, both as a general cleanup and in anticipation of adding support for
   emulating Mode-Based Execution Control (MBEC).
 
 * Reject KVM_RUN if userspace manages to gain control and stuff invalid guest
   state while KVM is in the middle of emulating nested VM-Enter.
 
 * Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs
   in anticipation of adding sanity checks for secondary exit controls (the
   primary field is out of bits).
 
 x86 (AMD):
 
 * Ensure the PSP driver is initialized when both the PSP and KVM modules are
   built-in (the initcall framework doesn't handle dependencies).
 
 * Use long-term pins when registering encrypted memory regions, so that the
   pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to
   excessive fragmentation.
 
 * Add macros and helpers for setting GHCB return/error codes.
 
 * Add support for Idle HLT interception, which elides interception if the vCPU
   has a pending, unmasked virtual IRQ when HLT is executed.
 
 * Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical
   address.
 
 * Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g.
   because the vCPU was "destroyed" via SNP's AP Creation hypercall.
 
 * Reject SNP AP Creation if the requested SEV features for the vCPU don't
   match the VM's configured set of features.
 
 Selftests:
 
 * Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data
   instead of executing code.  The theory is that modern Intel CPUs have
   learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that an event is
   counting correctly without actually knowing what the event counts on the
   underlying hardware.
 
 * Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and
   improve its coverage by collecting all dirty entries on each iteration.
 
 * Fix a few minor bugs related to handling of stats FDs.
 
 * Add infrastructure to make vCPU and VM stats FDs available to tests by
   default (open the FDs during VM/vCPU creation).
 
 * Relax an assertion on the number of HLT exits in the xAPIC IPI test when
   running on a CPU that supports AMD's Idle HLT (which elides interception of
   HLT if a virtual IRQ is pending and unmasked).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00
 yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom
 uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH
 diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw
 DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU
 wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A==
 =nnCN
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Nested virtualization support for VGICv3, giving the nested
     hypervisor control of the VGIC hardware when running an L2 VM

   - Removal of 'late' nested virtualization feature register masking,
     making the supported feature set directly visible to userspace

   - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage
     of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers

   - Paravirtual interface for discovering the set of CPU
     implementations where a VM may run, addressing a longstanding issue
     of guest CPU errata awareness in big-little systems and
     cross-implementation VM migration

   - Userspace control of the registers responsible for identifying a
     particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1),
     allowing VMs to be migrated cross-implementation

   - pKVM updates, including support for tracking stage-2 page table
     allocations in the protected hypervisor in the 'SecPageTable' stat

   - Fixes to vPMU, ensuring that userspace updates to the vPMU after
     KVM_RUN are reflected into the backing perf events

  LoongArch:

   - Remove unnecessary header include path

   - Assume constant PGD during VM context switch

   - Add perf events support for guest VM

  RISC-V:

   - Disable the kernel perf counter during configure

   - KVM selftests improvements for PMU

   - Fix warning at the time of KVM module removal

  x86:

   - Add support for aging of SPTEs without holding mmu_lock.

     Not taking mmu_lock allows multiple aging actions to run in
     parallel, and more importantly avoids stalling vCPUs. This includes
     an implementation of per-rmap-entry locking; aging the gfn is done
     with only a per-rmap single-bin spinlock taken, whereas locking an
     rmap for write requires taking both the per-rmap spinlock and the
     mmu_lock.

     Note that this decreases slightly the accuracy of accessed-page
     information, because changes to the SPTE outside aging might not
     use atomic operations even if they could race against a clear of
     the Accessed bit.

     This is deliberate because KVM and mm/ tolerate false
     positives/negatives for accessed information, and testing has shown
     that reducing the latency of aging is far more beneficial to
     overall system performance than providing "perfect" young/old
     information.

   - Defer runtime CPUID updates until KVM emulates a CPUID instruction,
     to coalesce updates when multiple pieces of vCPU state are
     changing, e.g. as part of a nested transition

   - Fix a variety of nested emulation bugs, and add VMX support for
     synthesizing nested VM-Exit on interception (instead of injecting
     #UD into L2)

   - Drop "support" for async page faults for protected guests that do
     not set SEND_ALWAYS (i.e. that only want async page faults at CPL3)

   - Bring a bit of sanity to x86's VM teardown code, which has
     accumulated a lot of cruft over the years. Particularly, destroy
     vCPUs before the MMU, despite the latter being a VM-wide operation

   - Add common secure TSC infrastructure for use within SNP and in the
     future TDX

   - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not
     make sense to use the capability if the relevant registers are not
     available for reading or writing

   - Don't take kvm->lock when iterating over vCPUs in the suspend
     notifier to fix a largely theoretical deadlock

   - Use the vCPU's actual Xen PV clock information when starting the
     Xen timer, as the cached state in arch.hv_clock can be stale/bogus

   - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across
     different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as
     KVM's suspend notifier only accounts for kvmclock, and there's no
     evidence that the flag is actually supported by Xen guests

   - Clean up the per-vCPU "cache" of its reference pvclock, and instead
     only track the vCPU's TSC scaling (multipler+shift) metadata (which
     is moderately expensive to compute, and rarely changes for modern
     setups)

   - Don't write to the Xen hypercall page on MSR writes that are
     initiated by the host (userspace or KVM) to fix a class of bugs
     where KVM can write to guest memory at unexpected times, e.g.
     during vCPU creation if userspace has set the Xen hypercall MSR
     index to collide with an MSR that KVM emulates

   - Restrict the Xen hypercall MSR index to the unofficial synthetic
     range to reduce the set of possible collisions with MSRs that are
     emulated by KVM (collisions can still happen as KVM emulates
     Hyper-V MSRs, which also reside in the synthetic range)

   - Clean up and optimize KVM's handling of Xen MSR writes and
     xen_hvm_config

   - Update Xen TSC leaves during CPUID emulation instead of modifying
     the CPUID entries when updating PV clocks; there is no guarantee PV
     clocks will be updated between TSC frequency changes and CPUID
     emulation, and guest reads of the TSC leaves should be rare, i.e.
     are not a hot path

  x86 (Intel):

   - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and
     thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1

   - Pass XFD_ERR as the payload when injecting #NM, as a preparatory
     step for upcoming FRED virtualization support

   - Decouple the EPT entry RWX protection bit macros from the EPT
     Violation bits, both as a general cleanup and in anticipation of
     adding support for emulating Mode-Based Execution Control (MBEC)

   - Reject KVM_RUN if userspace manages to gain control and stuff
     invalid guest state while KVM is in the middle of emulating nested
     VM-Enter

   - Add a macro to handle KVM's sanity checks on entry/exit VMCS
     control pairs in anticipation of adding sanity checks for secondary
     exit controls (the primary field is out of bits)

  x86 (AMD):

   - Ensure the PSP driver is initialized when both the PSP and KVM
     modules are built-in (the initcall framework doesn't handle
     dependencies)

   - Use long-term pins when registering encrypted memory regions, so
     that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and
     don't lead to excessive fragmentation

   - Add macros and helpers for setting GHCB return/error codes

   - Add support for Idle HLT interception, which elides interception if
     the vCPU has a pending, unmasked virtual IRQ when HLT is executed

   - Fix a bug in INVPCID emulation where KVM fails to check for a
     non-canonical address

   - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is
     invalid, e.g. because the vCPU was "destroyed" via SNP's AP
     Creation hypercall

   - Reject SNP AP Creation if the requested SEV features for the vCPU
     don't match the VM's configured set of features

  Selftests:

   - Fix again the Intel PMU counters test; add a data load and do
     CLFLUSH{OPT} on the data instead of executing code. The theory is
     that modern Intel CPUs have learned new code prefetching tricks
     that bypass the PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that an
     event is counting correctly without actually knowing what the event
     counts on the underlying hardware

   - Fix a variety of flaws, bugs, and false failures/passes
     dirty_log_test, and improve its coverage by collecting all dirty
     entries on each iteration

   - Fix a few minor bugs related to handling of stats FDs

   - Add infrastructure to make vCPU and VM stats FDs available to tests
     by default (open the FDs during VM/vCPU creation)

   - Relax an assertion on the number of HLT exits in the xAPIC IPI test
     when running on a CPU that supports AMD's Idle HLT (which elides
     interception of HLT if a virtual IRQ is pending and unmasked)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits)
  RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed
  RISC-V: KVM: Teardown riscv specific bits after kvm_exit
  LoongArch: KVM: Register perf callbacks for guest
  LoongArch: KVM: Implement arch-specific functions for guest perf
  LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel()
  LoongArch: KVM: Remove PGD saving during VM context switch
  LoongArch: KVM: Remove unnecessary header include path
  KVM: arm64: Tear down vGIC on failed vCPU creation
  KVM: arm64: PMU: Reload when resetting
  KVM: arm64: PMU: Reload when user modifies registers
  KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
  KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
  KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
  KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
  KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
  KVM: arm64: Initialize HCRX_EL2 traps in pKVM
  KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
  KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected
  KVM: x86: Add infrastructure for secure TSC
  KVM: x86: Push down setting vcpu.arch.user_set_tsc
  ...
2025-03-25 14:22:07 -07:00
Linus Torvalds
2d09a9449e arm64 updates for 6.15:
Perf and PMUs:
 
  - Support for the "Rainier" CPU PMU from Arm
 
  - Preparatory driver changes and cleanups that pave the way for BRBE
    support
 
  - Support for partial virtualisation of the Apple-M1 PMU
 
  - Support for the second event filter in Arm CSPMU designs
 
  - Minor fixes and cleanups (CMN and DWC PMUs)
 
  - Enable EL2 requirements for FEAT_PMUv3p9
 
 Power, CPU topology:
 
  - Support for AMUv1-based average CPU frequency
 
  - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds
    a generic topology_is_primary_thread() function overridden by x86 and
    powerpc
 
 New(ish) features:
 
  - MOPS (memcpy/memset) support for the uaccess routines
 
 Security/confidential compute:
 
  - Fix the DMA address for devices used in Realms with Arm CCA. The
    CCA architecture uses the address bit to differentiate between shared
    and private addresses
 
  - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
    default
 
 Memory management clean-ups:
 
  - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs
 
  - Some minor page table accessor clean-ups
 
  - PIE/POE (permission indirection/overlay) helpers clean-up
 
 Kselftests:
 
  - MTE: skip hugetlb tests if MTE is not supported on such mappings and
    user correct naming for sync/async tag checking modes
 
 Miscellaneous:
 
  - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
    request)
 
  - Sysreg updates for new register fields
 
  - CPU type info for some Qualcomm Kryo cores
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmfjB2QACgkQa9axLQDI
 XvGrfg//W3Bx9+jw1G/XHHEQqGEVFmvltvxZUkvgV0Qki0rPSMnappJhZRL9n0Nm
 V6PvGd2KoKHZuL3g5ViZb3cs2R9BiD2JB6PncwBKuxumHGh3vz3kk1JMkDVfWdHv
 qAceOckFJD9rXjPZn+PDsfYiEi2i3RRWIP5VglZ14ue8j3prHQ6DJXLUQF2GYvzE
 /bgLSq44wp5N59ddy23+qH9rxrHzz3bgpbVv/F56W/LErvE873mRmyFwiuGJm+M0
 Pn8ra572rI6a4sgSwrMTeNPBU+F9o5AbqwauVhkz428RdMvgfEuW6qHUBnGWJDmt
 HotXmu+4Eb2KJks/iQkDo4OTJ38yUqvvZZJtP171ms3E4yqESSJngWP6O2A6LF+y
 xhe0sESF/Ew6jLhM6/hvOmBcE2AyB14JE3ymqLkXbWub4NXddBn2AF1WXFjF4CBw
 F8KSUhNLekrCYKv1k9M3nhvkcpoS9FkTF/TI+zEg546alI/GLPih6uDRkgMAODh1
 RDJYixHsf2NDDRQbfwvt9Xua/KKpDF6qNkHLA4OiqqVUwh1hkas24Lrnp8vmce4o
 wIpWCLqYWey8Rl3XWuWgWz2Xu58fHH4Dl2k72Z8I0pwp3abCDa9xEj79G0Svk7Si
 Q+FCYrNlpKee1RXBC+1MUD/Gl5r/28dEUFkAzPD80F7AgafXPd0=
 =Kc9c
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "Nothing major this time around.

  Apart from the usual perf/PMU updates, some page table cleanups, the
  notable features are average CPU frequency based on the AMUv1
  counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in
  the uaccess routines.

  Perf and PMUs:

   - Support for the 'Rainier' CPU PMU from Arm

   - Preparatory driver changes and cleanups that pave the way for BRBE
     support

   - Support for partial virtualisation of the Apple-M1 PMU

   - Support for the second event filter in Arm CSPMU designs

   - Minor fixes and cleanups (CMN and DWC PMUs)

   - Enable EL2 requirements for FEAT_PMUv3p9

  Power, CPU topology:

   - Support for AMUv1-based average CPU frequency

   - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It
     adds a generic topology_is_primary_thread() function overridden by
     x86 and powerpc

  New(ish) features:

   - MOPS (memcpy/memset) support for the uaccess routines

  Security/confidential compute:

   - Fix the DMA address for devices used in Realms with Arm CCA. The
     CCA architecture uses the address bit to differentiate between
     shared and private addresses

   - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
     default

  Memory management clean-ups:

   - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs

   - Some minor page table accessor clean-ups

   - PIE/POE (permission indirection/overlay) helpers clean-up

  Kselftests:

   - MTE: skip hugetlb tests if MTE is not supported on such mappings
     and user correct naming for sync/async tag checking modes

  Miscellaneous:

   - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
     request)

   - Sysreg updates for new register fields

   - CPU type info for some Qualcomm Kryo cores"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
  arm64: mm: Don't use %pK through printk
  perf/arm_cspmu: Fix missing io.h include
  arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
  arm64: cputype: Add MIDR_CORTEX_A76AE
  arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
  arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
  arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
  arm64/sysreg: Enforce whole word match for open/close tokens
  arm64/sysreg: Fix unbalanced closing block
  arm64: Kconfig: Enable HOTPLUG_SMT
  arm64: topology: Support SMT control on ACPI based system
  arch_topology: Support SMT control for OF based system
  cpu/SMT: Provide a default topology_is_primary_thread()
  arm64/mm: Define PTDESC_ORDER
  perf/arm_cspmu: Add PMEVFILT2R support
  perf/arm_cspmu: Generalise event filtering
  perf/arm_cspmu: Move register definitons to header
  arm64/kernel: Always use level 2 or higher for early mappings
  arm64/mm: Drop PXD_TABLE_BIT
  arm64/mm: Check pmd_table() in pmd_trans_huge()
  ...
2025-03-25 13:16:16 -07:00
Catalin Marinas
8cc14fdcc1 Merge branches 'for-next/amuv1-avg-freq', 'for-next/pkey_unrestricted', 'for-next/sysreg', 'for-next/misc', 'for-next/pgtable-cleanups', 'for-next/kselftest', 'for-next/uaccess-mops', 'for-next/pie-poe-cleanup', 'for-next/cputype-kryo', 'for-next/cca-dma-address', 'for-next/drop-pxd_table_bit' and 'for-next/spectre-bhb-assume-vulnerable', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf/arm_cspmu: Fix missing io.h include
  perf/arm_cspmu: Add PMEVFILT2R support
  perf/arm_cspmu: Generalise event filtering
  perf/arm_cspmu: Move register definitons to header
  drivers/perf: apple_m1: Support host/guest event filtering
  drivers/perf: apple_m1: Refactor event select/filter configuration
  perf/dwc_pcie: fix duplicate pci_dev devices
  perf/dwc_pcie: fix some unreleased resources
  perf/arm-cmn: Minor event type housekeeping
  perf: arm_pmu: Move PMUv3-specific data
  perf: apple_m1: Don't disable counter in m1_pmu_enable_event()
  perf: arm_v7_pmu: Don't disable counter in (armv7|krait_|scorpion_)pmu_enable_event()
  perf: arm_v7_pmu: Drop obvious comments for enabling/disabling counters and interrupts
  perf: arm_pmuv3: Don't disable counter in armv8pmu_enable_event()
  perf: arm_pmu: Don't disable counter in armpmu_add()
  perf: arm_pmuv3: Call kvm_vcpu_pmu_resync_el0() before enabling counters
  perf: arm_pmuv3: Add support for ARM Rainier PMU

* for-next/amuv1-avg-freq:
  : Add support for AArch64 AMUv1-based average freq
  arm64: Utilize for_each_cpu_wrap for reference lookup
  arm64: Update AMU-based freq scale factor on entering idle
  arm64: Provide an AMU-based version of arch_freq_get_on_cpu
  cpufreq: Introduce an optional cpuinfo_avg_freq sysfs entry
  cpufreq: Allow arch_freq_get_on_cpu to return an error
  arch_topology: init capacity_freq_ref to 0

* for-next/pkey_unrestricted:
  : mm/pkey: Add PKEY_UNRESTRICTED macro
  selftest/powerpc/mm/pkey: fix build-break introduced by commit 00894c3fc9
  selftests/powerpc: Use PKEY_UNRESTRICTED macro
  selftests/mm: Use PKEY_UNRESTRICTED macro
  mm/pkey: Add PKEY_UNRESTRICTED macro

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Enforce whole word match for open/close tokens
  arm64/sysreg: Fix unbalanced closing block
  arm64/sysreg: Add register fields for HFGWTR2_EL2
  arm64/sysreg: Add register fields for HFGRTR2_EL2
  arm64/sysreg: Add register fields for HFGITR2_EL2
  arm64/sysreg: Add register fields for HDFGWTR2_EL2
  arm64/sysreg: Add register fields for HDFGRTR2_EL2
  arm64/sysreg: Update register fields for ID_AA64MMFR0_EL1

* for-next/misc:
  : Miscellaneous arm64 patches
  arm64: mm: Don't use %pK through printk
  arm64/fpsimd: Remove unused declaration fpsimd_kvm_prepare()

* for-next/pgtable-cleanups:
  : arm64 pgtable accessors cleanup
  arm64/mm: Define PTDESC_ORDER
  arm64/kernel: Always use level 2 or higher for early mappings
  arm64/hugetlb: Consistently use pud_sect_supported()
  arm64/mm: Convert __pte_to_phys() and __phys_to_pte_val() as functions

* for-next/kselftest:
  : arm64 kselftest updates
  kselftest/arm64: mte: Skip the hugetlb tests if MTE not supported on such mappings
  kselftest/arm64: mte: Use the correct naming for tag check modes in check_hugetlb_options.c

* for-next/uaccess-mops:
  : Implement the uaccess memory copy/set using MOPS instructions
  arm64: lib: Use MOPS for usercopy routines
  arm64: mm: Handle PAN faults on uaccess CPY* instructions
  arm64: extable: Add fixup handling for uaccess CPY* instructions

* for-next/pie-poe-cleanup:
  : PIE/POE helpers cleanup
  arm64/sysreg: Move POR_EL0_INIT to asm/por.h
  arm64/sysreg: Rename POE_RXW to POE_RWX
  arm64/sysreg: Improve PIR/POR helpers

* for-next/cputype-kryo:
  : Add cputype info for some Qualcomm Kryo cores
  arm64: cputype: Add comments about Qualcomm Kryo 5XX and 6XX cores
  arm64: cputype: Add QCOM_CPU_PART_KRYO_3XX_GOLD

* for-next/cca-dma-address:
  : Fix DMA address for devices used in realms with Arm CCA
  arm64: realm: Use aliased addresses for device DMA to shared buffers
  dma: Introduce generic dma_addr_*crypted helpers
  dma: Fix encryption bit clearing for dma_to_phys

* for-next/drop-pxd_table_bit:
  : Drop the arm64 PXD_TABLE_BIT (clean-up in preparation for 128-bit PTEs)
  arm64/mm: Drop PXD_TABLE_BIT
  arm64/mm: Check pmd_table() in pmd_trans_huge()
  arm64/mm: Check PUD_TYPE_TABLE in pud_bad()
  arm64/mm: Check PXD_TYPE_TABLE in [p4d|pgd]_bad()
  arm64/mm: Clear PXX_TYPE_MASK and set PXD_TYPE_SECT in [pmd|pud]_mkhuge()
  arm64/mm: Clear PXX_TYPE_MASK in mk_[pmd|pud]_sect_prot()
  arm64/ptdump: Test PMD_TYPE_MASK for block mapping
  KVM: arm64: ptdump: Test PMD_TYPE_MASK for block mapping

* for-next/spectre-bhb-assume-vulnerable:
  : Rework Spectre BHB mitigations to not assume "safe"
  arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
  arm64: cputype: Add MIDR_CORTEX_A76AE
  arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
  arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
  arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
2025-03-25 19:32:03 +00:00
Linus Torvalds
a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Oliver Upton
369c012268 Merge branch 'kvm-arm64/pmu-fixes' into kvmarm/next
* kvm-arm64/pmu-fixes:
  : vPMU fixes for 6.15 courtesy of Akihiko Odaki
  :
  : Various fixes to KVM's vPMU implementation, notably ensuring
  : userspace-directed changes to the PMCs are reflected in the backing perf
  : events.
  KVM: arm64: PMU: Reload when resetting
  KVM: arm64: PMU: Reload when user modifies registers
  KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
  KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
  KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:52 -07:00
Oliver Upton
ca19dd4323 Merge branch 'kvm-arm64/pkvm-6.15' into kvmarm/next
* kvm-arm64/pkvm-6.15:
  : pKVM updates for 6.15
  :
  :  - SecPageTable stats for stage-2 table pages allocated by the protected
  :    hypervisor (Vincent Donnefort)
  :
  :  - HCRX_EL2 trap + vCPU initialization fixes for pKVM (Fuad Tabba)
  KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
  KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
  KVM: arm64: Initialize HCRX_EL2 traps in pKVM
  KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
  KVM: arm64: Count pKVM stage-2 usage in secondary pagetable stats
  KVM: arm64: Distinct pKVM teardown memcache for stage-2
  KVM: arm64: Add flags to kvm_hyp_memcache

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:40 -07:00
Oliver Upton
4f2774c57a Merge branch 'kvm-arm64/writable-midr' into kvmarm/next
* kvm-arm64/writable-midr:
  : Writable implementation ID registers, courtesy of Sebastian Ott
  :
  : Introduce a new capability that allows userspace to set the
  : ID registers that identify a CPU implementation: MIDR_EL1, REVIDR_EL1,
  : and AIDR_EL1. Also plug a hole in KVM's trap configuration where
  : SMIDR_EL1 was readable at EL1, despite the fact that KVM does not
  : support SME.
  KVM: arm64: Fix documentation for KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
  KVM: arm64: Copy MIDR_EL1 into hyp VM when it is writable
  KVM: arm64: Copy guest CTR_EL0 into hyp VM
  KVM: selftests: arm64: Test writes to MIDR,REVIDR,AIDR
  KVM: arm64: Allow userspace to change the implementation ID registers
  KVM: arm64: Load VPIDR_EL2 with the VM's MIDR_EL1 value
  KVM: arm64: Maintain per-VM copy of implementation ID regs
  KVM: arm64: Set HCR_EL2.TID1 unconditionally

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:32 -07:00
Oliver Upton
1b1d1b17b8 Merge branch 'kvm-arm64/pmuv3-asahi' into kvmarm/next
* kvm-arm64/pmuv3-asahi:
  : Support PMUv3 for KVM guests on Apple silicon
  :
  : Take advantage of some IMPLEMENTATION DEFINED traps available on Apple
  : parts to trap-and-emulate the PMUv3 registers on behalf of a KVM guest.
  : Constrain the vPMU to a cycle counter and single event counter, as the
  : Apple PMU has events that cannot be counted on every counter.
  :
  : There is a small new interface between the ARM PMU driver and KVM, where
  : the PMU driver owns the PMUv3 -> hardware event mappings.
  arm64: Enable IMP DEF PMUv3 traps on Apple M*
  KVM: arm64: Provide 1 event counter on IMPDEF hardware
  drivers/perf: apple_m1: Provide helper for mapping PMUv3 events
  KVM: arm64: Remap PMUv3 events onto hardware
  KVM: arm64: Advertise PMUv3 if IMPDEF traps are present
  KVM: arm64: Compute synthetic sysreg ESR for Apple PMUv3 traps
  KVM: arm64: Move PMUVer filtering into KVM code
  KVM: arm64: Use guard() to cleanup usage of arm_pmus_lock
  KVM: arm64: Drop kvm_arm_pmu_available static key
  KVM: arm64: Use a cpucap to determine if system supports FEAT_PMUv3
  KVM: arm64: Always support SW_INCR PMU event
  KVM: arm64: Compute PMCEID from arm_pmu's event bitmaps
  drivers/perf: apple_m1: Support host/guest event filtering
  drivers/perf: apple_m1: Refactor event select/filter configuration

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:54:23 -07:00
Oliver Upton
d300b0168e Merge branch 'kvm-arm64/pv-cpuid' into kvmarm/next
* kvm-arm64/pv-cpuid:
  : Paravirtualized implementation ID, courtesy of Shameer Kolothum
  :
  : Big-little has historically been a pain in the ass to virtualize. The
  : implementation ID (MIDR, REVIDR, AIDR) of a vCPU can change at the whim
  : of vCPU scheduling. This can be particularly annoying when the guest
  : needs to know the underlying implementation to mitigate errata.
  :
  : "Hyperscalers" face a similar scheduling problem, where VMs may freely
  : migrate between hosts in a pool of heterogenous hardware. And yes, our
  : server-class friends are equally riddled with errata too.
  :
  : In absence of an architected solution to this wart on the ecosystem,
  : introduce support for paravirtualizing the implementation exposed
  : to a VM, allowing the VMM to describe the pool of implementations that a
  : VM may be exposed to due to scheduling/migration.
  :
  : Userspace is expected to intercept and handle these hypercalls using the
  : SMCCC filter UAPI, should it choose to do so.
  smccc: kvm_guest: Fix kernel builds for 32 bit arm
  KVM: selftests: Add test for KVM_REG_ARM_VENDOR_HYP_BMAP_2
  smccc/kvm_guest: Enable errata based on implementation CPUs
  arm64: Make  _midr_in_range_list() an exported function
  KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2
  KVM: arm64: Specify hypercall ABI for retrieving target implementations
  arm64: Modify _midr_range() functions to read MIDR/REVIDR internally

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:53:16 -07:00
Oliver Upton
13f64f6d21 Merge branch 'kvm-arm64/nv-idregs' into kvmarm/next
* kvm-arm64/nv-idregs:
  : Changes to exposure of NV features, courtesy of Marc Zyngier
  :
  : Apply NV-specific feature restrictions at reset rather than at the point
  : of KVM_RUN. This makes the true feature set visible to userspace, a
  : necessary step towards save/restore support or NV VMs.
  :
  : Add an additional vCPU feature flag for selecting the E2H0 flavor of NV,
  : such that the VHE-ness of the VM can be applied to the feature set.
  KVM: arm64: selftests: Test that TGRAN*_2 fields are writable
  KVM: arm64: Allow userspace to write ID_AA64MMFR0_EL1.TGRAN*_2
  KVM: arm64: Advertise FEAT_ECV when possible
  KVM: arm64: Make ID_AA64MMFR4_EL1.NV_frac writable
  KVM: arm64: Allow userspace to limit NV support to nVHE
  KVM: arm64: Move NV-specific capping to idreg sanitisation
  KVM: arm64: Enforce NV limits on a per-idregs basis
  KVM: arm64: Make ID_REG_LIMIT_FIELD_ENUM() more widely available
  KVM: arm64: Consolidate idreg callbacks
  KVM: arm64: Advertise NV2 in the boot messages
  KVM: arm64: Mark HCR.EL2.{NV*,AT} RES0 when ID_AA64MMFR4_EL1.NV_frac is 0
  KVM: arm64: Mark HCR.EL2.E2H RES0 when ID_AA64MMFR1_EL1.VH is zero
  KVM: arm64: Hide ID_AA64MMFR2_EL1.NV from guest and userspace
  arm64: cpufeature: Handle NV_frac as a synonym of NV2

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:52:26 -07:00
Oliver Upton
56e3e5c8f7 Merge branch 'kvm-arm64/nv-vgic' into kvmarm/next
* kvm-arm64/nv-vgic:
  : NV VGICv3 support, courtesy of Marc Zyngier
  :
  : Support for emulating the GIC hypervisor controls and managing shadow
  : VGICv3 state for the L1 hypervisor. As part of it, bring in support for
  : taking IRQs to the L1 and UAPI to manage the VGIC maintenance interrupt.
  KVM: arm64: nv: Fail KVM init if asking for NV without GICv3
  KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ
  KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
  KVM: arm64: nv: Propagate used_lrs between L1 and L0 contexts
  KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2
  KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting
  KVM: arm64: nv: Add Maintenance Interrupt emulation
  KVM: arm64: nv: Handle L2->L1 transition on interrupt injection
  KVM: arm64: nv: Nested GICv3 emulation
  KVM: arm64: nv: Sanitise ICH_HCR_EL2 accesses
  KVM: arm64: nv: Plumb handling of GICv3 EL2 accesses
  KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysreg
  KVM: arm64: nv: Load timer before the GIC
  arm64: sysreg: Add layout for ICH_MISR_EL2
  arm64: sysreg: Add layout for ICH_VTR_EL2
  arm64: sysreg: Add layout for ICH_HCR_EL2

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:51:43 -07:00
Oliver Upton
3ed0dc03f6 Merge branch 'kvm-arm64/misc' into kvmarm/next
* kvm-arm64/misc:
  : Miscellaneous fixes/cleanups for KVM/arm64
  :
  :  - Avoid GICv4 vLPI configuration when confronted with user error
  :
  :  - Only attempt vLPI configuration when the target routing is an MSI
  :
  :  - Document ordering requirements to avoid aforementioned user error
  KVM: arm64: Tear down vGIC on failed vCPU creation
  KVM: arm64: Document ordering requirements for irqbypass
  KVM: arm64: vgic-v4: Fall back to software irqbypass if LPI not found
  KVM: arm64: vgic-v4: Only WARN for HW IRQ mismatch when unmapping vLPI
  KVM: arm64: vgic-v4: Only attempt vLPI mapping for actual MSIs

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-19 14:51:15 -07:00
Will Deacon
250f25367b KVM: arm64: Tear down vGIC on failed vCPU creation
If kvm_arch_vcpu_create() fails to share the vCPU page with the
hypervisor, we propagate the error back to the ioctl but leave the
vGIC vCPU data initialised. Note only does this leak the corresponding
memory when the vCPU is destroyed but it can also lead to use-after-free
if the redistributor device handling tries to walk into the vCPU.

Add the missing cleanup to kvm_arch_vcpu_create(), ensuring that the
vGIC vCPU structures are destroyed on error.

Cc: <stable@vger.kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250314133409.9123-1-will@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 23:30:51 -07:00
Akihiko Odaki
fe53538069 KVM: arm64: PMU: Reload when resetting
Replace kvm_pmu_vcpu_reset() with the generic PMU reloading mechanism to
ensure the consistency with system registers and to reduce code size.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-5-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:45:25 -07:00
Akihiko Odaki
1db4aaa055 KVM: arm64: PMU: Reload when user modifies registers
Commit d0c94c4979 ("KVM: arm64: Restore PMU configuration on first
run") added the code to reload the PMU configuration on first run.

It is also important to keep the correct state even if system registers
are modified after first run, specifically when debugging Windows on
QEMU with GDB; QEMU tries to write back all visible registers when
resuming the VM execution with GDB, corrupting the PMU state. Windows
always uses the PMU so this can cause adverse effects on that particular
OS.

The usual register writes and reset are already handled independently,
but register writes from userspace are not covered.
Trigger the code to reload the PMU configuration for them instead so
that PMU configuration changes made by users will be applied also after
the first run.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-4-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:45:25 -07:00
Akihiko Odaki
64074ca8ca KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs
Reload the perf event when setting the vPMU counter (vPMC) registers
(PMCCNTR_EL0 and PMEVCNTR<n>_EL0). This is a change corresponding to
commit 9228b26194 ("KVM: arm64: PMU: Fix GET_ONE_REG
for vPMC regs to return the current value") but for SET_ONE_REG.

Values of vPMC registers are saved in sysreg files on certain occasions.
These saved values don't represent the current values of the vPMC
registers if the perf events for the vPMCs count events after the save.
The current values of those registers are the sum of the sysreg file
value and the current perf event counter value.  But, when userspace
writes those registers (using KVM_SET_ONE_REG), KVM only updates the
sysreg file value and leaves the current perf event counter value as is.

It is also important to keep the correct state even if userspace writes
them after first run, specifically when debugging Windows on QEMU with
GDB; QEMU tries to write back all visible registers when resuming the VM
execution with GDB, corrupting the PMU state. Windows always uses the
PMU so this can cause adverse effects on that particular OS.

Fix this by releasing the current perf event and trigger recreating one
with KVM_REQ_RELOAD_PMU.

Fixes: 051ff581ce ("arm64: KVM: Add access handler for event counter register")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-3-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:45:21 -07:00
Akihiko Odaki
be5ccac3f1 KVM: arm64: PMU: Assume PMU presence in pmu-emul.c
Many functions in pmu-emul.c checks kvm_vcpu_has_pmu(vcpu). A favorable
interpretation is defensive programming, but it also has downsides:

- It is confusing as it implies these functions are called without PMU
  although most of them are called only when a PMU is present.

- It makes semantics of functions fuzzy. For example, calling
  kvm_pmu_disable_counter_mask() without PMU may result in no-op as
  there are no enabled counters, but it's unclear what
  kvm_pmu_get_counter_value() returns when there is no PMU.

- It allows callers without checking kvm_vcpu_has_pmu(vcpu), but it is
  often wrong to call these functions without PMU.

- It is error-prone to duplicate kvm_vcpu_has_pmu(vcpu) checks into
  multiple functions. Many functions are called for system registers,
  and the system register infrastructure already employs less
  error-prone, comprehensive checks.

Check kvm_vcpu_has_pmu(vcpu) in callers of these functions instead,
and remove the obsolete checks from pmu-emul.c. The only exceptions are
the functions that implement ioctls as they have definitive semantics
even when the PMU is not present.

Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-2-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:42:22 -07:00
Akihiko Odaki
f2aeb7bbd5 KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
Commit a45f41d754 ("KVM: arm64: Add {get,set}_user for
PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}") changed KVM_SET_ONE_REG to update
the mentioned registers in a way matching with the behavior of guest
register writes. This is a breaking change of a UAPI though the new
semantics looks cleaner and VMMs are not prepared for this.

Firecracker, QEMU, and crosvm perform migration by listing registers
with KVM_GET_REG_LIST, getting their values with KVM_GET_ONE_REG and
setting them with KVM_SET_ONE_REG. This algorithm assumes
KVM_SET_ONE_REG restores the values retrieved with KVM_GET_ONE_REG
without any alteration. However, bit operations added by the earlier
commit do not preserve the values retried with KVM_GET_ONE_REG and
potentially break migration.

Remove the bit operations that alter the values retrieved with
KVM_GET_ONE_REG.

Cc: stable@vger.kernel.org
Fixes: a45f41d754 ("KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}")
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250315-pmc-v5-1-ecee87dab216@daynix.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-17 10:42:22 -07:00
Anshuman Khandual
f9aad62200 mm: rename GENERIC_PTDUMP and PTDUMP_CORE
Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP.
But generic ptdump gets enabled via PTDUMP_CORE.  These configs
combination is confusing as they sound very similar and does not
differentiate between platform's feature subscription and feature
enablement for ptdump.  Rename the configs as ARCH_HAS_PTDUMP and PTDUMP
making it more clear and improve readability.

Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 00:05:32 -07:00
Fuad Tabba
1eab115486 KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu
Instead of creating and initializing _all_ hyp vcpus in pKVM when
the first host vcpu runs for the first time, initialize _each_
hyp vcpu in conjunction with its corresponding host vcpu.

Some of the host vcpu state (e.g., system registers and traps
values) is not initialized until the first time the host vcpu is
run. Therefore, initializing a hyp vcpu before its corresponding
host vcpu has run for the first time might not view the complete
host state of these vcpus.

Additionally, this behavior is inline with non-protected modes.

Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-5-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:06:03 -07:00
Fuad Tabba
8b21fb47c7 KVM: arm64: Factor out pKVM hyp vcpu creation to separate function
Move the code that creates and initializes the hyp view of a vcpu
in pKVM to its own function. This is meant to make the transition
to initializing every vcpu individually clearer.

Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-4-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:04:23 -07:00
Fuad Tabba
066daa8d3b KVM: arm64: Initialize HCRX_EL2 traps in pKVM
Initialize and set the traps controlled by the HCRX_EL2 in pKVM.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-3-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:00:49 -07:00
Fuad Tabba
44f979bf43 KVM: arm64: Factor out setting HCRX_EL2 traps into separate function
Factor out the code for setting a vcpu's HCRX_EL2 traps in to a
separate inline function. This allows us to share the logic with
pKVM when setting the traps in protected mode.

No functional change intended.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250314111832.4137161-2-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 16:00:49 -07:00
Vincent Donnefort
79ea662315 KVM: arm64: Count pKVM stage-2 usage in secondary pagetable stats
Count the pages used by pKVM for the guest stage-2 in memory stats under
secondary pagetable, similarly to what the VHE mode does.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250313114038.1502357-4-vdonnefort@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 00:56:30 -07:00
Vincent Donnefort
8c0d7d14c5 KVM: arm64: Distinct pKVM teardown memcache for stage-2
In order to account for memory dedicated to the stage-2 page-tables, use
a separated memcache when tearing down the VM. Meanwhile rename
reclaim_guest_pages to reflect the fact it only reclaim page-table
pages.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250313114038.1502357-3-vdonnefort@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 00:56:29 -07:00
Vincent Donnefort
cf2d228da9 KVM: arm64: Add flags to kvm_hyp_memcache
Add flags to kvm_hyp_memcache and propagate the latter to the allocation
and free callbacks. This will later allow to account for memory, based
on the memcache configuration.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250313114038.1502357-2-vdonnefort@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-14 00:56:29 -07:00
Sebastian Ott
3f1e072753 KVM: arm64: Allow userspace to write ID_AA64MMFR0_EL1.TGRAN*_2
Allow userspace to write the safe (NI) value for ID_AA64MMFR0_EL1.TGRAN*_2.
Disallow to change these fields for NV since kvm provides a sanitized view
for them based on the PAGE_SIZE.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/kvmarm/20250306184013.30008-1-sebott@redhat.com/
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-12 13:39:50 -07:00
Anshuman Khandual
0b626b245c KVM: arm64: ptdump: Test PMD_TYPE_MASK for block mapping
Test given page table entries against PMD_TYPE_SECT on PMD_TYPE_MASK mask
bits for identifying block mappings in stage 2 page tables.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: kvmarm@lists.linux.dev
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250221044227.1145393-2-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-12 12:20:59 +00:00
Oliver Upton
1b92e65f50 KVM: arm64: Provide 1 event counter on IMPDEF hardware
PMUv3 requires that all programmable event counters are capable of
counting any event. The Apple M* PMU is quite a bit different, and
events have affinities for particular PMCs.

Expose 1 event counter on IMPDEF hardware, allowing the guest to do
something useful with its PMU while also upholding the requirements of
the architecture.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305203021.428366-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:30 -07:00
Oliver Upton
1e7dcbfa4b KVM: arm64: Remap PMUv3 events onto hardware
Map PMUv3 event IDs onto hardware, if the driver exposes such a helper.
This is expected to be quite rare, and only useful for non-PMUv3 hardware.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-12-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
bed9b8ec8c KVM: arm64: Advertise PMUv3 if IMPDEF traps are present
Advertise a baseline PMUv3 implementation when running on hardware with
IMPDEF traps of the PMUv3 sysregs.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-11-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
2c433f70dc KVM: arm64: Compute synthetic sysreg ESR for Apple PMUv3 traps
Apple M* CPUs provide an IMPDEF trap for PMUv3 sysregs, where ESR_EL2.EC
is a reserved value (0x3F) and a sysreg-like ISS is reported in
AFSR1_EL2.

Compute a synthetic ESR for these PMUv3 traps, giving the illusion of
something architectural to the rest of KVM.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-10-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
56290316a4 KVM: arm64: Move PMUVer filtering into KVM code
The supported guest PMU version on a particular platform is ultimately a
KVM decision. Move PMUVer filtering into KVM code.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
3d6d917212 KVM: arm64: Use guard() to cleanup usage of arm_pmus_lock
Get rid of some goto label patterns by using guard() to drop the
arm_pmus_lock when returning from a function.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-8-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
a38b67d151 KVM: arm64: Drop kvm_arm_pmu_available static key
With the PMUv3 cpucap, kvm_arm_pmu_available is no longer used in the
hot path of guest entry/exit. On top of that, guest support for PMUv3
may not correlate with host support for the feature, e.g. on IMPDEF
hardware.

Throw out the static key and just inspect the list of PMUs to determine
if PMUv3 is supported for KVM guests.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-7-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:29 -07:00
Oliver Upton
6f34024d18 KVM: arm64: Use a cpucap to determine if system supports FEAT_PMUv3
KVM is about to learn some new tricks to virtualize PMUv3 on IMPDEF
hardware. As part of that, we now need to differentiate host support
from guest support for PMUv3.

Add a cpucap to determine if an architectural PMUv3 is present to guard
host usage of PMUv3 controls.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:54:23 -07:00
Oliver Upton
ed335722b4 KVM: arm64: Always support SW_INCR PMU event
Support for SW_INCR is unconditional, as KVM traps accesses to
PMSWINC_EL0 and emulates the intended event increment. While it is
expected that ~all PMUv3 implementations already advertise this event,
non-PMUv3 hardware may not.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:52:32 -07:00
Oliver Upton
93b0152858 KVM: arm64: Compute PMCEID from arm_pmu's event bitmaps
The PMUv3 driver populates a couple of bitmaps with the values of
PMCEID{0,1}, from which the guest's PMCEID{0,1} can be derived. This
is particularly convenient when virtualizing PMUv3 on IMP DEF hardware,
as reading the nonexistent PMCEID registers leads to a rather unpleasant
UNDEF.

Tested-by: Janne Grunau <j@jannau.net>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305202641.428114-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-11 12:52:32 -07:00
Kevin Brodsky
83d78bbfd2 arm64/sysreg: Rename POE_RXW to POE_RWX
It is customary to list R, W, X permissions in that order. In fact
this is already the case for PIE constants (PIE_RWX). Rename POE_RXW
accordingly, as well as POE_XW (currently unused).

While at it also swap the W/X lines in
compute_s1_overlay_permissions() to follow the R, W, X order.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Link: https://lore.kernel.org/r/20250219164029.2309119-3-kevin.brodsky@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-10 18:10:40 +00:00
Oliver Upton
03e1b89d05 KVM: arm64: Copy MIDR_EL1 into hyp VM when it is writable
KVM recently added a capability that allows userspace to override the
'implementation ID' registers presented to the VM. MIDR_EL1 is a special
example, where the hypervisor can directly set the value when read from
EL1 using VPIDR_EL2.

Copy the VM-wide value for MIDR_EL1 into the hyp VM for non-protected
guests when the capability is enabled so VPIDR_EL2 gets set up
correctly.

Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/kvmarm/ac594b9c-4bbb-46c8-9391-e7a68ce4de5b@sirena.org.uk/
Fixes: 3adaee7830 ("KVM: arm64: Allow userspace to change the implementation ID registers")
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305230825.484091-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-05 16:56:57 -08:00
Oliver Upton
9d91227364 KVM: arm64: Copy guest CTR_EL0 into hyp VM
Since commit 2843cae266 ("KVM: arm64: Treat CTR_EL0 as a VM feature
ID register") KVM has allowed userspace to configure the VM-wide view of
CTR_EL0, falling back to trap-n-emulate if the value doesn't match
hardware. It appears that this has worked by chance in protected-mode
for some time, and on systems with FEAT_EVT protected-mode
unconditionally sets TID4 (i.e. TID2 traps sans CTR_EL0).

Forward the guest CTR_EL0 value through to the hyp VM and align the
TID2/TID4 configuration with the non-protected setup.

Fixes: 2843cae266 ("KVM: arm64: Treat CTR_EL0 as a VM feature ID register")
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250305230825.484091-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-05 16:55:41 -08:00
Marc Zyngier
83c6cb2014 KVM: arm64: nv: Fail KVM init if asking for NV without GICv3
Although there is nothing in NV that is fundamentally incompatible
with the lack of GICv3, there is no HW implementation without one,
at least on the virtual side (yes, even fruits have some form of
vGICv3).

We therefore make the decision to require GICv3, which will only
affect models such as QEMU. Booting with a GICv2 or something
even more exotic while asking for NV will result in KVM being
disabled.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-17-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Andre Przywara
faf7714a47 KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ
The VGIC maintenance IRQ signals various conditions about the LRs, when
the GIC's virtualization extension is used.
So far we didn't need it, but nested virtualization needs to know about
this interrupt, so add a userland interface to setup the IRQ number.
The architecture mandates that it must be a PPI, on top of that this code
only exports a per-device option, so the PPI is the same on all VCPUs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
[added some bits of documentation]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
89896cc159 KVM: arm64: nv: Fold GICv3 host trapping requirements into guest setup
Popular HW that is able to use NV also has a broken vgic implementation
that requires trapping.

On such HW, propagate the host trap bits into the guest's shadow
ICH_HCR_EL2 register, making sure we don't allow an L2 guest to bring
the system down.

This involves a bit of tweaking so that the emulation code correctly
poicks up the shadow state as needed, and to only partially sync
ICH_HCR_EL2 back with the guest state to capture EOIcount.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
7682c02321 KVM: arm64: nv: Propagate used_lrs between L1 and L0 contexts
We have so far made sure that L1 and L0 vgic contexts were
totally independent. There is however one spot of bother with
this approach, and that's in the GICv3 emulation code required by
our fruity friends.

The issue is that the emulation code needs to know how many LRs
are in flight. And while it is easy to reach the L0 version through
the vcpu pointer, doing so for the L1 is much more complicated,
as these structures are private to the nested code.

We could simply expose that structure and pick one or the other
depending on the context, but this seems extra complexity for not
much benefit.

Instead, just propagate the number of used LRs from the nested code
into the L0 context, and be done with it. Should this become a burden,
it can be easily rectified.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-14-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Oliver Upton
93078ae63f KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2
Running an L2 guest with GICv4 enabled goes absolutely nowhere, and gets
into a vicious cycle of nested ERET followed by nested exception entry
into the L1.

When KVM does a put on a runnable vCPU, it marks the vPE as nonresident
but does not request a doorbell IRQ. Behind the scenes in the ITS
driver's view of the vCPU, its_vpe::pending_last gets set to true to
indicate that context is still runnable.

This comes to a head when doing the nested ERET into L2. The vPE doesn't
get scheduled on the redistributor as it is exclusively part of the L1's
VGIC context. kvm_vgic_vcpu_pending_irq() returns true because the vPE
appears runnable, and KVM does a nested exception entry into the L1
before L2 ever gets off the ground.

This issue can be papered over by requesting a doorbell IRQ when
descheduling a vPE as part of a nested ERET. KVM needs this anyway to
kick the vCPU out of the L2 when an IRQ becomes pending for the L1.

Link: https://lore.kernel.org/r/20240823212703.3576061-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Jintack Lim
69c9176c38 KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting
Forward exceptions due to WFI or WFE instructions to the virtual EL2 if
they are not coming from the virtual EL2 and virtual HCR_EL2.TWx is set.

Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
201c8d40dd KVM: arm64: nv: Add Maintenance Interrupt emulation
Emulating the vGIC means emulating the dreaded Maintenance Interrupt.

This is a two-pronged problem:

- while running L2, getting an MI translates into an MI injected
  in the L1 based on the state of the HW.

- while running L1, we must accurately reflect the state of the
  MI line, based on the in-memory state.

The MI INTID is added to the distributor, as expected on any
virtualisation-capable implementation, and further patches
will allow its configuration.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
4b1b97f0d7 KVM: arm64: nv: Handle L2->L1 transition on interrupt injection
An interrupt being delivered to L1 while running L2 must result
in the correct exception being delivered to L1.

This means that if, on entry to L2, we found ourselves with pending
interrupts in the L1 distributor, we need to take immediate action.
This is done by posting a request which will prevent the entry in
L2, and deliver an IRQ exception to L1, forcing the switch.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
146a050f2d KVM: arm64: nv: Nested GICv3 emulation
When entering a nested VM, we set up the hypervisor control interface
based on what the guest hypervisor has set. Especially, we investigate
each list register written by the guest hypervisor whether HW bit is
set.  If so, we translate hw irq number from the guest's point of view
to the real hardware irq number if there is a mapping.

Co-developed-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
[Christoffer: Redesigned execution flow around vcpu load/put]
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: Rewritten to support GICv3 instead of GICv2, NV2 support]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:04 -08:00
Marc Zyngier
21d29cd814 KVM: arm64: nv: Sanitise ICH_HCR_EL2 accesses
As ICH_HCR_EL2 is a VNCR accessor when runnintg NV, add some
sanitising to what gets written. Crucially, mark TDIR as RES0
if the HW doesn't support it (unlikely, but hey...), as well
as anything GICv4 related, since we only expose a GICv3 to the
uest.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-8-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:55:10 -08:00
Marc Zyngier
96c2f03311 KVM: arm64: nv: Plumb handling of GICv3 EL2 accesses
Wire the handling of all GICv3 EL2 registers, and provide emulation
for all the non memory-backed registers (ICC_SRE_EL2, ICH_VTR_EL2,
ICH_MISR_EL2, ICH_ELRSR_EL2, and ICH_EISR_EL2).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:55:10 -08:00
Marc Zyngier
16abeb60be KVM: arm64: nv: Load timer before the GIC
In order for vgic_v3_load_nested to be able to observe which timer
interrupts have the HW bit set for the current context, the timers
must have been loaded in the new mode and the right timer mapped
to their corresponding HW IRQs.

At the moment, we load the GIC first, meaning that timer interrupts
injected to an L2 guest will never have the HW bit set (we see the
old configuration).

Swapping the two loads solves this particular problem.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:54:57 -08:00
Marc Zyngier
5815fb82dc arm64: sysreg: Add layout for ICH_VTR_EL2
The ICH_VTR_EL2-related macros are missing a number of config
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.

This results in a bit of churn to repaint constants that are now
generated with a different format.

Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:51:51 -08:00
Marc Zyngier
22513c0d2a arm64: sysreg: Add layout for ICH_HCR_EL2
The ICH_HCR_EL2-related macros are missing a number of control
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.

This results in a bit of churn, unfortunately.

Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:51:51 -08:00
Ahmed Genidi
3855a7b91d KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu()
When KVM is in protected mode, host calls to PSCI are proxied via EL2,
and cold entries from CPU_ON, CPU_SUSPEND, and SYSTEM_SUSPEND bounce
through __kvm_hyp_init_cpu() at EL2 before entering the host kernel's
entry point at EL1. While __kvm_hyp_init_cpu() initializes SPSR_EL2 for
the exception return to EL1, it does not initialize SCTLR_EL1.

Due to this, it's possible to enter EL1 with SCTLR_EL1 in an UNKNOWN
state. In practice this has been seen to result in kernel crashes after
CPU_ON as a result of SCTLR_EL1.M being 1 in violation of the initial
core configuration specified by PSCI.

Fix this by initializing SCTLR_EL1 for cold entry to the host kernel.
As it's necessary to write to SCTLR_EL12 in VHE mode, this
initialization is moved into __kvm_host_psci_cpu_entry() where we can
use write_sysreg_el1().

The remnants of the '__init_el2_nvhe_prepare_eret' macro are folded into
its only caller, as this is clearer than having the macro.

Fixes: cdf3671927 ("KVM: arm64: Intercept host's CPU_ON SMCs")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ahmed Genidi <ahmed.genidi@arm.com>
[ Mark: clarify commit message, handle E2H, move to C, remove macro ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ahmed Genidi <ahmed.genidi@arm.com>
Cc: Ben Horgan <ben.horgan@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20250227180526.1204723-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-02 08:36:52 +00:00
Mark Rutland
7a68b55ff3 KVM: arm64: Initialize HCR_EL2.E2H early
On CPUs without FEAT_E2H0, HCR_EL2.E2H is RES1, but may reset to an
UNKNOWN value out of reset and consequently may not read as 1 unless it
has been explicitly initialized.

We handled this for the head.S boot code in commits:

  3944382fa6 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
  b3320142f3 ("arm64: Fix early handling of FEAT_E2H0 not being implemented")

Unfortunately, we forgot to apply a similar fix to the KVM PSCI entry
points used when relaying CPU_ON, CPU_SUSPEND, and SYSTEM SUSPEND. When
KVM is entered via these entry points, the value of HCR_EL2.E2H may be
consumed before it has been initialized (e.g. by the 'init_el2_state'
macro).

Initialize HCR_EL2.E2H early in these paths such that it can be consumed
reliably. The existing code in head.S is factored out into a new
'init_el2_hcr' macro, and this is used in the __kvm_hyp_init_cpu()
function common to all the relevant PSCI entry points.

For clarity, I've tweaked the assembly used to check whether
ID_AA64MMFR4_EL1.E2H0 is negative. The bitfield is extracted as a signed
value, and this is checked with a signed-greater-or-equal (GE) comparison.

As the hyp code will reconfigure HCR_EL2 later in ___kvm_hyp_init(), all
bits other than E2H are initialized to zero in __kvm_hyp_init_cpu().

Fixes: 3944382fa6 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
Fixes: b3320142f3 ("arm64: Fix early handling of FEAT_E2H0 not being implemented")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ahmed Genidi <ahmed.genidi@arm.com>
Cc: Ben Horgan <ben.horgan@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250227180526.1204723-2-mark.rutland@arm.com
[maz: fixed LT->GE thinko]
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-02 08:36:52 +00:00
Shameer Kolothum
c0000e58c7 KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2
The vendor_hyp_bmap bitmap holds the information about the Vendor Hyp
services available to the user space and can be get/set using
{G, S}ET_ONE_REG interfaces. This is done using the pseudo-firmware
bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP.

At present, this bitmap is a 64 bit one and since the function numbers
for newly added DISCOVER_IPML_* hypercalls are 64-65, introduce
another pseudo-firmware bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP_2.

Reviewed-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250221140229.12588-4-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:30:36 -08:00
Shameer Kolothum
e3121298c7 arm64: Modify _midr_range() functions to read MIDR/REVIDR internally
These changes lay the groundwork for adding support for guest kernels,
allowing them to leverage target CPU implementations provided by the
VMM.

No functional changes intended.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250221140229.12588-2-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:29:44 -08:00
Oliver Upton
5c57533eb8 KVM: arm64: vgic-v4: Fall back to software irqbypass if LPI not found
Continuing with the theme of broken VMMs and guests, irqbypass
registration can fail if the virtual ITS lacks a translation for the
MSI. Either the guest hasn't mapped it or userspace may have forgotten
to restore the ITS.

Exit silently and allow irqbypass configuration to succeed. As a reward
for ingenuity, LPIs are demoted to software injection.

Tested-by: Sudheer Dantuluri <dantuluris@google.com>
Fixes: 196b136498 ("KVM: arm/arm64: GICv4: Wire mapping/unmapping of VLPIs in VFIO irq bypass")
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250226183124.82094-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:22:49 -08:00
Oliver Upton
d0b79563fd KVM: arm64: vgic-v4: Only WARN for HW IRQ mismatch when unmapping vLPI
The VMM or guest can easily screw up GICv4 vLPI injection by
misconfiguring the MSI or the virtual ITS. Don't fuss over it; limit the
WARN in vgic_v4_unset_forwarding() to fire in the worrying case where an
unrelated HW IRQ was mapped to a vLPI.

Reported-by: Sudheer Dantuluri <dantuluris@google.com>
Tested-by: Sudheer Dantuluri <dantuluris@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250226183124.82094-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:22:48 -08:00
Oliver Upton
a0d7e2fc61 KVM: arm64: vgic-v4: Only attempt vLPI mapping for actual MSIs
Some 'creative' VMMs out there may assign a VFIO MSI eventfd to an SPI
routing entry.

And yes, I can already hear you shouting about possibly driving a level
interrupt with an edge-sensitive one. You know who you are.

This works for the most part, and interrupt injection winds up taking
the software path. However, when running on GICv4-enabled hardware, KVM
erroneously attempts to setup LPI forwarding, even though the KVM
routing isn't an MSI.

Thanks to misuse of a union, the MSI destination is unlikely to match any
ITS in the VM and kvm_vgic_v4_set_forwarding() bails early. Later on when
the VM is being torn down, this half-configured state triggers the
WARN_ON() in kvm_vgic_v4_unset_forwarding() due to the fact that no HW
IRQ was ever assigned.

Avoid the whole mess by preventing SPI routing entries from getting into
the LPI forwarding helpers.

Reported-by: Sudheer Dantuluri <dantuluris@google.com>
Tested-by: Sudheer Dantuluri <dantuluris@google.com>
Fixes: 196b136498 ("KVM: arm/arm64: GICv4: Wire mapping/unmapping of VLPIs in VFIO irq bypass")
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250226183124.82094-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:22:48 -08:00
Sebastian Ott
3adaee7830 KVM: arm64: Allow userspace to change the implementation ID registers
KVM's treatment of the ID registers that describe the implementation
(MIDR, REVIDR, and AIDR) is interesting, to say the least. On the
userspace-facing end of it, KVM presents the values of the boot CPU on
all vCPUs and treats them as invariant. On the guest side of things KVM
presents the hardware values of the local CPU, which can change during
CPU migration in a big-little system.

While one may call this fragile, there is at least some degree of
predictability around it. For example, if a VMM wanted to present
big-little to a guest, it could affine vCPUs accordingly to the correct
clusters.

All of this makes a giant mess out of adding support for making these
implementation ID registers writable. Avoid breaking the rather subtle
ABI around the old way of doing things by requiring opt-in from
userspace to make the registers writable.

When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR
to any non-reserved value and present those values consistently across
all vCPUs.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
[oliver: changelog, capability]
Link: https://lore.kernel.org/r/20250225005401.679536-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:32:16 -08:00
Oliver Upton
d0d81e03e6 KVM: arm64: Load VPIDR_EL2 with the VM's MIDR_EL1 value
Userspace will soon be able to change the value of MIDR_EL1. Prepare by
loading VPIDR_EL2 with the guest value for non-nested VMs.

Since VPIDR_EL2 is set for any VM, get rid of the NV-specific cleanup of
reloading the hardware value on vcpu_put(). And for nVHE, load the
hardware value before switching to the host.

Link: https://lore.kernel.org/r/20250225005401.679536-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:32:05 -08:00
Sebastian Ott
b4043e7cb7 KVM: arm64: Maintain per-VM copy of implementation ID regs
Get ready to allow changes to the implementation ID registers by
tracking the VM-wide values.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20250225005401.679536-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:31:58 -08:00
Oliver Upton
4cd48565b0 KVM: arm64: Set HCR_EL2.TID1 unconditionally
commit 90807748ca ("KVM: arm64: Hide SME system registers from
guests") added trap handling for SMIDR_EL1, treating it as UNDEFINED as
KVM does not support SME. This is right for the most part, however KVM
needs to set HCR_EL2.TID1 to _actually_ trap the register.

Unfortunately, this comes with some collateral damage as TID1 forces
REVIDR_EL1 and AIDR_EL1 to trap as well. KVM has long treated these
registers as "invariant" which is an awful term for the following:

 - Userspace sees the boot CPU values on all vCPUs

 - The guest sees the hardware values of the CPU on which a vCPU is
   scheduled

Keep the plates spinning by adding trap handling for the affected
registers and repaint all of the "invariant" crud into terms of
identifying an implementation. Yes, at this point we only need to
set TID1 on SME hardware, but REVIDR_EL1 and AIDR_EL1 are about to
become mutable anyway.

Cc: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 90807748ca ("KVM: arm64: Hide SME system registers from guests")
[maz: handle traps from 32bit]
Co-developed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225005401.679536-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:31:52 -08:00
Marc Zyngier
8b0b98ebf3 KVM: arm64: Advertise FEAT_ECV when possible
We can advertise support for FEAT_ECV if supported on the HW as
long as we limit it to the basic trap bits, and not advertise
CNTPOFF_EL2 support, even if the host has it (the short story
being that CNTPOFF_EL2 is not virtualisable).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:33:34 -08:00
Marc Zyngier
642c23ea8b KVM: arm64: Make ID_AA64MMFR4_EL1.NV_frac writable
We want to make sure that it is possible for userspace to configure
whether recursive NV is possible. Make NV_frac writable for that
purpose.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:31:35 -08:00
Marc Zyngier
f83c41fb3d KVM: arm64: Allow userspace to limit NV support to nVHE
NV is hard. No kidding.

In order to make things simpler, we have established that NV would
support two mutually exclusive configurations:

- VHE-only, and supporting recursive virtualisation

- nVHE-only, and not supporting recursive virtualisation

For that purpose, introduce a new vcpu feature flag that denotes
the second configuration. We use this flag to limit the idregs
further.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:30:17 -08:00
Marc Zyngier
94f296dcd6 KVM: arm64: Move NV-specific capping to idreg sanitisation
Instead of applying the NV idreg limits at run time, switch to
doing it at the same time as the reset of the VM initialisation.

This will make things much simpler once we introduce vcpu-driven
variants of NV.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:28:43 -08:00
Marc Zyngier
e7ef6ed458 KVM: arm64: Enforce NV limits on a per-idregs basis
As we are about to change the way the idreg reset values are computed,
move all the NV limits into a function that initialises one register
at a time.

This will be most useful in the upcoming patches. We take this opportunity
to remove the NV_FTR() macro and rely on the generated names instead.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:28:37 -08:00
Marc Zyngier
179fd7e30f KVM: arm64: Make ID_REG_LIMIT_FIELD_ENUM() more widely available
ID_REG_LIMIT_FIELD_ENUM() is a useful macro to limit the idreg
features exposed to guest and userspace, and the NV code can
make use of it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-8-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:06:55 -08:00
Marc Zyngier
57e7de2650 KVM: arm64: Consolidate idreg callbacks
Most of the ID_DESC() users use the same callbacks, with only a few
overrides. Consolidate the common callbacks in a macro, and consistently
use it everywhere.

Whilst we're at it, give ID_UNALLOCATED() a .name string, so that we can
easily decode traces.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:06:55 -08:00
Marc Zyngier
2cd9542a37 KVM: arm64: Advertise NV2 in the boot messages
Make it a bit easier to understand what people are running by
adding a +NV2 string to the successful KVM initialisation.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:06:55 -08:00
Marc Zyngier
8f8d6084f5 KVM: arm64: Mark HCR.EL2.{NV*,AT} RES0 when ID_AA64MMFR4_EL1.NV_frac is 0
Enforce HCR_EL2.{NV*,AT} being RES0 when NV2 is disabled, so that
we can actually rely on these bits never being flipped behind our back.

This of course relies on our earlier ID reg sanitising.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:06:55 -08:00
Marc Zyngier
d9f943f765 KVM: arm64: Mark HCR.EL2.E2H RES0 when ID_AA64MMFR1_EL1.VH is zero
Enforce HCR_EL2.E2H being RES0 when VHE is disabled, so that we can
actually rely on that bit never being flipped behind our back.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:06:55 -08:00
Marc Zyngier
9d67455728 KVM: arm64: Hide ID_AA64MMFR2_EL1.NV from guest and userspace
Since our take on FEAT_NV is to only support FEAT_NV2, we should
never expose ID_AA64MMFR2_EL1.NV to a guest nor userspace.

Make sure we mask this field for good.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-3-maz@kernel.org
[oliver: squash diff for NV field]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:06:55 -08:00
Oliver Upton
fa808ed4e1 KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
Vladimir reports that a race condition to attach a VMID to a stage-2 MMU
sometimes results in a vCPU entering the guest with a VMID of 0:

| CPU1                                            |   CPU2
|                                                 |
|                                                 | kvm_arch_vcpu_ioctl_run
|                                                 |   vcpu_load             <= load VTTBR_EL2
|                                                 |                            kvm_vmid->id = 0
|                                                 |
| kvm_arch_vcpu_ioctl_run                         |
|   vcpu_load             <= load VTTBR_EL2       |
|                            with kvm_vmid->id = 0|
|   kvm_arm_vmid_update   <= allocates fresh      |
|                            kvm_vmid->id and     |
|                            reload VTTBR_EL2     |
|                                                 |
|                                                 |   kvm_arm_vmid_update <= observes that kvm_vmid->id
|                                                 |                          already allocated,
|                                                 |                          skips reload VTTBR_EL2

Oh yeah, it's as bad as it looks. Remember that VHE loads the stage-2
MMU eagerly but a VMID only gets attached to the MMU later on in the
KVM_RUN loop.

Even in the "best case" where VTTBR_EL2 correctly gets reprogrammed
before entering the EL1&0 regime, there is a period of time where
hardware is configured with VMID 0. That's completely insane. So, rather
than decorating the 'late' binding with another hack, just allocate the
damn thing up front.

Attaching a VMID from vcpu_load() is still rollover safe since
(surprise!) it'll always get called after a vCPU was preempted.

Excuse me while I go find a brown paper bag.

Cc: stable@vger.kernel.org
Fixes: 934bf871f0 ("KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()")
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250219220737.130842-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-20 16:29:28 +00:00
Will Deacon
102c51c50d KVM: arm64: Fix tcr_el2 initialisation in hVHE mode
When not running in VHE mode, cpu_prepare_hyp_mode() computes the value
of TCR_EL2 using the host's TCR_EL1 settings as a starting point. For
nVHE, this amounts to masking out everything apart from the TG0, SH0,
ORGN0, IRGN0 and T0SZ fields before setting the RES1 bits, shifting the
IPS field down to the PS field and setting DS if LPA2 is enabled.

Unfortunately, for hVHE, things go slightly wonky: EPD1 is correctly set
to disable walks via TTBR1_EL2 but then the T1SZ and IPS fields are
corrupted when we mistakenly attempt to initialise the PS and DS fields
in their E2H=0 positions. Furthermore, many fields are retained from
TCR_EL1 which should not be propagated to TCR_EL2. Notably, this means
we can end up with A1 set despite not initialising TTBR1_EL2 at all.
This has been shown to cause unexpected translation faults at EL2 with
pKVM due to TLB invalidation not taking effect when running with a
non-zero ASID.

Fix the TCR_EL2 initialisation code to set PS and DS only when E2H=0,
masking out HD, HA and A1 when E2H=1.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Fixes: ad744e8cb3 ("arm64: Allow arm64_sw.hvhe on command line")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250214133724.13179-1-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-19 22:09:24 +00:00
Nam Cao
7e5fd922c1 KVM: arm64: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/59f527713562ad491df7c216eeee0378e0eb2402.1738746821.git.namcao@linutronix.de
2025-02-18 10:32:31 +01:00
Paolo Bonzini
3bb7dcebd0 KVM/arm64 fixes for 6.14, take #2
- Large set of fixes for vector handling, specially in the interactions
   between host and guest state. This fixes a number of bugs affecting
   actual deployments, and greatly simplifies the FP/SIMD/SVE handling.
   Thanks to Mark Rutland for dealing with this thankless task.
 
 - Fix an ugly race between vcpu and vgic creation/init, resulting in
   unexpected behaviours.
 
 - Fix use of kernel VAs at EL2 when emulating timers with nVHE.
 
 - Small set of pKVM improvements and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmevLKMACgkQI9DQutE9
 ekP3hQ//db7pAPzLr++//PAyam0GP+ooKlgpB0ImZisQwkrTrTMP+IjNJG+NCJ46
 y88anBErFijvWb3BINpeTM/dux7DmuoaolGx7lquFu+i0L8UfFFjYG7UU+NZscim
 KE4j0tJz8jm5ksN4iwaj3RIkGKc1zJtRyoPny3j1blOtm8aTtujRJB7/Gx2QefZR
 1Z13RaIzk1tKdY0JxAmPpGkaRY99MQahx96iBsk2u4rlypcxmVr9aQ1Madp7Pc6Y
 pBcX9jZwLf75cj6CAK93YSjFF3j/x4QM8jSupLCu5tyin6YZ4sRaZa6sy52byk2v
 zes7i83l5g3+JEKv5oZVwjD5SFBu02UPbnMGSxKQitgz4Zej3qMIq5BxgII2kHZV
 jwXrNEx4trNegEcoqwFX5xA0FMUr1/g3Cr4+rZBoUramj80cBhzbBdUkhyWd3eey
 j2EOuAG3pgUD5Wv9SyojlbHBwmSAcBEtr3vqJpTjWQS6AyEmdKNvzh/8JCH1h7UM
 fBo4+LIEylzmZXbqDrZNwXh31tELoTCR9Ur3pTCEO3Yfg9npTLWmvKs+tAgO/282
 IOjZE0N/ZtzPJ6Cgr+2efBGd+id81HXh+H8gWo35Dyx3EH2k44FHwQ3rW2NKOVzo
 10eSbswYpjk3gi/6GxwC0lDqFi4Bk6ILvC6roqTghixBf7xThfY=
 =L5HS
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.14, take #2

- Large set of fixes for vector handling, specially in the interactions
  between host and guest state. This fixes a number of bugs affecting
  actual deployments, and greatly simplifies the FP/SIMD/SVE handling.
  Thanks to Mark Rutland for dealing with this thankless task.

- Fix an ugly race between vcpu and vgic creation/init, resulting in
  unexpected behaviours.

- Fix use of kernel VAs at EL2 when emulating timers with nVHE.

- Small set of pKVM improvements and cleanups.
2025-02-14 18:32:47 -05:00
Marc Zyngier
b3aa9283c0 KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()
If userspace creates vcpus, then a vgic, we end-up in a situation
where irqchip_in_kernel() will return true, but no private interrupt
has been allocated for these vcpus. This situation will continue
until userspace initialises the vgic, at which point we fix the
early vcpus. Should a vcpu run or be initialised in the interval,
bad things may happen.

An obvious solution is to move this fix-up phase to the point where
the vgic is created. This ensures that from that point onwards,
all vcpus have their private interrupts, as new vcpus will directly
allocate them.

With that, we have the invariant that when irqchip_in_kernel() is
true, all vcpus have their private interrupts.

Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250212182558.2865232-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 18:03:54 +00:00
Marc Zyngier
e6e3e0022e KVM: arm64: timer: Drop warning on failed interrupt signalling
We currently spit out a warning if making a timer interrupt pending
fails. But not only this is loud and easy to trigger from userspace,
we also fail to do anything useful with that information.

Dropping the warning is the easiest thing to do for now. We can
always add error reporting if we really want in the future.

Reported-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250212182558.2865232-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 18:03:54 +00:00
Marc Zyngier
65729da9ce KVM: arm64: Convert timer offset VA when accessed in HYP code
Now that EL2 has gained some early timer emulation, it accesses
the offsets pointed to by the timer structure, both of which
live in the KVM structure.

Of course, these are *kernel* pointers, so the dereferencing
of these pointers in non-kernel code must be itself be offset.

Given switch.h its own version of timer_get_offset() and use that
instead.

Fixes: b86fc215dc ("KVM: arm64: Handle counter access early in non-HYP context")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Link: https://lore.kernel.org/r/20250212173454.2864462-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:55:13 +00:00
Mark Rutland
332b7e6d62 KVM: arm64: Simplify warning in kvm_arch_vcpu_load_fp()
At the end of kvm_arch_vcpu_load_fp() we check that no bits are set in
SVCR. We only check this for protected mode despite this mattering
equally for non-protected mode, and the comment above this is confusing.

Remove the comment and simplify the check, moving from WARN_ON() to
WARN_ON_ONCE() to avoid spamming the log.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:55:13 +00:00
Mark Rutland
59419f1004 KVM: arm64: Eagerly switch ZCR_EL{1,2}
In non-protected KVM modes, while the guest FPSIMD/SVE/SME state is live on the
CPU, the host's active SVE VL may differ from the guest's maximum SVE VL:

* For VHE hosts, when a VM uses NV, ZCR_EL2 contains a value constrained
  by the guest hypervisor, which may be less than or equal to that
  guest's maximum VL.

  Note: in this case the value of ZCR_EL1 is immaterial due to E2H.

* For nVHE/hVHE hosts, ZCR_EL1 contains a value written by the guest,
  which may be less than or greater than the guest's maximum VL.

  Note: in this case hyp code traps host SVE usage and lazily restores
  ZCR_EL2 to the host's maximum VL, which may be greater than the
  guest's maximum VL.

This can be the case between exiting a guest and kvm_arch_vcpu_put_fp().
If a softirq is taken during this period and the softirq handler tries
to use kernel-mode NEON, then the kernel will fail to save the guest's
FPSIMD/SVE state, and will pend a SIGKILL for the current thread.

This happens because kvm_arch_vcpu_ctxsync_fp() binds the guest's live
FPSIMD/SVE state with the guest's maximum SVE VL, and
fpsimd_save_user_state() verifies that the live SVE VL is as expected
before attempting to save the register state:

| if (WARN_ON(sve_get_vl() != vl)) {
|         force_signal_inject(SIGKILL, SI_KERNEL, 0, 0);
|         return;
| }

Fix this and make this a bit easier to reason about by always eagerly
switching ZCR_EL{1,2} at hyp during guest<->host transitions. With this
happening, there's no need to trap host SVE usage, and the nVHE/nVHE
__deactivate_cptr_traps() logic can be simplified to enable host access
to all present FPSIMD/SVE/SME features.

In protected nVHE/hVHE modes, the host's state is always saved/restored
by hyp, and the guest's state is saved prior to exit to the host, so
from the host's PoV the guest never has live FPSIMD/SVE/SME state, and
the host's ZCR_EL1 is never clobbered by hyp.

Fixes: 8c8010d69c ("KVM: arm64: Save/restore SVE state for nVHE")
Fixes: 2e3cf82063 ("KVM: arm64: nv: Ensure correct VL is loaded before saving SVE state")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-9-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:55:06 +00:00
Mark Rutland
f9dd00de1e KVM: arm64: Mark some header functions as inline
The shared hyp switch header has a number of static functions which
might not be used by all files that include the header, and when unused
they will provoke compiler warnings, e.g.

| In file included from arch/arm64/kvm/hyp/nvhe/hyp-main.c:8:
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:703:13: warning: 'kvm_hyp_handle_dabt_low' defined but not used [-Wunused-function]
|   703 | static bool kvm_hyp_handle_dabt_low(struct kvm_vcpu *vcpu, u64 *exit_code)
|       |             ^~~~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:682:13: warning: 'kvm_hyp_handle_cp15_32' defined but not used [-Wunused-function]
|   682 | static bool kvm_hyp_handle_cp15_32(struct kvm_vcpu *vcpu, u64 *exit_code)
|       |             ^~~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:662:13: warning: 'kvm_hyp_handle_sysreg' defined but not used [-Wunused-function]
|   662 | static bool kvm_hyp_handle_sysreg(struct kvm_vcpu *vcpu, u64 *exit_code)
|       |             ^~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:458:13: warning: 'kvm_hyp_handle_fpsimd' defined but not used [-Wunused-function]
|   458 | static bool kvm_hyp_handle_fpsimd(struct kvm_vcpu *vcpu, u64 *exit_code)
|       |             ^~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:329:13: warning: 'kvm_hyp_handle_mops' defined but not used [-Wunused-function]
|   329 | static bool kvm_hyp_handle_mops(struct kvm_vcpu *vcpu, u64 *exit_code)
|       |             ^~~~~~~~~~~~~~~~~~~

Mark these functions as 'inline' to suppress this warning. This
shouldn't result in any functional change.

At the same time, avoid the use of __alias() in the header and alias
kvm_hyp_handle_iabt_low() and kvm_hyp_handle_watchpt_low() to
kvm_hyp_handle_memory_fault() using CPP, matching the style in the rest
of the kernel. For consistency, kvm_hyp_handle_memory_fault() is also
marked as 'inline'.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-8-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:55:03 +00:00
Mark Rutland
9b66195063 KVM: arm64: Refactor exit handlers
The hyp exit handling logic is largely shared between VHE and nVHE/hVHE,
with common logic in arch/arm64/kvm/hyp/include/hyp/switch.h. The code
in the header depends on function definitions provided by
arch/arm64/kvm/hyp/vhe/switch.c and arch/arm64/kvm/hyp/nvhe/switch.c
when they include the header.

This is an unusual header dependency, and prevents the use of
arch/arm64/kvm/hyp/include/hyp/switch.h in other files as this would
result in compiler warnings regarding missing definitions, e.g.

| In file included from arch/arm64/kvm/hyp/nvhe/hyp-main.c:8:
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:733:31: warning: 'kvm_get_exit_handler_array' used but never defined
|   733 | static const exit_handler_fn *kvm_get_exit_handler_array(struct kvm_vcpu *vcpu);
|       |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~
| ./arch/arm64/kvm/hyp/include/hyp/switch.h:735:13: warning: 'early_exit_filter' used but never defined
|   735 | static void early_exit_filter(struct kvm_vcpu *vcpu, u64 *exit_code);
|       |             ^~~~~~~~~~~~~~~~~

Refactor the logic such that the header doesn't depend on anything from
the C files. There should be no functional change as a result of this
patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-7-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:55:00 +00:00
Mark Rutland
ee14db31a9 KVM: arm64: Refactor CPTR trap deactivation
For historical reasons, the VHE and nVHE/hVHE implementations of
__activate_cptr_traps() pair with a common implementation of
__kvm_reset_cptr_el2(), which ideally would be named
__deactivate_cptr_traps().

Rename __kvm_reset_cptr_el2() to __deactivate_cptr_traps(), and split it
into separate VHE and nVHE/hVHE variants so that each can be paired with
its corresponding implementation of __activate_cptr_traps().

At the same time, fold kvm_write_cptr_el2() into its callers. This
makes it clear in-context whether a write is made to the CPACR_EL1
encoding or the CPTR_EL2 encoding, and removes the possibility of
confusion as to whether kvm_write_cptr_el2() reformats the sysreg fields
as cpacr_clear_set() does.

In the nVHE/hVHE implementation of __activate_cptr_traps(), placing the
sysreg writes within the if-else blocks requires that the call to
__activate_traps_fpsimd32() is moved earlier, but as this was always
called before writing to CPTR_EL2/CPACR_EL1, this should not result in a
functional change.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-6-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:57 +00:00
Mark Rutland
407a99c465 KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.SMEN (i.e. CPTR_EL2.SMEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic
has historically been broken, and is currently redundant.

This logic was originally introduced in commit:

  861262ab86 ("KVM: arm64: Handle SME host state when running guests")

At the time, the VHE hyp code would reset CPTR_EL2.SMEN to 0b00 when
returning to the host, trapping host access to SME state. Unfortunately,
this was unsafe as the host could take a softirq before calling
kvm_arch_vcpu_put_fp(), and if a softirq handler were to use kernel mode
NEON the resulting attempt to save the live FPSIMD/SVE/SME state would
result in a fatal trap.

That issue was limited to VHE mode. For nVHE/hVHE modes, KVM always
saved/restored the host kernel's CPACR_EL1 value, and configured
CPTR_EL2.TSM to 0b0, ensuring that host usage of SME would not be
trapped.

The issue above was incidentally fixed by commit:

  375110ab51 ("KVM: arm64: Fix resetting SME trap values on reset for (h)VHE")

That commit changed the VHE hyp code to configure CPTR_EL2.SMEN to 0b01
when returning to the host, permitting host kernel usage of SME,
avoiding the issue described above. At the time, this was not identified
as a fix for commit 861262ab86.

Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SME trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.

Remove the redundant logic.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-5-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:54 +00:00
Mark Rutland
459f059be7 KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.ZEN (i.e. CPTR_EL2.ZEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic is
currently redundant.

The VHE hyp code unconditionally configures CPTR_EL2.ZEN to 0b01 when
returning to the host, permitting host kernel usage of SVE.

Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SVE trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.

Remove the redundant logic.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-4-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:51 +00:00
Mark Rutland
8eca7f6d51 KVM: arm64: Remove host FPSIMD saving for non-protected KVM
Now that the host eagerly saves its own FPSIMD/SVE/SME state,
non-protected KVM never needs to save the host FPSIMD/SVE/SME state,
and the code to do this is never used. Protected KVM still needs to
save/restore the host FPSIMD/SVE state to avoid leaking guest state to
the host (and to avoid revealing to the host whether the guest used
FPSIMD/SVE/SME), and that code needs to be retained.

Remove the unused code and data structures.

To avoid the need for a stub copy of kvm_hyp_save_fpsimd_host() in the
VHE hyp code, the nVHE/hVHE version is moved into the shared switch
header, where it is only invoked when KVM is in protected mode.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:44 +00:00
Mark Rutland
fbc7e61195 KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
There are several problems with the way hyp code lazily saves the host's
FPSIMD/SVE state, including:

* Host SVE being discarded unexpectedly due to inconsistent
  configuration of TIF_SVE and CPACR_ELx.ZEN. This has been seen to
  result in QEMU crashes where SVE is used by memmove(), as reported by
  Eric Auger:

  https://issues.redhat.com/browse/RHEL-68997

* Host SVE state is discarded *after* modification by ptrace, which was an
  unintentional ptrace ABI change introduced with lazy discarding of SVE state.

* The host FPMR value can be discarded when running a non-protected VM,
  where FPMR support is not exposed to a VM, and that VM uses
  FPSIMD/SVE. In these cases the hyp code does not save the host's FPMR
  before unbinding the host's FPSIMD/SVE/SME state, leaving a stale
  value in memory.

Avoid these by eagerly saving and "flushing" the host's FPSIMD/SVE/SME
state when loading a vCPU such that KVM does not need to save any of the
host's FPSIMD/SVE/SME state. For clarity, fpsimd_kvm_prepare() is
removed and the necessary call to fpsimd_save_and_flush_cpu_state() is
placed in kvm_arch_vcpu_load_fp(). As 'fpsimd_state' and 'fpmr_ptr'
should not be used, they are set to NULL; all uses of these will be
removed in subsequent patches.

Historical problems go back at least as far as v5.17, e.g. erroneous
assumptions about TIF_SVE being clear in commit:

  8383741ab2 ("KVM: arm64: Get rid of host SVE tracking/saving")

... and so this eager save+flush probably needs to be backported to ALL
stable trees.

Fixes: 93ae6b01ba ("KVM: arm64: Discard any SVE state when entering KVM guests")
Fixes: 8c845e2731 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Fixes: ef3be86021 ("KVM: arm64: Add save/restore support for FPMR")
Reported-by: Eric Auger <eauger@redhat.com>
Reported-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Tested-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-2-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:25 +00:00
Marc Zyngier
8dbccafce3 KVM: arm64: Fix __pkvm_host_mkyoung_guest() return value
Don't use an uninitialised stack variable, and just return 0
on the non-error path.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202502100911.8c9DbtKD-lkp@intel.com/
Reviewed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-10 09:58:16 +00:00
Quentin Perret
eabc7aaef7 KVM: arm64: Simplify np-guest hypercalls
When the handling of a guest stage-2 permission fault races with an MMU
notifier, the faulting page might be gone from the guest's stage-2 by
the point we attempt to call (p)kvm_pgtable_stage2_relax_perms(). In the
normal KVM case, this leads to returning -EAGAIN which user_mem_abort()
handles correctly by simply re-entering the guest. However, the pKVM
hypercall implementation has additional logic to check the page state
using __check_host_shared_guest() which gets confused with absence of a
page mapped at the requested IPA and returns -ENOENT, hence breaking
user_mem_abort() and hilarity ensues.

Luckily, several of the hypercalls for managing the stage-2 page-table
of NP guests have no effect on the pKVM ownership tracking (wrprotect,
test_clear_young, mkyoung, and crucially relax_perms), so the extra
state checking logic is in fact not strictly necessary. So, to fix the
discrepancy between standard KVM and pKVM, let's just drop the
superfluous __check_host_shared_guest() logic from those hypercalls and
make the extra state checking a debug assertion dependent on
CONFIG_NVHE_EL2_DEBUG as we already do for other transitions.

Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250207145438.1333475-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-09 10:20:38 +00:00
Quentin Perret
c53fbdb60f KVM: arm64: Improve error handling from check_host_shared_guest()
The check_host_shared_guest() path expects to find a last-level valid
PTE in the guest's stage-2 page-table. However, it checks the PTE's
level before its validity, which makes it hard for callers to figure out
what went wrong.

To make error handling simpler, check the PTE's validity first.

Signed-off-by: Quentin Perret <qperret@google.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250207145438.1333475-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-09 10:20:38 +00:00
Paolo Bonzini
5e21d0c5b9 KVM/arm64 fixes for 6.14, take #1
- Correctly clean the BSS to the PoC before allowing EL2 to access it
   on nVHE/hVHE/protected configurations
 
 - Propagate ownership of debug registers in protected mode after
   the rework that landed in 6.14-rc1
 
 - Stop pretending that we can run the protected mode without a GICv3
   being present on the host
 
 - Fix a use-after-free situation that can occur if a vcpu fails to
   initialise the NV shadow S2 MMU contexts
 
 - Always evaluate the need to arm a background timer for fully emulated
   guest timers
 
 - Fix the emulation of EL1 timers in the absence of FEAT_ECV
 
 - Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmeiNZsACgkQI9DQutE9
 ekOmiQ//WsCDESxdg+kezCCmx9yHo8CM5tsOqzcDoIz+KV5aiLB6KlegGbciC9MZ
 Fgj8lBr+Mu2GGEhYP/H6glqMLB8VPLiGYzwbeBE8ty2IH5Nn6M9hiKFWTZJVpGma
 gNDtFVSO39d39gtklQ4Y3SYSlN0dN2wW47pXxlZiY/aD9GRie7z2m1yC3T05xVpR
 TT7XAYjucfx0Cj8vK9MaJyyy69TXsqocdAxZI+ZCp+NI1yuBv7sunjDdfPcp9n4G
 Lk+Ikn39FC3qlrjZ38eHidMgY6LQvucrwtlvDuPp7yMuNYTXW6UEIU7RUF1AUts4
 IsKKvENV0zLsmtoEAgSa4d9Rtid3eomyll6mb3hBF+1GSBf7+v+sgKdpoB7/mOyn
 exGEbYCOXghXswhF9up2YC6jI8XtcO/2nTt6PdS2nMK61mvCb31u6xFvU0yTEMQW
 dEDVTcIKmvKrAeVzNOBNNjZj0bGNns0P5+oLFbkOzYj8qNueYS0Y9o996kqODET2
 d4J8xib6Mh0cr6usgj0Z8Jp+7vl9Li9KR0SSQqiQyOQcu5St0zidJeCf7J46HCR7
 5d8sQi6lYZ+G+9rm0X7Rl8+yBaUhrUs9v+jOAip3M+1fIxIWr9DyDUcQQKv+obW0
 KqyMWx4Fv+bDqBqTRdd1r6KQEpLL0DbeWinNrdJdVZkmLyI1Y98=
 =idVr
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.14, take #1

- Correctly clean the BSS to the PoC before allowing EL2 to access it
  on nVHE/hVHE/protected configurations

- Propagate ownership of debug registers in protected mode after
  the rework that landed in 6.14-rc1

- Stop pretending that we can run the protected mode without a GICv3
  being present on the host

- Fix a use-after-free situation that can occur if a vcpu fails to
  initialise the NV shadow S2 MMU contexts

- Always evaluate the need to arm a background timer for fully emulated
  guest timers

- Fix the emulation of EL1 timers in the absence of FEAT_ECV

- Correctly handle the EL2 virtual timer, specially when HCR_EL2.E2H==0
2025-02-04 11:14:53 -05:00
Marc Zyngier
0e45981028 KVM: arm64: timer: Don't adjust the EL2 virtual timer offset
The way we deal with the EL2 virtual timer is a bit odd.

We try to cope with E2H being flipped, and adjust which offset
applies to that timer depending on the current E2H value. But that's
a complexity we shouldn't have to worry about.

What we have to deal with is either E2H being RES1, in which case
there is no offset, or E2H being RES0, and the virtual timer simply
does not exist.

Drop the adjusting of the timer offset, which makes things a bit
simpler. At the same time, make sure that accessing the HV timer
when E2H is RES0 results in an UNDEF in the guest.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250204110050.150560-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04 15:10:38 +00:00
Marc Zyngier
1b8705ad53 KVM: arm64: timer: Correctly handle EL1 timer emulation when !FEAT_ECV
Both Wei-Lin Chang and Volodymyr Babchuk report that the way we
handle the emulation of EL1 timers with NV is completely wrong,
specially in the case of HCR_EL2.E2H==0.

There are three problems in about as many lines of code:

- With E2H==0, the EL1 timers are overwritten with the EL1 state,
  while they should actually contain the EL2 state (as per the timer
  map)

- With E2H==1, we run the full EL1 timer emulation even when ECV
  is present, hiding a bug in timer_emulate() (see previous patch)

- The comments are actively misleading, and say all the wrong things.

This is only attributable to the code having been initially written
for FEAT_NV, hacked up to handle FEAT_NV2 *in parallel*, and vaguely
hacked again to be FEAT_NV2 only. Oh, and yours truly being a gold
plated idiot.

The fix is obvious: just delete most of the E2H==0 code, have a unified
handling of the timers (because they really are E2H agnostic), and
make sure we don't execute any of that when FEAT_ECV is present.

Fixes: 4bad3068cf ("KVM: arm64: nv: Sync nested timer state with FEAT_NV2")
Reported-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw>
Reported-by: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
Link: https://lore.kernel.org/r/fqiqfjzwpgbzdtouu2pwqlu7llhnf5lmy4hzv5vo6ph4v3vyls@jdcfy3fjjc5k
Link: https://lore.kernel.org/r/87frl51tse.fsf@epam.com
Tested-by: Dmytro Terletskyi <dmytro_terletskyi@epam.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250204110050.150560-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04 15:10:38 +00:00
Marc Zyngier
b450dcce93 KVM: arm64: timer: Always evaluate the need for a soft timer
When updating the interrupt state for an emulated timer, we return
early and skip the setup of a soft timer that runs in parallel
with the guest.

While this is OK if we have set the interrupt pending, it is pretty
wrong if the guest moved CVAL into the future.  In that case,
no timer is armed and the guest can wait for a very long time
(it will take a full put/load cycle for the situation to resolve).

This is specially visible with EDK2 running at EL2, but still
using the EL1 virtual timer, which in that case is fully emulated.
Any key-press takes ages to be captured, as there is no UART
interrupt and EDK2 relies on polling from a timer...

The fix is simply to drop the early return. If the timer interrupt
is pending, we will still return early, and otherwise arm the soft
timer.

Fixes: 4d74ecfa64 ("KVM: arm64: Don't arm a hrtimer for an already pending timer")
Cc: stable@vger.kernel.org
Tested-by: Dmytro Terletskyi <dmytro_terletskyi@epam.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250204110050.150560-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04 15:10:38 +00:00
Marc Zyngier
5417a2e9b1 KVM: arm64: Fix nested S2 MMU structures reallocation
For each vcpu that userspace creates, we allocate a number of
s2_mmu structures that will eventually contain our shadow S2
page tables.

Since this is a dynamically allocated array, we reallocate
the array and initialise the newly allocated elements. Once
everything is correctly initialised, we adjust pointer and size
in the kvm structure, and move on.

But should that initialisation fail *and* the reallocation triggered
a copy to another location, we end-up returning early, with the
kvm structure still containing the (now stale) old pointer. Weeee!

Cure it by assigning the pointer early, and use this to perform
the initialisation. If everything succeeds, we adjust the size.
Otherwise, we just leave the size as it was, no harm done, and the
new memory is as good as the ol' one (we hope...).

Fixes: 4f128f8e1a ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20250204145554.774427-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04 15:02:16 +00:00
Oliver Upton
32392e04cb KVM: arm64: Fail protected mode init if no vgic hardware is present
Protected mode assumes that at minimum vgic-v3 is present, however KVM
fails to actually enforce this at the time of initialization. As such,
when running protected mode in a half-baked state on GICv2 hardware we
see the hyp go belly up at vcpu_load() when it tries to restore the
vgic-v3 cpuif:

  $ ./arch_timer_edge_cases
  [  130.599140] kvm [4518]: nVHE hyp panic at: [<ffff800081102b58>] __kvm_nvhe___vgic_v3_restore_vmcr_aprs+0x8/0x84!
  [  130.603685] kvm [4518]: Cannot dump pKVM nVHE stacktrace: !CONFIG_PROTECTED_NVHE_STACKTRACE
  [  130.611962] kvm [4518]: Hyp Offset: 0xfffeca95ed000000
  [  130.617053] Kernel panic - not syncing: HYP panic:
  [  130.617053] PS:800003c9 PC:0000b56a94102b58 ESR:0000000002000000
  [  130.617053] FAR:ffff00007b98d4d0 HPFAR:00000000007b98d0 PAR:0000000000000000
  [  130.617053] VCPU:0000000000000000
  [  130.638013] CPU: 0 UID: 0 PID: 4518 Comm: arch_timer_edge Tainted: G         C         6.13.0-rc3-00009-gf7d03fcbf1f4 #1
  [  130.648790] Tainted: [C]=CRAP
  [  130.651721] Hardware name: Libre Computer AML-S905X-CC (DT)
  [  130.657242] Call trace:
  [  130.659656]  show_stack+0x18/0x24 (C)
  [  130.663279]  dump_stack_lvl+0x38/0x90
  [  130.666900]  dump_stack+0x18/0x24
  [  130.670178]  panic+0x388/0x3e8
  [  130.673196]  nvhe_hyp_panic_handler+0x104/0x208
  [  130.677681]  kvm_arch_vcpu_load+0x290/0x548
  [  130.681821]  vcpu_load+0x50/0x80
  [  130.685013]  kvm_arch_vcpu_ioctl_run+0x30/0x868
  [  130.689498]  kvm_vcpu_ioctl+0x2e0/0x974
  [  130.693293]  __arm64_sys_ioctl+0xb4/0xec
  [  130.697174]  invoke_syscall+0x48/0x110
  [  130.700883]  el0_svc_common.constprop.0+0x40/0xe0
  [  130.705540]  do_el0_svc+0x1c/0x28
  [  130.708818]  el0_svc+0x30/0xd0
  [  130.711837]  el0t_64_sync_handler+0x10c/0x138
  [  130.716149]  el0t_64_sync+0x198/0x19c
  [  130.719774] SMP: stopping secondary CPUs
  [  130.723660] Kernel Offset: disabled
  [  130.727103] CPU features: 0x000,00000800,02800000,0200421b
  [  130.732537] Memory Limit: none
  [  130.735561] ---[ end Kernel panic - not syncing: HYP panic:
  [  130.735561] PS:800003c9 PC:0000b56a94102b58 ESR:0000000002000000
  [  130.735561] FAR:ffff00007b98d4d0 HPFAR:00000000007b98d0 PAR:0000000000000000
  [  130.735561] VCPU:0000000000000000 ]---

Fix it by failing KVM initialization if the system doesn't implement
vgic-v3, as protected mode will never do anything useful on such
hardware.

Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/kvmarm/5ca7588c-7bf2-4352-8661-e4a56a9cd9aa@sirena.org.uk/
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250203231543.233511-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-04 10:49:23 +00:00
Oliver Upton
0f1a6c5c97 KVM: arm64: Flush/sync debug state in protected mode
The recent changes to debug state management broke self-hosted debug for
guests when running in protected mode, since both the debug owner and
the debug state itself aren't shared with the hyp's view of the vcpu.

Fix it by flushing/syncing the relevant bits with the hyp vcpu.

Fixes: beb470d96c ("KVM: arm64: Use debug_owner to track if debug regs need save/restore")
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/kvmarm/5f62740f-a065-42d9-9f56-8fb648b9c63f@sirena.org.uk/
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250131222922.1548780-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-01 09:31:11 +00:00
Linus Torvalds
e2ee2e9b15 KVM/arm64 updates for 6.14
* New features:
 
   - Support for non-protected guest in protected mode, achieving near
     feature parity with the non-protected mode
 
   - Support for the EL2 timers as part of the ongoing NV support
 
   - Allow control of hardware tracing for nVHE/hVHE
 
 * Improvements, fixes and cleanups:
 
   - Massive cleanup of the debug infrastructure, making it a bit less
     awkward and definitely easier to maintain. This should pave the
     way for further optimisations
 
   - Complete rewrite of pKVM's fixed-feature infrastructure, aligning
     it with the rest of KVM and making the code easier to follow
 
   - Large simplification of pKVM's memory protection infrastructure
 
   - Better handling of RES0/RES1 fields for memory-backed system
     registers
 
   - Add a workaround for Qualcomm's Snapdragon X CPUs, which suffer
     from a pretty nasty timer bug
 
   - Small collection of cleanups and low-impact fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmeYqJcQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLUhCACxUTMVQXhfW3qbh0UQxPd7XXvjI+Hm7SPS
 wDuVTle4jrFVGHxuZqtgWLmx8hD7bqO965qmFgbevKlwsRY33onH2nbH4i4AcwbA
 jcdM4yMHZI4+Qmnb4G5ZJ89IwjAhHPZTBOV5KRhyHQ/qtRciHHtOgJde7II9fd68
 uIESg4SSSyUzI47YSEHmGVmiBIhdQhq2qust0m6NPFalEGYstPbpluPQ6R1CsDqK
 v14TIAW7t0vSPucBeODxhA5gEa2JsvNi+sqA+DF/ELH2ZqpkuR7rofgMGblaXCSD
 JXa5xamRB9dI5zi8vatwfOzYlog+/gzmPqMh/9JXpiDGHxJe0vlz
 =tQ8F
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull KVM/arm64 updates from Will Deacon:
 "New features:

   - Support for non-protected guest in protected mode, achieving near
     feature parity with the non-protected mode

   - Support for the EL2 timers as part of the ongoing NV support

   - Allow control of hardware tracing for nVHE/hVHE

  Improvements, fixes and cleanups:

   - Massive cleanup of the debug infrastructure, making it a bit less
     awkward and definitely easier to maintain. This should pave the way
     for further optimisations

   - Complete rewrite of pKVM's fixed-feature infrastructure, aligning
     it with the rest of KVM and making the code easier to follow

   - Large simplification of pKVM's memory protection infrastructure

   - Better handling of RES0/RES1 fields for memory-backed system
     registers

   - Add a workaround for Qualcomm's Snapdragon X CPUs, which suffer
     from a pretty nasty timer bug

   - Small collection of cleanups and low-impact fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (87 commits)
  arm64/sysreg: Get rid of TRFCR_ELx SysregFields
  KVM: arm64: nv: Fix doc header layout for timers
  KVM: arm64: nv: Apply RESx settings to sysreg reset values
  KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors
  KVM: arm64: Fix selftests after sysreg field name update
  coresight: Pass guest TRFCR value to KVM
  KVM: arm64: Support trace filtering for guests
  KVM: arm64: coresight: Give TRBE enabled state to KVM
  coresight: trbe: Remove redundant disable call
  arm64/sysreg/tools: Move TRFCR definitions to sysreg
  tools: arm64: Update sysreg.h header files
  KVM: arm64: Drop pkvm_mem_transition for host/hyp donations
  KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing
  KVM: arm64: Drop pkvm_mem_transition for FF-A
  KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
  KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
  arm64: kvm: Introduce nvhe stack size constants
  KVM: arm64: Fix nVHE stacktrace VA bits mask
  KVM: arm64: Fix FEAT_MTE in pKVM
  Documentation: Update the behaviour of "kvm-arm.mode"
  ...
2025-01-28 09:01:36 -08:00
Lokesh Vutla
9bcbb6104a KVM: arm64: Flush hyp bss section after initialization of variables in bss
To determine CPU features during initialization, the nVHE hypervisor
utilizes sanitized values of the host's CPU features registers. These
values, stored in u64 idaa64*_el1_sys_val variables are updated by the
kvm_hyp_init_symbols() function at EL1. To ensure EL2 visibility with
the MMU off, the data cache needs to be flushed after these updates.
However, individually flushing each variable using
kvm_flush_dcache_to_poc() is inefficient.

These cpu feature variables would be part of the bss section of
the hypervisor. Hence, flush the entire bss section of hypervisor
once the initialization is complete.

Fixes: 6c30bfb18d ("KVM: arm64: Add handlers for protected VM System Registers")
Suggested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Lokesh Vutla <lokeshvutla@google.com>
Link: https://lore.kernel.org/r/20250121044016.2219256-1-lokeshvutla@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-21 08:32:53 +00:00
Linus Torvalds
9ad09c4f28 arm64 updates for 6.14
Confidential Computing:
 * Register a platform device when running in CCA realm mode to enable
   automatic loading of dependent modules.
 
 CPU Features:
 * Update a bunch of system register definitions to pick up new field
   encodings from the architectural documentation.
 
 * Add hwcaps and selftests for the new (2024) dpISA extensions.
 
 Documentation:
 * Update EL3 (firmware) requirements for booting Linux on modern arm64
   designs.
 
 * Remove stale information about the kernel virtual memory map.
 
 Miscellaneous:
 * Minor cleanups and typo fixes.
 
 Memory management:
 * Fix vmemmap_check_pmd() to look at the PMD type bits
 
 * LPA2 (52-bit physical addressing) cleanups and minor fixes.
 
 * Adjust physical address space depending upon whether or not LPA2 is
   enabled.
 
 Perf and PMUs:
 * Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU
 
 * Extend AXI filtering support for the DDR PMU on NXP IMX SoCs
 
 * Fix Designware PCIe PMU event numbering.
 
 * Add generic branch events for the Apple M1 CPU PMU.
 
 * Add support for Marvell Odyssey DDR and LLC-TAD PMUs.
 
 * Cleanups to the Hisilicon DDRC and Uncore PMU code.
 
 * Advertise discard mode for the SPE PMU.
 
 * Add the perf users mailing list to our MAINTAINERS entry.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmeKZLcQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNEQzB/0X2U89ZiqxIkTPQvfFrjN/uUGybkq59rEL
 DfeoGukTgJIwc3GHWXXtQ//wuuYKdTeCXaIz5NFK3+7/wmKSLvjkexmue8pta6EY
 5rx9bAPr/D8lAUvhKIN2l3pF/ygoRwDz+nT2yVQ1xlZxYJWX7ZIsMj7W7ceb5kdx
 HRrTSQuhEEPREAWWO4oCMWl5SQZSrIflSE3Be/PsP0OhW6k//ZmWbcJTgUcHbKam
 o2WtNjITyGzxMpRCcrGEZKoe9YcwSxiut/PoD7JuoB4C/rbsf1cdJ6uLmtvGJcZj
 qsdRHhVfBzP1+ahONrDbiT3C2+s1UZySKdCDIxiYy6lB39wpP0dd
 =E7Mf
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "We've got a little less than normal thanks to the holidays in
  December, but there's the usual summary below. The highlight is
  probably the 52-bit physical addressing (LPA2) clean-up from Ard.

  Confidential Computing:

   - Register a platform device when running in CCA realm mode to enable
     automatic loading of dependent modules

  CPU Features:

   - Update a bunch of system register definitions to pick up new field
     encodings from the architectural documentation

   - Add hwcaps and selftests for the new (2024) dpISA extensions

  Documentation:

   - Update EL3 (firmware) requirements for booting Linux on modern
     arm64 designs

   - Remove stale information about the kernel virtual memory map

  Miscellaneous:

   - Minor cleanups and typo fixes

  Memory management:

   - Fix vmemmap_check_pmd() to look at the PMD type bits

   - LPA2 (52-bit physical addressing) cleanups and minor fixes

   - Adjust physical address space depending upon whether or not LPA2 is
     enabled

  Perf and PMUs:

   - Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU

   - Extend AXI filtering support for the DDR PMU on NXP IMX SoCs

   - Fix Designware PCIe PMU event numbering

   - Add generic branch events for the Apple M1 CPU PMU

   - Add support for Marvell Odyssey DDR and LLC-TAD PMUs

   - Cleanups to the Hisilicon DDRC and Uncore PMU code

   - Advertise discard mode for the SPE PMU

   - Add the perf users mailing list to our MAINTAINERS entry"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
  Documentation: arm64: Remove stale and redundant virtual memory diagrams
  perf docs: arm_spe: Document new discard mode
  perf: arm_spe: Add format option for discard mode
  MAINTAINERS: Add perf list for drivers/perf/
  arm64: Remove duplicate included header
  drivers/perf: apple_m1: Map generic branch events
  arm64: rsi: Add automatic arm-cca-guest module loading
  kselftest/arm64: Add 2024 dpISA extensions to hwcap test
  KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
  arm64/hwcap: Describe 2024 dpISA extensions to userspace
  arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
  arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
  drivers/perf: hisi: Set correct IRQ affinity for PMUs with no association
  arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
  arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
  arm64/mm: Replace open encodings with PXD_TABLE_BIT
  arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
  arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
  ...
2025-01-20 21:21:49 -08:00
Will Deacon
602ffd4ce3 Merge branch 'for-next/mm' into for-next/core
* for-next/mm:
  arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
  arm64/mm: Replace open encodings with PXD_TABLE_BIT
  arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
  arm64: Kconfig: force ARM64_PAN=y when enabling TTBR0 sw PAN
  arm64/kvm: Avoid invalid physical addresses to signal owner updates
  arm64/kvm: Configure HYP TCR.PS/DS based on host stage1
  arm64/mm: Override PARange for !LPA2 and use it consistently
  arm64/mm: Reduce PA space to 48 bits when LPA2 is not enabled
2025-01-17 13:52:33 +00:00
Marc Zyngier
fa5e4043e9 Merge branch kvm-arm64/misc-6.14 into kvmarm-master/next
* kvm-arm64/misc-6.14:
  : .
  : Misc KVM/arm64 changes for 6.14
  :
  : - Don't expose AArch32 EL0 capability when NV is enabled
  :
  : - Update documentation to reflect the full gamut of kvm-arm.mode
  :   behaviours
  :
  : - Use the hypervisor VA bit width when dumping stacktraces
  :
  : - Decouple the hypervisor stack size from PAGE_SIZE, at least
  :   on the surface...
  :
  : - Make use of str_enabled_disabled() when advertising GICv4.1 support
  :
  : - Explicitly handle BRBE traps as UNDEFINED
  : .
  KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
  KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
  arm64: kvm: Introduce nvhe stack size constants
  KVM: arm64: Fix nVHE stacktrace VA bits mask
  Documentation: Update the behaviour of "kvm-arm.mode"
  KVM: arm64: nv: Advertise the lack of AArch32 EL0 support

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:06:50 +00:00
Marc Zyngier
3643b334aa Merge branch kvm-arm64/nv-resx-fixes-6.14 into kvmarm-master/next
* kvm-arm64/nv-resx-fixes-6.14:
  : .
  : Fixes for NV sysreg accessors. From the cover letter:
  :
  : "Joey recently reported that some rather basic tests were failing on
  : NV, and managed to track it down to critical register fields (such as
  : HCR_EL2.E2H) not having their expect value.
  :
  : Further investigation has outlined a couple of critical issues:
  :
  : - Evaluating HCR_EL2.E2H must always be done with a sanitising
  :   accessor, no ifs, no buts. Given that KVM assumes a fixed value for
  :   this bit, we cannot leave it to the guest to mess with.
  :
  : - Resetting the sysreg file must result in the RESx bits taking
  :   effect. Otherwise, we may end-up making the wrong decision (see
  :   above), and we definitely expose invalid values to the guest. Note
  :   that because we compute the RESx masks very late in the VM setup, we
  :   need to apply these masks at that particular point as well.
  : [...]"
  : .
  KVM: arm64: nv: Apply RESx settings to sysreg reset values
  KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/nested.c
2025-01-17 11:06:33 +00:00
Marc Zyngier
946904e728 Merge branch kvm-arm64/coresight-6.14 into kvmarm-master/next
* kvm-arm64/coresight-6.14:
  : .
  : Trace filtering update from James Clark. From the cover letter:
  :
  : "The guest filtering rules from the Perf session are now honored for both
  : nVHE and VHE modes. This is done by either writing to TRFCR_EL12 at the
  : start of the Perf session and doing nothing else further, or caching the
  : guest value and writing it at guest switch for nVHE. In pKVM, trace is
  : now be disabled for both protected and unprotected guests."
  : .
  KVM: arm64: Fix selftests after sysreg field name update
  coresight: Pass guest TRFCR value to KVM
  KVM: arm64: Support trace filtering for guests
  KVM: arm64: coresight: Give TRBE enabled state to KVM
  coresight: trbe: Remove redundant disable call
  arm64/sysreg/tools: Move TRFCR definitions to sysreg
  tools: arm64: Update sysreg.h header files

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:05:44 +00:00
Marc Zyngier
5e68d2eeac Merge branch kvm-arm64/pkvm-memshare-declutter into kvmarm-master/next
* kvm-arm64/pkvm-memshare-declutter:
  : .
  : pKVM memory transition simplifications, courtesy of Quentin Perret.
  :
  : From the cover letter:
  : "Since its early days, pKVM has formalized memory 'transitions' (shares
  : and donations) using 'struct pkvm_mem_transition' and bunch of helpers
  : to manipulate it. The intention was for all transitions to use this
  : machinery to ensure we're checking things consistently. However, as
  : development progressed, it became clear that the rigidity of this model
  : made it really difficult to use in some use-cases which ended-up
  : side-stepping it entirely. That is the case for the
  : hyp_{un}pin_shared_mem() and host_{un}share_guest() paths upstream which
  : use lower level helpers directly, as well as for several other pKVM
  : features that should land upstream in the future (ex: when a guest
  : relinquishes a page during ballooning, when annotating a page that is
  : being DMA'd to, ...). On top of this, the pkvm_mem_transition machinery
  : requires a lot of boilerplate which makes the code hard to read, but
  : also adds layers of indirection that no compilers seems to see through,
  : hence leading to suboptimal generated code.
  :
  : Given all the above, this series removes the pkvm_mem_transition
  : machinery from mem_protect.c, and converts all its users to use
  : __*_{check,set}_page_state_range() low-level helpers directly."
  : .
  KVM: arm64: Drop pkvm_mem_transition for host/hyp donations
  KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing
  KVM: arm64: Drop pkvm_mem_transition for FF-A
  KVM: arm64: Only apply PMCR_EL0.P to the guest range of counters
  KVM: arm64: nv: Reload PMU events upon MDCR_EL2.HPME change
  KVM: arm64: Use KVM_REQ_RELOAD_PMU to handle PMCR_EL0.E change
  KVM: arm64: Add unified helper for reprogramming counters by mask
  KVM: arm64: Always check the state from hyp_ack_unshare()
  KVM: arm64: Fix set_id_regs selftest for ASIDBITS becoming unwritable

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:05:18 +00:00
Marc Zyngier
080612b294 Merge branch kvm-arm64/nv-timers into kvmarm-master/next
* kvm-arm64/nv-timers:
  : .
  : Nested Virt support for the EL2 timers. From the initial cover letter:
  :
  : "Here's another batch of NV-related patches, this time bringing in most
  : of the timer support for EL2 as well as nested guests.
  :
  : The code is pretty convoluted for a bunch of reasons:
  :
  : - FEAT_NV2 breaks the timer semantics by redirecting HW controls to
  :   memory, meaning that a guest could setup a timer and never see it
  :   firing until the next exit
  :
  : - We go try hard to reflect the timer state in memory, but that's not
  :   great.
  :
  : - With FEAT_ECV, we can finally correctly emulate the virtual timer,
  :   but this emulation is pretty costly
  :
  : - As a way to make things suck less, we handle timer reads as early as
  :   possible, and only defer writes to the normal trap handling
  :
  : - Finally, some implementations are badly broken, and require some
  :   hand-holding, irrespective of NV support. So we try and reuse the NV
  :   infrastructure to make them usable. This could be further optimised,
  :   but I'm running out of patience for this sort of HW.
  :
  : [...]"
  : .
  KVM: arm64: nv: Fix doc header layout for timers
  KVM: arm64: nv: Document EL2 timer API
  KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity
  KVM: arm64: nv: Sanitise CNTHCTL_EL2
  KVM: arm64: nv: Propagate CNTHCTL_EL2.EL1NV{P,V}CT bits
  KVM: arm64: nv: Add trap routing for CNTHCTL_EL2.EL1{NVPCT,NVVCT,TVT,TVCT}
  KVM: arm64: Handle counter access early in non-HYP context
  KVM: arm64: nv: Accelerate EL0 counter accesses from hypervisor context
  KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV in use
  KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
  KVM: arm64: nv: Publish emulated timer interrupt state in the in-memory state
  KVM: arm64: nv: Sync nested timer state with FEAT_NV2
  KVM: arm64: nv: Add handling of EL2-specific timer registers

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:04:53 +00:00
Marc Zyngier
36f998de85 KVM: arm64: nv: Apply RESx settings to sysreg reset values
While we have sanitisation in place for the guest sysregs, we lack
that sanitisation out of reset. So some of the fields could be
evaluated and not reflect their RESx status, which sounds like
a very bad idea.

Apply the RESx masks to the the sysreg file in two situations:

- when going via a reset of the sysregs

- after having computed the RESx masks

Having this separate reset phase from the actual reset handling is
a bit grotty, but we need to apply this after the ID registers are
final.

Tested-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250112165029.1181056-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-14 11:33:09 +00:00
Marc Zyngier
c139b6d1b4 KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors
A lot of the NV code depends on HCR_EL2.{E2H,TGE}, and we assume
in places that at least HCR_EL2.E2H is invariant for a given guest.

However, we make a point in *not* using the sanitising accessor
that would enforce this, and are at the mercy of the guest doing
stupid things. Clearly, that's not good.

Rework the HCR_EL2 accessors to use __vcpu_sys_reg() instead,
guaranteeing that the RESx settings get applied, specially
when HCR_EL2.E2H is evaluated. This results in fewer accessors
overall.

Huge thanks to Joey who spent a long time tracking this bug down.

Reported-by: Joey Gouly <Joey.Gouly@arm.com>
Tested-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250112165029.1181056-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-14 11:27:25 +00:00
James Clark
054b88391b KVM: arm64: Support trace filtering for guests
For nVHE, switch the filter value in and out if the Coresight driver
asks for it. This will support filters for guests when sinks other than
TRBE are used.

For VHE, just write the filter directly to TRFCR_EL1 where trace can be
used even with TRBE sinks.

Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250106142446.628923-7-james.clark@linaro.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 12:50:11 +00:00
James Clark
a665e3bc88 KVM: arm64: coresight: Give TRBE enabled state to KVM
Currently in nVHE, KVM has to check if TRBE is enabled on every guest
switch even if it was never used. Because it's a debug feature and is
more likely to not be used than used, give KVM the TRBE buffer status to
allow a much simpler and faster do-nothing path in the hyp.

Protected mode now disables trace regardless of TRBE (because
trfcr_while_in_guest is always 0), which was not previously done.
However, it continues to flush whenever the buffer is enabled
regardless of the filter status. This avoids the hypothetical case of a
host that had disabled the filter but not flushed which would arise if
only doing the flush when the filter was enabled.

Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250106142446.628923-6-james.clark@linaro.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 12:50:11 +00:00
Quentin Perret
6f91d31d47 KVM: arm64: Drop pkvm_mem_transition for host/hyp donations
Simplify the __pkvm_host_donate_hyp() and pkvm_hyp_donate_host() paths
by not using the pkvm_mem_transition machinery. As the last users of
this, also remove all the now unused code.

No functional changes intended.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250110121936.1559655-4-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 10:46:39 +00:00
Quentin Perret
7cbf7c3771 KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing
Simplify the __pkvm_host_{un}share_hyp() paths by not using the
pkvm_mem_transition machinery. As there are the last users of the
do_share()/do_unshare(), remove all the now-unused code as well.

No functional changes intended.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250110121936.1559655-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 10:46:39 +00:00
Quentin Perret
7a0688832f KVM: arm64: Drop pkvm_mem_transition for FF-A
Simplify the __pkvm_host_{un}share_ffa() paths by using
{check,set}_page_state_range().

No functional changes intended.

Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250110121936.1559655-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 10:46:39 +00:00
Marc Zyngier
60f20d82c1 Merge branch 'kvmarm-fixes-6.13-3' 2025-01-12 10:46:17 +00:00
Marc Zyngier
e880b16efb Merge branch kvm-arm64/pkvm-fixed-features-6.14 into kvmarm-master/next
* kvm-arm64/pkvm-fixed-features-6.14: (24 commits)
  : .
  : Complete rework of the pKVM handling of features, catching up
  : with the rest of the code deals with it these days.
  : Patches courtesy of Fuad Tabba. From the cover letter:
  :
  : "This patch series uses the vm's feature id registers to track the
  : supported features, a framework similar to nested virt to set the
  : trap values, and removes the need to store cptr_el2 per vcpu in
  : favor of setting its value when traps are activated, as VHE mode
  : does."
  :
  : This branch drags the arm64/for-next/cpufeature branch to solve
  : ugly conflicts in -next.
  : .
  KVM: arm64: Fix FEAT_MTE in pKVM
  KVM: arm64: Use kvm_vcpu_has_feature() directly for struct kvm
  KVM: arm64: Convert the SVE guest vcpu flag to a vm flag
  KVM: arm64: Remove PtrAuth guest vcpu flag
  KVM: arm64: Fix the value of the CPTR_EL2 RES1 bitmask for nVHE
  KVM: arm64: Refactor kvm_reset_cptr_el2()
  KVM: arm64: Calculate cptr_el2 traps on activating traps
  KVM: arm64: Remove redundant setting of HCR_EL2 trap bit
  KVM: arm64: Remove fixed_config.h header
  KVM: arm64: Rework specifying restricted features for protected VMs
  KVM: arm64: Set protected VM traps based on its view of feature registers
  KVM: arm64: Fix RAS trapping in pKVM for protected VMs
  KVM: arm64: Initialize feature id registers for protected VMs
  KVM: arm64: Use KVM extension checks for allowed protected VM capabilities
  KVM: arm64: Remove KVM_ARM_VCPU_POWER_OFF from protected VMs allowed features in pKVM
  KVM: arm64: Move checking protected vcpu features to a separate function
  KVM: arm64: Group setting traps for protected VMs by control register
  KVM: arm64: Consolidate allowed and restricted VM feature checks
  arm64/sysreg: Get rid of CPACR_ELx SysregFields
  arm64/sysreg: Convert *_EL12 accessors to Mapping
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/fpsimd.c
#	arch/arm64/kvm/hyp/nvhe/pkvm.c
2025-01-12 10:40:10 +00:00
Marc Zyngier
d0670128d4 Merge branch kvm-arm64/pkvm-np-guest into kvmarm-master/next
* kvm-arm64/pkvm-np-guest:
  : .
  : pKVM support for non-protected guests using the standard MM
  : infrastructure, courtesy of Quentin Perret. From the cover letter:
  :
  : "This series moves the stage-2 page-table management of non-protected
  : guests to EL2 when pKVM is enabled. This is only intended as an
  : incremental step towards a 'feature-complete' pKVM, there is however a
  : lot more that needs to come on top.
  :
  : With that series applied, pKVM provides near-parity with standard KVM
  : from a functional perspective all while Linux no longer touches the
  : stage-2 page-tables itself at EL1. The majority of mm-related KVM
  : features work out of the box, including MMU notifiers, dirty logging,
  : RO memslots and things of that nature. There are however two gotchas:
  :
  :  - We don't support mapping devices into guests: this requires
  :    additional hypervisor support for tracking the 'state' of devices,
  :    which will come in a later series. No device assignment until then.
  :
  :  - Stage-2 mappings are forced to page-granularity even when backed by a
  :    huge page for the sake of simplicity of this series. I'm only aiming
  :    at functional parity-ish (from userspace's PoV) for now, support for
  :    HP can be added on top later as a perf improvement."
  : .
  KVM: arm64: Plumb the pKVM MMU in KVM
  KVM: arm64: Introduce the EL1 pKVM MMU
  KVM: arm64: Introduce __pkvm_tlb_flush_vmid()
  KVM: arm64: Introduce __pkvm_host_mkyoung_guest()
  KVM: arm64: Introduce __pkvm_host_test_clear_young_guest()
  KVM: arm64: Introduce __pkvm_host_wrprotect_guest()
  KVM: arm64: Introduce __pkvm_host_relax_guest_perms()
  KVM: arm64: Introduce __pkvm_host_unshare_guest()
  KVM: arm64: Introduce __pkvm_host_share_guest()
  KVM: arm64: Introduce __pkvm_vcpu_{load,put}()
  KVM: arm64: Add {get,put}_pkvm_hyp_vm() helpers
  KVM: arm64: Make kvm_pgtable_stage2_init() a static inline function
  KVM: arm64: Pass walk flags to kvm_pgtable_stage2_relax_perms
  KVM: arm64: Pass walk flags to kvm_pgtable_stage2_mkyoung
  KVM: arm64: Move host page ownership tracking to the hyp vmemmap
  KVM: arm64: Make hyp_page::order a u8
  KVM: arm64: Move enum pkvm_page_state to memory.h
  KVM: arm64: Change the layout of enum pkvm_page_state

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/arm.c
2025-01-12 10:37:15 +00:00
Marc Zyngier
4e26de25d2 Merge remote-tracking branch 'arm64/for-next/cpufeature' into kvm-arm64/pkvm-fixed-features-6.14
Merge arm64/for-next/cpufeature to solve extensive conflicts
caused by the CPACR_ELx->CPACR_EL1 repainting.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-11 14:55:18 +00:00
Mark Rutland
a7f1fa5564 KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
The Branch Record Buffer Extension (BRBE) adds a number of system
registers and instructions which we don't currently intend to expose to
guests. Our existing logic handles this safely, but this could be
improved with some explicit handling of BRBE.

KVM currently hides BRBE from guests: the cpufeature code's
ftr_id_aa64dfr0[] table doesn't have an entry for the BRBE field, and so
this will be zero in the sanitised value of ID_AA64DFR0 exposed to
guests via read_sanitised_id_aa64dfr0_el1().

KVM currently traps BRBE usage from guests: the default configuration of
the fine-grained trap controls HDFGRTR_EL2.{nBRBDATA,nBRBCTL,nBRBIDR}
and HFGITR_EL2.{nBRBINJ_nBRBIALL} cause these to be trapped to EL2.

Well-behaved guests shouldn't try to use the registers or instructions,
but badly-behaved guests could use these, resulting in unnecessary
warnings from KVM before it injects an UNDEF, e.g.

| kvm [197]: Unsupported guest access at: 401c98
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 0), Op2( 0), func_read },
| kvm [197]: Unsupported guest access at: 401d04
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 0), Op2( 1), func_read },
| kvm [197]: Unsupported guest access at: 401d70
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 2), Op2( 0), func_read },
| kvm [197]: Unsupported guest access at: 401ddc
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 1), Op2( 0), func_read },
| kvm [197]: Unsupported guest access at: 401e48
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 1), Op2( 1), func_read },
| kvm [197]: Unsupported guest access at: 401eb4
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 1), Op2( 2), func_read },
| kvm [197]: Unsupported guest access at: 401f20
|  { Op0( 2), Op1( 1), CRn( 9), CRm( 0), Op2( 2), func_read },
| kvm [197]: Unsupported guest access at: 401f8c
|  { Op0( 1), Op1( 1), CRn( 7), CRm( 2), Op2( 4), func_write },
| kvm [197]: Unsupported guest access at: 401ff8
|  { Op0( 1), Op1( 1), CRn( 7), CRm( 2), Op2( 5), func_write },

As with other features that we know how to handle, these warnings aren't
particularly interesting, and we can simply treat these as UNDEFINED
without any warning. Add the necessary fine-grained undef configuration
to make this happen, as suggested by Marc Zyngier:

  https://lore.kernel.org/linux-arm-kernel/86r0czk6wd.wl-maz@kernel.org/

At the same time, update read_sanitised_id_aa64dfr0_el1() to hide BRBE
from guests, as we do for SPE. This will prevent accidentally exposing
BRBE to guests if/when ftr_id_aa64dfr0[] gains a BRBE entry.

Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20250109223836.419240-1-robh@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-11 10:28:10 +00:00
Thorsten Blum
dea8838128 KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250110225310.369980-2-thorsten.blum@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-11 10:26:56 +00:00
Mark Brown
fd22af17a4 KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
ID_AA64ISAR3_EL1 is currently marked as unallocated in KVM but does have a
number of bitfields defined in it. Expose FPRCVT and FAMINMAX, two simple
instruction only extensions to guests.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250107-arm64-2024-dpisa-v5-4-7578da51fc3d@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-08 13:41:07 +00:00
Kalesh Singh
38f9e4b905 arm64: kvm: Introduce nvhe stack size constants
Refactor nvhe stack code to use NVHE_STACK_SIZE/SHIFT constants,
instead of directly using PAGE_SIZE/SHIFT. This makes the code a bit
easier to read, without introducing any functional changes.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Link: https://lore.kernel.org/r/20241112003336.1375584-1-kaleshsingh@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-08 11:25:28 +00:00
Vincent Donnefort
68344037b7 KVM: arm64: Fix nVHE stacktrace VA bits mask
The hypervisor VA space size depends on both the ID map's
(IDMAP_VA_BITS) and the kernel stage-1 (VA_BITS). However, the
hypervisor stacktrace decoding is solely relying on VA_BITS. This is
especially an issue when VA_BITS < IDMAP_VA_BITS (i.e. VA_BITS is
39-bit): the hypervisor may have addresses bigger than the stacktrace is
masking.

Align this mask with hyp_va_bits.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250107112821.416591-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-08 11:18:39 +00:00
Vladimir Murzin
b7f345fbc3 KVM: arm64: Fix FEAT_MTE in pKVM
Make sure we do not trap access to Allocation Tags.

Fixes: b56680de9c ("KVM: arm64: Initialize trap register values in hyp in pKVM")
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20250106112421.65355-1-vladimir.murzin@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-08 10:24:32 +00:00
Marc Zyngier
0bc9a9e85f KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity
It appears that on Qualcomm's x1e CPU, CNTVOFF_EL2 doesn't really
work, specially with HCR_EL2.E2H=1.

A non-zero offset results in a screaming virtual timer interrupt,
to the tune of a few 100k interrupts per second on a 4 vcpu VM.
This is also evidenced by this CPU's inability to correctly run
any of the timer selftests.

The only case this doesn't break is when this register is set to 0,
which breaks VM migration.

When HCR_EL2.E2H=0, the timer seems to behave normally, and does
not result in an interrupt storm.

As a workaround, use the fact that this CPU implements FEAT_ECV,
and trap all accesses to the virtual timer and counter, keeping
CNTVOFF_EL2 set to zero, and emulate accesses to CVAL/TVAL/CTL
and the counter itself, fixing up the timer to account for the
missing offset.

And if you think this is disgusting, you'd probably be right.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
d1e37a50e1 KVM: arm64: nv: Sanitise CNTHCTL_EL2
Inject some sanity in CNTHCTL_EL2, ensuring that we don't handle
more than we advertise to the guest.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
479428cc3d KVM: arm64: nv: Propagate CNTHCTL_EL2.EL1NV{P,V}CT bits
Allow a guest hypervisor to trap accesses to CNT{P,V}CT_EL02 by
propagating these trap bits to the host trap configuration.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-10-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
c271269e35 KVM: arm64: nv: Add trap routing for CNTHCTL_EL2.EL1{NVPCT,NVVCT,TVT,TVCT}
For completeness, fun, and cerebral meltdown, add the virtualisation
related traps to the counter and timers.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-9-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
b86fc215dc KVM: arm64: Handle counter access early in non-HYP context
We already deal with CNTPCT_EL0 accesses in non-HYP context.
Let's add CNTVCT_EL0 as a good measure.

This is also an opportunity to simplify things and make it
plain that this code is only for non-HYP context handling.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-8-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
9b3b2f0029 KVM: arm64: nv: Accelerate EL0 counter accesses from hypervisor context
Similarly to handling the physical timer accesses early when FEAT_ECV
causes a trap, we try to handle the physical counter without returning
to the general sysreg handling.

More surprisingly, we introduce something similar for the virtual
counter. Although this isn't necessary yet, it will prove useful on
systems that have a broken CNTVOFF_EL2 implementation. Yes, they exist.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-7-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
338f8ea519 KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV in use
Although FEAT_ECV allows us to correctly emulate the timers, it also
reduces performances pretty badly.

Mitigate this by emulating the CTL/CVAL register reads in the
inner run loop, without returning to the general kernel.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:09 +00:00
Marc Zyngier
2cd2a77f9c KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
Although FEAT_NV2 makes most things fast, it also makes it impossible
to correctly emulate the timers, as the sysreg accesses are redirected
to memory.

FEAT_ECV addresses this by giving a hypervisor the ability to trap
the EL02 sysregs as well as the virtual timer.

Add the required trap setting to make use of the feature, allowing
us to elide the ugly resync in the middle of the run loop.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:09 +00:00
Marc Zyngier
cc45963cbf KVM: arm64: nv: Publish emulated timer interrupt state in the in-memory state
With FEAT_NV2, the EL0 timer state is entirely stored in memory,
meaning that the hypervisor can only provide a very poor emulation.

The only thing we can really do is to publish the interrupt state
in the guest view of CNT{P,V}_CTL_EL0, and defer everything else
to the next exit.

Only FEAT_ECV will allow us to fix it, at the cost of extra trapping.

Suggested-by: Chase Conklin <chase.conklin@arm.com>
Suggested-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-4-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:09 +00:00
Marc Zyngier
4bad3068cf KVM: arm64: nv: Sync nested timer state with FEAT_NV2
Emulating the timers with FEAT_NV2 is a bit odd, as the timers
can be reconfigured behind our back without the hypervisor even
noticing. In the VHE case, that's an actual regression in the
architecture...

Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:09 +00:00
Marc Zyngier
b59dbb91f7 KVM: arm64: nv: Add handling of EL2-specific timer registers
Add the required handling for EL2 and EL02 registers, as
well as EL1 registers used in the E2H context. This includes
handling the virtual timer accesses when CNTHCTL_EL2.EL1TVT
or CNTHCTL_EL2.EL1TVCT are set.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:09 +00:00
Marc Zyngier
e891432cf7 KVM: arm64: nv: Advertise the lack of AArch32 EL0 support
Although we never supported 32bit anywhere in NV, we fail to
advertise so for EL0, probably owing to the relative lack of
hardware supporting both NV2 and 32bit EL0.

Add some sanitising to ID_AA64PFR0_EL1.EL0, and reaffirm that
"in 64bit-only we trust".

Reported-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-22 10:09:43 +00:00
Fuad Tabba
aac64ad369 KVM: arm64: Use kvm_vcpu_has_feature() directly for struct kvm
Now that we have introduced kvm_vcpu_has_feature(), use it in the
remaining code that checks for features in struct kvm, instead of
using the __vcpu_has_feature() helper.

No functional change intended.

Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-18-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:12 +00:00
Fuad Tabba
41d6028e28 KVM: arm64: Convert the SVE guest vcpu flag to a vm flag
The vcpu flag GUEST_HAS_SVE is per-vcpu, but it is based on what
is now a per-vm feature. Make the flag per-vm.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-17-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:09 +00:00
Fuad Tabba
c5c1763596 KVM: arm64: Remove PtrAuth guest vcpu flag
The vcpu flag GUEST_HAS_PTRAUTH is always associated with the
vcpu PtrAuth features, which are defined per vm rather than per
vcpu.

Remove the flag, and replace it with checks for the features
instead.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-16-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:06 +00:00
Fuad Tabba
2fd5b4b0e7 KVM: arm64: Calculate cptr_el2 traps on activating traps
Similar to VHE, calculate the value of cptr_el2 from scratch on
activate traps. This removes the need to store cptr_el2 in every
vcpu structure. Moreover, some traps, such as whether the guest
owns the fp registers, need to be set on every vcpu run.

Reported-by: James Clark <james.clark@linaro.org>
Fixes: 5294afdbf4 ("KVM: arm64: Exclude FP ownership from kvm_vcpu_arch")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-13-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:57 +00:00
Fuad Tabba
092e7b2c3b KVM: arm64: Remove redundant setting of HCR_EL2 trap bit
In hVHE mode, HCR_E2H should be set for both protected and
non-protected VMs. Since commit b56680de9c ("KVM: arm64:
Initialize trap register values in hyp in pKVM"), this has been
fixed, and the setting of the flag here is redundant.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-12-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:55 +00:00
Fuad Tabba
81403c8d04 KVM: arm64: Remove fixed_config.h header
The few remaining items needed in fixed_config.h are better
suited for pkvm.h. Move them there and delete it.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-11-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:53 +00:00
Fuad Tabba
3d7ff00700 KVM: arm64: Rework specifying restricted features for protected VMs
The existing code didn't properly distinguish between signed and
unsigned features, and was difficult to read and to maintain.
Rework it using the same method used in other parts of KVM when
handling vcpu features.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-10-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:01 +00:00
Fuad Tabba
0401f7e76d KVM: arm64: Set protected VM traps based on its view of feature registers
Now that the VM's feature id registers are initialized with the
values of the supported features, use those values to determine
which traps to set using kvm_has_feature().

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-9-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:00 +00:00
Fuad Tabba
9df9186f8d KVM: arm64: Fix RAS trapping in pKVM for protected VMs
Trap RAS in pKVM if not supported at all for protected VMs. The
RAS version doesn't matter in this case.

Fixes: 2a0c343386 ("KVM: arm64: Initialize trap registers for protected VMs")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-8-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:52:58 +00:00
Fuad Tabba
7ba5b8f804 KVM: arm64: Initialize feature id registers for protected VMs
The hypervisor maintains the state of protected VMs. Initialize
the values for feature ID registers for protected VMs, to be used
when setting traps and when advertising features to protected
VMs.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-7-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:52:50 +00:00
Fuad Tabba
a3163dca48 KVM: arm64: Use KVM extension checks for allowed protected VM capabilities
Use KVM extension checks as the source for determining which
capabilities are allowed for protected VMs. KVM extension checks
is the natural place for this, since it is also the interface
exposed to users.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-6-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:45:25 +00:00
Fuad Tabba
27f5cf8ad5 KVM: arm64: Remove KVM_ARM_VCPU_POWER_OFF from protected VMs allowed features in pKVM
The hypervisor is responsible for the power state of protected
VMs in pKVM. Therefore, remove KVM_ARM_VCPU_POWER_OFF from the
list of allowed features for protected VMs.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:45:25 +00:00
Fuad Tabba
1fea164ccf KVM: arm64: Move checking protected vcpu features to a separate function
At the moment, checks for supported vcpu features for protected
VMs are build-time bugs. In the following patch, they will become
runtime checks based on the vcpu's features registers. Therefore,
consolidate them into one function that would return an error if
it encounters an unsupported feature.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-4-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:45:25 +00:00
Fuad Tabba
f50758260b KVM: arm64: Group setting traps for protected VMs by control register
Group setting protected VM traps by control register rather than
feature id register, since some trap values (e.g., PAuth), depend
on more than one feature id register.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:45:21 +00:00
Marc Zyngier
2589dbd727 KVM: arm64: Consolidate allowed and restricted VM feature checks
The definitions for features allowed and allowed with
restrictions for protected guests, which are based on feature
registers, were defined and checked for separately, even though
they are handled in the same way. This could result in missing
checks for certain features, e.g., pointer authentication,
causing traps for allowed features.

Consolidate the definitions into one. Use that new definition to
construct the guest view of the feature registers for
consistency.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:39:10 +00:00
Quentin Perret
fce886a602 KVM: arm64: Plumb the pKVM MMU in KVM
Introduce the KVM_PGT_CALL() helper macro to allow switching from the
traditional pgtable code to the pKVM version easily in mmu.c. The cost
of this 'indirection' is expected to be very minimal due to
is_protected_kvm_enabled() being backed by a static key.

With this, everything is in place to allow the delegation of
non-protected guest stage-2 page-tables to pKVM, so let's stop using the
host's kvm_s2_mmu from EL2 and enjoy the ride.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-19-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
e912efed48 KVM: arm64: Introduce the EL1 pKVM MMU
Introduce a set of helper functions allowing to manipulate the pKVM
guest stage-2 page-tables from EL1 using pKVM's HVC interface.

Each helper has an exact one-to-one correspondance with the traditional
kvm_pgtable_stage2_*() functions from pgtable.c, with a strictly
matching prototype. This will ease plumbing later on in mmu.c.

These callbacks track the gfn->pfn mappings in a simple rb_tree indexed
by IPA in lieu of a page-table. This rb-tree is kept in sync with pKVM's
state and is protected by the mmu_lock like a traditional stage-2
page-table.

Signed-off-by: Quentin Perret <qperret@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-18-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
0adce4d42f KVM: arm64: Introduce __pkvm_tlb_flush_vmid()
Introduce a new hypercall to flush the TLBs of non-protected guests. The
host kernel will be responsible for issuing this hypercall after changing
stage-2 permissions using the __pkvm_host_relax_guest_perms() or
__pkvm_host_wrprotect_guest() paths. This is left under the host's
responsibility for performance reasons.

Note however that the TLB maintenance for all *unmap* operations still
remains entirely under the hypervisor's responsibility for security
reasons -- an unmapped page may be donated to another entity, so a stale
TLB entry could be used to leak private data.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-17-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
76f0b18b3d KVM: arm64: Introduce __pkvm_host_mkyoung_guest()
Plumb the kvm_pgtable_stage2_mkyoung() callback into pKVM for
non-protected guests. It will be called later from the fault handling
path.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-16-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
56ab4de37f KVM: arm64: Introduce __pkvm_host_test_clear_young_guest()
Plumb the kvm_stage2_test_clear_young() callback into pKVM for
non-protected guest. It will be later be called from MMU notifiers.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-15-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
26117e4c63 KVM: arm64: Introduce __pkvm_host_wrprotect_guest()
Introduce a new hypercall to remove the write permission from a
non-protected guest stage-2 mapping. This will be used for e.g. enabling
dirty logging.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-14-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
34884a0a4a KVM: arm64: Introduce __pkvm_host_relax_guest_perms()
Introduce a new hypercall allowing the host to relax the stage-2
permissions of mappings in a non-protected guest page-table. It will be
used later once we start allowing RO memslots and dirty logging.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-13-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
72db3d3fba KVM: arm64: Introduce __pkvm_host_unshare_guest()
In preparation for letting the host unmap pages from non-protected
guests, introduce a new hypercall implementing the host-unshare-guest
transition.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-12-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
d0bd3e6570 KVM: arm64: Introduce __pkvm_host_share_guest()
In preparation for handling guest stage-2 mappings at EL2, introduce a
new pKVM hypercall allowing to share pages with non-protected guests.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-11-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Marc Zyngier
f7d03fcbf1 KVM: arm64: Introduce __pkvm_vcpu_{load,put}()
Rather than look-up the hyp vCPU on every run hypercall at EL2,
introduce a per-CPU 'loaded_hyp_vcpu' tracking variable which is updated
by a pair of load/put hypercalls called directly from
kvm_arch_vcpu_{load,put}() when pKVM is enabled.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-10-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
99996d575e KVM: arm64: Add {get,put}_pkvm_hyp_vm() helpers
In preparation for accessing pkvm_hyp_vm structures at EL2 in a context
where we can't always expect a vCPU to be loaded (e.g. MMU notifiers),
introduce get/put helpers to get temporary references to hyp VMs from
any context.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-9-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
e279c25d78 KVM: arm64: Pass walk flags to kvm_pgtable_stage2_relax_perms
kvm_pgtable_stage2_relax_perms currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM. To
allow for the re-use of that function, make the walk flags one of its
parameters.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-7-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
5398ddc5c9 KVM: arm64: Pass walk flags to kvm_pgtable_stage2_mkyoung
kvm_pgtable_stage2_mkyoung currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM.
To allow for the re-use of that function, make the walk flags one of
its parameters.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-6-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:43:59 +00:00
Quentin Perret
e94a7dea29 KVM: arm64: Move host page ownership tracking to the hyp vmemmap
We currently store part of the page-tracking state in PTE software bits
for the host, guests and the hypervisor. This is sub-optimal when e.g.
sharing pages as this forces to break block mappings purely to support
this software tracking. This causes an unnecessarily fragmented stage-2
page-table for the host in particular when it shares pages with Secure,
which can lead to measurable regressions. Moreover, having this state
stored in the page-table forces us to do multiple costly walks on the
page transition path, hence causing overhead.

In order to work around these problems, move the host-side page-tracking
logic from SW bits in its stage-2 PTEs to the hypervisor's vmemmap.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-5-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:43:59 +00:00
Quentin Perret
b35875d466 KVM: arm64: Make hyp_page::order a u8
We don't need 16 bits to store the hyp page order, and we'll need some
bits to store page ownership data soon, so let's reduce the order
member.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-4-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:43:59 +00:00
Quentin Perret
d4fc42a479 KVM: arm64: Move enum pkvm_page_state to memory.h
In order to prepare the way for storing page-tracking information in
pKVM's vmemmap, move the enum pkvm_page_state definition to
nvhe/memory.h.

No functional changes intended.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-3-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:43:59 +00:00
Quentin Perret
a1a1f1ff1f KVM: arm64: Change the layout of enum pkvm_page_state
The 'concrete' (a.k.a non-meta) page states are currently encoded using
software bits in PTEs. For performance reasons, the abstract
pkvm_page_state enum uses the same bits to encode these states as that
makes conversions from and to PTEs easy.

In order to prepare the ground for moving the 'concrete' state storage
to the hyp vmemmap, re-arrange the enum to use bits 0 and 1 for this
purpose.

No functional changes intended.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:43:59 +00:00
Oliver Upton
c4a6ed8545 KVM: arm64: Promote guest ownership for DBGxVR/DBGxCR reads
Only yielding control of the debug registers for writes is a bit silly,
unless of course you're a fan of pointless traps. Give control of the
debug registers to the guest upon the first access, regardless of
direction.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-20-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:19 +00:00
Oliver Upton
3ce9f3357e KVM: arm64: Fold DBGxVR/DBGxCR accessors into common set
There is a nauseating amount of boilerplate for accessing the
breakpoint and watchpoint registers. Fold everything together into a
single set of accessors and select the right storage based on the sysreg
encoding.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-19-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:17 +00:00
Oliver Upton
8c02c2bbd6 KVM: arm64: Avoid reading ID_AA64DFR0_EL1 for debug save/restore
Similar to other per-CPU profiling/debug features we handle, store the
number of breakpoints/watchpoints in kvm_host_data to avoid reading the
ID register 4 times on every guest entry/exit. And if you're in the
nested virt business that's quite a few avoidable exits to the L0
hypervisor.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-18-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:14 +00:00
Oliver Upton
b0ee51033a KVM: arm64: nv: Honor MDCR_EL2.TDE routing for debug exceptions
Inject debug exceptions into vEL2 if MDCR_EL2.TDE is set.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-17-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:11 +00:00
Marc Zyngier
2ca3f03bf5 KVM: arm64: Manage software step state at load/put
KVM takes over the guest's software step state machine if the VMM is
debugging the guest, but it does the save/restore fiddling for every
guest entry.

Note that the only constraint on host usage of software step is that the
guest's configuration remains visible to userspace via the ONE_REG
ioctls. So, we can cut down on the amount of fiddling by doing this at
load/put instead.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-16-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:06 +00:00
Oliver Upton
4ad3a0b87f KVM: arm64: Don't hijack guest context MDSCR_EL1
Stealing MDSCR_EL1 in the guest's kvm_cpu_context for external debugging
is rather gross. Just add a field for this instead and let the context
switch code pick the correct one based on the debug owner.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-15-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:02:51 +00:00
Oliver Upton
75a5fbaf66 KVM: arm64: Compute MDCR_EL2 at vcpu_load()
KVM has picked up several hacks to cope with vcpu->arch.mdcr_el2 needing
to be prepared before vcpu_load(), which is when it gets programmed
into hardware on VHE.

Now that the flows for reprogramming MDCR_EL2 have been simplified, move
that computation to vcpu_load().

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-14-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
06d22a9c1b KVM: arm64: Reload vCPU for accesses to OSLAR_EL1
KVM takes ownership of the debug regs if the guest enables the OS lock,
as it needs to use MDSCR_EL1 to mask debug exceptions. Just reload the
vCPU if the guest toggles the OS lock, relying on kvm_vcpu_load_debug()
to update the debug owner and get the right trap configuration in place.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
beb470d96c KVM: arm64: Use debug_owner to track if debug regs need save/restore
Use the debug owner to determine if the debug regs are in use instead of
keeping around the DEBUG_DIRTY flag. Debug registers are now
saved/restored after the first trap, regardless of whether it was a read
or a write. This also shifts the point at which KVM becomes lazy to
vcpu_put() rather than the next exception taken from the guest.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-12-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
803602b0d9 KVM: arm64: Remove vestiges of debug_ptr
Delete the remnants of debug_ptr now that debug registers are selected
based on the debug owner instead.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-11-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
3b7780945c KVM: arm64: Remove debug tracepoints
The debug tracepoints are a useless firehose of information that track
implementation detail rather than well-defined events. These are going
to be rather difficult to uphold now that the implementation is getting
redone, so throw them out instead of bending over backwards.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-10-oliver.upton@linux.dev
[maz: fixed compilation after trace-ectomy]
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
58db67e9ac KVM: arm64: Select debug state to save/restore based on debug owner
Select the set of debug registers to use based on the owner rather than
relying on debug_ptr. Besides the code cleanup, this allows us to
eliminate a couple instances kern_hyp_va() as well.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
4cefbec97d KVM: arm64: Clean up KVM_SET_GUEST_DEBUG handler
No particular reason other than it isn't nice to look at.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-8-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
cd9b10102a KVM: arm64: Evaluate debug owner at vcpu_load()
In preparation for tossing the debug_ptr mess, introduce an enumeration
to track the ownership of the debug registers while in the guest. Update
the owner at vcpu_load() based on whether the host needs to steal the
guest's debug context or if breakpoints/watchpoints are actively in use.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
b47ffd13fd KVM: arm64: Write MDCR_EL2 directly from kvm_arm_setup_mdcr_el2()
Expecting the callee to know when MDCR_EL2 needs to be written to
hardware asking for trouble. Do the deed from kvm_arm_setup_mdcr_el2()
instead.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-6-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
d381e53384 KVM: arm64: Move host SME/SVE tracking flags to host data
The SME/SVE state tracking flags have no business in the vCPU. Move them
to kvm_host_data.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:09 +00:00
Oliver Upton
38131c02a5 KVM: arm64: Track presence of SPE/TRBE in kvm_host_data instead of vCPU
Add flags to kvm_host_data to track if SPE/TRBE is present +
programmable on a per-CPU basis. Set the flags up at init rather than
vcpu_load() as the programmability of these buffers is unlikely to
change.

Reviewed-by: James Clark <james.clark@linaro.org>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:09 +00:00
Oliver Upton
2417218f2f KVM: arm64: Get rid of __kvm_get_mdcr_el2() and related warts
KVM caches MDCR_EL2 on a per-CPU basis in order to preserve the
configuration of MDCR_EL2.HPMN while running a guest. This is a bit
gross, since we're relying on some baked configuration rather than the
hardware definition of implemented counters.

Discover the number of implemented counters by reading PMCR_EL0.N
instead. This works because:

 - In VHE the kernel runs at EL2, and N always returns the number of
   counters implemented in hardware

 - In {n,h}VHE, the EL2 setup code programs MDCR_EL2.HPMN with the EL2
   view of PMCR_EL0.N for the host

Lastly, avoid traps under nested virtualization by saving PMCR_EL0.N in
host data.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:08 +00:00
Oliver Upton
8ca19c40c4 KVM: arm64: Drop MDSCR_EL1_DEBUG_MASK
Nothing is using this macro, get rid of it.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-2-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:08 +00:00
Marc Zyngier
e5ecedcd7c arm64/sysreg: Get rid of CPACR_ELx SysregFields
There is no such thing as CPACR_ELx in the architecture.
What we have is CPACR_EL1, for which CPTR_EL12 is an accessor.

Rename CPACR_ELx_* to CPACR_EL1_*, and fix the bit of code using
these names.

Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241219173351.1123087-5-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 18:00:58 +00:00
Marc Zyngier
7052e808c4 arm64/sysreg: Get rid of the TCR2_EL1x SysregFields
TCR2_EL1x is a pretty bizarre construct, as it is shared between
TCR2_EL1 and TCR2_EL12. But the latter is obviously only an
accessor to the former.

In order to make things more consistent, upgrade TCR2_EL1x to
a full-blown sysreg definition for TCR2_EL1, and describe TCR2_EL12
as a mapping to TCR2_EL1.

This results in a couple of minor changes to the actual code.

Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241219173351.1123087-3-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 18:00:58 +00:00
Ard Biesheuvel
9d86c3c974 arm64/kvm: Avoid invalid physical addresses to signal owner updates
The pKVM stage2 mapping code relies on an invalid physical address to
signal to the internal API that only the annotations of descriptors
should be updated, and these are stored in the high bits of invalid
descriptors covering memory that has been donated to protected guests,
and is therefore unmapped from the host stage-2 page tables.

Given that these invalid PAs are never stored into the descriptors, it
is better to rely on an explicit flag, to clarify the API and to avoid
confusion regarding whether or not the output address of a descriptor
can ever be invalid to begin with (which is not the case with LPA2).

That removes a dependency on the logic that reasons about the maximum PA
range, which differs on LPA2 capable CPUs based on whether LPA2 is
enabled or not, and will be further clarified in subsequent patches.

Cc: Quentin Perret <qperret@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-12-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 17:23:53 +00:00
Ard Biesheuvel
f0da16992a arm64/kvm: Configure HYP TCR.PS/DS based on host stage1
When the host stage1 is configured for LPA2, the value currently being
programmed into TCR_EL2.T0SZ may be invalid unless LPA2 is configured
at HYP as well.  This means kvm_lpa2_is_enabled() is not the right
condition to test when setting TCR_EL2.DS, as it will return false if
LPA2 is only available for stage 1 but not for stage 2.

Similary, programming TCR_EL2.PS based on a limited IPA range due to
lack of stage2 LPA2 support could potentially result in problems.

So use lpa2_is_enabled() instead, and set the PS field according to the
host's IPS, which is capped at 48 bits if LPA2 support is absent or
disabled. Whether or not we can make meaningful use of such a
configuration is a different question.

Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-11-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 17:23:53 +00:00
Oliver Upton
e96d8b80af KVM: arm64: Only apply PMCR_EL0.P to the guest range of counters
An important distinction from other registers affected by HPMN is that
PMCR_EL0 only affects the guest range of counters, regardless of the EL
from which it is accessed. Ensure that PMCR_EL0.P is always applied to
'guest' counters by manually computing the mask rather than deriving it
from the current context.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175611.3658290-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18 13:22:25 -08:00
Oliver Upton
d3ba35b69e KVM: arm64: nv: Reload PMU events upon MDCR_EL2.HPME change
MDCR_EL2.HPME is the 'global' enable bit for event counters reserved for
EL2. Give the PMU a kick when it's changed to ensure events are
reprogrammed before returning to the guest.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175550.3658212-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18 13:22:25 -08:00
Oliver Upton
adf8623b3f KVM: arm64: Use KVM_REQ_RELOAD_PMU to handle PMCR_EL0.E change
Nested virt introduces yet another set of 'global' knobs for controlling
event counters that are reserved for EL2 (i.e. >= HPMN). Get ready to
share some plumbing with the NV controls by offloading counter
reprogramming to KVM_REQ_RELOAD_PMU.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175532.3658134-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18 13:22:25 -08:00
Oliver Upton
e22c369520 KVM: arm64: Add unified helper for reprogramming counters by mask
Having separate helpers for enabling/disabling counters provides the
wrong abstraction, as the state of each counter needs to be evaluated
independently and, in some cases, use a different global enable bit.

Collapse the enable/disable accessors into a single, common helper that
reconfigures every counter set in @mask, leaving the complexity of
determining if an event is actually enabled in
kvm_pmu_counter_is_enabled().

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241217175513.3658056-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18 13:22:25 -08:00
Quentin Perret
985bb51f17 KVM: arm64: Always check the state from hyp_ack_unshare()
There are multiple pKVM memory transitions where the state of a page is
not cross-checked from the completer's PoV for performance reasons.
For example, if a page is PKVM_PAGE_OWNED from the initiator's PoV,
we should be guaranteed by construction that it is PKVM_NOPAGE for
everybody else, hence allowing us to save a page-table lookup.

When it was introduced, hyp_ack_unshare() followed that logic and bailed
out without checking the PKVM_PAGE_SHARED_BORROWED state in the
hypervisor's stage-1. This was correct as we could safely assume that
all host-initiated shares were directed at the hypervisor at the time.
But with the introduction of other types of shares (e.g. for FF-A or
non-protected guests), it is now very much required to cross check this
state to prevent the host from running __pkvm_host_unshare_hyp() on a
page shared with TZ or a non-protected guest.

Thankfully, if an attacker were to try this, the hyp_unmap() call from
hyp_complete_unshare() would fail, hence causing to WARN() from
__do_unshare() with the host lock held, which is fatal. But this is
fragile at best, and can hardly be considered a security measure.

Let's just do the right thing and always check the state from
hyp_ack_unshare().

Signed-off-by: Quentin Perret <qperret@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241128154406.602875-1-qperret@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-18 13:21:37 -08:00
Paolo Bonzini
3154bddf8c KVM/arm64 fixes for 6.13, part #2
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
    SPE/TRBE initialization
 
  - Align nested page table walker with the intended memory attribute
    combining rules of the architecture
 
  - Prevent userspace from constraining the advertised ASID width,
    avoiding horrors of guest TLBIs not matching the intended context in
    hardware
 
  - Don't leak references on LPIs when insertion into the translation
    cache fails
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZ0+mZhccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFuKcAQDnFcLru8MVor4zjloe25oPPeuW
 iBocGpgKwJMioHrAdwEAoq8v0eqfxrUpwr5KJ7iN9CTo9oANJYhVACC8jPHEowI=
 =fLPh
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.13-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.13, part #2

 - Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
   SPE/TRBE initialization

 - Align nested page table walker with the intended memory attribute
   combining rules of the architecture

 - Prevent userspace from constraining the advertised ASID width,
   avoiding horrors of guest TLBIs not matching the intended context in
   hardware

 - Don't leak references on LPIs when insertion into the translation
   cache fails
2024-12-10 08:50:55 -05:00
Keisuke Nishimura
be7e611274 KVM: arm64: vgic-its: Add error handling in vgic_its_cache_translation
The return value of xa_store() needs to be checked. This fix adds an
error handling path that resolves the kref inconsistency on failure. As
suggested by Oliver Upton, this function does not return the error code
intentionally because the translation cache is best effort.

Fixes: 8201d1028c ("KVM: arm64: vgic-its: Maintain a translation cache per ITS")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241130144952.23729-1-keisuke.nishimura@inria.fr
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-03 16:22:10 -08:00
Marc Zyngier
03c7527e97 KVM: arm64: Do not allow ID_AA64MMFR0_EL1.ASIDbits to be overridden
Catalin reports that a hypervisor lying to a guest about the size
of the ASID field may result in unexpected issues:

- if the underlying HW does only supports 8 bit ASIDs, the ASID
  field in a TLBI VAE1* operation is only 8 bits, and the HW will
  ignore the other 8 bits

- if on the contrary the HW is 16 bit capable, the ASID field
  in the same TLBI operation is always 16 bits, irrespective of
  the value of TCR_ELx.AS.

This could lead to missed invalidations if the guest was lead to
assume that the HW had 8 bit ASIDs while they really are 16 bit wide.

In order to avoid any potential disaster that would be hard to debug,
prenent the migration between a host with 8 bit ASIDs to one with
wider ASIDs (the converse was obviously always forbidden). This is
also consistent with what we already do for VMIDs.

If it becomes absolutely mandatory to support such a migration path
in the future, we will have to trap and emulate all TLBIs, something
that nobody should look forward to.

Fixes: d5a32b60dc ("KVM: arm64: Allow userspace to change ID_AA64MMFR{0-2}_EL1")
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241203190236.505759-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-12-03 16:21:08 -08:00
Linus Torvalds
c4bb3a2d64 ARM:
* Fixes.
 
 RISC-V:
 
 * Svade and Svadu (accessed and dirty bit) extension support for host and
   guest.  This was acked on the mailing list by the RISC-V maintainer, see
   https://patchew.org/linux/20240726084931.28924-1-yongxuan.wang@sifive.com/.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmdKS0QUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP7hggAmt5CJesFGIuDwQgJX1KuNWAS84AX
 Oq5SPZLH0XjE5YDm6AusSzvbtOhRM6mARU5/iqMRE6Mqpf4MXpP9tOo6xaDiL7+m
 bOFsDYEO73WQyrIfFUCZ7dXiTbDVtQfNH8Z1yQwHPsa1d+WDYY3tLbCe5qCdqYMF
 JDiB7K0cQzPDmhCwf3Zf8mW2ZRI0QsTqiuFUfVGGNgFDspWfBFBqkLCkrMNmbp9z
 ye375oKb2VCe6OBJCY+Nl6tdoBUkz+CtZDCxkxuh0Uk4NmsUC9JMye9iwgU9DuI7
 nagFuvpUGcgbZvrx1ly47TL+wcEFLwnBJ0xBZTGIgVoZHj/wX9GM+tSgIw==
 =semZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:

 - ARM fixes

 - RISC-V Svade and Svadu (accessed and dirty bit) extension support for
   host and guest

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: riscv: selftests: Add Svade and Svadu Extension to get-reg-list test
  RISC-V: KVM: Add Svade and Svadu Extensions Support for Guest/VM
  dt-bindings: riscv: Add Svade and Svadu Entries
  RISC-V: Add Svade and Svadu Extensions Support
  KVM: arm64: Use MDCR_EL2.HPME to evaluate overflow of hyp counters
  KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
  KVM: arm64: Mark set_sysreg_masks() as inline to avoid build failure
  KVM: arm64: vgic-its: Add stronger type-checking to the ITS entry sizes
  KVM: arm64: vgic: Kill VGIC_MAX_PRIVATE definition
  KVM: arm64: vgic: Make vgic_get_irq() more robust
  KVM: arm64: vgic-v3: Sanitise guest writes to GICR_INVLPIR
2024-11-30 14:51:08 -08:00
Marc Zyngier
6fc3a49f23 KVM: arm64: Fix S1/S2 combination when FWB==1 and S2 has Device memory type
The G.a revision of the ARM ARM had it pretty clear that HCR_EL2.FWB
had no influence on "The way that stage 1 memory types and attributes
are combined with stage 2 Device type and attributes." (D5.5.5).

However, this wording was lost in further revisions of the architecture.

Restore the intended behaviour, which is to take the strongest memory
type of S1 and S2 in this case, as if FWB was 0. The specification is
being fixed accordingly.

Fixes: be04cebf3e ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241125094756.609590-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-26 07:32:27 -08:00
James Clark
d798bc6f3c arm64: Fix usage of new shifted MDCR_EL2 values
Since the linked fixes commit, these masks are already shifted so remove
the shifts. One issue that this fixes is SPE and TRBE not being
available anymore:

 arm_spe_pmu arm,spe-v1: profiling buffer owned by higher exception level

Fixes: 641630313e ("arm64: sysreg: Migrate MDCR_EL2 definition to table")
Signed-off-by: James Clark <james.clark@linaro.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241122164636.2944180-1-james.clark@linaro.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-26 06:31:36 -08:00
Linus Torvalds
9f16d5e6f2 The biggest change here is eliminating the awful idea that KVM had, of
essentially guessing which pfns are refcounted pages.  The reason to
 do so was that KVM needs to map both non-refcounted pages (for example
 BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
 refcounted pages.  However, the result was security issues in the past,
 and more recently the inability to map VM_IO and VM_PFNMAP memory
 that _is_ backed by struct page but is not refcounted.  In particular
 this broke virtio-gpu blob resources (which directly map host graphics
 buffers into the guest as "vram" for the virtio-gpu device) with the
 amdgpu driver, because amdgpu allocates non-compound higher order pages
 and the tail pages could not be mapped into KVM.
 
 This requires adjusting all uses of struct page in the per-architecture
 code, to always work on the pfn whenever possible.  The large series that
 did this, from David Stevens and Sean Christopherson, also cleaned up
 substantially the set of functions that provided arch code with the
 pfn for a host virtual addresses.  The previous maze of twisty little
 passages, all different, is replaced by five functions (__gfn_to_page,
 __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
 saving almost 200 lines of code.
 
 ARM:
 
 * Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker
 
 * Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI
 
 * Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps
 
 * PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest
 
 * Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM
 
 * Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection
 
 * Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
 
 LoongArch:
 
 * Add iocsr and mmio bus simulation in kernel.
 
 * Add in-kernel interrupt controller emulation.
 
 * Add support for virtualization extensions to the eiointc irqchip.
 
 PPC:
 
 * Drop lingering and utterly obsolete references to PPC970 KVM, which was
   removed 10 years ago.
 
 * Fix incorrect documentation references to non-existing ioctls
 
 RISC-V:
 
 * Accelerate KVM RISC-V when running as a guest
 
 * Perf support to collect KVM guest statistics from host side
 
 s390:
 
 * New selftests: more ucontrol selftests and CPU model sanity checks
 
 * Support for the gen17 CPU model
 
 * List registers supported by KVM_GET/SET_ONE_REG in the documentation
 
 x86:
 
 * Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes.  Even if the hardware
   A/D tracking is disabled, it is possible to use the hardware-defined A/D
   bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
   of special cases.
 
 * Elide TLB flushes when aging secondary PTEs, as has been done in x86's
   primary MMU for over 10 years.
 
 * Recover huge pages in-place in the TDP MMU when dirty page logging is
   toggled off, instead of zapping them and waiting until the page is
   re-accessed to create a huge mapping.  This reduces vCPU jitter.
 
 * Batch TLB flushes when dirty page logging is toggled off.  This reduces
   the time it takes to disable dirty logging by ~3x.
 
 * Remove the shrinker that was (poorly) attempting to reclaim shadow page
   tables in low-memory situations.
 
 * Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
 * Advertise CPUIDs for new instructions in Clearwater Forest
 
 * Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which in turn can lead to save/restore failures.
 
 * Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.
 
 * Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe; harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.  The issue that triggered this change was already fixed in 6.12,
   but was still kinda latent.
 
 * Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
 * Minor cleanups
 
 * Switch hugepage recovery thread to use vhost_task.  These kthreads can
   consume significant amounts of CPU time on behalf of a VM or in response
   to how the VM behaves (for example how it accesses its memory); therefore
   KVM tried to place the thread in the VM's cgroups and charge the CPU
   time consumed by that work to the VM's container.  However the kthreads
   did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
   instances inside could not complete freezing.  Fix this by replacing the
   kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
   Another 100+ lines removed, with generally better behavior too like
   having these threads properly parented in the process tree.
 
 * Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
   really work; there was really nothing to work around anyway: the broken
   patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
   MSR is virtualized and therefore unaffected by the erratum.
 
 * Fix 6.12 regression where CONFIG_KVM will be built as a module even
   if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
 
 x86 selftests:
 
 * x86 selftests can now use AVX.
 
 Documentation:
 
 * Use rST internal links
 
 * Reorganize the introduction to the API document
 
 Generic:
 
 * Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter long
   due to having to wait for all CPUs become quiescent.  In general both reads
   and writes are rare, but userspace that supports confidential computing is
   introducing the use of "helper" vCPUs that may jump from one host processor
   to another.  Those will be very happy to trigger a synchronize_rcu(), and
   the effect on performance is quite the disaster.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
 dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
 letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
 1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
 t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
 ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
 =lWpE
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "The biggest change here is eliminating the awful idea that KVM had of
  essentially guessing which pfns are refcounted pages.

  The reason to do so was that KVM needs to map both non-refcounted
  pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
  VMAs that contain refcounted pages.

  However, the result was security issues in the past, and more recently
  the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
  struct page but is not refcounted. In particular this broke virtio-gpu
  blob resources (which directly map host graphics buffers into the
  guest as "vram" for the virtio-gpu device) with the amdgpu driver,
  because amdgpu allocates non-compound higher order pages and the tail
  pages could not be mapped into KVM.

  This requires adjusting all uses of struct page in the
  per-architecture code, to always work on the pfn whenever possible.
  The large series that did this, from David Stevens and Sean
  Christopherson, also cleaned up substantially the set of functions
  that provided arch code with the pfn for a host virtual addresses.

  The previous maze of twisty little passages, all different, is
  replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
  non-__ versions of these two, and kvm_prefetch_pages) saving almost
  200 lines of code.

  ARM:

   - Support for stage-1 permission indirection (FEAT_S1PIE) and
     permission overlays (FEAT_S1POE), including nested virt + the
     emulated page table walker

   - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
     call was introduced in PSCIv1.3 as a mechanism to request
     hibernation, similar to the S4 state in ACPI

   - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
     part of it, introduce trivial initialization of the host's MPAM
     context so KVM can use the corresponding traps

   - PMU support under nested virtualization, honoring the guest
     hypervisor's trap configuration and event filtering when running a
     nested guest

   - Fixes to vgic ITS serialization where stale device/interrupt table
     entries are not zeroed when the mapping is invalidated by the VM

   - Avoid emulated MMIO completion if userspace has requested
     synchronous external abort injection

   - Various fixes and cleanups affecting pKVM, vCPU initialization, and
     selftests

  LoongArch:

   - Add iocsr and mmio bus simulation in kernel.

   - Add in-kernel interrupt controller emulation.

   - Add support for virtualization extensions to the eiointc irqchip.

  PPC:

   - Drop lingering and utterly obsolete references to PPC970 KVM, which
     was removed 10 years ago.

   - Fix incorrect documentation references to non-existing ioctls

  RISC-V:

   - Accelerate KVM RISC-V when running as a guest

   - Perf support to collect KVM guest statistics from host side

  s390:

   - New selftests: more ucontrol selftests and CPU model sanity checks

   - Support for the gen17 CPU model

   - List registers supported by KVM_GET/SET_ONE_REG in the
     documentation

  x86:

   - Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
     improve documentation, harden against unexpected changes.

     Even if the hardware A/D tracking is disabled, it is possible to
     use the hardware-defined A/D bits to track if a PFN is Accessed
     and/or Dirty, and that removes a lot of special cases.

   - Elide TLB flushes when aging secondary PTEs, as has been done in
     x86's primary MMU for over 10 years.

   - Recover huge pages in-place in the TDP MMU when dirty page logging
     is toggled off, instead of zapping them and waiting until the page
     is re-accessed to create a huge mapping. This reduces vCPU jitter.

   - Batch TLB flushes when dirty page logging is toggled off. This
     reduces the time it takes to disable dirty logging by ~3x.

   - Remove the shrinker that was (poorly) attempting to reclaim shadow
     page tables in low-memory situations.

   - Clean up and optimize KVM's handling of writes to
     MSR_IA32_APICBASE.

   - Advertise CPUIDs for new instructions in Clearwater Forest

   - Quirk KVM's misguided behavior of initialized certain feature MSRs
     to their maximum supported feature set, which can result in KVM
     creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
     a non-zero value results in the vCPU having invalid state if
     userspace hides PDCM from the guest, which in turn can lead to
     save/restore failures.

   - Fix KVM's handling of non-canonical checks for vCPUs that support
     LA57 to better follow the "architecture", in quotes because the
     actual behavior is poorly documented. E.g. most MSR writes and
     descriptor table loads ignore CR4.LA57 and operate purely on
     whether the CPU supports LA57.

   - Bypass the register cache when querying CPL from kvm_sched_out(),
     as filling the cache from IRQ context is generally unsafe; harden
     the cache accessors to try to prevent similar issues from occuring
     in the future. The issue that triggered this change was already
     fixed in 6.12, but was still kinda latent.

   - Advertise AMD_IBPB_RET to userspace, and fix a related bug where
     KVM over-advertises SPEC_CTRL when trying to support cross-vendor
     VMs.

   - Minor cleanups

   - Switch hugepage recovery thread to use vhost_task.

     These kthreads can consume significant amounts of CPU time on
     behalf of a VM or in response to how the VM behaves (for example
     how it accesses its memory); therefore KVM tried to place the
     thread in the VM's cgroups and charge the CPU time consumed by that
     work to the VM's container.

     However the kthreads did not process SIGSTOP/SIGCONT, and therefore
     cgroups which had KVM instances inside could not complete freezing.

     Fix this by replacing the kthread with a PF_USER_WORKER thread, via
     the vhost_task abstraction. Another 100+ lines removed, with
     generally better behavior too like having these threads properly
     parented in the process tree.

   - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
     didn't really work; there was really nothing to work around anyway:
     the broken patch was meant to fix nested virtualization, but the
     PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
     erratum.

   - Fix 6.12 regression where CONFIG_KVM will be built as a module even
     if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
     'y'.

  x86 selftests:

   - x86 selftests can now use AVX.

  Documentation:

   - Use rST internal links

   - Reorganize the introduction to the API document

  Generic:

   - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
     instead of RCU, so that running a vCPU on a different task doesn't
     encounter long due to having to wait for all CPUs become quiescent.

     In general both reads and writes are rare, but userspace that
     supports confidential computing is introducing the use of "helper"
     vCPUs that may jump from one host processor to another. Those will
     be very happy to trigger a synchronize_rcu(), and the effect on
     performance is quite the disaster"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
  KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
  KVM: x86: add back X86_LOCAL_APIC dependency
  Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
  KVM: x86: switch hugepage recovery thread to vhost_task
  KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
  x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
  Documentation: KVM: fix malformed table
  irqchip/loongson-eiointc: Add virt extension support
  LoongArch: KVM: Add irqfd support
  LoongArch: KVM: Add PCHPIC user mode read and write functions
  LoongArch: KVM: Add PCHPIC read and write functions
  LoongArch: KVM: Add PCHPIC device support
  LoongArch: KVM: Add EIOINTC user mode read and write functions
  LoongArch: KVM: Add EIOINTC read and write functions
  LoongArch: KVM: Add EIOINTC device support
  LoongArch: KVM: Add IPI user mode read and write function
  LoongArch: KVM: Add IPI read and write function
  LoongArch: KVM: Add IPI device support
  LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
  KVM: arm64: Pass on SVE mapping failures
  ...
2024-11-23 16:00:50 -08:00
Paolo Bonzini
1508bae370 KVM/arm64 changes for 6.13, part #2
- Constrain invalidations from GICR_INVLPIR to only affect the LPI
    INTID space
 
  - Set of robustness improvements to the management of vgic irqs and GIC
    ITS table entries
 
  - Fix compilation issue w/ CONFIG_CC_OPTIMIZE_FOR_SIZE=y where
    set_sysreg_masks() wasn't getting inlined, breaking check for a
    constant sysreg index
 
  - Correct KVM's vPMU overflow condition to match the architecture for
    hyp and non-hyp counters
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZz7sPxccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFh88AQC5nNcGT2LSuwJYgGhjkAEwMdcv
 /yKyiOM4fNmzXCoMPAD/cYEq6JLiAclMI+Xqs9lEnmn1cteyxjdV/8rIWvvJqAc=
 =Mi/s
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.13-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 changes for 6.13, part #2

 - Constrain invalidations from GICR_INVLPIR to only affect the LPI
   INTID space

 - Set of robustness improvements to the management of vgic irqs and GIC
   ITS table entries

 - Fix compilation issue w/ CONFIG_CC_OPTIMIZE_FOR_SIZE=y where
   set_sysreg_masks() wasn't getting inlined, breaking check for a
   constant sysreg index

 - Correct KVM's vPMU overflow condition to match the architecture for
   hyp and non-hyp counters
2024-11-21 06:00:16 -05:00
Oliver Upton
13905f4547 KVM: arm64: Use MDCR_EL2.HPME to evaluate overflow of hyp counters
The 'global enable control' (as it is termed in the architecture) for
counters reserved by EL2 is MDCR_EL2.HPME. Use that instead of
PMCR_EL0.E when evaluating the overflow state for hyp counters.

Change the return value to a bool while at it, which better reflects the
fact that the overflow state is a shared signal and not a per-counter
property.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241120005230.2335682-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:23:32 -08:00
Raghavendra Rao Ananta
54bbee190d KVM: arm64: Ignore PMCNTENSET_EL0 while checking for overflow status
DDI0487K.a D13.3.1 describes the PMU overflow condition, which evaluates
to true if any counter's global enable (PMCR_EL0.E), overflow flag
(PMOVSSET_EL0[n]), and interrupt enable (PMINTENSET_EL1[n]) are all 1.
Of note, this does not require a counter to be enabled
(i.e. PMCNTENSET_EL0[n] = 1) to generate an overflow.

Align kvm_pmu_overflow_status() with the reality of the architecture
and stop using PMCNTENSET_EL0 as part of the overflow condition. The
bug was discovered while running an SBSA PMU test [*], which only sets
PMCR.E, PMOVSSET<0>, PMINTENSET<0>, and expects an overflow interrupt.

Cc: stable@vger.kernel.org
Fixes: 76d883c4e6 ("arm64: KVM: Add access handler for PMOVSSET and PMOVSCLR register")
Link: https://github.com/ARM-software/sbsa-acs/blob/master/test_pool/pmu/operating_system/test_pmu001.c
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
[ oliver: massaged changelog ]
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241120005230.2335682-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:22:49 -08:00
Marc Zyngier
0f3a0f23f5 KVM: arm64: Mark set_sysreg_masks() as inline to avoid build failure
When compiling with CONFIG_CC_OPTIMIZE_FOR_SIZE=y, set_sysreg_masks()
fails to compile thanks to:

	BUILD_BUG_ON(!__builtin_constant_p(sr));

as the compiler doesn't identify sr as a constant, despite all the
callers passing constants.

Fix the issue by always inlining this function, which allows GCC to
do the right thing.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411201857.ZNudtGJl-lkp@intel.com/
Fixes: a016202009 ("KVM: arm64: Extend masking facility to arbitrary registers")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241120111516.304250-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:22:00 -08:00
Marc Zyngier
3b2c81d5fe KVM: arm64: vgic-its: Add stronger type-checking to the ITS entry sizes
The ITS ABI infrastructure allows for some pretty lax code, where
the size of the data doesn't have to match the size of the entry,
potentially leading to a collection of interesting bugs.

Commit 7fe28d7e68 ("KVM: arm64: vgic-its: Add a data length check
in vgic_its_save_*") added some checks, but starts by implicitly
casting all writes to a 64bit value, hiding some of the issues.

Instead, introduce macros that will check the data type actually used
for dealing with the table entries. The macros are taking a symbolic
entry type that is used to fetch the size of the entry type for the
current ABI. This immediately catches a couple of low-impact gotchas
(zero values that are implicitly 32bit), easy enough to fix.

Given that we currently only have a single ABI, hardcode a couple of
BUILD_BUG_ON()s that will fire if we use anything but a 64bit quantity,
and some (currently unreachable) fallback code that may become useful
one day.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:21:08 -08:00
Marc Zyngier
e7619f2a2f KVM: arm64: vgic: Kill VGIC_MAX_PRIVATE definition
VGIC_MAX_PRIVATE is a pretty useless definition, and is better
replaced with VGIC_NR_PRIVATE_IRQS.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:21:08 -08:00
Marc Zyngier
add570b39f KVM: arm64: vgic: Make vgic_get_irq() more robust
vgic_get_irq() has an awkward signature, as it takes both a kvm
*and* a vcpu, where the vcpu is allowed to be NULL if the INTID
being looked up is a global interrupt (SPI or LPI).

This leads to potentially problematic situations where the INTID
passed is a private interrupt, but that there is no vcpu.

In order to make things less ambiguous, let have *two* helpers
instead:

- vgic_get_irq(struct kvm *kvm, u32 intid), which is only concerned
  with *global* interrupts, as indicated by the lack of vcpu.

- vgic_get_vcpu_irq(struct kvm_vcpu *vcpu, u32 intid), which can
  return *any* interrupt class, but must have of course a non-NULL
  vcpu.

Most of the code nicely falls under one or the other situations,
except for a couple of cases (close to the UABI or in the debug code)
where we have to distinguish between the two cases.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241117165757.247686-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:21:08 -08:00
Marc Zyngier
d561491ba9 KVM: arm64: vgic-v3: Sanitise guest writes to GICR_INVLPIR
Make sure we filter out non-LPI invalidation when handling writes
to GICR_INVLPIR.

Fixes: 4645d11f4a ("KVM: arm64: vgic-v3: Implement MMIO-based LPI invalidation")
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241117165757.247686-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-20 17:21:07 -08:00
Linus Torvalds
ba1f9c8fe3 arm64 updates for 6.13:
* Support for running Linux in a protected VM under the Arm Confidential
   Compute Architecture (CCA)
 
 * Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from libc
 
 * AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)
 
 * Other arch features:
 
   - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
     exposed to user; uaccess support not merged yet)
 
   - MTE: hugetlbfs support and the corresponding kselftests
 
   - Optimise CRC32 using the PMULL instructions
 
   - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
 
   - Optimise the kernel TLB flushing to use the range operations
 
   - POE/pkey (permission overlays): further cleanups after bringing the
     signal handler in line with the x86 behaviour for 6.12
 
 * arm64 perf updates:
 
   - Support for the NXP i.MX91 PMU in the existing IMX driver
 
   - Support for Ampere SoCs in the Designware PCIe PMU driver
 
   - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
 
   - Support for Samsung's 'Mongoose' CPU PMU
 
   - Support for PMUv3.9 finer-grained userspace counter access control
 
   - Switch back to platform_driver::remove() now that it returns 'void'
 
   - Add some missing events for the CXL PMU driver
 
 * Miscellaneous arm64 fixes/cleanups:
 
   - Page table accessors cleanup: type updates, drop unused macros,
     reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
     check addresses before runtime P4D/PUD folding
 
   - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
     FEAT_ECV for the generic timers) allowing Linux to boot with
     firmware deployments that don't set SCTLR_EL3.ECVEn
 
   - ACPI/arm64: tighten the check for the array of platform timer
     structures and adjust the error handling procedure in
     gtdt_parse_timer_block()
 
   - Optimise the cache flush for the uprobes xol slot (skip if no
     change) and other uprobes/kprobes cleanups
 
   - Fix the context switching of tpidrro_el0 when kpti is enabled
 
   - Dynamic shadow call stack fixes
 
   - Sysreg updates
 
   - Various arm64 kselftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
 XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
 YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
 EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
 V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
 pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
 rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
 sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
 rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
 vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
 WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
 DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
 =wDot
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Support for running Linux in a protected VM under the Arm
   Confidential Compute Architecture (CCA)

 - Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from
   libc

 - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)

 - Other arch features:

     - In-kernel use of the memcpy instructions, FEAT_MOPS (previously
       only exposed to user; uaccess support not merged yet)

     - MTE: hugetlbfs support and the corresponding kselftests

     - Optimise CRC32 using the PMULL instructions

     - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG

     - Optimise the kernel TLB flushing to use the range operations

     - POE/pkey (permission overlays): further cleanups after bringing
       the signal handler in line with the x86 behaviour for 6.12

 - arm64 perf updates:

     - Support for the NXP i.MX91 PMU in the existing IMX driver

     - Support for Ampere SoCs in the Designware PCIe PMU driver

     - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC

     - Support for Samsung's 'Mongoose' CPU PMU

     - Support for PMUv3.9 finer-grained userspace counter access
       control

     - Switch back to platform_driver::remove() now that it returns
       'void'

     - Add some missing events for the CXL PMU driver

 - Miscellaneous arm64 fixes/cleanups:

     - Page table accessors cleanup: type updates, drop unused macros,
       reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
       check addresses before runtime P4D/PUD folding

     - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
       FEAT_ECV for the generic timers) allowing Linux to boot with
       firmware deployments that don't set SCTLR_EL3.ECVEn

     - ACPI/arm64: tighten the check for the array of platform timer
       structures and adjust the error handling procedure in
       gtdt_parse_timer_block()

     - Optimise the cache flush for the uprobes xol slot (skip if no
       change) and other uprobes/kprobes cleanups

     - Fix the context switching of tpidrro_el0 when kpti is enabled

     - Dynamic shadow call stack fixes

     - Sysreg updates

     - Various arm64 kselftest improvements

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  ...
2024-11-18 18:10:37 -08:00
Catalin Marinas
5a4332062e Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', 'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf: Switch back to struct platform_driver::remove()
  perf: arm_pmuv3: Add support for Samsung Mongoose PMU
  dt-bindings: arm: pmu: Add Samsung Mongoose core compatible
  perf/dwc_pcie: Fix typos in event names
  perf/dwc_pcie: Add support for Ampere SoCs
  ARM: pmuv3: Add missing write_pmuacr()
  perf/marvell: Marvell PEM performance monitor support
  perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
  perf/dwc_pcie: Convert the events with mixed case to lowercase
  perf/cxlpmu: Support missing events in 3.1 spec
  perf: imx_perf: add support for i.MX91 platform
  dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible
  drivers perf: remove unused field pmu_node

* for-next/gcs: (42 commits)
  : arm64 Guarded Control Stack user-space support
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  arm64/gcs: Fix outdated ptrace documentation
  kselftest/arm64: Ensure stable names for GCS stress test results
  kselftest/arm64: Validate that GCS push and write permissions work
  kselftest/arm64: Enable GCS for the FP stress tests
  kselftest/arm64: Add a GCS stress test
  kselftest/arm64: Add GCS signal tests
  kselftest/arm64: Add test coverage for GCS mode locking
  kselftest/arm64: Add a GCS test program built with the system libc
  kselftest/arm64: Add very basic GCS test program
  kselftest/arm64: Always run signals tests with GCS enabled
  kselftest/arm64: Allow signals tests to specify an expected si_code
  kselftest/arm64: Add framework support for GCS to signal handling tests
  kselftest/arm64: Add GCS as a detected feature in the signal tests
  kselftest/arm64: Verify the GCS hwcap
  arm64: Add Kconfig for Guarded Control Stack (GCS)
  arm64/ptrace: Expose GCS via ptrace and core files
  arm64/signal: Expose GCS state in signal frames
  arm64/signal: Set up and restore the GCS context for signal handlers
  arm64/mm: Implement map_shadow_stack()
  ...

* for-next/probes:
  : Various arm64 uprobes/kprobes cleanups
  arm64: insn: Simulate nop instruction for better uprobe performance
  arm64: probes: Remove probe_opcode_t
  arm64: probes: Cleanup kprobes endianness conversions
  arm64: probes: Move kprobes-specific fields
  arm64: probes: Fix uprobes for big-endian kernels
  arm64: probes: Fix simulate_ldr*_literal()
  arm64: probes: Remove broken LDR (literal) uprobe support

* for-next/asm-offsets:
  : arm64 asm-offsets.c cleanup (remove unused offsets)
  arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET
  arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE
  arm64: asm-offsets: remove VM_EXEC and PAGE_SZ
  arm64: asm-offsets: remove MM_CONTEXT_ID
  arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET
  arm64: asm-offsets: remove VMA_VM_*
  arm64: asm-offsets: remove TSK_ACTIVE_MM

* for-next/tlb:
  : TLB flushing optimisations
  arm64: optimize flush tlb kernel range
  arm64: tlbflush: add __flush_tlb_range_limit_excess()

* for-next/misc:
  : Miscellaneous patches
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  arm64: uprobes: Optimize cache flushes for xol slot
  acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
  arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG
  arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings
  arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
  arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
  ACPI: GTDT: Tighten the check for the array of platform timer structures
  arm64/fpsimd: Fix a typo
  arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers
  arm64: Return early when break handler is found on linked-list
  arm64/mm: Re-organize arch_make_huge_pte()
  arm64/mm: Drop _PROT_SECT_DEFAULT
  arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV
  arm64: head: Drop SWAPPER_TABLE_SHIFT
  arm64: cpufeature: add POE to cpucap_is_possible()
  arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t

* for-next/mte:
  : Various MTE improvements
  selftests: arm64: add hugetlb mte tests
  hugetlb: arm64: add mte support

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09

* for-next/stacktrace:
  : arm64 stacktrace improvements
  arm64: preserve pt_regs::stackframe during exec*()
  arm64: stacktrace: unwind exception boundaries
  arm64: stacktrace: split unwind_consume_stack()
  arm64: stacktrace: report recovered PCs
  arm64: stacktrace: report source of unwind data
  arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()
  arm64: use a common struct frame_record
  arm64: pt_regs: swap 'unused' and 'pmr' fields
  arm64: pt_regs: rename "pmr_save" -> "pmr"
  arm64: pt_regs: remove stale big-endian layout
  arm64: pt_regs: assert pt_regs is a multiple of 16 bytes

* for-next/hwcap3:
  : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4)
  arm64: Support AT_HWCAP3
  binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4

* for-next/kselftest: (30 commits)
  : arm64 kselftest fixes/cleanups
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  kselftest/arm64: Test signal handler state modification in fp-stress
  kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
  kselftest/arm64: Implement irritators for ZA and ZT
  kselftest/arm64: Remove unused ADRs from irritator handlers
  kselftest/arm64: Correct misleading comments on fp-stress irritators
  kselftest/arm64: Poll less often while waiting for fp-stress children
  kselftest/arm64: Increase frequency of signal delivery in fp-stress
  kselftest/arm64: Fix encoding for SVE B16B16 test
  ...

* for-next/crc32:
  : Optimise CRC32 using PMULL instructions
  arm64/crc32: Implement 4-way interleave using PMULL
  arm64/crc32: Reorganize bit/byte ordering macros
  arm64/lib: Handle CRC-32 alternative in C code

* for-next/guest-cca:
  : Support for running Linux as a guest in Arm CCA
  arm64: Document Arm Confidential Compute
  virt: arm-cca-guest: TSM_REPORT support for realms
  arm64: Enable memory encrypt for Realms
  arm64: mm: Avoid TLBI when marking pages as valid
  arm64: Enforce bounce buffers for realm DMA
  efi: arm64: Map Device with Prot Shared
  arm64: rsi: Map unprotected MMIO as decrypted
  arm64: rsi: Add support for checking whether an MMIO is protected
  arm64: realm: Query IPA size from the RMM
  arm64: Detect if in a realm and set RIPAS RAM
  arm64: rsi: Add RSI definitions

* for-next/haft:
  : Support for arm64 FEAT_HAFT
  arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
  arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
  arm64: Add support for FEAT_HAFT
  arm64: setup: name 'tcr2' register
  arm64/sysreg: Update ID_AA64MMFR1_EL1 register

* for-next/scs:
  : Dynamic shadow call stack fixes
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-14 12:07:16 +00:00
Paolo Bonzini
7b541d557f KVM/arm64 changes for 6.13, part #1
- Support for stage-1 permission indirection (FEAT_S1PIE) and
    permission overlays (FEAT_S1POE), including nested virt + the
    emulated page table walker
 
  - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
    was introduced in PSCIv1.3 as a mechanism to request hibernation,
    similar to the S4 state in ACPI
 
  - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
    part of it, introduce trivial initialization of the host's MPAM
    context so KVM can use the corresponding traps
 
  - PMU support under nested virtualization, honoring the guest
    hypervisor's trap configuration and event filtering when running a
    nested guest
 
  - Fixes to vgic ITS serialization where stale device/interrupt table
    entries are not zeroed when the mapping is invalidated by the VM
 
  - Avoid emulated MMIO completion if userspace has requested synchronous
    external abort injection
 
  - Various fixes and cleanups affecting pKVM, vCPU initialization, and
    selftests
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZzTZXRccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFioUAP0cs2pYcwuCqLgmeHqfz6L5Xsw3
 hKBCNuvr5mjU0hZfLAEA5ml2eUKD7OnssAOmUZ/K/NoCdJFCe8mJWQDlURvr9g4=
 =u2/3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 changes for 6.13, part #1

 - Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker

 - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI

 - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps

 - PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest

 - Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM

 - Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection

 - Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
2024-11-14 07:05:36 -05:00
James Clark
60ad25e14a KVM: arm64: Pass on SVE mapping failures
This function can fail but its return value isn't passed onto the
caller. Presumably this could result in a broken state.

Fixes: 66d5b53e20 ("KVM: arm64: Allocate memory mapped at hyp for host sve state in pKVM")
Signed-off-by: James Clark <james.clark@linaro.org>
Reviewed-by: Fuad Tabba <tabba@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241112105604.795809-1-james.clark@linaro.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-12 11:04:39 -08:00
Oliver Upton
9d0bee66f7 Merge branch kvm-arm64/vgic-its-fixes into kvmarm/next
* kvm-arm64/vgic-its-fixes:
  : Fixes for vgic-its save/restore, courtesy of Kunkun Jiang and Jing Zhang
  :
  : Address bugs where restoring an ITS consumes a stale DTE/ITE, which
  : may lead to either garbage mappings in the ITS or the overall restore
  : ioctl failing. The fix in both cases is to zero a DTE/ITE when its
  : translation has been invalidated by the guest.
  KVM: arm64: vgic-its: Clear ITE when DISCARD frees an ITE
  KVM: arm64: vgic-its: Clear DTE when MAPD unmaps a device
  KVM: arm64: vgic-its: Add a data length check in vgic_its_save_*

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 20:09:09 +00:00
Kunkun Jiang
7602ffd1d5 KVM: arm64: vgic-its: Clear ITE when DISCARD frees an ITE
When DISCARD frees an ITE, it does not invalidate the
corresponding ITE. In the scenario of continuous saves and
restores, there may be a situation where an ITE is not saved
but is restored. This is unreasonable and may cause restore
to fail. This patch clears the corresponding ITE when DISCARD
frees an ITE.

Cc: stable@vger.kernel.org
Fixes: eff484e029 ("KVM: arm64: vgic-its: ITT save and restore")
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-6-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 19:54:03 +00:00
Kunkun Jiang
e9649129d3 KVM: arm64: vgic-its: Clear DTE when MAPD unmaps a device
vgic_its_save_device_tables will traverse its->device_list to
save DTE for each device. vgic_its_restore_device_tables will
traverse each entry of device table and check if it is valid.
Restore if valid.

But when MAPD unmaps a device, it does not invalidate the
corresponding DTE. In the scenario of continuous saves
and restores, there may be a situation where a device's DTE
is not saved but is restored. This is unreasonable and may
cause restore to fail. This patch clears the corresponding
DTE when MAPD unmaps a device.

Cc: stable@vger.kernel.org
Fixes: 57a9a11715 ("KVM: arm64: vgic-its: Device table save/restore")
Co-developed-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Shusen Li <lishusen2@huawei.com>
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with entry write helper]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-5-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 19:52:31 +00:00
Jing Zhang
7fe28d7e68 KVM: arm64: vgic-its: Add a data length check in vgic_its_save_*
In all the vgic_its_save_*() functinos, they do not check whether
the data length is 8 bytes before calling vgic_write_guest_lock.
This patch adds the check. To prevent the kernel from being blown up
when the fault occurs, KVM_BUG_ON() is used. And the other BUG_ON()s
are replaced together.

Cc: stable@vger.kernel.org
Signed-off-by: Kunkun Jiang <jiangkunkun@huawei.com>
[Jing: Update with the new entry read/write helpers]
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Link: https://lore.kernel.org/r/20241107214137.428439-4-jingzhangos@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 19:50:14 +00:00
Oliver Upton
6d4b81e2e7 Merge branch kvm-arm64/nv-pmu into kvmarm/next
* kvm-arm64/nv-pmu:
  : Support for vEL2 PMU controls
  :
  : Align the vEL2 PMU support with the current state of non-nested KVM,
  : including:
  :
  :  - Trap routing, with the annoying complication of EL2 traps that apply
  :    in Host EL0
  :
  :  - PMU emulation, using the correct configuration bits depending on
  :    whether a counter falls in the hypervisor or guest range of PMCs
  :
  :  - Perf event swizzling across nested boundaries, as the event filtering
  :    needs to be remapped to cope with vEL2
  KVM: arm64: nv: Reprogram PMU events affected by nested transition
  KVM: arm64: nv: Apply EL2 event filtering when in hyp context
  KVM: arm64: nv: Honor MDCR_EL2.HLP
  KVM: arm64: nv: Honor MDCR_EL2.HPME
  KVM: arm64: Add helpers to determine if PMC counts at a given EL
  KVM: arm64: nv: Adjust range of accessible PMCs according to HPMN
  KVM: arm64: Rename kvm_pmu_valid_counter_mask()
  KVM: arm64: nv: Advertise support for FEAT_HPMN0
  KVM: arm64: nv: Describe trap behaviour of MDCR_EL2.HPMN
  KVM: arm64: nv: Honor MDCR_EL2.{TPM, TPMCR} in Host EL0
  KVM: arm64: nv: Reinject traps that take effect in Host EL0
  KVM: arm64: nv: Rename BEHAVE_FORWARD_ANY
  KVM: arm64: nv: Allow coarse-grained trap combos to use complex traps
  KVM: arm64: Describe RES0/RES1 bits of MDCR_EL2
  arm64: sysreg: Add new definitions for ID_AA64DFR0_EL1
  arm64: sysreg: Migrate MDCR_EL2 definition to table
  arm64: sysreg: Describe ID_AA64DFR2_EL1 fields

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:49:02 +00:00
Oliver Upton
4bc1a8808e Merge branch kvm-arm64/mmio-sea into kvmarm/next
* kvm-arm64/mmio-sea:
  : Fix for SEA injection in response to MMIO
  :
  : Fix + test coverage for SEA injection in response to an unhandled MMIO
  : exit to userspace. Naturally, if userspace decides to abort an MMIO
  : instruction KVM shouldn't continue with instruction emulation...
  KVM: arm64: selftests: Add tests for MMIO external abort injection
  KVM: arm64: selftests: Convert to kernel's ESR terminology
  tools: arm64: Grab a copy of esr.h from kernel
  KVM: arm64: Don't retire aborted MMIO instruction

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:48:12 +00:00
Oliver Upton
fbf3372baa Merge branch kvm-arm64/misc into kvmarm/next
* kvm-arm64/misc:
  : Miscellaneous updates
  :
  :  - Drop useless check against vgic state in ICC_CLTR_EL1.SEIS read
  :    emulation
  :
  :  - Fix trap configuration for pKVM
  :
  :  - Close the door on initialization bugs surrounding userspace irqchip
  :    static key by removing it.
  KVM: selftests: Don't bother deleting memslots in KVM when freeing VMs
  KVM: arm64: Get rid of userspace_irqchip_in_use
  KVM: arm64: Initialize trap register values in hyp in pKVM
  KVM: arm64: Initialize the hypervisor's VM state at EL2
  KVM: arm64: Refactor kvm_vcpu_enable_ptrauth() for hyp use
  KVM: arm64: Move pkvm_vcpu_init_traps() to init_pkvm_hyp_vcpu()
  KVM: arm64: Don't map 'kvm_vgic_global_state' at EL2 with pKVM
  KVM: arm64: Just advertise SEIS as 0 when emulating ICC_CTLR_EL1

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:47:50 +00:00
Oliver Upton
24bb181136 Merge branch kvm-arm64/mpam-ni into kvmarm/next
* kvm-arm64/mpam-ni:
  : Hiding FEAT_MPAM from KVM guests, courtesy of James Morse + Joey Gouly
  :
  : Fix a longstanding bug where FEAT_MPAM was accidentally exposed to KVM
  : guests + the EL2 trap configuration was not explicitly configured. As
  : part of this, bring in skeletal support for initialising the MPAM CPU
  : context so KVM can actually set traps for its guests.
  :
  : Be warned -- if this series leads to boot failures on your system,
  : you're running on turd firmware.
  :
  : As an added bonus (that builds upon the infrastructure added by the MPAM
  : series), allow userspace to configure CTR_EL0.L1Ip, courtesy of Shameer
  : Kolothum.
  KVM: arm64: Make L1Ip feature in CTR_EL0 writable from userspace
  KVM: arm64: selftests: Test ID_AA64PFR0.MPAM isn't completely ignored
  KVM: arm64: Disable MPAM visibility by default and ignore VMM writes
  KVM: arm64: Add a macro for creating filtered sys_reg_descs entries
  KVM: arm64: Fix missing traps of guest accesses to the MPAM registers
  arm64: cpufeature: discover CPU support for MPAM
  arm64: head.S: Initialise MPAM EL2 registers and disable traps
  arm64/sysreg: Convert existing MPAM sysregs and add the remaining entries

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:38:30 +00:00
Oliver Upton
7ccd615bc6 Merge branch kvm-arm64/psci-1.3 into kvmarm/next
* kvm-arm64/psci-1.3:
  : PSCI v1.3 support, courtesy of David Woodhouse
  :
  : Bump KVM's PSCI implementation up to v1.3, with the added bonus of
  : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
  : this gets relayed to userspace for further processing with a new
  : KVM_SYSTEM_EVENT_SHUTDOWN flag.
  :
  : As an added bonus, implement client-side support for hibernation with
  : the SYSTEM_OFF2 call.
  arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
  KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
  KVM: selftests: Add test for PSCI SYSTEM_OFF2
  KVM: arm64: Add support for PSCI v1.2 and v1.3
  KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
  firmware/psci: Add definitions for PSCI v1.3 specification

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:36:46 +00:00
Shameer Kolothum
e9b57d7f97 KVM: arm64: Make L1Ip feature in CTR_EL0 writable from userspace
Only allow userspace to set VIPT(0b10) or PIPT(0b11) for L1Ip based on
what hardware reports as both AIVIVT (0b01) and VPIPT (0b00) are
documented as reserved.

Using a VIPT for Guest where hardware reports PIPT may lead to over
invalidation, but is still correct. Hence, we can allow downgrading
PIPT to VIPT, but not the other way around.

Reviewed-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20241022073943.35764-1-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:33:09 +00:00
Raghavendra Rao Ananta
38d7aacca0 KVM: arm64: Get rid of userspace_irqchip_in_use
Improper use of userspace_irqchip_in_use led to syzbot hitting the
following WARN_ON() in kvm_timer_update_irq():

WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
  kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
  kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
  kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
  kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
  kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
  kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
  kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
  __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
  invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
  el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
  do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
  el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
  el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
  el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598

The following sequence led to the scenario:
 - Userspace creates a VM and a vCPU.
 - The vCPU is initialized with KVM_ARM_VCPU_PMU_V3 during
   KVM_ARM_VCPU_INIT.
 - Without any other setup, such as vGIC or vPMU, userspace issues
   KVM_RUN on the vCPU. Since the vPMU is requested, but not setup,
   kvm_arm_pmu_v3_enable() fails in kvm_arch_vcpu_run_pid_change().
   As a result, KVM_RUN returns after enabling the timer, but before
   incrementing 'userspace_irqchip_in_use':
   kvm_arch_vcpu_run_pid_change()
       ret = kvm_arm_pmu_v3_enable()
           if (!vcpu->arch.pmu.created)
               return -EINVAL;
       if (ret)
           return ret;
       [...]
       if (!irqchip_in_kernel(kvm))
           static_branch_inc(&userspace_irqchip_in_use);
 - Userspace ignores the error and issues KVM_ARM_VCPU_INIT again.
   Since the timer is already enabled, control moves through the
   following flow, ultimately hitting the WARN_ON():
   kvm_timer_vcpu_reset()
       if (timer->enabled)
          kvm_timer_update_irq()
              if (!userspace_irqchip())
                  ret = kvm_vgic_inject_irq()
                      ret = vgic_lazy_init()
                          if (unlikely(!vgic_initialized(kvm)))
                              if (kvm->arch.vgic.vgic_model !=
                                  KVM_DEV_TYPE_ARM_VGIC_V2)
                                      return -EBUSY;
                  WARN_ON(ret);

Theoretically, since userspace_irqchip_in_use's functionality can be
simply replaced by '!irqchip_in_kernel()', get rid of the static key
to avoid the mismanagement, which also helps with the syzbot issue.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 21:22:52 +00:00
Oliver Upton
ae323e0358 KVM: arm64: nv: Reprogram PMU events affected by nested transition
Start reprogramming PMU events at nested boundaries now that everything
is in place to handle the EL2 event filter. Only repaint events where
the filter differs between EL1 and EL2 as a slight optimization.

PMU now 'works' for nested VMs, albeit slow.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182559.3364829-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 19:00:41 +00:00
Oliver Upton
8a34979030 KVM: arm64: nv: Apply EL2 event filtering when in hyp context
It hopefully comes as no surprise when I say that vEL2 actually runs at
EL1. So, the guest hypervisor's EL2 event filter (NSH) needs to actually
be applied to EL1 in the perf event. In addition to this, the disable
bit for the guest counter range (HPMD) needs to have the effect of
stopping the affected counters.

Do exactly that by stuffing ::exclude_kernel with the combined effect of
these controls. This isn't quite enough yet, as the backing perf events
need to be reprogrammed upon nested ERET/exception entry to remap the
effective filter onto ::exclude_kernel.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-18-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 19:00:41 +00:00