Commit Graph

6076 Commits

Author SHA1 Message Date
Linus Torvalds
a382b06d29 KVM/arm64 fixes for 6.14, take #4
* Fix a couple of bugs affecting pKVM's PSCI relay implementation
   when running in the hVHE mode, resulting in the host being entered
   with the MMU in an unknown state, and EL2 being in the wrong mode.
 
 x86:
 
 * Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow.
 
 * Ensure DEBUGCTL is context switched on AMD to avoid running the guest with
   the host's value, which can lead to unexpected bus lock #DBs.
 
 * Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly
   emulate BTF.  KVM's lack of context switching has meant BTF has always been
   broken to some extent.
 
 * Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest
   can enable DebugSwap without KVM's knowledge.
 
 * Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO
   memory" phase without actually generating a write-protection fault.
 
 * Fix a printf() goof in the SEV smoke test that causes build failures with
   -Werror.
 
 * Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2
   isn't supported by KVM.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfNSeUUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNKngf/cLgQAT9AF4nFqcwh5b5uucKHVJ8W
 uTiGlWqLAf2UN53L63eZ/7vKQWGQYkOTFvormR14Jam6IYtytsZw1xLBH4fGtUyB
 qVjk0EPzaKGqn3LrgyneQNCXdyxJv7EBVBgoOKH0pvOksoW2E5ZizhhtRFtL7nCE
 Yk8FQKpP0mIBk04RMsvzJVEFKIb4OZgJadWo0gryg1oF2aAv7mxQjyqUWsBDsb3q
 99c0ElSBfV39FeT8xeok4k7S5jbBWii2KiaH72ZsNiBu0rYmEuLwIoygCNNWL9Wu
 FPdQ+r//YrzfCJSXwGPfdUaRaF4p2642S6oiXQuusNNUmhK6/MRo3mZo8A==
 =XQHm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "arm64:

   - Fix a couple of bugs affecting pKVM's PSCI relay implementation
     when running in the hVHE mode, resulting in the host being entered
     with the MMU in an unknown state, and EL2 being in the wrong mode

  x86:

   - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow

   - Ensure DEBUGCTL is context switched on AMD to avoid running the
     guest with the host's value, which can lead to unexpected bus lock
     #DBs

   - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't
     properly emulate BTF. KVM's lack of context switching has meant BTF
     has always been broken to some extent

   - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as
     the guest can enable DebugSwap without KVM's knowledge

   - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes
     to RO memory" phase without actually generating a write-protection
     fault

   - Fix a printf() goof in the SEV smoke test that causes build
     failures with -Werror

   - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when
     PERFMON_V2 isn't supported by KVM"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVM
  KVM: selftests: Fix printf() format goof in SEV smoke test
  KVM: selftests: Ensure all vCPUs hit -EFAULT during initial RO stage
  KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3
  KVM: SVM: Save host DR masks on CPUs with DebugSwap
  KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu()
  KVM: arm64: Initialize HCR_EL2.E2H early
  KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs
  KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabled
  KVM: x86: Snapshot the host's DEBUGCTL in common x86
  KVM: SVM: Suppress DEBUGCTL.BTF on AMD
  KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective value
  KVM: selftests: Assert that STI blocking isn't set after event injection
  KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow
2025-03-09 09:04:08 -10:00
Anna-Maria Behnsen
886653e366 vdso: Rework struct vdso_time_data and introduce struct vdso_clock
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be an array of VDSO clocks.

Now that all preparatory changes are in place:

Split the clock related struct members into a separate struct
vdso_clock. Make sure all users are aware, that vdso_time_data is no longer
initialized as an array and vdso_clock is now the array inside
vdso_data. Remove the vdso_clock define, which mapped it to vdso_time_data
for the transition.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-19-c1b5c69a166f@linutronix.de
2025-03-08 14:37:41 +01:00
Nam Cao
5340f3cb20 arm64/vdso: Prepare introduction of struct vdso_clock
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.

To prepare for the rework of the data structures, replace the struct
vdso_time_data pointer with a struct vdso_clock pointer where applicable.

No functional change.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-16-c1b5c69a166f@linutronix.de
2025-03-08 14:37:41 +01:00
Thomas Weißschuh
b69b47a6b5 arm64: Make asm/cache.h compatible with vDSO
asm/cache.h can be used during the vDSO build through vdso/cache.h.
Not all definitions in it are compatible with the vDSO, especially the
compat vDSO.

Hide the more complex definitions from the vDSO build.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-2-c1b5c69a166f@linutronix.de
2025-03-08 14:37:39 +01:00
Kristina Martšenko
fe59e0358d arm64: lib: Use MOPS for usercopy routines
Similarly to what was done with the memcpy() routines, make
copy_to_user(), copy_from_user() and clear_user() also use the Armv8.8
FEAT_MOPS instructions.

Both MOPS implementation options (A and B) are supported, including
asymmetric systems. The exception fixup code fixes up the registers
according to the option used.

In case of a fault the routines return precisely how much was not copied
(as required by the comment in include/linux/uaccess.h), as unprivileged
versions of CPY/SET are guaranteed not to have written past the
addresses reported in the GPRs.

The MOPS instructions could possibly be inlined into callers (and
patched to branch to the generic implementation if not detected;
similarly to what x86 does), but as a first step this patch just uses
them in the out-of-line routines.

Signed-off-by: Kristina Martšenko <kristina.martsenko@arm.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250228170006.390100-4-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-07 18:30:16 +00:00
Kristina Martšenko
04a9f771d8 arm64: mm: Handle PAN faults on uaccess CPY* instructions
A subsequent patch will use CPY* instructions to copy between user and
kernel memory. Add handling for PAN faults caused by an intended kernel
memory access erroneously accessing user memory, in order to make it
easier to debug kernel bugs and to keep the same behavior as with
regular loads/stores.

Signed-off-by: Kristina Martšenko <kristina.martsenko@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250228170006.390100-3-kristina.martsenko@arm.com
[catalin.marinas@arm.com: Folded the extable search into insn_may_access_user()]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-07 18:28:29 +00:00
Kristina Martšenko
653884f887 arm64: extable: Add fixup handling for uaccess CPY* instructions
A subsequent patch will use CPY* instructions to copy between user and
kernel memory. Add a new exception fixup type to avoid fixing up faults
on kernel memory accesses, in order to make it easier to debug kernel
bugs and to keep the same behavior as with regular loads/stores.

Signed-off-by: Kristina Martšenko <kristina.martsenko@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20250228170006.390100-2-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-07 16:18:06 +00:00
Anshuman Khandual
2d7872f3ae arm64/mm: Convert __pte_to_phys() and __phys_to_pte_val() as functions
When CONFIG_ARM64_PA_BITS_52 is enabled, page table helpers __pte_to_phys()
and __phys_to_pte_val() are functions which return phys_addr_t and pteval_t
respectively as expected. But otherwise without this config being enabled,
they are defined as macros and their return types are implicit.

Until now this has worked out correctly as both pte_t and phys_addr_t data
types have been 64 bits. But with the introduction of 128 bit page tables,
pte_t becomes 128 bits. Hence this ends up with incorrect widths after the
conversions, which leads to compiler warnings.

Fix these warnings by converting __pte_to_phys() and __phys_to_pte_val()
as functions instead where the return types are handled explicitly.

Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250227022412.2015835-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-03-05 18:16:15 +00:00
Andre Przywara
faf7714a47 KVM: arm64: nv: Allow userland to set VGIC maintenance IRQ
The VGIC maintenance IRQ signals various conditions about the LRs, when
the GIC's virtualization extension is used.
So far we didn't need it, but nested virtualization needs to know about
this interrupt, so add a userland interface to setup the IRQ number.
The architecture mandates that it must be a PPI, on top of that this code
only exports a per-device option, so the PPI is the same on all VCPUs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
[added some bits of documentation]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Oliver Upton
93078ae63f KVM: arm64: nv: Request vPE doorbell upon nested ERET to L2
Running an L2 guest with GICv4 enabled goes absolutely nowhere, and gets
into a vicious cycle of nested ERET followed by nested exception entry
into the L1.

When KVM does a put on a runnable vCPU, it marks the vPE as nonresident
but does not request a doorbell IRQ. Behind the scenes in the ITS
driver's view of the vCPU, its_vpe::pending_last gets set to true to
indicate that context is still runnable.

This comes to a head when doing the nested ERET into L2. The vPE doesn't
get scheduled on the redistributor as it is exclusively part of the L1's
VGIC context. kvm_vgic_vcpu_pending_irq() returns true because the vPE
appears runnable, and KVM does a nested exception entry into the L1
before L2 ever gets off the ground.

This issue can be papered over by requesting a doorbell IRQ when
descheduling a vPE as part of a nested ERET. KVM needs this anyway to
kick the vCPU out of the L2 when an IRQ becomes pending for the L1.

Link: https://lore.kernel.org/r/20240823212703.3576061-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Jintack Lim
69c9176c38 KVM: arm64: nv: Respect virtual HCR_EL2.TWx setting
Forward exceptions due to WFI or WFE instructions to the virtual EL2 if
they are not coming from the virtual EL2 and virtual HCR_EL2.TWx is set.

Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
4b1b97f0d7 KVM: arm64: nv: Handle L2->L1 transition on interrupt injection
An interrupt being delivered to L1 while running L2 must result
in the correct exception being delivered to L1.

This means that if, on entry to L2, we found ourselves with pending
interrupts in the L1 distributor, we need to take immediate action.
This is done by posting a request which will prevent the entry in
L2, and deliver an IRQ exception to L1, forcing the switch.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:10 -08:00
Marc Zyngier
146a050f2d KVM: arm64: nv: Nested GICv3 emulation
When entering a nested VM, we set up the hypervisor control interface
based on what the guest hypervisor has set. Especially, we investigate
each list register written by the guest hypervisor whether HW bit is
set.  If so, we translate hw irq number from the guest's point of view
to the real hardware irq number if there is a mapping.

Co-developed-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
[Christoffer: Redesigned execution flow around vcpu load/put]
Co-developed-by: Christoffer Dall <christoffer.dall@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
[maz: Rewritten to support GICv3 instead of GICv2, NV2 support]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:57:04 -08:00
Marc Zyngier
182f159694 KVM: arm64: nv: Add ICH_*_EL2 registers to vpcu_sysreg
FEAT_NV2 comes with a bunch of register-to-memory redirection
involving the ICH_*_EL2 registers (LRs, APRs, VMCR, HCR).

Adds them to the vcpu_sysreg enumeration.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:55:10 -08:00
Marc Zyngier
b7a252e881 arm64: sysreg: Add layout for ICH_MISR_EL2
The ICH_MISR_EL2-related macros are missing a number of status
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.

Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:51:51 -08:00
Marc Zyngier
5815fb82dc arm64: sysreg: Add layout for ICH_VTR_EL2
The ICH_VTR_EL2-related macros are missing a number of config
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.

This results in a bit of churn to repaint constants that are now
generated with a different format.

Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:51:51 -08:00
Marc Zyngier
22513c0d2a arm64: sysreg: Add layout for ICH_HCR_EL2
The ICH_HCR_EL2-related macros are missing a number of control
bits that we are about to handle. Take this opportunity to fully
describe the layout of that register as part of the automatic
generation infrastructure.

This results in a bit of churn, unfortunately.

Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225172930.1850838-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-03-03 14:51:51 -08:00
Ahmed Genidi
3855a7b91d KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu()
When KVM is in protected mode, host calls to PSCI are proxied via EL2,
and cold entries from CPU_ON, CPU_SUSPEND, and SYSTEM_SUSPEND bounce
through __kvm_hyp_init_cpu() at EL2 before entering the host kernel's
entry point at EL1. While __kvm_hyp_init_cpu() initializes SPSR_EL2 for
the exception return to EL1, it does not initialize SCTLR_EL1.

Due to this, it's possible to enter EL1 with SCTLR_EL1 in an UNKNOWN
state. In practice this has been seen to result in kernel crashes after
CPU_ON as a result of SCTLR_EL1.M being 1 in violation of the initial
core configuration specified by PSCI.

Fix this by initializing SCTLR_EL1 for cold entry to the host kernel.
As it's necessary to write to SCTLR_EL12 in VHE mode, this
initialization is moved into __kvm_host_psci_cpu_entry() where we can
use write_sysreg_el1().

The remnants of the '__init_el2_nvhe_prepare_eret' macro are folded into
its only caller, as this is clearer than having the macro.

Fixes: cdf3671927 ("KVM: arm64: Intercept host's CPU_ON SMCs")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ahmed Genidi <ahmed.genidi@arm.com>
[ Mark: clarify commit message, handle E2H, move to C, remove macro ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ahmed Genidi <ahmed.genidi@arm.com>
Cc: Ben Horgan <ben.horgan@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20250227180526.1204723-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-02 08:36:52 +00:00
Mark Rutland
7a68b55ff3 KVM: arm64: Initialize HCR_EL2.E2H early
On CPUs without FEAT_E2H0, HCR_EL2.E2H is RES1, but may reset to an
UNKNOWN value out of reset and consequently may not read as 1 unless it
has been explicitly initialized.

We handled this for the head.S boot code in commits:

  3944382fa6 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
  b3320142f3 ("arm64: Fix early handling of FEAT_E2H0 not being implemented")

Unfortunately, we forgot to apply a similar fix to the KVM PSCI entry
points used when relaying CPU_ON, CPU_SUSPEND, and SYSTEM SUSPEND. When
KVM is entered via these entry points, the value of HCR_EL2.E2H may be
consumed before it has been initialized (e.g. by the 'init_el2_state'
macro).

Initialize HCR_EL2.E2H early in these paths such that it can be consumed
reliably. The existing code in head.S is factored out into a new
'init_el2_hcr' macro, and this is used in the __kvm_hyp_init_cpu()
function common to all the relevant PSCI entry points.

For clarity, I've tweaked the assembly used to check whether
ID_AA64MMFR4_EL1.E2H0 is negative. The bitfield is extracted as a signed
value, and this is checked with a signed-greater-or-equal (GE) comparison.

As the hyp code will reconfigure HCR_EL2 later in ___kvm_hyp_init(), all
bits other than E2H are initialized to zero in __kvm_hyp_init_cpu().

Fixes: 3944382fa6 ("arm64: Treat HCR_EL2.E2H as RES1 when ID_AA64MMFR4_EL1.E2H0 is negative")
Fixes: b3320142f3 ("arm64: Fix early handling of FEAT_E2H0 not being implemented")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ahmed Genidi <ahmed.genidi@arm.com>
Cc: Ben Horgan <ben.horgan@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250227180526.1204723-2-mark.rutland@arm.com
[maz: fixed LT->GE thinko]
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-03-02 08:36:52 +00:00
Linus Torvalds
9d20040d71 arm64 fixes for -rc5
- Fix a sporadic boot failure due to incorrect randomization of the
   linear map on systems that support it
 
 - Fix the zapping (both clearing the entries *and* invalidating the TLB)
   of hugetlb PTEs constructed using the contiguous bit
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmfDdBIQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNN0GB/9gmEOX1GwMU6wFjPYqvjWlkGCFDwrldO84
 uF9jEUbPaw3P4xHTOFyPCfEWidktqa+yDVbe90mB7GVOM+1eEZ81em1k1hYBEXbz
 Q73Nl5VrNzxX4BjOrdxxoTSaR/TKklUh5mqWfIzy1RxEnBfpr/GuDPtUn1GViCAs
 sU16Ju12UdYXn3tyHFDHpjZS9WYZskfnrvS0QvXinz0LahZrCkeaH+ptYHrTjMFx
 hxyrRQwOlqLnZWvjLOegH9AC6uyRkKDinXKhXqHYvUfcfEkQsKwM7Fpc6cviUD0Q
 X2npLNegnYxPniwmLpXfNXazPDnKVMzxb9lpqw1fZS3nAuh8XOde
 =RqDZ
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "Ryan's been hard at work finding and fixing mm bugs in the arm64 code,
  so here's a small crop of fixes for -rc5.

  The main changes are to fix our zapping of non-present PTEs for
  hugetlb entries created using the contiguous bit in the page-table
  rather than a block entry at the level above. Prior to these fixes, we
  were pulling the contiguous bit back out of the PTE in order to
  determine the size of the hugetlb page but this is clearly bogus if
  the thing isn't present and consequently both the clearing of the
  PTE(s) and the TLB invalidation were unreliable.

  Although the problem was found by code inspection, we really don't
  want this sitting around waiting to trigger and the changes are CC'd
  to stable accordingly.

  Note that the diffstat looks a lot worse than it really is;
  huge_ptep_get_and_clear() now takes a size argument from the core code
  and so all the arch implementations of that have been updated in a
  pretty mechanical fashion.

   - Fix a sporadic boot failure due to incorrect randomization of the
     linear map on systems that support it

   - Fix the zapping (both clearing the entries *and* invalidating the
     TLB) of hugetlb PTEs constructed using the contiguous bit"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
  arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
  mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
  arm64/mm: Fix Boot panic on Ampere Altra
2025-03-01 13:44:51 -08:00
Ryan Roberts
eed6bfa8b2 arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
commit c910f2b655 ("arm64/mm: Update tlb invalidation routines for
FEAT_LPA2") changed the "invalidation level unknown" hint from 0 to
TLBI_TTL_UNKNOWN (INT_MAX). But the fallback "unknown level" path in
flush_hugetlb_tlb_range() was not updated. So as it stands, when trying
to invalidate CONT_PMD_SIZE or CONT_PTE_SIZE hugetlb mappings, we will
spuriously try to invalidate at level 0 on LPA2-enabled systems.

Fix this so that the fallback passes TLBI_TTL_UNKNOWN, and while we are
at it, explicitly use the correct stride and level for CONT_PMD_SIZE and
CONT_PTE_SIZE, which should provide a minor optimization.

Cc: stable@vger.kernel.org
Fixes: c910f2b655 ("arm64/mm: Update tlb invalidation routines for FEAT_LPA2")
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/r/20250226120656.2400136-4-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-02-27 17:40:58 +00:00
Ryan Roberts
02410ac72a mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
In order to fix a bug, arm64 needs to be told the size of the huge page
for which the huge_pte is being cleared in huge_ptep_get_and_clear().
Provide for this by adding an `unsigned long sz` parameter to the
function. This follows the same pattern as huge_pte_clear() and
set_huge_pte_at().

This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, loongarch, mips,
parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
in a separate commit.

Cc: stable@vger.kernel.org
Fixes: 66b3923a1a ("arm64: hugetlb: add support for PTE contiguous bit")
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-02-27 17:40:57 +00:00
Shameer Kolothum
86edf6bdcf smccc/kvm_guest: Enable errata based on implementation CPUs
Retrieve any migration target implementation CPUs using the hypercall
and enable associated errata.

Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250221140229.12588-6-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:30:37 -08:00
Shameer Kolothum
c8c2647e69 arm64: Make  _midr_in_range_list() an exported function
Subsequent patch will add target implementation CPU support and that
will require _midr_in_range_list() to access new data. To avoid
exporting the data make _midr_in_range_list() a normal function and
export it.

No functional changes intended.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250221140229.12588-5-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:30:36 -08:00
Shameer Kolothum
c0000e58c7 KVM: arm64: Introduce KVM_REG_ARM_VENDOR_HYP_BMAP_2
The vendor_hyp_bmap bitmap holds the information about the Vendor Hyp
services available to the user space and can be get/set using
{G, S}ET_ONE_REG interfaces. This is done using the pseudo-firmware
bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP.

At present, this bitmap is a 64 bit one and since the function numbers
for newly added DISCOVER_IPML_* hypercalls are 64-65, introduce
another pseudo-firmware bitmap register KVM_REG_ARM_VENDOR_HYP_BMAP_2.

Reviewed-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20250221140229.12588-4-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:30:36 -08:00
Shameer Kolothum
e3121298c7 arm64: Modify _midr_range() functions to read MIDR/REVIDR internally
These changes lay the groundwork for adding support for guest kernels,
allowing them to leverage target CPU implementations provided by the
VMM.

No functional changes intended.

Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Sebastian Ott <sebott@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250221140229.12588-2-shameerali.kolothum.thodi@huawei.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 13:29:44 -08:00
Sean Christopherson
b2aba529bf KVM: Drop kvm_arch_sync_events() now that all implementations are nops
Remove kvm_arch_sync_events() now that x86 no longer uses it (no other
arch has ever used it).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20250224235542.2562848-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-02-26 13:17:23 -05:00
Sebastian Ott
3adaee7830 KVM: arm64: Allow userspace to change the implementation ID registers
KVM's treatment of the ID registers that describe the implementation
(MIDR, REVIDR, and AIDR) is interesting, to say the least. On the
userspace-facing end of it, KVM presents the values of the boot CPU on
all vCPUs and treats them as invariant. On the guest side of things KVM
presents the hardware values of the local CPU, which can change during
CPU migration in a big-little system.

While one may call this fragile, there is at least some degree of
predictability around it. For example, if a VMM wanted to present
big-little to a guest, it could affine vCPUs accordingly to the correct
clusters.

All of this makes a giant mess out of adding support for making these
implementation ID registers writable. Avoid breaking the rather subtle
ABI around the old way of doing things by requiring opt-in from
userspace to make the registers writable.

When the cap is enabled, allow userspace to set MIDR, REVIDR, and AIDR
to any non-reserved value and present those values consistently across
all vCPUs.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
[oliver: changelog, capability]
Link: https://lore.kernel.org/r/20250225005401.679536-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:32:16 -08:00
Sebastian Ott
b4043e7cb7 KVM: arm64: Maintain per-VM copy of implementation ID regs
Get ready to allow changes to the implementation ID registers by
tracking the VM-wide values.

Signed-off-by: Sebastian Ott <sebott@redhat.com>
Link: https://lore.kernel.org/r/20250225005401.679536-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:31:58 -08:00
Oliver Upton
4cd48565b0 KVM: arm64: Set HCR_EL2.TID1 unconditionally
commit 90807748ca ("KVM: arm64: Hide SME system registers from
guests") added trap handling for SMIDR_EL1, treating it as UNDEFINED as
KVM does not support SME. This is right for the most part, however KVM
needs to set HCR_EL2.TID1 to _actually_ trap the register.

Unfortunately, this comes with some collateral damage as TID1 forces
REVIDR_EL1 and AIDR_EL1 to trap as well. KVM has long treated these
registers as "invariant" which is an awful term for the following:

 - Userspace sees the boot CPU values on all vCPUs

 - The guest sees the hardware values of the CPU on which a vCPU is
   scheduled

Keep the plates spinning by adding trap handling for the affected
registers and repaint all of the "invariant" crud into terms of
identifying an implementation. Yes, at this point we only need to
set TID1 on SME hardware, but REVIDR_EL1 and AIDR_EL1 are about to
become mutable anyway.

Cc: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 90807748ca ("KVM: arm64: Hide SME system registers from guests")
[maz: handle traps from 32bit]
Co-developed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250225005401.679536-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-26 01:31:52 -08:00
Marc Zyngier
f83c41fb3d KVM: arm64: Allow userspace to limit NV support to nVHE
NV is hard. No kidding.

In order to make things simpler, we have established that NV would
support two mutually exclusive configurations:

- VHE-only, and supporting recursive virtualisation

- nVHE-only, and not supporting recursive virtualisation

For that purpose, introduce a new vcpu feature flag that denotes
the second configuration. We use this flag to limit the idregs
further.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:30:17 -08:00
Marc Zyngier
94f296dcd6 KVM: arm64: Move NV-specific capping to idreg sanitisation
Instead of applying the NV idreg limits at run time, switch to
doing it at the same time as the reset of the VM initialisation.

This will make things much simpler once we introduce vcpu-driven
variants of NV.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250220134907.554085-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2025-02-24 11:28:43 -08:00
Thomas Weißschuh
0b3bc3354e arm64: vdso: Switch to generic storage implementation
The generic storage implementation provides the same features as the
custom one. However it can be shared between architectures, making
maintenance easier.

This switch also moves the random state data out of the time data page.
The currently used hardcoded __VDSO_RND_DATA_OFFSET does not take into
account changes to the time data page layout.

Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-8-13a4669dfc8c@linutronix.de
2025-02-21 09:54:01 +01:00
Oliver Upton
fa808ed4e1 KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
Vladimir reports that a race condition to attach a VMID to a stage-2 MMU
sometimes results in a vCPU entering the guest with a VMID of 0:

| CPU1                                            |   CPU2
|                                                 |
|                                                 | kvm_arch_vcpu_ioctl_run
|                                                 |   vcpu_load             <= load VTTBR_EL2
|                                                 |                            kvm_vmid->id = 0
|                                                 |
| kvm_arch_vcpu_ioctl_run                         |
|   vcpu_load             <= load VTTBR_EL2       |
|                            with kvm_vmid->id = 0|
|   kvm_arm_vmid_update   <= allocates fresh      |
|                            kvm_vmid->id and     |
|                            reload VTTBR_EL2     |
|                                                 |
|                                                 |   kvm_arm_vmid_update <= observes that kvm_vmid->id
|                                                 |                          already allocated,
|                                                 |                          skips reload VTTBR_EL2

Oh yeah, it's as bad as it looks. Remember that VHE loads the stage-2
MMU eagerly but a VMID only gets attached to the MMU later on in the
KVM_RUN loop.

Even in the "best case" where VTTBR_EL2 correctly gets reprogrammed
before entering the EL1&0 regime, there is a period of time where
hardware is configured with VMID 0. That's completely insane. So, rather
than decorating the 'late' binding with another hack, just allocate the
damn thing up front.

Attaching a VMID from vcpu_load() is still rollover safe since
(surprise!) it'll always get called after a vCPU was preempted.

Excuse me while I go find a brown paper bag.

Cc: stable@vger.kernel.org
Fixes: 934bf871f0 ("KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()")
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250219220737.130842-1-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-20 16:29:28 +00:00
Will Deacon
102c51c50d KVM: arm64: Fix tcr_el2 initialisation in hVHE mode
When not running in VHE mode, cpu_prepare_hyp_mode() computes the value
of TCR_EL2 using the host's TCR_EL1 settings as a starting point. For
nVHE, this amounts to masking out everything apart from the TG0, SH0,
ORGN0, IRGN0 and T0SZ fields before setting the RES1 bits, shifting the
IPS field down to the PS field and setting DS if LPA2 is enabled.

Unfortunately, for hVHE, things go slightly wonky: EPD1 is correctly set
to disable walks via TTBR1_EL2 but then the T1SZ and IPS fields are
corrupted when we mistakenly attempt to initialise the PS and DS fields
in their E2H=0 positions. Furthermore, many fields are retained from
TCR_EL1 which should not be propagated to TCR_EL2. Notably, this means
we can end up with A1 set despite not initialising TTBR1_EL2 at all.
This has been shown to cause unexpected translation faults at EL2 with
pKVM due to TLB invalidation not taking effect when running with a
non-zero ASID.

Fix the TCR_EL2 initialisation code to set PS and DS only when E2H=0,
masking out HD, HA and A1 when E2H=1.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Fixes: ad744e8cb3 ("arm64: Allow arm64_sw.hvhe on command line")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250214133724.13179-1-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-19 22:09:24 +00:00
Paolo Bonzini
3bb7dcebd0 KVM/arm64 fixes for 6.14, take #2
- Large set of fixes for vector handling, specially in the interactions
   between host and guest state. This fixes a number of bugs affecting
   actual deployments, and greatly simplifies the FP/SIMD/SVE handling.
   Thanks to Mark Rutland for dealing with this thankless task.
 
 - Fix an ugly race between vcpu and vgic creation/init, resulting in
   unexpected behaviours.
 
 - Fix use of kernel VAs at EL2 when emulating timers with nVHE.
 
 - Small set of pKVM improvements and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmevLKMACgkQI9DQutE9
 ekP3hQ//db7pAPzLr++//PAyam0GP+ooKlgpB0ImZisQwkrTrTMP+IjNJG+NCJ46
 y88anBErFijvWb3BINpeTM/dux7DmuoaolGx7lquFu+i0L8UfFFjYG7UU+NZscim
 KE4j0tJz8jm5ksN4iwaj3RIkGKc1zJtRyoPny3j1blOtm8aTtujRJB7/Gx2QefZR
 1Z13RaIzk1tKdY0JxAmPpGkaRY99MQahx96iBsk2u4rlypcxmVr9aQ1Madp7Pc6Y
 pBcX9jZwLf75cj6CAK93YSjFF3j/x4QM8jSupLCu5tyin6YZ4sRaZa6sy52byk2v
 zes7i83l5g3+JEKv5oZVwjD5SFBu02UPbnMGSxKQitgz4Zej3qMIq5BxgII2kHZV
 jwXrNEx4trNegEcoqwFX5xA0FMUr1/g3Cr4+rZBoUramj80cBhzbBdUkhyWd3eey
 j2EOuAG3pgUD5Wv9SyojlbHBwmSAcBEtr3vqJpTjWQS6AyEmdKNvzh/8JCH1h7UM
 fBo4+LIEylzmZXbqDrZNwXh31tELoTCR9Ur3pTCEO3Yfg9npTLWmvKs+tAgO/282
 IOjZE0N/ZtzPJ6Cgr+2efBGd+id81HXh+H8gWo35Dyx3EH2k44FHwQ3rW2NKOVzo
 10eSbswYpjk3gi/6GxwC0lDqFi4Bk6ILvC6roqTghixBf7xThfY=
 =L5HS
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.14, take #2

- Large set of fixes for vector handling, specially in the interactions
  between host and guest state. This fixes a number of bugs affecting
  actual deployments, and greatly simplifies the FP/SIMD/SVE handling.
  Thanks to Mark Rutland for dealing with this thankless task.

- Fix an ugly race between vcpu and vgic creation/init, resulting in
  unexpected behaviours.

- Fix use of kernel VAs at EL2 when emulating timers with nVHE.

- Small set of pKVM improvements and cleanups.
2025-02-14 18:32:47 -05:00
Quentin Perret
b938731ed2 KVM: arm64: Fix alignment of kvm_hyp_memcache allocations
When allocating guest stage-2 page-table pages at EL2, pKVM can consume
pages from the host-provided kvm_hyp_memcache. As pgtable.c expects
zeroed pages, guest_s2_zalloc_page() actively implements this zeroing
with a PAGE_SIZE memset. Unfortunately, we don't check the page
alignment of the host-provided address before doing so, which could
lead to the memset overrunning the page if the host was malicious.

Fix this by simply force-aligning all kvm_hyp_memcache allocations to
page boundaries.

Fixes: 60dfe093ec ("KVM: arm64: Instantiate guest stage-2 page-tables at EL2")
Reported-by: Ben Simner <ben.simner@cl.cam.ac.uk>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250213153615.3642515-1-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:57:27 +00:00
Mark Rutland
ee14db31a9 KVM: arm64: Refactor CPTR trap deactivation
For historical reasons, the VHE and nVHE/hVHE implementations of
__activate_cptr_traps() pair with a common implementation of
__kvm_reset_cptr_el2(), which ideally would be named
__deactivate_cptr_traps().

Rename __kvm_reset_cptr_el2() to __deactivate_cptr_traps(), and split it
into separate VHE and nVHE/hVHE variants so that each can be paired with
its corresponding implementation of __activate_cptr_traps().

At the same time, fold kvm_write_cptr_el2() into its callers. This
makes it clear in-context whether a write is made to the CPACR_EL1
encoding or the CPTR_EL2 encoding, and removes the possibility of
confusion as to whether kvm_write_cptr_el2() reformats the sysreg fields
as cpacr_clear_set() does.

In the nVHE/hVHE implementation of __activate_cptr_traps(), placing the
sysreg writes within the if-else blocks requires that the call to
__activate_traps_fpsimd32() is moved earlier, but as this was always
called before writing to CPTR_EL2/CPACR_EL1, this should not result in a
functional change.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-6-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:57 +00:00
Mark Rutland
407a99c465 KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.SMEN (i.e. CPTR_EL2.SMEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic
has historically been broken, and is currently redundant.

This logic was originally introduced in commit:

  861262ab86 ("KVM: arm64: Handle SME host state when running guests")

At the time, the VHE hyp code would reset CPTR_EL2.SMEN to 0b00 when
returning to the host, trapping host access to SME state. Unfortunately,
this was unsafe as the host could take a softirq before calling
kvm_arch_vcpu_put_fp(), and if a softirq handler were to use kernel mode
NEON the resulting attempt to save the live FPSIMD/SVE/SME state would
result in a fatal trap.

That issue was limited to VHE mode. For nVHE/hVHE modes, KVM always
saved/restored the host kernel's CPACR_EL1 value, and configured
CPTR_EL2.TSM to 0b0, ensuring that host usage of SME would not be
trapped.

The issue above was incidentally fixed by commit:

  375110ab51 ("KVM: arm64: Fix resetting SME trap values on reset for (h)VHE")

That commit changed the VHE hyp code to configure CPTR_EL2.SMEN to 0b01
when returning to the host, permitting host kernel usage of SME,
avoiding the issue described above. At the time, this was not identified
as a fix for commit 861262ab86.

Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SME trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.

Remove the redundant logic.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-5-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:54 +00:00
Mark Rutland
459f059be7 KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
When KVM is in VHE mode, the host kernel tries to save and restore the
configuration of CPACR_EL1.ZEN (i.e. CPTR_EL2.ZEN when HCR_EL2.E2H=1)
across kvm_arch_vcpu_load_fp() and kvm_arch_vcpu_put_fp(), since the
configuration may be clobbered by hyp when running a vCPU. This logic is
currently redundant.

The VHE hyp code unconditionally configures CPTR_EL2.ZEN to 0b01 when
returning to the host, permitting host kernel usage of SVE.

Now that the host eagerly saves and unbinds its own FPSIMD/SVE/SME
state, there's no need to save/restore the state of the EL0 SVE trap.
The kernel can safely save/restore state without trapping, as described
above, and will restore userspace state (including trap controls) before
returning to userspace.

Remove the redundant logic.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-4-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:51 +00:00
Mark Rutland
8eca7f6d51 KVM: arm64: Remove host FPSIMD saving for non-protected KVM
Now that the host eagerly saves its own FPSIMD/SVE/SME state,
non-protected KVM never needs to save the host FPSIMD/SVE/SME state,
and the code to do this is never used. Protected KVM still needs to
save/restore the host FPSIMD/SVE state to avoid leaking guest state to
the host (and to avoid revealing to the host whether the guest used
FPSIMD/SVE/SME), and that code needs to be retained.

Remove the unused code and data structures.

To avoid the need for a stub copy of kvm_hyp_save_fpsimd_host() in the
VHE hyp code, the nVHE/hVHE version is moved into the shared switch
header, where it is only invoked when KVM is in protected mode.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20250210195226.1215254-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-02-13 17:54:44 +00:00
Linus Torvalds
e2ee2e9b15 KVM/arm64 updates for 6.14
* New features:
 
   - Support for non-protected guest in protected mode, achieving near
     feature parity with the non-protected mode
 
   - Support for the EL2 timers as part of the ongoing NV support
 
   - Allow control of hardware tracing for nVHE/hVHE
 
 * Improvements, fixes and cleanups:
 
   - Massive cleanup of the debug infrastructure, making it a bit less
     awkward and definitely easier to maintain. This should pave the
     way for further optimisations
 
   - Complete rewrite of pKVM's fixed-feature infrastructure, aligning
     it with the rest of KVM and making the code easier to follow
 
   - Large simplification of pKVM's memory protection infrastructure
 
   - Better handling of RES0/RES1 fields for memory-backed system
     registers
 
   - Add a workaround for Qualcomm's Snapdragon X CPUs, which suffer
     from a pretty nasty timer bug
 
   - Small collection of cleanups and low-impact fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmeYqJcQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNLUhCACxUTMVQXhfW3qbh0UQxPd7XXvjI+Hm7SPS
 wDuVTle4jrFVGHxuZqtgWLmx8hD7bqO965qmFgbevKlwsRY33onH2nbH4i4AcwbA
 jcdM4yMHZI4+Qmnb4G5ZJ89IwjAhHPZTBOV5KRhyHQ/qtRciHHtOgJde7II9fd68
 uIESg4SSSyUzI47YSEHmGVmiBIhdQhq2qust0m6NPFalEGYstPbpluPQ6R1CsDqK
 v14TIAW7t0vSPucBeODxhA5gEa2JsvNi+sqA+DF/ELH2ZqpkuR7rofgMGblaXCSD
 JXa5xamRB9dI5zi8vatwfOzYlog+/gzmPqMh/9JXpiDGHxJe0vlz
 =tQ8F
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull KVM/arm64 updates from Will Deacon:
 "New features:

   - Support for non-protected guest in protected mode, achieving near
     feature parity with the non-protected mode

   - Support for the EL2 timers as part of the ongoing NV support

   - Allow control of hardware tracing for nVHE/hVHE

  Improvements, fixes and cleanups:

   - Massive cleanup of the debug infrastructure, making it a bit less
     awkward and definitely easier to maintain. This should pave the way
     for further optimisations

   - Complete rewrite of pKVM's fixed-feature infrastructure, aligning
     it with the rest of KVM and making the code easier to follow

   - Large simplification of pKVM's memory protection infrastructure

   - Better handling of RES0/RES1 fields for memory-backed system
     registers

   - Add a workaround for Qualcomm's Snapdragon X CPUs, which suffer
     from a pretty nasty timer bug

   - Small collection of cleanups and low-impact fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (87 commits)
  arm64/sysreg: Get rid of TRFCR_ELx SysregFields
  KVM: arm64: nv: Fix doc header layout for timers
  KVM: arm64: nv: Apply RESx settings to sysreg reset values
  KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors
  KVM: arm64: Fix selftests after sysreg field name update
  coresight: Pass guest TRFCR value to KVM
  KVM: arm64: Support trace filtering for guests
  KVM: arm64: coresight: Give TRBE enabled state to KVM
  coresight: trbe: Remove redundant disable call
  arm64/sysreg/tools: Move TRFCR definitions to sysreg
  tools: arm64: Update sysreg.h header files
  KVM: arm64: Drop pkvm_mem_transition for host/hyp donations
  KVM: arm64: Drop pkvm_mem_transition for host/hyp sharing
  KVM: arm64: Drop pkvm_mem_transition for FF-A
  KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
  KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
  arm64: kvm: Introduce nvhe stack size constants
  KVM: arm64: Fix nVHE stacktrace VA bits mask
  KVM: arm64: Fix FEAT_MTE in pKVM
  Documentation: Update the behaviour of "kvm-arm.mode"
  ...
2025-01-28 09:01:36 -08:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Qi Zheng
2dccdf7076 mm: pgtable: introduce generic __tlb_remove_table()
Several architectures (arm, arm64, riscv and x86) define exactly the same
__tlb_remove_table(), just introduce generic __tlb_remove_table() to
eliminate these duplications.

The s390 __tlb_remove_table() is nearly the same, so also make s390
__tlb_remove_table() version generic.

Link: https://lkml.kernel.org/r/ea372633d94f4d3f9f56a7ec5994bf050bf77e39.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Andreas Larsson <andreas@gaisler.com>		[sparc]
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Acked-by: Arnd Bergmann <arnd@arndb.de>			[asm-generic]
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:23 -08:00
Qi Zheng
12359c039b arm64: pgtable: move pagetable_dtor() to __tlb_remove_table()
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page
table pages can be freed together (regardless of whether RCU is used). 
This prevents the use-after-free problem where the ptlock is freed
immediately but the page table pages is freed later via RCU.

Page tables shouldn't have swap cache, so use pagetable_free() instead of
free_page_and_swap_cache() to free page table pages.

Link: https://lkml.kernel.org/r/cf4b847caf390f96a3e3d534dacb2c174e16c154.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:22 -08:00
Qi Zheng
db6b435d73 mm: pgtable: introduce pagetable_dtor()
The pagetable_p*_dtor() are exactly the same except for the handling of
ptlock.  If we make ptlock_free() handle the case where ptdesc->ptl is
NULL and remove VM_BUG_ON_PAGE() from pmd_ptlock_free(), we can unify
pagetable_p*_dtor() into one function.  Let's introduce pagetable_dtor()
to do this.

Later, pagetable_dtor() will be moved to tlb_remove_ptdesc(), so that
ptlock and page table pages can be freed together (regardless of whether
RCU is used).  This prevents the use-after-free problem where the ptlock
is freed immediately but the page table pages is freed later via RCU.

Link: https://lkml.kernel.org/r/47f44fff9dc68d9d9e9a0d6c036df275f820598a.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:22 -08:00
Qi Zheng
440af48d68 arm64: pgtable: use mmu gather to free p4d level page table
Like other levels of page tables, also use mmu gather mechanism to free
p4d level page table.

Link: https://lkml.kernel.org/r/3fd48525397b34a64f7c0eb76746da30814dc941.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:21 -08:00
Kevin Brodsky
98a7e47faa asm-generic: pgalloc: provide generic p4d_{alloc_one,free}
Four architectures currently implement 5-level pgtables: arm64, riscv, x86
and s390.  The first three have essentially the same implementation for
p4d_alloc_one() and p4d_free(), so we've got an opportunity to reduce
duplication like at the lower levels.

Provide a generic version of p4d_alloc_one() and p4d_free(), and make use
of it on those architectures.

Their implementation is the same as at PUD level, except that p4d_free()
performs a runtime check by calling mm_p4d_folded().  5-level pgtables
depend on a runtime-detected hardware feature on all supported
architectures, so we might as well include this check in the generic
implementation.  No runtime check is required in p4d_alloc_one() as the
top-level p4d_alloc() already does the required check.

Link: https://lkml.kernel.org/r/26d69c74a29183ecc335b9b407040d8e4cd70c6a.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>		[asm-generic]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:21 -08:00
Linus Torvalds
0f8e26b38d Loongarch:
* Clear LLBCTL if secondary mmu mapping changes.
 
 * Add hypercall service support for usermode VMM.
 
 x86:
 
 * Add a comment to kvm_mmu_do_page_fault() to explain why KVM performs a
   direct call to kvm_tdp_page_fault() when RETPOLINE is enabled.
 
 * Ensure that all SEV code is compiled out when disabled in Kconfig, even
   if building with less brilliant compilers.
 
 * Remove a redundant TLB flush on AMD processors when guest CR4.PGE changes.
 
 * Use str_enabled_disabled() to replace open coded strings.
 
 * Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's APICv cache
   prior to every VM-Enter.
 
 * Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.
 
 * Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.
 
 * Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.
 
 * Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
 
 * Make the completion of hypercalls go through the complete_hypercall
   function pointer argument, no matter if the hypercall exits to
   userspace or not.  Previously, the code assumed that KVM_HC_MAP_GPA_RANGE
   specifically went to userspace, and all the others did not; the new code
   need not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
   all whether there was an exit to userspace or not.
 
 * As part of enabling TDX virtual machines, support support separation of
   private/shared EPT into separate roots.  When TDX will be enabled, operations
   on private pages will need to go through the privileged TDX Module via SEAMCALLs;
   as a result, they are limited and relatively slow compared to reading a PTE.
   The patches included in 6.14 allow KVM to keep a mirror of the private EPT in
   host memory, and define entries in kvm_x86_ops to operate on external page
   tables such as the TDX private EPT.
 
 * The recently introduced conversion of the NX-page reclamation kthread to
   vhost_task moved the task under the main process.  The task is created as
   soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that
   didn't expect to see any child task of the VM process until it started
   creating its own userspace threads.  In particular crosvm refuses to fork()
   if procfs shows any child task, so unbreak it by creating the task lazily.
   This is arguably a userspace bug, as there can be other kinds of legitimate
   worker tasks and they wouldn't impede fork(); but it's not like userspace
   has a way to distinguish kernel worker tasks right now.  Should they show
   as "Kthread: 1" in proc/.../status?
 
 x86 - Intel:
 
 * Fix a bug where KVM updates hardware's APICv cache of the highest ISR bit
   while L2 is active, while ultimately results in a hardware-accelerated L1
   EOI effectively being lost.
 
 * Honor event priority when emulating Posted Interrupt delivery during nested
   VM-Enter by queueing KVM_REQ_EVENT instead of immediately handling the
   interrupt.
 
 * Rework KVM's processing of the Page-Modification Logging buffer to reap
   entries in the same order they were created, i.e. to mark gfns dirty in the
   same order that hardware marked the page/PTE dirty.
 
 * Misc cleanups.
 
 Generic:
 
 * Cleanup and harden kvm_set_memory_region(); add proper lockdep assertions when
   setting memory regions and add a dedicated API for setting KVM-internal
   memory regions.  The API can then explicitly disallow all flags for
   KVM-internal memory regions.
 
 * Explicitly verify the target vCPU is online in kvm_get_vcpu() to fix a bug
   where KVM would return a pointer to a vCPU prior to it being fully online,
   and give kvm_for_each_vcpu() similar treatment to fix a similar flaw.
 
 * Wait for a vCPU to come online prior to executing a vCPU ioctl, to fix a
   bug where userspace could coerce KVM into handling the ioctl on a vCPU that
   isn't yet onlined.
 
 * Gracefully handle xarray insertion failures; even though such failures are
   impossible in practice after xa_reserve(), reserving an entry is always followed
   by xa_store() which does not know (or differentiate) whether there was an
   xa_reserve() before or not.
 
 RISC-V:
 
 * Zabha, Svvptc, and Ziccrse extension support for guests.  None of them
   require anything in KVM except for detecting them and marking them
   as supported; Zabha adds byte and halfword atomic operations, while the
   others are markers for specific operation of the TLB and of LL/SC
   instructions respectively.
 
 * Virtualize SBI system suspend extension for Guest/VM
 
 * Support firmware counters which can be used by the guests to collect
   statistics about traps that occur in the host.
 
 Selftests:
 
 * Rework vcpu_get_reg() to return a value instead of using an out-param, and
   update all affected arch code accordingly.
 
 * Convert the max_guest_memory_test into a more generic mmu_stress_test.
   The basic gist of the "conversion" is to have the test do mprotect() on
   guest memory while vCPUs are accessing said memory, e.g. to verify KVM
   and mmu_notifiers are working as intended.
 
 * Play nice with treewrite builds of unsupported architectures, e.g. arm
   (32-bit), as KVM selftests' Makefile doesn't do anything to ensure the
   target architecture is actually one KVM selftests supports.
 
 * Use the kernel's $(ARCH) definition instead of the target triple for arch
   specific directories, e.g. arm64 instead of aarch64, mainly so as not to
   be different from the rest of the kernel.
 
 * Ensure that format strings for logging statements are checked by the
   compiler even when the logging statement itself is disabled.
 
 * Attempt to whack the last LLC references/misses mole in the Intel PMU
   counters test by adding a data load and doing CLFLUSH{OPT} on the data
   instead of the code being executed.  It seems that modern Intel CPUs
   have learned new code prefetching tricks that bypass the PMU counters.
 
 * Fix a flaw in the Intel PMU counters test where it asserts that events
   are counting correctly without actually knowing what the events count
   given the underlying hardware; this can happen if Intel reuses a
   formerly microarchitecture-specific event encoding as an architectural
   event, as was the case for Top-Down Slots.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmeTuzoUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOkBwf8CRNExYaM3j9y2E7mmo6AiL2ug6+J
 Uy5Hai1poY48pPwKC6ke3EWT8WVsgj/Py5pCeHvLojQchWNjCCYNfSQluJdkRxwG
 DgP3QUljSxEJWBeSwyTRcKM+IySi5hZd1IFo3gePFRB829Jpnj05vjbvCyv8gIwU
 y3HXxSYDsViaaFoNg4OlZFsIGis7mtknsZzk++QjuCXmxNa6UCbv3qvE/UkVLhVg
 WH65RTRdjk+EsdwaOMHKuUvQoGa+iM4o39b6bqmw8+ZMK39+y33WeTX/y5RXsp1N
 tUUBRfS+MuuYgC/6LmTr66EkMzoChxk3Dp3kKUaCBcfqRC8PxQag5reZhw==
 =NEaO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Loongarch:

   - Clear LLBCTL if secondary mmu mapping changes

   - Add hypercall service support for usermode VMM

  x86:

   - Add a comment to kvm_mmu_do_page_fault() to explain why KVM
     performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
     enabled

   - Ensure that all SEV code is compiled out when disabled in Kconfig,
     even if building with less brilliant compilers

   - Remove a redundant TLB flush on AMD processors when guest CR4.PGE
     changes

   - Use str_enabled_disabled() to replace open coded strings

   - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
     APICv cache prior to every VM-Enter

   - Overhaul KVM's CPUID feature infrastructure to track all vCPU
     capabilities instead of just those where KVM needs to manage state
     and/or explicitly enable the feature in hardware. Along the way,
     refactor the code to make it easier to add features, and to make it
     more self-documenting how KVM is handling each feature

   - Rework KVM's handling of VM-Exits during event vectoring; this
     plugs holes where KVM unintentionally puts the vCPU into infinite
     loops in some scenarios (e.g. if emulation is triggered by the
     exit), and brings parity between VMX and SVM

   - Add pending request and interrupt injection information to the
     kvm_exit and kvm_entry tracepoints respectively

   - Fix a relatively benign flaw where KVM would end up redoing RDPKRU
     when loading guest/host PKRU, due to a refactoring of the kernel
     helpers that didn't account for KVM's pre-checking of the need to
     do WRPKRU

   - Make the completion of hypercalls go through the complete_hypercall
     function pointer argument, no matter if the hypercall exits to
     userspace or not.

     Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
     went to userspace, and all the others did not; the new code need
     not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
     all whether there was an exit to userspace or not

   - As part of enabling TDX virtual machines, support support
     separation of private/shared EPT into separate roots.

     When TDX will be enabled, operations on private pages will need to
     go through the privileged TDX Module via SEAMCALLs; as a result,
     they are limited and relatively slow compared to reading a PTE.

     The patches included in 6.14 allow KVM to keep a mirror of the
     private EPT in host memory, and define entries in kvm_x86_ops to
     operate on external page tables such as the TDX private EPT

   - The recently introduced conversion of the NX-page reclamation
     kthread to vhost_task moved the task under the main process. The
     task is created as soon as KVM_CREATE_VM was invoked and this, of
     course, broke userspace that didn't expect to see any child task of
     the VM process until it started creating its own userspace threads.

     In particular crosvm refuses to fork() if procfs shows any child
     task, so unbreak it by creating the task lazily. This is arguably a
     userspace bug, as there can be other kinds of legitimate worker
     tasks and they wouldn't impede fork(); but it's not like userspace
     has a way to distinguish kernel worker tasks right now. Should they
     show as "Kthread: 1" in proc/.../status?

  x86 - Intel:

   - Fix a bug where KVM updates hardware's APICv cache of the highest
     ISR bit while L2 is active, while ultimately results in a
     hardware-accelerated L1 EOI effectively being lost

   - Honor event priority when emulating Posted Interrupt delivery
     during nested VM-Enter by queueing KVM_REQ_EVENT instead of
     immediately handling the interrupt

   - Rework KVM's processing of the Page-Modification Logging buffer to
     reap entries in the same order they were created, i.e. to mark gfns
     dirty in the same order that hardware marked the page/PTE dirty

   - Misc cleanups

  Generic:

   - Cleanup and harden kvm_set_memory_region(); add proper lockdep
     assertions when setting memory regions and add a dedicated API for
     setting KVM-internal memory regions. The API can then explicitly
     disallow all flags for KVM-internal memory regions

   - Explicitly verify the target vCPU is online in kvm_get_vcpu() to
     fix a bug where KVM would return a pointer to a vCPU prior to it
     being fully online, and give kvm_for_each_vcpu() similar treatment
     to fix a similar flaw

   - Wait for a vCPU to come online prior to executing a vCPU ioctl, to
     fix a bug where userspace could coerce KVM into handling the ioctl
     on a vCPU that isn't yet onlined

   - Gracefully handle xarray insertion failures; even though such
     failures are impossible in practice after xa_reserve(), reserving
     an entry is always followed by xa_store() which does not know (or
     differentiate) whether there was an xa_reserve() before or not

  RISC-V:

   - Zabha, Svvptc, and Ziccrse extension support for guests. None of
     them require anything in KVM except for detecting them and marking
     them as supported; Zabha adds byte and halfword atomic operations,
     while the others are markers for specific operation of the TLB and
     of LL/SC instructions respectively

   - Virtualize SBI system suspend extension for Guest/VM

   - Support firmware counters which can be used by the guests to
     collect statistics about traps that occur in the host

  Selftests:

   - Rework vcpu_get_reg() to return a value instead of using an
     out-param, and update all affected arch code accordingly

   - Convert the max_guest_memory_test into a more generic
     mmu_stress_test. The basic gist of the "conversion" is to have the
     test do mprotect() on guest memory while vCPUs are accessing said
     memory, e.g. to verify KVM and mmu_notifiers are working as
     intended

   - Play nice with treewrite builds of unsupported architectures, e.g.
     arm (32-bit), as KVM selftests' Makefile doesn't do anything to
     ensure the target architecture is actually one KVM selftests
     supports

   - Use the kernel's $(ARCH) definition instead of the target triple
     for arch specific directories, e.g. arm64 instead of aarch64,
     mainly so as not to be different from the rest of the kernel

   - Ensure that format strings for logging statements are checked by
     the compiler even when the logging statement itself is disabled

   - Attempt to whack the last LLC references/misses mole in the Intel
     PMU counters test by adding a data load and doing CLFLUSH{OPT} on
     the data instead of the code being executed. It seems that modern
     Intel CPUs have learned new code prefetching tricks that bypass the
     PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that
     events are counting correctly without actually knowing what the
     events count given the underlying hardware; this can happen if
     Intel reuses a formerly microarchitecture-specific event encoding
     as an architectural event, as was the case for Top-Down Slots"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
  kvm: defer huge page recovery vhost task to later
  KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  KVM: Disallow all flags for KVM-internal memslots
  KVM: x86: Drop double-underscores from __kvm_set_memory_region()
  KVM: Add a dedicated API for setting KVM-internal memslots
  KVM: Assert slots_lock is held when setting memory regions
  KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
  LoongArch: KVM: Add hypercall service support for usermode VMM
  LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
  KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
  KVM: VMX: read the PML log in the same order as it was written
  KVM: VMX: refactor PML terminology
  KVM: VMX: Fix comment of handle_vmx_instruction()
  KVM: VMX: Reinstate __exit attribute for vmx_exit()
  KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
  KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
  KVM: x86: Use LVT_TIMER instead of an open coded literal
  RISC-V: KVM: Add new exit statstics for redirected traps
  RISC-V: KVM: Update firmware counters for various events
  RISC-V: KVM: Redirect instruction access fault trap to guest
  ...
2025-01-25 09:55:09 -08:00
Linus Torvalds
382e391365 hyperv-next for v6.14
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmeTFQ4THHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXqMWB/4uHjnu50u+m00OwXAKQr6i92zh50BZ
 RQragd9s9C8tuUNwPDmS/ct2BNAhoy43KJ0ClegdZjKxT1Ys8cLv4Wr5CaGckqWq
 +WCHqTgt+cPe0vUofqahB5wiAZMsnBgzFkV/OfFwBx0wkub9y5T3qVq5KapYlaDI
 7Gftb+wg1AAsrdZ/HuLRy5ZVvkM/73rU2uoi8WXjr/T14E1krCFR/qirLd1OXo6Q
 Jb97qhnCt/N9JPwIq5/VnYWde5Mpqz6UgtA2rFLDXgNGz+h9/ND6ecWFHjZWNVdc
 AKWZTO5t+fRVBOSyahoyRoYSntPw3wlxyL7A2/54h6j4Dex7wLt6NQBj
 =empO
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Introduce a new set of Hyper-V headers in include/hyperv and replace
   the old hyperv-tlfs.h with the new headers (Nuno Das Neves)

 - Fixes for the Hyper-V VTL mode (Roman Kisel)

 - Fixes for cpu mask usage in Hyper-V code (Michael Kelley)

 - Document the guest VM hibernation behaviour (Michael Kelley)

 - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain)

* tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Documentation: hyperv: Add overview of guest VM hibernation
  hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id()
  hyperv: Do not overlap the hvcall IO areas in get_vtl()
  hyperv: Enable the hypercall output page for the VTL mode
  hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
  Drivers: hv: vmbus: Log on missing offers if any
  Drivers: hv: vmbus: Wait for boot-time offers during boot and resume
  uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup
  iommu/hyper-v: Don't assume cpu_possible_mask is dense
  Drivers: hv: Don't assume cpu_possible_mask is dense
  x86/hyperv: Don't assume cpu_possible_mask is dense
  hyperv: Remove the now unused hyperv-tlfs.h files
  hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
  hyperv: Add new Hyper-V headers in include/hyperv
  hyperv: Clean up unnecessary #includes
  hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-25 09:22:55 -08:00
Linus Torvalds
1d6d399223 Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
    relevant code on any other CPU. This is currently handled by smpboot
    code which takes care of CPU-hotplug operations. Affinity here is
    a correctness constraint.
 
 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
    run anywhere else. The affinity is set through kthread_bind_mask()
    and the subsystem takes care by itself to handle CPU-hotplug
    operations. Affinity here is assumed to be a correctness constraint.
 
 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
    is not a correctness constraint but merely a preference in terms of
    memory locality. kswapd and kcompactd both fall into this category.
    The affinity is set manually like for any other task and CPU-hotplug
    is supposed to be handled by the relevant subsystem so that the task
    is properly reaffined whenever a given CPU from the node comes up.
    Also care should be taken so that the node affinity doesn't cross
    isolated (nohz_full) cpumask boundaries.
 
 4) Similar to the previous point except kthreads have a _preferred_
    affinity different than a node. Both RCU boost kthreads and RCU
    exp kworkers fall into this category as they refer to "RCU nodes"
    from a distinctly distributed tree.
 
 Currently the preferred affinity patterns (3 and 4) have at least 4
 identified users, with more or less success when it comes to handle
 CPU-hotplug operations and CPU isolation. Each of which do it in its own
 ad-hoc way.
 
 This is an infrastructure proposal to handle this with the following API
 changes:
 
 _ kthread_create_on_node() automatically affines the created kthread to
   its target node unless it has been set as per-cpu or bound with
   kthread_bind[_mask]() before the first wake-up.
 
 - kthread_affine_preferred() is a new function that can be called right
   after kthread_create_on_node() to specify a preferred affinity
   different than the specified node.
 
 When the preferred affinity can't be applied because the possible
 targets are offline or isolated (nohz_full), the kthread is affine
 to the housekeeping CPUs (which means to all online CPUs most of the
 time or only the non-nohz_full CPUs when nohz_full= is set).
 
 kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
 converted, along with a few old drivers.
 
 Summary of the changes:
 
 * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
 
 * Introduce task_cpu_fallback_mask() that defines the default last
   resort affinity of a task to become nohz_full aware
 
 * Add some correctness check to ensure kthread_bind() is always called
   before the first kthread wake up.
 
 * Default affine kthread to its preferred node.
 
 * Convert kswapd / kcompactd and remove their halfway working ad-hoc
   affinity implementation
 
 * Implement kthreads preferred affinity
 
 * Unify kthread worker and kthread API's style
 
 * Convert RCU kthreads to the new API and remove the ad-hoc affinity
   implementation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
 jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
 Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
 u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
 v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
 ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
 NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
 f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
 IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
 AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
 =g8In
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Linus Torvalds
2e04247f7c ftrace updates for v6.14:
- Have fprobes built on top of function graph infrastructure
 
   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function. The
   fprobe and kretprobe logic implements a similar method as the function
   graph tracer to trace the end of the function. That is to hijack the
   return address and jump to a trampoline to do the trace when the function
   exits. To do this, a shadow stack needs to be created to store the
   original return address.  Fprobes and function graph do this slightly
   differently. Fprobes (and kretprobes) has slots per callsite that are
   reserved to save the return address. This is fine when just a few points
   are traced. But users of fprobes, such as BPF programs, are starting to add
   many more locations, and this method does not scale.
 
   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started, every
   task gets its own shadow stack to hold the return address that is going to
   be traced. The function graph tracer has been updated to allow multiple
   users to use its infrastructure. Now have fprobes be one of those users.
   This will also allow for the fprobe and kretprobe methods to trace the
   return address to become obsolete. With new technologies like CFI that
   need to know about these methods of hijacking the return address, going
   toward a solution that has only one method of doing this will make the
   kernel less complex.
 
 - Cleanup with guard() and free() helpers
 
   There were several places in the code that had a lot of "goto out" in the
   error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.
 
 - Remove disabling of interrupts in the function graph tracer
 
   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable interrupts and
   not trace NMIs. But the code has changed to allow NMIs and also
   interrupts. This change was done a long time ago, but the disabling of
   interrupts was never removed. Remove the disabling of interrupts in the
   function graph tracer is it is not needed. This greatly improves its
   performance.
 
 - Allow the :mod: command to enable tracing module functions on the kernel
   command line.
 
   The function tracer already has a way to enable functions to be traced in
   modules by writing ":mod:<module>" into set_ftrace_filter. That will
   enable either all the functions for the module if it is loaded, or if it
   is not, it will cache that command, and when the module is loaded that
   matches <module>, its functions will be enabled. This also allows init
   functions to be traced. But currently events do not have that feature.
 
   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42E2RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqXSAPwOMxuhye8tb1GYG62QD9+w7e6nOmlC
 2GCPj4detnEM2QD/ciivkhespVKhHpZHRewAuSnJgHPSM45NQ3EVESzjWQ4=
 =snbx
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Have fprobes built on top of function graph infrastructure

   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function.
   The fprobe and kretprobe logic implements a similar method as the
   function graph tracer to trace the end of the function. That is to
   hijack the return address and jump to a trampoline to do the trace
   when the function exits. To do this, a shadow stack needs to be
   created to store the original return address. Fprobes and function
   graph do this slightly differently. Fprobes (and kretprobes) has
   slots per callsite that are reserved to save the return address. This
   is fine when just a few points are traced. But users of fprobes, such
   as BPF programs, are starting to add many more locations, and this
   method does not scale.

   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started,
   every task gets its own shadow stack to hold the return address that
   is going to be traced. The function graph tracer has been updated to
   allow multiple users to use its infrastructure. Now have fprobes be
   one of those users. This will also allow for the fprobe and kretprobe
   methods to trace the return address to become obsolete. With new
   technologies like CFI that need to know about these methods of
   hijacking the return address, going toward a solution that has only
   one method of doing this will make the kernel less complex.

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Remove disabling of interrupts in the function graph tracer

   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable
   interrupts and not trace NMIs. But the code has changed to allow NMIs
   and also interrupts. This change was done a long time ago, but the
   disabling of interrupts was never removed. Remove the disabling of
   interrupts in the function graph tracer is it is not needed. This
   greatly improves its performance.

 - Allow the :mod: command to enable tracing module functions on the
   kernel command line.

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.

* tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  ftrace: Implement :mod: cache filtering on kernel command line
  tracing: Adopt __free() and guard() for trace_fprobe.c
  bpf: Use ftrace_get_symaddr() for kprobe_multi probes
  ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr
  Documentation: probes: Update fprobe on function-graph tracer
  selftests/ftrace: Add a test case for repeating register/unregister fprobe
  selftests: ftrace: Remove obsolate maxactive syntax check
  tracing/fprobe: Remove nr_maxactive from fprobe
  fprobe: Add fprobe_header encoding feature
  fprobe: Rewrite fprobe on function-graph tracer
  s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC
  ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC
  bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
  tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
  tracing: Add ftrace_fill_perf_regs() for perf event
  tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs
  fprobe: Use ftrace_regs in fprobe exit handler
  fprobe: Use ftrace_regs in fprobe entry handler
  fgraph: Pass ftrace_regs to retfunc
  fgraph: Replace fgraph_ret_regs with ftrace_regs
  ...
2025-01-21 15:15:28 -08:00
Linus Torvalds
9ad09c4f28 arm64 updates for 6.14
Confidential Computing:
 * Register a platform device when running in CCA realm mode to enable
   automatic loading of dependent modules.
 
 CPU Features:
 * Update a bunch of system register definitions to pick up new field
   encodings from the architectural documentation.
 
 * Add hwcaps and selftests for the new (2024) dpISA extensions.
 
 Documentation:
 * Update EL3 (firmware) requirements for booting Linux on modern arm64
   designs.
 
 * Remove stale information about the kernel virtual memory map.
 
 Miscellaneous:
 * Minor cleanups and typo fixes.
 
 Memory management:
 * Fix vmemmap_check_pmd() to look at the PMD type bits
 
 * LPA2 (52-bit physical addressing) cleanups and minor fixes.
 
 * Adjust physical address space depending upon whether or not LPA2 is
   enabled.
 
 Perf and PMUs:
 * Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU
 
 * Extend AXI filtering support for the DDR PMU on NXP IMX SoCs
 
 * Fix Designware PCIe PMU event numbering.
 
 * Add generic branch events for the Apple M1 CPU PMU.
 
 * Add support for Marvell Odyssey DDR and LLC-TAD PMUs.
 
 * Cleanups to the Hisilicon DDRC and Uncore PMU code.
 
 * Advertise discard mode for the SPE PMU.
 
 * Add the perf users mailing list to our MAINTAINERS entry.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmeKZLcQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNEQzB/0X2U89ZiqxIkTPQvfFrjN/uUGybkq59rEL
 DfeoGukTgJIwc3GHWXXtQ//wuuYKdTeCXaIz5NFK3+7/wmKSLvjkexmue8pta6EY
 5rx9bAPr/D8lAUvhKIN2l3pF/ygoRwDz+nT2yVQ1xlZxYJWX7ZIsMj7W7ceb5kdx
 HRrTSQuhEEPREAWWO4oCMWl5SQZSrIflSE3Be/PsP0OhW6k//ZmWbcJTgUcHbKam
 o2WtNjITyGzxMpRCcrGEZKoe9YcwSxiut/PoD7JuoB4C/rbsf1cdJ6uLmtvGJcZj
 qsdRHhVfBzP1+ahONrDbiT3C2+s1UZySKdCDIxiYy6lB39wpP0dd
 =E7Mf
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "We've got a little less than normal thanks to the holidays in
  December, but there's the usual summary below. The highlight is
  probably the 52-bit physical addressing (LPA2) clean-up from Ard.

  Confidential Computing:

   - Register a platform device when running in CCA realm mode to enable
     automatic loading of dependent modules

  CPU Features:

   - Update a bunch of system register definitions to pick up new field
     encodings from the architectural documentation

   - Add hwcaps and selftests for the new (2024) dpISA extensions

  Documentation:

   - Update EL3 (firmware) requirements for booting Linux on modern
     arm64 designs

   - Remove stale information about the kernel virtual memory map

  Miscellaneous:

   - Minor cleanups and typo fixes

  Memory management:

   - Fix vmemmap_check_pmd() to look at the PMD type bits

   - LPA2 (52-bit physical addressing) cleanups and minor fixes

   - Adjust physical address space depending upon whether or not LPA2 is
     enabled

  Perf and PMUs:

   - Add port filtering support for NVIDIA's NVLINK-C2C Coresight PMU

   - Extend AXI filtering support for the DDR PMU on NXP IMX SoCs

   - Fix Designware PCIe PMU event numbering

   - Add generic branch events for the Apple M1 CPU PMU

   - Add support for Marvell Odyssey DDR and LLC-TAD PMUs

   - Cleanups to the Hisilicon DDRC and Uncore PMU code

   - Advertise discard mode for the SPE PMU

   - Add the perf users mailing list to our MAINTAINERS entry"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (64 commits)
  Documentation: arm64: Remove stale and redundant virtual memory diagrams
  perf docs: arm_spe: Document new discard mode
  perf: arm_spe: Add format option for discard mode
  MAINTAINERS: Add perf list for drivers/perf/
  arm64: Remove duplicate included header
  drivers/perf: apple_m1: Map generic branch events
  arm64: rsi: Add automatic arm-cca-guest module loading
  kselftest/arm64: Add 2024 dpISA extensions to hwcap test
  KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
  arm64/hwcap: Describe 2024 dpISA extensions to userspace
  arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
  arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
  drivers/perf: hisi: Set correct IRQ affinity for PMUs with no association
  arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
  arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
  arm64/mm: Replace open encodings with PXD_TABLE_BIT
  arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
  arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
  ...
2025-01-20 21:21:49 -08:00
Will Deacon
602ffd4ce3 Merge branch 'for-next/mm' into for-next/core
* for-next/mm:
  arm64: mm: Test for pmd_sect() in vmemmap_check_pmd()
  arm64/mm: Replace open encodings with PXD_TABLE_BIT
  arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
  arm64: Kconfig: force ARM64_PAN=y when enabling TTBR0 sw PAN
  arm64/kvm: Avoid invalid physical addresses to signal owner updates
  arm64/kvm: Configure HYP TCR.PS/DS based on host stage1
  arm64/mm: Override PARange for !LPA2 and use it consistently
  arm64/mm: Reduce PA space to 48 bits when LPA2 is not enabled
2025-01-17 13:52:33 +00:00
Will Deacon
6e1173306e Merge branch 'for-next/misc' into for-next/core
* for-next/misc:
  arm64: Remove duplicate included header
  arm64/Kconfig: Drop EXECMEM dependency from ARCH_WANTS_EXECMEM_LATE
  arm64: asm: Fix typo in pgtable.h
  arm64/mm: Ensure adequate HUGE_MAX_HSTATE
  arm64/mm: Replace open encodings with PXD_TABLE_BIT
  arm64/mm: Drop INIT_MM_CONTEXT()
2025-01-17 13:52:29 +00:00
Will Deacon
763d584c5b Merge branch 'for-next/cpufeature' into for-next/core
* for-next/cpufeature:
  kselftest/arm64: Add 2024 dpISA extensions to hwcap test
  KVM: arm64: Allow control of dpISA extensions in ID_AA64ISAR3_EL1
  arm64/hwcap: Describe 2024 dpISA extensions to userspace
  arm64/sysreg: Update ID_AA64SMFR0_EL1 to DDI0601 2024-12
  arm64: Filter out SVE hwcaps when FEAT_SVE isn't implemented
  arm64/sme: Move storage of reg_smidr to __cpuinfo_store_cpu()
  arm64/sysreg: Update ID_AA64ISAR2_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64ZFR0_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64FPFR0_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64ISAR3_EL1 to DDI0601 2024-09
  arm64/sysreg: Update ID_AA64PFR2_EL1 to DDI0601 2024-09
  arm64/sysreg: Get rid of CPACR_ELx SysregFields
  arm64/sysreg: Convert *_EL12 accessors to Mapping
  arm64/sysreg: Get rid of the TCR2_EL1x SysregFields
  arm64/sysreg: Allow a 'Mapping' descriptor for system registers
  arm64/cpufeature: Refactor conditional logic in init_cpu_ftr_reg()
  arm64: cpufeature: Add HAFT to cpucap_is_possible()
2025-01-17 13:52:15 +00:00
Marc Zyngier
fa5e4043e9 Merge branch kvm-arm64/misc-6.14 into kvmarm-master/next
* kvm-arm64/misc-6.14:
  : .
  : Misc KVM/arm64 changes for 6.14
  :
  : - Don't expose AArch32 EL0 capability when NV is enabled
  :
  : - Update documentation to reflect the full gamut of kvm-arm.mode
  :   behaviours
  :
  : - Use the hypervisor VA bit width when dumping stacktraces
  :
  : - Decouple the hypervisor stack size from PAGE_SIZE, at least
  :   on the surface...
  :
  : - Make use of str_enabled_disabled() when advertising GICv4.1 support
  :
  : - Explicitly handle BRBE traps as UNDEFINED
  : .
  KVM: arm64: Explicitly handle BRBE traps as UNDEFINED
  KVM: arm64: vgic: Use str_enabled_disabled() in vgic_v3_probe()
  arm64: kvm: Introduce nvhe stack size constants
  KVM: arm64: Fix nVHE stacktrace VA bits mask
  Documentation: Update the behaviour of "kvm-arm.mode"
  KVM: arm64: nv: Advertise the lack of AArch32 EL0 support

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:06:50 +00:00
Marc Zyngier
3643b334aa Merge branch kvm-arm64/nv-resx-fixes-6.14 into kvmarm-master/next
* kvm-arm64/nv-resx-fixes-6.14:
  : .
  : Fixes for NV sysreg accessors. From the cover letter:
  :
  : "Joey recently reported that some rather basic tests were failing on
  : NV, and managed to track it down to critical register fields (such as
  : HCR_EL2.E2H) not having their expect value.
  :
  : Further investigation has outlined a couple of critical issues:
  :
  : - Evaluating HCR_EL2.E2H must always be done with a sanitising
  :   accessor, no ifs, no buts. Given that KVM assumes a fixed value for
  :   this bit, we cannot leave it to the guest to mess with.
  :
  : - Resetting the sysreg file must result in the RESx bits taking
  :   effect. Otherwise, we may end-up making the wrong decision (see
  :   above), and we definitely expose invalid values to the guest. Note
  :   that because we compute the RESx masks very late in the VM setup, we
  :   need to apply these masks at that particular point as well.
  : [...]"
  : .
  KVM: arm64: nv: Apply RESx settings to sysreg reset values
  KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/nested.c
2025-01-17 11:06:33 +00:00
Marc Zyngier
946904e728 Merge branch kvm-arm64/coresight-6.14 into kvmarm-master/next
* kvm-arm64/coresight-6.14:
  : .
  : Trace filtering update from James Clark. From the cover letter:
  :
  : "The guest filtering rules from the Perf session are now honored for both
  : nVHE and VHE modes. This is done by either writing to TRFCR_EL12 at the
  : start of the Perf session and doing nothing else further, or caching the
  : guest value and writing it at guest switch for nVHE. In pKVM, trace is
  : now be disabled for both protected and unprotected guests."
  : .
  KVM: arm64: Fix selftests after sysreg field name update
  coresight: Pass guest TRFCR value to KVM
  KVM: arm64: Support trace filtering for guests
  KVM: arm64: coresight: Give TRBE enabled state to KVM
  coresight: trbe: Remove redundant disable call
  arm64/sysreg/tools: Move TRFCR definitions to sysreg
  tools: arm64: Update sysreg.h header files

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:05:44 +00:00
Marc Zyngier
080612b294 Merge branch kvm-arm64/nv-timers into kvmarm-master/next
* kvm-arm64/nv-timers:
  : .
  : Nested Virt support for the EL2 timers. From the initial cover letter:
  :
  : "Here's another batch of NV-related patches, this time bringing in most
  : of the timer support for EL2 as well as nested guests.
  :
  : The code is pretty convoluted for a bunch of reasons:
  :
  : - FEAT_NV2 breaks the timer semantics by redirecting HW controls to
  :   memory, meaning that a guest could setup a timer and never see it
  :   firing until the next exit
  :
  : - We go try hard to reflect the timer state in memory, but that's not
  :   great.
  :
  : - With FEAT_ECV, we can finally correctly emulate the virtual timer,
  :   but this emulation is pretty costly
  :
  : - As a way to make things suck less, we handle timer reads as early as
  :   possible, and only defer writes to the normal trap handling
  :
  : - Finally, some implementations are badly broken, and require some
  :   hand-holding, irrespective of NV support. So we try and reuse the NV
  :   infrastructure to make them usable. This could be further optimised,
  :   but I'm running out of patience for this sort of HW.
  :
  : [...]"
  : .
  KVM: arm64: nv: Fix doc header layout for timers
  KVM: arm64: nv: Document EL2 timer API
  KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity
  KVM: arm64: nv: Sanitise CNTHCTL_EL2
  KVM: arm64: nv: Propagate CNTHCTL_EL2.EL1NV{P,V}CT bits
  KVM: arm64: nv: Add trap routing for CNTHCTL_EL2.EL1{NVPCT,NVVCT,TVT,TVCT}
  KVM: arm64: Handle counter access early in non-HYP context
  KVM: arm64: nv: Accelerate EL0 counter accesses from hypervisor context
  KVM: arm64: nv: Accelerate EL0 timer read accesses when FEAT_ECV in use
  KVM: arm64: nv: Use FEAT_ECV to trap access to EL0 timers
  KVM: arm64: nv: Publish emulated timer interrupt state in the in-memory state
  KVM: arm64: nv: Sync nested timer state with FEAT_NV2
  KVM: arm64: nv: Add handling of EL2-specific timer registers

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-17 11:04:53 +00:00
Marc Zyngier
36f998de85 KVM: arm64: nv: Apply RESx settings to sysreg reset values
While we have sanitisation in place for the guest sysregs, we lack
that sanitisation out of reset. So some of the fields could be
evaluated and not reflect their RESx status, which sounds like
a very bad idea.

Apply the RESx masks to the the sysreg file in two situations:

- when going via a reset of the sysregs

- after having computed the RESx masks

Having this separate reset phase from the actual reset handling is
a bit grotty, but we need to apply this after the ID registers are
final.

Tested-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250112165029.1181056-3-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-14 11:33:09 +00:00
Marc Zyngier
c139b6d1b4 KVM: arm64: nv: Always evaluate HCR_EL2 using sanitising accessors
A lot of the NV code depends on HCR_EL2.{E2H,TGE}, and we assume
in places that at least HCR_EL2.E2H is invariant for a given guest.

However, we make a point in *not* using the sanitising accessor
that would enforce this, and are at the mercy of the guest doing
stupid things. Clearly, that's not good.

Rework the HCR_EL2 accessors to use __vcpu_sys_reg() instead,
guaranteeing that the RESx settings get applied, specially
when HCR_EL2.E2H is evaluated. This results in fewer accessors
overall.

Huge thanks to Joey who spent a long time tracking this bug down.

Reported-by: Joey Gouly <Joey.Gouly@arm.com>
Tested-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20250112165029.1181056-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-14 11:27:25 +00:00
James Clark
054b88391b KVM: arm64: Support trace filtering for guests
For nVHE, switch the filter value in and out if the Coresight driver
asks for it. This will support filters for guests when sinks other than
TRBE are used.

For VHE, just write the filter directly to TRFCR_EL1 where trace can be
used even with TRBE sinks.

Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250106142446.628923-7-james.clark@linaro.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 12:50:11 +00:00
James Clark
a665e3bc88 KVM: arm64: coresight: Give TRBE enabled state to KVM
Currently in nVHE, KVM has to check if TRBE is enabled on every guest
switch even if it was never used. Because it's a debug feature and is
more likely to not be used than used, give KVM the TRBE buffer status to
allow a much simpler and faster do-nothing path in the hyp.

Protected mode now disables trace regardless of TRBE (because
trfcr_while_in_guest is always 0), which was not previously done.
However, it continues to flush whenever the buffer is enabled
regardless of the filter status. This avoids the hypothetical case of a
host that had disabled the filter but not flushed which would arise if
only doing the flush when the filter was enabled.

Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250106142446.628923-6-james.clark@linaro.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 12:50:11 +00:00
James Clark
c382ee674c arm64/sysreg/tools: Move TRFCR definitions to sysreg
Convert TRFCR to automatic generation. Add separate definitions for ELx
and EL2 as TRFCR_EL1 doesn't have CX. This also mirrors the previous
definition so no code change is required.

Also add TRFCR_EL12 which will start to be used in a later commit.

Unfortunately, to avoid breaking the Perf build with duplicate
definition errors, the tools copy of the sysreg.h header needs to be
updated at the same time rather than the usual second commit. This is
because the generated version of sysreg
(arch/arm64/include/generated/asm/sysreg-defs.h), is currently shared
and tools/ does not have its own copy.

Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20250106142446.628923-4-james.clark@linaro.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-12 12:50:11 +00:00
Marc Zyngier
e880b16efb Merge branch kvm-arm64/pkvm-fixed-features-6.14 into kvmarm-master/next
* kvm-arm64/pkvm-fixed-features-6.14: (24 commits)
  : .
  : Complete rework of the pKVM handling of features, catching up
  : with the rest of the code deals with it these days.
  : Patches courtesy of Fuad Tabba. From the cover letter:
  :
  : "This patch series uses the vm's feature id registers to track the
  : supported features, a framework similar to nested virt to set the
  : trap values, and removes the need to store cptr_el2 per vcpu in
  : favor of setting its value when traps are activated, as VHE mode
  : does."
  :
  : This branch drags the arm64/for-next/cpufeature branch to solve
  : ugly conflicts in -next.
  : .
  KVM: arm64: Fix FEAT_MTE in pKVM
  KVM: arm64: Use kvm_vcpu_has_feature() directly for struct kvm
  KVM: arm64: Convert the SVE guest vcpu flag to a vm flag
  KVM: arm64: Remove PtrAuth guest vcpu flag
  KVM: arm64: Fix the value of the CPTR_EL2 RES1 bitmask for nVHE
  KVM: arm64: Refactor kvm_reset_cptr_el2()
  KVM: arm64: Calculate cptr_el2 traps on activating traps
  KVM: arm64: Remove redundant setting of HCR_EL2 trap bit
  KVM: arm64: Remove fixed_config.h header
  KVM: arm64: Rework specifying restricted features for protected VMs
  KVM: arm64: Set protected VM traps based on its view of feature registers
  KVM: arm64: Fix RAS trapping in pKVM for protected VMs
  KVM: arm64: Initialize feature id registers for protected VMs
  KVM: arm64: Use KVM extension checks for allowed protected VM capabilities
  KVM: arm64: Remove KVM_ARM_VCPU_POWER_OFF from protected VMs allowed features in pKVM
  KVM: arm64: Move checking protected vcpu features to a separate function
  KVM: arm64: Group setting traps for protected VMs by control register
  KVM: arm64: Consolidate allowed and restricted VM feature checks
  arm64/sysreg: Get rid of CPACR_ELx SysregFields
  arm64/sysreg: Convert *_EL12 accessors to Mapping
  ...

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/fpsimd.c
#	arch/arm64/kvm/hyp/nvhe/pkvm.c
2025-01-12 10:40:10 +00:00
Marc Zyngier
d0670128d4 Merge branch kvm-arm64/pkvm-np-guest into kvmarm-master/next
* kvm-arm64/pkvm-np-guest:
  : .
  : pKVM support for non-protected guests using the standard MM
  : infrastructure, courtesy of Quentin Perret. From the cover letter:
  :
  : "This series moves the stage-2 page-table management of non-protected
  : guests to EL2 when pKVM is enabled. This is only intended as an
  : incremental step towards a 'feature-complete' pKVM, there is however a
  : lot more that needs to come on top.
  :
  : With that series applied, pKVM provides near-parity with standard KVM
  : from a functional perspective all while Linux no longer touches the
  : stage-2 page-tables itself at EL1. The majority of mm-related KVM
  : features work out of the box, including MMU notifiers, dirty logging,
  : RO memslots and things of that nature. There are however two gotchas:
  :
  :  - We don't support mapping devices into guests: this requires
  :    additional hypervisor support for tracking the 'state' of devices,
  :    which will come in a later series. No device assignment until then.
  :
  :  - Stage-2 mappings are forced to page-granularity even when backed by a
  :    huge page for the sake of simplicity of this series. I'm only aiming
  :    at functional parity-ish (from userspace's PoV) for now, support for
  :    HP can be added on top later as a perf improvement."
  : .
  KVM: arm64: Plumb the pKVM MMU in KVM
  KVM: arm64: Introduce the EL1 pKVM MMU
  KVM: arm64: Introduce __pkvm_tlb_flush_vmid()
  KVM: arm64: Introduce __pkvm_host_mkyoung_guest()
  KVM: arm64: Introduce __pkvm_host_test_clear_young_guest()
  KVM: arm64: Introduce __pkvm_host_wrprotect_guest()
  KVM: arm64: Introduce __pkvm_host_relax_guest_perms()
  KVM: arm64: Introduce __pkvm_host_unshare_guest()
  KVM: arm64: Introduce __pkvm_host_share_guest()
  KVM: arm64: Introduce __pkvm_vcpu_{load,put}()
  KVM: arm64: Add {get,put}_pkvm_hyp_vm() helpers
  KVM: arm64: Make kvm_pgtable_stage2_init() a static inline function
  KVM: arm64: Pass walk flags to kvm_pgtable_stage2_relax_perms
  KVM: arm64: Pass walk flags to kvm_pgtable_stage2_mkyoung
  KVM: arm64: Move host page ownership tracking to the hyp vmemmap
  KVM: arm64: Make hyp_page::order a u8
  KVM: arm64: Move enum pkvm_page_state to memory.h
  KVM: arm64: Change the layout of enum pkvm_page_state

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/arm.c
2025-01-12 10:37:15 +00:00
Marc Zyngier
4e26de25d2 Merge remote-tracking branch 'arm64/for-next/cpufeature' into kvm-arm64/pkvm-fixed-features-6.14
Merge arm64/for-next/cpufeature to solve extensive conflicts
caused by the CPACR_ELx->CPACR_EL1 repainting.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-11 14:55:18 +00:00
Thorsten Blum
965e9bbe02 arm64: Remove duplicate included header
The header asm/unistd_compat_32.h is included whether CONFIG_COMPAT is
defined or not.

Include it only once and remove the following make includecheck warning:

  asm/unistd_compat_32.h is included more than once

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20250109104636.124507-2-thorsten.blum@linux.dev
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-10 13:44:22 +00:00
Nuno Das Neves
962a4c7ea8 hyperv: Remove the now unused hyperv-tlfs.h files
Remove all hyperv-tlfs.h files. These are no longer included
anywhere. hyperv/hvhdk.h serves the same role, but with an easier
path for adding new definitions.

Remove the relevant lines in MAINTAINERS.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-6-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1732577084-2122-6-git-send-email-nunodasneves@linux.microsoft.com>
2025-01-10 00:54:21 +00:00
Nuno Das Neves
ef5a3c92a8 hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.

This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).

To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.

Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-01-10 00:54:21 +00:00
Frederic Weisbecker
3a5446612a sched,arm64: Handle CPU isolation on last resort fallback rq selection
When a kthread or any other task has an affinity mask that is fully
offline or unallowed, the scheduler reaffines the task to all possible
CPUs as a last resort.

This default decision doesn't mix up very well with nohz_full CPUs that
are part of the possible cpumask but don't want to be disturbed by
unbound kthreads or even detached pinned user tasks.

Make the fallback affinity setting aware of nohz_full.

Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2025-01-08 18:14:23 +01:00
Jeremy Linton
a1edec2245 arm64: rsi: Add automatic arm-cca-guest module loading
The TSM module provides guest identification and attestation when a
guest runs in CCA realm mode. By creating a dummy platform device,
let's ensure the module is automatically loaded. The udev daemon loads
the TSM module after it receives a device addition event. Once that
happens, it can be used earlier in the boot process to decrypt the
rootfs.

Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20241220181236.172060-2-jeremy.linton@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-08 13:58:49 +00:00
Mark Brown
819935464c arm64/hwcap: Describe 2024 dpISA extensions to userspace
The 2024 dpISA introduces a number of architecture features all of which
only add new instructions so only require the addition of hwcaps and ID
register visibility.

Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250107-arm64-2024-dpisa-v5-3-7578da51fc3d@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-08 13:41:06 +00:00
Kalesh Singh
38f9e4b905 arm64: kvm: Introduce nvhe stack size constants
Refactor nvhe stack code to use NVHE_STACK_SIZE/SHIFT constants,
instead of directly using PAGE_SIZE/SHIFT. This makes the code a bit
easier to read, without introducing any functional changes.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Link: https://lore.kernel.org/r/20241112003336.1375584-1-kaleshsingh@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-08 11:25:28 +00:00
Vincent Donnefort
68344037b7 KVM: arm64: Fix nVHE stacktrace VA bits mask
The hypervisor VA space size depends on both the ID map's
(IDMAP_VA_BITS) and the kernel stage-1 (VA_BITS). However, the
hypervisor stacktrace decoding is solely relying on VA_BITS. This is
especially an issue when VA_BITS < IDMAP_VA_BITS (i.e. VA_BITS is
39-bit): the hypervisor may have addresses bigger than the stacktrace is
masking.

Align this mask with hyp_va_bits.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20250107112821.416591-1-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-08 11:18:39 +00:00
Anshuman Khandual
fe2169f556 arm64/mm: Replace open encodings with PXD_TABLE_BIT
[pgd|p4d]_bad() helpers have open encodings for their respective table bits
which can be replaced with corresponding macros. This makes things clearer,
thus improving their readability as well.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20250107015529.798319-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-07 16:47:45 +00:00
Anshuman Khandual
1692265830 arm64/mm: Rename pte_mkpresent() as pte_mkvalid()
pte_present() is no longer synonymous with pte_valid() as it also tests for
pte_present_invalid() as well. Hence pte_mkpresent() is misleading, because
all that does is make an entry mapped, via setting PTE_VALID. Hence rename
the helper as pte_mkvalid() which reflects its functionality appropriately.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250107023016.829416-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2025-01-07 16:47:33 +00:00
Marc Zyngier
0bc9a9e85f KVM: arm64: Work around x1e's CNTVOFF_EL2 bogosity
It appears that on Qualcomm's x1e CPU, CNTVOFF_EL2 doesn't really
work, specially with HCR_EL2.E2H=1.

A non-zero offset results in a screaming virtual timer interrupt,
to the tune of a few 100k interrupts per second on a 4 vcpu VM.
This is also evidenced by this CPU's inability to correctly run
any of the timer selftests.

The only case this doesn't break is when this register is set to 0,
which breaks VM migration.

When HCR_EL2.E2H=0, the timer seems to behave normally, and does
not result in an interrupt storm.

As a workaround, use the fact that this CPU implements FEAT_ECV,
and trap all accesses to the virtual timer and counter, keeping
CNTVOFF_EL2 set to zero, and emulate accesses to CVAL/TVAL/CTL
and the counter itself, fixing up the timer to account for the
missing offset.

And if you think this is disgusting, you'd probably be right.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-12-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
d1e37a50e1 KVM: arm64: nv: Sanitise CNTHCTL_EL2
Inject some sanity in CNTHCTL_EL2, ensuring that we don't handle
more than we advertise to the guest.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-11-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:10 +00:00
Marc Zyngier
b59dbb91f7 KVM: arm64: nv: Add handling of EL2-specific timer registers
Add the required handling for EL2 and EL02 registers, as
well as EL1 registers used in the E2H context. This includes
handling the virtual timer accesses when CNTHCTL_EL2.EL1TVT
or CNTHCTL_EL2.EL1TVCT are set.

Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241217142321.763801-2-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-01-02 19:19:09 +00:00
Masami Hiramatsu (Google)
2bc56fdae1 ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr
This introduces ftrace_get_symaddr() which tries to convert fentry_ip
passed by ftrace or fgraph callback to symaddr without calling
kallsyms API. It returns the symbol address or 0 if it fails to
convert it.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/173519011487.391279.5450806886342723151.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412061423.K79V55Hd-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202412061804.5VRzF14E-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:07 -05:00
Masami Hiramatsu (Google)
b5fa903b7f fprobe: Add fprobe_header encoding feature
Fprobe store its data structure address and size on the fgraph return stack
by __fprobe_header. But most 64bit architecture can combine those to
one unsigned long value because 4 MSB in the kernel address are the same.
With this encoding, fprobe can consume less space on ret_stack.

This introduces asm/fprobe.h to define arch dependent encode/decode
macros. Note that since fprobe depends on CONFIG_HAVE_FUNCTION_GRAPH_FREGS,
currently only arm64, loongarch, riscv, s390 and x86 are supported.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173519005783.391279.5307910947400277525.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:05 -05:00
Masami Hiramatsu (Google)
4346ba1604 fprobe: Rewrite fprobe on function-graph tracer
Rewrite fprobe implementation on function-graph tracer.
Major API changes are:
 -  'nr_maxactive' field is deprecated.
 -  This depends on CONFIG_DYNAMIC_FTRACE_WITH_ARGS or
    !CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS, and
    CONFIG_HAVE_FUNCTION_GRAPH_FREGS. So currently works only
    on x86_64.
 -  Currently the entry size is limited in 15 * sizeof(long).
 -  If there is too many fprobe exit handler set on the same
    function, it will fail to probe.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/173519003970.391279.14406792285453830996.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:05 -05:00
Masami Hiramatsu (Google)
d5d01b7199 tracing: Add ftrace_fill_perf_regs() for perf event
Add ftrace_fill_perf_regs() which should be compatible with the
perf_fetch_caller_regs(). In other words, the pt_regs returned from the
ftrace_fill_perf_regs() must satisfy 'user_mode(regs) == false' and can be
used for stack tracing.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/173518997908.391279.15910334347345106424.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:04 -05:00
Masami Hiramatsu (Google)
b9b55c8912 tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs
Add ftrace_partial_regs() which converts the ftrace_regs to pt_regs.
This is for the eBPF which needs this to keep the same pt_regs interface
to access registers.
Thus when replacing the pt_regs with ftrace_regs in fprobes (which is
used by kprobe_multi eBPF event), this will be used.

If the architecture defines its own ftrace_regs, this copies partial
registers to pt_regs and returns it. If not, ftrace_regs is the same as
pt_regs and ftrace_partial_regs() will return ftrace_regs::regs.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Florent Revest <revest@chromium.org>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Link: https://lore.kernel.org/173518996761.391279.4987911298206448122.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:03 -05:00
Masami Hiramatsu (Google)
a3ed4157b7 fgraph: Replace fgraph_ret_regs with ftrace_regs
Use ftrace_regs instead of fgraph_ret_regs for tracing return value
on function_graph tracer because of simplifying the callback interface.

The CONFIG_HAVE_FUNCTION_GRAPH_RETVAL is also replaced by
CONFIG_HAVE_FUNCTION_GRAPH_FREGS.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/173518991508.391279.16635322774382197642.stgit@devnote2
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-26 10:50:02 -05:00
Fuad Tabba
41d6028e28 KVM: arm64: Convert the SVE guest vcpu flag to a vm flag
The vcpu flag GUEST_HAS_SVE is per-vcpu, but it is based on what
is now a per-vm feature. Make the flag per-vm.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-17-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:09 +00:00
Fuad Tabba
c5c1763596 KVM: arm64: Remove PtrAuth guest vcpu flag
The vcpu flag GUEST_HAS_PTRAUTH is always associated with the
vcpu PtrAuth features, which are defined per vm rather than per
vcpu.

Remove the flag, and replace it with checks for the features
instead.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-16-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:06 +00:00
Fuad Tabba
1eccad35c9 KVM: arm64: Fix the value of the CPTR_EL2 RES1 bitmask for nVHE
Since the introduction of SME, bit 12 in CPTR_EL2 (nVHE) is TSM
for trapping SME, instead of RES1, as per ARM ARM DDI 0487K.a,
section D23.2.34.

Fix the value of CPTR_NVHE_EL2_RES1 to reflect that, and adjust
the code that relies on it accordingly.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-15-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:03 +00:00
Fuad Tabba
8f7df795b2 KVM: arm64: Refactor kvm_reset_cptr_el2()
Fold kvm_get_reset_cptr_el2() into kvm_reset_cptr_el2(), since it
is its only caller. Add a comment to clarify that this function
is meant for the host value of cptr_el2.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-14-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:54:00 +00:00
Fuad Tabba
2fd5b4b0e7 KVM: arm64: Calculate cptr_el2 traps on activating traps
Similar to VHE, calculate the value of cptr_el2 from scratch on
activate traps. This removes the need to store cptr_el2 in every
vcpu structure. Moreover, some traps, such as whether the guest
owns the fp registers, need to be set on every vcpu run.

Reported-by: James Clark <james.clark@linaro.org>
Fixes: 5294afdbf4 ("KVM: arm64: Exclude FP ownership from kvm_vcpu_arch")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-13-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:57 +00:00
Fuad Tabba
3d7ff00700 KVM: arm64: Rework specifying restricted features for protected VMs
The existing code didn't properly distinguish between signed and
unsigned features, and was difficult to read and to maintain.
Rework it using the same method used in other parts of KVM when
handling vcpu features.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-10-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:53:01 +00:00
Fuad Tabba
a3163dca48 KVM: arm64: Use KVM extension checks for allowed protected VM capabilities
Use KVM extension checks as the source for determining which
capabilities are allowed for protected VMs. KVM extension checks
is the natural place for this, since it is also the interface
exposed to users.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241216105057.579031-6-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 13:45:25 +00:00
Quentin Perret
fce886a602 KVM: arm64: Plumb the pKVM MMU in KVM
Introduce the KVM_PGT_CALL() helper macro to allow switching from the
traditional pgtable code to the pKVM version easily in mmu.c. The cost
of this 'indirection' is expected to be very minimal due to
is_protected_kvm_enabled() being backed by a static key.

With this, everything is in place to allow the delegation of
non-protected guest stage-2 page-tables to pKVM, so let's stop using the
host's kvm_s2_mmu from EL2 and enjoy the ride.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-19-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
e912efed48 KVM: arm64: Introduce the EL1 pKVM MMU
Introduce a set of helper functions allowing to manipulate the pKVM
guest stage-2 page-tables from EL1 using pKVM's HVC interface.

Each helper has an exact one-to-one correspondance with the traditional
kvm_pgtable_stage2_*() functions from pgtable.c, with a strictly
matching prototype. This will ease plumbing later on in mmu.c.

These callbacks track the gfn->pfn mappings in a simple rb_tree indexed
by IPA in lieu of a page-table. This rb-tree is kept in sync with pKVM's
state and is protected by the mmu_lock like a traditional stage-2
page-table.

Signed-off-by: Quentin Perret <qperret@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-18-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
0adce4d42f KVM: arm64: Introduce __pkvm_tlb_flush_vmid()
Introduce a new hypercall to flush the TLBs of non-protected guests. The
host kernel will be responsible for issuing this hypercall after changing
stage-2 permissions using the __pkvm_host_relax_guest_perms() or
__pkvm_host_wrprotect_guest() paths. This is left under the host's
responsibility for performance reasons.

Note however that the TLB maintenance for all *unmap* operations still
remains entirely under the hypervisor's responsibility for security
reasons -- an unmapped page may be donated to another entity, so a stale
TLB entry could be used to leak private data.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-17-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
76f0b18b3d KVM: arm64: Introduce __pkvm_host_mkyoung_guest()
Plumb the kvm_pgtable_stage2_mkyoung() callback into pKVM for
non-protected guests. It will be called later from the fault handling
path.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-16-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
56ab4de37f KVM: arm64: Introduce __pkvm_host_test_clear_young_guest()
Plumb the kvm_stage2_test_clear_young() callback into pKVM for
non-protected guest. It will be later be called from MMU notifiers.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-15-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
26117e4c63 KVM: arm64: Introduce __pkvm_host_wrprotect_guest()
Introduce a new hypercall to remove the write permission from a
non-protected guest stage-2 mapping. This will be used for e.g. enabling
dirty logging.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-14-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
34884a0a4a KVM: arm64: Introduce __pkvm_host_relax_guest_perms()
Introduce a new hypercall allowing the host to relax the stage-2
permissions of mappings in a non-protected guest page-table. It will be
used later once we start allowing RO memslots and dirty logging.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-13-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
72db3d3fba KVM: arm64: Introduce __pkvm_host_unshare_guest()
In preparation for letting the host unmap pages from non-protected
guests, introduce a new hypercall implementing the host-unshare-guest
transition.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-12-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
d0bd3e6570 KVM: arm64: Introduce __pkvm_host_share_guest()
In preparation for handling guest stage-2 mappings at EL2, introduce a
new pKVM hypercall allowing to share pages with non-protected guests.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-11-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Marc Zyngier
f7d03fcbf1 KVM: arm64: Introduce __pkvm_vcpu_{load,put}()
Rather than look-up the hyp vCPU on every run hypercall at EL2,
introduce a per-CPU 'loaded_hyp_vcpu' tracking variable which is updated
by a pair of load/put hypercalls called directly from
kvm_arch_vcpu_{load,put}() when pKVM is enabled.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-10-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
c77e5181fe KVM: arm64: Make kvm_pgtable_stage2_init() a static inline function
Turn kvm_pgtable_stage2_init() into a static inline function instead of
a macro. This will allow the usage of typeof() on it later on.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-8-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
e279c25d78 KVM: arm64: Pass walk flags to kvm_pgtable_stage2_relax_perms
kvm_pgtable_stage2_relax_perms currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM. To
allow for the re-use of that function, make the walk flags one of its
parameters.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-7-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:44:00 +00:00
Quentin Perret
5398ddc5c9 KVM: arm64: Pass walk flags to kvm_pgtable_stage2_mkyoung
kvm_pgtable_stage2_mkyoung currently assumes that it is being called
from a 'shared' walker, which will not be true once called from pKVM.
To allow for the re-use of that function, make the walk flags one of
its parameters.

Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20241218194059.3670226-6-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:43:59 +00:00
Oliver Upton
8c02c2bbd6 KVM: arm64: Avoid reading ID_AA64DFR0_EL1 for debug save/restore
Similar to other per-CPU profiling/debug features we handle, store the
number of breakpoints/watchpoints in kvm_host_data to avoid reading the
ID register 4 times on every guest entry/exit. And if you're in the
nested virt business that's quite a few avoidable exits to the L0
hypervisor.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-18-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:14 +00:00
Oliver Upton
b0ee51033a KVM: arm64: nv: Honor MDCR_EL2.TDE routing for debug exceptions
Inject debug exceptions into vEL2 if MDCR_EL2.TDE is set.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-17-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:11 +00:00
Marc Zyngier
2ca3f03bf5 KVM: arm64: Manage software step state at load/put
KVM takes over the guest's software step state machine if the VMM is
debugging the guest, but it does the save/restore fiddling for every
guest entry.

Note that the only constraint on host usage of software step is that the
guest's configuration remains visible to userspace via the ONE_REG
ioctls. So, we can cut down on the amount of fiddling by doing this at
load/put instead.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-16-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:04:06 +00:00
Oliver Upton
4ad3a0b87f KVM: arm64: Don't hijack guest context MDSCR_EL1
Stealing MDSCR_EL1 in the guest's kvm_cpu_context for external debugging
is rather gross. Just add a field for this instead and let the context
switch code pick the correct one based on the debug owner.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-15-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:02:51 +00:00
Oliver Upton
75a5fbaf66 KVM: arm64: Compute MDCR_EL2 at vcpu_load()
KVM has picked up several hacks to cope with vcpu->arch.mdcr_el2 needing
to be prepared before vcpu_load(), which is when it gets programmed
into hardware on VHE.

Now that the flows for reprogramming MDCR_EL2 have been simplified, move
that computation to vcpu_load().

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-14-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
06d22a9c1b KVM: arm64: Reload vCPU for accesses to OSLAR_EL1
KVM takes ownership of the debug regs if the guest enables the OS lock,
as it needs to use MDSCR_EL1 to mask debug exceptions. Just reload the
vCPU if the guest toggles the OS lock, relying on kvm_vcpu_load_debug()
to update the debug owner and get the right trap configuration in place.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-13-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
beb470d96c KVM: arm64: Use debug_owner to track if debug regs need save/restore
Use the debug owner to determine if the debug regs are in use instead of
keeping around the DEBUG_DIRTY flag. Debug registers are now
saved/restored after the first trap, regardless of whether it was a read
or a write. This also shifts the point at which KVM becomes lazy to
vcpu_put() rather than the next exception taken from the guest.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-12-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
803602b0d9 KVM: arm64: Remove vestiges of debug_ptr
Delete the remnants of debug_ptr now that debug registers are selected
based on the debug owner instead.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-11-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
58db67e9ac KVM: arm64: Select debug state to save/restore based on debug owner
Select the set of debug registers to use based on the owner rather than
relying on debug_ptr. Besides the code cleanup, this allows us to
eliminate a couple instances kern_hyp_va() as well.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-9-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
cd9b10102a KVM: arm64: Evaluate debug owner at vcpu_load()
In preparation for tossing the debug_ptr mess, introduce an enumeration
to track the ownership of the debug registers while in the guest. Update
the owner at vcpu_load() based on whether the host needs to steal the
guest's debug context or if breakpoints/watchpoints are actively in use.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-7-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 09:01:25 +00:00
Oliver Upton
d381e53384 KVM: arm64: Move host SME/SVE tracking flags to host data
The SME/SVE state tracking flags have no business in the vCPU. Move them
to kvm_host_data.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-5-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:09 +00:00
Oliver Upton
38131c02a5 KVM: arm64: Track presence of SPE/TRBE in kvm_host_data instead of vCPU
Add flags to kvm_host_data to track if SPE/TRBE is present +
programmable on a per-CPU basis. Set the flags up at init rather than
vcpu_load() as the programmability of these buffers is unlikely to
change.

Reviewed-by: James Clark <james.clark@linaro.org>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:09 +00:00
Oliver Upton
2417218f2f KVM: arm64: Get rid of __kvm_get_mdcr_el2() and related warts
KVM caches MDCR_EL2 on a per-CPU basis in order to preserve the
configuration of MDCR_EL2.HPMN while running a guest. This is a bit
gross, since we're relying on some baked configuration rather than the
hardware definition of implemented counters.

Discover the number of implemented counters by reading PMCR_EL0.N
instead. This works because:

 - In VHE the kernel runs at EL2, and N always returns the number of
   counters implemented in hardware

 - In {n,h}VHE, the EL2 setup code programs MDCR_EL2.HPMN with the EL2
   view of PMCR_EL0.N for the host

Lastly, avoid traps under nested virtualization by saving PMCR_EL0.N in
host data.

Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241219224116.3941496-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20 08:49:08 +00:00
Marc Zyngier
e5ecedcd7c arm64/sysreg: Get rid of CPACR_ELx SysregFields
There is no such thing as CPACR_ELx in the architecture.
What we have is CPACR_EL1, for which CPTR_EL12 is an accessor.

Rename CPACR_ELx_* to CPACR_EL1_*, and fix the bit of code using
these names.

Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241219173351.1123087-5-maz@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 18:00:58 +00:00
Ard Biesheuvel
62cffa496a arm64/mm: Override PARange for !LPA2 and use it consistently
When FEAT_LPA{,2} are not implemented, the ID_AA64MMFR0_EL1.PARange and
TCR.IPS values corresponding with 52-bit physical addressing are
reserved.

Setting the TCR.IPS field to 0b110 (52-bit physical addressing) has side
effects, such as how the TTBRn_ELx.BADDR fields are interpreted, and so
it is important that disabling FEAT_LPA2 (by overriding the
ID_AA64MMFR0.TGran fields) also presents a PARange field consistent with
that.

So limit the field to 48 bits unless LPA2 is enabled, and update
existing references to use the override consistently.

Fixes: 352b0395b5 ("arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs")
Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-10-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 17:23:52 +00:00
Ard Biesheuvel
bf74bb73cd arm64/mm: Reduce PA space to 48 bits when LPA2 is not enabled
Currently, LPA2 kernel support implies support for up to 52 bits of
physical addressing, and this is reflected in global definitions such as
PHYS_MASK_SHIFT and MAX_PHYSMEM_BITS.

This is potentially problematic, given that LPA2 hardware support is
modeled as a CPU feature which can be overridden, and with LPA2 hardware
support turned off, attempting to map physical regions with address bits
[51:48] set (which may exist on LPA2 capable systems booting with
arm64.nolva) will result in corrupted mappings with a truncated output
address and bogus shareability attributes.

This means that the accepted physical address range in the mapping
routines should be at most 48 bits wide when LPA2 support is configured
but not enabled at runtime.

Fixes: 352b0395b5 ("arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs")
Cc: stable@vger.kernel.org
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241212081841.2168124-9-ardb+git@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-19 17:23:52 +00:00
Sean Christopherson
915d2f0718 KVM: Move KVM_REG_SIZE() definition to common uAPI header
Define KVM_REG_SIZE() in the common kvm.h header, and delete the arm64 and
RISC-V versions.  As evidenced by the surrounding definitions, all aspects
of the register size encoding are generic, i.e. RISC-V should have moved
arm64's definition to common code instead of copy+pasting.

Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-17 08:49:48 -08:00
Paolo Bonzini
3154bddf8c KVM/arm64 fixes for 6.13, part #2
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
    SPE/TRBE initialization
 
  - Align nested page table walker with the intended memory attribute
    combining rules of the architecture
 
  - Prevent userspace from constraining the advertised ASID width,
    avoiding horrors of guest TLBIs not matching the intended context in
    hardware
 
  - Don't leak references on LPIs when insertion into the translation
    cache fails
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZ0+mZhccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFuKcAQDnFcLru8MVor4zjloe25oPPeuW
 iBocGpgKwJMioHrAdwEAoq8v0eqfxrUpwr5KJ7iN9CTo9oANJYhVACC8jPHEowI=
 =fLPh
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-6.13-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.13, part #2

 - Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
   SPE/TRBE initialization

 - Align nested page table walker with the intended memory attribute
   combining rules of the architecture

 - Prevent userspace from constraining the advertised ASID width,
   avoiding horrors of guest TLBIs not matching the intended context in
   hardware

 - Don't leak references on LPIs when insertion into the translation
   cache fails
2024-12-10 08:50:55 -05:00
Mark Rutland
264a593da6 arm64: cpufeature: Add HAFT to cpucap_is_possible()
For consistency with other cpucaps, handle the configuration check for
ARM64_HAFT in cpucap_is_possible() rather than this being explicit in
system_supports_haft(). The configuration check will now happen
implicitly as cpus_have_final_cap() uses cpucap_is_possible() via
alternative_has_cap_unlikely().

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20241209155948.2124393-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-10 12:10:09 +00:00
Zhu Jun
e281bd2299 arm64: asm: Fix typo in pgtable.h
The word 'trasferring' is wrong, so fix it.

Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20241203093323.7831-1-zhujun2@cmss.chinamobile.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-10 11:34:45 +00:00
Anshuman Khandual
a0e33f528e arm64/mm: Replace open encodings with PXD_TABLE_BIT
[pgd|p4d]_bad() helpers have open encodings for their respective table bits
which can be replaced with corresponding macros. This makes things clearer,
thus improving their readability as well.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20241202083850.73207-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-10 11:32:57 +00:00
Anshuman Khandual
5f882f4aa8 arm64/mm: Drop INIT_MM_CONTEXT()
Platform override for INIT_MM_CONTEXT() is redundant because swapper_pg_dir
always gets assigned as the pgd during init_mm initialization. So just drop
this override on arm64.

Originally this override was added via the 'commit 2b5548b681 ("arm64/mm:
Separate boot-time page tables from swapper_pg_dir")' because non standard
init_pg_dir was assigned as the pgd. Subsequently it was changed as default
swapper_pg_dir by the 'commit ba5b0333a8 ("arm64: mm: omit redundant
remap of kernel image")', which might have also just dropped this override.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Link: https://lore.kernel.org/r/20241202043553.29592-1-anshuman.khandual@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-12-10 11:32:26 +00:00
Robin Murphy
f00b53f161 arm64: cpufeature: Add GCS to cpucap_is_possible()
Since system_supports_gcs() ends up referring to cpucap_is_possible(),
teach the latter about GCS for consistency with similar features.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/416c7369fcdce4ebb2a8f12daae234507be27e38.1733406275.git.robin.murphy@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-12-05 17:15:38 +00:00
Yang Shi
49ccf2c3ca arm64: mte: set VM_MTE_ALLOWED for hugetlbfs at correct place
The commit 5de195060b ("mm: resolve faulty mmap_region() error path
behaviour") moved vm flags validation before fop->mmap for file
mappings.  But when commit 25c17c4b55 ("hugetlb: arm64: add mte support")
was rebased on top of it, the hugetlbfs part was missed.  Mmapping
hugetlbfs file may not have MAP_HUGETLB set.

Fixes: 25c17c4b55 ("hugetlb: arm64: add mte support")
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241119200914.1145249-1-yang@os.amperecomputing.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-12-02 11:48:14 +00:00
James Clark
d798bc6f3c arm64: Fix usage of new shifted MDCR_EL2 values
Since the linked fixes commit, these masks are already shifted so remove
the shifts. One issue that this fixes is SPE and TRBE not being
available anymore:

 arm_spe_pmu arm,spe-v1: profiling buffer owned by higher exception level

Fixes: 641630313e ("arm64: sysreg: Migrate MDCR_EL2 definition to table")
Signed-off-by: James Clark <james.clark@linaro.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241122164636.2944180-1-james.clark@linaro.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-26 06:31:36 -08:00
Linus Torvalds
f5f4745a7f - The series "resource: A couple of cleanups" from Andy Shevchenko
performs some cleanups in the resource management code.
 
 - The series "Improve the copy of task comm" from Yafang Shao addresses
   possible race-induced overflows in the management of task_struct.comm[].
 
 - The series "Remove unnecessary header includes from
   {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
   small fix to the list_sort library code and to its selftest.
 
 - The series "Enhance min heap API with non-inline functions and
   optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
   min_heap library code.
 
 - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
   finishes off nilfs2's folioification.
 
 - The series "add detect count for hung tasks" from Lance Yang adds more
   userspace visibility into the hung-task detector's activity.
 
 - Apart from that, singelton patches in many places - please see the
   individual changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA
 jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m
 Ii8igH0UBibIgva7MrCyJedDI1O23AA=
 =8BIU
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "resource: A couple of cleanups" from Andy Shevchenko
   performs some cleanups in the resource management code

 - The series "Improve the copy of task comm" from Yafang Shao addresses
   possible race-induced overflows in the management of
   task_struct.comm[]

 - The series "Remove unnecessary header includes from
   {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a
   small fix to the list_sort library code and to its selftest

 - The series "Enhance min heap API with non-inline functions and
   optimizations" also from Kuan-Wei Chiu optimizes and cleans up the
   min_heap library code

 - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi
   finishes off nilfs2's folioification

 - The series "add detect count for hung tasks" from Lance Yang adds
   more userspace visibility into the hung-task detector's activity

 - Apart from that, singelton patches in many places - please see the
   individual changelogs for details

* tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits)
  gdb: lx-symbols: do not error out on monolithic build
  kernel/reboot: replace sprintf() with sysfs_emit()
  lib: util_macros_kunit: add kunit test for util_macros.h
  util_macros.h: fix/rework find_closest() macros
  Improve consistency of '#error' directive messages
  ocfs2: fix uninitialized value in ocfs2_file_read_iter()
  hung_task: add docs for hung_task_detect_count
  hung_task: add detect count for hung tasks
  dma-buf: use atomic64_inc_return() in dma_buf_getfile()
  fs/proc/kcore.c: fix coccinelle reported ERROR instances
  resource: avoid unnecessary resource tree walking in __region_intersects()
  ocfs2: remove unused errmsg function and table
  ocfs2: cluster: fix a typo
  lib/scatterlist: use sg_phys() helper
  checkpatch: always parse orig_commit in fixes tag
  nilfs2: convert metadata aops from writepage to writepages
  nilfs2: convert nilfs_recovery_copy_block() to take a folio
  nilfs2: convert nilfs_page_count_clean_buffers() to take a folio
  nilfs2: remove nilfs_writepage
  nilfs2: convert checkpoint file to be folio-based
  ...
2024-11-25 16:09:48 -08:00
Linus Torvalds
7f4f3b14e8 Add Rust support for trace events:
- Allow Rust code to have trace events
 
   Trace events is a popular way to debug what is happening inside the kernel
   or just to find out what is happening. Rust code is being added to the
   Linux kernel but it currently does not support the tracing infrastructure.
   Add support of trace events inside Rust code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ0DjqhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrLlAPsF6t/c1nHSGTKDv9FJDJe4JHdP7e+U
 7X0S8BmSTKFNAQD+K2TEd0bjVP7ug8dQZBT+fveiFr+ARYxAwJ3JnEFjUwg=
 =Ab+T
 -----END PGP SIGNATURE-----

Merge tag 'trace-rust-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull rust trace event support from Steven Rostedt:
 "Allow Rust code to have trace events

  Trace events is a popular way to debug what is happening inside the
  kernel or just to find out what is happening. Rust code is being added
  to the Linux kernel but it currently does not support the tracing
  infrastructure. Add support of trace events inside Rust code"

* tag 'trace-rust-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  rust: jump_label: skip formatting generated file
  jump_label: rust: pass a mut ptr to `static_key_count`
  samples: rust: fix `rust_print` build making it a combined module
  rust: add arch_static_branch
  jump_label: adjust inline asm to be consistent
  rust: samples: add tracepoint to Rust sample
  rust: add tracepoint support
  rust: add static_branch_unlikely for static_key_false
2024-11-25 15:44:29 -08:00
Linus Torvalds
9f16d5e6f2 The biggest change here is eliminating the awful idea that KVM had, of
essentially guessing which pfns are refcounted pages.  The reason to
 do so was that KVM needs to map both non-refcounted pages (for example
 BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP VMAs that contain
 refcounted pages.  However, the result was security issues in the past,
 and more recently the inability to map VM_IO and VM_PFNMAP memory
 that _is_ backed by struct page but is not refcounted.  In particular
 this broke virtio-gpu blob resources (which directly map host graphics
 buffers into the guest as "vram" for the virtio-gpu device) with the
 amdgpu driver, because amdgpu allocates non-compound higher order pages
 and the tail pages could not be mapped into KVM.
 
 This requires adjusting all uses of struct page in the per-architecture
 code, to always work on the pfn whenever possible.  The large series that
 did this, from David Stevens and Sean Christopherson, also cleaned up
 substantially the set of functions that provided arch code with the
 pfn for a host virtual addresses.  The previous maze of twisty little
 passages, all different, is replaced by five functions (__gfn_to_page,
 __kvm_faultin_pfn, the non-__ versions of these two, and kvm_prefetch_pages)
 saving almost 200 lines of code.
 
 ARM:
 
 * Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker
 
 * Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI
 
 * Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps
 
 * PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest
 
 * Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM
 
 * Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection
 
 * Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
 
 LoongArch:
 
 * Add iocsr and mmio bus simulation in kernel.
 
 * Add in-kernel interrupt controller emulation.
 
 * Add support for virtualization extensions to the eiointc irqchip.
 
 PPC:
 
 * Drop lingering and utterly obsolete references to PPC970 KVM, which was
   removed 10 years ago.
 
 * Fix incorrect documentation references to non-existing ioctls
 
 RISC-V:
 
 * Accelerate KVM RISC-V when running as a guest
 
 * Perf support to collect KVM guest statistics from host side
 
 s390:
 
 * New selftests: more ucontrol selftests and CPU model sanity checks
 
 * Support for the gen17 CPU model
 
 * List registers supported by KVM_GET/SET_ONE_REG in the documentation
 
 x86:
 
 * Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve
   documentation, harden against unexpected changes.  Even if the hardware
   A/D tracking is disabled, it is possible to use the hardware-defined A/D
   bits to track if a PFN is Accessed and/or Dirty, and that removes a lot
   of special cases.
 
 * Elide TLB flushes when aging secondary PTEs, as has been done in x86's
   primary MMU for over 10 years.
 
 * Recover huge pages in-place in the TDP MMU when dirty page logging is
   toggled off, instead of zapping them and waiting until the page is
   re-accessed to create a huge mapping.  This reduces vCPU jitter.
 
 * Batch TLB flushes when dirty page logging is toggled off.  This reduces
   the time it takes to disable dirty logging by ~3x.
 
 * Remove the shrinker that was (poorly) attempting to reclaim shadow page
   tables in low-memory situations.
 
 * Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE.
 
 * Advertise CPUIDs for new instructions in Clearwater Forest
 
 * Quirk KVM's misguided behavior of initialized certain feature MSRs to
   their maximum supported feature set, which can result in KVM creating
   invalid vCPU state.  E.g. initializing PERF_CAPABILITIES to a non-zero
   value results in the vCPU having invalid state if userspace hides PDCM
   from the guest, which in turn can lead to save/restore failures.
 
 * Fix KVM's handling of non-canonical checks for vCPUs that support LA57
   to better follow the "architecture", in quotes because the actual
   behavior is poorly documented.  E.g. most MSR writes and descriptor
   table loads ignore CR4.LA57 and operate purely on whether the CPU
   supports LA57.
 
 * Bypass the register cache when querying CPL from kvm_sched_out(), as
   filling the cache from IRQ context is generally unsafe; harden the
   cache accessors to try to prevent similar issues from occuring in the
   future.  The issue that triggered this change was already fixed in 6.12,
   but was still kinda latent.
 
 * Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM
   over-advertises SPEC_CTRL when trying to support cross-vendor VMs.
 
 * Minor cleanups
 
 * Switch hugepage recovery thread to use vhost_task.  These kthreads can
   consume significant amounts of CPU time on behalf of a VM or in response
   to how the VM behaves (for example how it accesses its memory); therefore
   KVM tried to place the thread in the VM's cgroups and charge the CPU
   time consumed by that work to the VM's container.  However the kthreads
   did not process SIGSTOP/SIGCONT, and therefore cgroups which had KVM
   instances inside could not complete freezing.  Fix this by replacing the
   kthread with a PF_USER_WORKER thread, via the vhost_task abstraction.
   Another 100+ lines removed, with generally better behavior too like
   having these threads properly parented in the process tree.
 
 * Revert a workaround for an old CPU erratum (Nehalem/Westmere) that didn't
   really work; there was really nothing to work around anyway: the broken
   patch was meant to fix nested virtualization, but the PERF_GLOBAL_CTRL
   MSR is virtualized and therefore unaffected by the erratum.
 
 * Fix 6.12 regression where CONFIG_KVM will be built as a module even
   if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is 'y'.
 
 x86 selftests:
 
 * x86 selftests can now use AVX.
 
 Documentation:
 
 * Use rST internal links
 
 * Reorganize the introduction to the API document
 
 Generic:
 
 * Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter long
   due to having to wait for all CPUs become quiescent.  In general both reads
   and writes are rare, but userspace that supports confidential computing is
   introducing the use of "helper" vCPUs that may jump from one host processor
   to another.  Those will be very happy to trigger a synchronize_rcu(), and
   the effect on performance is quite the disaster.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmc9MRYUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP00QgArxqxBIGLCW5t7bw7vtNq63QYRyh4
 dTiDguLiYQJ+AXmnRu11R6aPC7HgMAvlFCCmH+GEce4WEgt26hxCmncJr/aJOSwS
 letCS7TrME16PeZvh25A1nhPBUw6mTF1qqzgcdHMrqXG8LuHoGcKYGSRVbkf3kfI
 1ZoMq1r8ChXbVVmCx9DQ3gw1TVr5Dpjs2voLh8rDSE9Xpw0tVVabHu3/NhQEz/F+
 t8/nRaqH777icCHIf9PCk5HnarHxLAOvhM2M0Yj09PuBcE5fFQxpxltw/qiKQqqW
 ep4oquojGl87kZnhlDaac2UNtK90Ws+WxxvCwUmbvGN0ZJVaQwf4FvTwig==
 =lWpE
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "The biggest change here is eliminating the awful idea that KVM had of
  essentially guessing which pfns are refcounted pages.

  The reason to do so was that KVM needs to map both non-refcounted
  pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
  VMAs that contain refcounted pages.

  However, the result was security issues in the past, and more recently
  the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
  struct page but is not refcounted. In particular this broke virtio-gpu
  blob resources (which directly map host graphics buffers into the
  guest as "vram" for the virtio-gpu device) with the amdgpu driver,
  because amdgpu allocates non-compound higher order pages and the tail
  pages could not be mapped into KVM.

  This requires adjusting all uses of struct page in the
  per-architecture code, to always work on the pfn whenever possible.
  The large series that did this, from David Stevens and Sean
  Christopherson, also cleaned up substantially the set of functions
  that provided arch code with the pfn for a host virtual addresses.

  The previous maze of twisty little passages, all different, is
  replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
  non-__ versions of these two, and kvm_prefetch_pages) saving almost
  200 lines of code.

  ARM:

   - Support for stage-1 permission indirection (FEAT_S1PIE) and
     permission overlays (FEAT_S1POE), including nested virt + the
     emulated page table walker

   - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
     call was introduced in PSCIv1.3 as a mechanism to request
     hibernation, similar to the S4 state in ACPI

   - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
     part of it, introduce trivial initialization of the host's MPAM
     context so KVM can use the corresponding traps

   - PMU support under nested virtualization, honoring the guest
     hypervisor's trap configuration and event filtering when running a
     nested guest

   - Fixes to vgic ITS serialization where stale device/interrupt table
     entries are not zeroed when the mapping is invalidated by the VM

   - Avoid emulated MMIO completion if userspace has requested
     synchronous external abort injection

   - Various fixes and cleanups affecting pKVM, vCPU initialization, and
     selftests

  LoongArch:

   - Add iocsr and mmio bus simulation in kernel.

   - Add in-kernel interrupt controller emulation.

   - Add support for virtualization extensions to the eiointc irqchip.

  PPC:

   - Drop lingering and utterly obsolete references to PPC970 KVM, which
     was removed 10 years ago.

   - Fix incorrect documentation references to non-existing ioctls

  RISC-V:

   - Accelerate KVM RISC-V when running as a guest

   - Perf support to collect KVM guest statistics from host side

  s390:

   - New selftests: more ucontrol selftests and CPU model sanity checks

   - Support for the gen17 CPU model

   - List registers supported by KVM_GET/SET_ONE_REG in the
     documentation

  x86:

   - Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
     improve documentation, harden against unexpected changes.

     Even if the hardware A/D tracking is disabled, it is possible to
     use the hardware-defined A/D bits to track if a PFN is Accessed
     and/or Dirty, and that removes a lot of special cases.

   - Elide TLB flushes when aging secondary PTEs, as has been done in
     x86's primary MMU for over 10 years.

   - Recover huge pages in-place in the TDP MMU when dirty page logging
     is toggled off, instead of zapping them and waiting until the page
     is re-accessed to create a huge mapping. This reduces vCPU jitter.

   - Batch TLB flushes when dirty page logging is toggled off. This
     reduces the time it takes to disable dirty logging by ~3x.

   - Remove the shrinker that was (poorly) attempting to reclaim shadow
     page tables in low-memory situations.

   - Clean up and optimize KVM's handling of writes to
     MSR_IA32_APICBASE.

   - Advertise CPUIDs for new instructions in Clearwater Forest

   - Quirk KVM's misguided behavior of initialized certain feature MSRs
     to their maximum supported feature set, which can result in KVM
     creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
     a non-zero value results in the vCPU having invalid state if
     userspace hides PDCM from the guest, which in turn can lead to
     save/restore failures.

   - Fix KVM's handling of non-canonical checks for vCPUs that support
     LA57 to better follow the "architecture", in quotes because the
     actual behavior is poorly documented. E.g. most MSR writes and
     descriptor table loads ignore CR4.LA57 and operate purely on
     whether the CPU supports LA57.

   - Bypass the register cache when querying CPL from kvm_sched_out(),
     as filling the cache from IRQ context is generally unsafe; harden
     the cache accessors to try to prevent similar issues from occuring
     in the future. The issue that triggered this change was already
     fixed in 6.12, but was still kinda latent.

   - Advertise AMD_IBPB_RET to userspace, and fix a related bug where
     KVM over-advertises SPEC_CTRL when trying to support cross-vendor
     VMs.

   - Minor cleanups

   - Switch hugepage recovery thread to use vhost_task.

     These kthreads can consume significant amounts of CPU time on
     behalf of a VM or in response to how the VM behaves (for example
     how it accesses its memory); therefore KVM tried to place the
     thread in the VM's cgroups and charge the CPU time consumed by that
     work to the VM's container.

     However the kthreads did not process SIGSTOP/SIGCONT, and therefore
     cgroups which had KVM instances inside could not complete freezing.

     Fix this by replacing the kthread with a PF_USER_WORKER thread, via
     the vhost_task abstraction. Another 100+ lines removed, with
     generally better behavior too like having these threads properly
     parented in the process tree.

   - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
     didn't really work; there was really nothing to work around anyway:
     the broken patch was meant to fix nested virtualization, but the
     PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
     erratum.

   - Fix 6.12 regression where CONFIG_KVM will be built as a module even
     if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
     'y'.

  x86 selftests:

   - x86 selftests can now use AVX.

  Documentation:

   - Use rST internal links

   - Reorganize the introduction to the API document

  Generic:

   - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
     instead of RCU, so that running a vCPU on a different task doesn't
     encounter long due to having to wait for all CPUs become quiescent.

     In general both reads and writes are rare, but userspace that
     supports confidential computing is introducing the use of "helper"
     vCPUs that may jump from one host processor to another. Those will
     be very happy to trigger a synchronize_rcu(), and the effect on
     performance is quite the disaster"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
  KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
  KVM: x86: add back X86_LOCAL_APIC dependency
  Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
  KVM: x86: switch hugepage recovery thread to vhost_task
  KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
  x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
  Documentation: KVM: fix malformed table
  irqchip/loongson-eiointc: Add virt extension support
  LoongArch: KVM: Add irqfd support
  LoongArch: KVM: Add PCHPIC user mode read and write functions
  LoongArch: KVM: Add PCHPIC read and write functions
  LoongArch: KVM: Add PCHPIC device support
  LoongArch: KVM: Add EIOINTC user mode read and write functions
  LoongArch: KVM: Add EIOINTC read and write functions
  LoongArch: KVM: Add EIOINTC device support
  LoongArch: KVM: Add IPI user mode read and write function
  LoongArch: KVM: Add IPI read and write function
  LoongArch: KVM: Add IPI device support
  LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
  KVM: arm64: Pass on SVE mapping failures
  ...
2024-11-23 16:00:50 -08:00
Linus Torvalds
5c00ff742b - The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm.
   This leads to improved memory savings.
 
 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
 
 	- "refine mas_mab_cp()"
 	- "Reduce the space to be cleared for maple_big_node"
 	- "maple_tree: simplify mas_push_node()"
 	- "Following cleanup after introduce mas_wr_store_type()"
 	- "refine storing null"
 
 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.
 
 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping code.
 
 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of shadow
   entries.
 
 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.
 
 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in the
   hugetlb code.
 
 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page into
   small pages.  Instead we replace the whole thing with a THP.  More
   consistent cleaner and potentiall saves a large number of pagefaults.
 
 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.
 
 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to do.
 
 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio size
   rather than as individual pages.  A 20% speedup was observed.
 
 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting.
 
 - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt
   removes the long-deprecated memcgv2 charge moving feature.
 
 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.
 
 - The series "x86/module: use large ROX pages for text allocations" from
   Mike Rapoport teaches x86 to use large pages for read-only-execute
   module text.
 
 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.
 
 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/.  A slow march towards shrinking
   struct page.
 
 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.
 
 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression.  It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.
 
 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests
   over to the KUnit framework.
 
 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a single
   VMA, rather than requiring that multiple VMAs be created for this.
   Improved efficiencies for userspace memory allocators are expected.
 
 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.
 
 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.
 
 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP from
   the kernel boot command line.
 
 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.
 
 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep is
   enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA
 jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq
 xyyenpibWgUoShU2wZ/Ha8FE5WDINwg=
 =JfWR
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
2024-11-23 09:58:07 -08:00
Linus Torvalds
79caa6c88a asm-generic updates for 6.13
These are a number of unrelated cleanups, generally simplifying the
 architecture specific header files:
 
  - A series from Al Viro simplifies asm/vga.h, after it turns out that
    most of it can be generalized.
 
  - A series from Julian Vetter adds a common version of
    memcpy_{to,from}io() and memset_io() and changes most architectures
    to use that instead of their own implementation
 
  - A series from Niklas Schnelle concludes his work to make PC
    style inb()/outb() optional
 
  - Nicolas Pitre contributes improvements for the generic do_div()
    helper
 
  - Christoph Hellwig adds a generic version of page_to_phys()
    and phys_to_page(), replacing the slightly different architecture
    specific definitions.
 
  - Uwe Kleine-Koenig has a minor cleanup for ioctl definitions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmc+Z0gACgkQYKtH/8kJ
 UicqzA/8CcqVdcWKlFAyiFI62DCkd3iYm/joNK3/JhvUIvVFvY+HI0+XpTeOEN1r
 dfYBNg/KTVSbia5MEEy28Lk5WdoA3X7p9E8NuYC1ik/qvH3Y0kXDU2NiRcJDwalq
 u56tGUwDITFUzRo47a4Z53JpV60FlGaUVjuKp1jJiOQkcs/iussVYuti8mNVb1ud
 1tf21TEAIywq43IC8CxevIRsBkJBqMhalaGWYgKw3ZTwXdiKaXed6RH7IjPodanN
 6b7R6aFEqlT7usFX9vLOYNRGzd3HIueXOT1iqiiGI1lm5u/iutxKH+8eS4q381oN
 WJL0jQdo4sv2MxtSHYrjpzPRQpSp/qrin29h3PVjwBjZF3i5WvFeTYgfjQEEkqe0
 fpTXjUsr5n1F1pGV90DtJHwaD5TxKD4VYFLDRCDGUiAnWPkZ7EYUBL3SA6GqEkXB
 1lVRPsEBo0y867/WQcoCZA/x7ANZDI6bDZ6fjumwx8OCZOHZeN6FGtqQJHcVZR5O
 +nu/j3I8YH1tZGKbA+wliyQwt/T60Oxs62HHcFzFLGakARwUEDYO53IGCJUByFwk
 kCrgNVvzFklwWpqqyTADqb5lkQKpZr5gIdpst185qttCQkb+EFWiCi9w2inXTjHl
 2oCc7Uf0cvoxnhVlJAw73eGTtpqS37KCWK+iNyrQbOfy+hgIv+w=
 =zEHk
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "These are a number of unrelated cleanups, generally simplifying the
  architecture specific header files:

   - A series from Al Viro simplifies asm/vga.h, after it turns out that
     most of it can be generalized.

   - A series from Julian Vetter adds a common version of
     memcpy_{to,from}io() and memset_io() and changes most architectures
     to use that instead of their own implementation

   - A series from Niklas Schnelle concludes his work to make PC style
     inb()/outb() optional

   - Nicolas Pitre contributes improvements for the generic do_div()
     helper

   - Christoph Hellwig adds a generic version of page_to_phys() and
     phys_to_page(), replacing the slightly different architecture
     specific definitions.

   - Uwe Kleine-Koenig has a minor cleanup for ioctl definitions"

* tag 'asm-generic-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (24 commits)
  empty include/asm-generic/vga.h
  sparc: get rid of asm/vga.h
  asm/vga.h: don't bother with scr_mem{cpy,move}v() unless we need to
  vt_buffer.h: get rid of dead code in default scr_...() instances
  tty: serial: export serial_8250_warn_need_ioport
  lib/iomem_copy: fix kerneldoc format style
  hexagon: simplify asm/io.h for !HAS_IOPORT
  loongarch: Use new fallback IO memcpy/memset
  csky: Use new fallback IO memcpy/memset
  arm64: Use new fallback IO memcpy/memset
  New implementation for IO memcpy and IO memset
  watchdog: Add HAS_IOPORT dependency for SBC8360 and SBC7240
  __arch_xprod64(): make __always_inline when optimizing for performance
  ARM: div64: improve __arch_xprod_64()
  asm-generic/div64: optimize/simplify __div64_const32()
  lib/math/test_div64: add some edge cases relevant to __div64_const32()
  asm-generic: add an optional pfn_valid check to page_to_phys
  asm-generic: provide generic page_to_phys and phys_to_page implementations
  asm-generic/io.h: Remove I/O port accessors for HAS_IOPORT=n
  tty: serial: handle HAS_IOPORT dependencies
  ...
2024-11-20 15:13:02 -08:00
Linus Torvalds
aad3a0d084 ftrace updates for v6.13:
- Merged tag ftrace-v6.12-rc4
 
   There was a fix to locking in register_ftrace_graph() for shadow stacks
   that was sent upstream. But this code was also being rewritten, and the
   locking fix was needed. Merging this fix was required to continue the
   work.
 
 - Restructure the function graph shadow stack to prepare it for use with
   kretprobes
 
   With the goal of merging the shadow stack logic of function graph and
   kretprobes, some more restructuring of the function shadow stack is
   required.
 
   Move out function graph specific fields from the fgraph infrastructure and
   store it on the new stack variables that can pass data from the entry
   callback to the exit callback.
 
   Hopefully, with this change, the merge of kretprobes to use fgraph shadow
   stacks will be ready by the next merge window.
 
 - Make shadow stack 4k instead of using PAGE_SIZE.
 
   Some architectures have very large PAGE_SIZE values which make its use for
   shadow stacks waste a lot of memory.
 
 - Give shadow stacks its own kmem cache.
 
   When function graph is started, every task on the system gets a shadow
   stack. In the future, shadow stacks may not be 4K in size. Have it have
   its own kmem cache so that whatever size it becomes will still be
   efficient in allocations.
 
 - Initialize profiler graph ops as it will be needed for new updates to fgraph
 
 - Convert to use guard(mutex) for several ftrace and fgraph functions
 
 - Add more comments and documentation
 
 - Show function return address in function graph tracer
 
   Add an option to show the caller of a function at each entry of the
   function graph tracer, similar to what the function tracer does.
 
 - Abstract out ftrace_regs from being used directly like pt_regs
 
   ftrace_regs was created to store a partial pt_regs. It holds only the
   registers and stack information to get to the function arguments and
   return values. On several archs, it is simply a wrapper around pt_regs.
   But some users would access ftrace_regs directly to get the pt_regs which
   will not work on all archs. Make ftrace_regs an abstract structure that
   requires all access to its fields be through accessor functions.
 
 - Show how long it takes to do function code modifications
 
   When code modification for function hooks happen, it always had the time
   recorded in how long it took to do the conversion. But this value was
   never exported. Recently the code was touched due to new ROX modification
   handling that caused a large slow down in doing the modifications and
   had a significant impact on boot times.
 
   Expose the timings in the dyn_ftrace_total_info file. This file was
   created a while ago to show information about memory usage and such to
   implement dynamic function tracing. It's also an appropriate file to store
   the timings of this modification as well. This will make it easier to see
   the impact of changes to code modification on boot up timings.
 
 - Other clean ups and small fixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZztrUxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qnnNAQD6w4q9VQ7oOE2qKLqtnj87h4c1GqKn
 SPkpEfC3n/ATEAD/fnYjT/eOSlHiGHuD/aTA+U/bETrT99bozGM/4mFKEgY=
 =6nCa
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Restructure the function graph shadow stack to prepare it for use
   with kretprobes

   With the goal of merging the shadow stack logic of function graph and
   kretprobes, some more restructuring of the function shadow stack is
   required.

   Move out function graph specific fields from the fgraph
   infrastructure and store it on the new stack variables that can pass
   data from the entry callback to the exit callback.

   Hopefully, with this change, the merge of kretprobes to use fgraph
   shadow stacks will be ready by the next merge window.

 - Make shadow stack 4k instead of using PAGE_SIZE.

   Some architectures have very large PAGE_SIZE values which make its
   use for shadow stacks waste a lot of memory.

 - Give shadow stacks its own kmem cache.

   When function graph is started, every task on the system gets a
   shadow stack. In the future, shadow stacks may not be 4K in size.
   Have it have its own kmem cache so that whatever size it becomes will
   still be efficient in allocations.

 - Initialize profiler graph ops as it will be needed for new updates to
   fgraph

 - Convert to use guard(mutex) for several ftrace and fgraph functions

 - Add more comments and documentation

 - Show function return address in function graph tracer

   Add an option to show the caller of a function at each entry of the
   function graph tracer, similar to what the function tracer does.

 - Abstract out ftrace_regs from being used directly like pt_regs

   ftrace_regs was created to store a partial pt_regs. It holds only the
   registers and stack information to get to the function arguments and
   return values. On several archs, it is simply a wrapper around
   pt_regs. But some users would access ftrace_regs directly to get the
   pt_regs which will not work on all archs. Make ftrace_regs an
   abstract structure that requires all access to its fields be through
   accessor functions.

 - Show how long it takes to do function code modifications

   When code modification for function hooks happen, it always had the
   time recorded in how long it took to do the conversion. But this
   value was never exported. Recently the code was touched due to new
   ROX modification handling that caused a large slow down in doing the
   modifications and had a significant impact on boot times.

   Expose the timings in the dyn_ftrace_total_info file. This file was
   created a while ago to show information about memory usage and such
   to implement dynamic function tracing. It's also an appropriate file
   to store the timings of this modification as well. This will make it
   easier to see the impact of changes to code modification on boot up
   timings.

 - Other clean ups and small fixes

* tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
  ftrace: Show timings of how long nop patching took
  ftrace: Use guard to take ftrace_lock in ftrace_graph_set_hash()
  ftrace: Use guard to take the ftrace_lock in release_probe()
  ftrace: Use guard to lock ftrace_lock in cache_mod()
  ftrace: Use guard for match_records()
  fgraph: Use guard(mutex)(&ftrace_lock) for unregister_ftrace_graph()
  fgraph: Give ret_stack its own kmem cache
  fgraph: Separate size of ret_stack from PAGE_SIZE
  ftrace: Rename ftrace_regs_return_value to ftrace_regs_get_return_value
  selftests/ftrace: Fix check of return value in fgraph-retval.tc test
  ftrace: Use arch_ftrace_regs() for ftrace_regs_*() macros
  ftrace: Consolidate ftrace_regs accessor functions for archs using pt_regs
  ftrace: Make ftrace_regs abstract from direct use
  fgragh: No need to invoke the function call_filter_check_discard()
  fgraph: Simplify return address printing in function graph tracer
  function_graph: Remove unnecessary initialization in ftrace_graph_ret_addr()
  function_graph: Support recording and printing the function return address
  ftrace: Have calltime be saved in the fgraph storage
  ftrace: Use a running sleeptime instead of saving on shadow stack
  fgraph: Use fgraph data to store subtime for profiler
  ...
2024-11-20 11:34:10 -08:00
Linus Torvalds
0352387523 First step of consolidating the VDSO data page handling:
The VDSO data page handling is architecture specific for historical
   reasons, but there is no real technical reason to do so.
 
   Aside of that VDSO data has become a dump ground for various mechanisms
   and fail to provide a clear separation of the functionalities.
 
   Clean this up by:
 
     * consolidating the VDSO page data by getting rid of architecture
       specific warts especially in x86 and PowerPC.
 
     * removing the last includes of header files which are pulling in other
       headers outside of the VDSO namespace.
 
     * seperating timekeeping and other VDSO data accordingly.
 
   Further consolidation of the VDSO page handling is done in subsequent
   changes scheduled for the next merge window.
 
   This also lays the ground for expanding the VDSO time getters for
   independent PTP clocks in a generic way without making every architecture
   add support seperately.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kyoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVBjD/9awdN2YeCGIM9rlHIktUdNRmRSL2SL
 6av1CPffN5DenONYTXWrDYPkC4yfjUwIs8H57uzFo10yA7RQ/Qfq+O68k5GnuFew
 jvpmmYSZ6TT21AmAaCIhn+kdl9YbEJFvN2AWH85Bl29k9FGB04VzJlQMMjfEZ1a5
 Mhwv+cfYNuPSZmU570jcxW2XgbyTWlLZBByXX/Tuz9bwpmtszba507bvo45x6gIP
 twaWNzrsyJpdXfMrfUnRiChN8jHlDN7I6fgQvpsoRH5FOiVwIFo0Ip2rKbk+ONfD
 W/rcU5oeqRIxRVDHzf2Sv8WPHMCLRv01ZHBcbJOtgvZC3YiKgKYoeEKabu9ZL1BH
 6VmrxjYOBBFQHOYAKPqBuS7BgH5PmtMbDdSZXDfRaAKaCzhCRysdlWW7z48r2R//
 zPufb7J6Tle23AkuZWhFjvlGgSBl4zxnTFn31HYOyQps3TMI4y50Z2DhE/EeU8a6
 DRl8/k1KQVDUZ6udJogS5kOr1J8pFtUPrA2uhR8UyLdx7YKiCzcdO1qWAjtXlVe8
 oNpzinU+H9bQqGe9IyS7kCG9xNaCRZNkln5Q1WfnkTzg5f6ihfaCvIku3l4bgVpw
 3HmcxYiC6RxQB+ozwN7hzCCKT4L9aMhr/457TNOqRkj2Elw3nvJ02L4aI86XAKLE
 jwO9Fkp9qcCxCw==
 =q5eD
 -----END PGP SIGNATURE-----

Merge tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull vdso data page handling updates from Thomas Gleixner:
 "First steps of consolidating the VDSO data page handling.

  The VDSO data page handling is architecture specific for historical
  reasons, but there is no real technical reason to do so.

  Aside of that VDSO data has become a dump ground for various
  mechanisms and fail to provide a clear separation of the
  functionalities.

  Clean this up by:

   - consolidating the VDSO page data by getting rid of architecture
     specific warts especially in x86 and PowerPC.

   - removing the last includes of header files which are pulling in
     other headers outside of the VDSO namespace.

   - seperating timekeeping and other VDSO data accordingly.

  Further consolidation of the VDSO page handling is done in subsequent
  changes scheduled for the next merge window.

  This also lays the ground for expanding the VDSO time getters for
  independent PTP clocks in a generic way without making every
  architecture add support seperately"

* tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  x86/vdso: Add missing brackets in switch case
  vdso: Rename struct arch_vdso_data to arch_vdso_time_data
  powerpc: Split systemcfg struct definitions out from vdso
  powerpc: Split systemcfg data out of vdso data page
  powerpc: Add kconfig option for the systemcfg page
  powerpc/pseries/lparcfg: Use num_possible_cpus() for potential processors
  powerpc/pseries/lparcfg: Fix printing of system_active_processors
  powerpc/procfs: Propagate error of remap_pfn_range()
  powerpc/vdso: Remove offset comment from 32bit vdso_arch_data
  x86/vdso: Split virtual clock pages into dedicated mapping
  x86/vdso: Delete vvar.h
  x86/vdso: Access vdso data without vvar.h
  x86/vdso: Move the rng offset to vsyscall.h
  x86/vdso: Access rng vdso data without vvar.h
  x86/vdso: Access timens vdso data without vvar.h
  x86/vdso: Allocate vvar page from C code
  x86/vdso: Access rng data from kernel without vvar
  x86/vdso: Place vdso_data at beginning of vvar page
  x86/vdso: Use __arch_get_vdso_data() to access vdso data
  x86/mm/mmap: Remove arch_vma_name()
  ...
2024-11-19 16:09:13 -08:00
Linus Torvalds
f41dac3efb Performance events changes for v6.13:
- Uprobes:
     - Add BPF session support (Jiri Olsa)
     - Switch to RCU Tasks Trace flavor for better performance (Andrii Nakryiko)
     - Massively increase uretprobe SMP scalability by SRCU-protecting
       the uretprobe lifetime (Andrii Nakryiko)
     - Kill xol_area->slot_count (Oleg Nesterov)
 
  - Core facilities:
     - Implement targeted high-frequency profiling by adding the ability
       for an event to "pause" or "resume" AUX area tracing (Adrian Hunter)
 
  - VM profiling/sampling:
     - Correct perf sampling with guest VMs (Colton Lewis)
 
  - New hardware support:
     - x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi)
 
  - Misc fixes and enhancements:
     - x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter)
     - x86/amd: Warn only on new bits set (Breno Leitao)
     - x86/amd/uncore: Avoid a false positive warning about snprintf
                       truncation in amd_uncore_umc_ctx_init (Jean Delvare)
     - uprobes: Re-order struct uprobe_task to save some space (Christophe JAILLET)
     - x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang)
     - x86/rapl: Clean up cpumask and hotplug (Kan Liang)
     - uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg Nesterov)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7eKERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i57A/+KQ6TrIoICVTE+BPlDfUw8NU+N3DagVb0
 dzoyDxlDRsnsYzeXZipPn+3IitX1w+DrGxBNIojSoiFVCLnHIKgo4uHbj7cVrR7J
 fBTVSnoJ94SGAk5ySebvLwMLce/YhXBeHK2lx6W/pI6acNcxzDfIabjjETeqltUo
 g7hmT9lo10pzZEZyuUfYX9khlWBxda1dKHc9pMIq7baeLe4iz/fCGlJ0K4d4M4z3
 NPZw239Np6iHUwu3Lcs4gNKe4rcDe7Bt47hpedemHe0Y+7c4s2HaPxbXWxvDtE76
 mlsg93i28f8SYxeV83pREn0EOCptXcljhiek+US+GR7NSbltMnV+uUiDfPKIE9+Y
 vYP/DYF9hx73FsOucEFrHxYYcePorn3pne5/khBYWdQU6TnlrBYWpoLQsjgCKTTR
 4JhCFlBZ5cDpc6ihtpwCwVTQ4Q/H7vM1XOlDwx0hPhcIPPHDreaQD/wxo61jBdXf
 PY0EPAxh3BcQxfPYuDS+XiYjQ8qO8MtXMKz5bZyHBZlbHwccV6T4ExjsLKxFk5As
 6BG8pkBWLg7drXAgVdleIY0ux+34w/Zzv7gemdlQxvWLlZrVvpjiG93oU3PTpZeq
 A2UD9eAOuXVD6+HsF/dmn88sFmcLWbrMskFWujkvhEUmCvSGAnz3YSS/mLEawBiT
 2xI8xykNWSY=
 =ItOT
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Uprobes:
    - Add BPF session support (Jiri Olsa)
    - Switch to RCU Tasks Trace flavor for better performance (Andrii
      Nakryiko)
    - Massively increase uretprobe SMP scalability by SRCU-protecting
      the uretprobe lifetime (Andrii Nakryiko)
    - Kill xol_area->slot_count (Oleg Nesterov)

  Core facilities:
    - Implement targeted high-frequency profiling by adding the ability
      for an event to "pause" or "resume" AUX area tracing (Adrian
      Hunter)

  VM profiling/sampling:
    - Correct perf sampling with guest VMs (Colton Lewis)

  New hardware support:
    - x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi)

  Misc fixes and enhancements:
    - x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter)
    - x86/amd: Warn only on new bits set (Breno Leitao)
    - x86/amd/uncore: Avoid a false positive warning about snprintf
      truncation in amd_uncore_umc_ctx_init (Jean Delvare)
    - uprobes: Re-order struct uprobe_task to save some space
      (Christophe JAILLET)
    - x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang)
    - x86/rapl: Clean up cpumask and hotplug (Kan Liang)
    - uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg
      Nesterov)"

* tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  perf/core: Correct perf sampling with guest VMs
  perf/x86: Refactor misc flag assignments
  perf/powerpc: Use perf_arch_instruction_pointer()
  perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
  perf/arm: Drop unused functions
  uprobes: Re-order struct uprobe_task to save some space
  perf/x86/amd/uncore: Avoid a false positive warning about snprintf truncation in amd_uncore_umc_ctx_init
  perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
  perf/x86/intel/pt: Add support for pause / resume
  perf/core: Add aux_pause, aux_resume, aux_start_paused
  perf/x86/intel/pt: Fix buffer full but size is 0 case
  uprobes: SRCU-protect uretprobe lifetime (with timeout)
  uprobes: allow put_uprobe() from non-sleepable softirq context
  perf/x86/rapl: Clean up cpumask and hotplug
  perf/x86/rapl: Move the pmu allocation out of CPU hotplug
  uprobe: Add support for session consumer
  uprobe: Add data pointer to consumer handlers
  perf/x86/amd: Warn only on new bits set
  uprobes: fold xol_take_insn_slot() into xol_get_insn_slot()
  uprobes: kill xol_area->slot_count
  ...
2024-11-19 13:34:06 -08:00
Linus Torvalds
ba1f9c8fe3 arm64 updates for 6.13:
* Support for running Linux in a protected VM under the Arm Confidential
   Compute Architecture (CCA)
 
 * Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from libc
 
 * AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)
 
 * Other arch features:
 
   - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only
     exposed to user; uaccess support not merged yet)
 
   - MTE: hugetlbfs support and the corresponding kselftests
 
   - Optimise CRC32 using the PMULL instructions
 
   - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG
 
   - Optimise the kernel TLB flushing to use the range operations
 
   - POE/pkey (permission overlays): further cleanups after bringing the
     signal handler in line with the x86 behaviour for 6.12
 
 * arm64 perf updates:
 
   - Support for the NXP i.MX91 PMU in the existing IMX driver
 
   - Support for Ampere SoCs in the Designware PCIe PMU driver
 
   - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC
 
   - Support for Samsung's 'Mongoose' CPU PMU
 
   - Support for PMUv3.9 finer-grained userspace counter access control
 
   - Switch back to platform_driver::remove() now that it returns 'void'
 
   - Add some missing events for the CXL PMU driver
 
 * Miscellaneous arm64 fixes/cleanups:
 
   - Page table accessors cleanup: type updates, drop unused macros,
     reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
     check addresses before runtime P4D/PUD folding
 
   - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
     FEAT_ECV for the generic timers) allowing Linux to boot with
     firmware deployments that don't set SCTLR_EL3.ECVEn
 
   - ACPI/arm64: tighten the check for the array of platform timer
     structures and adjust the error handling procedure in
     gtdt_parse_timer_block()
 
   - Optimise the cache flush for the uprobes xol slot (skip if no
     change) and other uprobes/kprobes cleanups
 
   - Fix the context switching of tpidrro_el0 when kpti is enabled
 
   - Dynamic shadow call stack fixes
 
   - Sysreg updates
 
   - Various arm64 kselftest improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmc5POIACgkQa9axLQDI
 XvEDYA//a3eeNkgMuGdnSCVcLz+zy+oNwAwboG/4X1DqL8jiCbI4npwugPx95RIA
 YZOUvo9T2aL3OyefpUHll4gFHqx9OwoZIig2F70TEUmlPsGUbh0KBkdfQF3xZPdl
 EwV0kHSGEqMWMBwsGJGwgCYrUaf1MUQzh1GBl7VJ2ts5XsJBaBeOyKkysij26wtZ
 V+aHq2IUx7qQS7+HC/4P6IoHxKziFcsCMovaKaynP4cw9xXBQbDMcNlHEwndOMyk
 pu2zrv7GG0j3KQuVP/2Alf5FKhmI0GVGP/6Nc/zsOmw96w8Kf7HfzEtkHawr2aRq
 rqg/c9ivzDn1p+fUBo4ZYtrRk4IAY+yKu6hdzdLTP5+bQrBTWTO9rjQVBm9FAGYT
 sCdEj1NqzvExvNHD7X6ut/GJ05lmce3K+qeSXSEysN9gqiT3eomYWMXrD2V2lxzb
 rIDDcb/icfaqjt14Mksh19r/rzNeq7noj9CGSmcqw0BHZfHzl38Lai6pdfYzCNyn
 vCM/c4c1D/WWX8/lifO1JZVbhDk1jy82Iphg2KEhL8iKPxDsKBBZLmYuU1oa7tMo
 WryGAz9+GQwd+W9chFuaOEtMnzvW2scEJ5Eb2fEf0Qj0aEurkL+C9dZR6o1GN77V
 DBUxtU628Ef4PJJGfbNCwZzdd8UPYG3a/mKfQQ3dz0oz2LySlW4=
 =wDot
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Support for running Linux in a protected VM under the Arm
   Confidential Compute Architecture (CCA)

 - Guarded Control Stack user-space support. Current patches follow the
   x86 ABI of implicitly creating a shadow stack on clone(). Subsequent
   patches (already on the list) will add support for clone3() allowing
   finer-grained control of the shadow stack size and placement from
   libc

 - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are
   getting close with the upcoming dpISA support)

 - Other arch features:

     - In-kernel use of the memcpy instructions, FEAT_MOPS (previously
       only exposed to user; uaccess support not merged yet)

     - MTE: hugetlbfs support and the corresponding kselftests

     - Optimise CRC32 using the PMULL instructions

     - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG

     - Optimise the kernel TLB flushing to use the range operations

     - POE/pkey (permission overlays): further cleanups after bringing
       the signal handler in line with the x86 behaviour for 6.12

 - arm64 perf updates:

     - Support for the NXP i.MX91 PMU in the existing IMX driver

     - Support for Ampere SoCs in the Designware PCIe PMU driver

     - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC

     - Support for Samsung's 'Mongoose' CPU PMU

     - Support for PMUv3.9 finer-grained userspace counter access
       control

     - Switch back to platform_driver::remove() now that it returns
       'void'

     - Add some missing events for the CXL PMU driver

 - Miscellaneous arm64 fixes/cleanups:

     - Page table accessors cleanup: type updates, drop unused macros,
       reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity
       check addresses before runtime P4D/PUD folding

     - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the
       FEAT_ECV for the generic timers) allowing Linux to boot with
       firmware deployments that don't set SCTLR_EL3.ECVEn

     - ACPI/arm64: tighten the check for the array of platform timer
       structures and adjust the error handling procedure in
       gtdt_parse_timer_block()

     - Optimise the cache flush for the uprobes xol slot (skip if no
       change) and other uprobes/kprobes cleanups

     - Fix the context switching of tpidrro_el0 when kpti is enabled

     - Dynamic shadow call stack fixes

     - Sysreg updates

     - Various arm64 kselftest improvements

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits)
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  selftests/mm: Fix unused function warning for aarch64_write_signal_pkey()
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  ...
2024-11-18 18:10:37 -08:00
Catalin Marinas
437330d90c Merge branch 'for-next/mops' into for-next/core
* for-next/mops:
  : More FEAT_MOPS (memcpy instructions) uses - in-kernel routines
  arm64: mops: Document requirements for hypervisors
  arm64: lib: Use MOPS for copy_page() and clear_page()
  arm64: lib: Use MOPS for memcpy() routines
  arm64: mops: Document booting requirement for HCR_EL2.MCE2
  arm64: mops: Handle MOPS exceptions from EL1
  arm64: probes: Disable kprobes/uprobes on MOPS instructions

# Conflicts:
#	arch/arm64/kernel/entry-common.c
2024-11-14 12:07:28 +00:00
Catalin Marinas
5a4332062e Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', 'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf: Switch back to struct platform_driver::remove()
  perf: arm_pmuv3: Add support for Samsung Mongoose PMU
  dt-bindings: arm: pmu: Add Samsung Mongoose core compatible
  perf/dwc_pcie: Fix typos in event names
  perf/dwc_pcie: Add support for Ampere SoCs
  ARM: pmuv3: Add missing write_pmuacr()
  perf/marvell: Marvell PEM performance monitor support
  perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
  perf/dwc_pcie: Convert the events with mixed case to lowercase
  perf/cxlpmu: Support missing events in 3.1 spec
  perf: imx_perf: add support for i.MX91 platform
  dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible
  drivers perf: remove unused field pmu_node

* for-next/gcs: (42 commits)
  : arm64 Guarded Control Stack user-space support
  kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c
  arm64/gcs: Fix outdated ptrace documentation
  kselftest/arm64: Ensure stable names for GCS stress test results
  kselftest/arm64: Validate that GCS push and write permissions work
  kselftest/arm64: Enable GCS for the FP stress tests
  kselftest/arm64: Add a GCS stress test
  kselftest/arm64: Add GCS signal tests
  kselftest/arm64: Add test coverage for GCS mode locking
  kselftest/arm64: Add a GCS test program built with the system libc
  kselftest/arm64: Add very basic GCS test program
  kselftest/arm64: Always run signals tests with GCS enabled
  kselftest/arm64: Allow signals tests to specify an expected si_code
  kselftest/arm64: Add framework support for GCS to signal handling tests
  kselftest/arm64: Add GCS as a detected feature in the signal tests
  kselftest/arm64: Verify the GCS hwcap
  arm64: Add Kconfig for Guarded Control Stack (GCS)
  arm64/ptrace: Expose GCS via ptrace and core files
  arm64/signal: Expose GCS state in signal frames
  arm64/signal: Set up and restore the GCS context for signal handlers
  arm64/mm: Implement map_shadow_stack()
  ...

* for-next/probes:
  : Various arm64 uprobes/kprobes cleanups
  arm64: insn: Simulate nop instruction for better uprobe performance
  arm64: probes: Remove probe_opcode_t
  arm64: probes: Cleanup kprobes endianness conversions
  arm64: probes: Move kprobes-specific fields
  arm64: probes: Fix uprobes for big-endian kernels
  arm64: probes: Fix simulate_ldr*_literal()
  arm64: probes: Remove broken LDR (literal) uprobe support

* for-next/asm-offsets:
  : arm64 asm-offsets.c cleanup (remove unused offsets)
  arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET
  arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE
  arm64: asm-offsets: remove VM_EXEC and PAGE_SZ
  arm64: asm-offsets: remove MM_CONTEXT_ID
  arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET
  arm64: asm-offsets: remove VMA_VM_*
  arm64: asm-offsets: remove TSK_ACTIVE_MM

* for-next/tlb:
  : TLB flushing optimisations
  arm64: optimize flush tlb kernel range
  arm64: tlbflush: add __flush_tlb_range_limit_excess()

* for-next/misc:
  : Miscellaneous patches
  arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled
  arm64/ptrace: Clarify documentation of VL configuration via ptrace
  acpi/arm64: remove unnecessary cast
  arm64/mm: Change protval as 'pteval_t' in map_range()
  arm64: uprobes: Optimize cache flushes for xol slot
  acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block()
  arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG
  arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings
  arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
  arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
  ACPI: GTDT: Tighten the check for the array of platform timer structures
  arm64/fpsimd: Fix a typo
  arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers
  arm64: Return early when break handler is found on linked-list
  arm64/mm: Re-organize arch_make_huge_pte()
  arm64/mm: Drop _PROT_SECT_DEFAULT
  arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV
  arm64: head: Drop SWAPPER_TABLE_SHIFT
  arm64: cpufeature: add POE to cpucap_is_possible()
  arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t

* for-next/mte:
  : Various MTE improvements
  selftests: arm64: add hugetlb mte tests
  hugetlb: arm64: add mte support

* for-next/sysreg:
  : arm64 sysreg updates
  arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09

* for-next/stacktrace:
  : arm64 stacktrace improvements
  arm64: preserve pt_regs::stackframe during exec*()
  arm64: stacktrace: unwind exception boundaries
  arm64: stacktrace: split unwind_consume_stack()
  arm64: stacktrace: report recovered PCs
  arm64: stacktrace: report source of unwind data
  arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk()
  arm64: use a common struct frame_record
  arm64: pt_regs: swap 'unused' and 'pmr' fields
  arm64: pt_regs: rename "pmr_save" -> "pmr"
  arm64: pt_regs: remove stale big-endian layout
  arm64: pt_regs: assert pt_regs is a multiple of 16 bytes

* for-next/hwcap3:
  : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4)
  arm64: Support AT_HWCAP3
  binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4

* for-next/kselftest: (30 commits)
  : arm64 kselftest fixes/cleanups
  kselftest/arm64: Try harder to generate different keys during PAC tests
  kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all()
  kselftest/arm64: Corrupt P0 in the irritator when testing SSVE
  kselftest/arm64: Add FPMR coverage to fp-ptrace
  kselftest/arm64: Expand the set of ZA writes fp-ptrace does
  kselftets/arm64: Use flag bits for features in fp-ptrace assembler code
  kselftest/arm64: Enable build of PAC tests with LLVM=1
  kselftest/arm64: Check that SVCR is 0 in signal handlers
  kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests
  kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test
  kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests
  kselftest/arm64: Fix build with stricter assemblers
  kselftest/arm64: Test signal handler state modification in fp-stress
  kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test
  kselftest/arm64: Implement irritators for ZA and ZT
  kselftest/arm64: Remove unused ADRs from irritator handlers
  kselftest/arm64: Correct misleading comments on fp-stress irritators
  kselftest/arm64: Poll less often while waiting for fp-stress children
  kselftest/arm64: Increase frequency of signal delivery in fp-stress
  kselftest/arm64: Fix encoding for SVE B16B16 test
  ...

* for-next/crc32:
  : Optimise CRC32 using PMULL instructions
  arm64/crc32: Implement 4-way interleave using PMULL
  arm64/crc32: Reorganize bit/byte ordering macros
  arm64/lib: Handle CRC-32 alternative in C code

* for-next/guest-cca:
  : Support for running Linux as a guest in Arm CCA
  arm64: Document Arm Confidential Compute
  virt: arm-cca-guest: TSM_REPORT support for realms
  arm64: Enable memory encrypt for Realms
  arm64: mm: Avoid TLBI when marking pages as valid
  arm64: Enforce bounce buffers for realm DMA
  efi: arm64: Map Device with Prot Shared
  arm64: rsi: Map unprotected MMIO as decrypted
  arm64: rsi: Add support for checking whether an MMIO is protected
  arm64: realm: Query IPA size from the RMM
  arm64: Detect if in a realm and set RIPAS RAM
  arm64: rsi: Add RSI definitions

* for-next/haft:
  : Support for arm64 FEAT_HAFT
  arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
  arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
  arm64: Add support for FEAT_HAFT
  arm64: setup: name 'tcr2' register
  arm64/sysreg: Update ID_AA64MMFR1_EL1 register

* for-next/scs:
  : Dynamic shadow call stack fixes
  arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
  arm64/scs: Deal with 64-bit relative offsets in FDE frames
  arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-14 12:07:16 +00:00
Paolo Bonzini
0586ade9e7 LoongArch KVM changes for v6.13
1. Add iocsr and mmio bus simulation in kernel.
 2. Add in-kernel interrupt controller emulation.
 3. Add virt extension support for eiointc irqchip.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmc0otUWHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImega1D/0Q91hUlKVp55QXDZrnpW7Z71v+
 I9u8avjRiISDMLkjku/HE9eoD7lVYndzkDDSH32W+UVpBharJvuR+MIoH4jtLf3k
 IImybEaBwXru0+8YxbMqIzqcUEbQda0U5u31Ju1U6xcp+y1PGJJJDVPk4vBXOQB3
 +wnLE6Q7orddw3s6G0QYtTv8jPDPOOL0Jv2ClqBaM8mTr2dIEpMjbZg2yGPMQVlE
 mVEgoked9OS5blkoxz2rEfUMQX5CVs20lyhfr05Qk2mTbeKITceqVlx183CyLMUO
 /9uJl7sD1ctxmQtU7ezeM7n7ItP9ehdAPECkt8WWSHM6mGbwHVTAtJoQGZjgoc6O
 pL1aSzhfGH3mdbwUCjhGsov6cZ4hliDQ76H3dlxrSr0JJX3zOPY5qDegmfDlxlyT
 uoKOAsx5D2N+WgshDPApZonkh38agaeTWposamseJbVNZXHmQV8Q8ipiNhgcgtVe
 mAReWfoYHL2mFIQNrfKS2i9J8mRj9SrjcQyNxgeU3L1s5Mr1p11yYXrkfVrZiHVk
 0KzPfNJZvHO7zvgAIbyqyXEAY2Cq6F2r7UIELUOzY2zayoZwbn2jIZrsUVVbUsWp
 G4FbTRQDK1UR1cCVqe9jLmf5BzlSZ+jXOgcg+CxGIAelZ0qRcK/IgkX6/KygSlgY
 49W45xpHtVUycsWDNA==
 =Jov3
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.13

1. Add iocsr and mmio bus simulation in kernel.
2. Add in-kernel interrupt controller emulation.
3. Add virt extension support for eiointc irqchip.
2024-11-14 07:06:24 -05:00
Paolo Bonzini
7b541d557f KVM/arm64 changes for 6.13, part #1
- Support for stage-1 permission indirection (FEAT_S1PIE) and
    permission overlays (FEAT_S1POE), including nested virt + the
    emulated page table walker
 
  - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
    was introduced in PSCIv1.3 as a mechanism to request hibernation,
    similar to the S4 state in ACPI
 
  - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
    part of it, introduce trivial initialization of the host's MPAM
    context so KVM can use the corresponding traps
 
  - PMU support under nested virtualization, honoring the guest
    hypervisor's trap configuration and event filtering when running a
    nested guest
 
  - Fixes to vgic ITS serialization where stale device/interrupt table
    entries are not zeroed when the mapping is invalidated by the VM
 
  - Avoid emulated MMIO completion if userspace has requested synchronous
    external abort injection
 
  - Various fixes and cleanups affecting pKVM, vCPU initialization, and
    selftests
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZzTZXRccb2xpdmVyLnVw
 dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFioUAP0cs2pYcwuCqLgmeHqfz6L5Xsw3
 hKBCNuvr5mjU0hZfLAEA5ml2eUKD7OnssAOmUZ/K/NoCdJFCe8mJWQDlURvr9g4=
 =u2/3
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 changes for 6.13, part #1

 - Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker

 - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI

 - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps

 - PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest

 - Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM

 - Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection

 - Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
2024-11-14 07:05:36 -05:00
Colton Lewis
2c47e7a74f perf/core: Correct perf sampling with guest VMs
Previously any PMU overflow interrupt that fired while a VCPU was
loaded was recorded as a guest event whether it truly was or not. This
resulted in nonsense perf recordings that did not honor
perf_event_attr.exclude_guest and recorded guest IPs where it should
have recorded host IPs.

Rework the sampling logic to only record guest samples for events with
exclude_guest = 0. This way any host-only events with exclude_guest
set will never see unexpected guest samples. The behaviour of events
with exclude_guest = 0 is unchanged.

Note that events configured to sample both host and guest may still
misattribute a PMI that arrived in the host as a guest event depending
on KVM arch and vendor behavior.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241113190156.2145593-6-coltonlewis@google.com
2024-11-14 10:40:01 +01:00
Colton Lewis
04782e6391 perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
For clarity, rename the arch-specific definitions of these functions
to perf_arch_* to denote they are arch-specifc. Define the
generic-named functions in one place where they can call the
arch-specific ones as needed.

Signed-off-by: Colton Lewis <coltonlewis@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Acked-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20241113190156.2145593-3-coltonlewis@google.com
2024-11-14 10:40:01 +01:00
Paolo Bonzini
b39d1578d5 KVM generic changes for 6.13
- Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
    partial poasses over "all" vCPUs.  Opportunistically expand the comment
    to better explain the motivation and logic.
 
  - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
    of RCU, so that running a vCPU on a different task doesn't encounter
    long stalls due to having to wait for all CPUs become quiescent.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmczn8AACgkQOlYIJqCj
 N/39KA/8CKNrGLbbRSn0jaOLW2Xq8mA2ce1CgzTGScIcL6R+bLHSTRPeeUXMPuC7
 CbaRCrBtkP+qn0GJPp+Jo0luH5fxSQL16/iXC+rB8//Lw0ttEygWO3oY8MsV1YjF
 opRhHjqHNIzT1gLEXRSzwm/lBv3jjZG8+5hjaz/LKKIABuzS4bsmacNvHFpIXqhx
 tJDoZydiTFNGFA9+GD5telD/CxK43U5VIbwwnisFtSvNE+PMm1vT3cHIT8lUvuxB
 eoeAyS78SuWnJn4KhYSjByleBYAeqgxaGNHZVow0B6CMKRLocZdppRQklXE3ap/2
 3+V3LNdFl5LBjAP/FHWuMUoovGANUyVQumCHaflz9d8a4acMQInWJtirg/iqGLDN
 JXAXJZBh+JybEBVsPbGvxotTaXPRuMYzKCF512dRcmSzNk73yUSDg8XfOI45rJ9k
 sxFTAGk4blFhE7Htos5rQ/NmvY1K62va1TAdFXqThdx+b6Y5TjWmRE8dCIMouLoJ
 UpEiL0CmbS1yAFXReeyIV/EcBI4Q8C3KXP13n9zyNUCnEQK7CQgc+OmakFL6d5MX
 p8Dq98NBZ0GN+Ay1x9WjfpQCvaH8/7Hf9mltbkTzQ8Zrz/4nmuZ5YuRsIunLL/uO
 dyAs72I9lw7tHAoHJtBd9W7ShKJwYHLtQl9i9umKPOnsc03iwN8=
 =/hvr
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.13' of https://github.com/kvm-x86/linux into HEAD

KVM generic changes for 6.13

 - Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two
   partial poasses over "all" vCPUs.  Opportunistically expand the comment
   to better explain the motivation and logic.

 - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead
   of RCU, so that running a vCPU on a different task doesn't encounter
   long stalls due to having to wait for all CPUs become quiescent.
2024-11-13 06:24:19 -05:00
Nataniel Farzan
a7306f3c28 Improve consistency of '#error' directive messages
Remove the use of contractions and use proper punctuation in #error
directive messages that discourage the direct inclusion of header files.

Link: https://lkml.kernel.org/r/20241105032231.28833-1-natanielfarzan@gmail.com
Signed-off-by: Nataniel Farzan <natanielfarzan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 17:17:04 -08:00
Oliver Upton
6d4b81e2e7 Merge branch kvm-arm64/nv-pmu into kvmarm/next
* kvm-arm64/nv-pmu:
  : Support for vEL2 PMU controls
  :
  : Align the vEL2 PMU support with the current state of non-nested KVM,
  : including:
  :
  :  - Trap routing, with the annoying complication of EL2 traps that apply
  :    in Host EL0
  :
  :  - PMU emulation, using the correct configuration bits depending on
  :    whether a counter falls in the hypervisor or guest range of PMCs
  :
  :  - Perf event swizzling across nested boundaries, as the event filtering
  :    needs to be remapped to cope with vEL2
  KVM: arm64: nv: Reprogram PMU events affected by nested transition
  KVM: arm64: nv: Apply EL2 event filtering when in hyp context
  KVM: arm64: nv: Honor MDCR_EL2.HLP
  KVM: arm64: nv: Honor MDCR_EL2.HPME
  KVM: arm64: Add helpers to determine if PMC counts at a given EL
  KVM: arm64: nv: Adjust range of accessible PMCs according to HPMN
  KVM: arm64: Rename kvm_pmu_valid_counter_mask()
  KVM: arm64: nv: Advertise support for FEAT_HPMN0
  KVM: arm64: nv: Describe trap behaviour of MDCR_EL2.HPMN
  KVM: arm64: nv: Honor MDCR_EL2.{TPM, TPMCR} in Host EL0
  KVM: arm64: nv: Reinject traps that take effect in Host EL0
  KVM: arm64: nv: Rename BEHAVE_FORWARD_ANY
  KVM: arm64: nv: Allow coarse-grained trap combos to use complex traps
  KVM: arm64: Describe RES0/RES1 bits of MDCR_EL2
  arm64: sysreg: Add new definitions for ID_AA64DFR0_EL1
  arm64: sysreg: Migrate MDCR_EL2 definition to table
  arm64: sysreg: Describe ID_AA64DFR2_EL1 fields

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:49:02 +00:00
Oliver Upton
fbf3372baa Merge branch kvm-arm64/misc into kvmarm/next
* kvm-arm64/misc:
  : Miscellaneous updates
  :
  :  - Drop useless check against vgic state in ICC_CLTR_EL1.SEIS read
  :    emulation
  :
  :  - Fix trap configuration for pKVM
  :
  :  - Close the door on initialization bugs surrounding userspace irqchip
  :    static key by removing it.
  KVM: selftests: Don't bother deleting memslots in KVM when freeing VMs
  KVM: arm64: Get rid of userspace_irqchip_in_use
  KVM: arm64: Initialize trap register values in hyp in pKVM
  KVM: arm64: Initialize the hypervisor's VM state at EL2
  KVM: arm64: Refactor kvm_vcpu_enable_ptrauth() for hyp use
  KVM: arm64: Move pkvm_vcpu_init_traps() to init_pkvm_hyp_vcpu()
  KVM: arm64: Don't map 'kvm_vgic_global_state' at EL2 with pKVM
  KVM: arm64: Just advertise SEIS as 0 when emulating ICC_CTLR_EL1

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:47:50 +00:00
Oliver Upton
24bb181136 Merge branch kvm-arm64/mpam-ni into kvmarm/next
* kvm-arm64/mpam-ni:
  : Hiding FEAT_MPAM from KVM guests, courtesy of James Morse + Joey Gouly
  :
  : Fix a longstanding bug where FEAT_MPAM was accidentally exposed to KVM
  : guests + the EL2 trap configuration was not explicitly configured. As
  : part of this, bring in skeletal support for initialising the MPAM CPU
  : context so KVM can actually set traps for its guests.
  :
  : Be warned -- if this series leads to boot failures on your system,
  : you're running on turd firmware.
  :
  : As an added bonus (that builds upon the infrastructure added by the MPAM
  : series), allow userspace to configure CTR_EL0.L1Ip, courtesy of Shameer
  : Kolothum.
  KVM: arm64: Make L1Ip feature in CTR_EL0 writable from userspace
  KVM: arm64: selftests: Test ID_AA64PFR0.MPAM isn't completely ignored
  KVM: arm64: Disable MPAM visibility by default and ignore VMM writes
  KVM: arm64: Add a macro for creating filtered sys_reg_descs entries
  KVM: arm64: Fix missing traps of guest accesses to the MPAM registers
  arm64: cpufeature: discover CPU support for MPAM
  arm64: head.S: Initialise MPAM EL2 registers and disable traps
  arm64/sysreg: Convert existing MPAM sysregs and add the remaining entries

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:38:30 +00:00
Oliver Upton
7ccd615bc6 Merge branch kvm-arm64/psci-1.3 into kvmarm/next
* kvm-arm64/psci-1.3:
  : PSCI v1.3 support, courtesy of David Woodhouse
  :
  : Bump KVM's PSCI implementation up to v1.3, with the added bonus of
  : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
  : this gets relayed to userspace for further processing with a new
  : KVM_SYSTEM_EVENT_SHUTDOWN flag.
  :
  : As an added bonus, implement client-side support for hibernation with
  : the SYSTEM_OFF2 call.
  arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
  KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
  KVM: selftests: Add test for PSCI SYSTEM_OFF2
  KVM: arm64: Add support for PSCI v1.2 and v1.3
  KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
  firmware/psci: Add definitions for PSCI v1.3 specification

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:36:46 +00:00
Linus Torvalds
28e43197c4 20 hotfixes, 14 of which are cc:stable.
Three affect DAMON.  Lorenzo's five-patch series to address the
 mmap_region error handling is here also.
 
 Apart from that, various singletons.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzBVmAAKCRDdBJ7gKXxA
 ju42AQD0EEnzW+zFyI+E7x5FwCmLL6ofmzM8Sw9YrKjaeShdZgEAhcyS2Rc/AaJq
 Uty2ZvVMDF2a9p9gqHfKKARBXEbN2w0=
 =n+lO
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "20 hotfixes, 14 of which are cc:stable.

  Three affect DAMON. Lorenzo's five-patch series to address the
  mmap_region error handling is here also.

  Apart from that, various singletons"

* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Thorsten Blum
  ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
  signal: restore the override_rlimit logic
  fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
  ucounts: fix counter leak in inc_rlimit_get_ucounts()
  selftests: hugetlb_dio: check for initial conditions to skip in the start
  mm: fix docs for the kernel parameter ``thp_anon=``
  mm/damon/core: avoid overflow in damon_feed_loop_next_input()
  mm/damon/core: handle zero schemes apply interval
  mm/damon/core: handle zero {aggregation,ops_update} intervals
  mm/mlock: set the correct prev on failure
  objpool: fix to make percpu slot allocation more robust
  mm/page_alloc: keep track of free highatomic
  mm: resolve faulty mmap_region() error path behaviour
  mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
  mm: refactor map_deny_write_exec()
  mm: unconditionally close VMAs on error
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm/thp: fix deferred split unqueue naming and locking
  mm/thp: fix deferred split queue not partially_mapped
2024-11-10 09:04:27 -08:00
Paolo Bonzini
e3e0f9b7ae KVM/riscv changes for 6.13
- Accelerate KVM RISC-V when running as a guest
 - Perf support to collect KVM guest statistics from host side
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmct20AACgkQrUjsVaLH
 LAdyWw/+Os6fIFhThsur7z26E/+7Edxk/QEfUMEEvQPwKfwiRdHYLU1dWGTCw2nA
 xr8yR9MQFkUMTHZBPi+zki47uE6gW1qMb1tl8AuZ4cjZCMSg0pBObdB/axT5+517
 boC74A1r04dlvW3sEvcMaHen0B75Hwm703jjOIiThiZFlB519HcattTPAJo8nTRD
 D4hhpTY/v8QyFlMg0m7KnOOa3MtmR0Q704LaAdEQ5S9kdMlUtMzljHvWIzYdTfjJ
 C7v7EAo+iRi7ecBdg9ZxhOkcKUB2xF+lmLYBRoiOeaECJ4RIaseY63SqLyzlGeNZ
 Dx7DuF0mljC3nmHPH/60pOxDKYWGxjx5IQF1ORBvcsux4uhwH8KkyjcxYody/m82
 HXalAwmN4mA2asrt0xTpIoOUkwOGy7LEKx6083hbrSgzk9E/bnoi+YDe+Gm4z+YR
 ofqsj4G6x0fS85RCKB4iQGRcHgW2IPHYeKWXYYP0m1COe9kj9E3POBpAjj3qMxZr
 R+eveYhLZ7A3B/DAVYQQVoCcQWECXC0M4FOOVh4vEV2JeEB55dlI77/p6X5y7RGj
 YA7njv9eC5hrxOgUMOLFIvMqFKe7dcuYmE3V11+TD5lHBfqWfKE7vA1V1rOA8e8R
 Qf5DXLVBbsNj2yirbj1LI3x7jg5aXw5IhylJNn/c1vezqw8RUfg=
 =sI+4
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv changes for 6.13

- Accelerate KVM RISC-V when running as a guest
- Perf support to collect KVM guest statistics from host side
2024-11-08 12:13:48 -05:00
Ard Biesheuvel
47965a49a2 arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux()
The function scs_patch_vmlinux() was removed in the LPA2 boot code
refactoring so remove the declaration as well.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20241106185513.3096442-8-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-08 16:37:55 +00:00
Ard Biesheuvel
ccf54058f5 arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
The dynamic SCS patching code pretends to parse the DWARF augmentation
data in the CIE (header) frame, and handle accordingly when processing
the individual FDE frames based on this CIE frame. However, the boolean
variable is defined inside the loop, and so the parsed value is ignored.

The same applies to the code alignment field, which is also read from
the header but then discarded.

This was never spotted before because Clang is the only compiler that
supports dynamic SCS patching (which is essentially an Android feature),
and the unwind tables it produces are highly uniform, and match the
de facto defaults.

So instead of testing for the 'z' flag in the augmentation data field,
require a fixed augmentation data string of 'zR', and simplify the rest
of the code accordingly.

Also introduce some error codes to specify why the patching failed, and
log it to the kernel console on failure when this happens when loading a
module. (Doing so for vmlinux is infeasible, as the patching is done
extremely early in the boot.)

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Tested-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20241106185513.3096442-6-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-08 16:37:55 +00:00
Mike Rapoport (Microsoft)
0c6378a715 arch: introduce set_direct_map_valid_noflush()
Add an API that will allow updates of the direct/linear map for a set of
physically contiguous pages.

It will be used in the following patches.

Link: https://lkml.kernel.org/r/20241023162711.2579610-6-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: kdevops <kdevops@lists.linux.dev>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Song Liu <song@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:25:15 -08:00
Mike Rapoport (Microsoft)
0c3beacf68 asm-generic: introduce text-patching.h
Several architectures support text patching, but they name the header
files that declare patching functions differently.

Make all such headers consistently named text-patching.h and add an empty
header in asm-generic for architectures that do not support text patching.

Link: https://lkml.kernel.org/r/20241023162711.2579610-4-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: kdevops <kdevops@lists.linux.dev>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Song Liu <song@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:25:15 -08:00
John Hubbard
afe789b736 kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end
For clarity.  It's increasingly hard to reason about the code, when KASLR
is moving around the boundaries.  In this case where KASLR is randomizing
the location of the kernel image within physical memory, the maximum
number of address bits for physical memory has not changed.

What has changed is the ending address of memory that is allowed to be
directly mapped by the kernel.

Let's name the variable, and the associated macro accordingly.

Also, enhance the comment above the direct_map_physmem_end definition,
to further clarify how this all works.

Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jordan Niethe <jniethe@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-06 20:11:11 -08:00
Mario Limonciello
b79276dcac ACPI: processor: Move arch_init_invariance_cppc() call later
arch_init_invariance_cppc() is called at the end of
acpi_cppc_processor_probe() in order to configure frequency invariance
based upon the values from _CPC.

This however doesn't work on AMD CPPC shared memory designs that have
AMD preferred cores enabled because _CPC needs to be analyzed from all
cores to judge if preferred cores are enabled.

This issue manifests to users as a warning since commit 21fb59ab4b
("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"):
```
Could not retrieve highest performance (-19)
```

However the warning isn't the cause of this, it was actually
commit 279f838a61 ("x86/amd: Detect preferred cores in
amd_get_boost_ratio_numerator()") which exposed the issue.

To fix this problem, change arch_init_invariance_cppc() into a new weak
symbol that is called at the end of acpi_processor_driver_init().
Each architecture that supports it can declare the symbol to override
the weak one.

Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the
architectures using the generic arch_topology.c code.

Fixes: 279f838a61 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()")
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org
[ rjw: Changelog edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-06 21:31:36 +01:00
Lorenzo Stoakes
5baf8b037d mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
Currently MTE is permitted in two circumstances (desiring to use MTE
having been specified by the VM_MTE flag) - where MAP_ANONYMOUS is
specified, as checked by arch_calc_vm_flag_bits() and actualised by
setting the VM_MTE_ALLOWED flag, or if the file backing the mapping is
shmem, in which case we set VM_MTE_ALLOWED in shmem_mmap() when the mmap
hook is activated in mmap_region().

The function that checks that, if VM_MTE is set, VM_MTE_ALLOWED is also
set is the arm64 implementation of arch_validate_flags().

Unfortunately, we intend to refactor mmap_region() to perform this check
earlier, meaning that in the case of a shmem backing we will not have
invoked shmem_mmap() yet, causing the mapping to fail spuriously.

It is inappropriate to set this architecture-specific flag in general mm
code anyway, so a sensible resolution of this issue is to instead move the
check somewhere else.

We resolve this by setting VM_MTE_ALLOWED much earlier in do_mmap(), via
the arch_calc_vm_flag_bits() call.

This is an appropriate place to do this as we already check for the
MAP_ANONYMOUS case here, and the shmem file case is simply a variant of
the same idea - we permit RAM-backed memory.

This requires a modification to the arch_calc_vm_flag_bits() signature to
pass in a pointer to the struct file associated with the mapping, however
this is not too egregious as this is only used by two architectures anyway
- arm64 and parisc.

So this patch performs this adjustment and removes the unnecessary
assignment of VM_MTE_ALLOWED in shmem_mmap().

[akpm@linux-foundation.org: fix whitespace, per Catalin]
Link: https://lkml.kernel.org/r/ec251b20ba1964fb64cf1607d2ad80c47f3873df.1730224667.git.lorenzo.stoakes@oracle.com
Fixes: deb0f65628 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05 16:49:55 -08:00
Yicong Yang
b349a5a2b6 arm64: pgtable: Warn unexpected pmdp_test_and_clear_young()
Young bit operation on PMD table entry is only supported if
FEAT_HAFT enabled system wide. Add a warning for notifying
the misbehaviour.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241102104235.62560-6-yangyicong@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-05 13:21:14 +00:00
Yicong Yang
62df5870eb arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG
With the support of FEAT_HAFT, the NONLEAF_PMD_YOUNG can be enabled
on arm64 since the hardware is capable of updating the AF flag for
PMD table descriptor. Since the AF bit of the table descriptor
shares the same bit position in block descriptors, we only need
to implement arch_has_hw_nonleaf_pmd_young() and select related
configs. The related pmd_young test/update operations keeps the
same with and already implemented for transparent page support.

Currently ARCH_HAS_NONLEAF_PMD_YOUNG is used to improve the
efficiency of lru-gen aging.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241102104235.62560-5-yangyicong@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-05 13:21:14 +00:00
Yicong Yang
efe7254135 arm64: Add support for FEAT_HAFT
Armv8.9/v9.4 introduces the feature Hardware managed Access Flag
for Table descriptors (FEAT_HAFT). The feature is indicated by
ID_AA64MMFR1_EL1.HAFDBS == 0b0011 and can be enabled by
TCR2_EL1.HAFT so it has a dependency on FEAT_TCR2.

Adds the Kconfig for FEAT_HAFT and support detecting and enabling
the feature. The feature is enabled in __cpu_setup() before MMU on
just like HA. A CPU capability is added to notify the user of the
feature.

Add definition of P{G,4,U,M}D_TABLE_AF bit and set the AF bit
when creating the page table, which will save the hardware
from having to update them at runtime. This will be ignored if
FEAT_HAFT is not enabled.

The AF bit of table descriptors cannot be managed by the software
per spec, unlike the HA. So this should be used only if it's supported
system wide by system_supports_haft().

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20241102104235.62560-4-yangyicong@huawei.com
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
[catalin.marinas@arm.com: added the ID check back to __cpu_setup in case of future CPU errata]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-05 13:18:35 +00:00
Ard Biesheuvel
baec239797 arm64/mm: Sanity check PTE address before runtime P4D/PUD folding
The runtime P4D/PUD folding logic assumes that the respective pgd_t* and
p4d_t* arguments are pointers into actual page tables that are part of
the hierarchy being operated on.

This may not always be the case, and we have been bitten once by this
already [0], where the argument was actually a stack variable, and in
this case, the logic does not work at all.

So let's add a VM_BUG_ON() for each case, to ensure that the address of
the provided page table entry is consistent with the address being
translated.

[0] https://lore.kernel.org/all/20240725090345.28461-1-will@kernel.org/T/#u

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20241105093919.1312049-2-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-05 11:48:58 +00:00
Alice Ryhl
aecaf18165 jump_label: adjust inline asm to be consistent
To avoid duplication of inline asm between C and Rust, we need to
import the inline asm from the relevant `jump_label.h` header into Rust.
To make that easier, this patch updates the header files to expose the
inline asm via a new ARCH_STATIC_BRANCH_ASM macro.

The header files are all updated to define a ARCH_STATIC_BRANCH_ASM that
takes the same arguments in a consistent order so that Rust can use the
same logic for every architecture.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: " =?utf-8?q?Bj=C3=B6rn_Roy_Baron?= " <bjorn3_gh@protonmail.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anup Patel <apatel@ventanamicro.com>
Cc: Andrew Jones <ajones@ventanamicro.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Conor Dooley <conor.dooley@microchip.com>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tianrui Zhao <zhaotianrui@loongson.cn>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/20241030-tracepoint-v12-4-eec7f0f8ad22@google.com
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com> # RISC-V
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-04 16:21:45 -05:00
Anshuman Khandual
ced841702e arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont()
PTE_TYPE_PAGE bits were being set in pte_mkcont() because PTE_TABLE_BIT
was being cleared in pte_mkhuge(). But after arch_make_huge_pte()
modification in commit f8192813dc ("arm64/mm: Re-organize
arch_make_huge_pte()"), which dropped pte_mkhuge() completely, setting
back PTE_TYPE_PAGE bits is no longer necessary. Change pte_mkcont() to
only set PTE_CONT.

Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241104041617.3804617-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-04 16:21:31 +00:00
Thomas Weißschuh
0973fed6a5 arm64: vdso: Drop LBASE_VDSO
This constant is always "0", providing no value and making the logic
harder to understand.
Also prepare for a consolidation of the vdso linkerscript logic by
aligning it with other architectures.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/all/20241010-vdso-generic-base-v1-4-b64f0842d512@linutronix.de
2024-11-02 12:37:33 +01:00
Anshuman Khandual
f8192813dc arm64/mm: Re-organize arch_make_huge_pte()
Core HugeTLB defines a fallback definition for arch_make_huge_pte(), which
calls platform provided pte_mkhuge(). But if any platform already provides
an override for arch_make_huge_pte(), then it does not need to provide the
helper pte_mkhuge().

arm64 override for arch_make_huge_pte() calls pte_mkhuge() internally, thus
creating an impression, that both of these callbacks are being used in core
HugeTLB and hence required to be defined. This drops off pte_mkhuge() which
was never required to begin with as there could not be any section mappings
at the PTE level. Re-organize arch_make_huge_pte() based on requested page
size and create the entry for the applicable page table level as needed. It
also removes a redundancy of clearing PTE_TABLE_BIT bit followed by setting
both PTE_TABLE_BIT and PTE_VALID bits (via PTE_TYPE_MASK) in the pte, while
creating CONT_PTE_SIZE size entries.

Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241029044529.2624785-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-11-01 15:29:39 +00:00
Raghavendra Rao Ananta
38d7aacca0 KVM: arm64: Get rid of userspace_irqchip_in_use
Improper use of userspace_irqchip_in_use led to syzbot hitting the
following WARN_ON() in kvm_timer_update_irq():

WARNING: CPU: 0 PID: 3281 at arch/arm64/kvm/arch_timer.c:459
kvm_timer_update_irq+0x21c/0x394
Call trace:
  kvm_timer_update_irq+0x21c/0x394 arch/arm64/kvm/arch_timer.c:459
  kvm_timer_vcpu_reset+0x158/0x684 arch/arm64/kvm/arch_timer.c:968
  kvm_reset_vcpu+0x3b4/0x560 arch/arm64/kvm/reset.c:264
  kvm_vcpu_set_target arch/arm64/kvm/arm.c:1553 [inline]
  kvm_arch_vcpu_ioctl_vcpu_init arch/arm64/kvm/arm.c:1573 [inline]
  kvm_arch_vcpu_ioctl+0x112c/0x1b3c arch/arm64/kvm/arm.c:1695
  kvm_vcpu_ioctl+0x4ec/0xf74 virt/kvm/kvm_main.c:4658
  vfs_ioctl fs/ioctl.c:51 [inline]
  __do_sys_ioctl fs/ioctl.c:907 [inline]
  __se_sys_ioctl fs/ioctl.c:893 [inline]
  __arm64_sys_ioctl+0x108/0x184 fs/ioctl.c:893
  __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
  invoke_syscall+0x78/0x1b8 arch/arm64/kernel/syscall.c:49
  el0_svc_common+0xe8/0x1b0 arch/arm64/kernel/syscall.c:132
  do_el0_svc+0x40/0x50 arch/arm64/kernel/syscall.c:151
  el0_svc+0x54/0x14c arch/arm64/kernel/entry-common.c:712
  el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
  el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598

The following sequence led to the scenario:
 - Userspace creates a VM and a vCPU.
 - The vCPU is initialized with KVM_ARM_VCPU_PMU_V3 during
   KVM_ARM_VCPU_INIT.
 - Without any other setup, such as vGIC or vPMU, userspace issues
   KVM_RUN on the vCPU. Since the vPMU is requested, but not setup,
   kvm_arm_pmu_v3_enable() fails in kvm_arch_vcpu_run_pid_change().
   As a result, KVM_RUN returns after enabling the timer, but before
   incrementing 'userspace_irqchip_in_use':
   kvm_arch_vcpu_run_pid_change()
       ret = kvm_arm_pmu_v3_enable()
           if (!vcpu->arch.pmu.created)
               return -EINVAL;
       if (ret)
           return ret;
       [...]
       if (!irqchip_in_kernel(kvm))
           static_branch_inc(&userspace_irqchip_in_use);
 - Userspace ignores the error and issues KVM_ARM_VCPU_INIT again.
   Since the timer is already enabled, control moves through the
   following flow, ultimately hitting the WARN_ON():
   kvm_timer_vcpu_reset()
       if (timer->enabled)
          kvm_timer_update_irq()
              if (!userspace_irqchip())
                  ret = kvm_vgic_inject_irq()
                      ret = vgic_lazy_init()
                          if (unlikely(!vgic_initialized(kvm)))
                              if (kvm->arch.vgic.vgic_model !=
                                  KVM_DEV_TYPE_ARM_VGIC_V2)
                                      return -EBUSY;
                  WARN_ON(ret);

Theoretically, since userspace_irqchip_in_use's functionality can be
simply replaced by '!irqchip_in_kernel()', get rid of the static key
to avoid the mismanagement, which also helps with the syzbot issue.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 21:22:52 +00:00
Oliver Upton
d97e66fbcb KVM: arm64: nv: Reinject traps that take effect in Host EL0
Wire up the other end of traps that affect host EL0 by actually
injecting them into the guest hypervisor. Skip over FGT entirely, as a
cursory glance suggests no FGT is effective in host EL0.

Note that kvm_inject_nested() is already equipped for handling
exceptions while the VM is already in a host context.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-9-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 19:00:39 +00:00
Oliver Upton
eb609638da KVM: arm64: Describe RES0/RES1 bits of MDCR_EL2
Add support for sanitising MDCR_EL2 and describe the RES0/RES1 bits
according to the feature set exposed to the VM.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-6-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 19:00:39 +00:00
Oliver Upton
641630313e arm64: sysreg: Migrate MDCR_EL2 definition to table
Migrate MDCR_EL2 over to the sysreg table and align definitions with
DDI0601 2024-09.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241025182354.3364124-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 19:00:38 +00:00
Fuad Tabba
3663b258f7 KVM: arm64: Refactor kvm_vcpu_enable_ptrauth() for hyp use
Move kvm_vcpu_enable_ptrauth() to a shared header to be used by
hypervisor code in protected mode.

No functional change intended.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241018074833.2563674-3-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 18:45:24 +00:00
Fuad Tabba
0546d4a925 KVM: arm64: Move pkvm_vcpu_init_traps() to init_pkvm_hyp_vcpu()
Move pkvm_vcpu_init_traps() to the initialization of the
hypervisor's vcpu state in init_pkvm_hyp_vcpu(), and remove the
associated hypercall.

In protected mode, traps need to be initialized whenever a VCPU
is initialized anyway, and not only for protected VMs. This also
saves an unnecessary hypercall.

Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/r/20241018074833.2563674-2-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 18:45:24 +00:00
James Morse
31ff96c38e KVM: arm64: Fix missing traps of guest accesses to the MPAM registers
commit 011e5f5bf5 ("arm64/cpufeature: Add remaining feature bits in
ID_AA64PFR0 register") exposed the MPAM field of AA64PFR0_EL1 to guests,
but didn't add trap handling.

If you are unlucky, this results in an MPAM aware guest being delivered
an undef during boot. The host prints:
| kvm [97]: Unsupported guest sys_reg access at: ffff800080024c64 [00000005]
| { Op0( 3), Op1( 0), CRn(10), CRm( 5), Op2( 0), func_read },

Which results in:
| Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc7-00559-gd89c186d50b2 #14616
| Hardware name: linux,dummy-virt (DT)
| pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : test_has_mpam+0x18/0x30
| lr : test_has_mpam+0x10/0x30
| sp : ffff80008000bd90
...
| Call trace:
|  test_has_mpam+0x18/0x30
|  update_cpu_capabilities+0x7c/0x11c
|  setup_cpu_features+0x14/0xd8
|  smp_cpus_done+0x24/0xb8
|  smp_init+0x7c/0x8c
|  kernel_init_freeable+0xf8/0x280
|  kernel_init+0x24/0x1e0
|  ret_from_fork+0x10/0x20
| Code: 910003fd 97ffffde 72001c00 54000080 (d538a500)
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
| ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Add the support to enable the traps, and handle the three guest accessible
registers by injecting an UNDEF. This stops KVM from spamming the host
log, but doesn't yet hide the feature from the id registers.

With MPAM v1.0 we can trap the MPAMIDR_EL1 register only if
ARM64_HAS_MPAM_HCR, with v1.1 an additional MPAM2_EL2.TIDR bit traps
MPAMIDR_EL1 on platforms that don't have MPAMHCR_EL2. Enable one of
these if either is supported. If neither is supported, the guest can
discover that the CPU has MPAM support, and how many PARTID etc the
host has ... but it can't influence anything, so its harmless.

Fixes: 011e5f5bf5 ("arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register")
CC: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/linux-arm-kernel/20200925160102.118858-1-james.morse@arm.com/
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-5-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 18:09:38 +00:00
James Morse
09e6b306f3 arm64: cpufeature: discover CPU support for MPAM
ARMv8.4 adds support for 'Memory Partitioning And Monitoring' (MPAM)
which describes an interface to cache and bandwidth controls wherever
they appear in the system.

Add support to detect MPAM. Like SVE, MPAM has an extra id register that
describes some more properties, including the virtualisation support,
which is optional. Detect this separately so we can detect
mismatched/insane systems, but still use MPAM on the host even if the
virtualisation support is missing.

MPAM needs enabling at the highest implemented exception level, otherwise
the register accesses trap. The 'enabled' flag is accessible to lower
exception levels, but its in a register that traps when MPAM isn't enabled.
The cpufeature 'matches' hook is extended to test this on one of the
CPUs, so that firmware can emulate MPAM as disabled if it is reserved
for use by secure world.

Secondary CPUs that appear late could trip cpufeature's 'lower safe'
behaviour after the MPAM properties have been advertised to user-space.
Add a verify call to ensure late secondaries match the existing CPUs.

(If you have a boot failure that bisects here its likely your CPUs
advertise MPAM in the id registers, but firmware failed to either enable
or MPAM, or emulate the trap as if it were disabled)

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-4-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 18:09:38 +00:00
James Morse
23b33d1e16 arm64: head.S: Initialise MPAM EL2 registers and disable traps
Add code to head.S's el2_setup to detect MPAM and disable any EL2 traps.
This register resets to an unknown value, setting it to the default
parititons/pmg before we enable the MMU is the best thing to do.

Kexec/kdump will depend on this if the previous kernel left the CPU
configured with a restrictive configuration.

If linux is booted at the highest implemented exception level el2_setup
will clear the enable bit, disabling MPAM.

This code can't be enabled until a subsequent patch adds the Kconfig
and cpufeature boiler plate.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-3-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 18:09:38 +00:00
James Morse
83732ce6a0 arm64/sysreg: Convert existing MPAM sysregs and add the remaining entries
Move the existing MPAM system register defines from sysreg.h to
tools/sysreg and add the remaining system registers.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241030160317.2528209-2-joey.gouly@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 18:09:38 +00:00
Marc Zyngier
5970e9903f KVM: arm64: Add basic support for POR_EL2
S1POE support implies support for POR_EL2, which  we provide by

- adding it to the vcpu_sysreg enum
- advertising it as mapped to its EL1 counterpart in get_el2_to_el1_mapping
- wiring it in the sys_reg_desc table with the correct visibility
- handling POR_EL1 in __vcpu_{read,write}_sys_reg_from_cpu()

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-32-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:44:22 +00:00
Marc Zyngier
26e89dccdf KVM: arm64: Add kvm_has_s1poe() helper
Just like we have kvm_has_s1pie(), add its S1POE counterpart,
making the code slightly more readable.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-31-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:44:22 +00:00
Mark Brown
a68cddbe47 KVM: arm64: Hide S1PIE registers from userspace when disabled for guests
When the guest does not support S1PIE we should not allow any access
to the system registers it adds in order to ensure that we do not create
spurious issues with guest migration. Add a visibility operation for these
registers.

Fixes: 86f9de9db1 ("KVM: arm64: Save/restore PIE registers")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-3-376624fa829c@kernel.org
[maz: simplify by using __el2_visibility(), kvm_has_s1pie() throughout]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-26-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:44:21 +00:00
Mark Brown
0fcb4eea53 KVM: arm64: Hide TCR2_EL1 from userspace when disabled for guests
When the guest does not support FEAT_TCR2 we should not allow any access
to it in order to ensure that we do not create spurious issues with guest
migration. Add a visibility operation for it.

Fixes: fbff560682 ("KVM: arm64: Save/restore TCR2_EL1")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-2-376624fa829c@kernel.org
[maz: simplify by using __el2_visibility(), kvm_has_tcr2() throughout]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-25-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:44:21 +00:00
Marc Zyngier
5f8d5a15ef KVM: arm64: Add PIR{,E0}_EL2 to the sysreg arrays
Add the FEAT_S1PIE EL2 registers to the per-vcpu sysreg register
array.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:42:31 +00:00
Marc Zyngier
a016202009 KVM: arm64: Extend masking facility to arbitrary registers
We currently only use the masking (RES0/RES1) facility for VNCR
registers, as they are memory-based and thus easy to sanitise.

But we could apply the same thing to other registers if we:

- split the sanitisation from __VNCR_START__
- apply the sanitisation when reading from a HW register

This involves a new "marker" in the vcpu_sysreg enum, which
defines the point at which the sanitisation applies (the VNCR
registers being of course after this marker).

Whle we are at it, rename kvm_vcpu_sanitise_vncr_reg() to
kvm_vcpu_apply_reg_masks(), which is vaguely more explicit,
and harden set_sysreg_masks() against setting masks for
random registers...

Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:42:30 +00:00
Marc Zyngier
14ca930d82 KVM: arm64: Correctly access TCR2_EL1, PIR_EL1, PIRE0_EL1 with VHE
For code that accesses any of the guest registers for emulation
purposes, it is crucial to know where the most up-to-date data is.

While this is pretty clear for nVHE (memory is the sole repository),
things are a lot muddier for VHE, as depending on the SYSREGS_ON_CPU
flag, registers can either be loaded on the HW or be in memory.

Even worse with NV, where the loaded state is by definition partial.

For these reasons, KVM offers the vcpu_read_sys_reg() and
vcpu_write_sys_reg() primitives that always do the right thing.
However, these primitive must know what register to access, and
this is the role of the __vcpu_read_sys_reg_from_cpu() and
__vcpu_write_sys_reg_to_cpu() helpers.

As it turns out, TCR2_EL1, PIR_EL1, PIRE0_EL1 and not described
in the latter helpers, meaning that the AT code cannot use them
to emulate S1PIE.

Add the three registers to the (long) list.

Fixes: 86f9de9db1 ("KVM: arm64: Save/restore PIE registers")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:42:30 +00:00
Marc Zyngier
69c19e047d KVM: arm64: Add TCR2_EL2 to the sysreg arrays
Add the TCR2_EL2 register to the per-vcpu sysreg register array,
the sysreg descriptor array, and advertise it as mapped to TCR2_EL1
for NV purposes.

Access to this register is conditional based on ID_AA64MMFR3_EL1.TCRX
being advertised.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:42:30 +00:00
Marc Zyngier
5792349d0c arm64: Remove VNCR definition for PIRE0_EL2
As of the ARM ARM Known Issues document 102105_K.a_04_en, D22677
fixes a problem with the PIRE0_EL2 register, resulting in its
removal from the VNCR page (it had no purpose being there the
first place).

Follow the architecture update by removing this offset.

Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-31 02:42:29 +00:00
Sean Christopherson
3e7f43188e KVM: Protect vCPU's "last run PID" with rwlock, not RCU
To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead
of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a
vCPU.  When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to
run SEV migration helper vCPUs during post-copy, the synchronize_rcu()
needed to change the PID associated with the vCPU can stall for hundreds
of milliseconds, which is problematic for latency sensitive post-copy
operations.

In the directed yield path, do not acquire the lock if it's contended,
i.e. if the associated PID is changing, as that means the vCPU's task is
already running.

Reported-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Steve Rutherford <srutherford@google.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240802200136.329973-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-30 14:41:22 -07:00
Julian Vetter
0110feaaf6
arm64: Use new fallback IO memcpy/memset
Use the new fallback memcpy_{from,to}io and memset_io functions from
lib/iomem_copy.c on the arm64 processor architecture.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Yann Sionneau <ysionneau@kalrayinc.com>
Signed-off-by: Julian Vetter <jvetter@kalrayinc.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-10-28 21:44:29 +00:00
Christoph Hellwig
c5c3238d9b
asm-generic: provide generic page_to_phys and phys_to_page implementations
page_to_phys is duplicated by all architectures, and from some strange
reason placed in <asm/io.h> where it doesn't fit at all.

phys_to_page is only provided by a few architectures despite having a lot
of open coded users.

Provide generic versions in <asm-generic/memory_model.h> to make these
helpers more easily usable.

Note with this patch powerpc loses the CONFIG_DEBUG_VIRTUAL pfn_valid
check.  It will be added back in a generic version later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-10-28 21:44:28 +00:00
Rob Herring (Arm)
0bbff9ed81 perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control
Armv8.9/9.4 PMUv3.9 adds per counter EL0 access controls. Per counter
access is enabled with the UEN bit in PMUSERENR_EL1 register. Individual
counters are enabled/disabled in the PMUACR_EL1 register. When UEN is
set, the CR/ER bits control EL0 write access and must be set to disable
write access.

With the access controls, the clearing of unused counters can be
skipped.

KVM also configures PMUSERENR_EL1 in order to trap to EL2. UEN does not
need to be set for it since only PMUv3.5 is exposed to guests.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20241002184326.1105499-1-robh@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-10-28 17:27:15 +00:00
Sean Christopherson
2362506f7c KVM: arm64: Don't mark "struct page" accessed when making SPTE young
Don't mark pages/folios as accessed in the primary MMU when making a SPTE
young in KVM's secondary MMU, as doing so relies on
kvm_pfn_to_refcounted_page(), and generally speaking is unnecessary and
wasteful.  KVM participates in page aging via mmu_notifiers, so there's no
need to push "accessed" updates to the primary MMU.

Dropping use of kvm_set_pfn_accessed() also paves the way for removing
kvm_pfn_to_refcounted_page() and all its users.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-84-seanjc@google.com>
2024-10-25 13:01:35 -04:00
David Woodhouse
97413cea1c KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function
which is analogous to ACPI S4 state. This will allow hosting
environments to determine that a guest is hibernated rather than just
powered off, and ensure that they preserve the virtual environment
appropriately to allow the guest to resume safely (or bump the
hardware_signature in the FACS to trigger a clean reboot instead).

This feature is safe to enable unconditionally (in a subsequent commit)
because it is exposed to userspace through the existing
KVM_SYSTEM_EVENT_SHUTDOWN event, just with an additional flag which
userspace can use to know that the instance intended hibernation instead
of a plain power-off.

As with SYSTEM_RESET2, there is only one type available (in this case
HIBERNATE_OFF), and it is not explicitly reported to userspace through
the event; userspace can get it from the registers if it cares).

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-3-dwmw2@infradead.org
[oliver: slight cleanup of comments]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-24 16:36:51 -07:00
Anshuman Khandual
0448a96e24 arm64/mm: Drop _PROT_SECT_DEFAULT
'commit db95ea787b ("arm64: mm: Wire up TCR.DS bit to PTE shareability
fields")' dropped the last reference to symbol _PROT_SECT_DEFAULT, while
transitioning from PMD_SECT_S to PMD_MAYBE_SHARED for PROT_SECT_DEFAULT.
Hence let's just drop that symbol which is now unused.

Cc: Will Deacon <will@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241021063713.750870-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23 12:04:12 +01:00
Suzuki K Poulose
42be24a417 arm64: Enable memory encrypt for Realms
Use the memory encryption APIs to trigger a RSI call to request a
transition between protected memory and shared memory (or vice versa)
and updating the kernel's linear map of modified pages to flip the top
bit of the IPA. This requires that block mappings are not used in the
direct map for realm guests.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Co-developed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20241017131434.40935-10-steven.price@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23 10:19:33 +01:00
Suzuki K Poulose
3715894376 arm64: rsi: Add support for checking whether an MMIO is protected
On Arm CCA, with RMM-v1.0, all MMIO regions are shared. However, in
the future, an Arm CCA-v1.0 compliant guest may be run in a lesser
privileged partition in the Realm World (with Arm CCA-v1.1 Planes
feature). In this case, some of the MMIO regions may be emulated
by a higher privileged component in the Realm world, i.e, protected.

Thus the guest must decide today, whether a given MMIO region is shared
vs Protected and create the stage1 mapping accordingly. On Arm CCA, this
detection is based on the "IPA State" (RIPAS == RIPAS_IO). Provide a
helper to run this check on a given range of MMIO.

Also, provide a arm64 helper which may be hooked in by other solutions.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20241017131434.40935-5-steven.price@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23 10:19:32 +01:00
Steven Price
3993069549 arm64: realm: Query IPA size from the RMM
The top bit of the configured IPA size is used as an attribute to
control whether the address is protected or shared. Query the
configuration from the RMM to assertain which bit this is.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20241017131434.40935-4-steven.price@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23 10:19:32 +01:00
Suzuki K Poulose
c077711f71 arm64: Detect if in a realm and set RIPAS RAM
Detect that the VM is a realm guest by the presence of the RSI
interface. This is done after PSCI has been initialised so that we can
check the SMCCC conduit before making any RSI calls.

If in a realm then iterate over all memory ensuring that it is marked as
RIPAS RAM. The loader is required to do this for us, however if some
memory is missed this will cause the guest to receive a hard to debug
external abort at some random point in the future. So for a
belt-and-braces approach set all memory to RIPAS RAM. Any failure here
implies that the RAM regions passed to Linux are incorrect so panic()
promptly to make the situation clear.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Co-developed-by: Steven Price <steven.price@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20241017131434.40935-3-steven.price@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23 10:19:32 +01:00
Suzuki K Poulose
b880a80011 arm64: rsi: Add RSI definitions
The RMM (Realm Management Monitor) provides functionality that can be
accessed by a realm guest through SMC (Realm Services Interface) calls.

The SMC definitions are based on DEN0137[1] version 1.0-rel0.

[1] https://developer.arm.com/documentation/den0137/1-0rel0/

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Steven Price <steven.price@arm.com>
Link: https://lore.kernel.org/r/20241017131434.40935-2-steven.price@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-23 10:19:32 +01:00
Mark Rutland
f260c44267 arm64: preserve pt_regs::stackframe during exec*()
When performing an exec*(), there's a transient period before the return
to userspace where any stacktrace will result in a warning triggered by
kunwind_next_frame_record_meta() encountering a struct frame_record_meta
with an unknown type. This can be seen fairly reliably by enabling KASAN
or KFENCE, e.g.

| WARNING: CPU: 3 PID: 143 at arch/arm64/kernel/stacktrace.c:223 arch_stack_walk+0x264/0x3b0
| Modules linked in:
| CPU: 3 UID: 0 PID: 143 Comm: login Not tainted 6.12.0-rc2-00010-g0f0b9a3f6a50 #1
| Hardware name: linux,dummy-virt (DT)
| pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x264/0x3b0
| lr : arch_stack_walk+0x1ec/0x3b0
| sp : ffff80008060b970
| x29: ffff80008060ba10 x28: fff00000051133c0 x27: 0000000000000000
| x26: 0000000000000000 x25: 0000000000000000 x24: fff000007fe84000
| x23: ffff9d1b3c940af0 x22: 0000000000000000 x21: fff00000051133c0
| x20: ffff80008060ba50 x19: ffff9d1b3c9408e0 x18: 0000000000000014
| x17: 000000006d50da47 x16: 000000008e3f265e x15: fff0000004e8bf40
| x14: 0000ffffc5e50e48 x13: 000000000000000f x12: 0000ffffc5e50fed
| x11: 000000000000001f x10: 000018007f8bffff x9 : 0000000000000000
| x8 : ffff80008060b9c0 x7 : ffff80008060bfd8 x6 : ffff80008060ba80
| x5 : ffff80008060ba00 x4 : ffff80008060c000 x3 : ffff80008060bff0
| x2 : 0000000000000018 x1 : ffff80008060bfd8 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x264/0x3b0 (P)
|  arch_stack_walk+0x1ec/0x3b0 (L)
|  stack_trace_save+0x50/0x80
|  metadata_update_state+0x98/0xa0
|  kfence_guarded_free+0xec/0x2c4
|  __kfence_free+0x50/0x100
|  kmem_cache_free+0x1a4/0x37c
|  putname+0x9c/0xc0
|  do_execveat_common.isra.0+0xf0/0x1e4
|  __arm64_sys_execve+0x40/0x60
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x34/0xe0
|  el0t_64_sync_handler+0x120/0x12c
|  el0t_64_sync+0x198/0x19c

This happens because start_thread_common() zeroes the entirety of
current_pt_regs(), including pt_regs::stackframe::type, changing this
from FRAME_META_TYPE_FINAL to 0 and making the final record invalid.
The stacktrace code will reject this until the next return to userspace,
where a subsequent exception entry will reinitialize the type to
FRAME_META_TYPE_FINAL.

This zeroing wasn't a problem prior to commit:

  c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")

... as before that commit the stacktrace code only expected the final
pt_regs::stackframe to contain zeroes, which was unchanged by
start_thread_common().

A stacktrace could occur at any time, either due to instrumentation or
an exception, and so start_thread_common() must ensure that
pt_regs::stackframe is always valid.

Fix this by changing the way start_thread_common() zeroes and
reinitializes the pt_regs fields:

* The '{regs,pc,pstate}' fields are initialized in one go via a struct
  assignment to the user_regs, with start_thread() and
  compat_start_thread() modified to pass 'pstate' into
  start_thread_common().

* The 'sp' and 'compat_sp' fields are zeroed by the struct assignment in
  start_thread_common(), and subsequently overwritten in start_thread()
  and compat_start_thread respectively, matching existing behaviour.

* The 'syscallno' field is implicitly preserved while the 'orig_x0'
  field is explicitly zeroed, maintaining existing ABI.

* The 'pmr' field is explicitly initialized, as necessary for an exec*()
  from a kernel thread, matching existing behaviour.

* The 'stackframe' field is implicitly preserved, with a new comment and
  some assertions to ensure we don't accidentally break this in future.

* All other fields are implicitly preserved, and should have no
  functional impact:

  - 'sdei_ttbr1' is only used for SDEI exception entry/exit, and we
    never exec*() inside an SDEI handler.

  - 'lockdep_hardirqs' and 'exit_rcu' are only used for EL1 exception
    entry/exit, and we never exec*() inside an EL1 exception handler.

While updating compat_start_thread() to pass 'pstate' into
start_thread_common(), I've also updated the logic to remove the
ifdeffery, replacing:

| #ifdef __AARCH64EB__
|        regs->pstate |= PSR_AA32_E_BIT;
| #endif

... with:

| if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
|         pstate |= PSR_AA32_E_BIT;

... which should be functionally equivalent, and matches our preferred
code style.

Fixes: c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Puranjay Mohan <puranjay12@gmail.com>
Cc: Will Deacon <will@kernel.org>
Fixes: c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")
Tested-by: Puranjay Mohan <puranjay12@gmail.com>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Link: https://lore.kernel.org/r/20241021164456.2275285-1-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-22 11:56:43 +01:00
Linus Torvalds
d129377639 ARM64:
* Fix the guest view of the ID registers, making the relevant fields
   writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1)
 
 * Correcly expose S1PIE to guests, fixing a regression introduced
   in 6.12-rc1 with the S1POE support
 
 * Fix the recycling of stage-2 shadow MMUs by tracking the context
   (are we allowed to block or not) as well as the recycling state
 
 * Address a couple of issues with the vgic when userspace misconfigures
   the emulation, resulting in various splats. Headaches courtesy
   of our Syzkaller friends
 
 * Stop wasting space in the HYP idmap, as we are dangerously close
   to the 4kB limit, and this has already exploded in -next
 
 * Fix another race in vgic_init()
 
 * Fix a UBSAN error when faking the cache topology with MTE
   enabled
 
 RISCV:
 
 * RISCV: KVM: use raw_spinlock for critical section in imsic
 
 x86:
 
 * A bandaid for lack of XCR0 setup in selftests, which causes trouble
   if the compiler is configured to have x86-64-v3 (with AVX) as the
   default ISA.  Proper XCR0 setup will come in the next merge window.
 
 * Fix an issue where KVM would not ignore low bits of the nested CR3
   and potentially leak up to 31 bytes out of the guest memory's bounds
 
 * Fix case in which an out-of-date cached value for the segments could
   by returned by KVM_GET_SREGS.
 
 * More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL
 
 * Override MTRR state for KVM confidential guests, making it WB by
   default as is already the case for Hyper-V guests.
 
 Generic:
 
 * Remove a couple of unused functions
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmcVK54UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOfrgf7BRyihd28OGaqVuv2BqGYrxqfOkd6
 ZqpJDOy+X7UE3iG5NhTxw4mghCJFhOwIL7gDSZwPLe6D2k01oqPSP2pLMqXb5oOv
 /EkltRvzG0YIH3sjZY5PROrMMxnvSKkJKxETFxFQQzMKRym2v/T5LAzrium58YIT
 vWZXxo2HTPXOw/U5upAqqMYJMeeJEL3kurVHtOsPytUFjrIOl0BfeKvgjOwonDIh
 Awm4JZwk0+1d8sYfkuzsSrTQmtshDCx1jkFN1juirt90s1EwgmOvVKiHo3gMsVP9
 veDRoLTx2fM/r7TrhoHo46DTA2vbfmCltWcT0cn5x8P24BFGXXe/IDJIHA==
 =IVlI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix the guest view of the ID registers, making the relevant fields
     writable from userspace (affecting ID_AA64DFR0_EL1 and
     ID_AA64PFR1_EL1)

   - Correcly expose S1PIE to guests, fixing a regression introduced in
     6.12-rc1 with the S1POE support

   - Fix the recycling of stage-2 shadow MMUs by tracking the context
     (are we allowed to block or not) as well as the recycling state

   - Address a couple of issues with the vgic when userspace
     misconfigures the emulation, resulting in various splats. Headaches
     courtesy of our Syzkaller friends

   - Stop wasting space in the HYP idmap, as we are dangerously close to
     the 4kB limit, and this has already exploded in -next

   - Fix another race in vgic_init()

   - Fix a UBSAN error when faking the cache topology with MTE enabled

  RISCV:

   - RISCV: KVM: use raw_spinlock for critical section in imsic

  x86:

   - A bandaid for lack of XCR0 setup in selftests, which causes trouble
     if the compiler is configured to have x86-64-v3 (with AVX) as the
     default ISA. Proper XCR0 setup will come in the next merge window.

   - Fix an issue where KVM would not ignore low bits of the nested CR3
     and potentially leak up to 31 bytes out of the guest memory's
     bounds

   - Fix case in which an out-of-date cached value for the segments
     could by returned by KVM_GET_SREGS.

   - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL

   - Override MTRR state for KVM confidential guests, making it WB by
     default as is already the case for Hyper-V guests.

  Generic:

   - Remove a couple of unused functions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits)
  RISCV: KVM: use raw_spinlock for critical section in imsic
  KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups
  KVM: selftests: x86: Avoid using SSE/AVX instructions
  KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory
  KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()
  KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
  KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()
  KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot
  x86/kvm: Override default caching mode for SEV-SNP and TDX
  KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic
  KVM: Remove unused kvm_vcpu_gfn_to_pfn
  KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration
  KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS
  KVM: arm64: Fix shift-out-of-bounds bug
  KVM: arm64: Shave a few bytes from the EL2 idmap code
  KVM: arm64: Don't eagerly teardown the vgic on init error
  KVM: arm64: Expose S1PIE to guests
  KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule
  KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  ...
2024-10-21 11:22:04 -07:00
Mark Brown
ddadbcdaae arm64: Support AT_HWCAP3
We have filled all 64 bits of AT_HWCAP2 so in order to support discovery of
further features provide the framework to use the already defined AT_HWCAP3
for further CPU features.

Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241004-arm64-elf-hwcap3-v2-2-799d1daad8b0@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:38:50 +01:00
Mark Rutland
c2c6b27b5a arm64: stacktrace: unwind exception boundaries
When arm64's stack unwinder encounters an exception boundary, it uses
the pt_regs::stackframe created by the entry code, which has a copy of
the PC and FP at the time the exception was taken. The unwinder doesn't
know anything about pt_regs, and reports the PC from the stackframe, but
does not report the LR.

The LR is only guaranteed to contain the return address at function call
boundaries, and can be used as a scratch register at other times, so the
LR at an exception boundary may or may not be a legitimate return
address. It would be useful to report the LR value regardless, as it can
be helpful when debugging, and in future it will be helpful for reliable
stacktrace support.

This patch changes the way we unwind across exception boundaries,
allowing both the PC and LR to be reported. The entry code creates a
frame_record_meta structure embedded within pt_regs, which the unwinder
uses to find the pt_regs. The unwinder can then extract pt_regs::pc and
pt_regs::lr as two separate unwind steps before continuing with a
regular walk of frame records.

When a PC is unwound from pt_regs::lr, dump_backtrace() will log this
with an "L" marker so that it can be identified easily. For example,
an unwind across an exception boundary will appear as follows:

|  el1h_64_irq+0x6c/0x70
|  _raw_spin_unlock_irqrestore+0x10/0x60 (P)
|  __aarch64_insn_write+0x6c/0x90 (L)
|  aarch64_insn_patch_text_nosync+0x28/0x80

... with a (P) entry for pt_regs::pc, and an (L) entry for pt_regs:lr.

Note that the LR may be stale at the point of the exception, for example,
shortly after a return:

|  el1h_64_irq+0x6c/0x70
|  default_idle_call+0x34/0x180 (P)
|  default_idle_call+0x28/0x180 (L)
|  do_idle+0x204/0x268

... where the LR points a few instructions before the current PC.

This plays nicely with all the other unwind metadata tracking. With the
ftrace_graph profiler enabled globally, and kretprobes installed on
generic_handle_domain_irq() and do_interrupt_handler(), a backtrace triggered
by magic-sysrq + L reports:

| Call trace:
|  show_stack+0x20/0x40 (CF)
|  dump_stack_lvl+0x60/0x80 (F)
|  dump_stack+0x18/0x28
|  nmi_cpu_backtrace+0xfc/0x140
|  nmi_trigger_cpumask_backtrace+0x1c8/0x200
|  arch_trigger_cpumask_backtrace+0x20/0x40
|  sysrq_handle_showallcpus+0x24/0x38 (F)
|  __handle_sysrq+0xa8/0x1b0 (F)
|  handle_sysrq+0x38/0x50 (F)
|  pl011_int+0x460/0x5a8 (F)
|  __handle_irq_event_percpu+0x60/0x220 (F)
|  handle_irq_event+0x54/0xc0 (F)
|  handle_fasteoi_irq+0xa8/0x1d0 (F)
|  generic_handle_domain_irq+0x34/0x58 (F)
|  gic_handle_irq+0x54/0x140 (FK)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0xa0
|  el1_interrupt+0x34/0x68 (FK)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  default_idle_call+0x34/0x180 (P)
|  default_idle_call+0x28/0x180 (L)
|  do_idle+0x204/0x268
|  cpu_startup_entry+0x3c/0x50 (F)
|  rest_init+0xe4/0xf0
|  start_kernel+0x744/0x750
|  __primary_switched+0x88/0x98

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-11-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:25 +01:00
Mark Rutland
f05a4a42de arm64: stacktrace: split unwind_consume_stack()
When unwinding stacks, we use unwind_consume_stack() to both find
whether an object (e.g. a frame record) is on an accessible stack *and*
to update the stack boundaries. This works fine today since we only care
about one type of object which does not overlap other objects.

In subsequent patches we'll want to check whether an object (e.g a frame
record) is on the stack and follow this up by accessing a larger object
containing the first (e.g. a pt_regs with an embedded frame record).

To make that pattern easier to implement, this patch reworks
unwind_find_next_stack() and unwind_consume_stack() so that the former
can be used to check if an object is on any accessible stack, and the
latter is purely used to update the stack boundaries.

As unwind_find_next_stack() is modified to also check the stack
currently being unwound, it is renamed to unwind_find_stack().

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-10-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:25 +01:00
Mark Rutland
886c2b0ba8 arm64: use a common struct frame_record
Currently the signal handling code has its own struct frame_record,
the definition of struct pt_regs open-codes a frame record as an array,
and the kernel unwinder hard-codes frame record offsets.

Move to a common struct frame_record that can be used throughout the
kernel.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:24 +01:00
Mark Rutland
1454363098 arm64: pt_regs: swap 'unused' and 'pmr' fields
In subsequent patches we'll want to add an additional u64 to struct
pt_regs. To make space, this patch swaps the 'unused' and 'pmr' fields,
as the 'pmr' value only requires bits[7:0] and can safely fit into a
u32, which frees up a 64-bit unused field.

The 'lockdep_hardirqs' and 'exit_rcu' fields should eventually be moved
out of pt_regs and managed locally within entry-common.c, so I've left
those as-is for the moment.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:24 +01:00
Mark Rutland
00d9597903 arm64: pt_regs: rename "pmr_save" -> "pmr"
The pt_regs::pmr_save field is weirdly named relative to all other
pt_regs fields, with a '_save' suffix that doesn't make anything clearer
and only leads to more typing to access the field.

Remove the '_save' suffix.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-4-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:24 +01:00
Mark Rutland
2716d59bf4 arm64: pt_regs: remove stale big-endian layout
For historical reasons the layout of struct pt_regs depends on the
configured endianness, with the order of the 'syscallno' and 'unused2'
fields varying dependent upon whether __AARCH64EB__ is defined. We no
longer depend on the order of these two fields and can remove the
ifdeffery.

The current conditional layout was introduced in commit:

  35d0e6fb4d ("arm64: syscallno is secretly an int, make it official")

At the time, this was necessary so that the entry assembly could use a
single STP instruction to save the pt_regs::{orig_x0,syscallno} fields,
without logic that was conditional on the endianness of the kernel:

| el0_svc_naked:
|         stp     x0, xscno, [sp, #S_ORIG_X0]     // save the original x0 and syscall number

This logic was converted to C in commit:

  f37099b699 ("arm64: convert syscall trace logic to C")

Since that commit, we no longer manipulate pt_regs::orig_x0 from
assembly, and only manipulate pt_regs::syscallno as a 32-bit quantity
early in the kernel_entry assembly:

| /* Not in a syscall by default (el0_svc overwrites for real syscall) */
| .if     \el == 0
| mov     w21, #NO_SYSCALL
| str     w21, [sp, #S_SYSCALLNO]
| .endif

Given the above, there's no longer a need for the layout of
pt_regs::{syscallno,unused2} to depend on the endianness of the kernel.

This patch removes the ifdeffery and places 'syscallno' before 'unused2'
regardless of the endianess of the kernel. At the same time, 'unused2'
is renamed to 'unused', as it is the only unused field within pt_regs.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:24 +01:00
Mark Rutland
c87df9cb9a arm64: pt_regs: assert pt_regs is a multiple of 16 bytes
To ensure that the stack is correctly aligned when branching to C code,
we require that struct pt_regs is a multiple of 16 bytes, as noted in a
comment.

Add an explicit assertion for this, so that any accidental violation of
this requirement will be caught by the compiler.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241017092538.1859841-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:06:24 +01:00
Linus Torvalds
6efbea77b3 arm64 fixes for -rc4
- Disable software tag-based KASAN when compiling with GCC, as functions
   are incorrectly instrumented leading to a crash early during boot.
 
 - Fix pkey configuration for kernel threads when POE is enabled.
 
 - Fix invalid memory accesses in uprobes when targetting load-literal
   instructions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmcPrzQQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNIr6B/wN+o1xI7Fv/QdlaTuKYLvOOg/XTl6sbUDj
 YssxtjhpKuaFVG4zJHNsWvgUqO+YCM7m3F1L8LVPMF7l2xoKtRTIB1Ye315hTjYm
 dW5Te6xBMVKF8SVxE8sBbZobdokIW1JNPBrvGvHO3d5ujmofzwHU8RNMXuTUItRw
 z85Qy75FkEDTEbsWhS3VL5HOgEr+k0TYDRa8SXwKWVj7/rYna3tO39kIdS5dt9VX
 wDJbnxtWJMhiHmDnevFFhBkSZrips12P1Rb6HUSmhpUJh0Rk4TAZntSl2f/lr+jA
 PuboBbSG68UOCwAHoNmTcLdFhkiNaiyw4w2F7hk2A6aNRtme+bT0
 =M/ug
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:

 - Disable software tag-based KASAN when compiling with GCC, as
   functions are incorrectly instrumented leading to a crash early
   during boot

 - Fix pkey configuration for kernel threads when POE is enabled

 - Fix invalid memory accesses in uprobes when targetting load-literal
   instructions

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  kasan: Disable Software Tag-Based KASAN with GCC
  Documentation/protection-keys: add AArch64 to documentation
  arm64: set POR_EL0 for kernel threads
  arm64: probes: Fix uprobes for big-endian kernels
  arm64: probes: Fix simulate_ldr*_literal()
  arm64: probes: Remove broken LDR (literal) uprobe support
2024-10-17 09:51:03 -07:00
Kristina Martsenko
13840229d6 arm64: mops: Handle MOPS exceptions from EL1
We will soon be using MOPS instructions in the kernel, so wire up the
exception handler to handle exceptions from EL1 caused by the copy/set
operation being stopped and resumed on a different type of CPU.

Add a helper for advancing the single step state machine, similarly to
what the EL0 exception handler does.

Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Link: https://lore.kernel.org/r/20240930161051.3777828-3-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 16:42:51 +01:00
Kristina Martsenko
c56c599d90 arm64: probes: Disable kprobes/uprobes on MOPS instructions
FEAT_MOPS instructions require that all three instructions (prologue,
main and epilogue) appear consecutively in memory. Placing a
kprobe/uprobe on one of them doesn't work as only a single instruction
gets executed out-of-line or simulated. So don't allow placing a probe
on a MOPS instruction.

Fixes: b7564127ff ("arm64: mops: detect and enable FEAT_MOPS")
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Link: https://lore.kernel.org/r/20240930161051.3777828-2-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 16:42:50 +01:00
Marc Zyngier
afa9b48f32 KVM: arm64: Shave a few bytes from the EL2 idmap code
Our idmap is becoming too big, to the point where it doesn't fit in
a 4kB page anymore.

There are some low-hanging fruits though, such as the el2_init_state
horror that is expanded 3 times in the kernel. Let's at least limit
ourselves to two copies, which makes the kernel link again.

At some point, we'll have to have a better way of doing this.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
2024-10-17 09:17:56 +01:00
Gavin Shan
0f612c6eb1 arm64: head: Drop SWAPPER_TABLE_SHIFT
There is no users of SWAPPER_TABLE_SHIFT after commit 84b04d3e6b
("arm64: kernel: Create initial ID map from C code"). Just drop it.

Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241014030341.995806-1-gshan@redhat.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-16 16:04:19 +01:00
Joey Gouly
9c4a25140d arm64: cpufeature: add POE to cpucap_is_possible()
Since de66cb37ab ("arm64: Add cpucap_is_possible()"),
alternative_has_cap_unlikely() includes the IS_ENABLED() check.

Add CONFIG_ARM64_POE to cpucap_is_possible() to avoid the explicit check.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241008140121.2774348-1-joey.gouly@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-16 15:30:53 +01:00
Yang Shi
25c17c4b55 hugetlb: arm64: add mte support
Enable MTE support for hugetlb.

The MTE page flags will be set on the folio only.  When copying
hugetlb folio (for example, CoW), the tags for all subpages will be copied
when copying the first subpage.

When freeing hugetlb folio, the MTE flags will be cleared.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20241001225220.271178-1-yang@os.amperecomputing.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-16 14:50:47 +01:00
Anshuman Khandual
8ef41786d8 arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t
pgattr_change_is_safe() processes two distinct page table entries that just
happen to be 64 bits for all levels. This changes both arguments to reflect
the actual data type being processed in the function.

This change is important when moving to FEAT_D128 based 128 bit page tables
because it makes it simple to change the entry size in one place.

Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241001045804.1119881-1-anshuman.khandual@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-16 14:46:39 +01:00
Kefeng Wang
a923705c69 arm64: optimize flush tlb kernel range
Currently the kernel TLBs is flushed page by page if the target
VA range is less than MAX_DVM_OPS * PAGE_SIZE, otherwise we'll
brutally issue a TLBI ALL.

But we could optimize it when CPU supports TLB range operations,
convert to use __flush_tlb_range_op() like other tlb range flush
to improve performance.

Co-developed-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240923131351.713304-3-wangkefeng.wang@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-16 12:01:53 +01:00
Kefeng Wang
7ffc13e233 arm64: tlbflush: add __flush_tlb_range_limit_excess()
The __flush_tlb_range_limit_excess() helper will be used when
flush tlb kernel range soon.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240923131351.713304-2-wangkefeng.wang@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-16 12:01:53 +01:00
Vincenzo Frascino
efe8419ae7 vdso: Introduce vdso/page.h
The VDSO implementation includes headers from outside of the
vdso/ namespace.

Introduce vdso/page.h to make sure that the generic library
uses only the allowed namespace.

Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20241014151340.1639555-3-vincenzo.frascino@arm.com
2024-10-16 00:13:04 +02:00
Liao Chang
ac4ad5c09b arm64: insn: Simulate nop instruction for better uprobe performance
v2->v1:
1. Remove the simuation of STP and the related bits.
2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
   scenario.

As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
counterintuitive result that nop and push variants are much slower than
ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
which excludes 'nop' and 'stp' from the emulatable instructions list.
This force the kernel returns to userspace and execute them out-of-line,
then trapping back to kernel for running uprobe callback functions. This
leads to a significant performance overhead compared to 'ret' variant,
which is already emulated.

Typicall uprobe is installed on 'nop' for USDT and on function entry
which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
and fp into stack regardless kernel or userspace binary. In order to
improve the performance of handling uprobe for common usecases. This
patch supports the emulation of Arm64 equvialents instructions of 'nop'
and 'push'. The benchmark results below indicates the performance gain
of emulation is obvious.

On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.

xol (1 cpus)
------------
uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
-------------------
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)

As shown above, the performance gap between 'nop/push' and 'ret'
variants has been significantly reduced. Due to the emulation of 'push'
instruction needs to access userspace memory, it spent more cycles than
the other.

As Mark suggested [1], it is painful to emulate the correct atomicity
and ordering properties of STP, especially when it interacts with MTE,
POE, etc. So this patch just focus on the simuation of 'nop'. The
simluation of STP and related changes will be addressed in a separate
patch.

[0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
[1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/

CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
CC: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20240909071114.1150053-1-liaochang1@huawei.com
[catalin.marinas@arm.com: small tweaks following MarkR's comments]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15 19:43:26 +01:00
Mark Rutland
7bd8870af8 arm64: asm-offsets: remove VMA_VM_*
The VMA_VM_MM definition is only used by the vma_vm_mm macro, which
itself is unused. The VMA_VM_FLAGS definition isn't used anywhere.

Remove them all.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241007123921.549340-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15 18:42:06 +01:00
Mark Rutland
14762109de arm64: probes: Remove probe_opcode_t
The probe_opcode_t typedef for u32 isn't necessary, and is a source of
confusion as it is easily confused with kprobe_opcode_t, which is a
typedef for __le32.

The typedef is only used within arch/arm64, and all of arm64's commn
insn code uses u32 for the endian-agnostic value of an instruction, so
it'd be clearer to use u32 consistently.

Remove probe_opcode_t and use u32 directly.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marnias@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241008155851.801546-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15 18:16:20 +01:00
Mark Rutland
dd0eb50e7c arm64: probes: Cleanup kprobes endianness conversions
The core kprobes code uses kprobe_opcode_t for the in-memory
representation of an instruction, using 'kprobe_opcode_t *' for XOL
slots. As arm64 instructions are always little-endian 32-bit values,
kprobes_opcode_t should be __le32, but at the moment kprobe_opcode_t
is typedef'd to u32.

Today there is no functional issue as we convert values via
cpu_to_le32() and le32_to_cpu() where necessary, but these conversions
are inconsistent with the types used, causing sparse warnings:

|   CHECK   arch/arm64/kernel/probes/kprobes.c
| arch/arm64/kernel/probes/kprobes.c:102:21: warning: cast to restricted __le32
|   CHECK   arch/arm64/kernel/probes/decode-insn.c
| arch/arm64/kernel/probes/decode-insn.c:122:46: warning: cast to restricted __le32
| arch/arm64/kernel/probes/decode-insn.c:124:50: warning: cast to restricted __le32
| arch/arm64/kernel/probes/decode-insn.c:136:31: warning: cast to restricted __le32

Improve this by making kprobes_opcode_t a typedef for __le32 and
consistently using this for pointers to executable instructions. With
this change we can rely on the type system to tell us where conversions
are necessary.

Since kprobe::opcode is changed from u32 to __le32, the existing
le32_to_cpu() converion moves from the point this is initialized (in
arch_prepare_kprobe()) to the points this is consumed when passed to
a handler or text patching function. As kprobe::opcode isn't altered or
consumed elsewhere, this shouldn't result in a functional change.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241008155851.801546-6-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15 18:16:20 +01:00
Mark Rutland
6105c5d46d arm64: probes: Move kprobes-specific fields
We share struct arch_probe_insn between krpboes and uprobes, but most of
its fields aren't necessary for uprobes:

* The 'insn' field is only used by kprobes as a pointer to the XOL slot.

* The 'restore' field is only used by probes as the PC to restore after
  stepping an instruction in the XOL slot.

* The 'pstate_cc' field isn't used by kprobes or uprobes, and seems to
  only exist as a result of copy-pasting the 32-bit arm implementation
  of kprobes.

As these fields live in struct arch_probe_insn they cannot use
definitions that only exist when CONFIG_KPROBES=y, such as the
kprobe_opcode_t typedef, which we'd like to use in subsequent patches.

Clean this up by removing the 'pstate_cc' field, and moving the
kprobes-specific fields into the kprobes-specific struct
arch_specific_insn. To make it clear that the fields are related to
stepping instructions in the XOL slot, 'insn' is renamed to 'xol_insn'
and 'restore' is renamed to 'xol_restore'

At the same time, remove the misleading and useless comment above struct
arch_probe_insn.

The should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241008155851.801546-5-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-15 18:16:19 +01:00
Thomas Weißschuh
39c089a01a vdso: Remove timekeeper argument of __arch_update_vsyscall()
No implementation of this hook uses the passed in timekeeper anymore.

This avoids including a non-VDSO header while building the VDSO, which can
lead to compilation errors.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/all/20241010-vdso-generic-arch_update_vsyscall-v1-1-7fe5a3ea4382@linutronix.de
2024-10-15 17:50:28 +02:00
Steven Rostedt
e4cf33ca48 ftrace: Consolidate ftrace_regs accessor functions for archs using pt_regs
Most architectures use pt_regs within ftrace_regs making a lot of the
accessor functions just calls to the pt_regs internally. Instead of
duplication this effort, use a HAVE_ARCH_FTRACE_REGS for architectures
that have their own ftrace_regs that is not based on pt_regs and will
define all the accessor functions, and for the architectures that just use
pt_regs, it will leave it undefined, and the default accessor functions
will be used.

Note, this will also make it easier to add new accessor functions to
ftrace_regs as it will mean having to touch less architectures.

Cc: <linux-arch@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/20241010202114.2289f6fd@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> # powerpc
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-15 11:42:35 -04:00
Steven Rostedt
7888af4166 ftrace: Make ftrace_regs abstract from direct use
ftrace_regs was created to hold registers that store information to save
function parameters, return value and stack. Since it is a subset of
pt_regs, it should only be used by its accessor functions. But because
pt_regs can easily be taken from ftrace_regs (on most archs), it is
tempting to use it directly. But when running on other architectures, it
may fail to build or worse, build but crash the kernel!

Instead, make struct ftrace_regs an empty structure and have the
architectures define __arch_ftrace_regs and all the accessor functions
will typecast to it to get to the actual fields. This will help avoid
usage of ftrace_regs directly.

Link: https://lore.kernel.org/all/20241007171027.629bdafd@gandalf.local.home/

Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Paul  Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas  Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav  Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/20241008230628.958778821@goodmis.org
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # s390
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-10 20:18:01 -04:00
Mark Rutland
13f8f1e05f arm64: probes: Fix uprobes for big-endian kernels
The arm64 uprobes code is broken for big-endian kernels as it doesn't
convert the in-memory instruction encoding (which is always
little-endian) into the kernel's native endianness before analyzing and
simulating instructions. This may result in a few distinct problems:

* The kernel may may erroneously reject probing an instruction which can
  safely be probed.

* The kernel may erroneously erroneously permit stepping an
  instruction out-of-line when that instruction cannot be stepped
  out-of-line safely.

* The kernel may erroneously simulate instruction incorrectly dur to
  interpretting the byte-swapped encoding.

The endianness mismatch isn't caught by the compiler or sparse because:

* The arch_uprobe::{insn,ixol} fields are encoded as arrays of u8, so
  the compiler and sparse have no idea these contain a little-endian
  32-bit value. The core uprobes code populates these with a memcpy()
  which similarly does not handle endianness.

* While the uprobe_opcode_t type is an alias for __le32, both
  arch_uprobe_analyze_insn() and arch_uprobe_skip_sstep() cast from u8[]
  to the similarly-named probe_opcode_t, which is an alias for u32.
  Hence there is no endianness conversion warning.

Fix this by changing the arch_uprobe::{insn,ixol} fields to __le32 and
adding the appropriate __le32_to_cpu() conversions prior to consuming
the instruction encoding. The core uprobes copies these fields as opaque
ranges of bytes, and so is unaffected by this change.

At the same time, remove MAX_UINSN_BYTES and consistently use
AARCH64_INSN_SIZE for clarity.

Tested with the following:

| #include <stdio.h>
| #include <stdbool.h>
|
| #define noinline __attribute__((noinline))
|
| static noinline void *adrp_self(void)
| {
|         void *addr;
|
|         asm volatile(
|         "       adrp    %x0, adrp_self\n"
|         "       add     %x0, %x0, :lo12:adrp_self\n"
|         : "=r" (addr));
| }
|
|
| int main(int argc, char *argv)
| {
|         void *ptr = adrp_self();
|         bool equal = (ptr == adrp_self);
|
|         printf("adrp_self   => %p\n"
|                "adrp_self() => %p\n"
|                "%s\n",
|                adrp_self, ptr, equal ? "EQUAL" : "NOT EQUAL");
|
|         return 0;
| }

.... where the adrp_self() function was compiled to:

| 00000000004007e0 <adrp_self>:
|   4007e0:       90000000        adrp    x0, 400000 <__ehdr_start>
|   4007e4:       911f8000        add     x0, x0, #0x7e0
|   4007e8:       d65f03c0        ret

Before this patch, the ADRP is not recognized, and is assumed to be
steppable, resulting in corruption of the result:

| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0x4007e0
| EQUAL
| # echo 'p /root/adrp-self:0x007e0' > /sys/kernel/tracing/uprobe_events
| # echo 1 > /sys/kernel/tracing/events/uprobes/enable
| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0xffffffffff7e0
| NOT EQUAL

After this patch, the ADRP is correctly recognized and simulated:

| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0x4007e0
| EQUAL
| #
| # echo 'p /root/adrp-self:0x007e0' > /sys/kernel/tracing/uprobe_events
| # echo 1 > /sys/kernel/tracing/events/uprobes/enable
| # ./adrp-self
| adrp_self   => 0x4007e0
| adrp_self() => 0x4007e0
| EQUAL

Fixes: 9842ceae9f ("arm64: Add uprobe support")
Cc: stable@vger.kernel.org
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241008155851.801546-4-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-10-09 16:56:53 +01:00
Oliver Upton
c268f204f7 KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
Currently, when a nested MMU is repurposed for some other MMU context,
KVM unmaps everything during vcpu_load() while holding the MMU lock for
write. This is quite a performance bottleneck for large nested VMs, as
all vCPU scheduling will spin until the unmap completes.

Start punting the MMU cleanup to a vCPU request, where it is then
possible to periodically release the MMU lock and CPU in the presence of
contention.

Ensure that no vCPU winds up using a stale MMU by tracking the pending
unmap on the S2 MMU itself and requesting an unmap on every vCPU that
finds it.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-4-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08 10:40:27 +01:00
Oliver Upton
3c164eb946 KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
Right now the nested code allows unmap operations on a shadow stage-2 to
block unconditionally. This is wrong in a couple places, such as a
non-blocking MMU notifier or on the back of a sched_in() notifier as
part of shadow MMU recycling.

Carry through whether or not blocking is allowed to
kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU
reclaim would precipitate a stack overflow from a pile of kvm_sched_in()
callbacks, all trying to recycle a stage-2 MMU.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20241007233028.2236133-3-oliver.upton@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08 10:40:27 +01:00
Linus Torvalds
4563243ede ARM64:
* Fix pKVM error path on init, making sure we do not change critical
   system registers as we're about to fail
 
 * Make sure that the host's vector length is at capped by a value
   common to all CPUs
 
 * Fix kvm_has_feat*() handling of "negative" features, as the current
   code is pretty broken
 
 * Promote Joey to the status of official reviewer, while James steps
   down -- hopefully only temporarly
 
 x86:
 
 * Fix compilation with KVM_INTEL=KVM_AMD=n
 
 * Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use
 
 Selftests:
 
 * Fix compilation on non-x86 architectures
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmcCRMgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMNIgf/T80+VxFy7eP1yTkZy9nd3UjSsAeT
 fWvYMyN2isOTWTVbl3ckjMZc4i7L/nOngxfkLzI3OfFUO8TI8cw11hNFn85m+WKM
 95DVgEaqz1kuJg25VjSj9AySvPFDNec8bV37C2vk2jF4YsGo6qBugSSjktZUgGiW
 ozsdV39lcVcLf+x8/52Vc2eb736nrrYg8QaFP0tEQs9MHuYob/XBw3Zx42dJoZYl
 tCjGP5oW7EvUdRD48GkgXP9DWA12QmDxNOHEmUdxWamsK88YQXFyWwb7uwV5x+hd
 mO3bJaYInkJsh3D2e5QARswQb+D5HMVYFwvEkxQF/wvmcMosRVz4vv65Sw==
 =P4uw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix pKVM error path on init, making sure we do not change critical
     system registers as we're about to fail

   - Make sure that the host's vector length is at capped by a value
     common to all CPUs

   - Fix kvm_has_feat*() handling of "negative" features, as the current
     code is pretty broken

   - Promote Joey to the status of official reviewer, while James steps
     down -- hopefully only temporarly

  x86:

   - Fix compilation with KVM_INTEL=KVM_AMD=n

   - Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use

  Selftests:

   - Fix compilation on non-x86 architectures"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  x86/reboot: emergency callbacks are now registered by common KVM code
  KVM: x86: leave kvm.ko out of the build if no vendor module is requested
  KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU
  KVM: arm64: Fix kvm_has_feat*() handling of negative features
  KVM: selftests: Fix build on architectures other than x86_64
  KVM: arm64: Another reviewer reshuffle
  KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM
  KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path
2024-10-06 10:53:28 -07:00
Mark Brown
7ec3b57cb2 arm64/ptrace: Expose GCS via ptrace and core files
Provide a new register type NT_ARM_GCS reporting the current GCS mode
and pointer for EL0.  Due to the interactions with allocation and
deallocation of Guarded Control Stacks we do not permit any changes to
the GCS mode via ptrace, only GCSPR_EL0 may be changed.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-27-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:41 +01:00
Mark Brown
16f47bb9ac arm64/signal: Expose GCS state in signal frames
Add a context for the GCS state and include it in the signal context when
running on a system that supports GCS. We reuse the same flags that the
prctl() uses to specify which GCS features are enabled and also provide the
current GCS pointer.

We do not support enabling GCS via signal return, there is a conflict
between specifying GCSPR_EL0 and allocation of a new GCS and this is not
an ancticipated use case.  We also enforce GCS configuration locking on
signal return.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-26-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:41 +01:00
Mark Brown
eaf62ce156 arm64/signal: Set up and restore the GCS context for signal handlers
When invoking a signal handler we use the GCS configuration and stack
for the current thread.

Since we implement signal return by calling the signal handler with a
return address set up pointing to a trampoline in the vDSO we need to
also configure any active GCS for this by pushing a frame for the
trampoline onto the GCS.  If we do not do this then signal return will
generate a GCS protection fault.

In order to guard against attempts to bypass GCS protections via signal
return we only allow returning with GCSPR_EL0 pointing to an address
where it was previously preempted by a signal.  We do this by pushing a
cap onto the GCS, this takes the form of an architectural GCS cap token
with the top bit set and token type of 0 which we add on signal entry
and validate and pop off on signal return.  The combination of the top
bit being set and the token type mean that this can't be interpreted as
a valid token or address.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-25-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:40 +01:00
Mark Brown
b57180c75c arm64/gcs: Implement shadow stack prctl() interface
Implement the architecture neutral prctl() interface for setting the
shadow stack status, this supports setting and reading the current GCS
configuration for the current thread.

Userspace can enable basic GCS functionality and additionally also
support for GCS pushes and arbitrary GCS stores.  It is expected that
this prctl() will be called very early in application startup, for
example by the dynamic linker, and not subsequently adjusted during
normal operation.  Users should carefully note that after enabling GCS
for a thread GCS will become active with no call stack so it is not
normally possible to return from the function that invoked the prctl().

State is stored per thread, enabling GCS for a thread causes a GCS to be
allocated for that thread.

Userspace may lock the current GCS configuration by specifying
PR_SHADOW_STACK_ENABLE_LOCK, this prevents any further changes to the
GCS configuration via any means.

If GCS is not being enabled then all flags other than _LOCK are ignored,
it is not possible to enable stores or pops without enabling GCS.

When disabling the GCS we do not free the allocated stack, this allows
for inspection of the GCS after disabling as part of fault reporting.
Since it is not an expected use case and since it presents some
complications in determining what to do with previously initialsed data
on the GCS attempts to reenable GCS after this are rejected.  This can
be revisted if a use case arises.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-23-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:39 +01:00
Mark Brown
506496bcbb arm64/gcs: Ensure that new threads have a GCS
When a new thread is created by a thread with GCS enabled the GCS needs
to be specified along with the regular stack.

Unfortunately plain clone() is not extensible and existing clone3()
users will not specify a stack so all existing code would be broken if
we mandated specifying the stack explicitly.  For compatibility with
these cases and also x86 (which did not initially implement clone3()
support for shadow stacks) if no GCS is specified we will allocate one
so when a thread is created which has GCS enabled allocate one for it.
We follow the extensively discussed x86 implementation and allocate
min(RLIMIT_STACK/2, 2G).  Since the GCS only stores the call stack and not
any variables this should be more than sufficient for most applications.

GCSs allocated via this mechanism will be freed when the thread exits.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-22-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:39 +01:00
Mark Brown
fc84bc5378 arm64/gcs: Context switch GCS state for EL0
There are two registers controlling the GCS state of EL0, GCSPR_EL0 which
is the current GCS pointer and GCSCRE0_EL1 which has enable bits for the
specific GCS functionality enabled for EL0. Manage these on context switch
and process lifetime events, GCS is reset on exec().  Also ensure that
any changes to the GCS memory are visible to other PEs and that changes
from other PEs are visible on this one by issuing a GCSB DSYNC when
moving to or from a thread with GCS.

Since the current GCS configuration of a thread will be visible to
userspace we store the configuration in the format used with userspace
and provide a helper which configures the system register as needed.

On systems that support GCS we always allow access to GCSPR_EL0, this
facilitates reporting of GCS faults if userspace implements disabling of
GCS on error - the GCS can still be discovered and examined even if GCS
has been disabled.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-21-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:38 +01:00
Mark Brown
8ce71d2705 arm64/traps: Handle GCS exceptions
A new exception code is defined for GCS specific faults other than
standard load/store faults, for example GCS token validation failures,
add handling for this. These faults are reported to userspace as
segfaults with code SEGV_CPERR (protection error), mirroring the
reporting for x86 shadow stack errors.

GCS faults due to memory load/store operations generate data aborts with
a flag set, these will be handled separately as part of the data abort
handling.

Since we do not currently enable GCS for EL1 we should not get any faults
there but while we're at it we wire things up there, treating any GCS
fault as fatal.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-19-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:37 +01:00
Mark Brown
eefc98711f arm64/hwcap: Add hwcap for GCS
Provide a hwcap to enable userspace to detect support for GCS.

Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-18-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:37 +01:00
Mark Brown
6497b66ba6 arm64/mm: Map pages for guarded control stack
Map pages flagged as being part of a GCS as such rather than using the
full set of generic VM flags.

This is done using a conditional rather than extending the size of
protection_map since that would make for a very sparse array.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-15-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:36 +01:00
Mark Brown
092055f150 arm64/mm: Allocate PIE slots for EL0 guarded control stack
Pages used for guarded control stacks need to be described to the hardware
using the Permission Indirection Extension, GCS is not supported without
PIE. In order to support copy on write for guarded stacks we allocate two
values, one for active GCSs and one for GCS pages marked as read only prior
to copy.

Since the actual effect is defined using PIE the specific bit pattern used
does not matter to the hardware but we choose two values which differ only
in PTE_WRITE in order to help share code with non-PIE cases.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-13-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:35 +01:00
Mark Brown
6487c96308 arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
Add a cpufeature for GCS, allowing other code to conditionally support it
at runtime.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-12-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:35 +01:00
Mark Brown
ff5181d8a2 arm64/gcs: Provide basic EL2 setup to allow GCS usage at EL0 and EL1
There is a control HCRX_EL2.GCSEn which must be set to allow GCS
features to take effect at lower ELs and also fine grained traps for GCS
usage at EL0 and EL1.  Configure all these to allow GCS usage by EL0 and
EL1.

We also initialise GCSCR_EL1 and GCSCRE0_EL1 to ensure that we can
execute function call instructions without faulting regardless of the
state when the kernel is started.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-11-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:35 +01:00
Mark Brown
d0aa2b4351 arm64/gcs: Provide put_user_gcs()
In order for EL1 to write to an EL0 GCS it must use the GCSSTTR instruction
rather than a normal STTR. Provide a put_user_gcs() which does this.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-10-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:34 +01:00
Mark Brown
dad947cc22 arm64/gcs: Add manual encodings of GCS instructions
Define C callable functions for GCS instructions used by the kernel. In
order to avoid ambitious toolchain requirements for GCS support these are
manually encoded, this means we have fixed register numbers which will be
a bit limiting for the compiler but none of these should be used in
sufficiently fast paths for this to be a problem.

Note that GCSSTTR is used to store to EL0.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-9-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:34 +01:00
Mark Brown
ce0641d48d arm64/sysreg: Add definitions for architected GCS caps
The architecture defines a format for guarded control stack caps, used
to mark the top of an unused GCS in order to limit the potential for
exploitation via stack switching. Add definitions associated with these.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-8-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:34 +01:00
Mark Brown
f645e888b1 arm64/mm: Restructure arch_validate_flags() for extensibility
Currently arch_validate_flags() is written in a very non-extensible
fashion, returning immediately if MTE is not supported and writing the MTE
check as a direct return. Since we will want to add more checks for GCS
refactor the existing code to be more extensible, no functional change
intended.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-3-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:33 +01:00
Marc Zyngier
a1d402abf8 KVM: arm64: Fix kvm_has_feat*() handling of negative features
Oliver reports that the kvm_has_feat() helper is not behaviing as
expected for negative feature. On investigation, the main issue
seems to be caused by the following construct:

 #define get_idreg_field(kvm, id, fld)				\
 	(id##_##fld##_SIGNED ?					\
	 get_idreg_field_signed(kvm, id, fld) :			\
	 get_idreg_field_unsigned(kvm, id, fld))

where one side of the expression evaluates as something signed,
and the other as something unsigned. In retrospect, this is totally
braindead, as the compiler converts this into an unsigned expression.
When compared to something that is 0, the test is simply elided.

Epic fail. Similar issue exists in the expand_field_sign() macro.

The correct way to handle this is to chose between signed and unsigned
comparisons, so that both sides of the ternary expression are of the
same type (bool).

In order to keep the code readable (sort of), we introduce new
comparison primitives taking an operator as a parameter, and
rewrite the kvm_has_feat*() helpers in terms of these primitives.

Fixes: c62d7a23b9 ("KVM: arm64: Add feature checking helpers")
Reported-by: Oliver Upton <oliver.upton@linux.dev>
Tested-by: Oliver Upton <oliver.upton@linux.dev>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241002204239.2051637-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-03 19:35:27 +01:00
Mark Rutland
924725707d arm64: cputype: Add Neoverse-N3 definitions
Add cputype definitions for Neoverse-N3. These will be used for errata
detection in subsequent patches.

These values can be found in Table A-261 ("MIDR_EL1 bit descriptions")
in issue 02 of the Neoverse-N3 TRM, which can be found at:

  https://developer.arm.com/documentation/107997/0000/?lang=en

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240930111705.3352047-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-01 12:46:54 +01:00
Linus Torvalds
617a814f14 ALong with the usual shower of singleton patches, notable patch series in
this pull request are:
 
 "Align kvrealloc() with krealloc()" from Danilo Krummrich.  Adds
 consistency to the APIs and behaviour of these two core allocation
 functions.  This also simplifies/enables Rustification.
 
 "Some cleanups for shmem" from Baolin Wang.  No functional changes - mode
 code reuse, better function naming, logic simplifications.
 
 "mm: some small page fault cleanups" from Josef Bacik.  No functional
 changes - code cleanups only.
 
 "Various memory tiering fixes" from Zi Yan.  A small fix and a little
 cleanup.
 
 "mm/swap: remove boilerplate" from Yu Zhao.  Code cleanups and
 simplifications and .text shrinkage.
 
 "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt.  This
 is a feature, it adds new feilds to /proc/vmstat such as
 
     $ grep kstack /proc/vmstat
     kstack_1k 3
     kstack_2k 188
     kstack_4k 11391
     kstack_8k 243
     kstack_16k 0
 
 which tells us that 11391 processes used 4k of stack while none at all
 used 16k.  Useful for some system tuning things, but partivularly useful
 for "the dynamic kernel stack project".
 
 "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov.
 Teaches kmemleak to detect leaksage of percpu memory.
 
 "mm: memcg: page counters optimizations" from Roman Gushchin.  "3
 independent small optimizations of page counters".
 
 "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David
 Hildenbrand.  Improves PTE/PMD splitlock detection, makes powerpc/8xx work
 correctly by design rather than by accident.
 
 "mm: remove arch_make_page_accessible()" from David Hildenbrand.  Some
 folio conversions which make arch_make_page_accessible() unneeded.
 
 "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel.
 Cleans up and fixes our handling of the resetting of the cgroup/process
 peak-memory-use detector.
 
 "Make core VMA operations internal and testable" from Lorenzo Stoakes.
 Rationalizaion and encapsulation of the VMA manipulation APIs.  With a
 view to better enable testing of the VMA functions, even from a
 userspace-only harness.
 
 "mm: zswap: fixes for global shrinker" from Takero Funaki.  Fix issues in
 the zswap global shrinker, resulting in improved performance.
 
 "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao.  Fill in
 some missing info in /proc/zoneinfo.
 
 "mm: replace follow_page() by folio_walk" from David Hildenbrand.  Code
 cleanups and rationalizations (conversion to folio_walk()) resulting in
 the removal of follow_page().
 
 "improving dynamic zswap shrinker protection scheme" from Nhat Pham.  Some
 tuning to improve zswap's dynamic shrinker.  Significant reductions in
 swapin and improvements in performance are shown.
 
 "mm: Fix several issues with unaccepted memory" from Kirill Shutemov.
 Improvements to the new unaccepted memory feature,
 
 "mm/mprotect: Fix dax puds" from Peter Xu.  Implements mprotect on DAX
 PUDs.  This was missing, although nobody seems to have notied yet.
 
 "Introduce a store type enum for the Maple tree" from Sidhartha Kumar.
 Cleanups and modest performance improvements for the maple tree library
 code.
 
 "memcg: further decouple v1 code from v2" from Shakeel Butt.  Move more
 cgroup v1 remnants away from the v2 memcg code.
 
 "memcg: initiate deprecation of v1 features" from Shakeel Butt.  Adds
 various warnings telling users that memcg v1 features are deprecated.
 
 "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li.
 Greatly improves the success rate of the mTHP swap allocation.
 
 "mm: introduce numa_memblks" from Mike Rapoport.  Moves various disparate
 per-arch implementations of numa_memblk code into generic code.
 
 "mm: batch free swaps for zap_pte_range()" from Barry Song.  Greatly
 improves the performance of munmap() of swap-filled ptes.
 
 "support large folio swap-out and swap-in for shmem" from Baolin Wang.
 With this series we no longer split shmem large folios into simgle-page
 folios when swapping out shmem.
 
 "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao.  Nice performance
 improvements and code reductions for gigantic folios.
 
 "support shmem mTHP collapse" from Baolin Wang.  Adds support for
 khugepaged's collapsing of shmem mTHP folios.
 
 "mm: Optimize mseal checks" from Pedro Falcato.  Fixes an mprotect()
 performance regression due to the addition of mseal().
 
 "Increase the number of bits available in page_type" from Matthew Wilcox.
 Increases the number of bits available in page_type!
 
 "Simplify the page flags a little" from Matthew Wilcox.  Many legacy page
 flags are now folio flags, so the page-based flags and their
 accessors/mutators can be removed.
 
 "mm: store zero pages to be swapped out in a bitmap" from Usama Arif.  An
 optimization which permits us to avoid writing/reading zero-filled zswap
 pages to backing store.
 
 "Avoid MAP_FIXED gap exposure" from Liam Howlett.  Fixes a race window
 which occurs when a MAP_FIXED operqtion is occurring during an unrelated
 vma tree walk.
 
 "mm: remove vma_merge()" from Lorenzo Stoakes.  Major rotorooting of the
 vma_merge() functionality, making ot cleaner, more testable and better
 tested.
 
 "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.  Minor
 fixups of DAMON selftests and kunit tests.
 
 "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.  Code
 cleanups and folio conversions.
 
 "Shmem mTHP controls and stats improvements" from Ryan Roberts.  Cleanups
 for shmem controls and stats.
 
 "mm: count the number of anonymous THPs per size" from Barry Song.  Expose
 additional anon THP stats to userspace for improved tuning.
 
 "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio
 conversions and removal of now-unused page-based APIs.
 
 "replace per-quota region priorities histogram buffer with per-context
 one" from SeongJae Park.  DAMON histogram rationalization.
 
 "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae
 Park.  DAMON documentation updates.
 
 "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve
 related doc and warn" from Jason Wang: fixes usage of page allocator
 __GFP_NOFAIL and GFP_ATOMIC flags.
 
 "mm: split underused THPs" from Yu Zhao.  Improve THP=always policy - this
 was overprovisioning THPs in sparsely accessed memory areas.
 
 "zram: introduce custom comp backends API" frm Sergey Senozhatsky.  Add
 support for zram run-time compression algorithm tuning.
 
 "mm: Care about shadow stack guard gap when getting an unmapped area" from
 Mark Brown.  Fix up the various arch_get_unmapped_area() implementations
 to better respect guard areas.
 
 "Improve mem_cgroup_iter()" from Kinsey Ho.  Improve the reliability of
 mem_cgroup_iter() and various code cleanups.
 
 "mm: Support huge pfnmaps" from Peter Xu.  Extends the usage of huge
 pfnmap support.
 
 "resource: Fix region_intersects() vs add_memory_driver_managed()" from
 Huang Ying.  Fix a bug in region_intersects() for systems with CXL memory.
 
 "mm: hwpoison: two more poison recovery" from Kefeng Wang.  Teaches a
 couple more code paths to correctly recover from the encountering of
 poisoned memry.
 
 "mm: enable large folios swap-in support" from Barry Song.  Support the
 swapin of mTHP memory into appropriately-sized folios, rather than into
 single-page folios.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZu1BBwAKCRDdBJ7gKXxA
 jlWNAQDYlqQLun7bgsAN4sSvi27VUuWv1q70jlMXTfmjJAvQqwD/fBFVR6IOOiw7
 AkDbKWP2k0hWPiNJBGwoqxdHHx09Xgo=
 =s0T+
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Linus Torvalds
4a39ac5b7d Random number generator updates for Linux 6.12-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmboHyUACgkQSfxwEqXe
 A66wGQ/8DRIjBllwf1YuTWi4T6OcfoYxK6C9bXO6QPP5gzdTyFE9pvDuuPyad6+F
 FR086ydTHeodemz1dFiQCL9etcUaxo4+6FRKyXKF9/1ezGbTA5nJd0/fKJGlqbI2
 EoA4LNYHOsvCZk1BTpxRNWKeKphU9zQgQdSigy6Rx8p269UkGmIZjD1PtUc+vqfR
 Ox0dK/Cswyo236fRi5HzaoMntWI4vXgLfxty0e1R7tfbstkCxSKWAON1lo3uHgkA
 0HpJXWgWXAPt9gp++Fs/jGNpOqbt6IaKeV5f7CjYfvWhlFjNMhQxF+PbxknaZn/k
 K0gQsItOIoFTfbQdLDIdfnj9awMdLW8FB2A1WXHpNr9pVC4ickPb1bMTF/XRd0tm
 wBNu4BL0gklx6017KZg5uINMIduzMLGkBLRFiBW0en/sZMLTJTMg58BJn0CL1Pmh
 1ll/Q3ToSMHalvxU2OnJagTwh4fzzCEpK/hW9WiDO4jSCsMXyX0clinrCjNo1JfA
 tqgTWEy3uGtg+dg0Du9VD5JASbNQSJ0ZRnas5+qz10IRWWfTolrsk61dliXLQ4Sv
 tSryDtsE2znwJF1Krh4aHNSSVhD5/l/8QaXkf9aZc/kkaHxwsx83FuWnqw6nMz8c
 l4B2MbH0jUgsEqEyx+0iwk+FXE9kZKWumTVLjFZ6bRnq3q+uq0U=
 =mWCw
 -----END PGP SIGNATURE-----

Merge tag 'random-6.12-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:
 "Originally I'd planned on sending each of the vDSO getrandom()
  architecture ports to their respective arch trees. But as we started
  to work on this, we found lots of interesting issues in the shared
  code and infrastructure, the fixes for which the various archs needed
  to base their work.

  So in the end, this turned into a nice collaborative effort fixing up
  issues and porting to 5 new architectures -- arm64, powerpc64,
  powerpc32, s390x, and loongarch64 -- with everybody pitching in and
  commenting on each other's code. It was a fun development cycle.

  This contains:

   - Numerous fixups to the vDSO selftest infrastructure, getting it
     running successfully on more platforms, and fixing bugs in it.

   - Additions to the vDSO getrandom & chacha selftests. Basically every
     time manual review unearthed a bug in a revision of an arch patch,
     or an ambiguity, the tests were augmented.

     By the time the last arch was submitted for review, s390x, v1 of
     the series was essentially fine right out of the gate.

   - Fixes to the the generic C implementation of vDSO getrandom, to
     build and run successfully on all archs, decoupling it from
     assumptions we had (unintentionally) made on x86_64 that didn't
     carry through to the other architectures.

   - Port of vDSO getrandom to LoongArch64, from Xi Ruoyao and acked by
     Huacai Chen.

   - Port of vDSO getrandom to ARM64, from Adhemerval Zanella and acked
     by Will Deacon.

   - Port of vDSO getrandom to PowerPC, in both 32-bit and 64-bit
     varieties, from Christophe Leroy and acked by Michael Ellerman.

   - Port of vDSO getrandom to S390X from Heiko Carstens, the arch
     maintainer.

  While it'd be natural for there to be things to fix up over the course
  of the development cycle, these patches got a decent amount of review
  from a fairly diverse crew of folks on the mailing lists, and, for the
  most part, they've been cooking in linux-next, which has been helpful
  for ironing out build issues.

  In terms of architectures, I think that mostly takes care of the
  important 64-bit archs with hardware still being produced and running
  production loads in settings where vDSO getrandom is likely to help.

  Arguably there's still RISC-V left, and we'll see for 6.13 whether
  they find it useful and submit a port"

* tag 'random-6.12-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (47 commits)
  selftests: vDSO: check cpu caps before running chacha test
  s390/vdso: Wire up getrandom() vdso implementation
  s390/vdso: Move vdso symbol handling to separate header file
  s390/vdso: Allow alternatives in vdso code
  s390/module: Provide find_section() helper
  s390/facility: Let test_facility() generate static branch if possible
  s390/alternatives: Remove ALT_FACILITY_EARLY
  s390/facility: Disable compile time optimization for decompressor code
  selftests: vDSO: fix vdso_config for s390
  selftests: vDSO: fix ELF hash table entry size for s390x
  powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO64
  powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO32
  powerpc/vdso: Refactor CFLAGS for CVDSO build
  powerpc/vdso32: Add crtsavres
  mm: Define VM_DROPPABLE for powerpc/32
  powerpc/vdso: Fix VDSO data access when running in a non-root time namespace
  selftests: vDSO: don't include generated headers for chacha test
  arm64: vDSO: Wire up getrandom() vDSO implementation
  arm64: alternative: make alternative_has_cap_likely() VDSO compatible
  selftests: vDSO: also test counter in vdso_test_chacha
  ...
2024-09-18 15:26:31 +02:00
Peter Xu
3e509c9b03 mm/arm64: support large pfn mappings
Support huge pfnmaps by using bit 56 (PTE_SPECIAL) for "special" on
pmds/puds.  Provide the pmd/pud helpers to set/get special bit.

There's one more thing missing for arm64 which is the pxx_pgprot() for
pmd/pud.  Add them too, which is mostly the same as the pte version by
dropping the pfn field.  These helpers are essential to be used in the new
follow_pfnmap*() API to report valid pgprot_t results.

Note that arm64 doesn't yet support huge PUD yet, but it's still
straightforward to provide the pud helpers that we need altogether.  Only
PMD helpers will make an immediate benefit until arm64 will support huge
PUDs first in general (e.g.  in THPs).

Link: https://lkml.kernel.org/r/20240826204353.2228736-19-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17 01:07:00 -07:00
Peter Xu
0515e022e1 mm: always define pxx_pgprot()
There're:

  - 8 archs (arc, arm64, include, mips, powerpc, s390, sh, x86) that
  support pte_pgprot().

  - 2 archs (x86, sparc) that support pmd_pgprot().

  - 1 arch (x86) that support pud_pgprot().

Always define them to be used in generic code, and then we don't need to
fiddle with "#ifdef"s when doing so.

Link: https://lkml.kernel.org/r/20240826204353.2228736-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17 01:06:59 -07:00
Linus Torvalds
64dd3b6a79 ARM:
* New Stage-2 page table dumper, reusing the main ptdump infrastructure
 
 * FP8 support
 
 * Nested virtualization now supports the address translation (FEAT_ATS1A)
   family of instructions
 
 * Add selftest checks for a bunch of timer emulation corner cases
 
 * Fix multiple cases where KVM/arm64 doesn't correctly handle the guest
   trying to use a GICv3 that wasn't advertised
 
 * Remove REG_HIDDEN_USER from the sysreg infrastructure, making
   things little simpler
 
 * Prevent MTE tags being restored by userspace if we are actively
   logging writes, as that's a recipe for disaster
 
 * Correct the refcount on a page that is not considered for MTE tag
   copying (such as a device)
 
 * When walking a page table to split block mappings, synchronize only
   at the end the walk rather than on every store
 
 * Fix boundary check when transfering memory using FFA
 
 * Fix pKVM TLB invalidation, only affecting currently out of tree
   code but worth addressing for peace of mind
 
 LoongArch:
 
 * Revert qspinlock to test-and-set simple lock on VM.
 
 * Add Loongson Binary Translation extension support.
 
 * Add PMU support for guest.
 
 * Enable paravirt feature control from VMM.
 
 * Implement function kvm_para_has_feature().
 
 RISC-V:
 
 * Fix sbiret init before forwarding to userspace
 
 * Don't zero-out PMU snapshot area before freeing data
 
 * Allow legacy PMU access from guest
 
 * Fix to allow hpmcounter31 from the guest
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmbmghAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPFQgf+Ijeqlx90BGy96pyzo/NkYKPeEc8G
 gKhlm8PdtdZYaRdJ53MVRLLpzbLuzqbwrn0ZX2tvoDRLzuAqTt2GTFoT6e2HtY5B
 Sf7KQMFwHWGtGklC1EmZ1fXsCocswpuAcexCLKLRBoWUcKABlgwV3N3vJo5gx/Ag
 8XXhYpcLTh+p7bjMdJShQy019pTwEDE68pPVnL2NPzla1G6Qox7ZJIdOEMZXuyJA
 MJ4jbFWE/T8vLFUf/8MGQ/+bo+4140kzB8N9wkazNcBRoodY6Hx+Lm1LiZjNudO1
 ilIdB4P3Ht+D8UuBv2DO5XTakfJz9T9YsoRcPlwrOWi/8xBRbt236gFB3Q==
 =sHTI
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "These are the non-x86 changes (mostly ARM, as is usually the case).
  The generic and x86 changes will come later"

  ARM:

   - New Stage-2 page table dumper, reusing the main ptdump
     infrastructure

   - FP8 support

   - Nested virtualization now supports the address translation
     (FEAT_ATS1A) family of instructions

   - Add selftest checks for a bunch of timer emulation corner cases

   - Fix multiple cases where KVM/arm64 doesn't correctly handle the
     guest trying to use a GICv3 that wasn't advertised

   - Remove REG_HIDDEN_USER from the sysreg infrastructure, making
     things little simpler

   - Prevent MTE tags being restored by userspace if we are actively
     logging writes, as that's a recipe for disaster

   - Correct the refcount on a page that is not considered for MTE tag
     copying (such as a device)

   - When walking a page table to split block mappings, synchronize only
     at the end the walk rather than on every store

   - Fix boundary check when transfering memory using FFA

   - Fix pKVM TLB invalidation, only affecting currently out of tree
     code but worth addressing for peace of mind

  LoongArch:

   - Revert qspinlock to test-and-set simple lock on VM.

   - Add Loongson Binary Translation extension support.

   - Add PMU support for guest.

   - Enable paravirt feature control from VMM.

   - Implement function kvm_para_has_feature().

  RISC-V:

   - Fix sbiret init before forwarding to userspace

   - Don't zero-out PMU snapshot area before freeing data

   - Allow legacy PMU access from guest

   - Fix to allow hpmcounter31 from the guest"

* tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (64 commits)
  LoongArch: KVM: Implement function kvm_para_has_feature()
  LoongArch: KVM: Enable paravirt feature control from VMM
  LoongArch: KVM: Add PMU support for guest
  KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifier
  KVM: arm64: Simplify visibility handling of AArch32 SPSR_*
  KVM: arm64: Simplify handling of CNTKCTL_EL12
  LoongArch: KVM: Add vm migration support for LBT registers
  LoongArch: KVM: Add Binary Translation extension support
  LoongArch: KVM: Add VM feature detection function
  LoongArch: Revert qspinlock to test-and-set simple lock on VM
  KVM: arm64: Register ptdump with debugfs on guest creation
  arm64: ptdump: Don't override the level when operating on the stage-2 tables
  arm64: ptdump: Use the ptdump description from a local context
  arm64: ptdump: Expose the attribute parsing functionality
  KVM: arm64: Add memory length checks and remove inline in do_ffa_mem_xfer
  KVM: arm64: Move pagetable definitions to common header
  KVM: arm64: nv: Add support for FEAT_ATS1A
  KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
  KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3
  KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration
  ...
2024-09-16 07:38:18 +02:00
Linus Torvalds
114143a595 arm64 updates for 6.12
ACPI:
 * Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms.
 * Ensure arm64-specific IORT header is covered by MAINTAINERS.
 
 CPU Errata:
 * Enable workaround for hardware access/dirty issue on Ampere-1A cores.
 
 Memory management:
 * Define PHYSMEM_END to fix a crash in the amdgpu driver.
 * Avoid tripping over invalid kernel mappings on the kexec() path.
 * Userspace support for the Permission Overlay Extension (POE) using
   protection keys.
 
 Perf and PMUs:
 * Add support for the "fixed instruction counter" extension in the CPU
   PMU architecture.
 * Extend and fix the event encodings for Apple's M1 CPU PMU.
 * Allow LSM hooks to decide on SPE permissions for physical profiling.
 * Add support for the CMN S3 and NI-700 PMUs.
 
 Confidential Computing:
 * Add support for booting an arm64 kernel as a protected guest under
   Android's "Protected KVM" (pKVM) hypervisor.
 
 Selftests:
 * Fix vector length issues in the SVE/SME sigreturn tests
 * Fix build warning in the ptrace tests.
 
 Timers:
 * Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
   non-determinism arising from the architected counter.
 
 Miscellaneous:
 * Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
   don't succeed.
 * Minor fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmbkVNEQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNKeIB/9YtbN7JMgsXktM94GP03r3tlFF36Y1S51S
 +zdDZclAVZCTCZN+PaFeAZ/+ah2EQYrY6rtDoHUSEMQdF9kH+ycuIPDTwaJ4Qkam
 QKXMpAgtY/4yf2rX4lhDF8rEvkhLDsu7oGDhqUZQsA33GrMBHfgA3oqpYwlVjvGq
 gkm7olTo9LdWAxkPpnjGrjB6Mv5Dq8dJRhW+0Q5AntI5zx3RdYGJZA9GUSzyYCCt
 FIYOtMmWPkQ0kKxIVxOxAOm/ubhfyCs2sjSfkaa3vtvtt+Yjye1Xd81rFciIbPgP
 QlK/Mes2kBZmjhkeus8guLI5Vi7tx3DQMkNqLXkHAAzOoC4oConE
 =6osL
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The highlights are support for Arm's "Permission Overlay Extension"
  using memory protection keys, support for running as a protected guest
  on Android as well as perf support for a bunch of new interconnect
  PMUs.

  Summary:

  ACPI:
   - Enable PMCG erratum workaround for HiSilicon HIP10 and 11
     platforms.
   - Ensure arm64-specific IORT header is covered by MAINTAINERS.

  CPU Errata:
   - Enable workaround for hardware access/dirty issue on Ampere-1A
     cores.

  Memory management:
   - Define PHYSMEM_END to fix a crash in the amdgpu driver.
   - Avoid tripping over invalid kernel mappings on the kexec() path.
   - Userspace support for the Permission Overlay Extension (POE) using
     protection keys.

  Perf and PMUs:
   - Add support for the "fixed instruction counter" extension in the
     CPU PMU architecture.
   - Extend and fix the event encodings for Apple's M1 CPU PMU.
   - Allow LSM hooks to decide on SPE permissions for physical
     profiling.
   - Add support for the CMN S3 and NI-700 PMUs.

  Confidential Computing:
   - Add support for booting an arm64 kernel as a protected guest under
     Android's "Protected KVM" (pKVM) hypervisor.

  Selftests:
   - Fix vector length issues in the SVE/SME sigreturn tests
   - Fix build warning in the ptrace tests.

  Timers:
   - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
     non-determinism arising from the architected counter.

  Miscellaneous:
   - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
     don't succeed.
   - Minor fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: pkeys: remove redundant WARN
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  ...
2024-09-16 06:55:07 +02:00
Adhemerval Zanella
712676ea2b arm64: vDSO: Wire up getrandom() vDSO implementation
Hook up the generic vDSO implementation to the aarch64 vDSO data page.
The _vdso_rng_data required data is placed within the _vdso_data vvar
page, by using a offset larger than the vdso_data.

The vDSO function requires a ChaCha20 implementation that does not write
to the stack, and that can do an entire ChaCha20 permutation.  The one
provided uses NEON on the permute operation, with a fallback to the
syscall for chips that do not support AdvSIMD.

This also passes the vdso_test_chacha test along with
vdso_test_getrandom. The vdso_test_getrandom bench-single result on
Neoverse-N1 shows:

   vdso: 25000000 times in 0.783884250 seconds
   libc: 25000000 times in 8.780275399 seconds
syscall: 25000000 times in 8.786581518 seconds

A small fixup to arch/arm64/include/asm/mman.h was required to avoid
pulling kernel code into the vDSO, similar to what's already done in
arch/arm64/include/asm/rwonce.h.

Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:36 +02:00
Mark Rutland
2c2ca3416b arm64: alternative: make alternative_has_cap_likely() VDSO compatible
Currently alternative_has_cap_unlikely() can be used in VDSO code, but
alternative_has_cap_likely() cannot as it references alt_cb_patch_nops,
which is not available when linking the VDSO. This is unfortunate as it
would be useful to have alternative_has_cap_likely() available in VDSO
code.

The use of alt_cb_patch_nops was added in commit:

  d926079f17 ("arm64: alternatives: add shared NOP callback")

... as removing duplicate NOPs within the kernel Image saved areasonable
amount of space.

Given the VDSO code will have nowhere near as many alternative branches
as the main kernel image, this isn't much of a concern, and a few extra
nops isn't a massive problem.

Change alternative_has_cap_likely() to only use alt_cb_patch_nops for
the main kernel image, and allow duplicate NOPs in VDSO code.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-09-13 17:28:35 +02:00
Will Deacon
75078ba2b3 Merge branch 'for-next/timers' into for-next/core
* for-next/timers:
  arm64: Implement prctl(PR_{G,S}ET_TSC)
2024-09-12 13:44:03 +01:00
Will Deacon
982a847c71 Merge branch 'for-next/poe' into for-next/core
* for-next/poe: (31 commits)
  arm64: pkeys: remove redundant WARN
  kselftest/arm64: Add test case for POR_EL0 signal frame records
  kselftest/arm64: parse POE_MAGIC in a signal frame
  kselftest/arm64: add HWCAP test for FEAT_S1POE
  selftests: mm: make protection_keys test work on arm64
  selftests: mm: move fpregs printing
  kselftest/arm64: move get_header()
  arm64: add Permission Overlay Extension Kconfig
  arm64: enable PKEY support for CPUs with S1POE
  arm64: enable POE and PIE to coexist
  arm64/ptrace: add support for FEAT_POE
  arm64: add POE signal support
  arm64: implement PKEYS support
  arm64: add pte_access_permitted_no_overlay()
  arm64: handle PKEY/POE faults
  arm64: mask out POIndex when modifying a PTE
  arm64: convert protection key into vm_flags and pgprot values
  arm64: add POIndex defines
  arm64: re-order MTE VM_ flags
  arm64: enable the Permission Overlay Extension for EL0
  ...
2024-09-12 13:43:41 +01:00
Will Deacon
3175e051c3 Merge branch 'for-next/pkvm-guest' into for-next/core
* for-next/pkvm-guest:
  arm64: smccc: Reserve block of KVM "vendor" services for pKVM hypercalls
  drivers/virt: pkvm: Intercept ioremap using pKVM MMIO_GUARD hypercall
  arm64: mm: Add confidential computing hook to ioremap_prot()
  drivers/virt: pkvm: Hook up mem_encrypt API using pKVM hypercalls
  arm64: mm: Add top-level dispatcher for internal mem_encrypt API
  drivers/virt: pkvm: Add initial support for running as a protected guest
  firmware/smccc: Call arch-specific hook on discovering KVM services
2024-09-12 13:43:22 +01:00
Will Deacon
119e3eef32 Merge branch 'for-next/perf' into for-next/core
* for-next/perf: (33 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  perf/arm-cmn: Ensure dtm_idx is big enough
  perf/arm-cmn: Fix CCLA register offset
  perf/arm-cmn: Refactor node ID handling. Again.
  drivers/perf: hisi_pcie: Export supported Root Ports [bdf_min, bdf_max]
  drivers/perf: hisi_pcie: Fix TLP headers bandwidth counting
  drivers/perf: hisi_pcie: Record hardware counts correctly
  drivers/perf: arm_spe: Use perf_allow_kernel() for permissions
  perf/dwc_pcie: Add support for QCOM vendor devices
  ...
2024-09-12 13:43:16 +01:00
Will Deacon
c2c9402369 Merge branch 'for-next/mm' into for-next/core
* for-next/mm:
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
2024-09-12 13:43:08 +01:00
Will Deacon
f661eb5f8d Merge branch 'for-next/misc' into for-next/core
* for-next/misc:
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: Constify struct kobj_type
  arm64: smp: smp_send_stop() and crash_smp_send_stop() should try non-NMI first
  arm64/sve: Remove unused declaration read_smcr_features()
  arm64: mm: Remove unused declaration early_io_map()
  arm64: el2_setup.h: Rename some labels to be more diff-friendly
  arm64: signal: Fix some under-bracketed UAPI macros
  arm64/mm: Drop TCR_SMP_FLAGS
  arm64/mm: Drop PMD_SECT_VALID
2024-09-12 13:42:57 +01:00
Marc Zyngier
f625469051 Merge branch kvm-arm64/s2-ptdump into kvmarm-master/next
* kvm-arm64/s2-ptdump:
  : .
  : Stage-2 page table dumper, reusing the main ptdump infrastructure,
  : courtesy of Sebastian Ene. From the cover letter:
  :
  : "This series extends the ptdump support to allow dumping the guest
  : stage-2 pagetables. When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump
  : registers the new following files under debugfs:
  : - /sys/debug/kvm/<guest_id>/stage2_page_tables
  : - /sys/debug/kvm/<guest_id>/stage2_levels
  : - /sys/debug/kvm/<guest_id>/ipa_range
  :
  : This allows userspace tools (eg. cat) to dump the stage-2 pagetables by
  : reading the 'stage2_page_tables' file.
  : [...]"
  : .
  KVM: arm64: Register ptdump with debugfs on guest creation
  arm64: ptdump: Don't override the level when operating on the stage-2 tables
  arm64: ptdump: Use the ptdump description from a local context
  arm64: ptdump: Expose the attribute parsing functionality
  KVM: arm64: Move pagetable definitions to common header

Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-12 08:38:02 +01:00
Marc Zyngier
2e0f239457 Merge branch kvm-arm64/nv-at-pan into kvmarm-master/next
* kvm-arm64/nv-at-pan:
  : .
  : Add NV support for the AT family of instructions, which mostly results
  : in adding a page table walker that deals with most of the complexity
  : of the architecture.
  :
  : From the cover letter:
  :
  : "Another task that a hypervisor supporting NV on arm64 has to deal with
  : is to emulate the AT instruction, because we multiplex all the S1
  : translations on a single set of registers, and the guest S2 is never
  : truly resident on the CPU.
  :
  : So given that we lie about page tables, we also have to lie about
  : translation instructions, hence the emulation. Things are made
  : complicated by the fact that guest S1 page tables can be swapped out,
  : and that our shadow S2 is likely to be incomplete. So while using AT
  : to emulate AT is tempting (and useful), it is not going to always
  : work, and we thus need a fallback in the shape of a SW S1 walker."
  : .
  KVM: arm64: nv: Add support for FEAT_ATS1A
  KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
  KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3
  KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration
  KVM: arm64: nv: Add SW walker for AT S1 emulation
  KVM: arm64: nv: Make ps_to_output_size() generally available
  KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
  KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}
  KVM: arm64: nv: Add basic emulation of AT S1E1{R,W}P
  KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}
  KVM: arm64: nv: Honor absence of FEAT_PAN2
  KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor
  KVM: arm64: nv: Enforce S2 alignment when contiguous bit is set
  arm64: Add ESR_ELx_FSC_ADDRSZ_L() helper
  arm64: Add system register encoding for PSTATE.PAN
  arm64: Add PAR_EL1 field description
  arm64: Add missing APTable and TCR_ELx.HPD masks
  KVM: arm64: Make kvm_at() take an OP_AT_*

Signed-off-by: Marc Zyngier <maz@kernel.org>

# Conflicts:
#	arch/arm64/kvm/nested.c
2024-09-12 08:37:47 +01:00
Marc Zyngier
acf2ab2899 Merge branch kvm-arm64/vgic-sre-traps into kvmarm-master/next
* kvm-arm64/vgic-sre-traps:
  : .
  : Fix the multiple of cases where KVM/arm64 doesn't correctly
  : handle the guest trying to use a GICv3 that isn't advertised.
  :
  : From the cover letter:
  :
  : "It recently appeared that, when running on a GICv3-equipped platform
  : (which is what non-ancient arm64 HW has), *not* configuring a GICv3
  : for the guest could result in less than desirable outcomes.
  :
  : We have multiple issues to fix:
  :
  : - for registers that *always* trap (the SGI registers) or that *may*
  :   trap (the SRE register), we need to check whether a GICv3 has been
  :   instantiated before acting upon the trap.
  :
  : - for registers that only conditionally trap, we must actively trap
  :   them even in the absence of a GICv3 being instantiated, and handle
  :   those traps accordingly.
  :
  : - finally, ID registers must reflect the absence of a GICv3, so that
  :   we are consistent.
  :
  : This series goes through all these requirements. The main complexity
  : here is to apply a GICv3 configuration on the host in the absence of a
  : GICv3 in the guest. This is pretty hackish, but I don't have a much
  : better solution so far.
  :
  : As part of making wider use of of the trap bits, we fully define the
  : trap routing as per the architecture, something that we eventually
  : need for NV anyway."
  : .
  KVM: arm64: selftests: Cope with lack of GICv3 in set_id_regs
  KVM: arm64: Add selftest checking how the absence of GICv3 is handled
  KVM: arm64: Unify UNDEF injection helpers
  KVM: arm64: Make most GICv3 accesses UNDEF if they trap
  KVM: arm64: Honor guest requested traps in GICv3 emulation
  KVM: arm64: Add trap routing information for ICH_HCR_EL2
  KVM: arm64: Add ICH_HCR_EL2 to the vcpu state
  KVM: arm64: Zero ID_AA64PFR0_EL1.GIC when no GICv3 is presented to the guest
  KVM: arm64: Add helper for last ditch idreg adjustments
  KVM: arm64: Force GICv3 trap activation when no irqchip is configured on VHE
  KVM: arm64: Force SRE traps when SRE access is not enabled
  KVM: arm64: Move GICv3 trap configuration to kvm_calculate_traps()

Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-12 08:37:06 +01:00
Marc Zyngier
091258a0a0 Merge branch kvm-arm64/fpmr into kvmarm-master/next
* kvm-arm64/fpmr:
  : .
  : Add FP8 support to the KVM/arm64 floating point handling.
  :
  : This includes new ID registers (ID_AA64PFR2_EL1 ID_AA64FPFR0_EL1)
  : being made visible to guests, as well as a new confrol register
  : (FPMR) which gets context-switched.
  : .
  KVM: arm64: Expose ID_AA64PFR2_EL1 to userspace and guests
  KVM: arm64: Enable FP8 support when available and configured
  KVM: arm64: Expose ID_AA64FPFR0_EL1 as a writable ID reg
  KVM: arm64: Honor trap routing for FPMR
  KVM: arm64: Add save/restore support for FPMR
  KVM: arm64: Move FPMR into the sysreg array
  KVM: arm64: Add predicate for FPMR support in a VM
  KVM: arm64: Move SVCR into the sysreg array

Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-12 08:36:41 +01:00
Sebastian Ene
7c4f73548e KVM: arm64: Register ptdump with debugfs on guest creation
While arch/*/mem/ptdump handles the kernel pagetable dumping code,
introduce KVM/ptdump to show the guest stage-2 pagetables. The
separation is necessary because most of the definitions from the
stage-2 pagetable reside in the KVM path and we will be invoking
functionality specific to KVM. Introduce the PTDUMP_STAGE2_DEBUGFS config.

When a guest is created, register a new file entry under the guest
debugfs dir which allows userspace to show the contents of the guest
stage-2 pagetables when accessed.

[maz: moved function prototypes from kvm_host.h to kvm_mmu.h]

Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/r/20240909124721.1672199-6-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-10 21:32:51 +01:00
Sebastian Ene
9182301a7b arm64: ptdump: Use the ptdump description from a local context
Rename the attributes description array to allow the parsing method
to use the description from a local context. To be able to do this,
store a pointer to the description array in the state structure. This
will allow for the later introduced callers (stage_2 ptdump) to specify
their own page table description format to the ptdump parser.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240909124721.1672199-4-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-10 21:32:51 +01:00
Sebastian Ene
acc3d3a817 arm64: ptdump: Expose the attribute parsing functionality
Reuse the descriptor parsing functionality to keep the same output format
as the original ptdump code. In order for this to happen, move the state
tracking objects into a common header.

[maz: Fixed note_page() stub as suggested by Will]

Signed-off-by: Sebastian Ene <sebastianene@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240909124721.1672199-3-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-10 21:31:59 +01:00
Anastasia Belova
b6db3eb6c3 arm64: esr: Define ESR_ELx_EC_* constants as UL
Add explicit casting to prevent expantion of 32th bit of
u32 into highest half of u64 in several places.

For example, in inject_abt64:
ESR_ELx_EC_DABT_LOW << ESR_ELx_EC_SHIFT = 0x24 << 26.
This operation's result is int with 1 in 32th bit.
While casting this value into u64 (esr is u64) 1
fills 32 highest bits.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Cc: <stable@vger.kernel.org>
Fixes: aa8eff9bfb ("arm64: KVM: fault injection into a guest")
Signed-off-by: Anastasia Belova <abelova@astralinux.ru>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/stable/20240910085016.32120-1-abelova%40astralinux.ru
Link: https://lore.kernel.org/r/20240910085016.32120-1-abelova@astralinux.ru
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-10 18:22:58 +01:00
Joey Gouly
10166c23f4 arm64: pkeys: remove redundant WARN
FEAT_PAN3 is present if FEAT_S1POE is, this WARN() was to represent that.
However execute_only_pkey() is always called by mmap(), even on a CPU without
POE support.

Rather than making the WARN() conditional, just delete it.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/linux-arm-kernel/CA+G9fYvarKEPN3u1Ogw2pcw4h6r3OMzg+5qJpYkAXRunAEF_0Q@mail.gmail.com/
Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240910105004.706981-1-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-10 18:19:59 +01:00
Sebastian Ene
29caeda359 KVM: arm64: Move pagetable definitions to common header
In preparation for using the stage-2 definitions in ptdump, move some of
these macros in the common header.

Signed-off-by: Sebastian Ene <sebastianene@google.com>
Link: https://lore.kernel.org/r/20240909124721.1672199-2-sebastianene@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-09-10 17:46:57 +01:00
D Scott Phillips
eeb8fdfcf0 arm64: Expose the end of the linear map in PHYSMEM_END
The memory hot-plug and resource management code needs to know the
largest address which can fit in the linear map, so set PHYSMEM_END for
that purpose.

This fixes a crash at boot when amdgpu tries to create
DEVICE_PRIVATE_MEMORY and is given a physical address by the resource
management code which is outside the range which can have a `struct
page`

 | Unable to handle kernel paging request at virtual address 000001ffa6000034
 | user pgtable: 4k pages, 48-bit VAs, pgdp=000008000287c000
 | [000001ffa6000034] pgd=0000000000000000, p4d=0000000000000000
 | Call trace:
 |  __init_zone_device_page.constprop.0+0x2c/0xa8
 |  memmap_init_zone_device+0xf0/0x210
 |  pagemap_range+0x1e0/0x410
 |  memremap_pages+0x18c/0x2e0
 |  devm_memremap_pages+0x30/0x90
 |  kgd2kfd_init_zone_device+0xf0/0x200 [amdgpu]
 |  amdgpu_device_ip_init+0x674/0x888 [amdgpu]
 |  amdgpu_device_init+0x7a4/0xea0 [amdgpu]
 |  amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu]
 |  amdgpu_pci_probe+0x1a0/0x560 [amdgpu]

Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20240903164532.3874988-1-scott@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 16:39:58 +01:00
Joey Gouly
4afd00641b arm64: enable PKEY support for CPUs with S1POE
Now that PKEYs support has been implemented, enable it for CPUs that
support S1POE.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-23-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:54:05 +01:00
Joey Gouly
d0d6e7e081 arm64: enable POE and PIE to coexist
Permission Indirection Extension and Permission Overlay Extension can be
enabled independently.

When PIE is disabled and POE is enabled, the permissions set by POR_EL0 will be
applied on top of the permissions set in the PTE.

When both PIE and POE are enabled, the permissions set by POR_EL0 will be
applied on top of the permissions set by the PIRE0_EL1 register.
However PIRE0_EL1 has encodings that specifically enable and disable the
overlay from applying.

For example:
	0001	Read, Overlay applied.
	1000	Read, Overlay not applied.

Switch to using the 'Overlay applied' encodings in PIRE0_EL1, so that PIE and
POE can coexist.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-22-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:54:05 +01:00
Joey Gouly
9160f7e909 arm64: add POE signal support
Add PKEY support to signals, by saving and restoring POR_EL0 from the stackframe.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-20-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:54:05 +01:00
Joey Gouly
7f955be9f8 arm64: implement PKEYS support
Implement the PKEYS interface, using the Permission Overlay Extension.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-19-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:54:04 +01:00
Joey Gouly
fc2d9cd330 arm64: add pte_access_permitted_no_overlay()
We do not want take POE into account when clearing the MTE tags.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-18-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:54:04 +01:00
Joey Gouly
7f0ab60763 arm64: handle PKEY/POE faults
If a memory fault occurs that is due to an overlay/pkey fault, report that to
userspace with a SEGV_PKUERR.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-17-joey.gouly@arm.com
[will: Add ESR.FSC check to data abort handler]
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:53:44 +01:00
Joey Gouly
6580a36dd7 arm64: mask out POIndex when modifying a PTE
When a PTE is modified, the POIndex must be masked off so that it can be modified.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-16-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:52:41 +01:00
Joey Gouly
b3c03fe137 arm64: convert protection key into vm_flags and pgprot values
Modify arch_calc_vm_prot_bits() and vm_get_page_prot() such that the pkey
value is set in the vm_flags and then into the pgprot value.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20240822151113.1479789-15-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:52:41 +01:00
Joey Gouly
b66db4f3cc arm64: add POIndex defines
The 3-bit POIndex is stored in the PTE at bits 60..62.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-14-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:52:40 +01:00
Joey Gouly
bf83dae90f arm64: enable the Permission Overlay Extension for EL0
Expose a HWCAP and ID_AA64MMFR3_EL1_S1POE to userspace, so they can be used to
check if the CPU supports the feature.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-12-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:52:40 +01:00
Joey Gouly
b86c9bea63 KVM: arm64: Save/restore POE registers
Define the new system registers that POE introduces and context switch them.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240822151113.1479789-8-joey.gouly@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:52:39 +01:00
Joey Gouly
160a8e13de arm64: context switch POR_EL0 register
POR_EL0 is a register that can be modified by userspace directly,
so it must be context switched.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-7-joey.gouly@arm.com
[will: Dropped unnecessary isb()s]
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:52:18 +01:00
Joey Gouly
878c05e8ef arm64: disable trapping of POR_EL0 to EL2
Allow EL0 or EL1 to access POR_EL0 without being trapped to EL2.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20240822151113.1479789-5-joey.gouly@arm.com
[will: Rename Lset_poe_fgt to Lskip_pie_fgt to ease merge with for-next/misc]
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04 12:47:10 +01:00
Will Deacon
cf19cc5764 Merge remote-tracking branch 'kvmarm/arm64-shared-6.12' into for-next/poe
Pull in the AT instruction conversion patch from the KVM arm64 tree, as
this is a shared dependency between the POE series from Joey and the AT
emulation series for Nested Virtualisation from Marc.
2024-09-04 11:15:52 +01:00
Mike Rapoport (Microsoft)
46bcce5031 arch, mm: move definition of node_data to generic code
Every architecture that supports NUMA defines node_data in the same way:

	struct pglist_data *node_data[MAX_NUMNODES];

No reason to keep multiple copies of this definition and its forward
declarations, especially when such forward declaration is the only thing
in include/asm/mmzone.h for many architectures.

Add definition and declaration of node_data to generic code and drop
architecture-specific versions.

Link: https://lkml.kernel.org/r/20240807064110.1003856-8-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU]
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Rob Herring (Arm) <robh@kernel.org>
Cc: Samuel Holland <samuel.holland@sifive.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03 21:15:28 -07:00
Will Deacon
c86fa3470c arm64: mm: Add confidential computing hook to ioremap_prot()
Confidential Computing environments such as pKVM and Arm's CCA
distinguish between shared (i.e. emulated) and private (i.e. assigned)
MMIO regions.

Introduce a hook into our implementation of ioremap_prot() so that MMIO
regions can be shared if necessary.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240830130150.8568-6-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30 16:30:41 +01:00
Will Deacon
e7bafbf717 arm64: mm: Add top-level dispatcher for internal mem_encrypt API
Implementing the internal mem_encrypt API for arm64 depends entirely on
the Confidential Computing environment in which the kernel is running.

Introduce a simple dispatcher so that backend hooks can be registered
depending upon the environment in which the kernel finds itself.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240830130150.8568-4-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30 16:30:41 +01:00
Will Deacon
a06c3fad49 drivers/virt: pkvm: Add initial support for running as a protected guest
Implement a pKVM protected guest driver to probe the presence of pKVM
and determine the memory protection granule using the HYP_MEMINFO
hypercall.

Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240830130150.8568-3-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30 16:30:41 +01:00
Marc Zyngier
0ba5b4ba61 firmware/smccc: Call arch-specific hook on discovering KVM services
arm64 will soon require its own callback to initialise services
that are only available on this architecture. Introduce a hook
that can be overloaded by the architecture.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20240830130150.8568-2-will@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30 16:30:41 +01:00
D Scott Phillips
db0d8a8434 arm64: errata: Enable the AC03_CPU_38 workaround for ampere1a
The ampere1a cpu is affected by erratum AC04_CPU_10 which is the same
bug as AC03_CPU_38. Add ampere1a to the AC03_CPU_38 workaround midr list.

Cc: <stable@vger.kernel.org>
Signed-off-by: D Scott Phillips <scott@os.amperecomputing.com>
Acked-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://lore.kernel.org/r/20240827211701.2216719-1-scott@os.amperecomputing.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-08-30 14:22:12 +01:00
Marc Zyngier
ff987ffc0c KVM: arm64: nv: Add support for FEAT_ATS1A
Handling FEAT_ATS1A (which provides the AT S1E{1,2}A instructions)
is pretty easy, as it is just the usual AT without the permission
check.

This basically amounts to plumbing the instructions in the various
dispatch tables, and handling FEAT_ATS1A being disabled in the
ID registers.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-08-30 12:04:20 +01:00
Marc Zyngier
97634dac19 KVM: arm64: nv: Make ps_to_output_size() generally available
Make this helper visible to at.c, we are going to need it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-08-30 12:04:20 +01:00
Marc Zyngier
be04cebf3e KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the
combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk.

However, there is a great deal of complexity coming from combining
the S1 and S2 attributes to report something consistent in PAR_EL1.

This is an absolute mine field, and I have a splitting headache.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-08-30 12:04:20 +01:00