Commit Graph

50042 Commits

Author SHA1 Message Date
Li Chen
fbc2010d92 x86/smpboot: moves x86_topology to static initialize and truncate
The #ifdeffery and the initializers in build_sched_topology() are just
disgusting.

Statically initialize the domain levels in the topology array and let
build_sched_topology() invalidate the package domain level when NUMA in
package is available.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250710105715.66594-4-me@linux.beauty
2025-07-14 10:59:34 +02:00
Li Chen
992de2b025 x86/smpboot: remove redundant CONFIG_SCHED_SMT
On x86 CONFIG_SCHED_SMT is default y if SMP is enabled, so let's
simply drop CONFIG_SCHED_SMT.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250710105715.66594-3-me@linux.beauty
2025-07-14 10:59:34 +02:00
Li Chen
e075f43609 smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
Define a small SDTL_INIT(maskfn, flagsfn, name) macro and use it to build the
sched_domain_topology_level array. Purely a cleanup; behaviour is unchanged.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250710105715.66594-2-me@linux.beauty
2025-07-14 10:59:34 +02:00
Tiwei Bie
f7e9077a16 um: Stop tracking stub's PID via userspace_pid[]
The PID of the stub process can be obtained from current_mm_id().
There is no need to track it via userspace_pid[]. Stop doing that
to simplify the code.

Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20250711065021.2535362-4-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-13 19:42:49 +02:00
Linus Torvalds
5d5d62298b - Update Kirill's email address
- Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole lotta
   sense on 32-bit
 
 - Add fixes for a misconfigured AMD Zen2 client which wasn't even supposed to
   run Linux
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhziVAACgkQEsHwGGHe
 VUoPLRAAqnv0D8pKO/UPUp05bOvKvEvYarK3Va4MV1QrqOgPIvabGbJOzYStU9+Q
 4FZ5ZCZbi0eV1sZuNP1Zk7Ryp/bYipR6gLX+jg06VTXXTrjKnUN3ofBLPQf0+fE9
 AZDShoSjS+6ifzt6BaUWW3uDMLOzwv50X/xwtLG+Nrprshs7HzfvJq2oFFQX6drQ
 kg7Cj9N8WNHl1kp6CVy2DXVRzv4VR9+yxeNfOCPOJiCVEmzRlMulPzQYWWagYidB
 +U+IYDJiG2p8YNL9aiCmnrRNpSfA4Podn8ZJVPKDwXSpmuUmfcLPur0c1Tjt97h0
 85ovsJs+RqBBzD3ixkbNSpdNLRBFVX7q5mx4n4+1DuR5ygZrBbDyjZce9gwY2YPh
 h1c2dnxxxkp9LAnBFJcaWiv8jzScRRbkqwHprBidkCS4plJiGhrD6MC78fMf8kE5
 i+dydBefrsYnBwe3ciyCZh/fCvPHk6OmegSdT1+0jlz2YJOGlD1uSPSMeE6YFFvW
 64R7MV3BLllBkpRx57zafx8tsdRiH9mZM6naltlcOcQV3JkHZhDJ5aNkfvBJpXw3
 RZtHHAG0noCGSMAhl/crQUkZOany2TdkDQn6SQpcM+iY/E0OQSH+/QM4Rw/s+05/
 FEtmkC2FqJ00RODhzSqKIwkKSgiwSCokR4pty5OJqBirUqDFNiw=
 =d3UC
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Update Kirill's email address

 - Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole
   lotta sense on 32-bit

 - Add fixes for a misconfigured AMD Zen2 client which wasn't even
   supposed to run Linux

* tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Update Kirill Shutemov's email address for TDX
  x86/mm: Disable hugetlb page table sharing on 32-bit
  x86/CPU/AMD: Disable INVLPGB on Zen2
  x86/rdrand: Disable RDSEED on AMD Cyan Skillfish
2025-07-13 10:41:19 -07:00
David Kaplan
a026dc61cf x86/bugs: Print enabled attack vectors
Print the status of enabled attack vectors and SMT mitigation status in the
boot log for easier reporting and debugging.  This information will also be
available through sysfs.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-21-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
6b21d2f0dc x86/bugs: Add attack vector controls for TSA
Use attack vector controls to determine which TSA mitigation to use.

  [ bp: Simplify the condition in the select function for better
    readability. ]

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250709155844.3279471-1-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
02c7d5b8e0 x86/pti: Add attack vector controls for PTI
Disable PTI mitigation if user->kernel attack vector mitigations are
disabled.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-20-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
0cdd2c4f35 x86/bugs: Add attack vector controls for ITS
Use attack vector controls to determine if ITS mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-19-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
eda718fde6 x86/bugs: Add attack vector controls for SRSO
Use attack vector controls to determine if SRSO mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-18-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
2f970a5269 x86/bugs: Add attack vector controls for L1TF
Use attack vector controls to determine if L1TF mitigation is required.

Disable SMT if cross-thread protection is desired.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-17-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
fdf99228e2 x86/bugs: Add attack vector controls for spectre_v2
Use attack vector controls to determine if spectre_v2 mitigation is
required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-16-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
ddcd4d3cb3 x86/bugs: Add attack vector controls for BHI
Use attack vector controls to determine if BHI mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-15-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
07a659edcf x86/bugs: Add attack vector controls for spectre_v2_user
Use attack vector controls to determine if spectre_v2_user mitigation is
required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-14-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
9687eb2399 x86/bugs: Add attack vector controls for retbleed
Use attack vector controls to determine if retbleed mitigation is
required.

Disable SMT if cross-thread protection is desired and STIBP is not
available.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-13-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
19a5f3ea43 x86/bugs: Add attack vector controls for spectre_v1
Use attack vector controls to determine if spectre_v1 mitigation is
required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-12-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
8c7261abcb x86/bugs: Add attack vector controls for GDS
Use attack vector controls to determine if GDS mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-11-david.kaplan@amd.com
2025-07-11 17:56:41 +02:00
David Kaplan
71dc301c26 x86/bugs: Add attack vector controls for SRBDS
Use attack vector controls to determine if SRBDS mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-10-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
David Kaplan
54b53dca65 x86/bugs: Add attack vector controls for RFDS
Use attack vector controls to determine if RFDS mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-9-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
David Kaplan
de6f0921ba x86/bugs: Add attack vector controls for MMIO
Use attack vectors controls to determine if MMIO mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-8-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
David Kaplan
736565d4ed x86/bugs: Add attack vector controls for TAA
Use attack vector controls to determine if TAA mitigation is required.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-7-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
David Kaplan
e3a88d4c06 x86/bugs: Add attack vector controls for MDS
Use attack vector controls to determine if MDS mitigation is required.
The global mitigations=off command now simply disables all attack vectors
so explicit checking of mitigations=off is no longer needed.

If cross-thread attack mitigations are required, disable SMT.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-6-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
David Kaplan
2d31d28746 x86/bugs: Define attack vectors relevant for each bug
Add a function which defines which vulnerabilities should be mitigated
based on the selected attack vector controls.  The selections here are
based on the individual characteristics of each vulnerability.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-5-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
David Kaplan
735e59204b x86/Kconfig: Add arch attack vector support
ARCH_HAS_CPU_ATTACK_VECTORS should be set for architectures which implement
the new attack-vector based controls for CPU mitigations.  If an arch does
not support attack-vector based controls then an attempt to use them
results in a warning.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250707183316.1349127-4-david.kaplan@amd.com
2025-07-11 17:56:40 +02:00
Johannes Berg
ac1ad16f10 um: simplify syscall header files
Since Thomas's recent commit 2af10530639b ("um/x86: Add
system call table to header file") , we now have two
extern declarations of the syscall table, one internal
and one external, and they don't even match on 32-bit.
Clean this up and remove all the extra code.

Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://patch.msgid.link/20250704141243.a68366f6acc3.If8587a4aafdb90644fc6d0b2f5e31a2d1887915f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-11 08:49:02 +02:00
Thomas Weißschuh
32a15664ef um/x86: Add system call table to header file
The generic system call tracing infrastructure requires access to the
system call table. The symbol is already visible to the linker but is
lacking a public declaration.

Add a public declaration.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://patch.msgid.link/20250703-uml-have_syscall_tracepoints-v1-1-23c1d3808578@linutronix.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-07-11 08:49:02 +02:00
Neeraj Upadhyay
b95a9d3136 x86/apic: Rename 'reg_off' to 'reg'
Rename the 'reg_off' parameter of apic_{set|get}_reg() to 'reg' to
match other usages in apic.h.

No functional change intended.

Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-15-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:44 -07:00
Neeraj Upadhyay
17776e6c20 x86/apic: KVM: Move apic_test)vector() to common code
Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC
guest APIC driver in later patches to test vector state in the APIC
backing page.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:43 -07:00
Neeraj Upadhyay
fe954bcd57 x86/apic: KVM: Move lapic set/clear_vector() helpers to common code
Move apic_clear_vector() and apic_set_vector() helper functions to
apic.h in order to reuse them in the Secure AVIC guest APIC driver
in later patches to atomically set/clear vectors in the APIC backing
page.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-13-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:43 -07:00
Neeraj Upadhyay
3d3a9083da x86/apic: KVM: Move lapic get/set helpers to common code
Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and
apic_set_reg64() helper functions to apic.h in order to reuse them in the
Secure AVIC guest APIC driver in later patches to read/write registers
from/to the APIC backing page.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:42 -07:00
Neeraj Upadhyay
39e81633f6 x86/apic: KVM: Move apic_find_highest_vector() to a common header
In preparation for using apic_find_highest_vector() in Secure AVIC
guest APIC driver, move it and associated macros to apic.h.

No functional change intended.

Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:41 -07:00
Neeraj Upadhyay
b5f8980f29 KVM: x86: Rename lapic set/clear vector helpers
In preparation for moving kvm-internal kvm_lapic_set_vector(),
kvm_lapic_clear_vector() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-10-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:41 -07:00
Neeraj Upadhyay
9c23bc4fec KVM: x86: Rename lapic get/set_reg64() helpers
In preparation for moving kvm-internal __kvm_lapic_set_reg64(),
__kvm_lapic_get_reg64() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-9-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:40 -07:00
Neeraj Upadhyay
b9bd231913 KVM: x86: Rename lapic get/set_reg() helpers
In preparation for moving kvm-internal __kvm_lapic_set_reg(),
__kvm_lapic_get_reg() to apic.h for use in Secure AVIC APIC driver,
rename them as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-8-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:39 -07:00
Neeraj Upadhyay
bdaccfe4e5 KVM: x86: Rename find_highest_vector()
In preparation for moving kvm-internal find_highest_vector() to
apic.h for use in Secure AVIC APIC driver, rename find_highest_vector()
to apic_find_highest_vector() as part of the APIC API.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-7-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:39 -07:00
Neeraj Upadhyay
e2fa7905b2 KVM: x86: Change lapic regs base address to void pointer
Change APIC base address from "char *" to "void *" in KVM
lapic's set/get helper functions. Pointer arithmetic for "void *"
and "char *" operate identically. With "void *" there is less
of a chance of doing the wrong thing, e.g. neglecting to cast and
reading a byte instead of the desired APIC register size.

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-6-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:38 -07:00
Neeraj Upadhyay
9cbb5fd156 KVM: x86: Rename VEC_POS/REG_POS macro usages
In preparation for moving most of the KVM's lapic helpers which
use VEC_POS/REG_POS macros to common APIC header for use in Secure
AVIC APIC driver, rename all VEC_POS/REG_POS macro usages to
APIC_VECTOR_TO_BIT_NUMBER/APIC_VECTOR_TO_REG_OFFSET and remove
VEC_POS/REG_POS.

While at it, clean up line wrap in find_highest_vector().

No functional change intended.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-5-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:37 -07:00
Sean Christopherson
dc98e3bd49 x86/apic: KVM: Deduplicate APIC vector => register+bit math
Consolidate KVM's {REG,VEC}_POS() macros and lapic_vector_set_in_irr()'s
open coded equivalent logic in anticipation of the kernel gaining more
usage of vector => reg+bit lookups.

Use lapic_vector_set_in_irr()'s math as using divides for both the bit
number and register offset makes it easier to connect the dots, and for at
least one user, fixup_irqs(), "/ 32 * 0x10" generates ever so slightly
better code with gcc-14 (shaves a whole 3 bytes from the code stream):

((v) >> 5) << 4:
  c1 ef 05           shr    $0x5,%edi
  c1 e7 04           shl    $0x4,%edi
  81 c7 00 02 00 00  add    $0x200,%edi

(v) / 32 * 0x10:
  c1 ef 05           shr    $0x5,%edi
  83 c7 20           add    $0x20,%edi
  c1 e7 04           shl    $0x4,%edi

Keep KVM's tersely named macros as "wrappers" to avoid unnecessary churn
in KVM, and because the shorter names yield more readable code overall in
KVM.

The new macros type cast the vector parameter to "unsigned int". This is
required from better code generation for cases where an "int" is passed
to these macros in KVM code.

int v;

((v) >> 5) << 4:

  c1 f8 05    sar    $0x5,%eax
  c1 e0 04    shl    $0x4,%eax

((v) / 32 * 0x10):

  85 ff       test   %edi,%edi
  8d 47 1f    lea    0x1f(%rdi),%eax
  0f 49 c7    cmovns %edi,%eax
  c1 f8 05    sar    $0x5,%eax
  c1 e0 04    shl    $0x4,%eax

((unsigned int)(v) / 32 * 0x10):

  c1 f8 05    sar    $0x5,%eax
  c1 e0 04    shl    $0x4,%eax

(v) & (32 - 1):

  89 f8       mov    %edi,%eax
  83 e0 1f    and    $0x1f,%eax

(v) % 32

  89 fa       mov    %edi,%edx
  c1 fa 1f    sar    $0x1f,%edx
  c1 ea 1b    shr    $0x1b,%edx
  8d 04 17    lea    (%rdi,%rdx,1),%eax
  83 e0 1f    and    $0x1f,%eax
  29 d0       sub    %edx,%eax

(unsigned int)(v) % 32:

  89 f8       mov    %edi,%eax
  83 e0 1f    and    $0x1f,%eax

Overall kvm.ko text size is impacted if "unsigned int" is not used.

Bin      Orig     New (w/o unsigned int)  New (w/ unsigned int)

lapic.o  28580        28772                 28580
kvm.o    670810       671002                670810
kvm.ko   708079       708271                708079

No functional change intended.

[Neeraj: Type cast vec macro param to "unsigned int", provide data
         in commit log on "unsigned int" requirement]

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-4-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:37 -07:00
Neeraj Upadhyay
3fb7b83e2a KVM: x86: Remove redundant parentheses around 'bitmap'
When doing pointer arithmetic in apic_test_vector() and
kvm_lapic_{set|clear}_vector(), remove the unnecessary
parentheses surrounding the 'bitmap' parameter.

No functional change intended.

Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250709033242.267892-3-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:36 -07:00
Neeraj Upadhyay
ac48017020 KVM: x86: Open code setting/clearing of bits in the ISR
Remove __apic_test_and_set_vector() and __apic_test_and_clear_vector(),
because the _only_ register that's safe to modify with a non-atomic
operation is ISR, because KVM isn't running the vCPU, i.e. hardware can't
service an IRQ or process an EOI for the relevant (virtual) APIC.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250709033242.267892-2-Neeraj.Upadhyay@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:36 -07:00
Kevin Loughlin
a77896eea3 KVM: SEV: Prefer WBNOINVD over WBINVD for cache maintenance efficiency
AMD CPUs currently execute WBINVD in the host when unregistering SEV
guest memory or when deactivating SEV guests. Such cache maintenance is
performed to prevent data corruption, wherein the encrypted (C=1)
version of a dirty cache line might otherwise only be written back
after the memory is written in a different context (ex: C=0), yielding
corruption. However, WBINVD is performance-costly, especially because
it invalidates processor caches.

Strictly-speaking, unless the SEV ASID is being recycled (meaning the
SNP firmware requires the use of WBINVD prior to DF_FLUSH), the cache
invalidation triggered by WBINVD is unnecessary; only the writeback is
needed to prevent data corruption in remaining scenarios.

To improve performance in these scenarios, use WBNOINVD when available
instead of WBINVD. WBNOINVD still writes back all dirty lines
(preventing host data corruption by SEV guests) but does *not*
invalidate processor caches. Note that the implementation of wbnoinvd()
ensures fall back to WBINVD if WBNOINVD is unavailable.

In anticipation of forthcoming optimizations to limit the WBNOINVD only
to physical CPUs that have executed SEV guests, place the call to
wbnoinvd_on_all_cpus() in a wrapper function sev_writeback_caches().

Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250201000259.3289143-3-kevinloughlin@google.com
[sean: tweak comment regarding CLFUSH]
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:44:19 -07:00
Zheyun Shen
7e00013bd3 KVM: SVM: Remove wbinvd in sev_vm_destroy()
Before sev_vm_destroy() is called, kvm_arch_guest_memory_reclaimed()
has been called for SEV and SEV-ES and kvm_arch_gmem_invalidate()
has been called for SEV-SNP. These functions have already handled
flushing the memory. Therefore, this wbinvd_on_all_cpus() can
simply be dropped.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:43:43 -07:00
Sean Christopherson
55aed8c2db KVM: x86: Use wbinvd_on_cpu() instead of an open-coded equivalent
Use wbinvd_on_cpu() to target a single CPU instead of open-coding an
equivalent, and drop KVM's wbinvd_ipi() now that all users have switched
to x86 library versions.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250522233733.3176144-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-10 09:43:43 -07:00
Linus Torvalds
73d7cf0710 ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
   mapping at EL2 stage-1
 
 - Fix unexpected advertisement to the guest of unimplemented S2 base
   granule sizes
 
 - Gracefully fail initialising pKVM if the interrupt controller isn't
   GICv3
 
 - Also gracefully fail initialising pKVM if the carveout allocation
   fails
 
 - Fix the computing of the minimum MMIO range required for the host on
   stage-2 fault
 
 - Fix the generation of the GICv3 Maintenance Interrupt in nested mode
 
 x86:
 
 - Reject SEV{-ES} intra-host migration if one or more vCPUs are actively
   being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM.
 
 - Use a pre-allocated, per-vCPU buffer for handling de-sparsification of
   vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue.
 
 - Allow out-of-range/invalid Xen event channel ports when configuring IRQ
   routing, to avoid dictating a specific ioctl() ordering to userspace.
 
 - Conditionally reschedule when setting memory attributes to avoid soft
   lockups when userspace converts huge swaths of memory to/from private.
 
 - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest.
 
 - Add a missing field in struct sev_data_snp_launch_start that resulted in
   the guest-visible workarounds field being filled at the wrong offset.
 
 - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid
   VM-Fail on INVVPID.
 
 - Advertise supported TDX TDVMCALLs to userspace.
 
 - Pass SetupEventNotifyInterrupt arguments to userspace.
 
 - Fix TSC frequency underflow.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmhurKgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNxHggApTP4vw+oOzfN7UoNmgR9XZMI1p2a
 R8AzQ1zDyVbEVWq3xTKvXtld+dKeO0yKB/XeI/1JLck1OiHxY57I3X6k5AnsurEr
 CBzeAhAjXivF8woMgmlP+30aqpomcPACdQm0gRnWkRDDJfXqSUas/iE/s9Ct1dT4
 4w3PtFLsSsU8vX/RttR+CqF1AQ6SeV/NRvA8hzPGMGZoQ2um74j4ZsM/3xh77Kdw
 Z2vOnZOIA4dk0074JjO/Yb9l00Ib4hn+MWG5jVJ+6i2HRRYd2knnB29apVS/ARdL
 X20j+LvtYj/jrPPdYwqjvxbIXyLbJrLCZyjKhfueN+rnisPNvzR+7YE4ZQ==
 =NduO
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Many patches, pretty much all of them small, that accumulated while I
  was on vacation.

  ARM:

   - Remove the last leftovers of the ill-fated FPSIMD host state
     mapping at EL2 stage-1

   - Fix unexpected advertisement to the guest of unimplemented S2 base
     granule sizes

   - Gracefully fail initialising pKVM if the interrupt controller isn't
     GICv3

   - Also gracefully fail initialising pKVM if the carveout allocation
     fails

   - Fix the computing of the minimum MMIO range required for the host
     on stage-2 fault

   - Fix the generation of the GICv3 Maintenance Interrupt in nested
     mode

  x86:

   - Reject SEV{-ES} intra-host migration if one or more vCPUs are
     actively being created, so as not to create a non-SEV{-ES} vCPU in
     an SEV{-ES} VM

   - Use a pre-allocated, per-vCPU buffer for handling de-sparsification
     of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
     large" issue

   - Allow out-of-range/invalid Xen event channel ports when configuring
     IRQ routing, to avoid dictating a specific ioctl() ordering to
     userspace

   - Conditionally reschedule when setting memory attributes to avoid
     soft lockups when userspace converts huge swaths of memory to/from
     private

   - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest

   - Add a missing field in struct sev_data_snp_launch_start that
     resulted in the guest-visible workarounds field being filled at the
     wrong offset

   - Skip non-canonical address when processing Hyper-V PV TLB flushes
     to avoid VM-Fail on INVVPID

   - Advertise supported TDX TDVMCALLs to userspace

   - Pass SetupEventNotifyInterrupt arguments to userspace

   - Fix TSC frequency underflow"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: avoid underflow when scaling TSC frequency
  KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
  KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
  KVM: arm64: Don't free hyp pages with pKVM on GICv2
  KVM: arm64: Fix error path in init_hyp_mode()
  KVM: arm64: Adjust range correctly during host stage-2 faults
  KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
  KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
  KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
  Documentation: KVM: Fix unexpected unindent warnings
  KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
  KVM: Allow CPU to reschedule while setting per-page memory attributes
  KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
  KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
  KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
  KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
  KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
  KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
2025-07-10 09:06:53 -07:00
Zheyun Shen
4fdc3431e0 x86/lib: Add WBINVD and WBNOINVD helpers to target multiple CPUs
Extract KVM's open-coded calls to do writeback caches on multiple CPUs to
common library helpers for both WBINVD and WBNOINVD (KVM will use both).
Put the onus on the caller to check for a non-empty mask to simplify the
SMP=n implementation, e.g. so that it doesn't need to check that the one
and only CPU in the system is present in the mask.

  [sean: move to lib, add SMP=n helpers, clarify usage]

Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250128015345.7929-2-szy0127@sjtu.edu.cn
Link: https://lore.kernel.org/20250522233733.3176144-5-seanjc@google.com
2025-07-10 13:30:17 +02:00
Kevin Loughlin
07f99c3fbe x86/lib: Add WBNOINVD helper functions
In line with WBINVD usage, add WBNOINVD helper functions.  Explicitly fall
back to WBINVD (via alternative()) if WBNOINVD isn't supported even though
the instruction itself is backwards compatible (WBNOINVD is WBINVD with an
ignored REP prefix), so that disabling X86_FEATURE_WBNOINVD behaves as one
would expect, e.g. in case there's a hardware issue that affects WBNOINVD.

Opportunistically, add comments explaining the architectural behavior of
WBINVD and WBNOINVD, and provide hints and pointers to uarch-specific
behavior.

Note, alternative() ensures compatibility with early boot code as needed.

  [ bp: Massage, fix typos, make export _GPL. ]

Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/20250522233733.3176144-4-seanjc@google.com
2025-07-10 13:23:10 +02:00
Sean Christopherson
e638081751 x86/lib: Drop the unused return value from wbinvd_on_all_cpus()
Drop wbinvd_on_all_cpus()'s return value; both the "real" version and the
stub always return '0', and none of the callers check the return.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250522233733.3176144-3-seanjc@google.com
2025-07-10 13:13:26 +02:00
Alistair Popple
21aa65bf82 mm: remove callers of pfn_t functionality
All PFN_* pfn_t flags have been removed.  Therefore there is no longer a
need for the pfn_t type and all uses can be replaced with normal pfns.

Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Björn Töpel <bjorn@rivosinc.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:19 -07:00
Alistair Popple
d438d27341 mm: remove devmap related functions and page table bits
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD page
table bits.  So drop all references to these, freeing up a software
defined page table bit on architectures supporting it.

Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Will Deacon <will@kernel.org> # arm64
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Björn Töpel <bjorn@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: John Groves <john@groves.net>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:18 -07:00
Lorenzo Stoakes
d75fa3c947 mm: update architecture and driver code to use vm_flags_t
In future we intend to change the vm_flags_t type, so it isn't correct for
architecture and driver code to assume it is unsigned long.  Correct this
assumption across the board.

Overall, this patch does not introduce any functional change.

Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:14 -07:00
Lorenzo Stoakes
78ddaa358e mm: change vm_get_page_prot() to accept vm_flags_t argument
Patch series "use vm_flags_t consistently".

The VMA flags field vma->vm_flags is of type vm_flags_t.  Right now this
is exactly equivalent to unsigned long, but it should not be assumed to
be.

Much code that references vma->vm_flags already correctly uses vm_flags_t,
but a fairly large chunk of code simply uses unsigned long and assumes
that the two are equivalent.

This series corrects that and has us use vm_flags_t consistently.

This series is motivated by the desire to, in a future series, adjust
vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or
64-bit in order to deal with the VMA flag exhaustion issue and avoid all
the various problems that arise from it (being unable to use certain
features in 32-bit, being unable to add new flags except for 64-bit, etc.)

This is therefore a critical first step towards that goal.  At any rate,
using the correct type is of value regardless.

We additionally take the opportunity to refer to VMA flags as vm_flags
where possible to make clear what we're referring to.

Overall, this series does not introduce any functional change.


This patch (of 3):

We abstract the type of the VMA flags to vm_flags_t, however in may places
it is simply assumed this is unsigned long, which is simply incorrect.

At the moment this is simply an incongruity, however in future we plan to
change this type and therefore this change is a critical requirement for
doing so.

Overall, this patch does not introduce any functional change.

[lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include]
  Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local
Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:42:13 -07:00
Nuno Das Neves
faab52b59b x86/hyperv: Clean up hv_map/unmap_interrupt() return values
Fix the return values of these hypercall helpers so they return
a negated errno either directly or via hv_result_to_errno().

Update the callers to check for errno instead of using
hv_status_success(), and remove redundant error printing.

While at it, rearrange some variable declarations to adhere to style
guidelines i.e. "reverse fir tree order".

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1751582677-30930-5-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1751582677-30930-5-git-send-email-nunodasneves@linux.microsoft.com>
2025-07-09 23:49:25 +00:00
Nuno Das Neves
bb169f80ed x86/hyperv: Fix usage of cpu_online_mask to get valid cpu
Accessing cpu_online_mask here is problematic because the cpus read lock
is not held in this context.

However, cpu_online_mask isn't needed here since the effective affinity
mask is guaranteed to be valid in this callback. So, just use
cpumask_first() to get the cpu instead of ANDing it with cpus_online_mask
unnecessarily.

Fixes: e39397d1fd ("x86/hyperv: implement an MSI domain for root partition")
Reported-by: Michael Kelley <mhklinux@outlook.com>
Closes: https://lore.kernel.org/linux-hyperv/SN6PR02MB4157639630F8AD2D8FD8F52FD475A@SN6PR02MB4157.namprd02.prod.outlook.com/
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1751582677-30930-4-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1751582677-30930-4-git-send-email-nunodasneves@linux.microsoft.com>
2025-07-09 23:49:25 +00:00
Naman Jain
0271e72bc0 x86/hyperv: Fix warnings for missing export.h header inclusion
Fix below warning in Hyper-V drivers that comes when kernel is compiled
with W=1 option. Include export.h in driver files to fix it.
* warning: EXPORT_SYMBOL() is used, but #include <linux/export.h>
is missing

Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Link: https://lore.kernel.org/r/20250611100459.92900-3-namjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20250611100459.92900-3-namjain@linux.microsoft.com>
2025-07-09 23:43:53 +00:00
Paolo Bonzini
4578a747f3 KVM: x86: avoid underflow when scaling TSC frequency
In function kvm_guest_time_update(), __scale_tsc() is used to calculate
a TSC *frequency* rather than a TSC value.  With low-enough ratios,
a TSC value that is less than 1 would underflow to 0 and to an infinite
while loop in kvm_get_time_scale():

  kvm_guest_time_update(struct kvm_vcpu *v)
    if (kvm_caps.has_tsc_control)
      tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
                                  v->arch.l1_tsc_scaling_ratio);
        __scale_tsc(u64 ratio, u64 tsc)
          ratio=122380531, tsc=2299998, N=48
          ratio*tsc >> N = 0.999... -> 0

Later in the function:

  Call Trace:
   <TASK>
   kvm_get_time_scale arch/x86/kvm/x86.c:2458 [inline]
   kvm_guest_time_update+0x926/0xb00 arch/x86/kvm/x86.c:3268
   vcpu_enter_guest.constprop.0+0x1e70/0x3cf0 arch/x86/kvm/x86.c:10678
   vcpu_run+0x129/0x8d0 arch/x86/kvm/x86.c:11126
   kvm_arch_vcpu_ioctl_run+0x37a/0x13d0 arch/x86/kvm/x86.c:11352
   kvm_vcpu_ioctl+0x56b/0xe60 virt/kvm/kvm_main.c:4188
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:871 [inline]
   __se_sys_ioctl+0x12d/0x190 fs/ioctl.c:857
   do_syscall_x64 arch/x86/entry/common.c:51 [inline]
   do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81
   entry_SYSCALL_64_after_hwframe+0x78/0xe2

This can really happen only when fuzzing, since the TSC frequency
would have to be nonsensically low.

Fixes: 35181e86df ("KVM: x86: Add a common TSC scaling function")
Reported-by: Yuntao Liu <liuyuntao12@huawei.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-07-09 13:52:50 -04:00
Jim Mattson
a7cec20845 KVM: x86: Provide a capability to disable APERF/MPERF read intercepts
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.

The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.

Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.

Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:33:37 -07:00
Jim Mattson
6fbef8615d KVM: x86: Replace growing set of *_in_guest bools with a u64
Store each "disabled exit" boolean in a single bit rather than a byte.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:32:32 -07:00
Xin Li
e88cfd50b6 KVM: x86: Advertise support for LKGS
Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace
if the instruction is supported by the underlying CPU.

LKGS is introduced with FRED to completely eliminate the need to swapgs
explicilty.  It behaves like the MOV to GS instruction except that it
loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the
GS segment’s descriptor cache, which is exactly what Linux kernel does
to load a user level GS base.  Thus there is no need to SWAPGS away
from the kernel GS base.

LKGS is an independent CPU feature that works correctly in a KVM guest
without requiring explicit enablement.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:32:25 -07:00
Sean Christopherson
e1ef1c57ff KVM: VMX: Add a macro to track which DEBUGCTL bits are host-owned
Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e.
need to be preserved when running the guest, to dedup the logic without
having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09 09:30:52 -07:00
Borislav Petkov (AMD)
fde494e905 Add the mitigation logic for Transient Scheduler Attacks (TSA)
TSA are new aspeculative side channel attacks related to the execution
 timing of instructions under specific microarchitectural conditions. In
 some cases, an attacker may be able to use this timing information to
 infer data from other contexts, resulting in information leakage.
 
 Add the usual controls of the mitigation and integrate it into the
 existing speculation bugs infrastructure in the kernel.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe
 VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC
 POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G
 CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/
 ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu
 oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx
 bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY
 0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa
 NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ
 mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH
 hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa
 FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo=
 =2cLl
 -----END PGP SIGNATURE-----

Merge tag 'tsa_x86_bugs_for_6.16' into tip-x86-bugs

Pick up TSA changes from mainline so that attack vectors work can
continue ontop.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-07-09 18:16:53 +02:00
Jann Horn
76303ee8d5 x86/mm: Disable hugetlb page table sharing on 32-bit
Only select ARCH_WANT_HUGE_PMD_SHARE on 64-bit x86.
Page table sharing requires at least three levels because it involves
shared references to PMD tables; 32-bit x86 has either two-level paging
(without PAE) or three-level paging (with PAE), but even with
three-level paging, having a dedicated PGD entry for hugetlb is only
barely possible (because the PGD only has four entries), and it seems
unlikely anyone's actually using PMD sharing on 32-bit.

Having ARCH_WANT_HUGE_PMD_SHARE enabled on non-PAE 32-bit X86 (which
has 2-level paging) became particularly problematic after commit
59d9094df3 ("mm: hugetlb: independent PMD page table shared count"),
since that changes `struct ptdesc` such that the `pt_mm` (for PGDs) and
the `pt_share_count` (for PMDs) share the same union storage - and with
2-level paging, PMDs are PGDs.

(For comparison, arm64 also gates ARCH_WANT_HUGE_PMD_SHARE on the
configuration of page tables such that it is never enabled with 2-level
paging.)

Closes: https://lore.kernel.org/r/srhpjxlqfna67blvma5frmy3aa@altlinux.org
Fixes: cfe28c5d63 ("x86: mm: Remove x86 version of huge_pmd_share.")
Reported-by: Vitaly Chikunov <vt@altlinux.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Vitaly Chikunov <vt@altlinux.org>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250702-x86-2level-hugetlb-v2-1-1a98096edf92%40google.com
2025-07-09 07:46:36 -07:00
Kan Liang
829f5a6308 perf/x86/intel/uncore: Add iMC freerunning for Panther Lake
PTL uncore imc freerunning counters are the same as the previous HW.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250707201750.616527-5-kan.liang@linux.intel.com
2025-07-09 13:40:20 +02:00
Kan Liang
64ad6d6ede perf/x86/intel/uncore: Add Panther Lake support
The Panther Lake supports CBOX, MC, sNCU, and HBO uncore PMON.

The CBOX is similar to Lunar Lake. The only difference is the number of
CBOX.

The other three uncore PMON can be retrieved from the discovery table.
The global control register resides in the sNCU. The global freeze bit
is set by default. It must be cleared before monitoring any uncore
counters.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250707201750.616527-4-kan.liang@linux.intel.com
2025-07-09 13:40:19 +02:00
Kan Liang
fca24bf2b6 perf/x86/intel/uncore: Support customized MMIO map size
For a server platform, the MMIO map size is always 0x4000. However, a
client platform may have a smaller map size.

Make the map size customizable.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250707201750.616527-3-kan.liang@linux.intel.com
2025-07-09 13:40:19 +02:00
Kan Liang
cf002dafed perf/x86/intel/uncore: Support MSR portal for discovery tables
Starting from the Panther Lake, the discovery table mechanism is also
supported in client platforms. The difference is that the portal of the
global discovery table is retrieved from an MSR.

The layout of discovery tables are the same as the server platforms.
Factor out __parse_discovery_table() to parse discover tables.

The uncore PMON is Die scope. Need to parse the discovery tables for
each die.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250707201750.616527-2-kan.liang@linux.intel.com
2025-07-09 13:40:19 +02:00
Greg Kroah-Hartman
9b355cdb63 x86/microcode: Move away from using a fake platform device
Downloading firmware needs a device to hang off of, and so a platform device
seemed like the simplest way to do this.  Now that we have a faux device
interface, use that instead as this "microcode device" is not anything
resembling a platform device at all.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/2025070121-omission-small-9308@gregkh
2025-07-09 13:12:08 +02:00
Mikhail Paulyshka
a74bb5f202 x86/CPU/AMD: Disable INVLPGB on Zen2
AMD Cyan Skillfish (Family 17h, Model 47h, Stepping 0h) has an issue
that causes system oopses and panics when performing TLB flush using
INVLPGB.

However, the problem is that that machine has misconfigured CPUID and
should not report the INVLPGB bit in the first place. So zap the
kernel's representation of the flag so that nothing gets confused.

  [ bp: Massage. ]

Fixes: 767ae437a3 ("x86/mm: Add INVLPGB feature and Kconfig entry")
Signed-off-by: Mikhail Paulyshka <me@mixaill.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/1ebe845b-322b-4929-9093-b41074e9e939@mixaill.net
2025-07-08 21:34:01 +02:00
Mikhail Paulyshka
5b937a1ed6 x86/rdrand: Disable RDSEED on AMD Cyan Skillfish
AMD Cyan Skillfish (Family 17h, Model 47h, Stepping 0h) has an error that
causes RDSEED to always return 0xffffffff, while RDRAND works correctly.

Mask the RDSEED cap for this CPU so that both /proc/cpuinfo and direct CPUID
read report RDSEED as unavailable.

  [ bp: Move to amd.c, massage. ]

Signed-off-by: Mikhail Paulyshka <me@mixaill.net>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/20250524145319.209075-1-me@mixaill.net
2025-07-08 21:33:26 +02:00
Linus Torvalds
6e9128ff9d Add the mitigation logic for Transient Scheduler Attacks (TSA)
TSA are new aspeculative side channel attacks related to the execution
 timing of instructions under specific microarchitectural conditions. In
 some cases, an attacker may be able to use this timing information to
 infer data from other contexts, resulting in information leakage.
 
 Add the usual controls of the mitigation and integrate it into the
 existing speculation bugs infrastructure in the kernel.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe
 VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC
 POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G
 CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/
 ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu
 oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx
 bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY
 0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa
 NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ
 mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH
 hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa
 FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo=
 =2cLl
 -----END PGP SIGNATURE-----

Merge tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU speculation fixes from Borislav Petkov:
 "Add the mitigation logic for Transient Scheduler Attacks (TSA)

  TSA are new aspeculative side channel attacks related to the execution
  timing of instructions under specific microarchitectural conditions.
  In some cases, an attacker may be able to use this timing information
  to infer data from other contexts, resulting in information leakage.

  Add the usual controls of the mitigation and integrate it into the
  existing speculation bugs infrastructure in the kernel"

* tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/process: Move the buffer clearing before MONITOR
  x86/microcode/AMD: Add TSA microcode SHAs
  KVM: SVM: Advertise TSA CPUID bits to guests
  x86/bugs: Add a Transient Scheduler Attacks mitigation
  x86/bugs: Rename MDS machinery to something more generic
2025-07-07 17:08:36 -07:00
Mario Limonciello
f126821482 x86/itmt: Add debugfs file to show core priorities
Multiple drivers can report priorities to ITMT. To aid in debugging
any issues with the values reported by drivers introduce a debugfs
file to read out the values.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/20250609200518.3616080-14-superm1@kernel.org
2025-07-07 22:35:51 +02:00
Perry Yuan
9e8f6bf782 x86/process: Clear hardware feedback history for AMD processors
Incorporate a mechanism within the context switching code to reset the
hardware history for AMD processors. Specifically, when a task is switched in,
the class ID is read and the hardware workload classification history of the
CPU firmware is reset. Then, the workload classification for the next running
thread is begun.

  [ bp: Massage commit message. ]

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Co-developed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/20250609200518.3616080-10-superm1@kernel.org
2025-07-07 22:30:36 +02:00
Perry Yuan
a3c4f3396b x86/msr-index: Add AMD workload classification MSRs
Introduce new MSR registers for AMD hardware feedback support.  They provide
workload classification and configuration capabilities.

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/20250609200518.3616080-4-superm1@kernel.org
2025-07-07 19:24:58 +02:00
Linus Torvalds
5fc2e891a5 - Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which
prevents their TSCs from going skewed from the hypervisor's
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqMCYACgkQEsHwGGHe
 VUrz6A/9EMN2fEdpeStGrk5t0BvYPhnIApr2x2QuvVN6qXgyGWQXhnF5G94SFNn8
 ykNxcGi97R/wxqk2uK3RfBg4P0ScoPDOLKKSeaqO0LVuHDTVX72fwB1F3qdaPNbp
 EIEL+OOEwUAwviT2GSH4mwTb1C7TuJnOZH2lC6yDkWwN5BLnIA4P4C0Wr4pIQ+MT
 TMzxGMT01yTnCAGHGOD2NRUIv/29qeJl+18uOqDO4A64RPT5Pp0yLJ72U4grGzOr
 2C49+/XD0qjXMs8vRz/CbeBK47ZaE1v/ui3g5ofbZ2YtcrfJXDY2/SwoJ0Oz+liM
 TSEXj8IpFZ3aq3+Pvgp9Qibu5QnFxJi7xTzrGCG1OSouHXTH2eFSVIXCiojVU12Y
 s+pKCBTXs9wVJN4z/FaSSwmTvQolld7oozShgPieZsYNBfJeeWIm+6LZRT4Zr/7Y
 UVsYEc/7m36ggKK+XFHsea2ZnmUFV18kEHPuWAXwmH3DW3dDfI5nm/s811jsbS+6
 2RaLZPiKBsYmNZ7iCujrY3GEmE5Eyemr8Ricj2zSGTCH2EYNeODDQBOn+hYgQOTK
 WJFWqpC5JqI5oJapmhugCkjfT75e+XTgO8Dox7HdlJR4UAb61xxf1zGFbThH8T45
 LZgbIKtLwwLShg0FwzDl7swnADJ/SiaKl049Q8Z5YthhllHy6zI=
 =6yl8
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Borislav Petkov:

 - Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which
   prevents their TSCs from going skewed from the hypervisor's

* tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation
2025-07-06 10:44:20 -07:00
Linus Torvalds
bdde3141ce - Do not remove the MCE sysfs hierarchy if thresholding sysfs nodes init
fails due to new/unknown banks present, which in itself is not fatal
   anyway; add default names for new banks
 
 - Make sure MCE polling settings are honored after CMCI storms
 
 - Make sure MCE threshold limit is reset after the thresholding interrupt has
   been serviced
 
 - Clean up properly and disable CMCI banks on shutdown so that
   a second/kexec-ed kernel can rediscover those banks again
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqLrkACgkQEsHwGGHe
 VUqm/hAAlg6nX/uJIeszPpZK1o4j6WP3IZ0RAo5yxd9mQC+zNP35Xqv3MOUBZVGE
 1EhrQSgrqeC9NIbT8a9ansnc3BKxODOB8raoNA3HpViTG3LeQ5ycmpJv5qWqcr8B
 EV14BpZHAEx3AXAiDcGXkuKYM9RrpHtOwtyKjib4ScT+xCkzQzJtrO2hPTk8Slr/
 8Hly14+st7Iqvnh9nH1TeMOjogGg+lr9cIlG6rzZGj3IPbfEWbvmHhjJut7p4TfU
 g5AY3djzJ8eyrOv3aKxgVDkJ7qet283sc+mvTzrhAnoJEYI9v7tde4g6AKJj0BLA
 +u0tTOu47c/wijLcHPpaS+zwifo4BxKDIG8q+tHT6ixMMPT3Mev8nwanjAotjgz7
 WSN3eL9jie+IJPq4c0eN2z2um5tiqxFHF5M7q4Ol5VJiUU8Wa31B/pzPZymQoCWZ
 F/SC5VVq+ZwfRqzQMAK5dHQSj1zvkbtb0HOMYoFTOU1JocF7Px1gxx401UQ6pKdG
 Qw2rE1SUKxaQT4HZ2+SRvO2egJItsQw4r+ZT/7sMQhII9v9qK500kD8o30HcCEh3
 o8kT+dQwKEv0KLga5vYFnITT29XM9GySdAEI7HxKi/kpnwaUppUljuycfhzAMtYe
 xz86txluUsk7sUW8eUPgjuKPYO3a20FkY8VB5TYWTHxJdMAeb4E=
 =JI7d
 -----END PGP SIGNATURE-----

Merge tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS fixes from Borislav Petkov:

 - Do not remove the MCE sysfs hierarchy if thresholding sysfs nodes
   init fails due to new/unknown banks present, which in itself is not
   fatal anyway; add default names for new banks

 - Make sure MCE polling settings are honored after CMCI storms

 - Make sure MCE threshold limit is reset after the thresholding
   interrupt has been serviced

 - Clean up properly and disable CMCI banks on shutdown so that a
   second/kexec-ed kernel can rediscover those banks again

* tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Make sure CMCI banks are cleared during shutdown on Intel
  x86/mce/amd: Fix threshold limit reset
  x86/mce/amd: Add default names for MCA banks and blocks
  x86/mce: Ensure user polling settings are honored when restarting timer
  x86/mce: Don't remove sysfs if thresholding sysfs init fails
2025-07-06 09:17:48 -07:00
Eric Biggers
b86ced882b lib/crypto: sha256: Make library API use strongly-typed contexts
Currently the SHA-224 and SHA-256 library functions can be mixed
arbitrarily, even in ways that are incorrect, for example using
sha224_init() and sha256_final().  This is because they operate on the
same structure, sha256_state.

Introduce stronger typing, as I did for SHA-384 and SHA-512.

Also as I did for SHA-384 and SHA-512, use the names *_ctx instead of
*_state.  The *_ctx names have the following small benefits:

- They're shorter.
- They avoid an ambiguity with the compression function state.
- They're consistent with the well-known OpenSSL API.
- Users usually name the variable 'sctx' anyway, which suggests that
  *_ctx would be the more natural name for the actual struct.

Therefore: update the SHA-224 and SHA-256 APIs, implementation, and
calling code accordingly.

In the new structs, also strongly-type the compression function state.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160645.3198-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-04 10:18:53 -07:00
Linus Torvalds
df46426745 platform-drivers-x86 for v6.16-3
Fixes and New HW Support
 
 - amd/isp4: Improve swnode graph (new driver exception)
 
 - asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA
 
 - dell-lis3lv02d: Add Latitude 5500 accelerometer address
 
 - dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg
 
 - hp-bioscfg: Fix class device unregistration
 
 - i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/
 
 - intel/hid: Wildcat Lake support
 
 - mellanox:
 
   - mlxbf-pmc: Fix duplicate event ID
 
   - mlxbf-tmfifo: Fix vring_desc.len assignment
 
   - mlxreg-lc: Fix bit-not-set logic check
 
   - nvsw-sn2201: Fix bus number in error message & spelling errors
 
 - portwell-ec: Move watchdog device under correct platform hierarchy
 
 - think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg)
 
 - thinkpad_acpi: Handle HKEY 0x1402 event (2025 Thinkpads)
 
 - wmi: Fix WMI event enablement
 
 The following is an automated shortlog grouped by driver:
 
 asus-nb-wmi:
  -  add DMI quirk for ASUS Zenbook Duo UX8406CA
 
 dell-lis3lv02d:
  -  Add Latitude 5500
 
 dell-wmi-sysman:
  -  Fix class device unregistration
  -  Fix WMI data block retrieval in sysfs callbacks
 
 hp-bioscfg:
  -  Fix class device unregistration
 
 i2c:
  -  Re-enable piix4 driver on non-x86
 
 intel/hid:
  -  Add Wildcat Lake support
 
 mellanox:
  -  Fix spelling and comment clarity in Mellanox drivers
 
 mlxbf-pmc:
  -  Fix duplicate event ID for CACHE_DATA1
 
 mlxbf-tmfifo:
  -  fix vring_desc.len assignment
 
 mlxreg-lc:
  -  Fix logic error in power state check
 
 Move FCH header to a location accessible by all archs:
  - Move FCH header to a location accessible by all archs
 
 nvsw-sn2201:
  -  Fix bus number in adapter error message
 
 portwell-ec:
  -  Move watchdog device under correct platform hierarchy
 
 think-lmi:
  -  Create ksets consecutively
  -  Fix class device unregistration
  -  Fix kobject cleanup
  -  Fix sysfs group cleanup
 
 thinkpad_acpi:
  -  handle HKEY 0x1402 event
 
 Update swnode graph for amd isp4:
  - Update swnode graph for amd isp4
 
 wmi:
  -  Fix WMI event enablement
  -  Update documentation of WCxx/WExx ACPI methods
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCaGfkwwAKCRBZrE9hU+XO
 MVK1AQCK3C21auqcEbiZrx67hr5ir6VwTAZ9S6IR8R2FKqw8YwEAinUOcHSbmP6a
 eXV0v5xVRPxZV7JBO5aN7FESqVHpBQ4=
 =uxUH
 -----END PGP SIGNATURE-----

Merge tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform drivers fixes from Ilpo Järvinen:
 "Mostly a few lines fixed here and there except amd/isp4 which improves
  swnodes relationships but that is a new driver not in any stable
  kernels yet. The think-lmi driver changes also look relatively large
  but there are just many fixes to it.

  The i2c/piix4 change is a effectively a revert of the commit
  7e173eb82a ("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on
  CONFIG_X86") but that required moving the header out from arch/x86
  under include/linux/platform_data/

  Summary:

   - amd/isp4: Improve swnode graph (new driver exception)

   - asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA

   - dell-lis3lv02d: Add Latitude 5500 accelerometer address

   - dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg

   - hp-bioscfg: Fix class device unregistration

   - i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/

   - intel/hid: Wildcat Lake support

   - mellanox:
      - mlxbf-pmc: Fix duplicate event ID
      - mlxbf-tmfifo: Fix vring_desc.len assignment
      - mlxreg-lc: Fix bit-not-set logic check
      - nvsw-sn2201: Fix bus number in error message & spelling errors

   - portwell-ec: Move watchdog device under correct platform hierarchy

   - think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg)

   - thinkpad_acpi: Handle HKEY 0x1402 event (2025 Thinkpads)

   - wmi: Fix WMI event enablement"

* tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (22 commits)
  platform/x86: think-lmi: Fix sysfs group cleanup
  platform/x86: think-lmi: Fix kobject cleanup
  platform/x86: think-lmi: Create ksets consecutively
  platform/mellanox: mlxreg-lc: Fix logic error in power state check
  i2c: Re-enable piix4 driver on non-x86
  Move FCH header to a location accessible by all archs
  platform/x86/intel/hid: Add Wildcat Lake support
  platform/x86: dell-wmi-sysman: Fix class device unregistration
  platform/x86: think-lmi: Fix class device unregistration
  platform/x86: hp-bioscfg: Fix class device unregistration
  platform/x86: Update swnode graph for amd isp4
  platform/x86: dell-wmi-sysman: Fix WMI data block retrieval in sysfs callbacks
  platform/x86: wmi: Update documentation of WCxx/WExx ACPI methods
  platform/x86: wmi: Fix WMI event enablement
  platform/mellanox: nvsw-sn2201: Fix bus number in adapter error message
  platform/mellanox: Fix spelling and comment clarity in Mellanox drivers
  platform/mellanox: mlxbf-pmc: Fix duplicate event ID for CACHE_DATA1
  platform/x86: thinkpad_acpi: handle HKEY 0x1402 event
  platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8406CA
  platform/x86: dell-lis3lv02d: Add Latitude 5500
  ...
2025-07-04 10:05:31 -07:00
Kumar Kartikeya Dwivedi
f0c53fd4a7 bpf: Add function to find program from stack trace
In preparation of figuring out the closest program that led to the
current point in the kernel, implement a function that scans through the
stack trace and finds out the closest BPF program when walking down the
stack trace.

Special care needs to be taken to skip over kernel and BPF subprog
frames. We basically scan until we find a BPF main prog frame. The
assumption is that if a program calls into us transitively, we'll
hit it along the way. If not, we end up returning NULL.

Contextually the function will be used in places where we know the
program may have called into us.

Due to reliance on arch_bpf_stack_walk(), this function only works on
x86 with CONFIG_UNWINDER_ORC, arm64, and s390. Remove the warning from
arch_bpf_stack_walk as well since we call it outside bpf_throw()
context.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250703204818.925464-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-03 19:30:06 -07:00
Andrey Albershteyn
be7efb2d20
fs: introduce file_getattr and file_setattr syscalls
Introduce file_getattr() and file_setattr() syscalls to manipulate inode
extended attributes. The syscalls takes pair of file descriptor and
pathname. Then it operates on inode opened accroding to openat()
semantics. The struct file_attr is passed to obtain/change extended
attributes.

This is an alternative to FS_IOC_FSSETXATTR ioctl with a difference
that file don't need to be open as we can reference it with a path
instead of fd. By having this we can manipulated inode extended
attributes not only on regular files but also on special ones. This
is not possible with FS_IOC_FSSETXATTR ioctl as with special files
we can not call ioctl() directly on the filesystem inode using fd.

This patch adds two new syscalls which allows userspace to get/set
extended inode attributes on special files by using parent directory
and a path - *at() like syscall.

CC: linux-api@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
CC: linux-xfs@vger.kernel.org
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://lore.kernel.org/20250630-xattrat-syscall-v6-6-c4e3bc35227b@kernel.org
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-02 17:05:17 +02:00
Ilpo Järvinen
88e326b331
Merge branch 'fixes' into for-next
Merge fixes back into for-next to be able to take dell_rbu change that
is build on top of fixes material, and to bring lenovo related changes
in sync after the move under lenovo/ subdir in the for-next branch and
diverging changes in the fixes branch.
2025-07-02 13:30:30 +03:00
Nikunj A Dadhania
52e1a03e6c x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation
When using Secure TSC, the GUEST_TSC_FREQ MSR reports a frequency based on
the nominal P0 frequency, which deviates slightly (typically ~0.2%) from
the actual mean TSC frequency due to clocking parameters.

Over extended VM uptime, this discrepancy accumulates, causing clock skew
between the hypervisor and a SEV-SNP VM, leading to early timer interrupts as
perceived by the guest.

The guest kernel relies on the reported nominal frequency for TSC-based
timekeeping, while the actual frequency set during SNP_LAUNCH_START may
differ. This mismatch results in inaccurate time calculations, causing the
guest to perceive hrtimers as firing earlier than expected.

Utilize the TSC_FACTOR from the SEV firmware's secrets page (see "Secrets
Page Format" in the SNP Firmware ABI Specification) to calculate the mean
TSC frequency, ensuring accurate timekeeping and mitigating clock skew in
SEV-SNP VMs.

Use early_ioremap_encrypted() to map the secrets page as
ioremap_encrypted() uses kmalloc() which is not available during early TSC
initialization and causes a panic.

  [ bp: Drop the silly dummy var:
    https://lore.kernel.org/r/20250630192726.GBaGLlHl84xIopx4Pt@fat_crate.local ]

Fixes: 73bbf3b0fb ("x86/tsc: Init the TSC for Secure TSC guests")
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250630081858.485187-1-nikunj@amd.com
2025-07-01 00:29:27 +02:00
Eric Biggers
b10749d89f lib/crc: x86: Migrate optimized CRC code into lib/crc/
Move the x86-optimized CRC code from arch/x86/lib/crc* into its new
location in lib/crc/x86/, and wire it up in the new way.  This new way
of organizing the CRC code eliminates the need to artificially split the
code for each CRC variant into separate arch and generic modules,
enabling better inlining and dead code elimination.  For more details,
see "lib/crc: Prepare for arch-optimized code in subdirs of lib/crc/".

Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Jason A. Donenfeld" <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20250607200454.73587-12-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:31:57 -07:00
Eric Biggers
cece5689e1 x86/crc: drop checks of CONFIG_AS_VPCLMULQDQ
Now that the minimum binutils version supports VPCLMULQDQ (and the
minimum clang version does too), there is no need to check for assembler
support before compiling code that uses these instructions.

Link: https://lore.kernel.org/r/20250531211318.83677-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:31:56 -07:00
Eric Biggers
74750aa78d lib/crypto: x86: Move arch/x86/lib/crypto/ into lib/crypto/
Move the contents of arch/x86/lib/crypto/ into lib/crypto/x86/.

The new code organization makes a lot more sense for how this code
actually works and is developed.  In particular, it makes it possible to
build each algorithm as a single module, with better inlining and dead
code elimination.  For a more detailed explanation, see the patchset
which did this for the CRC library code:
https://lore.kernel.org/r/20250607200454.73587-1-ebiggers@kernel.org/.
Also see the patchset which did this for SHA-512:
https://lore.kernel.org/linux-crypto/20250616014019.415791-1-ebiggers@kernel.org/

This is just a preparatory commit, which does the move to get the files
into their new location but keeps them building the same way as before.
Later commits will make the actual improvements to the way the
arch-optimized code is integrated for each algorithm.

Add a gitignore entry for the removed directory arch/x86/lib/crypto/ so
that people don't accidentally commit leftover generated files.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20250619191908.134235-9-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:26:20 -07:00
Eric Biggers
484c18119f lib/crypto: x86/sha512: Migrate optimized SHA-512 code to library
Instead of exposing the x86-optimized SHA-512 code via x86-specific
crypto_shash algorithms, instead just implement the sha512_blocks()
library function.  This is much simpler, it makes the SHA-512 (and
SHA-384) library functions be x86-optimized, and it fixes the
longstanding issue where the x86-optimized SHA-512 code was disabled by
default.  SHA-512 still remains available through crypto_shash, but
individual architectures no longer need to handle it.

To match sha512_blocks(), change the type of the nblocks parameter of
the assembly functions from int to size_t.  The assembly functions
actually already treated it as size_t.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160320.2888-15-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:26:20 -07:00
Eric Biggers
e0fca17755 crypto: sha512 - Rename conflicting symbols
Rename existing functions and structs in architecture-optimized SHA-512
code that had names conflicting with the upcoming library interface
which will be added to <crypto/sha2.h>: sha384_init, sha512_init,
sha512_update, sha384, and sha512.

Note: all affected code will be superseded by later commits that migrate
the arch-optimized SHA-512 code into the library.  This commit simply
keeps the kernel building for the initial introduction of the library.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160320.2888-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30 09:26:19 -07:00
Mario Limonciello
b1c26e0595
Move FCH header to a location accessible by all archs
A new header fch.h was created to store registers used by different AMD
drivers.  This header was included by i2c-piix4 in
commit 624b0d5696 ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH
definitions to <asm/amd/fch.h>"). To prevent compile failures on non-x86
archs i2c-piix4 was set to only compile on x86 by commit 7e173eb82a
("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on CONFIG_X86").
This was not a good decision because loongarch and mips both actually
support i2c-piix4 and set it enabled in the defconfig.

Move the header to a location accessible by all architectures.

Fixes: 624b0d5696 ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>")
Suggested-by: Hans de Goede <hansg@kernel.org>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Hans de Goede <hansg@kernel.org>
Link: https://lore.kernel.org/r/20250610205817.3912944-1-superm1@kernel.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-06-30 13:42:11 +03:00
Greg Kroah-Hartman
815ac67919 Merge 6.16-rc4 into tty-next
We need the tty/serial fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-30 07:50:04 +02:00
Linus Torvalds
cc69ac7a65 - Make sure DR6 and DR7 are initialized to their architectural values and not
accidentally cleared, leading to misconfigurations
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhg/UQACgkQEsHwGGHe
 VUrmXBAAtOrpEtR4geeBeZtCEaUxE4DE8Zvj36dr+sAScHTXNTzYK94mAy/AHU22
 V3rF12/kuyyZSrwROXLBD6PgkHEn8u0WLztSeqP/SisnoLjMTV9H9TuGYBUoz1NS
 clkoElQ6DJP5BVzmYpZlJrcofqNjkS/mAxfMRoIAq+LzKkb3iL/Lge+Ox/IDUg8z
 L9wRlKh/IaJ5EETWlqh0gkFeS/M9DXYmfkasDQeVkUxnKFeXBdyUGc2jFzBGX5RA
 rsdnz+C3x3ow2U9N+ZMVr+n06yTZvh+fAiU8emeBQm0q5fZBBHWDbnZZtWf+KG6s
 43tlWyVqic5yzyQbUpRC2sttOkIAtOCMx36XexbGm1eKRNNc6fTz9IlgO/97HkuE
 lYBNq0zd/p5Kb53lXb3uwBVy4sjIEZUyD/K5DO4YfTgamcwXl8BP5xnKtNPqImI5
 aaF3xKKLOUDOTL1CcK5YG0joaU1k+I0F0KO7HYqkDi8Uf5naWZSUNil8nPQn8RX7
 3f3LJx0e3j2o0f60AHI4mjUAUJHsxExmpaTl079k03wt8YVE3ucNaUN6se6nidVz
 H5q0JU4q3C3DCu0I3Ub4wa5QXGA+TOHKuqhJCapKAVAAQDlbV2z8GxA/3B2YbRM+
 eZ6/RVyk++VrRXIyfmfwLPu0CLVoSNhaUhu/hFrYbjA3NX+85qw=
 =fjgT
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure DR6 and DR7 are initialized to their architectural values
   and not accidentally cleared, leading to misconfigurations

* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Initialize DR7 by writing its architectural reset value
  x86/traps: Initialize DR6 by writing its architectural reset value
2025-06-29 08:28:24 -07:00
Andy Shevchenko
acc902de05 serial: 8250: Move CE4100 quirks to a module under 8250 driver
There is inconvenient for maintainers and maintainership to have
some quirks under architectural code. Move it to the specific quirk
file like other 8250-compatible drivers do.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250627182743.1273326-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-29 14:24:46 +02:00
JP Kobryn
30ad231a50 x86/mce: Make sure CMCI banks are cleared during shutdown on Intel
CMCI banks are not cleared during shutdown on Intel CPUs. As a side effect,
when a kexec is performed, CPUs coming back online are unable to
rediscover/claim these occupied banks which breaks MCE reporting.

Clear the CPU ownership during shutdown via cmci_clear() so the banks can
be reclaimed and MCE reporting will become functional once more.

  [ bp: Massage commit message. ]

Reported-by: Aijay Adams <aijay@meta.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/20250627174935.95194-1-inwardvessel@gmail.com
2025-06-28 12:45:48 +02:00
Gerd Hoffmann
a7549636f6 x86/sev: Let sev_es_efi_map_ghcbs() map the CA pages too
OVMF EFI firmware needs access to the CA page to do SVSM protocol calls. For
example, when the SVSM implements an EFI variable store, such calls will be
necessary.

So add that to sev_es_efi_map_ghcbs() and also rename the function to reflect
the additional job it is doing now.

  [ bp: Massage. ]

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250626114014.373748-4-kraxel@redhat.com
2025-06-27 14:07:10 +02:00
Gerd Hoffmann
7b22e04329 x86/sev/vc: Fix EFI runtime instruction emulation
In case efi_mm is active go use the userspace instruction decoder which
supports fetching instructions from active_mm.  This is needed to make
instruction emulation work for EFI runtime code, so it can use CPUID and
RDMSR.

EFI runtime code uses the CPUID instruction to gather information about
the environment it is running in, such as SEV being enabled or not, and
choose (if needed) the SEV code path for ioport access.

EFI runtime code uses the RDMSR instruction to get the location of the
CAA page (see SVSM spec, section 4.2 - "Post Boot").

The big picture behind this is that the kernel needs to be able to
properly handle #VC exceptions that come from EFI runtime services.
Since EFI runtime services have a special page table mapping for the EFI
virtual address space, the efi_mm context must be used when decoding
instructions during #VC handling.

  [ bp: Massage. ]

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/20250626114014.373748-2-kraxel@redhat.com
2025-06-27 13:53:12 +02:00
Yazen Ghannam
5f6e3b7206 x86/mce/amd: Fix threshold limit reset
The MCA threshold limit must be reset after servicing the interrupt.

Currently, the restart function doesn't have an explicit check for this.  It
makes some assumptions based on the current limit and what's in the registers.
These assumptions don't always hold, so the limit won't be reset in some
cases.

Make the reset condition explicit. Either an interrupt/overflow has occurred
or the bank is being initialized.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-4-236dd74f645f@amd.com
2025-06-27 13:16:23 +02:00
Yazen Ghannam
d66e1e90b1 x86/mce/amd: Add default names for MCA banks and blocks
Ensure that sysfs init doesn't fail for new/unrecognized bank types or if
a bank has additional blocks available.

Most MCA banks have a single thresholding block, so the block takes the same
name as the bank.

Unified Memory Controllers (UMCs) are a special case where there are two
blocks and each has a unique name.

However, the microarchitecture allows for five blocks. Any new MCA bank types
with more than one block will be missing names for the extra blocks. The MCE
sysfs will fail to initialize in this case.

Fixes: 87a6d4091b ("x86/mce/AMD: Update sysfs bank names for SMCA systems")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-3-236dd74f645f@amd.com
2025-06-27 13:13:36 +02:00
Yazen Ghannam
00c092de6f x86/mce: Ensure user polling settings are honored when restarting timer
Users can disable MCA polling by setting the "ignore_ce" parameter or by
setting "check_interval=0". This tells the kernel to *not* start the MCE
timer on a CPU.

If the user did not disable CMCI, then storms can occur. When these
happen, the MCE timer will be started with a fixed interval. After the
storm subsides, the timer's next interval is set to check_interval.

This disregards the user's input through "ignore_ce" and
"check_interval". Furthermore, if "check_interval=0", then the new timer
will run faster than expected.

Create a new helper to check these conditions and use it when a CMCI
storm ends.

  [ bp: Massage. ]

Fixes: 7eae17c4ad ("x86/mce: Add per-bank CMCI storm mitigation")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-2-236dd74f645f@amd.com
2025-06-27 12:41:44 +02:00
Yazen Ghannam
4c113a5b28 x86/mce: Don't remove sysfs if thresholding sysfs init fails
Currently, the MCE subsystem sysfs interface will be removed if the
thresholding sysfs interface fails to be created. A common failure is due to
new MCA bank types that are not recognized and don't have a short name set.

The MCA thresholding feature is optional and should not break the common MCE
sysfs interface. Also, new MCA bank types are occasionally introduced, and
updates will be needed to recognize them. But likewise, this should not break
the common sysfs interface.

Keep the MCE sysfs interface regardless of the status of the thresholding
sysfs interface.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-1-236dd74f645f@amd.com
2025-06-26 17:28:13 +02:00
David Kaplan
98b5dab4d2 x86/bugs: Clean up SRSO microcode handling
SRSO microcode only exists for Zen3/Zen4 CPUs.  For those CPUs, the microcode
is required for any mitigation other than Safe-RET to be effective.  Safe-RET
can still protect user->kernel and guest->host attacks without microcode.

Clarify this in the code and ensure that SRSO_MITIGATION_UCODE_NEEDED is
selected for any mitigation besides Safe-RET if the required microcode isn't
present.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250625155805.600376-4-david.kaplan@amd.com
2025-06-26 13:32:31 +02:00
David Kaplan
ff54ae7314 x86/bugs: Use IBPB for retbleed if used by SRSO
If spec_rstack_overflow=ibpb then this mitigates retbleed as well.  This
is relevant for AMD Zen1 and Zen2 CPUs which are vulnerable to both bugs.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250625155805.600376-3-david.kaplan@amd.com
2025-06-26 10:56:39 +02:00
David Kaplan
1fd5eb0286 x86/bugs: Add SRSO_MITIGATION_NOSMT
AMD Zen1 and Zen2 CPUs with SMT disabled are not vulnerable to SRSO.

Instead of overloading the X86_FEATURE_SRSO_NO bit to indicate this,
define a separate mitigation to make the code cleaner.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: H . Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250625155805.600376-2-david.kaplan@amd.com
2025-06-26 10:56:39 +02:00
Dave Hansen
1cec9ac2d0 x86/fpu: Delay instruction pointer fixup until after warning
Right now, if XRSTOR fails a console message like this is be printed:

	Bad FPU state detected at restore_fpregs_from_fpstate+0x9a/0x170, reinitializing FPU registers.

However, the text location (...+0x9a in this case) is the instruction
*AFTER* the XRSTOR. The highlighted instruction in the "Code:" dump
also points one instruction late.

The reason is that the "fixup" moves RIP up to pass the bad XRSTOR and
keep on running after returning from the #GP handler. But it does this
fixup before warning.

The resulting warning output is nonsensical because it looks like the
non-FPU-related instruction is #GP'ing.

Do not fix up RIP until after printing the warning. Do this by using
the more generic and standard ex_handler_default().

Fixes: d5c8028b47 ("x86/fpu: Reinitialize FPU registers if restoring FPU state fails")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250624210148.97126F9E%40davehans-spike.ostc.intel.com
2025-06-25 16:28:06 -07:00
Sean Christopherson
bbc13ae593 VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment()
Drop kvm_arch_{start,end}_assignment() and all associated code now that
KVM x86 no longer consumes assigned_device_count.  Tracking whether or not
a VFIO-assigned device is formally associated with a VM is fundamentally
flawed, as such an association is optional for general usage, i.e. is prone
to false negatives.  E.g. prior to commit 2edd9cb79f ("kvm: detect
assigned device via irqbypass manager"), device passthrough via VFIO would
fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM
binding.

And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT,
shouldn't be relying on generic x86 tracking infrastructure.

Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:51:33 -07:00
Sean Christopherson
ff845e6a84 Revert "kvm: detect assigned device via irqbypass manager"
Now that KVM explicitly tracks the number of possible bypass IRQs, and
doesn't conflate IRQ bypass with host MMIO access, stop bumping the
assigned device count when adding an IRQ bypass producer.

This reverts commit 2edd9cb79f.

Link: https://lore.kernel.org/r/20250523011756.3243624-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:51:29 -07:00
Sean Christopherson
95d9b5d8d6 Merge branch 'kvm-x86 mmio'
Merge the MMIO stale data branch with the device posted IRQs branch to
provide a common base for removing KVM's tracking of "assigned" devices.

Link: https://lore.kernel.org/all/20250523011756.3243624-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:50:58 -07:00
Chao Gao
05186d7a8e KVM: SVM: Simplify MSR interception logic for IA32_XSS MSR
Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR
interception, ensuring consistency with other cases where MSRs are
intercepted depending on guest caps and CPUIDs.

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:47:28 -07:00
Manuel Andreas
fa787ac07b KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
In KVM guests with Hyper-V hypercalls enabled, the hypercalls
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
allow a guest to request invalidation of portions of a virtual TLB.
For this, the hypercall parameter includes a list of GVAs that are supposed
to be invalidated.

However, when non-canonical GVAs are passed, there is currently no
filtering in place and they are eventually passed to checked invocations of
INVVPID on Intel / INVLPGA on AMD.  While AMD's INVLPGA silently ignores
non-canonical addresses (effectively a no-op), Intel's INVVPID explicitly
signals VM-Fail and ultimately triggers the WARN_ONCE in invvpid_error():

  invvpid failed: ext=0x0 vpid=1 gva=0xaaaaaaaaaaaaa000
  WARNING: CPU: 6 PID: 326 at arch/x86/kvm/vmx/vmx.c:482
  invvpid_error+0x91/0xa0 [kvm_intel]
  Modules linked in: kvm_intel kvm 9pnet_virtio irqbypass fuse
  CPU: 6 UID: 0 PID: 326 Comm: kvm-vm Not tainted 6.15.0 #14 PREEMPT(voluntary)
  RIP: 0010:invvpid_error+0x91/0xa0 [kvm_intel]
  Call Trace:
    vmx_flush_tlb_gva+0x320/0x490 [kvm_intel]
    kvm_hv_vcpu_flush_tlb+0x24f/0x4f0 [kvm]
    kvm_arch_vcpu_ioctl_run+0x3013/0x5810 [kvm]

Hyper-V documents that invalid GVAs (those that are beyond a partition's
GVA space) are to be ignored.  While not completely clear whether this
ruling also applies to non-canonical GVAs, it is likely fine to make that
assumption, and manual testing on Azure confirms "real" Hyper-V interprets
the specification in the same way.

Skip non-canonical GVAs when processing the list of address to avoid
tripping the INVVPID failure.  Alternatively, KVM could filter out "bad"
GVAs before inserting into the FIFO, but practically speaking the only
downside of pushing validation to the final processing is that doing so
is suboptimal for the guest, and no well-behaved guest will request TLB
flushes for non-canonical addresses.

Fixes: 260970862c ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
Cc: stable@vger.kernel.org
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/c090efb3-ef82-499f-a5e0-360fc8420fb7@tum.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 09:15:24 -07:00
Sean Christopherson
83ebe71574 KVM: VMX: Apply MMIO Stale Data mitigation if KVM maps MMIO into the guest
Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO
into the VM, not if the VM has an assigned device.  VFIO is but one of
many ways to map host MMIO into a KVM guest, and even within VFIO,
formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely
optional.

Track whether or not the guest can access host MMIO on a per-MMU basis,
i.e. based on whether or not the vCPU has a mapping to host MMIO.  For
simplicity, track MMIO mappings in "special" rools (those without a
kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so
only legacy 32-bit shadow paging is affected, i.e. lack of precise
tracking is a complete non-issue.

Make the per-MMU and per-VM flags sticky.  Detecting when *all* MMIO
mappings have been removed would be absurdly complex.  And in practice,
removing MMIO from a guest will be done by deleting the associated memslot,
which by default will force KVM to re-allocate all roots.  Special roots
will forever be mitigated, but as above, the affected scenarios are not
expected to be performance sensitive.

Use a VMX_RUN flag to communicate the need for a buffers flush to
vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its
dependencies don't need to be marked __always_inline, e.g. so that KASAN
doesn't trigger a noinstr violation.

Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 8cb861e9e3 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25 08:42:51 -07:00
Tiwei Bie
8948941276 um: Use correct data source in fpregs_legacy_set()
Read from the buffer pointed to by 'from' instead of '&buf', as
'buf' contains no valid data when 'ubuf' is NULL.

Fixes: b1e1bd2e69 ("um: Add helper functions to get/set state for SECCOMP")
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20250606124428.148164-5-tiwei.btw@antgroup.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-06-25 09:26:33 +02:00
Chao Gao
3f06b8927a KVM: x86: Deduplicate MSR interception enabling and disabling
Extract a common function from MSR interception disabling logic and create
disabling and enabling functions based on it. This removes most of the
duplicated code for MSR interception disabling/enabling.

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com
[sean: s/enable/set, inline the wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 15:42:12 -07:00
Yang Weijiang
8b05b3c988 x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature
== Background ==

CET defines two register states: CET user, which includes user-mode control
registers, and CET supervisor, which consists of shadow-stack pointers for
privilege levels 0-2.

Current kernels disable shadow stacks in kernel mode, making the CET
supervisor state unused and eliminating the need for context switching.

== Problem ==

To virtualize CET for guests, KVM must accurately emulate hardware
behavior. A key challenge arises because there is no CPUID flag to indicate
that shadow stack is supported only in user mode. Therefore, KVM cannot
assume guests will not enable shadow stacks in kernel mode and must
preserve the CET supervisor state of vCPUs.

== Solution ==

An initial proposal to manually save and restore CET supervisor states
using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and
its impact on KVM's ABI. Instead, leveraging the kernel's FPU
infrastructure for context switching was favored [1].

The main question then became whether to enable the CET supervisor state
globally for all processes or restrict it to vCPU processes. This decision
involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU
processes and approximately 100 lines of code complexity in the kernel [2].
The agreed approach is to first try this optimal solution [3], i.e.,
restricting the CET supervisor state to guest FPUs only and eliminating
unnecessary space waste.

The guest-only xfeature infrastructure has already been added. Now,
introduce CET supervisor xstate support as the first guest-only feature
to prepare for the upcoming CET virtualization in KVM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/ [1]
Link: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/ [2]
Link: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/ [3]
Link: https://lore.kernel.org/all/20250522151031.426788-7-chao.gao%40intel.com
2025-06-24 13:46:33 -07:00
Yang Weijiang
151bf23249 x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set
In preparation for upcoming CET virtualization support, the CET supervisor
state will be added as a "guest-only" feature, since it is required only by
KVM (i.e., guest FPUs). Establish the infrastructure for "guest-only"
features.

Define a new XFEATURE_MASK_GUEST_SUPERVISOR mask to specify features that
are enabled by default in guest FPUs but not in host FPUs. Specifically,
for any bit in this set, permission is granted and XSAVE space is allocated
during vCPU creation. Non-guest FPUs cannot enable guest-only features,
even dynamically, and no XSAVE space will be allocated for them.

The mask is currently empty, but this will be changed by a subsequent
patch.

Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/all/20250522151031.426788-6-chao.gao%40intel.com
2025-06-24 13:46:32 -07:00
Chao Gao
fafb29e18d x86/fpu: Remove xfd argument from __fpstate_reset()
The initial values for fpstate::xfd differ between guest and host fpstates.
Currently, the initial values are passed as an argument to
__fpstate_reset(). But, __fpstate_reset() already assigns different default
features and sizes based on the type of fpstates (i.e., guest or host). So,
handle fpstate::xfd in a similar way to highlight the differences in the
initial xfd value between guest and host fpstates

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/all/aBuf7wiiDT0Wflhk@google.com/
Link: https://lore.kernel.org/all/20250522151031.426788-5-chao.gao%40intel.com
2025-06-24 13:46:32 -07:00
Chao Gao
509e880b77 x86/fpu: Initialize guest fpstate and FPU pseudo container from guest defaults
fpu_alloc_guest_fpstate() currently uses host defaults to initialize guest
fpstate and pseudo containers. Guest defaults were introduced to
differentiate the features and sizes of host and guest FPUs. Switch to
using guest defaults instead.

Adjust __fpstate_reset() to handle different defaults for host and guest
FPUs. And to distinguish between the types of FPUs, move the initialization
of indicators (is_guest and is_valloc) before the reset.

Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/all/20250522151031.426788-4-chao.gao%40intel.com
2025-06-24 13:46:32 -07:00
Chao Gao
7c2c89364d x86/fpu: Initialize guest FPU permissions from guest defaults
Currently, fpu->guest_perm is copied from fpu->perm, which is derived from
fpu_kernel_cfg.default_features.

Guest defaults were introduced to differentiate the features and sizes of
host and guest FPUs. Copying guest FPU permissions from the host will lead
to inconsistencies between the guest default features and permissions.

Initialize guest FPU permissions from guest defaults instead of host
defaults. This ensures that any changes to guest default features are
automatically reflected in guest permissions, which in turn guarantees
that fpstate_realloc() allocates a correctly sized XSAVE buffer for guest
FPUs.

Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/all/20250522151031.426788-3-chao.gao%40intel.com
2025-06-24 13:46:32 -07:00
Chao Gao
7bc4ed75f2 x86/fpu/xstate: Differentiate default features for host and guest FPUs
Currently, guest and host FPUs share the same default features. However,
the CET supervisor xstate is the first feature that needs to be enabled
exclusively for guest FPUs. Enabling it for host FPUs leads to a waste of
24 bytes in the XSAVE buffer.

To support "guest-only" features, add a new structure to hold the
default features and sizes for guest FPUs to clearly differentiate them
from those for host FPUs.

Add two helpers to provide the default feature masks for guest and host
FPUs. Default features are derived by applying the masks to the maximum
supported features.

Note that,
1) for now, guest_default_mask() and host_default_mask() are identical.
This will change in a follow-up patch once guest permissions, default
xfeatures, and fpstate size are all converted to use the guest defaults.

2) only supervisor features will diverge between guest FPUs and host
FPUs, while user features will remain the same [1][2]. So, the new
vcpu_fpu_config struct does not include default user features and size
for the UABI buffer.

An alternative approach is adding a guest_only_xfeatures member to
fpu_kernel_cfg and adding two helper functions to calculate the guest
default xfeatures and size. However, calculating these defaults at runtime
would introduce unnecessary overhead.

Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/kvm/aAwdQ759Y6V7SGhv@google.com/ [1]
Link: https://lore.kernel.org/kvm/9ca17e1169805f35168eb722734fbf3579187886.camel@intel.com/ [2]
Link: https://lore.kernel.org/all/20250522151031.426788-2-chao.gao%40intel.com
2025-06-24 13:46:32 -07:00
Sean Christopherson
ffe9d7966d KVM: x86/mmu: Locally cache whether a PFN is host MMIO when making a SPTE
When making a SPTE, cache whether or not the target PFN is host MMIO in
order to avoid multiple rounds of the slow path of kvm_is_mmio_pfn(), e.g.
hitting pat_pfn_immune_to_uc_mtrr() in particular can be problematic.  KVM
currently avoids multiple calls by virtue of the two users being mutually
exclusive (.get_mt_mask() is Intel-only, shadow_me_value is AMD-only), but
that won't hold true if/when KVM needs to detect host MMIO mappings for
other reasons, e.g. for mitigating the MMIO Stale Data vulnerability.

No functional change intended.

Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 13:29:31 -07:00
Sean Christopherson
c126b46e6f KVM: x86: Avoid calling kvm_is_mmio_pfn() when kvm_x86_ops.get_mt_mask is NULL
Guard the call to kvm_x86_call(get_mt_mask) with an explicit check on
kvm_x86_ops.get_mt_mask so as to avoid unnecessarily calling
kvm_is_mmio_pfn(), which is moderately expensive for some backing types.
E.g. lookup_memtype() conditionally takes a system-wide spinlock if KVM
ends up being call pat_pfn_immune_to_uc_mtrr(), e.g. for DAX memory.

While the call to kvm_x86_ops.get_mt_mask() itself is elided, the compiler
still needs to compute all parameters, as it can't know at build time that
the call will be squashed.

   <+243>:   call   0xffffffff812ad880 <kvm_is_mmio_pfn>
   <+248>:   mov    %r13,%rsi
   <+251>:   mov    %rbx,%rdi
   <+254>:   movzbl %al,%edx
   <+257>:   call   0xffffffff81c26af0 <__SCT__kvm_x86_get_mt_mask>

Fixes: 3fee4837ef ("KVM: x86: remove shadow_memtype_mask")
Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Link: https://lore.kernel.org/r/20250523011756.3243624-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 13:29:31 -07:00
Xin Li (Intel)
fa7d0f83c5 x86/traps: Initialize DR7 by writing its architectural reset value
Initialize DR7 by writing its architectural reset value to always set
bit 10, which is reserved to '1', when "clearing" DR7 so as not to
trigger unanticipated behavior if said bit is ever unreserved, e.g. as
a feature enabling flag with inverted polarity.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com
2025-06-24 13:15:52 -07:00
Xin Li (Intel)
5f465c148c x86/traps: Initialize DR6 by writing its architectural reset value
Initialize DR6 by writing its architectural reset value to avoid
incorrectly zeroing DR6 to clear DR6.BLD at boot time, which leads
to a false bus lock detected warning.

The Intel SDM says:

  1) Certain debug exceptions may clear bits 0-3 of DR6.

  2) BLD induced #DB clears DR6.BLD and any other debug exception
     doesn't modify DR6.BLD.

  3) RTM induced #DB clears DR6.RTM and any other debug exception
     sets DR6.RTM.

  To avoid confusion in identifying debug exceptions, debug handlers
  should set DR6.BLD and DR6.RTM, and clear other DR6 bits before
  returning.

The DR6 architectural reset value 0xFFFF0FF0, already defined as
macro DR6_RESERVED, satisfies these requirements, so just use it to
reinitialize DR6 whenever needed.

Since clear_all_debug_regs() no longer zeros all debug registers,
rename it to initialize_debug_regs() to better reflect its current
behavior.

Since debug_read_clear_dr6() no longer clears DR6, rename it to
debug_read_reset_dr6() to better reflect its current behavior.

Fixes: ebb1064e7c ("x86/traps: Handle #DB for bus lock")
Reported-by: Sohil Mehta <sohil.mehta@intel.com>
Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/lkml/06e68373-a92b-472e-8fd9-ba548119770c@intel.com/
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250620231504.2676902-2-xin%40zytor.com
2025-06-24 13:15:51 -07:00
Sean Christopherson
9c4fe6d150 KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
nested TDP VM is run, defer allocation of the array of hashed lists used
to track shadow MMU pages until the first shadow root is allocated.

Setting the list outside of mmu_lock is safe, as concurrent readers must
hold mmu_lock in some capacity, shadow pages can only be added (or removed)
from the list when mmu_lock is held for write, and tasks that are creating
a shadow root are serialized by slots_arch_lock.  I.e. it's impossible for
the list to become non-empty until all readers go away, and so readers are
guaranteed to see an empty list even if they make multiple calls to
kvm_get_mmu_page_hash() in a single mmu_lock critical section.

Use smp_store_release() and smp_load_acquire() to access the hash table
pointer to ensure the stores to zero the lists are retired before readers
start to walk the list.  E.g. if the compiler hoisted the store before the
zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale
kernel data.

Cc: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:51:07 -07:00
Sean Christopherson
ac777fbf06 KVM: x86: Use kvzalloc() to allocate VM struct
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
allocation before falling back to __vmalloc(), to avoid the overhead of
establishing the virtual mappings.  For non-debug builds, The SVM and VMX
(and TDX) structures are now just below 7000 bytes in the worst case
scenario (see below), i.e. are order-1 allocations, and will likely remain
that way for quite some time.

Add compile-time assertions in vendor code to ensure the size of the
structures, sans the memslot hash tables, are order-0 allocations, i.e.
are less than 4KiB.  There's nothing fundamentally wrong with a larger
kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
the memslots hash tables) is below 2KiB after 18+ years of existence,
more than doubling the size would be quite notable.

Add sanity checks on the memslot hash table sizes, partly to ensure they
aren't resized without accounting for the impact on VM structure size, and
partly to document that the majority of the size of VM structures comes
from the memslots.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:50:48 -07:00
Sean Christopherson
039ef33e2f KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation.  Dynamically
allocating the array will allow allocating "struct kvm" using kvmalloc(),
and will also allow deferring allocation of the array until it's actually
needed, i.e. until the first shadow root is allocated.

Opportunistically use kvmalloc() for the hashed lists, as an order-3
allocation is (stating the obvious) less likely to fail than an order-4
allocation, and the overhead of vmalloc() is undesirable given that the
size of the allocation is fixed.

Cc: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:50:34 -07:00
David Woodhouse
a7f4dff21f KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
To avoid imposing an ordering constraint on userspace, allow 'invalid'
event channel targets to be configured in the IRQ routing table.

This is the same as accepting interrupts targeted at vCPUs which don't
exist yet, which is already the case for both Xen event channels *and*
for MSIs (which don't do any filtering of permitted APIC ID targets at
all).

If userspace actually *triggers* an IRQ with an invalid target, that
will fail cleanly, as kvm_xen_set_evtchn_fast() also does the same range
check.

If KVM enforced that the IRQ target must be valid at the time it is
*configured*, that would force userspace to create all vCPUs and do
various other parts of setup (in this case, setting the Xen long_mode)
before restoring the IRQ table.

Cc: stable@vger.kernel.org
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/e489252745ac4b53f1f7f50570b03fb416aa2065.camel@infradead.org
[sean: massage comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:20:17 -07:00
Sean Christopherson
0b6f4a5f08 KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs
being targeted for Hyper-V's paravirt TLB flushing.  If KVM_MAX_NR_VCPUS
is set to 4096 (which is allowed even for MAXSMP=n builds), putting the
vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN
limit.

  arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024)
                                 in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than]
  2001 | static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc)
       |            ^
  1 error generated.

Note, sparse_banks was given the same treatment by commit 7d5e88d301
("KVM: x86: hyper-v: Use preallocated buffer in 'struct kvm_vcpu_hv'
instead of on-stack 'sparse_banks'"), for the exact same reason.

Reported-by: Abinash Lalotra <abinashsinghlalotra@gmail.com>
Closes: https://lore.kernel.org/all/20250613111023.786265-1-abinashsinghlalotra@gmail.com
Link: https://lore.kernel.org/all/aEylI-O8kFnFHrOH@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:20:16 -07:00
Sean Christopherson
48f15f6241 KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
When creating an SEV-ES vCPU for intra-host migration, set its vmsa_pa to
INVALID_PAGE to harden against doing VMRUN with a bogus VMSA (KVM checks
for a valid VMSA page in pre_sev_run()).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Link: https://lore.kernel.org/r/20250602224459.41505-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:20:15 -07:00
Sean Christopherson
ecf371f8b0 KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
Reject migration of SEV{-ES} state if either the source or destination VM
is actively creating a vCPU, i.e. if kvm_vm_ioctl_create_vcpu() is in the
section between incrementing created_vcpus and online_vcpus.  The bulk of
vCPU creation runs _outside_ of kvm->lock to allow creating multiple vCPUs
in parallel, and so sev_info.es_active can get toggled from false=>true in
the destination VM after (or during) svm_vcpu_create(), resulting in an
SEV{-ES} VM effectively having a non-SEV{-ES} vCPU.

The issue manifests most visibly as a crash when trying to free a vCPU's
NULL VMSA page in an SEV-ES VM, but any number of things can go wrong.

  BUG: unable to handle page fault for address: ffffebde00000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP KASAN NOPTI
  CPU: 227 UID: 0 PID: 64063 Comm: syz.5.60023 Tainted: G     U     O        6.15.0-smp-DEV #2 NONE
  Tainted: [U]=USER, [O]=OOT_MODULE
  Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 12.52.0-0 10/28/2024
  RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:206 [inline]
  RIP: 0010:arch_test_bit arch/x86/include/asm/bitops.h:238 [inline]
  RIP: 0010:_test_bit include/asm-generic/bitops/instrumented-non-atomic.h:142 [inline]
  RIP: 0010:PageHead include/linux/page-flags.h:866 [inline]
  RIP: 0010:___free_pages+0x3e/0x120 mm/page_alloc.c:5067
  Code: <49> f7 06 40 00 00 00 75 05 45 31 ff eb 0c 66 90 4c 89 f0 4c 39 f0
  RSP: 0018:ffff8984551978d0 EFLAGS: 00010246
  RAX: 0000777f80000001 RBX: 0000000000000000 RCX: ffffffff918aeb98
  RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffebde00000000
  RBP: 0000000000000000 R08: ffffebde00000007 R09: 1ffffd7bc0000000
  R10: dffffc0000000000 R11: fffff97bc0000001 R12: dffffc0000000000
  R13: ffff8983e19751a8 R14: ffffebde00000000 R15: 1ffffd7bc0000000
  FS:  0000000000000000(0000) GS:ffff89ee661d3000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffebde00000000 CR3: 000000793ceaa000 CR4: 0000000000350ef0
  DR0: 0000000000000000 DR1: 0000000000000b5f DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   sev_free_vcpu+0x413/0x630 arch/x86/kvm/svm/sev.c:3169
   svm_vcpu_free+0x13a/0x2a0 arch/x86/kvm/svm/svm.c:1515
   kvm_arch_vcpu_destroy+0x6a/0x1d0 arch/x86/kvm/x86.c:12396
   kvm_vcpu_destroy virt/kvm/kvm_main.c:470 [inline]
   kvm_destroy_vcpus+0xd1/0x300 virt/kvm/kvm_main.c:490
   kvm_arch_destroy_vm+0x636/0x820 arch/x86/kvm/x86.c:12895
   kvm_put_kvm+0xb8e/0xfb0 virt/kvm/kvm_main.c:1310
   kvm_vm_release+0x48/0x60 virt/kvm/kvm_main.c:1369
   __fput+0x3e4/0x9e0 fs/file_table.c:465
   task_work_run+0x1a9/0x220 kernel/task_work.c:227
   exit_task_work include/linux/task_work.h:40 [inline]
   do_exit+0x7f0/0x25b0 kernel/exit.c:953
   do_group_exit+0x203/0x2d0 kernel/exit.c:1102
   get_signal+0x1357/0x1480 kernel/signal.c:3034
   arch_do_signal_or_restart+0x40/0x690 arch/x86/kernel/signal.c:337
   exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
   exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
   __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
   syscall_exit_to_user_mode+0x67/0xb0 kernel/entry/common.c:218
   do_syscall_64+0x7c/0x150 arch/x86/entry/syscall_64.c:100
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f87a898e969
   </TASK>
  Modules linked in: gq(O)
  gsmi: Log Shutdown Reason 0x03
  CR2: ffffebde00000000
  ---[ end trace 0000000000000000 ]---

Deliberately don't check for a NULL VMSA when freeing the vCPU, as crashing
the host is likely desirable due to the VMSA being consumed by hardware.
E.g. if KVM manages to allow VMRUN on the vCPU, hardware may read/write a
bogus VMSA page.  Accessing PFN 0 is "fine"-ish now that it's sequestered
away thanks to L1TF, but panicking in this scenario is preferable to
potentially running with corrupted state.

Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Cc: stable@vger.kernel.org
Cc: James Houghton <jthoughton@google.com>
Cc: Peter Gonda <pgonda@google.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Tested-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250602224459.41505-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24 12:20:10 -07:00
Jiri Slaby (SUSE)
d22cf13814 serial: ce4100: clean up serial_in/out() hooks
ce4100_mem_serial_in() unnecessarily contains 4 nested 'if's. That makes
the code hard to follow. Invert the conditions and return early if the
particular conditions do not hold.

And use "<<=" for shifting the offset in both of the hooks.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250623101246.486866-2-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-24 15:32:50 +01:00
Jiri Slaby (SUSE)
f565594077 serial: ce4100: fix build after serial_in/out() changes
This x86_32 driver remain unnoticed, so after the commit below, the
compilation now fails with:
  arch/x86/platform/ce4100/ce4100.c:107:16: error: incompatible function pointer types assigning to '...' from '...'

To fix the error, convert also ce4100 to the new
uart_port::serial_{in,out}() types.

Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org>
Fixes: fc9ceb501e ("serial: 8250: sanitize uart_port::serial_{in,out}() types")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506190552.TqNasrC3-lkp@intel.com/
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250623101246.486866-1-jirislaby@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-24 15:32:50 +01:00
Pawan Gupta
ab9f2388e0 x86/bugs: Allow ITS stuffing in eIBRS+retpoline mode also
After a recent restructuring of the ITS mitigation, RSB stuffing can no longer
be enabled in eIBRS+Retpoline mode. Before ITS, retbleed mitigation only
allowed stuffing when eIBRS was not enabled. This was perfectly fine since
eIBRS mitigates retbleed.

However, RSB stuffing mitigation for ITS is still needed with eIBRS. The
restructuring solely relies on retbleed to deploy stuffing, and does not allow
it when eIBRS is enabled. This behavior is different from what was before the
restructuring. Fix it by allowing stuffing in eIBRS+retpoline mode also.

Fixes: 61ab72c2c6 ("x86/bugs: Restructure ITS mitigation")
Closes: https://lore.kernel.org/lkml/20250519235101.2vm6sc5txyoykb2r@desk/
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-7-5ff86cac6c61@linux.intel.com
2025-06-24 14:12:41 +02:00
Sean Christopherson
6f34372483 KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()
Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what
it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq()
with kvm_set_msi().

Opportunistically delete the public declaration and export, as they are
no longer used/needed.

No functional change intended.

Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:53 -07:00
Sean Christopherson
b03500f03e KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking
Configure IRTEs to GA log interrupts for device posted IRQs that hit
non-running vCPUs if and only if the target vCPU is blocking, i.e.
actually needs a wake event.  If the vCPU has exited to userspace or was
preempted, generating GA log entries and interrupts is wasteful and
unnecessary, as the vCPU will be re-loaded and/or scheduled back in
irrespective of the GA log notification (avic_ga_log_notifier() is just a
fancy wrapper for kvm_vcpu_wake_up()).

Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to
track whether or not the vCPU's associated IRTEs are configured to
generate GA logs, but only set the synthetic bit in KVM's "cache", i.e.
never set the should-be-zero bit in tables that are used by hardware.
Use a synthetic bit instead of a dedicated boolean to minimize the odds
of messing up the locking, i.e. so that all the existing rules that apply
to avic_physical_id_entry for IS_RUNNING are reused verbatim for
GA_LOG_INTR.

Note, because KVM (by design) "puts" AVIC state in a "pre-blocking"
phase, using kvm_vcpu_is_blocking() to track the need for notifications
isn't a viable option.

Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:52 -07:00
Sean Christopherson
b9e53f9ff4 iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or
not an IRTE is configured to generate GA log interrupts.  KVM only needs a
notification if the target vCPU is blocking, so the vCPU can be awakened.
If a vCPU is preempted or exits to userspace, KVM clears is_run, but will
set the vCPU back to running when userspace does KVM_RUN and/or the vCPU
task is scheduled back in, i.e. KVM doesn't need a notification.

Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes
from the KVM changes insofar as possible.

Opportunistically swap the ordering of parameters for amd_iommu_update_ga()
so that the match amd_iommu_activate_guest_mode().

Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as
a non-cached field, but per AMD hardware architects, it's not cached and
can be safely updated without an invalidation.

Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:51 -07:00
Sean Christopherson
5f3d06b164 KVM: SVM: Consolidate IRTE update when toggling AVIC on/off
Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into
__avic_vcpu_{load,put}(), and add a param to the helpers to communicate
whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update,
or just a quick update to set the CPU and IsRun.

Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:50 -07:00
Sean Christopherson
6eab2340f3 KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off
Don't query a vCPU's blocking status when toggling AVIC on/off; barring
KVM bugs, the vCPU can't be blocking when refreshing AVIC controls.  And
if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in
the correct state is desirable, i.e. well worth any overhead in a buggy
scenario.

Isolating the "real" load/put flows will allow moving the IOMMU IRTE
(de)activation logic from avic_refresh_apicv_exec_ctrl() to
avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's
physical ID entry and its IRTEs in a common path, under a single critical
section of ir_list_lock.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:50 -07:00
Sean Christopherson
f2bc961d38 KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller
Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in
anticipation of moving the __avic_vcpu_{load,put}() calls into the
critical section, and because having a one-off helper with a name that's
easily confused with avic_pi_update_irte() is unnecessary.

No functional change intended.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:49 -07:00
Sean Christopherson
11a60455d4 KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata
Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to
find and kick vCPUs when a device posted interrupt serves as a wake event.
Lookups on a vCPU index are O(fast) (not sure what xa_load() actually
provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match
its index.

Unlike the Physical APIC Table, which is accessed by hardware when
virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_
use APIC IDs to fill the Physical APIC Table, but KVM has free rein over
the format/meaning of the GA tag.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:48 -07:00
Sean Christopherson
ce9d54f41b KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass
WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are
disabled, to make it obvious that the logic is a sanity check, and so that
a bug related to nr_possible_bypass_irqs is more like to cause noisy
failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs.

Link: https://lore.kernel.org/r/20250611224604.313496-56-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:47 -07:00
Sean Christopherson
77e1b8332d KVM: x86: Decouple device assignment from IRQ bypass
Use a dedicated counter to track the number of IRQs that can utilize IRQ
bypass instead of piggybacking the assigned device count.  As evidenced by
commit 2edd9cb79f ("kvm: detect assigned device via irqbypass manager"),
it's possible for a device to be able to post IRQs to a vCPU without said
device being assigned to a VM.

Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment
to avoid regressing the MMIO stale data mitigation.  KVM is abusing the
assigned device count when applying mmio_stale_data_clear, and it's not at
all clear if vDPA devices rely on this behavior.  This will hopefully be
cleaned up in the future, as the number of assigned devices is a terrible
heuristic for detecting if a VM has access to host MMIO.

Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:46 -07:00
Sean Christopherson
99836eb9c5 KVM: SVM: WARN if ir_list is non-empty at vCPU free
Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is
freed with ir_list entries, i.e. if KVM leaves a dangling IRTE.

Initialize the per-vCPU interrupt remapping list and its lock even if AVIC
is disabled so that the WARN doesn't hit false positives (and so that KVM
doesn't need to call into AVIC code for a simple sanity check).

Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:46 -07:00
Sean Christopherson
25ef059e8b KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC
Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC,
as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any
and all postable IRQs to an APIC-less VM.

Link: https://lore.kernel.org/r/20250611224604.313496-53-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:45 -07:00
Sean Christopherson
d1bccaa179 KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()
WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the
code is only reachable if the VM already has an IRQ bypass producer (see
kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(),
which, stating the obvious, are called if and only if KVM enables its IRQ
bypass hooks.

Cc: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250611224604.313496-52-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:44 -07:00
Sean Christopherson
04c4ca0ae4 KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()
Don't bother checking if the VM has an assigned device when updating
IRTE entries.  kvm_arch_irq_bypass_add_producer() explicitly increments
the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly
decrements the count before invoking kvm_pi_update_irte(), and
kvm_irq_routing_update() only updates IRTE entries if there's an active
IRQ bypass producer.

Link: https://lore.kernel.org/r/20250611224604.313496-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:44 -07:00
Sean Christopherson
cd86240fea KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails
WARN if updating GA information for an IRTE entry fails as modifying an
IRTE should only fail if KVM is buggy, e.g. has stale metadata, and
because returning an error that is always ignored is pointless.

Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:43 -07:00
Sean Christopherson
48f79c6c86 KVM: SVM: Process all IRTEs on affinity change even if one update fails
When updating IRTE GA fields, keep processing all other IRTEs if an update
fails, as not updating later entries risks making a bad situation worse.

Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:42 -07:00
Sean Christopherson
16562766f1 KVM: SVM: WARN if (de)activating guest mode in IOMMU fails
WARN if (de)activating "guest mode" for an IRTE entry fails as modifying
an IRTE should only fail if KVM is buggy, e.g. has stale metadata.

Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:42 -07:00
Sean Christopherson
fe0213923d KVM: SVM: Don't check for assigned device(s) when activating AVIC
Don't short-circuit IRTE updating when (de)activating AVIC based on the
VM having assigned devices, as nothing prevents AVIC (de)activation from
racing with device (de)assignment.  And from a performance perspective,
bailing early when there is no assigned device doesn't add much, as
ir_list_lock will never be contended if there's no assigned device.

Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:41 -07:00
Sean Christopherson
f5998661ff KVM: SVM: Don't check for assigned device(s) when updating affinity
Don't bother checking if a VM has an assigned device when updating AVIC
vCPU affinity, querying ir_list is just as cheap and nothing prevents
racing with changes in device assignment.

Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:40 -07:00
Sean Christopherson
6df262f915 iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited
If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the
vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave
the actual IRTE in remapped mode.  KVM already handles the case where AVIC
is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()),
but doesn't handle the scenario where a postable IRQ comes along while AVIC
is inhibited.

Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:40 -07:00
Sean Christopherson
f965255dc5 iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that
avic_physical_id_entry can be safely accessed, set the pCPU info
straight-away when setting vCPU affinity.  Putting the IRTE into posted
mode, and then immediately updating the IRTE a second time if the target
vCPU is running is wasteful and confusing.

This also fixes a flaw where a posted IRQ that arrives between putting
the IRTE into guest_mode and setting the correct destination could cause
the IOMMU to ring the doorbell on the wrong pCPU.

Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:39 -07:00
Sean Christopherson
08d9ccdd1a iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination
Infer whether or not a vCPU should be marked running from the validity of
the pCPU on which it is running.  amd_iommu_update_ga() already skips the
IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an
invalid pCPU would be a blatant and egregrious KVM bug.

Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:37 -07:00
Sean Christopherson
c3d591c91f KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU
Now that svm_ir_list_add() isn't overloaded with all manner of weird
things, fold it into avic_pi_update_irte(), and more importantly take
ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info
that's shoved into the IRTE is fresh.  While preemption (and IRQs) is
disabled on the task performing the IRTE update, thanks to irqfds.lock,
that task doesn't hold the vCPU's mutex, i.e. preemption being disabled
is irrelevant.

Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23 09:50:36 -07:00