linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-27 15:36:48 +00:00

Author	SHA1	Message	Date
Li Chen	fbc2010d92	x86/smpboot: moves x86_topology to static initialize and truncate The #ifdeffery and the initializers in build_sched_topology() are just disgusting. Statically initialize the domain levels in the topology array and let build_sched_topology() invalidate the package domain level when NUMA in package is available. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-4-me@linux.beauty	2025-07-14 10:59:34 +02:00
Li Chen	992de2b025	x86/smpboot: remove redundant CONFIG_SCHED_SMT On x86 CONFIG_SCHED_SMT is default y if SMP is enabled, so let's simply drop CONFIG_SCHED_SMT. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-3-me@linux.beauty	2025-07-14 10:59:34 +02:00
Li Chen	e075f43609	smpboot: introduce SDTL_INIT() helper to tidy sched topology setup Define a small SDTL_INIT(maskfn, flagsfn, name) macro and use it to build the sched_domain_topology_level array. Purely a cleanup; behaviour is unchanged. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-2-me@linux.beauty	2025-07-14 10:59:34 +02:00
Tiwei Bie	f7e9077a16	um: Stop tracking stub's PID via userspace_pid[] The PID of the stub process can be obtained from current_mm_id(). There is no need to track it via userspace_pid[]. Stop doing that to simplify the code. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250711065021.2535362-4-tiwei.bie@linux.dev Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-07-13 19:42:49 +02:00
Linus Torvalds	5d5d62298b	- Update Kirill's email address - Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole lotta sense on 32-bit - Add fixes for a misconfigured AMD Zen2 client which wasn't even supposed to run Linux -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhziVAACgkQEsHwGGHe VUoPLRAAqnv0D8pKO/UPUp05bOvKvEvYarK3Va4MV1QrqOgPIvabGbJOzYStU9+Q 4FZ5ZCZbi0eV1sZuNP1Zk7Ryp/bYipR6gLX+jg06VTXXTrjKnUN3ofBLPQf0+fE9 AZDShoSjS+6ifzt6BaUWW3uDMLOzwv50X/xwtLG+Nrprshs7HzfvJq2oFFQX6drQ kg7Cj9N8WNHl1kp6CVy2DXVRzv4VR9+yxeNfOCPOJiCVEmzRlMulPzQYWWagYidB +U+IYDJiG2p8YNL9aiCmnrRNpSfA4Podn8ZJVPKDwXSpmuUmfcLPur0c1Tjt97h0 85ovsJs+RqBBzD3ixkbNSpdNLRBFVX7q5mx4n4+1DuR5ygZrBbDyjZce9gwY2YPh h1c2dnxxxkp9LAnBFJcaWiv8jzScRRbkqwHprBidkCS4plJiGhrD6MC78fMf8kE5 i+dydBefrsYnBwe3ciyCZh/fCvPHk6OmegSdT1+0jlz2YJOGlD1uSPSMeE6YFFvW 64R7MV3BLllBkpRx57zafx8tsdRiH9mZM6naltlcOcQV3JkHZhDJ5aNkfvBJpXw3 RZtHHAG0noCGSMAhl/crQUkZOany2TdkDQn6SQpcM+iY/E0OQSH+/QM4Rw/s+05/ FEtmkC2FqJ00RODhzSqKIwkKSgiwSCokR4pty5OJqBirUqDFNiw= =d3UC -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Update Kirill's email address - Allow hugetlb PMD sharing only on 64-bit as it doesn't make a whole lotta sense on 32-bit - Add fixes for a misconfigured AMD Zen2 client which wasn't even supposed to run Linux * tag 'x86_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Update Kirill Shutemov's email address for TDX x86/mm: Disable hugetlb page table sharing on 32-bit x86/CPU/AMD: Disable INVLPGB on Zen2 x86/rdrand: Disable RDSEED on AMD Cyan Skillfish	2025-07-13 10:41:19 -07:00
David Kaplan	a026dc61cf	x86/bugs: Print enabled attack vectors Print the status of enabled attack vectors and SMT mitigation status in the boot log for easier reporting and debugging. This information will also be available through sysfs. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-21-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	6b21d2f0dc	x86/bugs: Add attack vector controls for TSA Use attack vector controls to determine which TSA mitigation to use. [ bp: Simplify the condition in the select function for better readability. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250709155844.3279471-1-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	02c7d5b8e0	x86/pti: Add attack vector controls for PTI Disable PTI mitigation if user->kernel attack vector mitigations are disabled. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-20-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	0cdd2c4f35	x86/bugs: Add attack vector controls for ITS Use attack vector controls to determine if ITS mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-19-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	eda718fde6	x86/bugs: Add attack vector controls for SRSO Use attack vector controls to determine if SRSO mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-18-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	2f970a5269	x86/bugs: Add attack vector controls for L1TF Use attack vector controls to determine if L1TF mitigation is required. Disable SMT if cross-thread protection is desired. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-17-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	fdf99228e2	x86/bugs: Add attack vector controls for spectre_v2 Use attack vector controls to determine if spectre_v2 mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-16-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	ddcd4d3cb3	x86/bugs: Add attack vector controls for BHI Use attack vector controls to determine if BHI mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-15-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	07a659edcf	x86/bugs: Add attack vector controls for spectre_v2_user Use attack vector controls to determine if spectre_v2_user mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-14-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	9687eb2399	x86/bugs: Add attack vector controls for retbleed Use attack vector controls to determine if retbleed mitigation is required. Disable SMT if cross-thread protection is desired and STIBP is not available. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-13-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	19a5f3ea43	x86/bugs: Add attack vector controls for spectre_v1 Use attack vector controls to determine if spectre_v1 mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-12-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	8c7261abcb	x86/bugs: Add attack vector controls for GDS Use attack vector controls to determine if GDS mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-11-david.kaplan@amd.com	2025-07-11 17:56:41 +02:00
David Kaplan	71dc301c26	x86/bugs: Add attack vector controls for SRBDS Use attack vector controls to determine if SRBDS mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-10-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
David Kaplan	54b53dca65	x86/bugs: Add attack vector controls for RFDS Use attack vector controls to determine if RFDS mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-9-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
David Kaplan	de6f0921ba	x86/bugs: Add attack vector controls for MMIO Use attack vectors controls to determine if MMIO mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-8-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
David Kaplan	736565d4ed	x86/bugs: Add attack vector controls for TAA Use attack vector controls to determine if TAA mitigation is required. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-7-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
David Kaplan	e3a88d4c06	x86/bugs: Add attack vector controls for MDS Use attack vector controls to determine if MDS mitigation is required. The global mitigations=off command now simply disables all attack vectors so explicit checking of mitigations=off is no longer needed. If cross-thread attack mitigations are required, disable SMT. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-6-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
David Kaplan	2d31d28746	x86/bugs: Define attack vectors relevant for each bug Add a function which defines which vulnerabilities should be mitigated based on the selected attack vector controls. The selections here are based on the individual characteristics of each vulnerability. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-5-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
David Kaplan	735e59204b	x86/Kconfig: Add arch attack vector support ARCH_HAS_CPU_ATTACK_VECTORS should be set for architectures which implement the new attack-vector based controls for CPU mitigations. If an arch does not support attack-vector based controls then an attempt to use them results in a warning. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-4-david.kaplan@amd.com	2025-07-11 17:56:40 +02:00
Johannes Berg	ac1ad16f10	um: simplify syscall header files Since Thomas's recent commit 2af10530639b ("um/x86: Add system call table to header file") , we now have two extern declarations of the syscall table, one internal and one external, and they don't even match on 32-bit. Clean this up and remove all the extra code. Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20250704141243.a68366f6acc3.If8587a4aafdb90644fc6d0b2f5e31a2d1887915f@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-07-11 08:49:02 +02:00
Thomas Weißschuh	32a15664ef	um/x86: Add system call table to header file The generic system call tracing infrastructure requires access to the system call table. The symbol is already visible to the linker but is lacking a public declaration. Add a public declaration. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Nam Cao <namcao@linutronix.de> Link: https://patch.msgid.link/20250703-uml-have_syscall_tracepoints-v1-1-23c1d3808578@linutronix.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-07-11 08:49:02 +02:00
Neeraj Upadhyay	b95a9d3136	x86/apic: Rename 'reg_off' to 'reg' Rename the 'reg_off' parameter of apic_{set\|get}_reg() to 'reg' to match other usages in apic.h. No functional change intended. Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250709033242.267892-15-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:44 -07:00
Neeraj Upadhyay	17776e6c20	x86/apic: KVM: Move apic_test)vector() to common code Move apic_test_vector() to apic.h in order to reuse it in the Secure AVIC guest APIC driver in later patches to test vector state in the APIC backing page. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-14-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:43 -07:00
Neeraj Upadhyay	fe954bcd57	x86/apic: KVM: Move lapic set/clear_vector() helpers to common code Move apic_clear_vector() and apic_set_vector() helper functions to apic.h in order to reuse them in the Secure AVIC guest APIC driver in later patches to atomically set/clear vectors in the APIC backing page. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-13-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:43 -07:00
Neeraj Upadhyay	3d3a9083da	x86/apic: KVM: Move lapic get/set helpers to common code Move the apic_get_reg(), apic_set_reg(), apic_get_reg64() and apic_set_reg64() helper functions to apic.h in order to reuse them in the Secure AVIC guest APIC driver in later patches to read/write registers from/to the APIC backing page. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-12-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:42 -07:00
Neeraj Upadhyay	39e81633f6	x86/apic: KVM: Move apic_find_highest_vector() to a common header In preparation for using apic_find_highest_vector() in Secure AVIC guest APIC driver, move it and associated macros to apic.h. No functional change intended. Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250709033242.267892-11-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:41 -07:00
Neeraj Upadhyay	b5f8980f29	KVM: x86: Rename lapic set/clear vector helpers In preparation for moving kvm-internal kvm_lapic_set_vector(), kvm_lapic_clear_vector() to apic.h for use in Secure AVIC APIC driver, rename them as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-10-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:41 -07:00
Neeraj Upadhyay	9c23bc4fec	KVM: x86: Rename lapic get/set_reg64() helpers In preparation for moving kvm-internal __kvm_lapic_set_reg64(), __kvm_lapic_get_reg64() to apic.h for use in Secure AVIC APIC driver, rename them as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-9-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:40 -07:00
Neeraj Upadhyay	b9bd231913	KVM: x86: Rename lapic get/set_reg() helpers In preparation for moving kvm-internal __kvm_lapic_set_reg(), __kvm_lapic_get_reg() to apic.h for use in Secure AVIC APIC driver, rename them as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-8-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:39 -07:00
Neeraj Upadhyay	bdaccfe4e5	KVM: x86: Rename find_highest_vector() In preparation for moving kvm-internal find_highest_vector() to apic.h for use in Secure AVIC APIC driver, rename find_highest_vector() to apic_find_highest_vector() as part of the APIC API. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-7-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:39 -07:00
Neeraj Upadhyay	e2fa7905b2	KVM: x86: Change lapic regs base address to void pointer Change APIC base address from "char " to "void " in KVM lapic's set/get helper functions. Pointer arithmetic for "void " and "char " operate identically. With "void *" there is less of a chance of doing the wrong thing, e.g. neglecting to cast and reading a byte instead of the desired APIC register size. No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-6-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:38 -07:00
Neeraj Upadhyay	9cbb5fd156	KVM: x86: Rename VEC_POS/REG_POS macro usages In preparation for moving most of the KVM's lapic helpers which use VEC_POS/REG_POS macros to common APIC header for use in Secure AVIC APIC driver, rename all VEC_POS/REG_POS macro usages to APIC_VECTOR_TO_BIT_NUMBER/APIC_VECTOR_TO_REG_OFFSET and remove VEC_POS/REG_POS. While at it, clean up line wrap in find_highest_vector(). No functional change intended. Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-5-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:37 -07:00
Sean Christopherson	dc98e3bd49	x86/apic: KVM: Deduplicate APIC vector => register+bit math Consolidate KVM's {REG,VEC}_POS() macros and lapic_vector_set_in_irr()'s open coded equivalent logic in anticipation of the kernel gaining more usage of vector => reg+bit lookups. Use lapic_vector_set_in_irr()'s math as using divides for both the bit number and register offset makes it easier to connect the dots, and for at least one user, fixup_irqs(), "/ 32 * 0x10" generates ever so slightly better code with gcc-14 (shaves a whole 3 bytes from the code stream): ((v) >> 5) << 4: c1 ef 05 shr $0x5,%edi c1 e7 04 shl $0x4,%edi 81 c7 00 02 00 00 add $0x200,%edi (v) / 32 * 0x10: c1 ef 05 shr $0x5,%edi 83 c7 20 add $0x20,%edi c1 e7 04 shl $0x4,%edi Keep KVM's tersely named macros as "wrappers" to avoid unnecessary churn in KVM, and because the shorter names yield more readable code overall in KVM. The new macros type cast the vector parameter to "unsigned int". This is required from better code generation for cases where an "int" is passed to these macros in KVM code. int v; ((v) >> 5) << 4: c1 f8 05 sar $0x5,%eax c1 e0 04 shl $0x4,%eax ((v) / 32 * 0x10): 85 ff test %edi,%edi 8d 47 1f lea 0x1f(%rdi),%eax 0f 49 c7 cmovns %edi,%eax c1 f8 05 sar $0x5,%eax c1 e0 04 shl $0x4,%eax ((unsigned int)(v) / 32 * 0x10): c1 f8 05 sar $0x5,%eax c1 e0 04 shl $0x4,%eax (v) & (32 - 1): 89 f8 mov %edi,%eax 83 e0 1f and $0x1f,%eax (v) % 32 89 fa mov %edi,%edx c1 fa 1f sar $0x1f,%edx c1 ea 1b shr $0x1b,%edx 8d 04 17 lea (%rdi,%rdx,1),%eax 83 e0 1f and $0x1f,%eax 29 d0 sub %edx,%eax (unsigned int)(v) % 32: 89 f8 mov %edi,%eax 83 e0 1f and $0x1f,%eax Overall kvm.ko text size is impacted if "unsigned int" is not used. Bin Orig New (w/o unsigned int) New (w/ unsigned int) lapic.o 28580 28772 28580 kvm.o 670810 671002 670810 kvm.ko 708079 708271 708079 No functional change intended. [Neeraj: Type cast vec macro param to "unsigned int", provide data in commit log on "unsigned int" requirement] Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250709033242.267892-4-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:37 -07:00
Neeraj Upadhyay	3fb7b83e2a	KVM: x86: Remove redundant parentheses around 'bitmap' When doing pointer arithmetic in apic_test_vector() and kvm_lapic_{set\|clear}_vector(), remove the unnecessary parentheses surrounding the 'bitmap' parameter. No functional change intended. Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250709033242.267892-3-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:36 -07:00
Neeraj Upadhyay	ac48017020	KVM: x86: Open code setting/clearing of bits in the ISR Remove __apic_test_and_set_vector() and __apic_test_and_clear_vector(), because the _only_ register that's safe to modify with a non-atomic operation is ISR, because KVM isn't running the vCPU, i.e. hardware can't service an IRQ or process an EOI for the relevant (virtual) APIC. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250709033242.267892-2-Neeraj.Upadhyay@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:36 -07:00
Kevin Loughlin	a77896eea3	KVM: SEV: Prefer WBNOINVD over WBINVD for cache maintenance efficiency AMD CPUs currently execute WBINVD in the host when unregistering SEV guest memory or when deactivating SEV guests. Such cache maintenance is performed to prevent data corruption, wherein the encrypted (C=1) version of a dirty cache line might otherwise only be written back after the memory is written in a different context (ex: C=0), yielding corruption. However, WBINVD is performance-costly, especially because it invalidates processor caches. Strictly-speaking, unless the SEV ASID is being recycled (meaning the SNP firmware requires the use of WBINVD prior to DF_FLUSH), the cache invalidation triggered by WBINVD is unnecessary; only the writeback is needed to prevent data corruption in remaining scenarios. To improve performance in these scenarios, use WBNOINVD when available instead of WBINVD. WBNOINVD still writes back all dirty lines (preventing host data corruption by SEV guests) but does not invalidate processor caches. Note that the implementation of wbnoinvd() ensures fall back to WBINVD if WBNOINVD is unavailable. In anticipation of forthcoming optimizations to limit the WBNOINVD only to physical CPUs that have executed SEV guests, place the call to wbnoinvd_on_all_cpus() in a wrapper function sev_writeback_caches(). Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250201000259.3289143-3-kevinloughlin@google.com [sean: tweak comment regarding CLFUSH] Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250522233733.3176144-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:44:19 -07:00
Zheyun Shen	7e00013bd3	KVM: SVM: Remove wbinvd in sev_vm_destroy() Before sev_vm_destroy() is called, kvm_arch_guest_memory_reclaimed() has been called for SEV and SEV-ES and kvm_arch_gmem_invalidate() has been called for SEV-SNP. These functions have already handled flushing the memory. Therefore, this wbinvd_on_all_cpus() can simply be dropped. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250522233733.3176144-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:43:43 -07:00
Sean Christopherson	55aed8c2db	KVM: x86: Use wbinvd_on_cpu() instead of an open-coded equivalent Use wbinvd_on_cpu() to target a single CPU instead of open-coding an equivalent, and drop KVM's wbinvd_ipi() now that all users have switched to x86 library versions. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250522233733.3176144-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-10 09:43:43 -07:00
Linus Torvalds	73d7cf0710	ARM: - Remove the last leftovers of the ill-fated FPSIMD host state mapping at EL2 stage-1 - Fix unexpected advertisement to the guest of unimplemented S2 base granule sizes - Gracefully fail initialising pKVM if the interrupt controller isn't GICv3 - Also gracefully fail initialising pKVM if the carveout allocation fails - Fix the computing of the minimum MMIO range required for the host on stage-2 fault - Fix the generation of the GICv3 Maintenance Interrupt in nested mode x86: - Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM. - Use a pre-allocated, per-vCPU buffer for handling de-sparsification of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue. - Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing, to avoid dictating a specific ioctl() ordering to userspace. - Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private. - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest. - Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset. - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID. - Advertise supported TDX TDVMCALLs to userspace. - Pass SetupEventNotifyInterrupt arguments to userspace. - Fix TSC frequency underflow. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmhurKgUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNxHggApTP4vw+oOzfN7UoNmgR9XZMI1p2a R8AzQ1zDyVbEVWq3xTKvXtld+dKeO0yKB/XeI/1JLck1OiHxY57I3X6k5AnsurEr CBzeAhAjXivF8woMgmlP+30aqpomcPACdQm0gRnWkRDDJfXqSUas/iE/s9Ct1dT4 4w3PtFLsSsU8vX/RttR+CqF1AQ6SeV/NRvA8hzPGMGZoQ2um74j4ZsM/3xh77Kdw Z2vOnZOIA4dk0074JjO/Yb9l00Ib4hn+MWG5jVJ+6i2HRRYd2knnB29apVS/ARdL X20j+LvtYj/jrPPdYwqjvxbIXyLbJrLCZyjKhfueN+rnisPNvzR+7YE4ZQ== =NduO -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Paolo Bonzini: "Many patches, pretty much all of them small, that accumulated while I was on vacation. ARM: - Remove the last leftovers of the ill-fated FPSIMD host state mapping at EL2 stage-1 - Fix unexpected advertisement to the guest of unimplemented S2 base granule sizes - Gracefully fail initialising pKVM if the interrupt controller isn't GICv3 - Also gracefully fail initialising pKVM if the carveout allocation fails - Fix the computing of the minimum MMIO range required for the host on stage-2 fault - Fix the generation of the GICv3 Maintenance Interrupt in nested mode x86: - Reject SEV{-ES} intra-host migration if one or more vCPUs are actively being created, so as not to create a non-SEV{-ES} vCPU in an SEV{-ES} VM - Use a pre-allocated, per-vCPU buffer for handling de-sparsification of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too large" issue - Allow out-of-range/invalid Xen event channel ports when configuring IRQ routing, to avoid dictating a specific ioctl() ordering to userspace - Conditionally reschedule when setting memory attributes to avoid soft lockups when userspace converts huge swaths of memory to/from private - Add back MWAIT as a required feature for the MONITOR/MWAIT selftest - Add a missing field in struct sev_data_snp_launch_start that resulted in the guest-visible workarounds field being filled at the wrong offset - Skip non-canonical address when processing Hyper-V PV TLB flushes to avoid VM-Fail on INVVPID - Advertise supported TDX TDVMCALLs to userspace - Pass SetupEventNotifyInterrupt arguments to userspace - Fix TSC frequency underflow" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: avoid underflow when scaling TSC frequency KVM: arm64: Remove kvm_arch_vcpu_run_map_fp() KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes KVM: arm64: Don't free hyp pages with pKVM on GICv2 KVM: arm64: Fix error path in init_hyp_mode() KVM: arm64: Adjust range correctly during host stage-2 faults KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi() KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush KVM: SVM: Add missing member in SNP_LAUNCH_START command structure Documentation: KVM: Fix unexpected unindent warnings KVM: selftests: Add back the missing check of MONITOR/MWAIT availability KVM: Allow CPU to reschedule while setting per-page memory attributes KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table. KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt	2025-07-10 09:06:53 -07:00
Zheyun Shen	4fdc3431e0	x86/lib: Add WBINVD and WBNOINVD helpers to target multiple CPUs Extract KVM's open-coded calls to do writeback caches on multiple CPUs to common library helpers for both WBINVD and WBNOINVD (KVM will use both). Put the onus on the caller to check for a non-empty mask to simplify the SMP=n implementation, e.g. so that it doesn't need to check that the one and only CPU in the system is present in the mask. [sean: move to lib, add SMP=n helpers, clarify usage] Signed-off-by: Zheyun Shen <szy0127@sjtu.edu.cn> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250128015345.7929-2-szy0127@sjtu.edu.cn Link: https://lore.kernel.org/20250522233733.3176144-5-seanjc@google.com	2025-07-10 13:30:17 +02:00
Kevin Loughlin	07f99c3fbe	x86/lib: Add WBNOINVD helper functions In line with WBINVD usage, add WBNOINVD helper functions. Explicitly fall back to WBINVD (via alternative()) if WBNOINVD isn't supported even though the instruction itself is backwards compatible (WBNOINVD is WBINVD with an ignored REP prefix), so that disabling X86_FEATURE_WBNOINVD behaves as one would expect, e.g. in case there's a hardware issue that affects WBNOINVD. Opportunistically, add comments explaining the architectural behavior of WBINVD and WBNOINVD, and provide hints and pointers to uarch-specific behavior. Note, alternative() ensures compatibility with early boot code as needed. [ bp: Massage, fix typos, make export _GPL. ] Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/20250522233733.3176144-4-seanjc@google.com	2025-07-10 13:23:10 +02:00
Sean Christopherson	e638081751	x86/lib: Drop the unused return value from wbinvd_on_all_cpus() Drop wbinvd_on_all_cpus()'s return value; both the "real" version and the stub always return '0', and none of the callers check the return. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250522233733.3176144-3-seanjc@google.com	2025-07-10 13:13:26 +02:00
Alistair Popple	21aa65bf82	mm: remove callers of pfn_t functionality All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:42:19 -07:00
Alistair Popple	d438d27341	mm: remove devmap related functions and page table bits Now that DAX and all other reference counts to ZONE_DEVICE pages are managed normally there is no need for the special devmap PTE/PMD/PUD page table bits. So drop all references to these, freeing up a software defined page table bit on architectures supporting it. Link: https://lkml.kernel.org/r/6389398c32cc9daa3dfcaa9f79c7972525d310ce.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Will Deacon <will@kernel.org> # arm64 Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:42:18 -07:00
Lorenzo Stoakes	d75fa3c947	mm: update architecture and driver code to use vm_flags_t In future we intend to change the vm_flags_t type, so it isn't correct for architecture and driver code to assume it is unsigned long. Correct this assumption across the board. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/b6eb1894abc5555ece80bb08af5c022ef780c8bc.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:42:14 -07:00
Lorenzo Stoakes	78ddaa358e	mm: change vm_get_page_prot() to accept vm_flags_t argument Patch series "use vm_flags_t consistently". The VMA flags field vma->vm_flags is of type vm_flags_t. Right now this is exactly equivalent to unsigned long, but it should not be assumed to be. Much code that references vma->vm_flags already correctly uses vm_flags_t, but a fairly large chunk of code simply uses unsigned long and assumes that the two are equivalent. This series corrects that and has us use vm_flags_t consistently. This series is motivated by the desire to, in a future series, adjust vm_flags_t to be a u64 regardless of whether the kernel is 32-bit or 64-bit in order to deal with the VMA flag exhaustion issue and avoid all the various problems that arise from it (being unable to use certain features in 32-bit, being unable to add new flags except for 64-bit, etc.) This is therefore a critical first step towards that goal. At any rate, using the correct type is of value regardless. We additionally take the opportunity to refer to VMA flags as vm_flags where possible to make clear what we're referring to. Overall, this series does not introduce any functional change. This patch (of 3): We abstract the type of the VMA flags to vm_flags_t, however in may places it is simply assumed this is unsigned long, which is simply incorrect. At the moment this is simply an incongruity, however in future we plan to change this type and therefore this change is a critical requirement for doing so. Overall, this patch does not introduce any functional change. [lorenzo.stoakes@oracle.com: add missing vm_get_page_prot() instance, remove include] Link: https://lkml.kernel.org/r/552f88e1-2df8-4e95-92b8-812f7c8db829@lucifer.local Link: https://lkml.kernel.org/r/cover.1750274467.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/a12769720a2743f235643b158c4f4f0a9911daf0.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:42:13 -07:00
Nuno Das Neves	faab52b59b	x86/hyperv: Clean up hv_map/unmap_interrupt() return values Fix the return values of these hypercall helpers so they return a negated errno either directly or via hv_result_to_errno(). Update the callers to check for errno instead of using hv_status_success(), and remove redundant error printing. While at it, rearrange some variable declarations to adhere to style guidelines i.e. "reverse fir tree order". Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1751582677-30930-5-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1751582677-30930-5-git-send-email-nunodasneves@linux.microsoft.com>	2025-07-09 23:49:25 +00:00
Nuno Das Neves	bb169f80ed	x86/hyperv: Fix usage of cpu_online_mask to get valid cpu Accessing cpu_online_mask here is problematic because the cpus read lock is not held in this context. However, cpu_online_mask isn't needed here since the effective affinity mask is guaranteed to be valid in this callback. So, just use cpumask_first() to get the cpu instead of ANDing it with cpus_online_mask unnecessarily. Fixes: `e39397d1fd` ("x86/hyperv: implement an MSI domain for root partition") Reported-by: Michael Kelley <mhklinux@outlook.com> Closes: https://lore.kernel.org/linux-hyperv/SN6PR02MB4157639630F8AD2D8FD8F52FD475A@SN6PR02MB4157.namprd02.prod.outlook.com/ Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1751582677-30930-4-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1751582677-30930-4-git-send-email-nunodasneves@linux.microsoft.com>	2025-07-09 23:49:25 +00:00
Naman Jain	0271e72bc0	x86/hyperv: Fix warnings for missing export.h header inclusion Fix below warning in Hyper-V drivers that comes when kernel is compiled with W=1 option. Include export.h in driver files to fix it. * warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Link: https://lore.kernel.org/r/20250611100459.92900-3-namjain@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250611100459.92900-3-namjain@linux.microsoft.com>	2025-07-09 23:43:53 +00:00
Paolo Bonzini	4578a747f3	KVM: x86: avoid underflow when scaling TSC frequency In function kvm_guest_time_update(), __scale_tsc() is used to calculate a TSC frequency rather than a TSC value. With low-enough ratios, a TSC value that is less than 1 would underflow to 0 and to an infinite while loop in kvm_get_time_scale(): kvm_guest_time_update(struct kvm_vcpu v) if (kvm_caps.has_tsc_control) tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz, v->arch.l1_tsc_scaling_ratio); __scale_tsc(u64 ratio, u64 tsc) ratio=122380531, tsc=2299998, N=48 ratiotsc >> N = 0.999... -> 0 Later in the function: Call Trace: <TASK> kvm_get_time_scale arch/x86/kvm/x86.c:2458 [inline] kvm_guest_time_update+0x926/0xb00 arch/x86/kvm/x86.c:3268 vcpu_enter_guest.constprop.0+0x1e70/0x3cf0 arch/x86/kvm/x86.c:10678 vcpu_run+0x129/0x8d0 arch/x86/kvm/x86.c:11126 kvm_arch_vcpu_ioctl_run+0x37a/0x13d0 arch/x86/kvm/x86.c:11352 kvm_vcpu_ioctl+0x56b/0xe60 virt/kvm/kvm_main.c:4188 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0x12d/0x190 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x59/0x110 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 This can really happen only when fuzzing, since the TSC frequency would have to be nonsensically low. Fixes: `35181e86df` ("KVM: x86: Add a common TSC scaling function") Reported-by: Yuntao Liu <liuyuntao12@huawei.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-07-09 13:52:50 -04:00
Jim Mattson	a7cec20845	KVM: x86: Provide a capability to disable APERF/MPERF read intercepts Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs without interception. The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not handled at all. The MSR values are not zeroed on vCPU creation, saved on suspend, or restored on resume. No accommodation is made for processor migration or for sharing a logical processor with other tasks. No adjustments are made for non-unit TSC multipliers. The MSRs do not account for time the same way as the comparable PMU events, whether the PMU is virtualized by the traditional emulation method or the new mediated pass-through approach. Nonetheless, in a properly constrained environment, this capability can be combined with a guest CPUID table that advertises support for CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the effective physical CPU frequency in /proc/cpuinfo. Moreover, there is no performance cost for this capability. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:33:37 -07:00
Jim Mattson	6fbef8615d	KVM: x86: Replace growing set of *_in_guest bools with a u64 Store each "disabled exit" boolean in a single bit rather than a byte. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:32:32 -07:00
Xin Li	e88cfd50b6	KVM: x86: Advertise support for LKGS Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace if the instruction is supported by the underlying CPU. LKGS is introduced with FRED to completely eliminate the need to swapgs explicilty. It behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache, which is exactly what Linux kernel does to load a user level GS base. Thus there is no need to SWAPGS away from the kernel GS base. LKGS is an independent CPU feature that works correctly in a KVM guest without requiring explicit enablement. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:32:25 -07:00
Sean Christopherson	e1ef1c57ff	KVM: VMX: Add a macro to track which DEBUGCTL bits are host-owned Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e. need to be preserved when running the guest, to dedup the logic without having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-07-09 09:30:52 -07:00
Borislav Petkov (AMD)	fde494e905	Add the mitigation logic for Transient Scheduler Attacks (TSA) TSA are new aspeculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage. Add the usual controls of the mitigation and integrate it into the existing speculation bugs infrastructure in the kernel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/ ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY 0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo= =2cLl -----END PGP SIGNATURE----- Merge tag 'tsa_x86_bugs_for_6.16' into tip-x86-bugs Pick up TSA changes from mainline so that attack vectors work can continue ontop. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2025-07-09 18:16:53 +02:00
Jann Horn	76303ee8d5	x86/mm: Disable hugetlb page table sharing on 32-bit Only select ARCH_WANT_HUGE_PMD_SHARE on 64-bit x86. Page table sharing requires at least three levels because it involves shared references to PMD tables; 32-bit x86 has either two-level paging (without PAE) or three-level paging (with PAE), but even with three-level paging, having a dedicated PGD entry for hugetlb is only barely possible (because the PGD only has four entries), and it seems unlikely anyone's actually using PMD sharing on 32-bit. Having ARCH_WANT_HUGE_PMD_SHARE enabled on non-PAE 32-bit X86 (which has 2-level paging) became particularly problematic after commit `59d9094df3` ("mm: hugetlb: independent PMD page table shared count"), since that changes `struct ptdesc` such that the `pt_mm` (for PGDs) and the `pt_share_count` (for PMDs) share the same union storage - and with 2-level paging, PMDs are PGDs. (For comparison, arm64 also gates ARCH_WANT_HUGE_PMD_SHARE on the configuration of page tables such that it is never enabled with 2-level paging.) Closes: https://lore.kernel.org/r/srhpjxlqfna67blvma5frmy3aa@altlinux.org Fixes: `cfe28c5d63` ("x86: mm: Remove x86 version of huge_pmd_share.") Reported-by: Vitaly Chikunov <vt@altlinux.org> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Vitaly Chikunov <vt@altlinux.org> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250702-x86-2level-hugetlb-v2-1-1a98096edf92%40google.com	2025-07-09 07:46:36 -07:00
Kan Liang	829f5a6308	perf/x86/intel/uncore: Add iMC freerunning for Panther Lake PTL uncore imc freerunning counters are the same as the previous HW. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250707201750.616527-5-kan.liang@linux.intel.com	2025-07-09 13:40:20 +02:00
Kan Liang	64ad6d6ede	perf/x86/intel/uncore: Add Panther Lake support The Panther Lake supports CBOX, MC, sNCU, and HBO uncore PMON. The CBOX is similar to Lunar Lake. The only difference is the number of CBOX. The other three uncore PMON can be retrieved from the discovery table. The global control register resides in the sNCU. The global freeze bit is set by default. It must be cleared before monitoring any uncore counters. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250707201750.616527-4-kan.liang@linux.intel.com	2025-07-09 13:40:19 +02:00
Kan Liang	fca24bf2b6	perf/x86/intel/uncore: Support customized MMIO map size For a server platform, the MMIO map size is always 0x4000. However, a client platform may have a smaller map size. Make the map size customizable. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250707201750.616527-3-kan.liang@linux.intel.com	2025-07-09 13:40:19 +02:00
Kan Liang	cf002dafed	perf/x86/intel/uncore: Support MSR portal for discovery tables Starting from the Panther Lake, the discovery table mechanism is also supported in client platforms. The difference is that the portal of the global discovery table is retrieved from an MSR. The layout of discovery tables are the same as the server platforms. Factor out __parse_discovery_table() to parse discover tables. The uncore PMON is Die scope. Need to parse the discovery tables for each die. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250707201750.616527-2-kan.liang@linux.intel.com	2025-07-09 13:40:19 +02:00
Greg Kroah-Hartman	9b355cdb63	x86/microcode: Move away from using a fake platform device Downloading firmware needs a device to hang off of, and so a platform device seemed like the simplest way to do this. Now that we have a faux device interface, use that instead as this "microcode device" is not anything resembling a platform device at all. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/2025070121-omission-small-9308@gregkh	2025-07-09 13:12:08 +02:00
Mikhail Paulyshka	a74bb5f202	x86/CPU/AMD: Disable INVLPGB on Zen2 AMD Cyan Skillfish (Family 17h, Model 47h, Stepping 0h) has an issue that causes system oopses and panics when performing TLB flush using INVLPGB. However, the problem is that that machine has misconfigured CPUID and should not report the INVLPGB bit in the first place. So zap the kernel's representation of the flag so that nothing gets confused. [ bp: Massage. ] Fixes: `767ae437a3` ("x86/mm: Add INVLPGB feature and Kconfig entry") Signed-off-by: Mikhail Paulyshka <me@mixaill.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/1ebe845b-322b-4929-9093-b41074e9e939@mixaill.net	2025-07-08 21:34:01 +02:00
Mikhail Paulyshka	5b937a1ed6	x86/rdrand: Disable RDSEED on AMD Cyan Skillfish AMD Cyan Skillfish (Family 17h, Model 47h, Stepping 0h) has an error that causes RDSEED to always return 0xffffffff, while RDRAND works correctly. Mask the RDSEED cap for this CPU so that both /proc/cpuinfo and direct CPUID read report RDSEED as unavailable. [ bp: Move to amd.c, massage. ] Signed-off-by: Mikhail Paulyshka <me@mixaill.net> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/20250524145319.209075-1-me@mixaill.net	2025-07-08 21:33:26 +02:00
Linus Torvalds	6e9128ff9d	Add the mitigation logic for Transient Scheduler Attacks (TSA) TSA are new aspeculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage. Add the usual controls of the mitigation and integrate it into the existing speculation bugs infrastructure in the kernel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhSsvQACgkQEsHwGGHe VUrWNw//V+ZabYq3Nnvh4jEe6Altobnpn8bOIWmcBx6I3xuuArb9bLqcbKerDIcC POVVW6zrdNigDe/U4aqaJXE7qCRX55uTYbhp8OLH0zzqX3Pjl/hUnEXWtMtlXj/G CIM5mqjqEFp5JRGXetdjjuvjG1IPf+CbjKqj2WXbi//T6F3LiAFxkzdUhd+clBF/ ztWchjwUmqU0WJd6+Smb8ZnvWrLoZuOFldjhFad820B7fqkdJhzjHMmwBHJKUEZu oABv8B0/4IALrx6LenCspWS4OuTOGG7DKyIgzitByXygXXb4L3ZUKpuqkxBU7hFx bscwtOP7e5HIYAekx6ZSLZoZpYQXr1iH0aRGrjwapi3ASIpUwI0UA9ck2PdGo0IY 0GvmN0vbybskewBQyG819BM+DCau5pOLWuL7cYmaD2eTNoOHOknMDNlO8VzXqJxa NnignSuEWFm2vNV1FXEav2YbVjlanV6JleiPDGBe5Xd9dnxZTvg9HuP2NkYio4dZ mb/kEU/kTcN8nWh0Q96tX45kmj0vCbBgrSQkmUpyAugp38n69D1tp3ii9D/hyQFH hKGcFC9m+rYVx1NLyAxhTGxaEqF801d5Qawwud8HsnQudTpCdSXD9fcBg9aCbWEa FymtDpIeUQrFAjDpVEp6Syh3odKvLXsGEzL+DVvqKDuA8r6DxFo= =2cLl -----END PGP SIGNATURE----- Merge tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU speculation fixes from Borislav Petkov: "Add the mitigation logic for Transient Scheduler Attacks (TSA) TSA are new aspeculative side channel attacks related to the execution timing of instructions under specific microarchitectural conditions. In some cases, an attacker may be able to use this timing information to infer data from other contexts, resulting in information leakage. Add the usual controls of the mitigation and integrate it into the existing speculation bugs infrastructure in the kernel" * tag 'tsa_x86_bugs_for_6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/process: Move the buffer clearing before MONITOR x86/microcode/AMD: Add TSA microcode SHAs KVM: SVM: Advertise TSA CPUID bits to guests x86/bugs: Add a Transient Scheduler Attacks mitigation x86/bugs: Rename MDS machinery to something more generic	2025-07-07 17:08:36 -07:00
Mario Limonciello	f126821482	x86/itmt: Add debugfs file to show core priorities Multiple drivers can report priorities to ITMT. To aid in debugging any issues with the values reported by drivers introduce a debugfs file to read out the values. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/20250609200518.3616080-14-superm1@kernel.org	2025-07-07 22:35:51 +02:00
Perry Yuan	9e8f6bf782	x86/process: Clear hardware feedback history for AMD processors Incorporate a mechanism within the context switching code to reset the hardware history for AMD processors. Specifically, when a task is switched in, the class ID is read and the hardware workload classification history of the CPU firmware is reset. Then, the workload classification for the next running thread is begun. [ bp: Massage commit message. ] Signed-off-by: Perry Yuan <perry.yuan@amd.com> Co-developed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/20250609200518.3616080-10-superm1@kernel.org	2025-07-07 22:30:36 +02:00
Perry Yuan	a3c4f3396b	x86/msr-index: Add AMD workload classification MSRs Introduce new MSR registers for AMD hardware feedback support. They provide workload classification and configuration capabilities. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/20250609200518.3616080-4-superm1@kernel.org	2025-07-07 19:24:58 +02:00
Linus Torvalds	5fc2e891a5	- Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which prevents their TSCs from going skewed from the hypervisor's -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqMCYACgkQEsHwGGHe VUrz6A/9EMN2fEdpeStGrk5t0BvYPhnIApr2x2QuvVN6qXgyGWQXhnF5G94SFNn8 ykNxcGi97R/wxqk2uK3RfBg4P0ScoPDOLKKSeaqO0LVuHDTVX72fwB1F3qdaPNbp EIEL+OOEwUAwviT2GSH4mwTb1C7TuJnOZH2lC6yDkWwN5BLnIA4P4C0Wr4pIQ+MT TMzxGMT01yTnCAGHGOD2NRUIv/29qeJl+18uOqDO4A64RPT5Pp0yLJ72U4grGzOr 2C49+/XD0qjXMs8vRz/CbeBK47ZaE1v/ui3g5ofbZ2YtcrfJXDY2/SwoJ0Oz+liM TSEXj8IpFZ3aq3+Pvgp9Qibu5QnFxJi7xTzrGCG1OSouHXTH2eFSVIXCiojVU12Y s+pKCBTXs9wVJN4z/FaSSwmTvQolld7oozShgPieZsYNBfJeeWIm+6LZRT4Zr/7Y UVsYEc/7m36ggKK+XFHsea2ZnmUFV18kEHPuWAXwmH3DW3dDfI5nm/s811jsbS+6 2RaLZPiKBsYmNZ7iCujrY3GEmE5Eyemr8Ricj2zSGTCH2EYNeODDQBOn+hYgQOTK WJFWqpC5JqI5oJapmhugCkjfT75e+XTgO8Dox7HdlJR4UAb61xxf1zGFbThH8T45 LZgbIKtLwwLShg0FwzDl7swnADJ/SiaKl049Q8Z5YthhllHy6zI= =6yl8 -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which prevents their TSCs from going skewed from the hypervisor's * tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation	2025-07-06 10:44:20 -07:00
Linus Torvalds	bdde3141ce	- Do not remove the MCE sysfs hierarchy if thresholding sysfs nodes init fails due to new/unknown banks present, which in itself is not fatal anyway; add default names for new banks - Make sure MCE polling settings are honored after CMCI storms - Make sure MCE threshold limit is reset after the thresholding interrupt has been serviced - Clean up properly and disable CMCI banks on shutdown so that a second/kexec-ed kernel can rediscover those banks again -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhqLrkACgkQEsHwGGHe VUqm/hAAlg6nX/uJIeszPpZK1o4j6WP3IZ0RAo5yxd9mQC+zNP35Xqv3MOUBZVGE 1EhrQSgrqeC9NIbT8a9ansnc3BKxODOB8raoNA3HpViTG3LeQ5ycmpJv5qWqcr8B EV14BpZHAEx3AXAiDcGXkuKYM9RrpHtOwtyKjib4ScT+xCkzQzJtrO2hPTk8Slr/ 8Hly14+st7Iqvnh9nH1TeMOjogGg+lr9cIlG6rzZGj3IPbfEWbvmHhjJut7p4TfU g5AY3djzJ8eyrOv3aKxgVDkJ7qet283sc+mvTzrhAnoJEYI9v7tde4g6AKJj0BLA +u0tTOu47c/wijLcHPpaS+zwifo4BxKDIG8q+tHT6ixMMPT3Mev8nwanjAotjgz7 WSN3eL9jie+IJPq4c0eN2z2um5tiqxFHF5M7q4Ol5VJiUU8Wa31B/pzPZymQoCWZ F/SC5VVq+ZwfRqzQMAK5dHQSj1zvkbtb0HOMYoFTOU1JocF7Px1gxx401UQ6pKdG Qw2rE1SUKxaQT4HZ2+SRvO2egJItsQw4r+ZT/7sMQhII9v9qK500kD8o30HcCEh3 o8kT+dQwKEv0KLga5vYFnITT29XM9GySdAEI7HxKi/kpnwaUppUljuycfhzAMtYe xz86txluUsk7sUW8eUPgjuKPYO3a20FkY8VB5TYWTHxJdMAeb4E= =JI7d -----END PGP SIGNATURE----- Merge tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS fixes from Borislav Petkov: - Do not remove the MCE sysfs hierarchy if thresholding sysfs nodes init fails due to new/unknown banks present, which in itself is not fatal anyway; add default names for new banks - Make sure MCE polling settings are honored after CMCI storms - Make sure MCE threshold limit is reset after the thresholding interrupt has been serviced - Clean up properly and disable CMCI banks on shutdown so that a second/kexec-ed kernel can rediscover those banks again * tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Make sure CMCI banks are cleared during shutdown on Intel x86/mce/amd: Fix threshold limit reset x86/mce/amd: Add default names for MCA banks and blocks x86/mce: Ensure user polling settings are honored when restarting timer x86/mce: Don't remove sysfs if thresholding sysfs init fails	2025-07-06 09:17:48 -07:00
Eric Biggers	b86ced882b	lib/crypto: sha256: Make library API use strongly-typed contexts Currently the SHA-224 and SHA-256 library functions can be mixed arbitrarily, even in ways that are incorrect, for example using sha224_init() and sha256_final(). This is because they operate on the same structure, sha256_state. Introduce stronger typing, as I did for SHA-384 and SHA-512. Also as I did for SHA-384 and SHA-512, use the names _ctx instead of _state. The _ctx names have the following small benefits: - They're shorter. - They avoid an ambiguity with the compression function state. - They're consistent with the well-known OpenSSL API. - Users usually name the variable 'sctx' anyway, which suggests that _ctx would be the more natural name for the actual struct. Therefore: update the SHA-224 and SHA-256 APIs, implementation, and calling code accordingly. In the new structs, also strongly-type the compression function state. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630160645.3198-7-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-07-04 10:18:53 -07:00
Linus Torvalds	df46426745	platform-drivers-x86 for v6.16-3 Fixes and New HW Support - amd/isp4: Improve swnode graph (new driver exception) - asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA - dell-lis3lv02d: Add Latitude 5500 accelerometer address - dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg - hp-bioscfg: Fix class device unregistration - i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/ - intel/hid: Wildcat Lake support - mellanox: - mlxbf-pmc: Fix duplicate event ID - mlxbf-tmfifo: Fix vring_desc.len assignment - mlxreg-lc: Fix bit-not-set logic check - nvsw-sn2201: Fix bus number in error message & spelling errors - portwell-ec: Move watchdog device under correct platform hierarchy - think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg) - thinkpad_acpi: Handle HKEY 0x1402 event (2025 Thinkpads) - wmi: Fix WMI event enablement The following is an automated shortlog grouped by driver: asus-nb-wmi: - add DMI quirk for ASUS Zenbook Duo UX8406CA dell-lis3lv02d: - Add Latitude 5500 dell-wmi-sysman: - Fix class device unregistration - Fix WMI data block retrieval in sysfs callbacks hp-bioscfg: - Fix class device unregistration i2c: - Re-enable piix4 driver on non-x86 intel/hid: - Add Wildcat Lake support mellanox: - Fix spelling and comment clarity in Mellanox drivers mlxbf-pmc: - Fix duplicate event ID for CACHE_DATA1 mlxbf-tmfifo: - fix vring_desc.len assignment mlxreg-lc: - Fix logic error in power state check Move FCH header to a location accessible by all archs: - Move FCH header to a location accessible by all archs nvsw-sn2201: - Fix bus number in adapter error message portwell-ec: - Move watchdog device under correct platform hierarchy think-lmi: - Create ksets consecutively - Fix class device unregistration - Fix kobject cleanup - Fix sysfs group cleanup thinkpad_acpi: - handle HKEY 0x1402 event Update swnode graph for amd isp4: - Update swnode graph for amd isp4 wmi: - Fix WMI event enablement - Update documentation of WCxx/WExx ACPI methods -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCaGfkwwAKCRBZrE9hU+XO MVK1AQCK3C21auqcEbiZrx67hr5ir6VwTAZ9S6IR8R2FKqw8YwEAinUOcHSbmP6a eXV0v5xVRPxZV7JBO5aN7FESqVHpBQ4= =uxUH -----END PGP SIGNATURE----- Merge tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform drivers fixes from Ilpo Järvinen: "Mostly a few lines fixed here and there except amd/isp4 which improves swnodes relationships but that is a new driver not in any stable kernels yet. The think-lmi driver changes also look relatively large but there are just many fixes to it. The i2c/piix4 change is a effectively a revert of the commit `7e173eb82a` ("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on CONFIG_X86") but that required moving the header out from arch/x86 under include/linux/platform_data/ Summary: - amd/isp4: Improve swnode graph (new driver exception) - asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA - dell-lis3lv02d: Add Latitude 5500 accelerometer address - dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg - hp-bioscfg: Fix class device unregistration - i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/ - intel/hid: Wildcat Lake support - mellanox: - mlxbf-pmc: Fix duplicate event ID - mlxbf-tmfifo: Fix vring_desc.len assignment - mlxreg-lc: Fix bit-not-set logic check - nvsw-sn2201: Fix bus number in error message & spelling errors - portwell-ec: Move watchdog device under correct platform hierarchy - think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg) - thinkpad_acpi: Handle HKEY 0x1402 event (2025 Thinkpads) - wmi: Fix WMI event enablement" * tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (22 commits) platform/x86: think-lmi: Fix sysfs group cleanup platform/x86: think-lmi: Fix kobject cleanup platform/x86: think-lmi: Create ksets consecutively platform/mellanox: mlxreg-lc: Fix logic error in power state check i2c: Re-enable piix4 driver on non-x86 Move FCH header to a location accessible by all archs platform/x86/intel/hid: Add Wildcat Lake support platform/x86: dell-wmi-sysman: Fix class device unregistration platform/x86: think-lmi: Fix class device unregistration platform/x86: hp-bioscfg: Fix class device unregistration platform/x86: Update swnode graph for amd isp4 platform/x86: dell-wmi-sysman: Fix WMI data block retrieval in sysfs callbacks platform/x86: wmi: Update documentation of WCxx/WExx ACPI methods platform/x86: wmi: Fix WMI event enablement platform/mellanox: nvsw-sn2201: Fix bus number in adapter error message platform/mellanox: Fix spelling and comment clarity in Mellanox drivers platform/mellanox: mlxbf-pmc: Fix duplicate event ID for CACHE_DATA1 platform/x86: thinkpad_acpi: handle HKEY 0x1402 event platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8406CA platform/x86: dell-lis3lv02d: Add Latitude 5500 ...	2025-07-04 10:05:31 -07:00
Kumar Kartikeya Dwivedi	f0c53fd4a7	bpf: Add function to find program from stack trace In preparation of figuring out the closest program that led to the current point in the kernel, implement a function that scans through the stack trace and finds out the closest BPF program when walking down the stack trace. Special care needs to be taken to skip over kernel and BPF subprog frames. We basically scan until we find a BPF main prog frame. The assumption is that if a program calls into us transitively, we'll hit it along the way. If not, we end up returning NULL. Contextually the function will be used in places where we know the program may have called into us. Due to reliance on arch_bpf_stack_walk(), this function only works on x86 with CONFIG_UNWINDER_ORC, arm64, and s390. Remove the warning from arch_bpf_stack_walk as well since we call it outside bpf_throw() context. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250703204818.925464-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-07-03 19:30:06 -07:00
Andrey Albershteyn	be7efb2d20	fs: introduce file_getattr and file_setattr syscalls Introduce file_getattr() and file_setattr() syscalls to manipulate inode extended attributes. The syscalls takes pair of file descriptor and pathname. Then it operates on inode opened accroding to openat() semantics. The struct file_attr is passed to obtain/change extended attributes. This is an alternative to FS_IOC_FSSETXATTR ioctl with a difference that file don't need to be open as we can reference it with a path instead of fd. By having this we can manipulated inode extended attributes not only on regular files but also on special ones. This is not possible with FS_IOC_FSSETXATTR ioctl as with special files we can not call ioctl() directly on the filesystem inode using fd. This patch adds two new syscalls which allows userspace to get/set extended inode attributes on special files by using parent directory and a path - *at() like syscall. CC: linux-api@vger.kernel.org CC: linux-fsdevel@vger.kernel.org CC: linux-xfs@vger.kernel.org Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Link: https://lore.kernel.org/20250630-xattrat-syscall-v6-6-c4e3bc35227b@kernel.org Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-02 17:05:17 +02:00
Ilpo Järvinen	88e326b331	Merge branch 'fixes' into for-next Merge fixes back into for-next to be able to take dell_rbu change that is build on top of fixes material, and to bring lenovo related changes in sync after the move under lenovo/ subdir in the for-next branch and diverging changes in the fixes branch.	2025-07-02 13:30:30 +03:00
Nikunj A Dadhania	52e1a03e6c	x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation When using Secure TSC, the GUEST_TSC_FREQ MSR reports a frequency based on the nominal P0 frequency, which deviates slightly (typically ~0.2%) from the actual mean TSC frequency due to clocking parameters. Over extended VM uptime, this discrepancy accumulates, causing clock skew between the hypervisor and a SEV-SNP VM, leading to early timer interrupts as perceived by the guest. The guest kernel relies on the reported nominal frequency for TSC-based timekeeping, while the actual frequency set during SNP_LAUNCH_START may differ. This mismatch results in inaccurate time calculations, causing the guest to perceive hrtimers as firing earlier than expected. Utilize the TSC_FACTOR from the SEV firmware's secrets page (see "Secrets Page Format" in the SNP Firmware ABI Specification) to calculate the mean TSC frequency, ensuring accurate timekeeping and mitigating clock skew in SEV-SNP VMs. Use early_ioremap_encrypted() to map the secrets page as ioremap_encrypted() uses kmalloc() which is not available during early TSC initialization and causes a panic. [ bp: Drop the silly dummy var: https://lore.kernel.org/r/20250630192726.GBaGLlHl84xIopx4Pt@fat_crate.local ] Fixes: `73bbf3b0fb` ("x86/tsc: Init the TSC for Secure TSC guests") Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250630081858.485187-1-nikunj@amd.com	2025-07-01 00:29:27 +02:00
Eric Biggers	b10749d89f	lib/crc: x86: Migrate optimized CRC code into lib/crc/ Move the x86-optimized CRC code from arch/x86/lib/crc* into its new location in lib/crc/x86/, and wire it up in the new way. This new way of organizing the CRC code eliminates the need to artificially split the code for each CRC variant into separate arch and generic modules, enabling better inlining and dead code elimination. For more details, see "lib/crc: Prepare for arch-optimized code in subdirs of lib/crc/". Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Jason A. Donenfeld" <Jason@zx2c4.com> Link: https://lore.kernel.org/r/20250607200454.73587-12-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:31:57 -07:00
Eric Biggers	cece5689e1	x86/crc: drop checks of CONFIG_AS_VPCLMULQDQ Now that the minimum binutils version supports VPCLMULQDQ (and the minimum clang version does too), there is no need to check for assembler support before compiling code that uses these instructions. Link: https://lore.kernel.org/r/20250531211318.83677-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:31:56 -07:00
Eric Biggers	74750aa78d	lib/crypto: x86: Move arch/x86/lib/crypto/ into lib/crypto/ Move the contents of arch/x86/lib/crypto/ into lib/crypto/x86/. The new code organization makes a lot more sense for how this code actually works and is developed. In particular, it makes it possible to build each algorithm as a single module, with better inlining and dead code elimination. For a more detailed explanation, see the patchset which did this for the CRC library code: https://lore.kernel.org/r/20250607200454.73587-1-ebiggers@kernel.org/. Also see the patchset which did this for SHA-512: https://lore.kernel.org/linux-crypto/20250616014019.415791-1-ebiggers@kernel.org/ This is just a preparatory commit, which does the move to get the files into their new location but keeps them building the same way as before. Later commits will make the actual improvements to the way the arch-optimized code is integrated for each algorithm. Add a gitignore entry for the removed directory arch/x86/lib/crypto/ so that people don't accidentally commit leftover generated files. Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/r/20250619191908.134235-9-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:26:20 -07:00
Eric Biggers	484c18119f	lib/crypto: x86/sha512: Migrate optimized SHA-512 code to library Instead of exposing the x86-optimized SHA-512 code via x86-specific crypto_shash algorithms, instead just implement the sha512_blocks() library function. This is much simpler, it makes the SHA-512 (and SHA-384) library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SHA-512 code was disabled by default. SHA-512 still remains available through crypto_shash, but individual architectures no longer need to handle it. To match sha512_blocks(), change the type of the nblocks parameter of the assembly functions from int to size_t. The assembly functions actually already treated it as size_t. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630160320.2888-15-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:26:20 -07:00
Eric Biggers	e0fca17755	crypto: sha512 - Rename conflicting symbols Rename existing functions and structs in architecture-optimized SHA-512 code that had names conflicting with the upcoming library interface which will be added to <crypto/sha2.h>: sha384_init, sha512_init, sha512_update, sha384, and sha512. Note: all affected code will be superseded by later commits that migrate the arch-optimized SHA-512 code into the library. This commit simply keeps the kernel building for the initial introduction of the library. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250630160320.2888-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>	2025-06-30 09:26:19 -07:00
Mario Limonciello	b1c26e0595	Move FCH header to a location accessible by all archs A new header fch.h was created to store registers used by different AMD drivers. This header was included by i2c-piix4 in commit `624b0d5696` ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>"). To prevent compile failures on non-x86 archs i2c-piix4 was set to only compile on x86 by commit `7e173eb82a` ("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on CONFIG_X86"). This was not a good decision because loongarch and mips both actually support i2c-piix4 and set it enabled in the defconfig. Move the header to a location accessible by all architectures. Fixes: `624b0d5696` ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>") Suggested-by: Hans de Goede <hansg@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Hans de Goede <hansg@kernel.org> Link: https://lore.kernel.org/r/20250610205817.3912944-1-superm1@kernel.org Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>	2025-06-30 13:42:11 +03:00
Greg Kroah-Hartman	815ac67919	Merge 6.16-rc4 into tty-next We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-30 07:50:04 +02:00
Linus Torvalds	cc69ac7a65	- Make sure DR6 and DR7 are initialized to their architectural values and not accidentally cleared, leading to misconfigurations -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmhg/UQACgkQEsHwGGHe VUrmXBAAtOrpEtR4geeBeZtCEaUxE4DE8Zvj36dr+sAScHTXNTzYK94mAy/AHU22 V3rF12/kuyyZSrwROXLBD6PgkHEn8u0WLztSeqP/SisnoLjMTV9H9TuGYBUoz1NS clkoElQ6DJP5BVzmYpZlJrcofqNjkS/mAxfMRoIAq+LzKkb3iL/Lge+Ox/IDUg8z L9wRlKh/IaJ5EETWlqh0gkFeS/M9DXYmfkasDQeVkUxnKFeXBdyUGc2jFzBGX5RA rsdnz+C3x3ow2U9N+ZMVr+n06yTZvh+fAiU8emeBQm0q5fZBBHWDbnZZtWf+KG6s 43tlWyVqic5yzyQbUpRC2sttOkIAtOCMx36XexbGm1eKRNNc6fTz9IlgO/97HkuE lYBNq0zd/p5Kb53lXb3uwBVy4sjIEZUyD/K5DO4YfTgamcwXl8BP5xnKtNPqImI5 aaF3xKKLOUDOTL1CcK5YG0joaU1k+I0F0KO7HYqkDi8Uf5naWZSUNil8nPQn8RX7 3f3LJx0e3j2o0f60AHI4mjUAUJHsxExmpaTl079k03wt8YVE3ucNaUN6se6nidVz H5q0JU4q3C3DCu0I3Ub4wa5QXGA+TOHKuqhJCapKAVAAQDlbV2z8GxA/3B2YbRM+ eZ6/RVyk++VrRXIyfmfwLPu0CLVoSNhaUhu/hFrYbjA3NX+85qw= =fjgT -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure DR6 and DR7 are initialized to their architectural values and not accidentally cleared, leading to misconfigurations * tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/traps: Initialize DR7 by writing its architectural reset value x86/traps: Initialize DR6 by writing its architectural reset value	2025-06-29 08:28:24 -07:00
Andy Shevchenko	acc902de05	serial: 8250: Move CE4100 quirks to a module under 8250 driver There is inconvenient for maintainers and maintainership to have some quirks under architectural code. Move it to the specific quirk file like other 8250-compatible drivers do. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250627182743.1273326-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-29 14:24:46 +02:00
JP Kobryn	30ad231a50	x86/mce: Make sure CMCI banks are cleared during shutdown on Intel CMCI banks are not cleared during shutdown on Intel CPUs. As a side effect, when a kexec is performed, CPUs coming back online are unable to rediscover/claim these occupied banks which breaks MCE reporting. Clear the CPU ownership during shutdown via cmci_clear() so the banks can be reclaimed and MCE reporting will become functional once more. [ bp: Massage commit message. ] Reported-by: Aijay Adams <aijay@meta.com> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/20250627174935.95194-1-inwardvessel@gmail.com	2025-06-28 12:45:48 +02:00
Gerd Hoffmann	a7549636f6	x86/sev: Let sev_es_efi_map_ghcbs() map the CA pages too OVMF EFI firmware needs access to the CA page to do SVSM protocol calls. For example, when the SVSM implements an EFI variable store, such calls will be necessary. So add that to sev_es_efi_map_ghcbs() and also rename the function to reflect the additional job it is doing now. [ bp: Massage. ] Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250626114014.373748-4-kraxel@redhat.com	2025-06-27 14:07:10 +02:00
Gerd Hoffmann	7b22e04329	x86/sev/vc: Fix EFI runtime instruction emulation In case efi_mm is active go use the userspace instruction decoder which supports fetching instructions from active_mm. This is needed to make instruction emulation work for EFI runtime code, so it can use CPUID and RDMSR. EFI runtime code uses the CPUID instruction to gather information about the environment it is running in, such as SEV being enabled or not, and choose (if needed) the SEV code path for ioport access. EFI runtime code uses the RDMSR instruction to get the location of the CAA page (see SVSM spec, section 4.2 - "Post Boot"). The big picture behind this is that the kernel needs to be able to properly handle #VC exceptions that come from EFI runtime services. Since EFI runtime services have a special page table mapping for the EFI virtual address space, the efi_mm context must be used when decoding instructions during #VC handling. [ bp: Massage. ] Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/20250626114014.373748-2-kraxel@redhat.com	2025-06-27 13:53:12 +02:00
Yazen Ghannam	5f6e3b7206	x86/mce/amd: Fix threshold limit reset The MCA threshold limit must be reset after servicing the interrupt. Currently, the restart function doesn't have an explicit check for this. It makes some assumptions based on the current limit and what's in the registers. These assumptions don't always hold, so the limit won't be reset in some cases. Make the reset condition explicit. Either an interrupt/overflow has occurred or the bank is being initialized. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-4-236dd74f645f@amd.com	2025-06-27 13:16:23 +02:00
Yazen Ghannam	d66e1e90b1	x86/mce/amd: Add default names for MCA banks and blocks Ensure that sysfs init doesn't fail for new/unrecognized bank types or if a bank has additional blocks available. Most MCA banks have a single thresholding block, so the block takes the same name as the bank. Unified Memory Controllers (UMCs) are a special case where there are two blocks and each has a unique name. However, the microarchitecture allows for five blocks. Any new MCA bank types with more than one block will be missing names for the extra blocks. The MCE sysfs will fail to initialize in this case. Fixes: `87a6d4091b` ("x86/mce/AMD: Update sysfs bank names for SMCA systems") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-3-236dd74f645f@amd.com	2025-06-27 13:13:36 +02:00
Yazen Ghannam	00c092de6f	x86/mce: Ensure user polling settings are honored when restarting timer Users can disable MCA polling by setting the "ignore_ce" parameter or by setting "check_interval=0". This tells the kernel to not start the MCE timer on a CPU. If the user did not disable CMCI, then storms can occur. When these happen, the MCE timer will be started with a fixed interval. After the storm subsides, the timer's next interval is set to check_interval. This disregards the user's input through "ignore_ce" and "check_interval". Furthermore, if "check_interval=0", then the new timer will run faster than expected. Create a new helper to check these conditions and use it when a CMCI storm ends. [ bp: Massage. ] Fixes: `7eae17c4ad` ("x86/mce: Add per-bank CMCI storm mitigation") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-2-236dd74f645f@amd.com	2025-06-27 12:41:44 +02:00
Yazen Ghannam	4c113a5b28	x86/mce: Don't remove sysfs if thresholding sysfs init fails Currently, the MCE subsystem sysfs interface will be removed if the thresholding sysfs interface fails to be created. A common failure is due to new MCA bank types that are not recognized and don't have a short name set. The MCA thresholding feature is optional and should not break the common MCE sysfs interface. Also, new MCA bank types are occasionally introduced, and updates will be needed to recognize them. But likewise, this should not break the common sysfs interface. Keep the MCE sysfs interface regardless of the status of the thresholding sysfs interface. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-1-236dd74f645f@amd.com	2025-06-26 17:28:13 +02:00
David Kaplan	98b5dab4d2	x86/bugs: Clean up SRSO microcode handling SRSO microcode only exists for Zen3/Zen4 CPUs. For those CPUs, the microcode is required for any mitigation other than Safe-RET to be effective. Safe-RET can still protect user->kernel and guest->host attacks without microcode. Clarify this in the code and ensure that SRSO_MITIGATION_UCODE_NEEDED is selected for any mitigation besides Safe-RET if the required microcode isn't present. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250625155805.600376-4-david.kaplan@amd.com	2025-06-26 13:32:31 +02:00
David Kaplan	ff54ae7314	x86/bugs: Use IBPB for retbleed if used by SRSO If spec_rstack_overflow=ibpb then this mitigates retbleed as well. This is relevant for AMD Zen1 and Zen2 CPUs which are vulnerable to both bugs. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H . Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250625155805.600376-3-david.kaplan@amd.com	2025-06-26 10:56:39 +02:00
David Kaplan	1fd5eb0286	x86/bugs: Add SRSO_MITIGATION_NOSMT AMD Zen1 and Zen2 CPUs with SMT disabled are not vulnerable to SRSO. Instead of overloading the X86_FEATURE_SRSO_NO bit to indicate this, define a separate mitigation to make the code cleaner. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H . Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250625155805.600376-2-david.kaplan@amd.com	2025-06-26 10:56:39 +02:00
Dave Hansen	1cec9ac2d0	x86/fpu: Delay instruction pointer fixup until after warning Right now, if XRSTOR fails a console message like this is be printed: Bad FPU state detected at restore_fpregs_from_fpstate+0x9a/0x170, reinitializing FPU registers. However, the text location (...+0x9a in this case) is the instruction AFTER the XRSTOR. The highlighted instruction in the "Code:" dump also points one instruction late. The reason is that the "fixup" moves RIP up to pass the bad XRSTOR and keep on running after returning from the #GP handler. But it does this fixup before warning. The resulting warning output is nonsensical because it looks like the non-FPU-related instruction is #GP'ing. Do not fix up RIP until after printing the warning. Do this by using the more generic and standard ex_handler_default(). Fixes: `d5c8028b47` ("x86/fpu: Reinitialize FPU registers if restoring FPU state fails") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Alison Schofield <alison.schofield@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250624210148.97126F9E%40davehans-spike.ostc.intel.com	2025-06-25 16:28:06 -07:00
Sean Christopherson	bbc13ae593	VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment() Drop kvm_arch_{start,end}_assignment() and all associated code now that KVM x86 no longer consumes assigned_device_count. Tracking whether or not a VFIO-assigned device is formally associated with a VM is fundamentally flawed, as such an association is optional for general usage, i.e. is prone to false negatives. E.g. prior to commit `2edd9cb79f` ("kvm: detect assigned device via irqbypass manager"), device passthrough via VFIO would fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM binding. And device drivers that need the VFIO<=>KVM connection, e.g. KVM-GT, shouldn't be relying on generic x86 tracking infrastructure. Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 09:51:33 -07:00
Sean Christopherson	ff845e6a84	Revert "kvm: detect assigned device via irqbypass manager" Now that KVM explicitly tracks the number of possible bypass IRQs, and doesn't conflate IRQ bypass with host MMIO access, stop bumping the assigned device count when adding an IRQ bypass producer. This reverts commit `2edd9cb79f`. Link: https://lore.kernel.org/r/20250523011756.3243624-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 09:51:29 -07:00
Sean Christopherson	95d9b5d8d6	Merge branch 'kvm-x86 mmio' Merge the MMIO stale data branch with the device posted IRQs branch to provide a common base for removing KVM's tracking of "assigned" devices. Link: https://lore.kernel.org/all/20250523011756.3243624-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 09:50:58 -07:00
Chao Gao	05186d7a8e	KVM: SVM: Simplify MSR interception logic for IA32_XSS MSR Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR interception, ensuring consistency with other cases where MSRs are intercepted depending on guest caps and CPUIDs. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 09:47:28 -07:00
Manuel Andreas	fa787ac07b	KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush In KVM guests with Hyper-V hypercalls enabled, the hypercalls HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX allow a guest to request invalidation of portions of a virtual TLB. For this, the hypercall parameter includes a list of GVAs that are supposed to be invalidated. However, when non-canonical GVAs are passed, there is currently no filtering in place and they are eventually passed to checked invocations of INVVPID on Intel / INVLPGA on AMD. While AMD's INVLPGA silently ignores non-canonical addresses (effectively a no-op), Intel's INVVPID explicitly signals VM-Fail and ultimately triggers the WARN_ONCE in invvpid_error(): invvpid failed: ext=0x0 vpid=1 gva=0xaaaaaaaaaaaaa000 WARNING: CPU: 6 PID: 326 at arch/x86/kvm/vmx/vmx.c:482 invvpid_error+0x91/0xa0 [kvm_intel] Modules linked in: kvm_intel kvm 9pnet_virtio irqbypass fuse CPU: 6 UID: 0 PID: 326 Comm: kvm-vm Not tainted 6.15.0 #14 PREEMPT(voluntary) RIP: 0010:invvpid_error+0x91/0xa0 [kvm_intel] Call Trace: vmx_flush_tlb_gva+0x320/0x490 [kvm_intel] kvm_hv_vcpu_flush_tlb+0x24f/0x4f0 [kvm] kvm_arch_vcpu_ioctl_run+0x3013/0x5810 [kvm] Hyper-V documents that invalid GVAs (those that are beyond a partition's GVA space) are to be ignored. While not completely clear whether this ruling also applies to non-canonical GVAs, it is likely fine to make that assumption, and manual testing on Azure confirms "real" Hyper-V interprets the specification in the same way. Skip non-canonical GVAs when processing the list of address to avoid tripping the INVVPID failure. Alternatively, KVM could filter out "bad" GVAs before inserting into the FIFO, but practically speaking the only downside of pushing validation to the final processing is that doing so is suboptimal for the guest, and no well-behaved guest will request TLB flushes for non-canonical addresses. Fixes: `260970862c` ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently") Cc: stable@vger.kernel.org Signed-off-by: Manuel Andreas <manuel.andreas@tum.de> Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/c090efb3-ef82-499f-a5e0-360fc8420fb7@tum.de Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 09:15:24 -07:00
Sean Christopherson	83ebe71574	KVM: VMX: Apply MMIO Stale Data mitigation if KVM maps MMIO into the guest Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO into the VM, not if the VM has an assigned device. VFIO is but one of many ways to map host MMIO into a KVM guest, and even within VFIO, formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely optional. Track whether or not the guest can access host MMIO on a per-MMU basis, i.e. based on whether or not the vCPU has a mapping to host MMIO. For simplicity, track MMIO mappings in "special" rools (those without a kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so only legacy 32-bit shadow paging is affected, i.e. lack of precise tracking is a complete non-issue. Make the per-MMU and per-VM flags sticky. Detecting when all MMIO mappings have been removed would be absurdly complex. And in practice, removing MMIO from a guest will be done by deleting the associated memslot, which by default will force KVM to re-allocate all roots. Special roots will forever be mitigated, but as above, the affected scenarios are not expected to be performance sensitive. Use a VMX_RUN flag to communicate the need for a buffers flush to vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its dependencies don't need to be marked __always_inline, e.g. so that KASAN doesn't trigger a noinstr violation. Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Fixes: `8cb861e9e3` ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data") Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-25 08:42:51 -07:00
Tiwei Bie	8948941276	um: Use correct data source in fpregs_legacy_set() Read from the buffer pointed to by 'from' instead of '&buf', as 'buf' contains no valid data when 'ubuf' is NULL. Fixes: `b1e1bd2e69` ("um: Add helper functions to get/set state for SECCOMP") Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250606124428.148164-5-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-06-25 09:26:33 +02:00
Chao Gao	3f06b8927a	KVM: x86: Deduplicate MSR interception enabling and disabling Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 15:42:12 -07:00
Yang Weijiang	8b05b3c988	x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only feature == Background == CET defines two register states: CET user, which includes user-mode control registers, and CET supervisor, which consists of shadow-stack pointers for privilege levels 0-2. Current kernels disable shadow stacks in kernel mode, making the CET supervisor state unused and eliminating the need for context switching. == Problem == To virtualize CET for guests, KVM must accurately emulate hardware behavior. A key challenge arises because there is no CPUID flag to indicate that shadow stack is supported only in user mode. Therefore, KVM cannot assume guests will not enable shadow stacks in kernel mode and must preserve the CET supervisor state of vCPUs. == Solution == An initial proposal to manually save and restore CET supervisor states using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and its impact on KVM's ABI. Instead, leveraging the kernel's FPU infrastructure for context switching was favored [1]. The main question then became whether to enable the CET supervisor state globally for all processes or restrict it to vCPU processes. This decision involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU processes and approximately 100 lines of code complexity in the kernel [2]. The agreed approach is to first try this optimal solution [3], i.e., restricting the CET supervisor state to guest FPUs only and eliminating unnecessary space waste. The guest-only xfeature infrastructure has already been added. Now, introduce CET supervisor xstate support as the first guest-only feature to prepare for the upcoming CET virtualization in KVM. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/ [1] Link: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/ [2] Link: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/ [3] Link: https://lore.kernel.org/all/20250522151031.426788-7-chao.gao%40intel.com	2025-06-24 13:46:33 -07:00
Yang Weijiang	151bf23249	x86/fpu/xstate: Introduce "guest-only" supervisor xfeature set In preparation for upcoming CET virtualization support, the CET supervisor state will be added as a "guest-only" feature, since it is required only by KVM (i.e., guest FPUs). Establish the infrastructure for "guest-only" features. Define a new XFEATURE_MASK_GUEST_SUPERVISOR mask to specify features that are enabled by default in guest FPUs but not in host FPUs. Specifically, for any bit in this set, permission is granted and XSAVE space is allocated during vCPU creation. Non-guest FPUs cannot enable guest-only features, even dynamically, and no XSAVE space will be allocated for them. The mask is currently empty, but this will be changed by a subsequent patch. Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/all/20250522151031.426788-6-chao.gao%40intel.com	2025-06-24 13:46:32 -07:00
Chao Gao	fafb29e18d	x86/fpu: Remove xfd argument from __fpstate_reset() The initial values for fpstate::xfd differ between guest and host fpstates. Currently, the initial values are passed as an argument to __fpstate_reset(). But, __fpstate_reset() already assigns different default features and sizes based on the type of fpstates (i.e., guest or host). So, handle fpstate::xfd in a similar way to highlight the differences in the initial xfd value between guest and host fpstates Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/all/aBuf7wiiDT0Wflhk@google.com/ Link: https://lore.kernel.org/all/20250522151031.426788-5-chao.gao%40intel.com	2025-06-24 13:46:32 -07:00
Chao Gao	509e880b77	x86/fpu: Initialize guest fpstate and FPU pseudo container from guest defaults fpu_alloc_guest_fpstate() currently uses host defaults to initialize guest fpstate and pseudo containers. Guest defaults were introduced to differentiate the features and sizes of host and guest FPUs. Switch to using guest defaults instead. Adjust __fpstate_reset() to handle different defaults for host and guest FPUs. And to distinguish between the types of FPUs, move the initialization of indicators (is_guest and is_valloc) before the reset. Suggested-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/all/20250522151031.426788-4-chao.gao%40intel.com	2025-06-24 13:46:32 -07:00
Chao Gao	7c2c89364d	x86/fpu: Initialize guest FPU permissions from guest defaults Currently, fpu->guest_perm is copied from fpu->perm, which is derived from fpu_kernel_cfg.default_features. Guest defaults were introduced to differentiate the features and sizes of host and guest FPUs. Copying guest FPU permissions from the host will lead to inconsistencies between the guest default features and permissions. Initialize guest FPU permissions from guest defaults instead of host defaults. This ensures that any changes to guest default features are automatically reflected in guest permissions, which in turn guarantees that fpstate_realloc() allocates a correctly sized XSAVE buffer for guest FPUs. Suggested-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/all/20250522151031.426788-3-chao.gao%40intel.com	2025-06-24 13:46:32 -07:00
Chao Gao	7bc4ed75f2	x86/fpu/xstate: Differentiate default features for host and guest FPUs Currently, guest and host FPUs share the same default features. However, the CET supervisor xstate is the first feature that needs to be enabled exclusively for guest FPUs. Enabling it for host FPUs leads to a waste of 24 bytes in the XSAVE buffer. To support "guest-only" features, add a new structure to hold the default features and sizes for guest FPUs to clearly differentiate them from those for host FPUs. Add two helpers to provide the default feature masks for guest and host FPUs. Default features are derived by applying the masks to the maximum supported features. Note that, 1) for now, guest_default_mask() and host_default_mask() are identical. This will change in a follow-up patch once guest permissions, default xfeatures, and fpstate size are all converted to use the guest defaults. 2) only supervisor features will diverge between guest FPUs and host FPUs, while user features will remain the same [1][2]. So, the new vcpu_fpu_config struct does not include default user features and size for the UABI buffer. An alternative approach is adding a guest_only_xfeatures member to fpu_kernel_cfg and adding two helper functions to calculate the guest default xfeatures and size. However, calculating these defaults at runtime would introduce unnecessary overhead. Suggested-by: Chang S. Bae <chang.seok.bae@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/kvm/aAwdQ759Y6V7SGhv@google.com/ [1] Link: https://lore.kernel.org/kvm/9ca17e1169805f35168eb722734fbf3579187886.camel@intel.com/ [2] Link: https://lore.kernel.org/all/20250522151031.426788-2-chao.gao%40intel.com	2025-06-24 13:46:32 -07:00
Sean Christopherson	ffe9d7966d	KVM: x86/mmu: Locally cache whether a PFN is host MMIO when making a SPTE When making a SPTE, cache whether or not the target PFN is host MMIO in order to avoid multiple rounds of the slow path of kvm_is_mmio_pfn(), e.g. hitting pat_pfn_immune_to_uc_mtrr() in particular can be problematic. KVM currently avoids multiple calls by virtue of the two users being mutually exclusive (.get_mt_mask() is Intel-only, shadow_me_value is AMD-only), but that won't hold true if/when KVM needs to detect host MMIO mappings for other reasons, e.g. for mitigating the MMIO Stale Data vulnerability. No functional change intended. Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 13:29:31 -07:00
Sean Christopherson	c126b46e6f	KVM: x86: Avoid calling kvm_is_mmio_pfn() when kvm_x86_ops.get_mt_mask is NULL Guard the call to kvm_x86_call(get_mt_mask) with an explicit check on kvm_x86_ops.get_mt_mask so as to avoid unnecessarily calling kvm_is_mmio_pfn(), which is moderately expensive for some backing types. E.g. lookup_memtype() conditionally takes a system-wide spinlock if KVM ends up being call pat_pfn_immune_to_uc_mtrr(), e.g. for DAX memory. While the call to kvm_x86_ops.get_mt_mask() itself is elided, the compiler still needs to compute all parameters, as it can't know at build time that the call will be squashed. <+243>: call 0xffffffff812ad880 <kvm_is_mmio_pfn> <+248>: mov %r13,%rsi <+251>: mov %rbx,%rdi <+254>: movzbl %al,%edx <+257>: call 0xffffffff81c26af0 <__SCT__kvm_x86_get_mt_mask> Fixes: `3fee4837ef` ("KVM: x86: remove shadow_memtype_mask") Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 13:29:31 -07:00
Xin Li (Intel)	fa7d0f83c5	x86/traps: Initialize DR7 by writing its architectural reset value Initialize DR7 by writing its architectural reset value to always set bit 10, which is reserved to '1', when "clearing" DR7 so as not to trigger unanticipated behavior if said bit is ever unreserved, e.g. as a feature enabling flag with inverted polarity. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sean Christopherson <seanjc@google.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com	2025-06-24 13:15:52 -07:00
Xin Li (Intel)	5f465c148c	x86/traps: Initialize DR6 by writing its architectural reset value Initialize DR6 by writing its architectural reset value to avoid incorrectly zeroing DR6 to clear DR6.BLD at boot time, which leads to a false bus lock detected warning. The Intel SDM says: 1) Certain debug exceptions may clear bits 0-3 of DR6. 2) BLD induced #DB clears DR6.BLD and any other debug exception doesn't modify DR6.BLD. 3) RTM induced #DB clears DR6.RTM and any other debug exception sets DR6.RTM. To avoid confusion in identifying debug exceptions, debug handlers should set DR6.BLD and DR6.RTM, and clear other DR6 bits before returning. The DR6 architectural reset value 0xFFFF0FF0, already defined as macro DR6_RESERVED, satisfies these requirements, so just use it to reinitialize DR6 whenever needed. Since clear_all_debug_regs() no longer zeros all debug registers, rename it to initialize_debug_regs() to better reflect its current behavior. Since debug_read_clear_dr6() no longer clears DR6, rename it to debug_read_reset_dr6() to better reflect its current behavior. Fixes: `ebb1064e7c` ("x86/traps: Handle #DB for bus lock") Reported-by: Sohil Mehta <sohil.mehta@intel.com> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Link: https://lore.kernel.org/lkml/06e68373-a92b-472e-8fd9-ba548119770c@intel.com/ Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250620231504.2676902-2-xin%40zytor.com	2025-06-24 13:15:51 -07:00
Sean Christopherson	9c4fe6d150	KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a nested TDP VM is run, defer allocation of the array of hashed lists used to track shadow MMU pages until the first shadow root is allocated. Setting the list outside of mmu_lock is safe, as concurrent readers must hold mmu_lock in some capacity, shadow pages can only be added (or removed) from the list when mmu_lock is held for write, and tasks that are creating a shadow root are serialized by slots_arch_lock. I.e. it's impossible for the list to become non-empty until all readers go away, and so readers are guaranteed to see an empty list even if they make multiple calls to kvm_get_mmu_page_hash() in a single mmu_lock critical section. Use smp_store_release() and smp_load_acquire() to access the hash table pointer to ensure the stores to zero the lists are retired before readers start to walk the list. E.g. if the compiler hoisted the store before the zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale kernel data. Cc: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250523001138.3182794-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:51:07 -07:00
Sean Christopherson	ac777fbf06	KVM: x86: Use kvzalloc() to allocate VM struct Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:50:48 -07:00
Sean Christopherson	039ef33e2f	KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list Dynamically allocate the (massive) array of hashed lists used to track shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation all on its own, and is exactly an order-3 allocation. Dynamically allocating the array will allow allocating "struct kvm" using kvmalloc(), and will also allow deferring allocation of the array until it's actually needed, i.e. until the first shadow root is allocated. Opportunistically use kvmalloc() for the hashed lists, as an order-3 allocation is (stating the obvious) less likely to fail than an order-4 allocation, and the overhead of vmalloc() is undesirable given that the size of the allocation is fixed. Cc: Vipin Sharma <vipinsh@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:50:34 -07:00
David Woodhouse	a7f4dff21f	KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table. To avoid imposing an ordering constraint on userspace, allow 'invalid' event channel targets to be configured in the IRQ routing table. This is the same as accepting interrupts targeted at vCPUs which don't exist yet, which is already the case for both Xen event channels and for MSIs (which don't do any filtering of permitted APIC ID targets at all). If userspace actually triggers an IRQ with an invalid target, that will fail cleanly, as kvm_xen_set_evtchn_fast() also does the same range check. If KVM enforced that the IRQ target must be valid at the time it is configured, that would force userspace to create all vCPUs and do various other parts of setup (in this case, setting the Xen long_mode) before restoring the IRQ table. Cc: stable@vger.kernel.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/e489252745ac4b53f1f7f50570b03fb416aa2065.camel@infradead.org [sean: massage comment] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:20:17 -07:00
Sean Christopherson	0b6f4a5f08	KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs being targeted for Hyper-V's paravirt TLB flushing. If KVM_MAX_NR_VCPUS is set to 4096 (which is allowed even for MAXSMP=n builds), putting the vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN limit. arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024) in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than] 2001 \| static u64 kvm_hv_flush_tlb(struct kvm_vcpu vcpu, struct kvm_hv_hcall hc) \| ^ 1 error generated. Note, sparse_banks was given the same treatment by commit `7d5e88d301` ("KVM: x86: hyper-v: Use preallocated buffer in 'struct kvm_vcpu_hv' instead of on-stack 'sparse_banks'"), for the exact same reason. Reported-by: Abinash Lalotra <abinashsinghlalotra@gmail.com> Closes: https://lore.kernel.org/all/20250613111023.786265-1-abinashsinghlalotra@gmail.com Link: https://lore.kernel.org/all/aEylI-O8kFnFHrOH@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:20:16 -07:00
Sean Christopherson	48f15f6241	KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL When creating an SEV-ES vCPU for intra-host migration, set its vmsa_pa to INVALID_PAGE to harden against doing VMRUN with a bogus VMSA (KVM checks for a valid VMSA page in pre_sev_run()). Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Link: https://lore.kernel.org/r/20250602224459.41505-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:20:15 -07:00
Sean Christopherson	ecf371f8b0	KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight Reject migration of SEV{-ES} state if either the source or destination VM is actively creating a vCPU, i.e. if kvm_vm_ioctl_create_vcpu() is in the section between incrementing created_vcpus and online_vcpus. The bulk of vCPU creation runs _outside_ of kvm->lock to allow creating multiple vCPUs in parallel, and so sev_info.es_active can get toggled from false=>true in the destination VM after (or during) svm_vcpu_create(), resulting in an SEV{-ES} VM effectively having a non-SEV{-ES} vCPU. The issue manifests most visibly as a crash when trying to free a vCPU's NULL VMSA page in an SEV-ES VM, but any number of things can go wrong. BUG: unable to handle page fault for address: ffffebde00000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP KASAN NOPTI CPU: 227 UID: 0 PID: 64063 Comm: syz.5.60023 Tainted: G U O 6.15.0-smp-DEV #2 NONE Tainted: [U]=USER, [O]=OOT_MODULE Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 12.52.0-0 10/28/2024 RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:206 [inline] RIP: 0010:arch_test_bit arch/x86/include/asm/bitops.h:238 [inline] RIP: 0010:_test_bit include/asm-generic/bitops/instrumented-non-atomic.h:142 [inline] RIP: 0010:PageHead include/linux/page-flags.h:866 [inline] RIP: 0010:___free_pages+0x3e/0x120 mm/page_alloc.c:5067 Code: <49> f7 06 40 00 00 00 75 05 45 31 ff eb 0c 66 90 4c 89 f0 4c 39 f0 RSP: 0018:ffff8984551978d0 EFLAGS: 00010246 RAX: 0000777f80000001 RBX: 0000000000000000 RCX: ffffffff918aeb98 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffebde00000000 RBP: 0000000000000000 R08: ffffebde00000007 R09: 1ffffd7bc0000000 R10: dffffc0000000000 R11: fffff97bc0000001 R12: dffffc0000000000 R13: ffff8983e19751a8 R14: ffffebde00000000 R15: 1ffffd7bc0000000 FS: 0000000000000000(0000) GS:ffff89ee661d3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffebde00000000 CR3: 000000793ceaa000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000b5f DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <TASK> sev_free_vcpu+0x413/0x630 arch/x86/kvm/svm/sev.c:3169 svm_vcpu_free+0x13a/0x2a0 arch/x86/kvm/svm/svm.c:1515 kvm_arch_vcpu_destroy+0x6a/0x1d0 arch/x86/kvm/x86.c:12396 kvm_vcpu_destroy virt/kvm/kvm_main.c:470 [inline] kvm_destroy_vcpus+0xd1/0x300 virt/kvm/kvm_main.c:490 kvm_arch_destroy_vm+0x636/0x820 arch/x86/kvm/x86.c:12895 kvm_put_kvm+0xb8e/0xfb0 virt/kvm/kvm_main.c:1310 kvm_vm_release+0x48/0x60 virt/kvm/kvm_main.c:1369 __fput+0x3e4/0x9e0 fs/file_table.c:465 task_work_run+0x1a9/0x220 kernel/task_work.c:227 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x7f0/0x25b0 kernel/exit.c:953 do_group_exit+0x203/0x2d0 kernel/exit.c:1102 get_signal+0x1357/0x1480 kernel/signal.c:3034 arch_do_signal_or_restart+0x40/0x690 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x67/0xb0 kernel/entry/common.c:218 do_syscall_64+0x7c/0x150 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f87a898e969 </TASK> Modules linked in: gq(O) gsmi: Log Shutdown Reason 0x03 CR2: ffffebde00000000 ---[ end trace 0000000000000000 ]--- Deliberately don't check for a NULL VMSA when freeing the vCPU, as crashing the host is likely desirable due to the VMSA being consumed by hardware. E.g. if KVM manages to allow VMRUN on the vCPU, hardware may read/write a bogus VMSA page. Accessing PFN 0 is "fine"-ish now that it's sequestered away thanks to L1TF, but panicking in this scenario is preferable to potentially running with corrupted state. Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Fixes: `0b020f5af0` ("KVM: SEV: Add support for SEV-ES intra host migration") Fixes: `b56639318b` ("KVM: SEV: Add support for SEV intra host migration") Cc: stable@vger.kernel.org Cc: James Houghton <jthoughton@google.com> Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250602224459.41505-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-24 12:20:10 -07:00
Jiri Slaby (SUSE)	d22cf13814	serial: ce4100: clean up serial_in/out() hooks ce4100_mem_serial_in() unnecessarily contains 4 nested 'if's. That makes the code hard to follow. Invert the conditions and return early if the particular conditions do not hold. And use "<<=" for shifting the offset in both of the hooks. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250623101246.486866-2-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-24 15:32:50 +01:00
Jiri Slaby (SUSE)	f565594077	serial: ce4100: fix build after serial_in/out() changes This x86_32 driver remain unnoticed, so after the commit below, the compilation now fails with: arch/x86/platform/ce4100/ce4100.c:107:16: error: incompatible function pointer types assigning to '...' from '...' To fix the error, convert also ce4100 to the new uart_port::serial_{in,out}() types. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Fixes: `fc9ceb501e` ("serial: 8250: sanitize uart_port::serial_{in,out}() types") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506190552.TqNasrC3-lkp@intel.com/ Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250623101246.486866-1-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-06-24 15:32:50 +01:00
Pawan Gupta	ab9f2388e0	x86/bugs: Allow ITS stuffing in eIBRS+retpoline mode also After a recent restructuring of the ITS mitigation, RSB stuffing can no longer be enabled in eIBRS+Retpoline mode. Before ITS, retbleed mitigation only allowed stuffing when eIBRS was not enabled. This was perfectly fine since eIBRS mitigates retbleed. However, RSB stuffing mitigation for ITS is still needed with eIBRS. The restructuring solely relies on retbleed to deploy stuffing, and does not allow it when eIBRS is enabled. This behavior is different from what was before the restructuring. Fix it by allowing stuffing in eIBRS+retpoline mode also. Fixes: `61ab72c2c6` ("x86/bugs: Restructure ITS mitigation") Closes: https://lore.kernel.org/lkml/20250519235101.2vm6sc5txyoykb2r@desk/ Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250611-eibrs-fix-v4-7-5ff86cac6c61@linux.intel.com	2025-06-24 14:12:41 +02:00
Sean Christopherson	6f34372483	KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq() Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq() with kvm_set_msi(). Opportunistically delete the public declaration and export, as they are no longer used/needed. No functional change intended. Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:53 -07:00
Sean Christopherson	b03500f03e	KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:52 -07:00
Sean Christopherson	b9e53f9ff4	iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts Add plumbing to the AMD IOMMU driver to allow KVM to control whether or not an IRTE is configured to generate GA log interrupts. KVM only needs a notification if the target vCPU is blocking, so the vCPU can be awakened. If a vCPU is preempted or exits to userspace, KVM clears is_run, but will set the vCPU back to running when userspace does KVM_RUN and/or the vCPU task is scheduled back in, i.e. KVM doesn't need a notification. Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes from the KVM changes insofar as possible. Opportunistically swap the ordering of parameters for amd_iommu_update_ga() so that the match amd_iommu_activate_guest_mode(). Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as a non-cached field, but per AMD hardware architects, it's not cached and can be safely updated without an invalidation. Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:51 -07:00
Sean Christopherson	5f3d06b164	KVM: SVM: Consolidate IRTE update when toggling AVIC on/off Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into __avic_vcpu_{load,put}(), and add a param to the helpers to communicate whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update, or just a quick update to set the CPU and IsRun. Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:50 -07:00
Sean Christopherson	6eab2340f3	KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off Don't query a vCPU's blocking status when toggling AVIC on/off; barring KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the correct state is desirable, i.e. well worth any overhead in a buggy scenario. Isolating the "real" load/put flows will allow moving the IOMMU IRTE (de)activation logic from avic_refresh_apicv_exec_ctrl() to avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's physical ID entry and its IRTEs in a common path, under a single critical section of ir_list_lock. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:50 -07:00
Sean Christopherson	f2bc961d38	KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in anticipation of moving the __avic_vcpu_{load,put}() calls into the critical section, and because having a one-off helper with a name that's easily confused with avic_pi_update_irte() is unnecessary. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:49 -07:00
Sean Christopherson	11a60455d4	KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to find and kick vCPUs when a device posted interrupt serves as a wake event. Lookups on a vCPU index are O(fast) (not sure what xa_load() actually provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match its index. Unlike the Physical APIC Table, which is accessed by hardware when virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_ use APIC IDs to fill the Physical APIC Table, but KVM has free rein over the format/meaning of the GA tag. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:48 -07:00
Sean Christopherson	ce9d54f41b	KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are disabled, to make it obvious that the logic is a sanity check, and so that a bug related to nr_possible_bypass_irqs is more like to cause noisy failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs. Link: https://lore.kernel.org/r/20250611224604.313496-56-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:47 -07:00
Sean Christopherson	77e1b8332d	KVM: x86: Decouple device assignment from IRQ bypass Use a dedicated counter to track the number of IRQs that can utilize IRQ bypass instead of piggybacking the assigned device count. As evidenced by commit `2edd9cb79f` ("kvm: detect assigned device via irqbypass manager"), it's possible for a device to be able to post IRQs to a vCPU without said device being assigned to a VM. Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment to avoid regressing the MMIO stale data mitigation. KVM is abusing the assigned device count when applying mmio_stale_data_clear, and it's not at all clear if vDPA devices rely on this behavior. This will hopefully be cleaned up in the future, as the number of assigned devices is a terrible heuristic for detecting if a VM has access to host MMIO. Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:46 -07:00
Sean Christopherson	99836eb9c5	KVM: SVM: WARN if ir_list is non-empty at vCPU free Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:46 -07:00
Sean Christopherson	25ef059e8b	KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC, as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any and all postable IRQs to an APIC-less VM. Link: https://lore.kernel.org/r/20250611224604.313496-53-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:45 -07:00
Sean Christopherson	d1bccaa179	KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte() WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the code is only reachable if the VM already has an IRQ bypass producer (see kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(), which, stating the obvious, are called if and only if KVM enables its IRQ bypass hooks. Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250611224604.313496-52-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:44 -07:00
Sean Christopherson	04c4ca0ae4	KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte() Don't bother checking if the VM has an assigned device when updating IRTE entries. kvm_arch_irq_bypass_add_producer() explicitly increments the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly decrements the count before invoking kvm_pi_update_irte(), and kvm_irq_routing_update() only updates IRTE entries if there's an active IRQ bypass producer. Link: https://lore.kernel.org/r/20250611224604.313496-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:44 -07:00
Sean Christopherson	cd86240fea	KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails WARN if updating GA information for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata, and because returning an error that is always ignored is pointless. Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:43 -07:00
Sean Christopherson	48f79c6c86	KVM: SVM: Process all IRTEs on affinity change even if one update fails When updating IRTE GA fields, keep processing all other IRTEs if an update fails, as not updating later entries risks making a bad situation worse. Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:42 -07:00
Sean Christopherson	16562766f1	KVM: SVM: WARN if (de)activating guest mode in IOMMU fails WARN if (de)activating "guest mode" for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata. Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:42 -07:00
Sean Christopherson	fe0213923d	KVM: SVM: Don't check for assigned device(s) when activating AVIC Don't short-circuit IRTE updating when (de)activating AVIC based on the VM having assigned devices, as nothing prevents AVIC (de)activation from racing with device (de)assignment. And from a performance perspective, bailing early when there is no assigned device doesn't add much, as ir_list_lock will never be contended if there's no assigned device. Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:41 -07:00
Sean Christopherson	f5998661ff	KVM: SVM: Don't check for assigned device(s) when updating affinity Don't bother checking if a VM has an assigned device when updating AVIC vCPU affinity, querying ir_list is just as cheap and nothing prevents racing with changes in device assignment. Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:40 -07:00
Sean Christopherson	6df262f915	iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is inhibited If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave the actual IRTE in remapped mode. KVM already handles the case where AVIC is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()), but doesn't handle the scenario where a postable IRQ comes along while AVIC is inhibited. Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:40 -07:00
Sean Christopherson	f965255dc5	iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:39 -07:00
Sean Christopherson	08d9ccdd1a	iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:37 -07:00
Sean Christopherson	c3d591c91f	KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU Now that svm_ir_list_add() isn't overloaded with all manner of weird things, fold it into avic_pi_update_irte(), and more importantly take ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info that's shoved into the IRTE is fresh. While preemption (and IRQs) is disabled on the task performing the IRTE update, thanks to irqfds.lock, that task doesn't hold the vCPU's mutex, i.e. preemption being disabled is irrelevant. Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-23 09:50:36 -07:00

1 2 3 4 5 ...

50042 Commits