linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-09-05 20:30:41 +00:00

Author	SHA1	Message	Date
Ingo Molnar	c96f564e6f	Merge branch 'x86/cpu' into x86/microcode, to pick up dependent commits Avoid a conflict in <asm/cpufeatures.h> by merging pending x86/cpu changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-22 08:31:41 +02:00
Christian Brauner	79beea2db0	fs: remove uselib() system call This system call has been deprecated for quite a while now. Let's try and remove it from the kernel completely. Link: https://lore.kernel.org/20250415-kanufahren-besten-02ac00e6becd@brauner Acked-by: Kees Cook <kees@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-04-21 10:27:59 +02:00
Mike Rapoport (Microsoft)	83b2d345e1	x86/e820: Discard high memory that can't be addressed by 32-bit systems Dave Hansen reports the following crash on a 32-bit system with CONFIG_HIGHMEM=y and CONFIG_X86_PAE=y: > 0xf75fe000 is the mem_map[] entry for the first page >4GB. It > obviously wasn't allocated, thus the oops. BUG: unable to handle page fault for address: f75fe000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page pdpt = 0000000002da2001 pde = 000000000300c067 *pte = 0000000000000000 Oops: Oops: 0002 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.15.0-rc1-00288-ge618ee89561b-dirty #311 PREEMPT(undef) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 EIP: __free_pages_core+0x3c/0x74 ... Call Trace: memblock_free_pages+0x11/0x2c memblock_free_all+0x2ce/0x3a0 mm_core_init+0xf5/0x320 start_kernel+0x296/0x79c i386_start_kernel+0xad/0xb0 startup_32_smp+0x151/0x154 The mem_map[] is allocated up to the end of ZONE_HIGHMEM which is defined by max_pfn. The bug was introduced by this recent commit: `6faea3422e` ("arch, mm: streamline HIGHMEM freeing") Previously, freeing of high memory was also clamped to the end of ZONE_HIGHMEM but after this change, memblock_free_all() tries to free memory above the of ZONE_HIGHMEM as well and that causes access to mem_map[] entries beyond the end of the memory map. To fix this, discard the memory after max_pfn from memblock on 32-bit systems so that core MM would be aware only of actually usable memory. Fixes: `6faea3422e` ("arch, mm: streamline HIGHMEM freeing") Reported-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andy@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davide Ciminaghi <ciminaghi@gnudd.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: kvm@vger.kernel.org Link: https://lore.kernel.org/r/20250413080858.743221-1-rppt@kernel.org # discussion and submission	2025-04-19 16:48:18 +02:00
Eric Biggers	bb9c648b33	crypto: lib/poly1305 - restore ability to remove modules Though the module_exit functions are now no-ops, they should still be defined, since otherwise the modules become unremovable. Fixes: `1f81c58279` ("crypto: arm/poly1305 - remove redundant shash algorithm") Fixes: `f4b1a73aec` ("crypto: arm64/poly1305 - remove redundant shash algorithm") Fixes: `378a337ab4` ("crypto: powerpc/poly1305 - implement library instead of shash") Fixes: `21969da642` ("crypto: x86/poly1305 - remove redundant shash algorithm") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-19 11:18:28 +08:00
Eric Biggers	8821d26926	crypto: lib/chacha - restore ability to remove modules Though the module_exit functions are now no-ops, they should still be defined, since otherwise the modules become unremovable. Fixes: `08820553f3` ("crypto: arm/chacha - remove the redundant skcipher algorithms") Fixes: `8c28abede1` ("crypto: arm64/chacha - remove the skcipher algorithms") Fixes: `f7915484c0` ("crypto: powerpc/chacha - remove the skcipher algorithms") Fixes: `ceba0eda83` ("crypto: riscv/chacha - implement library instead of skcipher") Fixes: `632ab0978f` ("crypto: x86/chacha - remove the skcipher algorithms") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-19 11:18:28 +08:00
Linus Torvalds	3088d26962	Miscellaneous x86 fixes: - Fix hypercall detection on Xen guests - Extend the AMD microcode loader SHA check to Zen5, to block loading of any unreleased standalone Zen5 microcode patches - Add new Intel CPU model number for Bartlett Lake - Fix the workaround for AMD erratum 1054 - Fix buggy early memory acceptance between SEV-SNP guests and the EFI stub Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgCuUsRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jnuRAAqj+XgVPE25Yw0qyFjWxJwHg3duPPMBql BTs6766d8DJTZI2x+zGmiJzxUynKyMItQnkhS7w2n/T11YKw/wW2nMOYNqBgFJwS A7au/4PSp0LJARO5I3GxEVUFDQwawsf8ly+OeOZacKhtycmxL3Pr9pMHU98GWb5C TvqKQlohp/YI3SKpLxSE/LCJJZr/8o7uOt3i95Rx8fH9zEp4Bat5OPpFpPZpefDS kWnQKFq+iSwDldPMg00/SdpFDZVqhHItodeMqJdz6MMa7+sPB5dyjYNGuT9Dvf03 zMv3mjWFDTPjezvuQH+kTxhOfgFtV8VI+b35c2JqTyqkvSzrcrOV1W2EJausSt4H D//UXzDaAcJJaq/YWBuX+DaajyRdVl6i8trtgKMM0BWRPa7wTBFiJU7Lvt73gW/s 8/c5+V0iI0tkySkqoCZJKVwVVxHDxf9z5CQomEwupf7SrI+O8gjjwi0F8NwV0Zeo kP8InCOHVWFHKqf5G4lVsF7qqLgCSJFkKeyVXR8ZHtrcqEMDoF4eZDfky36K5d8f OMMWF/LAh3Fa2CyQDdwZkqtDi2D+3+99Bbw3zixOZpElPB90jtRrxbu2tfVOE8nC RCjdqLMYq7EPlrCzPiq85PbQjYLA8gDJw9WUZkeb3KJzFdv3zyV9VHc8eMF83JNq gRnKlwXAXPU= =cqkj -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix hypercall detection on Xen guests - Extend the AMD microcode loader SHA check to Zen5, to block loading of any unreleased standalone Zen5 microcode patches - Add new Intel CPU model number for Bartlett Lake - Fix the workaround for AMD erratum 1054 - Fix buggy early memory acceptance between SEV-SNP guests and the EFI stub * tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/sev: Avoid shared GHCB page for early memory acceptance x86/cpu/amd: Fix workaround for erratum 1054 x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches x86/xen: Fix __xen_hypercall_setfunc()	2025-04-18 14:04:57 -07:00
Linus Torvalds	ac85740edf	Fix a lockdep false positive in the i8253 driver. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgCtykRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hfkQ/9ENwVJzsDv+JqTt5WRG5fPWzqm+h3UL7j yK+0qsm80kkJJIupkwbgky/DX9X1YcWr7t8pSI5P4UCrC6es1ud78lSgDXqbSjEn wnao7u/ilGVWTIYyHRnD8QCBonZVJLNOiaHpp2CU8uMYmEzu5CIQup4uLLO7pIcV 52GkFN8MXYSa79treCGUZcNNosL1CJ/ZurX/TY1wvH0Adl/O6zsVBvXcWMC1FxLb 3Dc8hi4fD2e+/ZTtk5lfAjwA/zXcvO40cQgYByLFszFCjPs9ECiyusr3Eiw5uEDY iIK7/bntlgeVQx9NYDDqHAlmrV6bRgMrXZjyUTUQKMmPMCFCjz27dU7hMPaDtxsA r0ShP774TPLevsH/XiYNqpGc6+NH3Q8/ByEZwywwj4nHvndogRQWrbB2aZoiaS/U bxK78t0ocKxDr04lCdPgqyp5o0Aw0PQVLFNPrP/UEASgZYvIaM5V7by5RBafyTG4 c3uewPaqEQuhid+j69CsqcAhZUFxRvWCdKRiS5RRyUM5nf0C3nf3uzk+Gqgy/gZN b1bvoTwfiiarto9M8TamkaPThPN4I3HQEx3+Z8JxztWVbrlwQXJLbxuI9y4dpXPR rYn1YLEM6zABtFUK1lgVxXtD51d8IqaUONbDm+0GMJJX4wAGF8Ua8MCNkZhyWUDS idDWloLy7kE= =P0FD -----END PGP SIGNATURE----- Merge tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a lockdep false positive in the i8253 driver" * tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/i8253: Call clockevent_i8253_disable() with interrupts disabled	2025-04-18 14:02:45 -07:00
Linus Torvalds	b372359fbc	Miscellaneous fixes and a hardware-enabling change: - Fix Intel uncore PMU IIO free running counters on SPR, ICX and SNR systems. - Fix Intel PEBS buffer overflow handling - Fix skid in Intel PEBS sampling of user-space general purpose registers - Enable Panther Lake PMU support - similar to Lunar Lake. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgCtpgRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iKJxAAp+Sigxdj7jrq6a8txDsoPeCcb+tyqhxl j/hJ8JeTuQM+y09AwJDYGzRc0tatGMNUKqpR5YQbHLDJAFNQpQZRRXd0VzVyVOgy 29zMTtG+zfWAncqddy0v2QWCsMdJxW3coDj1ahbQjrFIfa1Qdus3hTodXlSJ2+/c dBqXNjz3sckdT1LLPC2+Ahc2dCTotvqsJV2kkY6njbIOnhqncPyVMCi8EaBxbrKO s4iSfYBS3IUX8px9Yqr/f4CkVMFJHG/jNI8TXnZEzyg/geAEY34K84T09qvq9hOl BB/oO0N0C81tARW9oehc7p5nhJe/59W7lvylMcdBDMPm2B1iHLIRLJw2aqo4VP0G apQPsEPGMBfnIdF7jVNBQztqUMJoCH33Mz94D3mr4VsXNdCvP0891kvNm0ftqlgV 2puWchBiHo5cdXP45o+CYJLrufzBm+v0hZF1YlfQxEiksBJlpKcpC2+4gZKqzHDO R4j3w7FBlEikySJnTwl7n+BZNSRUwXUuOYBekO2fhlmwPAtvK0pbnijjLb5jajQn xPh1UKwYl2MC0qzIGkBfnAG3yWnU2XeChnBGj9NXuzdky2qA9Oqyeix0yTEugvmW yj53nqFmP7vFYtBlDrondl3EaJlAsPRVvQGgB+zfgsro7G/NIud/j+VzcNBKwYMy VxbRLrejvHU= =rUbq -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf event fixes from Ingo Molnar: "Miscellaneous fixes and a hardware-enabling change: - Fix Intel uncore PMU IIO free running counters on SPR, ICX and SNR systems - Fix Intel PEBS buffer overflow handling - Fix skid in Intel PEBS sampling of user-space general purpose registers - Enable Panther Lake PMU support - similar to Lunar Lake" * tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add Panther Lake support perf/x86/intel: Allow to update user space GPRs from PEBS records perf/x86/intel: Don't clear perf metrics overflow bit unconditionally perf/x86/intel/uncore: Fix the scale of IIO free running counters on SPR perf/x86/intel/uncore: Fix the scale of IIO free running counters on ICX perf/x86/intel/uncore: Fix the scale of IIO free running counters on SNR	2025-04-18 13:35:13 -07:00
Peter Zijlstra	aef1d0209d	x86/mm: Fix {,un}use_temporary_mm() IRQ state As the function switch_mm_irqs_off() implies, it ought to be called with IRQs off. Commit `58f8ffa917` ("x86/mm: Allow temporary MMs when IRQs are on") caused this to not be the case for EFI. Ensure IRQs are off where it matters. Fixes: `58f8ffa917` ("x86/mm: Allow temporary MMs when IRQs are on") Reported-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/r/20250418095034.GR38216@noisy.programming.kicks-ass.net	2025-04-18 14:36:18 +02:00
Ard Biesheuvel	d54d610243	x86/boot/sev: Avoid shared GHCB page for early memory acceptance Communicating with the hypervisor using the shared GHCB page requires clearing the C bit in the mapping of that page. When executing in the context of the EFI boot services, the page tables are owned by the firmware, and this manipulation is not possible. So switch to a different API for accepting memory in SEV-SNP guests, one which is actually supported at the point during boot where the EFI stub may need to accept memory, but the SEV-SNP init code has not executed yet. For simplicity, also switch the memory acceptance carried out by the decompressor when not booting via EFI - this only involves the allocation for the decompressed kernel, and is generally only called after kexec, as normal boot will jump straight into the kernel from the EFI stub. Fixes: `6c32117963` ("x86/sev: Add SNP-specific unaccepted memory support") Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250404082921.2767593-8-ardb+git@google.com # discussion thread #1 Link: https://lore.kernel.org/r/20250410132850.3708703-2-ardb+git@google.com # discussion thread #2 Link: https://lore.kernel.org/r/20250417202120.1002102-2-ardb+git@google.com # final submission	2025-04-18 14:30:30 +02:00
Sandipan Das	263e55949d	x86/cpu/amd: Fix workaround for erratum 1054 Erratum 1054 affects AMD Zen processors that are a part of Family 17h Models 00-2Fh and the workaround is to not set HWCR[IRPerfEn]. However, when X86_FEATURE_ZEN1 was introduced, the condition to detect unaffected processors was incorrectly changed in a way that the IRPerfEn bit gets set only for unaffected Zen 1 processors. Ensure that HWCR[IRPerfEn] is set for all unaffected processors. This includes a subset of Zen 1 (Family 17h Models 30h and above) and all later processors. Also clear X86_FEATURE_IRPERF on affected processors so that the IRPerfCount register is not used by other entities like the MSR PMU driver. Fixes: `232afb5578` ("x86/CPU/AMD: Add X86_FEATURE_ZEN1") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/caa057a9d6f8ad579e2f1abaa71efbd5bd4eaf6d.1744956467.git.sandipan.das@amd.com	2025-04-18 14:29:47 +02:00
Sandipan Das	2492e5aba2	perf/x86/amd/uncore: Prevent UMC counters from saturating Unlike L3 and DF counters, UMC counters (PERF_CTRs) set the Overflow bit (bit 48) and saturate on overflow. A subsequent pmu->read() of the event reports an incorrect accumulated count as there is no difference between the previous and the current values of the counter. To avoid this, inspect the current counter value and proactively reset the corresponding PERF_CTR register on every pmu->read(). Combined with the periodic reads initiated by the hrtimer, the counters never get a chance saturate but the resolution reduces to 47 bits. Fixes: `25e5684782` ("perf/x86/amd/uncore: Add memory controller support") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/dee9c8af2c6d66814cf4c6224529c144c620cf2c.1744906694.git.sandipan.das@amd.com	2025-04-18 10:35:34 +02:00
Sandipan Das	e1ed37b70f	perf/x86/amd/uncore: Add parameter to configure hrtimer Introduce a module parameter for configuring the hrtimer duration in milliseconds. The default duration is 60000 milliseconds and the intent is to allow users to customize it to suit jitter tolerances. It should be noted that a longer duration will reduce jitter but affect accuracy if the programmed events cause the counters to overflow multiple times in a single interval. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/6cb0101da74955fa9c8361f168ffdf481ae8a200.1744906694.git.sandipan.das@amd.com	2025-04-18 10:35:33 +02:00
Sandipan Das	6d937e044b	perf/x86/amd/uncore: Use hrtimer for handling overflows Uncore counters do not provide mechanisms like interrupts to report overflows and the accumulated user-visible count is incorrect if there is more than one overflow between two successive read requests for the same event because the value of prev_count goes out-of-date for calculating the correct delta. To avoid this, start a hrtimer to periodically initiate a pmu->read() of the active counters for keeping prev_count up-to-date. It should be noted that the hrtimer duration should be lesser than the shortest time it takes for a counter to overflow for this approach to be effective. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/8ecf5fe20452da1cd19cf3ff4954d3e7c5137468.1744906694.git.sandipan.das@amd.com	2025-04-18 10:35:33 +02:00
Sandipan Das	05c9b0cbe4	perf/x86/intel/uncore: Use HRTIMER_MODE_HARD for detecting overflows hrtimer handlers can be deferred to softirq context and affect timely detection of counter overflows. Hence switch to HRTIMER_MODE_HARD. Disabling and re-enabling IRQs in the hrtimer handler is not required as pmu->start() and pmu->stop() can no longer intervene while updating event->hw.prev_count. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/0ad4698465077225769e8edd5b2c7e8f48f636d5.1744906694.git.sandipan.das@amd.com	2025-04-18 10:35:33 +02:00
Sandipan Das	4f81cc2d1b	perf/x86/amd/uncore: Remove unused 'struct amd_uncore_ctx::node' member Fixes: `d6389d3ccc` ("perf/x86/amd/uncore: Refactor uncore management") Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/30f9254c2de6c4318dd0809ef85a1677f68eef10.1744906694.git.sandipan.das@amd.com	2025-04-18 10:35:33 +02:00
Uros Bizjak	3ce4b1f1f2	x86/asm: Rename rep_nop() to native_pause() Rename rep_nop() function to what it really does. No functional change intended. Suggested-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Link: https://lore.kernel.org/r/20250418080805.83679-1-ubizjak@gmail.com	2025-04-18 10:19:26 +02:00
Uros Bizjak	d109ff4f0b	x86/asm: Replace "REP; NOP" with PAUSE mnemonic Current minimum required version of binutils is 2.25, which supports PAUSE instruction mnemonic. Replace "REP; NOP" with this proper mnemonic. No functional change intended. Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Link: https://lore.kernel.org/r/20250418080805.83679-2-ubizjak@gmail.com	2025-04-18 10:19:25 +02:00
Uros Bizjak	42c782fae3	x86/asm: Remove semicolon from "rep" prefixes Minimum version of binutils required to compile the kernel is 2.25. This version correctly handles the "rep" prefixes, so it is possible to remove the semicolon, which was used to support ancient versions of GNU as. Due to the semicolon, the compiler considers "rep; insn" (or its alternate "rep\n\tinsn" form) as two separate instructions. Removing the semicolon makes asm length calculations more accurate, consequently making scheduling and inlining decisions of the compiler more accurate. Removing the semicolon also enables assembler checks involving "rep" prefixes. Trying to assemble e.g. "rep addl %eax, %ebx" results in: Error: invalid instruction `add' after `rep' Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@kernel.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20250418071437.4144391-2-ubizjak@gmail.com	2025-04-18 09:33:33 +02:00
Uros Bizjak	0dcc51477b	x86/boot: Remove semicolon from "rep" prefixes Minimum version of binutils required to compile the kernel is 2.25. This version correctly handles the "rep" prefixes, so it is possible to remove the semicolon, which was used to support ancient versions of GNU as. Due to the semicolon, the compiler considers "rep; insn" (or its alternate "rep\n\tinsn" form) as two separate instructions. Removing the semicolon makes asm length calculations more accurate, consequently making scheduling and inlining decisions of the compiler more accurate. Removing the semicolon also enables assembler checks involving "rep" prefixes. Trying to assemble e.g. "rep addl %eax, %ebx" results in: Error: invalid instruction `add' after `rep' Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Mares <mj@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250418071437.4144391-1-ubizjak@gmail.com	2025-04-18 09:32:57 +02:00
Jiri Olsa	610f6e14c2	uprobes/x86: Add support to emulate NOP instructions Add support to emulate all NOP instructions as the original uprobe instruction. This change speeds up uprobe on top of all NOP instructions and is a preparation for usdt probe optimization, that will be done on top of NOP5 instructions. With this change the usdt probe on top of NOP5s won't take the performance hit compared to usdt probe on top of standard NOP instructions. Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Alan Maguire <alan.maguire@oracle.com> Cc: Hao Luo <haoluo@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20250414083647.1234007-1-jolsa@kernel.org	2025-04-18 09:03:05 +02:00
Andy Shevchenko	a2c6c1c23b	x86/PCI: Drop 'pci' suffix from intel_mid_pci.c CE4100 PCI specific code has no 'pci' suffix in the filename, intel_mid_pci.c is the only one that duplicates the folder name in its filename, drop that redundancy. While at it, group the respective modules in the Makefile. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20250407070321.3761063-1-andriy.shevchenko@linux.intel.com	2025-04-17 15:19:45 -05:00
Dave Hansen	eaa607deb2	x86/mm: Remove now unused SHARED_KERNEL_PMD All the users of SHARED_KERNEL_PMD are gone. Zap it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173244.1125BEC3%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	99b8f0c54f	x86/mm: Remove duplicated PMD preallocation macro MAX_PREALLOCATED_PMDS and PREALLOCATED_PMDS are now identical. Just use PREALLOCATED_PMDS and remove "MAX". Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173242.5ED13A5B%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	454e65b4fb	x86/mm: Preallocate all PAE page tables Finally, move away from having PAE kernels share any PMDs across processes. This was already the default on PTI kernels which are the common case. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173241.1288CAB4%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	82f120010f	x86/mm: Fix up comments around PMD preallocation The "paravirt environment" is no longer in the tree. Axe that part of the comment. Also add a blurb to remind readers that "USER_PMDS" refer to the PTI user copy of the page tables, not the user portion. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173240.5B1AB322%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	45fb940563	x86/mm: Simplify PAE PGD sharing macros There are a few too many levels of abstraction here. First, just expand the PREALLOCATED_PMDS macro in place to make it clear that it is only conditional on PTI. Second, MAX_PREALLOCATED_PMDS is only used in one spot for an on-stack allocation. It has a maximum value of 4. Do not bother with the macro MAX() magic. Just set it to 4. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173238.6E3CDA56%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	eb9c7f00f2	x86/mm: Always tell core mm to sync kernel mappings Each mm_struct has its own copy of the page tables. When core mm code makes changes to a copy of the page tables those changes sometimes need to be synchronized with other mms' copies of the page tables. But when this synchronization actually needs to happen is highly architecture and configuration specific. In cases where kernel PMDs are shared across processes (SHARED_KERNEL_PMD) the core mm does not itself need to do that synchronization for kernel PMD changes. The x86 code communicates this by clearing the PGTBL_PMD_MODIFIED bit cleared in those configs to avoid expensive synchronization. The kernel is moving toward never sharing kernel PMDs on 32-bit. Prepare for that and make 32-bit PAE always set PGTBL_PMD_MODIFIED, even if there is no modification to synchronize. This obviously adds some synchronization overhead in cases where the kernel page tables are being changed. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173237.EC790E95%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	b0cc4d19f1	x86/mm: Always "broadcast" PMD setting operations Kernel PMDs can either be shared across processes or private to a process. On 64-bit, they are always shared. 32-bit non-PAE hardware does not have PMDs, but the kernel logically squishes them into the PGD and treats them as private. Here are the four cases: 64-bit: Shared 32-bit: non-PAE: Private 32-bit: PAE+ PTI: Private 32-bit: PAE+noPTI: Shared Note that 32-bit is all "Private" except for PAE+noPTI being an oddball. The 32-bit+PAE+noPTI case will be made like the rest of 32-bit shortly. But until that can be done, temporarily treat the 32-bit+PAE+noPTI case as Private. This will do unnecessary walks across pgd_list and unnecessary PTE setting but should be otherwise harmless. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173235.F63F50D1%40davehans-spike.ostc.intel.com	2025-04-17 10:39:25 -07:00
Dave Hansen	780f97e309	x86/mm: Always allocate a whole page for PAE PGDs A hardware PAE PGD is only 32 bytes. A PGD is PAGE_SIZE in the other paging modes. But for reasons, the kernel _sometimes_ allocates a whole page even though it only ever uses 32 bytes. Make PAE less weird. Just allocate a page like the other paging modes. This was already being done for PTI (and Xen in the past) and nobody screamed that loudly about it so it can't be that bad. The original reason for PAGE_SIZE allocations for the PAE PGDs was Xen's need to detect page table writes. But 32-bit PTI forced it too for reasons I'm unclear about. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20250414173234.D34F0C3E%40davehans-spike.ostc.intel.com	2025-04-17 10:39:24 -07:00
Linus Torvalds	85a9793e76	xen: branch for v6.15-rc3 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCaAEWhAAKCRCAXGG7T9hj vne3AQCXbD8VtBJZEZ6h3sxRYLYpDHaJa5B0NTNUwgbVDPG/pgD/ad8c8iOomlWT EllZmgwobkMwz0XZHDjsBfHIYjA4AgE= =aDuc -----END PGP SIGNATURE----- Merge tag 'for-linus-6.15a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fix from Juergen Gross: "Just a single fix for the Xen multicall driver avoiding a percpu variable referencing initdata by its initializer" * tag 'for-linus-6.15a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: fix multicall debug feature	2025-04-17 10:24:22 -07:00
Peter Zijlstra	52ebfe7412	x86/mm: Remove the mm_cpumask(prev) warning from switch_mm_irqs_off() The CONFIG_DEBUG_VM=y warning in switch_mm_irqs_off() started triggering in testing: VM_WARN_ON_ONCE(prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(prev))); AFAIU what happens is that unuse_temporary_mm() clears the mm_cpumask() for the current CPU, while switch_mm_irqs_off() then checks that the mm_cpumask() bit is set for the current CPU. While this behaviour hasn't really changed since the following commit: `209954cbc7` ("x86/mm/tlb: Update mm_cpumask lazily") introduced both, but the warning is wrong, so remove it. [ mingo: Patchified Peter's email. ] Reported-by: syzbot+c2537ce72a879a38113e@syzkaller.appspotmail.com Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Link: https://lore.kernel.org/r/20250414135629.GA17910@noisy.programming.kicks-ass.net	2025-04-17 14:46:25 +02:00
Dapeng Mi	4a3fd13054	perf/x86/intel: Introduce pairs of PEBS static calls Arch-PEBS retires IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs, so intel_pmu_pebs_enable/disable() and intel_pmu_pebs_enable/disable_all() are not needed to call for ach-PEBS. To make the code cleaner, introduce static calls x86_pmu_pebs_enable/disable() and x86_pmu_pebs_enable/disable_all() instead of adding "x86_pmu.arch_pebs" check directly in these helpers. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250415114428.341182-7-dapeng1.mi@linux.intel.com	2025-04-17 14:21:24 +02:00
Dapeng Mi	acb727e095	perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Since architectural PEBS would be introduced in subsequent patches, rename x86_pmu.pebs to x86_pmu.ds_pebs for distinguishing with the upcoming architectural PEBS. Besides restrict reserve_ds_buffers() helper to work only for the legacy DS based PEBS and avoid it to corrupt the pebs_active flag and release PEBS buffer incorrectly for arch-PEBS since the later patch would reuse these flags and alloc/release_pebs_buffer() helpers for arch-PEBS. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250415114428.341182-6-dapeng1.mi@linux.intel.com	2025-04-17 14:21:24 +02:00
Dapeng Mi	d971342d38	perf/x86/intel: Decouple BTS initialization from PEBS initialization Move x86_pmu.bts flag initialization into bts_init() from intel_ds_init() and rename intel_ds_init() to intel_pebs_init() since it fully initializes PEBS now after removing the x86_pmu.bts initialization. It's safe to move x86_pmu.bts into bts_init() since all x86_pmu.bts flag are called after bts_init() execution. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250415114428.341182-5-dapeng1.mi@linux.intel.com	2025-04-17 14:21:24 +02:00
Dapeng Mi	25c623f414	perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs CPUID archPerfmonExt (0x23) leaves are supported to enumerate CPU level's PMU capabilities on non-hybrid processors as well. This patch supports to parse archPerfmonExt leaves on non-hybrid processors. Architectural PEBS leverages archPerfmonExt sub-leaves 0x4 and 0x5 to enumerate the PEBS capabilities as well. This patch is a precursor of the subsequent arch-PEBS enabling patches. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250415114428.341182-4-dapeng1.mi@linux.intel.com	2025-04-17 14:21:24 +02:00
Dapeng Mi	48d66c89dc	perf/x86/intel: Add PMU support for Clearwater Forest From the PMU's perspective, Clearwater Forest is similar to the previous generation Sierra Forest. The key differences are the ARCH PEBS feature and the new added 3 fixed counters for topdown L1 metrics events. The ARCH PEBS is supported in the following patches. This patch provides support for basic perfmon features and 3 new added fixed counters. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250415114428.341182-3-dapeng1.mi@linux.intel.com	2025-04-17 14:21:23 +02:00
Ingo Molnar	1d34a05433	Merge branch 'perf/urgent' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-17 14:20:57 +02:00
Kan Liang	7950de14ff	perf/x86/intel: Add Panther Lake support From PMU's perspective, Panther Lake is similar to the previous generation Lunar Lake. Both are hybrid platforms, with e-core and p-core. The key differences are the ARCH PEBS feature and several new events. The ARCH PEBS is supported in the following patches. The new events will be supported later in perf tool. Share the code path with the Lunar Lake. Only update the name. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250415114428.341182-2-dapeng1.mi@linux.intel.com	2025-04-17 14:19:59 +02:00
Dapeng Mi	71dcc11c2c	perf/x86/intel: Allow to update user space GPRs from PEBS records Currently when a user samples user space GPRs (--user-regs option) with PEBS, the user space GPRs actually always come from software PMI instead of from PEBS hardware. This leads to the sampled GPRs to possibly be inaccurate for single PEBS record case because of the skid between counter overflow and GPRs sampling on PMI. For the large PEBS case, it is even worse. If user sets the exclude_kernel attribute, large PEBS would be used to sample user space GPRs, but since PEBS GPRs group is not really enabled, it leads to all samples in the large PEBS record to share the same piece of user space GPRs, like this reproducer shows: $ perf record -e branches:pu --user-regs=ip,ax -c 100000 ./foo $ perf report -D \| grep "AX" .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead .... AX 0x000000003a0d4ead So enable GPRs group for user space GPRs sampling and prioritize reading GPRs from PEBS. If the PEBS sampled GPRs is not user space GPRs (single PEBS record case), perf_sample_regs_user() modifies them to user space GPRs. [ mingo: Clarified the changelog. ] Fixes: `c22497f583` ("perf/x86/intel: Support adaptive PEBS v4") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250415104135.318169-2-dapeng1.mi@linux.intel.com	2025-04-17 14:19:38 +02:00
Dapeng Mi	a5f5e1238f	perf/x86/intel: Don't clear perf metrics overflow bit unconditionally The below code would always unconditionally clear other status bits like perf metrics overflow bit once PEBS buffer overflows: status &= intel_ctrl \| GLOBAL_STATUS_TRACE_TOPAPMI; This is incorrect. Perf metrics overflow bit should be cleared only when fixed counter 3 in PEBS counter group. Otherwise perf metrics overflow could be missed to handle. Closes: https://lore.kernel.org/all/20250225110012.GK31462@noisy.programming.kicks-ass.net/ Fixes: `7b2c05a15d` ("perf/x86/intel: Generic support for hardware TopDown metrics") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250415104135.318169-1-dapeng1.mi@linux.intel.com	2025-04-17 14:19:07 +02:00
Kan Liang	506f981ab4	perf/x86/intel/uncore: Fix the scale of IIO free running counters on SPR The scale of IIO bandwidth in free running counters is inherited from the ICX. The counter increments for every 32 bytes rather than 4 bytes. The IIO bandwidth out free running counters don't increment with a consistent size. The increment depends on the requested size. It's impossible to find a fixed increment. Remove it from the event_descs. Fixes: `0378c93a92` ("perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server") Reported-by: Tang Jun <dukang.tj@alibaba-inc.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250416142426.3933977-3-kan.liang@linux.intel.com	2025-04-17 12:57:32 +02:00
Kan Liang	32c7f11502	perf/x86/intel/uncore: Fix the scale of IIO free running counters on ICX There was a mistake in the ICX uncore spec too. The counter increments for every 32 bytes rather than 4 bytes. The same as SNR, there are 1 ioclk and 8 IIO bandwidth in free running counters. Reuse the snr_uncore_iio_freerunning_events(). Fixes: `2b3b76b5ec` ("perf/x86/intel/uncore: Add Ice Lake server uncore support") Reported-by: Tang Jun <dukang.tj@alibaba-inc.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250416142426.3933977-2-kan.liang@linux.intel.com	2025-04-17 12:57:29 +02:00
Kan Liang	96a720db59	perf/x86/intel/uncore: Fix the scale of IIO free running counters on SNR There was a mistake in the SNR uncore spec. The counter increments for every 32 bytes of data sent from the IO agent to the SOC, not 4 bytes which was documented in the spec. The event list has been updated: "EventName": "UNC_IIO_BANDWIDTH_IN.PART0_FREERUN", "BriefDescription": "Free running counter that increments for every 32 bytes of data sent from the IO agent to the SOC", Update the scale of the IIO bandwidth in free running counters as well. Fixes: `210cc5f9db` ("perf/x86/intel/uncore: Add uncore support for Snow Ridge server") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250416142426.3933977-1-kan.liang@linux.intel.com	2025-04-17 12:57:20 +02:00
Nathan Chancellor	498cb872a1	x86/boot/startup: Disable LTO for the startup code When building with CONFIG_LTO_CLANG, there is an error in the x86 boot startup code because it builds with a different code model than the rest of the kernel: ld.lld: error: Function Import: link error: linking module flags 'Code Model': IDs have conflicting values: 'i32 2' from vmlinux.a(head64.o at 1302448), and 'i32 1' from vmlinux.a(map_kernel.o at 1314208) ld.lld: error: Function Import: link error: linking module flags 'Code Model': IDs have conflicting values: 'i32 2' from vmlinux.a(common.o at 1306108), and 'i32 1' from vmlinux.a(gdt_idt.o at 1314148) As this directory is for code that only runs during early system initialization, LTO is not very important, so filter out the LTO flags from KBUILD_CFLAGS for arch/x86/boot/startup to resolve the build error. Fixes: `4cecebf200` ("x86/boot: Move the early GDT/IDT setup code into startup/") Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: llvm@lists.linux.dev Link: https://lore.kernel.org/r/20250414-x86-boot-startup-lto-error-v1-1-7c8bed7c131c@kernel.org Closes: https://lore.kernel.org/CA+G9fYvnun+bhYgtt425LWxzOmj+8Jf3ruKeYxQSx-F6U7aisg@mail.gmail.com/	2025-04-17 12:09:30 +02:00
Pawan Gupta	d9b79111fd	x86/bugs: Rename mmio_stale_data_clear to cpu_buf_vm_clear The static key mmio_stale_data_clear controls the KVM-only mitigation for MMIO Stale Data vulnerability. Rename it to reflect its purpose. No functional change. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250416-mmio-rename-v2-1-ad1f5488767c@linux.intel.com	2025-04-16 19:40:01 +02:00
Chang S. Bae	de8304c319	x86/fpu: Rename fpu_reset_fpregs() to fpu_reset_fpstate_regs() The original function name came from an overly compressed form of 'fpstate_regs' by commit: `e61d6310a0` ("x86/fpu: Reset permission and fpstate on exec()") However, the term 'fpregs' typically refers to physical FPU registers. In contrast, this function copies the init values to fpu->fpstate->regs, not hardware registers. Rename the function to better reflect what it actually does. No functional change. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-11-chang.seok.bae@intel.com	2025-04-16 10:01:03 +02:00
Chang S. Bae	70fe4a0266	x86/fpu: Remove export of mxcsr_feature_mask The variable was previously referenced in KVM code but the last usage was removed by: `ea4d6938d4` ("x86/fpu: Replace KVMs home brewed FPU copy from user") Remove its export symbol. Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-10-chang.seok.bae@intel.com	2025-04-16 10:01:03 +02:00
Chang S. Bae	d1e420772c	x86/pkeys: Simplify PKRU update in signal frame The signal delivery logic was modified to always set the PKRU bit in xregs_state->header->xfeatures by this commit: `ae6012d72f` ("x86/pkeys: Ensure updated PKRU value is XRSTOR'd") However, the change derives the bitmask value using XGETBV(1), rather than simply updating the buffer that already holds the value. Thus, this approach induces an unnecessary dependency on XGETBV1 for PKRU handling. Eliminate the dependency by using the established helper function. Subsequently, remove the now-unused 'mask' argument. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250416021720.12305-9-chang.seok.bae@intel.com	2025-04-16 10:01:03 +02:00
Chang S. Bae	64e54461ab	x86/fpu: Refactor xfeature bitmask update code for sigframe XSAVE Currently, saving register states in the signal frame, the legacy feature bits are always set in xregs_state->header->xfeatures. This code sequence can be generalized for reuse in similar cases. Refactor the logic to ensure a consistent approach across similar usages. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-8-chang.seok.bae@intel.com	2025-04-16 10:01:00 +02:00
Chang S. Bae	39cd7fad39	x86/fpu: Log XSAVE disablement consistently Not all paths that lead to fpu__init_disable_system_xstate() currently emit a message indicating that XSAVE has been disabled. Move the print statement into the function to ensure the message in all cases. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250416021720.12305-7-chang.seok.bae@intel.com	2025-04-16 09:44:15 +02:00
Chang S. Bae	50c5b071e2	x86/fpu/apx: Enable APX state support With securing APX against conflicting MPX, it is now ready to be enabled. Include APX in the enabled xfeature set. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-5-chang.seok.bae@intel.com	2025-04-16 09:44:14 +02:00
Chang S. Bae	ea68e39190	x86/fpu/apx: Disallow conflicting MPX presence XSTATE components are architecturally independent. There is no rule requiring their offsets in the non-compacted format to be strictly ascending or mutually non-overlapping. However, in practice, such overlaps have not occurred -- until now. APX is introduced as xstate component 19, following AMX. In the non-compacted XSAVE format, its offset overlaps with the space previously occupied by the now-deprecated MPX feature: `45fc24e89b` ("x86/mpx: remove MPX from arch/x86") To prevent conflicts, the kernel must ensure the CPU never expose both features at the same time. If so, it indicates unreliable hardware. In such cases, XSAVE should be disabled entirely as a precautionary measure. Add a sanity check to detect this condition and disable XSAVE if an invalid hardware configuration is identified. Note: MPX state components remain enabled on legacy systems solely for KVM guest support. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-4-chang.seok.bae@intel.com	2025-04-16 09:44:14 +02:00
Chang S. Bae	bd0b10b795	x86/fpu/apx: Define APX state component Advanced Performance Extensions (APX) is associated with a new state component number 19. To support saving and restoring of the corresponding registers via the XSAVE mechanism, introduce the component definition along with the necessary sanity checks. Define the new component number, state name, and those register data type. Then, extend the size checker to validate the register data type and explicitly list the APX feature flag as a dependency for the new component in xsave_cpuid_features[]. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-3-chang.seok.bae@intel.com	2025-04-16 09:44:14 +02:00
Chang S. Bae	b02dc185ee	x86/cpufeatures: Add X86_FEATURE_APX Intel Advanced Performance Extensions (APX) introduce a new set of general-purpose registers, managed as an extended state component via the xstate management facility. Before enabling this new xstate, define a feature flag to clarify the dependency in xsave_cpuid_features[]. APX is enumerated under CPUID level 7 with EDX=1. Since this CPUID leaf is not yet allocated, place the flag in a scattered feature word. While this feature is intended only for userspace, exposing it via /proc/cpuinfo is unnecessary. Instead, the existing arch_prctl(2) mechanism with the ARCH_GET_XCOMP_SUPP option can be used to query the feature availability. Finally, clarify that APX depends on XSAVE. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250416021720.12305-2-chang.seok.bae@intel.com	2025-04-16 09:44:13 +02:00
Eric Biggers	34374f76af	crypto: x86/poly1305 - don't select CRYPTO_LIB_POLY1305_GENERIC The x86 Poly1305 code never falls back to the generic code, so selecting CRYPTO_LIB_POLY1305_GENERIC is unnecessary. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:25 +08:00
Eric Biggers	21969da642	crypto: x86/poly1305 - remove redundant shash algorithm Since crypto/poly1305.c now registers a poly1305-$(ARCH) shash algorithm that uses the architecture's Poly1305 library functions, individual architectures no longer need to do the same. Therefore, remove the redundant shash algorithm from the arch-specific code and leave just the library functions there. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:25 +08:00
Eric Biggers	ecaa4be128	crypto: poly1305 - centralize the shash wrappers for arch code Following the example of the crc32, crc32c, and chacha code, make the crypto subsystem register both generic and architecture-optimized poly1305 shash algorithms, both implemented on top of the appropriate library functions. This eliminates the need for every architecture to implement the same shash glue code. Note that the poly1305 shash requires that the key be prepended to the data, which differs from the library functions where the key is simply a parameter to poly1305_init(). Previously this was handled at a fairly low level, polluting the library code with shash-specific code. Reorganize things so that the shash code handles this quirk itself. Also, to register the architecture-optimized shashes only when architecture-optimized code is actually being used, add a function poly1305_is_arch_optimized() and make each arch implement it. Change each architecture's Poly1305 module_init function to arch_initcall so that the CPU feature detection is guaranteed to run before poly1305_is_arch_optimized() gets called by crypto/poly1305.c. (In cases where poly1305_is_arch_optimized() just returns true unconditionally, using arch_initcall is not strictly needed, but it's still good to be consistent across architectures.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:24 +08:00
Herbert Xu	f4065b2f63	crypto: lib/sm3 - Move sm3 library into lib/crypto Move the sm3 library code into lib/crypto. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:24 +08:00
Herbert Xu	f1c09a0b6a	x86: Make simd.h more resilient Add missing header inclusions and protect against double inclusion. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-16 15:36:23 +08:00
Ingo Molnar	4e2547509f	Merge branch 'x86/cpu' into x86/fpu, to pick up dependent commits Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-16 09:35:49 +02:00
Pi Xiange	d466304c43	x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores Bartlett Lake has a P-core only product with Raptor Cove. [ mingo: Switch around the define as pointed out by Christian Ludloff: Ratpr Cove is the core, Bartlett Lake is the product. Signed-off-by: Pi Xiange <xiange.pi@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Christian Ludloff <ludloff@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: "Ahmed S. Darwish" <darwi@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250414032839.5368-1-xiange.pi@intel.com	2025-04-16 09:16:02 +02:00
Ingo Molnar	06e09002bc	Merge branch 'linus' into x86/cpu, to resolve conflicts Conflicts: tools/arch/x86/include/asm/cpufeatures.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-16 07:03:58 +02:00
Xin Li (Intel)	3aba0b40ca	x86/cpufeatures: Shorten X86_FEATURE_AMD_HETEROGENEOUS_CORES Shorten X86_FEATURE_AMD_HETEROGENEOUS_CORES to X86_FEATURE_AMD_HTR_CORES to make the last column aligned consistently in the whole file. No functional changes. Suggested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250415175410.2944032-4-xin@zytor.com	2025-04-15 22:09:20 +02:00
Xin Li (Intel)	13327fada7	x86/cpufeatures: Shorten X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT Shorten X86_FEATURE_CLEAR_BHB_LOOP_ON_VMEXIT to X86_FEATURE_CLEAR_BHB_VMEXIT to make the last column aligned consistently in the whole file. There's no need to explain in the name what the mitigation does. No functional changes. Suggested-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250415175410.2944032-3-xin@zytor.com	2025-04-15 22:09:16 +02:00
Borislav Petkov (AMD)	282cc5b676	x86/cpufeatures: Clean up formatting It is a special file with special formatting so remove one whitespace damage and format newer defines like the rest. No functional changes. [ Xin: Do the same to tools/arch/x86/include/asm/cpufeatures.h. ] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250415175410.2944032-2-xin@zytor.com	2025-04-15 21:44:27 +02:00
Borislav Petkov (AMD)	dd86a1d013	x86/bugs: Remove X86_BUG_MMIO_UNKNOWN Whack this thing because: - the "unknown" handling is done only for this vuln and not for the others - it doesn't do anything besides reporting things differently. It doesn't apply any mitigations - it is simply causing unnecessary complications to the code which don't bring anything besides maintenance overhead to what is already a very nasty spaghetti pile - all the currently unaffected CPUs can also be in "unknown" status so there's no need for special handling here so get rid of it. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: David Kaplan <david.kaplan@amd.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250414150951.5345-1-bp@kernel.org	2025-04-14 17:15:27 +02:00
Borislav Petkov (AMD)	9fb6938d55	x86/cpuid: Align macro linebreaks vertically Align the backspaces vertically again, after recent cleanups. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ahmed S. Darwish <darwi@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250414094130.6768-1-bp@kernel.org	2025-04-14 12:25:50 +02:00
Ingo Molnar	0a35c9280a	x86/platform/amd: Move the <asm/amd_node.h> header to <asm/amd/node.h> Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Link: https://lore.kernel.org/r/20250413084144.3746608-7-mingo@kernel.org	2025-04-14 09:34:17 +02:00
Ingo Molnar	5bb144e52c	x86/platform/amd: Clean up the <asm/amd/hsmp.h> header guards a bit - There's no need for a newline after the SPDX line - But there's a need for one before the closing header guard. Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Cc: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Link: https://lore.kernel.org/r/20250413084144.3746608-6-mingo@kernel.org	2025-04-14 09:34:17 +02:00
Ingo Molnar	d96c786841	x86/platform/amd: Move the <asm/amd_hsmp.h> header to <asm/amd/hsmp.h> Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Cc: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com> Link: https://lore.kernel.org/r/20250413084144.3746608-5-mingo@kernel.org	2025-04-14 09:34:17 +02:00
Ingo Molnar	bcbb655595	x86/platform/amd: Move the <asm/amd_nb.h> header to <asm/amd/nb.h> Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Link: https://lore.kernel.org/r/20250413084144.3746608-4-mingo@kernel.org	2025-04-14 09:34:14 +02:00
Ingo Molnar	861c6b1185	x86/platform/amd: Add standard header guards to <asm/amd/ibs.h> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Link: https://lore.kernel.org/r/20250413084144.3746608-3-mingo@kernel.org	2025-04-14 09:31:47 +02:00
Ingo Molnar	3846389c03	x86/platform/amd: Move the <asm/amd-ibs.h> header to <asm/amd/ibs.h> Collect AMD specific platform header files in <asm/amd/*.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mario Limonciello <superm1@kernel.org> Link: https://lore.kernel.org/r/20250413084144.3746608-2-mingo@kernel.org	2025-04-14 09:31:47 +02:00
Ingo Molnar	e3a52b67f5	x86/fpu: Clarify FPU context cacheline alignment Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/Z_ejggklB5-IWB5W@gmail.com	2025-04-14 08:18:29 +02:00
Ingo Molnar	8b2a7a7294	x86/fpu: Use 'fpstate' variable names consistently A few uses of 'fps' snuck in, which is rather confusing (to me) as it suggests frames-per-second. ;-) Rename them to the canonical 'fpstate' name. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-9-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	22aafe3bcb	x86/fpu: Remove init_task FPU state dependencies, add debugging warning for PF_KTHREAD tasks init_task's FPU state initialization was a bit of a hack: __x86_init_fpu_begin = .; . = __x86_init_fpu_begin + 128PAGE_SIZE; __x86_init_fpu_end = .; But the init task isn't supposed to be using the FPU context in any case, so remove the hack and add in some debug warnings. As Linus noted in the discussion, the init task (and other PF_KTHREAD tasks) can* use the FPU via kernel_fpu_begin()/_end(), but they don't need the context area because their FPU use is not preemptible or reentrant, and they don't return to user-space. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250409211127.3544993-8-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	c360bdc593	x86/fpu: Make sure x86_task_fpu() doesn't get called for PF_KTHREAD\|PF_USER_WORKER tasks during exit fpu__drop() and arch_release_task_struct() calls x86_task_fpu() unconditionally, while the FPU context area will not be present if it's the init task, and should not be in use when it's some other type of kthread. Return early for PF_KTHREAD or PF_USER_WORKER tasks. The debug warning in x86_task_fpu() will catch any kthreads attempting to use the FPU save area. Fixed-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-7-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	ec2227e03a	x86/fpu: Push 'fpu' pointer calculation into the fpu__drop() call This encapsulates the fpu__drop() functionality better, and it will also enable other changes that want to check a task for PF_KTHREAD before calling x86_task_fpu(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-6-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	55bc30f2e3	x86/fpu: Remove the thread::fpu pointer As suggested by Oleg, remove the thread::fpu pointer, as we can calculate it via x86_task_fpu() at compile-time. This improves code generation a bit: kepler:~/tip> size vmlinux.before vmlinux.after text data bss dec hex filename 26475405 10435342 1740804 38651551 24dc69f vmlinux.before 26475339 10959630 1216516 38651485 24dc65d vmlinux.after Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250409211127.3544993-5-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	cb7ca40a38	x86/fpu: Make task_struct::thread constant size Turn thread.fpu into a pointer. Since most FPU code internals work by passing around the FPU pointer already, the code generation impact is small. This allows us to remove the old kludge of task_struct being variable size: struct task_struct { ... /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. / randomized_struct_fields_end / CPU-specific state of this task: / struct thread_struct thread; / * WARNING: on x86, 'thread_struct' contains a variable-sized * structure. It MUST be at the end of 'task_struct'. * * Do not put anything below here! */ }; ... which creates a number of problems, such as requiring thread_struct to be the last member of the struct - not allowing it to be struct-randomized, etc. But the primary motivation is to allow the decoupling of task_struct from hardware details (<asm/processor.h> in particular), and to eventually allow the per-task infrastructure: DECLARE_PER_TASK(type, name); ... per_task(current, name) = val; ... which requires task_struct to be a constant size struct. The fpu_thread_struct_whitelist() quirk to hardened usercopy can be removed, now that the FPU structure is not embedded in the task struct anymore, which reduces text footprint a bit. Fixed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250409211127.3544993-4-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	e3bfa38599	x86/fpu: Convert task_struct::thread.fpu accesses to use x86_task_fpu() This will make the removal of the task_struct::thread.fpu array easier. No change in functionality - code generated before and after this commit is identical on x86-defconfig: kepler:~/tip> diff -up vmlinux.before.asm vmlinux.after.asm kepler:~/tip> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250409211127.3544993-3-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Ingo Molnar	77fbccede6	x86/fpu: Introduce the x86_task_fpu() helper method The per-task FPU context/save area is allocated right next to task_struct, currently in a variable-size array via task_struct::thread.fpu[], but we plan to fully hide it from the C type scope. Introduce the x86_task_fpu() accessor that gets to the FPU context pointer explicitly from the task pointer. Right now this is a simple (task)->thread.fpu wrapper. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409211127.3544993-2-mingo@kernel.org	2025-04-14 08:18:29 +02:00
Chang S. Bae	cbe8e4dab1	x86/fpu/xstate: Adjust xstate copying logic for user ABI == Background == As feature positions in the userspace XSAVE buffer do not always align with their feature numbers, the XSAVE format conversion needs to be reconsidered to align with the revised xstate size calculation logic. * For signal handling, XSAVE and XRSTOR are used directly to save and restore extended registers. * For ptrace, KVM, and signal returns (for 32-bit frame), the kernel copies data between its internal buffer and the userspace XSAVE buffer. If memcpy() were used for these cases, existing offset helpers — such as __raw_xsave_addr() or xstate_offsets[] — would be sufficient to handle the format conversion. == Problem == When copying data from the compacted in-kernel buffer to the non-compacted userspace buffer, the function follows the user_regset_get2_fn() prototype. This means it utilizes struct membuf helpers for the destination buffer. As defined in regset.h, these helpers update the memory pointer during the copy process, enforcing sequential writes within the loop. Since xstate components are processed sequentially, any component whose buffer position does not align with its feature number has an issue. == Solution == Replace for_each_extended_xfeature() with the newly introduced for_each_extended_xfeature_in_order(). This macro ensures xstate components are handled in the correct order based on their actual positions in the destination buffer, rather than their feature numbers. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-5-chang.seok.bae@intel.com	2025-04-14 08:18:29 +02:00
Chang S. Bae	a758ae2885	x86/fpu/xstate: Adjust XSAVE buffer size calculation The current xstate size calculation assumes that the highest-numbered xstate feature has the highest offset in the buffer, determining the size based on the topmost bit in the feature mask. However, this assumption is not architecturally guaranteed -- higher-numbered features may have lower offsets. With the introduction of the xfeature order table and its helper macro, xstate components can now be traversed in their positional order. Update the non-compacted format handling to iterate through the table to determine the last-positioned feature. Then, set the offset accordingly. Since size calculation primarily occurs during initialization or in non-critical paths, looping to find the last feature is not expected to have a meaningful performance impact. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-4-chang.seok.bae@intel.com	2025-04-14 08:18:29 +02:00
Chang S. Bae	15d51a2f6f	x86/fpu/xstate: Introduce xfeature order table and accessor macro The kernel has largely assumed that higher xstate component numbers correspond to later offsets in the buffer. However, this assumption no longer holds for the non-compacted format, where a newer state component may have a lower offset. When iterating over xstate components in offset order, using the feature number as an index may be misleading. At the same time, the CPU exposes each component’s size and offset based on its feature number, making it a key for state information. To provide flexibility in handling xstate ordering, introduce a mapping table: feature order -> feature number. The table is dynamically populated based on the CPU-exposed features and is sorted in offset order at boot time. Additionally, add an accessor macro to facilitate sequential traversal of xstate components based on their actual buffer positions, given a feature bitmask. This accessor macro will be particularly useful for computing custom non-compacted format sizes and iterating over xstate offsets in non-compacted buffers. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-3-chang.seok.bae@intel.com	2025-04-14 08:18:29 +02:00
Chang S. Bae	031b33ef1a	x86/fpu/xstate: Remove xstate offset check Traditionally, new xstate components have been assigned sequentially, aligning feature numbers with their offsets in the XSAVE buffer. However, this ordering is not architecturally mandated in the non-compacted format, where a component's offset may not correspond to its feature number. The kernel caches CPUID-reported xstate component details, including size and offset in the non-compacted format. As part of this process, a sanity check is also conducted to ensure alignment between feature numbers and offsets. This check was likely intended as a general guideline rather than a strict requirement. Upcoming changes will support out-of-order offsets. Remove the check as becoming obsolete. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320234301.8342-2-chang.seok.bae@intel.com	2025-04-14 08:18:29 +02:00
Uros Bizjak	4850074ff0	x86/uaccess: Use asm_inline() instead of asm() in __untagged_addr() Use asm_inline() to instruct the compiler that the size of asm() is the minimum size of one instruction, ignoring how many instructions the compiler thinks it is. ALTERNATIVE macro that expands to several pseudo directives causes instruction length estimate to count more than 20 instructions. bloat-o-meter reports minimal code size increase (x86_64 defconfig with CONFIG_ADDRESS_MASKING, gcc-14.2.1): add/remove: 2/2 grow/shrink: 5/1 up/down: 2365/-1995 (370) Function old new delta ----------------------------------------------------- do_get_mempolicy - 1449 +1449 copy_nodes_to_user - 226 +226 __x64_sys_get_mempolicy 35 213 +178 syscall_user_dispatch_set_config 157 332 +175 __ia32_sys_get_mempolicy 31 206 +175 set_syscall_user_dispatch 29 181 +152 __do_sys_mremap 2073 2083 +10 sp_insert 133 117 -16 task_set_syscall_user_dispatch 172 - -172 kernel_get_mempolicy 1807 - -1807 Total: Before=21423151, After=21423521, chg +0.00% The code size increase is due to the compiler inlining more functions that inline untagged_addr(), e.g: task_set_syscall_user_dispatch() is now fully inlined in set_syscall_user_dispatch(): 000000000010b7e0 <set_syscall_user_dispatch>: 10b7e0: f3 0f 1e fa endbr64 10b7e4: 49 89 c8 mov %rcx,%r8 10b7e7: 48 89 d1 mov %rdx,%rcx 10b7ea: 48 89 f2 mov %rsi,%rdx 10b7ed: 48 89 fe mov %rdi,%rsi 10b7f0: 65 48 8b 3d 00 00 00 mov %gs:0x0(%rip),%rdi 10b7f7: 00 10b7f8: e9 03 fe ff ff jmp 10b600 <task_set_syscall_user_dispatch> that after inlining becomes: 000000000010b730 <set_syscall_user_dispatch>: 10b730: f3 0f 1e fa endbr64 10b734: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax 10b73b: 00 10b73c: 48 85 ff test %rdi,%rdi 10b73f: 74 54 je 10b795 <set_syscall_user_dispatch+0x65> 10b741: 48 83 ff 01 cmp $0x1,%rdi 10b745: 74 06 je 10b74d <set_syscall_user_dispatch+0x1d> 10b747: b8 ea ff ff ff mov $0xffffffea,%eax 10b74c: c3 ret 10b74d: 48 85 f6 test %rsi,%rsi 10b750: 75 7b jne 10b7cd <set_syscall_user_dispatch+0x9d> 10b752: 48 85 c9 test %rcx,%rcx 10b755: 74 1a je 10b771 <set_syscall_user_dispatch+0x41> 10b757: 48 89 cf mov %rcx,%rdi 10b75a: 49 b8 ef cd ab 89 67 movabs $0x123456789abcdef,%r8 10b761: 45 23 01 10b764: 90 nop 10b765: 90 nop 10b766: 90 nop 10b767: 90 nop 10b768: 90 nop 10b769: 90 nop 10b76a: 90 nop 10b76b: 90 nop 10b76c: 49 39 f8 cmp %rdi,%r8 10b76f: 72 6e jb 10b7df <set_syscall_user_dispatch+0xaf> 10b771: 48 89 88 48 08 00 00 mov %rcx,0x848(%rax) 10b778: 48 89 b0 50 08 00 00 mov %rsi,0x850(%rax) 10b77f: 48 89 90 58 08 00 00 mov %rdx,0x858(%rax) 10b786: c6 80 60 08 00 00 00 movb $0x0,0x860(%rax) 10b78d: f0 80 48 08 20 lock orb $0x20,0x8(%rax) 10b792: 31 c0 xor %eax,%eax 10b794: c3 ret 10b795: 48 09 d1 or %rdx,%rcx 10b798: 48 09 f1 or %rsi,%rcx 10b79b: 75 aa jne 10b747 <set_syscall_user_dispatch+0x17> 10b79d: 48 c7 80 48 08 00 00 movq $0x0,0x848(%rax) 10b7a4: 00 00 00 00 10b7a8: 48 c7 80 50 08 00 00 movq $0x0,0x850(%rax) 10b7af: 00 00 00 00 10b7b3: 48 c7 80 58 08 00 00 movq $0x0,0x858(%rax) 10b7ba: 00 00 00 00 10b7be: c6 80 60 08 00 00 00 movb $0x0,0x860(%rax) 10b7c5: f0 80 60 08 df lock andb $0xdf,0x8(%rax) 10b7ca: 31 c0 xor %eax,%eax 10b7cc: c3 ret 10b7cd: 48 8d 3c 16 lea (%rsi,%rdx,1),%rdi 10b7d1: 48 39 fe cmp %rdi,%rsi 10b7d4: 0f 82 78 ff ff ff jb 10b752 <set_syscall_user_dispatch+0x22> 10b7da: e9 68 ff ff ff jmp 10b747 <set_syscall_user_dispatch+0x17> 10b7df: b8 f2 ff ff ff mov $0xfffffff2,%eax 10b7e4: c3 ret Please note a series of NOPs that get replaced with an alternative: 11f0: 65 48 23 05 00 00 00 and %gs:0x0(%rip),%rax 11f7: 00 Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250407072129.33440-1-ubizjak@gmail.com	2025-04-13 21:12:04 +02:00
Thorsten Blum	5c3627b6f0	perf/x86/intel/bts: Replace offsetof() with struct_size() Use struct_size() to calculate the number of bytes to allocate for a new bts_buffer. Compared to offsetof(), struct_size() provides additional compile-time checks (e.g., __must_be_array()). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250413104108.49142-2-thorsten.blum@linux.dev	2025-04-13 21:05:50 +02:00
Ingo Molnar	a5447e92e1	x86/msr: Add compatibility wrappers for rdmsrl()/wrmsrl() To reduce the impact of the API renames in -next, add compatibility wrappers for the two most popular MSR access APIs: rdmsrl() and wrmsrl(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org	2025-04-13 20:50:38 +02:00
Josh Poimboeuf	7b3169dfa4	objtool, x86/hweight: Remove ANNOTATE_IGNORE_ALTERNATIVE Since objtool's inception, frame pointer warnings have been manually silenced for __arch_hweight*() to allow those functions' inline asm to avoid using ASM_CALL_CONSTRAINT. The potentially dubious reasoning for that decision over nine years ago was that since !X86_FEATURE_POPCNT is exceedingly rare, it's not worth hurting the code layout for a function call that will never happen on the vast majority of systems. However, those functions actually started using ASM_CALL_CONSTRAINT with the following commit: `194a613088` ("x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()") And rightfully so, as it makes the code correct. ASM_CALL_CONSTRAINT will soon have no effect for non-FP configs anyway. With ASM_CALL_CONSTRAINT in place, ANNOTATE_IGNORE_ALTERNATIVE no longer has a purpose for the hweight functions. Remove it. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/e7070dba3278c90f1a836b16157dcd34ccd21e21.1744318586.git.jpoimboe@kernel.org	2025-04-13 09:52:42 +02:00
Uros Bizjak	d51faee4bd	x86/percpu: Refer __percpu_prefix to __force_percpu_prefix Refer __percpu_prefix to __force_percpu_prefix to avoid duplicate definition. While there, slightly reorder definitions to a more logical sequence, remove unneeded double quotes and move misplaced comment to the right place. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Link: https://lore.kernel.org/r/20250411093130.81389-1-ubizjak@gmail.com	2025-04-13 09:48:24 +02:00
Borislav Petkov (AMD)	805b743fc1	x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches All Zen5 machines out there should get BIOS updates which update to the correct microcode patches addressing the microcode signature issue. However, silly people carve out random microcode blobs from BIOS packages and think are doing other people a service this way... Block loading of any unreleased standalone Zen5 microcode patches. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: Nikolay Borisov <nik.borisov@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250410114222.32523-1-bp@kernel.org	2025-04-12 21:09:42 +02:00
Ard Biesheuvel	221df25fdf	x86/sev: Prepare for splitting off early SEV code Prepare for splitting off parts of the SEV core.c source file into a file that carries code that must tolerate being called from the early 1:1 mapping. This will allow special build-time handling of thise code, to ensure that it gets generated in a way that is compatible with the early execution context. So create a de-facto internal SEV API and put the definitions into sev-internal.h. No attempt is made to allow this header file to be included in arbitrary other sources - this is explicitly not the intent. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-20-ardb+git@google.com	2025-04-12 11:13:05 +02:00
Ard Biesheuvel	bee174b27e	x86/boot: Drop RIP_REL_REF() uses from SME startup code RIP_REL_REF() has no effect on code residing in arch/x86/boot/startup, as it is built with -fPIC. So remove any occurrences from the SME startup code. Note the SME is the only caller of cc_set_mask() that requires this, so drop it from there as well. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-19-ardb+git@google.com	2025-04-12 11:13:05 +02:00
Ard Biesheuvel	7ae089ee75	x86/boot: Move early SME init code into startup/ Move the SME initialization code, which runs from the 1:1 mapping of memory as it operates on the kernel virtual mapping, into the new sub-directory arch/x86/boot/startup/ where all startup code will reside that needs to tolerate executing from the 1:1 mapping. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-18-ardb+git@google.com	2025-04-12 11:13:05 +02:00
Ard Biesheuvel	dafb26f427	x86/boot: Drop RIP_REL_REF() uses from early mapping code Now that __startup_64() is built using -fPIC, RIP_REL_REF() has become a NOP and can be removed. Only some occurrences of rip_rel_ptr() will remain, to explicitly take the address of certain global structures in the 1:1 mapping of memory. While at it, update the code comment to describe why this is needed. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-17-ardb+git@google.com	2025-04-12 11:13:05 +02:00
Ard Biesheuvel	dbe0ad775c	x86/boot: Move early kernel mapping code into startup/ The startup code that constructs the kernel virtual mapping runs from the 1:1 mapping of memory itself, and therefore, cannot use absolute symbol references. Before making changes in subsequent patches, move this code into a separate source file under arch/x86/boot/startup/ where all such code will be kept from now on. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-16-ardb+git@google.com	2025-04-12 11:13:05 +02:00
Ard Biesheuvel	4cecebf200	x86/boot: Move the early GDT/IDT setup code into startup/ Move the early GDT/IDT setup code that runs long before the kernel virtual mapping is up into arch/x86/boot/startup/, and build it in a way that ensures that the code tolerates being called from the 1:1 mapping of memory. The code itself is left unchanged by this patch. Also tweak the sed symbol matching pattern in the decompressor to match on lower case 't' or 'b', as these will be emitted by Clang for symbols with hidden linkage. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-15-ardb+git@google.com	2025-04-12 11:13:04 +02:00
Ard Biesheuvel	bcceba3c72	x86/asm: Make rip_rel_ptr() usable from fPIC code RIP_REL_REF() is used in non-PIC C code that is called very early, before the kernel virtual mapping is up, which is the mapping that the linker expects. It is currently used in two different ways: - to refer to the value of a global variable, including as an lvalue in assignments; - to take the address of a global variable via the mapping that the code currently executes at. The former case is only needed in non-PIC code, as PIC code will never use absolute symbol references when the address of the symbol is not being used. But taking the address of a variable in PIC code may still require extra care, as a stack allocated struct assignment may be emitted as a memcpy() from a statically allocated copy in .rodata. For instance, this void startup_64_setup_gdt_idt(void) { struct desc_ptr startup_gdt_descr = { .address = (__force unsigned long)gdt_page.gdt, .size = GDT_SIZE - 1, }; may result in an absolute symbol reference in PIC code, even though the struct is allocated on the stack and populated at runtime. To address this case, make rip_rel_ptr() accessible in PIC code, and update any existing uses where the address of a global variable is taken using RIP_REL_REF. Once all code of this nature has been moved into arch/x86/boot/startup and built with -fPIC, RIP_REL_REF() can be retired, and only rip_rel_ptr() will remain. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dionna Amalie Glaze <dionnaglaze@google.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Loughlin <kevinloughlin@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: linux-efi@vger.kernel.org Link: https://lore.kernel.org/r/20250410134117.3713574-14-ardb+git@google.com	2025-04-12 11:13:04 +02:00
Andy Lutomirski	af8967158f	x86/mm: Opt-in to IRQs-off activate_mm() We gain nothing by having the core code enable IRQs right before calling activate_mm() only for us to turn them right back off again in switch_mm(). This will save a few cycles, so execve() should be blazingly fast with this patch applied! Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20250402094540.3586683-8-mingo@kernel.org	2025-04-12 10:06:08 +02:00
Andy Lutomirski	e7021e2fe0	x86/efi: Make efi_enter/leave_mm() use the use_/unuse_temporary_mm() machinery This should be considerably more robust. It's also necessary for optimized for_each_possible_lazymm_cpu() on x86 -- without this patch, EFI calls in lazy context would remove the lazy mm from mm_cpumask(). [ mingo: Merged it on top of x86/alternatives ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20250402094540.3586683-7-mingo@kernel.org	2025-04-12 10:06:04 +02:00
Andy Lutomirski	58f8ffa917	x86/mm: Allow temporary MMs when IRQs are on EFI runtime services should use temporary MMs, but EFI runtime services want IRQs on. Preemption must still be disabled in a temporary MM context. At some point, the entirely temporary MM mechanism should be moved out of arch code. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250402094540.3586683-6-mingo@kernel.org	2025-04-12 10:06:00 +02:00
Peter Zijlstra	4873f494bb	x86/mm: Remove 'mm' argument from unuse_temporary_mm() again Now that unuse_temporary_mm() lives in tlb.c it can access cpu_tlbstate.loaded_mm. [ mingo: Merged it on top of x86/alternatives ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20250402094540.3586683-5-mingo@kernel.org	2025-04-12 10:05:56 +02:00
Andy Lutomirski	d376972c98	x86/mm: Make use_/unuse_temporary_mm() non-static This prepares them for use outside of the alternative machinery. The code is unchanged. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20250402094540.3586683-4-mingo@kernel.org	2025-04-12 10:05:52 +02:00
Andy Lutomirski	81e3cbdef2	x86/events, x86/insn-eval: Remove incorrect current->active_mm references When decoding an instruction or handling a perf event that references an LDT segment, if we don't have a valid user context, trying to access the LDT by any means other than SLDT is racy. Certainly, using current->active_mm is wrong, as active_mm can point to a real user mm when CR3 and LDTR no longer reference that mm. Clean up the code. If nmi_uaccess_okay() says we don't have a valid context, just fail. Otherwise use current->mm. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20250402094540.3586683-3-mingo@kernel.org	2025-04-12 10:05:46 +02:00
Peter Zijlstra	0812e096cf	x86/mm: Add 'mm' argument to unuse_temporary_mm() In commit `209954cbc7` ("x86/mm/tlb: Update mm_cpumask lazily") unuse_temporary_mm() grew the assumption that it gets used on poking_mm exclusively. While this is currently true, lets not hard code this assumption. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20250402094540.3586683-2-mingo@kernel.org	2025-04-12 10:05:37 +02:00
Jason Andryuk	164a9f712f	x86/xen: Fix __xen_hypercall_setfunc() Hypercall detection is failing with xen_hypercall_intel() chosen even on an AMD processor. Looking at the disassembly, the call to xen_get_vendor() was removed. The check for boot_cpu_has(X86_FEATURE_CPUID) was used as a proxy for the x86_vendor having been set. When CONFIG_X86_REQUIRED_FEATURE_CPUID=y (the default value), DCE eliminates the call to xen_get_vendor(). An uninitialized value 0 means X86_VENDOR_INTEL, so the Intel function is always returned. Remove the if and always call xen_get_vendor() to avoid this issue. Fixes: `3d37d9396e` ("x86/cpufeatures: Add {REQUIRED,DISABLED} feature configs") Suggested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Xin Li (Intel)" <xin@zytor.com> Link: https://lore.kernel.org/r/20250410193106.16353-1-jason.andryuk@amd.com	2025-04-11 11:39:50 +02:00
Ahmed S. Darwish	62e5652739	x86/cacheinfo: Standardize header files and CPUID references Reference header files using their canonical form <linux/cacheinfo.h>. Standardize on CPUID(0xN), instead of CPUID(N), for all standard leaves. This removes ambiguity and aligns them with their extended counterparts like CPUID(0x8000001d). References: `0dd09e215a` ("x86/cacheinfo: Apply maintainer-tip coding style fixes") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20250411070401.1358760-3-darwi@linutronix.de	2025-04-11 11:14:55 +02:00
Ahmed S. Darwish	718f9038ac	x86/cpuid: Remove obsolete CPUID(0x2) iteration macro The CPUID(0x2) cache descriptors iterator at <cpuid/leaf_0x2_api.h>: for_each_leaf_0x2_desc() has no more call sites. Remove it. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250411070401.1358760-2-darwi@linutronix.de	2025-04-11 11:14:55 +02:00
Ingo Molnar	9f13acb240	Linux 6.15-rc1 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfy3/YeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ygIAItY5dzf5fVnVEPy UrF+EzIaWGWRw3N+41AyT5X7z77FPX7E0cA6MD4KxfWW/OYzeAoeZSyrM2xIsEh3 26qiohvJjpHjfHdzvKmxNItvW8+xBv3km00U/CWWqJo89JsIVnJtrSBHOut2/gNp f6sGoOrrR4GXXz8JX3yG/pmizr23lN81ZkVdz0ayYEK4uY92hSsBspvyFWcdffgF o8NCtR+JVGac8xm+f3VPSLyunLMXsh8NWETumMHP6tHQif36I3BQqeU8DgXCgjEK pfZ8gEyRtXIKbEt+qniUetT+2Cwu/lAN2GjTu0LqIe9Ro3HzjtotwQdk5h6kC+Lc BogxIs8= =bf5G -----END PGP SIGNATURE----- Merge tag 'v6.15-rc1' into x86/cpu, to refresh the branch with upstream changes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-11 11:13:27 +02:00
Nikolay Borisov	23a76739d6	x86/alternatives: Make smp_text_poke_batch_process() subsume smp_text_poke_batch_finish() Simplify the alternatives interface some more by moving the poke_batch_finish check into poke_batch_process and renaming the latter. The net effect is one less function name to consider when reading the code. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-54-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	4f9534719e	x86/alternatives: Add comment about noinstr expectations Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-53-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	023f42dd59	x86/alternatives: Rename 'apply_relocation()' to 'text_poke_apply_relocation()' Join the text_poke_*() API namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-52-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	dac0d75427	x86/alternatives: Update the comments in smp_text_poke_batch_process() - Capitalize 'INT3' consistently, - make it clear that 'sync cores' means an SMP sync to all CPUs, - fix typos and spelling. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-51-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	2c373ca064	x86/alternatives: Remove 'smp_text_poke_batch_flush()' It only has a single user left, merge it into smp_text_poke_batch_add() and remove the helper function. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-50-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	b1bb39185d	x86/alternatives: Move declarations of vmlinux.lds.S defined section symbols to <asm/alternative.h> Move it from the middle of a .c file next to the similar declarations of __alt_instructions[] et al. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-49-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	db5c68c88c	x86/alternatives: Simplify the #include section We accumulated lots of unnecessary header inclusions over the years, trim them. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-48-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	3c8454dfc9	x86/alternatives: Rename 'POKE_MAX_OPCODE_SIZE' to 'TEXT_POKE_MAX_OPCODE_SIZE' Join the TEXT_POKE_ namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-47-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	8036fbe5a5	x86/alternatives: Rename 'TP_ARRAY_NR_ENTRIES_MAX' to 'TEXT_POKE_ARRAY_MAX' Standardize on TEXT_POKE_ namespace for CPP constants too. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-46-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	22b9662313	x86/alternatives: Standardize on 'tpl' local variable names for 'struct smp_text_poke_loc *' There's no toilet paper in this code. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-45-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	3e6f47573e	x86/alternatives: Simplify and clean up patch_cmp() - No need to cast over to 'struct smp_text_poke_loc ', void is just fine for a binary search, - Use the canonical (a, b) input parameter nomenclature of cmp_func_t functions and rename the input parameters from (tp, elt) to (tpl_a, tpl_b). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-44-mingo@kernel.org	2025-04-11 11:01:35 +02:00
Ingo Molnar	6af9540379	x86/alternatives: Constify text_poke_addr() This will also allow the simplification of patch_cmp(). Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-43-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	0e67e587e2	x86/alternatives: Simplify text_poke_addr_ordered() - Use direct 'void *' pointer comparison, there's no need to force the type to 'unsigned long'. - Remove the 'tp' local variable indirection Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-42-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	6e4955a9d7	x86/alternatives: Rename 'text_poke_sync()' to 'smp_text_poke_sync_each_cpu()' Unlike sync_core(), text_poke_sync() is a very heavy operation, as it sends an IPI to every online CPU in the system and waits for completion. Reflect this in the name. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-41-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	7fbadb50d9	x86/alternatives: Move text_poke_array completion from smp_text_poke_batch_finish() and smp_text_poke_batch_flush() to smp_text_poke_batch_process() Simplifies the code and improves code generation a bit: text data bss dec hex filename 14769 1017 4112 19898 4dba alternative.o.before 14742 1017 4112 19871 4d9f alternative.o.after Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-40-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	cca3473956	x86/alternatives: Add documentation for smp_text_poke_batch_add() Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-39-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	9647ce4652	x86/alternatives: Document 'smp_text_poke_single()' Extend the documentation to better describe its purpose. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-38-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	8a6a1b4e0e	x86/alternatives: Remove the mixed-patching restriction on smp_text_poke_single() At this point smp_text_poke_single(addr, opcode, len, emulate) is equivalent to: smp_text_poke_batch_add(addr, opcode, len, emulate); smp_text_poke_batch_finish(); So remove the restriction on mixing single-instruction patching with multi-instruction patching. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-37-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	0e351aec2b	x86/alternatives: Move the text_poke_array manipulation into text_poke_int3_loc_init() and rename it to __smp_text_poke_batch_add() This simplifies the code and code generation a bit: text data bss dec hex filename 14802 1029 4112 19943 4de7 alternative.o.before 14784 1029 4112 19925 4dd5 alternative.o.after Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-36-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	74e8e2bf95	x86/alternatives: Simplify smp_text_poke_batch_process() This function is now using the text_poke_array state exclusively, make that explicit by removing the redundant input parameters. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-34-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	8e35752f0c	x86/alternatives: Simplify smp_text_poke_int3_handler() Remove the 'desc' local variable indirection and use text_poke_array directly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-33-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	b6a25841c1	x86/alternatives: Simplify try_get_text_poke_array() There's no need to return a pointer on success - it's always the same pointer. Return a bool instead. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-32-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	3916eec516	x86/alternatives: Rename 'put_desc()' to 'put_text_poke_array()' Just like with try_get_text_poke_array(), this name better reflects what the underlying code is doing, there's no 'descriptor' indirection anymore. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-31-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	46f3d9d329	x86/alternatives: Rename 'try_get_desc()' to 'try_get_text_poke_array()' This better reflects what the underlying code is doing, there's no 'descriptor' indirection anymore. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-30-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	0494b16b9c	x86/alternatives: Remove the tp_vec indirection At this point we are always working out of an uptodate text_poke_array, there's no need for smp_text_poke_int3_handler() to read via the int3_vec indirection - remove it. This simplifies the code: 1 file changed, 5 insertions(+), 15 deletions(-) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-29-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	6e7dc03aee	x86/alternatives: Introduce 'struct smp_text_poke_array' and move tp_vec and tp_vec_nr to it struct text_poke_array is an equivalent structure to these global variables: static struct smp_text_poke_loc tp_vec[TP_VEC_MAX]; static int tp_vec_nr; Note that we intentionally mirror much of the naming of 'struct text_poke_int3_vec', which will further highlight the unecessary layering going on in this code, and will ease its removal. No change in functionality. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-28-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	37725b64a9	x86/alternatives: Assert input parameters in smp_text_poke_batch_process() At this point the 'tp' input parameter must always be the global 'tp_vec' array, and 'nr_entries' must always be equal to 'tp_vec_nr'. Assert these conditions - which will allow the removal of a layer of indirection between these values. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-27-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	476ad071c6	x86/alternatives: Assert that smp_text_poke_int3_handler() can only ever handle 'tp_vec[]' based requests Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-26-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	c8976ade0c	x86/alternatives: Simplify smp_text_poke_single() by using tp_vec and existing APIs Instead of constructing a vector on-stack, just use the already available batch-patching vector - which should always be empty at this point. This will allow subsequent simplifications. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-25-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	eaa24c9177	x86/alternatives: Remove the 'addr == NULL means forced-flush' hack from smp_text_poke_batch_finish()/smp_text_poke_batch_flush()/text_poke_addr_ordered() There's this weird hack used by smp_text_poke_batch_finish() to indicate a 'forced flush': smp_text_poke_batch_flush(NULL); Just open-code the vector-flush in a straightforward fashion: smp_text_poke_batch_process(tp_vec, tp_vec_nr); tp_vec_nr = 0; And get rid of !addr hack from text_poke_addr_ordered(). Leave a WARN_ON_ONCE(), just in case some external code learned to rely on this behavior. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-24-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	2d0cf10a1e	x86/alternatives: Use non-inverted logic instead of 'tp_order_fail()' tp_order_fail() uses inverted logic: it returns true in case something is false, which is only a plus at the IOCCC. Instead rename it to regular parity as 'text_poke_addr_ordered()', and adjust the code accordingly. Also add a comment explaining how the address ordering should be understood. No change in functionality intended. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-23-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	87836af1ea	x86/alternatives: Add text_mutex) assert to smp_text_poke_batch_flush() It's possible to escape the text_mutex-held assert in smp_text_poke_batch_process() if the caller uses a properly batched and sorted series of patch requests, so add an explicit lockdep_assert_held() to make sure it's held by all callers. All text_poke_int3_*() APIs will call either smp_text_poke_batch_process() or smp_text_poke_batch_flush() internally. The text_mutex must be held, because tp_vec and tp_vec_nr et al are all globals, and the INT3 patching machinery itself relies on external serialization. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-22-mingo@kernel.org	2025-04-11 11:01:34 +02:00
Ingo Molnar	3bd7546ff2	x86/alternatives: Rename 'int3_desc' to 'int3_vec' Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-21-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	a81d43c46e	x86/alternatives: Rename 'struct text_poke_loc' to 'struct smp_text_poke_loc' Make it clear that this structure is part of the INT3 based SMP patching facility, not the regular text_poke*() MM-switch based facility. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-19-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	fb802d6393	x86/alternatives: Rename 'text_poke_loc_init()' to 'text_poke_int3_loc_init()' This name is actively confusing as well, because the simple text_poke() APIs use MM-switching based code patching, while text_poke_loc_init() is part of the INT3 based text_poke_int3_() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to text_poke_int3_loc_init() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-18-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	732c7c33a0	x86/alternatives: Rename 'text_poke_queue()' to 'smp_text_poke_batch_add()' This name is actively confusing as well, because the simple text_poke() APIs use MM-switching based code patching, while text_poke_queue() is part of the INT3 based text_poke_int3_() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to smp_text_poke_batch_add() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-17-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	e8d7b8c2bb	x86/alternatives: Rename 'text_poke_finish()' to 'smp_text_poke_batch_finish()' This name is actively confusing as well, because the simple text_poke() APIs use MM-switching based code patching, while text_poke_finish() is part of the INT3 based text_poke_int3_() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to smp_text_poke_batch_finish() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-16-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	aedb60c2c6	x86/alternatives: Rename 'text_poke_flush()' to 'smp_text_poke_batch_flush()' This name is actually actively confusing, because the simple text_poke() APIs use MM-switching based code patching, while text_poke_flush() is part of the INT3 based text_poke_int3_() machinery that is an additional layer of functionality on top of regular text_poke*() functionality. Rename it to smp_text_poke_batch_flush() to make it clear which layer it belongs to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-15-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	f5afa2e8ef	x86/alternatives: Remove the confusing, inaccurate & unnecessary 'temp_mm_state_t' abstraction So the temp_mm_state_t abstraction used by use_temporary_mm() and unuse_temporary_mm() is super confusing: - The whole machinery is about temporarily switching to the text_poke_mm utility MM that got allocated during bootup for text-patching purposes alone: temp_mm_state_t prev; /* * Loading the temporary mm behaves as a compiler barrier, which * guarantees that the PTE will be set at the time memcpy() is done. / prev = use_temporary_mm(text_poke_mm); - Yet the value that gets saved in the temp_mm_state_t variable is not the temporary MM ... but the previous MM... - Ie. we temporarily put the non-temporary MM into a variable that has the temp_mm_state_t type. This makes no sense whatsoever. - The confusion continues in unuse_temporary_mm(): static inline void unuse_temporary_mm(temp_mm_state_t prev_state) Here we unuse an MM that is ... not the temporary MM, but the previous MM. :-/ Fix up all this confusion by removing the unnecessary layer of abstraction and using a bog-standard 'struct mm_struct prev_mm' variable to save the MM to. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-14-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	5224f09a7b	x86/alternatives: Update comments in int3_emulate_push() The idtentry macro in entry_64.S hasn't had a create_gap option for 5 years - update the comment. (Also clean up the entire comment block while at it.) Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-13-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	762255b743	x86/alternatives: Remove duplicate 'text_poke_early()' prototype It's declared in <asm/text-patching.h> already. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-12-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	e84c31b9c9	x86/alternatives: Rename 'bp_desc' to 'int3_desc' Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-11-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	da364fc547	x86/alternatives: Rename 'poking_addr' to 'text_poke_mm_addr' Put it into the text_poke_* namespace of <asm/text-patching.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-10-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	a5c832e047	x86/alternatives: Rename 'poking_mm' to 'text_poke_mm' Put it into the text_poke_* namespace of <asm/text-patching.h>. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-9-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	5236b6a0fe	x86/alternatives: Rename 'poke_int3_handler()' to 'smp_text_poke_int3_handler()' All related functions in this subsystem already have a text_poke_int3_ prefix - add it to the trap handler as well. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-8-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	9586ae48e7	x86/alternatives: Rename 'text_poke_bp()' to 'smp_text_poke_single()' Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-7-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	bee4fcfbc1	x86/alternatives: Rename 'text_poke_bp_batch()' to 'smp_text_poke_batch_process()' Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-6-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	28fb79092d	x86/alternatives: Rename 'bp_refs' to 'text_poke_array_refs' Make it clear that these reference counts lock access to text_poke_array. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-5-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Ingo Molnar	84e5ba949b	x86/alternatives: Rename 'struct bp_patching_desc' to 'struct text_poke_int3_vec' Follow the INT3 text-poking nomenclature, and also adopt the 'vector' name for the entire object, instead of the rather opaque 'descriptor' naming. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250411054105.2341982-4-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Peter Zijlstra	d60e4b2410	x86/alternatives: Document the text_poke_bp_batch() synchronization rules a bit more Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250411054105.2341982-3-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Eric Dumazet	4334336e76	x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler() eBPF programs can be run 50,000,000 times per second on busy servers. Whenever /proc/sys/kernel/bpf_stats_enabled is turned off, hundreds of calls sites are patched from text_poke_bp_batch() and we see a huge loss of performance due to false sharing on bp_desc.refs lasting up to three seconds. 51.30% server_bin [kernel.kallsyms] [k] poke_int3_handler \| \|--46.45%--poke_int3_handler \| exc_int3 \| asm_exc_int3 \| \| \| \|--24.26%--cls_bpf_classify \| \| tcf_classify \| \| __dev_queue_xmit \| \| ip6_finish_output2 \| \| ip6_output \| \| ip6_xmit \| \| inet6_csk_xmit \| \| __tcp_transmit_skb Fix this by replacing bp_desc.refs with a per-cpu bp_refs. Before the patch, on a host with 240 cores (480 threads): $ sysctl -wq kernel.bpf_stats_enabled=0 text_poke_bp_batch(nr_entries=164) : Took 2655300 usec $ bpftool prog \| grep run_time_ns ... 105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns 3009063719 run_cnt 82757845 : average cost is 36 nsec per call After this patch: $ sysctl -wq kernel.bpf_stats_enabled=0 text_poke_bp_batch(nr_entries=164) : Took 702 usec $ bpftool prog \| grep run_time_ns ... 105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns 1928223019 run_cnt 67682728 : average cost is 28 nsec per call Ie. text-patching performance improved 3700x: from 2.65 seconds to 0.0007 seconds. Since the atomic_cond_read_acquire(refs, !VAL) spin-loop was not triggered even once in my tests, add an unlikely() annotation, because this appears to be the common case. [ mingo: Improved the changelog some more. ] Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250411054105.2341982-2-mingo@kernel.org	2025-04-11 11:01:33 +02:00
Juergen Gross	715ad3e0ec	xen: fix multicall debug feature Initializing a percpu variable with the address of a struct tagged as .initdata is breaking the build with CONFIG_SECTION_MISMATCH_WARN_ONLY not set to "y". Fix that by using an access function instead returning the .initdata struct address if the percpu space of the struct hasn't been allocated yet. Fixes: `368990a7fe` ("xen: fix multicall debug data referencing") Reported-by: Borislav Petkov <bp@alien8.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: "Borislav Petkov (AMD)" <bp@alien8.de> Tested-by: "Borislav Petkov (AMD)" <bp@alien8.de> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250327190602.26015-1-jgross@suse.com>	2025-04-11 09:44:50 +02:00
Fernando Fernandez Mancera	3940f5349b	x86/i8253: Call clockevent_i8253_disable() with interrupts disabled There's a lockdep false positive warning related to i8253_lock: WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ... systemd-sleep/3324 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: ffffffffb2c23398 (i8253_lock){+.+.}-{2:2}, at: pcspkr_event+0x3f/0xe0 [pcspkr] ... ... which became HARDIRQ-irq-unsafe at: ... lock_acquire+0xd0/0x2f0 _raw_spin_lock+0x30/0x40 clockevent_i8253_disable+0x1c/0x60 pit_timer_init+0x25/0x50 hpet_time_init+0x46/0x50 x86_late_time_init+0x1b/0x40 start_kernel+0x962/0xa00 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0xed/0xf0 common_startup_64+0x13e/0x141 ... Lockdep complains due pit_timer_init() using the lock in an IRQ-unsafe fashion, but it's a false positive, because there is no deadlock possible at that point due to init ordering: at the point where pit_timer_init() is called there is no other possible usage of i8253_lock because the system is still in the very early boot stage with no interrupts. But in any case, pit_timer_init() should disable interrupts before calling clockevent_i8253_disable() out of general principle, and to keep lockdep working even in this scenario. Use scoped_guard() for that, as suggested by Thomas Gleixner. [ mingo: Cleaned up the changelog. ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/Z-uwd4Bnn7FcCShX@gmail.com	2025-04-11 07:28:20 +02:00
Linus Torvalds	3c9de67dd3	Miscellaneous fixes: - Fix CPU topology related regression that limited Xen PV guests to a single CPU - Fix ancient e820__register_nosave_regions() bugs that were causing problems with kexec's artificial memory maps - Fix an S4 hibernation crash caused by two missing ENDBR's that were mistakenly removed in a recent commit - Fix a resctrl serialization bug - Fix early_printk documentation and comments - Fix RSB bugs, combined with preparatory updates to better match the code to vendor recommendations. - Add RSB mitigation document - Fix/update documentation - Fix the erratum_1386_microcode[] table to be NULL terminated Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmf4Na0RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iy0hAAw03t9IGCgFEbzFkm2jRvoR/kUBnh7Q+B E1LLYjlYws0TLcxTFIkc3slI2dt0LE6YN6kHT4gzJmE2Rp7G3oKR9xGwW/soJEuv +hTZ4ueY8TY2mEOwKUkY7xetBDI/e6iXqMnrXIVz1xIDwW3wyQ31jT+A7LzW7Gxn CKKymIJQfH9eDJwakiTjrmsJRy2cmah5ajFmhrlt1bLDV1Ykts595HTZNFBnsDJq mGxUwKZi0h9h6JZgLSZJQtUu2Pv3WmI/6DlkPG3cNZJIIfS7sMPj1LpQVTKMPQ19 zGzkHGAv6tgp7gIxse1MFoLiKEsAPR/iAL++o2PeyQkynXpVb0g6d6fvicGK/OAe xWR4rf/LVluvvwRam9bYaIkDkahbT/uLe/dp99YEqclfBGSsHY1C8jhPiuVyOQQK w5AS1D5LSqXVTxu1XWCVTAhfR5nPS+O5q2hEs4O8tEdWNeOQSeExOZ8z2lqyqeoG VifCuQqcPbCja0msBWX9eEY/M/ie3AcasrfgD49Xj7oTBQOMXO70YeENM1fVzcko NQFY8RqA+N/EmTaWJvJ8o88ZIvTKqosyTYOvQIq9ZJS7DeeVtPZ+wgJahiZbBKT7 4KSjLOO3ZvosrgafS35I4v5+zU0GO6B7rgWUKALFsSy52FgXk0ip4RpO6DPCsmRD 8GEpn0X19xM= =1DWX -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix CPU topology related regression that limited Xen PV guests to a single CPU - Fix ancient e820__register_nosave_regions() bugs that were causing problems with kexec's artificial memory maps - Fix an S4 hibernation crash caused by two missing ENDBR's that were mistakenly removed in a recent commit - Fix a resctrl serialization bug - Fix early_printk documentation and comments - Fix RSB bugs, combined with preparatory updates to better match the code to vendor recommendations. - Add RSB mitigation document - Fix/update documentation - Fix the erratum_1386_microcode[] table to be NULL terminated * tag 'x86-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ibt: Fix hibernate x86/cpu: Avoid running off the end of an AMD erratum table Documentation/x86: Zap the subsection letters Documentation/x86: Update the naming of CPU features for /proc/cpuinfo x86/bugs: Add RSB mitigation document x86/bugs: Don't fill RSB on context switch with eIBRS x86/bugs: Don't fill RSB on VMEXIT with eIBRS+retpoline x86/bugs: Fix RSB clearing in indirect_branch_prediction_barrier() x86/bugs: Use SBPB in write_ibpb() if applicable x86/bugs: Rename entry_ibpb() to write_ibpb() x86/early_printk: Use 'mmio32' for consistency, fix comments x86/resctrl: Fix rdtgroup_mkdir()'s unlocked use of kernfs_node::name x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions() x86/acpi: Don't limit CPUs to 1 for Xen PV guests due to disabled ACPI	2025-04-10 15:20:10 -07:00
Linus Torvalds	54a012b622	Miscellaneous objtool fixes: - Remove the recently introduced ANNOTATE_IGNORE_ALTERNATIVE noise from clac()/stac() code to make .s files more readable. - Fix INSN_SYSCALL / INSN_SYSRET semantics - Fix various false-positive warnings Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmf4MWQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1i85w/+P/iNkUg6X9eU/Jg8p21E+bXWimvnEUOt WAdQOLjtlhanHnvdJy1DguQTdNVT30JwIDjj3gPVkgOBIJBSg+YR7Gk7VYVJPQnl 17tPt+VdVdPRB1wpB4WYx5OLJn7mIpsHXx46uPDFZh2xCEfRiKSbTRg5y/lWb54G vw5AITSHISAbJDRVLxXXDtMPvK8oxO8F8slEU4p4oiKEUiKpHKQ3UUCN9SM3hPtq Lhhp3eeRCcv4Yi8CFUXLQ+9NeACVmc+2KI5T3kuxs7uyNbauWT2+oGyN/q3ofwDx iZglEKuorK1YUAG2uwxVpv+YB1GRb3Kd0Hi28kfgzOkr3i8ECabiaVQ528bLvzxf ujD62N0D2OXYDe/jVAZgpptO893coxdEViZOw6/pjtXw8XUGlcGN7xQ7pfkAr8ZK xY5MRFdFRV8GIITJ/LsD3xYk//e3gyI3HXs3D4sMIDBqeksJ9kHhV1MeF17Ksxli QoqzOJryfg1WKvHT8vLuo6TQweP92wGEYEOYeAgqejlvqOfc56AY+un5bFSPAxHb 54iCmvGUB2JzWAmRzyVEOk0Lat0OX9WnYPbBcdBiC7qkRzeEdy/tEwW1ncgDyeJY WmDY217Fadz0/vPIgwofip3/PujKsjB2CllNWf0QUzxU3Sy1uH9Erfi6uCh96tmA vnlE6QHRi+o= =Iuch -----END PGP SIGNATURE----- Merge tag 'objtool-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc objtool fixes from Ingo Molnar: - Remove the recently introduced ANNOTATE_IGNORE_ALTERNATIVE noise from clac()/stac() code to make .s files more readable - Fix INSN_SYSCALL / INSN_SYSRET semantics - Fix various false-positive warnings * tag 'objtool-urgent-2025-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: objtool: Fix false-positive "ignoring unreachables" warning objtool: Remove ANNOTATE_IGNORE_ALTERNATIVE from CLAC/STAC objtool, xen: Fix INSN_SYSCALL / INSN_SYSRET semantics objtool: Stop UNRET validation on UD2 objtool: Split INSN_CONTEXT_SWITCH into INSN_SYSCALL and INSN_SYSRET objtool: Fix INSN_CONTEXT_SWITCH handling in validate_unret()	2025-04-10 14:27:32 -07:00
Stefano Garzarella	e396dd8517	x86/sev: Register tpm-svsm platform device SNP platform can provide a vTPM device emulated by SVSM. The "tpm-svsm" device can be handled by the platform driver registered by the x86/sev core code. Register the platform device only when SVSM is available and it supports vTPM commands as checked by snp_svsm_vtpm_probe(). [ bp: Massage commit message. ] Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20250410135118.133240-5-sgarzare@redhat.com	2025-04-10 16:25:33 +02:00
Stefano Garzarella	770de678bc	x86/sev: Add SVSM vTPM probe/send_command functions Add two new functions to probe and send commands to the SVSM vTPM. They leverage the two calls defined by the AMD SVSM specification [1] for the vTPM protocol: SVSM_VTPM_QUERY and SVSM_VTPM_CMD. Expose snp_svsm_vtpm_send_command() to be used by a TPM driver. [1] "Secure VM Service Module for SEV-SNP Guests" Publication # 58019 Revision: 1.00 [ bp: Some doc touchups. ] Co-developed-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Co-developed-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lore.kernel.org/r/20250403100943.120738-2-sgarzare@redhat.com	2025-04-10 16:15:41 +02:00
Linus Torvalds	2eb959eeec	xen: branch for v6.15-rc2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZ/eJDgAKCRCAXGG7T9hj vrq5AQD0H1br5/vq0LIYrf22CwHIp/o7zfSPjpoEccwLRDRVZQEA9KW+pnFTYBpL d29PeC2oGPjNsS9sL0b0DgYBD/JStQ0= =c9Cv -----END PGP SIGNATURE----- Merge tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A simple fix adding the module description of the Xenbus frontend module - A fix correcting the xen-acpi-processor Kconfig dependency for PVH Dom0 support - A fix for the Xen balloon driver when running as Xen Dom0 in PVH mode - A fix for PVH Dom0 in order to avoid problems with CPU idle and frequency drivers conflicting with Xen * tag 'for-linus-6.15a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: disable CPU idle and frequency drivers for PVH dom0 x86/xen: fix balloon target initialization for PVH dom0 xen: Change xen-acpi-processor dom0 dependency xenbus: add module description	2025-04-10 07:04:23 -07:00
David Woodhouse	de085ddd49	x86/kexec: Invalidate GDT/IDT from relocate_kernel() instead of earlier Reduce the window during which exceptions are unhandled, by leaving the GDT/IDT in place all the way into the relocate_kernel() function, until the moment that %cr3 gets replaced. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-4-dwmw2@infradead.org	2025-04-10 12:17:14 +02:00
David Woodhouse	7516e7216b	x86/kexec: Add 8250 MMIO serial port output This supports the same 32-bit MMIO-mapped 8250 as the early_printk code. It's not clear why the early_printk code supports this form and only this form; the actual runtime 8250_pci doesn't seem to support it. But having hacked up QEMU to expose such a device, early_printk does work with it, and now so does the kexec debug code. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-3-dwmw2@infradead.org	2025-04-10 12:17:14 +02:00
David Woodhouse	d358b45120	x86/kexec: Add 8250 serial port output If a serial port was configured for early_printk, use it for debug output from the relocate_kernel exception handler too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250326142404.256980-2-dwmw2@infradead.org	2025-04-10 12:17:13 +02:00
Ingo Molnar	eef476f15c	x86/msr: Rename 'wrmsrl_cstar()' to 'wrmsrq_cstar()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:37 +02:00
Ingo Molnar	7cbc2ba7c1	x86/msr: Rename 'native_wrmsrl()' to 'native_wrmsrq()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:28 +02:00
Ingo Molnar	604d15d15e	x86/msr: Rename 'wrmsrl_amd_safe()' to 'wrmsrq_amd_safe()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:24 +02:00
Ingo Molnar	e2b8af0c69	x86/msr: Rename 'rdmsrl_amd_safe()' to 'rdmsrq_amd_safe()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:19 +02:00
Ingo Molnar	8e44e83f57	x86/msr: Rename 'mce_wrmsrl()' to 'mce_wrmsrq()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:14 +02:00
Ingo Molnar	ebe29309c4	x86/msr: Rename 'mce_rdmsrl()' to 'mce_rdmsrq()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:09 +02:00
Ingo Molnar	c895ecdab2	x86/msr: Rename 'wrmsrl_on_cpu()' to 'wrmsrq_on_cpu()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:05 +02:00
Ingo Molnar	d7484babd2	x86/msr: Rename 'rdmsrl_on_cpu()' to 'rdmsrq_on_cpu()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:59:00 +02:00
Ingo Molnar	27a23a544a	x86/msr: Rename 'wrmsrl_safe_on_cpu()' to 'wrmsrq_safe_on_cpu()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:55 +02:00
Ingo Molnar	5e404cb7ac	x86/msr: Rename 'rdmsrl_safe_on_cpu()' to 'rdmsrq_safe_on_cpu()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:49 +02:00
Ingo Molnar	6fa17efe45	x86/msr: Rename 'wrmsrl_safe()' to 'wrmsrq_safe()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:44 +02:00
Ingo Molnar	6fe22abacd	x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:38 +02:00
Ingo Molnar	78255eb239	x86/msr: Rename 'wrmsrl()' to 'wrmsrq()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:33 +02:00
Ingo Molnar	c435e608cf	x86/msr: Rename 'rdmsrl()' to 'rdmsrq()' Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:27 +02:00
Ingo Molnar	d58c04cf1d	x86/msr: Standardize on 'u32' MSR indices in <asm/msr.h> This is the customary type used for hardware ABIs. Suggested-by: Xin Li <xin@zytor.com> Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:20 +02:00
Ingo Molnar	d8f8aad698	x86/msr: Harmonize the prototype and definition of do_trace_rdpmc() In <asm/msr.h> the first parameter of do_trace_rdpmc() is named 'msr': extern void do_trace_rdpmc(unsigned int msr, u64 val, int failed); But in the definition it's 'counter': void do_trace_rdpmc(unsigned counter, u64 val, int failed) Use 'msr' in both cases, and change the type to u32. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:13 +02:00
Ingo Molnar	cd905826cb	x86/msr: Use u64 in rdmsrl_safe() and paravirt_read_pmc() The paravirt_read_pmc() result is in fact only loaded into an u64 variable. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:07 +02:00
Ingo Molnar	73bd1e01e9	x86/msr: Use u64 in rdmsrl_amd_safe() and wrmsrl_amd_safe() Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:58:02 +02:00
Ingo Molnar	f4138de5e4	x86/msr: Standardize on u64 in <asm/msr-index.h> Also fix some nearby whitespace damage while at it. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:57:57 +02:00
Ingo Molnar	dfe2574ce8	x86/msr: Standardize on u64 in <asm/msr.h> There's 9 uses of 'unsigned long long' in <asm/msr.h>, which is really the same as 'u64', which is used 34 times. Standardize on u64. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2025-04-10 11:57:40 +02:00
Uros Bizjak	a23be6ccd8	x86: Remove __FORCE_ORDER workaround GCC PR82602 that caused invalid scheduling of volatile asms: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82602 was fixed for gcc-8.1.0, the current minimum version of the compiler required to compile the kernel. Remove workaround that prevented invalid scheduling for compilers, affected by PR82602. There were no differences between old and new kernel object file when compiled for x86_64 defconfig with gcc-8.1.0. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250407112316.378347-1-ubizjak@gmail.com	2025-04-10 08:42:26 +02:00
Mike Rapoport (Microsoft)	35c3151a98	x86/mm: Consolidate initmem_init() There are 4 wariants of initmem_init(), for 32 and 64 bits and for CONFIG_NUMA enabled and disabled. After commit `bbeb69ce30` ("x86/mm: Remove CONFIG_HIGHMEM64G support") NUMA is not supported on 32 bit kernels anymore, and arch/x86/mm/numa_32.c can be just deleted and setup_bootmem_allocator() with completely misleading name can be folded into initmem_init(). For 64 bits the NUMA variant calls x86_numa_init() and !NUMA variant sets all memory to node 0. The later can be split out into inline helper called x86_numa_init() and then both initmem_init() functions become the same. Split out memblock_set_node() from initmem_init() for !NUMA on 64 bit into x86_numa_init() helper and remove arch/x86/mm/numa_*.c that only contained initmem_init() variants for NUMA configs. Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Len Brown <len.brown@intel.com> Link: https://lore.kernel.org/r/20250409122815.420041-1-rppt@kernel.org	2025-04-09 22:02:30 +02:00
Ingo Molnar	78a84fbfa4	Linux 6.15-rc1 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfy3/YeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ygIAItY5dzf5fVnVEPy UrF+EzIaWGWRw3N+41AyT5X7z77FPX7E0cA6MD4KxfWW/OYzeAoeZSyrM2xIsEh3 26qiohvJjpHjfHdzvKmxNItvW8+xBv3km00U/CWWqJo89JsIVnJtrSBHOut2/gNp f6sGoOrrR4GXXz8JX3yG/pmizr23lN81ZkVdz0ayYEK4uY92hSsBspvyFWcdffgF o8NCtR+JVGac8xm+f3VPSLyunLMXsh8NWETumMHP6tHQif36I3BQqeU8DgXCgjEK pfZ8gEyRtXIKbEt+qniUetT+2Cwu/lAN2GjTu0LqIe9Ro3HzjtotwQdk5h6kC+Lc BogxIs8= =bf5G -----END PGP SIGNATURE----- Merge tag 'v6.15-rc1' into x86/mm, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-09 22:00:25 +02:00
Mateusz Guzik	6f9bd8ae03	x86/uaccess: Predict valid_user_address() returning true This works around what seems to be an optimization bug in GCC (at least 13.3.0), where it predicts access_ok() to fail despite the hint to the contrary. _copy_to_user() contains: if (access_ok(to, n)) { instrument_copy_to_user(to, from, n); n = raw_copy_to_user(to, from, n); } Where access_ok() is likely(__access_ok(addr, size)), yet the compiler emits conditional jumps forward for the case where it succeeds: <+0>: endbr64 <+4>: mov %rdx,%rcx <+7>: mov %rdx,%rax <+10>: xor %edx,%edx <+12>: add %rdi,%rcx <+15>: setb %dl <+18>: movabs $0x123456789abcdef,%r8 <+28>: test %rdx,%rdx <+31>: jne 0xffffffff81b3b7c6 <_copy_to_user+38> <+33>: cmp %rcx,%r8 <+36>: jae 0xffffffff81b3b7cb <_copy_to_user+43> <+38>: jmp 0xffffffff822673e0 <__x86_return_thunk> <+43>: nop <+44>: nop <+45>: nop <+46>: mov %rax,%rcx <+49>: rep movsb %ds:(%rsi),%es:(%rdi) <+51>: nop <+52>: nop <+53>: nop <+54>: mov %rcx,%rax <+57>: nop <+58>: nop <+59>: nop <+60>: jmp 0xffffffff822673e0 <__x86_return_thunk> Patching _copy_to_user() to likely() around the access_ok() use does not change the asm. However, spelling out the prediction within valid_user_address() does the trick: <+0>: endbr64 <+4>: xor %eax,%eax <+6>: mov %rdx,%rcx <+9>: add %rdi,%rdx <+12>: setb %al <+15>: movabs $0x123456789abcdef,%r8 <+25>: test %rax,%rax <+28>: jne 0xffffffff81b315e6 <_copy_to_user+54> <+30>: cmp %rdx,%r8 <+33>: jb 0xffffffff81b315e6 <_copy_to_user+54> <+35>: nop <+36>: nop <+37>: nop <+38>: rep movsb %ds:(%rsi),%es:(%rdi) <+40>: nop <+41>: nop <+42>: nop <+43>: nop <+44>: nop <+45>: nop <+46>: mov %rcx,%rax <+49>: jmp 0xffffffff82255ba0 <__x86_return_thunk> <+54>: mov %rcx,%rax <+57>: jmp 0xffffffff82255ba0 <__x86_return_thunk> Since we kinda expect valid_user_address() to be likely anyway, add the likely() annotation that also happens to work around this compiler bug. [ mingo: Moved the unlikely() branch into valid_user_address() & updated the changelog ] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401203029.1132135-1-mjguzik@gmail.com	2025-04-09 21:40:17 +02:00
Ingo Molnar	6ce0fdaae0	Linux 6.15-rc1 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfy3/YeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ygIAItY5dzf5fVnVEPy UrF+EzIaWGWRw3N+41AyT5X7z77FPX7E0cA6MD4KxfWW/OYzeAoeZSyrM2xIsEh3 26qiohvJjpHjfHdzvKmxNItvW8+xBv3km00U/CWWqJo89JsIVnJtrSBHOut2/gNp f6sGoOrrR4GXXz8JX3yG/pmizr23lN81ZkVdz0ayYEK4uY92hSsBspvyFWcdffgF o8NCtR+JVGac8xm+f3VPSLyunLMXsh8NWETumMHP6tHQif36I3BQqeU8DgXCgjEK pfZ8gEyRtXIKbEt+qniUetT+2Cwu/lAN2GjTu0LqIe9Ro3HzjtotwQdk5h6kC+Lc BogxIs8= =bf5G -----END PGP SIGNATURE----- Merge tag 'v6.15-rc1' into x86/asm, to refresh the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-09 21:39:43 +02:00
Peter Zijlstra	1fac13956e	x86/ibt: Fix hibernate Todd reported, and Len confirmed, that commit `582077c940` ("x86/cfi: Clean up linkage") broke S4 hiberate on a fair number of machines. Turns out these machines trip #CP when trying to restore the image. As it happens, the commit in question removes two ENDBR instructions in the hibernate code, and clearly got it wrong. Notably restore_image() does an indirect jump to relocated_restore_code(), which is a relocated copy of core_restore_code(). In turn, core_restore_code(), will at the end do an indirect jump to restore_jump_address (r8), which is pointing at a relocated restore_registers(). So both sites do indeed need to be ENDBR. Fixes: `582077c940` ("x86/cfi: Clean up linkage") Reported-by: Todd Brandt <todd.e.brandt@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Tested-by: Len Brown <len.brown@intel.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219998 Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219998	2025-04-09 21:29:11 +02:00
Ahmed S. Darwish	d02c83d75f	x86/cacheinfo: Properly parse CPUID(0x80000006) L2/L3 associativity Complete the AMD CPUID(4) emulation logic, which uses CPUID(0x80000006) for L2/L3 cache info and an assocs[] associativity mapping array, by adding entries for 3-way caches and 6-way caches. Properly handle the case where CPUID(0x80000006) returns an L2/L3 associativity of 9. This is not real associativity, but a marker to indicate that the respective L2/L3 cache information should be retrieved from CPUID(0x8000001d) instead. If such a marker is encountered, return early from legacy_amd_cpuid4(), thus effectively emulating an "invalid index" CPUID(4) response with a cache type of zero. When checking if CPUID(0x80000006) L2/L3 cache info output is valid, and given the associtivity marker 9 above, do not just check if the whole ECX/EDX register is zero. Rather, check if the associativity is zero or 9. An associativity of zero implies no L2/L3 cache, which make it the more correct check anyway vs. a zero check of the whole output register. Fixes: `a326e948c5` ("x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250409122233.1058601-3-darwi@linutronix.de	2025-04-09 20:47:05 +02:00
Ahmed S. Darwish	d274cde0db	x86/cacheinfo: Properly parse CPUID(0x80000005) L1d/L1i associativity For the AMD CPUID(4) emulation cache info logic, the same associativity mapping array, assocs[], is used for both CPUID(0x80000005) and CPUID(0x80000006). This is incorrect since per the AMD manuals, the mappings for CPUID(0x80000005) L1d/L1i associativity is: n = 0x1 -> 0xfe n n = 0xff fully associative while assocs[] maps these values to: n = 0x1, 0x2, 0x4 n n = 0x3, 0x7, 0x9 0 n = 0x6 8 n = 0x8 16 n = 0xa 32 n = 0xb 48 n = 0xc 64 n = 0xd 96 n = 0xe 128 n = 0xf fully associative which is only valid for CPUID(0x80000006). Parse CPUID(0x80000005) L1d/L1i associativity values as shown in the AMD manuals. Since the 0xffff literal is used to denote full associativity at the AMD CPUID(4)-emulation logic, define AMD_CPUID4_FULLY_ASSOCIATIVE for it instead of spreading that literal in more places. Mark the assocs[] mapping array as only valid for CPUID(0x80000006) L2/L3 cache information. Fixes: `a326e948c5` ("x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250409122233.1058601-2-darwi@linutronix.de	2025-04-09 20:47:05 +02:00
Dave Hansen	f0df00ebc5	x86/cpu: Avoid running off the end of an AMD erratum table The NULL array terminator at the end of erratum_1386_microcode was removed during the switch from x86_cpu_desc to x86_cpu_id. This causes readers to run off the end of the array. Replace the NULL. Fixes: `f3f3251526` ("x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'") Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>	2025-04-09 07:57:16 -07:00
Herbert Xu	3be3f70ee9	crypto: x86/chacha - Restore SSSE3 fallback path The chacha_use_simd static branch is required for x86 machines that lack SSSE3 support. Restore it and the generic fallback code. Reported-by: Eric Biggers <ebiggers@kernel.org> Fixes: `9b4400215e` ("crypto: x86/chacha - Remove SIMD fallback path") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-09 21:35:27 +08:00
Mark Barnett	1734d98fbc	perf/arch: Record sample last_period before updating on the x86 and PowerPC platforms This change alters the PowerPC and x86 driver implementations to record the last sample period before the event is updated for the next period. A common pattern in PMU driver implementations is to have a "*_event_set_period" function which takes care of updating the various period-related fields in a perf_event structure. In most cases, the drivers choose to call this function after initializing a sample data structure with perf_sample_data_init. The x86 and PowerPC drivers deviate from this, choosing to update the period before initializing the sample data. When using an event with an alternate sample period, this causes an incorrect period to be written to the sample data that gets reported to userspace. Signed-off-by: Mark Barnett <mark.barnett@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250408171530.140858-2-mark.barnett@arm.com	2025-04-09 13:45:08 +02:00
Josh Poimboeuf	83f6665a49	x86/bugs: Add RSB mitigation document Create a document to summarize hard-earned knowledge about RSB-related mitigations, with references, and replace the overly verbose yet incomplete comments with a reference to the document. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/ab73f4659ba697a974759f07befd41ae605e33dd.1744148254.git.jpoimboe@kernel.org	2025-04-09 12:42:09 +02:00
Josh Poimboeuf	27ce8299bc	x86/bugs: Don't fill RSB on context switch with eIBRS User->user Spectre v2 attacks (including RSB) across context switches are already mitigated by IBPB in cond_mitigation(), if enabled globally or if either the prev or the next task has opted in to protection. RSB filling without IBPB serves no purpose for protecting user space, as indirect branches are still vulnerable. User->kernel RSB attacks are mitigated by eIBRS. In which case the RSB filling on context switch isn't needed, so remove it. Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/98cdefe42180358efebf78e3b80752850c7a3e1b.1744148254.git.jpoimboe@kernel.org	2025-04-09 12:42:09 +02:00
Josh Poimboeuf	18bae0dfec	x86/bugs: Don't fill RSB on VMEXIT with eIBRS+retpoline eIBRS protects against guest->host RSB underflow/poisoning attacks. Adding retpoline to the mix doesn't change that. Retpoline has a balanced CALL/RET anyway. So the current full RSB filling on VMEXIT with eIBRS+retpoline is overkill. Disable it or do the VMEXIT_LITE mitigation if needed. Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Amit Shah <amit.shah@amd.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Woodhouse <dwmw2@infradead.org> Link: https://lore.kernel.org/r/84a1226e5c9e2698eae1b5ade861f1b8bf3677dc.1744148254.git.jpoimboe@kernel.org	2025-04-09 12:41:55 +02:00
Josh Poimboeuf	b1b19cfcf4	x86/bugs: Fix RSB clearing in indirect_branch_prediction_barrier() IBPB is expected to clear the RSB. However, if X86_BUG_IBPB_NO_RET is set, that doesn't happen. Make indirect_branch_prediction_barrier() take that into account by calling write_ibpb() which clears RSB on X86_BUG_IBPB_NO_RET: /* Make sure IBPB clears return stack preductions too. */ FILL_RETURN_BUFFER %rax, RSB_CLEAR_LOOPS, X86_BUG_IBPB_NO_RET Note that, as of the previous patch, write_ibpb() also reads 'x86_pred_cmd' in order to use SBPB when applicable: movl _ASM_RIP(x86_pred_cmd), %eax Therefore that existing behavior in indirect_branch_prediction_barrier() is not lost. Fixes: `50e4b3b940` ("x86/entry: Have entry_ibpb() invalidate return predictions") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/bba68888c511743d4cd65564d1fc41438907523f.1744148254.git.jpoimboe@kernel.org	2025-04-09 12:41:30 +02:00
Josh Poimboeuf	fc9fd3f984	x86/bugs: Use SBPB in write_ibpb() if applicable write_ibpb() does IBPB, which (among other things) flushes branch type predictions on AMD. If the CPU has SRSO_NO, or if the SRSO mitigation has been disabled, branch type flushing isn't needed, in which case the lighter-weight SBPB can be used. The 'x86_pred_cmd' variable already keeps track of whether IBPB or SBPB should be used. Use that instead of hardcoding IBPB. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/17c5dcd14b29199b75199d67ff7758de9d9a4928.1744148254.git.jpoimboe@kernel.org	2025-04-09 12:41:30 +02:00
Josh Poimboeuf	13235d6d50	x86/bugs: Rename entry_ibpb() to write_ibpb() There's nothing entry-specific about entry_ibpb(). In preparation for calling it from elsewhere, rename it to write_ibpb(). Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/1e54ace131e79b760de3fe828264e26d0896e3ac.1744148254.git.jpoimboe@kernel.org	2025-04-09 12:41:29 +02:00
Andy Shevchenko	996457176b	x86/early_printk: Use 'mmio32' for consistency, fix comments First of all, using 'mmio' prevents proper implementation of 8-bit accessors. Second, it's simply inconsistent with uart8250 set of options. Rename it to 'mmio32'. While at it, remove rather misleading comment in the documentation. From now on mmio32 is self-explanatory and pciserial supports not only 32-bit MMIO accessors. Also, while at it, fix the comment for the "pciserial" case. The comment seems to be a copy'n'paste error when mentioning "serial" instead of "pciserial" (with double quotes). Fix this. With that, move it upper, so we don't calculate 'buf' twice. Fixes: `3181424aea` ("x86/early_printk: Add support for MMIO-based UARTs") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Denis Mukhin <dmukhin@ford.com> Link: https://lore.kernel.org/r/20250407172214.792745-1-andriy.shevchenko@linux.intel.com	2025-04-09 12:27:08 +02:00
Thorsten Blum	3256a83335	perf/x86/intel/bts: Rename local bts_buffer variables for clarity Rename struct bts_buffer objects from 'buf' to 'bb' to improve the readability when accessing the structure's 'buf' member. For example, 'buf->buf[]' becomes 'bb->buf[]'. Indent line 327 using tabs to silence a checkpatch warning. No functional changes intended. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250407085253.742834-2-thorsten.blum@linux.dev	2025-04-09 12:14:14 +02:00
Ard Biesheuvel	d9fa398fe8	x86/boot/startup: Disable objtool validation for library code The library code built under arch/x86/boot/startup is not intended to be linked into vmlinux but only into the decompressor and/or the EFI stub. This means objtool validation is not needed here, and may result in false positive errors for things like missing retpolines. So disable it for all objects added to lib-y Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250408085254.836788-10-ardb+git@google.com	2025-04-09 11:59:03 +02:00
James Morse	45c2e30bbd	x86/resctrl: Fix rdtgroup_mkdir()'s unlocked use of kernfs_node::name Since `741c10b096` ("kernfs: Use RCU to access kernfs_node::name.") a helper rdt_kn_name() that checks that rdtgroup_mutex is held has been used for all accesses to the kernfs node name. rdtgroup_mkdir() uses the name to determine if a valid monitor group is being created by checking the parent name is "mon_groups". This is done without holding rdtgroup_mutex, and now triggers the following warning: \| WARNING: suspicious RCU usage \| 6.15.0-rc1 #4465 Tainted: G E \| ----------------------------- \| arch/x86/kernel/cpu/resctrl/internal.h:408 suspicious rcu_dereference_check() usage! [...] \| Call Trace: \| <TASK> \| dump_stack_lvl \| lockdep_rcu_suspicious.cold \| is_mon_groups \| rdtgroup_mkdir \| kernfs_iop_mkdir \| vfs_mkdir \| do_mkdirat \| __x64_sys_mkdir \| do_syscall_64 \| entry_SYSCALL_64_after_hwframe Creating a control or monitor group calls mkdir_rdt_prepare(), which uses rdtgroup_kn_lock_live() to take the rdtgroup_mutex. To avoid taking and dropping the lock, move the check for the monitor group name and position into mkdir_rdt_prepare() so that it occurs under rdtgroup_mutex. Hoist is_mon_groups() earlier in the file. [ bp: Massage. ] Fixes: `741c10b096` ("kernfs: Use RCU to access kernfs_node::name.") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250407124637.2433230-1-james.morse@arm.com	2025-04-09 11:35:08 +02:00
Linus Torvalds	0e8863244e	ARM: * Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk stage-1 page tables) to align with the architecture. This avoids possibly taking an SEA at EL2 on the page table walk or using an architecturally UNKNOWN fault IPA. * Use acquire/release semantics in the KVM FF-A proxy to avoid reading a stale value for the FF-A version. * Fix KVM guest driver to match PV CPUID hypercall ABI. * Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM selftests, which is the only memory type for which atomic instructions are architecturally guaranteed to work. s390: * Don't use %pK for debug printing and tracepoints. x86: * Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. The schedule out code runs with a scheduler lock that the wakeup handler takes in the opposite order; but it does so with IRQs disabled and cannot run concurrently with a wakeup. * Explicitly zero-initialize on-stack CPUID unions * Allow building irqbypass.ko as as module when kvm.ko is a module * Wrap relatively expensive sanity check with KVM_PROVE_MMU * Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses selftests: * Add more scenarios to the MONITOR/MWAIT test. * Add option to rseq test to override /dev/cpu_dma_latency * Bring list of exit reasons up to date * Cleanup Makefile to list once tests that are valid on all architectures Other: * Documentation fixes -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmf083IUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroN1dgf/QwfpZcHoMNQSnrc1jMy2LHrArln2 XfmsOGZTU7kyoLQsLWGAPNocOveGdiemTDsj5ZXoNMnqV8hCBr+tZuv2gWI1rr/o kiGerdIgSZ9piTjBlJkVAaOzbWhg2DUnr7qVVzEzFY9+rPNyQ81vgAfU7h56KhYB optecozmBrHHAxvQZwmPeL9UyPWFjOF1BY/8LTMx7X+aVuCX6qx1JqO3a3ylAw4J tGXv6qFJfuCnu1d1b4X0ILce0iMUTOjQzvTcIm+BKjYycecl+3j1aczC/BOorIgc mf0+XeauhcTduK73pirnvx2b05eOxntgkOpwJytO2RP6pE0uK+2Th/C3Qg== =ba/Y -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk stage-1 page tables) to align with the architecture. This avoids possibly taking an SEA at EL2 on the page table walk or using an architecturally UNKNOWN fault IPA - Use acquire/release semantics in the KVM FF-A proxy to avoid reading a stale value for the FF-A version - Fix KVM guest driver to match PV CPUID hypercall ABI - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM selftests, which is the only memory type for which atomic instructions are architecturally guaranteed to work s390: - Don't use %pK for debug printing and tracepoints x86: - Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. The schedule out code runs with a scheduler lock that the wakeup handler takes in the opposite order; but it does so with IRQs disabled and cannot run concurrently with a wakeup - Explicitly zero-initialize on-stack CPUID unions - Allow building irqbypass.ko as as module when kvm.ko is a module - Wrap relatively expensive sanity check with KVM_PROVE_MMU - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses selftests: - Add more scenarios to the MONITOR/MWAIT test - Add option to rseq test to override /dev/cpu_dma_latency - Bring list of exit reasons up to date - Cleanup Makefile to list once tests that are valid on all architectures Other: - Documentation fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits) KVM: arm64: Use acquire/release to communicate FF-A version negotiation KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable KVM: arm64: selftests: Introduce and use hardware-definition macros KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list KVM: x86: Explicitly zero-initialize on-stack CPUID unions KVM: Allow building irqbypass.ko as as module when kvm.ko is a module KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses Documentation: kvm: remove KVM_CAP_MIPS_TE Documentation: kvm: organize capabilities in the right section Documentation: kvm: fix some definition lists Documentation: kvm: drop "Capability" heading from capabilities Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE selftests: kvm: list once tests that are valid on all architectures selftests: kvm: bring list of exit reasons up to date selftests: kvm: revamp MONITOR/MWAIT tests KVM: arm64: Don't translate FAR if invalid/unsafe ...	2025-04-08 13:47:55 -07:00
Josh Poimboeuf	2d12c6fb78	objtool: Remove ANNOTATE_IGNORE_ALTERNATIVE from CLAC/STAC ANNOTATE_IGNORE_ALTERNATIVE adds additional noise to the code generated by CLAC/STAC alternatives, hurting readability for those whose read uaccess-related code generation on a regular basis. Remove the annotation specifically for the "NOP patched with CLAC/STAC" case in favor of a manual check. Leave the other uses of that annotation in place as they're less common and more difficult to detect. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fc972ba4995d826fcfb8d02733a14be8d670900b.1744098446.git.jpoimboe@kernel.org	2025-04-08 22:03:51 +02:00
Kan Liang	ec980e4fac	perf/x86/intel: Support auto counter reload The relative rates among two or more events are useful for performance analysis, e.g., a high branch miss rate may indicate a performance issue. Usually, the samples with a relative rate that exceeds some threshold are more useful. However, the traditional sampling takes samples of events separately. To get the relative rates among two or more events, a high sample rate is required, which can bring high overhead. Many samples taken in the non-hotspot area are also dropped (useless) in the post-process. The auto counter reload (ACR) feature takes samples when the relative rate of two or more events exceeds some threshold, which provides the fine-grained information at a low cost. To support the feature, two sets of MSRs are introduced. For a given counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s) can cause a reload of that counter. The reload value is stored in the IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C. The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto Counter Reload. In the hw_config(), an ACR event is specially configured, because the cause/reloadable counter mask has to be applied to the dyn_constraint. Besides the HW limit, e.g., not support perf metrics, PDist and etc, a SW limit is applied as well. ACR events in a group must be contiguous. It facilitates the later conversion from the event idx to the counter idx. Otherwise, the intel_pmu_acr_late_setup() has to traverse the whole event list again to find the "cause" event. Also, add a new flag PERF_X86_EVENT_ACR to indicate an ACR group, which is set to the group leader. The late setup() is also required for an ACR group. It's to convert the event idx to the counter idx, and saved it in hw.config1. The ACR configuration MSRs are only updated in the enable_event(). The disable_event() doesn't clear the ACR CFG register. Add acr_cfg_b/acr_cfg_c in the struct cpu_hw_events to cache the MSR values. It can avoid a MSR write if the value is not changed. Expose an acr_mask to the sysfs. The perf tool can utilize the new format to configure the relation of events in the group. The bit sequence of the acr_mask follows the events enabled order of the group. Example: Here is the snippet of the mispredict.c. Since the array has a random numbers, jumps are random and often mispredicted. The mispredicted rate depends on the compared value. For the Loop1, ~11% of all branches are mispredicted. For the Loop2, ~21% of all branches are mispredicted. main() { ... for (i = 0; i < N; i++) data[i] = rand() % 256; ... /* Loop 1 / for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 64) sum += data[i]; ... ... / Loop 2 */ for (k = 0; k < 50; k++) for (i = 0; i < N; i++) if (data[i] >= 128) sum += data[i]; ... } Usually, a code with a high branch miss rate means a bad performance. To understand the branch miss rate of the codes, the traditional method usually samples both branches and branch-misses events. E.g., perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}" -c 1000000 -- ./mispredict [ perf record: Woken up 4 times to write data ] [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ] The 5106 samples are from both events and spread in both Loops. In the post-process stage, a user can know that the Loop 2 has a 21% branch miss rate. Then they can focus on the samples of branch-misses events for the Loop 2. With this patch, the user can generate the samples only when the branch miss rate > 20%. For example, perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu, cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}" -- ./mispredict (Two different periods are applied to branch-misses and branch-instructions. The ratio is set to 20%. If the branch-instructions is overflowed first, the branch-miss rate < 20%. No samples should be generated. All counters should be automatically reloaded. If the branch-misses is overflowed first, the branch-miss rate > 20%. A sample triggered by the branch-misses event should be generated. Just the counter of the branch-instructions should be automatically reloaded. The branch-misses event should only be automatically reloaded when the branch-instructions is overflowed. So the "cause" event is the branch-instructions event. The acr_mask is set to 0x2, since the event index in the group of branch-instructions is 1. The branch-instructions event is automatically reloaded no matter which events are overflowed. So the "cause" events are the branch-misses and the branch-instructions event. The acr_mask should be set to 0x3.) [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ] $perf report Percent │154: movl $0x0,-0x14(%rbp) │ ↓ jmp 1af │ for (i = j; i < N; i++) │15d: mov -0x10(%rbp),%eax │ mov %eax,-0x18(%rbp) │ ↓ jmp 1a2 │ if (data[i] >= 128) │165: mov -0x18(%rbp),%eax │ cltq │ lea 0x0(,%rax,4),%rdx │ mov -0x8(%rbp),%rax │ add %rdx,%rax │ mov (%rax),%eax │ ┌──cmp $0x7f,%eax 100.00 0.00 │ ├──jle 19e │ │sum += data[i]; The 2498 samples are all from the branch-misses events for the Loop 2. The number of samples and overhead is significantly reduced without losing any information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-6-kan.liang@linux.intel.com	2025-04-08 20:55:49 +02:00
Kan Liang	1856c6c2f8	perf/x86/intel: Add CPUID enumeration for the auto counter reload The counters that support the auto counter reload feature can be enumerated in the CPUID Leaf 0x23 sub-leaf 0x2. Add acr_cntr_mask to store the mask of counters which are reloadable. Add acr_cause_mask to store the mask of counters which can cause reload. Since the e-core and p-core may have different numbers of counters, track the masks in the struct x86_hybrid_pmu as well. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-5-kan.liang@linux.intel.com	2025-04-08 20:55:49 +02:00
Kan Liang	c9449c8506	perf: Extend the bit width of the arch-specific flag The auto counter reload feature requires an event flag to indicate an auto counter reload group, which can only be scheduled on specific counters that enumerated in CPUID. However, the hw_perf_event.flags has run out on X86. Two solutions were considered to address the issue. - Currently, 20 bits are reserved for the architecture-specific flags. Only the bit 31 is used for the generic flag. There is still plenty of space left. Reserve 8 more bits for the arch-specific flags. - Add a new X86 specific hw_perf_event.flags1 to support more flags. The former is implemented. Enough room is still left in the global generic flag. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-4-kan.liang@linux.intel.com	2025-04-08 20:55:49 +02:00
Kan Liang	0a6557938d	perf/x86/intel: Track the num of events needs late setup When a machine supports PEBS v6, perf unconditionally searches the cpuc->event_list[] for every event and check if the late setup is required, which is unnecessary. The late setup is only required for special events, e.g., events support counters snapshotting feature. Add n_late_setup to track the num of events that needs the late setup. Other features, e.g., auto counter reload feature, require the late setup as well. Add a wrapper, intel_pmu_pebs_late_setup, for the events that support counters snapshotting feature. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-3-kan.liang@linux.intel.com	2025-04-08 20:55:48 +02:00
Kan Liang	4dfe3232cc	perf/x86: Add dynamic constraint More and more features require a dynamic event constraint, e.g., branch counter logging, auto counter reload, Arch PEBS, etc. Add a generic flag, PMU_FL_DYN_CONSTRAINT, to indicate the case. It avoids keeping adding the individual flag in intel_cpuc_prepare(). Add a variable dyn_constraint in the struct hw_perf_event to track the dynamic constraint of the event. Apply it if it's updated. Apply the generic dynamic constraint for branch counter logging. Many features on and after V6 require dynamic constraint. So unconditionally set the flag for V6+. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lkml.kernel.org/r/20250327195217.2683619-2-kan.liang@linux.intel.com	2025-04-08 20:55:48 +02:00
Roger Pau Monne	64a66e2c3b	x86/xen: disable CPU idle and frequency drivers for PVH dom0 When running as a PVH dom0 the ACPI tables exposed to Linux are (mostly) the native ones, thus exposing the C and P states, that can lead to attachment of CPU idle and frequency drivers. However the entity in control of the CPU C and P states is Xen, as dom0 doesn't have a full view of the system load, neither has all CPUs assigned and identity pinned. Like it's done for classic PV guests, prevent Linux from using idle or frequency state drivers when running as a PVH dom0. On an AMD EPYC 7543P system without this fix a Linux PVH dom0 will keep the host CPUs spinning at 100% even when dom0 is completely idle, as it's attempting to use the acpi_idle driver. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250407101842.67228-1-roger.pau@citrix.com>	2025-04-08 13:15:56 +02:00
Ashish Kalra	6f1d5a3513	KVM: SVM: Add support to initialize SEV/SNP functionality in KVM Move platform initialization of SEV/SNP from CCP driver probe time to KVM module load time so that KVM can do SEV/SNP platform initialization explicitly if it actually wants to use SEV/SNP functionality. Add support for KVM to explicitly call into the CCP driver at load time to initialize SEV/SNP. If required, this behavior can be altered with KVM module parameters to not do SEV/SNP platform initialization at module load time. Additionally, a corresponding SEV/SNP platform shutdown is invoked during KVM module unload time. Continue to support SEV deferred initialization as the user may have the file containing SEV persistent data for SEV INIT_EX available only later after module load/init. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-08 15:54:37 +08:00
Josh Poimboeuf	2dbbca9be4	objtool, xen: Fix INSN_SYSCALL / INSN_SYSRET semantics Objtool uses an arbitrary rule for INSN_SYSCALL and INSN_SYSRET that almost works by accident: if it's in a function, control flow continues after the instruction, otherwise it terminates. That behavior should instead be based on the semantics of the underlying instruction. Change INSN_SYSCALL to always preserve control flow and INSN_SYSRET to always terminate it. The changed semantic for INSN_SYSCALL requires a tweak to the !CONFIG_IA32_EMULATION version of xen_entry_SYSCALL_compat(). In Xen, SYSCALL is a hypercall which usually returns. But in this case it's a hypercall to IRET which doesn't return. Add UD2 to tell objtool to terminate control flow, and to prevent undefined behavior at runtime. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> # for the Xen part Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/19453dfe9a0431b7f016e9dc16d031cad3812a50.1744095216.git.jpoimboe@kernel.org	2025-04-08 09:14:12 +02:00
Myrrh Periwinkle	f2f29da9f0	x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions() While debugging kexec/hibernation hangs and crashes, it turned out that the current implementation of e820__register_nosave_regions() suffers from multiple serious issues: - The end of last region is tracked by PFN, causing it to find holes that aren't there if two consecutive subpage regions are present - The nosave PFN ranges derived from holes are rounded out (instead of rounded in) which makes it inconsistent with how explicitly reserved regions are handled Fix this by: - Treating reserved regions as if they were holes, to ensure consistent handling (rounding out nosave PFN ranges is more correct as the kernel does not use partial pages) - Tracking the end of the last RAM region by address instead of pages to detect holes more precisely These bugs appear to have been introduced about ~18 years ago with the very first version of e820_mark_nosave_regions(), and its flawed assumptions were carried forward uninterrupted through various waves of rewrites and renames. [ mingo: Added Git archeology details, for kicks and giggles. ] Fixes: `e8eff5ac29` ("[PATCH] Make swsusp avoid memory holes and reserved memory regions on x86_64") Reported-by: Roberto Ricci <io@r-ricci.it> Tested-by: Roberto Ricci <io@r-ricci.it> Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Len Brown <len.brown@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250406-fix-e820-nosave-v3-1-f3787bc1ee1d@qtmlabs.xyz Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/	2025-04-07 19:20:08 +02:00
Petr Vaněk	8b37357a78	x86/acpi: Don't limit CPUs to 1 for Xen PV guests due to disabled ACPI Xen disables ACPI for PV guests in DomU, which causes acpi_mps_check() to return 1 when CONFIG_X86_MPPARSE is not set. As a result, the local APIC is disabled and the guest is later limited to a single vCPU, despite being configured with more. This regression was introduced in version 6.9 in commit `7c0edad364` ("x86/cpu/topology: Rework possible CPU management"), which added an early check that limits CPUs to 1 if apic_is_disabled. Update the acpi_mps_check() logic to return 0 early when running as a Xen PV guest in DomU, preventing APIC from being disabled in this specific case and restoring correct multi-vCPU behaviour. Fixes: `7c0edad364` ("x86/cpu/topology: Rework possible CPU management") Signed-off-by: Petr Vaněk <arkamar@atlas.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250407132445.6732-2-arkamar@atlas.cz	2025-04-07 16:35:21 +02:00
Boris Ostrovsky	321550859f	x86/microcode/AMD: Clean the cache if update did not load microcode If microcode did not get loaded there is no reason to keep it in the cache. Moreover, if loading failed it will not be possible to load an earlier version of microcode since the failed revision will always be selected from the cache on the next reload attempt. Since the failed revisions is not easily available at this point just clean the whole cache. It will be rebuilt later if needed. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250327230503.1850368-3-boris.ostrovsky@oracle.com	2025-04-07 14:46:56 +02:00
Paolo Bonzini	fd02aa45bd	Merge branch 'kvm-tdx-initial' into HEAD This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit `7911f145de`, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-04-07 07:36:33 -04:00
Paolo Bonzini	7d7685631a	Merge branch 'kvm-pi-fix-lockdep' into HEAD	2025-04-07 07:11:03 -04:00
Paolo Bonzini	b6262dd695	Merge branch 'kvm-6.15-rc2-fixes' into HEAD	2025-04-07 07:10:46 -04:00
Roger Pau Monne	87af633689	x86/xen: fix balloon target initialization for PVH dom0 PVH dom0 re-uses logic from PV dom0, in which RAM ranges not assigned to dom0 are re-used as scratch memory to map foreign and grant pages. Such logic relies on reporting those unpopulated ranges as RAM to Linux, and mark them as reserved. This way Linux creates the underlying page structures required for metadata management. Such approach works fine on PV because the initial balloon target is calculated using specific Xen data, that doesn't take into account the memory type changes described above. However on HVM and PVH the initial balloon target is calculated using get_num_physpages(), and that function does take into account the unpopulated RAM regions used as scratch space for remote domain mappings. This leads to PVH dom0 having an incorrect initial balloon target, which causes malfunction (excessive memory freeing) of the balloon driver if the dom0 memory target is later adjusted from the toolstack. Fix this by using xen_released_pages to account for any pages that are part of the memory map, but are already unpopulated when the balloon driver is initialized. This accounts for any regions used for scratch remote mappings. Note on x86 xen_released_pages definition is moved to enlighten.c so it's uniformly available for all Xen-enabled builds. Take the opportunity to unify PV with PVH/HVM guests regarding the usage of get_num_physpages(), as that avoids having to add different logic for PV vs PVH in both balloon_add_regions() and arch_xen_unpopulated_init(). Much like `a6aa4eb994`, the code in this changeset should have been part of `38620fc4e8`. Fixes: `a6aa4eb994` ('xen/x86: add extra pages to unpopulated-alloc if available') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20250407082838.65495-1-roger.pau@citrix.com>	2025-04-07 11:24:12 +02:00
Eric Biggers	632ab0978f	crypto: x86/chacha - remove the skcipher algorithms Since crypto/chacha.c now registers chacha20-$(ARCH), xchacha20-$(ARCH), and xchacha12-$(ARCH) skcipher algorithms that use the architecture's ChaCha and HChaCha library functions, individual architectures no longer need to do the same. Therefore, remove the redundant skcipher algorithms and leave just the library functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Eric Biggers	4aa6dc909e	crypto: chacha - centralize the skcipher wrappers for arch code Following the example of the crc32 and crc32c code, make the crypto subsystem register both generic and architecture-optimized chacha20, xchacha20, and xchacha12 skcipher algorithms, all implemented on top of the appropriate library functions. This eliminates the need for every architecture to implement the same skcipher glue code. To register the architecture-optimized skciphers only when architecture-optimized code is actually being used, add a function chacha_is_arch_optimized() and make each arch implement it. Change each architecture's ChaCha module_init function to arch_initcall so that the CPU feature detection is guaranteed to run before chacha_is_arch_optimized() gets called by crypto/chacha.c. In the case of s390, remove the CPU feature based module autoloading, which is no longer needed since the module just gets pulled in via function linkage. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Eric Biggers	570ef50a15	crypto: x86/aes-xts - optimize _compute_first_set_of_tweaks for AVX-512 Optimize the AVX-512 version of _compute_first_set_of_tweaks by using vectorized shifts to compute the first vector of tweak blocks, and by using byte-aligned shifts when multiplying by x^8. AES-XTS performance on AMD Ryzen 9 9950X (Zen 5) improves by about 2% for 4096-byte messages or 6% for 512-byte messages. AES-XTS performance on Intel Sapphire Rapids improves by about 1% for 4096-byte messages or 3% for 512-byte messages. Code size decreases by 75 bytes which outweighs the increase in rodata size of 16 bytes. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Uros Bizjak	bc23fe6dc1	crypto: x86 - Remove CONFIG_AS_AVX512 handling Current minimum required version of binutils is 2.25, which supports AVX-512 instruction mnemonics. Remove check for assembler support of AVX-512 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Uros Bizjak	d032a27e8f	crypto: x86 - Remove CONFIG_AS_SHA256_NI Current minimum required version of binutils is 2.25, which supports SHA-256 instruction mnemonics. Remove check for assembler support of SHA-256 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Uros Bizjak	984f835009	crypto: x86 - Remove CONFIG_AS_SHA1_NI Current minimum required version of binutils is 2.25, which supports SHA-1 instruction mnemonics. Remove check for assembler support of SHA-1 instructions and all relevant macros for conditional compilation. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:28 +08:00
Herbert Xu	9b4400215e	crypto: x86/chacha - Remove SIMD fallback path Get rid of the fallback path as SIMD is now always usable in softirq context. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	bda5cd6e29	crypto: x86/twofish - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	982b72cd00	crypto: x86/sm4 - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	cc01d2840f	crypto: x86/serpent - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	ca6d0e8ed8	crypto: x86/cast - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	3e862a87ff	crypto: x86/camellia - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	6e3379b933	crypto: x86/aria - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	0ba6ec5b29	crypto: x86/aes - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	3a7dfdbbe3	crypto: x86/aegis - stop using the SIMD helper Stop wrapping skcipher and aead algorithms with the crypto SIMD helper (crypto/simd.c). The only purpose of doing so was to work around x86 not always supporting kernel-mode FPU in softirqs. Specifically, if a hardirq interrupted a task context kernel-mode FPU section and then a softirqs were run at the end of that hardirq, those softirqs could not use kernel-mode FPU. This has now been fixed. In combination with the fact that the skcipher and aead APIs only support task and softirq contexts, these can now just use kernel-mode FPU unconditionally on x86. This simplifies the code and improves performance. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Eric Biggers	7d14fbc569	crypto: x86/aes - drop the avx10_256 AES-XTS and AES-CTR code Intel made a late change to the AVX10 specification that removes support for a 256-bit maximum vector length and enumeration of the maximum vector length. AVX10 will imply a maximum vector length of 512 bits. I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will just be AVX10, and it will essentially just consolidate AVX512 features. As a result of this new development, my strategy of providing both _avx10_256 and _avx10_512 functions didn't turn out to be that useful. The only remaining motivation for the 256-bit AVX512 / AVX10 functions is to avoid downclocking on older Intel CPUs. But in the case of AES-XTS and AES-CTR, I already wrote _avx2 code too (primarily to support CPUs without AVX512), which performs almost as well as _avx10_256. So we should just use that. Therefore, remove the _avx10_256 AES-XTS and AES-CTR functions and algorithms, and rename the _avx10_512 AES-XTS and AES-CTR functions and algorithms to _avx512. Make Ice Lake and Tiger Lake use _avx2 instead of *_avx10_256 which they previously used. I've left AES-GCM unchanged for now. There is no VAES+AVX2 optimized AES-GCM in the kernel yet, so the path forward for that is not as clear. However, I did write a VAES+AVX2 optimized AES-GCM for BoringSSL. So one option is to port that to the kernel and then do the same cleanup. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-04-07 13:22:27 +08:00
Ard Biesheuvel	4f2d1bbc2c	x86/boot: Move the EFI mixed mode startup code back under arch/x86, into startup/ Linus expressed a strong preference for arch-specific asm code (i.e., virtually all of it) to reside under arch/ rather than anywhere else. So move the EFI mixed mode startup code back, and put it under arch/x86/boot/startup/ where all shared x86 startup code is going to live. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20250401133416.1436741-11-ardb+git@google.com	2025-04-06 20:15:14 +02:00
Ard Biesheuvel	5a67da1f49	x86/boot: Move the 5-level paging trampoline into /startup The 5-level paging trampoline is used by both the EFI stub and the traditional decompressor. Move it out of the decompressor sources into the newly minted arch/x86/boot/startup/ sub-directory which will hold startup code that may be shared between the decompressor, the EFI stub and the kernel proper, and needs to tolerate being called during early boot, before the kernel virtual mapping has been created. This will allow the 5-level paging trampoline to be used by EFI boot images such as zboot that omit the traditional decompressor entirely. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401133416.1436741-10-ardb+git@google.com	2025-04-06 20:15:14 +02:00
Ard Biesheuvel	5d4456fc88	x86/boot/compressed: Merge the local pgtable.h include into <asm/boot.h> Merge the local include "pgtable.h" -which declares the API of the 5-level paging trampoline- into <asm/boot.h> so that its implementation in la57toggle.S as well as the calling code can be decoupled from the traditional decompressor. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401133416.1436741-9-ardb+git@google.com	2025-04-06 20:15:14 +02:00
Andy Shevchenko	0ee07a0792	x86/boot: Use __ALIGN_KERNEL_MASK() instead of open coded analogue LOAD_PHYSICAL_ADDR is calculated as an aligned (up) CONFIG_PHYSICAL_START with the respective alignment value CONFIG_PHYSICAL_ALIGN. However, the code is written openly while we have __ALIGN_KERNEL_MASK() macro that does the same. This macro has nothing special, that's why it may be used in assembler code or linker scripts (on the contrary __ALIGN_KERNEL() may not). Do it so. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250404165303.3657139-1-andriy.shevchenko@linux.intel.com	2025-04-06 20:06:36 +02:00
Andi Kleen	e37aa1211f	x86/cpuid: Add AMX and SPEC_CTRL dependencies Add some missing dependencies to the CPUID dependency table: - All the AMX features depend on AMX_TILE - All the SPEC_CTRL features depend on SPEC_CTRL [ mingo: Keep the AMX part of the table grouped ... ] Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Ahmed S. Darwish <darwi@linutronix.de> Link: https://lore.kernel.org/r/20240924170128.2611854-1-ak@linux.intel.com	2025-04-06 19:54:35 +02:00
Linus Torvalds	16cd1c2657	A set of final cleanups for the timer subsystem: 1) Convert all del_timer[_sync]() instances over to the new timer_delete[_sync]() API and remove the legacy wrappers. Conversion was done with coccinelle plus some manual fixups as coccinelle chokes on scoped_guard(). 2) The final cleanup of the hrtimer_init() to hrtimer_setup() conversion. This has been delayed to the end of the merge window, so that all patches which have been merged through other trees are in mainline and all new users are catched. Doing this right before rc1 ensures that new code which is merged post rc1 is not introducing new instances of the original functionality. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfyXi0THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYzlD/4ykDZbUzgTreYOxEQpBJ9elPwBhxfL 1v8OwDjRWlNrmLup8RiUfKrlbmztGl1J/u9ld0qhjcqkywCCBC1N5S+DhCjYetyP MPWLbi2Dc35cFA+M7i8fMgxI2K9MLz2Zj1UKxz1MdsSuNHm07N3mul/3T11Ye4Rz nPlzeQBTBDFCKTEGKjr8zjuoD15Wl48sObM0AjV35BPuQR1jfY4CE6VXo2h78+0c jYwpJpDmcd+o1bDrfFhWUME2DzABEkHhn4wNSETnM4E5RXZRMUbi4UiigzInibQr JOUTKwPJXTMX/Erd0XyXErrYf2qy1X9BQy6NlyDDOv+8kLEVRsC9Efplx9uoEtfi QvVT/UmgmhZFJBfIT3/B8OvasrfwOropaYoG4L0zbDpp1b09VY47N5lCLlNr/mZf jb2TwIln8Szy2EfIT2RSd0ZNupyU8V4aH/mYNpSlbUJ6mfvfIAttBSS/YH+Zeqku 7zOJkoCusaySOCZCOQkeikL3ZBN+FHtNteXxmGnp34ed/tsfgGZj1lsbmkM2rrWo f2mQsYAclUA4KQeY9z/Xf7/c5wJUkME69PxOaaN23dOpBR7GA58Cvb0PQTnPlAiT KnH/JRweBHtcv4KEHMi2f5no4cxcmXyKTj7/TLyYNjc8LATL9Eo/nxG36PLxy4lN QPOWz11zEBLjQQ== =8Ftq -----END PGP SIGNATURE----- Merge tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A set of final cleanups for the timer subsystem: - Convert all del_timer[_sync]() instances over to the new timer_delete[_sync]() API and remove the legacy wrappers. Conversion was done with coccinelle plus some manual fixups as coccinelle chokes on scoped_guard(). - The final cleanup of the hrtimer_init() to hrtimer_setup() conversion. This has been delayed to the end of the merge window, so that all patches which have been merged through other trees are in mainline and all new users are catched. Doing this right before rc1 ensures that new code which is merged post rc1 is not introducing new instances of the original functionality" * tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tracing/timers: Rename the hrtimer_init event to hrtimer_setup hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack() hrtimers: Rename debug_init() to debug_setup() hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns() hrtimers: Make callback function pointer private hrtimers: Merge __hrtimer_init() into __hrtimer_setup() hrtimers: Switch to use __htimer_setup() hrtimers: Delete hrtimer_init() treewide: Convert new and leftover hrtimer_init() users treewide: Switch/rename to timer_delete[_sync]()	2025-04-06 08:35:37 -07:00
Linus Torvalds	ff0c66685d	A set of updates for the interrupt subsystem: 1) A treewide cleanup for the irq_domain code, which makes the naming consistent and gets rid of the original oddity of naming domains 'host'. This is a trivial mechanical change and is done late to ensure that all instances have been catched and new code merged post rc1 wont reintroduce new instances. 2) A trivial consistency fix in the migration code The recent introduction of irq_force_complete_move() in the core code, causes a problem for the nostalgia crowd who maintains ia64 out of tree. The code assumes that hierarchical interrupt domains are enabled and dereferences irq_data::parent_data unconditionally. That works in mainline because both architectures which enable that code have hierarchical domains enabled. Though it breaks the ia64 build, which enables the functionality, but does not have hierarchical domains. While it's not really a problem for mainline today, this unconditional dereference is inconsistent and trivially fixable by using the existing helper function irqd_get_parent_data(), which has the appropriate #ifdeffery in place. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfyW1sTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoWywD/sG69q7rjt0bBHleXPjjUIrM5TdRI9k r9S3BhVtZzfreiMnhQS1CLrA64fBFhKGJVo9HtKbsjC0hF8r10A1+OKEftYpydPz Mk7DreqCvQO/GQ/p2MiwHiQL39iXW5eFqL8qScafD8jUnkQ1kjHu53blLuoAzx2u ysfe/4V3KtcziKgShss4Y0SGg3CEL5sJiLbU7SLNCSRNkO/hCPh1KYAFcsrRaXnQ pcnHae8N58RrgGIhe1F9oPNji2B0YdQ2vt7Ora2g6TlbMv66LYQ+QCu++/0n3HZI EV/ikBtuF7zwAg6qzcmfY63XfTMj/K/Oj7qKTsMtcgHFlrpcQ9HW33qMUm90rATB Sx/oeiJS10XFlEoseX0dO8NoRE/ZvF9wioAXnvbxxZtOchr+3hyQSbI3hGdJoncL mqIRyf08o5kzBoRUY7Nqztlst6/+0bBgxPgDFsW7j47V/NBlUYQ0UBlB+FyoeVfk RWS3Z18jpKlvVNKn67ZYRI0zlaxgyyGszwSsLTpQvOFt2HGdKiHFeCuBiBVOboel vhtIRW+zT3cyMKvZimQ3BfKnBgFiEKd73VQIjaHBB+eLt2DtNpq6x0dnaOQLvVau 7eSFgBKOwEz3zAu81omcgHwMb/5/Z46e5jrtliF4YFThHWUZPZFrhrr7JFJ+pqTz PTNWb0zGIzQCmg== =lhoB -----END PGP SIGNATURE----- Merge tag 'irq-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more irq updates from Thomas Gleixner: "A set of updates for the interrupt subsystem: - A treewide cleanup for the irq_domain code, which makes the naming consistent and gets rid of the original oddity of naming domains 'host'. This is a trivial mechanical change and is done late to ensure that all instances have been catched and new code merged post rc1 wont reintroduce new instances. - A trivial consistency fix in the migration code The recent introduction of irq_force_complete_move() in the core code, causes a problem for the nostalgia crowd who maintains ia64 out of tree. The code assumes that hierarchical interrupt domains are enabled and dereferences irq_data::parent_data unconditionally. That works in mainline because both architectures which enable that code have hierarchical domains enabled. Though it breaks the ia64 build, which enables the functionality, but does not have hierarchical domains. While it's not really a problem for mainline today, this unconditional dereference is inconsistent and trivially fixable by using the existing helper function irqd_get_parent_data(), which has the appropriate #ifdeffery in place" * tag 'irq-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/migration: Use irqd_get_parent_data() in irq_force_complete_move() irqdomain: Stop using 'host' for domain irqdomain: Rename irq_get_default_host() to irq_get_default_domain() irqdomain: Rename irq_set_default_host() to irq_set_default_domain()	2025-04-06 08:17:43 -07:00
Linus Torvalds	f4d2ef4825	Kbuild updates for v6.15 - Improve performance in gendwarfksyms - Remove deprecated EXTRA_FLAGS and KBUILD_ENABLE_EXTRA_GCC_CHECKS - Support CONFIG_HEADERS_INSTALL for ARCH=um - Use more relative paths to sources files for better reproducibility - Support the loong64 Debian architecture - Add Kbuild bash completion - Introduce intermediate vmlinux.unstripped for architectures that need static relocations to be stripped from the final vmlinux - Fix versioning in Debian packages for -rc releases - Treat missing MODULE_DESCRIPTION() as an error - Convert Nios2 Makefiles to use the generic rule for built-in DTB - Add debuginfo support to the RPM package -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmfxp2EVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsGkIUP/AgNiP6or6fmY5+HSyjlrdutBWAh QNW0AiKh5vytmBIv63/i103OE0SRbt+U6IApn9c7FQKkeuyIlD1e9NfSwFMZixmP P7t6JqDCL61G5d3W2Iisqle1cpBoVvNgUwu0k3sTSXl0vNsDbiyxcCzQzLhZMKsd O+Ppwp3zNGE2vIUwpIjzJsR5Dt/Z5MfuKDi4UShsyWpFZ1rg9X93YKc9QJOXjKwj 4Np2x2cukDo2oz4uXuZQ8F1+bOFsKYoilCwjtxlrC6BO0lSPiJsRTN6nGJ0ejns9 GGD56mBNGcGk+NEPGhAMQmZHqNAP4JfjEvAgaoSBn0Rdnjd9Cj/2T+4n61xkR4Wu MXCP/LEJ3MyctmkZjUq+0fDAe2wjxuaAG15kAHCha+9KxIG2NzHbf2XXb4E49DDU 2rw3fqA41/cKCq1ZEaqRn3pZZgU6ysfsEW42JmnNxO+7zz9k8RX4rk8CVaVIEUuw Xojkis//KnE6+OCBe6Tb0H2Rzo0JF3AG2eNF4zY/xnc562FRIMS19WYS38tKZng6 Gr1BRG0bA4t9mf2Vck1W1LcAb3Jh0mddtyrgYKhbcwq0YOj2q/H6F50DkC+wL282 wvhV6B/vKAH8BByEWAn3rBcN0N+w/VFc0uPCz//tkoAm4nPg8PvKq63JHPrHsyZe mOMhifoiVbjF4KFo =GiQ6 -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Improve performance in gendwarfksyms - Remove deprecated EXTRA_FLAGS and KBUILD_ENABLE_EXTRA_GCC_CHECKS - Support CONFIG_HEADERS_INSTALL for ARCH=um - Use more relative paths to sources files for better reproducibility - Support the loong64 Debian architecture - Add Kbuild bash completion - Introduce intermediate vmlinux.unstripped for architectures that need static relocations to be stripped from the final vmlinux - Fix versioning in Debian packages for -rc releases - Treat missing MODULE_DESCRIPTION() as an error - Convert Nios2 Makefiles to use the generic rule for built-in DTB - Add debuginfo support to the RPM package * tag 'kbuild-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (40 commits) kbuild: rpm-pkg: build a debuginfo RPM kconfig: merge_config: use an empty file as initfile nios2: migrate to the generic rule for built-in DTB rust: kbuild: skip `--remap-path-prefix` for `rustdoc` kbuild: pacman-pkg: hardcode module installation path kbuild: deb-pkg: don't set KBUILD_BUILD_VERSION unconditionally modpost: require a MODULE_DESCRIPTION() kbuild: make all file references relative to source root x86: drop unnecessary prefix map configuration kbuild: deb-pkg: add comment about future removal of KDEB_COMPRESS kbuild: Add a help message for "headers" kbuild: deb-pkg: remove "version" variable in mkdebian kbuild: deb-pkg: fix versioning for -rc releases Documentation/kbuild: Fix indentation in modules.rst example x86: Get rid of Makefile.postlink kbuild: Create intermediate vmlinux build with relocations preserved kbuild: Introduce Kconfig symbol for linking vmlinux with relocations kbuild: link-vmlinux.sh: Make output file name configurable kbuild: do not generate .tmp_vmlinux*.map when CONFIG_VMLINUX_MAP=y Revert "kheaders: Ignore silly-rename files" ...	2025-04-05 15:46:50 -07:00
Thomas Gleixner	8fa7292fee	treewide: Switch/rename to timer_delete[_sync]() timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-04-05 10:30:12 +02:00
Jiri Slaby (SUSE)	825dfab23b	irqdomain: Rename irq_set_default_host() to irq_set_default_domain() Naming interrupt domains host is confusing at best and the irqdomain code uses both domain and host inconsistently. Therefore rename irq_set_default_host() to irq_set_default_domain(). Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-3-jirislaby@kernel.org	2025-04-04 16:39:10 +02:00
Linus Torvalds	fffb5cd21e	Miscellaneous x86 fixes: - Fix a performance regression on AMD iGPU and dGPU drivers, related to the unintended activation of DMA bounce buffers that regressed game performance if KASLR disturbed things just enough. - Fix a copy_user_generic() performance regression on certain older non-FSRM/ERMS CPUs - Fix a Clang build warning due to a semantic merge conflict the Kunit tree generated with the x86 tree - Fix FRED related system hang during S4 resume - Remove an unused API Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfvqpwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hUTA//QOCCdQxYIkh2Zq/Ak+9vG7baoFUbV3ho ECjI9Gm6lWNQgXNwwokR3vy7yxUI+Q2zTrlSSY5s45gxtq4xE4/mgiHGK9aF6wxL 9t41x+oI4tla330kIz1L6utjXnvL5X22AHNupUkzwTNV0Ci84ySQZO6tBnZ15Yer +rA0uJFnUsRyZnE8alckFtOWqbknBmHGCElGuSSgIDVWARGWfXbIcYV0ph9HeYeG Zm0YbDcze9A/58PHGD7OHW/hjfae/TDep2Xd6IJqLfEztxMOETgcCGY9Ti0DgnOD 1lFdv6VNxq8RLAdiUqId9+rJq/2Xyir/q2r79MXVPgd6K9vXGyVGMu4JZ4NcLdx8 vu9PNYIerXVGgz3EUQCZyLkinOxB8apiKFb+/6hYNoBpIVBs8yqE+Qai8WMMzHF1 5oMhefDfsXGfdvs5u/svXd3/dTl3F0gdaWb2Ej9qLh2Cxt/c7AEC5QT4NT0xjyDQ c896JI2eFhvzGKh6QaDa8+4YzjGR1X8MT35ajtpPQKrZU2bbM0aorE4FSznBmv41 RYC+gNROBtnWaBSpbQnNp66o7Q9cBq7VNpPUVhAGz1wp9iIUHxvtLz9Rw/3Mxv83 pSSr5ZuYgskL4V/bvPj8ri/17+9h9/cxj9uXaKp4eDvgsihKdE5fdDpDf7iEpq5w BivfdEPAe0U= =8jxI -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix a performance regression on AMD iGPU and dGPU drivers, related to the unintended activation of DMA bounce buffers that regressed game performance if KASLR disturbed things just enough - Fix a copy_user_generic() performance regression on certain older non-FSRM/ERMS CPUs - Fix a Clang build warning due to a semantic merge conflict the Kunit tree generated with the x86 tree - Fix FRED related system hang during S4 resume - Remove an unused API * tag 'x86-urgent-2025-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fred: Fix system hang during S4 resume with FRED enabled x86/platform/iosf_mbi: Remove unused iosf_mbi_unregister_pmic_bus_access_notifier() x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs	2025-04-04 07:12:26 -07:00
Paolo Bonzini	c77eee50ca	Merge branch 'kvm-pi-fix-lockdep' into HEAD	2025-04-04 07:17:04 -04:00
Yan Zhao	c0b8dcabb2	KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); * DEADLOCK * In the wakeup handler, the callchain is always: sysvec_kvm_posted_intr_wakeup_ipi() \| --> pi_wakeup_handler() \| --> kvm_vcpu_wake_up() \| --> try_to_wake_up(), and the lock order is: &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. For the schedule out path, the callchain is always (for all intents and purposes; if the kernel is preemptible, kvm_sched_out() can be called from something other than schedule(), but the beginning of the callchain will be the same point in vcpu_block()): vcpu_block() \| --> schedule() \| --> kvm_sched_out() \| --> vmx_vcpu_put() \| --> vmx_vcpu_pi_put() \| --> pi_enable_wakeup_handler() and the lock order is: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for wakeup, and complains about the A=>C versus C=>A inversion. In practice, deadlock can't occur between schedule out and the wakeup handler as they are mutually exclusive. The entirely of the schedule out code that runs with the problematic scheduler locks held, does so with IRQs disabled, i.e. can't run concurrently with the wakeup handler. Use a subclass instead disabling lockdep entirely, and tell lockdep that both subclasses are being acquired when loading a vCPU, as the sched_out and sched_in paths are NOT mutually exclusive, e.g. CPU 0 CPU 1 --------------- --------------- vCPU0 sched_out vCPU1 sched_in vCPU1 sched_out vCPU 0 sched_in where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup list+lock. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-ID: <20250401154727.835231-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-04-04 07:11:59 -04:00
Sean Christopherson	6bad6ecc63	KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Assert that IRQs are already disabled when putting a vCPU on a CPU's PI wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on IRQs being disabled until the vCPU task is fully scheduled out, i.e. until the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), as attempting to wake the task while it's being scheduled out could lead to deadlock. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250401154727.835231-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-04-04 07:11:59 -04:00
Sean Christopherson	bc52ae0a70	KVM: x86: Explicitly zero-initialize on-stack CPUID unions Explicitly zero/empty-initialize the unions used for PMU related CPUID entries, instead of manually zeroing all fields (hopefully), or in the case of 0x80000022, relying on the compiler to clobber the uninitialized bitfields. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-ID: <20250315024102.2361628-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-04-04 07:07:40 -04:00
Sean Christopherson	81d480fdf8	KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU Wrap the TDP MMU page counter in CONFIG_KVM_PROVE_MMU so that the sanity check is omitted from production builds, and more importantly to remove the atomic accesses to account pages. A one-off memory leak in production is relatively uninteresting, and a WARN_ON won't help mitigate a systemic issue; it's as much about helping triage memory leaks as it is about detecting them in the first place, and doesn't magically stop the leaks. I.e. production environments will be quite sad if a severe KVM bug escapes, regardless of whether or not KVM WARNs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250315023448.2358456-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-04-04 07:07:40 -04:00
Sean Christopherson	ef01cac401	KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses Acquire a lock on kvm->srcu when userspace is getting MP state to handle a rather extreme edge case where "accepting" APIC events, i.e. processing pending INIT or SIPI, can trigger accesses to guest memory. If the vCPU is in L2 with INIT and a TRIPLE_FAULT request pending, then getting MP state will trigger a nested VM-Exit by way of ->check_nested_events(), and emuating the nested VM-Exit can access guest memory. The splat was originally hit by syzkaller on a Google-internal kernel, and reproduced on an upstream kernel by hacking the triple_fault_event_test selftest to stuff a pending INIT, store an MSR on VM-Exit (to generate a memory access on VMX), and do vcpu_mp_state_get() to trigger the scenario. ============================= WARNING: suspicious RCU usage 6.14.0-rc3-b112d356288b-vmx/pi_lockdep_false_pos-lock #3 Not tainted ----------------------------- include/linux/kvm_host.h:1058 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by triple_fault_ev/1256: #0: ffff88810df5a330 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x8b/0x9a0 [kvm] stack backtrace: CPU: 11 UID: 1000 PID: 1256 Comm: triple_fault_ev Not tainted 6.14.0-rc3-b112d356288b-vmx #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x144/0x190 kvm_vcpu_gfn_to_memslot+0x156/0x180 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] read_and_check_msr_entry+0x2e/0x180 [kvm_intel] __nested_vmx_vmexit+0x550/0xde0 [kvm_intel] kvm_check_nested_events+0x1b/0x30 [kvm] kvm_apic_accept_events+0x33/0x100 [kvm] kvm_arch_vcpu_ioctl_get_mpstate+0x30/0x1d0 [kvm] kvm_vcpu_ioctl+0x33e/0x9a0 [kvm] __x64_sys_ioctl+0x8b/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401150504.829812-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-04-04 07:07:39 -04:00
Linus Torvalds	8c7c1b5506	- The 2 patch series "mm: fixes for fallouts from mem_init() cleanup" from Mike Rapoport fixes a couple of issues with the just-merged "arch, mm: reduce code duplication in mem_init()" series. - The 4 patch series "MAINTAINERS: add my isub-entries to MM part." from Mike Rapoport does some maintenance on MAINTAINERS. - The 6 patch series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some cleanup work to the page mapping code. - The 7 patch series "mseal system mappings" from Jeff Xu permits sealing of "system mappings", such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode). - Plus the usual shower of singleton patches. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+4XpgAKCRDdBJ7gKXxA jnwtAP43Rp3zyWf034fEypea36xQqcsy4I7YUTdZEgnFS7LCZwEApM97JvGHsYEr Ns9Zhnh+E3RWASfOAzJoVZVrAaMovg4= =MyVR -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - The series "mm: fixes for fallouts from mem_init() cleanup" from Mike Rapoport fixes a couple of issues with the just-merged "arch, mm: reduce code duplication in mem_init()" series - The series "MAINTAINERS: add my isub-entries to MM part." from Mike Rapoport does some maintenance on MAINTAINERS - The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some cleanup work to the page mapping code - The series "mseal system mappings" from Jeff Xu permits sealing of "system mappings", such as vdso, vvar, vvar_vclock, vectors (arm compat-mode), sigpage (arm compat-mode) - Plus the usual shower of singleton patches * tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits) mseal sysmap: add arch-support txt mseal sysmap: enable s390 selftest: test system mappings are sealed mseal sysmap: update mseal.rst mseal sysmap: uprobe mapping mseal sysmap: enable arm64 mseal sysmap: enable x86-64 mseal sysmap: generic vdso vvar mapping selftests: x86: test_mremap_vdso: skip if vdso is msealed mseal sysmap: kernel config and header change mm: pgtable: remove tlb_remove_page_ptdesc() x86: pgtable: convert to use tlb_remove_ptdesc() riscv: pgtable: unconditionally use tlb_remove_ptdesc() mm: pgtable: convert some architectures to use tlb_remove_ptdesc() mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc* mm: pgtable: make generic tlb_remove_table() use struct ptdesc microblaze/mm: put mm_cmdline_setup() in .init.text section mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range MAINTAINERS: mm: add entry for secretmem MAINTAINERS: mm: add entry for numa memblocks and numa emulation ...	2025-04-03 11:10:00 -07:00
Uros Bizjak	fc1cd60042	x86/idle: Use MONITOR and MWAIT mnemonics in <asm/mwait.h> Current minimum required version of binutils is 2.25, which supports MONITOR and MWAIT instruction mnemonics. Replace the byte-wise specification of MONITOR and MWAIT with these proper mnemonics. No functional change intended. Note: LLVM assembler is not able to assemble correct forms of MONITOR and MWAIT instructions with explicit operands and reports: error: invalid operand for instruction monitor %rax,%ecx,%edx ^~~~ # https://lore.kernel.org/oe-kbuild-all/202504030802.2lEVBSpN-lkp@intel.com/ Use instruction mnemonics with implicit operands to work around this issue. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250403125111.429805-1-ubizjak@gmail.com	2025-04-03 16:28:38 +02:00
Uros Bizjak	a17b37a3f4	x86/idle: Change arguments of mwait_idle_with_hints() to u32 All functions in mwait_idle_with_hints() cast eax and ecx arguments to u32. Propagate argument type to the enclosing function. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250403073105.245987-1-ubizjak@gmail.com	2025-04-03 16:27:42 +02:00
Borislav Petkov (AMD)	2fb34b1566	x86/tlb: Simplify choose_new_asid() and generate better code Have it return the two things it does return: - a new ASID and - the need to flush the TLB or not, in a struct which fits in a single 32-bit register and whack the IO parameters. Beyond being easier to read, this also helps the compiler generate better, more compact code: # arch/x86/mm/tlb.o: text data bss dec hex filename 9341 753 516 10610 2972 tlb.o.before 9213 753 516 10482 28f2 tlb.o.after No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250403085623.20824-1-bp@kernel.org	2025-04-03 13:35:37 +02:00
Uros Bizjak	a72d55dc3b	x86/idle: Remove CONFIG_AS_TPAUSE There is not much point in CONFIG_AS_TPAUSE at all when the emitted assembly is always the same - it only obfuscates the __tpause() code in essence. Remove the TPAUSE insn mnemonic from __tpause() and leave only the equivalent byte-wise definition. This can then be changed back to insn mnemonic once binutils 2.31.1 is the minimum version to build the kernel. (Right now it's 2.25.) Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250402180827.3762-4-ubizjak@gmail.com	2025-04-03 13:19:18 +02:00
Uros Bizjak	19c3dcd953	x86/idle: Remove .s output beautifying delimiters from simpler asm() templates Delimiters in asm() templates such as ';', '\t' or '\n' are not required syntactically, they were used historically in the Linux kernel to prettify the compiler's .s output for people who were looking at compiler generated .s output. Most x86 developers these days are primarily looking at: 1) objdump --disassemble-all .o 2) perf top's live kernel function annotation and disassembler feature that uses /dev/mem. ... because: - this kind of assembler output is standardized regardless of compiler used, - it's generally less messy looking, - it gives ground-truth instead of being some intermediate layer in the toolchain that might or might not be the real deal, - and on a live kernel it also sees through the kernel's various layers of runtime patching code obfuscation facilities, also known as: alternative-instructions, tracepoints and jump labels. There are some cases where the .s output is the most useful tool, such as alternatives() code generation, but other than that these delimiters used in simple asm() statements mostly add noise to the source code side, which isn't desirable for assembly code that is fragile enough already. Remove the delimiters for <asm/mwait.h>, which also happens to make the GCC inliner's asm() instruction length heuristics more accurate... [ mingo: Wrote a new changelog to give historic context and to give people a chance to object. :-) ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250402180827.3762-3-ubizjak@gmail.com	2025-04-03 13:19:18 +02:00
Linus Torvalds	01ecadbe09	cxl for v6.15 - Add support for Global Persistent Flush (GPF) - Cleanup of DPA partition metadata handling - Remove the CXL_DECODER_MIXED enum that's not needed anymore - Introduce helpers to access resource and perf meta data - Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' - Make cxl_dpa_alloc() DPA partition number agnostic - Remove cxl_decoder_mode - Cleanup partition size and perf helpers - Remove unused CXL partition values - Add logging support for CXL CPER endpoint and port protocol errors - Prefix protocol error struct and function names with cxl_ - Move protocol error definitions and structures to a common location - Remove drivers/firmware/efi/cper_cxl.h to include/linux/cper.h - Add support in GHES to process CXL CPER protocol errors - Process CXL CPER protocol errors - Add trace logging for CXL PCIe port RAS errors - Remove redundant gp_port init - Add validation of cxl device serial number - CXL ABI documentation updates/fixups - A series that uses guard() to clean up open coded mutex lockings and remove gotos for error handling. - Some followup patches to support dirty shutdown accounting - Add helper to retrieve DVSEC offset for dirty shutdown registers - Rename cxl_get_dirty_shutdown() to cxl_arm_dirty_shutdown() - Add support for dirty shutdown count via sysfs - cxl_test support for dirty shutdown - A series to support CXL mailbox Features commands. Mostly in preparation for CXL EDAC code to utilize the Features commands. It's also in preparation for CXL fwctl support to utilize the CXL Features. The commands include "Get Supported Features", "Get Feature", and "Set Feature". - A series to support extended linear cache support described by the ACPI HMAT table. The addition helps enumerate the cache and also provides additional RAS reporting support for configuration with extended linear cache. (and related fixes for the series). - An update to cxl_test to support a 3-way capable CFMWS. - A documentation fix to remove unused "mixed mode". -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAmfqtP4ACgkQYGjFFmlT OEqx9A//UsCWf1CH8bvjKXxSTlQmtPlNpcXe+gVR0sc5cL2VFxKf93AY8Zo1Br5A b40gtZJz9QwjwGwIvDiki9U2bopOyX3aMOyBJMYmLuL/irY8ENx2ra7ODbxe7uGn oZwpwG2sEGQxIAG2bCpVuCDIt8JjNvsTJo45TICs07w9TWTmH4Swpbz1g8VGpDz/ kCQcXXHSHZleR5BzqVRKxjjqGEUFj2xDMzAI8VSL+7izMMoPLbjwnl2c1fwaLBPd iJTMboTXDj7eVMta/qqGkG7pshM81SnkSzy8cxImj3r4SRgRTZg9U8vhrR3K1kdH F05Ozd12tljtNXLWthENZPUbfcovy9oTxzMt/gVut7j6C7H3s3KCSbV7zhz5BmfD XcapOX4Cu7ptn88KLqE5a98oLuq2DXrLOcX5vKPYBfAO+68rC+gSAPSbzfZlSHa0 1/TsxVvzDQUBVZWL94DeHvemyQb58GQBOypeNZbH8P4gAhWJqk3hZEO+wlSxpfd+ R7wgabfKJUJ82KusCZHIW1Wg3/IrXb4yC+UyiObS5RgIJWpRmOkuJEHDvEUje+Dj aOWw/H3vZgeZnpW87FRxzvDJx1/0jZI1vsxH65m2wrvz6n5aGIA/Q6pgqCdU/m6c I231bl1bmZzJ8u3+vOZL4tFHcYHh4XCwQp+ZQt1uDa0fA5LbLhc= =ZME1 -----END PGP SIGNATURE----- Merge tag 'cxl-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull Compute Express Link (CXL) updates from Dave Jiang: - Add support for Global Persistent Flush (GPF) - Cleanup of DPA partition metadata handling: - Remove the CXL_DECODER_MIXED enum that's not needed anymore - Introduce helpers to access resource and perf meta data - Introduce 'struct cxl_dpa_partition' and 'struct cxl_range_info' - Make cxl_dpa_alloc() DPA partition number agnostic - Remove cxl_decoder_mode - Cleanup partition size and perf helpers - Remove unused CXL partition values - Add logging support for CXL CPER endpoint and port protocol errors: - Prefix protocol error struct and function names with cxl_ - Move protocol error definitions and structures to a common location - Remove drivers/firmware/efi/cper_cxl.h to include/linux/cper.h - Add support in GHES to process CXL CPER protocol errors - Process CXL CPER protocol errors - Add trace logging for CXL PCIe port RAS errors - Remove redundant gp_port init - Add validation of cxl device serial number - CXL ABI documentation updates/fixups - A series that uses guard() to clean up open coded mutex lockings and remove gotos for error handling. - Some followup patches to support dirty shutdown accounting: - Add helper to retrieve DVSEC offset for dirty shutdown registers - Rename cxl_get_dirty_shutdown() to cxl_arm_dirty_shutdown() - Add support for dirty shutdown count via sysfs - cxl_test support for dirty shutdown - A series to support CXL mailbox Features commands. Mostly in preparation for CXL EDAC code to utilize the Features commands. It's also in preparation for CXL fwctl support to utilize the CXL Features. The commands include "Get Supported Features", "Get Feature", and "Set Feature". - A series to support extended linear cache support described by the ACPI HMAT table. The addition helps enumerate the cache and also provides additional RAS reporting support for configuration with extended linear cache. (and related fixes for the series). - An update to cxl_test to support a 3-way capable CFMWS - A documentation fix to remove unused "mixed mode" * tag 'cxl-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (39 commits) cxl/region: Fix the first aliased address miscalculation cxl/region: Quiet some dev_warn()s in extended linear cache setup cxl/Documentation: Remove 'mixed' from sysfs mode doc cxl: Fix warning from emitting resource_size_t as long long int on 32bit systems cxl/test: Define a CFMWS capable of a 3 way HB interleave cxl/mem: Do not return error if CONFIG_CXL_MCE unset tools/testing/cxl: Set Shutdown State support cxl/pmem: Export dirty shutdown count via sysfs cxl/pmem: Rename cxl_dirty_shutdown_state() cxl/pci: Introduce cxl_gpf_get_dvsec() cxl/pci: Support Global Persistent Flush (GPF) cxl: Document missing sysfs files cxl: Plug typos in ABI doc cxl/pmem: debug invalid serial number data cxl/cdat: Remove redundant gp_port initialization cxl/memdev: Remove unused partition values cxl/region: Drop goto pattern of construct_region() cxl/region: Drop goto pattern in cxl_dax_region_alloc() cxl/core: Use guard() to drop goto pattern of cxl_dpa_alloc() cxl/core: Use guard() to drop the goto pattern of cxl_dpa_free() ...	2025-04-02 20:04:43 -07:00
Uros Bizjak	1ae899e413	x86/idle: Standardize argument types for MONITOR{,X} and MWAIT{,X} instruction wrappers on 'u32' MONITOR and MONITORX expect 32-bit unsigned integer arguments in the %ecx and %edx registers. MWAIT and MWAITX expect 32-bit usigned int argument in %eax and %ecx registers. Some of the helpers around these instructions in <asm/mwait.h> are using too wide types (long), standardize on u32 instead that makes it clear that this is a hardware ABI. [ mingo: Cleaned up the changelog. ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250402180827.3762-1-ubizjak@gmail.com	2025-04-02 22:26:17 +02:00
Andrew Cooper	1f13c60d84	x86/idle: Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR in mwait_idle_with_hints() and prefer_mwait_c1_over_halt() The following commit, 12 years ago: `7e98b71920` ("x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers") added barriers around the CLFLUSH in mwait_idle_with_hints(), justified with: ... and add memory barriers around it since the documentation is explicit that CLFLUSH is only ordered with respect to MFENCE. This also triggered, 11 years ago, the same adjustment in: `f8e617f458` ("sched/idle/x86: Optimize unnecessary mwait_idle() resched IPIs") during development, although it failed to get the static_cpu_has_bug() treatment. X86_BUG_CLFLUSH_MONITOR (a.k.a the AAI65 errata) is specific to Intel CPUs, and the SDM currently states: Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked read-modify-write instructions, and fence instructions[1]. With footnote 1 reading: Earlier versions of this manual specified that executions of the CLFLUSH instruction were ordered only by the MFENCE instruction. All processors implementing the CLFLUSH instruction also order it relative to the other operations enumerated above. i.e. The SDM was incorrect at the time, and barriers should not have been inserted. Double checking the original AAI65 errata (not available from intel.com any more) shows no mention of barriers either. Note: If this were a general codepath, the MFENCEs would be needed, because AMD CPUs of the same vintage do sport otherwise-unordered CLFLUSHs. Remove the unnecessary barriers. Furthermore, use a plain alternative(), rather than static_cpu_has_bug() and/or no optimisation. The workaround is a single instruction. Use an explicit %rax pointer rather than a general memory operand, because MONITOR takes the pointer implicitly in the same way. [ mingo: Cleaned up the commit a bit. ] Fixes: `7e98b71920` ("x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20250402172458.1378112-1-andrew.cooper3@citrix.com	2025-04-02 22:02:26 +02:00
Linus Torvalds	8a6b94032e	Updates for UML for this cycle, notably: - proper nofault accesses and read-only rodata - hostfs fix for host inode number reuse - fixes for host errno handling - various cleanups/small fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpeA8sTs3M8SN2hR410qiO8sPaAAFAmfs6R0ACgkQ10qiO8sP aAC6VxAAoIqPoyeOqH5DRnHQZQuZOE2lelhKCKckH3xg8peXqHdBl5Mx3kE2j851 S+ufysSJkbZIYfPw/B59ZzpQg7d1JmrCoDCvleH/LdWhi6sosqe9F7Zwxk2p9STA swCifnACaIZc0iir70MrCAd/1/bBZq7PG8BWO1XFDrqNBLBGQxtSiMCmzBWsKjuj 33uocQMi5PvVikBRxfz2uo4PJuJumhGs3sWAoFlA61ogHP4JrwKjW/HvuHJGaHKV YgaObr/JPhDkbGn7bXdQpLT+Qz7FwBeZFt9AUHOk+IibcwQY126ArXD11zzAAPkT 3Q9H8eNV+MpieGtpA2+3Gwe//QsNjEOj3ACfV+S7veQ0Vxk+Bd/wMSDBKLF+z71g qpFqFeO0wS/XmwFI6RVN+GW6rZZ6mR3c7r/5mtAOa5+iJTnqDadyE/4oouQht2of IrS4LugnTB0KCgRZZDmtTOFT8lGOjey3e+AO42Qi+Z64oolI6zKUakTdBWvywmk4 V9w9OUmEZAy64a0luvavYfxx+6WoTHURyQ/L99Ysk6ns9BrUk7U+hpfsLZBWZiyT 3jfOlRGgt4N7iHaVqQwB6l6/Q/FtrdVK7SrTtsGzURhuCSy3SP0HZxWh9qhaL/2j Af8Qz5OAOEBYmTaN9lLYsHXp02NyM+4hlsR1DvEcNGuWQsyepIw= =MSjq -----END PGP SIGNATURE----- Merge tag 'uml-for-linux-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML updates from Johannes Berg: - proper nofault accesses and read-only rodata - hostfs fix for host inode number reuse - fixes for host errno handling - various cleanups/small fixes * tag 'uml-for-linux-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Rewrite the sigio workaround based on epoll and tgkill um: Prohibit the VM_CLONE flag in run_helper_thread() um: Switch to the pthread-based helper in sigio workaround um: ubd: Switch to the pthread-based helper um: Add pthread-based helper support um: x86: clean up elf specific definitions um: Store full CSGSFS and SS register from mcontext um: virt-pci: Refactor virtio_pcidev into its own module um: work around sched_yield not yielding in time-travel mode um/locking: Remove semicolon from "lock" prefix um: Update min_low_pfn to match changes in uml_reserved um: use str_yes_no() to remove hardcoded "yes" and "no" um: hostfs: avoid issues on inode number reuse by host um: Allocate vdso page pointer statically um: remove copy_from_kernel_nofault_allowed um: mark rodata read-only and implement _nofault accesses um: Pass the correct Rust target and options with gcc	2025-04-02 12:25:03 -07:00
Linus Torvalds	6cb094583a	* Avoid direct HLT instruction execution in TDX guests -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmfsa8QACgkQaDWVMHDJ krBCAhAAodPYiIEy+qpad1Q8HPhaKYUJ5jzkIdt1GYXCBf2dfY6Zj8w7edSApUhA 7og9gK8ku8hwpf6oCGmp2Lm74FgATIj7q0ac07XBW3OsrfFQc73DfPJn6WMDYjRV ec9baSzX5GcqUyezq7woyJayZT9LRLBexF/vk7dAQ7nuecCOUhqLXWBN5eUT0e+K 58kFjZoZZx/4Y9zh7UIxBQyCbL88IeI6rclW5tZJlRHNuD7B64x606ETwQJKK9GK YHPhqRKtjJRzSOn/xGYT4AQDPbF9u14Q4WGVO+bvgv8Z6BtmiYV2fG0q5GU14h0z +gwjja3Edo+F6zSIIZonQbrSVHspwm1IPJQQZHljhFOEt7Ezu3hLIYouUWVlNRgl mRzubZBmhQUfJOAtfGmHktdg6j+QinYDQr+/CjoXoeh8EknL+KtqamXJnyb8KAMN qH6X+N2coaCcl334zW44m6YTmTipdIhmHFj6edYwqdR3Ux6DDaX9PKopIIpiZEcb GH1o++4JMp9OBIaTu0Yp1WgWJ+EyUSWDJbydqCMOdthuESqKW45IQkLhPxZpIhB4 5Wra4Ot7AdsThyPqNPaEu3ND+BXu4tAAa8r8GK+AP7DqRxXz/bbWTHqNepm9wSvP pnOlLyVTri/difMWWsJJPK6QRYbNnemrny3Do3PbIZVKS08vgLs= =XvoD -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "Avoid direct HLT instruction execution in TDX guests. TDX guests aren't expected to use the HLT instruction directly. It causes a virtualization exception (#VE). While the #VE _can_ be handled, the current handling is slow and buggy and the easiest thing is just to avoid HLT in the first place. Plus, the kernel already has paravirt infrastructure that makes it relatively painless. Make TDX guests require paravirt and add some TDX-specific paravirt handlers which avoid HLT in the normal halt routines. Also add a warning in case another HLT sneaks in. There was a report that this leads to a "major performance improvement" on specjbb2015, probably because of the extra #VE overhead or missed wakeups from the buggy HLT handling" * tag 'x86_tdx_for_6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling x86/tdx: Fix arch_safe_halt() execution for TDX VMs x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT	2025-04-02 11:33:20 -07:00
Linus Torvalds	92b71befc3	These are objtool fixes and updates by Josh Poimboeuf, centered around the fallout from the new CONFIG_OBJTOOL_WERROR=y feature, which, despite its default-off nature, increased the profile/impact of objtool warnings: - Improve error handling and the presentation of warnings/errors. - Revert the new summary warning line that some test-bot tools interpreted as new regressions. - Fix a number of objtool warnings in various drivers, core kernel code and architecture code. About half of them are potential problems related to out-of-bounds accesses or potential undefined behavior, the other half are additional objtool annotations. - Update objtool to latest (known) compiler quirks and objtool bugs triggered by compiler code generation - Misc fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfsRJMRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1g0YRAApiCylIv+0ucdKiDVAiI+cU7dqAggFp9h ULcTuuCtVkfjYzIBw6y1Iw9JeYsyngYaI0VEMmLasJPt8o93K0vwBXGArXJKoMeu UPcVS8N6+LqrHsWBXk919t1wgBZ7csgUxsCa1K47NKa3eCijrqI0N8PtcoYqKd+M tOuyEcTCTfS0E2STv6Gpdp6VfDKms3Cn4MffLbcNWJXAsd1dwzDIG8IvAHUW9yG3 /ezVjm46thneNrRd9j/qU3mqNmhsec9NemHG7URaTznRKleWULhpmhGmcPYCh4Rj AqGjmPtqprPELtgezeV+LIcmIm5UWF/f+0tzzBrsRy1MiY8ED2w+J51DHsLoHg8t IfIkPyYX/zu9StXoRIwx/7C5NQqBlUfXGp6TuOOwzgbKOt+uRJOU6SnSQ06ZDwsa l2brQ+NDfvF7EvGnvi18wIM+iqMc2jSuWl0AT94ATDuAZGCyzlmwluIYmDuLfyZM JuYOogojt5vgHXDN6Ro3rDfK+tYckwez+Txx4oByGB3IJy75osBihtvHiYno7FgW KXDbiAfLZ4SlfPzqxI6PPzaj3py6hG9LICEiL0U8VecC7bZ/22BZQCpdKko+/E/Y PwlqCatqz/25U7GlsnfBISJO2VAyyUcbymvjnVXzZCi+IPAfeih6WcsTPJ96jxsa LULLCnuvmoY= =KkiI -----END PGP SIGNATURE----- Merge tag 'objtool-urgent-2025-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Ingo Molnar: "These are objtool fixes and updates by Josh Poimboeuf, centered around the fallout from the new CONFIG_OBJTOOL_WERROR=y feature, which, despite its default-off nature, increased the profile/impact of objtool warnings: - Improve error handling and the presentation of warnings/errors - Revert the new summary warning line that some test-bot tools interpreted as new regressions - Fix a number of objtool warnings in various drivers, core kernel code and architecture code. About half of them are potential problems related to out-of-bounds accesses or potential undefined behavior, the other half are additional objtool annotations - Update objtool to latest (known) compiler quirks and objtool bugs triggered by compiler code generation - Misc fixes" * tag 'objtool-urgent-2025-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) objtool/loongarch: Add unwind hints in prepare_frametrace() rcu-tasks: Always inline rcu_irq_work_resched() context_tracking: Always inline ct_{nmi,irq}_{enter,exit}() sched/smt: Always inline sched_smt_active() objtool: Fix verbose disassembly if CROSS_COMPILE isn't set objtool: Change "warning:" to "error: " for fatal errors objtool: Always fail on fatal errors Revert "objtool: Increase per-function WARN_FUNC() rate limit" objtool: Append "()" to function name in "unexpected end of section" warning objtool: Ignore end-of-section jumps for KCOV/GCOV objtool: Silence more KCOV warnings, part 2 objtool, drm/vmwgfx: Don't ignore vmw_send_msg() for ORC objtool: Fix STACK_FRAME_NON_STANDARD for cold subfunctions objtool: Fix segfault in ignore_unreachable_insn() objtool: Fix NULL printf() '%s' argument in builtin-check.c:save_argv() objtool, lkdtm: Obfuscate the do_nothing() pointer objtool, regulator: rk808: Remove potential undefined behavior in rk806_set_mode_dcdc() objtool, ASoC: codecs: wcd934x: Remove potential undefined behavior in wcd934x_slim_irq_handler() objtool, Input: cyapa - Remove undefined behavior in cyapa_update_fw_store() objtool, panic: Disable SMAP in __stack_chk_fail() ...	2025-04-02 10:30:10 -07:00
Jeff Xu	3049def198	mseal sysmap: enable x86-64 Provide support for CONFIG_MSEAL_SYSTEM_MAPPINGS on x86-64, covering the vdso, vvar, vvar_vclock. Production release testing passes on Android and Chrome OS. Link: https://lkml.kernel.org/r/20250305021711.3867874-4-jeffxu@google.com Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: Adhemerval Zanella <adhemerval.zanella@linaro.org> Cc: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Berg <benjamin@sipsolutions.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Elliot Hughes <enh@google.com> Cc: Florian Faineli <f.fainelli@gmail.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <mike.rapoport@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stephen Röttger <sroettger@google.com> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-04-01 15:17:15 -07:00
Qi Zheng	f1fdec956f	x86: pgtable: convert to use tlb_remove_ptdesc() The x86 has already been converted to use struct ptdesc, so convert it to use tlb_remove_ptdesc() instead of tlb_remove_table(). Link: https://lkml.kernel.org/r/36ad56b7e06fa4b17fb23c4fc650e8e0d72bb3cd.1740454179.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-04-01 15:17:14 -07:00
Mateusz Guzik	1701771d30	x86/mm: Stop prefetching current->mm->mmap_lock on page faults The prefetchw() dates back decades and the fundamental notion of doing something like this on a lock is shady. Moreover, for a few years now in the fast path faults are handled with RCU + per-vma locking, hopefully not even looking at the lock to begin with. As such just remove it. I did not see a point benchmarking this. Given that it is not expected to be looked at by default justifies not doing the prefetch. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401143520.1113572-1-mjguzik@gmail.com	2025-04-01 22:48:56 +02:00
Baoquan He	2b00d9031e	x86/mm: Simplify the pgd_leaf() and p4d_leaf() checks a bit The functions return bool, simplify the checks. [ mingo: Split off from two other patches. ] Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250331081327.256412-6-bhe@redhat.com	2025-04-01 22:48:56 +02:00
Baoquan He	c083eff324	x86/mm: Remove the arch-specific p4d_leaf() definition P4D huge pages are not supported yet, let's use the generic definition in <linux/pgtable.h>. [ mingo: Cleaned up the changelog. ] Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Link: https://lore.kernel.org/r/20250331081327.256412-7-bhe@redhat.com	2025-04-01 22:48:51 +02:00
Baoquan He	b0510ac74e	x86/mm: Remove the arch-specific pgd_leaf() definition PGD huge pages are not supported yet, let's use the generic definition in <linux/pgtable.h>. [ mingo: Cleaned up the changelog. ] Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Link: https://lore.kernel.org/r/20250331081327.256412-6-bhe@redhat.com	2025-04-01 22:46:51 +02:00
Xin Li (Intel)	e5f1e8af9c	x86/fred: Fix system hang during S4 resume with FRED enabled Upon a wakeup from S4, the restore kernel starts and initializes the FRED MSRs as needed from its perspective. It then loads a hibernation image, including the image kernel, and attempts to load image pages directly into their original page frames used before hibernation unless those frames are currently in use. Once all pages are moved to their original locations, it jumps to a "trampoline" page in the image kernel. At this point, the image kernel takes control, but the FRED MSRs still contain values set by the restore kernel, which may differ from those set by the image kernel before hibernation. Therefore, the image kernel must ensure the FRED MSRs have the same values as before hibernation. Since these values depend only on the location of the kernel text and data, they can be recomputed from scratch. Reported-by: Xi Pardee <xi.pardee@intel.com> Reported-by: Todd Brandt <todd.e.brandt@intel.com> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250401075728.3626147-1-xin@zytor.com	2025-04-01 22:29:02 +02:00
Sohil Mehta	f2e01dcf6d	x86/nmi: Improve NMI duration console printouts Convert the last remaining printk() in nmi.c to pr_info(). Along with it, use timespec macros to calculate the NMI handler duration. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-10-sohil.mehta@intel.com	2025-04-01 22:26:38 +02:00
Sohil Mehta	05279a2863	x86/nmi: Clean up NMI selftest The expected_testcase_failures variable in the NMI selftest has never been set since its introduction. Remove this unused variable along with the related checks to simplify the code. While at it, replace printk() with the corresponding pr_{cont,info}() calls. Also, get rid of the superfluous testname wrapper and the redundant file path comment. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-9-sohil.mehta@intel.com	2025-04-01 22:26:32 +02:00
Sohil Mehta	7324d7de77	x86/nmi: Add missing description x86_platform_ops::get_nmi_reason to <asm/x86_init.h> [ mingo: Split off from another patch. ] Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-8-sohil.mehta@intel.com	2025-04-01 22:26:27 +02:00
Sohil Mehta	3b12927063	x86/nmi: Improve <asm/nmi.h> documentation NMI handlers can be registered by various subsystems, including drivers. However, the interface for registering and unregistering such handlers is not clearly documented. In the future, the interface may need to be extended to identify the source of the NMI. Add documentation to make the current API more understandable and easier to use. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-8-sohil.mehta@intel.com	2025-04-01 22:26:21 +02:00
Sohil Mehta	59cddd397a	x86/nmi: Improve and relocate NMI handler comments Some of the comments in the default NMI handling code are out of place or inadequate. Move them to the appropriate locations and update them as needed. Move the comment related to CPU-specific NMIs closer to the actual code. Also, add more details about how back-to-back NMIs are detected since that isn't immediately obvious. Opportunistically, replace an #ifdef section in the vicinity with an IS_ENABLED() check to make the code easier to read. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-7-sohil.mehta@intel.com	2025-04-01 22:26:16 +02:00
Sohil Mehta	b4bc3144c1	x86/nmi: Fix comment in unknown_nmi_error() The comment in unknown_nmi_error() is incorrect and misleading. There is no longer a restriction on having a single Unknown NMI handler. Also, nmi_handle() never used the 'b2b' parameter. The commits that made the comment outdated are: `0d443b70cc` ("x86/platform: Remove warning message for duplicate NMI handlers") `bf9f2ee28d` ("x86/nmi: Remove the 'b2b' parameter from nmi_handle()") Remove the old comment and update it to reflect the current logic. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-6-sohil.mehta@intel.com	2025-04-01 22:26:11 +02:00
Sohil Mehta	6325f94701	x86/nmi: Remove export of local_touch_nmi() Commit: `feb6cd6a0f` ("thermal/intel_powerclamp: stop sched tick in forced idle") got rid of the last exported user of local_touch_nmi() a while back. Remove the unnecessary export. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-5-sohil.mehta@intel.com	2025-04-01 22:26:06 +02:00
Sohil Mehta	4a8fba4be8	x86/nmi: Use a macro to initialize NMI descriptors The NMI descriptors for each NMI type are stored in an array. However, they are currently initialized using raw numbers, which makes it difficult to understand the code. Introduce a macro to initialize the NMI descriptors using the NMI type enum values to make the code more readable. No functional change intended. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-4-sohil.mehta@intel.com	2025-04-01 22:26:01 +02:00
Sohil Mehta	78a0323506	x86/nmi: Consolidate NMI panic variables Commit: `c305a4e983` ("x86: Move sysctls into arch/x86") recently moved the sysctl handling of panic_on_unrecovered_nmi and panic_on_io_nmi to x86-specific code. These variables no longer need to be declared in the generic header file. Relocate the variable definitions and declarations closer to where they are used. This makes all the NMI panic options consistent and easier to track. [ mingo: Fixed up the SHA1 of the commit reference. ] Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Joel Granados <joel.granados@kernel.org> Link: https://lore.kernel.org/r/20250327234629.3953536-3-sohil.mehta@intel.com	2025-04-01 22:25:56 +02:00
Sohil Mehta	2e016da1cb	x86/nmi: Simplify unknown NMI panic handling The unknown_nmi_panic variable is used to control whether the kernel should panic on unknown NMIs. There is a sysctl entry under: /proc/sys/kernel/unknown_nmi_panic which can be used to change the behavior at runtime. However, it seems that in some places, the option unnecessarily depends on CONFIG_X86_LOCAL_APIC. Other code in nmi.c uses unknown_nmi_panic without such a dependency. This results in a few messy #ifdefs splattered across the code. The dependency was likely introduce due to a potential build bug reported a long time ago: https://lore.kernel.org/lkml/40BC67F9.3000609@myrealbox.com/ This build bug no longer exists. Also, similar NMI panic options, such as panic_on_unrecovered_nmi and panic_on_io_nmi, do not have an explicit dependency on the local APIC either. Though, it's hard to imagine a production system without the local APIC configuration, making a specific NMI sysctl option dependent on it doesn't make sense. Remove the explicit dependency between unknown NMI handling and the local APIC to make the code cleaner and more consistent. While at it, reorder the header includes to maintain alphabetical order. [ mingo: Cleaned up the changelog a bit, truly ordered the headers ... ] Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250327234629.3953536-2-sohil.mehta@intel.com	2025-04-01 22:25:41 +02:00
Dr. David Alan Gilbert	d0ebf4c7eb	x86/platform/iosf_mbi: Remove unused iosf_mbi_unregister_pmic_bus_access_notifier() The last use of iosf_mbi_unregister_pmic_bus_access_notifier() was removed in 2017 by: `a5266db4d3` ("drm/i915: Acquire PUNIT->PMIC bus for intel_uncore_forcewake_reset()") Remove it. (Note that the '_unlocked' version is still used.) Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Link: https://lore.kernel.org/r/20241225175010.91783-1-linux@treblig.org	2025-04-01 20:31:39 +02:00
Linus Torvalds	2cd5769fb0	Driver core updates for 6.15-rc1 Here is the big set of driver core updates for 6.15-rc1. Lots of stuff happened this development cycle, including: - kernfs scaling changes to make it even faster thanks to rcu - bin_attribute constify work in many subsystems - faux bus minor tweaks for the rust bindings - rust binding updates for driver core, pci, and platform busses, making more functionaliy available to rust drivers. These are all due to people actually trying to use the bindings that were in 6.14. - make Rafael and Danilo full co-maintainers of the driver core codebase - other minor fixes and updates. This has been in linux-next for a while now, with the only reported issue being some merge conflicts with the rust tree. Depending on which tree you pull first, you will have conflicts in one of them. The merge resolution has been in linux-next as an example of what to do, or can be found here: https://lore.kernel.org/r/CANiq72n3Xe8JcnEjirDhCwQgvWoE65dddWecXnfdnbrmuah-RQ@mail.gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ+mMrg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylRgwCdH58OE3BgL0uoFY5vFImStpmPtqUAoL5HpVWI jtbJ+UuXGsnmO+JVNBEv =gy6W -----END PGP SIGNATURE----- Merge tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updatesk from Greg KH: "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff happened this development cycle, including: - kernfs scaling changes to make it even faster thanks to rcu - bin_attribute constify work in many subsystems - faux bus minor tweaks for the rust bindings - rust binding updates for driver core, pci, and platform busses, making more functionaliy available to rust drivers. These are all due to people actually trying to use the bindings that were in 6.14. - make Rafael and Danilo full co-maintainers of the driver core codebase - other minor fixes and updates" * tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits) rust: platform: require Send for Driver trait implementers rust: pci: require Send for Driver trait implementers rust: platform: impl Send + Sync for platform::Device rust: pci: impl Send + Sync for pci::Device rust: platform: fix unrestricted &mut platform::Device rust: pci: fix unrestricted &mut pci::Device rust: device: implement device context marker rust: pci: use to_result() in enable_device_mem() MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers rust/kernel/faux: mark Registration methods inline driver core: faux: only create the device if probe() succeeds rust/faux: Add missing parent argument to Registration::new() rust/faux: Drop #[repr(transparent)] from faux::Registration rust: io: fix devres test with new io accessor functions rust: io: rename `io::Io` accessors kernfs: Move dput() outside of the RCU section. efi: rci2: mark bin_attribute as __ro_after_init rapidio: constify 'struct bin_attribute' firmware: qemu_fw_cfg: constify 'struct bin_attribute' powerpc/perf/hv-24x7: Constify 'struct bin_attribute' ...	2025-04-01 11:02:03 -07:00
Linus Torvalds	d6b02199cd	- The 7 patch series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The 2 patch series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The 12 patch series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The 16 patch series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The 7 patch series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The 2 patch series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The 4 patch series "resource: Split and use DEFINE_RES() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq r6khQUIcQImPPcjFqEFpuiSOU0MBZA0= =Kii8 -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...	2025-04-01 10:06:52 -07:00
Linus Torvalds	eb0ece1602	- The 6 patch series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was founf to be incorrect. - The 4 patch series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The 17 patch series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The 5 patch series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The 4 patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The 12 patch series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The 2 patch series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The 3 patch series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The 4 patch series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The 4 patch series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The 4 patch series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The 18 patch series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The 27 patch series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The 12 patch series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The 8 patch series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The 2 patch series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The 3 patch series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The 3 patch series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The 3 patch series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The 9 patch series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The 5 patch series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The 13 patch series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The 13 patch series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The 3 patch series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The 8 patch series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The 2 patch series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The 3 patch series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The 4 patch series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The 3 patch series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The 5 patch series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The 5 patch series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The 4 patch series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The 2 patch series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. -----BEGIN PGP SIGNATURE----- iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF Ee93qSSLR1BkNdDw+931Pu0mXfbnBw== =Pn2K -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "Enable strict percpu address space checks" from Uros Bizjak uses x86 named address space qualifiers to provide compile-time checking of percpu area accesses. This has caused a small amount of fallout - two or three issues were reported. In all cases the calling code was found to be incorrect. - The series "Some cleanup for memcg" from Chen Ridong implements some relatively monir cleanups for the memcontrol code. - The series "mm: fixes for device-exclusive entries (hmm)" from David Hildenbrand fixes a boatload of issues which David found then using device-exclusive PTE entries when THP is enabled. More work is needed, but this makes thins better - our own HMM selftests now succeed. - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed remove the z3fold and zbud implementations. They have been deprecated for half a year and nobody has complained. - The series "mm: further simplify VMA merge operation" from Lorenzo Stoakes implements numerous simplifications in this area. No runtime effects are anticipated. - The series "mm/madvise: remove redundant mmap_lock operations from process_madvise()" from SeongJae Park rationalizes the locking in the madvise() implementation. Performance gains of 20-25% were observed in one MADV_DONTNEED microbenchmark. - The series "Tiny cleanup and improvements about SWAP code" from Baoquan He contains a number of touchups to issues which Baoquan noticed when working on the swap code. - The series "mm: kmemleak: Usability improvements" from Catalin Marinas implements a couple of improvements to the kmemleak user-visible output. - The series "mm/damon/paddr: fix large folios access and schemes handling" from Usama Arif provides a couple of fixes for DAMON's handling of large folios. - The series "mm/damon/core: fix wrong and/or useless damos_walk() behaviors" from SeongJae Park fixes a few issues with the accuracy of kdamond's walking of DAMON regions. - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo Stoakes changes the interaction between framebuffer deferred-io and core MM. No functional changes are anticipated - this is preparatory work for the future removal of page structure fields. - The series "mm/damon: add support for hugepage_size DAMOS filter" from Usama Arif adds a DAMOS filter which permits the filtering by huge page sizes. - The series "mm: permit guard regions for file-backed/shmem mappings" from Lorenzo Stoakes extends the guard region feature from its present "anon mappings only" state. The feature now covers shmem and file-backed mappings. - The series "mm: batched unmap lazyfree large folios during reclamation" from Barry Song cleans up and speeds up the unmapping for pte-mapped large folios. - The series "reimplement per-vma lock as a refcount" from Suren Baghdasaryan puts the vm_lock back into the vma. Our reasons for pulling it out were largely bogus and that change made the code more messy. This patchset provides small (0-10%) improvements on one microbenchmark. - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and improves" from SeongJae Park does some maintenance work on the DAMON docs. - The series "hugetlb/CMA improvements for large systems" from Frank van der Linden addresses a pile of issues which have been observed when using CMA on large machines. - The series "mm/damon: introduce DAMOS filter type for unmapped pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the page's mapped/unmapped status. - The series "zsmalloc/zram: there be preemption" from Sergey Senozhatsky teaches zram to run its compression and decompression operations preemptibly. - The series "selftests/mm: Some cleanups from trying to run them" from Brendan Jackman fixes a pile of unrelated issues which Brendan encountered while runnimg our selftests. - The series "fs/proc/task_mmu: add guard region bit to pagemap" from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to determine whether a particular page is a guard page. - The series "mm, swap: remove swap slot cache" from Kairui Song removes the swap slot cache from the allocation path - it simply wasn't being effective. - The series "mm: cleanups for device-exclusive entries (hmm)" from David Hildenbrand implements a number of unrelated cleanups in this code. - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual implements a number of preparatoty cleanups to the GENERIC_PTDUMP Kconfig logic. - The series "mm/damon: auto-tune aggregation interval" from SeongJae Park implements a feedback-driven automatic tuning feature for DAMON's aggregation interval tuning. - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in powerpc, sparc and x86 lazy MMU implementations. Ryan did this in preparation for implementing lazy mmu mode for arm64 to optimize vmalloc. - The series "mm/page_alloc: Some clarifications for migratetype fallback" from Brendan Jackman reworks some commentary to make the code easier to follow. - The series "page_counter cleanup and size reduction" from Shakeel Butt cleans up the page_counter code and fixes a size increase which we accidentally added late last year. - The series "Add a command line option that enables control of how many threads should be used to allocate huge pages" from Thomas Prescher does that. It allows the careful operator to significantly reduce boot time by tuning the parallalization of huge page initialization. - The series "Fix calculations in trace_balance_dirty_pages() for cgwb" from Tang Yizhou fixes the tracing output from the dirty page balancing code. - The series "mm/damon: make allow filters after reject filters useful and intuitive" from SeongJae Park improves the handling of allow and reject filters. Behaviour is made more consistent and the documention is updated accordingly. - The series "Switch zswap to object read/write APIs" from Yosry Ahmed updates zswap to the new object read/write APIs and thus permits the removal of some legacy code from zpool and zsmalloc. - The series "Some trivial cleanups for shmem" from Baolin Wang does as it claims. - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from Alistair Popple regularizes the weird ZONE_DEVICE page refcount handling in DAX, permittig the removal of a number of special-case checks. - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a preparatoty refactoring and cleanup of the mremap() code. - The series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in which we determine whether a large folio is known to be mapped exclusively into a single MM. - The series "mm/damon: add sysfs dirs for managing DAMOS filters based on handling layers" from SeongJae Park adds a couple of new sysfs directories to ease the management of DAMON/DAMOS filters. - The series "arch, mm: reduce code duplication in mem_init()" from Mike Rapoport consolidates many per-arch implementations of mem_init() into code generic code, where that is practical. - The series "mm/damon/sysfs: commit parameters online via damon_call()" from SeongJae Park continues the cleaning up of sysfs access to DAMON internal data. - The series "mm: page_ext: Introduce new iteration API" from Luiz Capitulino reworks the page_ext initialization to fix a boot-time crash which was observed with an unusual combination of compile and cmdline options. - The series "Buddy allocator like (or non-uniform) folio split" from Zi Yan reworks the code to split a folio into smaller folios. The main benefit is lessened memory consumption: fewer post-split folios are generated. - The series "Minimize xa_node allocation during xarry split" from Zi Yan reduces the number of xarray xa_nodes which are generated during an xarray split. - The series "drivers/base/memory: Two cleanups" from Gavin Shan performs some maintenance work on the drivers/base/memory code. - The series "Add tracepoints for lowmem reserves, watermarks and totalreserve_pages" from Martin Liu adds some more tracepoints to the page allocator code. - The series "mm/madvise: cleanup requests validations and classifications" from SeongJae Park cleans up some warts which SeongJae observed during his earlier madvise work. - The series "mm/hwpoison: Fix regressions in memory failure handling" from Shuai Xue addresses two quite serious regressions which Shuai has observed in the memory-failure implementation. - The series "mm: reliable huge page allocator" from Johannes Weiner makes huge page allocations cheaper and more reliable by reducing fragmentation. - The series "Minor memcg cleanups & prep for memdescs" from Matthew Wilcox is preparatory work for the future implementation of memdescs. - The series "track memory used by balloon drivers" from Nico Pache introduces a way to track memory used by our various balloon drivers. - The series "mm/damon: introduce DAMOS filter type for active pages" from Nhat Pham permits users to filter for active/inactive pages, separately for file and anon pages. - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia separates the proactive reclaim statistics from the direct reclaim statistics. - The series "mm/vmscan: don't try to reclaim hwpoison folio" from Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim code. * tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits) mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex() x86/mm: restore early initialization of high_memory for 32-bits mm/vmscan: don't try to reclaim hwpoison folio mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper cgroup: docs: add pswpin and pswpout items in cgroup v2 doc mm: vmscan: split proactive reclaim statistics from direct reclaim statistics selftests/mm: speed up split_huge_page_test selftests/mm: uffd-unit-tests support for hugepages > 2M docs/mm/damon/design: document active DAMOS filter type mm/damon: implement a new DAMOS filter type for active pages fs/dax: don't disassociate zero page entries MM documentation: add "Unaccepted" meminfo entry selftests/mm: add commentary about 9pfs bugs fork: use __vmalloc_node() for stack allocation docs/mm: Physical Memory: Populate the "Zones" section xen: balloon: update the NR_BALLOON_PAGES state hv_balloon: update the NR_BALLOON_PAGES state balloon_compaction: update the NR_BALLOON_PAGES state meminfo: add a per node counter for balloon drivers mm: remove references to folio in __memcg_kmem_uncharge_page() ...	2025-04-01 09:29:18 -07:00
Balbir Singh	7170130e4c	x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers As Bert Karwatzki reported, the following recent commit causes a performance regression on AMD iGPU and dGPU systems: `7ffb791423` ("x86/kaslr: Reduce KASLR entropy on most x86 systems") It exposed a bug with nokaslr and zone device interaction. The root cause of the bug is that, the GPU driver registers a zone device private memory region. When KASLR is disabled or the above commit is applied, the direct_map_physmem_end is set to much higher than 10 TiB typically to the 64TiB address. When zone device private memory is added to the system via add_pages(), it bumps up the max_pfn to the same value. This causes dma_addressing_limited() to return true, since the device cannot address memory all the way up to max_pfn. This caused a regression for games played on the iGPU, as it resulted in the DMA32 zone being used for GPU allocations. Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed into add_pages(). The presence of pgmap is used to determine if device private memory is being added via add_pages(). More details: devm_request_mem_region() and request_free_mem_region() request for device private memory. iomem_resource is passed as the base resource with start and end parameters. iomem_resource's end depends on several factors, including the platform and virtualization. On x86 for example on bare metal, this value is set to boot_cpu_data.x86_phys_bits. boot_cpu_data.x86_phys_bits can change depending on support for MKTME. By default it is set to the same as log2(direct_map_physmem_end) which is 46 to 52 bits depending on the number of levels in the page table. The allocation routines used iomem_resource's end and direct_map_physmem_end to figure out where to allocate the region. [ arch/powerpc is also impacted by this problem, but this patch does not fix the issue for PowerPC. ] Testing: 1. Tested on a virtual machine with test_hmm for zone device inseration 2. A previous version of this patch was tested by Bert, please see: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [ mingo: Clarified the comments and the changelog. ] Reported-by: Bert Karwatzki <spasswolf@web.de> Tested-by: Bert Karwatzki <spasswolf@web.de> Fixes: `7ffb791423` ("x86/kaslr: Reduce KASLR entropy on most x86 systems") Signed-off-by: Balbir Singh <balbirs@nvidia.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Link: https://lore.kernel.org/r/20250401000752.249348-1-balbirs@nvidia.com	2025-04-01 10:52:38 +02:00
Linus Torvalds	1e7857b280	x86: don't re-generate cpufeaturemasks.h so eagerly It turns out the code to generate the x86 cpufeaturemasks.h header was way too aggressive, and would re-generate it whenever the timestamp on the kernel config file changed. Now, the regular 'make *config' tools are fairly careful to not rewrite the kernel config file unless the contents change, but other usecases aren't that careful. Michael Kelley reports that 'make-kpkg' ends up doing "make syncconfig" multiple times in prepping to build, and will modify the config file in the process (and then modify it back, but by then the timestamps have changed). Jakub Kicinski reports that the netdev CI does something similar in how it generates the config file in multiple steps. In both cases, the config file timestamp updates then cause the cpufeaturemasks.h file to be regenerated, and that in turn then causes lots of unnecessary rebuilds due to all the normal dependencies. Fix it by using our 'filechk' infrastructure in the Makefile to generate the header file. That will only write a new version of the file if the contents of the file have actually changed. Fixes: `841326332b` ("x86/cpufeatures: Generate the <asm/cpufeaturemasks.h> header based on build config") Reported-by: Michael Kelley <mhklinux@outlook.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/all/SN6PR02MB415756D1829740F6E8AC11D1D4D82@SN6PR02MB4157.namprd02.prod.outlook.com/ Link: https://lore.kernel.org/all/20250328162311.08134fa6@kernel.org/ Cc: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-03-31 14:19:55 -07:00
Linus Torvalds	01d5b167dc	Modules changes for 6.15-rc1 - Use RCU instead of RCU-sched The mix of rcu_read_lock(), rcu_read_lock_sched() and preempt_disable() in the module code and its users has been replaced with just rcu_read_lock(). - The rest of changes are smaller fixes and updates. The changes have been on linux-next for at least 2 weeks, with the RCU cleanup present for 2 months. One performance problem was reported with the RCU change when KASAN + lockdep were enabled, but it was effectively addressed by the already merged `ee57ab5a32` ("locking/lockdep: Disable KASAN instrumentation of lockdep.c"). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmfmwrsUHHBldHIucGF2 bHVAc3VzZS5jb20ACgkQumpXJwqY6prWxgf/S7Pvdywm10vJ6fooYa+GxXNMwhyh XRjZ4m9gjeTNf2KLwX0XHv0XZeFHOmHfjd3iI+pS6CXZnCFTN9J3XPLYsrTxXUb6 U6zzLf8Zsz8TzeI4dgvSBsZln7oICSACkAgdJCq23hpNKeaeRo91dgiZaIwyZJG3 FekqSFtP7pYhfFoNkrFKysqbgl1+RWWZ79L2qRJA0bPzVFlvRUuh6cOHQw+8RMqf BYLwnArjTkW8AcXpxIXSiwphDHVZ81B96xoplavyoprA5FDpv1W+8y4DtxdWFn+1 QVWCs/ZV3KrwXWpZev625w3fIOOIXILqRINOzLfvXTw+1xFS3TzSQEpVeg== =4OKc -----END PGP SIGNATURE----- Merge tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules updates from Petr Pavlu: - Use RCU instead of RCU-sched The mix of rcu_read_lock(), rcu_read_lock_sched() and preempt_disable() in the module code and its users has been replaced with just rcu_read_lock() - The rest of changes are smaller fixes and updates * tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (32 commits) MAINTAINERS: Update the MODULE SUPPORT section module: Remove unnecessary size argument when calling strscpy() module: Replace deprecated strncpy() with strscpy() params: Annotate struct module_param_attrs with __counted_by() bug: Use RCU instead RCU-sched to protect module_bug_list. static_call: Use RCU in all users of __module_text_address(). kprobes: Use RCU in all users of __module_text_address(). bpf: Use RCU in all users of __module_text_address(). jump_label: Use RCU in all users of __module_text_address(). jump_label: Use RCU in all users of __module_address(). x86: Use RCU in all users of __module_address(). cfi: Use RCU while invoking __module_address(). powerpc/ftrace: Use RCU in all users of __module_text_address(). LoongArch: ftrace: Use RCU in all users of __module_text_address(). LoongArch/orc: Use RCU in all users of __module_address(). arm64: module: Use RCU in all users of __module_text_address(). ARM: module: Use RCU in all users of __module_text_address(). module: Use RCU in all users of __module_text_address(). module: Use RCU in all users of __module_address(). module: Use RCU in search_module_extables(). ...	2025-03-30 15:44:36 -07:00
Linus Torvalds	7405c0f01a	Miscellaneous x86 fixes and updates: - Fix a large number of x86 Kconfig dependency and help text accuracy bugs/problems, by Mateusz Jończyk and David Heideberg. - Fix a VM_PAT interaction with fork() crash. This also touches core kernel code. - Fix an ORC unwinder bug for interrupt entries - Fixes and cleanups. - Fix an AMD microcode loader bug that can promote verification failures into success. - Add early-printk support for MMIO based UARTs on an x86 board that had no other serial debugging facility and also experienced early boot crashes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfnFBERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iVDxAAmiB4soT3/WbaWJJdeVyxEL7sOmUNOm04 5kAVHJVK8QGdje0eWa6h7xmuQD3UOxafE2coCrOxHhZi2qpAAY6CPIIy6oIBRwZK gLgT5xn1CHojfm4UFC3YUOyecBRPUF2C5jfkajWdZHumyPP/sOObqvGanpQRAYd5 bfPHEvrBpeEeS7WkATCdyF2j+I5xYflD4g/MDAsMmqasQHOnjBuFX5VBeVxxkysC dMsFkFpxqcA95MnnyOnxXzgOtRTY0UystX07D3Bk1pqhG9zor+mp8OynsTRCU87T ZPPbUr2qACNmCqEEXl+F1mAkgj5H66xE2gaJdYx0/jBAIbX8Nwih7mMxhJShVU07 Lhc0tukmVrDoDaVIr2HsxqI8iokuYLszUjDAqEQmQDrgelL6usPYghN1b2bDSJ9r 0hCO/s79024H/U9oMrC+CF52D5UH/fE98ipigrbKRIO/hOsoxiiniF3DG2NVWZM2 n5nPnOdbperqjCEteN1nxQfr7XZkvP95Bwmuqqc90XH+tzKJdHruUkbm4ua7NEEz WKgsUIYFjeN5ZrHbJaNtHlQueTyvsyGmL1nlaLi/MaJbSXPsM/WfwvHsaKTh3NrE BFwEAhMZVLDHEfnFT0Ev7Mm1MGpW8MbHoRBR1+E5FWWNS4X0yGLKXWRp8diw25Tm W3ZVsn65E6U= =/qKX -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes and updates from Ingo Molnar: - Fix a large number of x86 Kconfig dependency and help text accuracy bugs/problems, by Mateusz Jończyk and David Heideberg - Fix a VM_PAT interaction with fork() crash. This also touches core kernel code - Fix an ORC unwinder bug for interrupt entries - Fixes and cleanups - Fix an AMD microcode loader bug that can promote verification failures into success - Add early-printk support for MMIO based UARTs on an x86 board that had no other serial debugging facility and also experienced early boot crashes * tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Fix __apply_microcode_amd()'s return value x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range() x86/fpu: Update the outdated comment above fpstate_init_user() x86/early_printk: Add support for MMIO-based UARTs x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment x86/entry: Fix ORC unwinder for PUSH_REGS with save_ret=1 x86/Kconfig: Fix lists in X86_EXTENDED_PLATFORM help text x86/Kconfig: Correct X86_X2APIC help text x86/speculation: Remove the extra #ifdef around CALL_NOSPEC x86/Kconfig: Document release year of glibc 2.3.3 x86/Kconfig: Make CONFIG_PCI_CNB20LE_QUIRK depend on X86_32 x86/Kconfig: Document CONFIG_PCI_MMCONFIG x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM x86/Kconfig: Move all X86_EXTENDED_PLATFORM options together x86/Kconfig: Always enable ARCH_SPARSEMEM_ENABLE x86/Kconfig: Enable X86_X2APIC by default and improve help text	2025-03-30 15:25:15 -07:00
Linus Torvalds	b4c5c57c2d	Miscellaneous locking fixes and updates: - Fix a locking self-test FAIL on PREEMPT_RT kernels - Fix nr_unused_locks accounting bug - Simplify the split-lock debugging feature's fast-path Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfnDqERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jXVw/+IGVWjYnw5Ac+HyccKwY7+OEmKiGx+9C4 yS3pc1wrofkpIVRsr0GuOv2h5V+YZJvuRrZPbatVidUkEJAQ++Iaeef+gnunfMX2 0mRk5B0CG1M+y3lNFOhIEafkGDQOtdFTOmoRv508Xj3mb8cRMRCgkf8S3600l2IB ZEw67f8waHy1PwKC+dFlr0QjGL0+qbkrDqB1m4VSy0/egaSNMFjYVLLBAe6gn+/X zU/SWBqyV9jSfmwABSiKCpjt/GziuwGoADJpxAbSR/1NaCy866VSBLRjiiLI/c5q xdKd3GyD3IhLDCbzKrwVugeioxoPzXLmZKajVhZARv2kPW8DbzUMymP/eVkOf7VJ 6dDNV3Yyq568YQBi3PQu2vESumHRctLOsRSAr2TnXlbFzBvjBM+Ia11e+p4mg9FU Hcn98Rt3FfgigrzH4IeLaYTbm6A5amj2ymoMwjR/vchnB8tlXddc3/KumIdak04Q hHRHA0cyNg1YvB/8yB0NKf5jETpbYaaKhCO9KlsceLDjuhrwlmi/XFz+bwBJIBgD zug2hevSW2RVRGidJ5Qz91qG76xBkhH038qorcegzGybMQ0p+Hw+/BN7OzPSueaq ZMRFde3CtrB332SS73KToG5XvyuffhW8XyCvkOt5W6z9P7n3UOFzDcKr/wrePV9d 9kEJPdOCOwM= =jcdr -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc locking fixes and updates from Ingo Molnar: - Fix a locking self-test FAIL on PREEMPT_RT kernels - Fix nr_unused_locks accounting bug - Simplify the split-lock debugging feature's fast-path * tag 'locking-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/lockdep: Decrease nr_unused_locks if lock unused in zap_class() lockdep: Fix wait context check on softirq for PREEMPT_RT x86/split_lock: Simplify reenabling	2025-03-30 15:18:36 -07:00
Linus Torvalds	494e7fe591	bpf_res_spin_lock -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfcq3kACgkQ6rmadz2v bToxkw/8DHIqjVnzU2O9hbRM1anYo6yM8e34IxCt0ajHTSEVJ93+C161QDWo/6Dk +RNlaeGekaBUk+QOLb4u+rzZ2eR/pWSm37xuDRAiBCQ+3MgR60gGRaSljpS3IUem 0FvS6C1HObBCEUXMU2rNv/5cJB5/qrQYa9FEEjRvBTLqgQkdS7yaW/KKuZaNb+Ts KiEeWvPrPSZXStfRGy8Wr4eS2rYhxPAikUR+xde9CM+HtMWwKTCTSp8qXrqA92Dj Cz9ix01scznuf78QCRDZp09im3lZys8ZQprmPgMxyEscN+CDL7n68wAhmTJq0uo3 3NqIv7zBQ8wMChj0f0HjwZ0Wrj7BJAveY2Q0RterxdzT4vMKdtNkThX46ISaCoX/ XQAAhZHemK6MvBJk+LKkqqMgrD+3FAzvY7O+SCyUBAMs4FK1myRJQihdLXHGfiBU DMDZE1jsE8qBaeUbz4LIuCy8fx2LhtVwVNwbNIBUZHdyfjxIXnQT/8Cnrgklwy2i tnYekhAsHDQY+QDkrvJpc4E1vUtiXwSDI5ErcnWdSzctEOyVeUg7OuuGD4riCd1c emdJmtASM1z9Ajqa1dytDxVaF6wjKlbhQgnKamuex5JLGCK6makk8ZoB+DBfKYHD VoWummTu8ldf+Dp4ehBh7AbeF2vn4kLqcF1PLRsBO6ytJs4HIt8= =5O7h -----END PGP SIGNATURE----- Merge tag 'bpf_res_spin_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf relisient spinlock support from Alexei Starovoitov: "This patch set introduces Resilient Queued Spin Lock (or rqspinlock with res_spin_lock() and res_spin_unlock() APIs). This is a qspinlock variant which recovers the kernel from a stalled state when the lock acquisition path cannot make forward progress. This can occur when a lock acquisition attempt enters a deadlock situation (e.g. AA, or ABBA), or more generally, when the owner of the lock (which we’re trying to acquire) isn’t making forward progress. Deadlock detection is the main mechanism used to provide instant recovery, with the timeout mechanism acting as a final line of defense. Detection is triggered immediately when beginning the waiting loop of a lock slow path. Additionally, BPF programs attached to different parts of the kernel can introduce new control flow into the kernel, which increases the likelihood of deadlocks in code not written to handle reentrancy. There have been multiple syzbot reports surfacing deadlocks in internal kernel code due to the diverse ways in which BPF programs can be attached to different parts of the kernel. By switching the BPF subsystem’s lock usage to rqspinlock, all of these issues are mitigated at runtime. This spin lock implementation allows BPF maps to become safer and remove mechanisms that have fallen short in assuring safety when nesting programs in arbitrary ways in the same context or across different contexts. We run benchmarks that stress locking scalability and perform comparison against the baseline (qspinlock). For the rqspinlock case, we replace the default qspinlock with it in the kernel, such that all spin locks in the kernel use the rqspinlock slow path. As such, benchmarks that stress kernel spin locks end up exercising rqspinlock. More details in the cover letter in commit `6ffb9017e9` ("Merge branch 'resilient-queued-spin-lock'")" * tag 'bpf_res_spin_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (24 commits) selftests/bpf: Add tests for rqspinlock bpf: Maintain FIFO property for rqspinlock unlock bpf: Implement verifier support for rqspinlock bpf: Introduce rqspinlock kfuncs bpf: Convert lpm_trie.c to rqspinlock bpf: Convert percpu_freelist.c to rqspinlock bpf: Convert hashtab.c to rqspinlock rqspinlock: Add locktorture support rqspinlock: Add entry to Makefile, MAINTAINERS rqspinlock: Add macros for rqspinlock usage rqspinlock: Add basic support for CONFIG_PARAVIRT rqspinlock: Add a test-and-set fallback rqspinlock: Add deadlock detection and recovery rqspinlock: Protect waiters in trylock fallback from stalls rqspinlock: Protect waiters in queue from stalls rqspinlock: Protect pending bit owners from stalls rqspinlock: Hardcode cond_acquire loops for arm64 rqspinlock: Add support for timeouts rqspinlock: Drop PV and virtualization support rqspinlock: Add rqspinlock.h header ...	2025-03-30 13:06:27 -07:00
Linus Torvalds	fa593d0f96	bpf-next-6.15 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmfi6ZAACgkQ6rmadz2v bTpLOg/+J7xUddPMhlpFAUlifQEadE5hmw6v1tXpM3zyKHzUWJiv/qsx3j8/ckgD D+d4P8bqIbI9SSuIS4oZ0+D9pr/g7GYztnoYZmPiYJ7v2AijPuof5dsagFQE8E2y rhfbt9KHTMzzkdkTvaAZaITS/HWAoJ2YVRB6gfLex2ghcXYHcgmtKRZniQrbBiFZ MIXBN8Rg6HP+pUdIVllSXFcQCb3XIgjPONRAos4hr5tIm+3Ku7Jvkgk2H/9vUcoF bdXAcg8xygyH7eY+1l3e7nEPQlG0jUZEsL+tq+vpdoLRLqlIpAUYmwUvqcmq4dPS QGFjiUcpDbXlxsUFpzjXHIFto7fXCfND7HEICQPwAncdflIIfYaATSQUfkEexn0a wBCFlAChrEzAmg2vFl4EeEr0fdSe/3jswrgKx0m6ctKieMjgloBUeeH4fXOpfkhS 9tvhuduVFuronlebM8ew4w9T/mBgbyxkE5KkvP4hNeB3ni3N0K6Mary5/u2HyN1e lqTlnZxRA4p6lrvxce/mDrR4VSwlKLcSeQVjxAL1afD5KRkuZJnUv7bUhS361vkG IjNrQX30EisDAz+X7tMn3ndBf9vVatwFT4+c3yaxlQRor1WofhDfT88HPiyB4QqQ Kdx2EHgbQxJp4vkzhp4/OXlTfkihsMEn8egzZuphdPEQ9Y+Jdwg= =aN/V -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "For this merge window we're splitting BPF pull request into three for higher visibility: main changes, res_spin_lock, try_alloc_pages. These are the main BPF changes: - Add DFA-based live registers analysis to improve verification of programs with loops (Eduard Zingerman) - Introduce load_acquire and store_release BPF instructions and add x86, arm64 JIT support (Peilin Ye) - Fix loop detection logic in the verifier (Eduard Zingerman) - Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet) - Add kfunc for populating cpumask bits (Emil Tsalapatis) - Convert various shell based tests to selftests/bpf/test_progs format (Bastien Curutchet) - Allow passing referenced kptrs into struct_ops callbacks (Amery Hung) - Add a flag to LSM bpf hook to facilitate bpf program signing (Blaise Boscaccy) - Track arena arguments in kfuncs (Ihor Solodrai) - Add copy_remote_vm_str() helper for reading strings from remote VM and bpf_copy_from_user_task_str() kfunc (Jordan Rome) - Add support for timed may_goto instruction (Kumar Kartikeya Dwivedi) - Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy) - Reduce bpf_cgrp_storage_busy false positives when accessing cgroup local storage (Martin KaFai Lau) - Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko) - Allow retrieving BTF data with BTF token (Mykyta Yatsenko) - Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix (Song Liu) - Reject attaching programs to noreturn functions (Yafang Shao) - Introduce pre-order traversal of cgroup bpf programs (Yonghong Song)" * tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits) selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid bpf: Fix out-of-bounds read in check_atomic_load/store() libbpf: Add namespace for errstr making it libbpf_errstr bpf: Add struct_ops context information to struct bpf_prog_aux selftests/bpf: Sanitize pointer prior fclose() selftests/bpf: Migrate test_xdp_vlan.sh into test_progs selftests/bpf: test_xdp_vlan: Rename BPF sections bpf: clarify a misleading verifier error message selftests/bpf: Add selftest for attaching fexit to __noreturn functions bpf: Reject attaching fexit/fmod_ret to __noreturn functions bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage bpf: Make perf_event_read_output accessible in all program types. bpftool: Using the right format specifiers bpftool: Add -Wformat-signedness flag to detect format errors selftests/bpf: Test freplace from user namespace libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID bpf: Return prog btf_id without capable check bpf: BPF token support for BPF_BTF_GET_FD_BY_ID bpf, x86: Fix objtool warning for timed may_goto bpf: Check map->record at the beginning of check_and_free_fields() ...	2025-03-30 12:43:03 -07:00
Linus Torvalds	1fa753c7b5	EFI updates for v6.15 - Decouple mixed mode startup code from the traditional x86 decompressor - Revert zero-length file hack in efivarfs - Prevent EFI zboot from using the CopyMem/SetMem boot services after ExitBootServices() - Update EFI zboot to use the ZLIB/ZSTD library interfaces directly -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZ9vAPwAKCRAwbglWLn0t XNsFAQCq4zXmbHnFl8gR3rq06f2gR3DKPfUBGVnyfaP/77ag0AD6Alzm4Pg014cL GsZPQf38uGnygMTGYsU1HdE8EugFFQY= =UXC0 -----END PGP SIGNATURE----- Merge tag 'efi-next-for-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: - Decouple mixed mode startup code from the traditional x86 decompressor - Revert zero-length file hack in efivarfs - Prevent EFI zboot from using the CopyMem/SetMem boot services after ExitBootServices() - Update EFI zboot to use the ZLIB/ZSTD library interfaces directly * tag 'efi-next-for-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi/libstub: Avoid legacy decompressor zlib/zstd wrappers efi/libstub: Avoid CopyMem/SetMem EFI services after ExitBootServices efi: efibc: change kmalloc(size * count, ...) to kmalloc_array() efivarfs: Revert "allow creation of zero length files" x86/efi/mixed: Move mixed mode startup code into libstub x86/efi/mixed: Simplify and document thunking logic x86/efi/mixed: Remove dependency on legacy startup_32 code x86/efi/mixed: Set up 1:1 mapping of lower 4GiB in the stub x86/efi/mixed: Factor out and clean up long mode entry x86/efi/mixed: Check CPU compatibility without relying on verify_cpu() x86/efistub: Merge PE and handover entrypoints	2025-03-29 11:36:19 -07:00
Linus Torvalds	e5e0e6bebe	This update includes the following changes: API: - Remove legacy compression interface. - Improve scatterwalk API. - Add request chaining to ahash and acomp. - Add virtual address support to ahash and acomp. - Add folio support to acomp. - Remove NULL dst support from acomp. Algorithms: - Library options are fuly hidden (selected by kernel users only). - Add Kerberos5 algorithms. - Add VAES-based ctr(aes) on x86. - Ensure LZO respects output buffer length on compression. - Remove obsolete SIMD fallback code path from arm/ghash-ce. Drivers: - Add support for PCI device 0x1134 in ccp. - Add support for rk3588's standalone TRNG in rockchip. - Add Inside Secure SafeXcel EIP-93 crypto engine support in eip93. - Fix bugs in tegra uncovered by multi-threaded self-test. - Fix corner cases in hisilicon/sec2. Others: - Add SG_MITER_LOCAL to sg miter. - Convert ubifs, hibernate and xfrm_ipcomp from legacy API to acomp. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmfiQ9kACgkQxycdCkmx i6fFZg/9GWjC1FLEV66vNlYAIzFGwzwWdFGyQzXyP235Cphhm4qt9gx7P91N6Lvc pplVjNEeZHoP8lMw+AIeGc2cRhIwsvn8C+HA3tCBOoC1qSe8T9t7KHAgiRGd/0iz UrzVBFLYlR9i4tc0T5peyQwSctv8DfjWzduTmI3Ts8i7OQcfeVVgj3sGfWam7kjF 1GJWIQH7aPzT8cwFtk8gAK1insuPPZelT1Ppl9kUeZe0XUibrP7Gb5G9simxXAyi B+nLCaJYS6Hc1f47cfR/qyZSeYQN35KTVrEoKb1pTYXfEtMv6W9fIvQVLJRYsqpH RUBdDJUseE+WckR6glX9USrh+Fv9d+HfsTXh1fhpApKU5sQJ7pDbUm4ge8p6htNG MIszbJPdqajYveRLuPUjFlUXaqomos8eT6BZA+RLHm1cogzEOm+5bjspbfRNAVPj x9KiDu5lXNiFj02v/MkLKUe3bnGIyVQnZNi7Rn0Rpxjv95tIjVpksZWMPJarxUC6 5zdyM2I5X0Z9+teBpbfWyqfzSbAs/KpzV8S/xNvWDUT6NlpYGBeNXrCDTXcwJLAh PRW0w1EJUwsZbPi8GEh5jNzo/YK1cGsUKrihKv7YgqSSopMLI8e/WVr8nKZMVDFA O+6F6ec5lR7KsOIMGUqrBGFU1ccAeaLLvLK3H5J8//gMMg82Uik= =aQNt -----END PGP SIGNATURE----- Merge tag 'v6.15-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Remove legacy compression interface - Improve scatterwalk API - Add request chaining to ahash and acomp - Add virtual address support to ahash and acomp - Add folio support to acomp - Remove NULL dst support from acomp Algorithms: - Library options are fuly hidden (selected by kernel users only) - Add Kerberos5 algorithms - Add VAES-based ctr(aes) on x86 - Ensure LZO respects output buffer length on compression - Remove obsolete SIMD fallback code path from arm/ghash-ce Drivers: - Add support for PCI device 0x1134 in ccp - Add support for rk3588's standalone TRNG in rockchip - Add Inside Secure SafeXcel EIP-93 crypto engine support in eip93 - Fix bugs in tegra uncovered by multi-threaded self-test - Fix corner cases in hisilicon/sec2 Others: - Add SG_MITER_LOCAL to sg miter - Convert ubifs, hibernate and xfrm_ipcomp from legacy API to acomp" * tag 'v6.15-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (187 commits) crypto: testmgr - Add multibuffer acomp testing crypto: acomp - Fix synchronous acomp chaining fallback crypto: testmgr - Add multibuffer hash testing crypto: hash - Fix synchronous ahash chaining fallback crypto: arm/ghash-ce - Remove SIMD fallback code path crypto: essiv - Replace memcpy() + NUL-termination with strscpy() crypto: api - Call crypto_alg_put in crypto_unregister_alg crypto: scompress - Fix incorrect stream freeing crypto: lib/chacha - remove unused arch-specific init support crypto: remove obsolete 'comp' compression API crypto: compress_null - drop obsolete 'comp' implementation crypto: cavium/zip - drop obsolete 'comp' implementation crypto: zstd - drop obsolete 'comp' implementation crypto: lzo - drop obsolete 'comp' implementation crypto: lzo-rle - drop obsolete 'comp' implementation crypto: lz4hc - drop obsolete 'comp' implementation crypto: lz4 - drop obsolete 'comp' implementation crypto: deflate - drop obsolete 'comp' implementation crypto: 842 - drop obsolete 'comp' implementation crypto: nx - Migrate to scomp API ...	2025-03-29 10:01:55 -07:00
Uros Bizjak	e29c5d0e5d	x86/bitops: Simplify variable_ffz() as variable__ffs(~word) Find first zero (FFZ) can be implemented by negating the input and using find first set (FFS). Before/after code generation comparison on ffz()-using kernel code shows that code generation has not changed: # kernel/signal.o: text data bss dec hex filename 42121 3472 8 45601 b221 signal.o.before 42121 3472 8 45601 b221 signal.o.after md5: ce4c31e1bce96af19b62a5f9659842f1 signal.o.before.asm ce4c31e1bce96af19b62a5f9659842f1 signal.o.after.asm [ mingo: Added code generation check. ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250327095641.131483-1-ubizjak@gmail.com	2025-03-28 23:24:16 +01:00
Nathan Chancellor	f710202b2a	x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c After commit `c104c16073` ("Kunit to check the longest symbol length"), there is a warning when building with clang because there is now a definition of unlikely from compiler.h in tools/include/linux, which conflicts with the one in the instruction decoder selftest: arch/x86/tools/insn_decoder_test.c:15:9: warning: 'unlikely' macro redefined [-Wmacro-redefined] Remove the second unlikely() definition, as it is no longer necessary, clearing up the warning. Fixes: `c104c16073` ("Kunit to check the longest symbol length") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20250318-x86-decoder-test-fix-unlikely-redef-v1-1-74c84a7bf05b@kernel.org	2025-03-28 22:57:44 +01:00
Herton R. Krzesinski	b5322b6ec0	x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs History of the performance regression: ====================================== Since the following series of user copy updates were merged upstream ~2 years ago via: `a562456643` ("Merge branch 'x86-rep-insns': x86 user copy clarifications") .. copy_user_generic() on x86_64 stopped doing alignment of the writes to the destination to a 8 byte boundary for the non FSRM case. Previously, this was done through the ALIGN_DESTINATION macro that was used in the now removed copy_user_generic_unrolled function. Turns out this change causes some loss of performance/throughput on some use cases and specific CPU/platforms without FSRM and ERMS. Lately I got two reports of performance/throughput issues after a RHEL 9 kernel pulled the same upstream series with updates to user copy functions. Both reports consisted of running specific networking/TCP related testing using iperf3. Partial upstream fix ==================== The first report was related to a Linux Bridge testing using VMs on a specific machine with an AMD CPU (EPYC 7402), and after a brief investigation it turned out that the later change via: `ca96b162bf` ("x86: bring back rep movsq for user access on CPUs without ERMS") ... helped/fixed the performance issue. However, after the later commit/fix was applied, then I got another regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also running on an AMD based platform (AMD EPYC 7302 CPU), again that was using iperf3 to run the test. That regression was after applying the later fix/commit, but only this didn't help in telling the whole history. Testing performed to pinpoint residual regression ================================================= So I narrowed down the second regression use case, but running it without traffic through a NIC, on localhost, in trying to narrow down CPU usage and not being limited by other factor like network bandwidth. I used another system also with an AMD CPU (AMD EPYC 7742). Basically, I run iperf3 in server and client mode in the same system, for example: - Start the server binding it to CPU core/thread 19: $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000 - Start the client always binding/running on CPU core/thread 17, using perf to get statistics: $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \ -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \ > stat-19.txt For the client, always running/pinned to CPU 17. But for the iperf3 in server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any specific CPU. So it basically consisted with four runs of the same commands, just changing the CPU which the server is pinned, or without pinning by removing the taskset call before the server command. The CPUs were chosen based on NUMA node they were on, this is the relevant output of lscpu on the system: $ lscpu ... Model name: AMD EPYC 7742 64-Core Processor ... Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 4 NUMA node0 CPU(s): 0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 NUMA node1 CPU(s): 2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123 NUMA node2 CPU(s): 4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125 NUMA node3 CPU(s): 6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127 ... So for the server run, when picking a CPU, I chose CPUs to be not on the same node. The reason is with that I was able to get/measure relevant performance differences when changing the alignment of the writes to the destination in copy_user_generic. Testing shows up to +81% performance improvement under iperf3 ============================================================= Here's a summary of the iperf3 runs: # Vanilla upstream alignment: CPU RATE SYS TIME sender-receiver Server bind 19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8% Server bind 21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9% Server bind 23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5% Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5% # With the attached patch (aligning writes in non ERMS/FSRM case): CPU RATE SYS TIME sender-receiver Server bind 19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0% Server bind 21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7% Server bind 23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8% Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6% So I consistently got better results when aligning the write. The results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys time and then the total time to run/transfer 50G of data. The last field is the CPU usage of sender/receiver iperf3 process. It's also worth to note that each pair of iperf3 runs may get slightly different results on each run, but I always got consistent higher results with the write alignment for this specific test of running the processes on CPUs in different NUMA nodes. Linus Torvalds helped/provided this version of the patch. Initially I proposed a version which aligned writes for all cases in rep_movs_alternative, however it used two extra registers and thus Linus provided an enhanced version that only aligns the write on the large_movsq case, which is sufficient since the problem happens only on those AMD CPUs like ones mentioned above without ERMS/FSRM, and also doesn't require using extra registers. Also, I validated that aligning only on large_movsq case is really enough for getting the performance back. I also tested this patch on an old Intel based non-ERMS/FRMS system (with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems: no performance enhancement but also no regression either, using the same iperf3 based benchmark. Also newer Intel processors after Sandy Bridge usually have ERMS and should not be affected by this change. [ mingo: Updated the changelog. ] Fixes: `ca96b162bf` ("x86: bring back rep movsq for user access on CPUs without ERMS") Fixes: `034ff37d34` ("x86: rewrite '__copy_user_nocache' function") Reported-by: Ondrej Lichtner <olichtne@redhat.com> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com	2025-03-28 22:57:44 +01:00
Boris Ostrovsky	31ab12df72	x86/microcode/AMD: Fix __apply_microcode_amd()'s return value When verify_sha256_digest() fails, __apply_microcode_amd() should propagate the failure by returning false (and not -1 which is promoted to true). Fixes: `50cef76d5c` ("x86/microcode/AMD: Load only SHA256-checksummed patches") Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250327230503.1850368-2-boris.ostrovsky@oracle.com	2025-03-28 12:49:02 +01:00
Linus Torvalds	7b667acd69	powerpc updates for 6.15 - Removal of support for IBM Cell Blades - SMP support for microwatt platform - Support for inline static calls on PPC32 - Enable pmu selftests for power11 platform - Enable hardware trace macro (HTM) hcall support - Support for limited address mode capability - Changes to RMA size from 512 MB to 768 MB to handle fadump - Misc fixes and cleanups Thanks to: Abhishek Dubey, Amit Machhiwal, Andreas Schwab, Arnd Bergmann, Athira Rajeev, Avnish Chouhan, Christophe Leroy, Disha Goel, Donet Tom, Gaurav Batra, Gautam Menghani, Hari Bathini, Kajol Jain, Kees Cook, Mahesh Salgaonkar, Michael Ellerman, Paul Mackerras, Ritesh Harjani (IBM), Sathvika Vasireddy, Segher Boessenkool, Sourabh Jain, Vaibhav Jain, Venkat Rao Bagalkote. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqX2DNAOgU8sBX3pRpnEsdPSHZJQFAmfjZnYACgkQpnEsdPSH ZJTmKg/+NGW7wyFY9d8Iai9ncYY7GSzsMSDTaan7qg0QWOd5gjHbsdbava7TM/DW 8p9XsC+17kSeftRNUjtc52bSN8Ei2gBdsXIagQG1alfB2X2e6wkNauifK+dz3Su6 usMEZZTO5R/jFDotFXNM1nsUj+8dvjnPgOUrji/P8k7PT5295wpza0hz1fy5SrOA hM5cliBP36UgFe5Efvgm4OUX2gQIhbc3stt9MVfymW/k0Mit5f41UIPuVGiTWowY s0cUJGkhxUlGXT3VfOVKuZfn4u9KMha7UCl9afSceJzXOdnUIKIbskui1VEv6cD/ iSIxi839uErAobFHlsLYprgYFciYLII3xe2qNZCA/ZxeIMS/Mm6xokESeWLhBnfa P7ke6l0z3GDtTvgI2eSeU9BdrVveF1NgbP9GYSKgT6gtw/kRRnxgHF8tzmLON5PT KXpQlzz8VuSBRtF2jnLFU89+FFwSA1bRUhDrp89HyYFqw1B5g4N7kFFTUJWHOuKS fwPGy+cveKehmCUBedeTRFqHvvqdwpD/WnPlQzCly3WxqdL8U/eTXYftMiAwuK28 ovLuSs3vRThKRQ8DnUa5oB0UGsjMpRV5LdvYkhw+x8mZKUR59oj4fx2ae4TtPakg dbAYuPPkCORdaSga/nV6vQgsLprFpcGX3dq6E19+BVBAY5D+1PE= =GFcj -----END PGP SIGNATURE----- Merge tag 'powerpc-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Madhavan Srinivasan: - Remove support for IBM Cell Blades - SMP support for microwatt platform - Support for inline static calls on PPC32 - Enable pmu selftests for power11 platform - Enable hardware trace macro (HTM) hcall support - Support for limited address mode capability - Changes to RMA size from 512 MB to 768 MB to handle fadump - Misc fixes and cleanups Thanks to Abhishek Dubey, Amit Machhiwal, Andreas Schwab, Arnd Bergmann, Athira Rajeev, Avnish Chouhan, Christophe Leroy, Disha Goel, Donet Tom, Gaurav Batra, Gautam Menghani, Hari Bathini, Kajol Jain, Kees Cook, Mahesh Salgaonkar, Michael Ellerman, Paul Mackerras, Ritesh Harjani (IBM), Sathvika Vasireddy, Segher Boessenkool, Sourabh Jain, Vaibhav Jain, and Venkat Rao Bagalkote. * tag 'powerpc-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (61 commits) powerpc/kexec: fix physical address calculation in clear_utlb_entry() crypto: powerpc: Mark ghashp8-ppc.o as an OBJECT_FILES_NON_STANDARD powerpc: Fix 'intra_function_call not a direct call' warning powerpc/perf: Fix ref-counting on the PMU 'vpa_pmu' KVM: PPC: Enable CAP_SPAPR_TCE_VFIO on pSeries KVM guests powerpc/prom_init: Fixup missing #size-cells on PowerBook6,7 powerpc/microwatt: Add SMP support powerpc: Define config option for processors with broadcast TLBIE powerpc/microwatt: Define an idle power-save function powerpc/microwatt: Device-tree updates powerpc/microwatt: Select COMMON_CLK in order to get the clock framework net: toshiba: Remove reference to PPC_IBM_CELL_BLADE net: spider_net: Remove powerpc Cell driver cpufreq: ppc_cbe: Remove powerpc Cell driver genirq: Remove IRQ_EDGE_EOI_HANDLER docs: Remove reference to removed CBE_CPUFREQ_SPU_GOVERNOR powerpc: Remove UDBG_RTAS_CONSOLE powerpc/io: Use standard barrier macros in io.c powerpc/io: Rename _insw_ns() etc. powerpc/io: Use generic raw accessors ...	2025-03-27 19:39:08 -07:00
Linus Torvalds	a10c7949ad	linux_kselftest-kunit-6.15-rc1 kunit tool: - Changes to kunit tool to use qboot on QEMU x86_64, and build GDB scripts. - Fixes kunit tool bug in parsing test plan. - Adds test to kunit tool to check parsing late test plan. kunit: - Clarifies kunit_skip() argument name. - Adds Kunit check for the longest symbol length. - Changes qemu_configs for sparc to use Zilog console. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmfkvDYACgkQCwJExA0N Qxwljg//ZRoF/Jncvlb0vapnOIYywHbJEPRVTKfNurRjhb7stAX7CpLKXing4Gtq ewy3UXRaAZKg1BvugDYWoUsDDD5o7jx6y9rOMOWM+aAHPzYgxY6gbIzyUVolNZg/ 50/ANMhT0bvME8KBB2k2l6p1NAblzOpH3zH35CCDL/40eVodwMPrhq0V5AqccOaE C5Bn+tDiviS6Icw+b/mVUw8fvmoJSTSKvdjaSeRAqThJN3KtqBVyX383++A1zNqy Y6tItu9wG06FDjuQ1miOlSMwhgMEYK4TS4GwbX4PUucR8ETaZNUXVviMRou7vMEa GGOdtsBG3CBgFNtO2VK1qJLWbJesw2G9+w2oIZ2KQKtyfoF7nDMj+DBO2QD/T+GB u2g/xlSDJ5PTzZBMVKENDMy+C9Q+ux8Y2PsQ0fTCdpYgadytKYBFA23EAiZaMdKa d1AweNvFS5gi8WkpS8SyMjs0D5pZnKMgHQqOIfRFjCi0HXsGE9RJfkOjLOzRnaOc zldLAgDcrhtdG8Xin08bux5UuCoqg/e/RJiXF+xQLLJkE7cltN/CuWMrHX4kija+ 8xmJtj4Oe0p7JCwnIaXjLAQDuFfxHYHM9wM0nKm+YpVJLPSWqSXk4+xtQEOlvZhN DJW61ez+pYVCmXuIZ/bgeRzpwXJMfALmI3kn+UtCYwqdTt6Xhp8= =h8xS -----END PGP SIGNATURE----- Merge tag 'linux_kselftest-kunit-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit updates from Shuah Khan: "kunit tool: - Changes to kunit tool to use qboot on QEMU x86_64, and build GDB scripts - Fixes kunit tool bug in parsing test plan - Adds test to kunit tool to check parsing late test plan kunit: - Clarifies kunit_skip() argument name - Adds Kunit check for the longest symbol length - Changes qemu_configs for sparc to use Zilog console" * tag 'linux_kselftest-kunit-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit: tool: add test to check parsing late test plan kunit: tool: Fix bug in parsing test plan Kunit to check the longest symbol length kunit: Clarify kunit_skip() argument name kunit: tool: Build GDB scripts kunit: qemu_configs: sparc: use Zilog console kunit: tool: Use qboot on QEMU x86_64	2025-03-27 19:06:07 -07:00
Linus Torvalds	592329e5e9	Summary * Move vm_table members out of kernel/sysctl.c All vm_table array members have moved to their respective subsystems leading to the removal of vm_table from kernel/sysctl.c. This increases modularity by placing the ctl_tables closer to where they are actually used and at the same time reducing the chances of merge conflicts in kernel/sysctl.c. * ctl_table range fixes Replace the proc_handler function that checks variable ranges in coredump_sysctls and vdso_table with the one that actually uses the extra{1,2} pointers as min/max values. This tightens the range of the values that users can pass into the kernel effectively preventing {under,over}flows. * Misc fixes Correct grammar errors and typos in test messages. Update sysctl files in MAINTAINERS. Constified and removed array size in declaration for alignment_tbl * Testing - These have all been in linux-next for at least 1 month - They have gone through 0-day - Ran all these through sysctl selftests in x86_64 -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmfhV8EACgkQupfNUreW QU/udAv/VCXGkndQsJ5biXpXYFnokX0gIEaYzzHiqrFycZqr8ys0/wWzc+ar1LjF Jvanl2uKB0mUviLKt7Gk0+Hri+PJlYIrbx+5K5eo2wsKUUxFykqLLm59y/orPODl gyPQjKNpHJb7COsnEc3Lrq/fvol4NPHlcBPXG8NwehccTeBHZ1ninfo+pSnxh3o8 kI3GSLLxD4K9AgBl5QuVWH4gU7o//u7lUkKzy03NW+2jmuRv3dRcYF7IdgMINNee AeXnygdSBxLzECBvmkfNdyg+AmL8hdsmzbsIh7UuJDvxLlQOInVLZa+sXBotCOIc TImCrr1Ws1OuGrD0kpH+21tJvc8pNFWt61QlulObQdrLndWHdZEGyGOusLpXTwbn jIWZmMvzk1foSwdgzwPFzUqPEpW3FrBVDo4Z4kenBDrCp56QTX7hGRvkNYJNKvot Ue+i8BeHR/Gm/p+UMqgsSTOaNJXTqZhFqwJQVzxU/9LN/vkS0On6fbjgBd5X6Pn+ a5dlc9gy =0bcX -----END PGP SIGNATURE----- Merge tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Move vm_table members out of kernel/sysctl.c All vm_table array members have moved to their respective subsystems leading to the removal of vm_table from kernel/sysctl.c. This increases modularity by placing the ctl_tables closer to where they are actually used and at the same time reducing the chances of merge conflicts in kernel/sysctl.c. - ctl_table range fixes Replace the proc_handler function that checks variable ranges in coredump_sysctls and vdso_table with the one that actually uses the extra{1,2} pointers as min/max values. This tightens the range of the values that users can pass into the kernel effectively preventing {under,over}flows. - Misc fixes Correct grammar errors and typos in test messages. Update sysctl files in MAINTAINERS. Constified and removed array size in declaration for alignment_tbl * tag 'sysctl-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (22 commits) selftests/sysctl: fix wording of help messages selftests: fix spelling/grammar errors in sysctl/sysctl.sh MAINTAINERS: Update sysctl file list in MAINTAINERS sysctl: Fix underflow value setting risk in vm_table coredump: Fixes core_pipe_limit sysctl proc_handler sysctl: remove unneeded include sysctl: remove the vm_table sh: vdso: move the sysctl to arch/sh/kernel/vsyscall/vsyscall.c x86: vdso: move the sysctl to arch/x86/entry/vdso/vdso32-setup.c fs: dcache: move the sysctl to fs/dcache.c sunrpc: simplify rpcauth_cache_shrink_count() fs: drop_caches: move sysctl to fs/drop_caches.c fs: fs-writeback: move sysctl to fs/fs-writeback.c mm: nommu: move sysctl to mm/nommu.c security: min_addr: move sysctl to security/min_addr.c mm: mmap: move sysctl to mm/mmap.c mm: util: move sysctls to mm/util.c mm: vmscan: move vmscan sysctls to mm/vmscan.c mm: swap: move sysctl to mm/swap.c mm: filemap: move sysctl to mm/filemap.c ...	2025-03-26 21:02:05 -07:00
Vishal Annapurve	e8f45927ee	x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. safe_halt() routines execute HLT in STI-shadow so IRQs need to remain disabled until the TDCALL to ensure that pending IRQs are correctly treated as wake events. Emit warning and fail emulation if IRQs are enabled during HLT #VE handling to avoid running into scenarios where IRQ wake events are lost resulting in indefinite HLT execution times. Signed-off-by: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Ryan Afranji <afranji@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250228014416.3925664-4-vannapurve@google.com	2025-03-26 08:52:10 +01:00
Vishal Annapurve	9f98a4f4e7	x86/tdx: Fix arch_safe_halt() execution for TDX VMs Direct HLT instruction execution causes #VEs for TDX VMs which is routed to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE handler will enable interrupts before TDCALL is routed to hypervisor leading to missed wakeup events, as current TDX spec doesn't expose interruptibility state information to allow #VE handler to selectively enable interrupts. Commit `bfe6ed0c67` ("x86/tdx: Add HLT support for TDX guests") prevented the idle routines from executing HLT instruction in STI-shadow. But it missed the paravirt routine which can be reached via this path as an example: kvm_wait() => safe_halt() => raw_safe_halt() => arch_safe_halt() => irq.safe_halt() => pv_native_safe_halt() To reliably handle arch_safe_halt() for TDX VMs, introduce explicit dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt() routines with TDX-safe versions that execute direct TDCALL and needed interrupt flag updates. Executing direct TDCALL brings in additional benefit of avoiding HLT related #VEs altogether. As tested by Ryan Afranji: "Tested with the specjbb2015 benchmark. It has heavy lock contention which leads to many halt calls. TDX VMs suffered a poor score before this patchset. Verified the major performance improvement with this patchset applied." Fixes: `bfe6ed0c67` ("x86/tdx: Add HLT support for TDX guests") Signed-off-by: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Ryan Afranji <afranji@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250228014416.3925664-3-vannapurve@google.com	2025-03-26 08:51:20 +01:00
Kirill A. Shutemov	22cc5ca5de	x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For other VM guest types, features supported under CONFIG_PARAVIRT are self sufficient. CONFIG_PARAVIRT mainly provides support for TLB flush operations and time related operations. For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets most of its requirement except the need of HLT and SAFE_HLT paravirt calls, which is currently defined under CONFIG_PARAVIRT_XXL. Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest like platforms, move HLT and SAFE_HLT paravirt calls under CONFIG_PARAVIRT. Moving HLT and SAFE_HLT paravirt calls are not fatal and should not break any functionality for current users of CONFIG_PARAVIRT. Fixes: `bfe6ed0c67` ("x86/tdx: Add HLT support for TDX guests") Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Vishal Annapurve <vannapurve@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Ryan Afranji <afranji@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20250228014416.3925664-2-vannapurve@google.com	2025-03-26 08:48:18 +01:00
Linus Torvalds	ee6740fd34	CRC updates for 6.15 Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions. - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme. - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme. - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there. - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect. - Add kunit test cases for crc64_nvme and crc7. - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c(). - Remove unnecessary prompts from some of the CRC kconfig options. - Further optimize the x86 crc32c code. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY= =wmKG -----END PGP SIGNATURE----- Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...	2025-03-25 18:33:04 -07:00
Linus Torvalds	054570267d	lsm/stable-6.15 PR 20250323 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmfgWgMUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNW5RAAvCDq5gBtY0aTNlULe637EVLSh+t8 PkSzHzu/NlzU6BfjtwSm2fuML8welTGxSwUPxUzMCI91gPdkGeFktefavT3xa+QI BHWROn7fEJ/KmRZvngPeIkgLr5xhF5nBJmc/Jw71qem20zRzNgJnpzMX16d10Phx dxd2xOO1qM3bv6Z9RcIssZRGaN+PHngpWWg+0B69XuaBUso87S6NDyKNn1XPmvoz as96k+Wk/xAZGVEeCbs/+H5rBx6DLg+FfTRa06Oh4BFsqedpkDPxLrTgCJGJkA0H dsK6O/993zvjx0Jn4ZPoJ9n35S82BmkCsz4bGq1xVl6FYUiMcm3/8yO41wllS+w4 j+RlTU/RIdB7n8EKyMMl1hj1stTvt3Bi9F5Cbf7ZEv0snfR00K4KVpi17jnFjUHv kpOiEtXZb/NGQip7UAuUq0PisfqbiO4jJurYHRetDgv1WCy6+C8ufM5t6I+cnvmG VG+dlxcW+rDIn6bLRVuGi9TJRsQ6eox9ipa+qEKNNiOXgftELcgT7m74nAS5m0uv n5rDa221nPXecEB0X7d6YUFk711lly90dbelNeLrmv1w6jl8L1PpS1oBaW+UzGu9 46eGBd6pzu9otvK9WVyDEdotDOCrgH0sd7pTetqDhLJZ7KrGwyyqO2gD/JroUKcC lnxBQwPnat86iI8= =oxfV -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Various minor updates to the LSM Rust bindings Changes include marking trivial Rust bindings as inlines and comment tweaks to better reflect the LSM hooks. - Add LSM/SELinux access controls to io_uring_allowed() Similar to the io_uring_disabled sysctl, add a LSM hook to io_uring_allowed() to enable LSMs a simple way to enforce security policy on the use of io_uring. This pull request includes SELinux support for this new control using the io_uring/allowed permission. - Remove an unused parameter from the security_perf_event_open() hook The perf_event_attr struct parameter was not used by any currently supported LSMs, remove it from the hook. - Add an explicit MAINTAINERS entry for the credentials code We've seen problems in the past where patches to the credentials code sent by non-maintainers would often languish on the lists for multiple months as there was no one explicitly tasked with the responsibility of reviewing and/or merging credentials related code. Considering that most of the code under security/ has a vested interest in ensuring that the credentials code is well maintained, I'm volunteering to look after the credentials code and Serge Hallyn has also volunteered to step up as an official reviewer. I posted the MAINTAINERS update as a RFC to LKML in hopes that someone else would jump up with an "I'll do it!", but beyond Serge it was all crickets. - Update Stephen Smalley's old email address to prevent confusion This includes a corresponding update to the mailmap file. * tag 'lsm-pr-20250323' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: mailmap: map Stephen Smalley's old email addresses lsm: remove old email address for Stephen Smalley MAINTAINERS: add Serge Hallyn as a credentials reviewer MAINTAINERS: add an explicit credentials entry cred,rust: mark Credential methods inline lsm,rust: reword "destroy" -> "release" in SecurityCtx lsm,rust: mark SecurityCtx methods inline perf: Remove unnecessary parameter of security check lsm: fix a missing security_uring_allowed() prototype io_uring,lsm,selinux: add LSM hooks for io_uring_setup() io_uring: refactor io_uring_allowed()	2025-03-25 15:44:19 -07:00
Linus Torvalds	7d20aa5c32	Power management updates for 6.15-rc1 - Manage sysfs attributes and boost frequencies efficiently from cpufreq core to reduce boilerplate code in drivers (Viresh Kumar). - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider, Dhananjay Ugwekar, Imran Shaik, zuoqian). - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky Bai). - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski). - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng). - Optimize the amd-pstate driver to avoid cases where call paths end up calling the same writes multiple times and needlessly caching variables through code reorganization, locking overhaul and tracing adjustments (Mario Limonciello, Dhananjay Ugwekar). - Make it possible to avoid enabling capacity-aware scheduling (CAS) in the intel_pstate driver and relocate a check for out-of-band (OOB) platform handling in it to make it detect OOB before checking HWP availability (Rafael Wysocki). - Fix dbs_update() to avoid inadvertent conversions of negative integer values to unsigned int which causes CPU frequency selection to be inaccurate in some cases when the "conservative" cpufreq governor is in use (Jie Zhan). - Update the handling of the most recent idle intervals in the menu cpuidle governor to prevent useful information from being discarded by it in some cases and improve the prediction accuracy (Rafael Wysocki). - Make it possible to tell the intel_idle driver to ignore its built-in table of idle states for the given processor, clean up the handling of auto-demotion disabling on Baytrail and Cherrytrail chips in it, and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy, Rafael Wysocki). - Make some cpuidle drivers use for_each_present_cpu() instead of for_each_possible_cpu() during initialization to avoid issues occurring when nosmp or maxcpus=0 are used (Jacky Bai). - Clean up the Energy Model handling code somewhat (Rafael Wysocki). - Use kfree_rcu() to simplify the handling of runtime Energy Model updates (Li RongQing). - Add an entry for the Energy Model framework to MAINTAINERS as properly maintained (Lukasz Luba). - Address RCU-related sparse warnings in the Energy Model code (Rafael Wysocki). - Remove ENERGY_MODEL dependency on SMP and allow it to be selected when DEVFREQ is set without CPUFREQ so it can be used on a wider range of systems (Jeson Gao). - Unify error handling during runtime suspend and runtime resume in the core to help drivers to implement more consistent runtime PM error handling (Rafael Wysocki). - Drop a redundant check from pm_runtime_force_resume() and rearrange documentation related to __pm_runtime_disable() (Rafael Wysocki). - Rework the handling of the "smart suspend" driver flag in the PM core to avoid issues hat may occur when drivers using it depend on some other drivers and clean up the related PM core code (Rafael Wysocki, Colin Ian King). - Fix the handling of devices with the power.direct_complete flag set if device_suspend() returns an error for at least one device to avoid situations in which some of them may not be resumed (Rafael Wysocki). - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a possible deadlock that may occur if the "compressor" hibernation module parameter is accessed during the registration of a new ieee80211 device (Lizhi Xu). - Suppress sleeping parent warning in device_pm_add() in the case when new children are added under a device with the power.direct_complete set after it has been processed by device_resume() (Xu Yang). - Remove needless return in three void functions related to system wakeup (Zijun Hu). - Replace deprecated kmap_atomic() with kmap_local_page() in the hibernation core code (David Reaver). - Remove unused helper functions related to system sleep (David Alan Gilbert). - Clean up s2idle_enter() so it does not lock and unlock CPU offline in vain and update comments in it (Ulf Hansson). - Clean up broken white space in dpm_wait_for_children() (Geert Uytterhoeven). - Update the cpupower utility to fix lib version-ing in it and memory leaks in error legs, remove hard-coded values, and implement CPU physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah Khan, Yiwei Lin, Zhongqiu Han). -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmfhhTYSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO16/gIAKuRiG1fFgUcUSXC1iFu42vrB/1i4wpA 02GICACqM3K6/5jd3ct/WOU28GUgDs+xcmqH7CnMaM6y9nXEWjWarmSfFekAO+0q TPtQ7xTy0hBCB3he1P2uLKBJBin4Wn47U9/rvs4J7mQd5zDxTINKIiVoHg2lEE+s HAeSoNRb2sp5IZDm9+/LfhHNYRP1mJ97cbZlymqctGB3xgDL7qMLid/1+gFPHAQS 4/LXj3IgyU8DpA/j5nhtpaAqjN5g2QxIUfQgADRIcESK99Y/7aAMs1/G0WhJKaay 9yx+4/xmkGvVCZQx1DphksFLISEzltY0SFWLsoppPzBTGVEW2GQQsNI= =LqVy -----END PGP SIGNATURE----- Merge tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These are dominated by cpufreq updates which in turn are dominated by updates related to boost support in the core and drivers and amd-pstate driver optimizations. Apart from the above, there are some cpuidle updates including a rework of the most recent idle intervals handling in the venerable menu governor that leads to significant improvements in some performance benchmarks, as the governor is now more likely to predict a shorter idle duration in some cases, and there are updates of the core device power management code, mostly related to system suspend and resume, that should help to avoid potential issues arising when the drivers of devices depending on one another want to use different optimizations. There is also a usual collection of assorted fixes and cleanups, including removal of some unused code. Specifics: - Manage sysfs attributes and boost frequencies efficiently from cpufreq core to reduce boilerplate code in drivers (Viresh Kumar) - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider, Dhananjay Ugwekar, Imran Shaik, zuoqian) - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky Bai) - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski) - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng) - Optimize the amd-pstate driver to avoid cases where call paths end up calling the same writes multiple times and needlessly caching variables through code reorganization, locking overhaul and tracing adjustments (Mario Limonciello, Dhananjay Ugwekar) - Make it possible to avoid enabling capacity-aware scheduling (CAS) in the intel_pstate driver and relocate a check for out-of-band (OOB) platform handling in it to make it detect OOB before checking HWP availability (Rafael Wysocki) - Fix dbs_update() to avoid inadvertent conversions of negative integer values to unsigned int which causes CPU frequency selection to be inaccurate in some cases when the "conservative" cpufreq governor is in use (Jie Zhan) - Update the handling of the most recent idle intervals in the menu cpuidle governor to prevent useful information from being discarded by it in some cases and improve the prediction accuracy (Rafael Wysocki) - Make it possible to tell the intel_idle driver to ignore its built-in table of idle states for the given processor, clean up the handling of auto-demotion disabling on Baytrail and Cherrytrail chips in it, and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy, Rafael Wysocki) - Make some cpuidle drivers use for_each_present_cpu() instead of for_each_possible_cpu() during initialization to avoid issues occurring when nosmp or maxcpus=0 are used (Jacky Bai) - Clean up the Energy Model handling code somewhat (Rafael Wysocki) - Use kfree_rcu() to simplify the handling of runtime Energy Model updates (Li RongQing) - Add an entry for the Energy Model framework to MAINTAINERS as properly maintained (Lukasz Luba) - Address RCU-related sparse warnings in the Energy Model code (Rafael Wysocki) - Remove ENERGY_MODEL dependency on SMP and allow it to be selected when DEVFREQ is set without CPUFREQ so it can be used on a wider range of systems (Jeson Gao) - Unify error handling during runtime suspend and runtime resume in the core to help drivers to implement more consistent runtime PM error handling (Rafael Wysocki) - Drop a redundant check from pm_runtime_force_resume() and rearrange documentation related to __pm_runtime_disable() (Rafael Wysocki) - Rework the handling of the "smart suspend" driver flag in the PM core to avoid issues hat may occur when drivers using it depend on some other drivers and clean up the related PM core code (Rafael Wysocki, Colin Ian King) - Fix the handling of devices with the power.direct_complete flag set if device_suspend() returns an error for at least one device to avoid situations in which some of them may not be resumed (Rafael Wysocki) - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a possible deadlock that may occur if the "compressor" hibernation module parameter is accessed during the registration of a new ieee80211 device (Lizhi Xu) - Suppress sleeping parent warning in device_pm_add() in the case when new children are added under a device with the power.direct_complete set after it has been processed by device_resume() (Xu Yang) - Remove needless return in three void functions related to system wakeup (Zijun Hu) - Replace deprecated kmap_atomic() with kmap_local_page() in the hibernation core code (David Reaver) - Remove unused helper functions related to system sleep (David Alan Gilbert) - Clean up s2idle_enter() so it does not lock and unlock CPU offline in vain and update comments in it (Ulf Hansson) - Clean up broken white space in dpm_wait_for_children() (Geert Uytterhoeven) - Update the cpupower utility to fix lib version-ing in it and memory leaks in error legs, remove hard-coded values, and implement CPU physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah Khan, Yiwei Lin, Zhongqiu Han)" * tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (139 commits) PM: sleep: Fix bit masking operation dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650 dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1 dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible cpufreq: Init cpufreq only for present CPUs PM: sleep: Fix handling devices with direct_complete set on errors cpuidle: Init cpuidle only for present CPUs PM: clk: Remove unused pm_clk_remove() PM: sleep: core: Fix indentation in dpm_wait_for_children() PM: s2idle: Extend comment in s2idle_enter() PM: s2idle: Drop redundant locks when entering s2idle PM: sleep: Remove unused pm_generic_ wrappers cpufreq: tegra186: Share policy per cluster cpupower: Make lib versioning scheme more obvious and fix version link PM: EM: Rework the depends on for CONFIG_ENERGY_MODEL PM: EM: Address RCU-related sparse warnings cpupower: Implement CPU physical core querying pm: cpupower: remove hard-coded topology depth values pm: cpupower: Fix cmd_monitor() error legs to free cpu_topology ...	2025-03-25 15:00:18 -07:00
Linus Torvalds	21e0ff5b10	ACPI updates for 6.15-rc1 - Use the str_on_off() helper function instead of hard-coded strings in the ACPI power resources handling code (Thorsten Blum). - Add fan speed reporting for ACPI fans that have _FST, but otherwise do not support the entire ACPI 4 fan interface (Joshua Grisham). - Fix a stale comment regarding trip points in acpi_thermal_add() that diverged from the commented code after removing _CRT evaluation from acpi_thermal_get_trip_points() (xueqin Luo). - Make ACPI button driver also subscribe to system events (Mario Limonciello). - Use the str_yes_no() helper function instead of hard-coded strings in the ACPI backlight (video) driver (Thorsten Blum). - Add a missing header file include to the x86 arch CPPC code (Mario Limonciello). - Rework the sysfs attributes implementation in the ACPI platform-profile driver and improve the unregistration code in it (Nathan Chancellor, Kurt Borja). - Prevent the ACPI HED driver from being built as a module and change its initcall level to subsys_initcall to avoid initialization ordering issues related to it (Xiaofei Tan). - Update a maintainer email address in the ACPI PMIC entry in MAINTAINERS (Mika Westerberg). - Address a GCC 15's -Wunterminated-string-initialization warning in the core PNP subsystem code and remove some dead code from it (Kees Cook, David Alan Gilbert). -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmfhhZYSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1yCMH/3IbFftpA2sJg504igRgLdMzDAhTc/Y3 x8Y37v1e1Psxyp0SQni84H4E11QWaytSXngemnp39+LgN+14KW243z6v+PBGioyI +cdJw7kAk8v1aX+Ujel2Z3BIz9QPFqZd6d3R0AsZkZcI/28VW7kHNRXZ6p2kYmxK 7acx0Y1cM1k0UotzpzQ4RDaTnFNKUGJFQwdKTEJU237gsFrVh8ev2hLm9Hy4FU6R zGtjrU2/oBEmoCVgrXG4n6bZYP4dZwCZ8ewckIvepGuTPTHP8tYMxrELw/3A1901 +lnN6zK6nMpvCd9cl0ongT5iFG4gsuBGanvnP7Mf51YI380jmNSLeCI= =/b+r -----END PGP SIGNATURE----- Merge tag 'acpi-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "From the functional perspective, the most significant changes here are the ACPI fan driver update allowing it to handle fans with fine-grained state checking supported, but without fine-grained control, and the ACPI button driver update making it subscribe to system event notifications (in addition to device notifications) which on some systems is requisite for waking up the system from sleep. The rest is fixes and cleanups including removal of some dead code. Specifics: - Use the str_on_off() helper function instead of hard-coded strings in the ACPI power resources handling code (Thorsten Blum) - Add fan speed reporting for ACPI fans that have _FST, but otherwise do not support the entire ACPI 4 fan interface (Joshua Grisham) - Fix a stale comment regarding trip points in acpi_thermal_add() that diverged from the commented code after removing _CRT evaluation from acpi_thermal_get_trip_points() (xueqin Luo) - Make ACPI button driver also subscribe to system events (Mario Limonciello) - Use the str_yes_no() helper function instead of hard-coded strings in the ACPI backlight (video) driver (Thorsten Blum) - Add a missing header file include to the x86 arch CPPC code (Mario Limonciello) - Rework the sysfs attributes implementation in the ACPI platform-profile driver and improve the unregistration code in it (Nathan Chancellor, Kurt Borja) - Prevent the ACPI HED driver from being built as a module and change its initcall level to subsys_initcall to avoid initialization ordering issues related to it (Xiaofei Tan) - Update a maintainer email address in the ACPI PMIC entry in MAINTAINERS (Mika Westerberg) - Address a GCC 15's -Wunterminated-string-initialization warning in the core PNP subsystem code and remove some dead code from it (Kees Cook, David Alan Gilbert)" * tag 'acpi-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PNP: Expand length of fixup id string PNP: Remove prehistoric deadcode ACPI: button: Install notifier for system events as well ACPI: fan: Add fan speed reporting for fans with only _FST ACPI: HED: Always initialize before evged x86/ACPI: CPPC: Add missing include ACPI: video: Use str_yes_no() helper in acpi_video_bus_add() ACPI: platform_profile: Improve platform_profile_unregister() ACPI: platform-profile: Fix CFI violation when accessing sysfs files ACPI: power: Use str_on_off() helper function ACPI: thermal: Fix stale comment regarding trip points MAINTAINERS: Use my kernel.org address for ACPI PMIC work	2025-03-25 14:56:33 -07:00
Linus Torvalds	a5b3d8660b	hyperv-next for 6.15 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmfhlLATHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXgchCADOz33rSm4G4w4r0qT05dTDi/lZkEdK 64dQq322XXP/C9FfR66d30243gsAmuM5a0SvzFHLXAOu6yqM270Xehd/Rud+Um2s lSVnc0Ux0AWBgksqFd0t577aN7zmJEukosEYO5lBNop+zOcadrm3S6Th/AoL2h/D yphPkhH13bsCK+Wll/eBOQLIhC9iA0konYbBLuEQ5MqvUbrzc6Rmb5gxsHHZKOqg vLjkrYR/d3s2gIpKxiFp0RwvzGyffZEHxvU/YF3hTenPMlTlnXWbyspBSTVmWggP 13IFLzqxDdW9RgUnGB4xRc424AC1LKqEr42QPQE7zGvl2jdJriA2Q1LT =BXqj -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Add support for running as the root partition in Hyper-V (Microsoft Hypervisor) by exposing /dev/mshv (Nuno and various people) - Add support for CPU offlining in Hyper-V (Hamza Mahfooz) - Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael Kelley, Thorsten Blum) * tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits) x86/hyperv: fix an indentation issue in mshyperv.h x86/hyperv: Add comments about hv_vpset and var size hypercall input args Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs hyperv: Add definitions for root partition driver to hv headers x86: hyperv: Add mshv_handler() irq handler and setup function Drivers: hv: Introduce per-cpu event ring tail Drivers: hv: Export some functions for use by root partition module acpi: numa: Export node_to_pxm() hyperv: Introduce hv_recommend_using_aeoi() arm64/hyperv: Add some missing functions to arm64 x86/mshyperv: Add support for extended Hyper-V features hyperv: Log hypercall status codes as strings x86/hyperv: Fix check of return value from snp_set_vmsa() x86/hyperv: Add VTL mode callback for restarting the system x86/hyperv: Add VTL mode emergency restart callback hyperv: Remove unused union and structs hyperv: Add CONFIG_MSHV_ROOT to gate root partition support hyperv: Change hv_root_partition into a function hyperv: Convert hypercall statuses to linux error codes drivers/hv: add CPU offlining support ...	2025-03-25 14:47:04 -07:00
Uros Bizjak	0717b1392d	x86/bitops: Use TZCNT mnemonic in <asm/bitops.h> Current minimum required version of binutils is 2.25, which supports TZCNT instruction mnemonic. Replace "REP; BSF" in variable__{ffs,ffz}() function with this proper mnemonic. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250325175215.330659-1-ubizjak@gmail.com	2025-03-25 22:38:29 +01:00
David Hildenbrand	dc84bc2aba	x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range() If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables. Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied. The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong. So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation. Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run. A reproducer can trigger this usually at the first try: https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180 Likely this case was missed in: `d155df53f3` ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") ... and instead of undoing the reservation we simply cleared the VM_PAT flag. Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately. Fixes: `d155df53f3` ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: `2ab640379a` ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yuxin wang <wang1315768607@163.com> Reported-by: Marius Fleischer <fleischermarius@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/	2025-03-25 22:35:14 +01:00
Linus Torvalds	dce3ab4c57	xen: branch for v6.15-rc1 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZ9/gEwAKCRCAXGG7T9hj vlxhAQCRzSCNI8wwvENnuc2OnRyWKy8gq7C5WAOIOJdJ3U+scQEAwKGhPJLwE4IS /JDh5PRJgZ4rdMYatuDfldEcSAfRRgw= =dF6Z -----END PGP SIGNATURE----- Merge tag 'for-linus-6.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - cleanup: remove an used function - add support for a XenServer specific virtual PCI device - fix the handling of a sparse Xen hypervisor symbol table - avoid warnings when building the kernel with gcc 15 - fix use of devices behind a VMD bridge when running as a Xen PV dom0 * tag 'for-linus-6.15-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flag PCI: vmd: Disable MSI remapping bypass under Xen xen/pci: Do not register devices with segments >= 0x10000 xen/pciback: Remove unused pcistub_get_pci_dev xenfs/xensyms: respect hypervisor's "next" indication xen/mcelog: Add __nonstring annotations for unterminated strings xen: Add support for XenServer 6.1 platform device	2025-03-25 14:33:32 -07:00
Linus Torvalds	edb0e8f6e2	ARM: * Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM * Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace * Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers * Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration * Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation * pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat * Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: * Remove unnecessary header include path * Assume constant PGD during VM context switch * Add perf events support for guest VM RISC-V: * Disable the kernel perf counter during configure * KVM selftests improvements for PMU * Fix warning at the time of KVM module removal x86: * Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. * Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition. * Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2). * Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) * Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation. * Add common secure TSC infrastructure for use within SNP and in the future TDX * Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing. * Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock. * Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus. * Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests. * Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups). * Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates. * Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range). * Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config. * Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path. x86 (Intel): * Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1. * Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support. * Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC). * Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter. * Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits). x86 (AMD): * Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies). * Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation. * Add macros and helpers for setting GHCB return/error codes. * Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed. * Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address. * Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall. * Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features. Selftests: * Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters. * Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware. * Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration. * Fix a few minor bugs related to handling of stats FDs. * Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation). * Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfcTkEUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMnQAf/cPx72hJOdNy4Qrm8M33YLXVRVV00 yEZ8eN8TWdOclr0ltE/w/ELGh/qS4CU8pjURAk0A6lPioU+mdcTn3dPEqMDMVYom uOQ2lusEHw0UuSnGZSEjvZJsE/Ro2NSAsHIB6PWRqig1ZBPJzyu0frce34pMpeQH diwriJL9lKPAhBWXnUQ9BKoi1R0P5OLW9ahX4SOWk7cAFg4DLlDE66Nqf6nKqViw DwEucTiUEg5+a3d93gihdD4JNl+fb3vI2erxrMxjFjkacl0qgqRu3ei3DG0MfdHU wNcFSG5B1n0OECKxr80lr1Ip1KTVNNij0Ks+w6Gc6lSg9c4PptnNkfLK3A== =nnCN -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: - Remove unnecessary header include path - Assume constant PGD during VM context switch - Add perf events support for guest VM RISC-V: - Disable the kernel perf counter during configure - KVM selftests improvements for PMU - Fix warning at the time of KVM module removal x86: - Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2) - Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) - Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups) - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates - Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range) - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path x86 (Intel): - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1 - Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC) - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits) x86 (AMD): - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies) - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation - Add macros and helpers for setting GHCB return/error codes - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features Selftests: - Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware - Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration - Fix a few minor bugs related to handling of stats FDs - Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation) - Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits) RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed RISC-V: KVM: Teardown riscv specific bits after kvm_exit LoongArch: KVM: Register perf callbacks for guest LoongArch: KVM: Implement arch-specific functions for guest perf LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel() LoongArch: KVM: Remove PGD saving during VM context switch LoongArch: KVM: Remove unnecessary header include path KVM: arm64: Tear down vGIC on failed vCPU creation KVM: arm64: PMU: Reload when resetting KVM: arm64: PMU: Reload when user modifies registers KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs KVM: arm64: PMU: Assume PMU presence in pmu-emul.c KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM: x86: Add infrastructure for secure TSC KVM: x86: Push down setting vcpu.arch.user_set_tsc ...	2025-03-25 14:22:07 -07:00
Linus Torvalds	0d86c23953	- A cleanup to the MCE notification machinery -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfixMAACgkQEsHwGGHe VUrZ/A/+LmqCE4TQEsS04oLezqu9hfPHiQ1Z1DK1RpjLyx23YZGBoYslGHtltknZ rXt15UDINSYXlbnVBzIFtuTQX8GKWySr/DbN9rV/rb4xW81giOPwW96/M/LfnPBh kV3KCrgsPvIfKE1pmCkK0ThkemcVcjtvq83Jpn3C6ppsqDZdOrSH+e8ZCnKSVK3q n3IwDQXmBJXV+wZFiAMvMTqUpJVNCTeiQj+ACrfOqgnAZsGFsEsEKqZkdO2ouaEy 7QAdY+6AELX3LnAu4mJKDVsZ/HSUip7uVOqqM02TMw5O4z/cpzd3rH8tObSTW9hr DLlLXmfOJliQdJGHAECv79DiViycF6axVdZh6WvfvJHzZYUyNSrjoWUTDUW3JFI1 ZikhBh/hQlfas12k0dYYcObgF1li45LyfFl/uSyIfoO1aIno+Od8yv1/jRrd8s50 7ehS5OFtpb4EsqCED2arAsiDiaoHwrYWAP8aoJVwXg5AZB6ShritSC6QlQpOgDCw 81VOeARaJoYJggxDzxGYCjLQORzoweDuuMs41qZLqn3DfinYvotHUjo6j5DL2JEm iFEce2NeKvi+T2dB8k1EzqyGL0VKSh1ogI53RzGnaWUt1f8JnJuM7Je+VXI5LGL3 Ce8sSVzZbc5MFPOCncoxXw7f68aND+P0lm+yA79lVMT7ytp54KA= =rY2E -----END PGP SIGNATURE----- Merge tag 'ras_core_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS update from Borislav Petkov: - A cleanup to the MCE notification machinery * tag 'ras_core_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/inject: Remove call to mce_notify_irq()	2025-03-25 14:13:35 -07:00
Linus Torvalds	2899aa3973	- First part of the MPAM work: split the architectural part of resctrl from the filesystem part so that ARM's MPAM varian of resource control can be added later while sharing the user interface with x86 (James Morse) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfi118ACgkQEsHwGGHe VUrhZhAAj9brJYnluZpNgOMl231QRJaK0Exz1TLMFvmEZxQSnRs6TJ4PVDqU7QQb lrvqbobf77BfO8u3jtFLZvcoXxG+zzaTEoDEqGc/57Gu4G/kC64S8kYWa88aUf4I lHS5kZvNUxVBh4L/33QaprigN61pZbhLoejOCdr3zWRJ62+/xoNXs1rV8N3Zwgdv 6p/B56MMi1CBXXHbFSzBI1bXSb/gW9jMjTnvrHbg3sOzrvVVuigJMVYgEfcEi0lh npc0Iz/Gz3Bzemxcl05bm2eJ+Z9WR9CIHMp+PAewqL7eJCV0OBHUkClU9Ui92Js+ BA7XhL4XAnnZAaXHQoBfskGzcQ91pWPpkjJwSQO7y3zl8A8lvTFJCb89tZMWiLDl bF9MmbyjJFMtEaIYLHlhoasilN2laRrnTW41ZhxEtSJ0IofE4OInJ2+pPB/TfT7O HfZtkadIDrH6p5qLXy9bRwPxHskuM+NX0bw0OxWfu49DGw3O8pRhTFkiQ/+ofuBb oJNwVBAH11AiXUZBR1ZunpYEkwMFlL4FyNOkq/OS6C51UUE72dYITR5HB0/wkTp2 cc2oiX3CSQPKrA4G8BAvMb7zGTmryXRZ7nOkTzScVTm8BoyyZf9F69aTpg1Deuuf W8Z9WrabVBCEs7EhZ7OH9bvmpBFapoNDUwmt+gnTAw6U0QDspZw= =FRzf -----END PGP SIGNATURE----- Merge tag 'x86_cache_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - First part of the MPAM work: split the architectural part of resctrl from the filesystem part so that ARM's MPAM varian of resource control can be added later while sharing the user interface with x86 (James Morse) * tag 'x86_cache_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) x86/resctrl: Move get_{mon,ctrl}_domain_from_cpu() to live with their callers x86/resctrl: Move get_config_index() to a header x86/resctrl: Handle throttle_mode for SMBA resources x86/resctrl: Move RFTYPE flags to be managed by resctrl x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr x86/resctrl: Make prefetch_disable_bits belong to the arch code x86/resctrl: Allow an architecture to disable pseudo lock x86/resctrl: Add resctrl_arch_ prefix to pseudo lock functions x86/resctrl: Move mbm_cfg_mask to struct rdt_resource x86/resctrl: Move mba_mbps_default_event init to filesystem code x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC x86/resctrl: Move the is_mbm__enabled() helpers to asm/resctrl.h x86/resctrl: Rewrite and move the for_each__rdt_resource() walkers x86/resctrl: Move monitor init work to a resctrl init call x86/resctrl: Move monitor exit work to a resctrl exit call x86/resctrl: Add an arch helper to reset one resource x86/resctrl: Move resctrl types to a separate header x86/resctrl: Move rdt_find_domain() to be visible to arch and fs code x86/resctrl: Expose resctrl fs's init function to the rest of the kernel ...	2025-03-25 13:51:28 -07:00
Linus Torvalds	906174776c	- Some preparatory work to convert the mitigations machinery to mitigating attack vectors instead of single vulnerabilities - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag - Add support for a Zen5-specific SRSO mitigation - Cleanups and minor improvements -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfixS0ACgkQEsHwGGHe VUpi1xAAgvH2u8Eo8ibT5dABQpD65w3oQiykO+9aDpObG9w9beDVGlld8DJE61Rz 6tcE0Clp2H/tMcCbn8zXIJ92TQ3wIX/85uZwLi1VEM1Tx7A6VtAbPv8WKfZE3FCX 9v92HRKnK3ql+A2ZR+oyy+/8RedUmia7y7/bXH1H7Zf2uozoKkmq5cQnwfq5iU4A qNiKuvSlQwjZ8Zz6Ax1ugHUkE4R7mlKh8rccLXl4+mVr63/lkPHSY3OFTjcYf4HW Ir92N86Spfo0/l0vsOOsWoYKmoaiVP7ouJh7YbKR3B0BGN0pt2MT476mehkEs427 m4J6XhRKhIrsYmzEkLvvpsg12zO4/PKk8BEYNS7YPYlRaOwjV4ivyFS2aY6e55rh yUHyo9s+16f/Mp+/fNFXll3mdMxYBioPWh3M191nJkdfyKMrtf0MdKPRibaJB8wH yMF4D1gMx+hFbs0/VOS6dtqD9DKW7VgPg0LW+RysfhnLTuFFb5iBcH6Of7l7Z/Ca vVK+JxrhB1EDVI1+MKnESKPF9c6j3DRa2xrQHi/XYje1TGqnQ1v4CmsEObYBuJDN 9M9t4QLzNuA/DA5tS7cxxtQ3YUthuJjPLcO4EVHOCvnqCAxkzp0i3dVMUr+YISl+ 2yFqaZdTt8s8FjTI21LOyuloCo30ZLlzaorFa0lp2cIyYup+1vg= =btX/ -----END PGP SIGNATURE----- Merge tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 speculation mitigation updates from Borislav Petkov: - Some preparatory work to convert the mitigations machinery to mitigating attack vectors instead of single vulnerabilities - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag - Add support for a Zen5-specific SRSO mitigation - Cleanups and minor improvements * tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2 x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds x86/bugs: Relocate mds/taa/mmio/rfds defines x86/bugs: Add X86_BUG_SPECTRE_V2_USER x86/bugs: Remove X86_FEATURE_USE_IBPB KVM: nVMX: Always use IBPB to properly virtualize IBRS x86/bugs: Use a static branch to guard IBPB on vCPU switch x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set() x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation() x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers x86/bugs: KVM: Add support for SRSO_MSR_FIX	2025-03-25 13:30:18 -07:00
Linus Torvalds	2d09a9449e	arm64 updates for 6.15: Perf and PMUs: - Support for the "Rainier" CPU PMU from Arm - Preparatory driver changes and cleanups that pave the way for BRBE support - Support for partial virtualisation of the Apple-M1 PMU - Support for the second event filter in Arm CSPMU designs - Minor fixes and cleanups (CMN and DWC PMUs) - Enable EL2 requirements for FEAT_PMUv3p9 Power, CPU topology: - Support for AMUv1-based average CPU frequency - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds a generic topology_is_primary_thread() function overridden by x86 and powerpc New(ish) features: - MOPS (memcpy/memset) support for the uaccess routines Security/confidential compute: - Fix the DMA address for devices used in Realms with Arm CCA. The CCA architecture uses the address bit to differentiate between shared and private addresses - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by default Memory management clean-ups: - Drop the PD_TABLE_BIT definition in preparation for 128-bit PTEs - Some minor page table accessor clean-ups - PIE/POE (permission indirection/overlay) helpers clean-up Kselftests: - MTE: skip hugetlb tests if MTE is not supported on such mappings and user correct naming for sync/async tag checking modes Miscellaneous: - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people request) - Sysreg updates for new register fields - CPU type info for some Qualcomm Kryo cores -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmfjB2QACgkQa9axLQDI XvGrfg//W3Bx9+jw1G/XHHEQqGEVFmvltvxZUkvgV0Qki0rPSMnappJhZRL9n0Nm V6PvGd2KoKHZuL3g5ViZb3cs2R9BiD2JB6PncwBKuxumHGh3vz3kk1JMkDVfWdHv qAceOckFJD9rXjPZn+PDsfYiEi2i3RRWIP5VglZ14ue8j3prHQ6DJXLUQF2GYvzE /bgLSq44wp5N59ddy23+qH9rxrHzz3bgpbVv/F56W/LErvE873mRmyFwiuGJm+M0 Pn8ra572rI6a4sgSwrMTeNPBU+F9o5AbqwauVhkz428RdMvgfEuW6qHUBnGWJDmt HotXmu+4Eb2KJks/iQkDo4OTJ38yUqvvZZJtP171ms3E4yqESSJngWP6O2A6LF+y xhe0sESF/Ew6jLhM6/hvOmBcE2AyB14JE3ymqLkXbWub4NXddBn2AF1WXFjF4CBw F8KSUhNLekrCYKv1k9M3nhvkcpoS9FkTF/TI+zEg546alI/GLPih6uDRkgMAODh1 RDJYixHsf2NDDRQbfwvt9Xua/KKpDF6qNkHLA4OiqqVUwh1hkas24Lrnp8vmce4o wIpWCLqYWey8Rl3XWuWgWz2Xu58fHH4Dl2k72Z8I0pwp3abCDa9xEj79G0Svk7Si Q+FCYrNlpKee1RXBC+1MUD/Gl5r/28dEUFkAzPD80F7AgafXPd0= =Kc9c -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: "Nothing major this time around. Apart from the usual perf/PMU updates, some page table cleanups, the notable features are average CPU frequency based on the AMUv1 counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in the uaccess routines. Perf and PMUs: - Support for the 'Rainier' CPU PMU from Arm - Preparatory driver changes and cleanups that pave the way for BRBE support - Support for partial virtualisation of the Apple-M1 PMU - Support for the second event filter in Arm CSPMU designs - Minor fixes and cleanups (CMN and DWC PMUs) - Enable EL2 requirements for FEAT_PMUv3p9 Power, CPU topology: - Support for AMUv1-based average CPU frequency - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds a generic topology_is_primary_thread() function overridden by x86 and powerpc New(ish) features: - MOPS (memcpy/memset) support for the uaccess routines Security/confidential compute: - Fix the DMA address for devices used in Realms with Arm CCA. The CCA architecture uses the address bit to differentiate between shared and private addresses - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by default Memory management clean-ups: - Drop the PD_TABLE_BIT definition in preparation for 128-bit PTEs - Some minor page table accessor clean-ups - PIE/POE (permission indirection/overlay) helpers clean-up Kselftests: - MTE: skip hugetlb tests if MTE is not supported on such mappings and user correct naming for sync/async tag checking modes Miscellaneous: - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people request) - Sysreg updates for new register fields - CPU type info for some Qualcomm Kryo cores" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits) arm64: mm: Don't use %pK through printk perf/arm_cspmu: Fix missing io.h include arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists arm64: cputype: Add MIDR_CORTEX_A76AE arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list arm64/sysreg: Enforce whole word match for open/close tokens arm64/sysreg: Fix unbalanced closing block arm64: Kconfig: Enable HOTPLUG_SMT arm64: topology: Support SMT control on ACPI based system arch_topology: Support SMT control for OF based system cpu/SMT: Provide a default topology_is_primary_thread() arm64/mm: Define PTDESC_ORDER perf/arm_cspmu: Add PMEVFILT2R support perf/arm_cspmu: Generalise event filtering perf/arm_cspmu: Move register definitons to header arm64/kernel: Always use level 2 or higher for early mappings arm64/mm: Drop PXD_TABLE_BIT arm64/mm: Check pmd_table() in pmd_trans_huge() ...	2025-03-25 13:16:16 -07:00
Catalin Marinas	8ae9e2d832	Merge branch 'for-next/smt-control' into for-next/core * for-next/smt-control: : Support SMT control on arm64 arm64: Kconfig: Enable HOTPLUG_SMT arm64: topology: Support SMT control on ACPI based system arch_topology: Support SMT control for OF based system cpu/SMT: Provide a default topology_is_primary_thread()	2025-03-25 19:32:28 +00:00
Linus Torvalds	317a76a996	Updates for the VDSO infrastructure: - Consolidate the VDSO storage The VDSO data storage and data layout has been largely architecture specific for historical reasons. That increases the maintenance effort and causes inconsistencies over and over. There is no real technical reason for architecture specific layouts and implementations. The architecture specific details can easily be integrated into a generic layout, which also reduces the amount of duplicated code for managing the mappings. Convert all architectures over to a unified layout and common mapping infrastructure. This splits the VDSO data layout into subsystem specific blocks, timekeeping, random and architecture parts, which provides a better structure and allows to improve and update the functionalities without conflict and interaction. - Rework the timekeeping data storage The current implementation is designed for exposing system timekeeping accessors, which was good enough at the time when it was designed. PTP and Time Sensitive Networking (TSN) change that as there are requirements to expose independent PTP clocks, which are not related to system timekeeping. Replace the monolithic data storage by a structured layout, which allows to add support for independent PTP clocks on top while reusing both the data structures and the time accessor implementations. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfgSWUTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoYGED/0f/M8YyacAyErDYW4ufW+zh2sUidSf GVlK0Jn5BMljOoye+y2XfTxuvvXxEDjJNYiJm2uKGPdV29tjNXreGK39XyNqXPu5 jwR4f/IN/QVSM2nCO6jyydMz8ympJ2k6M4RewwmxXBL2KsUzzJWSKTgRNqM5Tdjs 1RhJMjkQVTiiSYerBpHXYCeZLM7/VEfZ120uuzVAYPXo0/R6zuyF7IBgIao9hbfO IQeCMLLfpDQHQhwquTA8ZbWqQusiEoSYHT+kTDa3eXDDbE/2UklAUs9gaatI979x 73zs0Yqxyx2iIGaghACWOAbKdcBWBeCYDw5fFwYVKn4VMQi1+wcxbtOYL767jp9o vfkLXGilXcVkvDjv4fH+e1NoJXXBxq1Ug1silKdOeJzenQF8Q1i3tavkWUVCNfwH qyOIM72NiCEWbYBDcz0lwBxEAyO4o0E6NP1bDc4y50VedEYIbXwSh0QGrdev1abn rjY9vsuUR9oznmZ6BRPPxMTY87gOSHoKvqydgSZUACEgLV9346f5qZf341OReYai MXUmXOM4+LdyaM1+Mec8ppvjMbLw+736NZyZtT2InusEBE+Ddp25L3hYiWnklJu8 2uwv0AoyrwaJ8y6ADOX4thcLZq0gND0Z/Ayz/XvpeI30eftsGUCt5KOVlqwfwOkI 4EQKvk2fAixPxg== =rwei -----END PGP SIGNATURE----- Merge tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull VDSO infrastructure updates from Thomas Gleixner: - Consolidate the VDSO storage The VDSO data storage and data layout has been largely architecture specific for historical reasons. That increases the maintenance effort and causes inconsistencies over and over. There is no real technical reason for architecture specific layouts and implementations. The architecture specific details can easily be integrated into a generic layout, which also reduces the amount of duplicated code for managing the mappings. Convert all architectures over to a unified layout and common mapping infrastructure. This splits the VDSO data layout into subsystem specific blocks, timekeeping, random and architecture parts, which provides a better structure and allows to improve and update the functionalities without conflict and interaction. - Rework the timekeeping data storage The current implementation is designed for exposing system timekeeping accessors, which was good enough at the time when it was designed. PTP and Time Sensitive Networking (TSN) change that as there are requirements to expose independent PTP clocks, which are not related to system timekeeping. Replace the monolithic data storage by a structured layout, which allows to add support for independent PTP clocks on top while reusing both the data structures and the time accessor implementations. * tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) sparc/vdso: Always reject undefined references during linking x86/vdso: Always reject undefined references during linking vdso: Rework struct vdso_time_data and introduce struct vdso_clock vdso: Move architecture related data before basetime data powerpc/vdso: Prepare introduction of struct vdso_clock arm64/vdso: Prepare introduction of struct vdso_clock x86/vdso: Prepare introduction of struct vdso_clock time/namespace: Prepare introduction of struct vdso_clock vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct vdso/vsyscall: Prepare introduction of struct vdso_clock vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock vdso/gettimeofday: Prepare introduction of struct vdso_clock vdso/helpers: Prepare introduction of struct vdso_clock vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks vdso: Make vdso_time_data cacheline aligned arm64: Make asm/cache.h compatible with vDSO ...	2025-03-25 11:30:42 -07:00
Linus Torvalds	a50b4fe095	A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0 iJbvLJeisXr3GA== =bYAX -----END PGP SIGNATURE----- Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...	2025-03-25 10:54:15 -07:00
Linus Torvalds	0f40464674	Updates for interrupt chip drivers: - Support for hard indices on RISC-V. The hart index identifies a hart (core) within a specific interrupt domain in RISC-V's Priviledged Architecture. - Rework of the RISC-V MSI driver. This moves the driver over to the generic MSI library and solves the affinity problem of unmaskable PCI/MSI controllers. Unmaskable PCI/MSI controllers are prone to lose interrupts when the MSI message is updated to change the affinity because the message write consists of three 32-bit subsequent writes, which update address and data. As these writes are non-atomic versus the device raising an interrupt, the device can observe a half written update and issue an interrupt on the wrong vector. This is mitiated by a carefully orchestrated step by step update and the observation of an eventually pending interrupt on the CPU which issues the update. The algorithm follows the well established method of the X86 MSI driver. - A new driver for the RISC-V Sophgo SG2042 MSI controller - Overhaul of the Renesas RZQ2L driver. Simplification of the probe function by using devm_() mechanisms, which avoid the endless list of error prone gotos in the failure paths. - Expand the Renesas RZV2H driver to support RZ/G3E SoCs - A workaround for Rockchip 3568002 erratum in the GIC-V3 driver to ensure that the addressing is limited to the lower 32-bit of the physical address space. - Add support for the Allwinner AS23 NMI controller - Expand the IMX irqsteer driver to handle up to 960 input interrupts - The usual small updates, cleanups and device tree changes. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff454THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZqoD/4kdHzbxfLpf7vC3NnG8NWwTq5FpbSx 6grQC9hWNMAs4n2IFjJRFLrjeX3AcdAQXL/BWuM0LfW9tQDQaVmqlSIlB/bn69KB 7HyAR6ozbOgnHKGAqFUXSLf+4pq+6q3mOgGKIF289dy14HFu4ta0DqKgkPZeQnVs R/J8i7REUnn+YuxzSt5eOqyDPyt2EHJosSUABSWQZBlrM9jy1W7f6NqDFwawiVsa +tv4U/bz91vjzVxwTIgt7nJK+b2HVYdxoZYuKJwPaTsj26ANPp6ltjRTeOmZhb5h uKgw+OyzDnk6q+tjGcRqrqwl291VKxCvnRiqHFfu3CERdmI9qvpN9IRcEJqIbkcN cakekhAyt7OO7sEPcql5vBL97e9hpb7EcH78gYxwHf8Dy0rFZUvSC5v+L6VRFnJS XcKA1L+f9B6u5qxnBtLan9IW08HYNdvmPq6AuVjk+ndKioPUFqB2q6AtXpuA3Rmu Y3XH/wh/q5wk0pgeByxQW6swsfpMN3OYK3mpLx475wFh2NKzcdGlwGhDFhiw8DKX m1AESy3UZatj1a0qGaFS/M+mm9KGrDYIMrje832Wf4Yf1LGmTsDkd3/V99oazSsq Jm4qhDASXChJXd0imQICX9hPw0aHTlLYNs54obUXVULH4HivQKIgWhUXrjG0dBDL +tttjuv5FJxr3A== =jPHa -----END PGP SIGNATURE----- Merge tag 'irq-drivers-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq driver updates from Thomas Gleixner: - Support for hard indices on RISC-V. The hart index identifies a hart (core) within a specific interrupt domain in RISC-V's Priviledged Architecture. - Rework of the RISC-V MSI driver This moves the driver over to the generic MSI library and solves the affinity problem of unmaskable PCI/MSI controllers. Unmaskable PCI/MSI controllers are prone to lose interrupts when the MSI message is updated to change the affinity because the message write consists of three 32-bit subsequent writes, which update address and data. As these writes are non-atomic versus the device raising an interrupt, the device can observe a half written update and issue an interrupt on the wrong vector. This is mitiated by a carefully orchestrated step by step update and the observation of an eventually pending interrupt on the CPU which issues the update. The algorithm follows the well established method of the X86 MSI driver. - A new driver for the RISC-V Sophgo SG2042 MSI controller - Overhaul of the Renesas RZQ2L driver Simplification of the probe function by using devm_() mechanisms, which avoid the endless list of error prone gotos in the failure paths. - Expand the Renesas RZV2H driver to support RZ/G3E SoCs - A workaround for Rockchip 3568002 erratum in the GIC-V3 driver to ensure that the addressing is limited to the lower 32-bit of the physical address space. - Add support for the Allwinner AS23 NMI controller - Expand the IMX irqsteer driver to handle up to 960 input interrupts - The usual small updates, cleanups and device tree changes * tag 'irq-drivers-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) irqchip/imx-irqsteer: Support up to 960 input interrupts irqchip/sunxi-nmi: Support Allwinner A523 NMI controller dt-bindings: irq: sun7i-nmi: Document the Allwinner A523 NMI controller irqchip/davinci-cp-intc: Remove public header irqchip/renesas-rzv2h: Add RZ/G3E support irqchip/renesas-rzv2h: Update macros ICU_TSSR_TSSEL_{MASK,PREP} irqchip/renesas-rzv2h: Update TSSR_TIEN macro irqchip/renesas-rzv2h: Add field_width to struct rzv2h_hw_info irqchip/renesas-rzv2h: Add max_tssel to struct rzv2h_hw_info irqchip/renesas-rzv2h: Add struct rzv2h_hw_info with t_offs variable irqchip/renesas-rzv2h: Use devm_pm_runtime_enable() irqchip/renesas-rzv2h: Use devm_reset_control_get_exclusive_deasserted() irqchip/renesas-rzv2h: Simplify rzv2h_icu_init() irqchip/renesas-rzv2h: Drop irqchip from struct rzv2h_icu_priv irqchip/renesas-rzv2h: Fix wrong variable usage in rzv2h_tint_set_type() dt-bindings: interrupt-controller: renesas,rzv2h-icu: Document RZ/G3E SoC riscv: sophgo: dts: Add msi controller for SG2042 irqchip: Add the Sophgo SG2042 MSI interrupt controller dt-bindings: interrupt-controller: Add Sophgo SG2042 MSI arm64: dts: rockchip: rk356x: Move PCIe MSI to use GIC ITS instead of MBI ...	2025-03-25 09:54:36 -07:00
David Woodhouse	3d66af75b0	x86/kexec: Debugging support: Dump registers on exception The actual serial output function is a no-op for now. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20250314173226.3062535-3-dwmw2@infradead.org	2025-03-25 12:49:05 +01:00
David Woodhouse	8df505af7f	x86/kexec: Debugging support: Load an IDT and basic exception entry points [ mingo: Minor readability edits ] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20250314173226.3062535-2-dwmw2@infradead.org	2025-03-25 12:49:05 +01:00
Ahmed S. Darwish	0dd09e215a	x86/cacheinfo: Apply maintainer-tip coding style fixes The x86/cacheinfo code has been heavily refactored and fleshed out at parent commits, where any necessary coding style fixes were also done in place. Apply Documentation/process/maintainer-tip.rst coding style fixes to the rest of the code, and align its assignment expressions for readability. Standardize on CPUID(n) when mentioning leaf queries. Avoid breaking long lines when doing so helps readability. At cacheinfo_amd_init_llc_id(), rename variable 'msb' to 'index_msb' as this is how it's called at the rest of cacheinfo.c code. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-30-darwi@linutronix.de	2025-03-25 10:23:37 +01:00
Ahmed S. Darwish	6c963c42fc	x86/cacheinfo: Introduce cpuid_amd_hygon_has_l3_cache() Multiple code paths at cacheinfo.c and amd_nb.c check for AMD/Hygon CPUs L3 cache presensce by directly checking leaf 0x80000006 EDX output. Extract that logic into its own function. While at it, rework the AMD/Hygon LLC topology ID caclculation comments for clarity. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-29-darwi@linutronix.de	2025-03-25 10:23:30 +01:00
Ahmed S. Darwish	eeeebc4fc6	x86/cacheinfo: Relocate CPUID leaf 0x4 cache_type mapping The cache_type_map[] array is used to map Intel leaf 0x4 cache_type values to their corresponding types at <linux/cacheinfo.h>. Move that array's definition after the actual CPUID leaf 0x4 structures, instead of having it in the middle of AMD leaf 0x4 emulation code. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-28-darwi@linutronix.de	2025-03-25 10:23:27 +01:00
Ahmed S. Darwish	05d48035e5	x86/cacheinfo: Extract out cache self-snoop checks The logic of not doing a cache flush if the CPU declares cache self snooping support is repeated across the x86/cacheinfo code. Extract it into its own function. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-27-darwi@linutronix.de	2025-03-25 10:23:24 +01:00
Ahmed S. Darwish	fda5f817ae	x86/cacheinfo: Extract out cache level topology ID calculation For Intel CPUID leaf 0x4 parsing, refactor the cache level topology ID calculation code into its own method instead of repeating the same logic twice for L2 and L3. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-26-darwi@linutronix.de	2025-03-25 10:23:21 +01:00
Ahmed S. Darwish	66122616e2	x86/cacheinfo: Separate Intel CPUID leaf 0x4 handling init_intel_cacheinfo() was overly complex. It parsed leaf 0x4 data, leaf 0x2 data, and performed post-processing, all within one function. Parent commit moved leaf 0x2 parsing and the post-processing logic into their own functions. Continue the refactoring by extracting leaf 0x4 parsing into its own function. Initialize local L2/L3 topology ID variables to BAD_APICID by default, thus ensuring they can be used unconditionally. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-25-darwi@linutronix.de	2025-03-25 10:23:18 +01:00
Ahmed S. Darwish	5adfd36758	x86/cacheinfo: Separate CPUID leaf 0x2 handling and post-processing logic The logic of init_intel_cacheinfo() is quite convoluted: it mixes leaf 0x4 parsing, leaf 0x2 parsing, plus some post-processing, in a single place. Begin simplifying its logic by extracting the leaf 0x2 parsing code, and the post-processing logic, into their own functions. While at it, rework the SMT LLC topology ID comment for clarity. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-24-darwi@linutronix.de	2025-03-25 10:23:15 +01:00
Ahmed S. Darwish	4772304ee6	x86/cpu: Use consolidated CPUID leaf 0x2 descriptor table CPUID leaf 0x2 output is a stream of one-byte descriptors, each implying certain details about the CPU's cache and TLB entries. At previous commits, the mapping tables for such descriptors were merged into one consolidated table. The mapping was also transformed into a hash lookup instead of a loop-based lookup for each descriptor. Use the new consolidated table and its hash-based lookup through the for_each_leaf_0x2_tlb_entry() accessor. Remove the TLB-specific mapping, intel_tlb_table[], as it is now no longer used. Remove the <cpuid/types.h> macro, for_each_leaf_0x2_desc(), since the converted code was its last user. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-23-darwi@linutronix.de	2025-03-25 10:23:12 +01:00
Ahmed S. Darwish	da23a62598	x86/cacheinfo: Use consolidated CPUID leaf 0x2 descriptor table CPUID leaf 0x2 output is a stream of one-byte descriptors, each implying certain details about the CPU's cache and TLB entries. At previous commits, the mapping tables for such descriptors were merged into one consolidated table. The mapping was also transformed into a hash lookup instead of a loop-based lookup for each descriptor. Use the new consolidated table and its hash-based lookup through the for_each_leaf_0x2_tlb_entry() accessor. Remove the old cache-specific mapping, cache_table[], as it is no longer used. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-22-darwi@linutronix.de	2025-03-25 10:23:08 +01:00
Thomas Gleixner	37aedb806b	x86/cpu: Consolidate CPUID leaf 0x2 tables CPUID leaf 0x2 describes TLBs and caches. So there are two tables with the respective descriptor constants in intel.c and cacheinfo.c. The tables occupy almost 600 byte and require a loop based lookup for each variant. Combining them into one table occupies exactly 1k rodata and allows to get rid of the loop based lookup by just using the descriptor byte provided by CPUID leaf 0x2 as index into the table, which simplifies the code and reduces text size. The conversion of the intel.c and cacheinfo.c code is done separately. [ darwi: Actually define struct leaf_0x2_table. Tab-align all of cpuid_0x2_table[] mapping entries. Define needed SZ_* macros at <linux/sizes.h> instead (merged commit.) Use CACHE_L1_{INST,DATA} as names for L1 cache descriptor types. Set descriptor 0x63 type as TLB_DATA_1G_2M_4M and explain why. Use enums for cache and TLB descriptor types (parent commits.) Start enum types at 1 since type 0 is reserved for unknown descriptors. Ensure that cache and TLB enum type values do not intersect. Add leaf 0x2 table accessor for_each_leaf_0x2_entry() + documentation. ] Co-developed-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-21-darwi@linutronix.de	2025-03-25 10:23:04 +01:00
Ahmed S. Darwish	543904cdfe	x86/cpu: Use enums for TLB descriptor types The leaf 0x2 one-byte TLB descriptor types: TLB_INST_4K TLB_INST_4M TLB_INST_2M_4M ... are just discriminators to be used within the intel_tlb_table[] mapping. Their specific values are irrelevant. Use enums for such types. Make the enum packed and static assert that its values remain within a single byte so that the intel_tlb_table[] size do not go out of hand. Use a __CHECKER__ guard for the static_assert(sizeof(enum) == 1) line as sparse ignores the __packed annotation on enums. This is similar to: `fe3944fb24` ("fs: Move enum rw_hint into a new header file") for the core SCSI code. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/Z9rsTirs9lLfEPD9@lx-t490 Link: https://lore.kernel.org/r/20250324133324.23458-20-darwi@linutronix.de	2025-03-25 10:23:00 +01:00
Ahmed S. Darwish	e1e6b57146	x86/cacheinfo: Use enums for cache descriptor types The leaf 0x2 one-byte cache descriptor types: CACHE_L1_INST CACHE_L1_DATA CACHE_L2 CACHE_L3 are just discriminators to be used within the cache_table[] mapping. Their specific values are irrelevant. Use enums for such types. Make the enum packed and static assert that its values remain within a single byte so that the cache_table[] array size do not go out of hand. Use a __CHECKER__ guard for the static_assert(sizeof(enum) == 1) line as sparse ignores the __packed annotation on enums. This is similar to: `fe3944fb24` ("fs: Move enum rw_hint into a new header file") for the core SCSI code. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/Z9rsTirs9lLfEPD9@lx-t490 Link: https://lore.kernel.org/r/20250324133324.23458-19-darwi@linutronix.de	2025-03-25 10:22:56 +01:00
Ahmed S. Darwish	7596ab7a10	x86/cacheinfo: Clarify type markers for CPUID leaf 0x2 cache descriptors CPUID leaf 0x2 output is a stream of one-byte descriptors, each implying certain details about the CPU's cache and TLB entries. Two separate tables exist for interpreting these descriptors: one for TLBs at intel.c and one for caches at cacheinfo.c. These mapping tables will be merged in further commits, among other improvements to their model. In preparation for this, use more descriptive type names for the leaf 0x2 descriptors associated with cpu caches. Namely: LVL_1_INST => CACHE_L1_INST LVL_1_DATA => CACHE_L1_DATA LVL_2 => CACHE_L2 LVL_3 => CACHE_L3 After the TLB and cache descriptors mapping tables are merged, this will make it clear that such descriptors correspond to cpu caches. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-18-darwi@linutronix.de	2025-03-25 10:22:52 +01:00
Ahmed S. Darwish	eb1c7c08c5	x86/cacheinfo: Rename 'struct _cpuid4_info_regs' to 'struct _cpuid4_info' Parent commits decoupled amd_northbridge from _cpuid4_info_regs, moved AMD L3 northbridge cache_disable_0/1 sysfs code to its own file, and splitted AMD vs. Intel leaf 0x4 handling into: amd_fill_cpuid4_info() intel_fill_cpuid4_info() fill_cpuid4_info() After doing all that, the "_cpuid4_info_regs" name becomes a mouthful. It is also not totally accurate, as the structure holds cpuid4 derived information like cache node ID and size -- not just regs. Rename struct _cpuid4_info_regs to _cpuid4_info. That new name also better matches the AMD/Intel leaf 0x4 functions mentioned above. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-17-darwi@linutronix.de	2025-03-25 10:22:49 +01:00
Ahmed S. Darwish	2d56cc8722	x86/cacheinfo: Separate Intel and AMD CPUID leaf 0x4 code paths The CPUID leaf 0x4 parsing code at cpuid4_cache_lookup_regs() is ugly and convoluted. It is tangled with multiple nested conditions to handle: * AMD with TOPEXT, or Hygon CPUs via leaf 0x8000001d * Legacy AMD fallback via leaf 0x4 emulation * Intel CPUs via the actual CPUID leaf 0x4 Moreover, AMD L3 northbridge initialization is also awkwardly placed alongside the CPUID calls of the first two scenarios above. Refactor all of that as follows: * Update AMD's leaf 0x4 emulation comment to represent current state * Clearly label the AMD leaf 0x4 emulation function as a fallback * Split AMD/Hygon and Intel code paths into separate functions * Move AMD L3 northbridge initialization out of CPUID leaf 0x4 code, and into populate_cache_leaves() where it belongs. There, ci_info_init() can directly store the initialized object in the private pointer of the <linux/cacheinfo.h> API. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-16-darwi@linutronix.de	2025-03-25 10:22:45 +01:00
Ahmed S. Darwish	071f4ad649	x86/cacheinfo: Use sysfs_emit() for sysfs attributes show() Per Documentation/filesystems/sysfs.rst, a sysfs attribute's show() method should only use sysfs_emit() or sysfs_emit_at() when returning values to user space. Use sysfs_emit() for the AMD L3 cache sysfs attributes cache_disable_0, cache_disable_1, and subcaches. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-15-darwi@linutronix.de	2025-03-25 10:22:43 +01:00
Ahmed S. Darwish	365e960d29	x86/cacheinfo: Move AMD cache_disable_0/1 handling to separate file Parent commit decoupled amd_northbridge out of _cpuid4_info_regs, where it was merely "parked" there until ci_info_init() can store it in the private pointer of the <linux/cacheinfo.h> API. Given that decoupling, move the AMD-specific L3 cache_disable_0/1 sysfs code from the generic (and already extremely convoluted) x86/cacheinfo code into its own file. Compile the file only if CONFIG_AMD_NB and CONFIG_SYSFS are both enabled, which mirrors the existing logic. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-14-darwi@linutronix.de	2025-03-25 10:22:39 +01:00
Ahmed S. Darwish	c58ed2d4da	x86/cacheinfo: Separate amd_northbridge from _cpuid4_info_regs 'struct _cpuid4_info_regs' is meant to hold the CPUID leaf 0x4 output registers (EAX, EBX, and ECX), as well as derived information such as the cache node ID and size. It also contains a reference to amd_northbridge, which is there only to be "parked" until ci_info_init() can store it in the priv pointer of the <linux/cacheinfo.h> API. That priv pointer is then used by AMD-specific L3 cache_disable_0/1 sysfs attributes. Decouple amd_northbridge from _cpuid4_info_regs and pass it explicitly through the functions at x86/cacheinfo. Doing so clarifies when amd_northbridge is actually needed (AMD-only code) and when it is not (Intel-specific code). It also prepares for moving the AMD-specific L3 cache_disable_0/1 sysfs code into its own file in next commit. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-13-darwi@linutronix.de	2025-03-25 10:22:36 +01:00
Ahmed S. Darwish	77676e6802	x86/cacheinfo: Consolidate AMD/Hygon leaf 0x8000001d calls While gathering CPU cache info, CPUID leaf 0x8000001d is invoked in two separate if blocks: one for Hygon CPUs and one for AMDs with topology extensions. After each invocation, amd_init_l3_cache() is called. Merge the two if blocks into a single condition, thus removing the duplicated code. Future commits will expand these if blocks, so combining them now is both cleaner and more maintainable. Note, while at it, remove a useless "better error?" comment that was within the same function since the 2005 commit `e2cac78935` ("[PATCH] x86_64: When running cpuid4 need to run on the correct CPU"). Note, as previously done at commit aec28d852ed2 ("x86/cpuid: Standardize on u32 in <asm/cpuid/api.h>"), standardize on using 'u32' and 'u8' types. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-12-darwi@linutronix.de	2025-03-25 10:22:32 +01:00
Ahmed S. Darwish	1374ff60ed	x86/cacheinfo: Standardize _cpuid4_info_regs instance naming The cacheinfo code frequently uses the output registers from CPUID leaf 0x4. Such registers are cached in 'struct _cpuid4_info_regs', augmented with related information, and are then passed across functions. The naming of these _cpuid4_info_regs instances is confusing at best. Some instances are called "this_leaf", which is vague as "this" lacks context and "leaf" is overly generic given that other CPUID leaves are also processed within cacheinfo. Other _cpuid4_info_regs instances are just called "base", adding further ambiguity. Standardize on id4 for all instances. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-11-darwi@linutronix.de	2025-03-25 10:22:29 +01:00
Ahmed S. Darwish	036a73b517	x86/cacheinfo: Align ci_info_init() assignment expressions The ci_info_init() function initializes 10 members of a 'struct cacheinfo' instance using passed data from CPUID leaf 0x4. Such assignment expressions are difficult to read in their current form. Align them for clarity. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-10-darwi@linutronix.de	2025-03-25 10:22:26 +01:00
Ahmed S. Darwish	7b83e0d2b2	x86/cacheinfo: Constify _cpuid4_info_regs instances _cpuid4_info_regs instances are passed through a large number of functions at cacheinfo.c. For clarity, constify the instance parameters where _cpuid4_info_regs is only read from. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-9-darwi@linutronix.de	2025-03-25 10:22:23 +01:00
Thomas Gleixner	cf87582052	x86/cacheinfo: Use proper name for cacheinfo instances The cacheinfo structure defined at <include/linux/cacheinfo.h> is a generic cache info object representation. Calling its instances at x86 cacheinfo.c "leaf" confuses it with a CPUID leaf -- especially that multiple CPUID calls are already sprinkled across that file. Most of such instances also have a redundant "this_" prefix. Rename all of the cacheinfo "this_leaf" instances to just "ci". [ darwi: Move into separate commit and write commit log ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-8-darwi@linutronix.de	2025-03-25 10:22:19 +01:00
Thomas Gleixner	21e2a452dc	x86/cacheinfo: Properly name amd_cpuid4()'s first parameter amd_cpuid4()'s first parameter, "leaf", is not a CPUID leaf as the name implies. Rather, it's an index emulating CPUID(4)'s subleaf semantics; i.e. an ID for the cache object currently enumerated. Rename that parameter to "index". Apply minor coding style fixes to the rest of the function as well. [ darwi: Move into a separate commit and write commit log. Use "index" instead of "subleaf" for amd_cpuid4() first param, as that's the name typically used at the whole of cacheinfo.c. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-7-darwi@linutronix.de	2025-03-25 10:22:16 +01:00
Thomas Gleixner	ee159792a4	x86/cacheinfo: Refactor CPUID leaf 0x2 cache descriptor lookup Extract the cache descriptor lookup logic out of the leaf 0x2 parsing code and into a dedicated function. This disentangles such lookup from the deeply nested leaf 0x2 parsing loop. Remove the cache table termination entry, as it is no longer needed after the ARRAY_SIZE()-based lookup. [ darwi: Move refactoring logic into this separate commit + commit log. Remove the cache table termination entry. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-6-darwi@linutronix.de	2025-03-25 10:22:13 +01:00
Ahmed S. Darwish	a078aaa38a	x86/cacheinfo: Use CPUID leaf 0x2 parsing helpers Parent commit introduced CPUID leaf 0x2 parsing helpers at <asm/cpuid/leaf_0x2_api.h>. The new API allows sharing leaf 0x2's output validation and iteration logic across both intel.c and cacheinfo.c. Convert cacheinfo.c to that new API. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-5-darwi@linutronix.de	2025-03-25 10:22:10 +01:00
Ahmed S. Darwish	fe78079ec0	x86/cpu: Introduce and use CPUID leaf 0x2 parsing helpers Introduce CPUID leaf 0x2 parsing helpers at <asm/cpuid/leaf_0x2_api.h>. This allows sharing the leaf 0x2's output validation and iteration logic across both x86/cpu intel.c and cacheinfo.c. Start by converting intel.c to the new API. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-4-darwi@linutronix.de	2025-03-25 10:22:06 +01:00
Ahmed S. Darwish	09a1da4beb	x86/cacheinfo: Remove CPUID leaf 0x2 parsing loop Leaf 0x2 output includes a "query count" byte where it was supposed to specify the number of repeated CPUID leaf 0x2 subleaf 0 queries needed to extract all of the CPU's cache and TLB descriptors. Per current Intel manuals, all CPUs supporting this leaf "will always" return an iteration count of 1. Remove the leaf 0x2 query loop and just query the hardware once. Note, as previously done at commit aec28d852ed2 ("x86/cpuid: Standardize on u32 in <asm/cpuid/api.h>"), standardize on using 'u32' and 'u8' types. Suggested-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-3-darwi@linutronix.de	2025-03-25 10:22:02 +01:00
Ahmed S. Darwish	b5969494c8	x86/cpu: Remove CPUID leaf 0x2 parsing loop Leaf 0x2 output includes a "query count" byte where it was supposed to specify the number of repeated CPUID leaf 0x2 subleaf 0 queries needed to extract all of the CPU's cache and TLB descriptors. Per current Intel manuals, all CPUs supporting this leaf "will always" return an iteration count of 1. Remove the leaf 0x2 query loop and just query the hardware once. Note, as previously done in: aec28d852ed2 ("x86/cpuid: Standardize on u32 in <asm/cpuid/api.h>") standardize on using 'u32' and 'u8' types. Suggested-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324133324.23458-2-darwi@linutronix.de	2025-03-25 10:21:46 +01:00
Maksim Davydov	0e1ff67d16	x86/split_lock: Simplify reenabling When split_lock_mitigate is disabled, each CPU needs its own delayed_work structure. They are used to reenable split lock detection after its disabling. But delayed_work structure must be correctly initialized after its allocation. Current implementation uses deferred initialization that makes the split lock handler code unclear. The code can be simplified a bit if the initialization is moved to the appropriate initcall. sld_setup() is called before setup_per_cpu_areas(), thus it can't be used for this purpose, so introduce an independent initcall for the initialization. [ mingo: Simplified the 'work' assignment line a bit more. ] Signed-off-by: Maksim Davydov <davydov-max@yandex-team.ru> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250325085807.171885-1-davydov-max@yandex-team.ru	2025-03-25 10:16:50 +01:00
Chao Gao	878477a595	x86/fpu: Update the outdated comment above fpstate_init_user() fpu_init_fpstate_user() was removed in: commit `582b01b6ab` ("x86/fpu: Remove old KVM FPU interface"). Update that comment to accurately reflect the current state regarding its callers. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250324131931.2097905-1-chao.gao@intel.com	2025-03-25 09:57:33 +01:00
Josh Poimboeuf	1154bbd326	objtool: Fix X86_FEATURE_SMAP alternative handling For X86_FEATURE_SMAP alternatives which replace NOP with STAC or CLAC, uaccess validation skips the NOP branch to avoid following impossible code paths, e.g. where a STAC would be patched but a CLAC wouldn't. However, it's not safe to assume an X86_FEATURE_SMAP alternative is patching STAC/CLAC. There can be other alternatives, like static_cpu_has(), where both branches need to be validated. Fix that by repurposing ANNOTATE_IGNORE_ALTERNATIVE for skipping either original instructions or new ones. This is a more generic approach which enables the removal of the feature checking hacks and the insn->ignore bit. Fixes the following warnings: arch/x86/mm/fault.o: warning: objtool: do_user_addr_fault+0x8ec: __stack_chk_fail() missing __noreturn in .c/.h or NORETURN() in noreturns.h arch/x86/mm/fault.o: warning: objtool: do_user_addr_fault+0x8f1: unreachable instruction [ mingo: Fix up conflicts with recent x86 changes. ] Fixes: `ea24213d80` ("objtool: Add UACCESS validation") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/de0621ca242130156a55d5d74fed86994dfa4c9c.1742852846.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/oe-kbuild-all/202503181736.zkZUBv4N-lkp@intel.com/	2025-03-25 09:20:26 +01:00
Denis Mukhin	3181424aea	x86/early_printk: Add support for MMIO-based UARTs During the bring-up of an x86 board, the kernel was crashing before reaching the platform's console driver because of a bug in the firmware, leaving no trace of the boot progress. The only available method to debug the kernel boot process was via the platform's MMIO-based UART, as the board lacked an I/O port-based UART, PCI UART, or functional video output. Then it turned out that earlyprintk= does not have a knob to configure the MMIO-mapped UART. Extend the early printk facility to support platform MMIO-based UARTs on x86 systems, enabling debugging during the system bring-up phase. The command line syntax to enable platform MMIO-based UART is: earlyprintk=mmio,membase[,{nocfg\|baudrate}][,keep] Note, the change does not integrate MMIO-based UART support to: arch/x86/boot/early_serial_console.c Also, update kernel parameters documentation with the new syntax and add the missing 'nocfg' setting to the PCI serial cards description. Signed-off-by: Denis Mukhin <dmukhin@ford.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250324-earlyprintk-v3-1-aee7421dc469@ford.com	2025-03-25 08:35:38 +01:00
Jann Horn	2c118f50d7	x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment Commit: `2e4be0d011` ("x86/show_trace_log_lvl: Ensure stack pointer is aligned, again") was intended to ensure alignment of the stack pointer; but it also moved the initialization of the "stack" variable down into the loop header. This was likely intended as a no-op cleanup, since the commit message does not mention it; however, this caused a behavioral change because the value of "regs" is different between the two places. Originally, get_stack_pointer() used the regs provided by the caller; after that commit, get_stack_pointer() instead uses the regs at the top of the stack frame the unwinder is looking at. Often, there are no such regs at all, and "regs" is NULL, causing get_stack_pointer() to fall back to the task's current stack pointer, which is not what we want here, but probably happens to mostly work. Other times, the original regs will point to another regs frame - in that case, the linear guess unwind logic in show_trace_log_lvl() will start unwinding too far up the stack, causing the first frame found by the proper unwinder to never be visited, resulting in a stack trace consisting purely of guess lines. Fix it by moving the "stack = " assignment back where it belongs. Fixes: `2e4be0d011` ("x86/show_trace_log_lvl: Ensure stack pointer is aligned, again") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250325-2025-03-unwind-fixes-v1-2-acd774364768@google.com	2025-03-25 08:30:43 +01:00
Jann Horn	57e2428f8d	x86/entry: Fix ORC unwinder for PUSH_REGS with save_ret=1 PUSH_REGS with save_ret=1 is used by interrupt entry helper functions that initially start with a UNWIND_HINT_FUNC ORC state. However, save_ret=1 means that we clobber the helper function's return address (and then later restore the return address further down on the stack); after that point, the only thing on the stack we can unwind through is the IRET frame, so use UNWIND_HINT_IRET_REGS until we have a full pt_regs frame. ( An alternate approach would be to move the pt_regs->di overwrite down such that it is the final step of pt_regs setup; but I don't want to rearrange entry code just to make unwinding a tiny bit more elegant. ) Fixes: `9e809d15d6` ("x86/entry: Reduce the code footprint of the 'idtentry' macro") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250325-2025-03-unwind-fixes-v1-1-acd774364768@google.com	2025-03-25 08:30:43 +01:00
Nathan Chancellor	ad9b861824	x86/kbuild/64: Restrict clang versions that can use '-march=native' There are two issues that can appear when building with clang and '-march=native'. The first issue is a compiler crash in the ipv6 stack with clang-18, such as when building allmodconfig. This was frequently reported on the LLVM issue tracker: # Link: https://github.com/llvm/llvm-project/issues?q=is%3Aissue%20ip6_rcv_core Stack dump: 0. Program arguments: clang ... -march=native ... net/ipv6/ip6_input.c 1. <eof> parser at end of file 2. Code generation 3. Running pass 'Function Pass Manager' on module 'net/ipv6/ip6_input.c'. 4. Running pass 'X86 DAG->DAG Instruction Selection' on function '@ip6_rcv_core' The second issue is certain -Wframe-larger-than warnings that appear with clang versions prior to 19, which introduced an optimization that produces much better code: # Link: `90ba33099c` [2] clang-18: drivers/media/pci/saa7164/saa7164-core.c:605:20: error: stack frame size (2392) exceeds limit (2048) in 'saa7164_irq' [-Werror,-Wframe-larger-than] 605 \| static irqreturn_t saa7164_irq(int irq, void dev_id) \| ^ clang-19: drivers/media/pci/saa7164/saa7164-core.c:605:20: error: stack frame size (216) exceeds limit (128) in 'saa7164_irq' [-Werror,-Wframe-larger-than] 605 \| static irqreturn_t saa7164_irq(int irq, void dev_id) \| ^ Restrict the testing of the availability of '-march=native' to all supported GCC versions and clang-19 and newer to avoid these known resolved issues. Fixes: 0480bc7e65dc ("x86/kbuild/64: Add the CONFIG_X86_NATIVE_CPU option to locally optimize the kernel with '-march=native'") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tor Vic <torvic9@mailbox.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324-x86-march-native-clang-19-and-newer-v1-1-3a05ed32a89e@kernel.org	2025-03-25 08:24:06 +01:00
Ingo Molnar	0141208186	x86/kbuild/64: Test for the availability of the -mtune=native compiler flag Stephen reported this build failure when cross-compiling: cc1: error: bad value 'native' for '-march=' switch Test for the availability of the -march=native flag. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> # build test Cc: Tor Vic <torvic9@mailbox.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250324172723.49fb0416@canb.auug.org.au	2025-03-25 08:24:06 +01:00
Tor Vic	ea1dcca1de	x86/kbuild/64: Add the CONFIG_X86_NATIVE_CPU option to locally optimize the kernel with '-march=native' Add a 'native' option that allows users to build an optimized kernel for their local machine (i.e. the machine which is used to build the kernel) by passing '-march=native' to CFLAGS. The idea comes from Linus' reply to Arnd's initial proposal: https://lore.kernel.org/all/CAHk-=wji1sV93yKbc==Z7OSSHBiDE=LAdG_d5Y-zPBrnSs0k2A@mail.gmail.com/ Here are some numbers comparing 'generic' to 'native' on a Skylake dual-core laptop (generic --> native): - vmlinux and compressed modules size: 125'907'744 bytes --> 125'595'280 bytes (-0.248 %) 18'810 kilobytes --> 18'770 kilobytes (-0.213 %) - phoronix, average of 3 runs: ffmpeg: 130.99 --> 131.15 (+0.122 %) nginx: 10'650 --> 10'725 (+0.704 %) hackbench (lower is better): 102.27 --> 99.50 (-2.709 %) - xz compression of firefox tarball (lower is better): 319.57 seconds --> 317.34 seconds (-0.698 %) - stress-ng, bogoops, average of 3 15-second runs: fork: 111'744 --> 115'509 (+3.397 %) bsearch: 7'211 --> 7'436 (+3.120 %) memfd: 3'591 --> 3'604 (+0.362 %) mmapfork: 630 --> 629 (-0.159 %) schedmix: 42'715 --> 43'251 (+1.255 %) epoll: 2'443'767 --> 2'454'413 (+0.436 %) vm: 1'442'256 --> 1'486'615 (+3.076 %) - schbench (two message threads), 30-second runs: 304 rps --> 305 rps (+0.329 %) There is little difference both in terms of size and of performance, however the native build comes out on top ever so slightly. [ mingo: Renamed the option to CONFIG_X86_NATIVE_CPU, expanded the help text and added Linus's Suggested-by tag. ] Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Tor Vic <torvic9@mailbox.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250321142859.13889-1-torvic9@mailbox.org	2025-03-25 08:24:06 +01:00
Mateusz Jończyk	2704ad556c	x86/Kconfig: Fix lists in X86_EXTENDED_PLATFORM help text Support for STA2X11-based systems was removed in February in: `dcbb01fbb7` ("x86/pci: Remove old STA2x11 support") Intel MID for 32-bit platforms was removed from this list also in February in: `ca5955dd5f` ("x86/cpu: Document CONFIG_X86_INTEL_MID as 64-bit-only") Intel MID for 64-bit platforms is a duplicate for "Merrifield/Moorefield MID devices". Fixes: `4047e8773f` ("x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM") Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Shevchenko <andy@kernel.org> Link: https://lore.kernel.org/r/20250322175052.43611-1-mat.jonczyk@o2.pl	2025-03-25 08:22:26 +01:00
Mateusz Jończyk	99bb1bd810	x86/Kconfig: Correct X86_X2APIC help text Currently, it is not true that the kernel will panic with CONFIG_X86_X2APIC=n on systems that require it; it will try to disable the APIC and run without it to at least give the user a clear warning message. See the second variant of check_x2apic() in arch/x86/kernel/apic/apic.c . Also massage some other parts of the help text. Fixes: `9232c49ff3` ("x86/Kconfig: Enable X86_X2APIC by default and improve help text") Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250322154541.40325-1-mat.jonczyk@o2.pl	2025-03-25 08:17:49 +01:00
Ingo Molnar	2487b6b9bf	Merge branch 'linus' into x86/urgent, to pick up fixes and refresh the branch Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-25 08:17:25 +01:00
Linus Torvalds	2df0c02dab	x86 boot build: make git ignore stale 'tools' directory We've had this before: when we remove infrastructure to generate files, the old stale build artifacts still remain in-tree. And when the infrastructure to generate them is gone, so is the gitignore file for those build artifacts. End result: git will see the old generated files, and people will mistakenly commit them. That's what happened with the 'genheaders' file not that long ago (see commit `04a3389b35` "Remove stale generated 'genheaders' file"). This time it's commit `9c54baab44` ("x86/boot: Drop CRC-32 checksum and the build tool that generates it") that removed the 'build' file from the arch/x86/boot/tools/ subdirectory, and removed the .gitignore file too (because the whole subdirectory is gone). And as a result, if you don't do a 'git clean -dqfx' or similar to clean up your tree, 'git status' will say Untracked files: (use "git add <file>..." to include in what will be committed) arch/x86/boot/tools/ and some hapless sleep-deprived developer will inevitably decide that that means that they need to 'git add' that directory. Which would bring back some stale generated file that we most definitely do not want in the tree. So when removing directories that had special .gitignore patterns, make sure to add a new gitignore entry in the parent directory for the no longer existing subdirectory. It will avoid mistakes. Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Fixes: `9c54baab44` ("x86/boot: Drop CRC-32 checksum and the build tool that generates it") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-03-24 23:09:14 -07:00
Linus Torvalds	001a3a0c6a	Two small cleanups in the x86 platform support code. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfepu4RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gDlxAAjHZwjixviNtICUdOmvx1NwG4MX2w/Wcy zF1PJWWkcxD7xlp6JXUgQFwJ6BLBj2UulJ3+5yrhv8InzAacfCdSgocrzmriYB9g Zmw72nFXvj1aR70uS2Fs838ngHJ/SUDpHTRCOddl5mUEA6GxGzT71Zz1bykBRado w0y0Nv3U0kuROGk/iQa6O9N3bFubQfP7KlfWEpEklrgP8NwFyZYsTX/b33N529TL 8SDJn/dOpBl5ZsHOSxYlZYtT/bpD+plWxZ0VpA92Drw3pZDgAnoWPVd4ADCJw1XX GFVwWlEO3kuKEQ4F2AFJeZDqK2aHgeevSnDX6PA5f5bDhmU8T1/2/dwASfbCAEEa 1ZwHmuqYtxay6iISliTmIaxnC6a++m81tHQJVpnNcp3Ngl3EJ/c1Rve7W5YOz0E9 RMPJU3mqb+i2C9IUgCiW7f75W4fIyisOmURGhOKTVH3J3VeX/nu5Ggm5Egwf3Wjj 0Q+JTDRDkUa6hmgeBFm2T1ugCfYr0nYxicoEvNFrvm6cR8SdaKwYJXZLF4ZW7Ekn EFnk5mTtCPe1VxtpRAf9Dt4rIqbzLcXsGiJaqb5ZWOpVMqNGmAX4Kct3VOLTEFMr CKbso3jMt2iTnL8Z4THVbJDk6k7rJZ11Bg7xYZVHkZwaFAOZ5AHmj673aCgdwl6J LM0sho6i45g= =eh6a -----END PGP SIGNATURE----- Merge tag 'x86-platform-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Ingo Molnar: "Two small cleanups in the x86 platform support code" * tag 'x86-platform-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/olpc: Remove unused variable 'len' in olpc_dt_compatible_match() x86/platform/olpc-xo1-sci: Don't include <linux/pm_wakeup.h> directly	2025-03-24 23:03:33 -07:00
Linus Torvalds	8ac6067bd8	x86/sev updates for v6.15: - Improve sme_enable() PIC build robustness (Kevin Loughlin) - Simplify vc_handle_msr() a bit (Peng Hao) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfep+kRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jneRAAgpKFXI9V1lmjJLpGDyjiVDlD27Y7eQcB geOw3zkVO9MRl4cz30gO7kvGnJaTEDDTCE0MDIy7kpyYdP6xnNoJJj3jtWIUPdx9 Q0PvgRWcAjmJHet2f8y/x4iIQKQrgoplovnzBssAC0bXo1bb2mtlY9c0dSMYAD4M j9n2Qar+9xqUNDqO/N3MzGxP8fJBpPNvdvkrMOmOy09gY7Co2hTCZlRlbC2LRLU+ VYPre6L1qsKeS4GrsKRzNI4LKMux+6nx2TOEkzg2KHZ7jZgsQY4zHlq1CyZ6FUFk IhC5+V3lAlwxmWgZWqo1zgv7up+OrelNXqBWZMDg5QC197xZO10xs2dNtWWPpQZc 1sdydznYy18pPVPinspyNleYw/y79Spe0/KBHHC//et8epgRFPA8Fey6QObqr6mc RRtR0xa+IGwrDWAKN03gvxwk9XXpV15HIp19hCl2QYmAgY4FYf7jB9nnrIOe3RSd W1NWE0Q5VE2jn9Ysr2etGnu8rAPKjyadoKK1Oi6kAoO7xn/tR0zzXtsVv7g9bLor LufAPDH/jjer7eDKxCOzOyhzFwrszaPrygH53hzIKQG3qdJF2FfzixozKuQBTopV S4xlkINgDC60aa45w8beZ0ufQTpzE9iXGti2u2v5UzSPRhOWvna2OBAw/fVkWOg9 E/xJWHwYv0E= =6fVl -----END PGP SIGNATURE----- Merge tag 'x86-sev-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Ingo Molnar: - Improve sme_enable() PIC build robustness (Kevin Loughlin) - Simplify vc_handle_msr() a bit (Peng Hao) [ Just reminding myself and everybody else about the endless stream of x86 TLAs: "SEV" is AMD's Secure Encrypted Virtualization - Linus ] * tag 'x86-sev-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Simplify the code by removing unnecessary 'else' statement x86/sev: Add missing RIP_REL_REF() invocations during sme_enable()	2025-03-24 22:51:23 -07:00
Linus Torvalds	a49a879f0a	Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han, Mirsad Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeo6ERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gl9w//ZRxGhf3eXYVyPW35aoqaRe7W3fBJhaPn eLP0JuuglZO3JSTJfmbTmSXrNN4kkzI3nnuTnchK0iYZ+2Tn4E0iaz4tm5sjL5T1 KW9ajv7QQxMOwXS69hrklY/eGxVimfs4mIyhxEhxLjPvzbGJOVS6WFgvaYH5klQQ 7sPYXzrNGZJhGCmRqCelHSPhl7b6YVS/yWtMe+BpPBn1AqIQN+O+S/Kbu7dQokmq 7MNy4eUe7GYgnSB7Qhq/jNSqjmGWsVwhMiuoDxx7GvShM73+kYatQZT0H+qzTxzp 90rIPcPTNCOlKJO3xSWqXStcMH00MbplOhKdEU6SzeM+xC2aD2t4GoIG0EFxIQdS X0wh40Psm5GehThhpmMBJdmM4Le4TTSkHBhedpQ+sp+6BIX4A9hoeCoybIlcngaJ W89YKHqC5ruPSRDOzyaTMSypaEGH5VHP6AMj0CmkuyHRbxADsMazifpOVOw9AvVk IR0USCzyenjLMiugRrJgRpy8R2Q1jp35bP0e7Y4QvmydEGC1Wl5mW5Y8B9Zdlk0C iuxoqwntem7PAkEA2JejW6rzuRICxXloKv+xjtQjKzXpK+pwhhSp830zT8gaWZ1Z hdQRz7gSlJg3JzWinX8+XNusdHw9TxCUqwgLw8486nzNp01GU62gby5DWgECEdae 2IyRRSd0FyA= =GsK4 -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han, Mirsad Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo" * tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/coco: Replace 'static const cc_mask' with the newly introduced cc_get_mask() function x86/delay: Fix inconsistent whitespace selftests/x86/syscall: Fix coccinelle WARNING recommending the use of ARRAY_SIZE() x86/platform: Fix missing declaration of 'x86_apple_machine' x86/irq: Fix missing declaration of 'io_apic_irqs' x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s description x86/apic: Use str_disabled_enabled() helper in print_ipi_mode()	2025-03-24 22:39:53 -07:00
Linus Torvalds	71b639af06	x86/fpu updates for v6.15: - Improve crypto performance by making kernel-mode FPU reliably usable in softirqs ((Eric Biggers) - Fully optimize out WARN_ON_FPU() (Eric Biggers) - Initial steps to support Support Intel APX (Advanced Performance Extensions) (Chang S. Bae) - Fix KASAN for arch_dup_task_struct() (Benjamin Berg) - Refine and simplify the FPU magic number check during signal return (Chang S. Bae) - Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav Spassov) - selftests/x86/xstate: Introduce common code for testing extended states (Chang S. Bae) - Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros Bizjak) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfepZQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1grLw/+Lt71PQWu/uVt3/dU0VkpppTqRmUTujuR XVwFlBVxMFw9A8wi56bLD3WT28mKWr0SepynBt1Wr7/pZCnaW4SH/91ERChRqTzU ifawmqQqhrVhFlXziDOKp9lcDy9f7NjJVKBfEBa1VBQGC0Q8FzeasOhCwTJw8qXa Lj2z8zenhDq6NBkk22VlScc1CvsUBiXppvm8uk3md7HKaDqdzmCakm/a0FgaEppt K6P/u1iaVvGXme2CHSskCshkYoEFIF6LRNYnkVloXsVV4AeWeLaJ54xW+syPd/4H EX6oMLifdXCmxmsmi3LPRUjBgdSMsDaAkLsPNoX1w4uWjsihqyUl4wTEpobMIuu1 PWOrCxQKxtaGoECJ0nsE4uR7MZEQ0sYCGKv4JSiiXeUVf/EDRuUBfvBCoGlDWXYA GZcoMmH+BtcYuQdzbrc/vWkfo3bpTxL3x5zsgtFkQL3xyzhugRPNgeOn9yuSZ9x6 AD8QS0G6uRcc9a5ZeeTEcYxoIE/+vvfnh1wfMjisehkVn179ixfZdgcSGrZJUNyH a7LmUmwLvLYNO5DUUGs1upN8YHP7jiohNc/r2ZC1IX5LHbuK1gyXA4xo6hAZ8r/7 XZ2FfRp0dr3PvuWIwq2v6T5JHvALADsKCAjqvNdwHLkxF87ygixT+B87wTA3H6ov LSY10A/eRx4= =U7PR -----END PGP SIGNATURE----- Merge tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu updates from Ingo Molnar: - Improve crypto performance by making kernel-mode FPU reliably usable in softirqs ((Eric Biggers) - Fully optimize out WARN_ON_FPU() (Eric Biggers) - Initial steps to support Support Intel APX (Advanced Performance Extensions) (Chang S. Bae) - Fix KASAN for arch_dup_task_struct() (Benjamin Berg) - Refine and simplify the FPU magic number check during signal return (Chang S. Bae) - Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav Spassov) - selftests/x86/xstate: Introduce common code for testing extended states (Chang S. Bae) - Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros Bizjak) * tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs x86/fpu/xstate: Simplify print_xstate_features() x86/fpu: Refine and simplify the magic number check during signal return selftests/x86/xstate: Fix spelling mistake "hader" -> "header" x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct() vmlinux.lds.h: Remove entry to place init_task onto init_stack selftests/x86/avx: Add AVX tests selftests/x86/xstate: Clarify supported xstates selftests/x86/xstate: Consolidate test invocations into a single entry selftests/x86/xstate: Introduce signal ABI test selftests/x86/xstate: Refactor ptrace ABI test selftests/x86/xstate: Refactor context switching test selftests/x86/xstate: Enumerate and name xstate components selftests/x86/xstate: Refactor XSAVE helpers for general use selftests/x86: Consolidate redundant signal helper functions x86/fpu: Fix guest FPU state buffer allocation size x86/fpu: Fully optimize out WARN_ON_FPU()	2025-03-24 22:27:18 -07:00
Linus Torvalds	b58386a9bd	Updates to the x86 boot code for the v6.15 cycle: - Memblock setup and other early boot code cleanups (Mike Rapoport) - Export e820_table_kexec[] to sysfs (Dave Young) - Baby steps of adding relocate_kernel() debugging support (David Woodhouse) - Replace open-coded parity calculation with parity8() (Kuan-Wei Chiu) - Move the LA57 trampoline to separate source file (Ard Biesheuvel) - Misc micro-optimizations (Uros Bizjak) - Drop obsolete E820_TYPE_RESERVED_KERN and related code (Mike Rapoport) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeoawRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gCRBAAm5MwAxTOtqRQtwUBkbGB8HEfjCHJTLIe FiLLric6lHEn2uVw/9uhlN646pWxa+487QtxRAHlR2hpm0JyEiZkawhFpnWWx8s6 WXdLVPK+CNQNKgcWC2AsIj7C71JcKBNJI2Lj8/p9Cn3AgB0s7m4e3GfuugMk43Lq aw8JHd1zzqyT9NsdfNkglwn12iui9Y0t7q0EuZgQhRXLvThwZZblJg+dvub30LGg FE2QM4dQC4K0IUhE42ea5wWylX3tmiDYpdEH/CwxPobfra4kMxnoUrrh9Dk82cma QR3wwOc4JZ6mXUWVumbtk+cyUvZ1wTGFgiSUGmomkoKz9dJewqNV4b6iRa5URGzG izZaAZyJDQk9r2dCnwLbjzQjr2SHXLvvTpmS8AlAyOEPTnc+388Fg4h4oL9N/rcM ZIxxKpfuSjiWT8tRGKGPePhqAIg7kllk/w3zSkyAsx9/DG/UrLhpLSzq0+4GPQ0E d0V6WwX41iouoAH+kmDDj3KkaezQ/ZfXcxKk2d3wSCvIEMfJkSSXFBDlanE+skrM x/0QCWVyN5zajYEEoWv8WoXov7Q67Ar6HdxtPRLtQcd/ZhpTFeq4wuitV+4phb3m twWQo43wkMI5jFf9U2b+PD//8PWfcBJhzP0BEN8rNJaq8KVa93eHsOpMqZK+5wC6 q03Wx00ewfE= =cUeH -----END PGP SIGNATURE----- Merge tag 'x86-boot-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot code updates from Ingo Molnar: - Memblock setup and other early boot code cleanups (Mike Rapoport) - Export e820_table_kexec[] to sysfs (Dave Young) - Baby steps of adding relocate_kernel() debugging support (David Woodhouse) - Replace open-coded parity calculation with parity8() (Kuan-Wei Chiu) - Move the LA57 trampoline to separate source file (Ard Biesheuvel) - Misc micro-optimizations (Uros Bizjak) - Drop obsolete E820_TYPE_RESERVED_KERN and related code (Mike Rapoport) * tag 'x86-boot-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kexec: Add relocate_kernel() debugging support: Load a GDT x86/boot: Move the LA57 trampoline to separate source file x86/boot: Do not test if AC and ID eflags are changeable on x86_64 x86/bootflag: Replace open-coded parity calculation with parity8() x86/bootflag: Micro-optimize sbf_write() x86/boot: Add missing has_cpuflag() prototype x86/kexec: Export e820_table_kexec[] to sysfs x86/boot: Change some static bootflag functions to bool x86/e820: Drop obsolete E820_TYPE_RESERVED_KERN and related code x86/boot: Split parsing of boot_params into the parse_boot_params() helper function x86/boot: Split kernel resources setup into the setup_kernel_resources() helper function x86/boot: Move setting of memblock parameters to e820__memblock_setup()	2025-03-24 22:25:21 -07:00
Linus Torvalds	ebfb94d87b	x86/build updates for v6.15: - Drop CRC-32 checksum and the build tool that generates it (Ard Biesheuvel) - Fix broken copy command in genimage.sh when making isoimage (Nir Lichtman) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeoqwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gI6RAAvz1v7cKpnOJLBXdjXImcisqDHiW1us+H 6LWG2U2zFos0Px5hQ2BHBd/T76uOwGcWdxhl0FVGavGVnaCe4TboOjIN1ulhaEoy ddG/JS+NSua1bniDuws5gGk0RVb8O4sewWZyI+NEiFntgN2n/WjQ7ewuT9d13IU9 IERkgHKPIeb6Kk+2sMWmpQcZizcxGSeupN5y9keWrSy5BY4JXmfPGwZg2BWjJVf2 9UhkjGG0MzWw/Bjm8ng7BdcOz/zU5pdn/UtlLbA7Eg7lpW3hwjfCKOQHsqb2Eeqa TwJGJrAoIYT5vQY2ISQDRRQW+F2rbMxO3zRZDWY19t5mR0d4on21HdhZYtICShpL 64/ZyztA1gBBCzT9me+Wd83+wSgLrGy5vNpmtP2DwaDZb9GWOm+EZGc3skgTMal+ /JkrXDlWu4TZ+qo1/AVtyD+IC1Olq8rd7tv/69eTMPxUShem7i80JKeOwTB2qrBY AXxhQAWyVDxR59ANHwA/8kekCBq4ps/65NmWBxplKvq7bILNNItSUMCWb0YVUeTH 7pjD74PkZwEOw1KO3L6DSjM8fazJ6dDpTL/xftOLO/JSssaoNVrVdvaLL03T6qq6 5DIuk3Y08FxyEch4/CYgdaJXXLSTB+rtmbZKN2/wPsGfbTp8i6gkLy9Sn8kx6fyp /9kaUsYhno8= =nLmp -----END PGP SIGNATURE----- Merge tag 'x86-build-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 build updates from Ingo Molnar: - Drop CRC-32 checksum and the build tool that generates it (Ard Biesheuvel) - Fix broken copy command in genimage.sh when making isoimage (Nir Lichtman) * tag 'x86-build-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Add back some padding for the CRC-32 checksum x86/boot: Drop CRC-32 checksum and the build tool that generates it x86/build: Fix broken copy command in genimage.sh when making isoimage	2025-03-24 22:23:23 -07:00
Linus Torvalds	e34c38057a	[ Merge note: this pull request depends on you having merged two locking commits in the locking tree, part of the locking-core-2025-03-22 pull request. ] x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Bootup: Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd 7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0 3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq 2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1 q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/ DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M tJj1oc2TIZw= =Dcb3 -----END PGP SIGNATURE----- Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye" * tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits) zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault x86/asm: Make asm export of __ref_stack_chk_guard unconditional x86/mm: Only do broadcast flush from reclaim if pages were unmapped perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones perf/x86/intel, x86/cpu: Simplify Intel PMU initialization x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions x86/asm: Use asm_inline() instead of asm() in clwb() x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> x86/hweight: Use asm_inline() instead of asm() x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() x86/hweight: Use named operands in inline asm() x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP x86/head/64: Avoid Clang < 17 stack protector in startup code x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro x86/cpu/intel: Limit the non-architectural constant_tsc model checks x86/mm/pat: Replace Intel x86_model checks with VFM ones x86/cpu/intel: Fix fast string initialization for extended Families ...	2025-03-24 22:06:11 -07:00
Linus Torvalds	327ecdbc0f	Performance events updates for v6.15: Core: - Move perf_event sysctls into kernel/events/ (Joel Granados) - Use POLLHUP for pinned events in error (Namhyung Kim) - Avoid the read if the count is already updated (Peter Zijlstra) - Allow the EPOLLRDNORM flag for poll (Tao Chen) - locking/percpu-rwsem: Add guard support (Peter Zijlstra) [ NOTE: this got (mis-)merged into the perf tree due to related work. ] perf_pmu_unregister() related improvements: (Peter Zijlstra) - Simplify the perf_event_alloc() error path - Simplify the perf_pmu_register() error path - Simplify perf_pmu_register() - Simplify perf_init_event() - Simplify perf_event_alloc() - Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count - Add this_cpc() helper - Introduce perf_free_addr_filters() - Robustify perf_event_free_bpf_prog() - Simplify the perf_mmap() control flow - Further simplify perf_mmap() - Remove retry loop from perf_mmap() - Lift event->mmap_mutex in perf_mmap() - Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes - Fix perf_mmap() failure path Uprobes: - Harden x86 uretprobe syscall trampoline check (Jiri Olsa) - Remove redundant spinlock in uprobe_deny_signal() (Liao Chang) - Remove the spinlock within handle_singlestep() (Liao Chang) x86 Intel PMU enhancements: - Support PEBS counters snapshotting (Kan Liang) - Fix intel_pmu_read_event() (Kan Liang) - Extend per event callchain limit to branch stack (Kan Liang) - Fix system-wide LBR profiling (Kan Liang) - Allocate bts_ctx only if necessary (Li RongQing) - Apply static call for drain_pebs (Peter Zijlstra) x86 AMD PMU enhancements: (Ravi Bangoria) - Remove pointless sample period check - Fix ->config to sample period calculation for OP PMU - Fix perf_ibs_op.cnt_mask for CurCnt - Don't allow freq mode event creation through ->config interface - Add PMU specific minimum period - Add ->check_period() callback - Ceil sample_period to min_period - Add support for OP Load Latency Filtering - Update DTLB/PageSize decode logic Hardware breakpoints: - Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar Bhaskar) Hardlockup detector improvements: (Li Huafei) - perf_event memory leak - Warn if watchdog_ev is leaked Fixes and cleanups: - Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfehRIRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hF3g//TCAQijI6OFNpYiD1xoyMq4m+baIYhYx0 lnxwxhsN58JFcEJeWIEGLACqUePyH68jNKVSr9sIoeV4gnnMX+x2Ny6rh/1H3Ox+ jQyVmPdFmKa8QG7wGNjcDteIzlEKK4zqruXWaG54LX2e6kbQZWwd0I21MyXkrHXb oMIfyZbCAWuPW1wefZm8FPgImT+nvwOosyx90OVagGqk5mYdNb9DFhMjQveStHdQ BnWU6rYdW1c2eXKpeuvxY4uWQoCELC6WntLimvcswy6fb+9LtbglpCYQOGGDrGvp v3RASf/8clFVSau8P/8NEaNgLgjN/e3eN/fAoSut8Z22nAeBC6qv4qjFt1piDpbs AaEXYCYM0/Tfzjp3ctPsFrxbKvB8q2qhxSm37Co0Ix6WyJn3JQbNx48g8GIod2os eGPXSZzoz9O8coeTKKbxWp4fpAjFfyfe/ovWQuVd8JI4bYj7Mi63J+RxQDd2TkJP H+IgxZoamJExgS1YcKJUBtw7QKQm5pHFx03Br7KsNxgmHy7JdoN9bh0h14pkeXjB MnAvWOS5ouuriJgQ+4bqAezS8DSHnDdmFmWgNEEqAlOD9Zy9hDXJ2GiqbHKMyRNC ae35o0PDUFTIX9O5NPIDUyWtJb5uH/S1lQhS7GD+ODlMDIX+ny+REXf9krSCR1H0 GUqq2UmxBGA= =iPmA -----END PGP SIGNATURE----- Merge tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core: - Move perf_event sysctls into kernel/events/ (Joel Granados) - Use POLLHUP for pinned events in error (Namhyung Kim) - Avoid the read if the count is already updated (Peter Zijlstra) - Allow the EPOLLRDNORM flag for poll (Tao Chen) - locking/percpu-rwsem: Add guard support [ NOTE: this got (mis-)merged into the perf tree due to related work ] (Peter Zijlstra) perf_pmu_unregister() related improvements: (Peter Zijlstra) - Simplify the perf_event_alloc() error path - Simplify the perf_pmu_register() error path - Simplify perf_pmu_register() - Simplify perf_init_event() - Simplify perf_event_alloc() - Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count - Add this_cpc() helper - Introduce perf_free_addr_filters() - Robustify perf_event_free_bpf_prog() - Simplify the perf_mmap() control flow - Further simplify perf_mmap() - Remove retry loop from perf_mmap() - Lift event->mmap_mutex in perf_mmap() - Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes - Fix perf_mmap() failure path Uprobes: - Harden x86 uretprobe syscall trampoline check (Jiri Olsa) - Remove redundant spinlock in uprobe_deny_signal() (Liao Chang) - Remove the spinlock within handle_singlestep() (Liao Chang) x86 Intel PMU enhancements: - Support PEBS counters snapshotting (Kan Liang) - Fix intel_pmu_read_event() (Kan Liang) - Extend per event callchain limit to branch stack (Kan Liang) - Fix system-wide LBR profiling (Kan Liang) - Allocate bts_ctx only if necessary (Li RongQing) - Apply static call for drain_pebs (Peter Zijlstra) x86 AMD PMU enhancements: (Ravi Bangoria) - Remove pointless sample period check - Fix ->config to sample period calculation for OP PMU - Fix perf_ibs_op.cnt_mask for CurCnt - Don't allow freq mode event creation through ->config interface - Add PMU specific minimum period - Add ->check_period() callback - Ceil sample_period to min_period - Add support for OP Load Latency Filtering - Update DTLB/PageSize decode logic Hardware breakpoints: - Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar Bhaskar) Hardlockup detector improvements: (Li Huafei) - perf_event memory leak - Warn if watchdog_ev is leaked Fixes and cleanups: - Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan)" * tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) perf: Fix __percpu annotation perf: Clean up pmu specific data perf/x86: Remove swap_task_ctx() perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode perf: Supply task information to sched_task() perf: attach/detach PMU specific data locking/percpu-rwsem: Add guard support perf: Save PMU specific data in task_struct perf: Extend per event callchain limit to branch stack perf/ring_buffer: Allow the EPOLLRDNORM flag for poll perf/core: Use POLLHUP for pinned events in error perf/core: Use sysfs_emit() instead of scnprintf() perf/core: Remove optional 'size' arguments from strscpy() calls perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions uprobes/x86: Harden uretprobe syscall trampoline check watchdog/hardlockup/perf: Warn if watchdog_ev is leaked watchdog/hardlockup/perf: Fix perf_event memory leak perf/x86: Annotate struct bts_buffer::buf with __counted_by() perf/core: Clean up perf_try_init_event() perf/core: Fix perf_mmap() failure path ...	2025-03-24 21:46:36 -07:00
Linus Torvalds	32b22538be	Scheduler updates for v6.15: [ Merge note, these two commits are identical: - `f3fa0e40df` ("sched/clock: Don't define sched_clock_irqtime as static key") - `b9f2b29b94` ("sched: Don't define sched_clock_irqtime as static key") The first one is a cherry-picked version of the second, and the first one is already upstream. ] Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfejsoRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1ivkhAAwBF2tYRBS1oIHcC/OKK3JJoHVDp2LFbU 9sm5S3ZlGD/Ns2fbpY+9A8UFgUFfjYiTSV7hvf2B9Vge0XSxTmMNFu/MdxLBbo9r w6GSeNcNDQKpjEGLkrmPFsa2fiYI4dmH0IzDbS9V2cNPk470QBKjAKXNPaSER691 n2wLnQq+m5o4gXnPjnSz6RrrzisRnm2GOWnDV/iqR47pZFNlX2wWlo3s5r7//Hw0 a+QfEfpgKehhy/VSDXmSAgpqnNffjc78yBV6LNoVUddwahnOWiQMS3XViOqgy5VO jUGBrzW+sKkdRMBppxwJ/0XWgHGC27amIgnU0ZE5u+eiUEu8H9qWl1cRCFxyeB0O 8+WNfwmkH+FPWUdsn84kdePhSsZy6HfM6h44Xe0hx1V7tQXEXfbPzK3TnQg8Ktt1 Ky6ctbZt4cGpqGQuIqvba21A/racrD/DgvB7mHeZksnqZoKTDwxhT/nlQGpuwPoy SJYd1ynFVJvfC69SwMdwnaglimvEZx1GfT0o5XtCMslY5NkWCou5u+e65WX7ccU5 94wBCwI1/+KiFMJZp6TlPw07Q/Hsj9dDryxLc3OunU3zMVt3++1GZnBS/eb5FN1A 5TlAEpxgH9c5Q4/XJFCvx21DrTVHSuIrR6naK91bgqHo0cEfaHGtO/ejPtJGSfxe YIFnnu1dhRw= =SuQE -----END PGP SIGNATURE----- Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...	2025-03-24 21:28:12 -07:00
Linus Torvalds	5a658afd46	Objtool changes for v6.15: - The biggest change is the new option to automatically fail the build on objtool warnings: CONFIG_OBJTOOL_WERROR. While there are no currently known unfixed false positives left, such an expansion in the severity of objtool warnings inevitably creates a risk of build failures, so it's disabled by default and depends on !COMPILE_TEST, so it shouldn't be enabled on allyesconfig/allmodconfig builds and won't be forced on people who just accept build-time defaults in 'make oldconfig'. While the option is strongly recommended, only people who enable it explicitly should see it. (Josh Poimboeuf) - Disable branch profiling in noinstr code with a broad brush that includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf) - Create backup object files on objtool errors and print exact objtool arguments to make failure analysis easier (Josh Poimboeuf) - Improve noreturn handling (Josh Poimboeuf) - Improve rodata handling (Tiezhu Yang) - Support jump tables, switch tables and goto tables on LoongArch (Tiezhu Yang) - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfefkARHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1inlRAAvd9Hom2qh9Iu+KYYF58vsg9zsxWZA6I1 blouKI4SUA8Xjuw6Nihx+emaPaMW1boGSLTNsFzrCa3S1+4UHTTp/Y8snZWJ/Mc/ Peg52N6u/LIcoQM+vNJYRtd9y4wabX87vl0qTxte0kB0Neps3/yQvxtUa2K1srXp 8nwHK+PdzNsgPuIrIiNc9ymsPvbqFHmVIRRVNVKX4BlPJi2kJs9kx43kszweQR/X kW/bs1315m1HS5i02K0Zs/XdOZHLsk9ERu+aviBJV1txrgZIukATIqbODiI+3RZX 0oa3KxfzEVFN2k3OukrezV2INzETkN+oOSTAZIUOqwSVe+8rdQVBSdYT6svYn/yy aS8Bi5Mm1nfizTU+cRrzU7FxWCmwsxm9r4fHPTV8Owjxg0uoGk/E/qlvERuR2rpA p2tHMo1lp2Yo+VZBZPfm5KDHFG4tSGhF9eav2bqSI7/Kf5AWxRl8kBs5iLrcxsXh 4qk3FalnuM7A+1McAUcBJAvM897Yie0s2G83ZipyYyA6U3LSBhBMWh9FlIiAjuIh YnX6IFkW9tVzVZpJFEGQn+2Ewl5Y2Go5bokKk03vkWCZCgg+hEUsVh6Cnm1ocZpO Ll3/UF4i8XjjyuAuHDn6mzyHIgch2xRN02v7dJSb09+O9b8vIoPoSqbWSLJUOBqf r6UesXDG8mY= =iswJ -----END PGP SIGNATURE----- Merge tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Ingo Molnar: - The biggest change is the new option to automatically fail the build on objtool warnings: CONFIG_OBJTOOL_WERROR. While there are no currently known unfixed false positives left, such an expansion in the severity of objtool warnings inevitably creates a risk of build failures, so it's disabled by default and depends on !COMPILE_TEST, so it shouldn't be enabled on allyesconfig/allmodconfig builds and won't be forced on people who just accept build-time defaults in 'make oldconfig'. While the option is strongly recommended, only people who enable it explicitly should see it. (Josh Poimboeuf) - Disable branch profiling in noinstr code with a broad brush that includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf) - Create backup object files on objtool errors and print exact objtool arguments to make failure analysis easier (Josh Poimboeuf) - Improve noreturn handling (Josh Poimboeuf) - Improve rodata handling (Tiezhu Yang) - Support jump tables, switch tables and goto tables on LoongArch (Tiezhu Yang) - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar) * tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) tracing: Disable branch profiling in noinstr code objtool: Use O_CREAT with explicit mode mask objtool: Add CONFIG_OBJTOOL_WERROR objtool: Create backup on error and print args objtool: Change "warning:" to "error:" for --Werror objtool: Add --Werror option objtool: Add --output option objtool: Upgrade "Linked object detected" warning to error objtool: Consolidate option validation objtool: Remove --unret dependency on --rethunk objtool: Increase per-function WARN_FUNC() rate limit objtool: Update documentation objtool: Improve __noreturn annotation warning objtool: Fix error handling inconsistencies in check() x86/traps: Make exc_double_fault() consistently noreturn LoongArch: Enable jump table for objtool objtool/LoongArch: Add support for goto table objtool/LoongArch: Add support for switch table objtool: Handle PC relative relocation type objtool: Handle different entry size of rodata ...	2025-03-24 21:18:05 -07:00
Linus Torvalds	23608993bb	Locking changes for v6.15: Locking primitives: - Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and {,try_}cmpxchg{64,128} on x86 (Uros Bizjak) - mutexes: extend debug checks in mutex_lock() (Yunhui Cui) - Misc cleanups (Uros Bizjak) Lockdep: - Fix might_fault() lockdep check of current->mm->mmap_lock (Peter Zijlstra) - Don't disable interrupts on RT in disable_irq_nosync_lockdep.() (Sebastian Andrzej Siewior) - Disable KASAN instrumentation of lockdep.c (Waiman Long) - Add kasan_check_byte() check in lock_acquire() (Waiman Long) - Misc cleanups (Sebastian Andrzej Siewior) Rust runtime integration: - Use Pin for all LockClassKey usages (Mitchell Levy) - sync: Add accessor for the lock behind a given guard (Alice Ryhl) - sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl) - sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng) Split-lock detection feature (x86): - Fix warning mode with disabled mitigation mode (Maksim Davydov) Locking events: - Add locking events for rtmutex slow paths (Waiman Long) - Add locking events for lockdep (Waiman Long) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeeMARHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j+DQ/7BUherLWPuvGCx+0+FwBG9T2XFKm8cy8r 8p7UM/0gOzMn+EvBz+LFL2/b5BWu2tjB2Qyen2tvq8NcDQdS8GpFA+u9rxeXQzQY tu5LxsnQbvQe0UXl5aJX1D8ft2xmRSU+a/2uQC3/PAbXByTwN/dkEqDoxJQG6GuP 0mpULlbG0D5j2YiiaQyG2+3xKj+fd1mg/aEoG5lx88ko6Bgoguj8b+tX/4f70YWl igNxWoJ8CZxCBbd7+o8vFFvvYpk1sj6Ni3LyTs658t5deJpfxOu9xkrmlxGm/d7q IryuiQC7yYwWWFF96W3yJ13lyojKZTVCYr50hzMd88HE/NGJawZZQJMtyeRGS2r9 7wNZDl0JiPRUgl8bTFOHZUgVU5IIgTSGpgv4XHvUFF0+QtZ91IqB+/fcMIpdEBV9 K02wOfqIb3uUsCXGmNfFVi1E7TeXWUDudqHN7rosxOpFDSm1PvGI4rnnaNjddVr3 kerNfRSyoBaj5Ff1zr59yM8XZVBPmY8MrruwoODMxxcfasM6vllEjv9McBRSoxlb HC3+wXaadWlUnaitaVU6Xak9qIj0djaSgQfQ9nS48XuN4EfztepLYM9OEPAsNWXh 5NZDdYXB1ndYsDTlCLiEl2c0831duJpy2kpVOkaCqC3hu+JjVt82ZeeBhOZeAXQK glwrSkq0FiU= =33q7 -----END PGP SIGNATURE----- Merge tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Locking primitives: - Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and {,try_}cmpxchg{64,128} on x86 (Uros Bizjak) - mutexes: extend debug checks in mutex_lock() (Yunhui Cui) - Misc cleanups (Uros Bizjak) Lockdep: - Fix might_fault() lockdep check of current->mm->mmap_lock (Peter Zijlstra) - Don't disable interrupts on RT in disable_irq_nosync_lockdep.() (Sebastian Andrzej Siewior) - Disable KASAN instrumentation of lockdep.c (Waiman Long) - Add kasan_check_byte() check in lock_acquire() (Waiman Long) - Misc cleanups (Sebastian Andrzej Siewior) Rust runtime integration: - Use Pin for all LockClassKey usages (Mitchell Levy) - sync: Add accessor for the lock behind a given guard (Alice Ryhl) - sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl) - sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng) Split-lock detection feature (x86): - Fix warning mode with disabled mitigation mode (Maksim Davydov) Locking events: - Add locking events for rtmutex slow paths (Waiman Long) - Add locking events for lockdep (Waiman Long)" * tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Remove disable_irq_lockdep() lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*() rust: lockdep: Use Pin for all LockClassKey usages rust: sync: condvar: Add wait_interruptible_freezable() rust: sync: lock: Add an example for Guard:: Lock_ref() rust: sync: Add accessor for the lock behind a given guard locking/lockdep: Add kasan_check_byte() check in lock_acquire() locking/lockdep: Disable KASAN instrumentation of lockdep.c locking/lock_events: Add locking events for lockdep locking/lock_events: Add locking events for rtmutex slow paths x86/split_lock: Fix the delayed detection logic lockdep/mm: Fix might_fault() lockdep check of current->mm->mmap_lock x86/locking: Remove semicolon from "lock" prefix locking/mutex: Add MUTEX_WARN_ON() into fast path x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()	2025-03-24 20:55:03 -07:00
Linus Torvalds	fc13a78e1f	hardening updates for v6.15-rc1 - loadpin: remove unsupported MODULE_COMPRESS_NONE (Arulpandiyan Vadivel) - samples/check-exec: Fix script name (Mickaël Salaün) - yama: remove needless locking in yama_task_prctl() (Oleg Nesterov) - lib/string_choices: Sort by function name (R Sundar) - hardening: Allow default HARDENED_USERCOPY to be set at compile time (Mel Gorman) - uaccess: Split out compile-time checks into ucopysize.h - kbuild: clang: Support building UM with SUBARCH=i386 - x86: Enable i386 FORTIFY_SOURCE on Clang 16+ - ubsan/overflow: Rework integer overflow sanitizer option - Add missing __nonstring annotations for callers of memtostr()/strtomem() - Add __must_be_noncstr() and have memtostr()/strtomem() check for it - Introduce __nonstring_array for silencing future GCC 15 warnings -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZ9hGrgAKCRA2KwveOeQk u1WvAQC3ZxFu3b0Omfmht2pPqCltf2UOQNvUx3egjoeXpUaNSgD+Lxr/T4xksy7E jHh7rCYDkruOWs3DHA5JjRQcf0BBLQo= =FTQp -----END PGP SIGNATURE----- Merge tag 'hardening-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "As usual, it's scattered changes all over. Patches touching things outside of our traditional areas in the tree have been Acked by maintainers or were trivial changes: - loadpin: remove unsupported MODULE_COMPRESS_NONE (Arulpandiyan Vadivel) - samples/check-exec: Fix script name (Mickaël Salaün) - yama: remove needless locking in yama_task_prctl() (Oleg Nesterov) - lib/string_choices: Sort by function name (R Sundar) - hardening: Allow default HARDENED_USERCOPY to be set at compile time (Mel Gorman) - uaccess: Split out compile-time checks into ucopysize.h - kbuild: clang: Support building UM with SUBARCH=i386 - x86: Enable i386 FORTIFY_SOURCE on Clang 16+ - ubsan/overflow: Rework integer overflow sanitizer option - Add missing __nonstring annotations for callers of memtostr()/strtomem() - Add __must_be_noncstr() and have memtostr()/strtomem() check for it - Introduce __nonstring_array for silencing future GCC 15 warnings" * tag 'hardening-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits) compiler_types: Introduce __nonstring_array hardening: Enable i386 FORTIFY_SOURCE on Clang 16+ x86/build: Remove -ffreestanding on i386 with GCC ubsan/overflow: Enable ignorelist parsing and add type filter ubsan/overflow: Enable pattern exclusions ubsan/overflow: Rework integer overflow sanitizer option to turn on everything samples/check-exec: Fix script name yama: don't abuse rcu_read_lock/get_task_struct in yama_task_prctl() kbuild: clang: Support building UM with SUBARCH=i386 loadpin: remove MODULE_COMPRESS_NONE as it is no longer supported lib/string_choices: Rearrange functions in sorted order string.h: Validate memtostr()/strtomem() arguments more carefully compiler.h: Introduce __must_be_noncstr() nilfs2: Mark on-disk strings as nonstring uapi: stddef.h: Introduce __kernel_nonstring x86/tdx: Mark message.bytes as nonstring string: kunit: Mark nonstring test strings as __nonstring scsi: qla2xxx: Mark device strings as nonstring scsi: mpt3sas: Mark device strings as nonstring scsi: mpi3mr: Mark device strings as nonstring ...	2025-03-24 15:18:08 -07:00
Linus Torvalds	fd101da676	vfs-6.15-rc1.mount -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ90qAwAKCRCRxhvAZXjc on7lAP0akpIsJMWREg9tLwTNTySI1b82uKec0EAgM6T7n/PYhAD/T4zoY8UYU0Pr qCxwTXHUVT6bkNhjREBkfqq9OkPP8w8= =GxeN -----END PGP SIGNATURE----- Merge tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...	2025-03-24 09:34:10 -07:00
Rafael J. Wysocki	1774be7cfc	Merge branch 'pm-cpufreq' Merge cpufreq updates for 6.15-rc1: - Manage sysfs attributes and boost frequencies efficiently from cpufreq core to reduce boilerplate code from drivers (Viresh Kumar). - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider, Dhananjay Ugwekar, Imran Shaik, and zuoqian). - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky Bai). - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski). - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng). - Optimize the amd-pstate driver to avoid cases where call paths end up calling the same writes multiple times and needlessly caching variables through code reorganization, locking overhaul and tracing adjustments (Mario Limonciello, Dhananjay Ugwekar). - Make it possible to avoid enabling capacity-aware scheduling (CAS) in the intel_pstate driver and relocate a check for out-of-band (OOB) platform handling in it to make it detect OOB before checking HWP availability (Rafael Wysocki). - Fix dbs_update() to avoid inadvertent conversions of negative integer values to unsigned int which causes CPU frequency selection to be inaccurate in some cases when the "conservative" cpufreq governor is in use (Jie Zhan). * pm-cpufreq: (91 commits) dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650 dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1 dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible cpufreq: Init cpufreq only for present CPUs cpufreq: tegra186: Share policy per cluster cpufreq/amd-pstate: Drop actions in amd_pstate_epp_cpu_offline() cpufreq/amd-pstate: Stop caching EPP cpufreq/amd-pstate: Rework CPPC enabling cpufreq/amd-pstate: Drop debug statements for policy setting cpufreq/amd-pstate: Update cppc_req_cached for shared mem EPP writes cpufreq/amd-pstate: Move all EPP tracing into _update_perf and _set_epp functions cpufreq/amd-pstate: Cache CPPC request in shared mem case too cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks cpufreq/amd-pstate-ut: Adjust variable scope cpufreq/amd-pstate-ut: Run on all of the correct CPUs cpufreq/amd-pstate-ut: Drop SUCCESS and FAIL enums cpufreq/amd-pstate-ut: Allow lowest nonlinear and lowest to be the same cpufreq/amd-pstate-ut: Use _free macro to free put policy cpufreq/amd-pstate: Drop `cppc_cap1_cached` ...	2025-03-24 14:19:50 +01:00
Rafael J. Wysocki	8b30d2a396	Merge branches 'acpi-x86', 'acpi-platform-profile', 'acpi-apei' and 'acpi-misc' Merge an ACPI CPPC update, ACPI platform-profile driver updates, an ACPI APEI update and a MAINTAINERS update related to ACPI for 6.15-rc1: - Add a missing header file include to the x86 arch CPPC code (Mario Limonciello). - Rework the sysfs attributes implementation in the ACPI platform-profile driver and improve the unregistration code in it (Nathan Chancellor, Kurt Borja). - Prevent the ACPI HED driver from being built as a module and change its initcall level to subsys_initcall to avoid initialization ordering issues related to it (Xiaofei Tan). - Update a maintainer email address in the ACPI PMIC entry in MAINTAINERS (Mika Westerberg). * acpi-x86: x86/ACPI: CPPC: Add missing include * acpi-platform-profile: ACPI: platform_profile: Improve platform_profile_unregister() ACPI: platform-profile: Fix CFI violation when accessing sysfs files * acpi-apei: ACPI: HED: Always initialize before evged * acpi-misc: MAINTAINERS: Use my kernel.org address for ACPI PMIC work	2025-03-24 13:57:44 +01:00
Thomas Weißschuh	97282e6d38	x86: drop unnecessary prefix map configuration The toplevel Makefile already provides -fmacro-prefix-map as part of KBUILD_CPPFLAGS. In contrast to the KBUILD_CFLAGS and KBUILD_AFLAGS variables, KBUILD_CPPFLAGS is not redefined in the architecture specific Makefiles. Therefore the toplevel KBUILD_CPPFLAGS do apply just fine, to both C and ASM sources. The custom configuration was necessary when it was added in commit `9e2276fa6e` ("arch/x86/boot: Use prefix map to avoid embedded paths") but has since become unnecessary in commit `a716bd7432` ("kbuild: use -fmacro-prefix-map for .S sources"). Drop the now unnecessary custom prefix map configuration. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2025-03-22 23:50:58 +09:00
Josh Poimboeuf	2cbb20b008	tracing: Disable branch profiling in noinstr code CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update() for each use of likely() or unlikely(). That breaks noinstr rules if the affected function is annotated as noinstr. Disable branch profiling for files with noinstr functions. In addition to some individual files, this also includes the entire arch/x86 subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle directories, all of which are noinstr-heavy. Due to the nature of how sched binaries are built by combining multiple .c files into one, branch profiling is disabled more broadly across the sched code than would otherwise be needed. This fixes many warnings like the following: vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section ... Reported-by: Ingo Molnar <mingo@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org	2025-03-22 09:49:26 +01:00
Pawan Gupta	c8c8145886	x86/speculation: Remove the extra #ifdef around CALL_NOSPEC Commit: `010c4a461c` ("x86/speculation: Simplify and make CALL_NOSPEC consistent") added an #ifdef CONFIG_MITIGATION_RETPOLINE around the CALL_NOSPEC definition. This is not required as this code is already under a larger #ifdef. Remove the extra #ifdef, no functional change. vmlinux size remains same before and after this change: CONFIG_MITIGATION_RETPOLINE=y: text data bss dec hex filename 25434752 7342290 2301212 35078254 217406e vmlinux.before 25434752 7342290 2301212 35078254 217406e vmlinux.after # CONFIG_MITIGATION_RETPOLINE is not set: text data bss dec hex filename 22943094 6214994 1550152 30708240 1d49210 vmlinux.before 22943094 6214994 1550152 30708240 1d49210 vmlinux.after Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20250320-call-nospec-extra-ifdef-v1-1-d9b084d24820@linux.intel.com	2025-03-22 08:41:47 +01:00
Namhyung Kim	50a53b60e1	perf/amd/ibs: Prevent leaking sensitive data to userspace Although IBS "swfilt" can prevent leaking samples with kernel RIP to the userspace, there are few subtle cases where a 'data' address and/or a 'branch target' address can fall under kernel address range although RIP is from userspace. Prevent leaking kernel 'data' addresses by discarding such samples when {exclude_kernel=1,swfilt=1}. IBS can now be invoked by unprivileged user with the introduction of "swfilt". However, this creates a loophole in the interface where an unprivileged user can get physical address of the userspace virtual addresses through IBS register raw dump (PERF_SAMPLE_RAW). Prevent this as well. This upstream commit fixed the most obvious leak: `65a99264f5` perf/x86: Check data address for IBS software filter Follow that up with a more complete fix. Fixes: `d29e744c71` ("perf/x86: Relax privilege filter restriction on AMD IBS") Suggested-by: Matteo Rizzo <matteorizzo@google.com> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250321161251.1033-1-ravi.bangoria@amd.com	2025-03-22 08:18:24 +01:00
Mateusz Jończyk	de7115636c	x86/Kconfig: Document release year of glibc 2.3.3 I wonder how many people were checking their glibc version when considering whether to enable this option. Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: David Heidelberg <david@ixit.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-7-b0cbaa6fa338@ixit.cz	2025-03-22 08:08:57 +01:00
Mateusz Jończyk	d9f8780267	x86/Kconfig: Make CONFIG_PCI_CNB20LE_QUIRK depend on X86_32 I was unable to find a good description of the ServerWorks CNB20LE chipset. However, it was probably exclusively used with the Pentium III processor (this CPU model was used in all references to it that I found where the CPU model was provided: dmesgs in [1] and [2]; [3] page 2; [4]-[7]). As is widely known, the Pentium III processor did not support the 64-bit mode, support for which was introduced by Intel a couple of years later. So it is safe to assume that no systems with the CNB20LE chipset have amd64 and the CONFIG_PCI_CNB20LE_QUIRK may now depend on X86_32. Additionally, I have determined that most computers with the CNB20LE chipset did have ACPI support and this driver was inactive on them. I have submitted a patch to remove this driver, but it was met with resistance [8]. [1] Jim Studt, Re: Problem with ServerWorks CNB20LE and lost interrupts Linux Kernel Mailing List, https://lkml.org/lkml/2002/1/11/111 [2] RedHat Bug 665109 - e100 problems on old Compaq Proliant DL320 https://bugzilla.redhat.com/show_bug.cgi?id=665109 [3] R. Hughes-Jones, S. Dallison, G. Fairey, Performance Measurements on Gigabit Ethernet NICs and Server Quality Motherboards, http://datatag.web.cern.ch/papers/pfldnet2003-rhj.doc [4] "Hardware for Linux", Probe #d6b5151873 of Intel STL2-bd A28808-302 Desktop Computer (STL2) https://linux-hardware.org/?probe=d6b5151873 [5] "Hardware for Linux", Probe #0b5d843f10 of Compaq ProLiant DL380 https://linux-hardware.org/?probe=0b5d843f10 [6] Ubuntu Forums, Dell Poweredge 2400 - Adaptec SCSI Bus AIC-7880 https://ubuntuforums.org/showthread.php?t=1689552 [7] Ira W. Snyder, "BISECTED: 2.6.35 (and -git) fail to boot: APIC problems" https://lkml.org/lkml/2010/8/13/220 [8] Bjorn Helgaas, "Re: [PATCH] x86/pci: drop ServerWorks / Broadcom CNB20LE PCI host bridge driver" https://lore.kernel.org/lkml/20220318165535.GA840063@bhelgaas/T/ Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: David Heideberg <david@ixit.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-6-b0cbaa6fa338@ixit.cz	2025-03-22 08:08:57 +01:00
Mateusz Jończyk	21d8fb8d4e	x86/Kconfig: Document CONFIG_PCI_MMCONFIG This configuration option had no help text, so add it. CONFIG_EXPERT is enabled on some distribution kernels, so people using a distribution kernel's configuration as a starting point will see this option. [ mingo: Standardized the new Kconfig text a bit. ] Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: David Heideberg <david@ixit.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-5-b0cbaa6fa338@ixit.cz	2025-03-22 08:08:20 +01:00
Mateusz Jończyk	4047e8773f	x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM The order of the entries matches the order they appear in Kconfig. In 2011, AMD Elan was moved to Kconfig.cpu and the dependency on X86_EXTENDED_PLATFORM was dropped in: `ce9c99af8d` ("x86, cpu: Move AMD Elan Kconfig under "Processor family"") Support for Moorestown MID devices was removed in 2012 in: `1a8359e411` ("x86/mid: Remove Intel Moorestown") SGI 320/540 (Visual Workstation) was removed in 2014 in: `c5f9ee3d66` ("x86, platforms: Remove SGI Visual Workstation") Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: David Heideberg <david@ixit.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-4-b0cbaa6fa338@ixit.cz	2025-03-22 08:02:16 +01:00
Mateusz Jończyk	e35e328d37	x86/Kconfig: Move all X86_EXTENDED_PLATFORM options together So that these options will be displayed together in menuconfig etc. Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: David Heidelberg <david@ixit.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-3-b0cbaa6fa338@ixit.cz	2025-03-22 08:02:16 +01:00
Mateusz Jończyk	31be5041dc	x86/Kconfig: Always enable ARCH_SPARSEMEM_ENABLE It appears that (X86_64 \|\| X86_32) is always true on x86. This logical OR directive was introduced in: `6ea3038648` ("arch/x86: remove depends on CONFIG_EXPERIMENTAL") By (EXPERIMENTAL && X86_32) turning into (X86_32). Since this change was an identity transformation, nobody noticed that the condition turned into 'true'. [ mingo: Updated changelog ] Fixes: `6ea3038648` ("arch/x86: remove depends on CONFIG_EXPERIMENTAL") Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: David Heideberg <david@ixit.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-2-b0cbaa6fa338@ixit.cz	2025-03-22 08:02:16 +01:00
Mateusz Jończyk	9232c49ff3	x86/Kconfig: Enable X86_X2APIC by default and improve help text As many current platforms (most modern Intel CPUs and QEMU) have x2APIC present, enable CONFIG_X86_X2APIC by default as it gives performance and functionality benefits. Additionally, if the BIOS has already switched APIC to x2APIC mode, but CONFIG_X86_X2APIC is disabled, the kernel will panic in arch/x86/kernel/apic/apic.c . Also improve the help text, which was confusing and really did not describe what the feature is about. Help text references and discussion: Both Intel [1] and AMD [3] spell the name as "x2APIC", not "x2apic". "It allows faster access to the local APIC" [2], chapter 2.1, page 15: "More efficient MSR interface to access APIC registers." "x2APIC was introduced in Intel CPUs around 2008": I was unable to find specific information which Intel CPUs support x2APIC. Wikipedia claims it was "introduced with the Nehalem microarchitecture in November 2008", but I was not able to confirm this independently. At least some Nehalem CPUs do not support x2APIC [1]. The documentation [2] is dated June 2008. Linux kernel also introduced x2APIC support in 2008, so the year seems to be right. "and in AMD EPYC CPUs in 2019": [3], page 15: "AMD introduced an x2APIC in our EPYC 7002 Series processors for the first time." "It is also frequently emulated in virtual machines, even when the host CPU does not support it." [1] "If this configuration option is disabled, the kernel will not boot on some platforms that have x2APIC enabled." According to some BIOS documentation [4], the x2APIC may be "disabled", "enabled", or "force enabled" on this system. I think that "enabled" means "made available to the operating system, but not already turned on" and "force enabled" means "already switched to x2APIC mode when the OS boots". Only in the latter mode a kernel without CONFIG_X86_X2APIC will panic in validate_x2apic() in arch/x86/kernel/apic/apic.c . QEMU 4.2.1 and my Intel HP laptop (bought in 2019) use the "enabled" mode and the kernel does not panic. [1] "Re: [Qemu-devel] [Question] why x2apic's set by default without host sup" https://lists.gnu.org/archive/html/qemu-devel/2013-07/msg03527.html [2] Intel® 64 Architecture x2APIC Specification, ( https://www.naic.edu/~phil/software/intel/318148.pdf ) [3] Workload Tuning Guide for AMD EPYC ™ 7002 Series Processor Based Servers Application Note, https://developer.amd.com/wp-content/resources/56745_0.80.pdf [4] UEFI System Utilities and Shell Command Mobile Help for HPE ProLiant Gen10, ProLiant Gen10 Plus Servers and HPE Synergy: Enabling or disabling Processor x2APIC Support https://techlibrary.hpe.com/docs/iss/proliant-gen10-uefi/s_enable_disable_x2APIC_support.html Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250321-x86_x2apic-v3-1-b0cbaa6fa338@ixit.cz	2025-03-22 08:02:16 +01:00
Mike Rapoport (Microsoft)	d893aca973	x86/mm: restore early initialization of high_memory for 32-bits Kernel test robot reports the following crash on 32-bit system with HIGHMEM and DEBUG_VIRTUAL: [ 0.056128][ T0] kernel BUG at arch/x86/mm/physaddr.c:77! PANIC: early exception 0x06 IP 60:c116539d error 0 cr2 0x0 [ 0.056916][ T0] CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.14.0-rc4-00010-ga4dbe5c71817 #1 [ 0.057570][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 0.058299][ T0] EIP: __phys_addr (arch/x86/mm/physaddr.c:77) [ 0.058633][ T0] Code: 00 74 33 89 f0 e8 d3 8b 2e 00 89 c3 0f b6 d0 b8 58 bb 4b c5 31 c9 6a 00 e8 70 f5 15 00 83 c4 04 84 db 74 25 ff 05 78 de 5d c5 <0f> 0b b8 c8 91 ea c4 e8 e7 6e ea ff b8 58 bb 4b c5 31 d2 31 c9 6a All code [ 0.060017][ T0] EAX: 00000000 EBX: c61f7001 ECX: 00000000 EDX: 00000000 [ 0.060519][ T0] ESI: c61f7000 EDI: 061f7000 EBP: c4e31f04 ESP: c61f7000 [ 0.061016][ T0] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: cff4 EFLAGS: 00210002 [ 0.061560][ T0] CR0: 80050033 CR2: 00000000 CR3: 059fc000 CR4: 00000090 [ 0.062060][ T0] Call Trace: [ 0.062288][ T0] ? show_regs (arch/x86/kernel/dumpstack.c:478) [ 0.062588][ T0] ? early_fixup_exception (arch/x86/include/asm/nospec-branch.h:595) [ 0.062968][ T0] ? early_idt_handler_common (arch/x86/kernel/head_32.S:352) [ 0.063360][ T0] ? __phys_addr (arch/x86/mm/physaddr.c:77) [ 0.063677][ T0] ? one_page_table_init (arch/x86/mm/init_32.c:100) [ 0.064037][ T0] ? page_table_range_init (arch/x86/mm/init_32.c:227) [ 0.064411][ T0] ? permanent_kmaps_init (include/linux/pgtable.h:191 include/linux/pgtable.h:196 arch/x86/mm/init_32.c:395) [ 0.064814][ T0] ? paging_init (arch/x86/mm/init_32.c:677) [ 0.065118][ T0] ? native_pagetable_init (arch/x86/mm/init_32.c:481) [ 0.065503][ T0] ? setup_arch (arch/x86/kernel/setup.c:1131) [ 0.065819][ T0] ? start_kernel (include/linux/jump_label.h:267 init/main.c:920) [ 0.066143][ T0] ? i386_start_kernel (arch/x86/kernel/head32.c:79) [ 0.066501][ T0] ? startup_32_smp (arch/x86/kernel/head_32.S:292) The crash happens because commit `e120d1bc12` ("arch, mm: set high_memory in free_area_init()") moved initialization of high_memory after __vmalloc_start_set and with high_memory still set to 0 any address passes is_vmalloc_addr() check. Restore early initialization of high_memory on 32-bit systems in initmem_init(). Link: https://lkml.kernel.org/r/20250319122337.1538924-1-rppt@kernel.org Fixes: `e120d1bc12` ("arch, mm: set high_memory in free_area_init()") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503191442.112e954f-lkp@intel.com Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-21 22:03:17 -07:00
Wei Liu	628cc040b3	x86/hyperv: fix an indentation issue in mshyperv.h Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503220640.hjiacW2C-lkp@intel.com/ Signed-off-by: Wei Liu <wei.liu@kernel.org>	2025-03-21 22:41:56 +00:00
Michael Kelley	999ad14259	x86/hyperv: Add comments about hv_vpset and var size hypercall input args Current code varies in how the size of the variable size input header for hypercalls is calculated when the input contains struct hv_vpset. Surprisingly, this variation is correct, as different hypercalls make different choices for what portion of struct hv_vpset is treated as part of the variable size input header. The Hyper-V TLFS is silent on these details, but the behavior has been confirmed with Hyper-V developers. To avoid future confusion about these differences, add comments to struct hv_vpset, and to hypercall call sites with input that contains a struct hv_vpset. The comments describe the overall situation and the calculation that should be used at each particular call site. No functional change as only comments are updated. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20250318214919.958953-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250318214919.958953-1-mhklinux@outlook.com>	2025-03-21 18:24:22 +00:00
Eric Biggers	ca17aa6640	crypto: lib/chacha - remove unused arch-specific init support All implementations of chacha_init_arch() just call chacha_init_generic(), so it is pointless. Just delete it, and replace chacha_init() with what was previously chacha_init_generic(). Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-21 17:39:06 +08:00
Roger Pau Monne	c3164d2e0d	PCI/MSI: Convert pci_msi_ignore_mask to per MSI domain flag Setting pci_msi_ignore_mask inhibits the toggling of the mask bit for both MSI and MSI-X entries globally, regardless of the IRQ chip they are using. Only Xen sets the pci_msi_ignore_mask when routing physical interrupts over event channels, to prevent PCI code from attempting to toggle the maskbit, as it's Xen that controls the bit. However, the pci_msi_ignore_mask being global will affect devices that use MSI interrupts but are not routing those interrupts over event channels (not using the Xen pIRQ chip). One example is devices behind a VMD PCI bridge. In that scenario the VMD bridge configures MSI(-X) using the normal IRQ chip (the pIRQ one in the Xen case), and devices behind the bridge configure the MSI entries using indexes into the VMD bridge MSI table. The VMD bridge then demultiplexes such interrupts and delivers to the destination device(s). Having pci_msi_ignore_mask set in that scenario prevents (un)masking of MSI entries for devices behind the VMD bridge. Move the signaling of no entry masking into the MSI domain flags, as that allows setting it on a per-domain basis. Set it for the Xen MSI domain that uses the pIRQ chip, while leaving it unset for the rest of the cases. Remove pci_msi_ignore_mask at once, since it was only used by Xen code, and with Xen dropping usage the variable is unneeded. This fixes using devices behind a VMD bridge on Xen PV hardware domains. Albeit Devices behind a VMD bridge are not known to Xen, that doesn't mean Linux cannot use them. By inhibiting the usage of VMD_FEAT_CAN_BYPASS_MSI_REMAP and the removal of the pci_msi_ignore_mask bodge devices behind a VMD bridge do work fine when use from a Linux Xen hardware domain. That's the whole point of the series. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Message-ID: <20250219092059.90850-4-roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>	2025-03-21 08:57:57 +01:00
Ard Biesheuvel	3e57612561	x86/asm: Make asm export of __ref_stack_chk_guard unconditional Clang does not tolerate the use of non-TLS symbols for the per-CPU stack protector very well, and to work around this limitation, the symbol passed via the -mstack-protector-guard-symbol= option is never defined in C code, but only in the linker script, and it is exported from an assembly file. This is necessary because Clang will fail to generate the correct %GS based references in a compilation unit that includes a non-TLS definition of the guard symbol being used to store the stack cookie. This problem is only triggered by symbol definitions, not by declarations, but nonetheless, the declaration in <asm/asm-prototypes.h> is conditional on __GENKSYMS__ being #define'd, so that only genksyms will observe it, but for ordinary compilation, it will be invisible. This is causing problems with the genksyms alternative gendwarfksyms, which does not #define __GENKSYMS__, does not observe the symbol declaration, and therefore lacks the information it needs to version it. Adding the #define creates problems in other places, so that is not a straight-forward solution. So take the easy way out, and drop the conditional on __GENKSYMS__, as this is not really needed to begin with. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20250320213238.4451-2-ardb@kernel.org	2025-03-21 08:34:28 +01:00
Nuno Das Neves	e2575ffe57	x86: hyperv: Add mshv_handler() irq handler and setup function Add mshv_handler() to process messages related to managing guest partitions such as intercepts, doorbells, and scheduling messages. In a (non-nested) root partition, the same interrupt vector is shared between the vmbus and mshv_root drivers. Introduce a stub for mshv_handler() and call it in sysvec_hyperv_callback alongside vmbus_handler(). Even though both handlers will be called for every Hyper-V interrupt, the messages for each driver are delivered to different offsets within the SYNIC message page, so they won't step on each other. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Link: https://lore.kernel.org/r/1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com>	2025-03-20 21:23:04 +00:00
Nuno Das Neves	21050f6197	Drivers: hv: Export some functions for use by root partition module hv_get_hypervisor_version(), hv_call_deposit_pages(), and hv_call_create_vp(), are all needed in-module with CONFIG_MSHV_ROOT=m. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@microsoft.linux.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/r/1741980536-3865-7-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-7-git-send-email-nunodasneves@linux.microsoft.com>	2025-03-20 21:23:04 +00:00
Stanislav Kinsburskii	8cac51796e	x86/mshyperv: Add support for extended Hyper-V features Extend the "ms_hyperv_info" structure to include a new field, "ext_features", for capturing extended Hyper-V features. Update the "ms_hyperv_init_platform" function to retrieve these features using the cpuid instruction and include them in the informational output. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com>	2025-03-20 21:23:03 +00:00
Nuno Das Neves	3817854ba8	hyperv: Log hypercall status codes as strings Introduce hv_status_printk() macros as a convenience to log hypercall errors, formatting them with the status code (HV_STATUS_*) as a raw hex value and also as a string, which saves some time while debugging. Create a table of HV_STATUS_ codes with strings and mapped errnos, and use it for hv_result_to_string() and hv_result_to_errno(). Use the new hv_status_printk()s in hv_proc.c, hyperv-iommu.c, and irqdomain.c hypercalls to aid debugging in the root partition. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Link: https://lore.kernel.org/r/1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1741980536-3865-2-git-send-email-nunodasneves@linux.microsoft.com>	2025-03-20 21:23:03 +00:00
Tianyu Lan	e792d843aa	x86/hyperv: Fix check of return value from snp_set_vmsa() snp_set_vmsa() returns 0 as success result and so fix it. Cc: stable@vger.kernel.org Fixes: `44676bb9d5` ("x86/hyperv: Add smp support for SEV-SNP guest") Signed-off-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20250313085217.45483-1-ltykernel@gmail.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250313085217.45483-1-ltykernel@gmail.com>	2025-03-20 21:23:03 +00:00
Roman Kisel	07b74192e6	x86/hyperv: Add VTL mode callback for restarting the system The kernel runs as a firmware in the VTL mode, and the only way to restart in the VTL mode on x86 is to triple fault. Thus, one has to always supply "reboot=t" on the kernel command line in the VTL mode, and missing that renders rebooting not working. Define the machine restart callback to always use the triple fault to provide the robust configuration by default. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20250227214728.15672-3-romank@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250227214728.15672-3-romank@linux.microsoft.com>	2025-03-20 21:23:03 +00:00
Roman Kisel	ced518ad55	x86/hyperv: Add VTL mode emergency restart callback By default, X86(-64) systems use the emergecy restart routine in the course of which the code unconditionally writes to the physical address of 0x472 to indicate the boot mode to the firmware (BIOS or UEFI). When the kernel itself runs as a firmware in the VTL mode, that write corrupts the memory of the guest upon emergency restarting. Preserving the state intact in that situation is important for debugging, at least. Define the specialized machine callback to avoid that write and use the triple fault to perform emergency restart. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20250227214728.15672-2-romank@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250227214728.15672-2-romank@linux.microsoft.com>	2025-03-20 21:23:03 +00:00
Dhananjay Ugwekar	7e512f5ad2	perf/x86/rapl: Fix error handling in init_rapl_pmus() If init_rapl_pmu() fails while allocating memory for "rapl_pmu" objects, we miss freeing the "rapl_pmus" object in the error path. Fix that. Fixes: `9b99d65c0b` ("perf/x86/rapl: Move the pmu allocation out of CPU hotplug") Signed-off-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250320100617.4480-1-dhananjay.ugwekar@amd.com	2025-03-20 21:03:55 +01:00
Paolo Bonzini	782f9feaa9	Merge branch 'kvm-pre-tdx' into HEAD - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing.	2025-03-20 13:13:13 -04:00
Paolo Bonzini	361da275e5	Merge branch 'kvm-nvmx-and-vm-teardown' into HEAD The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-20 13:13:00 -04:00
Rik van Riel	0b7eb55cb7	x86/mm: Only do broadcast flush from reclaim if pages were unmapped Track whether pages were unmapped from any MM (even ones with a currently empty mm_cpumask) by the reclaim code, to figure out whether or not broadcast TLB flush should be done when reclaim finishes. The reason any MM must be tracked, and not only ones contributing to the tlbbatch cpumask, is that broadcast ASIDs are expected to be kept up to date even on CPUs where the MM is not currently active. This change allows reclaim to avoid doing TLB flushes when only clean page cache pages and/or slab memory were reclaimed, which is fairly common. ( This is a simpler alternative to the code that was in my INVLPGB series before, and it seems to capture most of the benefit due to how common it is to reclaim only page cache. ) Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250319132520.6b10ad90@fangorn	2025-03-19 21:56:42 +01:00
Sohil Mehta	de844ef582	perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones Introduce a name for an old Pentium 4 model and replace the x86_model checks with VFM ones. This gets rid of one of the last remaining Intel-specific x86_model checks. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250318223828.2945651-3-sohil.mehta@intel.com	2025-03-19 21:34:10 +01:00
Sohil Mehta	8b70c7436f	perf/x86/intel, x86/cpu: Simplify Intel PMU initialization Architectural Perfmon was introduced on the Family 6 "Core" processors starting with Yonah. Processors before Yonah need their own customized PMU initialization. p6_pmu_init() is expected to provide that initialization for early Family 6 processors. But, currently, it could get called for any Family 6 processor if the architectural perfmon feature is disabled on that processor. To simplify, restrict the P6 PMU initialization to early Family 6 processors that do not have architectural perfmon support and truly need the special handling. As a result, the "unsupported" console print becomes practically unreachable because all the released P6 processors are covered by the switch cases. Move the console print to a common location where it can cover all modern processors (including Family >15) that may not have architectural perfmon support enumerated. Also, use this opportunity to get rid of the unnecessary switch cases in P6 initialization. Only the Pentium Pro processor needs a quirk, and the rest of the processors do not need any special handling. The gaps in the case numbers are only due to no processor with those model numbers being released. Use decimal numbers to represent Intel Family numbers. Also, convert one of the last few Intel x86_model comparisons to a VFM-based one. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250318223828.2945651-2-sohil.mehta@intel.com	2025-03-19 21:34:09 +01:00
Eric Biggers	acf9f8da5e	x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 Intel made a late change to the AVX10 specification that removes support for a 256-bit maximum vector length and enumeration of the maximum vector length. AVX10 will imply a maximum vector length of 512 bits. I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will just be AVX10, and it will essentially just consolidate AVX512 features. As a result of this new development, my strategy of providing both _avx10_256 and _avx10_512 functions didn't turn out to be that useful. The only remaining motivation for the 256-bit AVX512 / AVX10 functions is to avoid downclocking on older Intel CPUs. But I already wrote _avx2 code too (primarily to support CPUs without AVX512), which performs almost as well as _avx10_256. So we should just use that. Therefore, remove the _avx10_256 CRC functions, and rename the _avx10_512 CRC functions to _avx512. Make Ice Lake and Tiger Lake use the _avx2 functions instead of *_avx10_256 which they previously used. Link: https://lore.kernel.org/r/20250319181316.91271-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2025-03-19 12:22:00 -07:00
Kumar Kartikeya Dwivedi	ecbd804752	rqspinlock: Add basic support for CONFIG_PARAVIRT We ripped out PV and virtualization related bits from rqspinlock in an earlier commit, however, a fair lock performs poorly within a virtual machine when the lock holder is preempted. As such, retain the virt_spin_lock fallback to test and set lock, but with timeout and deadlock detection. We can do this by simply depending on the resilient_tas_spin_lock implementation from the previous patch. We don't integrate support for CONFIG_PARAVIRT_SPINLOCKS yet, as that requires more involved algorithmic changes and introduces more complexity. It can be done when the need arises in the future. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-15-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-19 08:03:05 -07:00
Paolo Bonzini	3ecf162a31	KVM Xen changes for 6.15 - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates. - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range). - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config. - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks, as there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of Xen TSC should be rare, i.e. are not a hot path. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZpO4ACgkQOlYIJqCj N/3AMQ/+J4+yOslekq4DHYhZaTvJFf0MqhPgTuf2s6I5p449JWn9rebqK2w0M9Xj fJy7/rboQA4QflBuhTiWcC3Dl1lYtxUqqtcCH9608eqKhbeay87OfV0/vgMwWBRs FhcOcp1587esJj5gz5M5R9i3S5Yq7Q4fp1+DmS23X41Zz5nTb2q80MY5UklMgI9I Ydaw1liB8rRHWbdt9yM4UsI8k4fMuj0PE8pEapoTSfsZm8J4cG9qHKrvuWjuFSCF l18Hyl11nWq8eZ5Vg2E2UIz0EgtWIHKu1/fi4av20/JTuA8Mon15WC5q4BBmDDdD keR9OJLYclVBh8KweiJSTUE6PcD9A8pWmoWyp6aGRiyyUVhbwysYTzT7uytwQz6w RH/vVHe0o/m19SnD9rqsRVObc7dOGorFXScMcf4Qxoq9yQm2p0lJDvq6c9uECLMV RIfZrXe9HS67RB9INybS+1fVlLcd0bLgGfG7q9lWLEABD45HpM5daQ4Mlf8+MIE0 V7egx9t69/WALbJka8pWNISeFRKkB1LRjite+XXasqJ0iFeneM8UKFVB4OMtXL9g M0m8ovvySySMkoCq3yMlKxXh4rJ1/D556/bAaJBukMPWFWX9FQaP33U3FuzId7jH ztZVugViQMNiIbQVgUSAcgpuJvgpttAciACODlaw2u2Bk1Txmn0= =c3Wt -----END PGP SIGNATURE----- Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEAD KVM Xen changes for 6.15 - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates. - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range). - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config. - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks, as there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of Xen TSC should be rare, i.e. are not a hot path.	2025-03-19 09:14:59 -04:00
Paolo Bonzini	fcce7c1e7d	KVM PV clock changes for 6.15: - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock. - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus. - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks. - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests. - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups). -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZnJsACgkQOlYIJqCj N/3nAg//TuURhfTm56TB0PZ01DX9Fqxl+9b+fDSllk1F7O5BcfkkEd11Jv4qa/Zb eKhSNZzWuDCTMky8izM2Ej4rfTmCg2xF+3hdpVi6yQ7SItgDo2E7e71lm1lNXSMO oCkQxEwQk0cW2sxeEqPuREq0Zm/kw7jrEt2co2OX2FKlt9UoZiCy6RDBde50z1ut 5Z32k6QX9Alhu67kXvBE/+Xv6abx1dbADOnaTgE7s74smHKxS2WXrfpKnPXjy2y0 pWjX9k2ClSISKdaFbSu4Y0VqeLqE+57ZAWAPT8vndJxjNWOvZK1oBSlaOPchR9CZ 0VFLDWKV2FjEs0O0AkWCw8XTEmdJ4R1ekHpqbBZJ9TJYwVA/LDWOGgR1jcORkzsS WMJkfMOmQeL8bPR6TBuAFXawbhalsXnYUSthZ3sn4kA7c1DTkIC5mzrDZ3ADPyJi UpYwVHaWAMOqncEvSQEuUTvSoDeb5P4HMyB4QOAsh1GoKw4vVXpSWUPDy0JKPOnu WblztX9h/CRB/ZNt/566s2Jh7sCeBO2qs3ffujI4GDosYDcIRRuQy5U08/oMrPRf l3nPStjxLqsCdJe8IXvL5zwt6YOxJvJdG8XcfcvfQsUCPMAZOIv7PKrt8AFfrN6c GU5v8x/IBBB46qJw1Jm5eE5S3P/PuaIf235JHpIabGPzJ+H1QGo= =yv8C -----END PGP SIGNATURE----- Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEAD KVM PV clock changes for 6.15: - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock. - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus. - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks. - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests. - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups).	2025-03-19 09:11:59 -04:00
Paolo Bonzini	9b093f5b86	KVM SVM changes for 6.15 - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies). - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation. - Add macros and helpers for setting GHCB return/error codes. - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed. - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address. - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall. - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZoO4ACgkQOlYIJqCj N/2m1BAAkbn6k0ZzJj0GTYqNh8ejWFBa4Mt+RHXrf9deNXLJuQFPRAQi2En3rQwj esISxA0dzsADEYkCXHxDsUfsJ729kQYAIyN5r4h3GftriKlNBmHLZJsXHXnpAZ0F 2Yjwu/r5zw/c5/mOYkcBjJ0gDgoDWNI0UoA656HTqE88E6v7DoSZlZBZZRSCMXRi 8jIQSzmQQkKsWi0c/N5LKm8E+6HFLJB1BnASbXbwXChIoi2pFE/wAv6ntC+V5DzN Y5oqDtf3evOBrpmMbN7t2I4KJ8VW1+041whANymFK1QARBBevCYY1ezCHg2RIHpc cyS8G+wice7IMSnqTNtJvN2IpwBkV2SqRyxwBKS2j1ec1xeoX2JT23tOom1XpPhW diqiSUto2xQIPz3x8fddtAHvY0W11jpXt4MUyOzdefbBLGQBB4EsxbnwxY+i6kKh 0tdw4R1uzvbn1sHW+p2hOvtkgxSLmYFGIrYEUMCxXRxOviHfPWzCBlucEOOceU1D 2o/SgoBWS6xF8KxMxwnVLE9q8/Baiua8Ak2h2cLapHwWGpRaeJGFbz/TwbcDaKVy gW34W8KXc4WNWiFwoD6WRqrSDTRXG3XAtn0vjwvCqD6PBPRleALWsAxq8ztenYIy 2se051XsKGg+e64zsAZFNdzIDrSGIHWfZb9ec398cF/iuKTGWLQ= =iZrd -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.15' of https://github.com/kvm-x86/linux into HEAD KVM SVM changes for 6.15 - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies). - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation. - Add macros and helpers for setting GHCB return/error codes. - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed. - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address. - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall. - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features. - Misc cleanups	2025-03-19 09:10:44 -04:00
Paolo Bonzini	a24dbf986b	KVM VMX changes for 6.15 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1. - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step for upcoming FRED virtualization support. - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits as a general cleanup, and in anticipation of adding support for emulating Mode-Based Execution (MBEC). - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter. - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits). -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZosAACgkQOlYIJqCj N/0pSA/9FTKQ8hzBinHc5ae+FPhbmAmSgX9X1Gge2cDWF8OYD1yR3ySEdPRGOalz oRhhO9pnXeNDbGeps8TFqoRhydcXRyp8AiDATFKky5kFk9vyKxDR26bXrM8nWGgI WZS/txkae/Bt5P63IViaGBIbWjXlfKP2wtAOrv676e//SGtZ22K0QUOUuBNETD1F txs2vHOUC7S9q+hIo95I4WZJGA4Ih7ZMlfGr1hrPGASHMG0AUozaouN8CEHFWecH uutpeVAcYtLiZbsl19L+M6wHr0TeRYo0d5Rcw/zH1XeTZ+zLoF27PJ6PHwx7QmVs C0mNzW+2cNjEYpzEDFxo+EkiWXagJ0m9pWlne6PKb92WtX+l5+x5abxVSF86lCy9 X46wtm/FJ3DyfZ8yuLJXX2c2TOHVNNPUOS/dmAKrv3i1t2kG90yKRX+P9m0k9L1i f1IcyhIy4hrkSz+OVRG07mf7VeXUjklWfjIOsAYGBiQRyTbD+8PU1pgDRXWC9mPw tgvOgnnDKaS8POHak5DGb3kfdISVztNUM5Dg4GV7fZYqy8E37cdjo1PHk71xmksU lPN5U1wSX5MkAyqzrFz+LAZlgoIAvDmbPclWVACbwwRIo7IvbAjwsfE+8JyO5DPX T8mvd6C+A2K8GqkQTpjx8leXEDRSZiuV63dscSoAMEr+NbSudi4= =SAgZ -----END PGP SIGNATURE----- Merge tag 'kvm-x86-vmx-6.15' of https://github.com/kvm-x86/linux into HEAD KVM VMX changes for 6.15 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1. - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step for upcoming FRED virtualization support. - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits as a general cleanup, and in anticipation of adding support for emulating Mode-Based Execution (MBEC). - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter. - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits).	2025-03-19 09:05:52 -04:00
Paolo Bonzini	4d9a677596	KVM x86 misc changes for 6.15: - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT. - Add a helper to consolidate handling of mp_state transitions, and use it to clear pv_unhalted whenever a vCPU is made RUNNABLE. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition. - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2). - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS, as KVM can't get the current CPL. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZhoYACgkQOlYIJqCj N/1oSBAAlhsZzv4m6rHtWACjGsSBlAE5WM7HrpnwjOyMW+Desc0WLL4L6qAlxgT7 4ZkBvQ4zsiyUIdLv1XARvkBYHTUkcgG8fQSSJQ6grZit5OcFcMgWafvQrJYI6428 TyzF/x6t0gj8CjDluA6jM/eL5HhWZZDOIg0Wma+pyYpW+7y2tkphXyYyOQPBGwv3 geRUetbDHHXjf042k/8f1j1vjzrNNvAg3YyNyx1YbdU9XKsn5D+SeUW2eVfYk8G7 5QsCOGvUYcbbjrR8kbCZKexvoH6Np9J6YKDe4R9R2yDzgs/96qz6xkYTGCVkHA1y uursKqRHgbXBxzxa+ban073laT7Qt3S01Gd9bJW3IO7hzG89gl4qfX7fap8T9Yc2 yeBTYIgInpyx+NCdZ2Z/++BzPagBGfa77gFX/eIkmsVA9LWYi9CI3FSjtr/czvWm a4tfMPvTVBjsBQQ7t/lNksrq0O51lbb3iqqv3ToQpDOOqCWuMEU5xcihhPRr5NSZ dX4o/jIDhCV8EyXdtASyqMlYBXcuC45ojEZn1elh0QogzYAdSGQ2bIDyxuBtA//k kSbi+E4GB64jVfBWUyK2QeLOBnBkH7mh6Cg5UYr1Ln9Sm6l8vrcxhcbnchiWxXMI WCK7BJwI2HojBVpEZ04jMkjHvg36uSfjOzmMLT5yPXfFNebsGmA= =8SGF -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEAD KVM x86 misc changes for 6.15: - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT. - Add a helper to consolidate handling of mp_state transitions, and use it to clear pv_unhalted whenever a vCPU is made RUNNABLE. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition. - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2). - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS, as KVM can't get the current CPL. - Misc cleanups	2025-03-19 09:04:48 -04:00
Paolo Bonzini	4286a3ec25	KVM x86/mmu changes for 6.15 Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfZizUACgkQOlYIJqCj N/1XkBAAwLxK3uKwkIIgzu+V6NPMiqPBsNtGiiQgRYMfvCwMW2vU6ztsBgNgs6zI eMnOCCo6fQaxPFvpKue8VN7TD33BcjzKZaPiuHZrzQIa/oYQeOlZ4oaN8lr9F9Ec 5l1Lg/p2+z1GGDhWc2opNpg48sCtX7IQ0Tx46LkoB3VSDFP33GwW4+Ht2r71rNeL ofKB+T0hU5HOry5j0w0nTVwOEoNzlm1sVFqOHzgnK18Lmqw2CfOPm+46K+w8nOh+ v+rwGuGa//1kcCjNCcGP1OuJdNAMgXBxND/l6LAkWcHfffIRbXlO07O05dAGqPeF rRn5JUl02OkI6lq99+935OmtEROe6mt+Bx0dhAzk4Z0CD6JY34ShZSAADSnltQlK 2a1E95t63v8a7ZM5dwED7os2HBhxODoyeWQAlIHpkVdmeTJstkyvjPhubJc13+Js oDL6ehs3hhZ171ePn2aXo0NobA5fe7xbl4wugP3hNmBXjLvu04D+llcDmC095nBk ICuzFqFXCXzdjEwgWwPzTseWOCoWTkoRqeJ9lch4UD3mMMcmK0MbK6joocGvCFto cL/eZdElnf1MZwWYdo44X+NEc1jItZVvktkRrllpwCtpRSDINO6RYZGcRf/g0Lha XmaU7jICfi3AKc4N3S2l4KIkd/AeJQySM+kGArxIOYoaqFCe2Mc= =Iy57 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-mmu-6.15' of https://github.com/kvm-x86/linux into HEAD KVM x86/mmu changes for 6.15 Add support for "fast" aging of SPTEs in both the TDP MMU and Shadow MMU, where "fast" means "without holding mmu_lock". Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs, e.g. due to holding mmu_lock for an extended duration while a vCPU is faulting in memory. For the TDP MMU, protect aging via RCU; the page tables are RCU-protected and KVM doesn't need to access any metadata to age SPTEs. For the Shadow MMU, use bit 1 of rmap pointers (bit 0 is used to terminate a list of rmaps) to implement a per-rmap single-bit spinlock. When aging a gfn, acquire the rmap's spinlock with read-only permissions, which allows hardening and optimizing the locking and aging, e.g. locking an rmap for write requires mmu_lock to also be held. The lock is NOT a true R/W spinlock, i.e. multiple concurrent readers aren't supported. To avoid forcing all SPTE updates to use atomic operations (clearing the Accessed bit out of mmu_lock makes it inherently volatile), rework and rename spte_has_volatile_bits() to spte_needs_atomic_update() and deliberately exclude the Accessed bit. KVM (and mm/) already tolerates false positives/negatives for Accessed information, and all testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information.	2025-03-19 09:04:33 -04:00
Thomas Huth	24a295e4ef	x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers While the GCC and Clang compilers already define __ASSEMBLER__ automatically when compiling assembly code, __ASSEMBLY__ is a macro that only gets defined by the Makefiles in the kernel. This can be very confusing when switching between userspace and kernelspace coding, or when dealing with UAPI headers that rather should use __ASSEMBLER__ instead. So let's standardize on the __ASSEMBLER__ macro that is provided by the compilers now. This is mostly a mechanical patch (done with a simple "sed -i" statement), with some manual tweaks in <asm/frame.h>, <asm/hw_irq.h> and <asm/setup.h> that mentioned this macro in comments with some missing underscores. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314071013.1575167-38-thuth@redhat.com	2025-03-19 11:47:30 +01:00
Thomas Huth	8a141be323	x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers __ASSEMBLY__ is only defined by the Makefile of the kernel, so this is not really useful for UAPI headers (unless the userspace Makefile defines it, too). Let's switch to __ASSEMBLER__ which gets set automatically by the compiler when compiling assembly code. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Brian Gerst <brgerst@gmail.com> Link: https://lore.kernel.org/r/20250310104256.123527-1-thuth@redhat.com	2025-03-19 11:30:53 +01:00
Uros Bizjak	faa6f77b0d	x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions According to: https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html the usage of asm pseudo directives in the asm template can confuse the compiler to wrongly estimate the size of the generated code. The LOCK_PREFIX macro expands to several asm pseudo directives, so its usage in atomic locking insns causes instruction length estimates to fail significantly (the specially instrumented compiler reports the estimated length of these asm templates to be 6 instructions long). This incorrect estimate further causes unoptimal inlining decisions, un-optimal instruction scheduling and un-optimal code block alignments for functions that use these locking primitives. Use asm_inline instead: https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html which is a feature that makes GCC pretend some inline assembler code is tiny (while it would think it is huge), instead of just asm. For code size estimation, the size of the asm is then taken as the minimum size of one instruction, ignoring how many instructions compiler thinks it is. bloat-o-meter reports the following code size increase (x86_64 defconfig, gcc-14.2.1): add/remove: 82/283 grow/shrink: 870/372 up/down: 76272/-43618 (32654) Total: Before=22770320, After=22802974, chg +0.14% with top grows (>500 bytes): Function old new delta ---------------------------------------------------------------- copy_process 6465 10191 +3726 balance_dirty_pages_ratelimited_flags 237 2949 +2712 icl_plane_update_noarm 5800 7969 +2169 samsung_input_mapping 3375 5170 +1795 ext4_do_update_inode.isra - 1526 +1526 __schedule 2416 3472 +1056 __i915_vma_resource_unhold - 946 +946 sched_mm_cid_after_execve 175 1097 +922 __do_sys_membarrier - 862 +862 filemap_fault 2666 3462 +796 nl80211_send_wiphy 11185 11874 +689 samsung_input_mapping.cold 900 1500 +600 virtio_gpu_queue_fenced_ctrl_buffer 839 1410 +571 ilk_update_pipe_csc 1201 1735 +534 enable_step - 525 +525 icl_color_commit_noarm 1334 1847 +513 tg3_read_bc_ver - 501 +501 and top shrinks (>500 bytes): Function old new delta ---------------------------------------------------------------- nl80211_send_iftype_data 580 - -580 samsung_gamepad_input_mapping.isra.cold 604 - -604 virtio_gpu_queue_ctrl_sgs 724 - -724 tg3_get_invariants 9218 8376 -842 __i915_vma_resource_unhold.part 899 - -899 ext4_mark_iloc_dirty 1735 106 -1629 samsung_gamepad_input_mapping.isra 2046 - -2046 icl_program_input_csc 2203 - -2203 copy_mm 2242 - -2242 balance_dirty_pages 2657 - -2657 These code size changes can be grouped into 4 groups: a) some functions now include once-called functions in full or in part. These are: Function old new delta ---------------------------------------------------------------- copy_process 6465 10191 +3726 balance_dirty_pages_ratelimited_flags 237 2949 +2712 icl_plane_update_noarm 5800 7969 +2169 samsung_input_mapping 3375 5170 +1795 ext4_do_update_inode.isra - 1526 +1526 that now include: Function old new delta ---------------------------------------------------------------- copy_mm 2242 - -2242 balance_dirty_pages 2657 - -2657 icl_program_input_csc 2203 - -2203 samsung_gamepad_input_mapping.isra 2046 - -2046 ext4_mark_iloc_dirty 1735 106 -1629 b) ISRA [interprocedural scalar replacement of aggregates, interprocedural pass that removes unused function return values (turning functions returning a value which is never used into void functions) and removes unused function parameters. It can also replace an aggregate parameter by a set of other parameters representing part of the original, turning those passed by reference into new ones which pass the value directly.] Top grows and shrinks of this group are listed below: Function old new delta ---------------------------------------------------------------- ext4_do_update_inode.isra - 1526 +1526 nfs4_begin_drain_session.isra - 249 +249 nfs4_end_drain_session.isra - 168 +168 __guc_action_register_multi_lrc_v70.isra 335 500 +165 __i915_gem_free_objects.isra - 144 +144 ... membarrier_register_private_expedited.isra 108 - -108 syncobj_eventfd_entry_func.isra 445 314 -131 __ext4_sb_bread_gfp.isra 140 - -140 class_preempt_notrace_destructor.isra 145 - -145 p9_fid_put.isra 151 - -151 __mm_cid_try_get.isra 238 - -238 membarrier_global_expedited.isra 294 - -294 mm_cid_get.isra 295 - -295 samsung_gamepad_input_mapping.isra.cold 604 - -604 samsung_gamepad_input_mapping.isra 2046 - -2046 c) different split points of hot/cold split that just move code around: Top grows and shrinks of this group are listed below: Function old new delta ---------------------------------------------------------------- samsung_input_mapping.cold 900 1500 +600 __i915_request_reset.cold 311 389 +78 nfs_update_inode.cold 77 153 +76 __do_sys_swapon.cold 404 455 +51 copy_process.cold - 45 +45 tg3_get_invariants.cold 73 115 +42 ... hibernate.cold 671 643 -28 copy_mm.cold 31 - -31 software_resume.cold 249 207 -42 io_poll_wake.cold 106 54 -52 samsung_gamepad_input_mapping.isra.cold 604 - -604 c) full inline of small functions with locking insn (~150 cases). These bring in most of the code size increase because the removed function code is now inlined in multiple places. E.g.: 0000000000a50e10 <release_devnum>: a50e10: 48 63 07 movslq (%rdi),%rax a50e13: 85 c0 test %eax,%eax a50e15: 7e 10 jle a50e27 <release_devnum+0x17> a50e17: 48 8b 4f 50 mov 0x50(%rdi),%rcx a50e1b: f0 48 0f b3 41 50 lock btr %rax,0x50(%rcx) a50e21: c7 07 ff ff ff ff movl $0xffffffff,(%rdi) a50e27: e9 00 00 00 00 jmp a50e2c <release_devnum+0x1c> a50e28: R_X86_64_PLT32 __x86_return_thunk-0x4 a50e2c: 0f 1f 40 00 nopl 0x0(%rax) is now fully inlined into the caller function. This is desirable due to the per function overhead of CPU bug mitigations like retpolines. FTR a) with -Os (where generated code size really matters) x86_64 defconfig object file decreases by 24.388 kbytes, representing 0.1% code size decrease: text data bss dec hex filename 23883860 4617284 814212 29315356 1bf511c vmlinux-old.o 23859472 4615404 814212 29289088 1beea80 vmlinux-new.o FTR b) clang recognizes "asm inline", but there was no difference in code sizes: text data bss dec hex filename 27577163 4503078 807732 32887973 1f5d4a5 vmlinux-clang-patched.o 27577181 4503078 807732 32887991 1f5d4b7 vmlinux-clang-unpatched.o The performance impact of the patch was assessed by recompiling fedora-41 6.13.5 kernel and running lmbench with old and new kernel. The most noticeable improvements were: Process fork+exit: 270.0952 microseconds Process fork+execve: 2620.3333 microseconds Process fork+/bin/sh -c: 6781.0000 microseconds File /usr/tmp/XXX write bandwidth: 1780350 KB/sec Pagefaults on /usr/tmp/XXX: 0.3875 microseconds to: Process fork+exit: 298.6842 microseconds Process fork+execve: 1662.7500 microseconds Process fork+/bin/sh -c: 2127.6667 microseconds File /usr/tmp/XXX write bandwidth: 1950077 KB/sec Pagefaults on /usr/tmp/XXX: 0.1958 microseconds and from: Socket bandwidth using localhost 0.000001 2.52 MB/sec 0.000064 163.02 MB/sec 0.000128 321.70 MB/sec 0.000256 630.06 MB/sec 0.000512 1207.07 MB/sec 0.001024 2004.06 MB/sec 0.001437 2475.43 MB/sec 10.000000 5817.34 MB/sec Avg xfer: 3.2KB, 41.8KB in 1.2230 millisecs, 34.15 MB/sec AF_UNIX sock stream bandwidth: 9850.01 MB/sec Pipe bandwidth: 4631.28 MB/sec to: Socket bandwidth using localhost 0.000001 3.13 MB/sec 0.000064 187.08 MB/sec 0.000128 324.12 MB/sec 0.000256 618.51 MB/sec 0.000512 1137.13 MB/sec 0.001024 1962.95 MB/sec 0.001437 2458.27 MB/sec 10.000000 6168.08 MB/sec Avg xfer: 3.2KB, 41.8KB in 1.0060 millisecs, 41.52 MB/sec AF_UNIX sock stream bandwidth: 9921.68 MB/sec Pipe bandwidth: 4649.96 MB/sec [ mingo: Prettified the changelog a bit. ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20250309170955.48919-1-ubizjak@gmail.com	2025-03-19 11:26:58 +01:00
Uros Bizjak	f685a96bfd	x86/asm: Use asm_inline() instead of asm() in clwb() Use asm_inline() to instruct the compiler that the size of asm() is the minimum size of one instruction, ignoring how many instructions the compiler thinks it is. ALTERNATIVE macro that expands to several pseudo directives causes instruction length estimate to count more than 20 instructions. bloat-o-meter reports slight increase of the code size for x86_64 defconfig object file, compiled with gcc-14.2: add/remove: 0/2 grow/shrink: 3/0 up/down: 190/-59 (131) Function old new delta __copy_user_flushcache 166 247 +81 __memcpy_flushcache 369 437 +68 arch_wb_cache_pmem 6 47 +41 __pfx_clean_cache_range 16 - -16 clean_cache_range 43 - -43 Total: Before=22807167, After=22807298, chg +0.00% The compiler now inlines and removes the clean_cache_range() function. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250313102715.333142-2-ubizjak@gmail.com	2025-03-19 11:26:58 +01:00
Uros Bizjak	5328663245	x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> Current minimum required version of binutils is 2.25, which supports CLFLUSHOPT and CLWB instruction mnemonics. Replace the byte-wise specification of CLFLUSHOPT and CLWB with these proper mnemonics. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250313102715.333142-1-ubizjak@gmail.com	2025-03-19 11:26:58 +01:00
Uros Bizjak	21fe251484	x86/hweight: Use asm_inline() instead of asm() Use asm_inline() to instruct the compiler that the size of asm() is the minimum size of one instruction, ignoring how many instructions the compiler thinks it is. ALTERNATIVE macro that expands to several pseudo directives causes instruction length estimate to count more than 20 instructions. bloat-o-meter reports slight reduction of the code size for x86_64 defconfig object file, compiled with gcc-14.2: add/remove: 6/12 grow/shrink: 59/50 up/down: 3389/-3560 (-171) Total: Before=22734393, After=22734222, chg -0.00% where 29 instances of code blocks involving POPCNT now gets inlined, resulting in the removal of several functions: format_is_yuv_semiplanar.part.isra 41 - -41 cdclk_divider 69 - -69 intel_joiner_adjust_timings 140 - -140 nl80211_send_wowlan_tcp_caps 369 - -369 nl80211_send_iftype_data 579 - -579 __do_sys_pidfd_send_signal 809 - -809 One noticeable change is: pcpu_page_first_chunk 1075 1060 -15 Where the compiler now inlines 4 more instances of POPCNT insns, but still manages to compile to a function with smaller code size. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250312123905.149298-3-ubizjak@gmail.com	2025-03-19 11:26:58 +01:00
Uros Bizjak	194a613088	x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() Use ASM_CALL_CONSTRAINT to prevent inline asm() that includes call instruction from being scheduled before the frame pointer gets set up by the containing function. This unconstrained scheduling might cause objtool to print a "call without frame pointer save/setup" warning. Current versions of compilers don't seem to trigger this condition, but without this constraint there's nothing to prevent the compiler from scheduling the insn in front of frame creation. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250312123905.149298-2-ubizjak@gmail.com	2025-03-19 11:26:58 +01:00
Uros Bizjak	72899899e4	x86/hweight: Use named operands in inline asm() No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250312123905.149298-1-ubizjak@gmail.com	2025-03-19 11:26:58 +01:00
Ingo Molnar	91d5451d97	x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP The __ref_stack_chk_guard symbol doesn't exist on UP: <stdin>:4:15: error: ‘__ref_stack_chk_guard’ undeclared here (not in a function) Fix the #ifdef around the entry.S export. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250123190747.745588-8-brgerst@gmail.com	2025-03-19 11:26:58 +01:00
Ard Biesheuvel	3f5dbafc2d	x86/head/64: Avoid Clang < 17 stack protector in startup code Clang versions before 17 will not honour -fdirect-access-external-data for the load of the stack cookie emitted into each function's prologue and epilogue, and will emit a GOT based reference instead, e.g., 4c 8b 2d 00 00 00 00 mov 0x0(%rip),%r13 18a: R_X86_64_REX_GOTPCRELX __ref_stack_chk_guard-0x4 65 49 8b 45 00 mov %gs:0x0(%r13),%rax This is inefficient, but at least, the linker will usually follow the rules of the x86 psABI, and relax the GOT load into a RIP-relative LEA instruction. This is still suboptimal, as the per-CPU load could use a RIP-relative reference directly, but at least it gets rid of the first load from memory. However, Boris reports that in some cases, when using distro builds of Clang/LLD 15, the first load gets relaxed into 49 c7 c6 20 c0 55 86 mov $0xffffffff8655c020,%r14 ffffffff8373bf0f: R_X86_64_32S __ref_stack_chk_guard 65 49 8b 06 mov %gs:(%r14),%rax instead, which is fine in principle, as MOV may be cheaper than LEA on some micro-architectures. However, such absolute references assume that the variable in question can be accessed via the kernel virtual mapping, and this is not guaranteed for the startup code residing in .head.text. This is therefore a true positive, that was caught using the recently introduced relocs check for absolute references in the startup code: Absolute reference to symbol '__ref_stack_chk_guard' not permitted in .head.text Work around the issue by disabling the stack protector in the startup code for Clang versions older than 17. Fixes: `80d47defdd` ("x86/stackprotector/64: Convert to normal per-CPU variable") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250312102740.602870-2-ardb+git@google.com	2025-03-19 11:26:49 +01:00
Uros Bizjak	a9deda6959	x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> Merge common x86_32 and x86_64 code in crash_setup_regs() using macros from <asm/asm.h>. The compiled object files before and after the patch are unchanged. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250306145227.55819-1-ubizjak@gmail.com	2025-03-19 11:26:24 +01:00
Kirill A. Shutemov	bd72baff22	x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro Add an assembly macro to refer runtime cost. It hides linker magic and makes assembly more readable. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304153342.2016569-1-kirill.shutemov@linux.intel.com	2025-03-19 11:26:24 +01:00
Sohil Mehta	fadb6f569b	x86/cpu/intel: Limit the non-architectural constant_tsc model checks X86_FEATURE_CONSTANT_TSC is a Linux-defined, synthesized feature flag. It is used across several vendors. Intel CPUs will set the feature when the architectural CPUID.80000007.EDX[1] bit is set. There are also some Intel CPUs that have the X86_FEATURE_CONSTANT_TSC behavior but don't enumerate it with the architectural bit. Those currently have a model range check. Today, virtually all of the CPUs that have the CPUID bit also match the "model >= 0x0e" check. This is confusing. Instead of an open-ended check, pick some models (INTEL_IVYBRIDGE and P4_WILLAMETTE) as the end of goofy CPUs that should enumerate the bit but don't. These models are relatively arbitrary but conservative pick for this. This makes it obvious that later CPUs (like Family 18+) no longer need to synthesize X86_FEATURE_CONSTANT_TSC. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20250219184133.816753-14-sohil.mehta@intel.com	2025-03-19 11:19:56 +01:00
Sohil Mehta	05d234d3c7	x86/mm/pat: Replace Intel x86_model checks with VFM ones Introduce markers and names for some Family 6 and Family 15 models and replace x86_model checks with VFM ones. Since the VFM checks are closed ended and only applicable to Intel, get rid of the explicit Intel vendor check as well. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250219184133.816753-13-sohil.mehta@intel.com	2025-03-19 11:19:53 +01:00
Sohil Mehta	15b7ddcf66	x86/cpu/intel: Fix fast string initialization for extended Families X86_FEATURE_REP_GOOD is a linux defined feature flag to track whether fast string operations should be used for copy_page(). It is also used as a second alternative for clear_page() if enhanced fast string operations (ERMS) are not available. X86_FEATURE_ERMS is an Intel-specific hardware-defined feature flag that tracks hardware support for Enhanced Fast strings. It is used to track whether Fast strings should be used for similar memory copy and memory clearing operations. On top of these, there is a FAST_STRING enable bit in the IA32_MISC_ENABLE MSR. It is typically controlled by the BIOS to provide a hint to the hardware and the OS on whether fast string operations are preferred. Commit: `161ec53c70` ("x86, mem, intel: Initialize Enhanced REP MOVSB/STOSB") introduced a mechanism to honor the BIOS preference for fast string operations and clear the above feature flags if needed. Unfortunately, the current initialization code for Intel to set and clear these bits is confusing at best and likely incorrect. X86_FEATURE_REP_GOOD is cleared in early_init_intel() if MISC_ENABLE.FAST_STRING is 0. But it gets set later on unconditionally for all Family 6 processors in init_intel(). This not only overrides the BIOS preference but also contradicts the earlier check. Fix this by combining the related checks and always relying on the BIOS provided preference for fast string operations. This simplification makes sure the upcoming Intel Family 18 and 19 models are covered as well. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250219184133.816753-12-sohil.mehta@intel.com	2025-03-19 11:19:51 +01:00
Sohil Mehta	7a2ad75274	x86/smpboot: Fix INIT delay assignment for extended Intel Families Some old crusty CPUs need an extra delay that slows down booting. See the comment above 'init_udelay' for details. Newer CPUs don't need the delay. Right now, for Intel, Family 6 and only Family 6 skips the delay. That leaves out both the Family 15 (Pentium 4s) and brand new Family 18/19 models. The omission of Family 15 (Pentium 4s) seems like an oversight and 18/19 do not need the delay. Skip the delay on all Intel processors Family 6 and beyond. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250219184133.816753-11-sohil.mehta@intel.com	2025-03-19 11:19:50 +01:00
Sohil Mehta	58d1c1fd03	x86/smpboot: Remove confusing quirk usage in INIT delay Very old multiprocessor systems required a 10 msec delay between asserting and de-asserting INIT but modern processors do not require this delay. Over time the usage of the "quirk" wording while setting the INIT delay has become misleading. The code comments suggest that modern processors need to be quirked, which clears the default init_udelay of 10 msec, while legacy processors don't need the quirk and continue to use the default init_udelay. With a lot more modern processors, the wording should be inverted if at all needed. Instead, simplify the comments and the code by getting rid of "quirk" usage altogether and clarifying the following: - Old legacy processors -> Set the "legacy" 10 msec delay - Modern processors -> Do not set any delay No functional change. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250219184133.816753-10-sohil.mehta@intel.com	2025-03-19 11:19:48 +01:00
Sohil Mehta	337959860d	x86/acpi/cstate: Improve Intel Family model checks Update the Intel Family checks to consistently use Family 15 instead of Family 0xF. Also, get rid of one of last usages of x86_model by using the new VFM checks. Update the incorrect comment since the check has changed since the initial commit: `ee1ca48fae` ("ACPI: Disable ARB_DISABLE on platforms where it is not needed") The two changes were: - `3e2ada5867` ("ACPI: fix Compaq Evo N800c (Pentium 4m) boot hang regression") removed the P4 - Family 15. - `03a05ed115` ("ACPI: Use the ARB_DISABLE for the CPU which model id is less than 0x0f.") got rid of CORE_YONAH - Family 6, model E. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-9-sohil.mehta@intel.com	2025-03-19 11:19:46 +01:00
Sohil Mehta	eb1ac33305	x86/cpu/intel: Replace Family 5 model checks with VFM ones Introduce names for some Family 5 models and convert some of the checks to be VFM based. Also, to keep the file sorted by family, move Family 5 to the top of the header file. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-8-sohil.mehta@intel.com	2025-03-19 11:19:44 +01:00
Sohil Mehta	fc866f2472	x86/cpu/intel: Replace Family 15 checks with VFM ones Introduce names for some old pentium 4 models and replace the x86_model checks with VFM ones. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-7-sohil.mehta@intel.com	2025-03-19 11:19:43 +01:00
Sohil Mehta	eaa472f76d	x86/cpu/intel: Replace early Family 6 checks with VFM ones Introduce names for some old pentium models and replace the x86_model checks with VFM ones. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-6-sohil.mehta@intel.com	2025-03-19 11:19:41 +01:00
Sohil Mehta	a8cb451458	x86/mtrr: Modify a x86_model check to an Intel VFM check Simplify one of the last few Intel x86_model checks in arch/x86 by substituting it with a VFM one. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-5-sohil.mehta@intel.com	2025-03-19 11:19:40 +01:00
Sohil Mehta	7e6b0a2e41	x86/microcode: Update the Intel processor flag scan check The Family model check to read the processor flag MSR is misleading and potentially incorrect. It doesn't consider Family while comparing the model number. The original check did have a Family number but it got lost/moved during refactoring. intel_collect_cpu_info() is called through multiple paths such as early initialization, CPU hotplug as well as IFS image load. Some of these flows would be error prone due to the ambiguous check. Correct the processor flag scan check to use a Family number and update it to a VFM based one to make it more readable. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-4-sohil.mehta@intel.com	2025-03-19 11:19:38 +01:00
Sohil Mehta	7e67f36172	x86/cpu/intel: Fix the MOVSL alignment preference for extended Families The alignment preference for 32-bit MOVSL based bulk memory move has been 8-byte for a long time. However this preference is only set for Family 6 and 15 processors. Use the same preference for upcoming Family numbers 18 and 19. Also, use a simpler VFM based check instead of switching based on Family numbers. Refresh the comment to reflect the new check. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250219184133.816753-3-sohil.mehta@intel.com	2025-03-19 11:19:31 +01:00
Sohil Mehta	680d9b2a56	x86/apic: Fix 32-bit APIC initialization for extended Intel Families APIC detection is currently limited to a few specific Families and will not match the upcoming Families >=18. Extend the check to include all Families 6 or greater. Also convert it to a VFM check to make it simpler. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250219184133.816753-2-sohil.mehta@intel.com	2025-03-19 11:19:29 +01:00
Ingo Molnar	a46f322661	x86/cpuid: Use u32 in instead of uint32_t in <asm/cpuid/api.h> Use u32 instead of uint32_t in hypervisor_cpuid_base(). Yes, uint32_t is used in Xen code et al, but this is a core x86 architecture header and we should standardize on the type that is being used overwhelmingly in related x86 architecture code. The two types are the same so there should be no build warnings. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: "Ahmed S. Darwish" <darwi@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250317221824.3738853-6-mingo@kernel.org	2025-03-19 11:19:28 +01:00
Ingo Molnar	cfb4fc5f08	x86/cpuid: Standardize on u32 in <asm/cpuid/api.h> Convert all uses of 'unsigned int' to 'u32' in <asm/cpuid/api.h>. This is how a lot of the call sites are doing it, and the two types are equivalent in the C sense - but 'u32' better expresses that these are expressions of an immutable hardware ABI. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: "Ahmed S. Darwish" <darwi@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250317221824.3738853-5-mingo@kernel.org	2025-03-19 11:19:26 +01:00
Ingo Molnar	fb99ed1e00	x86/cpuid: Clean up <asm/cpuid/api.h> - Include <asm/cpuid/types.h> first, as is customary. This also has the side effect of build-testing the header dependency assumptions in the types header. - No newline necessary after the SPDX line - Newline necessary after inline function definitions - Rename native_cpuid_reg() to NATIVE_CPUID_REG(): it's a CPP macro, whose name we capitalize in such cases. - Prettify the CONFIG_PARAVIRT_XXL inclusion block a bit - Standardize register references in comments to EAX/EBX/ECX/etc., from the hodgepodge of references. - s/cpus/CPUs because why add noise to common acronyms? Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: "Ahmed S. Darwish" <darwi@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250317221824.3738853-4-mingo@kernel.org	2025-03-19 11:19:25 +01:00
Ingo Molnar	04a1007004	x86/cpuid: Clean up <asm/cpuid/types.h> - We have 0x0d, 0x9 and 0x1d as literals for the CPUID_LEAF definitions, pick a single, consistent style of 0xZZ literals. - Likewise, harmonize the style of the 'struct cpuid_regs' list of registers with that of 'enum cpuid_regs_idx'. Because while computers don't care about unnecessary visual noise, humans do. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: "Ahmed S. Darwish" <darwi@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250317221824.3738853-3-mingo@kernel.org	2025-03-19 11:19:23 +01:00
Ahmed S. Darwish	adc574269b	x86/cpuid: Refactor <asm/cpuid.h> In preparation for future commits where CPUID headers will be expanded, refactor the CPUID header <asm/cpuid.h> into: asm/cpuid/ ├── api.h └── types.h Move the CPUID data structures into <asm/cpuid/types.h> and the access APIs into <asm/cpuid/api.h>. Let <asm/cpuid.h> be just an include of <asm/cpuid/api.h> so that existing call sites do not break. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: "Ahmed S. Darwish" <darwi@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://lore.kernel.org/r/20250317221824.3738853-2-mingo@kernel.org	2025-03-19 11:19:22 +01:00
Brian Gerst	82070bc042	x86/syscall/32: Add comment to conditional Add a CONFIG_X86_FRED comment, since this conditional is nested. Suggested-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-8-brgerst@gmail.com	2025-03-19 11:19:20 +01:00
Brian Gerst	6049395522	x86/syscall: Remove stray semicolons No functional change. Suggested-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-7-brgerst@gmail.com	2025-03-19 11:19:18 +01:00
Brian Gerst	9a93e29f16	x86/syscall: Move sys_ni_syscall() Move sys_ni_syscall() to kernel/process.c, and remove the now empty entry/common.c No functional changes. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-6-brgerst@gmail.com	2025-03-19 11:19:17 +01:00
Brian Gerst	21832247f2	x86/syscall/x32: Move x32 syscall table Since commit: `2e958a8a51` ("x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*") the ABI prefix for x32 syscalls is the same as native 64-bit syscalls. Move the x32 syscall table to syscall_64.c No functional changes. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-5-brgerst@gmail.com	2025-03-19 11:19:15 +01:00
Brian Gerst	01dfb48054	x86/syscall/64: Move 64-bit syscall dispatch code Move the 64-bit syscall dispatch code to syscall_64.c. No functional changes. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-4-brgerst@gmail.com	2025-03-19 11:19:04 +01:00
Brian Gerst	b634b02e2b	x86/syscall/32: Move 32-bit syscall dispatch code Move the 32-bit syscall dispatch code to syscall_32.c. No functional changes. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-3-brgerst@gmail.com	2025-03-19 11:19:01 +01:00
Brian Gerst	1ab7b5ed44	x86/xen: Move Xen upcall handler Move the upcall handler to Xen-specific files. No functional changes. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250314151220.862768-2-brgerst@gmail.com	2025-03-19 11:18:58 +01:00
Mario Limonciello	4476e7f814	x86/amd_node: Add a smn_read_register() helper Some of the ACP drivers will poll registers through SMN using read_poll_timeout() which requires returning the result of the register read as the argument. Add a helper to do just that. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250217231747.1656228-2-superm1@kernel.org	2025-03-19 11:18:48 +01:00
Mario Limonciello	9c19cc1f5f	x86/amd_node: Add support for debugfs access to SMN registers There are certain registers on AMD Zen systems that can only be accessed through SMN. Introduce a new interface that provides debugfs files for accessing SMN. As this introduces the capability for userspace to manipulate the hardware in unpredictable ways, taint the kernel when writing. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-3-b5cc997e471b@amd.com	2025-03-19 11:18:33 +01:00
Mario Limonciello	8351845307	x86/amd_node: Add SMN offsets to exclusive region access Offsets 0x60 and 0x64 are used internally by kernel drivers that call the amd_smn_read() and amd_smn_write() functions. If userspace accesses the regions at the same time as the kernel it may cause malfunctions in drivers using the offsets. Add these offsets to the exclusions so that the kernel is tainted if a non locked down userspace tries to access them. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-2-b5cc997e471b@amd.com	2025-03-19 11:18:23 +01:00
Yazen Ghannam	8a3dc0f7c4	x86/amd_node, platform/x86/amd/hsmp: Have HSMP use SMN through AMD_NODE The HSMP interface is just an SMN interface with different offsets. Define an HSMP wrapper in the SMN code and have the HSMP platform driver use that rather than a local solution. Also, remove the "root" member from AMD_NB, since there are no more users of it. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Carlos Bilbao <carlos.bilbao@kernel.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-1-b5cc997e471b@amd.com	2025-03-19 11:18:05 +01:00
Thorsten Blum	ad5a3a8f41	x86/mtrr: Use str_enabled_disabled() helper in print_mtrr_state() Remove hard-coded strings by using the str_enabled_disabled() helper function. Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20250117144900.171684-2-thorsten.blum%40linux.dev	2025-03-19 11:17:56 +01:00
Vitaly Kuznetsov	d55f31e290	x86/entry: Add __init to ia32_emulation_override_cmdline() ia32_emulation_override_cmdline() is an early_param() arg and these are only needed at boot time. In fact, all other early_param() functions in arch/x86 seem to have '__init' annotation and ia32_emulation_override_cmdline() is the only exception. Fixes: `a11e097504` ("x86: Make IA32_EMULATION boot time configurable") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/all/20241210151650.1746022-1-vkuznets%40redhat.com	2025-03-19 11:17:37 +01:00
Sohil Mehta	07e4a6eec2	x86/cpufeatures: Warn about unmet CPU feature dependencies Currently, the cpuid_deps[] table is only exercised when a particular feature is explicitly disabled and clear_cpu_cap() is called. However, some of these listed dependencies might already be missing during boot. These types of errors shouldn't generally happen in production environments, but they could sometimes sneak through, especially when VMs and Kconfigs are in the mix. Also, the kernel might introduce artificial dependencies between unrelated features, such as making LAM depend on LASS. Unexpected failures can occur when the kernel tries to use such features. Add a simple boot-time scan of the cpuid_deps[] table to detect the missing dependencies. One option is to disable all of such features during boot, but that may cause regressions in existing systems. For now, just warn about the missing dependencies to create awareness. As a trade-off between spamming the kernel log and keeping track of all the features that have been warned about, only warn about the first missing dependency. Any subsequent unmet dependency will only be logged after the first one has been resolved. Features are typically represented through unsigned integers within the kernel, though some of them have user-friendly names if they are exposed via /proc/cpuinfo. Show the friendlier name if available, otherwise display the X86_FEATURE_* numerals to make it easier to identify the feature. Suggested-by: Tony Luck <tony.luck@intel.com> Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20250313201608.3304135-1-sohil.mehta@intel.com	2025-03-19 11:17:31 +01:00
Pawan Gupta	722fa0dba7	x86/rfds: Exclude P-only parts from the RFDS affected list The affected CPU table (cpu_vuln_blacklist) marks Alderlake and Raptorlake P-only parts affected by RFDS. This is not true because only E-cores are affected by RFDS. With the current family/model matching it is not possible to differentiate the unaffected parts, as the affected and unaffected hybrid variants have the same model number. Add a cpu-type match as well for such parts so as to exclude P-only parts being marked as affected. Note, family/model and cpu-type enumeration could be inaccurate in virtualized environments. In a guest affected status is decided by RFDS_NO and RFDS_CLEAR bits exposed by VMMs. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-5-e8514dcaaff2@linux.intel.com	2025-03-19 11:17:23 +01:00
Pawan Gupta	adf2de5e8d	x86/cpu: Update x86_match_cpu() to also use cpu-type Non-hybrid CPU variants that share the same Family/Model could be differentiated by their cpu-type. x86_match_cpu() currently does not use cpu-type for CPU matching. Dave Hansen suggested to use below conditions to match CPU-type: 1. If CPU_TYPE_ANY (the wildcard), then matched 2. If hybrid, then matched 3. If !hybrid, look at the boot CPU and compare the cpu-type to determine if it is a match. This special case for hybrid systems allows more compact vulnerability list. Imagine that "Haswell" CPUs might or might not be hybrid and that only Atom cores are vulnerable to Meltdown. That means there are three possibilities: 1. P-core only 2. Atom only 3. Atom + P-core (aka. hybrid) One might be tempted to code up the vulnerability list like this: MATCH( HASWELL, X86_FEATURE_HYBRID, MELTDOWN) MATCH_TYPE(HASWELL, ATOM, MELTDOWN) Logically, this matches #2 and #3. But that's a little silly. You would only ask for the "ATOM" match in cases where there WERE hybrid cores in play. You shouldn't have to _also_ ask for hybrid cores explicitly. In short, assume that processors that enumerate Hybrid==1 have a vulnerable core type. Update x86_match_cpu() to also match cpu-type. Also treat hybrid systems as special, and match them to any cpu-type. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-4-e8514dcaaff2@linux.intel.com	2025-03-19 11:17:11 +01:00
Pawan Gupta	00d7fc04b7	x86/cpu: Add cpu_type to struct x86_cpu_id In addition to matching vendor/family/model/feature, for hybrid variants it is required to also match cpu-type. For example, some CPU vulnerabilities like RFDS only affect a specific cpu-type. To be able to also match CPUs based on their type, add a new field "type" to struct x86_cpu_id which is used by the CPU-matching tables. Introduce X86_CPU_TYPE_ANY for the cases that don't care about the cpu-type. [ bp: Massage commit message. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-3-e8514dcaaff2@linux.intel.com	2025-03-19 11:17:03 +01:00
Pawan Gupta	c3390406ad	x86/cpu: Shorten CPU matching macro To add cpu-type to the existing CPU matching infrastructure, the base macro X86_MATCH_VENDOR_FAM_MODEL_STEPPINGS_FEATURE need to append _CPU_TYPE. This makes an already long name longer, and somewhat incomprehensible. To avoid this, rename the base macro to X86_MATCH_CPU. The macro name doesn't need to explicitly tell everything that it matches. The arguments to the macro already hint at that. For consistency, use this base macro to define X86_MATCH_VFM and friends. Remove unused X86_MATCH_VENDOR_FAM_MODEL_FEATURE while at it. [ bp: Massage commit message. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-2-e8514dcaaff2@linux.intel.com	2025-03-19 11:16:46 +01:00
Pawan Gupta	7b9b54e23a	x86/cpu: Fix the description of X86_MATCH_VFM_STEPS() The comments needs to reflect an implementation change. No functional change. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-1-e8514dcaaff2@linux.intel.com	2025-03-19 11:16:33 +01:00
Xin Li (Intel)	da414d34b5	x86/cpufeatures: Use AWK to generate {REQUIRED\|DISABLED}_MASK_BIT_SET in <asm/cpufeaturemasks.h> Generate the {REQUIRED\|DISABLED}_MASK_BIT_SET macros in the newly added AWK script that generates <asm/cpufeaturemasks.h>. Suggested-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Brian Gerst <brgerst@gmail.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250228082338.73859-6-xin@zytor.com	2025-03-19 11:15:12 +01:00
Xin Li (Intel)	8f97566c8a	x86/cpufeatures: Remove {disabled,required}-features.h The functionalities of {disabled,required}-features.h have been replaced with the auto-generated generated/<asm/cpufeaturemasks.h> header. Thus they are no longer needed and can be removed. None of the macros defined in {disabled,required}-features.h is used in tools, delete them too. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250305184725.3341760-4-xin@zytor.com	2025-03-19 11:15:12 +01:00
H. Peter Anvin (Intel)	841326332b	x86/cpufeatures: Generate the <asm/cpufeaturemasks.h> header based on build config Introduce an AWK script to auto-generate the <asm/cpufeaturemasks.h> header with required and disabled feature masks based on <asm/cpufeatures.h> and the current build config. Thus for any CPU feature with a build config, e.g., X86_FRED, simply add: config X86_DISABLED_FEATURE_FRED def_bool y depends on !X86_FRED to arch/x86/Kconfig.cpufeatures, instead of adding a conditional CPU feature disable flag, e.g., DISABLE_FRED. Lastly, the generated required and disabled feature masks will be added to their corresponding feature masks for this particular compile-time configuration. [ Xin: build integration improvements ] [ mingo: Improved changelog and comments ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250305184725.3341760-3-xin@zytor.com	2025-03-19 11:15:11 +01:00
H. Peter Anvin (Intel)	3d37d9396e	x86/cpufeatures: Add {REQUIRED,DISABLED} feature configs Required and disabled feature masks completely rely on build configs, i.e., once a build config is fixed, so are the feature masks. To prepare for auto-generating the <asm/cpufeaturemasks.h> header with required and disabled feature masks based on a build config, add feature Kconfig items: - X86_REQUIRED_FEATURE_x - X86_DISABLED_FEATURE_x each of which may be set to "y" if and only if its preconditions from current build config are met. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250228082338.73859-3-xin@zytor.com	2025-03-19 11:15:11 +01:00
Kirill A. Shutemov	f666c92090	x86/mm/ident_map: Fix theoretical virtual address overflow to zero The current calculation of the 'next' virtual address in the page table initialization functions in arch/x86/mm/ident_map.c doesn't protect against wrapping to zero. This is a theoretical issue that cannot happen currently, the problematic case is possible only if the user sets a high enough x86_mapping_info::offset value - which no current code in the upstream kernel does. ( The wrapping to zero only occurs if the top PGD entry is accessed. There are no such users upstream. Only hibernate_64.c uses x86_mapping_info::offset, and it operates on the direct mapping range, which is not the top PGD entry. ) Should such an overflow happen, it can result in page table corruption and a hang. To future-proof this code, replace the manual 'next' calculation with p?d_addr_end() which handles wrapping correctly. [ Backporter's note: there's no need to backport this patch. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20241016111458.846228-2-kirill.shutemov@linux.intel.com	2025-03-19 11:12:29 +01:00
Kirill A. Shutemov	775d37d8f0	x86/acpi: Replace manual page table initialization with kernel_ident_mapping_init() The init_transition_pgtable() functions maps the page with asm_acpi_mp_play_dead() into an identity mapping. Replace open-coded manual page table initialization with kernel_ident_mapping_init() to avoid code duplication. Use x86_mapping_info::offset to get the page mapped at the correct location. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20241016111458.846228-3-kirill.shutemov@linux.intel.com	2025-03-19 11:12:29 +01:00
Tom Lendacky	634ab76159	x86/mm: Always set the ASID valid bit for the INVLPGB instruction When executing the INVLPGB instruction on a bare-metal host or hypervisor, if the ASID valid bit is not set, the instruction will flush the TLB entries that match the specified criteria for any ASID, not just the those of the host. If virtual machines are running on the system, this may result in inadvertent flushes of guest TLB entries. When executing the INVLPGB instruction in a guest and the INVLPGB instruction is not intercepted by the hypervisor, the hardware will replace the requested ASID with the guest ASID and set the ASID valid bit before doing the broadcast invalidation. Thus a guest is only able to flush its own TLB entries. So to limit the host TLB flushing reach, always set the ASID valid bit using an ASID value of 0 (which represents the host/hypervisor). This will will result in the desired effect in both host and guest. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304120449.GHZ8bsYYyEBOKQIxBm@fat_crate.local	2025-03-19 11:12:29 +01:00
Rik van Riel	440a65b7d2	x86/mm: Enable AMD translation cache extensions With AMD TCE (translation cache extensions) only the intermediate mappings that cover the address range zapped by INVLPG / INVLPGB get invalidated, rather than all intermediate mappings getting zapped at every TLB invalidation. This can help reduce the TLB miss rate, by keeping more intermediate mappings in the cache. From the AMD manual: Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit to 1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on TLB entries. When this bit is 0, these instructions remove the target PTE from the TLB as well as all upper-level table entries that are cached in the TLB, whether or not they are associated with the target PTE. When this bit is set, these instructions will remove the target PTE and only those upper-level entries that lead to the target PTE in the page table hierarchy, leaving unrelated upper-level entries intact. [ bp: use cpu_has()... I know, it is a mess. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-13-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	4afeb0ed17	x86/mm: Enable broadcast TLB invalidation for multi-threaded processes There is not enough room in the 12-bit ASID address space to hand out broadcast ASIDs to every process. Only hand out broadcast ASIDs to processes when they are observed to be simultaneously running on 4 or more CPUs. This also allows single threaded process to continue using the cheaper, local TLB invalidation instructions like INVLPGB. Due to the structure of flush_tlb_mm_range(), the INVLPGB flushing is done in a generically named broadcast_tlb_flush() function which can later also be used for Intel RAR. Combined with the removal of unnecessary lru_add_drain calls() (see https://lore.kernel.org/r/20241219153253.3da9e8aa@fangorn) this results in a nice performance boost for the will-it-scale tlb_flush2_threads test on an AMD Milan system with 36 cores: - vanilla kernel: 527k loops/second - lru_add_drain removal: 731k loops/second - only INVLPGB: 527k loops/second - lru_add_drain + INVLPGB: 1157k loops/second Profiling with only the INVLPGB changes showed while TLB invalidation went down from 40% of the total CPU time to only around 4% of CPU time, the contention simply moved to the LRU lock. Fixing both at the same time about doubles the number of iterations per second from this case. Comparing will-it-scale tlb_flush2_threads with several different numbers of threads on a 72 CPU AMD Milan shows similar results. The number represents the total number of loops per second across all the threads: threads tip INVLPGB 1 315k 304k 2 423k 424k 4 644k 1032k 8 652k 1267k 16 737k 1368k 32 759k 1199k 64 636k 1094k 72 609k 993k 1 and 2 thread performance is similar with and without INVLPGB, because INVLPGB is only used on processes using 4 or more CPUs simultaneously. The number is the median across 5 runs. Some numbers closer to real world performance can be found at Phoronix, thanks to Michael: https://www.phoronix.com/news/AMD-INVLPGB-Linux-Benefits [ bp: - Massage - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi - :%s/\<clear_asid_transition\>/mm_clear_asid_transition/cgi - Fold in a 0day bot fix: https://lore.kernel.org/oe-kbuild-all/202503040000.GtiWUsBm-lkp@intel.com ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nadav Amit <nadav.amit@gmail.com> Link: https://lore.kernel.org/r/20250226030129.530345-11-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	c9826613a9	x86/mm: Add global ASID process exit helpers A global ASID is allocated for the lifetime of a process. Free the global ASID at process exit time. [ bp: Massage, create helpers, hide details inside them. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-10-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	be88a1dd61	x86/mm: Handle global ASID context switch and TLB flush Do context switch and TLB flush support for processes that use a global ASID and PCID across all CPUs. At both context switch time and TLB flush time, it needs to be checked whether a task is switching to a global ASID, and, if so, reload the TLB with the new ASID as appropriate. In both code paths, the TLB flush is avoided if a global ASID is used, because the global ASIDs are always kept up to date across CPUs, even when the process is not running on a CPU. [ bp: - Massage - :%s/\<static_cpu_has\>/cpu_feature_enabled/cgi ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-9-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	d504d1247e	x86/mm: Add global ASID allocation helper functions Add functions to manage global ASID space. Multithreaded processes that are simultaneously active on 4 or more CPUs can get a global ASID, resulting in the same PCID being used for that process on every CPU. This in turn will allow the kernel to use hardware-assisted TLB flushing through AMD INVLPGB or Intel RAR for these processes. [ bp: - Extend use_global_asid() comment - s/X86_BROADCAST_TLB_FLUSH/BROADCAST_TLB_FLUSH/g - other touchups ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-8-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	72a920eacd	x86/mm: Use broadcast TLB flushing in page reclaim Page reclaim tracks only the CPU(s) where the TLB needs to be flushed, rather than all the individual mappings that may be getting invalidated. Use broadcast TLB flushing when that is available. [ bp: Massage commit message. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-7-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	82378c6c2f	x86/mm: Use INVLPGB for kernel TLB flushes Use broadcast TLB invalidation for kernel addresses when available. Remove the need to send IPIs for kernel TLB flushes. [ bp: Integrate dhansen's comments additions, merge the flush_tlb_all() change into this one too. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-5-riel@surriel.com	2025-03-19 11:12:29 +01:00
Rik van Riel	b7aa05cbdc	x86/mm: Add INVLPGB support code Add helper functions and definitions needed to use broadcast TLB invalidation on AMD CPUs. [ bp: - Cleanup commit message - Improve and expand comments - push the preemption guards inside the invlpgb* helpers - merge improvements from dhansen - add !CONFIG_BROADCAST_TLB_FLUSH function stubs because Clang can't do DCE properly yet and looks at the inline asm and complains about it getting a u64 argument on 32-bit code ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-4-riel@surriel.com	2025-03-19 11:12:25 +01:00
Rik van Riel	767ae437a3	x86/mm: Add INVLPGB feature and Kconfig entry In addition, the CPU advertises the maximum number of pages that can be shot down with one INVLPGB instruction in CPUID. Save that information for later use. [ bp: use cpu_has(), typos, massage. ] Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250226030129.530345-3-riel@surriel.com	2025-03-19 11:08:52 +01:00
Rik van Riel	4a02ed8e1c	x86/mm: Consolidate full flush threshold decision Reduce code duplication by consolidating the decision point for whether to do individual invalidations or a full flush inside get_flush_tlb_info(). Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/20250226030129.530345-2-riel@surriel.com	2025-03-19 11:08:07 +01:00
Philip Redkin	631ca8909f	x86/mm: Check return value from memblock_phys_alloc_range() At least with CONFIG_PHYSICAL_START=0x100000, if there is < 4 MiB of contiguous free memory available at this point, the kernel will crash and burn because memblock_phys_alloc_range() returns 0 on failure, which leads memblock_phys_free() to throw the first 4 MiB of physical memory to the wolves. At a minimum it should fail gracefully with a meaningful diagnostic, but in fact everything seems to work fine without the weird reserve allocation. Signed-off-by: Philip Redkin <me@rarity.fan> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/94b3e98f-96a7-3560-1f76-349eb95ccf7f@rarity.fan	2025-03-19 11:05:22 +01:00
Ingo Molnar	89771319e0	Linux 6.14-rc7 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfXVtUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN/sH/i5423Gt/z51gDjA s4v5Z7GaBJ9zOGBahn2RWFe72zytTqKrEJmMnGfguirs0atD1DtQj4WAP7iFKP+e WyO663X6HF7i5y37ja0Yd4PZc31hwtqzKH8LjBf8f8tTy8UsEVqumdi5A4sS9KTM qm4kTyyVEY9D/s7oRY8ywjDlRJtO6nT0aKMp4kAqNEbrNUYbilT/a0hgXcgSmPyB uIjmjL2fZfutxGI5LgfbaSHCa1ElmhvTvivOMpaAmZSGCRVHCKGgT0CTNnHyn/7C dB145JkRO4ZOUqirCdO4PE/23id3ajq9fcixJGBzAv7c45y+B3JZ1r2kAfKalE8/ qrOKLys= =8r7a -----END PGP SIGNATURE----- Merge tag 'v6.14-rc7' into x86/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-19 11:03:06 +01:00
Hajime Tazaki	16a0ca5e4e	um: x86: clean up elf specific definitions The file arch/x86/um/asm/module.h is equivalent to the definition of asm-generic. Thus this commit cleans up to use it. Signed-off-by: Hajime Tazaki <thehajime@gmail.com> Link: https://patch.msgid.link/2d70a0ed79ee0a0bef80ad4790063f4833dd9bed.1737348399.git.thehajime@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-03-18 14:48:31 +01:00
Benjamin Berg	cef721e0d5	um: Store full CSGSFS and SS register from mcontext Doing this allows using registers as retrieved from an mcontext to be pushed to a process using PTRACE_SETREGS. It is not entirely clear to me why CSGSFS was masked. Doing so creates issues when using the mcontext as process state in seccomp and simply copying the register appears to work perfectly fine for ptrace. Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net> Link: https://patch.msgid.link/20250224181827.647129-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-03-18 11:10:41 +01:00
Uros Bizjak	089db01ea7	um/locking: Remove semicolon from "lock" prefix Minimum version of binutils required to compile the kernel is 2.25. This version correctly handles the "lock" prefix, so it is possible to remove the semicolon, which was used to support ancient versions of GNU as. Due to the semicolon, the compiler considers "lock; insn" as two separate instructions. Removing the semicolon makes asm length calculations more accurate, consequently making scheduling and inlining decisions of the compiler more accurate. Removing the semicolon also enables assembler checks involving lock prefix. Trying to assemble e.g. "lock andl %eax, %ebx" results in: Error: expecting lockable instruction after `lock' Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://patch.msgid.link/20250228090058.2499163-1-ubizjak@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-03-18 11:04:55 +01:00
Tiwei Bie	1fc350eed6	um: Allocate vdso page pointer statically Instead of dynamically allocating the pointer to the vdso page during boot, we can just allocate it statically. Doing so will reduce error handling and make the code slightly more readable. Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Link: https://patch.msgid.link/20250212045756.164977-1-tiwei.btw@antgroup.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-03-18 11:04:00 +01:00
Johannes Berg	d1d7f01f7c	um: mark rodata read-only and implement _nofault accesses Mark read-only data actually read-only (simple mprotect), and to be able to test it also implement _nofault accesses. This works by setting up a new "segv_continue" pointer in current, and then when we hit a segfault we change the signal return context so that we continue at that address. The code using this sets it up so that it jumps to a label and then aborts the access that way, returning -EFAULT. It's possible to optimize the ___backtrack_faulted() thing by using asm goto (compiler version dependent) and/or gcc's (not sure if clang has it) &&label extension, but at least in one attempt I made the && caused the compiler to not load -EFAULT into the register in case of jumping to the &&label from the fault handler. So leave it like this for now. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Co-developed-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20250210160926.420133-2-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-03-18 11:03:14 +01:00
David Gow	5550187c4c	um: Pass the correct Rust target and options with gcc In order to work around some issues with disabling SSE on older versions of gcc (compilation would fail upon seeing a function declaration containing a float, even if it was never called or defined), the corresponding CFLAGS and RUSTFLAGS were only set when using clang. However, this led to two problems: - Newer gcc versions also wouldn't get the correct flags, despite not having the bug. - The RUSTFLAGS for setting the rust target definition were not set, despite being unrelated. This works by chance for x86_64, as the built-in default target is close enough, but not for 32-bit x86. Move the target definition outside the conditional block, and update the condition to take into account the gcc version. Fixes: `a3046a618a` ("um: Only disable SSE on clang to work around old GCC bugs") Signed-off-by: David Gow <davidgow@google.com> Link: https://patch.msgid.link/20250210105353.2238769-2-davidgow@google.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>	2025-03-18 11:01:02 +01:00
Shuai Xue	1a15bb8303	x86/mce: use is_copy_from_user() to determine copy-from-user context Patch series "mm/hwpoison: Fix regressions in memory failure handling", v4. ## 1. What am I trying to do: This patchset resolves two critical regressions related to memory failure handling that have appeared in the upstream kernel since version 5.17, as compared to 5.10 LTS. - copyin case: poison found in user page while kernel copying from user space - instr case: poison found while instruction fetching in user space ## 2. What is the expected outcome and why - For copyin case: Kernel can recover from poison found where kernel is doing get_user() or copy_from_user() if those places get an error return and the kernel return -EFAULT to the process instead of crashing. More specifily, MCE handler checks the fixup handler type to decide whether an in kernel #MC can be recovered. When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space. - For instr case: If a poison found while instruction fetching in user space, full recovery is possible. User process takes #PF, Linux allocates a new page and fills by reading from storage. ## 3. What actually happens and why - For copyin case: kernel panic since v5.17 Commit `4c132d1d84` ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. It breaks previous EX_TYPE_UACCESS handling when posion found in get_user() or copy_from_user(). - For instr case: user process is killed by a SIGBUS signal due to #CMCI and #MCE race When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. ### Background: why UNcorrected errors tied to CMCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UCNA signature name. ### Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). ## Why user process is killed for instr case Commit `046545a661` ("mm/hwpoison: fix error page recovered but reported "not recovered"") tries to fix noise message "Memory error not recovered" and skips duplicate SIGBUSs due to the race. But it also introduced a bug that kill_accessing_process() return -EHWPOISON for instr case, as result, kill_me_maybe() send a SIGBUS to user process. # 4. The fix, in my opinion, should be: - For copyin case: The key point is whether the error context is in a read from user memory. We do not care about the ex-type if we know its a MOV reading from userspace. is_copy_from_user() return true when both of the following two checks are true: - the current instruction is copy - source address is user memory If copy_user is true, we set m->kflags \|= MCE_IN_KERNEL_COPYIN \| MCE_IN_KERNEL_RECOV; Then do_machine_check() will try fixup_exception() first. - For instr case: let kill_accessing_process() return 0 to prevent a SIGBUS. - For patch 3: The return value of memory_failure() is quite important while discussed instr case regression with Tony and Miaohe for patch 2, so add comment about the return value. This patch (of 3): Commit `4c132d1d84` ("x86/futex: Remove .fixup usage") introduced a new extable fixup type, EX_TYPE_EFAULT_REG, and commit `4c132d1d84` ("x86/futex: Remove .fixup usage") updated the extable fixup type for copy-from-user operations, changing it from EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG. The error context for copy-from-user operations no longer functions as an in-kernel recovery context. Consequently, the error context for copy-from-user operations no longer functions as an in-kernel recovery context, resulting in kernel panics with the message: "Machine check: Data load in unrecoverable area of kernel." To address this, it is crucial to identify if an error context involves a read operation from user memory. The function is_copy_from_user() can be utilized to determine: - the current operation is copy - when reading user memory When these conditions are met, is_copy_from_user() will return true, confirming that it is indeed a direct copy from user memory. This check is essential for correctly handling the context of errors in these operations without relying on the extable fixup types that previously allowed for in-kernel recovery. So, use is_copy_from_user() to determine if a context is copy user directly. Link: https://lkml.kernel.org/r/20250312112852.82415-1-xueshuai@linux.alibaba.com Link: https://lkml.kernel.org/r/20250312112852.82415-2-xueshuai@linux.alibaba.com Fixes: `4c132d1d84` ("x86/futex: Remove .fixup usage") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:07:05 -07:00
Mike Rapoport (Microsoft)	8afa901c14	arch, mm: make releasing of memory to page allocator more explicit The point where the memory is released from memblock to the buddy allocator is hidden inside arch-specific mem_init()s and the call to memblock_free_all() is needlessly duplicated in every artiste cure and after introduction of arch_mm_preinit() hook, mem_init() implementation on many architecture only contains the call to memblock_free_all(). Pull memblock_free_all() call into mm_core_init() and drop mem_init() on relevant architectures to make it more explicit where the free memory is released from memblock to the buddy allocator and to reduce code duplication in architecture specific code. Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:53 -07:00
Mike Rapoport (Microsoft)	0d98484ee3	arch, mm: introduce arch_mm_preinit Currently, implementation of mem_init() in every architecture consists of one or more of the following: * initializations that must run before page allocator is active, for instance swiotlb_init() * a call to memblock_free_all() to release all the memory to the buddy allocator * initializations that must run after page allocator is ready and there is no arch-specific hook other than mem_init() for that, like for example register_page_bootmem_info() in x86 and sparc64 or simple setting of mem_init_done = 1 in several architectures * a bunch of semi-related stuff that apparently had no better place to live, for example a ton of BUILD_BUG_ON()s in parisc. Introduce arch_mm_preinit() that will be the first thing called from mm_core_init(). On architectures that have initializations that must happen before the page allocator is ready, move those into arch_mm_preinit() along with the code that does not depend on ordering with page allocator setup. On several architectures this results in reduction of mem_init() to a single call to memblock_free_all() that allows its consolidation next. Link: https://lkml.kernel.org/r/20250313135003.836600-13-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:53 -07:00
Mike Rapoport (Microsoft)	6faea3422e	arch, mm: streamline HIGHMEM freeing All architectures that support HIGHMEM have their code that frees high memory pages to the buddy allocator while __free_memory_core() is limited to freeing only low memory. There is no actual reason for that. The memory map is completely ready by the time memblock_free_all() is called and high pages can be released to the buddy allocator along with low memory. Remove low memory limit from __free_memory_core() and drop per-architecture code that frees high memory pages. Link: https://lkml.kernel.org/r/20250313135003.836600-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:53 -07:00
Mike Rapoport (Microsoft)	e120d1bc12	arch, mm: set high_memory in free_area_init() high_memory defines upper bound on the directly mapped memory. This bound is defined by the beginning of ZONE_HIGHMEM when a system has high memory and by the end of memory otherwise. All this is known to generic memory management initialization code that can set high_memory while initializing core mm structures. Add a generic calculation of high_memory to free_area_init() and remove per-architecture calculation except for the architectures that set and use high_memory earlier than that. Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:52 -07:00
Mike Rapoport (Microsoft)	8268af309d	arch, mm: set max_mapnr when allocating memory map for FLATMEM max_mapnr is essentially the size of the memory map for systems that use FLATMEM. There is no reason to calculate it in each and every architecture when it's anyway calculated in alloc_node_mem_map(). Drop setting of max_mapnr from architecture code and set it once in alloc_node_mem_map(). While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so there won't be two copies for MMU and !MMU variants. Link: https://lkml.kernel.org/r/20250313135003.836600-10-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86] Tested-by: Mark Brown <broonie@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren (csky) <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Richard Weinberger <richard@nod.at> Cc: Russel King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 22:06:52 -07:00
Chao Gao	dda366083e	x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures Guest FPUs manage vCPU FPU states. They are allocated via fpu_alloc_guest_fpstate() and are resized in fpstate_realloc() when XFD features are enabled. Since the introduction of guest FPUs, there have been inconsistencies in the kernel buffer size and xfeatures: 1. fpu_alloc_guest_fpstate() uses fpu_user_cfg since its introduction. See: `69f6ed1d14` ("x86/fpu: Provide infrastructure for KVM FPU cleanup") `36487e6228` ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features") 2. __fpstate_reset() references fpu_kernel_cfg to set storage attributes. 3. fpu->guest_perm uses fpu_kernel_cfg, affecting fpstate_realloc(). A recent commit in the tip:x86/fpu tree partially addressed the inconsistency between (1) and (3) by using fpu_kernel_cfg for size calculation in (1), but left fpu_guest->xfeatures and fpu_guest->perm still referencing fpu_user_cfg: https://lore.kernel.org/all/20250218141045.85201-1-stanspas@amazon.de/ `1937e18cc3` ("x86/fpu: Fix guest FPU state buffer allocation size") The inconsistencies within fpu_alloc_guest_fpstate() and across the mentioned functions cause confusion. Fix them by using fpu_kernel_cfg consistently in fpu_alloc_guest_fpstate(), except for fields related to the UABI buffer. Referencing fpu_kernel_cfg won't impact functionalities, as: 1. fpu_guest->perm is overwritten shortly in fpu_init_guest_permissions() with fpstate->guest_perm, which already uses fpu_kernel_cfg. 2. fpu_guest->xfeatures is solely used to check if XFD features are enabled. Including supervisor xfeatures doesn't affect the check. Fixes: `36487e6228` ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features") Suggested-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Woodhouse <dwmw2@infradead.org> Link: https://lore.kernel.org/r/20250317140613.1761633-1-chao.gao@intel.com	2025-03-17 23:52:31 +01:00
Namhyung Kim	65a99264f5	perf/x86: Check data address for IBS software filter The IBS software filter is filtering kernel samples for regular users in the PMI handler. It checks the instruction address in the IBS register to determine if it was in kernel mode or not. But it turns out that it's possible to report a kernel data address even if the instruction address belongs to user-space. Matteo Rizzo found that when an instruction raises an exception, IBS can report some kernel data addresses like IDT while holding the faulting instruction's RIP. To prevent an information leak, it should double check if the data address in PERF_SAMPLE_DATA is in the kernel space as well. [ mingo: Clarified the changelog ] Suggested-by: Matteo Rizzo <matteorizzo@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250317163755.1842589-1-namhyung@kernel.org	2025-03-17 23:37:31 +01:00
Borislav Petkov (AMD)	4348e91778	x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros Tie together the %[xa] in the XSAVE/XRSTOR definitions with the respective usage in the asm macros so that it is perfectly clear. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2025-03-17 13:47:12 +01:00
Josh Poimboeuf	8085fcd78c	x86/traps: Make exc_double_fault() consistently noreturn The CONFIG_X86_ESPFIX64 version of exc_double_fault() can return to its caller, but the !CONFIG_X86_ESPFIX64 version never does. In the latter case the compiler and/or objtool may consider it to be implicitly noreturn. However, due to the currently inflexible way objtool detects noreturns, a function's noreturn status needs to be consistent across configs. The current workaround for this issue is to suppress unreachable warnings for exc_double_fault()'s callers. Unfortunately that can result in ORC coverage gaps and potentially worse issues like inert static calls and silently disabled CPU mitigations. Instead, prevent exc_double_fault() from ever being implicitly marked noreturn by forcing a return behind a never-taken conditional. Until a more integrated noreturn detection method exists, this is likely the least objectionable workaround. Fixes: `55eeab2a8a` ("objtool: Ignore exc_double_fault() __noreturn warnings") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://lore.kernel.org/r/d1f4026f8dc35d0de6cc61f2684e0cb6484009d1.1741975349.git.jpoimboe@kernel.org	2025-03-17 11:35:59 +01:00
Sebastian Andrzej Siewior	96389cf365	x86: Rely on generic printing of preemption model After __die_header(), __die_body() is always invoked. There we have show_regs() -> show_regs_print_info() which prints the current preemption model. Remove it from the initial line. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314160810.2373416-8-bigeasy@linutronix.de	2025-03-17 11:23:40 +01:00
Kan Liang	1fbc6c8e52	perf/x86: Remove swap_task_ctx() The pmu specific data is saved in task_struct now. It doesn't need to swap between context. Remove swap_task_ctx() support. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-6-kan.liang@linux.intel.com	2025-03-17 11:23:37 +01:00
Kan Liang	3cec9fd035	perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode In the system-wide mode, LBR callstacks are shorter in comparison to the per-process mode. LBR MSRs are reset during a context switch in the system-wide mode. For the LBR call stack, the LBRs should be always saved/restored during a context switch. Use the space in task_struct to save/restore the LBR call stack data. For a system-wide event, it's unnecessagy to update the lbr_callstack_users for each threads. Add a variable in x86_pmu to indicate whether the system-wide event is active. Fixes: `76cb2c617f` ("perf/x86/intel: Save/restore LBR stack during context switch") Reported-by: Andi Kleen <ak@linux.intel.com> Reported-by: Alexey Budankov <alexey.budankov@linux.intel.com> Debugged-by: Alexey Budankov <alexey.budankov@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-5-kan.liang@linux.intel.com	2025-03-17 11:23:37 +01:00
Kan Liang	d57e94f5b8	perf: Supply task information to sched_task() To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com	2025-03-17 11:23:37 +01:00
Peng Hao	f0373cc090	x86/sev: Simplify the code by removing unnecessary 'else' statement No need for an 'else' statement after a 'return'. [ mingo: Clarified the changelog ] Signed-off-by: Peng Hao <flyingpeng@tencent.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org	2025-03-17 09:02:45 +01:00
Ryan Roberts	c36549ff8d	Revert "x86/xen: allow nesting of same lazy mode" Commit `49147beb0c` ("x86/xen: allow nesting of same lazy mode") was added as a solution for a core-mm code change where arch_[enter\|leave]_lazy_mmu_mode() started to be called in a nested manner; see commit `bcc6cc8325` ("mm: add default definition of set_ptes()"). However, now that we have fixed the API to avoid nesting, we no longer need this capability in the x86 implementation. Additionally, from code review, I don't believe the fix was ever robust in the case of preemption occurring while in the nested lazy mode. The implementation usually deals with preemption by calling arch_leave_lazy_mmu_mode() from xen_start_context_switch() for the outgoing task if we are in the lazy mmu mode. Then in xen_end_context_switch(), it restarts the lazy mode by calling arch_enter_lazy_mmu_mode() for an incoming task that was in the lazy mode when it was switched out. But arch_leave_lazy_mmu_mode() will only unwind a single level of nesting. If we are in the double nest, then it's not fully unwound and per-cpu variables are left in a bad state. So the correct solution is to remove the possibility of nesting from the higher level (which has now been done) and remove this x86-specific solution. Link: https://lkml.kernel.org/r/20250303141542.3371656-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Juergen Gross <jgross@suse.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juegren Gross <jgross@suse.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:35 -07:00
Anshuman Khandual	f9aad62200	mm: rename GENERIC_PTDUMP and PTDUMP_CORE Platforms subscribe into generic ptdump implementation via GENERIC_PTDUMP. But generic ptdump gets enabled via PTDUMP_CORE. These configs combination is confusing as they sound very similar and does not differentiate between platform's feature subscription and feature enablement for ptdump. Rename the configs as ARCH_HAS_PTDUMP and PTDUMP making it more clear and improve readability. Link: https://lkml.kernel.org/r/20250226122404.1927473-6-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Cc: Will Deacon <will@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-17 00:05:32 -07:00
Sourabh Jain	7b54a96f30	crash: remove an unused argument from reserve_crashkernel_generic() cmdline argument is not used in reserve_crashkernel_generic() so remove it. Correspondingly, all the callers have been updated as well. No functional change intended. Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:30:47 -07:00
Frank van der Linden	08efe29350	x86/mm: set ARCH_WANT_HUGETLB_VMEMMAP_PREINIT Now that hugetlb bootmem pages are allocated earlier, and available for section preinit (HVO-style), set ARCH_WANT_HUGETLB_VMEMMAP_PREINIT for x86_64, so that is can be done. This enables pre-HVO on x86_64. Link: https://lkml.kernel.org/r/20250228182928.2645936-22-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:30 -07:00
Frank van der Linden	665eaf3133	x86/setup: call hugetlb_bootmem_alloc early Call hugetlb_bootmem_allloc in an earlier spot in setup, after hugelb_cma_reserve. This will make vmemmap preinit of the sections covered by the allocated hugetlb pages possible. Link: https://lkml.kernel.org/r/20250228182928.2645936-21-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:29 -07:00
Frank van der Linden	d3cd80c587	x86/mm: make register_page_bootmem_memmap handle PTE mappings register_page_bootmem_memmap expects that vmemmap pages handed to it are PMD-mapped, and that the number of pages to call get_page_bootmem on is PMD-aligned. This is currently a correct assumption, but will no longer be true once pre-HVO of hugetlb pages is implemented. Make it handle PTE-mapped vmemmap pages and a nr_pages argument that is not necessarily PAGES_PER_SECTION. Link: https://lkml.kernel.org/r/20250228182928.2645936-9-fvdl@google.com Signed-off-by: Frank van der Linden <fvdl@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:26 -07:00
Ryan Roberts	86758b5048	mm/ioremap: pass pgprot_t to ioremap_prot() instead of unsigned long ioremap_prot() currently accepts pgprot_val parameter as an unsigned long, thus implicitly assuming that pgprot_val and pgprot_t could never be bigger than unsigned long. But this assumption soon will not be true on arm64 when using D128 pgtables. In 128 bit page table configuration, unsigned long is 64 bit, but pgprot_t is 128 bit. Passing platform abstracted pgprot_t argument is better as compared to size based data types. Let's change the parameter to directly pass pgprot_t like another similar helper generic_ioremap_prot(). Without this change in place, D128 configuration does not work on arm64 as the top 64 bits gets silently stripped when passing the protection value to this function. Link: https://lkml.kernel.org/r/20250218101954.415331-1-anshuman.khandual@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:23 -07:00
Barry Song	2f4ab3ac10	mm: support tlbbatch flush for a range of PTEs This patch lays the groundwork for supporting batch PTE unmapping in try_to_unmap_one(). It introduces range handling for TLB batch flushing, with the range currently set to the size of PAGE_SIZE. The function __flush_tlb_range_nosync() is architecture-specific and is only used within arch/arm64. This function requires the mm structure instead of the vma structure. To allow its reuse by arch_tlbbatch_add_pending(), which operates with mm but not vma, this patch modifies the argument of __flush_tlb_range_nosync() to take mm as its parameter. Link: https://lkml.kernel.org/r/20250214093015.51024-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shaoqin Huang <shahuang@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mauricio Faria de Oliveira <mfo@canonical.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:16 -07:00
Uros Bizjak	6a36757715	percpu/x86: enable strict percpu checks via named AS qualifiers This patch declares percpu variables in __seg_gs/__seg_fs named AS and keeps them named AS qualified until they are dereferenced with percpu accessor. This approach enables various compiler check for cross-namespace variable assignments. Link: https://lkml.kernel.org/r/20250127160709.80604-7-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:05:53 -07:00
Uros Bizjak	6a39fe05ec	percpu: use TYPEOF_UNQUAL() in _cpu_ptr() accessors Use TYPEOF_UNQUAL() macro to declare the return type of _cpu_ptr() accessors in the generic named address space to avoid access to data from pointer to non-enclosed address space type of errors. Link: https://lkml.kernel.org/r/20250127160709.80604-5-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:05:53 -07:00
Uros Bizjak	8a3c392388	percpu: use TYPEOF_UNQUAL() in variable declarations Use TYPEOF_UNQUAL() to declare variables as a corresponding type without named address space qualifier to avoid "`__seg_gs' specified for auto variable `var'" errors. Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:05:53 -07:00
Uros Bizjak	6b7ce3134f	x86/kgdb: use IS_ERR_PCPU() macro Patch series "Enable strict percpu address space checks", v4. Enable strict percpu address space checks via x86 named address space qualifiers. Percpu variables are declared in __seg_gs/__seg_fs named AS and kept named AS qualified until they are dereferenced via percpu accessor. This approach enables various compiler checks for cross-namespace variable assignments. Please note that current version of sparse doesn't know anything about __typeof_unqual__() operator. Avoid the usage of __typeof_unqual__() when sparse checking is active to prevent sparse errors with unknowing keyword. The proposed patch by Dan Carpenter to implement __typeof_unqual__() handling in sparse is located at: https://lore.kernel.org/lkml/5b8d0dee-8fb6-45af-ba6c-7f74aff9a4b8@stanley.mountain/ This patch (of 6): Use IS_ERR_PCPU() when checking the error pointer in the percpu address space. This macro adds intermediate cast to unsigned long when switching named address spaces. The patch will avoid future build errors due to pointer address space mismatch with enabled strict percpu address space checks. Link: https://lkml.kernel.org/r/20250127160709.80604-1-ubizjak@gmail.com Link: https://lkml.kernel.org/r/20250127160709.80604-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:05:52 -07:00
Ard Biesheuvel	e6a03a6669	x86: Get rid of Makefile.postlink Instead of generating the vmlinux.relocs file (needed by the decompressor build to construct the KASLR relocation tables) as a vmlinux postlink step, which is dubious because it depends on data that is stripped from vmlinux before the build completes, generate it from vmlinux.unstripped, which has been introduced specifically for this purpose. This ensures that each artifact is rebuilt as needed, rather than as a side effect of another build rule. This effectively reverts commit `9d9173e9ce` ("x86/build: Avoid relocation information in final vmlinux") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2025-03-17 00:29:50 +09:00
Ard Biesheuvel	ac4f06789b	kbuild: Create intermediate vmlinux build with relocations preserved The imperative paradigm used to build vmlinux, extract some info from it or perform some checks on it, and subsequently modify it again goes against the declarative paradigm that is usually employed for defining make rules. In particular, the Makefile.postlink files that consume their input via an output rule result in some dodgy logic in the decompressor makefiles for RISC-V and x86, given that the vmlinux.relocs input file needed to generate the arch-specific relocation tables may not exist or be out of date, but cannot be constructed using the ordinary Make dependency based rules, because the info needs to be extracted while vmlinux is in its ephemeral, non-stripped form. So instead, for architectures that require the static relocations that are emitted into vmlinux when passing --emit-relocs to the linker, and are subsequently stripped out again, introduce an intermediate vmlinux target called vmlinux.unstripped, and organize the reset of the build logic accordingly: - vmlinux.unstripped is created only once, and not updated again - build rules under arch/*/boot can depend on vmlinux.unstripped without running the risk of the data disappearing or being out of date - the final vmlinux generated by the build is not bloated with static relocations that are never needed again after the build completes. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2025-03-17 00:29:50 +09:00
Ard Biesheuvel	9b400d1725	kbuild: Introduce Kconfig symbol for linking vmlinux with relocations Some architectures build vmlinux with static relocations preserved, but strip them again from the final vmlinux image. Arch specific tools consume these static relocations in order to construct relocation tables for KASLR. The fact that vmlinux is created, consumed and subsequently updated goes against the typical, declarative paradigm used by Make, which is based on rules and dependencies. So as a first step towards cleaning this up, introduce a Kconfig symbol to declare that the arch wants to consume the static relocations emitted into vmlinux. This will be wired up further in subsequent patches. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>	2025-03-17 00:29:50 +09:00
Sergio González Collado	c104c16073	Kunit to check the longest symbol length The longest length of a symbol (KSYM_NAME_LEN) was increased to 512 in the reference [1]. This patch adds kunit test suite to check the longest symbol length. These tests verify that the longest symbol length defined is supported. This test can also help other efforts for longer symbol length, like [2]. The test suite defines one symbol with the longest possible length. The first test verify that functions with names of the created symbol, can be called or not. The second test, verify that the symbols are created (or not) in the kernel symbol table. [1] https://lore.kernel.org/lkml/20220802015052.10452-6-ojeda@kernel.org/ [2] https://lore.kernel.org/lkml/20240605032120.3179157-1-song@kernel.org/ Link: https://lore.kernel.org/r/20250302221518.76874-1-sergio.collado@gmail.com Tested-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Rae Moar <rmoar@google.com> Signed-off-by: Sergio González Collado <sergio.collado@gmail.com> Link: https://github.com/Rust-for-Linux/linux/issues/504 Reviewed-by: Rae Moar <rmoar@google.com> Acked-by: David Gow <davidgow@google.com> Signed-off-by: Shuah Khan <shuah@kernel.org>	2025-03-15 18:13:31 -06:00
Kumar Kartikeya Dwivedi	812f7702d8	bpf, x86: Fix objtool warning for timed may_goto Kernel test robot reported "call without frame pointer save/setup" warning in objtool. This will make stack traces unreliable on CONFIG_UNWINDER_FRAME_POINTER=y, however it works on CONFIG_UNWINDER_ORC=y. Fix this by creating a stack frame for the function. Fixes: `2fb761823e` ("bpf, x86: Add x86 JIT support for timed may_goto") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503071350.QOhsHVaW-lkp@intel.com/ Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250315013039.1625048-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 15:55:14 -07:00
Peilin Ye	5341c9a4d8	bpf, x86: Support load-acquire and store-release instructions Recently we introduced BPF load-acquire (BPF_LOAD_ACQ) and store-release (BPF_STORE_REL) instructions. For x86-64, simply implement them as regular BPF_LDX/BPF_STX loads and stores. The verifier always rejects misaligned load-acquires/store-releases (even if BPF_F_ANY_ALIGNMENT is set), so emitted MOV* instructions are guaranteed to be atomic. Arena accesses are supported. 8- and 16-bit load-acquires are zero-extending (i.e., MOVZBQ, MOVZWQ). Rename emit_atomic{,_index}() to emit_atomic_rmw{,_index}() to make it clear that they only handle read-modify-write atomics, and extend their @atomic_op parameter from u8 to u32, since we are starting to use more than the lowest 8 bits of the 'imm' field. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/d22bb3c69f126af1d962b7314f3489eff606a3b7.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:29 -07:00
Peilin Ye	880442305a	bpf: Introduce load-acquire and store-release instructions Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags: #define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110 A "load-acquire" is a BPF_STX \| BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100). Similarly, a "store-release" is a BPF_STX \| BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110). Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot). An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends. Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set). As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian): db 10 00 00 00 01 00 00 r0 = load_acquire((u64 )(r1 + 0x0)) opcode (0xdb): BPF_ATOMIC \| BPF_DW \| BPF_STX imm (0x00000100): BPF_LOAD_ACQ Similarly, a 16-bit BPF store-release: cb 21 00 00 10 01 00 00 store_release((u16 )(r1 + 0x0), w2) opcode (0xcb): BPF_ATOMIC \| BPF_H \| BPF_STX imm (0x00000110): BPF_STORE_REL In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /in_arena=/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena. [1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:28 -07:00
Kumar Kartikeya Dwivedi	2fb761823e	bpf, x86: Add x86 JIT support for timed may_goto Implement the arch_bpf_timed_may_goto function using inline assembly to have control over which registers are spilled, and use our special protocol of using BPF_REG_AX as an argument into the function, and as the return value when going back. Emit call depth accounting for the call made from this stub, and ensure we don't have naked returns (when rethunk mitigations are enabled) by falling back to the RET macro (instead of retq). After popping all saved registers, the return address into the BPF program should be on top of the stack. Since the JIT support is now enabled, ensure selftests which are checking the produced may_goto sequences do not break by adjusting them. Make sure we still test the old may_goto sequence on other architectures, while testing the new sequence on x86_64. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250304003239.2390751-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-03-15 11:48:28 -07:00
Herbert Xu	37d451809f	crypto: skcipher - Make skcipher_walk src.virt.addr const Mark the src.virt.addr field in struct skcipher_walk as a pointer to const data. This guarantees that the user won't modify the data which should be done through dst.virt.addr to ensure that flushing is done when necessary. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-15 16:21:22 +08:00
Herbert Xu	65775cf313	crypto: scatterwalk - Change scatterwalk_next calling convention Rather than returning the address and storing the length into an argument pointer, add an address field to the walk struct and use that to store the address. The length is returned directly. Change the done functions to use this stored address instead of getting them from the caller. Split the address into two using a union. The user should only access the const version so that it is never changed. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-15 16:21:22 +08:00
Isaku Yamahata	161d34609f	KVM: TDX: Make TDX VM type supported Now all the necessary code for TDX is in place, it's ready to run TDX guest. Advertise the VM type of KVM_X86_TDX_VM so that the user space VMM like QEMU can start to use it. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> --- TDX "the rest" v2: - No change. TDX "the rest" v1: - Move down to the end of patch series. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Yan Zhao	90fe64a94d	KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guests Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by making self-snoop feature a hard dependency for TDX and making quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled. The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is always forced to WB now. Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices, which would cause mirrored EPTs for TDs to be zapped, leading to the TDX-module-managed private EPT being incorrectly zapped. As a new feature, TDX always comes with support for self-snoop, and does not have to worry about unmodifiable but buggy guests. So, simply ignore KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already does. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250224071039.31511-1-yan.y.zhao@intel.com> [Only apply to TDX guests. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Paolo Bonzini	3fee4837ef	KVM: x86: remove shadow_memtype_mask The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled, if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not called for the shadow paging case, there is no need to consult shadow_memtype_mask and it can be removed altogether. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Yan Zhao	c9c1e20b4c	KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM cannot honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Yan Zhao	bd7d5362b4	KVM: x86: Introduce supported_quirks to block disabling quirks Introduce supported_quirks in kvm_caps to store platform-specific force-enabled quirks. No functional changes intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250224070832.31394-1-yan.y.zhao@intel.com> [Remove unsupported quirks at KVM_ENABLE_CAP time. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Paolo Bonzini	a4dae7c7a4	KVM: x86: Allow vendor code to disable quirks In some cases, the handling of quirks is split between platform-specific code and generic code, or it is done entirely in generic code, but the relevant bug does not trigger on some platforms; for example, this will be the case for "ignore guest PAT". Allow unaffected vendor modules to disable handling of a quirk for all VMs via a new entry in kvm_caps. Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API tells userspace that KVM knows that some of its past behavior was bogus or just undesirable. In other words, it's plausible for userspace to refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so preserve that and make it part of the API. As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on Intel systems. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Paolo Bonzini	9966b7822b	KVM: x86: do not allow re-enabling quirks Allowing arbitrary re-enabling of quirks puts a limit on what the quirks themselves can do, since you cannot assume that the quirk prevents a particular state. More important, it also prevents KVM from disabling a quirk at VM creation time, because userspace can always go back and re-enable that. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Binbin Wu	26eab9ae4b	KVM: TDX: Enable guest access to MTRR MSRs Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory types. TDX module exposes MTRR feature to TDX guests unconditionally. KVM needs to support MTRR MSRs accesses for TDX guests to match the architectural behavior. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-19-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Isaku Yamahata	0b75889b0c	KVM: TDX: Add a method to ignore hypercall patching Because guest TD memory is protected, VMM patching guest binary for hypercall instruction isn't possible. Add a method to ignore hypercall patching. Note: guest TD kernel needs to be modified to use TDG.VP.VMCALL for hypercall. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-18-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:58 -04:00
Isaku Yamahata	79264ff080	KVM: TDX: Ignore setting up mce Because vmx_set_mce function is VMX specific and it cannot be used for TDX. Add vt stub to ignore setting up mce for TDX. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-17-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	cf5f3668c5	KVM: TDX: Add methods to ignore accesses to TSC TDX protects TDX guest TSC state from VMM. Implement access methods to ignore guest TSC. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-16-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	a946c71cf8	KVM: TDX: Add methods to ignore VMX preemption timer TDX doesn't support VMX preemption timer. Implement access methods for VMM to ignore VMX preemption timer. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-15-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	e6bb397884	KVM: TDX: Add method to ignore guest instruction emulation Skip instruction emulation and let the TDX guest retry for MMIO emulation after installing the MMIO SPTE with suppress #VE bit cleared. TDX protects TDX guest state from VMM, instructions in guest memory cannot be emulated. MMIO emulation is the only case that triggers the instruction emulation code path for TDX guest. The MMIO emulation handling flow as following: - The TDX guest issues a vMMIO instruction. (The GPA must be shared and is not covered by KVM memory slot.) - The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So EPT violation causes TD exit to KVM. - Trigger KVM page fault handler and install a new SPTE with suppress #VE bit cleared. - Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU retry. - TDX guest re-executes the vMMIO instruction. - TDX guest gets #VE because KVM has cleared #VE suppress bit. - TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO> Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for TDX guests to retry the MMIO instruction. Also, the instruction emulation handling will be skipped, so that the callback check_intercept() will never be called for TDX guest. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-14-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	a141f28d6b	KVM: TDX: Add methods to ignore accesses to CPU state TDX protects TDX guest state from VMM. Implement access methods for TDX guest state to ignore them or return zero. Because those methods can be called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-13-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	04733836fe	KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is zero, return success code and zero in output registers. TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is for future enhancement of the Guest-Host-Communication Interface (GHCI) specification. The GHCI version of 344426-001US defines it to require input R12 to be zero and to return zero in output registers, R11, R12, R13, and R14 so that guest TD enumerates no enhancement. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	9fc3402a20	KVM: TDX: Enable guest access to LMCE related MSRs Allow TDX guest to configure LMCE (Local Machine Check Event) by handling MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL. MCE and MCA are advertised via cpuid based on the TDX module spec. Guest kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the platform or not. If LMCE is opted-in by the platform, guest kernel can access IA32_MCG_EXT_CTL to enable/disable LMCE. Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE. E.g., Linux guest will treat the failure as a #GP(0). Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by default, "-cpu lmce=on" is needed in QEMU command line to opt-in it. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	081385dbc6	KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and wire up KVM backend functions. For complete_emulated_msr() callback, instead of injecting #GP on error, implement tdx_complete_emulated_msr() to set return code on error. Also set return value on MSR read according to the values from kvm x86 registers. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	dd50294f3e	KVM: TDX: Implement callbacks for MSR operations Add functions to implement MSR related callbacks, .set_msr(), .get_msr(), and .has_emulated_msr(), for preparation of handling hypercalls from TDX guest for PV RDMSR and WRMSR. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX. There are three classes of MSR virtualization for TDX. - Non-configurable: TDX module directly virtualizes it. VMM can't configure it, the value set by KVM_SET_MSRS is ignored. - Configurable: TDX module directly virtualizes it. VMM can configure it at VM creation time. The value set by KVM_SET_MSRS is used. - #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used. For the MSRs belonging to the #VE case, the TDX module injects #VE to the TDX guest upon RDMSR or WRMSR. The exact list of such MSRs is defined in TDX Module ABI Spec. Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>, which are defined in GHCI (Guest-Host Communication Interface) so that the host VMM (e.g. KVM) can virtualize the MSRs. TDX doesn't allow VMM to configure interception of MSR accesses. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX guest. If the userspace has set any MSR filters, it will be applied when handling TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	7ddf314441	KVM: x86: Move KVM_MAX_MCE_BANKS to header file Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in a future patch. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	5cf7239b78	KVM: TDX: Handle TDX PV HLT hypercall Handle TDX PV HLT hypercall and the interrupt status due to it. TDX guest status is protected, KVM can't get the interrupt status of TDX guest and it assumes interrupt is always allowed unless TDX guest calls TDVMCALL with HLT, which passes the interrupt blocked flag. If the guest halted with interrupt enabled, also query pending RVI by checking bit 0 of TD_VCPU_STATE_DETAILS_NON_ARCH field via a seamcall. Update vt_interrupt_allowed() for TDX based on interrupt blocked flag passed by HLT TDVMCALL. Do not wakeup TD vCPU if interrupt is blocked for VT-d PI. For NMIs, KVM cannot determine the NMI blocking status for TDX guests, so KVM always assumes NMIs are not blocked. In the unlikely scenario where a guest invokes the PV HLT hypercall within an NMI handler, this could result in a spurious wakeup. The guest should implement the PV HLT hypercall within a loop if it truly requires no interruptions, since NMI could be unblocked by an IRET due to an exception occurring before the PV HLT is executed in the NMI handler. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Isaku Yamahata	3bf31b5786	KVM: TDX: Handle TDX PV CPUID hypercall Handle TDX PV CPUID hypercall for the CPUIDs virtualized by VMM according to TDX Guest Host Communication Interface (GHCI). For TDX, most CPUID leaf/sub-leaf combinations are virtualized by the TDX module while some trigger #VE. On #VE, TDX guest can issue TDG.VP.VMCALL<INSTRUCTION.CPUID> (same value as EXIT_REASON_CPUID) to request VMM to emulate CPUID operation. Morph PV CPUID hypercall to EXIT_REASON_CPUID and wire up to the KVM backend function. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rewrite changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Yan Zhao	4b2abc4971	KVM: TDX: Kick off vCPUs when SEAMCALL is busy during TD page removal Kick off all vCPUs and prevent tdh_vp_enter() from executing whenever tdh_mem_range_block()/tdh_mem_track()/tdh_mem_page_remove() encounters contention, since the page removal path does not expect error and is less sensitive to the performance penalty caused by kicking off vCPUs. Although KVM has protected SEPT zap-related SEAMCALLs with kvm->mmu_lock, KVM may still encounter TDX_OPERAND_BUSY due to the contention in the TDX module. - tdh_mem_track() may contend with tdh_vp_enter(). - tdh_mem_range_block()/tdh_mem_page_remove() may contend with tdh_vp_enter() and TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ TDCS epoch tdh_vp_enter tdh_mem_track ------------------------------------------------------------ SEPT tree tdh_mem_page_remove tdh_vp_enter (0-step mitigation) tdh_mem_range_block ------------------------------------------------------------ SEPT entry tdh_mem_range_block (Host lock) tdh_mem_page_remove (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Use a TDX specific per-VM flag wait_for_sept_zap along with KVM_REQ_OUTSIDE_GUEST_MODE to kick off vCPUs and prevent them from entering TD, thereby avoiding the potential contention. Apply the kick-off and no vCPU entering only after each SEAMCALL busy error to minimize the window of no TD entry, as the contention due to 0-step mitigation or TDCALLs is expected to be rare. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250227012021.1778144-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:57 -04:00
Yan Zhao	b0327bb2e7	KVM: TDX: Retry locally in TDX EPT violation handler on RET_PF_RETRY Retry locally in the TDX EPT violation handler for private memory to reduce the chances for tdh_mem_sept_add()/tdh_mem_page_aug() to contend with tdh_vp_enter(). TDX EPT violation installs private pages via tdh_mem_sept_add() and tdh_mem_page_aug(). The two may have contention with tdh_vp_enter() or TDCALLs. Resources SHARED users EXCLUSIVE users ------------------------------------------------------------ SEPT tree tdh_mem_sept_add tdh_vp_enter(0-step mitigation) tdh_mem_page_aug ------------------------------------------------------------ SEPT entry tdh_mem_sept_add (Host lock) tdh_mem_page_aug (Host lock) tdg_mem_page_accept (Guest lock) tdg_mem_page_attr_rd (Guest lock) tdg_mem_page_attr_wr (Guest lock) Though the contention between tdh_mem_sept_add()/tdh_mem_page_aug() and TDCALLs may be removed in future TDX module, their contention with tdh_vp_enter() due to 0-step mitigation still persists. The TDX module may trigger 0-step mitigation in SEAMCALL TDH.VP.ENTER, which works as follows: 0. Each TDH.VP.ENTER records the guest RIP on TD entry. 1. When the TDX module encounters a VM exit with reason EPT_VIOLATION, it checks if the guest RIP is the same as last guest RIP on TD entry. -if yes, it means the EPT violation is caused by the same instruction that caused the last VM exit. Then, the TDX module increases the guest RIP no-progress count. When the count increases from 0 to the threshold (currently 6), the TDX module records the faulting GPA into a last_epf_gpa_list. -if no, it means the guest RIP has made progress. So, the TDX module resets the RIP no-progress count and the last_epf_gpa_list. 2. On the next TDH.VP.ENTER, the TDX module (after saving the guest RIP on TD entry) checks if the last_epf_gpa_list is empty. -if yes, TD entry continues without acquiring the lock on the SEPT tree. -if no, it triggers the 0-step mitigation by acquiring the exclusive lock on SEPT tree, walking the EPT tree to check if all page faults caused by the GPAs in the last_epf_gpa_list have been resolved before continuing TD entry. Since KVM TDP MMU usually re-enters guest whenever it exits to userspace (e.g. for KVM_EXIT_MEMORY_FAULT) or encounters a BUSY, it is possible for a tdh_vp_enter() to be called more than the threshold count before a page fault is addressed, triggering contention when tdh_vp_enter() attempts to acquire exclusive lock on SEPT tree. Retry locally in TDX EPT violation handler to reduce the count of invoking tdh_vp_enter(), hence reducing the possibility of its contention with tdh_mem_sept_add()/tdh_mem_page_aug(). However, the 0-step mitigation and the contention are still not eliminated due to KVM_EXIT_MEMORY_FAULT, signals/interrupts, and cases when one instruction faults more GFNs than the threshold count. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250227012021.1778144-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Yan Zhao	e6a85781f7	KVM: TDX: Detect unexpected SEPT violations due to pending SPTEs Detect SEPT violations that occur when an SEPT entry is in PENDING state while the TD is configured not to receive #VE on SEPT violations. A TD guest can be configured not to receive #VE by setting SEPT_VE_DISABLE to 1 in tdh_mng_init() or modifying pending_ve_disable to 1 in TDCS when flexible_pending_ve is permitted. In such cases, the TDX module will not inject #VE into the TD upon encountering an EPT violation caused by an SEPT entry in the PENDING state. Instead, TDX module will exit to VMM and set extended exit qualification type to PENDING_EPT_VIOLATION and exit qualification bit 6:3 to 0. Since #VE will not be injected to such TDs, they are not able to be notified to accept a GPA. TD accessing before accepting a private GPA is regarded as an error within the guest. Detect such guest error by inspecting the (extended) exit qualification bits and make such VM dead. Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	da407fe459	KVM: TDX: Handle EPT violation/misconfig exit For TDX, on EPT violation, call common __vmx_handle_ept_violation() to trigger x86 MMU code; on EPT misconfiguration, bug the VM since it shouldn't happen. EPT violation due to instruction fetch should never be triggered from shared memory in TDX guest. If such EPT violation occurs, treat it as broken hardware. EPT misconfiguration shouldn't happen on neither shared nor secure EPT for TDX guests. - TDX module guarantees no EPT misconfiguration on secure EPT. Per TDX module v1.5 spec section 9.4 "Secure EPT Induced TD Exits": "By design, since secure EPT is fully controlled by the TDX module, an EPT misconfiguration on a private GPA indicates a TDX module bug and is handled as a fatal error." - For shared EPT, the MMIO caching optimization, which is the only case where current KVM configures EPT entries to generate EPT misconfiguration, is implemented in a different way for TDX guests. KVM configures EPT entries to non-present value without suppressing #VE bit. It causes #VE in the TDX guest and the guest will call TDG.VP.VMCALL to request MMIO emulation. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> [binbin: rework changelog] Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	6c441e4d6e	KVM: TDX: Handle EXIT_REASON_OTHER_SMI Handle VM exit caused by "other SMI" for TDX, by returning back to userspace for Machine Check System Management Interrupt (MSMI) case or ignoring it and resume vCPU for non-MSMI case. For VMX, SMM transition can happen in both VMX non-root mode and VMX root mode. Unlike VMX, in SEAM root mode (TDX module), all interrupts are blocked. If an SMI occurs in SEAM non-root mode (TD guest), the SMI causes VM exit to TDX module, then SEAMRET to KVM. Once it exits to KVM, SMI is delivered and handled by kernel handler right away. An SMI can be "I/O SMI" or "other SMI". For TDX, there will be no I/O SMI because I/O instructions inside TDX guest trigger #VE and TDX guest needs to use TDVMCALL to request VMM to do I/O emulation. For "other SMI", there are two cases: - MSMI case. When BIOS eMCA MCE-SMI morphing is enabled, the #MC occurs in TDX guest will be delivered as an MSMI. It causes an EXIT_REASON_OTHER_SMI VM exit with MSMI (bit 0) set in the exit qualification. On VM exit, TDX module checks whether the "other SMI" is caused by an MSMI or not. If so, TDX module marks TD as fatal, preventing further TD entries, and then completes the TD exit flow to KVM with the TDH.VP.ENTER outputs indicating TDX_NON_RECOVERABLE_TD. After TD exit, the MSMI is delivered and eventually handled by the kernel machine check handler (`7911f145de` x86/mce: Implement recovery for errors in TDX/SEAM non-root mode), i.e., the memory page is marked as poisoned and it won't be freed to the free list when the TDX guest is terminated. Since the TDX guest is dead, follow other non-recoverable cases, exit to userspace. - For non-MSMI case, KVM doesn't need to do anything, just continue TDX vCPU execution. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-17-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	f30cb6429f	KVM: TDX: Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT Handle EXCEPTION_NMI and EXTERNAL_INTERRUPT exits for TDX. NMI Handling: Just like the VMX case, NMI remains blocked after exiting from TDX guest for NMI-induced exits []. Handle NMI-induced exits for TDX guests in the same way as they are handled for VMX guests, i.e., handle NMI in tdx_vcpu_enter_exit() by calling the vmx_handle_nmi() helper. Interrupt and Exception Handling: Similar to the VMX case, external interrupts and exceptions (machine check is the only exception type KVM handles for TDX guests) are handled in the .handle_exit_irqoff() callback. For other exceptions, because TDX guest state is protected, exceptions in TDX guests can't be intercepted. TDX VMM isn't supposed to handle these exceptions. If unexpected exception occurs, exit to userspace with KVM_EXIT_EXCEPTION. For external interrupt, increase the statistics, same as the VMX case. []: Some old TDX modules have a bug which makes NMI unblocked after exiting from TDX guest for NMI-induced exits. This could potentially lead to nested NMIs: a new NMI arrives when KVM is manually calling the host NMI handler. This is an architectural violation, but it doesn't have real harm until FRED is enabled together with TDX (for non-FRED, the host NMI handler can handle nested NMIs). Given this is rare to happen and has no real harm, ignore this for the initial TDX support. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-16-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Sean Christopherson	7e548b0d90	KVM: VMX: Add a helper for NMI handling Add a helper to handles NMI exit. TDX handles the NMI exit the same as VMX case. Add a helper to share the code with TDX, expose the helper in common.h. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-15-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Binbin Wu	d5bc91e8e7	KVM: VMX: Move emulation_required to struct vcpu_vt Move emulation_required from struct vcpu_vmx to struct vcpu_vt so that vmx_handle_exit_irqoff() can be reused by TDX code. No functional change intended. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-14-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	8dac6b9a97	KVM: TDX: Add methods to ignore virtual apic related operation TDX protects TDX guest APIC state from VMM. Implement access methods of TDX guest vAPIC state to ignore them or return zero. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-13-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	f65916ae2d	KVM: TDX: Force APICv active for TDX guest Force APICv active for TDX guests in KVM because APICv is always enabled by TDX module. From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are rejected due to has_protected_state for TDs and guest_state_protected for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since migration is not supported yet, so that userspace cannot disable APIC. For various APICv inhibit reasons: - APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv in tdx_bringup(). If !enable_apicv, TDX support will be disabled. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note: APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory allocation fails for KVM apic_map.) - APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is mandatory. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects userspace to set APIC base. - The rest inhibit reasons are relevant only to AMD's AVIC, including APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN, APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. (For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX guests, which eliminates the in-kernel PIT.) Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is disabled for TDX guests. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Binbin Wu	209afc0c42	KVM: TDX: Enforce KVM_IRQCHIP_SPLIT for TDX guests Enforce KVM_IRQCHIP_SPLIT for TDX guests to disallow in-kernel I/O APIC while in-kernel local APIC is needed. APICv is always enabled by TDX module and TDX Module doesn't allow the hypervisor to modify the EOI-bitmap, i.e. all EOIs are accelerated and never trigger exits. Level-triggered interrupts and other things depending on EOI VM-Exit can't be faithfully emulated in KVM. Also, the lazy check of pending APIC EOI for RTC edge-triggered interrupts, which was introduced as a workaround when EOI cannot be intercepted, doesn't work for TDX either because kvm_apic_pending_eoi() checks vIRR and vISR, but both values are invisible in KVM. If the guest induces generation of a level-triggered interrupt, the VMM is left with the choice of dropping the interrupt, sending it as-is, or converting it to an edge-triggered interrupt. Ditto for KVM. All of those options will make the guest unhappy. There's no architectural behavior KVM can provide that's better than sending the interrupt and hoping for the best. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	4cdf243eb1	KVM: TDX: Always block INIT/SIPI Always block INIT and SIPI events for the TDX guest because the TDX module doesn't provide API for VMM to inject INIT IPI or SIPI. TDX defines its own vCPU creation and initialization sequence including multiple seamcalls. Also, it's only allowed during TD build time. Given that TDX guest is para-virtualized to boot BSP/APs, normally there shouldn't be any INIT/SIPI event for TDX guest. If any, three options to handle them: 1. Always block INIT/SIPI request. 2. (Silently) ignore INIT/SIPI request during delivery. 3. Return error to guest TDs somehow. Choose option 1 for simplicity. Since INIT and SIPI are always blocked, INIT handling and the OP vcpu_deliver_sipi_vector() won't be called, no need to add new interface or helper function for INIT/SIPI delivery. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	2b06beb08f	KVM: TDX: Handle SMI request as !CONFIG_KVM_SMM Handle SMI request as what KVM does for CONFIG_KVM_SMM=n, i.e. return -ENOTTY, and add KVM_BUG_ON() to SMI related OPs for TD. TDX doesn't support system-management mode (SMM) and system-management interrupt (SMI) in guest TDs. Because guest state (vCPU state, memory state) is protected, it must go through the TDX module APIs to change guest state. However, the TDX module doesn't provide a way for VMM to inject SMI into guest TD or a way for VMM to switch guest vCPU mode into SMM. MSR_IA32_SMBASE will not be emulated for TDX guest, -ENOTTY will be returned when SMI is requested. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	acc64eb4e2	KVM: TDX: Implement methods to inject NMI Inject NMI to TDX guest by setting the PEND_NMI TDVPS field to 1, i.e. make the NMI pending in the TDX module. If there is a further pending NMI in KVM, collapse it to the one pending in the TDX module. VMM can request the TDX module to inject a NMI into a TDX vCPU by setting the PEND_NMI TDVPS field to 1. Following that, VMM can call TDH.VP.ENTER to run the vCPU and the TDX module will attempt to inject the NMI as soon as possible. KVM has the following 3 cases to inject two NMIs when handling simultaneous NMIs and they need to be injected in a back-to-back way. Otherwise, OS kernel may fire a warning about the unknown NMI [1]: K1. One NMI is being handled in the guest and one NMI pending in KVM. KVM requests NMI window exit to inject the pending NMI. K2. Two NMIs are pending in KVM. KVM injects the first NMI and requests NMI window exit to inject the second NMI. K3. A previous NMI needs to be rejected and one NMI pending in KVM. KVM first requests force immediate exit followed by a VM entry to complete the NMI rejection. Then, during the force immediate exit, KVM requests NMI window exit to inject the pending NMI. For TDX, PEND_NMI TDVPS field is a 1-bit field, i.e. KVM can only pend one NMI in the TDX module. Also, the vCPU state is protected, KVM doesn't know the NMI blocking states of TDX vCPU, KVM has to assume NMI is always unmasked and allowed. When KVM sees PEND_NMI is 1 after a TD exit, it means the previous NMI needs to be re-injected. Based on KVM's NMI handling flow, there are following 6 cases: In NMI handler TDX module KVM T1. No PEND_NMI=0 1 pending NMI T2. No PEND_NMI=0 2 pending NMIs T3. No PEND_NMI=1 1 pending NMI T4. Yes PEND_NMI=0 1 pending NMI T5. Yes PEND_NMI=0 2 pending NMIs T6. Yes PEND_NMI=1 1 pending NMI K1 is mapped to T4. K2 is mapped to T2 or T5. K3 is mapped to T3 or T6. Note: KVM doesn't know whether NMI is blocked by a NMI or not, case T5 and T6 can happen. When handling pending NMI in KVM for TDX guest, what KVM can do is to add a pending NMI in TDX module when PEND_NMI is 0. T1 and T4 can be handled by this way. However, TDX doesn't allow KVM to request NMI window exit directly, if PEND_NMI is already set and there is still pending NMI in KVM, the only way KVM could try is to request a force immediate exit. But for case T5 and T6, force immediate exit will result in infinite loop because force immediate exit makes it no progress in the NMI handler, so that the pending NMI in the TDX module can never be injected. Considering on X86 bare metal, multiple NMIs could collapse into one NMI, e.g. when NMI is blocked by SMI. It's OS's responsibility to poll all NMI sources in the NMI handler to avoid missing handling of some NMI events. Based on that, for the above 3 cases (K1-K3), only case K1 must inject the second NMI because the guest NMI handler may have already polled some of the NMI sources, which could include the source of the pending NMI, the pending NMI must be injected to avoid the lost of NMI. For case K2 and K3, the guest OS will poll all NMI sources (including the sources caused by the second NMI and further NMI collapsed) when the delivery of the first NMI, KVM doesn't have the necessity to inject the second NMI. To handle the NMI injection properly for TDX, there are two options: - Option 1: Modify the KVM's NMI handling common code, to collapse the second pending NMI for K2 and K3. - Option 2: Do it in TDX specific way. When the previous NMI is still pending in the TDX module, i.e. it has not been delivered to TDX guest yet, collapse the pending NMI in KVM into the previous one. This patch goes with option 2 because it is simple and doesn't impact other VM types. Option 1 may need more discussions. This is the first need to access vCPU scope metadata in the "management" class. Make needed accessors available. [1] https://lore.kernel.org/all/1317409584-23662-5-git-send-email-dzickus@redhat.com/ Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:56 -04:00
Isaku Yamahata	fc17de9901	KVM: TDX: Wait lapic expire when timer IRQ was injected Call kvm_wait_lapic_expire() when POSTED_INTR_ON is set and the vector for LVTT is set in PIR before TD entry. KVM always assumes a timer IRQ was injected if APIC state is protected. For TDX guest, APIC state is protected and KVM injects timer IRQ via posted interrupt. To avoid unnecessary wait calls, only call kvm_wait_lapic_expire() when a timer IRQ was injected, i.e., POSTED_INTR_ON is set and the vector for LVTT is set in PIR. Add a helper to test PIR. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Sean Christopherson	14aecf2a5b	KVM: x86: Assume timer IRQ was injected if APIC state is protected If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer IRQ was injected when deciding whether or not to busy wait in the "timer advanced" path. The "real" vIRR is not readable/writable, so trying to query for a pending timer IRQ will return garbage. Note, TDX can scour the PIR if it wants to be more precise and skip the "wait" call entirely. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	24c1291116	KVM: TDX: Implement non-NMI interrupt injection Implement non-NMI interrupt injection for TDX via posted interrupt. As CPU state is protected and APICv is enabled for the TDX guest, TDX supports non-NMI interrupt injection only by posted interrupt. Posted interrupt descriptors (PIDs) are allocated in shared memory, KVM can update them directly. If target vCPU is in non-root mode, send posted interrupt notification to the vCPU and hardware will sync PIR to vIRR atomically. Otherwise, kick it to pick up the interrupt from PID. To post pending interrupts in the PID, KVM can generate a self-IPI with notification vector prior to TD entry. Since the guest status of TD vCPU is protected, assume interrupt is always allowed. Ignore the code path for event injection mechanism or LAPIC emulation for TDX. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014757.897978-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	254e5dcd5a	KVM: VMX: Move posted interrupt delivery code to common header Move posted interrupt delivery code to common header so that TDX can leverage it. No functional change intended. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014757.897978-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	34d2d1ca1b	KVM: TDX: Disable PI wakeup for IPIv Disable PI wakeup for IPI virtualization (IPIv) case for TDX. When a vCPU is being scheduled out, notification vector is switched and pi_wakeup_handler() is enabled when the vCPU has interrupt enabled and posted interrupt is used to wake up the vCPU. For VMX, a blocked vCPU can be the target of posted interrupts when using IPIv or VT-d PI. TDX doesn't support IPIv, disable PI wakeup for IPIv. Also, since the guest status of TD vCPU is protected, assume interrupt is always enabled for TD. (PV HLT hypercall is not support yet, TDX guest tells VMM whether HLT is called with interrupt disabled or not.) Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Sean Christopherson	90cfe144c8	KVM: TDX: Add support for find pending IRQ in a protected local APIC Add flag and hook to KVM's local APIC management to support determining whether or not a TDX guest has a pending IRQ. For TDX vCPUs, the virtual APIC page is owned by the TDX module and cannot be accessed by KVM. As a result, registers that are virtualized by the CPU, e.g. PPR, cannot be read or written by KVM. To deliver interrupts for TDX guests, KVM must send an IRQ to the CPU on the posted interrupt notification vector. And to determine if TDX vCPU has a pending interrupt, KVM must check if there is an outstanding notification. Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is protected to short-circuit the various other flows that try to pull an IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is pending, KVM can't do anything based on _which_ IRQ is pending. Intentionally omit sanity checks from other flows, e.g. PPR update, so as not to degrade non-TDX guests with unnecessary checks. A well-behaved KVM and userspace will never reach those flows for TDX guests, but reaching them is not fatal if something does go awry. For the TD exits not due to HLT TDCALL, skip checking RVI pending in tdx_protected_apic_has_interrupt(). Except for the guest being stupid (e.g., non-HLT TDCALL in an interrupt shadow), it's not even possible to have an interrupt in RVI that is fully unmasked. There is no any CPU flows that modify RVI in the middle of instruction execution. I.e. if RVI is non-zero, then either the interrupt has been pending since before the TD exit, or the instruction caused the TD exit is in an STI/SS shadow. KVM doesn't care about STI/SS shadows outside of the HALTED case. And if the interrupt was pending before TD exit, then it _must_ be blocked, otherwise the interrupt would have been serviced at the instruction boundary. For the HLT TDCALL case, it will be handled in a future patch when HLT TDCALL is supported. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Sean Christopherson	bb723bebde	KVM: TDX: Handle TDX PV MMIO hypercall Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according to TDX Guest Host Communication Interface (GHCI) spec. For TDX guests, VMM is not allowed to access vCPU registers and the private memory, and the code instructions must be fetched from the private memory. So MMIO emulation implemented for non-TDX VMs is not possible for TDX guests. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. The #VE handling is supposed to emulate the MMIO instruction inside the guest and convert it into a TDVMCALL with the leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION. The requested MMIO address must be in shared GPA space. The shared bit is stripped after check because the existing code for MMIO emulation is not aware of the shared bit. The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA should be shared. KVM could do the checks before exiting to userspace, however, even if KVM does the check, there still will be race conditions between the check in KVM and the emulation of MMIO access in userspace due to a memslot hotplug, or a memory attribute conversion. If userspace doesn't check the attribute of the GPA and the attribute happens to be private, it will not pose a security risk or cause an MCE, but it can lead to another issue. E.g., in QEMU, treating a GPA with private attribute as shared when it falls within RAM's range can result in extra memory consumption during the emulation to the access to the HVA of the GPA. There are two options: 1) Do the check both in KVM and userspace. 2) Do the check only in QEMU. This patch chooses option 2, i.e. KVM omits the memslot and attribute checks, and expects userspace to do the checks. Similar to normal MMIO emulation, try to handle the MMIO in kernel first, if kernel can't support it, forward the request to userspace. Export needed symbols used for MMIO handling. Fragments handling is not needed for TDX PV MMIO because GPA is provided, if a MMIO access crosses page boundary, it should be continuous in GPA. Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed. Allow cross page access because no extra handling needed after checking both start and end GPA are shared GPAs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	33608aaf71	KVM: TDX: Handle TDX PV port I/O hypercall Emulate port I/O requested by TDX guest via TDVMCALL with leaf Instruction.IO (same value as EXIT_REASON_IO_INSTRUCTION) according to TDX Guest Host Communication Interface (GHCI). All port I/O instructions inside the TDX guest trigger the #VE exception. On #VE triggered by I/O instructions, TDX guest can call TDVMCALL with leaf Instruction.IO to request VMM to emulate I/O instructions. Similar to normal port I/O emulation, try to handle the port I/O in kernel first, if kernel can't support it, forward the request to userspace. Note string I/O operations are not supported in TDX. Guest should unroll them before calling the TDVMCALL. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Binbin Wu	79462faa2b	KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError> Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for handling. TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal error it has experienced. This hypercall is special because TD guest is requesting a termination with the error information, KVM needs to forward the hypercall to userspace anyway, KVM doesn't do parsing or conversion, it just dumps the 16 general-purpose registers to userspace and let userspace decide what to do. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Binbin Wu	2c30488083	KVM: TDX: Handle TDG.VP.VMCALL<MapGPA> Convert TDG.VP.VMCALL<MapGPA> to KVM_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE and forward it to userspace for handling. MapGPA is used by TDX guest to request to map a GPA range as private or shared memory. It needs to exit to userspace for handling. KVM has already implemented a similar hypercall KVM_HC_MAP_GPA_RANGE, which will exit to userspace with exit reason KVM_EXIT_HYPERCALL. Do sanity checks, convert TDVMCALL_MAP_GPA to KVM_HC_MAP_GPA_RANGE and forward the request to userspace. To prevent a TDG.VP.VMCALL<MapGPA> call from taking too long, the MapGPA range is split into 2MB chunks and check interrupt pending between chunks. This allows for timely injection of interrupts and prevents issues with guest lockup detection. TDX guest should retry the operation for the GPA starting at the address specified in R11 when the TDVMCALL return TDVMCALL_RETRY as status code. Note userspace needs to enable KVM_CAP_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE bit set for TD VM. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-7-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	d5998c02bc	KVM: TDX: Handle KVM hypercall with TDG.VP.VMCALL Handle KVM hypercall for TDX according to TDX Guest-Host Communication Interface (GHCI) specification. The TDX GHCI specification defines the ABI for the guest TD to issue hypercalls. When R10 is non-zero, it indicates the TDG.VP.VMCALL is vendor-specific. KVM uses R10 as KVM hypercall number and R11-R14 as 4 arguments, while the error code is returned in R10. Morph the TDG.VP.VMCALL with KVM hypercall to EXIT_REASON_VMCALL and marshall r10~r14 from vp_enter_args to the appropriate x86 registers for KVM hypercall handling. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-6-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	c42856af8f	KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) Add a place holder and related helper functions for preparation of TDG.VP.VMCALL handling. The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short) for the guest TD to call hypercall to VMM. When the guest TD issues a TDVMCALL, the guest TD exits to VMM with a new exit reason. The arguments from the guest TD and returned values from the VMM are passed in the guest registers. The guest RCX register indicates which registers are used. Define helper functions to access those registers. A new VMX exit reason TDCALL is added to indicate the exit is due to TDVMCALL from the guest TD. Define the TDCALL exit reason and add a place holder to handle such exit. Some leafs of TDCALL will be morphed to another VMX exit reason instead of EXIT_REASON_TDCALL, add a helper tdcall_to_vmx_exit_reason() as a place holder to do the conversion. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014225.897298-5-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:55 -04:00
Isaku Yamahata	095b71a03f	KVM: TDX: Add a place holder to handle TDX VM exit Introduce the wiring for handling TDX VM exits by implementing the callbacks .get_exit_info(), .get_entry_info(), and .handle_exit(). Additionally, add error handling during the TDX VM exit flow, and add a place holder to handle various exit reasons. Store VMX exit reason and exit qualification in struct vcpu_vt for TDX, so that TDX/VMX can use the same helpers to get exit reason and exit qualification. Store extended exit qualification and exit GPA info in struct vcpu_tdx because they are used by TDX code only. Contention Handling: The TDH.VP.ENTER operation may contend with TDH.MEM.* operations due to secure EPT or TD EPOCH. If the contention occurs, the return value will have TDX_OPERAND_BUSY set, prompting the vCPU to attempt re-entry into the guest with EXIT_FASTPATH_EXIT_HANDLED, not EXIT_FASTPATH_REENTER_GUEST, so that the interrupts pending during IN_GUEST_MODE can be delivered for sure. Otherwise, the requester of KVM_REQ_OUTSIDE_GUEST_MODE may be blocked endlessly. Error Handling: - TDX_SW_ERROR: This includes #UD caused by SEAMCALL instruction if the CPU isn't in VMX operation, #GP caused by SEAMCALL instruction when TDX isn't enabled by the BIOS, and TDX_SEAMCALL_VMFAILINVALID when SEAM firmware is not loaded or disabled. - TDX_ERROR: This indicates some check failed in the TDX module, preventing the vCPU from running. - Failed VM Entry: Exit to userspace with KVM_EXIT_FAIL_ENTRY. Handle it separately before handling TDX_NON_RECOVERABLE because when off-TD debug is not enabled, TDX_NON_RECOVERABLE is set. - TDX_NON_RECOVERABLE: Set by the TDX module when the error is non-recoverable, indicating that the TDX guest is dead or the vCPU is disabled. A special case is triple fault, which also sets TDX_NON_RECOVERABLE but exits to userspace with KVM_EXIT_SHUTDOWN, aligning with the VMX case. - Any unhandled VM exit reason will also return to userspace with KVM_EXIT_INTERNAL_ERROR. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Message-ID: <20250222014225.897298-4-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Binbin Wu	44428e4936	KVM: x86: Move pv_unhalted check out of kvm_vcpu_has_events() Move pv_unhalted check out of kvm_vcpu_has_events(), check pv_unhalted explicitly when handling PV unhalt and expose kvm_vcpu_has_events(). kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted is only cleared on transitions to KVM_MP_STATE_RUNNABLE. If the guest initiates a spurious wakeup, pv_unhalted could be left set in perpetuity. Currently, this is not problematic because kvm_vcpu_has_events() is only called when handling PV unhalt. However, if kvm_vcpu_has_events() is used for other purposes in the future, it could return the unexpected results. Export kvm_vcpu_has_events() for its usage in broader contexts. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Binbin Wu	6162b37357	KVM: x86: Have ____kvm_emulate_hypercall() read the GPRs Have ____kvm_emulate_hypercall() read the GPRs instead of passing them in via the macro. When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of TDVMCALL ABI into KVM's x86 registers to match the definition of KVM hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. Therefore, ____kvm_emulate_hypercall() can just read registers internally based on KVM hypercall ABI, and those registers can be removed from the __kvm_emulate_hypercall() macro. Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(), remove it from the __kvm_emulate_hypercall() macro as well. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20250222014225.897298-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Isaku Yamahata	484612f1a7	KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest DRs. TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter, and resets DRs to architectural INIT state on TD exit. Use the new flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to save/restore guest DRs. KVM still needs to restore host DRs after TD exit if there are active breakpoints in the host, which is covered by the existing code. MOV-DR exiting is always cleared for TDX guests, so the handler for DR access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH are set. Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT(). Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Adrian Hunter	8af0990375	KVM: TDX: Save and restore IA32_DEBUGCTL Save the IA32_DEBUGCTL MSR before entering a TDX VCPU and restore it afterwards. The TDX Module preserves bits 1, 12, and 14, so if no other bits are set, no restore is done. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-12-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Adrian Hunter	6d415778f1	KVM: TDX: Disable support for TSX and WAITPKG Support for restoring IA32_TSX_CTRL MSR and IA32_UMWAIT_CONTROL MSR is not yet implemented, so disable support for TSX and WAITPKG for now. Clear the associated CPUID bits returned by KVM_TDX_CAPABILITIES, and return an error if those bits are set in KVM_TDX_INIT_VM. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-11-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Isaku Yamahata	e0b4f31a3c	KVM: TDX: restore user ret MSRs Several MSRs are clobbered on TD exit that are not used by Linux while in ring 0. Ensure the cached value of the MSR is updated on vcpu_put, and the MSRs themselves before returning to ring 3. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-10-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Chao Gao	d3a6b6cfb8	KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr Several MSRs are constant and only used in userspace(ring 3). But VMs may have different values. KVM uses kvm_set_user_return_msr() to switch to guest's values and leverages user return notifier to restore them when the kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also caches the value it wrote to an MSR last time. TDX module unconditionally resets some of these MSRs to architectural INIT state on TD exit. It makes the cached values in kvm_user_return_msrs are inconsistent with values in hardware. This inconsistency needs to be fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring some MSRs to the host's values. kvm_set_user_return_msr() can help correct this case, but it is not optimal as it always does a wrmsr. So, introduce a variation of kvm_set_user_return_msr() to update cached values and skip that wrmsr. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Isaku Yamahata	6bfa6d8509	KVM: TDX: restore host xsave state when exit from the guest TD On exiting from the guest TD, xsave state is clobbered; restore it. Do not use kvm_load_host_xsave_state(), as it relies on vcpu->arch to find out whether other KVM_RUN code has loaded guest state into XCR0/PKRU/XSS or not. In the case of TDX, the exit values are known independent of the guest CR0 and CR4, and in fact the latter are not available. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-8-adrian.hunter@intel.com> [Rewrite to not use kvm_load_host_xsave_state. - Paolo] Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Isaku Yamahata	81bf40d54c	KVM: TDX: vcpu_run: save/restore host state(host kernel gs) On entering/exiting TDX vcpu, preserved or clobbered CPU state is different from the VMX case. Add TDX hooks to save/restore host/guest CPU state. Save/restore kernel GS base MSR. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-7-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Isaku Yamahata	81bf912b2c	KVM: TDX: Implement TDX vcpu enter/exit path Implement callbacks to enter/exit a TDX VCPU by calling tdh_vp_enter(). Ensure the TDX VCPU is in a correct state to run. Do not pass arguments from/to vcpu->arch.regs[] unconditionally. Instead, marshall state to/from the appropriate x86 registers only when needed, i.e., to handle some TDVMCALL sub-leaves following KVM's ABI to leverage the existing code. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-6-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Binbin Wu	7172c753c2	KVM: VMX: Move common fields of struct vcpu_{vmx,tdx} to a struct Move common fields of struct vcpu_vmx and struct vcpu_tdx to struct vcpu_vt, to share the code between VMX/TDX as much as possible and to make TDX exit handling more VMX like. No functional change intended. [Adrian: move code that depends on struct vcpu_vmx back to vmx.h] Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/Z1suNzg2Or743a7e@google.com Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250129095902.16391-5-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:54 -04:00
Kai Huang	69e23faf82	x86/virt/tdx: Add SEAMCALL wrapper to enter/exit TDX guest Intel TDX protects guest VM's from malicious host and certain physical attacks. TDX introduces a new operation mode, Secure Arbitration Mode (SEAM) to isolate and protect guest VM's. A TDX guest VM runs in SEAM and, unlike VMX, direct control and interaction with the guest by the host VMM is not possible. Instead, Intel TDX Module, which also runs in SEAM, provides a SEAMCALL API. The SEAMCALL that provides the ability to enter a guest is TDH.VP.ENTER. The TDX Module processes TDH.VP.ENTER, and enters the guest via VMX VMLAUNCH/VMRESUME instructions. When a guest VM-exit requires host VMM interaction, the TDH.VP.ENTER SEAMCALL returns to the host VMM (KVM). Add tdh_vp_enter() to wrap the SEAMCALL invocation of TDH.VP.ENTER; tdh_vp_enter() needs to be noinstr because VM entry in KVM is noinstr as well, which is for two reasons: * marking the area as CT_STATE_GUEST via guest_state_enter_irqoff() and guest_state_exit_irqoff() * IRET must be avoided between VM-exit and NMI handling, in order to avoid prematurely releasing the NMI inhibit. TDH.VP.ENTER is different from other SEAMCALLs in several ways: it uses more arguments, and after it returns some host state may need to be restored. Therefore tdh_vp_enter() uses __seamcall_saved_ret() instead of __seamcall_ret(); since it is the only caller of __seamcall_saved_ret(), it can be made noinstr also. TDH.VP.ENTER arguments are passed through General Purpose Registers (GPRs). For the special case of the TD guest invoking TDG.VP.VMCALL, nearly any GPR can be used, as well as XMM0 to XMM15. Notably, RBP is not used, and Linux mandates the TDX Module feature NO_RBP_MOD, which is enforced elsewhere. Additionally, XMM registers are not required for the existing Guest Hypervisor Communication Interface and are handled by existing KVM code should they be modified by the guest. There are 2 input formats and 5 output formats for TDH.VP.ENTER arguments. Input #1 : Initial entry or following a previous async. TD Exit Input #2 : Following a previous TDCALL(TDG.VP.VMCALL) Output #1 : On Error (No TD Entry) Output #2 : Async. Exits with a VMX Architectural Exit Reason Output #3 : Async. Exits with a non-VMX TD Exit Status Output #4 : Async. Exits with Cross-TD Exit Details Output #5 : On TDCALL(TDG.VP.VMCALL) Currently, to keep things simple, the wrapper function does not attempt to support different formats, and just passes all the GPRs that could be used. The GPR values are held by KVM in the area set aside for guest GPRs. KVM code uses the guest GPR area (vcpu->arch.regs[]) to set up for or process results of tdh_vp_enter(). Therefore changing tdh_vp_enter() to use more complex argument formats would also alter the way KVM code interacts with tdh_vp_enter(). Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241121201448.36170-2-adrian.hunter@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Yan Zhao	eac0b72fae	KVM: TDX: Handle SEPT zap error due to page add error in premap Move the handling of SEPT zap errors caused by unsuccessful execution of tdh_mem_page_add() in KVM_TDX_INIT_MEM_REGION from tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). Introduce a new helper function tdx_is_sept_zap_err_due_to_premap() to detect this specific error. During the IOCTL KVM_TDX_INIT_MEM_REGION, KVM premaps leaf SPTEs in the mirror page table before the corresponding entry in the private page table is successfully installed by tdh_mem_page_add(). If an error occurs during the invocation of tdh_mem_page_add(), a mismatch between the mirror and private page tables results in SEAMCALLs for SEPT zap returning the error code TDX_EPT_ENTRY_STATE_INCORRECT. The error TDX_EPT_WALK_FAILED is not possible because, during KVM_TDX_INIT_MEM_REGION, KVM only premaps leaf SPTEs after successfully mapping non-leaf SPTEs. Unlike leaf SPTEs, there is no mismatch in non-leaf PTEs between the mirror and private page tables. Therefore, during zap, SEAMCALLs should find an empty leaf entry in the private EPT, leading to the error TDX_EPT_ENTRY_STATE_INCORRECT instead of TDX_EPT_WALK_FAILED. Since tdh_mem_range_block() is always invoked before tdh_mem_page_remove(), move the handling of SEPT zap errors from tdx_sept_drop_private_spte() to tdx_sept_zap_private_spte(). In tdx_sept_zap_private_spte(), return 0 for errors due to premap to skip executing other SEAMCALLs for zap, which are unnecessary. Return 1 to indicate no other errors, allowing the execution of other zap SEAMCALLs to continue. The failure of tdh_mem_page_add() is uncommon and has not been observed in real workloads. Currently, this failure is only hypothetically triggered by skipping the real SEAMCALL and faking the add error in the SEAMCALL wrapper. Additionally, without this fix, there will be no host crashes or other severe issues. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250217085642.19696-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Paolo Bonzini	1f62531bc9	KVM: TDX: Skip updating CPU dirty logging request for TDs Wrap vmx_update_cpu_dirty_logging so as to ignore requests to update CPU dirty logging for TDs, as basic TDX does not support the PML feature. Invoking vmx_update_cpu_dirty_logging() for TDs would cause an incorrect access to a kvm_vmx struct for a TDX VM, so block that before it happens. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Yan Zhao	fbb4adadea	KVM: x86: Make cpu_dirty_log_size a per-VM value Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled. Do not set it for TDs. Until now, cpu_dirty_log_size was a system-wide value that is used for all VMs and is set to the PML buffer size when PML was enabled in VMX. However, PML is not currently supported for TDs, though PML remains available for normal VMs as long as the feature is supported by hardware and enabled in VMX. Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer size for normal VMs and 0 for TDs. This allows functions like kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to determine if PML is supported, in order to kick off vCPUs or request them to update CPU dirty logging status (turn on/off PML in VMCS). This fixes an issue first reported in [1], where QEMU attaches an emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD. KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx struct for a TDX VM. Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com> Reported-by: Pedro Principeza <pedro.principeza@canonical.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://github.com/canonical/tdx/issues/202 Link: https://github.com/canonical/tdx/issues/202 [1] Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Yan Zhao	fd3252571b	KVM: x86/mmu: Add parameter "kvm" to kvm_mmu_page_ad_need_write_protect() Add a parameter "kvm" to kvm_mmu_page_ad_need_write_protect() and its caller tdp_mmu_need_write_protect(). This is a preparation to make cpu_dirty_log_size a per-VM value rather than a system-wide value. No function changes expected. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Yan Zhao	c4a92f12cf	KVM: Add parameter "kvm" to kvm_cpu_dirty_log_size() and its callers Add a parameter "kvm" to kvm_cpu_dirty_log_size() and down to its callers: kvm_dirty_ring_get_rsvd_entries(), kvm_dirty_ring_alloc(). This is a preparation to make cpu_dirty_log_size a per-VM value rather than a system-wide value. No function changes expected. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Isaku Yamahata	d789fa6efa	KVM: TDX: Handle vCPU dissociation Handle vCPUs dissociations by invoking SEAMCALL TDH.VP.FLUSH which flushes the address translation caches and cached TD VMCS of a TD vCPU in its associated pCPU. In TDX, a vCPUs can only be associated with one pCPU at a time, which is done by invoking SEAMCALL TDH.VP.ENTER. For a successful association, the vCPU must be dissociated from its previous associated pCPU. To facilitate vCPU dissociation, introduce a per-pCPU list associated_tdvcpus. Add a vCPU into this list when it's loaded into a new pCPU (i.e. when a vCPU is loaded for the first time or migrated to a new pCPU). vCPU dissociations can happen under below conditions: - On the op hardware_disable is called. This op is called when virtualization is disabled on a given pCPU, e.g. when hot-unplug a pCPU or machine shutdown/suspend. In this case, dissociate all vCPUs from the pCPU by iterating its per-pCPU list associated_tdvcpus. - On vCPU migration to a new pCPU. Before adding a vCPU into associated_tdvcpus list of the new pCPU, dissociation from its old pCPU is required, which is performed by issuing an IPI and executing SEAMCALL TDH.VP.FLUSH on the old pCPU. On a successful dissociation, the vCPU will be removed from the associated_tdvcpus list of its previously associated pCPU. - On tdx_mmu_release_hkid() is called. TDX mandates that all vCPUs must be disassociated prior to the release of an hkid. Therefore, dissociation of all vCPUs is a must before executing the SEAMCALL TDH.MNG.VPFLUSHDONE and subsequently freeing the hkid. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073858.22312-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Isaku Yamahata	012426d6f5	KVM: TDX: Finalize VM initialization Add a new VM-scoped KVM_MEMORY_ENCRYPT_OP IOCTL subcommand, KVM_TDX_FINALIZE_VM, to perform TD Measurement Finalization. Documentation for the API is added in another patch: "Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)" For the purpose of attestation, a measurement must be made of the TDX VM initial state. This is referred to as TD Measurement Finalization, and uses SEAMCALL TDH.MR.FINALIZE, after which: 1. The VMM adding TD private pages with arbitrary content is no longer allowed 2. The TDX VM is runnable Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240904030751.117579-21-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Isaku Yamahata	c846b451d3	KVM: TDX: Add an ioctl to create initial guest memory Add a new ioctl for the user space VMM to initialize guest memory with the specified memory contents. Because TDX protects the guest's memory, the creation of the initial guest memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of directly copying the memory contents into the guest's memory in the case of the default VM type. Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped KVM_MEMORY_ENCRYPT_OP. Check if the GFN is already pre-allocated, assign the guest page in Secure-EPT, copy the initial memory contents into the guest memory, and encrypt the guest memory. Optionally, extend the memory measurement of the TDX guest. The ioctl uses the vCPU file descriptor because of the TDX module's requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD) prior to initialization (TDH.MEM.PAGE.ADD). Accessing the MMU in turn requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY. In fact, the post-populate callback is able to reuse the same logic used by KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single ioctl. Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that point. This ensures that there cannot be pages in the S-EPT awaiting TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT, but the corresponding S-EPT entry will be !PRESENT). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- - KVM_BUG_ON() for kvm_tdx->nr_premapped (Paolo) - Use tdx_operand_busy() - Merge first patch in SEPT SEAMCALL retry series in to this base (Paolo) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Rick Edgecombe	b832317b8c	KVM: x86/mmu: Export kvm_tdp_map_page() In future changes coco specific code will need to call kvm_tdp_map_page() from within their respective gmem_post_populate() callbacks. Export it so this can be done from vendor specific code. Since kvm_mmu_reload() will be needed for this operation, export its callee kvm_mmu_load() as well. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073827.22270-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Yan Zhao	958810a094	KVM: x86/mmu: Bail out kvm_tdp_map_page() when VM dead Bail out of the loop in kvm_tdp_map_page() when a VM is dead. Otherwise, kvm_tdp_map_page() may get stuck in the kernel loop when there's only one vCPU in the VM (or if the other vCPUs are not executing ioctls), even if fatal errors have occurred. kvm_tdp_map_page() is called by the ioctl KVM_PRE_FAULT_MEMORY or the TDX ioctl KVM_TDX_INIT_MEM_REGION. It loops in the kernel whenever RET_PF_RETRY is returned. In the TDP MMU, kvm_tdp_mmu_map() always returns RET_PF_RETRY, regardless of the specific error code from tdp_mmu_set_spte_atomic(), tdp_mmu_link_sp(), or tdp_mmu_split_huge_page(). While this is acceptable in general cases where the only possible error code from these functions is -EBUSY, TDX introduces an additional error code, -EIO, due to SEAMCALL errors. Since this -EIO error is also a fatal error, check for VM dead in the kvm_tdp_map_page() to avoid unnecessary retries until a signal is pending. The error -EIO is uncommon and has not been observed in real workloads. Currently, it is only hypothetically triggered by bypassing the real SEAMCALL and faking an error in the SEAMCALL wrapper. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250220102728.24546-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Isaku Yamahata	0036b87a95	KVM: TDX: Implement hook to get max mapping level of private pages Implement hook private_max_mapping_level for TDX to let TDP MMU core get max mapping level of private pages. The value is hard coded to 4K for no huge page support for now. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073816.22256-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Isaku Yamahata	02ab57707b	KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table Implement hooks in TDX to propagate changes of mirror page table to private EPT, including changes for page table page adding/removing, guest page adding/removing. TDX invokes corresponding SEAMCALLs in the hooks. - Hook link_external_spt propagates adding page table page into private EPT. - Hook set_external_spte tdx_sept_set_private_spte() in this patch only handles adding of guest private page when TD is finalized. Later patches will handle the case of adding guest private pages before TD finalization. - Hook free_external_spt It is invoked when page table page is removed in mirror page table, which currently must occur at TD tear down phase, after hkid is freed. - Hook remove_external_spte It is invoked when guest private page is removed in mirror page table, which can occur when TD is active, e.g. during shared <-> private conversion and slot move/deletion. This hook is ensured to be triggered before hkid is freed, because gmem fd is released along with all private leaf mappings zapped before freeing hkid at VM destroy. TDX invokes below SEAMCALLs sequentially: 1) TDH.MEM.RANGE.BLOCK (remove RWX bits from a private EPT entry), 2) TDH.MEM.TRACK (increases TD epoch) 3) TDH.MEM.PAGE.REMOVE (remove the private EPT entry and untrack the guest page). TDH.MEM.PAGE.REMOVE can't succeed without TDH.MEM.RANGE.BLOCK and TDH.MEM.TRACK being called successfully. SEAMCALL TDH.MEM.TRACK is called in function tdx_track() to enforce that TLB tracking will be performed by TDX module for private EPT. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- - Remove TDX_ERROR_SEPT_BUSY and Add tdx_operand_busy() helper (Binbin) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:53 -04:00
Isaku Yamahata	22836e1de6	KVM: TDX: Handle TLB tracking for TDX Handle TLB tracking for TDX by introducing function tdx_track() for private memory TLB tracking and implementing flush_tlb* hooks to flush TLBs for shared memory. Introduce function tdx_track() to do TLB tracking on private memory, which basically does two things: calling TDH.MEM.TRACK to increase TD epoch and kicking off all vCPUs. The private EPT will then be flushed when each vCPU re-enters the TD. This function is unused temporarily in this patch and will be called on a page-by-page basis on removal of private guest page in a later patch. In earlier revisions, tdx_track() relied on an atomic counter to coordinate the synchronization between the actions of kicking off vCPUs, incrementing the TD epoch, and the vCPUs waiting for the incremented TD epoch after being kicked off. However, the core MMU only actually needs to call tdx_track() while aleady under a write mmu_lock. So this sychnonization can be made to be unneeded. vCPUs are kicked off only after the successful execution of TDH.MEM.TRACK, eliminating the need for vCPUs to wait for TDH.MEM.TRACK completion after being kicked off. tdx_track() is therefore able to send requests KVM_REQ_OUTSIDE_GUEST_MODE rather than KVM_REQ_TLB_FLUSH. Hooks for flush_remote_tlb and flush_remote_tlbs_range are not necessary for TDX, as tdx_track() will handle TLB tracking of private memory on page-by-page basis when private guest pages are removed. There is no need to invoke tdx_track() again in kvm_flush_remote_tlbs() even after changes to the mirrored page table. For hooks flush_tlb_current and flush_tlb_all, which are invoked during kvm_mmu_load() and vcpu load for normal VMs, let VMM to flush all EPTs in the two hooks for simplicity, since TDX does not depend on the two hooks to notify TDX module to flush private EPT in those cases. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073753.22228-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	7d10ffb1ac	KVM: TDX: Set per-VM shadow_mmio_value to 0 Set per-VM shadow_mmio_value to 0 for TDX. With enable_mmio_caching on, KVM installs MMIO SPTEs for TDs. To correctly configure MMIO SPTEs, TDX requires the per-VM shadow_mmio_value to be set to 0. This is necessary to override the default value of the suppress VE bit in the SPTE, which is 1, and to ensure value 0 in RWX bits. For MMIO SPTE, the spte value changes as follows: 1. initial value (suppress VE bit is set) 2. Guest issues MMIO and triggers EPT violation 3. KVM updates SPTE value to MMIO value (suppress VE bit is cleared) 4. Guest MMIO resumes. It triggers VE exception in guest TD 5. Guest VE handler issues TDG.VP.VMCALL<MMIO> 6. KVM handles MMIO 7. Guest VE handler resumes its execution after MMIO instruction Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073743.22214-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	5a46fd48d8	KVM: x86/mmu: Add setter for shadow_mmio_value Future changes will want to set shadow_mmio_value from TDX code. Add a helper to setter with a name that makes more sense from that context. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [split into new patch] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073730.22200-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	427a6486c5	KVM: TDX: Require TDP MMU, mmio caching and EPT A/D bits for TDX Disable TDX support when TDP MMU or mmio caching or EPT A/D bits aren't supported. As TDP MMU is becoming main stream than the legacy MMU, the legacy MMU support for TDX isn't implemented. TDX requires KVM mmio caching. Without mmio caching, KVM will go to MMIO emulation without installing SPTEs for MMIOs. However, TDX guest is protected and KVM would meet errors when trying to emulate MMIOs for TDX guest during instruction decoding. So, TDX guest relies on SPTEs being installed for MMIOs, which are with no RWX bits and with VE suppress bit unset, to inject VE to TDX guest. The TDX guest would then issue TDVMCALL in the VE handler to perform instruction decoding and have host do MMIO emulation. TDX also relies on EPT A/D bits as EPT A/D bits have been supported in all CPUs since Haswell. Relying on it can avoid RWX bits being masked out in the mirror page table for prefaulted entries. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- Requested by Sean at [1]. [1] https://lore.kernel.org/kvm/Zva4aORxE9ljlMNe@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	e0fbb3bbb5	KVM: TDX: Set gfn_direct_bits to shared bit Make the direct root handle memslot GFNs at an alias with the TDX shared bit set. For TDX shared memory, the memslot GFNs need to be mapped at an alias with the shared bit set. These shared mappings will be mapped on the KVM MMU's "direct" root. The direct root has it's mappings shifted by applying "gfn_direct_bits" as a mask. The concept of "GPAW" (guest physical address width) determines the location of the shared bit. So set gfn_direct_bits based on this, to map shared memory at the proper GPA. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073613.22100-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Sean Christopherson	87e3f45e80	KVM: TDX: Add load_mmu_pgd method for TDX TDX uses two EPT pointers, one for the private half of the GPA space and one for the shared half. The private half uses the normal EPT_POINTER vmcs field, which is managed in a special way by the TDX module. For TDX, KVM is not allowed to operate on it directly. The shared half uses a new SHARED_EPT_POINTER field and will be managed by the conventional MMU management operations that operate directly on the EPT root. This means for TDX the .load_mmu_pgd() operation will need to know to use the SHARED_EPT_POINTER field instead of the normal one. Add a new wrapper in x86 ops for load_mmu_pgd() that either directs the write to the existing vmx implementation or a TDX one. tdx_load_mmu_pgd() is so much simpler than vmx_load_mmu_pgd() since for the TDX mode of operation, EPT will always be used and KVM does not need to be involved in virtualization of CR3 behavior. So tdx_load_mmu_pgd() can simply write to SHARED_EPT_POINTER. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073601.22084-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	fe1e6d483f	KVM: TDX: Add accessors VMX VMCS helpers TDX defines SEAMCALL APIs to access TDX control structures corresponding to the VMX VMCS. Introduce helper accessors to hide its SEAMCALL ABI details. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073551.22070-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Rick Edgecombe	3b725e972f	KVM: VMX: Teach EPT violation helper about private mem Teach EPT violation helper to check shared mask of a GPA to find out whether the GPA is for private memory. When EPT violation is triggered after TD accessing a private GPA, KVM will exit to user space if the corresponding GFN's attribute is not private. User space will then update GFN's attribute during its memory conversion process. After that, TD will re-access the private GPA and trigger EPT violation again. Only with GFN's attribute matches to private, KVM will fault in private page, map it in mirrored TDP root, and propagate changes to private EPT to resolve the EPT violation. Relying on GFN's attribute tracking xarray to determine if a GFN is private, as for KVM_X86_SW_PROTECTED_VM, may lead to endless EPT violations. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073539.22056-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Sean Christopherson	c8563d1b69	KVM: VMX: Split out guts of EPT violation to common/exposed function The difference of TDX EPT violation is how to retrieve information, GPA, and exit qualification. To share the code to handle EPT violation, split out the guts of EPT violation handler so that VMX/TDX exit handler can call it after retrieving GPA and exit qualification. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241112073528.22042-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Yan Zhao	6d15a641fd	KVM: x86/mmu: Do not enable page track for TD guest Fail kvm_page_track_write_tracking_enabled() if VM type is TDX to make the external page track user fail in kvm_page_track_register_notifier() since TDX does not support write protection and hence page track. No need to fail KVM internal users of page track (i.e. for shadow page), because TDX is always with EPT enabled and currently TDX module does not emulate and send VMLAUNCH/VMRESUME VMExits to VMM. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Cc: Yuan Yao <yuan.yao@linux.intel.com> Message-ID: <20241112073515.22028-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	2608f10576	KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU Export a function to walk down the TDP without modifying it and simply check if a GPA is mapped. Future changes will support pre-populating TDX private memory. In order to implement this KVM will need to check if a given GFN is already pre-populated in the mirrored EPT. [1] There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within the KVM MMU that almost does what is required. However, to make sense of the results, MMU internal PTE helpers are needed. Refactor the code to provide a helper that can be used outside of the KVM MMU code. Refactoring the KVM page fault handler to support this lookup usage was also considered, but it was an awkward fit. kvm_tdp_mmu_gpa_is_mapped() is based on a diff by Paolo Bonzini. Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1] Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241112073457.22011-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Rick Edgecombe	ae80c7d66c	KVM: x86/mmu: Implement memslot deletion for TDX Update attr_filter field to zap both private and shared mappings for TDX when memslot is deleted. Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073426.21997-1-yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	099d7e9bea	x86/virt/tdx: Add SEAMCALL wrappers for TD measurement of initial contents The TDX module measures the TD during the build process and saves the measurement in TDCS.MRTD to facilitate TD attestation of the initial contents of the TD. Wrap the SEAMCALL TDH.MR.EXTEND with tdh_mr_extend() and TDH.MR.FINALIZE with tdh_mr_finalize() to enable the host kernel to assist the TDX module in performing the measurement. The measurement in TDCS.MRTD is a SHA-384 digest of the build process. SEAMCALLs TDH.MNG.INIT and TDH.MEM.PAGE.ADD initialize and contribute to the MRTD digest calculation. The caller of tdh_mr_extend() should break the TD private page into chunks of size TDX_EXTENDMR_CHUNKSIZE and invoke tdh_mr_extend() to add the page content into the digest calculation. Failures are possible with TDH.MR.EXTEND (e.g., due to SEPT walking). The caller of tdh_mr_extend() can check the function return value and retrieve extended error information from the function output parameters. Calling tdh_mr_finalize() completes the measurement. The TDX module then turns the TD into the runnable state. Further TDH.MEM.PAGE.ADD and TDH.MR.EXTEND calls will fail. TDH.MR.FINALIZE may fail due to errors such as the TD having no vCPUs or contentions. Check function return value when calling tdh_mr_finalize() to determine the exact reason for failure. Take proper locks on the caller's side to avoid contention failures, or handle the BUSY error in specific ways (e.g., retry). Return the SEAMCALL error code directly to the caller. Do not attempt to handle it in the core kernel. [Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073709.22171-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:52 -04:00
Isaku Yamahata	206e7860e7	x86/virt/tdx: Add SEAMCALL wrappers to remove a TD private page TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a single Secure EPT (S-EPT or SEPT) tree per TD to translate TD's private memory accessed using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.REMOVE with tdh_mem_page_remove() and TDH_PHYMEM_PAGE_WBINVD with tdh_phymem_page_wbinvd_hkid() to unmap a TD private page from the SEPT, remove the TD private page from the TDX module and flush cache lines to memory after removal of the private page. Callers should specify "GPA" and "level" when calling tdh_mem_page_remove() to indicate to the TDX module which TD private page to unmap and remove. TDH.MEM.PAGE.REMOVE may fail, and the caller of tdh_mem_page_remove() can check the function return value and retrieve extended error information from the function output parameters. Follow the TLB tracking protocol before calling tdh_mem_page_remove() to remove a TD private page to avoid SEAMCALL failure. After removing a TD's private page, the TDX module does not write back and invalidate cache lines associated with the page and the page's keyID (i.e., the TD's guest keyID). Therefore, provide tdh_phymem_page_wbinvd_hkid() to allow the caller to pass in the TD's guest keyID and invoke TDH_PHYMEM_PAGE_WBINVD to perform this action. Before reusing the page, the host kernel needs to map the page with keyID 0 and invoke movdir64b() to convert the TD private page to a normal shared page. TDH.MEM.PAGE.REMOVE and TDH_PHYMEM_PAGE_WBINVD may meet contentions inside the TDX module for TDX's internal resources. To avoid staying in SEAM mode for too long, TDX module will return a BUSY error code to the kernel instead of spinning on the locks. The caller may need to handle this error in specific ways (e.g., retry). The wrappers return the SEAMCALL error code directly to the caller. Don't attempt to handle it in the core kernel. [Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073658.22157-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	ee4884eb84	x86/virt/tdx: Add SEAMCALL wrappers to manage TDX TLB tracking TDX module defines a TLB tracking protocol to make sure that no logical processor holds any stale Secure EPT (S-EPT or SEPT) TLB translations for a given TD private GPA range. After a successful TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, and kicking off all vCPUs, TDX module ensures that the subsequent TDH.VP.ENTER on each vCPU will flush all stale TLB entries for the specified GPA ranges in TDH.MEM.RANGE.BLOCK. Wrap the TDH.MEM.RANGE.BLOCK with tdh_mem_range_block() and TDH.MEM.TRACK with tdh_mem_track() to enable the kernel to assist the TDX module in TLB tracking management. The caller of tdh_mem_range_block() needs to specify "GPA" and "level" to request the TDX module to block the subsequent creation of TLB translation for a GPA range. This GPA range can correspond to a SEPT page or a TD private page at any level. Contentions and errors are possible with the SEAMCALL TDH.MEM.RANGE.BLOCK. Therefore, the caller of tdh_mem_range_block() needs to check the function return value and retrieve extended error info from the function output params. Upon TDH.MEM.RANGE.BLOCK success, no new TLB entries will be created for the specified private GPA range, though the existing TLB translations may still persist. TDH.MEM.TRACK will then advance the TD's epoch counter to ensure TDX module will flush TLBs in all vCPUs once the vCPUs re-enter the TD. TDH.MEM.TRACK will fail to advance TD's epoch counter if there are vCPUs still running in non-root mode at the previous TD epoch counter. So to ensure private GPA translations are flushed, callers must first call tdh_mem_range_block(), then tdh_mem_track(), and lastly send IPIs to kick all the vCPUs and force them to re-enter, thus triggering the TLB flush. Don't export a single operation and instead export functions that just expose the block and track operations; this is for a couple reasons: 1. The vCPU kick should use KVM's functionality for doing this, which can better target sending IPIs to only the minimum required pCPUs. 2. tdh_mem_track() doesn't need to be executed if a vCPU has not entered a TD, which is information only KVM knows. 3. Leaving the operations separate will allow for batching many tdh_mem_range_block() calls before a tdh_mem_track(). While this batching will not be done initially by KVM, it demonstrates that keeping mem block and track as separate operations is a generally good design. Contentions are also possible in TDH.MEM.TRACK. For example, TDH.MEM.TRACK may contend with TDH.VP.ENTER when advancing the TD epoch counter. tdh_mem_track() does not provide the retries for the caller. Callers can choose to avoid contentions or retry on their own. [Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073648.22143-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	94c477a751	x86/virt/tdx: Add SEAMCALL wrappers to add TD private pages TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT (S-EPT or SEPT) tree per TD to translate TD's private memory accessed using a private GPA. Wrap the SEAMCALL TDH.MEM.PAGE.ADD with tdh_mem_page_add() and TDH.MEM.PAGE.AUG with tdh_mem_page_aug() to add TD private pages and map them to the TD's private GPAs in the SEPT. Callers of tdh_mem_page_add() and tdh_mem_page_aug() allocate and provide normal pages to the wrappers, who further pass those pages to the TDX module. Before passing the pages to the TDX module, tdh_mem_page_add() and tdh_mem_page_aug() perform a CLFLUSH on the page mapped with keyID 0 to ensure that any dirty cache lines don't write back later and clobber TD memory or control structures. Don't worry about the other MK-TME keyIDs because the kernel doesn't use them. The TDX docs specify that this flush is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Do the CLFLUSH unconditionally for two reasons: make the solution simpler by having a single path that can handle both !CLFLUSH_BEFORE_ALLOC and CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any correctness uncertainty by going with a conservative solution to start. Call tdh_mem_page_add() to add a private page to a TD during the TD's build time (i.e., before TDH.MR.FINALIZE). Specify which GPA the 4K private page will map to. No need to specify level info since TDH.MEM.PAGE.ADD only adds pages at 4K level. To provide initial contents to TD, provide an additional source page residing in memory managed by the host kernel itself (encrypted with a shared keyID). The TDX module will copy the initial contents from the source page in shared memory into the private page after mapping the page in the SEPT to the specified private GPA. The TDX module allows the source page to be the same page as the private page to be added. In that case, the TDX module converts and encrypts the source page as a TD private page. Call tdh_mem_page_aug() to add a private page to a TD during the TD's runtime (i.e., after TDH.MR.FINALIZE). TDH.MEM.PAGE.AUG supports adding huge pages. Specify which GPA the private page will map to, along with level info embedded in the lower bits of the GPA. The TDX module will recognize the added page as the TD's private page after the TD's acceptance with TDCALL TDG.MEM.PAGE.ACCEPT. tdh_mem_page_add() and tdh_mem_page_aug() may fail. Callers can check function return value and retrieve extended error info from the function output parameters. The TDX module has many internal locks. To avoid staying in SEAM mode for too long, SEAMCALLs returns a BUSY error code to the kernel instead of spinning on the locks. Depending on the specific SEAMCALL, the caller may need to handle this error in specific ways (e.g., retry). Therefore, return the SEAMCALL error code directly to the caller. Don't attempt to handle it in the core kernel. [Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073636.22129-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	385ba3fd8d	x86/virt/tdx: Add SEAMCALL wrapper tdh_mem_sept_add() to add SEPT pages TDX architecture introduces the concept of private GPA vs shared GPA, depending on the GPA.SHARED bit. The TDX module maintains a Secure EPT (S-EPT or SEPT) tree per TD for private GPA to HPA translation. Wrap the TDH.MEM.SEPT.ADD SEAMCALL with tdh_mem_sept_add() to provide pages to the TDX module for building a TD's SEPT tree. (Refer to these pages as SEPT pages). Callers need to allocate and provide a normal page to tdh_mem_sept_add(), which then passes the page to the TDX module via the SEAMCALL TDH.MEM.SEPT.ADD. The TDX module then installs the page into SEPT tree and encrypts this SEPT page with the TD's guest keyID. The kernel cannot use the SEPT page until after reclaiming it via TDH.MEM.SEPT.REMOVE or TDH.PHYMEM.PAGE.RECLAIM. Before passing the page to the TDX module, tdh_mem_sept_add() performs a CLFLUSH on the page mapped with keyID 0 to ensure that any dirty cache lines don't write back later and clobber TD memory or control structures. Don't worry about the other MK-TME keyIDs because the kernel doesn't use them. The TDX docs specify that this flush is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Do the CLFLUSH unconditionally for two reasons: make the solution simpler by having a single path that can handle both !CLFLUSH_BEFORE_ALLOC and CLFLUSH_BEFORE_ALLOC cases. Avoid wading into any correctness uncertainty by going with a conservative solution to start. Callers should specify "GPA" and "level" for the TDX module to install the SEPT page at the specified position in the SEPT. Do not include the root page level in "level" since TDH.MEM.SEPT.ADD can only add non-root pages to the SEPT. Ensure "level" is between 1 and 3 for a 4-level SEPT or between 1 and 4 for a 5-level SEPT. Call tdh_mem_sept_add() during the TD's build time or during the TD's runtime. Check for errors from the function return value and retrieve extended error info from the function output parameters. The TDX module has many internal locks. To avoid staying in SEAM mode for too long, SEAMCALLs returns a BUSY error code to the kernel instead of spinning on the locks. Depending on the specific SEAMCALL, the caller may need to handle this error in specific ways (e.g., retry). Therefore, return the SEAMCALL error code directly to the caller. Don't attempt to handle it in the core kernel. TDH.MEM.SEPT.ADD effectively manages two internal resources of the TDX module: it installs page table pages in the SEPT tree and also updates the TDX module's page metadata (PAMT). Don't add a wrapper for the matching SEAMCALL for removing a SEPT page (TDH.MEM.SEPT.REMOVE) because KVM, as the only in-kernel user, will only tear down the SEPT tree when the TD is being torn down. When this happens it can just do other operations that reclaim the SEPT pages for the host kernels to use, update the PAMT and let the SEPT get trashed. [Kai: Switched from generic seamcall export] [Yan: Re-wrote the changelog] Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241112073624.22114-1-yan.y.zhao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Zhiming Hu	7c035bea94	KVM: TDX: Register TDX host key IDs to cgroup misc controller TDX host key IDs (HKID) are limit resources in a machine, and the misc cgroup lets the machine owner track their usage and limits the possibility of abusing them outside the owner's control. The cgroup v2 miscellaneous subsystem was introduced to control the resource of AMD SEV & SEV-ES ASIDs. Likewise introduce HKIDs as a misc resource. Signed-off-by: Zhiming Hu <zhiming.hu@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Xiaoyao Li	20d913729c	KVM: x86/mmu: Taking guest pa into consideration when calculate tdp level For TDX, the maxpa (CPUID.0x80000008.EAX[7:0]) is fixed as native and the max_gpa (CPUID.0x80000008.EAX[23:16]) is configurable and used to configure the EPT level and GPAW. Use max_gpa to determine the TDP level. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Xiaoyao Li	488808e682	KVM: x86: Introduce KVM_TDX_GET_CPUID Implement an IOCTL to allow userspace to read the CPUID bit values for a configured TD. The TDX module doesn't provide the ability to set all CPUID bits. Instead some are configured indirectly, or have fixed values. But it does allow for the final resulting CPUID bits to be read. This information will be useful for userspace to understand the configuration of the TD, and set KVM's copy via KVM_SET_CPUID2. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix subleaf mask check (Binbin) - Search all possible sub-leafs (Francesco Lavra) - Reduce off-by-one error sensitve code (Francesco, Xiaoyao) - Handle buffers too small from userspace (Xiaoyao) - Read max CPUID from TD instead of using fixed values (Xiaoyao) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	a50f673f25	KVM: TDX: Do TDX specific vcpu initialization TD guest vcpu needs TDX specific initialization before running. Repurpose KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command KVM_TDX_INIT_VCPU, and implement the callback for it. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/ (Sean) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	9002f8cf52	KVM: TDX: create/free TDX vcpu structure Implement vcpu related stubs for TDX for create, reset and free. For now, create only the features that do not require the TDX SEAMCALL. The TDX specific vcpu initialization will be handled by KVM_TDX_INIT_VCPU. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Use lapic_in_kernel() (Nikolay Borisov) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	9934d7e529	KVM: TDX: Don't offline the last cpu of one package when there's TDX guest Destroying TDX guest requires there's at least one cpu online for each package, because reclaiming the TDX KeyID of the guest (as part of the teardown process) requires to call some SEAMCALL (on any cpu) on all packages. Do not offline the last cpu of one package when there's any TDX guest running, otherwise KVM may not be able to teardown TDX guest resulting in leaking of TDX KeyID and other resources like TDX guest control structure pages. Implement the TDX version 'offline_cpu()' to prevent the cpu from going offline if it is the last cpu on the package. Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	ffb6fc8413	KVM: TDX: Make pmu_intel.c ignore guest TD case TDX KVM doesn't support PMU yet, it's future work of TDX KVM support as another patch series. For now, handle TDX by updating vcpu_to_lbr_desc() and vcpu_to_lbr_records() to return NULL. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Add pragma poison for to_vmx() (Paolo) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Isaku Yamahata	0186dd29a2	KVM: TDX: add ioctl to initialize VM with TDX specific parameters After the crypto-protection key has been configured, TDX requires a VM-scope initialization as a step of creating the TDX guest. This "per-VM" TDX initialization does the global configurations/features that the TDX guest can support, such as guest's CPUIDs (emulated by the TDX module), the maximum number of vcpus etc. Because there is no room in KVM_CREATE_VM to pass all the required parameters, introduce a new ioctl KVM_TDX_INIT_VM and mark the VM as TD_STATE_UNINITIALIZED until it is invoked. This "per-VM" TDX initialization must be done before any "vcpu-scope" TDX initialization; KVM_TDX_INIT_VM IOCTL must be invoked before the creation of vCPUs. Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:51 -04:00
Paolo Bonzini	a656dac800	KVM: x86: expose cpuid_entry2_find for TDX CPUID values are provided for TDX virtual machines as part of the KVM_TDX_INIT_VM ioctl. Unlike KVM_SET_CPUID2, TDX will need to examine the leaves, either to validate against the CPUIDs listed in the TDX modules configuration or to fill other controls with matching values. Since there is an existing function to look up a leaf/index pair into a given list of CPUID entries, export it as kvm_find_cpuid_entry2(). Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	f94f4a97e6	KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension check Change to report the KVM_CAP_MAX_VCPUS extension from globally to per-VM to allow userspace to be able to query maximum vCPUs for TDX guest via checking the KVM_CAP_MAX_VCPU extension on per-VM basis. Today KVM x86 reports KVM_MAX_VCPUS as guest's maximum vCPUs for all guests globally, and userspace, i.e. Qemu, queries the KVM_MAX_VCPUS extension globally but not on per-VM basis. TDX has its own limit of maximum vCPUs it can support for all TDX guests in addition to KVM_MAX_VCPUS. TDX module reports this limit via the MAX_VCPU_PER_TD global metadata. Different modules may report different values. In practice, the reported value reflects the maximum logical CPUs that ALL the platforms that the module supports can possibly have. Note some old modules may also not support this metadata, in which case the limit is U16_MAX. The current way to always report KVM_MAX_VCPUS in the KVM_CAP_MAX_VCPUS extension is not enough for TDX. To accommodate TDX, change to report the KVM_CAP_MAX_VCPUS extension on per-VM basis. Specifically, override kvm->max_vcpus in tdx_vm_init() for TDX guest, and report kvm->max_vcpus in the KVM_CAP_MAX_VCPUS extension check. Change to report "the number of logical CPUs the platform has" as the maximum vCPUs for TDX guest. Simply forwarding the MAX_VCPU_PER_TD reported by the TDX module would result in an unpredictable ABI because the reported value to userspace would be depending on whims of TDX modules. This works in practice because of the MAX_VCPU_PER_TD reported by the TDX module will never be smaller than the one reported to userspace. But to make sure KVM never reports an unsupported value, sanity check the MAX_VCPU_PER_TD reported by TDX module is not smaller than the number of logical CPUs the platform has, otherwise refuse to use TDX. Note, when creating a TDX guest, TDX actually requires the "maximum vCPUs for _this_ TDX guest" as an input to initialize the TDX guest. But TDX guest's maximum vCPUs is not part of TDREPORT thus not part of attestation, thus there's no need to allow userspace to explicitly _configure_ the maximum vCPUs on per-VM basis. KVM will simply use kvm->max_vcpus as input when initializing the TDX guest. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	8d032b683c	KVM: TDX: create/destroy VM structure Implement managing the TDX private KeyID to implement, create, destroy and free for a TDX guest. When creating at TDX guest, assign a TDX private KeyID for the TDX guest for memory encryption, and allocate pages for the guest. These are used for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS). On destruction, free the allocated pages, and the KeyID. Before tearing down the private page tables, TDX requires the guest TD to be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops hook. The TDR control structures can be freed in the vm_destroy() hook, which runs last. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix build issue in kvm-coco-queue - Init ret earlier to fix __tdx_td_init() error handling. (Chao) - Standardize -EAGAIN for __tdx_td_init() retry errors (Rick) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	61bb282796	KVM: TDX: Get system-wide info about TDX module on initialization TDX KVM needs system-wide information about the TDX module. Generate the data based on tdx_sysinfo td_conf CPUID data. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> --- - Clarify comment about EAX[23:16] in td_init_cpuid_entry2() (Xiaoyao) - Add comment for configurable CPUID bits (Xiaoyao) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	b2aaf38ced	KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for guest state-protected VM. It defined subcommands for technology-specific operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands are not limited to memory encryption, but various technology-specific operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP for TDX specific operations and define subcommands. Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op. TDX specific sub-commands will be added to retrieve/pass TDX specific parameters. Make mem_enc_ioctl non-optional as it's always filled. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Drop the misleading "defined for consistency" line. It's a copy-paste error introduced in the earlier patches. Earlier there was padding at the end to match struct kvm_sev_cmd size. (Tony) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	e4aa6f6961	KVM: TDX: Add helper functions to print TDX SEAMCALL error Add helper functions to print out errors from the TDX module in a uniform manner. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Sean Christopherson	1001d9886f	KVM: TDX: Add TDX "architectural" error codes Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH SEAMCALL and TDX guest side for TDG.VP.VMCALL. KVM issues the TDX SEAMCALLs and checks its error code. KVM handles hypercall from the TDX guest and may return an error. So error code for the TDX guest is also needed. TDX SEAMCALL uses bits 31:0 to return more information, so these error codes will only exactly match RAX[63:32]. Error codes for TDG.VP.VMCALL is defined by TDX Guest-Host-Communication interface spec. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20241030190039.77971-14-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	fcae3a3e7c	KVM: TDX: Define TDX architectural definitions Define architectural definitions for KVM to issue the TDX SEAMCALLs. Structures and values that are architecturally defined in the TDX module specifications the chapter of ABI Reference. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> --- - Drop old duplicate defines, the x86 core exports what's needed (Kai) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	09b3d3c17f	KVM: TDX: Add placeholders for TDX VM/vCPU structures Add TDX's own VM and vCPU structures as placeholder to manage and run TDX guests. Also add helper functions to check whether a VM/vCPU is TDX or normal VMX one, and add helpers to convert between TDX VM/vCPU and KVM VM/vCPU. TDX protects guest VMs from malicious host. Unlike VMX guests, TDX guests are crypto-protected. KVM cannot access TDX guests' memory and vCPU states directly. Instead, TDX requires KVM to use a set of TDX architecture-defined firmware APIs (a.k.a TDX module SEAMCALLs) to manage and run TDX guests. In fact, the way to manage and run TDX guests and normal VMX guests are quite different. Because of that, the current structures ('struct kvm_vmx' and 'struct vcpu_vmx') to manage VMX guests are not quite suitable for TDX guests. E.g., the majority of the members of 'struct vcpu_vmx' don't apply to TDX guests. Introduce TDX's own VM and vCPU structures ('struct kvm_tdx' and 'struct vcpu_tdx' respectively) for KVM to manage and run TDX guests. And instead of building TDX's VM and vCPU structures based on VMX's, build them directly based on 'struct kvm'. As a result, TDX and VMX guests will have different VM size and vCPU size/alignment. Currently, kvm_arch_alloc_vm() uses 'kvm_x86_ops::vm_size' to allocate enough space for the VM structure when creating guest. With TDX guests, ideally, KVM should allocate the VM structure based on the VM type so that the precise size can be allocated for VMX and TDX guests. But this requires more extensive code change. For now, simply choose the maximum size of 'struct kvm_tdx' and 'struct kvm_vmx' for VM structure allocation for both VMX and TDX guests. This would result in small memory waste for each VM which has smaller VM structure size but this is acceptable. For simplicity, use the same way for vCPU allocation too. Otherwise KVM would need to maintain a separate 'kvm_vcpu_cache' for each VM type. Note, updating the 'vt_x86_ops::vm_size' needs to be done before calling kvm_ops_update(), which copies vt_x86_ops to kvm_x86_ops. However this happens before TDX module initialization. Therefore theoretically it is possible that 'kvm_x86_ops::vm_size' is set to size of 'struct kvm_tdx' (when it's larger) but TDX actually fails to initialize at a later time. Again the worst case of this is wasting couple of bytes memory for each VM. KVM could choose to update 'kvm_x86_ops::vm_size' at a later time depending on TDX's status but that would require base KVM module to export either kvm_x86_ops or kvm_ops_update(). Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- - Make to_kvm_tdx() and to_tdx() private to tdx.c (Francesco, Tony) - Add pragma poison for to_vmx() (Paolo) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Kai Huang	45154fb010	KVM: TDX: Get TDX global information KVM will need to consult some essential TDX global information to create and run TDX guests. Get the global information after initializing TDX. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-3-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Kai Huang	fcdbdf6343	KVM: VMX: Initialize TDX during KVM module load Before KVM can use TDX to create and run TDX guests, TDX needs to be initialized from two perspectives: 1) TDX module must be initialized properly to a working state; 2) A per-cpu TDX initialization, a.k.a the TDH.SYS.LP.INIT SEAMCALL must be done on any logical cpu before it can run any other TDX SEAMCALLs. The TDX host core-kernel provides two functions to do the above two respectively: tdx_enable() and tdx_cpu_enable(). There are two options in terms of when to initialize TDX: initialize TDX at KVM module loading time, or when creating the first TDX guest. Choose to initialize TDX during KVM module loading time: Initializing TDX module is both memory and CPU time consuming: 1) the kernel needs to allocate a non-trivial size(~1/256) of system memory as metadata used by TDX module to track each TDX-usable memory page's status; 2) the TDX module needs to initialize this metadata, one entry for each TDX-usable memory page. Also, the kernel uses alloc_contig_pages() to allocate those metadata chunks, because they are large and need to be physically contiguous. alloc_contig_pages() can fail. If initializing TDX when creating the first TDX guest, then there's chance that KVM won't be able to run any TDX guests albeit KVM _declares_ to be able to support TDX. This isn't good for the user. On the other hand, initializing TDX at KVM module loading time can make sure KVM is providing a consistent view of whether KVM can support TDX to the user. Always only try to initialize TDX after VMX has been initialized. TDX is based on VMX, and if VMX fails to initialize then TDX is likely to be broken anyway. Also, in practice, supporting TDX will require part of VMX and common x86 infrastructure in working order, so TDX cannot be enabled alone w/o VMX support. There are two cases that can result in failure to initialize TDX: 1) TDX cannot be supported (e.g., because of TDX is not supported or enabled by hardware, or module is not loaded, or missing some dependency in KVM's configuration); 2) Any unexpected error during TDX bring-up. For the first case only mark TDX is disabled but still allow KVM module to be loaded. For the second case just fail to load the KVM module so that the user can be aware. Because TDX costs additional memory, don't enable TDX by default. Add a new module parameter 'enable_tdx' to allow the user to opt-in. Note, the name tdx_init() has already been taken by the early boot code. Use tdx_bringup() for initializing TDX (and tdx_cleanup() since KVM doesn't actually teardown TDX). They don't match vt_init()/vt_exit(), vmx_init()/vmx_exit() etc but it's not end of the world. Also, once initialized, the TDX module cannot be disabled and enabled again w/o the TDX module runtime update, which isn't supported by the kernel. After TDX is enabled, nothing needs to be done when KVM disables hardware virtualization, e.g., when offlining CPU, or during suspend/resume. TDX host core-kernel code internally tracks TDX status and can handle "multiple enabling" scenario. Similar to KVM_AMD_SEV, add a new KVM_INTEL_TDX Kconfig to guide KVM TDX code. Make it depend on INTEL_TDX_HOST but not replace INTEL_TDX_HOST because in the longer term there's a use case that requires making SEAMCALLs w/o KVM as mentioned by Dan [1]. Link: https://lore.kernel.org/6723fc2070a96_60c3294dc@dwillia2-mobl3.amr.corp.intel.com.notmuch/ [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <162f9dee05c729203b9ad6688db1ca2960b4b502.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Kai Huang	d6bee78137	KVM: VMX: Refactor VMX module init/exit functions Add vt_init() and vt_exit() as the new module init/exit functions and refactor existing vmx_init()/vmx_exit() as helper to make room for TDX specific initialization and teardown. To support TDX, KVM will need to enable TDX during KVM module loading time. Enabling TDX requires enabling hardware virtualization first so that all online CPUs (and the new CPU going online) are in post-VMXON state. Currently, the vmx_init() flow is: 1) hv_init_evmcs(), 2) kvm_x86_vendor_init(), 3) Other VMX specific initialization, 4) kvm_init() The kvm_x86_vendor_init() invokes kvm_x86_init_ops::hardware_setup() to do VMX specific hardware setup and calls kvm_update_ops() to initialize kvm_x86_ops to VMX's version. TDX will have its own version for most of kvm_x86_ops callbacks. It would be nice if kvm_x86_init_ops::hardware_setup() could also be used for TDX, but in practice it cannot. The reason is, as mentioned above, TDX initialization requires hardware virtualization having been enabled, which must happen after kvm_update_ops(), but hardware_setup() is done before that. Also, TDX is based on VMX, and it makes sense to only initialize TDX after VMX has been initialized. If VMX fails to initialize, TDX is likely broken anyway. So the new flow of KVM module init function will be: 1) Current VMX initialization code in vmx_init() before kvm_init(), 2) TDX initialization, 3) kvm_init() Split vmx_init() into two parts based on above 1) and 3) so that TDX initialization can fit in between. Make part 1) as the new helper vmx_init(). Introduce vt_init() as the new module init function which calls vmx_init() and kvm_init(). TDX initialization will be added later. Do the same thing for vmx_exit()/vt_exit(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-ID: <3f23f24098bdcf42e213798893ffff7cdc7103be.1731664295.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:50 -04:00
Isaku Yamahata	aed4dde24c	x86/virt/tdx: Add tdx_guest_keyid_alloc/free() to alloc and free TDX guest KeyID Intel TDX protects guest VMs from malicious host and certain physical attacks. Pre-TDX Intel hardware has support for a memory encryption architecture called MK-TME, which repurposes several high bits of physical address as "KeyID". The BIOS reserves a sub-range of MK-TME KeyIDs as "TDX private KeyIDs". Each TDX guest must be assigned with a unique TDX KeyID when it is created. The kernel reserves the first TDX private KeyID for crypto-protection of specific TDX module data which has a lifecycle that exceeds the KeyID reserved for the TD's use. The rest of the KeyIDs are left for TDX guests to use. Create a small KeyID allocator. Export tdx_guest_keyid_alloc()/tdx_guest_keyid_free() to allocate and free TDX guest KeyID for KVM to use. Don't provide the stub functions when CONFIG_INTEL_TDX_HOST=n since they are not supposed to be called in this case. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-5-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Kai Huang	4caf32daf0	x86/virt/tdx: Read essential global metadata for KVM KVM needs two classes of global metadata to create and run TDX guests: - "TD Control Structures" - "TD Configurability" The first class contains the sizes of TDX guest per-VM and per-vCPU control structures. KVM will need to use them to allocate enough space for those control structures. The second class contains info which reports things like which features are configurable to TDX guest etc. KVM will need to use them to properly configure TDX guests. Read them for KVM TDX to use. The code change is auto-generated by re-running the script in [1] after uncommenting the "td_conf" and "td_ctrl" part to regenerate the tdx_global_metadata.{hc} and update them to the existing ones in the kernel. #python tdx.py global_metadata.json tdx_global_metadata.h \ tdx_global_metadata.c The 'global_metadata.json' can be fetched from [2]. Note that as of this writing, the JSON file only allows a maximum of 32 CPUID entries. While this is enough for current contents of the CPUID leaves, there were plans to change the JSON per TDX module release which would change the ABI and potentially prevent future versions of the TDX module from working with older kernels. While discussions are ongoing with the TDX module team on what exactly constitutes an ABI breakage, in the meantime the TDX module team has agreed to not increase the number of CPUID entries beyond 128 without an opt in. Therefore the file was tweaked by hand to change the maximum number of CPUID_CONFIGs. Link: https://lore.kernel.org/kvm/0853b155ec9aac09c594caa60914ed6ea4dc0a71.camel@intel.com/ [1] Link: https://cdrdv2.intel.com/v1/dl/getContent/795381 [2] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20241030190039.77971-4-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Paolo Bonzini	7ba2fd80ee	x86/virt/tdx: allocate tdx_sys_info in static memory Adding all the information that KVM needs increases the size of struct tdx_sys_info, to the point that you can get warnings about the stack size of init_tdx_module(). Since KVM also needs to read the TDX metadata after init_tdx_module() returns, make the variable a global. Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Rick Edgecombe	e465cc63db	x86/virt/tdx: Add SEAMCALL wrappers for TDX flush operations Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has the concept of flushing vCPUs. These flushes include both a flush of the translation caches and also any other state internal to the TDX module. Before freeing a KeyID, this flush operation needs to be done. KVM will need to perform the flush on each pCPU associated with the TD, and also perform a TD scoped operation that checks if the flush has been done on all vCPU's associated with the TD. Add a tdh_vp_flush() function to be used to call TDH.VP.FLUSH on each pCPU associated with the TD during TD teardown. It will also be called when disabling TDX and during vCPU migration between pCPUs. Add tdh_mng_vpflushdone() to be used by KVM to call TDH.MNG.VPFLUSHDONE. KVM will use this during TD teardown to verify that TDH.VP.FLUSH has been called sufficiently, and advance the state machine that will allow for reclaiming the TD's KeyID. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-7-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Rick Edgecombe	5e5151c556	x86/virt/tdx: Add SEAMCALL wrappers for TDX VM/vCPU field access Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module has TD scoped and vCPU scoped "metadata fields". These fields are a bit like VMCS fields, and stored in data structures maintained by the TDX module. Export 3 SEAMCALLs for use in reading and writing these fields: Make tdh_mng_rd() use MNG.VP.RD to read the TD scoped metadata. Make tdh_vp_rd()/tdh_vp_wr() use TDH.VP.RD/WR to read/write the vCPU scoped metadata. KVM will use these by creating inline helpers that target various metadata sizes. Export the raw SEAMCALL leaf, to avoid exporting the large number of various sized helpers. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Message-ID: <20241203010317.827803-6-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Rick Edgecombe	541b3e9e0d	x86/virt/tdx: Add SEAMCALL wrappers for TDX page cache management Intel TDX protects guest VMs from malicious host and certain physical attacks. The TDX module uses pages provided by the host for both control structures and for TD guest pages. These pages are encrypted using the MK-TME encryption engine, with its special requirements around cache invalidation. For its own security, the TDX module ensures pages are flushed properly and track which usage they are currently assigned. For creating and tearing down TD VMs and vCPUs KVM will need to use the TDH.PHYMEM.PAGE.RECLAIM, TDH.PHYMEM.CACHE.WB, and TDH.PHYMEM.PAGE.WBINVD SEAMCALLs. Add tdh_phymem_page_reclaim() to enable KVM to call TDH.PHYMEM.PAGE.RECLAIM to reclaim the page for use by the host kernel. This effectively resets its state in the TDX module's page tracking (PAMT), if the page is available to be reclaimed. This will be used by KVM to reclaim the various types of pages owned by the TDX module. It will have a small wrapper in KVM that retries in the case of a relevant error code. Don't implement this wrapper in arch/x86 because KVM's solution around retrying SEAMCALLs will be better located in a single place. Add tdh_phymem_cache_wb() to enable KVM to call TDH.PHYMEM.CACHE.WB to do a cache write back in a way that the TDX module can verify, before it allows a KeyID to be freed. The KVM code will use this to have a small wrapper that handles retries. Since the TDH.PHYMEM.CACHE.WB operation is interruptible, have tdh_phymem_cache_wb() take a resume argument to pass this info to the TDX module for restarts. It is worth noting that this SEAMCALL uses a SEAM specific MSR to do the write back in sections. In this way it does export some new functionality that affects CPU state. Add tdh_phymem_page_wbinvd_tdr() to enable KVM to call TDH.PHYMEM.PAGE.WBINVD to do a cache write back and invalidate of a TDR, using the global KeyID. The underlying TDH.PHYMEM.PAGE.WBINVD SEAMCALL requires the related KeyID to be encoded into the SEAMCALL args. Since the global KeyID is not exposed to KVM, a dedicated wrapper is needed for TDR focused TDH.PHYMEM.PAGE.WBINVD operations. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-5-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Rick Edgecombe	0d65dff2b9	x86/virt/tdx: Add SEAMCALL wrappers for TDX vCPU creation Intel TDX protects guest VMs from malicious host and certain physical attacks. It defines various control structures that hold state for virtualized components of the TD (i.e. VMs or vCPUs) These control structures are stored in pages given to the TDX module and encrypted with either the global KeyID or the guest KeyIDs. To manipulate these control structures the TDX module defines a few SEAMCALLs. KVM will use these during the process of creating a vCPU as follows: 1) Call TDH.VP.CREATE to create a TD vCPU Root (TDVPR) page for each vCPU. 2) Call TDH.VP.ADDCX to add per-vCPU control pages (TDCX) for each vCPU. 3) Call TDH.VP.INIT to initialize the TDCX for each vCPU. To reclaim these pages for use by the kernel other SEAMCALLs are needed, which will be added in future patches. Export functions to allow KVM to make these SEAMCALLs. Export two variants for TDH.VP.CREATE, in order to support the planned logic of KVM to support TDX modules with and without the ENUM_TOPOLOGY feature. If KVM can drop support for the !ENUM_TOPOLOGY case, this could go down a single version. Leave that for later discussion. The TDX module provides SEAMCALLs to hand pages to the TDX module for storing TDX controlled state. SEAMCALLs that operate on this state are directed to the appropriate TD vCPU using references to the pages originally provided for managing the vCPU's state. So the host kernel needs to track these pages, both as an ID for specifying which vCPU to operate on, and to allow them to be eventually reclaimed. The vCPU associated pages are called TDVPR (Trust Domain Virtual Processor Root) and TDCX (Trust Domain Control Extension). Introduce "struct tdx_vp" for holding references to pages provided to the TDX module for the TD vCPU associated state. Don't plan for any vCPU associated state that is controlled by KVM to live in this struct. Only expect it to hold data for concepts specific to the TDX architecture, for which there can't already be preexisting storage for in KVM. Add both the TDVPR page and an array of TDCX pages, even though the SEAMCALL wrappers will only need to know about the TDVPR pages for directing the SEAMCALLs to the right vCPU. Adding the TDCX pages to this struct will let all of the vCPU associated pages handed to the TDX module be tracked in one location. For a type to specify physical pages, use KVM's hpa_t type. Do this for KVM's benefit This is the common type used to hold physical addresses in KVM, so will make interoperability easier. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-4-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Rick Edgecombe	b8a4e7de84	x86/virt/tdx: Add SEAMCALL wrappers for TDX TD creation Intel TDX protects guest VMs from malicious hosts and certain physical attacks. It defines various control structures that hold state for things like TDs or vCPUs. These control structures are stored in pages given to the TDX module and encrypted with either the global KeyID or the guest KeyIDs. To manipulate these control structures the TDX module defines a few SEAMCALLs. KVM will use these during the process of creating a TD as follows: 1) Allocate a unique TDX KeyID for a new guest. 1) Call TDH.MNG.CREATE to create a "TD Root" (TDR) page, together with the new allocated KeyID. Unlike the rest of the TDX guest, the TDR page is crypto-protected by the 'global KeyID'. 2) Call the previously added TDH.MNG.KEY.CONFIG on each package to configure the KeyID for the guest. After this step, the KeyID to protect the guest is ready and the rest of the guest will be protected by this KeyID. 3) Call TDH.MNG.ADDCX to add TD Control Structure (TDCS) pages. 4) Call TDH.MNG.INIT to initialize the TDCS. To reclaim these pages for use by the kernel other SEAMCALLs are needed, which will be added in future patches. Add tdh_mng_addcx(), tdh_mng_create() and tdh_mng_init() to export these SEAMCALLs so that KVM can use them to create TDs. For SEAMCALLs that give a page to the TDX module to be encrypted, CLFLUSH the page mapped with KeyID 0, such that any dirty cache lines don't write back later and clobber TD memory or control structures. Don't worry about the other MK-TME KeyIDs because the kernel doesn't use them. The TDX docs specify that this flush is not needed unless the TDX module exposes the CLFLUSH_BEFORE_ALLOC feature bit. Be conservative and always flush. Add a helper function to facilitate this. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-3-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Rick Edgecombe	d19a42d696	x86/virt/tdx: Add SEAMCALL wrappers for TDX KeyID management Intel TDX protects guest VMs from malicious host and certain physical attacks. Pre-TDX Intel hardware has support for a memory encryption architecture called MK-TME, which repurposes several high bits of physical address as "KeyID". TDX ends up with reserving a sub-range of MK-TME KeyIDs as "TDX private KeyIDs". Like MK-TME, these KeyIDs can be associated with an ephemeral key. For TDX this association is done by the TDX module. It also has its own tracking for which KeyIDs are in use. To do this ephemeral key setup and manipulate the TDX module's internal tracking, KVM will use the following SEAMCALLs: TDH.MNG.KEY.CONFIG: Mark the KeyID as in use, and initialize its ephemeral key. TDH.MNG.KEY.FREEID: Mark the KeyID as not in use. These SEAMCALLs both operate on TDR structures, which are setup using the previously added TDH.MNG.CREATE SEAMCALL. KVM's use of these operations will go like: - tdx_guest_keyid_alloc() - Initialize TD and TDR page with TDH.MNG.CREATE (not yet-added), passing KeyID - TDH.MNG.KEY.CONFIG to initialize the key - TD runs, teardown is started - TDH.MNG.KEY.FREEID - tdx_guest_keyid_free() Don't try to combine the tdx_guest_keyid_alloc() and TDH.MNG.KEY.CONFIG operations because TDH.MNG.CREATE and some locking need to be done in the middle. Don't combine TDH.MNG.KEY.FREEID and tdx_guest_keyid_free() so they are symmetrical with the creation path. So implement tdh_mng_key_config() and tdh_mng_key_freeid() as separate functions than tdx_guest_keyid_alloc() and tdx_guest_keyid_free(). The TDX module provides SEAMCALLs to hand pages to the TDX module for storing TDX controlled state. SEAMCALLs that operate on this state are directed to the appropriate TD VM using references to the pages originally provided for managing the TD's state. So the host kernel needs to track these pages, both as an ID for specifying which TD to operate on, and to allow them to be eventually reclaimed. The TD VM associated pages are called TDR (Trust Domain Root) and TDCS (Trust Domain Control Structure). Introduce "struct tdx_td" for holding references to pages provided to the TDX module for this TD VM associated state. Don't plan for any TD associated state that is controlled by KVM to live in this struct. Only expect it to hold data for concepts specific to the TDX architecture, for which there can't already be preexisting storage for in KVM. Add both the TDR page and an array of TDCS pages, even though the SEAMCALL wrappers will only need to know about the TDR pages for directing the SEAMCALLs to the right TD. Adding the TDCS pages to this struct will let all of the TD VM associated pages handed to the TDX module be tracked in one location. For a type to specify physical pages, use KVM's hpa_t type. Do this for KVM's benefit This is the common type used to hold physical addresses in KVM, so will make interoperability easier. Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-ID: <20241203010317.827803-2-rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:49 -04:00
Paolo Bonzini	74c1807f6c	KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state, since the register values cannot actually be written. Return 0 when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero. However, on exit from KVM_RUN userspace could have placed a nonzero value into kvm_run->kvm_valid_regs, so check guest_state_protected again and skip store_regs() in that case. Cc: stable@vger.kernel.org Fixes: `517987e3fb` ("KVM: x86: add fields to struct kvm_arch for CoCo features") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250306202923.646075-1-pbonzini@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 14:20:03 -04:00
Isaku Yamahata	adafea1106	KVM: x86: Add infrastructure for secure TSC Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 13:55:44 -04:00
Isaku Yamahata	5f3b30b2b0	KVM: x86: Push down setting vcpu.arch.user_set_tsc Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc() to __kvm_synchronize_tsc() so that the two callers don't have to modify user_set_tsc directly as preparation. Later, prohibit changing TSC synchronization for TDX guests to modify __kvm_synchornize_tsc() change. We don't want to touch caller sites not to change user_set_tsc. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 13:55:32 -04:00
Paolo Bonzini	46c49372e1	KVM: x86: move vm_destroy callback at end of kvm_arch_destroy_vm TDX needs to free the TDR control structures last, after all paging structures have been torn down; move the vm_destroy callback at a suitable place. The new place is also okay for AMD; the main difference is that the MMU has been torn down and, if anything, that is better done before the SNP ASID is released. Extracted from a patch by Yan Zhao. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-14 13:45:36 -04:00
Yicong Yang	4b455f5994	cpu/SMT: Provide a default topology_is_primary_thread() Currently if architectures want to support HOTPLUG_SMT they need to provide a topology_is_primary_thread() telling the framework which thread in the SMT cannot offline. However arm64 doesn't have a restriction on which thread in the SMT cannot offline, a simplest choice is that just make 1st thread as the "primary" thread. So just make this as the default implementation in the framework and let architectures like x86 that have special primary thread to override this function (which they've already done). There's no need to provide a stub function if !CONFIG_SMP or !CONFIG_HOTPLUG_SMT. In such case the testing CPU is already the 1st CPU in the SMT so it's always the primary thread. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Pierre Gondois <pierre.gondois@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20250311075143.61078-2-yangyicong@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>	2025-03-14 17:31:02 +00:00
David Woodhouse	b25eb5f5e4	x86/kexec: Add relocate_kernel() debugging support: Load a GDT There are some failure modes which lead to triple-faults in the relocate_kernel() function, which is fairly much undebuggable for normal mortals. Adding a GDT in the relocate_kernel() environment is step 1 towards being able to catch faults and do something more useful. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250312144257.2348250-2-dwmw2@infradead.org	2025-03-14 11:01:53 +01:00
Ajay Kaher	a2ab25529b	x86/vmware: Parse MP tables for SEV-SNP enabled guests under VMware hypervisors Under VMware hypervisors, SEV-SNP enabled VMs are fundamentally able to boot without UEFI, but this regressed a year ago due to: `0f4a1e8098` ("x86/sev: Skip ROM range scans and validation for SEV-SNP guests") In this case, mpparse_find_mptable() has to be called to parse MP tables which contains the necessary boot information. [ mingo: Updated the changelog. ] Fixes: `0f4a1e8098` ("x86/sev: Skip ROM range scans and validation for SEV-SNP guests") Co-developed-by: Ye Li <ye.li@broadcom.com> Signed-off-by: Ye Li <ye.li@broadcom.com> Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ye Li <ye.li@broadcom.com> Reviewed-by: Kevin Loughlin <kevinloughlin@google.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250313173111.10918-1-ajay.kaher@broadcom.com	2025-03-13 19:01:09 +01:00
Uros Bizjak	2883b4c216	x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h Current minimum required version of binutils is 2.25, which supports XSAVE{,OPT,C,S} and XRSTOR{,S} instruction mnemonics. Replace the byte-wise specification of XSAVE{,OPT,C,S} and XRSTOR{,S} with these proper mnemonics. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250313130251.383204-1-ubizjak@gmail.com	2025-03-13 18:36:52 +01:00
Ard Biesheuvel	e27dffba1b	x86/boot: Move the LA57 trampoline to separate source file To permit the EFI stub to call this code even when building the kernel without the legacy decompressor, move the trampoline out of the latter's startup code. This is part of an ongoing WIP effort on my part to make the existing, generic EFI zboot format work on x86 as well. No functional change intended. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250313120324.1095968-2-ardb+git@google.com	2025-03-13 18:12:38 +01:00
Ard Biesheuvel	e471a86a8c	x86/boot: Add back some padding for the CRC-32 checksum Even though no uses of the bzImage CRC-32 checksum are known, ensure that the last 4 bytes of the image are unused zero bytes, so that the checksum can be generated post-build if needed. [ mingo: Added the 'obsolete' qualifier to the comment. ] Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ian Campbell <ijc@hellion.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250312081204.521411-2-ardb+git@google.com	2025-03-12 13:04:52 +01:00
James Morse	823beb31e5	x86/resctrl: Move get_{mon,ctrl}_domain_from_cpu() to live with their callers Each of get_{mon,ctrl}_domain_from_cpu() only has one caller. Once the filesystem code is moved to /fs/, there is no equivalent to core.c. Move these functions to each live next to their caller. This allows them to be made static and the header file entries to be removed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-31-james.morse@arm.com	2025-03-12 12:24:58 +01:00
James Morse	f62b4e45e0	x86/resctrl: Move get_config_index() to a header get_config_index() is used by the architecture specific code to map a CLOSID+type pair to an index in the configuration arrays. MPAM needs to do this too to preserve the ABI to user-space, there is no reason to do it differently. Move the helper to a header file to allow all architectures that either use or emulate CDP to use the same pattern of CLOSID values. Moving this to a header file means it must be marked inline, which matches the existing compiler choice for this static function. Co-developed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-30-james.morse@arm.com	2025-03-12 12:24:54 +01:00
James Morse	6c2282d42c	x86/resctrl: Handle throttle_mode for SMBA resources Now that the visibility of throttle_mode is being managed by resctrl, it should consider resources other than MBA that may have a throttle_mode. SMBA is one such resource. Extend thread_throttle_mode_init() to check SMBA for a throttle_mode. Adding support for multiple resources means it is possible for a platform with both MBA and SMBA, but an undefined throttle_mode on one of them to make the file visible. Add the 'undefined' case to rdt_thread_throttle_mode_show(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-29-james.morse@arm.com	2025-03-12 12:24:46 +01:00
James Morse	373af4ecfd	x86/resctrl: Move RFTYPE flags to be managed by resctrl resctrl_file_fflags_init() is called from the architecture specific code to make the 'thread_throttle_mode' file visible. The architecture specific code has already set the membw.throttle_mode in the rdt_resource. This forces the RFTYPE flags used by resctrl to be exposed to the architecture specific code. This doesn't need to be specific to the architecture, the throttle_mode can be used by resctrl to determine if the 'thread_throttle_mode' file should be visible. This allows the RFTYPE flags to be private to resctrl. Add thread_throttle_mode_init(), and use it to call resctrl_file_fflags_init() from resctrl_init(). This avoids publishing an extra function between the architecture and filesystem code. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-28-james.morse@arm.com	2025-03-12 12:24:37 +01:00
James Morse	4cf9acfc8f	x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr resctrl_arch_pseudo_lock_fn() has architecture specific behaviour, and takes a struct rdtgroup as an argument. After the filesystem code moves to /fs/, the definition of struct rdtgroup will not be available to the architecture code. The only reason resctrl_arch_pseudo_lock_fn() wants the rdtgroup is for the CLOSID. Embed that in the pseudo_lock_region as a closid, and move the definition of struct pseudo_lock_region to resctrl.h. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-27-james.morse@arm.com	2025-03-12 12:24:33 +01:00
James Morse	4d20f38ab6	x86/resctrl: Make prefetch_disable_bits belong to the arch code prefetch_disable_bits is set by rdtgroup_locksetup_enter() from a value provided by the architecture, but is largely read by other architecture helpers. Make resctrl_arch_get_prefetch_disable_bits() set prefetch_disable_bits so that it can be isolated to arch-code from where the other arch-code helpers can use its cached value. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-26-james.morse@arm.com	2025-03-12 12:24:30 +01:00
James Morse	7028840552	x86/resctrl: Allow an architecture to disable pseudo lock Pseudo-lock relies on knowledge of the micro-architecture to disable prefetchers etc. On arm64 these controls are typically secure only, meaning Linux can't access them. Arm's cache-lockdown feature works in a very different way. Resctrl's pseudo-lock isn't going to be used on arm64 platforms. Add a Kconfig symbol that can be selected by the architecture. This enables or disables building of the pseudo_lock.c file, and replaces the functions with stubs. An additional IS_ENABLED() check is needed in rdtgroup_mode_write() so that attempting to enable pseudo-lock reports an "Unknown or unsupported mode" to user-space via the last_cmd_status file. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-25-james.morse@arm.com	2025-03-12 12:24:25 +01:00
James Morse	7d0ec14c64	x86/resctrl: Add resctrl_arch_ prefix to pseudo lock functions resctrl's pseudo lock has some copy-to-cache and measurement functions that are micro-architecture specific. For example, pseudo_lock_fn() is not at all portable. Label these 'resctrl_arch_' so they stay under /arch/x86. To expose these functions to the filesystem code they need an entry in a header file, and can't be marked static. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-24-james.morse@arm.com	2025-03-12 12:24:22 +01:00
James Morse	c32a7d7777	x86/resctrl: Move mbm_cfg_mask to struct rdt_resource The mbm_cfg_mask field lists the bits that user-space can set when configuring an event. This value is output via the last_cmd_status file. Once the filesystem parts of resctrl are moved to live in /fs/, the struct rdt_hw_resource is inaccessible to the filesystem code. Because this value is output to user-space, it has to be accessible to the filesystem code. Move it to struct rdt_resource. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-23-james.morse@arm.com	2025-03-12 12:24:09 +01:00
James Morse	37bae17567	x86/resctrl: Move mba_mbps_default_event init to filesystem code mba_mbps_default_event is initialised based on whether mbm_local or mbm_total is supported. In the case of both, it is initialised to mbm_local. mba_mbps_default_event is initialised in core.c's get_rdt_mon_resources(), while all the readers are in rdtgroup.c. After this code is split into architecture-specific and filesystem code, get_rdt_mon_resources() remains part of the architecture code, which would mean mba_mbps_default_event has to be exposed by the filesystem code. Move the initialisation to the filesystem's resctrl_mon_resource_init(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-22-james.morse@arm.com	2025-03-12 12:23:52 +01:00
James Morse	650680d651	x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers mon_event_config_{read,write}() are called via IPI and access model specific registers to do their work. To support another architecture, this needs abstracting. Rename mon_event_config_{read,write}() to have a "resctrl_arch_" prefix, and move their struct mon_config_info parameter into <linux/resctrl.h>. This allows another architecture to supply an implementation of these. As struct mon_config_info is now exposed globally, give it a 'resctrl_' prefix. MPAM systems need access to the domain to do this work, add the resource and domain to struct resctrl_mon_config_info. Co-developed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-21-james.morse@arm.com	2025-03-12 12:23:49 +01:00
James Morse	d81826f87a	x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC When BMEC is supported the resctrl event can be configured in a number of ways. This depends on architecture support. rdt_get_mon_l3_config() modifies the struct mon_evt and calls resctrl_file_fflags_init() to create the files that allow the configuration. Splitting this into separate architecture and filesystem parts would require the struct mon_evt and resctrl_file_fflags_init() to be exposed. Instead, add resctrl_arch_is_evt_configurable(), and use this from resctrl_mon_resource_init() to initialise struct mon_evt and call resctrl_file_fflags_init(). resctrl_arch_is_evt_configurable() calls rdt_cpu_has() so it doesn't obviously benefit from being inlined. Putting it in core.c will allow rdt_cpu_has() to eventually become static. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-20-james.morse@arm.com	2025-03-12 12:23:45 +01:00
James Morse	d012b66a16	x86/resctrl: Move the is_mbm_*_enabled() helpers to asm/resctrl.h The architecture specific parts of resctrl provide helpers like is_mbm_total_enabled() and is_mbm_local_enabled() to hide accesses to the rdt_mon_features bitmap. Exposing a group of helpers between the architecture and filesystem code is preferable to a single unsigned-long like rdt_mon_features. Helpers can be more readable and have a well defined behaviour, while allowing architectures to hide more complex behaviour. Once the filesystem parts of resctrl are moved, these existing helpers can no longer live in internal.h. Move them to include/linux/resctrl.h Once these are exposed to the wider kernel, they should have a 'resctrl_arch_' prefix, to fit the rest of the arch<->fs interface. Move and rename the helpers that touch rdt_mon_features directly. is_mbm_event() and is_mbm_enabled() are only called from rdtgroup.c, so can be moved into that file. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-19-james.morse@arm.com	2025-03-12 12:23:33 +01:00
James Morse	88464bff03	x86/resctrl: Rewrite and move the for_each__rdt_resource() walkers The for_each__rdt_resource() helpers walk the architecture's array of structures, using the resctrl visible part as an iterator. These became over-complex when the structures were split into a filesystem and architecture-specific struct. This approach avoided the need to touch every call site, and was done before there was a helper to retrieve a resource by rid. Once the filesystem parts of resctrl are moved to /fs/, both the arch's resource array, and the definition of those structures is no longer accessible. To support resctrl, each architecture would have to provide equally complex macros. Rewrite the macro to make use of resctrl_arch_get_resource(), and move these to include/linux/resctrl.h so existing x86 arch code continues to use them. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-18-james.morse@arm.com	2025-03-12 12:23:30 +01:00
James Morse	4b6bdbf27f	x86/resctrl: Move monitor init work to a resctrl init call rdt_get_mon_l3_config() is called from the arch's resctrl_arch_late_init(), and initialises both architecture specific fields, such as hw_res->mon_scale and resctrl filesystem fields by calling dom_data_init(). To separate the filesystem and architecture parts of resctrl, this function needs splitting up. Add resctrl_mon_resource_init() to do the filesystem specific work, and call it from resctrl_init(). This runs later, but is still before the filesystem is mounted and the rmid_ptrs[] array can be used. [ bp: Massage commit message. ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-17-james.morse@arm.com	2025-03-12 12:23:21 +01:00
James Morse	011842727f	x86/resctrl: Move monitor exit work to a resctrl exit call rdt_put_mon_l3_config() is called via the architecture's resctrl_arch_exit() call, and appears to free the rmid_ptrs[] and closid_num_dirty_rmid[] arrays. In reality this code is marked __exit, and is removed by the linker as resctrl can't be built as a module. To separate the filesystem and architecture parts of resctrl, this free()ing work needs to be triggered by the filesystem, as these structures belong to the filesystem code. Rename rdt_put_mon_l3_config() to resctrl_mon_resource_exit() and call it from resctrl_exit(). The kfree() is currently dependent on r->mon_capable. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-16-james.morse@arm.com	2025-03-12 12:23:13 +01:00
James Morse	9be68b144a	x86/resctrl: Add an arch helper to reset one resource On umount(), resctrl resets each resource back to its default configuration. It only ever does this for all resources in one go. reset_all_ctrls() is architecture specific as it works with struct rdt_hw_resource. Make reset_all_ctrls() an arch helper that resets one resource. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-15-james.morse@arm.com	2025-03-12 12:23:10 +01:00
James Morse	f16adbaf92	x86/resctrl: Move resctrl types to a separate header When resctrl is fully factored into core and per-arch code, each arch will need to use some resctrl common definitions in order to define its own specializations and helpers. Following conventional practice, it would be desirable to put the dependent arch definitions in an <asm/resctrl.h> header that is included by the common <linux/resctrl.h> header. However, this can make it awkward to avoid a circular dependency between <linux/resctrl.h> and the arch header. To avoid such dependencies, move the affected common types and constants into a new header that does not need to depend on <linux/resctrl.h> or on the arch headers. The same logic applies to the monitor-configuration defines, move these too. Some kind of enumeration for events is needed between the filesystem and architecture code. Take the x86 definition as its convenient for x86. The definition of enum resctrl_event_id is needed to allow the architecture code to define resctrl_arch_mon_ctx_alloc() and resctrl_arch_mon_ctx_free(). The definition of enum resctrl_res_level is needed to allow the architecture code to define resctrl_arch_set_cdp_enabled() and resctrl_arch_get_cdp_enabled(). The bits for mbm_local_bytes_config et al are ABI, and must be the same on all architectures. These are documented in Documentation/arch/x86/resctrl.rst The maintainers entry for these headers was missed when resctrl.h was created. Add a wildcard entry to match both resctrl.h and resctrl_types.h. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-14-james.morse@arm.com	2025-03-12 12:23:00 +01:00
James Morse	e3d5138cef	x86/resctrl: Move rdt_find_domain() to be visible to arch and fs code rdt_find_domain() finds a domain given a resource and a cache-id. This is used by both the architecture code and the filesystem code. After the filesystem code moves to live in /fs/, this helper is either duplicated by all architectures, or needs exposing by the filesystem code. Add the declaration to the global header file. As it's now globally visible, and has only a handful of callers, swap the 'rdt' for 'resctrl'. Move the function to live with its caller in ctrlmondata.c as the filesystem code will not have anything corresponding to core.c. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-13-james.morse@arm.com	2025-03-12 12:22:56 +01:00
James Morse	8079565d17	x86/resctrl: Expose resctrl fs's init function to the rest of the kernel rdtgroup_init() needs exposing to the rest of the kernel so that arch code can call it once it lives in core code. As this is one of the few functions exposed, rename it to have "resctrl" in the name. The same goes for the exit call. Rename x86's arch code init functions for RDT to have an arch prefix to make it clear these are part of the architecture code. Co-developed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-12-james.morse@arm.com	2025-03-12 12:22:54 +01:00
James Morse	6f06aee356	x86/resctrl: Remove rdtgroup from update_cpu_closid_rmid() update_cpu_closid_rmid() takes a struct rdtgroup as an argument, which it uses to update the local CPUs default pqr values. This is a problem once the resctrl parts move out to /fs/, as the arch code cannot poke around inside struct rdtgroup. Rename update_cpu_closid_rmid() as resctrl_arch_sync_cpus_defaults() to be used as the target of an IPI, and pass the effective CLOSID and RMID in a new struct. Co-developed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-11-james.morse@arm.com	2025-03-12 12:22:51 +01:00
James Morse	aebd5354dd	x86/resctrl: Add helper for setting CPU default properties rdtgroup_rmdir_ctrl() and rdtgroup_rmdir_mon() set the per-CPU pqr_state for CPUs that were part of the rmdir()'d group. Another architecture might not have a 'pqr_state', its hardware may need the values in a different format. MPAM's equivalent of RMID values are not unique, and always need the CLOSID to be provided too. There is only one caller that modifies a single value, (rdtgroup_rmdir_mon()). MPAM always needs both CLOSID and RMID for the hardware value as these are written to the same system register. As rdtgroup_rmdir_mon() has the CLOSID on hand, only provide a helper to set both values. These values are read by __resctrl_sched_in(), but may be written by a different CPU without any locking, add READ/WRTE_ONCE() to avoid torn values. Co-developed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-10-james.morse@arm.com	2025-03-12 12:22:48 +01:00
James Morse	dbc58f7eec	x86/resctrl: Generate default_ctrl instead of sharing it The struct rdt_resource default_ctrl is used by both the architecture code for resetting the hardware controls, and sometimes by the filesystem code as the default value for the schema, unless the bandwidth software controller is in use. Having the default exposed by the architecture code causes unnecessary duplication for each architecture as the default value must be specified, but can be derived from other schema properties. Now that the maximum bandwidth is explicitly described, resctrl can derive the default value from the schema format and the other resource properties. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-9-james.morse@arm.com	2025-03-12 12:22:44 +01:00
James Morse	634ebb98b9	x86/resctrl: Add max_bw to struct resctrl_membw __rdt_get_mem_config_amd() and __get_mem_config_intel() both use the default_ctrl property as a maximum value. This is because the MBA schema works differently between these platforms. Doing this complicates determining whether the default_ctrl property belongs to the arch code, or can be derived from the schema format. Deriving the maximum or default value from the schema format would avoid the architecture code having to tell resctrl such obvious things as the maximum percentage is 100, and the maximum bitmap is all ones. Maximum bandwidth is always going to vary per platform. Add max_bw as a special case. This is currently used for the maximum MBA percentage on Intel platforms, but can be removed from the architecture code if 'percentage' becomes a schema format resctrl supports directly. This value isn't needed for other schema formats. This will allow the default_ctrl to be generated from the schema properties when it is needed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-8-james.morse@arm.com	2025-03-12 12:22:41 +01:00
James Morse	43312b8ea1	x86/resctrl: Remove data_width and the tabular format The resctrl architecture code provides a data_width for the controls of each resource. This is used to zero pad all control values in the schemata file so they appear in columns. The same is done with the resource names to complete the visual effect. e.g. \| SMBA:0=2048 \| L3:0=00ff AMD platforms discover their maximum bandwidth for the MB resource from firmware, but hard-code the data_width to 4. If the maximum bandwidth requires more digits - the tabular format is silently broken. This is also broken when the mba_MBps mount option is used as the field width isn't updated. If new schema are added resctrl will need to be able to determine the maximum width. The benefit of this pretty-printing is questionable. Instead of handling runtime discovery of the data_width for AMD platforms, remove the feature. These fields are always zero padded so should be harmless to remove if the whole field has been treated as a number. In the above example, this would now look like this: \| SMBA:0=2048 \| L3:0=ff Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-7-james.morse@arm.com	2025-03-12 12:22:36 +01:00
James Morse	bb9343c8f2	x86/resctrl: Use schema type to determine the schema format string Resctrl's architecture code gets to specify a format string that is used when printing schema entries. This is expected to be one of two values that the filesystem code supports. Setting this format string allows the architecture code to change the ABI resctrl presents to user-space. Instead, use the schema format enum to choose which format string to use. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-6-james.morse@arm.com	2025-03-12 12:22:33 +01:00
James Morse	c24f5eab6b	x86/resctrl: Use schema type to determine how to parse schema values Resctrl's architecture code gets to specify a function pointer that is used when parsing schema entries. This is expected to be one of two helpers from the filesystem code. Setting this function pointer allows the architecture code to change the ABI resctrl presents to user-space, and forces resctrl to expose these helpers. Instead, add a schema format enum to choose which schema parser to use. This allows the helpers to be made static and the structs used for passing arguments moved out of shared headers. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-5-james.morse@arm.com	2025-03-12 12:22:30 +01:00
James Morse	131dab13a8	x86/resctrl: Remove fflags from struct rdt_resource The resctrl arch code specifies whether a resource controls a cache or memory using the fflags field. This field is then used by resctrl to determine which files should be exposed in the filesystem. Allowing the architecture to pick this value means the RFTYPE_ flags have to be in a shared header, and allows an architecture to create a combination that resctrl does not support. Remove the fflags field, and pick the value based on the resource id. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-4-james.morse@arm.com	2025-03-12 12:22:26 +01:00
James Morse	3c02153113	x86/resctrl: Add a helper to avoid reaching into the arch code resource list Resctrl occasionally wants to know something about a specific resource, in these cases it reaches into the arch code's rdt_resources_all[] array. Once the filesystem parts of resctrl are moved to /fs/, this means it will need visibility of the architecture specific struct rdt_hw_resource definition, and the array of all resources. All architectures would also need a r_resctrl member in this struct. Instead, abstract this via a helper to allow architectures to do different things here. Move the level enum to the resctrl header and add a helper to retrieve the struct rdt_resource by 'rid'. resctrl_arch_get_resource() should not return NULL for any value in the enum, it may instead return a dummy resource that is !alloc_enabled && !mon_enabled. Co-developed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-3-james.morse@arm.com	2025-03-12 12:22:22 +01:00
James Morse	a121798ae6	x86/resctrl: Fix allocation of cleanest CLOSID on platforms with no monitors Commit `6eac36bb9e` ("x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid") added logic that causes resctrl to search for the CLOSID with the fewest dirty cache lines when creating a new control group, if requested by the arch code. This depends on the values read from the llc_occupancy counters. The logic is applicable to architectures where the CLOSID effectively forms part of the monitoring identifier and so do not allow complete freedom to choose an unused monitoring identifier for a given CLOSID. This support missed that some platforms may not have these counters. This causes a NULL pointer dereference when creating a new control group as the array was not allocated by dom_data_init(). As this feature isn't necessary on platforms that don't have cache occupancy monitors, add this to the check that occurs when a new control group is allocated. Fixes: `6eac36bb9e` ("x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64 Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Peter Newman <peternewman@google.com> Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64 Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64 Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20250311183715.16445-2-james.morse@arm.com	2025-03-12 12:21:48 +01:00
Linus Torvalds	0fed89a961	hyperv-fixes for v6.14-rc7 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmfQqQgTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXp65B/9kg3QLFvqe88XtisV2bdI1jkwuFrVD ibHAvmodYPqnWnecXl1TG+4x3gn5U6qA6niQcXl5uucsWoB3hdqG/oaJfw0gLnp5 +x6dsoY1olKVs0Y8/Tr4lzg73IBxz2w8IhvYZerbUx1g3l7cc79WLYkr6lS7hnlz szBEoPuVMRTZsblLuw80u8BZHLcVLDFFFZ+MRTtTqZwL7y2A5fMA5rcZpTngvoxP DEzaoU9iYMn37pC/8uaCjviR7n2UEFGLY4tVHDgrpRRtJvB14aEkhvM9k3v6ddpL +c5UhUKd+L+KKHMmSNVeJ2lstKXVgssJEDUrf+ZBwmoDk/0O7Wdrx2Dy =HkGK -----END PGP SIGNATURE----- Merge tag 'hyperv-fixes-signed-20250311' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Patches to fix Hyper-v framebuffer code (Michael Kelley and Saurabh Sengar) - Fix for Hyper-V output argument to hypercall that changes page visibility (Michael Kelley) - Fix for Hyper-V VTL mode (Naman Jain) * tag 'hyperv-fixes-signed-20250311' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: Don't release fb_mmio resource in vmbus_free_mmio() x86/hyperv: Fix output argument to hypercall that changes page visibility fbdev: hyperv_fb: Allow graceful removal of framebuffer fbdev: hyperv_fb: Simplify hvfb_putmem fbdev: hyperv_fb: Fix hang in kdump kernel when on Hyper-V Gen 2 VMs drm/hyperv: Fix address space leak when Hyper-V DRM device is removed fbdev: hyperv_fb: iounmap() the correct memory when removing a device x86/hyperv/vtl: Stop kernel from probing VTL0 low memory	2025-03-11 12:49:51 -10:00
Thomas Weißschuh	c080f2b8a2	x86/vdso: Always reject undefined references during linking Instead of using a custom script to detect and fail on undefined references, use --no-undefined for all VDSO linker invocations. Drop the now unused checkundef.sh script. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250306-vdso-checkundef-v2-1-a26cc315fd73@linutronix.de	2025-03-11 09:49:15 +01:00
Arnd Bergmann	ec73859d76	x86/coco: Replace 'static const cc_mask' with the newly introduced cc_get_mask() function When extra warnings are enabled, the cc_mask definition in <asm/coco.h> causes a build failure with GCC: arch/x86/include/asm/coco.h:28:18: error: 'cc_mask' defined but not used [-Werror=unused-const-variable=] 28 \| static const u64 cc_mask = 0; Add a cc_get_mask() function mirroring cc_set_mask() for the one user of the variable outside of the CoCo implementation. Fixes: `a0a8d15a79` ("x86/tdx: Preserve shared bit on mprotect()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250310131114.2635497-1-arnd@kernel.org -- v2: use an inline helper instead of a __maybe_unused annotaiton.	2025-03-10 20:06:47 +01:00
Greg Kroah-Hartman	993a47bd7b	Linux 6.14-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfOKBUeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1aQH/iC+Oyij4VxAjBek BOXIT/p6CwlIXb8ObiWWcRjDPizlcxb3RaV8J2RO+IqaQ2wltxpFANq2G7Re2FPm SNcEpIURAOVcxHGedcfFA91srO5F4FzNTO8LVp7MIbcgMYy3pdk+dbZmi6A691R+ t9pb74m+MAnF1o/MUx7pUlhAT/4ymuuR0F7WCSg4h0Xwe5m0nlJY89kJBC7PCjyd n3mdhsz3rDSLmt/z/T7HGD89r8sYSvm9cOKtL3ELgGTrm7boQV8ii9Y9w04DI8PQ JmIernugcCxmhH36mVUAHgJf2+/T388xFUh/D5+skeUOUZpaJZG866rnb32WpsHc eWLFUeg= =Wypt -----END PGP SIGNATURE----- Merge 6.14-rc6 into driver-core-next We need the driver core fix in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2025-03-10 17:37:25 +01:00
Eric Biggers	5aebe00b2f	x86/crc32: optimize tail handling for crc32c short inputs For handling the 0 <= len < sizeof(unsigned long) bytes left at the end, do a 4-2-1 step-down instead of a byte-at-a-time loop. This allows taking advantage of wider CRC instructions. Note that crc32c-3way.S already uses this same optimization too. crc_kunit shows an improvement of about 25% for len=127. Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250304213216.108925-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2025-03-10 09:29:29 -07:00
Florent Revest	e3e89178a9	x86/microcode/AMD: Fix out-of-bounds on systems with CPU-less NUMA nodes Currently, load_microcode_amd() iterates over all NUMA nodes, retrieves their CPU masks and unconditionally accesses per-CPU data for the first CPU of each mask. According to Documentation/admin-guide/mm/numaperf.rst: "Some memory may share the same node as a CPU, and others are provided as memory only nodes." Therefore, some node CPU masks may be empty and wouldn't have a "first CPU". On a machine with far memory (and therefore CPU-less NUMA nodes): - cpumask_of_node(nid) is 0 - cpumask_first(0) is CONFIG_NR_CPUS - cpu_data(CONFIG_NR_CPUS) accesses the cpu_info per-CPU array at an index that is 1 out of bounds This does not have any security implications since flashing microcode is a privileged operation but I believe this has reliability implications by potentially corrupting memory while flashing a microcode update. When booting with CONFIG_UBSAN_BOUNDS=y on an AMD machine that flashes a microcode update. I get the following splat: UBSAN: array-index-out-of-bounds in arch/x86/kernel/cpu/microcode/amd.c:X:Y index 512 is out of range for type 'unsigned long[512]' [...] Call Trace: dump_stack __ubsan_handle_out_of_bounds load_microcode_amd request_microcode_amd reload_store kernfs_fop_write_iter vfs_write ksys_write do_syscall_64 entry_SYSCALL_64_after_hwframe Change the loop to go over only NUMA nodes which have CPUs before determining whether the first CPU on the respective node needs microcode update. [ bp: Massage commit message, fix typo. ] Fixes: `7ff6edf4fe` ("x86/microcode/AMD: Fix mixed steppings support") Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250310144243.861978-1-revest@chromium.org	2025-03-10 16:02:54 +01:00
Vladis Dronov	65be5c95d0	x86/sgx: Warn explicitly if X86_FEATURE_SGX_LC is not enabled The kernel requires X86_FEATURE_SGX_LC to be able to create SGX enclaves, not just X86_FEATURE_SGX. There is quite a number of hardware which has X86_FEATURE_SGX but not X86_FEATURE_SGX_LC. A kernel running on such hardware does not create the /dev/sgx_enclave file and does so silently. Explicitly warn if X86_FEATURE_SGX_LC is not enabled to properly notify users that the kernel disabled the SGX driver. The X86_FEATURE_SGX_LC, a.k.a. SGX Launch Control, is a CPU feature that enables LE (Launch Enclave) hash MSRs to be writable (with additional opt-in required in the 'feature control' MSR) when running enclaves, i.e. using a custom root key rather than the Intel proprietary key for enclave signing. I've hit this issue myself and have spent some time researching where my /dev/sgx_enclave file went on SGX-enabled hardware. Related links: https://github.com/intel/linux-sgx/issues/837 https://patchwork.kernel.org/project/platform-driver-x86/patch/20180827185507.17087-3-jarkko.sakkinen@linux.intel.com/ [ mingo: Made the error message a bit more verbose, and added other cases where the kernel fails to create the /dev/sgx_enclave device node. ] Signed-off-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250309172215.21777-2-vdronov@redhat.com	2025-03-10 12:29:18 +01:00
Sebastian Andrzej Siewior	14daa3bca2	x86: Use RCU in all users of __module_address(). __module_address() can be invoked within a RCU section, there is no requirement to have preemption disabled. Replace the preempt_disable() section around __module_address() with RCU. Cc: H. Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250108090457.512198-23-bigeasy@linutronix.de Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>	2025-03-10 11:54:45 +01:00
Michael Kelley	09beefefb5	x86/hyperv: Fix output argument to hypercall that changes page visibility The hypercall in hv_mark_gpa_visibility() is invoked with an input argument and an output argument. The output argument ostensibly returns the number of pages that were processed. But in fact, the hypercall does not provide any output, so the output argument is spurious. The spurious argument is harmless because Hyper-V ignores it, but in the interest of correctness and to avoid the potential for future problems, remove it. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Link: https://lore.kernel.org/r/20250226200612.2062-2-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250226200612.2062-2-mhklinux@outlook.com>	2025-03-10 00:21:07 +00:00
Linus Torvalds	a382b06d29	KVM/arm64 fixes for 6.14, take #4 * Fix a couple of bugs affecting pKVM's PSCI relay implementation when running in the hVHE mode, resulting in the host being entered with the MMU in an unknown state, and EL2 being in the wrong mode. x86: * Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. * Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. * Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. * Always save DR masks for SNP vCPUs if DebugSwap is supported, as the guest can enable DebugSwap without KVM's knowledge. * Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. * Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. * Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfNSeUUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNKngf/cLgQAT9AF4nFqcwh5b5uucKHVJ8W uTiGlWqLAf2UN53L63eZ/7vKQWGQYkOTFvormR14Jam6IYtytsZw1xLBH4fGtUyB qVjk0EPzaKGqn3LrgyneQNCXdyxJv7EBVBgoOKH0pvOksoW2E5ZizhhtRFtL7nCE Yk8FQKpP0mIBk04RMsvzJVEFKIb4OZgJadWo0gryg1oF2aAv7mxQjyqUWsBDsb3q 99c0ElSBfV39FeT8xeok4k7S5jbBWii2KiaH72ZsNiBu0rYmEuLwIoygCNNWL9Wu FPdQ+r//YrzfCJSXwGPfdUaRaF4p2642S6oiXQuusNNUmhK6/MRo3mZo8A== =XQHm -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Paolo Bonzini: "arm64: - Fix a couple of bugs affecting pKVM's PSCI relay implementation when running in the hVHE mode, resulting in the host being entered with the MMU in an unknown state, and EL2 being in the wrong mode x86: - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent - Always save DR masks for SNP vCPUs if DebugSwap is supported, as the guest can enable DebugSwap without KVM's knowledge - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVM KVM: selftests: Fix printf() format goof in SEV smoke test KVM: selftests: Ensure all vCPUs hit -EFAULT during initial RO stage KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3 KVM: SVM: Save host DR masks on CPUs with DebugSwap KVM: arm64: Initialize SCTLR_EL1 in __kvm_hyp_init_cpu() KVM: arm64: Initialize HCR_EL2.E2H early KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabled KVM: x86: Snapshot the host's DEBUGCTL in common x86 KVM: SVM: Suppress DEBUGCTL.BTF on AMD KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective value KVM: selftests: Assert that STI blocking isn't set after event injection KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow	2025-03-09 09:04:08 -10:00
Paolo Bonzini	ea9bd29a9c	KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is supported, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmfLlhUACgkQOlYIJqCj N/0x7w/+MhqJdHbshL7Gzw+rcXwCROiCkqsxFP+YoTXte8uaHS5CEfcMYjE8SuGp KBpgLo4Lj1dVTXiCjemlY5sn6CDiuSs74X8A88ksuu5hVsFByJUgyWU9iw8J/crZ B2vj8huhqa8OCEPe5JujWfnfyAkKE5tUA4GFi73vhHMcftTNj+ftxT33/Pfg7y7M xOvWFWS6ZshKrouRKzI7ZFEYLwp0lr4U3dzO5rCRAd5J4MSBWRx6Dx2um5dyEYKJ xgwl4ylM4S/+78u1+0nQnToM0UWHJ3e7x8nze6UXYTZIrBr/lSeKlbhOPnEWJcJB Eemnur9ORI2BRPUReqBKluCZsSK+E5B/HPCVt5cxtuRIuUOD+kW17LPgnPyE4Sso eVt+XAvQc7EjrpWDSHr3ZQZZM89l9zHhuSAQ0npO6y71s0FzEVZQoDamNmOLAPjH Qg+qhBV2l6pyfqhqiLzADasYLOl57cJsfiMjM331ALLqAn57jzd+B8c4hdB2Xg4s KPuy8w8uBaY9zpd9YDBLLr7JJVs35KexNZMjT2vqBYXcScyLgmAuSQXy3hub6Mzn gI5ZXIKG8eO9v2jejfClI6/OEdtEwgSGEVwuBKB16pMrIxqpguMTMTWLVRn5G+oo qA8anmKaac62GaB66JE/Wjy069OPIGYnHSU2nal0Tej6kG0xv6E= =as6u -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux into HEAD KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is supported, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM.	2025-03-09 03:44:06 -04:00
Uros Bizjak	558fc8e186	x86/boot: Do not test if AC and ID eflags are changeable on x86_64 The test for the changeabitily of AC and ID EFLAGS is used to distinguish between i386 and i486 processors (AC) and to test for CPUID instruction support (ID). Skip these tests on x86_64 processors as they always supports CPUID. Also change the return type of has_eflag() to bool. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20250307091022.181136-1-ubizjak@gmail.com	2025-03-08 20:36:26 +01:00
Borislav Petkov (AMD)	058a6bec37	x86/microcode/AMD: Add some forgotten models to the SHA check Add some more forgotten models to the SHA check. Fixes: `50cef76d5c` ("x86/microcode/AMD: Load only SHA256-checksummed patches") Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lore.kernel.org/r/20250307220256.11816-1-bp@kernel.org	2025-03-08 20:09:37 +01:00
Ingo Molnar	14296d0e85	Merge branch 'linus' into x86/urgent, to pick up dependent patches Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-08 20:09:27 +01:00
Kees Cook	d70da12453	hardening: Enable i386 FORTIFY_SOURCE on Clang 16+ The i386 regparm bug exposed with FORTIFY_SOURCE with Clang was fixed in Clang 16[1]. Link: `c167c0a4dc` [1] Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250308042929.1753543-2-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>	2025-03-08 09:16:42 -08:00
Kees Cook	16cb16e0d2	x86/build: Remove -ffreestanding on i386 with GCC The use of -ffreestanding is a leftover that is only needed for certain versions of Clang. Adjust this to be Clang-only. A later patch will make this a versioned check. Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250308042929.1753543-1-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>	2025-03-08 09:16:41 -08:00
Anna-Maria Behnsen	bf0eff816e	x86/vdso: Prepare introduction of struct vdso_clock To support multiple PTP clocks, the VDSO data structure needs to be reworked. All clock specific data will end up in struct vdso_clock and in struct vdso_time_data there will be array of VDSO clocks. At the moment, vdso_clock is simply a define which maps vdso_clock to vdso_time_data. To prepare for the rework of the data structures, replace the struct vdso_time_data pointer with a struct vdso_clock pointer where applicable. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-15-c1b5c69a166f@linutronix.de	2025-03-08 14:37:41 +01:00
Ingo Molnar	f23ecef20a	Merge branch 'locking/urgent' into locking/core, to pick up locking fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-08 00:54:06 +01:00
Ingo Molnar	6914f7e2e2	x86/mm: Define PTRS_PER_PMD for assembly code too Andy reported the following build warning from head_32.S: In file included from arch/x86/kernel/head_32.S:29: arch/x86/include/asm/pgtable_32.h:59:5: error: "PTRS_PER_PMD" is not defined, evaluates to 0 [-Werror=undef] 59 \| #if PTRS_PER_PMD > 1 The reason is that on 2-level i386 paging the folded in PMD's PTRS_PER_PMD constant is not defined in assembly headers, only in generic MM C headers. Instead of trying to fish out the definition from the generic headers, just define it - it even has a comment for it already... Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/Z8oa8AUVyi2HWfo9@gmail.com	2025-03-08 00:09:09 +01:00
Ard Biesheuvel	9c54baab44	x86/boot: Drop CRC-32 checksum and the build tool that generates it Apart from some sanity checks on the size of setup.bin, the only remaining task carried out by the arch/x86/boot/tools/build.c build tool is generating the CRC-32 checksum of the bzImage. This feature was added in commit `7d6e737c8d` ("x86: add a crc32 checksum to the kernel image.") without any motivation (or any commit log text, for that matter). This checksum is not verified by any known bootloader, and given that a) the checksum of the entire bzImage is reported by most tools (zlib, rhash) as 0xffffffff and not 0x0 as documented, b) the checksum is corrupted when the image is signed for secure boot, which means that no distro ships x86 images with valid CRCs, it seems quite unlikely that this checksum is being used, so let's just drop it, along with the tool that generates it. Instead, use simple file concatenation and truncation to combine the two pieces into bzImage, and replace the checks on the size of the setup block with a couple of ASSERT()s in the linker script. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ian Campbell <ijc@hellion.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250307164801.885261-2-ardb+git@google.com	2025-03-07 23:59:10 +01:00
Linus Torvalds	042751d353	Miscellaneous x86 fixes: - Fix CPUID leaf 0x2 parsing bugs - Sanitize very early boot parameters to avoid crash - Fix size overflows in the SGX code - Make CALL_NOSPEC use consistent Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfK46YRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hrTA/9EKBx/Uu+WKaVCfX/+hrdHfiMhwCFFOrT Aa5QEcsEcFP2QDo3M8mXj9DzNJda8L5ilBWqv+E08zYexvv5/wQ2Q7bEQthD+4Su hFbbTjlsF5r6D2sfzT03RuDbvTO3lv7uD/dYoE9e82GKKhvGt2mdcyQ6ZIt1Iz8R gIcqGBv3KMXHlx91+VeRHZmI7ZHL1lqz8SyXvnY5jCI5x44QNdlBAMSbxwkw9YzJ 1w59Dfxz9UDHJ0vzj7OpZFL/JjABKXEIeTXHfSevd0f3V05T3hTlZUPEKvqNpHlP dnAHbH5HoqQpBWdt4CYmJcEhdz9fM5h0s7JBMN+0k+ORZ7J1wJ+tM35cm3uWpd0/ zpDzsFU33AdVbt9ryY62kZk5Lr9Pvas5GRjhB8ZkcSjFPu03xKvZ0sFs02H4zsOc k5Nd8PHKRbBhoJhtBtSRMuhADKKEIA490TUz5NVCJYqVnopyotYEX74GlQhFcXf6 qXE8wcIfpC+bYbzPEVWe0+MhqQrcuN85D+KaIUIR9EITkAQ57VEvfwRv0FB3j7zS //0sM1bsNJB9dbGgkL2/PqihwBtXGf8MzyDvvQPLK/DiyD32B4zgtQHYScgQZqH+ Bze+GR33T2eifkEkBEIOjfIVc8t4iTuoni9SjLIcTTNo8kedCKkGRCEHgqWZO8ko a53vzAMQSv8= =ziXk -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix CPUID leaf 0x2 parsing bugs - Sanitize very early boot parameters to avoid crash - Fix size overflows in the SGX code - Make CALL_NOSPEC use consistent * tag 'x86-urgent-2025-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Sanitize boot params before parsing command line x86/sgx: Fix size overflows in sgx_encl_create() x86/cpu: Properly parse CPUID leaf 0x2 TLB descriptor 0x63 x86/cpu: Validate CPUID leaf 0x2 EDX output x86/cacheinfo: Validate CPUID leaf 0x2 EDX output x86/speculation: Add a conditional CS prefix to CALL_NOSPEC x86/speculation: Simplify and make CALL_NOSPEC consistent	2025-03-07 10:05:32 -10:00
Alexey Kardashevskiy	3e385c0d6c	virt: sev-guest: Move SNP Guest Request data pages handling under snp_cmd_mutex Compared to the SNP Guest Request, the "Extended" version adds data pages for receiving certificates. If not enough pages provided, the HV can report to the VM how much is needed so the VM can reallocate and repeat. Commit `ae596615d9` ("virt: sev-guest: Reduce the scope of SNP command mutex") moved handling of the allocated/desired pages number out of scope of said mutex and create a possibility for a race (multiple instances trying to trigger Extended request in a VM) as there is just one instance of snp_msg_desc per /dev/sev-guest and no locking other than snp_cmd_mutex. Fix the issue by moving the data blob/size and the GHCB input struct (snp_req_data) into snp_guest_req which is allocated on stack now and accessed by the GHCB caller under that mutex. Stop allocating SEV_FW_BLOB_MAX_SIZE in snp_msg_alloc() as only one of four callers needs it. Free the received blob in get_ext_report() right after it is copied to the userspace. Possible future users of snp_send_guest_request() are likely to have different ideas about the buffer size anyways. Fixes: `ae596615d9` ("virt: sev-guest: Reduce the scope of SNP command mutex") Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250307013700.437505-3-aik@amd.com	2025-03-07 14:09:33 +01:00
Andrew Cooper	14cb5d8306	x86/amd_nb: Use rdmsr_safe() in amd_get_mmconfig_range() Xen doesn't offer MSR_FAM10H_MMIO_CONF_BASE to all guests. This results in the following warning: unchecked MSR access error: RDMSR from 0xc0010058 at rIP: 0xffffffff8101d19f (xen_do_read_msr+0x7f/0xa0) Call Trace: xen_read_msr+0x1e/0x30 amd_get_mmconfig_range+0x2b/0x80 quirk_amd_mmconfig_area+0x28/0x100 pnp_fixup_device+0x39/0x50 __pnp_add_device+0xf/0x150 pnp_add_device+0x3d/0x100 pnpacpi_add_device_handler+0x1f9/0x280 acpi_ns_get_device_callback+0x104/0x1c0 acpi_ns_walk_namespace+0x1d0/0x260 acpi_get_devices+0x8a/0xb0 pnpacpi_init+0x50/0x80 do_one_initcall+0x46/0x2e0 kernel_init_freeable+0x1da/0x2f0 kernel_init+0x16/0x1b0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 based on quirks for a "PNP0c01" device. Treating MMCFG as disabled is the right course of action, so no change is needed there. This was most likely exposed by fixing the Xen MSR accessors to not be silently-safe. Fixes: `3fac3734c4` ("xen/pv: support selecting safe/unsafe msr accesses") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250307002846.3026685-1-andrew.cooper3@citrix.com	2025-03-07 13:28:31 +01:00
Maksim Davydov	c929d08df8	x86/split_lock: Fix the delayed detection logic If the warning mode with disabled mitigation mode is used, then on each CPU where the split lock occurred detection will be disabled in order to make progress and delayed work will be scheduled, which then will enable detection back. Now it turns out that all CPUs use one global delayed work structure. This leads to the fact that if a split lock occurs on several CPUs at the same time (within 2 jiffies), only one CPU will schedule delayed work, but the rest will not. The return value of schedule_delayed_work_on() would have shown this, but it is not checked in the code. A diagram that can help to understand the bug reproduction: - sld_update_msr() enables/disables SLD on both CPUs on the same core - schedule_delayed_work_on() internally checks WORK_STRUCT_PENDING_BIT. If a work has the 'pending' status, then schedule_delayed_work_on() will return an error code and, most importantly, the work will not be placed in the workqueue. Let's say we have a multicore system on which split_lock_mitigate=0 and a multithreaded application is running that calls splitlock in multiple threads. Due to the fact that sld_update_msr() affects the entire core (both CPUs), we will consider 2 CPUs from different cores. Let the 2 threads of this application schedule to CPU0 (core 0) and to CPU 2 (core 1), then: \| \|\| \| \| CPU 0 (core 0) \|\| CPU 2 (core 1) \| \|_________________________________\|\|___________________________________\| \| \|\| \| \| 1) SPLIT LOCK occured \|\| \| \| \|\| \| \| 2) split_lock_warn() \|\| \| \| \|\| \| \| 3) sysctl_sld_mitigate == 0 \|\| \| \| (work = &sl_reenable) \|\| \| \| \|\| \| \| 4) schedule_delayed_work_on() \|\| \| \| (reenable will be called \|\| \| \| after 2 jiffies on CPU 0) \|\| \| \| \|\| \| \| 5) disable SLD for core 0 \|\| \| \| \|\| \| \| ------------------------- \|\| \| \| \|\| \| \| \|\| 6) SPLIT LOCK occured \| \| \|\| \| \| \|\| 7) split_lock_warn() \| \| \|\| \| \| \|\| 8) sysctl_sld_mitigate == 0 \| \| \|\| (work = &sl_reenable, \| \| \|\| the same address as in 3) ) \| \| \|\| \| \| 2 jiffies \|\| 9) schedule_delayed_work_on() \| \| \|\| fials because the work is in \| \| \|\| the pending state since 4). \| \| \|\| The work wasn't placed to the \| \| \|\| workqueue. reenable won't be \| \| \|\| called on CPU 2 \| \| \|\| \| \| \|\| 10) disable SLD for core 0 \| \| \|\| \| \| \|\| From now on SLD will \| \| \|\| never be reenabled on core 1 \| \| \|\| \| \| ------------------------- \|\| \| \| \|\| \| \| 11) enable SLD for core 0 by \|\| \| \| __split_lock_reenable \|\| \| \| \|\| \| If the application threads can be scheduled to all processor cores, then over time there will be only one core left, on which SLD will be enabled and split lock will be able to be detected; and on all other cores SLD will be disabled all the time. Most likely, this bug has not been noticed for so long because sysctl_sld_mitigate default value is 1, and in this case a semaphore is used that does not allow 2 different cores to have SLD disabled at the same time, that is, strictly only one work is placed in the workqueue. In order to fix the warning mode with disabled mitigation mode, delayed work has to be per-CPU. Implement it. Fixes: `727209376f` ("x86/split_lock: Add sysctl to control the misery mode") Signed-off-by: Maksim Davydov <davydov-max@yandex-team.ru> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250115131704.132609-1-davydov-max@yandex-team.ru	2025-03-07 13:01:02 +01:00
Ard Biesheuvel	48140f8bca	Merge branch 'x86-mixed-mode' into efi/next	2025-03-07 12:30:53 +01:00
Li RongQing	7a310c644c	perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions bts_ctx might not be allocated, for example if the CPU has X86_FEATURE_PTI, but intel_bts_disable/enable_local() and intel_bts_interrupt() are called unconditionally from intel_pmu_handle_irq() and crash on bts_ctx. So check if bts_ctx is allocated when calling BTS functions. Fixes: `3acfcefa79` ("perf/x86/intel/bts: Allocate bts_ctx only if necessary") Reported-by: Jiri Olsa <olsajiri@gmail.com> Tested-by: Jiri Olsa <jolsa@kernel.org> Suggested-by: Adrian Hunter <adrian.hunter@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250306051102.2642-1-lirongqing@baidu.com	2025-03-06 22:42:26 +01:00
Ingo Molnar	82354fce16	Merge branch 'sched/urgent' into sched/core, to pick up dependent commits Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-06 22:26:36 +01:00
Ard Biesheuvel	c00b413a96	x86/boot: Sanitize boot params before parsing command line The 5-level paging code parses the command line to look for the 'no5lvl' string, and does so very early, before sanitize_boot_params() has been called and has been given the opportunity to wipe bogus data from the fields in boot_params that are not covered by struct setup_header, and are therefore supposed to be initialized to zero by the bootloader. This triggers an early boot crash when using syslinux-efi to boot a recent kernel built with CONFIG_X86_5LEVEL=y and CONFIG_EFI_STUB=n, as the 0xff padding that now fills the unused PE/COFF header is copied into boot_params by the bootloader, and interpreted as the top half of the command line pointer. Fix this by sanitizing the boot_params before use. Note that there is no harm in calling this more than once; subsequent invocations are able to spot that the boot_params have already been cleaned up. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v6.1+ Link: https://lore.kernel.org/r/20250306155915.342465-2-ardb+git@google.com Closes: https://lore.kernel.org/all/202503041549.35913.ulrich.gemkow@ikr.uni-stuttgart.de	2025-03-06 22:02:39 +01:00
Rafael J. Wysocki	f96d92fcbb	amd-pstate content for 6.15 (3/6/25) A lot of code optimization to avoid cases where call paths will end up calling the same writes multiple times and needlessly caching variables. To accomplish this some of the writes are now made into an atomically written "perf" variable. Locking has been overhauled to ensure it only applies to the necessary functions. Tracing has been adjusted to ensure trace events only are used right before writing out to the hardware. NOTE: This is a redo of amd-pstate-v6.15-2025-03-03 with a fixed Fixes tag. -----BEGIN PGP SIGNATURE----- iQJOBAABCgA4FiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmfJ8w8aHG1hcmlvLmxp bW9uY2llbGxvQGFtZC5jb20ACgkQLRkspiR3AnaugQ/9GvheK3/lWweGF3f4ixdl xS0XTj3DYQBxDvP2H35QRBHaUQQADJnJLjpl1HHy0x9Hem6m3oLZH3QpqKDLb1xX QWzI2JJZ9+KPZRKRwlBnzENcHtBraKX4M/+vvhTo6LSbLfSz47cdZhX6r+9siwPR ofWfusfBlEVPPLaul2Q+oOaQzvwy9Or2q2y3xICzo+Vo3QilSk3pekN8bd2CVd9y L1t1sAeN1StiOLBRbdbh8B2oyJfCCmSI85P8+sEKKR58G1q42Dv4NRl7fgzuJAfO 4ATgX0RjNZcI9yLPmLr5PFmHJYvpHr29MAWKAo5a8VwMu1cBTv0FFkEboRvUgiXl 3+hjQJGkyCZJTxIWhnvKl6Ci00N5apmwc1iMeoa1A1xvvTVkXLmn8RrYIJm2xMCH 27SPJAQx+7e2cRcgjb3orz2Ymj+I9WoGy68/wdmnWM+ZxF1QaKguhM0lrIKFbu7K 6CzVnndVgtYt9kFi/SQ4m1e8zr6RnvPWTxqQxTCVQZBNahe3b7DgnrrNrB/NBU86 QotiTBIxQed6oqHAtL2mFRb9SdTGh9PE5xGcR94qlEGqVtu9KZIFqCDG1OgZKkR5 9yLLai7sUGfMCLuEsdM5hXvf4fOHaGSQ0Cxx2qYWC3DDnS651fMNKh/taBo8RmS3 r18loH2dKvBvOi2beCA5sMk= =VkdW -----END PGP SIGNATURE----- Merge tag 'amd-pstate-v6.15-2025-03-06' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux Merge amd-pstate updates for 6.15 (3/6/25) from Mario Limonciello: "A lot of code optimization to avoid cases where call paths will end up calling the same writes multiple times and needlessly caching variables. To accomplish this some of the writes are now made into an atomically written "perf" variable. Locking has been overhauled to ensure it only applies to the necessary functions. Tracing has been adjusted to ensure trace events only are used right before writing out to the hardware." NOTE: This is a redo of amd-pstate-v6.15-2025-03-03 with a fixed Fixes tag. * tag 'amd-pstate-v6.15-2025-03-06' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (29 commits) cpufreq/amd-pstate: Drop actions in amd_pstate_epp_cpu_offline() cpufreq/amd-pstate: Stop caching EPP cpufreq/amd-pstate: Rework CPPC enabling cpufreq/amd-pstate: Drop debug statements for policy setting cpufreq/amd-pstate: Update cppc_req_cached for shared mem EPP writes cpufreq/amd-pstate: Move all EPP tracing into _update_perf and _set_epp functions cpufreq/amd-pstate: Cache CPPC request in shared mem case too cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks cpufreq/amd-pstate-ut: Adjust variable scope cpufreq/amd-pstate-ut: Run on all of the correct CPUs cpufreq/amd-pstate-ut: Drop SUCCESS and FAIL enums cpufreq/amd-pstate-ut: Allow lowest nonlinear and lowest to be the same cpufreq/amd-pstate-ut: Use _free macro to free put policy cpufreq/amd-pstate: Drop `cppc_cap1_cached` cpufreq/amd-pstate: Overhaul locking cpufreq/amd-pstate: Move perf values into a union cpufreq/amd-pstate: Drop min and max cached frequencies cpufreq/amd-pstate: Show a warning when a CPU fails to setup cpufreq/amd-pstate: Invalidate cppc_req_cached during suspend cpufreq/amd-pstate: Fix the clamping of perf values ...	2025-03-06 21:28:48 +01:00
Mario Limonciello	b4cc466b97	cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks Bitfield masks are easier to follow and less error prone. Reviewed-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>	2025-03-06 13:01:25 -06:00
Eric Biggers	d021985504	x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs Background: =========== Currently kernel-mode FPU is not always usable in softirq context on x86, since softirqs can nest inside a kernel-mode FPU section in task context, and nested use of kernel-mode FPU is not supported. Therefore, x86 SIMD-optimized code that can be called in softirq context has to sometimes fall back to non-SIMD code. There are two options for the fallback, both of which are pretty terrible: (a) Use a scalar fallback. This can be 10-100x slower than vectorized code because it cannot use specialized instructions like AES, SHA, or carryless multiplication. (b) Execute the request asynchronously using a kworker. In other words, use the "crypto SIMD helper" in crypto/simd.c. Currently most of the x86 en/decryption code (skcipher and aead algorithms) uses option (b), since this avoids the slow scalar fallback and it is easier to wire up. But option (b) is still really bad for its own reasons: - Punting the request to a kworker is bad for performance too. - It forces the algorithm to be marked as asynchronous (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API users who request a synchronous algorithm. That's another huge performance problem, which is especially unfortunate for users who don't even do en/decryption in softirq context. - It makes all en/decryption operations take a detour through crypto/simd.c. That involves additional checks and an additional indirect call, which slow down en/decryption for everyone. Fortunately, the skcipher and aead APIs are only usable in task and softirq context in the first place. Thus, if kernel-mode FPU were to be reliably usable in softirq context, no fallback would be needed. Indeed, other architectures such as arm, arm64, and riscv have already done this. Changes implemented: ==================== Therefore, this patch updates x86 accordingly to reliably support kernel-mode FPU in softirqs. This is done by just disabling softirq processing in kernel-mode FPU sections (when hardirqs are not already disabled), as that prevents the nesting that was problematic. This will delay some softirqs slightly, but only ones that would have otherwise been nested inside a task context kernel-mode FPU section. Any such softirqs would have taken the slow fallback path before if they tried to do any en/decryption. Now these softirqs will just run at the end of the task context kernel-mode FPU section (since local_bh_enable() runs pending softirqs) and will no longer take the slow fallback path. Alternatives considered: ======================== - Make kernel-mode FPU sections fully preemptible. This would require growing task_struct by another struct fpstate which is more than 2K. - Make softirqs save/restore the kernel-mode FPU state to a per-CPU struct fpstate when nested use is detected. Somewhat interesting, but seems unnecessary when a simpler solution exists. Performance results: ==================== I did some benchmarks with AES-XTS encryption of 16-byte messages (which is unrealistically small, but this makes it easier to see the overhead of kernel-mode FPU...). The baseline was 384 MB/s. Removing the use of crypto/simd.c, which this work makes possible, increases it to 487 MB/s, a +27% improvement in throughput. CPU was AMD Ryzen 9 9950X (Zen 5). No debugging options were enabled. [ mingo: Prettified the changelog and added performance results. ] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250304204954.3901-1-ebiggers@kernel.org	2025-03-06 12:44:09 +01:00
Jiri Olsa	fa6192adc3	uprobes/x86: Harden uretprobe syscall trampoline check Jann reported a possible issue when trampoline_check_ip returns address near the bottom of the address space that is allowed to call into the syscall if uretprobes are not set up: https://lore.kernel.org/bpf/202502081235.5A6F352985@keescook/T/#m9d416df341b8fbc11737dacbcd29f0054413cbbf Though the mmap minimum address restrictions will typically prevent creating mappings there, let's make sure uretprobe syscall checks for that. Fixes: `ff474a78ce` ("uprobe: Add uretprobe syscall to speed up return probe") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250212220433.3624297-1-jolsa@kernel.org	2025-03-06 12:22:45 +01:00
Zeng Heng	ef69de53c4	x86/platform/olpc: Remove unused variable 'len' in olpc_dt_compatible_match() The following build warning highlights some unused code: arch/x86/platform/olpc/olpc_dt.c: In function ‘olpc_dt_compatible_match’: arch/x86/platform/olpc/olpc_dt.c:222:12: warning: variable ‘len’ set but not used [-Wunused-but-set-variable] The compiler is right, the local variable 'len' is set but never used, so remove it. Fixes: `a7a9bacb9a` ("x86/platform/olpc: Use a correct version when making up a battery node") Signed-off-by: Zeng Heng <zengheng4@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241025074203.1921344-1-zengheng4@huawei.com	2025-03-06 11:14:26 +01:00
Thorsten Blum	5e7adc81ae	perf/x86: Annotate struct bts_buffer::buf with __counted_by() Add the __counted_by() compiler attribute to the flexible array member buf to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250305123134.215577-2-thorsten.blum@linux.dev	2025-03-05 18:28:22 +01:00
Jarkko Sakkinen	0d3e0dfd68	x86/sgx: Fix size overflows in sgx_encl_create() The total size calculated for EPC can overflow u64 given the added up page for SECS. Further, the total size calculated for shmem can overflow even when the EPC size stays within limits of u64, given that it adds the extra space for 128 byte PCMD structures (one for each page). Address this by pre-evaluating the micro-architectural requirement of SGX: the address space size must be power of two. This is eventually checked up by ECREATE but the pre-check has the additional benefit of making sure that there is some space for additional data. Fixes: `888d249117` ("x86/sgx: Add SGX_IOC_ENCLAVE_CREATE") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250305050006.43896-1-jarkko@kernel.org Closes: https://lore.kernel.org/linux-sgx/c87e01a0-e7dd-4749-a348-0980d3444f04@stanley.mountain/	2025-03-05 09:51:41 +01:00
Charles Han	f739365158	x86/delay: Fix inconsistent whitespace Smatch warns about this whitespace damage: arch/x86/lib/delay.c:134 delay_halt_mwaitx() warn: inconsistent indenting Signed-off-by: Charles Han <hanchunchao@inspur.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250305063515.3951-1-hanchunchao@inspur.com	2025-03-05 09:51:04 +01:00
Linus Torvalds	bb2281fb05	- Load only sha256-signed microcode patch blobs - Other good cleanups -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAme+3r8ACgkQEsHwGGHe VUq7thAAvpWb51ep3pegXtUcTF6hmB5rKfFmWSd5ntzcBudhxZyFMAIkoNGtjA/m lwamrDowhcGMLQ16aWzruYLeXJao6ROTJHq0hiTdxdBqyemW7Z1exPc0s4OolHCD 1hL3pP7Z3ubBEwm+2jI1M0KWJ0ZKv54SLQEaTxIStFqvVNq4/mI+5zpsFM+iElZ+ bc3D5ISyCwdEGvVHLKAD0W6J1cWNFygLaAM74JvkGkY3ByCR7HnUJa9lYx/b8PEk uWeDu5IWY+gwKDor9NgW1mI66x3jiGExVcwhi30r1V/jX8v+H0+zO0s84hF8YFra VmAoa2YkmOmO1zqrs/S2gGb+kX2nKlck6jRALVFFq8eXOP1BsmyvGk12XbVCeBYQ kl6M89AS4nSF62k7PCJdzvkuRpVx1DSCbaTUB+fnzxP9BcOCKY1eobIJs/rBwDSx 0AcD178j2eb5uP8nACnBUshJwHYmssHXX2sXH/NbCvSqulYfTrnYs/EYQuf7M4g+ yGkBsH/T24RswobT9VzdO4EyzGaL1SPn6pQ5yKnDBSAAZNMfjaNNYFzBpSttzviR 8j4PBJ6u6mCGohEsMVXAmCYs2EvfSr+lxSFHMRV9Ym09SFnI2muhG5Rd29ast1Mz H0YSRs9pRMHLDHorMVU6+AgFq+8XIt0f13zkIyM/S3b2ZLRVgiQ= =2mhO -----END PGP SIGNATURE----- Merge tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull AMD microcode loading fixes from Borislav Petkov: - Load only sha256-signed microcode patch blobs - Other good cleanups * tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Load only SHA256-checksummed patches x86/microcode/AMD: Add get_patch_level() x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration x86/microcode/AMD: Merge early_apply_microcode() into its single callsite x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature	2025-03-04 19:05:53 -10:00
Uros Bizjak	6d536cad0d	x86/percpu: Fix __per_cpu_hot_end marker Make __per_cpu_hot_end marker point to the end of the percpu cache hot data, not to the end of the percpu cache hot section. This fixes CONFIG_MPENTIUM4 case where X86_L1_CACHE_SHIFT is set to 7 (128 bytes). Also update assert message accordingly. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Link: https://lore.kernel.org/r/20250304173455.89361-1-ubizjak@gmail.com Closes: https://lore.kernel.org/lkml/Z8a-NVJs-pm5W-mG@gmail.com/	2025-03-04 20:30:33 +01:00
Brian Gerst	06aa03056f	x86/smp: Move this_cpu_off to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-12-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	f3856cd343	x86/stackprotector: Move __stack_chk_guard to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-11-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	a1e4cc0155	x86/percpu: Move current_task to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-10-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	385f72c83e	x86/percpu: Move top_of_stack to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-9-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	c6a0918072	x86/irq: Move irq stacks to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-8-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	c8f1ac2bd7	x86/softirq: Move softirq_pending to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-7-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	839be1619f	x86/retbleed: Move call depth to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-6-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	01c7bc5198	x86/smp: Move cpu number to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-5-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	46e8fff6d4	x86/preempt: Move preempt count to percpu hot section No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-4-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Brian Gerst	972f9cdff9	x86/percpu: Move pcpu_hot to percpu hot section Also change the alignment of the percpu hot section: - PERCPU_SECTION(INTERNODE_CACHE_BYTES) + PERCPU_SECTION(L1_CACHE_BYTES) As vSMP will muck with INTERNODE_CACHE_BYTES that invalidates the too-large-section assert we do: ASSERT(__per_cpu_hot_end - __per_cpu_hot_start <= 64, "percpu cache hot section too large") [ mingo: Added INTERNODE_CACHE_BYTES fix & explanation. ] Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-3-brgerst@gmail.com	2025-03-04 20:30:33 +01:00
Ingo Molnar	f3a3c29b8d	Merge branch 'x86/headers' into x86/core, to pick up dependent commits Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-04 20:29:43 +01:00
Ingo Molnar	71c2ff150f	Merge branch 'x86/asm' into x86/core, to pick up dependent commits Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-04 20:29:35 +01:00
Uros Bizjak	c8b584fe82	x86/irq/32: Change some static functions to bool The return values of these functions is 0/1, but they use an int type instead of bool: check_stack_overflow() execute_on_irq_stack() Change the type of these function to bool and adjust their return values and affected helper variables. [ mingo: Rewrote the changelog ] Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303155446.112769-5-ubizjak@gmail.com	2025-03-04 20:28:58 +01:00
Uros Bizjak	d4432fb5b8	x86/irq/32: Use current_stack_pointer to avoid asm() in check_stack_overflow() Make code more readable by using the 'current_stack_pointer' global variable. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303155446.112769-4-ubizjak@gmail.com	2025-03-04 20:28:58 +01:00
Uros Bizjak	76f7113781	x86/irq/32: Add missing clobber to inline asm i386 ABI declares %edx as a call-clobbered register. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303155446.112769-2-ubizjak@gmail.com	2025-03-04 20:12:40 +01:00
Uros Bizjak	0ec914707c	x86/irq/32: Use named operands in inline asm Also use inout "+" constraint modifier where appropriate. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303155446.112769-1-ubizjak@gmail.com	2025-03-04 20:12:40 +01:00
Xiaoyao Li	e6c8728a8e	KVM: x86: Remove the unreachable case for 0x80000022 leaf in __do_cpuid_func() Remove dead/unreachable (and misguided) code in KVM's processing of 0x80000022. The case statement breaks early if PERFMON_V2 isnt supported, i.e. kvm_cpu_cap_has(X86_FEATURE_PERFMON_V2) must be true when KVM reaches the code code to setup EBX. Note, early versions of the patch that became commit `94cdeebd82` ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") didn't break early on lack of PERFMON_V2 support, and instead enumerated the effective number of counters KVM could emulate. All of that code was flawed, e.g. the APM explicitly states EBX is valid only for v2. Performance Monitoring Version 2 supported. When set, CPUID_Fn8000_0022_EBX reports the number of available performance counters. When the flaw of not respecting v2 support was addressed, the misguided stuffing of the number of counters got left behind. Link: https://lore.kernel.org/all/20220919093453.71737-4-likexu@tencent.com Fixes: `94cdeebd82` ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-2-xiaoyao.li@intel.com [sean: elaborate on the situation a bit more, add Fixes] Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-04 09:19:49 -08:00
Xiaoyao Li	f9dc8fb3af	KVM: x86: Explicitly zero EAX and EBX when PERFMON_V2 isn't supported by KVM Fix a goof where KVM sets CPUID.0x80000022.EAX to CPUID.0x80000022.EBX instead of zeroing both when PERFMON_V2 isn't supported by KVM. In practice, barring a buggy CPU (or vCPU model when running nested) only the !enable_pmu case is affected, as KVM always supports PERFMON_V2 if it's available in hardware, i.e. CPUID.0x80000022.EBX will be '0' if PERFMON_V2 is unsupported. For the !enable_pmu case, the bug is relatively benign as KVM will refuse to enable PMU capabilities, but a VMM that reflects KVM's supported CPUID into the guest could inadvertently induce #GPs in the guest due to advertising support for MSRs that KVM refuses to emulate. Fixes: `94cdeebd82` ("KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250304082314.472202-3-xiaoyao.li@intel.com [sean: massage shortlog and changelog, tag for stable] Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-04 09:19:18 -08:00
Josh Poimboeuf	224788b63a	x86/alternatives: Simplify alternative_call() interface Separate the input from the clobbers in preparation for appending the input. Do this in preparation of changing the ASM_CALL_CONSTRAINT primitive. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org	2025-03-04 11:21:40 +01:00
Josh Poimboeuf	9064a8e556	x86/hyperv: Use named operands in inline asm Use named operands in inline asm to make it easier to change the constraint order. Do this in preparation of changing the ASM_CALL_CONSTRAINT primitive. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Dexuan Cui <decui@microsoft.com> Cc: linux-kernel@vger.kernel.org	2025-03-04 11:21:39 +01:00
Josh Poimboeuf	e1c49eaee5	KVM: VMX: Use named operands in inline asm Convert the non-asm-goto version of the inline asm in __vmcs_readl() to use named operands, similar to its asm-goto version. Do this in preparation of changing the ASM_CALL_CONSTRAINT primitive. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Cc: linux-kernel@vger.kernel.org	2025-03-04 11:21:39 +01:00
Ingo Molnar	0c53ba0984	Merge branch 'x86/locking' into x86/asm, to simplify dependencies Before picking up new changes in this area, consolidate these changes into x86/asm. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-04 11:20:07 +01:00
Ingo Molnar	cfdaa618de	Merge branch 'x86/cpu' into x86/asm, to pick up dependent commits Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-04 11:19:21 +01:00
Ahmed S. Darwish	6309ff98f0	x86/cacheinfo: Remove unnecessary headers and reorder the rest Remove the headers at cacheinfo.c that are no longer required. Alphabetically reorder what remains since more headers will be included in further commits. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-13-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Thomas Gleixner	b3a756bd72	x86/cacheinfo: Remove the P4 trace leftovers for real Commit `851026a2bf` ("x86/cacheinfo: Remove unused trace variable") removed the switch case for LVL_TRACE but did not get rid of the surrounding gunk. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-12-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Thomas Gleixner	1f61dfdf16	x86/cpu: Remove unused TLB strings Commit: `e0ba94f14f` ("x86/tlb_info: get last level TLB entry number of CPU") added the TLB table for parsing CPUID(0x4), including strings describing them. The string entry in the table was never used. Convert them to comments. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-10-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Thomas Gleixner	535d9a8270	x86/cpu: Get rid of the smp_store_cpu_info() indirection smp_store_cpu_info() is just a wrapper around identify_secondary_cpu() without further value. Move the extra bits from smp_store_cpu_info() into identify_secondary_cpu() and remove the wrapper. [ darwi: Make it compile and fix up the xen/smp_pv.c instance ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-9-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Ahmed S. Darwish	8b7e54b542	x86/cpu: Simplify TLB entry count storage Commit: `e0ba94f14f` ("x86/tlb_info: get last level TLB entry number of CPU") introduced u16 "info" arrays for each TLB type. Since 2012 and each array stores just one type of information: the number of TLB entries for its respective TLB type. Replace such arrays with simple variables. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-8-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Ahmed S. Darwish	cb5f4c76b2	x86/cpu: Use max() for CPUID leaf 0x2 TLB descriptors parsing The conditional statement "if (x < y) { x = y; }" appears 22 times at the Intel leaf 0x2 descriptors parsing logic. Replace each of such instances with a max() expression to simplify the code. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-7-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Ahmed S. Darwish	dec7fdc0b7	x86/cpu: Remove unnecessary headers and reorder the rest Remove the headers at intel.c that are no longer required. Alphabetically reorder what remains since more headers will be included in further commits. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-6-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Ahmed S. Darwish	97c7d57235	x86/cpuid: Include <linux/build_bug.h> in <asm/cpuid.h> <asm/cpuid.h> uses static_assert() at multiple locations but it does not include the CPP macro's definition at linux/build_bug.h. Include the needed header to make <asm/cpuid.h> self-sufficient. This gets triggered when cpuid.h is included in new C files, which is to be done in further commits. Fixes: `43d86e3cd9` ("x86/cpu: Provide cpuid_read() et al.") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250304085152.51092-5-darwi@linutronix.de	2025-03-04 11:17:33 +01:00
Ingo Molnar	1b4c36f9b1	Merge branch 'x86/urgent' into x86/cpu, to pick up dependent commits Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-04 11:15:26 +01:00
Brendan Jackman	d0ba9bcf00	x86/cpu: Log CPU flag cmdline hacks more verbosely Since using these options is very dangerous, make details as visible as possible: - Instead of a single message for each of the cmdline options, print a separate pr_warn() for each individual flag. - Say explicitly whether the flag is a "feature" or a "bug". Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-3-8d255032cb4c@google.com	2025-03-04 11:15:12 +01:00
Brendan Jackman	681955761b	x86/cpu: Warn louder about the {set,clear}cpuid boot parameters Commit `814165e9fd` ("x86/cpu: Add the 'setcpuid=' boot parameter") recently expanded the user's ability to break their system horribly by overriding effective CPU flags. This was reflected with updates to the documentation to try and make people aware that this is dangerous. To further reduce the risk of users mistaking this for a "real feature", and try to help them figure out why their kernel is tainted if they do use it: - Upgrade the existing printk to pr_warn, to help ensure kernel logs reflect what changes are in effect. - Print an extra warning that tries to be as dramatic as possible, while also highlighting the fact that it tainted the kernel. Suggested-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-2-8d255032cb4c@google.com	2025-03-04 11:15:03 +01:00
Brendan Jackman	27c3b452c1	x86/cpu: Remove unnecessary macro indirection related to CPU feature names These macros used to abstract over CONFIG_X86_FEATURE_NAMES, but that was removed in: `7583e8fbdc` ("x86/cpu: Remove X86_FEATURE_NAMES") Now they are just an unnecessary indirection, remove them. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-1-8d255032cb4c@google.com	2025-03-04 11:14:53 +01:00
Pawan Gupta	052040e34c	x86/speculation: Add a conditional CS prefix to CALL_NOSPEC Retpoline mitigation for spectre-v2 uses thunks for indirect branches. To support this mitigation compilers add a CS prefix with -mindirect-branch-cs-prefix. For an indirect branch in asm, this needs to be added manually. CS prefix is already being added to indirect branches in asm files, but not in inline asm. Add CS prefix to CALL_NOSPEC for inline asm as well. There is no JMP_NOSPEC for inline asm. Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228-call-nospec-v3-2-96599fed0f33@linux.intel.com	2025-03-04 11:14:42 +01:00
Pawan Gupta	cfceff8526	x86/speculation: Simplify and make CALL_NOSPEC consistent CALL_NOSPEC macro is used to generate Spectre-v2 mitigation friendly indirect branches. At compile time the macro defaults to indirect branch, and at runtime those can be patched to thunk based mitigations. This approach is opposite of what is done for the rest of the kernel, where the compile time default is to replace indirect calls with retpoline thunk calls. Make CALL_NOSPEC consistent with the rest of the kernel, default to retpoline thunk at compile time when CONFIG_MITIGATION_RETPOLINE is enabled. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228-call-nospec-v3-1-96599fed0f33@linux.intel.com	2025-03-04 11:14:35 +01:00
Josh Poimboeuf	4e32645cd8	x86/smp: Fix mwait_play_dead() and acpi_processor_ffh_play_dead() noreturn behavior Fix some related issues (done in a single patch to avoid introducing intermediate bisect warnings): 1) The SMP version of mwait_play_dead() doesn't return, but its !SMP counterpart does. Make its calling behavior consistent by resolving the !SMP version to a BUG(). It should never be called anyway, this just enforces that at runtime and enables its callers to be marked as __noreturn. 2) While the SMP definition of mwait_play_dead() is annotated as __noreturn, the declaration isn't. Nor is it listed in tools/objtool/noreturns.h. Fix that. 3) Similar to #1, the SMP version of acpi_processor_ffh_play_dead() doesn't return but its !SMP counterpart does. Make the !SMP version a BUG(). It should never be called. 4) acpi_processor_ffh_play_dead() doesn't return, but is lacking any __noreturn annotations. Fix that. This fixes the following objtool warnings: vmlinux.o: warning: objtool: acpi_processor_ffh_play_dead+0x67: mwait_play_dead() is missing a __noreturn annotation vmlinux.o: warning: objtool: acpi_idle_play_dead+0x3c: acpi_processor_ffh_play_dead() is missing a __noreturn annotation Fixes: `a7dd183f0b` ("x86/smp: Allow calling mwait_play_dead with an arbitrary hint") Fixes: `541ddf31e3` ("ACPI/processor_idle: Add FFH state handling") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/e885c6fa9e96a61471b33e48c2162d28b15b14c5.1740962711.git.jpoimboe@kernel.org	2025-03-04 11:14:25 +01:00
Lukas Bulwahn	091b768604	xen: Kconfig: Drop reference to obsolete configs MCORE2 and MK8 Commit `f388f60ca9` ("x86/cpu: Drop configuration options for early 64-bit CPUs") removes the config symbols MCORE2 and MK8. With that, the references to those two config symbols in xen's x86 Kconfig are obsolete. Drop them. Fixes: `f388f60ca9` ("x86/cpu: Drop configuration options for early 64-bit CPUs") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20250303093759.371445-1-lukas.bulwahn@redhat.com	2025-03-04 11:14:15 +01:00
Ahmed S. Darwish	f6bdaab79e	x86/cpu: Properly parse CPUID leaf 0x2 TLB descriptor 0x63 CPUID leaf 0x2's one-byte TLB descriptors report the number of entries for specific TLB types, among other properties. Typically, each emitted descriptor implies the same number of entries for its respective TLB type(s). An emitted 0x63 descriptor is an exception: it implies 4 data TLB entries for 1GB pages and 32 data TLB entries for 2MB or 4MB pages. For the TLB descriptors parsing code, the entry count for 1GB pages is encoded at the intel_tlb_table[] mapping, but the 2MB/4MB entry count is totally ignored. Update leaf 0x2's parsing logic 0x2 to account for 32 data TLB entries for 2MB/4MB pages implied by the 0x63 descriptor. Fixes: `e0ba94f14f` ("x86/tlb_info: get last level TLB entry number of CPU") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-4-darwi@linutronix.de	2025-03-04 09:59:14 +01:00
Ahmed S. Darwish	1881148215	x86/cpu: Validate CPUID leaf 0x2 EDX output CPUID leaf 0x2 emits one-byte descriptors in its four output registers EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most significant bit (MSB) of each register must be clear. Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and ECX, but left EDX unchecked. Validate EDX's most-significant bit as well. Fixes: `e0ba94f14f` ("x86/tlb_info: get last level TLB entry number of CPU") Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-3-darwi@linutronix.de	2025-03-04 09:59:14 +01:00
Ahmed S. Darwish	8177c6bedb	x86/cacheinfo: Validate CPUID leaf 0x2 EDX output CPUID leaf 0x2 emits one-byte descriptors in its four output registers EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most significant bit (MSB) of each register must be clear. The historical Git commit: 019361a20f016 ("- pre6: Intel: start to add Pentium IV specific stuff (128-byte cacheline etc)...") introduced leaf 0x2 output parsing. It only validated the MSBs of EAX, EBX, and ECX, but left EDX unchecked. Validate EDX's most-significant bit. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304085152.51092-2-darwi@linutronix.de	2025-03-04 09:59:14 +01:00
Ingo Molnar	1fff9f8730	Linux 6.14-rc5 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfEtgQeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUI8H/0TJm0ia8TOseHDT bV0alw4qlNdq7Thi7M5HCsInzyhKImdHDBd/cY3HI2QRZS927WoPHUExDvEd6vUE sETrrBrl/iGJ5nLvXPKkCxXC43ZD3nCI/chBPWwspH2DA/Nih8LAmMkESGVFC7fa TUOKqY1FsAWMYUJ64hP4s9Dwi5XECps7/Di46ypgtr7sVA15jpfF3ePi2mfR73Bm hSfF7E5Xa3E22IBE2NPxvO4fHiYJWbNJk8Vv2ewfHroE/zKsJ/zCMk9mOtFII0P/ TODkBSImFqUx+cDZVc0bJAy8rwA2lNXo3LU28N0Ca4EAoqyyIIdTPQ2wYA82Z2Y2 hE/BeN0= =a9Md -----END PGP SIGNATURE----- Merge tag 'v6.14-rc5' into x86/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-03 21:05:45 +01:00
Brian Gerst	604ea3e90b	x86/smp/32: Remove safe_smp_processor_id() The safe_smp_processor_id() function was originally implemented in: `dc2bc768a0` ("stack overflow safe kdump: safe_smp_processor_id()") to mitigate the CPU number corruption on a stack overflow. At the time, x86-32 stored the CPU number in thread_struct, which was located at the bottom of the task stack and thus vulnerable to an overflow. The CPU number is now located in percpu memory, so this workaround is no longer needed. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250303170115.2176553-1-brgerst@gmail.com	2025-03-03 20:30:09 +01:00
Brian Gerst	399fd7a264	x86/asm: Merge KSTK_ESP() implementations Commit: `263042e463` ("Save user RSP in pt_regs->sp on SYSCALL64 fastpath") simplified the 64-bit implementation of KSTK_ESP() which is now identical to 32-bit. Merge them into a common definition. No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250303183111.2245129-1-brgerst@gmail.com	2025-03-03 20:28:33 +01:00
Sean Christopherson	0c3566b63d	KVM: VMX: Extract checks on entry/exit control pairs to a helper macro Extract the checking of entry/exit pairs to a helper macro so that the code can be reused to process the upcoming "secondary" exit controls (the primary exit controls field is out of bits). Use a macro instead of a function to support different sized variables (all secondary exit controls will be optional and so the MSR doesn't have the fixed-0/fixed-1 split). Taking the largest size as input is trivial, but handling the modification of KVM's to-be-used controls is much trickier, e.g. would require bitmap games to clear bits from a 32-bit bitmap vs. a 64-bit bitmap. Opportunistically add sanity checks to ensure the size of the controls match (yay, macro!), e.g. to detect bugs where KVM passes in the pairs for primary exit controls, but its variable for the secondary exit controls. To help users triage mismatches, print the control bits that are checked, not just the actual value. For the foreseeable future, that provides enough information for a user to determine which fields mismatched. E.g. until secondary entry controls comes along, all entry bits and thus all error messages are guaranteed to be unique. To avoid returning from a macro, which can get quite dangerous, simply process all pairs even if error_on_inconsistent_vmcs_config is set. The speed at which KVM rejects module load is not at all interesting. Keep the error message a "once" printk, even though it would be nice to print out all mismatching pairs. In practice, the most likely scenario is that a single pair will mismatch on all CPUs. Printing all mismatches generates redundant messages in that situation, and can be extremely noisy on systems with large numbers of CPUs. If a CPU has multiple mismatches, not printing every bad pair is the least of the user's concerns. Cc: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250227005353.3216123-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:45:54 -08:00
Sean Christopherson	4e96f010af	KVM: SVM: Invalidate "next" SNP VMSA GPA even on failure When processing an SNP AP Creation event, invalidate the "next" VMSA GPA even if acquiring the page/pfn for the new VMSA fails. In practice, the next GPA will never be used regardless of whether or not its invalidated, as the entire flow is guarded by snp_ap_waiting_for_reset, and said guard and snp_vmsa_gpa are always written as a pair. But that's really hard to see in the code. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:34:56 -08:00
Sean Christopherson	5279d6f7e4	KVM: SVM: Use guard(mutex) to simplify SNP vCPU state updates Use guard(mutex) in sev_snp_init_protected_guest_state() and pull in its lock-protected inner helper. Without an unlock trampoline (and even with one), there is no real need for an inner helper. Eliminating the helper also avoids having to fixup the open coded "lockdep" WARN_ON(). Opportunistically drop the error message if KVM can't obtain the pfn for the new target VMSA. The error message provides zero information that can't be gleaned from the fact that the vCPU is stuck. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:34:55 -08:00
Sean Christopherson	e268beee4a	KVM: SVM: Mark VMCB dirty before processing incoming snp_vmsa_gpa Mark the VMCB dirty, i.e. zero control.clean, prior to handling the new VMSA. Nothing in the VALID_PAGE() case touches control.clean, and isolating the VALID_PAGE() code will allow simplifying the overall logic. Note, the VMCB probably doesn't need to be marked dirty when the VMSA is invalid, as KVM will disallow running the vCPU in such a state. But it also doesn't hurt anything. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:34:54 -08:00
Sean Christopherson	46332437e1	KVM: SVM: Use guard(mutex) to simplify SNP AP Creation error handling Use guard(mutex) in sev_snp_ap_creation() and modify the error paths to return directly instead of jumping to a common exit point. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:34:53 -08:00
Sean Christopherson	c6e129fb2a	KVM: SVM: Simplify request+kick logic in SNP AP Creation handling Drop the local "kick" variable and the unnecessary "fallthrough" logic from sev_snp_ap_creation(), and simply pivot on the request when deciding whether or not to immediate force a state update on the target vCPU. No functional change intended. Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:34:52 -08:00
Sean Christopherson	745ff82199	KVM: SVM: Require AP's "requested" SEV_FEATURES to match KVM's view When handling an "AP Create" event, return an error if the "requested" SEV features for the vCPU don't exactly match KVM's view of the VM-scoped features. There is no known use case for heterogeneous SEV features across vCPUs, and while KVM can't actually enforce an exact match since the value in RAX isn't guaranteed to match what the guest shoved into the VMSA, KVM can at least avoid knowingly letting the guest run in an unsupported state. E.g. if a VM is created with DebugSwap disabled, KVM will intercept #DBs and DRs for all vCPUs, even if an AP is "created" with DebugSwap enabled in its VMSA. Note, the GHCB spec only "requires" that "AP use the same interrupt injection mechanism as the BSP", but given the disaster that is DebugSwap and SEV_FEATURES in general, it's safe to say that AMD didn't consider all possible complications with mismatching features between the BSP and APs. Opportunistically fold the check into the relevant request flavors; the "request < AP_DESTROY" check is just a bizarre way of implementing the AP_CREATE_ON_INIT => AP_CREATE fallthrough. Fixes: `e366f92ea9` ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:34:50 -08:00
Sean Christopherson	d26638bfcd	KVM: SVM: Don't change target vCPU state on AP Creation VMGEXIT error If KVM rejects an AP Creation event, leave the target vCPU state as-is. Nothing in the GHCB suggests the hypervisor is allowed to muck with vCPU state on failure, let alone required to do so. Furthermore, kicking only in the !ON_INIT case leads to divergent behavior, and even the "kick" case is non-deterministic. E.g. if an ON_INIT request fails, the guest can successfully retry if the fixed AP Creation request is made prior to sending INIT. And if a !ON_INIT fails, the guest can successfully retry if the fixed AP Creation request is handled before the target vCPU processes KVM's KVM_REQ_UPDATE_PROTECTED_GUEST_STATE. Fixes: `e366f92ea9` ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:33:44 -08:00
Sean Christopherson	72d12715ed	KVM: SVM: Refuse to attempt VRMUN if an SEV-ES+ guest has an invalid VMSA Explicitly reject KVM_RUN with KVM_EXIT_FAIL_ENTRY if userspace "coerces" KVM into running an SEV-ES+ guest with an invalid VMSA, e.g. by modifying a vCPU's mp_state to be RUNNABLE after an SNP vCPU has undergone a Destroy event. On Destroy or failed Create, KVM marks the vCPU HALTED so that KVM doesn't run the vCPU, but nothing prevents a misbehaving VMM from manually making the vCPU RUNNABLE via KVM_SET_MP_STATE. Attempting VMRUN with an invalid VMSA should be harmless, but knowingly executing VMRUN with bad control state is at best dodgy. Fixes: `e366f92ea9` ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:33:43 -08:00
Sean Christopherson	807cb9ce2e	KVM: SVM: Don't rely on DebugSwap to restore host DR0..DR3 Never rely on the CPU to restore/load host DR0..DR3 values, even if the CPU supports DebugSwap, as there are no guarantees that SNP guests will actually enable DebugSwap on APs. E.g. if KVM were to rely on the CPU to load DR0..DR3 and skipped them during hw_breakpoint_restore(), KVM would run with clobbered-to-zero DRs if an SNP guest created APs without DebugSwap enabled. Update the comment to explain the dangers, and hopefully prevent breaking KVM in the future. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:26:39 -08:00
Sean Christopherson	b2653cd3b7	KVM: SVM: Save host DR masks on CPUs with DebugSwap When running SEV-SNP guests on a CPU that supports DebugSwap, always save the host's DR0..DR3 mask MSR values irrespective of whether or not DebugSwap is enabled, to ensure the host values aren't clobbered by the CPU. And for now, also save DR0..DR3, even though doing so isn't necessary (see below). SVM_VMGEXIT_AP_CREATE is deeply flawed in that it allows the guest to create a VMSA with guest-controlled SEV_FEATURES. A well behaved guest can inform the hypervisor, i.e. KVM, of its "requested" features, but on CPUs without ALLOWED_SEV_FEATURES support, nothing prevents the guest from lying about which SEV features are being enabled (or not!). If a misbehaving guest enables DebugSwap in a secondary vCPU's VMSA, the CPU will load the DR0..DR3 mask MSRs on #VMEXIT, i.e. will clobber the MSRs with '0' if KVM doesn't save its desired value. Note, DR0..DR3 themselves are "ok", as DR7 is reset on #VMEXIT, and KVM restores all DRs in common x86 code as needed via hw_breakpoint_restore(). I.e. there is no risk of host DR0..DR3 being clobbered (when it matters). However, there is a flaw in the opposite direction; because the guest can lie about enabling DebugSwap, i.e. can disable DebugSwap without KVM's knowledge, KVM must not rely on the CPU to restore DRs. Defer fixing that wart, as it's more of a documentation issue than a bug in the code. Note, KVM added support for DebugSwap on commit `d1f85fbe83` ("KVM: SEV: Enable data breakpoints in SEV-ES"), but that is not an appropriate Fixes, as the underlying flaw exists in hardware, not in KVM. I.e. all kernels that support SEV-SNP need to be patched, not just kernels with KVM's full support for DebugSwap (ignoring that DebugSwap support landed first). Opportunistically fix an incorrect statement in the comment; on CPUs without DebugSwap, the CPU does NOT save or load debug registers, i.e. Fixes: `e366f92ea9` ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Cc: stable@vger.kernel.org Cc: Naveen N Rao <naveen@kernel.org> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250227012541.3234589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-03-03 07:26:39 -08:00
Breno Leitao	98fdaeb296	x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2 Change the default value of spectre v2 in user mode to respect the CONFIG_MITIGATION_SPECTRE_V2 config option. Currently, user mode spectre v2 is set to auto (SPECTRE_V2_USER_CMD_AUTO) by default, even if CONFIG_MITIGATION_SPECTRE_V2 is disabled. Set the spectre_v2 value to auto (SPECTRE_V2_USER_CMD_AUTO) if the Spectre v2 config (CONFIG_MITIGATION_SPECTRE_V2) is enabled, otherwise set the value to none (SPECTRE_V2_USER_CMD_NONE). Important to say the command line argument "spectre_v2_user" overwrites the default value in both cases. When CONFIG_MITIGATION_SPECTRE_V2 is not set, users have the flexibility to opt-in for specific mitigations independently. In this scenario, setting spectre_v2= will not enable spectre_v2_user=, and command line options spectre_v2_user and spectre_v2 are independent when CONFIG_MITIGATION_SPECTRE_V2=n. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Kaplan <David.Kaplan@amd.com> Link: https://lore.kernel.org/r/20241031-x86_bugs_last_v2-v2-2-b7ff1dab840e@debian.org	2025-03-03 12:48:41 +01:00
Breno Leitao	2a08b83271	x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code There is a helper function to check if SMT is available. Use this helper instead of performing the check manually. The helper function cpu_smt_possible() does exactly the same thing as was being done manually inside spectre_v2_user_select_mitigation(). Specifically, it returns false if CONFIG_SMP is disabled, otherwise it checks the cpu_smt_control global variable. This change improves code consistency and reduces duplication. No change in functionality intended. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: David Kaplan <David.Kaplan@amd.com> Link: https://lore.kernel.org/r/20241031-x86_bugs_last_v2-v2-1-b7ff1dab840e@debian.org	2025-03-03 12:48:17 +01:00
Pawan Gupta	9af9ad85ac	x86/speculation: Add a conditional CS prefix to CALL_NOSPEC Retpoline mitigation for spectre-v2 uses thunks for indirect branches. To support this mitigation compilers add a CS prefix with -mindirect-branch-cs-prefix. For an indirect branch in asm, this needs to be added manually. CS prefix is already being added to indirect branches in asm files, but not in inline asm. Add CS prefix to CALL_NOSPEC for inline asm as well. There is no JMP_NOSPEC for inline asm. Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228-call-nospec-v3-2-96599fed0f33@linux.intel.com	2025-03-03 12:04:43 +01:00
Pawan Gupta	010c4a461c	x86/speculation: Simplify and make CALL_NOSPEC consistent CALL_NOSPEC macro is used to generate Spectre-v2 mitigation friendly indirect branches. At compile time the macro defaults to indirect branch, and at runtime those can be patched to thunk based mitigations. This approach is opposite of what is done for the rest of the kernel, where the compile time default is to replace indirect calls with retpoline thunk calls. Make CALL_NOSPEC consistent with the rest of the kernel, default to retpoline thunk at compile time when CONFIG_MITIGATION_RETPOLINE is enabled. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228-call-nospec-v3-1-96599fed0f33@linux.intel.com	2025-03-03 12:04:42 +01:00
Dr. David Alan Gilbert	3101900218	x86/paravirt: Remove unused paravirt_disable_iospace() The last use of paravirt_disable_iospace() was removed in 2015 by commit `d1c29465b8` ("lguest: don't disable iospace.") Remove it. Note the comment above it about 'entry.S' is unrelated to this but stayed when intervening code got deleted. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20250303004441.250451-1-linux@treblig.org	2025-03-03 11:19:52 +01:00
Peter Zijlstra	73e8079be9	x86/ibt: Make cfi_bhi a constant for FINEIBT_BHI=n Robot yielded a .config that tripped: vmlinux.o: warning: objtool: do_jit+0x276: relocation to !ENDBR: .noinstr.text+0x6a60 This is the result of using __bhi_args[1] in unreachable code; make sure the compiler is able to determine this is unreachable and trigger DCE. Closes: https://lore.kernel.org/oe-kbuild-all/202503030704.H9KFysNS-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250303094911.GL5880@noisy.programming.kicks-ass.net	2025-03-03 10:54:11 +01:00
Herbert Xu	17ec3e71ba	crypto: lib/Kconfig - Hide arch options from user The ARCH_MAY_HAVE patch missed arm64, mips and s390. But it may also lead to arch options being enabled but ineffective because of modular/built-in conflicts. As the primary user of all these options wireguard is selecting the arch options anyway, make the same selections at the lib/crypto option level and hide the arch options from the user. Instead of selecting them centrally from lib/crypto, simply set the default of each arch option as suggested by Eric Biggers. Change the Crypto API generic algorithms to select the top-level lib/crypto options instead of the generic one as otherwise there is no way to enable the arch options (Eric Biggers). Introduce a set of INTERNAL options to work around dependency cycles on the CONFIG_CRYPTO symbol. Fixes: `1047e21aec` ("crypto: lib/Kconfig - Fix lib built-in failure when arch is modular") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Arnd Bergmann <arnd@kernel.org> Closes: https://lore.kernel.org/oe-kbuild-all/202502232152.JC84YDLp-lkp@intel.com/ Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-02 15:21:47 +08:00
Eric Biggers	fd7cbef67f	crypto: x86/aegis - use the new scatterwalk functions In crypto_aegis128_aesni_process_ad(), use scatterwalk_next() which consolidates scatterwalk_clamp() and scatterwalk_map(). Use scatterwalk_done_src() which consolidates scatterwalk_unmap(), scatterwalk_advance(), and scatterwalk_done(). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-02 15:19:44 +08:00
Eric Biggers	e9787deff4	crypto: x86/aes-gcm - use the new scatterwalk functions In gcm_process_assoc(), use scatterwalk_next() which consolidates scatterwalk_clamp() and scatterwalk_map(). Use scatterwalk_done_src() which consolidates scatterwalk_unmap(), scatterwalk_advance(), and scatterwalk_done(). Also rename some variables to avoid implying that anything is actually mapped (it's not), or that the loop is going page by page (it is for now, but nothing actually requires that to be the case). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2025-03-02 15:19:44 +08:00
Ingo Molnar	ef2f798600	Merge branch 'perf/urgent' into perf/core, to pick up dependent patches and fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2025-03-01 19:41:14 +01:00
Linus Torvalds	209cd6f2ca	ARM: * Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. * Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID. RISC-V: * Fix hart status check in SBI HSM extension * Fix hart suspend_type usage in SBI HSM extension * Fix error returned by SBI IPI and TIME extensions for unsupported function IDs * Fix suspend_type usage in SBI SUSP extension * Remove unnecessary vcpu kick after injecting interrupt via IMSIC guest file x86: * Fix an nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). * To avoid freeing the PIC while vCPUs are still around, which would cause a NULL pointer access with the previous patch, destroy vCPUs before any VM-level destruction. * Handle failures to create vhost_tasks -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmfCvVsUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPqGwf9FOWQRd/yCKHiufjPDefD1Og0DmgB Dgk0nmHxaxbyPw+5vYlhn/J3vZ54sNngBpmUekE5OuBMZ9EsxXAK/myByHkzNnV9 cyLm4vYwpb9OQmbQ5MMdDlptYsjV40EmSfwwIJpBxjdkwAI3f7NgeHvG8EwkJgch C+X4JMrLu2+BGo7BUhuE/xrB8h0CBRnhalB5aK1wuF+ey8v06zcU0zdQCRLUpOsx mW9S0OpSpSlecvcblr0AhuajjHjwFaTFOQofaXaQFBW6kv3dXmSq/JRABEfx0TBb MTUDQtnnaYvPy/RWwZIzBpgfASLQNQNxSJ7DIw9C8IG7k6rK25BSRwTmSw== =afMB -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. - Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID. RISC-V: - Fix hart status check in SBI HSM extension - Fix hart suspend_type usage in SBI HSM extension - Fix error returned by SBI IPI and TIME extensions for unsupported function IDs - Fix suspend_type usage in SBI SUSP extension - Remove unnecessary vcpu kick after injecting interrupt via IMSIC guest file x86: - Fix an nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). - To avoid freeing the PIC while vCPUs are still around, which would cause a NULL pointer access with the previous patch, destroy vCPUs before any VM-level destruction. - Handle failures to create vhost_tasks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: retry nx_huge_page_recovery_thread creation vhost: return task creation error instead of NULL KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pending KVM: x86: Free vCPUs before freeing VM state riscv: KVM: Remove unnecessary vcpu kick KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2 KVM: arm64: Fix tcr_el2 initialisation in hVHE mode riscv: KVM: Fix SBI sleep_type use riscv: KVM: Fix SBI TIME error generation riscv: KVM: Fix SBI IPI error generation riscv: KVM: Fix hart suspend_type use riscv: KVM: Fix hart suspend status check	2025-03-01 08:48:53 -08:00
Keith Busch	916b7f42b3	kvm: retry nx_huge_page_recovery_thread creation A VMM may send a non-fatal signal to its threads, including vCPU tasks, at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU task receives the signal while its trying to spawn the huge page recovery vhost task, then KVM_RUN will fail due to copy_process() returning -ERESTARTNOINTR. Rework call_once() to mark the call complete if and only if the called function succeeds, and plumb the function's true error code back to the call_once() invoker. This provides userspace with the correct, non-fatal error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge page task. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> [implemented the kvm user side] Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-3-kbusch@meta.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-01 02:54:18 -05:00
Keith Busch	cb380909ae	vhost: return task creation error instead of NULL Lets callers distinguish why the vhost task creation failed. No one currently cares why it failed, so no real runtime change from this patch, but that will not be the case for long. Signed-off-by: Keith Busch <kbusch@kernel.org> Message-ID: <20250227230631.303431-2-kbusch@meta.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2025-03-01 02:52:52 -05:00
Linus Torvalds	7a5668899f	Miscellaneous x86 fixes: - Fix conflicts between devicetree and ACPI SMP discovery & setup - Fix a warm-boot lockup on AMD SC1100 SoC systems - Fix a W=1 build warning related to x86 IRQ trace event setup - Fix a kernel-doc warning Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfCDUQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1ggUw/8Dh5RMxsOyqlZSoBeUTLpeN/+A31TlPZZ QlwvkiJUpcfh2ySg6fIBSotVtaq4/6sHBwbskZBLsYfwxDJOt4/y+pijH1lr9Slg 5hAU199PS+snZxmaBFeCzhsBjp+iEK5HWQmmYuDLb6M7To6fpjFKDkTpBaRHpPhb jiZxE19/xv50W+DrVc1f3WJliPT/+t6i87riH5R0Cdd9rdc8deI4vHlrUNF30d5C Lq/akzq1tzvj15YO6LcEUiEtyu+05KMym74kCPWlCFo9ZqLgcrsA6I+jDiR111rv XyXj7XutrFGFvm/CkbzqF9CmvImRvZWYWssIaZ4v6ealZZhUFEHzK76UVBd+Nv1t WmJ427UjoJc4zAD3wHa4acsBZFOagea4I0pwumpZa6dnUxPYgIYbuhvvbUQK6De0 RBgk1oI4rMQE3+W6uW9xec7m3UP997r/uu7EivWR5eb4R0x/x0Foo9mwoMu0krbE bk0lQ1D0BN5gFSOa7mFrHziqOJGRHIOdPzejElp6FqS+50jsYMzHp7uXuf2gf/zX RfzHyclMPGzAr0oL9Nk9/edEP9uaHgR2Nwecs0lcAVVn5nuBCuZ8GcD7Mrk5hIJc Vp97vhJC4+3cRRAdQUFks/PEQs+dBUdRhY5k2ZpT3rEe2IMi7k53CFdXRqQtwga4 TCHglg7rIxk= =ueF7 -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix conflicts between devicetree and ACPI SMP discovery & setup - Fix a warm-boot lockup on AMD SC1100 SoC systems - Fix a W=1 build warning related to x86 IRQ trace event setup - Fix a kernel-doc warning * tag 'x86-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Fix kernel-doc warning x86/irq: Define trace events conditionally x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systems x86/of: Don't use DTB for SMP setup if ACPI is enabled	2025-02-28 17:05:22 -08:00
Sean Christopherson	2a289aed3f	KVM: x86: Always set mp_state to RUNNABLE on wakeup from HLT When emulating HLT and a wake event is already pending, explicitly mark the vCPU RUNNABLE (via kvm_set_mp_state()) instead of assuming the vCPU is already in the appropriate state. Barring a KVM bug, it should be impossible for the vCPU to be in a non-RUNNABLE state, but there is no advantage to relying on that to hold true, and ensuring the vCPU is made RUNNABLE avoids non-deterministic behavior with respect to pv_unhalted. E.g. if the vCPU is not already RUNNABLE, then depending on when pv_unhalted is set, KVM could either leave the vCPU in the non-RUNNABLE state (set before __kvm_emulate_halt()), or transition the vCPU to HALTED and then RUNNABLE (pv_unhalted set after the kvm_vcpu_has_events() check). Link: https://lore.kernel.org/r/20250224174156.2362059-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 15:43:18 -08:00
Kees Cook	c0e1d4656e	x86/tdx: Mark message.bytes as nonstring In preparation for strtomem*() checking that its destination is a nonstring, rename and annotate message.bytes accordingly. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kees Cook <kees@kernel.org>	2025-02-28 11:51:32 -08:00
Sean Christopherson	189ecdb3e1	KVM: x86: Snapshot the host's DEBUGCTL after disabling IRQs Snapshot the host's DEBUGCTL after disabling IRQs, as perf can toggle debugctl bits from IRQ context, e.g. when enabling/disabling events via smp_call_function_single(). Taking the snapshot (long) before IRQs are disabled could result in KVM effectively clobbering DEBUGCTL due to using a stale snapshot. Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:17:45 -08:00
Sean Christopherson	433265870a	KVM: SVM: Manually context switch DEBUGCTL if LBR virtualization is disabled Manually load the guest's DEBUGCTL prior to VMRUN (and restore the host's value on #VMEXIT) if it diverges from the host's value and LBR virtualization is disabled, as hardware only context switches DEBUGCTL if LBR virtualization is fully enabled. Running the guest with the host's value has likely been mildly problematic for quite some time, e.g. it will result in undesirable behavior if BTF diverges (with the caveat that KVM now suppresses guest BTF due to lack of support). But the bug became fatal with the introduction of Bus Lock Trap ("Detect" in kernel paralance) support for AMD (commit `408eb7417a` ("x86/bus_lock: Add support for AMD")), as a bus lock in the guest will trigger an unexpected #DB. Note, suppressing the bus lock #DB, i.e. simply resuming the guest without injecting a #DB, is not an option. It wouldn't address the general issue with DEBUGCTL, e.g. for things like BTF, and there are other guest-visible side effects if BusLockTrap is left enabled. If BusLockTrap is disabled, then DR6.BLD is reserved-to-1; any attempts to clear it by software are ignored. But if BusLockTrap is enabled, software can clear DR6.BLD: Software enables bus lock trap by setting DebugCtl MSR[BLCKDB] (bit 2) to 1. When bus lock trap is enabled, ... The processor indicates that this #DB was caused by a bus lock by clearing DR6[BLD] (bit 11). DR6[11] previously had been defined to be always 1. and clearing DR6.BLD is "sticky" in that it's not set (i.e. lowered) by other #DBs: All other #DB exceptions leave DR6[BLD] unmodified E.g. leaving BusLockTrap enable can confuse a legacy guest that writes '0' to reset DR6. Reported-by: rangemachine@gmail.com Reported-by: whanos@sergal.fun Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219787 Closes: https://lore.kernel.org/all/bug-219787-28872@https.bugzilla.kernel.org%2F Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:17:45 -08:00
Sean Christopherson	fb71c79593	KVM: x86: Snapshot the host's DEBUGCTL in common x86 Move KVM's snapshot of DEBUGCTL to kvm_vcpu_arch and take the snapshot in common x86, so that SVM can also use the snapshot. Opportunistically change the field to a u64. While bits 63:32 are reserved on AMD, not mentioned at all in Intel's SDM, and managed as an "unsigned long" by the kernel, DEBUGCTL is an MSR and therefore a 64-bit value. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:17:45 -08:00
Sean Christopherson	d0eac42f5c	KVM: SVM: Suppress DEBUGCTL.BTF on AMD Mark BTF as reserved in DEBUGCTL on AMD, as KVM doesn't actually support BTF, and fully enabling BTF virtualization is non-trivial due to interactions with the emulator, guest_debug, #DB interception, nested SVM, etc. Don't inject #GP if the guest attempts to set BTF, as there's no way to communicate lack of support to the guest, and instead suppress the flag and treat the WRMSR as (partially) unsupported. In short, make KVM behave the same on AMD and Intel (VMX already squashes BTF). Note, due to other bugs in KVM's handling of DEBUGCTL, the only way BTF has "worked" in any capacity is if the guest simultaneously enables LBRs. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:17:45 -08:00
Sean Christopherson	ee89e80133	KVM: SVM: Drop DEBUGCTL[5:2] from guest's effective value Drop bits 5:2 from the guest's effective DEBUGCTL value, as AMD changed the architectural behavior of the bits and broke backwards compatibility. On CPUs without BusLockTrap (or at least, in APMs from before ~2023), bits 5:2 controlled the behavior of external pins: Performance-Monitoring/Breakpoint Pin-Control (PBi)—Bits 5:2, read/write. Software uses thesebits to control the type of information reported by the four external performance-monitoring/breakpoint pins on the processor. When a PBi bit is cleared to 0, the corresponding external pin (BPi) reports performance-monitor information. When a PBi bit is set to 1, the corresponding external pin (BPi) reports breakpoint information. With the introduction of BusLockTrap, presumably to be compatible with Intel CPUs, AMD redefined bit 2 to be BLCKDB: Bus Lock #DB Trap (BLCKDB)—Bit 2, read/write. Software sets this bit to enable generation of a #DB trap following successful execution of a bus lock when CPL is > 0. and redefined bits 5:3 (and bit 6) as "6:3 Reserved MBZ". Ideally, KVM would treat bits 5:2 as reserved. Defer that change to a feature cleanup to avoid breaking existing guest in LTS kernels. For now, drop the bits to retain backwards compatibility (of a sort). Note, dropping bits 5:2 is still a guest-visible change, e.g. if the guest is enabling LBRs and the legacy PBi bits, then the state of the PBi bits is visible to the guest, whereas now the guest will always see '0'. Reported-by: Ravi Bangoria <ravi.bangoria@amd.com> Cc: stable@vger.kernel.org Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250227222411.3490595-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:17:45 -08:00
Sean Christopherson	d4b69c3d14	KVM: SVM: Inject #GP if memory operand for INVPCID is non-canonical Inject a #GP if the memory operand received by INVCPID is non-canonical. The APM clearly states that the intercept takes priority over all #GP checks except the CPL0 restriction. Of course, that begs the question of how the CPU generates a linear address in the first place. Tracing confirms that EXITINFO1 does hold a linear address, at least for 64-bit mode guests (hooray GS prefix). Unfortunately, the APM says absolutely nothing about the EXITINFO fields for INVPCID intercepts, so it's not at all clear what's supposed to happen. Add a FIXME to call out that KVM still does the wrong thing for 32-bit guests, and if the stack segment is used for the memory operand. Cc: Babu Moger <babu.moger@amd.com> Cc: Jim Mattson <jmattson@google.com> Fixes: `4407a797e9` ("KVM: SVM: Enable INVPCID feature on AMD") Link: https://lore.kernel.org/r/20250224174522.2363400-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:16:21 -08:00
Sean Christopherson	64c947a1cf	KVM: VMX: Reject KVM_RUN if userspace forces emulation during nested VM-Enter Extend KVM's restrictions on userspace forcing "emulation required" at odd times to cover stuffing invalid guest state while a nested run is pending. Clobbering guest state while KVM is in the middle of emulating VM-Enter is nonsensical, as it puts the vCPU into an architecturally impossible state, and will trip KVM's sanity check that guards against KVM bugs, e.g. helps detect missed VMX consistency checks. WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 __vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline] WARNING: CPU: 3 PID: 6336 at arch/x86/kvm/vmx/vmx.c:6480 vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637 Modules linked in: CPU: 3 UID: 0 PID: 6336 Comm: syz.0.73 Not tainted 6.13.0-rc1-syzkaller-00316-gb5f217084ab3 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:__vmx_handle_exit arch/x86/kvm/vmx/vmx.c:6480 [inline] RIP: 0010:vmx_handle_exit+0x40f/0x1f70 arch/x86/kvm/vmx/vmx.c:6637 <TASK> vcpu_enter_guest arch/x86/kvm/x86.c:11081 [inline] vcpu_run+0x3047/0x4f50 arch/x86/kvm/x86.c:11242 kvm_arch_vcpu_ioctl_run+0x44a/0x1740 arch/x86/kvm/x86.c:11560 kvm_vcpu_ioctl+0x6ce/0x1520 virt/kvm/kvm_main.c:4340 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl fs/ioctl.c:892 [inline] __x64_sys_ioctl+0x190/0x200 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Reported-by: syzbot+ac0bc3a70282b4d586cc@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67598fb9.050a0220.17f54a.003b.GAE@google.com Debugged-by: James Houghton <jthoughton@google.com> Tested-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250224171409.2348647-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:15:47 -08:00
Sean Christopherson	be45bc4eff	KVM: SVM: Set RFLAGS.IF=1 in C code, to get VMRUN out of the STI shadow Enable/disable local IRQs, i.e. set/clear RFLAGS.IF, in the common svm_vcpu_enter_exit() just after/before guest_state_{enter,exit}_irqoff() so that VMRUN is not executed in an STI shadow. AMD CPUs have a quirk (some would say "bug"), where the STI shadow bleeds into the guest's intr_state field if a #VMEXIT occurs during injection of an event, i.e. if the VMRUN doesn't complete before the subsequent #VMEXIT. The spurious "interrupts masked" state is relatively benign, as it only occurs during event injection and is transient. Because KVM is already injecting an event, the guest can't be in HLT, and if KVM is querying IRQ blocking for injection, then KVM would need to force an immediate exit anyways since injecting multiple events is impossible. However, because KVM copies int_state verbatim from vmcb02 to vmcb12, the spurious STI shadow is visible to L1 when running a nested VM, which can trip sanity checks, e.g. in VMware's VMM. Hoist the STI+CLI all the way to C code, as the aforementioned calls to guest_state_{enter,exit}_irqoff() already inform lockdep that IRQs are enabled/disabled, and taking a fault on VMRUN with RFLAGS.IF=1 is already possible. I.e. if there's kernel code that is confused by running with RFLAGS.IF=1, then it's already a problem. In practice, since GIF=0 also blocks NMIs, the only change in exposure to non-KVM code (relative to surrounding VMRUN with STI+CLI) is exception handling code, and except for the kvm_rebooting=1 case, all exception in the core VM-Enter/VM-Exit path are fatal. Use the "raw" variants to enable/disable IRQs to avoid tracing in the "no instrumentation" code; the guest state helpers also take care of tracing IRQ state. Oppurtunstically document why KVM needs to do STI in the first place. Reported-by: Doug Covelli <doug.covelli@broadcom.com> Closes: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com Fixes: `f14eec0a32` ("KVM: SVM: move more vmentry code to assembly") Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250224165442.2338294-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:15:23 -08:00
Nikolay Borisov	0dab791f05	KVM: x86/tdp_mmu: Remove tdp_mmu_for_each_pte() That macro acts as a different name for for_each_tdp_pte, apart from adding cognitive load it doesn't bring any value. Let's remove it. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250226074131.312565-1-nik.borisov@suse.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:14:20 -08:00
Sean Christopherson	61146f67e4	KVM: nVMX: Decouple EPT RWX bits from EPT Violation protection bits Define independent macros for the RWX protection bits that are enumerated via EXIT_QUALIFICATION for EPT Violations, and tie them to the RWX bits in EPT entries via compile-time asserts. Piggybacking the EPTE defines works for now, but it creates holes in the EPT_VIOLATION_xxx macros and will cause headaches if/when KVM emulates Mode-Based Execution (MBEC), or any other features that introduces additional protection information. Opportunistically rename EPT_VIOLATION_RWX_MASK to EPT_VIOLATION_PROT_MASK so that it doesn't become stale if/when MBEC support is added. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250227000705.3199706-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:14:05 -08:00
Nikolay Borisov	fa6c8fc2d2	KVM: VMX: Remove EPT_VIOLATIONS_ACC__BIT defines Those defines are only used in the definition of the various EPT_VIOLATIONS_ACC_ macros which are then used to extract respective bits from vmexit error qualifications. Remove the _BIT defines and redefine the _ACC ones via BIT() macro. No functional changes. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/20250227000705.3199706-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-02-28 09:14:05 -08:00
Kevin Brodsky	95c4cc5a58	x86/mm: Reduce header dependencies in <asm/set_memory.h> Commit: `03b122da74` ("x86/sgx: Hook arch_memory_failure() into mainline code") ... added <linux/mm.h> to <asm/set_memory.h> to provide some helpers. However the following commit: `b3fdf9398a` ("x86/mce: relocate set{clear}_mce_nospec() functions") ... moved the inline definitions someplace else, and now <asm/set_memory.h> just declares a bunch of mostly self-contained functions. No need for the whole <linux/mm.h> inclusion to declare functions; just remove that include. This helps avoid circular dependency headaches (e.g. if <linux/mm.h> ends up including <linux/set_memory.h>). This change requires a couple of include fixups not to break the build: * <asm/smp.h>: including <asm/thread_info.h> directly relies on <linux/thread_info.h> having already been included, because the former needs the BAD_STACK/NOT_STACK constants defined in the latter. This is no longer the case when <asm/smp.h> is included from some driver file - just include <linux/thread_info.h> to stay out of trouble. * sev-guest.c relies on <asm/set_memory.h> including <linux/mm.h>, so we just need to make that include explicit. [ mingo: Cleaned up the changelog ] Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20241212080904.2089632-3-kevin.brodsky@arm.com	2025-02-28 17:35:22 +01:00
Kevin Brodsky	693bbf2a50	x86/mm: Remove unused __set_memory_prot() __set_memory_prot() is unused since: `5c11f00b09` ("x86: remove memory hotplug support on X86_32") Let's remove it. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20241212080904.2089632-2-kevin.brodsky@arm.com	2025-02-28 17:35:14 +01:00
David Kaplan	b8ce25df29	x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds Add AUTO mitigations for mds/taa/mmio/rfds to create consistent vulnerability handling. These AUTO mitigations will be turned into the appropriate default mitigations in the <vuln>_select_mitigation() functions. Later, these will be used with the new attack vector controls to help select appropriate mitigations. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250108202515.385902-4-david.kaplan@amd.com	2025-02-28 12:40:21 +01:00
David Kaplan	2c93762ec4	x86/bugs: Relocate mds/taa/mmio/rfds defines Move the mds, taa, mmio, and rfds mitigation enums earlier in the file to prepare for restructuring of these mitigations as they are all inter-related. No functional change. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250108202515.385902-3-david.kaplan@amd.com	2025-02-28 12:39:17 +01:00
David Kaplan	98c7a713db	x86/bugs: Add X86_BUG_SPECTRE_V2_USER All CPU vulnerabilities with command line options map to a single X86_BUG bit except for Spectre V2 where both the spectre_v2 and spectre_v2_user command line options are related to the same bug. The spectre_v2 command line options mostly relate to user->kernel and guest->host mitigations, while the spectre_v2_user command line options relate to user->user or guest->guest protections. Define a new X86_BUG bit for spectre_v2_user so each *_select_mitigation() function in bugs.c is related to a unique X86_BUG bit. No functional changes. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250108202515.385902-2-david.kaplan@amd.com	2025-02-28 12:34:30 +01:00
H. Peter Anvin (Intel)	909639aa58	x86/cpufeatures: Rename X86_CMPXCHG64 to X86_CX8 Replace X86_CMPXCHG64 with X86_CX8, as CX8 is the name of the CPUID flag, thus to make it consistent with X86_FEATURE_CX8 defined in <asm/cpufeatures.h>. No functional change intended. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250228082338.73859-2-xin@zytor.com	2025-02-28 11:42:34 +01:00
Brendan Jackman	ab68d2e365	x86/cpu: Enable modifying CPU bug flags with '{clear,set}puid=' Sometimes it can be very useful to run CPU vulnerability mitigations on systems where they aren't known to mitigate any real-world vulnerabilities. This can be handy for mundane reasons like debugging HW-agnostic logic on whatever machine is to hand, but also for research reasons: while some mitigations are focused on individual vulns and uarches, others are fairly general, and it's strategically useful to have an idea how they'd perform on systems where they aren't currently needed. As evidence for this being useful, a flag specifically for Retbleed was added in: `5c9a92dec3` ("x86/bugs: Add retbleed=force"). Since CPU bugs are tracked using the same basic mechanism as features, and there are already parameters for manipulating them by hand, extend that mechanism to support bug as well as capabilities. With this patch and setcpuid=srso, a QEMU guest running on an Intel host will boot with Safe-RET enabled. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-3-7dc71bce742a@google.com	2025-02-28 10:57:50 +01:00
Brendan Jackman	814165e9fd	x86/cpu: Add the 'setcpuid=' boot parameter In preparation for adding support to inject fake CPU bugs at boot-time, add a general facility to force enablement of CPU flags. The flag taints the kernel and the documentation attempts to be clear that this is highly unsuitable for uses outside of kernel development and platform experimentation. The new arg is parsed just like clearcpuid, but instead of leading to setup_clear_cpu_cap() it leads to setup_force_cpu_cap(). I've tested this by booting a nested QEMU guest on an Intel host, which with setcpuid=svm will claim that it supports AMD virtualization. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-2-7dc71bce742a@google.com	2025-02-28 10:57:49 +01:00
Brendan Jackman	f034937f5a	x86/cpu: Create helper function to parse the 'clearcpuid=' boot parameter This is in preparation for a later commit that will reuse this code, to make review convenient. Factor out a helper function which does the full handling for this arg including printing info to the console. No functional change intended. Signed-off-by: Brendan Jackman <jackmanb@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-1-7dc71bce742a@google.com	2025-02-28 10:57:49 +01:00
Max Grobecker	a4248ee16f	x86/cpu: Don't clear X86_FEATURE_LAHF_LM flag in init_amd_k8() on AMD when running in a virtual machine When running in a virtual machine, we might see the original hardware CPU vendor string (i.e. "AuthenticAMD"), but a model and family ID set by the hypervisor. In case we run on AMD hardware and the hypervisor sets a model ID < 0x14, the LAHF cpu feature is eliminated from the the list of CPU capabilities present to circumvent a bug with some BIOSes in conjunction with AMD K8 processors. Parsing the flags list from /proc/cpuinfo seems to be happening mostly in bash scripts and prebuilt Docker containers, as it does not need to have additionals tools present – even though more reliable ways like using "kcpuid", which calls the CPUID instruction instead of parsing a list, should be preferred. Scripts, that use /proc/cpuinfo to determine if the current CPU is "compliant" with defined microarchitecture levels like x86-64-v2 will falsely claim the CPU is incapable of modern CPU instructions when "lahf_lm" is missing in that flags list. This can prevent some docker containers from starting or build scripts to create unoptimized binaries. Admittably, this is more a small inconvenience than a severe bug in the kernel and the shoddy scripts that rely on parsing /proc/cpuinfo should be fixed instead. This patch adds an additional check to see if we're running inside a virtual machine (X86_FEATURE_HYPERVISOR is present), which, to my understanding, can't be present on a real K8 processor as it was introduced only with the later/other Athlon64 models. Example output with the "lahf_lm" flag missing in the flags list (should be shown between "hypervisor" and "abm"): $ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 6 model name : Common KVM processor stepping : 1 microcode : 0x1000065 cpu MHz : 2599.998 cache size : 512 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c hypervisor abm 3dnowprefetch vmmcall bmi1 avx2 bmi2 xsaveopt ... while kcpuid shows the feature to be present in the CPU: # kcpuid -d \| grep lahf lahf_lm - LAHF/SAHF available in 64-bit mode [ mingo: Updated the comment a bit, incorporated Boris's review feedback. ] Signed-off-by: Max Grobecker <max@grobecker.info> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: Borislav Petkov <bp@alien8.de>	2025-02-28 10:42:28 +01:00
Uros Bizjak	023f3290b0	x86/locking: Remove semicolon from "lock" prefix Minimum version of binutils required to compile the kernel is 2.25. This version correctly handles the "lock" prefix, so it is possible to remove the semicolon, which was used to support ancient versions of GNU as. Due to the semicolon, the compiler considers "lock; insn" as two separate instructions. Removing the semicolon makes asm length calculations more accurate, consequently making scheduling and inlining decisions of the compiler more accurate. Removing the semicolon also enables assembler checks involving lock prefix. Trying to assemble e.g. "lock andl %eax, %ebx" results in: Error: expecting lockable instruction after `lock' Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250228085149.2478245-1-ubizjak@gmail.com	2025-02-28 10:18:26 +01:00
Kevin Loughlin	72dafb5677	x86/sev: Add missing RIP_REL_REF() invocations during sme_enable() The following commit: `1c811d403a` ("x86/sev: Fix position dependent variable references in startup code") introduced RIP_REL_REF() to force RIP-relative accesses to global variables, as needed to prevent crashes during early SEV/SME startup code. For completeness, RIP_REL_REF() should be used with additional variables during sme_enable(): https://lore.kernel.org/all/CAMj1kXHnA0fJu6zh634=fbJswp59kSRAbhW+ubDGj1+NYwZJ-Q@mail.gmail.com/ Access these vars with RIP_REL_REF() to prevent problem reoccurence. Fixes: `1c811d403a` ("x86/sev: Fix position dependent variable references in startup code") Signed-off-by: Kevin Loughlin <kevinloughlin@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241122202322.977678-1-kevinloughlin@google.com	2025-02-27 23:01:33 +01:00
Zhang Kunbo	e008eeec78	x86/platform: Fix missing declaration of 'x86_apple_machine' Fix this sparse warning: arch/x86/kernel/quirks.c:662:6: warning: symbol 'x86_apple_machine' was not declared. Should it be static? Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241126015636.3463994-1-zhangkunbo@huawei.com	2025-02-27 22:52:37 +01:00
Zhang Kunbo	b8ffd97935	x86/irq: Fix missing declaration of 'io_apic_irqs' Fix this sparse warning: arch/x86/kernel/i8259.c:57:15: warning: symbol 'io_apic_irqs' was not declared. Should it be static? Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241126020511.3464664-1-zhangkunbo@huawei.com	2025-02-27 22:52:37 +01:00
Xin Li (Intel)	ad546940b5	x86/ia32: Leave NULL selector values 0~3 unchanged The first GDT descriptor is reserved as 'NULL descriptor'. As bits 0 and 1 of a segment selector, i.e., the RPL bits, are NOT used to index GDT, selector values 0~3 all point to the NULL descriptor, thus values 0, 1, 2 and 3 are all valid NULL selector values. When a NULL selector value is to be loaded into a segment register, reload_segments() sets its RPL bits. Later IRET zeros ES, FS, GS, and DS segment registers if any of them is found to have any nonzero NULL selector value. The two operations offset each other to actually effect a nop. Besides, zeroing of RPL in NULL selector values is an information leak in pre-FRED systems as userspace can spot any interrupt/exception by loading a nonzero NULL selector, and waiting for it to become zero. But there is nothing software can do to prevent it before FRED. ERETU, the only legit instruction to return to userspace from kernel under FRED, by design does NOT zero any segment register to avoid this problem behavior. As such, leave NULL selector values 0~3 unchanged and close the leak. Do the same on 32-bit kernel as well. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241126184529.1607334-1-xin@zytor.com	2025-02-27 22:46:11 +01:00
Chang S. Bae	69a2fdf446	x86/fpu/xstate: Simplify print_xstate_features() print_xstate_features() currently invokes print_xstate_feature() multiple times on separate lines, which can be simplified in a loop. print_xstate_feature() already checks the feature's enabled status and is only called within print_xstate_features(). Inline print_xstate_feature() and iterate over features in a loop to streamline the enabling message. No functional changes. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250227184502.10288-2-chang.seok.bae@intel.com	2025-02-27 19:54:41 +01:00
Chang S. Bae	dc8aa31a7a	x86/fpu: Refine and simplify the magic number check during signal return Before restoring xstate from the user space buffer, the kernel performs sanity checks on these magic numbers: magic1 in the software reserved area, and magic2 at the end of XSAVE region. The position of magic2 is calculated based on the xstate size derived from the user space buffer. But, the in-kernel record is directly available and reliable for this purpose. This reliance on user space data is also inconsistent with the recent fix in: `d877550eaf` ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer") Simply use fpstate->user_size, and then get rid of unnecessary size-evaluation code. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241211014500.3738-1-chang.seok.bae@intel.com	2025-02-27 19:38:06 +01:00
Uros Bizjak	b6762467a0	x86/percpu: Disable named address spaces for UBSAN_BOOL with KASAN for GCC < 14.2 GCC < 14.2 does not correctly propagate address space qualifiers with -fsanitize=bool,enum. Together with address sanitizer then causes that load to be sanitized. Disable named address spaces for GCC < 14.2 when both, UBSAN_BOOL and KASAN are enabled. Reported-by: Matt Fleming <matt@readmodwrite.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250227140715.2276353-1-ubizjak@gmail.com Closes: https://lore.kernel.org/lkml/20241213190119.3449103-1-matt@readmodwrite.com/	2025-02-27 19:24:48 +01:00
Kuan-Wei Chiu	9c94c14ca3	x86/bootflag: Replace open-coded parity calculation with parity8() Refactor parity calculations to use the standard parity8() helper. This change eliminates redundant implementations and improves code efficiency. [ ubizjak: Updated the patch to apply to the latest x86 tree. ] Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250227125616.2253774-1-ubizjak@gmail.com	2025-02-27 14:00:30 +01:00
Pawan Gupta	db5157df14	x86/cpu: Remove get_this_hybrid_cpu_*() Because calls to get_this_hybrid_cpu_type() and get_this_hybrid_cpu_native_id() are not required now. cpu-type and native-model-id are cached at boot in per-cpu struct cpuinfo_topology. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-4-2ae010f50370@linux.intel.com	2025-02-27 13:34:52 +01:00
Pawan Gupta	c4a8b7116b	perf/x86/intel: Use cache cpu-type for hybrid PMU selection get_this_hybrid_cpu_type() misses a case when cpu-type is populated regardless of X86_FEATURE_HYBRID_CPU. This is particularly true for hybrid variants that have P or E cores fused off. Instead use the cpu-type cached in struct x86_topology, as it does not rely on hybrid feature to enumerate cpu-type. This can also help avoid the model-specific fixup get_hybrid_cpu_type(). Also replace the get_this_hybrid_cpu_native_id() with its cached value in struct x86_topology. While at it, remove enum hybrid_cpu_type as it serves no purpose when we have the exact cpu-types defined in enum intel_cpu_type. Also rename atom_native_id to intel_native_id and move it to intel-family.h where intel_cpu_type lives. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-3-2ae010f50370@linux.intel.com	2025-02-27 13:34:52 +01:00
Pawan Gupta	4a412c70af	x86/cpu: Prefix hexadecimal values with 0x in cpu_debug_show() The hex values in CPU debug interface are not prefixed with 0x. This may cause misinterpretation of values. Fix it. [ mingo: Restore previous vertical alignment of the output. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-1-2ae010f50370@linux.intel.com	2025-02-27 13:26:53 +01:00
Arnd Bergmann	976ba8da2f	x86/platform: Only allow CONFIG_EISA for 32-bit The CONFIG_EISA menu was cleaned up in 2018, but this inadvertently brought the option back on 64-bit machines: ISA remains guarded by a CONFIG_X86_32 check, but EISA no longer depends on ISA. The last Intel machines ith EISA support used a 82375EB PCI/EISA bridge from 1993 that could be paired with the 440FX chipset on early Pentium-II CPUs, long before the first x86-64 products. Fixes: `6630a8e501` ("eisa: consolidate EISA Kconfig entry in drivers/eisa") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-11-arnd@kernel.org	2025-02-27 11:22:16 +01:00
Arnd Bergmann	dcbb01fbb7	x86/pci: Remove old STA2x11 support ST ConneXt STA2x11 was an interface chip for Atom E6xx processors, using a number of components usually found on Arm SoCs. Most of this was merged upstream, but it was never complete enough to actually work and has been abandoned for many years. We already had an agreement on removing it in 2022, but nobody ever submitted the patch to do it. Without STA2x11, CONFIG_X86_32_NON_STANDARD no longer has any use - remove it. Suggested-by: Davide Ciminaghi <ciminaghi@gnudd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-10-arnd@kernel.org	2025-02-27 11:22:14 +01:00
Arnd Bergmann	ca5955dd5f	x86/cpu: Document CONFIG_X86_INTEL_MID as 64-bit-only The X86_INTEL_MID code was originally introduced for the 32-bit Moorestown/Medfield/Clovertrail platform, later the 64-bit Merrifield/Moorefield variants were added, but the final Morganfield 14nm platform was canceled before it hit the market. To help users understand what the option actually refers to, update the help text, and add a dependency on 64-bit kernels. Ferry confirmed that all the hardware can run 64-bit kernels these days, but is still testing 32-bit kernels on the Intel Edison board, so this remains possible, but is guarded by a CONFIG_EXPERT dependency now, to gently push remaining users towards using CONFIG_64BIT. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andy Shevchenko <andy@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-9-arnd@kernel.org	2025-02-27 11:22:07 +01:00
Arnd Bergmann	0081fdeccb	x86/mm: Drop support for CONFIG_HIGHPTE With the maximum amount of RAM now 4GB, there is very little point to still have PTE pages in highmem. Drop this for simplification. The only other architecture supporting HIGHPTE is 32-bit arm, and once that feature is removed as well, the highpte logic can be dropped from common code as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-8-arnd@kernel.org	2025-02-27 11:22:06 +01:00
Arnd Bergmann	a833159403	x86/mm: Drop CONFIG_SWIOTLB for PAE Since kernels with and without CONFIG_X86_PAE are now limited to the low 4GB of physical address space, there is no need to use swiotlb any more, so stop selecting this. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-7-arnd@kernel.org	2025-02-27 11:21:59 +01:00
Arnd Bergmann	bbeb69ce30	x86/mm: Remove CONFIG_HIGHMEM64G support HIGHMEM64G support was added in linux-2.3.25 to support (then) high-end Pentium Pro and Pentium III Xeon servers with more than 4GB of addressing, NUMA and PCI-X slots started appearing. I have found no evidence of this ever being used in regular dual-socket servers or consumer devices, all the users seem obsolete these days, even by i386 standards: - Support for NUMA servers (NUMA-Q, IBM x440, unisys) was already removed ten years ago. - 4+ socket non-NUMA servers based on Intel 450GX/450NX, HP F8 and ServerWorks ServerSet/GrandChampion could theoretically still work with 8GB, but these were exceptionally rare even 20 years ago and would have usually been equipped with than the maximum amount of RAM. - Some SKUs of the Celeron D from 2004 had 64-bit mode fused off but could still work in a Socket 775 mainboard designed for the later Core 2 Duo and 8GB. Apparently most BIOSes at the time only allowed 64-bit CPUs. - The rare Xeon LV "Sossaman" came on a few motherboards with registered DDR2 memory support up to 16GB. - In the early days of x86-64 hardware, there was sometimes the need to run a 32-bit kernel to work around bugs in the hardware drivers, or in the syscall emulation for 32-bit userspace. This likely still works but there should never be a need for this any more. PAE mode is still required to get access to the 'NX' bit on Atom 'Pentium M' and 'Core Duo' CPUs. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-6-arnd@kernel.org	2025-02-27 11:21:53 +01:00
Arnd Bergmann	f388f60ca9	x86/cpu: Drop configuration options for early 64-bit CPUs The x86 CPU selection menu is confusing for a number of reasons: When configuring 32-bit kernels, it shows a small number of early 64-bit microarchitectures (K8, Core 2) but not the regular generic 64-bit target that is the normal default. There is no longer a reason to run 32-bit kernels on production 64-bit systems, so only actual 32-bit CPUs need to be shown here. When configuring 64-bit kernels, the options also pointless as there is no way to pick any CPU from the past 15 years, leaving GENERIC_CPU as the only sensible choice. Address both of the above by removing the obsolete options and making all 64-bit kernels run on both Intel and AMD CPUs from any generation. Testing generic 32-bit kernels on 64-bit hardware remains possible, just not building a 32-bit kernel that requires a 64-bit CPU. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-5-arnd@kernel.org	2025-02-27 11:19:06 +01:00
Arnd Bergmann	fc2d5cbe54	x86/build: Rework CONFIG_GENERIC_CPU compiler flags Building an x86-64 kernel with CONFIG_GENERIC_CPU is documented to run on all CPUs, but the Makefile does not actually pass an -march= argument, instead relying on the default that was used to configure the toolchain. In many cases, gcc will be configured to -march=x86-64 or -march=k8 for maximum compatibility, but in other cases a distribution default may be either raised to a more recent ISA, or set to -march=native to build for the CPU used for compilation. This still works in the case of building a custom kernel for the local machine. The point where it breaks down is building a kernel for another machine that is older the the default target. Changing the default to -march=x86-64 would make it work reliable, but possibly produce worse code on distros that intentionally default to a newer ISA. To allow reliably building a kernel for either the oldest x86-64 CPUs, pass the -march=x86-64 flag to the compiler. This was not possible in early versions of x86-64 gcc, but works on all currently supported versions down to at least gcc-5. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-4-arnd@kernel.org	2025-02-27 11:19:05 +01:00
Arnd Bergmann	0abf508675	x86/smp: Drop 32-bit "bigsmp" machine support The x86-32 kernel used to support multiple platforms with more than eight logical CPUs, from the 1999-2003 timeframe: Sequent NUMA-Q, IBM Summit, Unisys ES7000 and HP F8. Support for all except the latter was dropped back in 2014, leaving only the F8 based DL740 and DL760 G2 machines in this catery, with up to eight single-core Socket-603 Xeon-MP processors with hyperthreading. Like the already removed machines, the HP F8 servers at the time cost upwards of $100k in typical configurations, but were quickly obsoleted by their 64-bit Socket-604 cousins and the AMD Opteron. Earlier servers with up to 8 Pentium Pro or Xeon processors remain fully supported as they had no hyperthreading. Similarly, the more common 4-socket Xeon-MP machines with hyperthreading using Intel or ServerWorks chipsets continue to work without this, and all the multi-core Xeon processors also run 64-bit kernels. While the "bigsmp" support can also be used to run on later 64-bit machines (including VM guests), it seems best to discourage that and get any remaining users to update their kernels to 64-bit builds on these. As a side-effect of this, there is also no more need to support NUMA configurations on 32-bit x86, as all true 32-bit NUMA platforms are already gone. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250226213714.4040853-3-arnd@kernel.org	2025-02-27 11:19:05 +01:00

... 15 16 17 18 19 ...

50048 Commits