- Shrink 'struct instruction', to improve objtool performance & memory
footprint.
- Other maximum memory usage reductions - this makes the build both faster,
and fixes kernel build OOM failures on allyesconfig and similar configs
when they try to build the final (large) vmlinux.o.
- Fix ORC unwinding when a kprobe (INT3) is set on a stack-modifying
single-byte instruction (PUSH/POP or LEAVE). This requires the
extension of the ORC metadata structure with a 'signal' field.
- Misc fixes & cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmQAVp8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gV6A//YbWb4nNxYbRFBd1O3FnFfy4efrDQ4btI
hwkL6f7jka9RnIpIEatJvaLdNvyN5tuPCC/+B5eVnvFdd1JcBUmj5D+zYFt6H6qt
BG4M6TNHFkP1kOJVfFGn8UPRfoMz2oMiEqilpsc1Yuf7b3ldMJtGUoHaeZC9pyqe
RUisKNw4WHZp2G/gTBUWxW17xpWY3Awgch/w4HCu8wMnR+uEC44i0UCBfnAadl36
ar66PfhMJcQIv0XkK9wu43g7+HFnjpxHOx35JW3lRot0xRnwl/JcsmaX5iPkh0gt
HV8eLH80J0homeMZDY7vWIKJxGeLkIdfjO5gxwTdnFc9rQw3GwHp1B7WTS6J3Vwe
gM00kyaGly3CvkKMiz5QQBfViWCjE25nYS8X0i9Oz6Gk58IkRPGByaDTKRjNrDJB
BwH9DE9xb3dPVZRv/PejkTdggQWo+FDTrL8ulHIjUFK11M7VubwkskecNHkfpAOE
TRy5iLjMocF8u7hdyec6Mma2K6qEndC2Rw9ZMPQ7TeieMsBcl63cSRgSJLFfdRhr
/5c6Hr2SNQKU8xu+3j49GyBwFvp4CwCa+GPs9/o+l0uCvuKNIn9B788cm4TjxLJ9
C3PRzE6B/CaLhYvlC5k5cNM+I4YpoMU/mvSvY6HcC0Duj2nSAWS2VV60MVMDpqVX
8nK4xnla2tM=
=bpPY
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2023-03-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- Shrink 'struct instruction', to improve objtool performance & memory
footprint
- Other maximum memory usage reductions - this makes the build both
faster, and fixes kernel build OOM failures on allyesconfig and
similar configs when they try to build the final (large) vmlinux.o
- Fix ORC unwinding when a kprobe (INT3) is set on a stack-modifying
single-byte instruction (PUSH/POP or LEAVE). This requires the
extension of the ORC metadata structure with a 'signal' field
- Misc fixes & cleanups
* tag 'objtool-core-2023-03-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
objtool: Fix ORC 'signal' propagation
objtool: Remove instruction::list
x86: Fix FILL_RETURN_BUFFER
objtool: Fix overlapping alternatives
objtool: Union instruction::{call_dest,jump_table}
objtool: Remove instruction::reloc
objtool: Shrink instruction::{type,visited}
objtool: Make instruction::alts a single-linked list
objtool: Make instruction::stack_ops a single-linked list
objtool: Change arch_decode_instruction() signature
x86/entry: Fix unwinding from kprobe on PUSH/POP instruction
x86/unwind/orc: Add 'signal' field to ORC metadata
objtool: Optimize layout of struct special_alt
objtool: Optimize layout of struct symbol
objtool: Allocate multiple structures with calloc()
objtool: Make struct check_options static
objtool: Make struct entries[] static and const
objtool: Fix HOSTCC flag usage
objtool: Properly support make V=1
objtool: Install libsubcmd in build
...
When plain IBRS is enabled (not enhanced IBRS), the logic in
spectre_v2_user_select_mitigation() determines that STIBP is not needed.
The IBRS bit implicitly protects against cross-thread branch target
injection. However, with legacy IBRS, the IBRS bit is cleared on
returning to userspace for performance reasons which leaves userspace
threads vulnerable to cross-thread branch target injection against which
STIBP protects.
Exclude IBRS from the spectre_v2_in_ibrs_mode() check to allow for
enabling STIBP (through seccomp/prctl() by default or always-on, if
selected by spectre_v2_user kernel cmdline parameter).
[ bp: Massage. ]
Fixes: 7c693f54c8 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS")
Reported-by: José Oliveira <joseloliveira11@gmail.com>
Reported-by: Rodrigo Branco <rodrigo@kernelhacking.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230220120127.1975241-1-kpsingh@kernel.org
Link: https://lore.kernel.org/r/20230221184908.2349578-1-kpsingh@kernel.org
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Two patches sorting out confusion between virtual and physical
addresses, which currently are the same on s390.
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world,
some of them affecting architecurally legal but unlikely to
happen in practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give SVM
similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at this
point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the PMU and
MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just
let the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how
to do initialization.
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to emit
the correct hypercall instruction instead of relying on KVM to patch
in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmP2YA0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPg/Qf+J6nT+TkIa+8Ei+fN1oMTDp4YuIOx
mXvJ9mRK9sQ+tAUVwvDz3qN/fK5mjsYbRHIDlVc5p2Q3bCrVGDDqXPFfCcLx1u+O
9U9xjkO4JxD2LS9pc70FYOyzVNeJ8VMGOBbC2b0lkdYZ4KnUc6e/WWFKJs96bK+H
duo+RIVyaMthnvbTwSv1K3qQb61n6lSJXplywS8KWFK6NZAmBiEFDAWGRYQE9lLs
VcVcG0iDJNL/BQJ5InKCcvXVGskcCm9erDszPo7w4Bypa4S9AMS42DHUaRZrBJwV
/WqdH7ckIz7+OSV0W1j+bKTHAFVTCjXYOM7wQykgjawjICzMSnnG9Gpskw==
=goe1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place
- Add support for taking stage-2 access faults in parallel. This was
an accidental omission in the original parallel faults
implementation, but should provide a marginal improvement to
machines w/o FEAT_HAFDBS (such as hardware from the fruit company)
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception
handling and masking unsupported features for nested guests
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at
reducing the trap overhead of running nested
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems
- Avoid VM-wide stop-the-world operations when a vCPU accesses its
own redistributor
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
exceptions in the host
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the
guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Sort out confusion between virtual and physical addresses, which
currently are the same on s390
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world, some
of them affecting architecurally legal but unlikely to happen in
practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
SVM similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at
this point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the
PMU and MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's
send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't
support EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just let
the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how to
do initialization
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
emit the correct hypercall instruction instead of relying on KVM to
patch in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
KVM: SVM: hyper-v: placate modpost section mismatch error
KVM: x86/mmu: Make tdp_mmu_allowed static
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
...
- Prevent unexpected #VE's from:
- Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE
- Excessive #VE notifications (NOTIFY_ENABLES) which are
delivered via a #VE.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmP1WzMACgkQaDWVMHDJ
krAqig/+MzIYmUIkuYbluektxPdzI6zhY/Z+eD5DDH9OFZX5e0WQrmHpQbJ3i4Q6
LT5JQ+yAI2ox/mhPfyCeDXqdRiatJJExUDUepc0qsOEW9gTsJ+edYwUsJg8HII61
+TLz/BiMSF6xCUk46b4CqzhoeEk1dupFAG204uc4vGwSfXdysN3buAcciJc1rOTS
7G9hI9fdLSjEJ8yyFebSDMPxSmdnjJPrDK3LF/leGJEpAQ/eMU0entG4ZH3Uyh2s
3EnDpOdRjX56LAEixB4e5igXyS7wesCun4ytOnwndzW8p4gPIsypcJUEbVt84BfA
HQaSWP35BFAn0JshJnFPmj4r4jV2EB8l630dVTOKdNSiIa3YjyB5nbzy+mMPFl4f
8vcrHEZ6boEcRhgz0zFG0RfnDsjdbqKgFBXdRt0vYB/CG+EfmYaPoDXsb/8A7dtc
8IQ9wLk2AqG0L8blZVS2kjFxNa/9lkDcMsAbfZmlORTQTF2WN2Jlbxri87vuBpRy
8sqMUhgvHoffd/SIiDzJJIBjOH5/RhXLKhGzXQHI1vpZdU6ps9KIvohiycgx1mUQ
lXXQwN5OWSHdUXZ7TFBIGXy9n32Ak/k5GCzCJSqvsMJDDdbycGVB+YCaKX6QK30+
HAHrPy/FQ3FFvZWdsDMD5Pn4RkF4LYH/k4QZwqBFMs9+/Sdzwxc=
=UpyL
-----END PGP SIGNATURE-----
Merge tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull Intel Trust Domain Extensions (TDX) updates from Dave Hansen:
"Other than a minor fixup, the content here is to ensure that TDX
guests never see virtualization exceptions (#VE's) that might be
induced by the untrusted VMM.
This is a highly desirable property. Without it, #VE exception
handling would fall somewhere between NMIs, machine checks and total
insanity. With it, #VE handling remains pretty mundane.
Summary:
- Fixup comment typo
- Prevent unexpected #VE's from:
- Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE)
- Excessive #VE notifications (NOTIFY_ENABLES) which are delivered
via a #VE"
* tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()
x86/tdx: Disable NOTIFY_ENABLES
x86/tdx: Relax SEPT_VE_DISABLE check for debug TD
x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE
x86/tdx: Expand __tdx_hypercall() to handle more arguments
x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments
x86/tdx: Add more registers to struct tdx_hypercall_args
x86/tdx: Fix typo in comment in __tdx_hypercall()
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
- Performance tweaks for efifb earlycon by Andy
- Preparatory refactoring and cleanup work in the efivar layer by Johan,
which is needed to accommodate the Snapdragon arm64 laptops that
expose their EFI variable store via a TEE secure world API.
- Enhancements to the EFI memory map handling so that Xen dom0 can
safely access EFI configuration tables (Demi Marie)
- Wire up the newly introduced IBT/BTI flag in the EFI memory attributes
table, so that firmware that is generated with ENDBR/BTI landing pads
will be mapped with enforcement enabled.
- Clean up how we check and print the EFI revision exposed by the
firmware.
- Incorporate EFI memory attributes protocol definition contributed by
Evgeniy and wire it up in the EFI zboot code. This ensures that these
images can execute under new and stricter rules regarding the default
memory permissions for EFI page allocations. (More work is in progress
here)
- CPER header cleanup by Dan Williams
- Use a raw spinlock to protect the EFI runtime services stack on arm64
to ensure the correct semantics under -rt. (Pierre)
- EFI framebuffer quirk for Lenovo Ideapad by Darrell.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmPzuwsACgkQw08iOZLZ
jyS7dwwAm95DlDxFIQi4FmTm2mqJws9PyDrkfaAK1CoyqCgeOLQT2FkVolgr8jne
pwpwCTXtYP8y0BZvdQEIjpAq/BHKaD3GJSPfl7lo+pnUu68PpsFWaV6EdT33KKfj
QeF0MnUvrqUeTFI77D+S0ZW2zxdo9eCcahF3TPA52/bEiiDHWBF8Qm9VHeQGklik
zoXA15ft3mgITybgjEA0ncGrVZiBMZrYoMvbdkeoedfw02GN/eaQn8d2iHBtTDEh
3XNlo7ONX0v50cjt0yvwFEA0AKo0o7R1cj+ziKH/bc4KjzIiCbINhy7blroSq+5K
YMlnPHuj8Nhv3I+MBdmn/nxRCQeQsE4RfRru04hfNfdcqjAuqwcBvRXvVnjWKZHl
CmUYs+p/oqxrQ4BjiHfw0JKbXRsgbFI6o3FeeLH9kzI9IDUPpqu3Ma814FVok9Ai
zbOCrJf5tEtg5tIavcUESEMBuHjEafqzh8c7j7AAqbaNjlihsqosDy9aYoarEi5M
f/tLec86
=+pOz
-----END PGP SIGNATURE-----
Merge tag 'efi-next-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
"A healthy mix of EFI contributions this time:
- Performance tweaks for efifb earlycon (Andy)
- Preparatory refactoring and cleanup work in the efivar layer, which
is needed to accommodate the Snapdragon arm64 laptops that expose
their EFI variable store via a TEE secure world API (Johan)
- Enhancements to the EFI memory map handling so that Xen dom0 can
safely access EFI configuration tables (Demi Marie)
- Wire up the newly introduced IBT/BTI flag in the EFI memory
attributes table, so that firmware that is generated with ENDBR/BTI
landing pads will be mapped with enforcement enabled
- Clean up how we check and print the EFI revision exposed by the
firmware
- Incorporate EFI memory attributes protocol definition and wire it
up in the EFI zboot code (Evgeniy)
This ensures that these images can execute under new and stricter
rules regarding the default memory permissions for EFI page
allocations (More work is in progress here)
- CPER header cleanup (Dan Williams)
- Use a raw spinlock to protect the EFI runtime services stack on
arm64 to ensure the correct semantics under -rt (Pierre)
- EFI framebuffer quirk for Lenovo Ideapad (Darrell)"
* tag 'efi-next-for-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits)
firmware/efi sysfb_efi: Add quirk for Lenovo IdeaPad Duet 3
arm64: efi: Make efi_rt_lock a raw_spinlock
efi: Add mixed-mode thunk recipe for GetMemoryAttributes
efi: x86: Wire up IBT annotation in memory attributes table
efi: arm64: Wire up BTI annotation in memory attributes table
efi: Discover BTI support in runtime services regions
efi/cper, cxl: Remove cxl_err.h
efi: Use standard format for printing the EFI revision
efi: Drop minimum EFI version check at boot
efi: zboot: Use EFI protocol to remap code/data with the right attributes
efi/libstub: Add memory attribute protocol definitions
efi: efivars: prevent double registration
efi: verify that variable services are supported
efivarfs: always register filesystem
efi: efivars: add efivars printk prefix
efi: Warn if trying to reserve memory under Xen
efi: Actually enable the ESRT under Xen
efi: Apply allowlist to EFI configuration tables when running under Xen
efi: xen: Implement memory descriptor lookup based on hypercall
efi: memmap: Disregard bogus entries instead of returning them
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmP01jEACgkQUqAMR0iA
lPIH1g/9G3CLt9+by3d0FiS5AsbK6vohZRzKxqCTyX9b2p2sLQuiTu8TodJ9IOur
axpaOuPOIjQ253yfqrYL2YZHdfr/632nSTGsT18p7k8a9m6ghQ6cW+uq23Ro9W43
uubjyvFMTeLBnkDTSJquciENdMtyPwyiTN0+ZvDOsAOKoBt7i3Rt7lcorjDuALwb
2SQ80qCjYLy1r6mkQyy3IhLQVSeRiaqkR6IAdxR5EeXmbSi0YYh0Jxz5AzPMeDw6
F1w7Whe6Vf8vf4JHVDqNazB3f+JH4JrmRvk6LxlGU33z9uJac4NIbboGDXzRP21h
A0UGqwJimeoi8dji9Y3cXRIYRc+HqyqL0IadSHHLCou746/CJDBQN1EjNNBgZgu6
cTELnRL9TsV9tWt9boqbMK0KfCFnOsGPMDAXXDH5G/ZUsvyDweGMNelgNTXURHFX
f1cTfdDj2T/iG3XDZHf35/rSI56BDQJcJ1G1fQc3sl+G9jDGxX3PBJYNAUpvC6cy
iewDeROcAAXmwUqaC8epJJfN2m9zJJpDgR1t4mZTnKkmRF090ZApyDA4M769kn0J
ioHvE8Pe9+GjY8syGzq3EJRyQnfXgTbAyKfiTpGWefLPrgkQ9LfFQRQ1S21sy483
Bz/KrNl1tMWFsSmngNYaiOZifjxVj8pYoIz1W2s0ore9sYu78YM=
=xdQJ
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Petr Mladek:
- Allow reloading a livepatched module by clearing livepatch-specific
relocations in the livepatch module.
Otherwise, the repeated load would fail on consistency checks.
* tag 'livepatching-for-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
livepatch,x86: Clear relocation targets on a module removal
x86/module: remove unused code in __apply_relocate_add
- Skip negative return code check for snprintf in eprobe.
- Add recursive call test cases for kprobe unit test
- Add 'char' type to probe events to show it as the character instead of value.
- Update kselftest kprobe-event testcase to ignore '__pfx_' symbols.
- Fix kselftest to check filter on eprobe event correctly.
- Add filter on eprobe to the README file in tracefs.
- Fix optprobes to check whether there is 'under unoptimizing' optprobe when optimizing another kprobe correctly.
- Fix optprobe to check whether there is 'under unoptimizing' optprobe when fetching the original instruction correctly.
- Fix optprobe to free 'forcibly unoptimized' optprobe correctly.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmP0JdYACgkQ2/sHvwUr
Pxt6sQf/TD9Kwqx3XG1tnLPev6yt2nuggUippHwWUFHlJtMyUaLV8aKFqByyEe+j
tCQvrFIIJq242xg0Jac/MAf2exlWG9jsmVZPmvC1YzepOAbjXu2eBkIS7LsbeHjF
JJypNnEceffWCpNoD6nlvR0xWXenqRbZJwdsGqo3u+fXnzTurEMY2GU2xOyv39tv
S1uNLPANJxdMb/2iUsUE3hMbe82dqr8zPcApqWFtTBB6QPHI3B2SjuQHpQxwbTPl
bzAl0yQkLSQXprVzT7xJ4xLnzbl1ljgJBci5aX8BFF+VD9oYkypdfYVczBH5VsP9
E3eT9T9lRf4Q99EqxNy5uw7NqQXGQg==
=CMPb
-----END PGP SIGNATURE-----
Merge tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull kprobes updates from Masami Hiramatsu:
- Skip negative return code check for snprintf in eprobe
- Add recursive call test cases for kprobe unit test
- Add 'char' type to probe events to show it as the character instead
of value
- Update kselftest kprobe-event testcase to ignore '__pfx_' symbols
- Fix kselftest to check filter on eprobe event correctly
- Add filter on eprobe to the README file in tracefs
- Fix optprobes to check whether there is 'under unoptimizing' optprobe
when optimizing another kprobe correctly
- Fix optprobe to check whether there is 'under unoptimizing' optprobe
when fetching the original instruction correctly
- Fix optprobe to free 'forcibly unoptimized' optprobe correctly
* tag 'probes-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: no need to check for negative ret value for snprintf
test_kprobes: Add recursed kprobe test case
tracing/probe: add a char type to show the character value of traced arguments
selftests/ftrace: Fix probepoint testcase to ignore __pfx_* symbols
selftests/ftrace: Fix eprobe syntax test case to check filter support
tracing/eprobe: Fix to add filter on eprobe description in README file
x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range
x86/kprobes: Fix __recover_optprobed_insn check optimizing logic
kprobes: Fix to handle forcibly unoptimized kprobes on freeing_list
Add diagnostics to the x86 NMI handler to help detect NMI-handler bugs
on the one hand and failing hardware on the other.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPq3x0THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jOwED/41rLVFORQqNpK5mitA2acqVzRmUG9J
sUHkJCPmHVr4sEDwqi2u+iBqHMqm8COaQOKA65tPsHJKI1PPcIjBG371QPPYsdRl
+qNq6oLCrD37Dgs7CmJPjIO+0P2Xb765GUmcNhR9aH9QnYGz2a3s7QfXV2WlFjq3
1LJ6Z8euEQBb5IE1syp3HHYf3IP4Z88gQxcU4kgV16uADnW0IKSw8F7p9B/EjSnB
IjIh8gkAbfqNh0VXpex/wzPkrXRbjcOr1s43YkoYS1t3ggIZc6MEGs1kTmXAjxo2
4S4CAPKfh4Btlez9VVIMwCDb56fHG6I5wyP+jH51dhNNKiuLqnSyHU3kWU3GFiYn
5Ix7BKtAtp/AzASrL1xildOYjN6gB2QdQijs+bvqzH8Rm8Nl1Yy1z6p6iqcJGe/q
cerzulajs+/UG2XKwRTWw6I3km4WkueEHYjEzmer+olK/Akx4COYiqMivdY5HIwK
M7cFVQz2EiHLP3fu2LWrOkNi/Dy96Vsuya0n3E8Ch7Xtdjez7QPAdWU7bLXY/OTd
jCRdd1MDPz87XQLSmbCV42nJzP5nNryBfijS7swqq9qL/D152ycctxOpIHa6/fJH
pze5nqRBxjwlhOatJds5mVbhtKwF01YV4wzcaHypCmBXXTz1zVj/hFSbzuUuFwEI
06c2sVNqzez4tw==
=MPMw
-----END PGP SIGNATURE-----
Merge tag 'nmi.2023.02.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull x86 NMI diagnostics from Paul McKenney:
"Add diagnostics to the x86 NMI handler to help detect NMI-handler bugs
on the one hand and failing hardware on the other"
* tag 'nmi.2023.02.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
x86/nmi: Print reasons why backtrace NMIs are ignored
x86/nmi: Accumulate NMI-progress evidence in exc_nmi()
API:
- Use kmap_local instead of kmap_atomic.
- Change request callback to take void pointer.
- Print FIPS status in /proc/crypto (when enabled).
Algorithms:
- Add rfc4106/gcm support on arm64.
- Add ARIA AVX2/512 support on x86.
Drivers:
- Add TRNG driver for StarFive SoC.
- Delete ux500/hash driver (subsumed by stm32/hash).
- Add zlib support in qat.
- Add RSA support in aspeed.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmPzAiwACgkQxycdCkmx
i6et8xAAoO3w5MZFGXMzWsYhfSZFdceXBEQfDR7JOCdHxpMIQhw0FLlb0uttFk6m
SeWrdP9wiifBDoCmw7qffFJml8ZftPL/XeXjob2d9v7jKbPyw3lDSIdsNfN/5EEL
oIc9915zwrgawvahPAa+PQ4Ue03qRjUyOcV42dpd1W3NYhzDVHoK5OUU+mEFYDvx
Sgw/YUugKf0VXkVDFzG5049+CPcheyRZqclAo9jyl2eZiXujgUyV33nxRCtqIA+t
7jlHKwi+6QzFHY0CX5BvShR8xyEuH5MLoU3H/jYGXnRb3nEpRYAEO4VZchIHqF0F
Y6pKIKc6Q8OyIVY8RsjQY3hioCqYnQFZ5Xtc1zGtOYEitVLbkmItMG0mVn0XOfyt
gJDi6gkEw5uPUbEQdI4R1xEgJ8eCckMsOJ+uRxqTm+uLqNDxPbsB9bohKniMogXV
lDlVXjU23AA9VeKtqU8FvWjfgqsN47X4aoq1j4/4aI7X9F7P9FOP21TZloP7+ssj
PFrzNaRXUrMEsvyS1wqPegIh987lj6WkH4hyU0wjzaIq4IQELidHsSXFS12iWIPH
kTEoC/trAVoYSr0zXKWUCs4h/x0FztVNbjs4KiDP2FLXX1RzeVZ0WlaXZhryHr+n
1+8yCuS6tVofAbSX0wNkZdf0x5+3CIBw4kqSIvjKDPYYEfIDaT0=
=dMYe
-----END PGP SIGNATURE-----
Merge tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Use kmap_local instead of kmap_atomic
- Change request callback to take void pointer
- Print FIPS status in /proc/crypto (when enabled)
Algorithms:
- Add rfc4106/gcm support on arm64
- Add ARIA AVX2/512 support on x86
Drivers:
- Add TRNG driver for StarFive SoC
- Delete ux500/hash driver (subsumed by stm32/hash)
- Add zlib support in qat
- Add RSA support in aspeed"
* tag 'v6.3-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (156 commits)
crypto: x86/aria-avx - Do not use avx2 instructions
crypto: aspeed - Fix modular aspeed-acry
crypto: hisilicon/qm - fix coding style issues
crypto: hisilicon/qm - update comments to match function
crypto: hisilicon/qm - change function names
crypto: hisilicon/qm - use min() instead of min_t()
crypto: hisilicon/qm - remove some unused defines
crypto: proc - Print fips status
crypto: crypto4xx - Call dma_unmap_page when done
crypto: octeontx2 - Fix objects shared between several modules
crypto: nx - Fix sparse warnings
crypto: ecc - Silence sparse warning
tls: Pass rec instead of aead_req into tls_encrypt_done
crypto: api - Remove completion function scaffolding
tls: Remove completion function scaffolding
tipc: Remove completion function scaffolding
net: ipv6: Remove completion function scaffolding
net: ipv4: Remove completion function scaffolding
net: macsec: Remove completion function scaffolding
dm: Remove completion function scaffolding
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmPzgDgTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXrc7CACfG4SSd8KkWU/y8Q66Irxdau0a3ETD
KL4UNRKGIyKujufgFsme79O6xVSSsCNSay449wk20hqn8lnwbSRi9pUwmLn29hfd
CMFleWIqgwGFfC1do5DRF1vrt1siuG/jVE07mWsEwuY2iHx/es+H7LiQKidhkndZ
DhXRqoi7VYiJv5fRSumpkUJrMZiI96o9Mk09HUksdMwCn3+7RQEqHnlTH5KOozKF
iMroDB72iNw5Na/USZwWL2EDRptENam3lFkPBeDPqNw0SbG4g65JGPR9DSa0Lkbq
AGCJQkdU33mcYQG5MY7R4K1evufpOl/apqLW7h92j45Znr9ok6Vr2c1R
=J1VT
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20230220' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- allow Linux to run as the nested root partition for Microsoft
Hypervisor (Jinank Jain and Nuno Das Neves)
- clean up the return type of callback functions (Dawei Li)
* tag 'hyperv-next-signed-20230220' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Fix hv_get/set_register for nested bringup
Drivers: hv: Make remove callback of hyperv driver void returned
Drivers: hv: Enable vmbus driver for nested root partition
x86/hyperv: Add an interface to do nested hypercalls
Drivers: hv: Setup synic registers in case of nested root partition
x86/hyperv: Add support for detecting nested hypervisor
where possible, when supporting a debug registers swap feature for
SEV-ES guests
- Add support for AMD's version of eIBRS called Automatic IBRS which is
a set-and-forget control of indirect branch restriction speculation
resources on privilege change
- Add support for a new x86 instruction - LKGS - Load kernel GS which is
part of the FRED infrastructure
- Reset SPEC_CTRL upon init to accomodate use cases like kexec which
rediscover
- Other smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmP1RDIACgkQEsHwGGHe
VUohBw//ZB9ZRqsrKdm6D9YaP2x4Zb+kqKqo6rjYeWaYqyPyCwDujPwh+pb3Oq1t
aj62muDv1t/wEJc8mKNkfXkjEEtBVAOcpb5YIpKreoEvNKyevol83Ih0u5iJcTRE
E5qf8HDS8b/JZrcazJJLl6WQmQNH5RiKSu5bbCpRhoeOcyo5pRYR5MztK9vNmAQk
GMdwHsUSU+jN8uiE4HnpaOb/luhgFindRwZVTpdjJegQWLABS8cl3CKeTv4+PW45
isvv37XnQP248wsptIEVRHeG6g3g/HtvwRx7DikUw06QwUyUK7H9hJssOoSP8TL9
u4psRwfWnJ1OxU6klL+s0Ii+pjQ97wXmK/oqK7QkdUwhWqR/mQAW2e9kWHAngyDn
A6mKbzSM6HFAeSXQpB9cMb6uvYRD44SngDFe3WXtEK8jiiQ70ikUm4E28I5KJOPg
s+RyioHk0NFRHYSOOBqNG1NKz6ED7L3GbgbbzxkgMh21AAyI3X351t+PtGoLV5ew
eqOsM7lbg9Scg1LvPk1JcoALS8USWqgar397rz9qGUs+OkPWBtEBCmTdMz/Eb+2t
g/WHdLS5/ajSs5gNhT99W3DeqZMPDEkgBRSeyBBmY3CUD3gBL2wXEktRXv504zBR
RC4oyUPX3c9E2ib6GATLE3kBLbcz9hTWbMxF+X3lLJvTVd/Qc2o=
=v/ZC
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Cache the AMD debug registers in per-CPU variables to avoid MSR
writes where possible, when supporting a debug registers swap feature
for SEV-ES guests
- Add support for AMD's version of eIBRS called Automatic IBRS which is
a set-and-forget control of indirect branch restriction speculation
resources on privilege change
- Add support for a new x86 instruction - LKGS - Load kernel GS which
is part of the FRED infrastructure
- Reset SPEC_CTRL upon init to accomodate use cases like kexec which
rediscover
- Other smaller fixes and cleanups
* tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd: Cache debug register values in percpu variables
KVM: x86: Propagate the AMD Automatic IBRS feature to the guest
x86/cpu: Support AMD Automatic IBRS
x86/cpu, kvm: Add the SMM_CTL MSR not present feature
x86/cpu, kvm: Add the Null Selector Clears Base feature
x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leaf
x86/cpu, kvm: Add the NO_NESTED_DATA_BP feature
KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code
x86/cpu, kvm: Add support for CPUID_80000021_EAX
x86/gsseg: Add the new <asm/gsseg.h> header to <asm/asm-prototypes.h>
x86/gsseg: Use the LKGS instruction if available for load_gs_index()
x86/gsseg: Move load_gs_index() to its own new header file
x86/gsseg: Make asm_load_gs_index() take an u16
x86/opcode: Add the LKGS instruction to x86-opcode-map
x86/cpufeature: Add the CPU feature bit for LKGS
x86/bugs: Reset speculation control settings on init
x86/cpu: Remove redundant extern x86_read_arch_cap_msr()
- Add EPP support to the AMD P-state cpufreq driver (Perry Yuan, Wyes
Karny, Arnd Bergmann, Bagas Sanjaya).
- Drop the custom cpufreq driver for loongson1 that is not necessary
any more and the corresponding cpufreq platform device (Keguang
Zhang).
- Remove "select SRCU" from system sleep, cpufreq and OPP Kconfig
entries (Paul E. McKenney).
- Enable thermal cooling for Tegra194 (Yi-Wei Wang).
- Register module device table and add missing compatibles for
cpufreq-qcom-hw (Nícolas F. R. A. Prado, Abel Vesa and Luca Weiss).
- Various dt binding updates for qcom-cpufreq-nvmem and opp-v2-kryo-cpu
(Christian Marangi).
- Make kobj_type structure in the cpufreq core constant (Thomas
Weißschuh).
- Make cpufreq_unregister_driver() return void (Uwe Kleine-König).
- Make the TEO cpuidle governor check CPU utilization in order to refine
idle state selection (Kajetan Puchalski).
- Make Kconfig select the haltpoll cpuidle governor when the haltpoll
cpuidle driver is selected and replace a default_idle() call in that
driver with arch_cpu_idle() to allow MWAIT to be used (Li RongQing).
- Add Emerald Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy).
- Add ARCH_SUSPEND_POSSIBLE dependencies for ARMv4 cpuidle drivers to
avoid randconfig build failures (Arnd Bergmann).
- Make kobj_type structures used in the cpuidle sysfs interface
constant (Thomas Weißschuh).
- Make the cpuidle driver registration code update microsecond values
of idle state parameters in accordance with their nanosecond values
if they are provided (Rafael Wysocki).
- Make the PSCI cpuidle driver prevent topology CPUs from being
suspended on PREEMPT_RT (Krzysztof Kozlowski).
- Document that pm_runtime_force_suspend() cannot be used with
DPM_FLAG_SMART_SUSPEND (Richard Fitzgerald).
- Add EXPORT macros for exporting PM functions from drivers (Richard
Fitzgerald).
- Remove /** from non-kernel-doc comments in hibernation code (Randy
Dunlap).
- Fix possible name leak in powercap_register_zone() (Yang Yingliang).
- Add Meteor Lake and Emerald Rapids support to the intel_rapl power
capping driver (Zhang Rui).
- Modify the idle_inject power capping facility to support 100% idle
injection (Srinivas Pandruvada).
- Fix large time windows handling in the intel_rapl power capping
driver (Zhang Rui).
- Fix memory leaks with using debugfs_lookup() in the generic PM
domains and Energy Model code (Greg Kroah-Hartman).
- Add missing 'cache-unified' property in the example for kryo OPP
bindings (Rob Herring).
- Fix error checking in opp_migrate_dentry() (Qi Zheng).
- Let qcom,opp-fuse-level be a 2-long array for qcom SoCs (Konrad
Dybcio).
- Modify some power management utilities to use the canonical ftrace
path (Ross Zwisler).
- Correct spelling problems for Documentation/power/ as reported by
codespell (Randy Dunlap).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmPuJfMSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx/5kQAJNOVImLEPLerLP8xufw30//LuDU5Gi0
STsyDOMql/I2MpkeqeCcgrSbpy6NlEglOvg16gfpQ3qqTCLF9ypENxs9E5BGGvW0
aEdCzvaoqmvi9PCr/jmj0EPP70/U+rIX5m/k0QdjLh9x0aLoAEe3uRJTfR9QVqXf
I7JX0N9kjKi7YxpA5DlkHrS7J7GPPiWlesJ3p4wXuHMo3jf+6fgkoPFt8yRrGWeh
AHzGT2BLrsy7aAUjGZB65Qx9q3fnSXMmXOjmn0Xh2njQah+zRZDwrNzwoY2HTLL/
KQ6/Ww16USYRZtCS1fmGwAj9I+ddq6AOvhPCMn0vLXXmKVAMUrVVWnQS/0+vpm9y
suUMK9Tndkgxd1vjby2246ThJn27uDd/ERFan4ouQo2j22uICY+SDo3osj2hMXka
wq4zthXkY8KgjZ+MuXnZxPhcOvo8KRvfxAU0fy5efQnSkbtwY9UlMvjPBMBHm/RA
21/6kjQNtq5vMmI37oC8DH+oPrRQ7sUKuY7HNqwO9P3QNKWVmNe7cF5UtXXxME7Q
ULvP1d+u+TNNdHFLryPwCSzBO34wQEccdRZBjalZ8tBe6JiDWUFHC3giSURZSuzZ
GDvzVaNX6PkgToyv4inBTB8lTp6pAuUjaWNvNJzVvUXiEKHB0ihzg5vpJW5NdwlH
15Tn8cjH7pp0
=lZLx
-----END PGP SIGNATURE-----
Merge tag 'pm-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add EPP support to the AMD P-state cpufreq driver, add support
for new platforms to the Intel RAPL power capping driver, intel_idle
and the Qualcomm cpufreq driver, enable thermal cooling for Tegra194,
drop the custom cpufreq driver for loongson1 that is not necessary any
more (and the corresponding cpufreq platform device), fix assorted
issues and clean up code.
Specifics:
- Add EPP support to the AMD P-state cpufreq driver (Perry Yuan, Wyes
Karny, Arnd Bergmann, Bagas Sanjaya)
- Drop the custom cpufreq driver for loongson1 that is not necessary
any more and the corresponding cpufreq platform device (Keguang
Zhang)
- Remove "select SRCU" from system sleep, cpufreq and OPP Kconfig
entries (Paul E. McKenney)
- Enable thermal cooling for Tegra194 (Yi-Wei Wang)
- Register module device table and add missing compatibles for
cpufreq-qcom-hw (Nícolas F. R. A. Prado, Abel Vesa and Luca Weiss)
- Various dt binding updates for qcom-cpufreq-nvmem and
opp-v2-kryo-cpu (Christian Marangi)
- Make kobj_type structure in the cpufreq core constant (Thomas
Weißschuh)
- Make cpufreq_unregister_driver() return void (Uwe Kleine-König)
- Make the TEO cpuidle governor check CPU utilization in order to
refine idle state selection (Kajetan Puchalski)
- Make Kconfig select the haltpoll cpuidle governor when the haltpoll
cpuidle driver is selected and replace a default_idle() call in
that driver with arch_cpu_idle() to allow MWAIT to be used (Li
RongQing)
- Add Emerald Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy)
- Add ARCH_SUSPEND_POSSIBLE dependencies for ARMv4 cpuidle drivers to
avoid randconfig build failures (Arnd Bergmann)
- Make kobj_type structures used in the cpuidle sysfs interface
constant (Thomas Weißschuh)
- Make the cpuidle driver registration code update microsecond values
of idle state parameters in accordance with their nanosecond values
if they are provided (Rafael Wysocki)
- Make the PSCI cpuidle driver prevent topology CPUs from being
suspended on PREEMPT_RT (Krzysztof Kozlowski)
- Document that pm_runtime_force_suspend() cannot be used with
DPM_FLAG_SMART_SUSPEND (Richard Fitzgerald)
- Add EXPORT macros for exporting PM functions from drivers (Richard
Fitzgerald)
- Remove /** from non-kernel-doc comments in hibernation code (Randy
Dunlap)
- Fix possible name leak in powercap_register_zone() (Yang Yingliang)
- Add Meteor Lake and Emerald Rapids support to the intel_rapl power
capping driver (Zhang Rui)
- Modify the idle_inject power capping facility to support 100% idle
injection (Srinivas Pandruvada)
- Fix large time windows handling in the intel_rapl power capping
driver (Zhang Rui)
- Fix memory leaks with using debugfs_lookup() in the generic PM
domains and Energy Model code (Greg Kroah-Hartman)
- Add missing 'cache-unified' property in the example for kryo OPP
bindings (Rob Herring)
- Fix error checking in opp_migrate_dentry() (Qi Zheng)
- Let qcom,opp-fuse-level be a 2-long array for qcom SoCs (Konrad
Dybcio)
- Modify some power management utilities to use the canonical ftrace
path (Ross Zwisler)
- Correct spelling problems for Documentation/power/ as reported by
codespell (Randy Dunlap)"
* tag 'pm-6.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (53 commits)
Documentation: amd-pstate: disambiguate user space sections
cpufreq: amd-pstate: Fix invalid write to MSR_AMD_CPPC_REQ
dt-bindings: opp: opp-v2-kryo-cpu: enlarge opp-supported-hw maximum
dt-bindings: cpufreq: qcom-cpufreq-nvmem: make cpr bindings optional
dt-bindings: cpufreq: qcom-cpufreq-nvmem: specify supported opp tables
PM: Add EXPORT macros for exporting PM functions
cpuidle: psci: Do not suspend topology CPUs on PREEMPT_RT
MIPS: loongson32: Drop obsolete cpufreq platform device
powercap: intel_rapl: Fix handling for large time window
cpuidle: driver: Update microsecond values of state parameters as needed
cpuidle: sysfs: make kobj_type structures constant
cpuidle: add ARCH_SUSPEND_POSSIBLE dependencies
PM: EM: fix memory leak with using debugfs_lookup()
PM: domains: fix memory leak with using debugfs_lookup()
cpufreq: Make kobj_type structure constant
cpufreq: davinci: Fix clk use after free
cpufreq: amd-pstate: avoid uninitialized variable use
cpufreq: Make cpufreq_unregister_driver() return void
OPP: fix error checking in opp_migrate_dentry()
dt-bindings: cpufreq: cpufreq-qcom-hw: Add SM8550 compatible
...
Core:
- Move the interrupt affinity spreading mechanism into lib/group_cpus
so it can be used for similar spreading requirements, e.g. in the
block multi-queue code.
This also contains a first usecase in the block multi-queue code which
Jens asked to take along with the librarization.
- Improve irqdomain locking to close a number race conditions which
can be observed with massive parallel device driver probing.
- Enforce and document the semantics of disable_irq() which cannot be
invoked safely from non-sleepable context.
- Move the IPI multiplexing code from the Apple AIC driver into the
core. so it can be reused by RISCV.
Drivers:
- Plug OF node refcounting leaks in various drivers.
- Correctly mark level triggered interrupts in the Broadcom L2 drivers.
- The usual small fixes and improvements.
- No new drivers for the record!
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmPzUSkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoY3DEAC9E4yLO7VxxTrs/KrAVCgL3SnHVXQU
nE42uFbQwpCILuNmnqP3uvTHLCsXZkbuBaZEbxLBxC2iyU6+31N1Is+e6cClGMjK
kX6U9g9EqiRCdX3fgJiEU16fCgE8D1AEg+7XKLjeasQhCfKQGGtCtE9/Gmg/Ji92
gcEY/bjvm1hcoNo9dh/vR4k0k63fb13716RLScozUkS/XYVlu+LrrG349gD2WEA9
lh1twDkXvZTWkiYKWAkLorxcNyKhcnJxJw8zEIGVF5b6pCCudK8gXjBbMD5abC7W
xano6B8F455eSKNsi2TWyW47ZHUkC60sqCNDgI2MBTsI7D72UpAJoDfe0VjbMoaH
RQJnrGsUQbviBUen+LEet7nWZBQJRKZHOVtYEjA8ndB3PJUXKKcLeODdw11odyjR
bgZk+0wnowMArIaoLfeItF2oSpfSzLVxh2i8Aeus5tBesvhVCOi4LABRBKGCWvMj
cpSlMhZ4znMnr5j5lOGpcAjKFlWVh1HmF70Y2deGZi5xC8EXFL/VsB7rH5LEEEuF
7I8CO8M1mXeOTJoCchCbuAYgZyuk1DIhKUyOiYQZblaPNGcVGvCIN31SFBRT9h/8
e0VwSvVL756GhotUp/LjgTdG7MoKspWqRG00+q84SsDalsKGXMW7zmHc+1NgGN/C
Yxio1Jlly9Rwyw==
=+pu3
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core:
- Move the interrupt affinity spreading mechanism into lib/group_cpus
so it can be used for similar spreading requirements, e.g. in the
block multi-queue code
This also contains a first usecase in the block multi-queue code
which Jens asked to take along with the librarization
- Improve irqdomain locking to close a number race conditions which
can be observed with massive parallel device driver probing
- Enforce and document the semantics of disable_irq() which cannot be
invoked safely from non-sleepable context
- Move the IPI multiplexing code from the Apple AIC driver into the
core, so it can be reused by RISCV
Drivers:
- Plug OF node refcounting leaks in various drivers
- Correctly mark level triggered interrupts in the Broadcom L2
drivers
- The usual small fixes and improvements
- No new drivers for the record!"
* tag 'irq-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
irqchip/irq-bcm7120-l2: Set IRQ_LEVEL for level triggered interrupts
irqchip/irq-brcmstb-l2: Set IRQ_LEVEL for level triggered interrupts
irqdomain: Switch to per-domain locking
irqchip/mvebu-odmi: Use irq_domain_create_hierarchy()
irqchip/loongson-pch-msi: Use irq_domain_create_hierarchy()
irqchip/gic-v3-mbi: Use irq_domain_create_hierarchy()
irqchip/gic-v3-its: Use irq_domain_create_hierarchy()
irqchip/gic-v2m: Use irq_domain_create_hierarchy()
irqchip/alpine-msi: Use irq_domain_add_hierarchy()
x86/uv: Use irq_domain_create_hierarchy()
x86/ioapic: Use irq_domain_create_hierarchy()
irqdomain: Clean up irq_domain_push/pop_irq()
irqdomain: Drop leftover brackets
irqdomain: Drop dead domain-name assignment
irqdomain: Drop revmap mutex
irqdomain: Fix domain registration race
irqdomain: Fix mapping-creation race
irqdomain: Refactor __irq_domain_alloc_irqs()
irqdomain: Look for existing mapping only once
irqdomain: Drop bogus fwspec-mapping error handling
...
Core:
- Yet another round of improvements to make the clocksource watchdog
more robust:
- Relax the clocksource-watchdog skew criteria to match the NTP
criteria.
- Temporarily skip the watchdog when high memory latencies are
detected which can lead to false-positives.
- Provide an option to enable TSC skew detection even on systems
where TSC is marked as reliable.
Sigh!
- Initialize the restart block in the nanosleep syscalls to be directed
to the no restart function instead of doing a partial setup on entry.
This prevents an erroneous restart_syscall() invocation from
corrupting user space data. While such a situation is clearly a user
space bug, preventing this is a correctness issue and caters to the
least suprise principle.
- Ignore the hrtimer slack for realtime tasks in schedule_hrtimeout()
to align it with the nanosleep semantics.
Drivers:
- The obligatory new driver bindings for Mediatek, Rockchip and RISC-V
variants.
- Add support for the C3STOP misfeature to the RISC-V timer to handle
the case where the timer stops in deeper idle state.
- Set up a static key in the RISC-V timer correctly before first use.
- The usual small improvements and fixes all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmPzV+cTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYlDEACMrjN2F6qeiOW94t4nQ3qP1M9AMSgO
OihC04XuM14/3tEviu/cUOd60wYcUQ/kfI5C+IL35ezeP2w9lnuKqeFpG7aDOa33
5F3isDPamJdXZEZs44CW15brR6dqDlEi5acKee/TtFV9mN6xNhzxM64IaFqecPmW
P+BTwunB8xwquY8RzsHXor/GOGb6mqWQIPoHEPnywTDe/xQYWt0Exzi7ch6HQr5Z
ZzHG6X4h6UTNimjay6L4qsRQWILmPIg4Z5IlycWMQ8qDFM0lbnIJqkG4JwceolI6
aRQyLe3NQFcPYgq3ue+SNm4RckYn4NbAa1zFm0d5VDgKp4xW1sxvtkxOJuxjaOw2
/rLkHkmyuVvCeTMAySfxrwnszAoM505CHC6CEYc1xELbeCkROFUaymtVyNFnnTru
V/Jt/T2Gyx6tOrafX7u+djUjv9figddRpNbskVZvEi3Ztq4MQ069nK3oSUqtP5vO
INApNg4lq6s8aGqVE+Kp9+CKwGqZqI4MdxQMNMAmCRLPon6apActVawbj18qO/wS
qblQ0cbF8a16itlQ3V68qmhcPh6EZOuq8II4etNq6U0ulV9712WfMbat3z53LG94
QNkAmZ3/wui93I+Q2NPxhf5ybJFQZhR0SOtVO6xIdTgOntkODwzzGu9UapfD8mLb
k5BpWnH8CoUgiw==
=I67j
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timekeeping, timers and clockevent/source drivers:
Core:
- Yet another round of improvements to make the clocksource watchdog
more robust:
- Relax the clocksource-watchdog skew criteria to match the NTP
criteria.
- Temporarily skip the watchdog when high memory latencies are
detected which can lead to false-positives.
- Provide an option to enable TSC skew detection even on systems
where TSC is marked as reliable.
Sigh!
- Initialize the restart block in the nanosleep syscalls to be
directed to the no restart function instead of doing a partial
setup on entry.
This prevents an erroneous restart_syscall() invocation from
corrupting user space data. While such a situation is clearly a
user space bug, preventing this is a correctness issue and caters
to the least suprise principle.
- Ignore the hrtimer slack for realtime tasks in schedule_hrtimeout()
to align it with the nanosleep semantics.
Drivers:
- The obligatory new driver bindings for Mediatek, Rockchip and
RISC-V variants.
- Add support for the C3STOP misfeature to the RISC-V timer to handle
the case where the timer stops in deeper idle state.
- Set up a static key in the RISC-V timer correctly before first use.
- The usual small improvements and fixes all over the place"
* tag 'timers-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
clocksource/drivers/timer-sun4i: Add CLOCK_EVT_FEAT_DYNIRQ
clocksource/drivers/em_sti: Mark driver as non-removable
clocksource/drivers/sh_tmu: Mark driver as non-removable
clocksource/drivers/riscv: Patch riscv_clock_next_event() jump before first use
clocksource/drivers/timer-microchip-pit64b: Add delay timer
clocksource/drivers/timer-microchip-pit64b: Select driver only on ARM
dt-bindings: timer: sifive,clint: add comaptibles for T-Head's C9xx
dt-bindings: timer: mediatek,mtk-timer: add MT8365
clocksource/drivers/riscv: Get rid of clocksource_arch_init() callback
clocksource/drivers/sh_cmt: Mark driver as non-removable
clocksource/drivers/timer-microchip-pit64b: Drop obsolete dependency on COMPILE_TEST
clocksource/drivers/riscv: Increase the clock source rating
clocksource/drivers/timer-riscv: Set CLOCK_EVT_FEAT_C3STOP based on DT
dt-bindings: timer: Add bindings for the RISC-V timer device
RISC-V: time: initialize hrtimer based broadcast clock event device
dt-bindings: timer: rk-timer: Add rktimer for rv1126
time/debug: Fix memory leak with using debugfs_lookup()
clocksource: Enable TSC watchdog checking of HPET and PMTMR only when requested
posix-timers: Use atomic64_try_cmpxchg() in __update_gt_cputime()
clocksource: Verify HPET and PMTMR when TSC unverified
...
- Correct the common copy and pasted mishandling of kstrtobool() in the
strict_sas_size() setup function.
- Make recalibrate_cpu_khz() an GPL only export.
- Check TSC feature before doing anything else which avoids pointless
code execution if TSC is not available.
- Remove or fixup stale and misleading comments.
- Remove unused or pointelessly duplicated variables.
- Spelling and typo fixes.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmPzWVkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodbEEAC7XjF7BkZ9nhmAMWgwThKbHhNb3QLk
oO0pcbbff2o7bhcP55Mb6R52G1a/kEvpFg/iF6+4/GcsbxHtLhILtG0PGOgmg28p
UcdXt8EvkMv+bICr3gYtnwqB50stc/1s8JhHVItaDIXbRjNOrkBHQzgcPx0qfC8w
INPhlqShSehGtzmaoP4AWMfVtBlqKXlCADpQGd8hcTojlNRAJwzBF9mZbWGdgopW
qa3yoa+s6kL3M2lXvwREuz/1JnmtKx7cav9ldWlSno2dgDbw1ioDZg9tJhARJo//
toF9Y9h12ASDBaqVoyVJgKmDQddsdxkBTrMCKQX8yRH21pEX9eeHM/re9lNtUbhl
4/0juvAKFyviatWAHHCPYGyuPGrSsrsj5sea2fNURnkc6TZ4pHHArDytpAOhYqh2
8CPpT2Qn/C6CqUsc9Z2fbDZBAOTKR/IF93NzE+HcjRjDyjm30ImeKEbwMHfEa7lX
V3/wvXH9+WIzvVC3EqbvVqkArG1YQTqQHBZIl9+Za2iEeLz8DGEWCH0b7w8/m2Cg
0mzUOzjJviy6ShO0B8fZK8LuCoDbPAmL4etfjp1t3q+EsuG5pYOrYtrnZ76XWYD7
TWxlBHhrYuqUBERpN7SCJgixqXgWVUe2/hZwstQqbmvH/jOe9TGgxrIu2MmvB1kK
5+ul2d2uwbd4cA==
=zlRy
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull miscellaneous x86 cleanups from Thomas Gleixner:
- Correct the common copy and pasted mishandling of kstrtobool() in the
strict_sas_size() setup function
- Make recalibrate_cpu_khz() an GPL only export
- Check TSC feature before doing anything else which avoids pointless
code execution if TSC is not available
- Remove or fixup stale and misleading comments
- Remove unused or pointelessly duplicated variables
- Spelling and typo fixes
* tag 'x86-cleanups-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hotplug: Remove incorrect comment about mwait_play_dead()
x86/tsc: Do feature check as the very first thing
x86/tsc: Make recalibrate_cpu_khz() export GPL only
x86/cacheinfo: Remove unused trace variable
x86/Kconfig: Fix spellos & punctuation
x86/signal: Fix the value returned by strict_sas_size()
x86/cpu: Remove misleading comment
x86/setup: Move duplicate boot_cpu_data definition out of the ifdeffery
x86/boot/e820: Fix typo in e820.c comment
- Some smaller fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPzusMACgkQEsHwGGHe
VUojfQ/7BOqXI0XsHTIwilF12w2bLQl1PeI4bSk6VY+iAN2YmQkq2qvNUgwt62e5
5Z95cDuCZ8sx6L3mDIoOgWBN9zdLbxNhezLFDykb+6as67PMaww9l9R6n3JoC2qm
ELso5JZnWvIZ7Cu7RRm9IzbSj93JAlN3Aypexe61NywMyge9CAvCiOEhvW+lkYSD
lhZqgbm5WAB14F1CeqFyC8kVvUez1GH9Dunbe7ozk7LqRfTRlf5YPH88iE4UKzdg
JXmbcHB2K4aQzfIW66OFPnl/4Cl+XxS/i5CR2NtWlB4/ANZBPoUr7QAS239OpC6u
3uwv/qPmMe7p/lYMaGXSUpzD/MOCHP1HPN8/CWgdyK+Mdmctpqr0FYh1qXXm1Nuu
v0SE3btHVIB5UfvImoOlV/RfCx3+TqxzqUU2erc0iD5VxlRfrqJEwJdJHOgRGxFU
vflRxMQOshhyI7+Q7et0S0QlgK4HvGEHmBUwBsUbfyptIxbqpOLK8INC6N8qwGKZ
gTuBxLNZ5yRE/NeOVe0cL2ooelfOlg7GKUI+gZbfzzQw8M5WZW9qEDS9y2wIuGey
wBFJNzjKXSkrTxc6Hd136N7DX7PlMjiJhXP42s+7rXJguPvgk1oVyEuaX540+xX4
HphXRC2QW0o0hCeFgP11Ai4oq/vRW1RFvdDimJjveJAv19bQNv0=
=Wg/8
-----END PGP SIGNATURE-----
Merge tag 'x86_vdso_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Borislav Petkov:
- Add getcpu support for the 32-bit version of the vDSO
- Some smaller fixes
* tag 'x86_vdso_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Fix -Wmissing-prototypes warnings
x86/vdso: Fake 32bit VDSO build on 64bit compile for vgetcpu
selftests: Emit a warning if getcpu() is missing on 32bit
x86/vdso: Provide getcpu for x86-32.
x86/cpu: Provide the full setup for getcpu() on x86-32
x86/vdso: Move VDSO image init to vdso2c generated code
the way
- Improve revision reporting
- Properly check CPUID capabilities after late microcode upgrade to
avoid false positives
- A garden variety of other small fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPzs/sACgkQEsHwGGHe
VUonGw//RgIVCZIkuytiZesFsAXD3sn4Mmji7WoRZvu3XooA0idOo+7ujBeNcJGw
aFGjf0K5b7eAfiREqTXPlFSymPid7aN+7cPJD7iURJ5UEoDXca1vVh9Jeq7lhvRL
M5CErroStya17vFqU5pz50EcUwGcao/N3wY+0rERk8Rkqu864PgI+KahS2V2D2PU
XolD4CH/+JZMAJPaTG5dSkSf3gJevW/owZ+F2oqKKYNlFsQ6aYd/JZYwIQ2X7W9T
HdVYzeASZs0tfBEPOsZUSobmIlqUU/MziefDyUuTYbO1DPJ525787RLpRyubhG9k
b/7DWUNymR56B8AUq/RV6YE/Dw2YpcrP3Eu0pSbD5xUfEy8eFCcIr+cUL5M9+I4W
iCZtYYGypNbDQf5NRkubtQu8xIwEbjOZNv444kMMBimZGzt/WDEGMHqgRbKpJ2MQ
F2HoBnNVC5O2BddS0ErTpQDWK8B/c0+S4L1ZTkbh/y9CNhzITZ10sLAEGQawvBEk
PBYeCQ98m72ijLcecz0vvVO81UHGicqyY86OqeqRx0FbGO9cZJg+8BqyTLxsRTSW
OgxtB/moURdanWAAOdxZ91yUw40CYWn7qXhYxilZDtGgkFT6sUdA126uMxLJ8u2v
WiOHmj/ymszHhkJiahcSMaD8gRFnLQ59jNatHNa/5Jyw0mi330g=
=z8rd
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loader updates from Borislav Petkov:
- Fix mixed steppings support on AMD which got broken somewhere along
the way
- Improve revision reporting
- Properly check CPUID capabilities after late microcode upgrade to
avoid false positives
- A garden variety of other small fixes
* tag 'x86_microcode_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/core: Return an error only when necessary
x86/microcode/AMD: Fix mixed steppings support
x86/microcode/AMD: Add a @cpu parameter to the reloading functions
x86/microcode/amd: Remove load_microcode_amd()'s bsp parameter
x86/microcode: Allow only "1" as a late reload trigger value
x86/microcode/intel: Print old and new revision during early boot
x86/microcode/intel: Pass the microcode revision to print_ucode_info() directly
x86/microcode: Adjust late loading result reporting message
x86/microcode: Check CPU capabilities after late microcode update correctly
x86/microcode: Add a parameter to microcode_check() to store CPU capabilities
x86/microcode: Use the DEVICE_ATTR_RO() macro
x86/microcode/AMD: Handle multiple glued containers properly
x86/microcode/AMD: Rename a couple of functions
allocation. Its goal is to control resource allocation in external slow
memory which is connected to the machine like for example through CXL devices,
accelerators etc
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPzmf4ACgkQEsHwGGHe
VUppKg//Tq+lHaMYO8aTvk4jgqbR9RVXJwPbtEOp2C0kSLs5QxBms/o21IXnxJ07
tdbIGOrfJGlbzSWP8ywysRRQwpKlwltWUVAjMOFqEfzEURLL042qtHZ8nxGKSGrc
IZFJLNTMyx1Zyjc7e9A/hANCOoQFoPHT8zHf1CNNo1LtzgHzNZG6kggLHh5tRKSz
Xi7wFbYBtmttsyIA/iAQjYAU0O9MnmdnktUb7XdPSFtTIZ3Nyw90We4gwYueEPzD
S/rQHKr8V7ROZMHXQ/BWpVWdcxGoHD8acUSVq8j20KW3W9/H8KL9TRVakvnf0aRW
g0efxKXdTjTRO49GgD7FUL8x1JdAOXeZwQYDzKPqW/GRESRdpOvsaMwcLDCEpIXK
PmEOVReklokJF0btFqaVYkY6wGE2KLKmp97g/RffuHdIeIomwI9lTpy9kyQsKakc
yJ+VsE85BlBEVkHNt49qFClO1L98G3IgZTTt6//EGv0EJl8pELfsddsbjG5uXun+
xFhr2i7gllQcV4B4HSFFdYRBLvZYnTfKlNR7Hs9pRJT7V28Jv2GURiCHBm4sRv9O
k3FX7sxytH2syBBwU1NNrMRMo+KgjVZurJwiHpTRbb39K6uCgLk/wbXfWh2SovW1
BRItz2T6LFu4bo6WIhakx31pNmH94P8vC0acO8LHECVji7qvXFM=
=8hmj
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for a new AMD feature called slow memory bandwidth
allocation. Its goal is to control resource allocation in external
slow memory which is connected to the machine like for example
through CXL devices, accelerators etc
* tag 'x86_cache_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix a silly -Wunused-but-set-variable warning
Documentation/x86: Update resctrl.rst for new features
x86/resctrl: Add interface to write mbm_local_bytes_config
x86/resctrl: Add interface to write mbm_total_bytes_config
x86/resctrl: Add interface to read mbm_local_bytes_config
x86/resctrl: Add interface to read mbm_total_bytes_config
x86/resctrl: Support monitor configuration
x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()
x86/resctrl: Detect and configure Slow Memory Bandwidth Allocation
x86/resctrl: Include new features in command line options
x86/cpufeatures: Add Bandwidth Monitoring Event Configuration feature flag
x86/resctrl: Add a new resource type RDT_RESOURCE_SMBA
x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag
x86/resctrl: Replace smp_call_function_many() with on_each_cpu_mask()
tall calls properly which can be static calls too
- Add proper struct alt_instr.flags which controls different aspects of
insn patching behavior
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPzkAcACgkQEsHwGGHe
VUpA3A//bALDnLosUQe/m8CTcj1AU12Y59fGInoLl5xArM3liOhRYWj9yu8+2r5N
j+89yjWoiaogu/9B18pV0+VnBrFUbALZmHxec0+4VAWyMqYuTbqN28Nj/2cZiHdP
I/9mwGu40I/Ira021D132EcdoZI7O/6bFlh+kEoAqxc7rsqhD5KKRMlrTTdEPVjH
aRbWIuzqDWNhbi7IwfgEBIPLiQZQKmIdH5hsFMD6yOMIdMRL6CwKmXVg2M1Zp8ta
5v2Aqgvu2nZYCIteP4GQck2AlUBlGR4ClGQQRII+U1o8c9dM0hfcIDgsbSYKvgrY
ANm9MQJaF7MRomk9y4E0EHPZAJEMLKUgiQXMxWpER3O1GOKgZPlyzNSe0gRCiL6O
NZWZ2cGtdhQMrko4EapE3GNryM1HoCY/QCuD1fCYwoc/pRBDhCxsSqjWUd8G/6wn
s3S/mD0v3nmTrxHg8sWvqhKshsd7B9V0LSkTpHktz3soFIJGXTxbrtty0CIS61pM
4iUMYB9SjunoEmdwC7+gCN3sCiRpRqfmIybqXdsW3d37QI+FM5aSBPw51xULubfY
Wsxo8SkH+IMYxXmfbQuUppsGZ+1QHzU08+MrlvNxGHUjS1aMnsrFF/fbfbbCnWvX
7hcyBPT0jxc9RPMNeKDm4ItapMMGxGdv6XiRmM8LiUtVG2fMaW4=
=XUqC
-----END PGP SIGNATURE-----
Merge tag 'x86_alternatives_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm alternatives updates from Borislav Petkov:
- Teach the static_call patching infrastructure to handle conditional
tall calls properly which can be static calls too
- Add proper struct alt_instr.flags which controls different aspects of
insn patching behavior
* tag 'x86_alternatives_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/static_call: Add support for Jcc tail-calls
x86/alternatives: Teach text_poke_bp() to patch Jcc.d32 instructions
x86/alternatives: Introduce int3_emulate_jcc()
x86/alternatives: Add alt_instr.flags
on newer AMD CPUs
- Mask out bits which don't belong to the address of the error being
reported
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPzUAEACgkQEsHwGGHe
VUr22RAAh7fi3s8sDP4B2WBe1LPKZystZamxlLObBG2eLT7g0YmSKV12+bHCGf/B
nGqz9iy+e/T1Khxv0gdEyuENwzuitXgiEOYgB4u70HimWy5422ZCzn1EiOMFtyST
g0ehOR+tU84YwMVR40ui3spI1DHgeVPqVBLHBARZ1OAaA58N8eVREC6MqJAeAzIU
+VYiBbn69quECTuU1P7yaT8NDnbm5G6pA1dhKLc7vLl9QWzoW1yWLLcp+oGFN6B8
rcGDKEDK1OYtdHScRCfhFrznkeYP6SVnSt4wlAgX+HVGPoMpvq8nJygxCWdE0yjd
aQGhdcVJkQlSqm1iJUv0MK9nkolqXVVSVTurpHunAq7ctul6Qm/X+fsfwBgSIXXn
Gdj3in374MLWCz/xGqeBS8IiiPxGxJA9s350jyk02LK6Np6sXeuc4PpR66+6FAKQ
Ypen+uWJ6oBof04bW7DBK0R14atA8EpOOLUrrGIsSkNSEIjLaCipMZOpRCbOw76N
bXcdnKKsaEDjKtHClvx/vZXklfzWk0OgF8qtY0nGF+khvDAi3pQaIIlCehf0Qemh
6j00TqIYBCXa0kuKktdPzVJSM7A7TZ5ftboa1IPhE+GYrFFee/VJ3yfgqz102FWI
RJsY8JXt+EP3VMSOQYqQ5KzcLBJ2uDiRYtgUo4P1CITNpRfZEMc=
=e9v9
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Add support for reporting more bits of the physical address on error,
on newer AMD CPUs
- Mask out bits which don't belong to the address of the error being
reported
* tag 'ras_core_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Mask out non-address bits from machine check bank
x86/mce: Add support for Extended Physical Address MCA changes
x86/mce: Define a function to extract ErrorAddr from MCA_ADDR
- Simplify add_rtc_cmos()
- Use strscpy() in the mcelog code
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzdU8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gS6w/8DFsYUIK4CGEtG0MYH+DUVsz/zyo4pejM
yukMpKxXMJKi7pZ6k+0he9LOawa2WeK/hzzJh0zP3EzJtF1RR831XSsGRNPu3OTQ
Q6mAuFdlLC1EAgJs6muqhIF5/bxfti6pFpZBN8Dwi/9VPpUQwayOgaJXysiRA2aN
r/czEgp5fGgExC4QLE6HIPIzhsyjUlngH2F4xNeO13cS6S0T3Ns1xcSJs9jJfwMW
2Vx0FCSo/cj8Hdr10NGEJNrqCzN9yvUuZuQ4utp6yf03zWyP1L4c2MB0E+aw6d1c
ygpYWm400vlEFHfT8x9UVnybR5wABG4GP8JNtSBHASk7rNPKl5cKfOIPHzdsCnFO
bHh1Xc4gduFVB8nUilUlcvoseum8GaYOqhi6ov2hwq+n47uam9H5QlCWn7OyLBbW
88Ajg+wqNxG/R3mhyPslXDMr/dccQ9mcZSxbPDX14LpG8bWAjvM3yP43N8w6myVs
1Br8Lsbf8lm5jiJ5hv2GBGQa9eDA0qLBFnkvBTe9zx7AV/K/KnTUXlK2DdeZVfO8
eqgyTrXANyyJqTC/s2GAOLFwySRZFx9EHw6Kg8cmjoG9o8VCpXljQ8qj71ZOtFlo
xBDlmg4Y6czRZC2kQEFC1kA30nLw+2UAnOEwr11JgGod9K+DqFtLKnVVEsrtxmGH
E7ccV3QRc/Q=
=Nuzs
-----END PGP SIGNATURE-----
Merge tag 'x86-platform-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform update from Ingo Molnar:
- Simplify add_rtc_cmos()
- Use strscpy() in the mcelog code
* tag 'x86-platform-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/dev-mcelog: use strscpy() to instead of strncpy()
x86/rtc: Simplify PNP ids check
- Replace zero-length array in struct xregs_state with flexible-array member,
to help the enabling of stricter compiler checks.
- Don't set TIF_NEED_FPU_LOAD for PF_IO_WORKER threads.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzc+cRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gJ2RAAni5xUIObYPeEH4gEL0hXaLDXXYM+iyvK
Fp4kI/9HLdgEOYjIFZG1H6nOlSDOSwFHg2mhpxXRujByAuohGT8cU2ax6Zh3ya1g
qJy5UepKa+Twvoifts4mmClJrK+tVCWfKthoqCc5C6D2lGroCplvp5hE2/KIs/9d
QcnpKaFAapYNat1rzLtBLQUovxVdd8yow5+pTEJhUnhF6EYMIn4M5cPncINPdx98
AnQi1v6pWjhgK5gP7G1maz7J0khr0nNzd17Dtx+iknLX4cVxa/Hx+wQtsom80Xwt
aeIwPYiAYTFs0ZjVrEtwwtV0ub29viHeS2x1v8nkLSA0LJAVrIX/1yuIMbUuFqS8
PW/mSKkA0phhtVbe6SYtUyXv8zLysWenFre6nD7PNpWFmjOyNUSJcw/clUYwBYVX
LbtwaOwI+shbd/BEGwAPC4235XOl1wq/g9gvV4bt3vTi0pE8tke0oHVkIG6bCAK1
+CB5pnAHfZRNC8H1CkyurbmHPUOMaS/PK4LpvPK1/G03Fe4s42SUNUd7DWCevxq3
agvfOmiH/XvoLdJejYmqeLAz700gBntYkNvBg2P1f30EVj/wGvVDEen8MBGMaIDe
5bef3pIUfP5Ye3DdzFRcjI5OjbijNHxQ7MnIqqW3Uuz2ujdq8ME88l8eUxJE7JZN
KZv7r+asPEM=
=az43
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Ingo Molnar:
- Replace zero-length array in struct xregs_state with flexible-array
member, to help the enabling of stricter compiler checks.
- Don't set TIF_NEED_FPU_LOAD for PF_IO_WORKER threads.
* tag 'x86-fpu-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Don't set TIF_NEED_FPU_LOAD for PF_IO_WORKER threads
x86/fpu: Replace zero-length array in struct xregs_state with flexible-array member
- Robustify/fix calling startup_{32,64}() from the decompressor code,
and removing x86 quirk from scripts/head-object-list.txt as
a result.
- Do not register processors that cannot be onlined for x2APIC
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzcNsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gFjhAAqxnVl1X413IK9sd4C56wWdoLlRo9uGvO
HtAYK1SRzibTOrFn+ByBhugYzYPyxIx634rM6hyp4nkVEnbgCXQ+Qc1xOwjyW8fh
gxR3FCxsQiqajg7/1DOOSoMc/rc3adU73RHCWTjcHV/Zo7KEtvVa6AFvMTd1xzt9
eMPqi7wsPflbdUV9wvf6cKkFPe3Nm3P1hOlUDHGmYZkDw30N8UlZmxvegwrBFDdV
SpiJ0ZLV90NGJ6k6O3XSd4pVDxMn9DlYd0v/0r+YAT56hiRhefSKR2/jQntutZqp
YlyZYjvwUjwEgOdUWPPRbndWWEfFsE2XQQclr4L+ZLQ/Gm8jTsT2b/IvXBmF4FzV
0kzjNdhkPObx3X6UQZ47r6J3x8SWA9qZ6JH+uqCd6w/UW1KIiMBZ2kuIXvJn6eSr
xFLabjPPeOeRXFpiQJjIZ31m7i3JlQbIsfb8IIxI1D55nEkNywjk9VqlLEVw23qD
p93l0+ehpnZ2YCjV0kts/EXMikSmVZorA5wkTzEmG5ER+2BuIDin+wuGPawXrKew
QCa2X7GoVmxf81Rcz7f/E+JnYcSTQ6AQzFkOxe3zb97bnRsckM/87buC0GktcPjW
C8iy3yZzEIhRj2ilKEZLl7jIK59B4jReUKJx+vsxk2k2p5fuRdMkMtPfIZDBwVHQ
PzRZGSDY4FI=
=p3z1
-----END PGP SIGNATURE-----
Merge tag 'x86-boot-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
- Robustify/fix calling startup_{32,64}() from the decompressor code,
and removing x86 quirk from scripts/head-object-list.txt as a result.
- Do not register processors that cannot be onlined for x2APIC
* tag 'x86-boot-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acpi/boot: Do not register processors that cannot be onlined for x2APIC
scripts/head-object-list: Remove x86 from the list
x86/boot: Robustify calling startup_{32,64}() from the decompressor code
- Improve the scalability of the CFS bandwidth unthrottling logic
with large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with
the generic scheduler code. Add __cpuidle methods as noinstr to
objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
to query previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period,
to improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- ... Misc other cleanups, fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
skkQXoONNaM=
=l1nN
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Improve the scalability of the CFS bandwidth unthrottling logic with
large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with the
generic scheduler code. Add __cpuidle methods as noinstr to objtool's
noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period, to
improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- Misc other cleanups, fixes
* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
sched/rt: pick_next_rt_entity(): check list_entry
sched/deadline: Add more reschedule cases to prio_changed_dl()
sched/fair: sanitize vruntime of entity being placed
sched/fair: Remove capacity inversion detection
sched/fair: unlink misfit task from cpu overutilized
objtool: mem*() are not uaccess safe
cpuidle: Fix poll_idle() noinstr annotation
sched/clock: Make local_clock() noinstr
sched/clock/x86: Mark sched_clock() noinstr
x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
x86/atomics: Always inline arch_atomic64*()
cpuidle: tracing, preempt: Squash _rcuidle tracing
cpuidle: tracing: Warn about !rcu_is_watching()
cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
cpuidle: drivers: firmware: psci: Dont instrument suspend code
KVM: selftests: Fix build of rseq test
exit: Detect and fix irq disabled state in oops
cpuidle, arm64: Fix the ARM64 cpuidle logic
cpuidle: mvebu: Fix duplicate flags assignment
sched/fair: Limit sched slice duration
...
- Optimize perf_sample_data layout
- Prepare sample data handling for BPF integration
- Update the x86 PMU driver for Intel Meteor Lake
- Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
discovery breakage
- Fix the x86 Zhaoxin PMU driver
- Cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzaHgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jYQg/+KRfobCevMQlZVnz09T3SsJ4ahJ587BL6
g2C6kobyUNfeChpFVroBkTR+yCb6Mq4xGr2nda9+2E978BYu9eanpx/u/bXNQ6NU
6YhLwgRrlFXonYn07kFfUJeELZ0W+zpPvymEN1KhTQWcrgXDfXRt2VfMwNsVxGRF
ZRyCWK+UOzSMU22FtW3I/xVLBB0vio9Y6wRC5QOpDVW5YtGwQGust7GJ53JPK43J
m2soJvWORauT+v0aqc7ggOtKd6pahVoXrDrbktxtq9N0ZGI+PubVCGevex++cXm/
B3QSf6VcMMuU6pfzxiEwRa8Whrc3XFeSDEfvMjC5v3becGNkdNBnGOJzYprwgRZJ
irb6/dSrv5P2lj6WphsO1Wzcm7EoWh8M7DVOMh/13Y/oODRdOrv48112Don9UURC
EPyvzAzizqdwdDopUmfiqUwuAXqb8uPZqCgmlz/NJkVz1/ijlfrmLgeDuf0vI7Aq
HznzzRwjFHzyCH7D+rtonFh3JDaqgaouY76tpC5yTtzKbZPlFT8kzeCvqkTMnGgH
czZnSNc/kBup0HDkNSlthK+TyrMXWKeVa8KQSY1E0NJHO4IBBCMzZywSoAaeofQK
hqfQyofX9XHmuHhCA4yIfv1XkZGlBTxpPAyDdHjgs9iJTsodSYMs8ESY08eW8DXn
Ld/35O6SylM=
=ztUT
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
- Optimize perf_sample_data layout
- Prepare sample data handling for BPF integration
- Update the x86 PMU driver for Intel Meteor Lake
- Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
discovery breakage
- Fix the x86 Zhaoxin PMU driver
- Cleanups
* tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
perf/x86/intel/uncore: Add Meteor Lake support
x86/perf/zhaoxin: Add stepping check for ZXC
perf/x86/intel/ds: Fix the conversion from TSC to perf time
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
perf/x86/uncore: Add a quirk for UPI on SPR
perf/x86/uncore: Ignore broken units in discovery table
perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
perf/x86/uncore: Factor out uncore_device_to_die()
perf/core: Call perf_prepare_sample() before running BPF
perf/core: Introduce perf_prepare_header()
perf/core: Do not pass header for sample ID init
perf/core: Set data->sample_flags in perf_prepare_sample()
perf/core: Add perf_sample_save_brstack() helper
perf/core: Add perf_sample_save_raw_data() helper
perf/core: Add perf_sample_save_callchain() helper
perf/core: Save the dynamic parts of sample data size
x86/kprobes: Use switch-case for 0xFF opcodes in prepare_emulation
perf/core: Change the layout of perf_sample_data
perf/x86/msr: Add Meteor Lake support
perf/x86/cstate: Add Meteor Lake support
...
When arch_prepare_optimized_kprobe calculating jump destination address,
it copies original instructions from jmp-optimized kprobe (see
__recover_optprobed_insn), and calculated based on length of original
instruction.
arch_check_optimized_kprobe does not check KPROBE_FLAG_OPTIMATED when
checking whether jmp-optimized kprobe exists.
As a result, setup_detour_execution may jump to a range that has been
overwritten by jump destination address, resulting in an inval opcode error.
For example, assume that register two kprobes whose addresses are
<func+9> and <func+11> in "func" function.
The original code of "func" function is as follows:
0xffffffff816cb5e9 <+9>: push %r12
0xffffffff816cb5eb <+11>: xor %r12d,%r12d
0xffffffff816cb5ee <+14>: test %rdi,%rdi
0xffffffff816cb5f1 <+17>: setne %r12b
0xffffffff816cb5f5 <+21>: push %rbp
1.Register the kprobe for <func+11>, assume that is kp1, corresponding optimized_kprobe is op1.
After the optimization, "func" code changes to:
0xffffffff816cc079 <+9>: push %r12
0xffffffff816cc07b <+11>: jmp 0xffffffffa0210000
0xffffffff816cc080 <+16>: incl 0xf(%rcx)
0xffffffff816cc083 <+19>: xchg %eax,%ebp
0xffffffff816cc084 <+20>: (bad)
0xffffffff816cc085 <+21>: push %rbp
Now op1->flags == KPROBE_FLAG_OPTIMATED;
2. Register the kprobe for <func+9>, assume that is kp2, corresponding optimized_kprobe is op2.
register_kprobe(kp2)
register_aggr_kprobe
alloc_aggr_kprobe
__prepare_optimized_kprobe
arch_prepare_optimized_kprobe
__recover_optprobed_insn // copy original bytes from kp1->optinsn.copied_insn,
// jump address = <func+14>
3. disable kp1:
disable_kprobe(kp1)
__disable_kprobe
...
if (p == orig_p || aggr_kprobe_disabled(orig_p)) {
ret = disarm_kprobe(orig_p, true) // add op1 in unoptimizing_list, not unoptimized
orig_p->flags |= KPROBE_FLAG_DISABLED; // op1->flags == KPROBE_FLAG_OPTIMATED | KPROBE_FLAG_DISABLED
...
4. unregister kp2
__unregister_kprobe_top
...
if (!kprobe_disabled(ap) && !kprobes_all_disarmed) {
optimize_kprobe(op)
...
if (arch_check_optimized_kprobe(op) < 0) // because op1 has KPROBE_FLAG_DISABLED, here not return
return;
p->kp.flags |= KPROBE_FLAG_OPTIMIZED; // now op2 has KPROBE_FLAG_OPTIMIZED
}
"func" code now is:
0xffffffff816cc079 <+9>: int3
0xffffffff816cc07a <+10>: push %rsp
0xffffffff816cc07b <+11>: jmp 0xffffffffa0210000
0xffffffff816cc080 <+16>: incl 0xf(%rcx)
0xffffffff816cc083 <+19>: xchg %eax,%ebp
0xffffffff816cc084 <+20>: (bad)
0xffffffff816cc085 <+21>: push %rbp
5. if call "func", int3 handler call setup_detour_execution:
if (p->flags & KPROBE_FLAG_OPTIMIZED) {
...
regs->ip = (unsigned long)op->optinsn.insn + TMPL_END_IDX;
...
}
The code for the destination address is
0xffffffffa021072c: push %r12
0xffffffffa021072e: xor %r12d,%r12d
0xffffffffa0210731: jmp 0xffffffff816cb5ee <func+14>
However, <func+14> is not a valid start instruction address. As a result, an error occurs.
Link: https://lore.kernel.org/all/20230216034247.32348-3-yangjihong1@huawei.com/
Fixes: f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Since the following commit:
commit f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")
modified the update timing of the KPROBE_FLAG_OPTIMIZED, a optimized_kprobe
may be in the optimizing or unoptimizing state when op.kp->flags
has KPROBE_FLAG_OPTIMIZED and op->list is not empty.
The __recover_optprobed_insn check logic is incorrect, a kprobe in the
unoptimizing state may be incorrectly determined as unoptimizing.
As a result, incorrect instructions are copied.
The optprobe_queued_unopt function needs to be exported for invoking in
arch directory.
Link: https://lore.kernel.org/all/20230216034247.32348-2-yangjihong1@huawei.com/
Fixes: f66c0447cc ("kprobes: Set unoptimized flag after unoptimizing code")
Cc: stable@vger.kernel.org
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
hv_get_nested_reg only translates SINT0, resulting in the wrong sint
being registered by nested vmbus.
Fix the issue with new utility function hv_is_sint_reg.
While at it, improve clarity of hv_set_non_nested_register and hv_is_synic_reg.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Jinank Jain <jinankjain@linux.microsoft.com>
Link: https://lore.kernel.org/r/1675980172-6851-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsL4USHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5fvQP/jOhCos/8VrNPJ5vtVkELSgbefbLFPbR
ThCefVekbnbqkJ4PdxLFOw/nzK+NREnoet02lvLXIeBt243qMeDWbNuGncZ+rqqX
ohJ6ALIS7ag2egG2ZllM4KTV1jWnm7bG6C8QsJwrFuG7wHe69lPYNnv34OD4l8QO
PR7reXBl7iyqQehF0rCqkYvtpv1QI+VlVd3ODCg/0vaZX4AOPZI6X1hX2+myUfTy
LI1oQ88SQ0ptoH5o7xBtNQw7Zknm/SP82Djzpg/1Wpr3cZv4g/1vblNcxpU/D+3I
Cp8xM42c82GIX1Zk6uYfsoCyQZ7Bd38llbgcLM5Fi0F2IuKJKoN1iuH77RJ1Juqf
v+SkRmFAr5Uf3UYVp4ck4FxTRH58ooDPeIv2jdaguyE9NgRfbVsghn6QDC+gkZMx
z4xdwCBXgsAAkhJwOLqDy98ZrQZoy1kxlmG2jLIe5RvOzI4WQDUJkJAd+u0dBD/W
U7PRqbqfJFtvvKN784BTa4CO8m+KwaNz4Xrwcg4YtBnKBgDU/cxFUZnnSEHdi9Pr
tVf462dfReK3slK3rkPw5HlacFdQsgfX6+VX1eeFKY+gE88TXoZks/X7ma24MXEw
hOu6VZMat1NjkfcgUcMffBVZlejK7C6oXTeVmqVKsx6hUs+zzqxpQ05AT4TMbAUI
gPTHJyy9ByHO
=WXGG
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-vmx-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.3:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Merge cpuidle updates, PM core updates and changes related to system
sleep handling for 6.3-rc1:
- Make the TEO cpuidle governor check CPU utilization in order to refine
idle state selection (Kajetan Puchalski).
- Make Kconfig select the haltpoll cpuidle governor when the haltpoll
cpuidle driver is selected and replace a default_idle() call in that
driver with arch_cpu_idle() which allows MWAIT to be used (Li
RongQing).
- Add Emerald Rapids Xeon support to the intel_idle driver (Artem
Bityutskiy).
- Add ARCH_SUSPEND_POSSIBLE dependencies for ARMv4 cpuidle drivers to
avoid randconfig build failures (Arnd Bergmann).
- Make kobj_type structures used in the cpuidle sysfs interface
constant (Thomas Weißschuh).
- Make the cpuidle driver registration code update microsecond values
of idle state parameters in accordance with their nanosecond values
if they are provided (Rafael Wysocki).
- Make the PSCI cpuidle driver prevent topology CPUs from being
suspended on PREEMPT_RT (Krzysztof Kozlowski).
- Document that pm_runtime_force_suspend() cannot be used with
DPM_FLAG_SMART_SUSPEND (Richard Fitzgerald).
- Add EXPORT macros for exporting PM functions from drivers (Richard
Fitzgerald).
- Drop "select SRCU" from system sleep Kconfig (Paul E. McKenney).
- Remove /** from non-kernel-doc comments in hibernation code (Randy
Dunlap).
* pm-cpuidle:
cpuidle: psci: Do not suspend topology CPUs on PREEMPT_RT
cpuidle: driver: Update microsecond values of state parameters as needed
cpuidle: sysfs: make kobj_type structures constant
cpuidle: add ARCH_SUSPEND_POSSIBLE dependencies
intel_idle: add Emerald Rapids Xeon support
cpuidle-haltpoll: Replace default_idle() with arch_cpu_idle()
cpuidle-haltpoll: select haltpoll governor
cpuidle: teo: Introduce util-awareness
cpuidle: teo: Optionally skip polling states in teo_find_shallower_state()
* pm-core:
PM: Add EXPORT macros for exporting PM functions
PM: runtime: Document that force_suspend() is incompatible with SMART_SUSPEND
* pm-sleep:
PM: sleep: Remove "select SRCU"
PM: hibernate: swap: don't use /** for non-kernel-doc comments
The comment that says mwait_play_dead() returns only on failure is a bit
misleading because mwait_play_dead() could actually return for valid
reasons (such as mwait not being supported by the platform) that do not
indicate a failure of the CPU offline operation. So, remove the comment.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230128003751.141317-1-srivatsa@csail.mit.edu
predictions bug. When running in SMT mode and one of the sibling threads
transitions out of C0 state, the other thread gets access to twice as many
entries in the RSB, but unfortunately the predictions of the now-halted
logical processor are not purged. Therefore, the executing processor
could speculatively execute from locations that the now-halted processor
had trained the RSB on.
The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
when context switching to the idle thread. However, KVM allows a VMM to
prevent exiting guest mode when transitioning out of C0 using the
KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change this
behavior. To mitigate the cross-thread return address predictions bug,
a VMM must not be allowed to override the default behavior to intercept
C0 transitions.
These patches introduce a KVM module parameter that, if set, will prevent
the user from disabling the HLT, MWAIT and CSTATE exits.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmPrvAAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMQAgf8CirL+yrngzQ+39UTFQgXj3IS1UyR
o2mF39w3bVlQhLNf5MBHF965FUeOV/8A16x73hJGjhiiuisphtQWS/6xKR6uYbq7
0Qi821skqN6XpRsWTWqFHMsdY+n0skr8QeXG4k/GJu7Ghb3tqs4eTGgnf2WBfI8/
K1UgTmjd9+ikM5gKZoVLpcqZnti0gx3lM+cvZGdfrIUaXB+i+hNd2NfRTiGsTOiK
fX7vZtLvOeje2TPoKLhzekTbh8kTU07HRWID9aVXT8bLy6Zd6tg2CHlv11noKpwv
DFVV+RsJ1SiAQYSwT+4IvWfIG4oq4onBQ972g2a27pP2cxF+38GXzt4NQw==
=xscg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Certain AMD processors are vulnerable to a cross-thread return address
predictions bug. When running in SMT mode and one of the sibling
threads transitions out of C0 state, the other thread gets access to
twice as many entries in the RSB, but unfortunately the predictions of
the now-halted logical processor are not purged. Therefore, the
executing processor could speculatively execute from locations that
the now-halted processor had trained the RSB on.
The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
when context switching to the idle thread. However, KVM allows a VMM
to prevent exiting guest mode when transitioning out of C0 using the
KVM_CAP_X86_DISABLE_EXITS capability can be used by a VMM to change
this behavior. To mitigate the cross-thread return address predictions
bug, a VMM must not be allowed to override the default behavior to
intercept C0 transitions.
These patches introduce a KVM module parameter that, if set, will
prevent the user from disabling the HLT, MWAIT and CSTATE exits"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
Documentation/hw-vuln: Add documentation for Cross-Thread Return Predictions
KVM: x86: Mitigate the cross-thread return address predictions bug
x86/speculation: Identify processors vulnerable to SMT RSB predictions
Use the irq_domain_create_hierarchy() helper to create the hierarchical
domain, which both serves as documentation and avoids poking at
irqdomain internals.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Tested-by: Hsin-Yi Wang <hsinyi@chromium.org>
Tested-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230213104302.17307-13-johan+linaro@kernel.org
This pull request contains the following:
o Improvements to clocksource-watchdog console messages.
o Loosening of the clocksource-watchdog skew criteria to match
those of NTP (500 parts per million, relaxed from 400 parts
per million). If it is good enough for NTP, it is good enough
for the clocksource watchdog.
o Suspend clocksource-watchdog checking temporarily when high
memory latencies are detected. This avoids the false-positive
clock-skew events that have been seen on production systems
running memory-intensive workloads.
o On systems where the TSC is deemed trustworthy, use it as the
watchdog timesource, but only when specifically requested using
the tsc=watchdog kernel boot parameter. This permits clock-skew
events to be detected, but avoids forcing workloads to use the
slow HPET and ACPI PM timers. These last two timers are slow
enough to cause systems to be needlessly marked bad on the one
hand, and real skew does sometimes happen on production systems
running production workloads on the other. And sometimes it is
the fault of the TSC, or at least of the firmware that told the
kernel to program the TSC with the wrong frequency.
o Add a tsc=revalidate kernel boot parameter to allow the kernel
to diagnose cases where the TSC hardware works fine, but was told
by firmware to tick at the wrong frequency. Such cases are rare,
but they really have happened on production systems.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmPhnhkTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jClDD/9gTo62MakVQz2wzBRBcWunzX4BAfy2
2ORqZYqq8cJ4ccFVWtSq7gZ+0bxiT+J4jaVyJpmUPzaiCSfNUT+GLjWyLGzF9Xq+
xLWpFJOhFhKYjYN2m1ottuQ81V7aTlorC8AJt/o+oCJFGUCb/heg/UrmoZ6DweHw
H7uXS9yenKdKgYoMENW+8IVsy16sT4D5Fe8XAD/2J6vBBUbgBzKWhi8XSgSHB/Xw
GCP4UfXVGl5QRG9Xu4ZgrFV1t4azxtmdBghFm7/Kep/j6ttSY78yoS43AbI57bhD
fWB5mfAQvO+Zo5/9rLjcDzeZCp/PSdARD41aycPMiei08K278tIN9T/fmfSoG6rV
lVRdFxTHrQcqc9d+g+mGASQBezCF8pxonm9HYLBpNjyfYHnKV70SPXywO4oqAJ1I
7dCm+uv3Y8KaJdVnPUWOHJjvQLx9NWK5/pXBYjsYnLR+69EVmGDgPZ+/ulQxkWBj
DtrQgs+sHQ8gngNpAilxuu/lrUXzrC8N4mtxXKBFQoCPYQMFBkr9S+aAEHIgZT9H
1dWwR1QxeR5uxt7U+3DmTyJ1XKfYjDyyScesILlLMLbdKgZtTS5wGaK4QdJ3QW2z
z4zqPDccWDDZKZy9W4QBnFBx6Rn49C8xThy7f6Loc+2cKAT10hrEmRJsn79AOCDc
6hV0S2U9a6ypQg==
=OWY2
-----END PGP SIGNATURE-----
Merge tag 'clocksource.2023.02.06b' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into timers/core
Pull clocksource watchdog changes from Paul McKenney:
o Improvements to clocksource-watchdog console messages.
o Loosening of the clocksource-watchdog skew criteria to match
those of NTP (500 parts per million, relaxed from 400 parts
per million). If it is good enough for NTP, it is good enough
for the clocksource watchdog.
o Suspend clocksource-watchdog checking temporarily when high
memory latencies are detected. This avoids the false-positive
clock-skew events that have been seen on production systems
running memory-intensive workloads.
o On systems where the TSC is deemed trustworthy, use it as the
watchdog timesource, but only when specifically requested using
the tsc=watchdog kernel boot parameter. This permits clock-skew
events to be detected, but avoids forcing workloads to use the
slow HPET and ACPI PM timers. These last two timers are slow
enough to cause systems to be needlessly marked bad on the one
hand, and real skew does sometimes happen on production systems
running production workloads on the other. And sometimes it is
the fault of the TSC, or at least of the firmware that told the
kernel to program the TSC with the wrong frequency.
o Add a tsc=revalidate kernel boot parameter to allow the kernel
to diagnose cases where the TSC hardware works fine, but was told
by firmware to tick at the wrong frequency. Such cases are rare,
but they really have happened on production systems.
Link: https://lore.kernel.org/r/20230210193640.GA3325193@paulmck-ThinkPad-P17-Gen-1
Add a 'signal' field which allows unwind hints to specify whether the
instruction pointer should be taken literally (like for most interrupts
and exceptions) rather than decremented (like for call stack return
addresses) when used to find the next ORC entry.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/d2c5ec4d83a45b513d8fd72fab59f1a8cfa46871.1676068346.git.jpoimboe@kernel.org
Do the feature check as the very first thing in the function. Everything
else comes after that and is meaningless work if the TSC CPUID bit is
not even set. Switch to cpu_feature_enabled() too, while at it.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y5990CUCuWd5jfBH@zn.tnic
A quick search doesn't reveal any use outside of the kernel - which
would be questionable to begin with anyway - so make the export GPL
only.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/Y599miBzWRAuOwhg@zn.tnic
Certain AMD processors are vulnerable to a cross-thread return address
predictions bug. When running in SMT mode and one of the sibling threads
transitions out of C0 state, the other sibling thread could use return
target predictions from the sibling thread that transitioned out of C0.
The Spectre v2 mitigations cover the Linux kernel, as it fills the RSB
when context switching to the idle thread. However, KVM allows a VMM to
prevent exiting guest mode when transitioning out of C0. A guest could
act maliciously in this situation, so create a new x86 BUG that can be
used to detect if the processor is vulnerable.
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <91cec885656ca1fcd4f0185ce403a53dd9edecb7.1675956146.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
UEFI v2.10 extends the EFI memory attributes table with a flag that
indicates whether or not all RuntimeServicesCode regions were
constructed with ENDBR landing pads, permitting the OS to map these
regions with IBT restrictions enabled.
So let's take this into account on x86 as well.
Suggested-by: Peter Zijlstra <peterz@infradead.org> # ibt_save() changes
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Commit 3bc753c06d ("kbuild: treat char as always unsigned") broke
kprobes. Setting a probe-point on 1 byte conditional jump can cause the
kernel to crash when the (signed) relative jump offset gets treated as
unsigned.
Fix by replacing the unsigned 'immediate.bytes' (plus a cast) with the
signed 'immediate.value' when assigning to the relative jump offset.
[ dhansen: clarified changelog ]
Fixes: 3bc753c06d ("kbuild: treat char as always unsigned")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20230208071708.4048-1-namit%40vmware.com
Unconditionally enabling TSC watchdog checking of the HPET and PMTMR
clocksources can degrade latency and performance. Therefore, provide
a new "watchdog" option to the tsc= boot parameter that opts into such
checking. Note that tsc=watchdog is overridden by a tsc=nowatchdog
regardless of their relative positions in the list of boot parameters.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
setup_getcpu() configures two things:
- it writes the current CPU & node information into MSR_TSC_AUX
- it writes the same information as a GDT entry.
By using the "full" setup_getcpu() on i386 it is possible to read the CPU
information in userland via RDTSCP() or via LSL from the GDT.
Provide an GDT_ENTRY_CPUNODE for x86-32 and make the setup function
unconditionally available.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Link: https://lore.kernel.org/r/20221125094216.3663444-2-bigeasy@linutronix.de
Return an error from the late loading function which is run on each CPU
only when an error has actually been encountered during the update.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230130161709.11615-5-bp@alien8.de
The AMD side of the loader has always claimed to support mixed
steppings. But somewhere along the way, it broke that by assuming that
the cached patch blob is a single one instead of it being one per
*node*.
So turn it into a per-node one so that each node can stash the blob
relevant for it.
[ NB: Fixes tag is not really the exactly correct one but it is good
enough. ]
Fixes: fe055896c0 ("x86/microcode: Merge the early microcode loader")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org> # 2355370cd9 ("x86/microcode/amd: Remove load_microcode_amd()'s bsp parameter")
Cc: <stable@kernel.org> # a5ad92134b ("x86/microcode/AMD: Add a @cpu parameter to the reloading functions")
Link: https://lore.kernel.org/r/20230130161709.11615-4-bp@alien8.de
Josh reported a bug:
When the object to be patched is a module, and that module is
rmmod'ed and reloaded, it fails to load with:
module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 2, loc 00000000ba0302e9, val ffffffffa03e293c
livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8)
livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd'
The livepatch module has a relocation which references a symbol
in the _previous_ loading of nfsd. When apply_relocate_add()
tries to replace the old relocation with a new one, it sees that
the previous one is nonzero and it errors out.
He also proposed three different solutions. We could remove the error
check in apply_relocate_add() introduced by commit eda9cec4c9
("x86/module: Detect and skip invalid relocations"). However the check
is useful for detecting corrupted modules.
We could also deny the patched modules to be removed. If it proved to be
a major drawback for users, we could still implement a different
approach. The solution would also complicate the existing code a lot.
We thus decided to reverse the relocation patching (clear all relocation
targets on x86_64). The solution is not
universal and is too much arch-specific, but it may prove to be simpler
in the end.
Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Originally-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230125185401.279042-2-song@kernel.org
This "#if 0" block has been untouched for many years. Remove it to clean
up the code.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230125185401.279042-1-song@kernel.org
On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
TSC is disabled. This works well much of the time, but there is the
occasional production-level system that meets all of these criteria, but
which still has a TSC that skews significantly from atomic-clock time.
This is usually attributed to a firmware or hardware fault. Yes, the
various NTP daemons do express their opinions of userspace-to-atomic-clock
time skew, but they put them in various places, depending on the daemon
and distro in question. It would therefore be good for the kernel to
have some clue that there is a problem.
The old behavior of marking the TSC unstable is a non-starter because a
great many workloads simply cannot tolerate the overheads and latencies
of the various non-TSC clocksources. In addition, NTP-corrected systems
sometimes can tolerate significant kernel-space time skew as long as
the userspace time sources are within epsilon of atomic-clock time.
Therefore, when watchdog verification of TSC is disabled, enable it for
HPET and PMTMR (AKA ACPI PM timer). This provides the needed in-kernel
time-skew diagnostic without degrading the system's performance.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <x86@kernel.org>
Tested-by: Feng Tang <feng.tang@intel.com>
The kernel assumes that the TSC frequency which is provided by the
hardware / firmware via MSRs or CPUID(0x15) is correct after applying
a few basic consistency checks. This disables the TSC recalibration
against HPET or PM timer.
As a result there is no mechanism to validate that frequency in cases
where a firmware or hardware defect is suspected. And there was case
that some user used atomic clock to measure the TSC frequency and
reported an inaccuracy issue, which was later fixed in firmware.
Add an option 'recalibrate' for 'tsc' kernel parameter to force the
tsc freq recalibration with HPET or PM timer, and warn if the
deviation from previous value is more than about 500 PPM, which
provides a way to verify the data from hardware / firmware.
There is no functional change to existing work flow.
Recently there was a real-world case: "The 40ms/s divergence between
TSC and HPET was observed on hardware that is quite recent" [1], on
that platform the TSC frequence 1896 MHz was got from CPUID(0x15),
and the force-reclibration with HPET/PMTIMER both calibrated out
value of 1975 MHz, which also matched with check from software
'chronyd', indicating it's a problem of BIOS or firmware.
[Thanks tglx for helping improving the commit log]
[ paulmck: Wordsmith Kconfig help text. ]
[1]. https://lore.kernel.org/lkml/20221117230910.GI4001@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: <x86@kernel.org>
Cc: <linux-doc@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reading DR[0-3]_ADDR_MASK MSRs takes about 250 cycles which is going to
be noticeable with the AMD KVM SEV-ES DebugSwap feature enabled. KVM is
going to store host's DR[0-3] and DR[0-3]_ADDR_MASK before switching to
a guest; the hardware is going to swap these on VMRUN and VMEXIT.
Store MSR values passed to set_dr_addr_mask() in percpu variables
(when changed) and return them via new amd_get_dr_addr_mask().
The gain here is about 10x.
As set_dr_addr_mask() uses the array too, change the @dr type to
unsigned to avoid checking for <0. And give it the amd_ prefix to match
the new helper as the whole DR_ADDR_MASK feature is AMD-specific anyway.
While at it, replace deprecated boot_cpu_has() with cpu_feature_enabled()
in set_dr_addr_mask().
Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230120031047.628097-2-aik@amd.com
Microcode gets reloaded late only if "1" is written to the reload file.
However, the code silently treats any other unsigned integer as a
successful write even though no actions are performed to load microcode.
Make the loader more strict to accept only "1" as a trigger value and
return an error otherwise.
[ bp: Massage commit message. ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230130213955.6046-3-ashok.raj@intel.com
Clang likes to create conditional tail calls like:
0000000000000350 <amd_pmu_add_event>:
350: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 351: R_X86_64_NONE __fentry__-0x4
355: 48 83 bf 20 01 00 00 00 cmpq $0x0,0x120(%rdi)
35d: 0f 85 00 00 00 00 jne 363 <amd_pmu_add_event+0x13> 35f: R_X86_64_PLT32 __SCT__amd_pmu_branch_add-0x4
363: e9 00 00 00 00 jmp 368 <amd_pmu_add_event+0x18> 364: R_X86_64_PLT32 __x86_return_thunk-0x4
Where 0x35d is a static call site that's turned into a conditional
tail-call using the Jcc class of instructions.
Teach the in-line static call text patching about this.
Notably, since there is no conditional-ret, in that case patch the Jcc
to point at an empty stub function that does the ret -- or the return
thunk when needed.
Reported-by: "Erhard F." <erhard_f@mailbox.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/Y9Kdg9QjHkr9G5b5@hirez.programming.kicks-ass.net
In order to re-write Jcc.d32 instructions text_poke_bp() needs to be
taught about them.
The biggest hurdle is that the whole machinery is currently made for 5
byte instructions and extending this would grow struct text_poke_loc
which is currently a nice 16 bytes and used in an array.
However, since text_poke_loc contains a full copy of the (s32)
displacement, it is possible to map the Jcc.d32 2 byte opcodes to
Jcc.d8 1 byte opcode for the int3 emulation.
This then leaves the replacement bytes; fudge that by only storing the
last 5 bytes and adding the rule that 'length == 6' instruction will
be prefixed with a 0x0f byte.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20230123210607.115718513@infradead.org
Move the kprobe Jcc emulation into int3_emulate_jcc() so it can be
used by more code -- specifically static_call() will need this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20230123210607.057678245@infradead.org
In order to use sched_clock() from noinstr code, mark it and all it's
implenentations noinstr.
The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
since it calls out to watchdogs, create a
pvclock_clocksource_read_nowd() variant doesn't do that and can be
noinstr.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230126151323.702003578@infradead.org
Improve atomic update of last_value in pvclock_clocksource_read:
- Atomic update can be skipped if the "last_value" is already
equal to "ret".
- The detection of atomic update failure is not correct. The value,
returned by atomic64_cmpxchg should be compared to the old value
from the location to be updated. If these two are the same, then
atomic update succeeded and "last_value" location is updated to
"ret" in an atomic way. Otherwise, the atomic update failed and
it should be retried with the value from "last_value" - exactly
what atomic64_try_cmpxchg does in a correct and more optimal way.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230118202330.3740-1-ubizjak@gmail.com
Link: https://lore.kernel.org/r/20230126151323.643408110@infradead.org
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
SDc/Y/c=
=zpaD
-----END PGP SIGNATURE-----
Merge tag 'v6.2-rc6' into sched/core, to pick up fixes
Pick up fixes before merging another batch of cpuidle updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
16 will support it
- Fix a NULL ptr deref when suspending with Xen PV
- Have a SEV-SNP guest check explicitly for features enabled by the hypervisor
and fail gracefully if some are unsupported by the guest instead of failing in
a non-obvious and hard-to-debug way
- Fix a MSI descriptor leakage under Xen
- Mark Xen's MSI domain as supporting MSI-X
- Prevent legacy PIC interrupts from being resent in software by marking them
level triggered, as they should be, which lead to a NULL ptr deref
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPWdaAACgkQEsHwGGHe
VUrHUBAAvRh7cDVKvr2wPNJ+9RFjPFubcLq/+4yTe4Bu14wnYn8+gb2S7P2bPJWl
TxbZhJGV2IeYyB+aJybjGwX96AzE5lWD6epQLRLI/BOVmqLA8Eyw9te0NEQDd9Wq
hgY1/hhUmLdpXubq09YjIht5nlxVZ+JFfppouTR7LVtZl7MFvQbAOU8rWzpcbm7l
UlA4GLakFeOYpbI5g+zzsZ1a1M0cSz8wQ43plD5aAtNy2wKbWiA4QDXJo9J7+ZQg
1FmOHZ3ChPjEhf9k/N2uAZ6RUHVEDIJ7hDfsM3j/qsKGzCV4cho2Ify+0PFWo4pt
FO2bRh1yDFqdz4m6ulhnmuMnoRnEuswwrTzrG+HYu1ntpUt72dULmmRBpJ0s6C7W
BHpCThNWIlcfHbLNY+VAOJ2hiOqfU/8ld+8R0sn7xmomjgCRYHh9kDGOeuvh1hJl
jfITDuL2gjcj3Ph6+xh6KvOga41ff3EcxkfvqolZ/emRllPiDWsKYXVgUIl22FHt
23xH7gutPCp27MdoXM1EaWs5s/PQfctvm4LrW6XS8IjWURLo4RrkU6y7YD5VKVy8
KCKcpF+JMdE1Ao1WpMFfjDG4ObyMYyqnD790nQVM1e5kIhOpeWaa7WzkGFRyIo6Q
5BU3GGDyMT8SHy7NFhQL62YJe0P2ZctNIDjiTQlYDnWWhjmvk8I=
=4QMN
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Start checking for -mindirect-branch-cs-prefix clang support too now
that LLVM 16 will support it
- Fix a NULL ptr deref when suspending with Xen PV
- Have a SEV-SNP guest check explicitly for features enabled by the
hypervisor and fail gracefully if some are unsupported by the guest
instead of failing in a non-obvious and hard-to-debug way
- Fix a MSI descriptor leakage under Xen
- Mark Xen's MSI domain as supporting MSI-X
- Prevent legacy PIC interrupts from being resent in software by
marking them level triggered, as they should be, which lead to a NULL
ptr deref
* tag 'x86_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Move '-mindirect-branch-cs-prefix' out of GCC-only block
acpi: Fix suspend with Xen PV
x86/sev: Add SEV-SNP guest feature negotiation support
x86/pci/xen: Fixup fallout from the PCI/MSI overhaul
x86/pci/xen: Set MSI_FLAG_PCI_MSIX support in Xen MSI domain
x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL
struct tdx_hypercall_args is used to pass down hypercall arguments to
__tdx_hypercall() assembly routine.
Currently __tdx_hypercall() handles up to 6 arguments. In preparation to
changes in __tdx_hypercall(), expand the structure to 6 more registers
and generate asm offsets for them.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-3-kirill.shutemov%40linux.intel.com
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
__acpi_{acquire,release}_global_lock(). x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after CMPXCHG
(and related MOV instruction in front of CMPXCHG).
Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG
fails. There is no need to re-read the value in the loop.
Note that the value from *ptr should be read using READ_ONCE() to prevent
the compiler from merging, refetching or reordering the read.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20230116162522.4072-1-ubizjak@gmail.com
clang correctly complains
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1456:6: warning: variable \
'h' set but not used [-Wunused-but-set-variable]
u32 h;
^
but it can't know whether this use is innocuous or really a problem.
There's a reason why those warning switches are behind a W=1 and not
enabled by default - yes, one needs to do:
make W=1 CC=clang HOSTCC=clang arch/x86/kernel/cpu/resctrl/
with clang 14 in order to trigger it.
I would normally not take a silly fix like that but this one is simple
and doesn't make the code uglier so...
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/202301242015.kbzkVteJ-lkp@intel.com
The AMD Zen4 core supports a new feature called Automatic IBRS.
It is a "set-and-forget" feature that means that, like Intel's Enhanced IBRS,
h/w manages its IBRS mitigation resources automatically across CPL transitions.
The feature is advertised by CPUID_Fn80000021_EAX bit 8 and is enabled by
setting MSR C000_0080 (EFER) bit 21.
Enable Automatic IBRS by default if the CPU feature is present. It typically
provides greater performance over the incumbent generic retpolines mitigation.
Reuse the SPECTRE_V2_EIBRS spectre_v2_mitigation enum. AMD Automatic IBRS and
Intel Enhanced IBRS have similar enablement. Add NO_EIBRS_PBRSB to
cpu_vuln_whitelist, since AMD Automatic IBRS isn't affected by PBRSB-eIBRS.
The kernel command line option spectre_v2=eibrs is used to select AMD Automatic
IBRS, if available.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-8-kim.phillips@amd.com
The Null Selector Clears Base feature was being open-coded for KVM.
Add it to its newly added native CPUID leaf 0x80000021 EAX proper.
Also drop the bit description comments now it's more self-describing.
[ bp: Convert test in check_null_seg_clears_base() too. ]
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-6-kim.phillips@amd.com
The LFENCE always serializing feature bit was defined as scattered
LFENCE_RDTSC and its native leaf bit position open-coded for KVM. Add
it to its newly added CPUID leaf 0x80000021 EAX proper. With
LFENCE_RDTSC in its proper place, the kernel's set_cpu_cap() will
effectively synthesize the feature for KVM going forward.
Also, DE_CFG[1] doesn't need to be set on such CPUs anymore.
[ bp: Massage and merge diff from Sean. ]
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-5-kim.phillips@amd.com
We don't set it on PF_KTHREAD threads as they never return to userspace,
and PF_IO_WORKER threads are identical in that regard. As they keep
running in the kernel until they die, skip setting the FPU flag on them.
More of a cosmetic thing that was found while debugging and
issue and pondering why the FPU flag is set on these threads.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/560c844c-f128-555b-40c6-31baff27537f@kernel.dk
Add support for CPUID leaf 80000021, EAX. The majority of the features will be
used in the kernel and thus a separate leaf is appropriate.
Include KVM's reverse_cpuid entry because features are used by VM guests, too.
[ bp: Massage commit message. ]
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-2-kim.phillips@amd.com
Use a dedicated entry for invoking the NMI handler from KVM VMX's VM-Exit
path for 32-bit even though using a dedicated entry for 32-bit isn't
strictly necessary. Exposing a single symbol will allow KVM to reference
the entry point in assembly code without having to resort to more #ifdefs
(or #defines). identry.h is intended to be included from asm files only
once, and so simply including idtentry.h in KVM assembly isn't an option.
Bypassing the ESP fixup and CR3 switching in the standard NMI entry code
is safe as KVM always handles NMIs that occur in the guest on a kernel
stack, with a kernel CR3.
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable SVM and more importantly force GIF=1 when halting a CPU or
rebooting the machine. Similar to VMX, SVM allows software to block
INITs via CLGI, and thus can be problematic for a crash/reboot. The
window for failure is smaller with SVM as INIT is only blocked while
GIF=0, i.e. between CLGI and STGI, but the window does exist.
Fixes: fba4f472b3 ("x86/reboot: Turn off KVM when halting a CPU")
Cc: stable@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable SVM on all CPUs via NMI shootdown during an emergency reboot.
Like VMX, SVM can block INIT, e.g. if the emergency reboot is triggered
between CLGI and STGI, and thus can prevent bringing up other CPUs via
INIT-SIPI-SIPI.
Cc: stable@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable virtualization in crash_nmi_callback() and rework the
emergency_vmx_disable_all() path to do an NMI shootdown if and only if a
shootdown has not already occurred. NMI crash shootdown fundamentally
can't support multiple invocations as responding CPUs are deliberately
put into halt state without unblocking NMIs. But, the emergency reboot
path doesn't have any work of its own, it simply cares about disabling
virtualization, i.e. so long as a shootdown occurred, emergency reboot
doesn't care who initiated the shootdown, or when.
If "crash_kexec_post_notifiers" is specified on the kernel command line,
panic() will invoke crash_smp_send_stop() and result in a second call to
nmi_shootdown_cpus() during native_machine_emergency_restart().
Invoke the callback _before_ disabling virtualization, as the current
VMCS needs to be cleared before doing VMXOFF. Note, this results in a
subtle change in ordering between disabling virtualization and stopping
Intel PT on the responding CPUs. While VMX and Intel PT do interact,
VMXOFF and writes to MSR_IA32_RTIT_CTL do not induce faults between one
another, which is all that matters when panicking.
Harden nmi_shootdown_cpus() against multiple invocations to try and
capture any such kernel bugs via a WARN instead of hanging the system
during a crash/dump, e.g. prior to the recent hardening of
register_nmi_handler(), re-registering the NMI handler would trigger a
double list_add() and hang the system if CONFIG_BUG_ON_DATA_CORRUPTION=y.
list_add double add: new=ffffffff82220800, prev=ffffffff8221cfe8, next=ffffffff82220800.
WARNING: CPU: 2 PID: 1319 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
Call Trace:
__register_nmi_handler+0xcf/0x130
nmi_shootdown_cpus+0x39/0x90
native_machine_emergency_restart+0x1c9/0x1d0
panic+0x237/0x29b
Extract the disabling logic to a common helper to deduplicate code, and
to prepare for doing the shootdown in the emergency reboot path if SVM
is supported.
Note, prior to commit ed72736183 ("x86/reboot: Force all cpus to exit
VMX root if VMX is supported"), nmi_shootdown_cpus() was subtly protected
against a second invocation by a cpu_vmx_enabled() check as the kdump
handler would disable VMX if it ran first.
Fixes: ed72736183 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported")
Cc: stable@vger.kernel.org
Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/all/20220427224924.592546-2-gpiccoli@igalia.com
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
ARM:
* Fix the PMCR_EL0 reset value after the PMU rework
* Correctly handle S2 fault triggered by a S1 page table walk
by not always classifying it as a write, as this breaks on
R/O memslots
* Document why we cannot exit with KVM_EXIT_MMIO when taking
a write fault from a S1 PTW on a R/O memslot
* Put the Apple M2 on the naughty list for not being able to
correctly implement the vgic SEIS feature, just like the M1
before it
* Reviewer updates: Alex is stepping down, replaced by Zenghui
x86:
* Fix various rare locking issues in Xen emulation and teach lockdep
to detect them
* Documentation improvements
* Do not return host topology information from KVM_GET_SUPPORTED_CPUID
The event configuration for mbm_local_bytes can be changed by the
user by writing to the configuration file
/sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config.
The event configuration settings are domain specific and will affect all
the CPUs in the domain.
Following are the types of events supported:
==== ===========================================================
Bits Description
==== ===========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ===========================================================
For example, to change the mbm_local_bytes_config to count all the non-temporal
writes on domain 0, the bits 2 and 3 needs to be set which is 1100b (in hex
0xc).
Run the command:
$echo 0=0xc > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
To change the mbm_local_bytes to count only reads to local NUMA domain 1,
the bit 0 needs to be set which 1b (in hex 0x1). Run the command:
$echo 1=0x1 > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-13-babu.moger@amd.com
The event configuration for mbm_total_bytes can be changed by the user by
writing to the file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.
The event configuration settings are domain specific and affect all the
CPUs in the domain.
Following are the types of events supported:
==== ===========================================================
Bits Description
==== ===========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ===========================================================
For example:
To change the mbm_total_bytes to count only reads on domain 0, the bits
0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33).
Run the command:
$echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
To change the mbm_total_bytes to count all the slow memory reads on domain 1,
the bits 4 and 5 needs to be set which is 110000b (in hex 0x30).
Run the command:
$echo 1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-12-babu.moger@amd.com
The event configuration can be viewed by the user by reading the configuration
file /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config. The event
configuration settings are domain specific and will affect all the CPUs in the
domain.
Following are the types of events supported:
==== ===========================================================
Bits Description
==== ===========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ===========================================================
By default, the mbm_local_bytes_config is set to 0x15 to count all the local
event types.
For example:
$cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
0=0x15;1=0x15;2=0x15;3=0x15
In this case, the event mbm_local_bytes is configured with 0x15 on
domains 0 to 3.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-11-babu.moger@amd.com
The event configuration can be viewed by the user by reading the
configuration file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config. The
event configuration settings are domain specific and will affect all the CPUs in
the domain.
Following are the types of events supported:
==== ===========================================================
Bits Description
==== ===========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ===========================================================
By default, the mbm_total_bytes_config is set to 0x7f to count all the
event types.
For example:
$cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
0=0x7f;1=0x7f;2=0x7f;3=0x7f
In this case, the event mbm_total_bytes is configured with 0x7f on
domains 0 to 3.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-10-babu.moger@amd.com
Add a new field in struct mon_evt to support Bandwidth Monitoring Event
Configuration (BMEC) and also update the "mon_features" display.
The resctrl file "mon_features" will display the supported events
and files that can be used to configure those events if monitor
configuration is supported.
Before the change:
$ cat /sys/fs/resctrl/info/L3_MON/mon_features
llc_occupancy
mbm_total_bytes
mbm_local_bytes
After the change when BMEC is supported:
$ cat /sys/fs/resctrl/info/L3_MON/mon_features
llc_occupancy
mbm_total_bytes
mbm_total_bytes_config
mbm_local_bytes
mbm_local_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-9-babu.moger@amd.com
In an upcoming change, rdt_get_mon_l3_config() needs to call rdt_cpu_has() to
query the monitor related features. It cannot be called right now because
rdt_cpu_has() has the __init attribute but rdt_get_mon_l3_config() doesn't.
Add the __init attribute to rdt_get_mon_l3_config() that is only called by
get_rdt_mon_resources() that already has the __init attribute. Also make
rdt_cpu_has() available to by rdt_get_mon_l3_config() via the internal header
file.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-8-babu.moger@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The QoS slow memory configuration details are available via
CPUID_Fn80000020_EDX_x02. Detect the available details and
initialize the rest to defaults.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-7-babu.moger@amd.com
Newer AMD processors support the new feature Bandwidth Monitoring Event
Configuration (BMEC).
The feature support is identified via CPUID Fn8000_0020_EBX_x0[3]: EVT_CFG -
Bandwidth Monitoring Event Configuration (BMEC)
The bandwidth monitoring events mbm_total_bytes and mbm_local_bytes are set to
count all the total and local reads/writes, respectively. With the introduction
of slow memory, the two counters are not enough to count all the different types
of memory events. Therefore, BMEC provides the option to configure
mbm_total_bytes and mbm_local_bytes to count the specific type of events.
Each BMEC event has a configuration MSR which contains one field for each
bandwidth type that can be used to configure the bandwidth event to track any
combination of supported bandwidth types. The event will count requests from
every bandwidth type bit that is set in the corresponding configuration
register.
Following are the types of events supported:
==== ========================================================
Bits Description
==== ========================================================
6 Dirty Victims from the QOS domain to all types of memory
5 Reads to slow memory in the non-local NUMA domain
4 Reads to slow memory in the local NUMA domain
3 Non-temporal writes to non-local NUMA domain
2 Non-temporal writes to local NUMA domain
1 Reads to memory in the non-local NUMA domain
0 Reads to memory in the local NUMA domain
==== ========================================================
By default, the mbm_total_bytes configuration is set to 0x7F to count
all the event types and the mbm_local_bytes configuration is set to 0x15 to
count all the local memory events.
Feature description is available in the specification, "AMD64 Technology
Platform Quality of Service Extensions, Revision: 1.03 Publication" at
https://bugzilla.kernel.org/attachment.cgi?id=301365
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-5-babu.moger@amd.com
Add a new resource type RDT_RESOURCE_SMBA to handle the QoS enforcement
policies on the external slow memory.
Mostly initialization of the essentials. Setting fflags to RFTYPE_RES_MB
configures the SMBA resource to have the same resctrl files as the
existing MBA resource. The SMBA resource has identical properties to
the existing MBA resource. These properties will be enumerated in an
upcoming change and exposed via resctrl because of this flag.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-4-babu.moger@amd.com
Add the new AMD feature X86_FEATURE_SMBA. With it, the QOS enforcement policies
can be applied to external slow memory connected to the host. QOS enforcement is
accomplished by assigning a Class Of Service (COS) to a processor and specifying
allocations or limits for that COS for each resource to be allocated.
This feature is identified by the CPUID function 0x8000_0020_EBX_x0[2]:
L3SBE - L3 external slow memory bandwidth enforcement.
CXL.memory is the only supported "slow" memory device. With SMBA, the hardware
enables bandwidth allocation on the slow memory devices. If there are multiple
slow memory devices in the system, then the throttling logic groups all the slow
sources together and applies the limit on them as a whole.
The presence of the SMBA feature (with CXL.memory) is independent of whether
slow memory device is actually present in the system. If there is no slow memory
in the system, then setting a SMBA limit will have no impact on the performance
of the system.
Presence of CXL memory can be identified by the numactl command:
$numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
node 0 size: 63678 MB node 0 free: 59542 MB
node 1 cpus:
node 1 size: 16122 MB
node 1 free: 15627 MB
node distances:
node 0 1
0: 10 50
1: 50 10
CPU list for CXL memory will be empty. The cpu-cxl node distance is greater than
cpu-to-cpu distances. Node 1 has the CXL memory in this case. CXL memory can
also be identified using ACPI SRAT table and memory maps.
Feature description is available in the specification, "AMD64 Technology
Platform Quality of Service Extensions, Revision: 1.03 Publication # 56375
Revision: 1.03 Issue Date: February 2022" at
https://bugzilla.kernel.org/attachment.cgi?id=301365
See also https://www.amd.com/en/support/tech-docs/amd64-technology-platform-quality-service-extensions
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-3-babu.moger@amd.com
on_each_cpu_mask() runs the function on each CPU specified by cpumask,
which may include the local processor.
Replace smp_call_function_many() with on_each_cpu_mask() to simplify
the code.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-2-babu.moger@amd.com
get disabled due to a value error
- Fix a NULL pointer access on UP configs
- Use the proper locking when updating CPU capacity
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPNKRsACgkQEsHwGGHe
VUr8NA/9GTqGUpOq0iRQkEOE1oAdT4ZxJ9dQpaWjwWSNv40HT7XJBzjtvzAyiOBF
LTUVClBct5i1/j0tVcNq1zXEj3im2e24Ki6A1TWukejgGAMT/7siSkuChEDAMg2M
b79nKCMpuIZMzkJND3qTkW/aMPpAyU82G8BeLCjw7vgPBsbVkjgbxGFKYdHgpLZa
kdX/GhOufu40jcGeUxA4zWTpXfuXT3OG7JYLrlHeJ/HEdzy9kLCkWH4jHHllzPQw
c4JwG7UnNkDKD6zkG0Guzi1zQy39egh/kaj7FQmVap9sneq+x69T4ta0BfXdBXnQ
Vsqc/nnOybUzr9Gjg5W6KRZWk0k6hK9n3+cye88BfRvzMY0KAgxeCiZEn7cuKqZp
15Agzz77vcwt32QsxjSp+hKQxJtePcBCurDhkfuAZyPELNIBeCW6inrXArprfTIg
IfEF068GKsvDGu3I/z49VXFZ9YBoerQREHbN/xL1VNeB8VoAxAE227OMUvhMcIu1
jxVvkESwc9BIybTcJbUjgg2i9t+Fv3gtZBQ6GFfDboHneia4U5a9aTyN6QGjX+5P
SIXPhePbFCNrhW7JbSGqKBd96MczbFNQinRhgX1lBU0241cAnchY4nH9fUvv7Zrt
b/QDg58tb2bJkm1Z08L256ZETELOU9nLJVxUbtBNSSO9dfkvoBs=
=aNSK
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Make sure the scheduler doesn't use stale frequency scaling values
when latter get disabled due to a value error
- Fix a NULL pointer access on UP configs
- Use the proper locking when updating CPU capacity
* tag 'sched_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/aperfmperf: Erase stale arch_freq_scale values when disabling frequency invariance readings
sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
sched/fair: Fixes for capacity inversion detection
sched/uclamp: Fix a uninitialized variable warnings
Make early loading message match late loading message and print both old
and new revisions.
This is helpful to know what the BIOS loaded revision is before an early
update.
Cache the early BIOS revision before the microcode update and have
print_ucode_info() print both the old and new revision in the same
format as microcode_reload_late().
[ bp: Massage, remove useless comment. ]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230120161923.118882-6-ashok.raj@intel.com
print_ucode_info() takes a struct ucode_cpu_info pointer as parameter.
Its sole purpose is to print the microcode revision.
The only available ucode_cpu_info always describes the currently loaded
microcode revision. After a microcode update is successful, this is the
new revision, or on failure it is the original revision.
In preparation for future changes, replace the struct ucode_cpu_info
pointer parameter with a plain integer which contains the revision
number and adjust the call sites accordingly.
No functional change.
[ bp:
- Fix + cleanup commit message.
- Revert arbitrary, unrelated change.
]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230120161923.118882-5-ashok.raj@intel.com
During late microcode loading, the "Reload completed" message is issued
unconditionally, regardless of success or failure.
Adjust the message to report the result of the update.
[ bp: Massage. ]
Fixes: 9bd681251b ("x86/microcode: Announce reload operation's completion")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/lkml/874judpqqd.ffs@tglx/
The kernel caches each CPU's feature bits at boot in an x86_capability[]
structure. However, the capabilities in the BSP's copy can be turned off
as a result of certain command line parameters or configuration
restrictions, for example the SGX bit. This can cause a mismatch when
comparing the values before and after the microcode update.
Another example is X86_FEATURE_SRBDS_CTRL which gets added only after
microcode update:
--- cpuid.before 2023-01-21 14:54:15.652000747 +0100
+++ cpuid.after 2023-01-21 14:54:26.632001024 +0100
@@ -10,7 +10,7 @@ CPU:
0x00000004 0x04: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x11142120
0x00000006 0x00: eax=0x000027f7 ebx=0x00000002 ecx=0x00000001 edx=0x00000000
- 0x00000007 0x00: eax=0x00000000 ebx=0x029c6fbf ecx=0x40000000 edx=0xbc002400
+ 0x00000007 0x00: eax=0x00000000 ebx=0x029c6fbf ecx=0x40000000 edx=0xbc002e00
^^^
and which proves for a gazillionth time that late loading is a bad bad
idea.
microcode_check() is called after an update to report any previously
cached CPUID bits which might have changed due to the update.
Therefore, store the cached CPU caps before the update and compare them
with the CPU caps after the microcode update has succeeded.
Thus, the comparison is done between the CPUID *hardware* bits before
and after the upgrade instead of using the cached, possibly runtime
modified values in BSP's boot_cpu_data copy.
As a result, false warnings about CPUID bits changes are avoided.
[ bp:
- Massage.
- Add SRBDS_CTRL example.
- Add kernel-doc.
- Incorporate forgotten review feedback from dhansen.
]
Fixes: 1008c52c09 ("x86/CPU: Add a microcode loader callback")
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230109153555.4986-3-ashok.raj@intel.com
Add a parameter to store CPU capabilities before performing a microcode
update so that CPU capabilities can be compared before and after update.
[ bp: Massage. ]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230109153555.4986-2-ashok.raj@intel.com
When a KVM guest has MWAIT, mwait_idle() is used as the default idle
function.
However, the cpuidle-haltpoll driver calls default_idle() from
default_enter_idle() directly and that one uses HLT instead of MWAIT,
which may affect performance adversely, because MWAIT is preferred to
HLT as explained by the changelog of commit aebef63cf7 ("x86: Remove
vendor checks from prefer_mwait_c1_over_halt").
Make default_enter_idle() call arch_cpu_idle(), which can use MWAIT,
instead of default_idle() to address this issue.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
[ rjw: Changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Instrument nmi_trigger_cpumask_backtrace() to dump out diagnostics based
on evidence accumulated by exc_nmi(). These diagnostics are dumped for
CPUs that ignored an NMI backtrace request for more than 10 seconds.
[ paulmck: Apply Ingo Molnar feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
CPUs ignoring NMIs is often a sign of those CPUs going bad, but there
are quite a few other reasons why a CPU might ignore NMIs. Therefore,
accumulate evidence within exc_nmi() as to what might be preventing a
given CPU from responding to an NMI.
[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Use DEVICE_ATTR_RO() helper instead of open-coded DEVICE_ATTR(),
which makes the code a bit shorter and easier to read.
No change in functionality.
Signed-off-by: Guangju Wang[baidu] <wgj900@163.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230118023554.1898-1-wgj900@163.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPEGkMeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGE4YH/1sxdve3nlWT7eyf
VDanaCZFc2u6Sk1td23HyhvOVbFjj1E39UokHK9YfzD4hhn+u9SFvDgoXHGhHzYq
IhDvNM3mLsKpEtjwNbbi/lt6tryJchhZ96O5leRiUxZaw/GBVnHrKGR56vCjC/DB
zq8td4I+TmTPMpsL3e3WZqJoX3bjTqgZFNShA1mZFYtuQRpAVkJkafMRwuAySIMQ
+FyK/zZ+MpGJ7y/UmWfaBmS5GtjBz0fva7fTbEkUqk8bDtNUWMBhsTKtgU1LBpBe
Yrkhs3sdQdpKSm9yNaNrlLhSjFwkusV5kR09p6/5w+wthdPmjnoUZuEnAZpYyydE
ZY7+Vqk=
=q9Ac
-----END PGP SIGNATURE-----
Merge tag 'v6.2-rc4' into perf/core, to pick up fixes
Move from the -rc1 base to the fresher -rc4 kernel that
has various fixes included, before applying a larger
patchset.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Child partitions are free to allocate SynIC message and event page but in
case of root partition it must use the pages allocated by Microsoft
Hypervisor (MSHV). Base address for these pages can be found using
synthetic MSRs exposed by MSHV. There is a slight difference in those MSRs
for nested vs non-nested root partition.
Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/cb951fb1ad6814996fc54f4a255c5841a20a151f.1672639707.git.jinankjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Baoquan reported that after triggering a crash the subsequent crash-kernel
fails to boot about half of the time. It triggers a NULL pointer
dereference in the periodic tick code.
This happens because the legacy timer interrupt (IRQ0) is resent in
software which happens in soft interrupt (tasklet) context. In this context
get_irq_regs() returns NULL which leads to the NULL pointer dereference.
The reason for the resend is a spurious APIC interrupt on the IRQ0 vector
which is captured and leads to a resend when the legacy timer interrupt is
enabled. This is wrong because the legacy PIC interrupts are level
triggered and therefore should never be resent in software, but nothing
ever sets the IRQ_LEVEL flag on those interrupts, so the core code does not
know about their trigger type.
Ensure that IRQ_LEVEL is set when the legacy PCI interrupts are set up.
Fixes: a4633adcdb ("[PATCH] genirq: add genirq sw IRQ-retrigger")
Reported-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/87mt6rjrra.ffs@tglx
Once disable_freq_invariance_work is called the scale_freq_tick function
will not compute or update the arch_freq_scale values.
However the scheduler will still read these values and use them.
The result is that the scheduler might perform unfair decisions based on stale
values.
This patch adds the step of setting the arch_freq_scale values for all
cpus to the default (max) value SCHED_CAPACITY_SCALE, Once all cpus
have the same arch_freq_scale value the scaling is meaningless.
Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230110160206.75912-1-ypodemsk@redhat.com
Functions used with __setup() return 1 when the argument has been
successfully parsed.
Reverse the returned value so that 1 is returned when kstrtobool() is
successful (i.e. returns 0).
My understanding of these __setup() functions is that returning 1 or 0
does not change much anyway - so this is more of a cleanup than a
functional fix.
I spot it and found it spurious while looking at something else.
Even if the output is not perfect, you'll get the idea with:
$ git grep -B2 -A10 retu.*kstrtobool | grep __setup -B10
Fixes: 3aac3ebea0 ("x86/signal: Implement sigaltstack size validation")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/73882d43ebe420c9d8fb82d0560021722b243000.1673717552.git.christophe.jaillet@wanadoo.fr
The comment of the "#endif" after setup_disable_pku() is wrong.
As the related #ifdef is only a few lines above, just remove the
comment.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230113130126.1966-1-jgross@suse.com
objtool found a few cases where this code called out into instrumented
code:
vmlinux.o: warning: objtool: acpi_idle_enter_s2idle+0xde: call to wbinvd() leaves .noinstr.text section
vmlinux.o: warning: objtool: default_idle+0x4: call to arch_safe_halt() leaves .noinstr.text section
vmlinux.o: warning: objtool: xen_safe_halt+0xa: call to HYPERVISOR_sched_op.constprop.0() leaves .noinstr.text section
Solve this by:
- marking arch_safe_halt(), wbinvd(), native_wbinvd() and
HYPERVISOR_sched_op() as __always_inline().
- Explicitly uninlining xen_safe_halt() and pv_native_wbinvd() [they were
already uninlined by the compiler on use as function pointers] and
annotating them as 'noinstr'.
- Annotating pv_native_safe_halt() as 'noinstr'.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195541.171918174@infradead.org
Current arch_cpu_idle() is called with IRQs disabled, but will return
with IRQs enabled.
However, the very first thing the generic code does after calling
arch_cpu_idle() is raw_local_irq_disable(). This means that
architectures that can idle with IRQs disabled end up doing a
pointless 'enable-disable' dance.
Therefore, push this IRQ disabling into the idle function, meaning
that those architectures can avoid the pointless IRQ state flipping.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org
Idle code is very like entry code in that RCU isn't available. As
such, add a little validation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.373461409@infradead.org
Typical boot time setup; no need to suffer an indirect call for that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20230112195539.453613251@infradead.org
The LKGS instruction atomically loads a segment descriptor into the
%gs descriptor registers, *except* that %gs.base is unchanged, and the
base is instead loaded into MSR_IA32_KERNEL_GS_BASE, which is exactly
what we want this function to do.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230112072032.35626-6-xin3.li@intel.com
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Detect if Linux is running as a nested hypervisor in the root
partition for Microsoft Hypervisor, using flags provided by MSHV.
Expose a new variable hv_nested that is used later for decisions
specific to the nested use case.
Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/8e3e7112806e81d2292a66a56fe547162754ecea.1672639707.git.jinankjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
GS is a special segment on x86_64, move load_gs_index() to its own new
header file to simplify header inclusion.
No change in functionality.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230112072032.35626-5-xin3.li@intel.com
Currently, x86_spec_ctrl_base is read at boot time and speculative bits
are set if Kconfig items are enabled. For example, IBRS is enabled if
CONFIG_CPU_IBRS_ENTRY is configured, etc. These MSR bits are not cleared
if the mitigations are disabled.
This is a problem when kexec-ing a kernel that has the mitigation
disabled from a kernel that has the mitigation enabled. In this case,
the MSR bits are not cleared during the new kernel boot. As a result,
this might have some performance degradation that is hard to pinpoint.
This problem does not happen if the machine is (hard) rebooted because
the bit will be cleared by default.
[ bp: Massage. ]
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221128153148.1129350-1-leitao@debian.org
Both the if and else blocks define an exact same boot_cpu_data variable, move
the duplicate variable definition out of the if/else block.
In addition, do some other minor cleanups.
[ bp: Massage. ]
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20220601122914.820890-1-ytcoode@gmail.com
When creating a new monitoring group, the RMID allocated for it may have
been used by a group which was previously removed. In this case, the
hardware counters will have non-zero values which should be deducted
from what is reported in the new group's counts.
resctrl_arch_reset_rmid() initializes the prev_msr value for counters to
0, causing the initial count to be charged to the new group. Resurrect
__rmid_read() and use it to initialize prev_msr correctly.
Unlike before, __rmid_read() checks for error bits in the MSR read so
that callers don't need to.
Fixes: 1d81d15db3 ("x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()")
Signed-off-by: Peter Newman <peternewman@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20221220164132.443083-1-peternewman@google.com
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae4 ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be94 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
Section 5.2.12.12 Processor Local x2APIC Structure in the ACPI v6.5
spec mandates that both "enabled" and "online capable" Local APIC Flags
should be used to determine if the processor is usable or not.
However, Linux doesn't use the "online capable" flag for x2APIC to
determine if the processor is usable. As a result, cpu_possible_mask has
incorrect value and results in more memory getting allocated for per_cpu
variables than it is going to be used.
Make sure Linux parses both "enabled" and "online capable" flags for
x2APIC to correctly determine if the processor is usable.
Fixes: aa06e20f1b ("x86/ACPI: Don't add CPUs that are not online capable")
Reported-by: Leo Duran <leo.duran@amd.com>
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20230105041059.39366-1-kvijayab@amd.com
The prototype for the x86_read_arch_cap_msr() function has moved to
arch/x86/include/asm/cpu.h - kill the redundant definition in arch/x86/kernel/cpu.h
and include the header.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Link: https://lore.kernel.org/r/20221128172451.792595-1-ashok.raj@intel.com
For the `FF /digit` opcodes in prepare_emulation, use switch-case
instead of hand-written code to make the logic easier to understand.
Signed-off-by: Chuang Wang <nashuiliang@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20221129084022.718355-1-nashuiliang@gmail.com
Systems that support various memory encryption schemes (MKTME, TDX, SEV)
use high order physical address bits to indicate which key should be
used for a specific memory location.
When a memory error is reported, some systems may report those key
bits in the IA32_MCi_ADDR machine check MSR.
The Intel SDM has a footnote for the contents of the address register
that says: "Useful bits in this field depend on the address methodology
in use when the register state is saved."
AMD Processor Programming Reference has a more explicit description
of the MCA_ADDR register:
"For physical addresses, the most significant bit is given by
Core::X86::Cpuid::LongModeInfo[PhysAddrSize]."
Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical
address bits within the machine check bank address register. Use this
mask for recoverable machine check handling and in the EDAC driver to
ignore any key bits that may be present.
[ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ]
Reported-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reported-by: Fan Du <fan.du@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
Drop removed INT3 handling code from kprobe_int3_handler() because this
case (get_kprobe() doesn't return corresponding kprobe AND the INT3 is
removed) must not happen with the kprobe managed INT3, but can happen
with the non-kprobe INT3, which should be handled by other callbacks.
For the kprobe managed INT3, it is already safe. The commit 5c02ece818
("x86/kprobes: Fix ordering while text-patching") introduced
text_poke_sync() to the arch_disarm_kprobe() right after removing INT3.
Since this text_poke_sync() uses IPI to call sync_core() on all online
cpus, that ensures that all running INT3 exception handlers have done.
And, the unregister_kprobe() will remove the kprobe from the hash table
after arch_disarm_kprobe().
Thus, when the kprobe managed INT3 hits, kprobe_int3_handler() should
be able to find corresponding kprobe always by get_kprobe(). If it can
not find any kprobe, this means that is NOT a kprobe managed INT3.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/166981518895.1131462.4693062055762912734.stgit@devnote3
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.
Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/202212031419324523731@zte.com.cn
aria-avx assembly code accesses members of aria_ctx with magic number
offset. If the shape of struct aria_ctx is changed carelessly,
aria-avx will not work.
So, we need to ensure accessing members of aria_ctx with correct
offset values, not with magic numbers.
It adds ARIA_CTX_enc_key, ARIA_CTX_dec_key, and ARIA_CTX_rounds in the
asm-offsets.c So, correct offset definitions will be generated.
aria-avx assembly code can access members of aria_ctx safely with
these definitions.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
compare_pnp_id() already iterates over the single linked pnp_ids list
starting with the id past to it.
So there is no need for add_rtc_cmos() to call compare_pnp_id()
for each id on the list.
No change in functionality intended.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: linux-kernel@vger.kernel.org
Move the tests to the appropriate signal_$(BITS).c file.
Convert them to use static_assert(), removing the need for a dummy
function.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20221219193904.190220-2-brgerst@gmail.com
Cc: Al Viro <viro@zeniv.linux.org.uk>
Add a struct alt_instr.flags field which will contain different flags
controlling alternatives patching behavior.
The initial idea was to be able to specify it as a separate macro
parameter but that would mean touching all possible invocations of the
alternatives macros and thus a lot of churn.
What is more, as PeterZ suggested, being able to say ALT_NOT(feature) is
very readable and explains exactly what is meant.
So make the feature field a u32 where the patching flags are the upper
u16 part of the dword quantity while the lower u16 word is the feature.
The highest feature number currently is 0x26a (i.e., word 19) so there
is plenty of space. If that becomes insufficient, the field can be
extended to u64 which will then make struct alt_instr of the nice size
of 16 bytes (14 bytes currently).
There should be no functional changes resulting from this.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/Y6RCoJEtxxZWwotd@zn.tnic
We missed the window between the TIF flag update and the next reschedule.
Signed-off-by: Rodrigo Branco <bsdaemon@google.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants,
whose namespace overlaps.
Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be
used from the same source file.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
After
b3e34a47f9 ("x86/kexec: fix memory leak of elf header buffer"),
freeing image->elf_headers in the error path of crash_load_segments()
is not needed because kimage_file_post_load_cleanup() will take
care of that later. And not clearing it could result in a double-free.
Drop the superfluous vfree() call at the error path of
crash_load_segments().
Fixes: b3e34a47f9 ("x86/kexec: fix memory leak of elf header buffer")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20221122115122.13937-1-tiwai@suse.de
x86:
* Change tdp_mmu to a read-only parameter
* Separate TDP and shadow MMU page fault paths
* Enable Hyper-V invariant TSC control
selftests:
* Use TAP interface for kvm_binary_stats_test and tsc_msrs_test
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Avoid open coding BIT(0) of HV_X64_MSR_TSC_INVARIANT_CONTROL by adding
a dedicated define. While there's only one user at this moment, the
upcoming KVM implementation of Hyper-V Invariant TSC feature will need
to use it as well.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Newer AMD CPUs support more physical address bits.
That is, the MCA_ADDR registers on Scalable MCA systems contain the
ErrorAddr in bits [56:0] instead of [55:0]. Hence, the existing LSB field
from bits [61:56] in MCA_ADDR must be moved around to accommodate the
larger ErrorAddr size.
MCA_CONFIG[McaLsbInStatusSupported] indicates this change. If set, the
LSB field will be found in MCA_STATUS rather than MCA_ADDR.
Each logical CPU has unique MCA bank in hardware and is not shared with
other logical CPUs. Additionally, on SMCA systems, each feature bit may
be different for each bank within same logical CPU.
Check for MCA_CONFIG[McaLsbInStatusSupported] for each MCA bank and for
each CPU.
Additionally, all MCA banks do not support maximum ErrorAddr bits in
MCA_ADDR. Some banks might support fewer bits but the remaining bits are
marked as reserved.
[ Yazen: Rebased and fixed up formatting.
bp: Massage comments. ]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221206173607.1185907-5-yazen.ghannam@amd.com
Move MCA_ADDR[ErrorAddr] extraction into a separate helper function. This
will be further refactored to support extended ErrorAddr bits in MCA_ADDR
in newer AMD CPUs.
[ bp: Massage. ]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/all/20220225193342.215780-3-Smita.KoralahalliChannabasappa@amd.com/
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping
speculative execution after function return, kprobe jump optimization
always fails on the functions with such INT3 inside the function body.
(It already checks the INT3 padding between functions, but not inside
the function)
To avoid this issue, as same as kprobes, check whether the INT3 comes
from kgdb or not, and if so, stop decoding and make it fail. The other
INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be
treated as a one-byte instruction.
Fixes: e463a09af2 ("x86: Add straight-line-speculation mitigation")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/167146051929.1374301.7419382929328081706.stgit@devnote3
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping
speculative execution after RET instruction, kprobes always failes to
check the probed instruction boundary by decoding the function body if
the probed address is after such sequence. (Note that some conditional
code blocks will be placed after function return, if compiler decides
it is not on the hot path.)
This is because kprobes expects kgdb puts the INT3 as a software
breakpoint and it will replace the original instruction.
But these INT3 are not such purpose, it doesn't need to recover the
original instruction.
To avoid this issue, kprobes checks whether the INT3 is owned by
kgdb or not, and if so, stop decoding and make it fail. The other
INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be
treated as a one-byte instruction.
Fixes: e463a09af2 ("x86: Add straight-line-speculation mitigation")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/167146051026.1374301.392728975473572291.stgit@devnote3
The addition of callthunks_translate_call_dest means that
skip_addr() and patch_dest() can no longer be discarded
as part of the __init section freeing:
WARNING: modpost: vmlinux.o: section mismatch in reference: callthunks_translate_call_dest.cold (section: .text.unlikely) -> skip_addr (section: .init.text)
WARNING: modpost: vmlinux.o: section mismatch in reference: callthunks_translate_call_dest.cold (section: .text.unlikely) -> patch_dest (section: .init.text)
WARNING: modpost: vmlinux.o: section mismatch in reference: is_callthunk.cold (section: .text.unlikely) -> skip_addr (section: .init.text)
ERROR: modpost: Section mismatches detected.
Set CONFIG_SECTION_MISMATCH_WARN_ONLY=y to allow them.
Fixes: b2e9dfe54b ("x86/bpf: Emit call depth accounting if required")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221215164334.968863-1-arnd@kernel.org
It can happen that - especially during testing - the microcode
blobs of all families are all glued together in the initrd. The
current code doesn't check whether the current container matched
a microcode patch and continues to the next one, which leads to
save_microcode_in_initrd_amd() to look at the next and thus wrong one:
microcode: parse_container: ucode: 0xffff88807e9d9082
microcode: verify_patch: buf: 0xffff88807e9d90ce, buf_size: 26428
microcode: verify_patch: proc_id: 0x8082, patch_fam: 0x17, this family: 0x17
microcode: verify_patch: buf: 0xffff88807e9d9d56, buf_size: 23220
microcode: verify_patch: proc_id: 0x8012, patch_fam: 0x17, this family: 0x17
microcode: parse_container: MATCH: eq_id: 0x8012, patch proc_rev_id: 0x8012
<-- matching patch found
microcode: verify_patch: buf: 0xffff88807e9da9de, buf_size: 20012
microcode: verify_patch: proc_id: 0x8310, patch_fam: 0x17, this family: 0x17
microcode: verify_patch: buf: 0xffff88807e9db666, buf_size: 16804
microcode: Invalid type field (0x414d44) in container file section header.
microcode: Patch section fail
<-- checking chokes on the microcode magic value of the next container.
microcode: parse_container: saving container 0xffff88807e9d9082
microcode: save_microcode_in_initrd_amd: scanned containers, data: 0xffff88807e9d9082, size: 9700a
and now if there's a next (and last container) it'll use that in
save_microcode_in_initrd_amd() and not find a proper patch, ofc.
Fix that by moving the out: label up, before the desc->mc check which
jots down the pointer of the matching patch and is used to signal to the
caller that it has found a matching patch in the current container.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221219210656.5140-2-bp@alien8.de
- Rename apply_microcode_early_amd() to early_apply_microcode():
simplify the name so that it is clear what it does and when does it do
it.
- Rename __load_ucode_amd() to find_blobs_in_containers(): the new name
actually explains what it does.
Document some.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221219210656.5140-1-bp@alien8.de
* Randomize the per-cpu entry areas
Cleanups:
* Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
coding it
* Move to "native" set_memory_rox() helper
* Clean up pmd_get_atomic() and i386-PAE
* Remove some unused page table size macros
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
+PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
=wwVF
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
"New Feature:
- Randomize the per-cpu entry areas
Cleanups:
- Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it
- Move to "native" set_memory_rox() helper
- Clean up pmd_get_atomic() and i386-PAE
- Remove some unused page table size macros"
* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
x86/mm: Ensure forced page table splitting
x86/kasan: Populate shadow for shared chunk of the CPU entry area
x86/kasan: Add helpers to align shadow addresses up and down
x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
x86/mm: Recompute physical address for every page of per-CPU CEA mapping
x86/mm: Rename __change_page_attr_set_clr(.checkalias)
x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
x86/mm: Add a few comments
x86/mm: Fix CR3_ADDR_MASK
x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
mm: Convert __HAVE_ARCH_P..P_GET to the new style
mm: Remove pointless barrier() after pmdp_get_lockless()
x86/mm/pae: Get rid of set_64bit()
x86_64: Remove pointless set_64bit() usage
x86/mm/pae: Be consistent with pXXp_get_and_clear()
x86/mm/pae: Use WRITE_ONCE()
x86/mm/pae: Don't (ab)use atomic64
mm/gup: Fix the lockless PMD access
...
Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass in
a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be used
no matter what the const value is. This prevents every subsystem from
having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject, objects
as being "non-mutable". The changes to the kobject and driver core in
this pull request are the result of that, as there are lots of paths
where kobjects and device pointers are not modified at all, so marking
them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object rules.
All of this has been bike-shedded in quite a lot of detail on lkml with
different names and implementations resulting in the tiny version we
have in here, much better than my original proposal. Lots of subsystem
maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with no
problems, OTHER than some merge issues with other trees that should be
obvious when you hit them (block tree deletes a driver that this tree
modifies, iommufd tree modifies code that this tree also touches). If
there are merge problems with these trees, please let me know.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
1Fp53boFaQkGBjl8mGF8
=v+FB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass
in a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be
used no matter what the const value is. This prevents every subsystem
from having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject,
objects as being "non-mutable". The changes to the kobject and driver
core in this pull request are the result of that, as there are lots of
paths where kobjects and device pointers are not modified at all, so
marking them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object
rules.
All of this has been bike-shedded in quite a lot of detail on lkml
with different names and implementations resulting in the tiny version
we have in here, much better than my original proposal. Lots of
subsystem maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with
no problems"
* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
device property: Fix documentation for fwnode_get_next_parent()
firmware_loader: fix up to_fw_sysfs() to preserve const
usb.h: take advantage of container_of_const()
device.h: move kobj_to_dev() to use container_of_const()
container_of: add container_of_const() that preserves const-ness of the pointer
driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
driver core: fix up missed scsi/cxlflash class.devnode() conversion.
driver core: fix up some missing class.devnode() conversions.
driver core: make struct class.devnode() take a const *
driver core: make struct class.dev_uevent() take a const *
cacheinfo: Remove of_node_put() for fw_token
device property: Add a blank line in Kconfig of tests
device property: Rename goto label to be more precise
device property: Move PROPERTY_ENTRY_BOOL() a bit down
device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
kernfs: fix all kernel-doc warnings and multiple typos
driver core: pass a const * into of_device_uevent()
kobject: kset_uevent_ops: make name() callback take a const *
kobject: kset_uevent_ops: make filter() callback take a const *
kobject: make kobject_namespace take a const *
...
- Add options to the osnoise tracer
o panic_on_stop option that panics the kernel if osnoise is greater than some
user defined threshold.
o preempt option, to test noise while preemption is disabled
o irq option, to test noise when interrupts are disabled
- Add .percent and .graph suffix to histograms to give different outputs
- Add nohitcount to disable showing hitcount in histogram output
- Add new __cpumask() to trace event fields to annotate that a unsigned long
array is a cpumask to user space and should be treated as one.
- Add trace_trigger kernel command line parameter to enable trace event
triggers at boot up. Useful to trace stack traces, disable tracing and take
snapshots.
- Fix x86/kmmio mmio tracer to work with the updates to lockdep
- Unify the panic and die notifiers
- Add back ftrace_expect reference that is used to extract more information in
the ftrace_bug() code.
- Have trigger filter parsing errors show up in the tracing error log.
- Updated MAINTAINERS file to add kernel tracing mailing list and patchwork
info
- Use IDA to keep track of event type numbers.
- And minor fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY5vIcxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlO7AQCmtZbriadAR6N7Llj092YXmYfzrxyi
1WS35vhpZsBJ8gEA8j68l+LrgNt51N2gXlTXEHgXzdBgL/TKAPSX4D99GQY=
=z1pe
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Add options to the osnoise tracer:
- 'panic_on_stop' option that panics the kernel if osnoise is
greater than some user defined threshold.
- 'preempt' option, to test noise while preemption is disabled
- 'irq' option, to test noise when interrupts are disabled
- Add .percent and .graph suffix to histograms to give different
outputs
- Add nohitcount to disable showing hitcount in histogram output
- Add new __cpumask() to trace event fields to annotate that a unsigned
long array is a cpumask to user space and should be treated as one.
- Add trace_trigger kernel command line parameter to enable trace event
triggers at boot up. Useful to trace stack traces, disable tracing
and take snapshots.
- Fix x86/kmmio mmio tracer to work with the updates to lockdep
- Unify the panic and die notifiers
- Add back ftrace_expect reference that is used to extract more
information in the ftrace_bug() code.
- Have trigger filter parsing errors show up in the tracing error log.
- Updated MAINTAINERS file to add kernel tracing mailing list and
patchwork info
- Use IDA to keep track of event type numbers.
- And minor fixes and clean ups
* tag 'trace-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (44 commits)
tracing: Fix cpumask() example typo
tracing: Improve panic/die notifiers
ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY kernels
tracing: Do not synchronize freeing of trigger filter on boot up
tracing: Remove pointer (asterisk) and brackets from cpumask_t field
tracing: Have trigger filter parsing errors show up in error_log
x86/mm/kmmio: Remove redundant preempt_disable()
tracing: Fix infinite loop in tracing_read_pipe on overflowed print_trace_line
Documentation/osnoise: Add osnoise/options documentation
tracing/osnoise: Add preempt and/or irq disabled options
tracing/osnoise: Add PANIC_ON_STOP option
Documentation/osnoise: Escape underscore of NO_ prefix
tracing: Fix some checker warnings
tracing/osnoise: Make osnoise_options static
tracing: remove unnecessary trace_trigger ifdef
ring-buffer: Handle resize in early boot up
tracing/hist: Fix issue of losting command info in error_log
tracing: Fix issue of missing one synthetic field
tracing/hist: Fix out-of-bound write on 'action_data.var_ref_idx'
tracing/hist: Fix wrong return value in parse_action_params()
...
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
* Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
* Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a9:
"Fix a number of issues with MTE, such as races on the tags being
initialised vs the PG_mte_tagged flag as well as the lack of support
for VM_SHARED when KVM is involved. Patches from Catalin Marinas and
Peter Collingbourne").
* Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
* Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
* Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
* Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
* Second batch of the lazy destroy patches
* First batch of KVM changes for kernel virtual != physical address support
* Removal of a unused function
x86:
* Allow compiling out SMM support
* Cleanup and documentation of SMM state save area format
* Preserve interrupt shadow in SMM state save area
* Respond to generic signals during slow page faults
* Fixes and optimizations for the non-executable huge page errata fix.
* Reprogram all performance counters on PMU filter change
* Cleanups to Hyper-V emulation and tests
* Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
running on top of a L1 Hyper-V hypervisor)
* Advertise several new Intel features
* x86 Xen-for-KVM:
** Allow the Xen runstate information to cross a page boundary
** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
** Add support for 32-bit guests in SCHEDOP_poll
* Notable x86 fixes and cleanups:
** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
** Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
** Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
** Advertise (on AMD) that the SMM_CTL MSR is not supported
** Remove unnecessary exports
Generic:
* Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
* Add support for pinning vCPUs in dirty_log_perf_test.
* Rename the so called "perf_util" framework to "memstress".
* Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress tests.
* Add a common ucall implementation; code dedup and pre-work for running
SEV (and beyond) guests in selftests.
* Provide a common constructor and arch hook, which will eventually be
used by x86 to automatically select the right hypercall (AMD vs. Intel).
* A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
breakpoints, stage-2 faults and access tracking.
* x86-specific selftest changes:
** Clean up x86's page table management.
** Clean up and enhance the "smaller maxphyaddr" test, and add a related
test to cover generic emulation failure.
** Clean up the nEPT support checks.
** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
** Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
Documentation:
* Remove deleted ioctls from documentation
* Clean up the docs for the x86 MSR filter.
* Various fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
=QAYX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a9: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
Other architectures and the common mm/ use P*D_MASK, and P*D_SIZE.
Remove the duplicated P*D_PAGE_MASK and P*D_PAGE_SIZE which are only
used in x86/*.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Link: https://lore.kernel.org/r/20220516185202.604654-1-tatashin@google.com
Now that text_poke is available before ftrace, remove the
SYSTEM_BOOTING exceptions.
Specifically, this cures a W+X case during boot.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221025201057.945960823@infradead.org
Seth found that the CPU-entry-area; the piece of per-cpu data that is
mapped into the userspace page-tables for kPTI is not subject to any
randomization -- irrespective of kASLR settings.
On x86_64 a whole P4D (512 GB) of virtual address space is reserved for
this structure, which is plenty large enough to randomize things a
little.
As such, use a straight forward randomization scheme that avoids
duplicates to spread the existing CPUs over the available space.
[ bp: Fix le build. ]
Reported-by: Seth Jenkins <sethjenkins@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
been long in the making. It is a lighterweight software-only fix for
Skylake-based cores where enabling IBRS is a big hammer and causes a
significant performance impact.
What it basically does is, it aligns all kernel functions to 16 bytes
boundary and adds a 16-byte padding before the function, objtool
collects all functions' locations and when the mitigation gets applied,
it patches a call accounting thunk which is used to track the call depth
of the stack at any time.
When that call depth reaches a magical, microarchitecture-specific value
for the Return Stack Buffer, the code stuffs that RSB and avoids its
underflow which could otherwise lead to the Intel variant of Retbleed.
This software-only solution brings a lot of the lost performance back,
as benchmarks suggest:
https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
That page above also contains a lot more detailed explanation of the
whole mechanism
- Implement a new control flow integrity scheme called FineIBT which is
based on the software kCFI implementation and uses hardware IBT support
where present to annotate and track indirect branches using a hash to
validate them
- Other misc fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe
VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb
VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN
gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A
iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y
Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A
ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh
ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd
73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP
i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n
Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/
zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs=
=cRy1
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Borislav Petkov:
- Add the call depth tracking mitigation for Retbleed which has been
long in the making. It is a lighterweight software-only fix for
Skylake-based cores where enabling IBRS is a big hammer and causes a
significant performance impact.
What it basically does is, it aligns all kernel functions to 16 bytes
boundary and adds a 16-byte padding before the function, objtool
collects all functions' locations and when the mitigation gets
applied, it patches a call accounting thunk which is used to track
the call depth of the stack at any time.
When that call depth reaches a magical, microarchitecture-specific
value for the Return Stack Buffer, the code stuffs that RSB and
avoids its underflow which could otherwise lead to the Intel variant
of Retbleed.
This software-only solution brings a lot of the lost performance
back, as benchmarks suggest:
https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
That page above also contains a lot more detailed explanation of the
whole mechanism
- Implement a new control flow integrity scheme called FineIBT which is
based on the software kCFI implementation and uses hardware IBT
support where present to annotate and track indirect branches using a
hash to validate them
- Other misc fixes and cleanups
* tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
x86/paravirt: Use common macro for creating simple asm paravirt functions
x86/paravirt: Remove clobber bitmask from .parainstructions
x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al
x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit
x86/Kconfig: Enable kernel IBT by default
x86,pm: Force out-of-line memcpy()
objtool: Fix weak hole vs prefix symbol
objtool: Optimize elf_dirty_reloc_sym()
x86/cfi: Add boot time hash randomization
x86/cfi: Boot time selection of CFI scheme
x86/ibt: Implement FineIBT
objtool: Add --cfi to generate the .cfi_sites section
x86: Add prefix symbols for function padding
objtool: Add option to generate prefix symbols
objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf
objtool: Slice up elf_create_section_symbol()
kallsyms: Revert "Take callthunks into account"
x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces
x86/retpoline: Fix crash printing warning
x86/paravirt: Fix a !PARAVIRT build warning
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmOYpTIUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxuZhAAhGjE8voLZeOYwxbvfL69hGTAZ+Me
x2hqRWVhh/IGWXTTaoSLwSjMMokcmAKN5S/wv8qdCG5sB8EN8FyTBIZDy8PuRRdl
8UlqlBMSL+d4oSRDCnYLxFNcynLRNnmx2dfcdw9tJ4zjTLN8Y4o8PHFogR6pJ3MT
sDC8S0myTQKXr4wAGzTZycKsiGManviYtByp6dCcKD3Oy5Q2uZ9OKO2DP2yQpn+F
c3IJSV9oDz3KR8JVJ5Q1iz9cdMXbGwjkM3JLlHpxhedwjN4ErLumPutKcebtzO5C
aTqabN7Nnzc4yJusAIfojFCWH7fgaYUyJ3pxcFyJ4tu4m9Last+2I5UB/kV2sYAD
jWiCYx3sA/mRopNXOnrBGae+Lgy+sQnt8or0grySr0bK+b+ArAGis4uT4A0uASGO
RUQdIQwz7zhHeQrwAladHWxnx4BEDNCatgfn38p4fklIYKydCY5nfZURMDvHezSR
G6Nu08hoE9ZXlmkWTFw+5F23wPWKcCpzZj0hf7OroIouXUp8vqSFSqatH5vGkbCl
bDswck9GdRJ2hl5SvFOeelaXkM42du45TMLU2JmIn6dYYFNrO93JgdvKSU7E2CpG
AmDIpg1Idxo8fEPPGH1I7RVU5+ilzmmPQQY7poQW+va4/dEd/QVp1+ZZTDnMC1qk
qi3ck22VdvPU2VU=
=KULr
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Squash portdrv_{core,pci}.c into portdrv.c to ease maintenance and
make more things static.
- Make portdrv bind to Switch Ports that have AER. Previously, if
these Ports lacked MSI/MSI-X, portdrv failed to bind, which meant
the Ports couldn't be suspended to low-power states. AER on these
Ports doesn't use interrupts, and the AER driver doesn't need to
claim them.
- Assign PCI domain IDs using ida_alloc(), which makes host bridge
add/remove work better.
Resource management:
- To work better with recent BIOSes that use EfiMemoryMappedIO for
PCI host bridge apertures, remove those regions from the E820 map
(E820 entries normally prevent us from allocating BARs). In v5.19,
we added some quirks to disable E820 checking, but that's not very
maintainable. EfiMemoryMappedIO means the OS needs to map the
region for use by EFI runtime services; it shouldn't prevent OS
from using it.
PCIe native device hotplug:
- Build pciehp by default if USB4 is enabled, since Thunderbolt/USB4
PCIe tunneling depends on native PCIe hotplug.
- Enable Command Completed Interrupt only if supported to avoid user
confusion from lspci output that says this is enabled but not
supported.
- Prevent pciehp from binding to Switch Upstream Ports; this happened
because of interaction with acpiphp and caused devices below the
Upstream Port to disappear.
Power management:
- Convert AGP drivers to generic power management. We hope to remove
legacy power management from the PCI core eventually.
Virtualization:
- Fix pci_device_is_present(), which previously always returned
"false" for VFs, causing virtio hangs when unbinding the driver.
Miscellaneous:
- Convert drivers to gpiod API to prepare for dropping some legacy
code.
- Fix DOE fencepost error for the maximum data object length.
Baikal-T1 PCIe controller driver:
- Add driver and DT bindings.
Broadcom STB PCIe controller driver:
- Enable Multi-MSI.
- Delay 100ms after PERST# deassert to allow power and clocks to
stabilize.
- Configure Read Completion Boundary to 64 bytes.
Freescale i.MX6 PCIe controller driver:
- Initialize PHY before deasserting core reset to fix a regression in
v6.0 on boards where the PHY provides the reference.
- Fix imx6sx and imx8mq clock names in DT schema.
Intel VMD host bridge driver:
- Fix Secondary Bus Reset on VMD bridges, which allows reset of NVMe
SSDs in VT-d pass-through scenarios.
- Disable MSI remapping, which gets re-enabled by firmware during
suspend/resume.
MediaTek PCIe Gen3 controller driver:
- Add MT7986 and MT8195 support.
Qualcomm PCIe controller driver:
- Add SC8280XP/SA8540P basic interconnect support.
Rockchip DesignWare PCIe controller driver:
- Base DT schema on common Synopsys schema.
Synopsys DesignWare PCIe core:
- Collect DT items shared between Root Port and Endpoint (PERST GPIO,
PHY info, clocks, resets, link speed, number of lanes, number of
iATU windows, interrupt info, etc) to snps,dw-pcie-common.yaml.
- Add dma-ranges support for Root Ports and Endpoints.
- Consolidate DT resource retrieval for "dbi", "dbi2", "atu", etc. to
reduce code duplication.
- Add generic names for clocks and resets to encourage more
consistent naming across drivers using DesignWare IP.
- Stop advertising PTM Responder role for Endpoints, which aren't
allowed to be responders.
TI J721E PCIe driver:
- Add j721s2 host mode ID to DT schema.
- Add interrupt properties to DT schema.
Toshiba Visconti PCIe controller driver:
- Fix interrupts array max constraints in DT schema"
* tag 'pci-v6.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (95 commits)
x86/PCI: Use pr_info() when possible
x86/PCI: Fix log message typo
x86/PCI: Tidy E820 removal messages
PCI: Skip allocate_resource() if too little space available
efi/x86: Remove EfiMemoryMappedIO from E820 map
PCI/portdrv: Allow AER service only for Root Ports & RCECs
PCI: xilinx-nwl: Fix coding style violations
PCI: mvebu: Switch to using gpiod API
PCI: pciehp: Enable Command Completed Interrupt only if supported
PCI: aardvark: Switch to using devm_gpiod_get_optional()
dt-bindings: PCI: mediatek-gen3: add support for mt7986
dt-bindings: PCI: mediatek-gen3: add SoC based clock config
dt-bindings: PCI: qcom: Allow 'dma-coherent' property
PCI: mt7621: Add sentinel to quirks table
PCI: vmd: Fix secondary bus reset for Intel bridges
PCI: endpoint: pci-epf-vntb: Fix sparse ntb->reg build warning
PCI: endpoint: pci-epf-vntb: Fix sparse build warning for epf_db
PCI: endpoint: pci-epf-vntb: Replace hardcoded 4 with sizeof(u32)
PCI: endpoint: pci-epf-vntb: Remove unused epf_db_phy struct member
PCI: endpoint: pci-epf-vntb: Fix call pci_epc_mem_free_addr() in error path
...
- More userfaultfs work from Peter Xu.
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
- Some filemap cleanups from Vishal Moola.
- David Hildenbrand added the ability to selftest anon memory COW handling.
- Some cpuset simplifications from Liu Shixin.
- Addition of vmalloc tracing support by Uladzislau Rezki.
- Some pagecache folioifications and simplifications from Matthew Wilcox.
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword. This series shold have been in the
non-MM tree, my bad.
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages.
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages.
- Peter Xu utilized the PTE marker code for handling swapin errors.
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient.
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand.
- zram support for multiple compression streams from Sergey Senozhatsky.
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway.
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations.
- Vishal Moola removed the try_to_release_page() wrapper.
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache.
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking.
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend.
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range().
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen.
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect.
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages().
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting.
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines.
- Many singleton patches, as usual.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
CodAgiA51qwzId3GRytIo/tfWZSezgA=
=d19R
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- More userfaultfs work from Peter Xu
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying
- Some filemap cleanups from Vishal Moola
- David Hildenbrand added the ability to selftest anon memory COW
handling
- Some cpuset simplifications from Liu Shixin
- Addition of vmalloc tracing support by Uladzislau Rezki
- Some pagecache folioifications and simplifications from Matthew
Wilcox
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
it
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword.
This series should have been in the non-MM tree, my bad
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages
- Peter Xu utilized the PTE marker code for handling swapin errors
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand
- zram support for multiple compression streams from Sergey Senozhatsky
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations
- Vishal Moola removed the try_to_release_page() wrapper
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range()
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages()
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines
- Many singleton patches, as usual
* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
mm: mmu_gather: allow more than one batch of delayed rmaps
mm: fix typo in struct pglist_data code comment
kmsan: fix memcpy tests
mm: add cond_resched() in swapin_walk_pmd_entry()
mm: do not show fs mm pc for VM_LOCKONFAULT pages
selftests/vm: ksm_functional_tests: fixes for 32bit
selftests/vm: cow: fix compile warning on 32bit
selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
mm,thp,rmap: fix races between updates of subpages_mapcount
mm: memcg: fix swapcached stat accounting
mm: add nodes= arg to memory.reclaim
mm: disable top-tier fallback to reclaim on proactive reclaim
selftests: cgroup: make sure reclaim target memcg is unprotected
selftests: cgroup: refactor proactive reclaim code to reclaim_until()
mm: memcg: fix stale protection of reclaim target memcg
mm/mmap: properly unaccount memory on mas_preallocate() failure
omfs: remove ->writepage
jfs: remove ->writepage
...
driver in order to be able to run multiple different test patterns.
Rework things and remove the BROKEN dependency so that the driver can be
enabled (Jithu Joseph)
- Remove the subsys interface usage in the microcode loader because it
is not really needed
- A couple of smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYjh8ACgkQEsHwGGHe
VUpu8xAAhY7ywLcAoG9p3AaGiXpryFwnXFBah13o1rkgkJGRaG/eVjPJ4KUUjOQs
Wo3WUHeeHwmFWq+F/OSRefNsptOLBQ3u/cSza9TDDjPoS3glO5cIFc34JqIItMTg
L1GMB4LfmD1+9lYpM6Td11/Dluqf7EjeEdF4qDmCRZ5i4YNsaAlM4HtgATavNkYc
6Bvsi1r7tv7tCNDAEYqEfsQLoc79Yca4W5s86HNIyrxtyk9RLrK75WvRkcpTSnK9
SEpgpYwZy4iRTtZmePC7BqqbHfV6NoeuRqIMR73FrNK9pQuauGFMPkIx08Sgl3BW
/YGpefleGBHhy6Dqa6rEPsYS9xHfhqYAde09zzECJWW4VSI0PuFKyfm67ep2O7q6
zbV2DjxEZ+8kWeO9cDJPedEd8pXC8Ua7H+KNl00npdfNlkBaVR9ZRjX7ZVoiFMi8
6SRmCr1MLngldSMkUr6cYiLpoXmRzM+7gnKhVzhO6yNa0eihYBAIZ5lei0n9Q01W
Soxvec2KKeSZraNLoQH0MSndEJY4sqx6lPjlXgFT6gGHzgfQZTg+9INdaPK9gbI7
tg5j1e0/1UyvWrxYxOdzThtRY1X7Y1QtdpQDcatkVOgR1uZi1CTDx1dxTrHP5jbZ
7MSKn/8/T61beG6ujjif+pC8kOwNISLNDBBZGNzeLRyx8t9/6jQ=
=Z2Nu
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode and IFS updates from Borislav Petkov:
"The IFS (In-Field Scan) stuff goes through tip because the IFS driver
uses the same structures and similar functionality as the microcode
loader and it made sense to route it all through this branch so that
there are no conflicts.
- Add support for multiple testing sequences to the Intel In-Field
Scan driver in order to be able to run multiple different test
patterns. Rework things and remove the BROKEN dependency so that
the driver can be enabled (Jithu Joseph)
- Remove the subsys interface usage in the microcode loader because
it is not really needed
- A couple of smaller fixes and cleanups"
* tag 'x86_microcode_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/microcode/intel: Do not retry microcode reloading on the APs
x86/microcode/intel: Do not print microcode revision and processor flags
platform/x86/intel/ifs: Add missing kernel-doc entry
Revert "platform/x86/intel/ifs: Mark as BROKEN"
Documentation/ABI: Update IFS ABI doc
platform/x86/intel/ifs: Add current_batch sysfs entry
platform/x86/intel/ifs: Remove reload sysfs entry
platform/x86/intel/ifs: Add metadata validation
platform/x86/intel/ifs: Use generic microcode headers and functions
platform/x86/intel/ifs: Add metadata support
x86/microcode/intel: Use a reserved field for metasize
x86/microcode/intel: Add hdr_type to intel_microcode_sanity_check()
x86/microcode/intel: Reuse microcode_sanity_check()
x86/microcode/intel: Use appropriate type in microcode_sanity_check()
x86/microcode/intel: Reuse find_matching_signature()
platform/x86/intel/ifs: Remove memory allocation from load path
platform/x86/intel/ifs: Remove image loading during init
platform/x86/intel/ifs: Return a more appropriate error code
platform/x86/intel/ifs: Remove unused selection
x86/microcode: Drop struct ucode_cpu_info.valid
...
guests which do not get MTRRs exposed but only PAT. (TDX guests do not
support the cache disabling dance when setting up MTRRs so they fall
under the same category.) This is a cleanup work to remove all the ugly
workarounds for such guests and init things separately (Juergen Gross)
- Add two new Intel CPUs to the list of CPUs with "normal" Energy
Performance Bias, leading to power savings
- Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur
CPUs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYhIMACgkQEsHwGGHe
VUpxug//ZKw3hYFroKhsULJi/e0j2nGARiSlJrJcFHl2vgh9yGvDsnYUyM/rgjgt
cM3uCLbEG7nA6uhB3nupzaXZ8lBM1nU9kiEl/kjQ5oYf9nmJ48fLttvWGfxYN4s3
kj5fYVhlOZpntQXIWrwxnPqghUysumMnZmBJeKYiYNNfkj62l3xU2Ni4Gnjnp02I
9MmUhl7pj1aEyOQfM8rovy+wtYCg5WTOmXVlyVN+b9MwfYeK+stojvCZHxtJs9BD
fezpJjjG+78xKUC7vVZXCh1p1N5Qvj014XJkVl9Hg0n7qizKFZRtqi8I769G2ptd
exP8c2nDXKCqYzE8vK6ukWgDANQPs3d6Z7EqUKuXOCBF81PnMPSUMyNtQFGNM6Wp
S5YSvFfCgUjp50IunOpvkDABgpM+PB8qeWUq72UFQJSOymzRJg/KXtE2X+qaMwtC
0i6VLXfMddGcmqNKDppfGtCjq2W5VrNIIJedtAQQGyl+pl3XzZeNomhJpm/0mVfJ
8UrlXZeXl/EUQ7qk40gC/Ash27pU9ZDx4CMNMy1jDIQqgufBjEoRIDSFqQlghmZq
An5/BqMLhOMxUYNA7bRUnyeyxCBypetMdQt5ikBmVXebvBDmArXcuSNAdiy1uBFX
KD8P3Y1AnsHIklxkLNyZRUy7fb4mgMFenUbgc0vmbYHbFl0C0pQ=
=Zmgh
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Split MTRR and PAT init code to accomodate at least Xen PV and TDX
guests which do not get MTRRs exposed but only PAT. (TDX guests do
not support the cache disabling dance when setting up MTRRs so they
fall under the same category)
This is a cleanup work to remove all the ugly workarounds for such
guests and init things separately (Juergen Gross)
- Add two new Intel CPUs to the list of CPUs with "normal" Energy
Performance Bias, leading to power savings
- Do not do bus master arbitration in C3 (ARB_DISABLE) on modern
Centaur CPUs
* tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/mtrr: Make message for disabled MTRRs more descriptive
x86/pat: Handle TDX guest PAT initialization
x86/cpuid: Carve out all CPUID functionality
x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV
x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area()
x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack()
x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode()
x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h
x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs
x86/mtrr: Simplify mtrr_ops initialization
x86/cacheinfo: Switch cache_ap_init() to hotplug callback
x86: Decouple PAT and MTRR handling
x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
x86/mtrr: Get rid of __mtrr_enabled bool
x86/mtrr: Simplify mtrr_bp_init()
x86/mtrr: Remove set_all callback from struct mtrr_ops
x86/mtrr: Disentangle MTRR init from PAT init
x86/mtrr: Move cache control code to cacheinfo.c
x86/mtrr: Split MTRR-specific handling from cache dis/enabling
...
EFI mixed-mode code to a separate compilation unit, the AMD memory
encryption early code where it belongs and fixing up build dependencies.
Make the deprecated EFI handover protocol optional with the goal of
removing it at some point (Ard Biesheuvel)
- Skip realmode init code on Xen PV guests as it is not needed there
- Remove an old 32-bit PIC code compiler workaround
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYaiMACgkQEsHwGGHe
VUrNVhAAk3lLagEsrBcQ24SnMMAyQvdKfRucn9fbs72jBCyWbDqXcE59qNgdbMS1
3rIL+EJdF8jlm5K28GjRS1WSvwUyYbyFEfUcYfqZl9L/5PAl7PlG7nNQw7/gXnw+
xS57w/Q3cONlo5LC0K2Zkbj/59RvDoBEs3nkhozkKR0npTDW/LK3Vl0zgKTkvqsV
DzRIHhWsqSEvpdowbQmQCyqFh/pOoQlZkQwjYVA9+SaQYdH3Yo1dpLd5i9I9eVmJ
dci/HDU+plwYYuZ1XhxwXr82PcdCUVYjJ/DTt9GkTVYq7u5EWx62puxTl+c+wbG2
H1WBXuZHBGdzNMFdnb1k9RuLCaYdaxKTNlZh3FPMMDtkjtjKTl/olXTlFUYFgI6E
FPv4hi15g6pMveS3K6YUAd0uGvpsjvLUZHPqMDVS2trhxLENQALc6Id/PwqzrQ1T
FzfPYcDyFFwMM3MDuWc8ClwEDD9wr0Z4m4Aek/ca2r85AKEX8ZtTTlWZoI4E9A4B
hEjUFnRhT/d6XLWwZqcOIKfwtbpKAjdsCN3ElFst8ogRFAXqW8luDoI4BRCkBC4p
T4RHdij4afkuFjSAxBacazpaavtcCsDqXwBpeL4YN+4fA7+NokVZGiQVh/3S8BPn
LlgIf6awFq6yQq7JyEGPdk+dWn5sknldixZ55m666ZLzSvQhvE8=
=VGZx
-----END PGP SIGNATURE-----
Merge tag 'x86_boot_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Borislav Petkov:
"A of early boot cleanups and fixes.
- Do some spring cleaning to the compressed boot code by moving the
EFI mixed-mode code to a separate compilation unit, the AMD memory
encryption early code where it belongs and fixing up build
dependencies. Make the deprecated EFI handover protocol optional
with the goal of removing it at some point (Ard Biesheuvel)
- Skip realmode init code on Xen PV guests as it is not needed there
- Remove an old 32-bit PIC code compiler workaround"
* tag 'x86_boot_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Remove x86_32 PIC using %ebx workaround
x86/boot: Skip realmode init code when running as Xen PV guest
x86/efi: Make the deprecated EFI handover protocol optional
x86/boot/compressed: Only build mem_encrypt.S if AMD_MEM_ENCRYPT=y
x86/boot/compressed: Adhere to calling convention in get_sev_encryption_bit()
x86/boot/compressed: Move startup32_check_sev_cbit() out of head_64.S
x86/boot/compressed: Move startup32_check_sev_cbit() into .text
x86/boot/compressed: Move startup32_load_idt() out of head_64.S
x86/boot/compressed: Move startup32_load_idt() into .text section
x86/boot/compressed: Pull global variable reference into startup32_load_idt()
x86/boot/compressed: Avoid touching ECX in startup32_set_idt_entry()
x86/boot/compressed: Simplify IDT/GDT preserve/restore in the EFI thunk
x86/boot/compressed, efi: Merge multiple definitions of image_offset into one
x86/boot/compressed: Move efi32_pe_entry() out of head_64.S
x86/boot/compressed: Move efi32_entry out of head_64.S
x86/boot/compressed: Move efi32_pe_entry into .text section
x86/boot/compressed: Move bootargs parsing out of 32-bit startup code
x86/boot/compressed: Move 32-bit entrypoint code into .text section
x86/boot/compressed: Rename efi_thunk_64.S to efi-mixed.S
- Refactor the zboot code so that it incorporates all the EFI stub
logic, rather than calling the decompressed kernel as a EFI app.
- Add support for initrd= command line option to x86 mixed mode.
- Allow initrd= to be used with arbitrary EFI accessible file systems
instead of just the one the kernel itself was loaded from.
- Move some x86-only handling and manipulation of the EFI memory map
into arch/x86, as it is not used anywhere else.
- More flexible handling of any random seeds provided by the boot
environment (i.e., systemd-boot) so that it becomes available much
earlier during the boot.
- Allow improved arch-agnostic EFI support in loaders, by setting a
uniform baseline of supported features, and adding a generic magic
number to the DOS/PE header. This should allow loaders such as GRUB or
systemd-boot to reduce the amount of arch-specific handling
substantially.
- (arm64) Run EFI runtime services from a dedicated stack, and use it to
recover from synchronous exceptions that might occur in the firmware
code.
- (arm64) Ensure that we don't allocate memory outside of the 48-bit
addressable physical range.
- Make EFI pstore record size configurable
- Add support for decoding CXL specific CPER records
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmOTQ1cACgkQw08iOZLZ
jyQRkAv+LqaZFWeVwhAQHiw/N3RnRM0nZHea6++D2p1y/ZbCpwv3pdLl2YHQ1KmW
wDG9Nr4C1ITLtfy1YZKeYpwloQtq9S1GZDWnFpVv/hdo7L924eRAwIlxowWn1OnP
ruxv2PaYXyb0plh1YD1f6E1BqrfUOtajET55Kxs9ZsxmnMtDpIX3NiYy4LKMBIZC
+Eywt41M3uBX+wgmSujFBMVVJjhOX60WhUYXqy0RXwDKOyrz/oW5td+eotSCreB6
FVbjvwQvUdtzn4s1FayOMlTrkxxLw4vLhsaUGAdDOHd3rg3sZT9Xh1HqFFD6nss6
ZAzAYQ6BzdiV/5WSB9meJe+BeG1hjTNKjJI6JPO2lctzYJqlnJJzI6JzBuH9vzQ0
dffLB8NITeEW2rphIh+q+PAKFFNbXWkJtV4BMRpqmzZ/w7HwupZbUXAzbWE8/5km
qlFpr0kmq8GlVcbXNOFjmnQVrJ8jPYn+O3AwmEiVAXKZJOsMH0sjlXHKsonme9oV
Sk71c6Em
=JEXz
-----END PGP SIGNATURE-----
Merge tag 'efi-next-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
"Another fairly sizable pull request, by EFI subsystem standards.
Most of the work was done by me, some of it in collaboration with the
distro and bootloader folks (GRUB, systemd-boot), where the main focus
has been on removing pointless per-arch differences in the way EFI
boots a Linux kernel.
- Refactor the zboot code so that it incorporates all the EFI stub
logic, rather than calling the decompressed kernel as a EFI app.
- Add support for initrd= command line option to x86 mixed mode.
- Allow initrd= to be used with arbitrary EFI accessible file systems
instead of just the one the kernel itself was loaded from.
- Move some x86-only handling and manipulation of the EFI memory map
into arch/x86, as it is not used anywhere else.
- More flexible handling of any random seeds provided by the boot
environment (i.e., systemd-boot) so that it becomes available much
earlier during the boot.
- Allow improved arch-agnostic EFI support in loaders, by setting a
uniform baseline of supported features, and adding a generic magic
number to the DOS/PE header. This should allow loaders such as GRUB
or systemd-boot to reduce the amount of arch-specific handling
substantially.
- (arm64) Run EFI runtime services from a dedicated stack, and use it
to recover from synchronous exceptions that might occur in the
firmware code.
- (arm64) Ensure that we don't allocate memory outside of the 48-bit
addressable physical range.
- Make EFI pstore record size configurable
- Add support for decoding CXL specific CPER records"
* tag 'efi-next-for-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (43 commits)
arm64: efi: Recover from synchronous exceptions occurring in firmware
arm64: efi: Execute runtime services from a dedicated stack
arm64: efi: Limit allocations to 48-bit addressable physical region
efi: Put Linux specific magic number in the DOS header
efi: libstub: Always enable initrd command line loader and bump version
efi: stub: use random seed from EFI variable
efi: vars: prohibit reading random seed variables
efi: random: combine bootloader provided RNG seed with RNG protocol output
efi/cper, cxl: Decode CXL Error Log
efi/cper, cxl: Decode CXL Protocol Error Section
efi: libstub: fix efi_load_initrd_dev_path() kernel-doc comment
efi: x86: Move EFI runtime map sysfs code to arch/x86
efi: runtime-maps: Clarify purpose and enable by default for kexec
efi: pstore: Add module parameter for setting the record size
efi: xen: Set EFI_PARAVIRT for Xen dom0 boot on all architectures
efi: memmap: Move manipulation routines into x86 arch tree
efi: memmap: Move EFI fake memmap support into x86 arch tree
efi: libstub: Undeprecate the command line initrd loader
efi: libstub: Add mixed mode support to command line initrd loader
efi: libstub: Permit mixed mode return types other than efi_status_t
...
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOXitsACgkQEsHwGGHe
VUp8IA//W0CiKuDcqZcMLzl5A16ZOSNXS1xuxBjFcDhS2JtCb3NZEeXLPRcEWglO
HwOZpLq5gc32kSboujT4XqKFnrtcdS94fO+BPwB/xxlM4Y4WWp4JRwbAylzGOOft
5NmZFB35zLMAKDpCogrigYtvav+usZqeCt2SRxAGrK8MuCXLk53OndQdChfJj0+O
VzZsd6gdhjCJ20lZzSYiAZWUYE1Ibfd6hch37A/T1bLD8crANWJPV97PCCJivWIH
PVx3NTWzCjSX507eX3+v1Nf8a+GpCGcJzJwu8+0o5T6lWrf4vyXF/Evz4jgdUe4i
8ZeeCDTsIgbDT7WhLpM6DS5SvkgrYkCamsSQzFLFdeXwpPlgvogyc6DmoQvPy1sw
WFPTYy+HOp/5Slz7GIPN2WdjE5RkfFDQ6a+w72R6YZDlZLYEKCWciZmRwOlrtmR6
K2ujo4ipEK9I7QW9EES2WAAvaM0AsxfjX545T9IDI4W+AXil+m08TFDPwmoag4Ja
q0MUTDFAfmvdhAa2rxJuSedbpjYflb/uuYHqX/kdR9syHhtJpiw1TypaVYD89hpL
AceNGn1gHMVTcC19+Ey9uoL9+Y3VfazD1UIFuzK/iFIgsieqK6zxTRKo8mZ0NHYE
fNnA2sdcteKgaW+d2aKsb0yF1OZ05nfg1YgmUZLbcE1oXmGEJ5A=
=expz
-----END PGP SIGNATURE-----
Merge tag 'x86_alternatives_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 alternative update from Borislav Petkov:
"A single alternatives patching fix for modules:
- Have alternatives patch the same sections in modules as in vmlinux"
* tag 'x86_alternatives_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternative: Consistently patch SMP locks in vmlinux and modules
- Add another MCE severity error case to the Intel error severity
table to promote UC and AR errors to panic severity and remove the
corresponding code condition doing that.
- Make sure the thresholding and deferred error interrupts on AMD SMCA
systems clear the all registers reporting an error so that there are no
multiple errors logged for the same event
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOXiPcACgkQEsHwGGHe
VUomHQ/9Gj6Go0ILvIEpn3VCC8bRc0nf/SFJ+4BnFBc+GdiN6ePhDwn1of3bPk/d
zNuPFGEa1rV7r97MDsggGetvNVVrA6zumoPmLrHPrZSvhRyCW620J43RH+3T4bzz
XfCMU6oRJg+F4dFUPnnAnp/6DTwLSe0ofpc2eARlmjdOxTo4MRfxWfwe5XALGGN1
q+8ycB3Gb8cvhjlB61PL7hhjuy6yH29v63vjUMqsyfDmVhXRxY3xymg+4SxalCBf
0Zz8/RRJFHSOwzUPsQUm9kVMN8phhJ/fN0B1wsLDWlt0K5Vx5D19l+wbH4aaTTcF
8bWMMmS43raFuARkcROAEbDrWM2kEo5Qe9eWhZ7HB9wLG9SicJfZIH5Th4ul40V8
4RARSj1ve4vfxzNmmhFf+RL8kdYWLpxlwVJUhdHkiqKTqN8SIQMb/dsNg/7KjFsV
N3PSZ0lOEQ5Q2l5fSZoL+auqXgJBD5BUy+Gjk0awZavzZCdI35/LK8xVzrgCgsRk
AlAcfvngpyZB7A0aeiwalIszcyjk1cnK8+RwIRtvM0CAYhKP+SVTvq4wHtRiO5HD
TfuVgaHSgOyoOD0NdUGM6PXgrBXqnyIYI2me8Gwtg7gzenizBsv/uh8cvWa3DQnG
5NItCYEG7Paa7p3VQDvFqxKldIC1Bjwi2pYYIDOgRiwWGbYzSCU=
=sKP5
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Fix confusing output from /sys/kernel/debug/ras/daemon_active
- Add another MCE severity error case to the Intel error severity table
to promote UC and AR errors to panic severity and remove the
corresponding code condition doing that.
- Make sure the thresholding and deferred error interrupts on AMD SMCA
systems clear the all registers reporting an error so that there are
no multiple errors logged for the same event
* tag 'ras_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
RAS: Fix return value from show_trace()
x86/mce: Use severity table to handle uncorrected errors in kernel
x86/MCE/AMD: Clear DFR errors found in THR handler
* Fix up ptrace interface to protection keys register (PKRU)
* Avoid undefined compiler behavior with TYPE_ALIGN
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYisACgkQaDWVMHDJ
krAJkA//QRChRwyKi1syinXt2SGoSa3mTzP23SyV0TunOfKBiBUreFJ2mMFjsX0h
V7SJcu82sCWLHAY6LZRdyiF8zK3Cfzpbgb1QfzBCefE/gU801FhCypqNbQO5Lpdr
PEo+naaDOzwDWDt0A6OkAArgb0zfaOGL+OBhuwT7mcUtBz6gCakFqG2BMgOzqD1z
SAp0RraoSsFnKFl5Gv44+gkThq8/8yL5tyrJtnGv1jAsbhw9zmloaOue6MNMPJhH
3sFQnML3qeNRozquWWeCPu/hxWuFDitPhwdmNRZrnQ3DyRdDhCZPOjv+tQmxI3EO
5c+UIkMIsRh2nZLwHcM+iO5cWE7lyiAWpgqqArB+r2CFXWK5q2lplhXngBodE9Kr
ki/NZ6oEitT3+bLXhCwyc7WKxohl2IlmclJ4AD3Qrp4bzPhfsZebL6nNs/3bxWuF
CxJWIKzjtIcgNSEJaDOzFA5CAImq74r/kCW4e11ZXwmOnx6PX1YG6p0C1yknrZYJ
bvy8WxureO7OJEcVZfwxpXLYbb+7Q/k/l2DkUdVAvKSCB81uWR4JzEp4oooDxf2j
6x9qT5Mi95FhAHOCmlxwkQJTBCB36LkVF/3ESEOqJmun4F5ghPbMX2JzpBa6jPCS
lzkBrzA8MAdmaLHhDO+nd5m8HVY3QBSXDVtRTycmuloeoSeyBno=
=An0n
-----END PGP SIGNATURE-----
Merge tag 'x86_fpu_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Dave Hansen:
"There are two little fixes in here, one to give better XSAVE warnings
and another to address some undefined behavior in offsetof().
There is also a collection of patches to fix some issues with ptrace
and the protection keys register (PKRU). PKRU is a real oddity because
it is exposed in the XSAVE-related ABIs, but it is generally managed
without using XSAVE in the kernel. This fix thankfully came with a
selftest to ward off future regressions.
Summary:
- Clarify XSAVE consistency warnings
- Fix up ptrace interface to protection keys register (PKRU)
- Avoid undefined compiler behavior with TYPE_ALIGN"
* tag 'x86_fpu_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Use _Alignof to avoid undefined behavior in TYPE_ALIGN
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
x86/fpu: Allow PKRU to be (once again) written by ptrace.
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
x86/fpu/xstate: Fix XSTATE_WARN_ON() to emit relevant diagnostics
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYlcACgkQaDWVMHDJ
krB+IQ//fLNzHnNmTaFq3TlmF9HiJ926uCu2C3MMma0g8NlFXGLTbeI/UaaER12/
m/N4d+/WPSO1PnsMR36f10Byr8PN2fpbJHGMsHtZ4Y4MTnGycM6JxjDYeFuaSPB5
Cw2IRsO6c7X/dWEVW7hLbHhlG4MpsiX9APt2/PBGpGJm88wL1RDosMKst6430UQK
24JZtFbdyaPnUlo48ql85VkGtdgFHXRnebhM0sX95bVWdSLvNWUSpQAyETp+U9rn
CH75pnoKcJspKun5FmdN2n3gix8Rumz8OZuv9e4XAfBl94H4OZ+SeRN4YbKUvzJP
PtSCz7PT8VQNsJVCA58TQ+QdmhtKsT4ia0ylDvMhHiozzjUNeeS54qJQSUyPLOqK
dBl4hl6BmGMMH2fAZGeoxVmVZMIdLaE0PBECjBEuPAG15IqlxQwTdSeyo0k+S0wV
wYUtCqmxOItW3TA8y044zDjCcIN6wiFymBJtjKbAMxz54ONfnUqgAUluXLeE3xim
8UqL/uM869Ptu6sDO6sfROd1K8EA3KXrsmOGZV7s9hp+qGQcxsvUDhePT9EosS/G
JcmYspV211FO2fTAAOiCe5SJRkoPw/lRWufjNNNWWd3mawJhDeYujZ2fQAxEThC+
Mf8FyFsbxOdbJ1UatgWs/iLOnVwMJf/E1hraq7mdRuZHbNQm7H4=
=yyO+
-----END PGP SIGNATURE-----
Merge tag 'x86_splitlock_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 splitlock updates from Dave Hansen:
"Add a sysctl to control the split lock misery mode.
This enables users to reduce the penalty inflicted on split lock
users. There are some proprietary, binary-only games which became
entirely unplayable with the old penalty.
Anyone opting into the new mode is, of course, more exposed to the DoS
nasitness inherent with split locks, but they can play their games
again"
* tag 'x86_splitlock_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/split_lock: Add sysctl to control the misery mode
* Remove unnecessary arch_has_empty_bitmaps structure memory
* Move rescrtl MSR defines into msr-index.h, like normal MSRs
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYhsACgkQaDWVMHDJ
krA75w//XmOC929XGMOY7WQL6IZlH62xsJbtb3BhmM24Ho7RHSNQGPD+ukArCb0u
V/w50Q4crQrLsIxqWjXkyDQ7w66PvvsAIhYFBEV4kssRli9y173CzJQt/lQfUXL9
T7vG5WY1n4f+vtvmZfwcFaGOPkZ5edp8v1y8Grk3r93ci2VDSk+yvEiq80c+JQoX
ZnEYPxGPUpwAVuaysY8wkGCEc4Yln6gtTKzpVPXE18WAs82OeiCWBfldI/+95j3o
/5r5asYQpD8bVhtLHi1mepkBAGbeVNWhSJVlOE9HdU9WnzCkNKn1ZXRuXSBlvTeq
FPjg6vsBXuz8zQV4Dd3Jk3hWv3H/4sTWsgiyUFdHtz/VlE9M8NjGcE4caOgSuBqR
2ovI/HwdvdYyiZwvNN0fXrnzEn1MliSXDgAscNuxzovJXqdTP2BpUj0SVlZdVs0U
0xba5sZ5A6fh2SwKX7JQYYsEh4gudiixR+D2l5u7EUOiNyfw0DZgWi/ElpvX4ncy
QvDIIqlm29A/VkJQAdSHJc0ew+w39M7f3VNfQviLXxudGFuhrg+kXlI1UYGcX/cH
4LEjmE1KCymmq7v+7+zBrHwsVCxr5mi/CZnx+/4Y/2O+xOKJ1U7GQDXWzu/SC+aF
tEwqDCldYKjqrfdkmuGXSt2YipkNOC2EBLY32mW7rtTDDIXPSro=
=n2UE
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Dave Hansen:
"These declare the resource control (rectrl) MSRs a bit more normally
and clean up an unnecessary structure member:
- Remove unnecessary arch_has_empty_bitmaps structure memory
- Move rescrtl MSR defines into msr-index.h, like normal MSRs"
* tag 'x86_cache_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Move MSR defines into msr-index.h
x86/resctrl: Remove arch_has_empty_bitmaps
for bare-metal enclaves and KVM guests to mitigate single-step
attacks
* Increase batching to speed up enclave release
* Replace kmap/kunmap_atomic() calls
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOXYkEACgkQaDWVMHDJ
krB5Og//Vn0oy0pGhda+LtHJgpa9/qPlzvoZCBxi/6SfLneadE5/g/q2KHbiCgVf
sQ6SEZ0MiVc2SrQcA6CntMO+stJIHG4LqYutygfKDoxXHGzxotzvzTmRV7Qxfhj5
LrPfl4cLWVO/jGDs0XQpOVFykKgdMcg1OjlnQYfriFiIiBkcClC7F0zYrOWAQWW0
z+4h3mlWzyAcBdxrZ9qPVqBMbM3qVKQWeE4D9K2Edfgx1lhQBmvtRdYXTplk08tV
DrfEkG5L189lrwlmbkKT5+pXSTmJqJzBoYyAGOH8n4Wb9aKLdagJErVg0ocXx8uV
ngPFU5vmaZza7EZcQheu8iRfM+zQCrcVjBImrRLyQPgCeMBX7o75axYvu4/bvPkP
3+1/JUL6/m738Fqom4wUKdeoJFw/HLGRyQ36yhZAEzH7wPv7/9Q1zpdxcypE6a+Q
B7UGQNVXV9g5Ivhe44gZIKx/3VL7AthtyCQvhwGQzzm4jX2SwnQKNXy0iKlJr2iI
LyREdYlJsRR1/wMdjnj2QqtnWPRZ5/rzl7bvWqiXa4xyvcgArrBowjMdZBttaItJ
cVK5Aj2bvR3Yc/e9GtPoLvwU5IwtoXgUe1B4DsJtoFoUq7gUGZZcEd5uAYRAk7PX
lyP2LQNxX5i150cxjlSYLLLTNmwvZQ+5PFq+V5+McKbAge8OD8g=
=bIXL
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 sgx updates from Dave Hansen:
"The biggest deal in this series is support for a new hardware feature
that allows enclaves to detect and mitigate single-stepping attacks.
There's also a minor performance tweak and a little piece of the
kmap_atomic() -> kmap_local() transition.
Summary:
- Introduce a new SGX feature (Asynchrounous Exit Notification) for
bare-metal enclaves and KVM guests to mitigate single-step attacks
- Increase batching to speed up enclave release
- Replace kmap/kunmap_atomic() calls"
* tag 'x86_sgx_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Replace kmap/kunmap_atomic() calls
KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest
x86/sgx: Allow enclaves to use Asynchrounous Exit Notification
x86/sgx: Reduce delay and interference of enclave release
- Reserve a new boot loader type for barebox which is usally used on ARM
and MIPS, but can also be utilized as EFI payload on x86 to provide
watchdog-supervised boot up.
- Consolidate the native and compat 32bit signal handling code and split
the 64bit version out into a separate source file
- Switch the ESPFIX random usage to get_random_long().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUvMQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQmmD/9xVeaZbBInehnzbsZi4C4WyOMGUg4l
AoZC0QSzp2hFZRwpbu4Df1Zh2VN5nItAhQUvNLfdZv9/GL5VkhO+J5fPEHUbtnQ8
34TujaTHAssyib8uRFTAxxGSz3S2jPRrzUloZ71M+Whx7Fw7Fh8M/t8DmnvnaPtw
uYbBmZd9mZ0Y7BVMoXh70V0nd21PN8a8qQhYRaUD7lyb1w6Tcfzag4J1DXFfP8Lm
ovaf2AW3mgt+RmzIRNqP28weLt/VxFC38H/nZ9Jlc9npfnLTyGfwfOxE0CILfEo+
cYYVbMaIN+vs5kJQaVbvEJvk7oumLC9CvwE6oIL8J0XOs8dbBHkbZPQYW0yVF1/m
rXEd3LBSNhnZIF0aMUoJrBZAI++nGZo0izSu3eGwLZXSbWBVjlzPAqeBJQtqfQ/E
j87IisQjkWeOOSNvBas1bURWa7Gy5QFRCxbJQFfAZjIHhg+fIwxrK0HlSqxUXqK5
PRbc1LsWjUn9TspOC+mRIKrqAfetkohL7BGc+uuslH3uXiMQVAghg37+rSqvAjkn
50d8XxqOd7aC0NOVn8BfxhMf85Ge7z/0r7JJcaLcRY7/CP6S3vTCAgbSjN4+WzfN
sRu5W/m8oLuF8Q9DdgqtqiNrYezhoEKJHZsGoi/IGy6eAYjMxPX/Cl4YysdqV32N
Z55ZeEBwg9KC1g==
=AHdL
-----END PGP SIGNATURE-----
Merge tag 'x86-misc-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Thomas Gleixner:
"Updates for miscellaneous x86 areas:
- Reserve a new boot loader type for barebox which is usally used on
ARM and MIPS, but can also be utilized as EFI payload on x86 to
provide watchdog-supervised boot up.
- Consolidate the native and compat 32bit signal handling code and
split the 64bit version out into a separate source file
- Switch the ESPFIX random usage to get_random_long()"
* tag 'x86-misc-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/espfix: Use get_random_long() rather than archrandom
x86/signal/64: Move 64-bit signal code to its own file
x86/signal/32: Merge native and compat 32-bit signal code
x86/signal: Add ABI prefixes to frame setup functions
x86/signal: Merge get_sigframe()
x86: Remove __USER32_DS
signal/compat: Remove compat_sigset_t override
x86/signal: Remove sigset_t parameter from frame setup functions
x86/signal: Remove sig parameter from frame setup functions
Documentation/x86/boot: Reserve type_of_loader=13 for barebox
- Rework the handling of x86_regset for 32 and 64 bit. The original
implementation tried to minimize the allocation size with quite some
hard to understand and fragile tricks. Make it robust and straight
forward by separating the register enumerations for 32 and 64 bit
completely.
- Add a few missing static annotations
- Remove the stale unused setup_once() assembly function
- Address a few minor static analysis and kernel-doc warnings
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUu0ATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUNzEACNn5XbRqxPQZak5XHeJ46/VNVTqTE0
Z7euwF8oP+aAybyDevvm18D2hB9Atn4vU9QJYhnTxBXbCLUNErKrH8FcXdNOBbeC
YdAX7nO5WH8IM+drCMySeK6Tv6rvhnDUtgBzdBSl4NdPXUSOnGo+jHqHfN/Q+/n0
yvbwSoVAjD01sxVZQqKQOrzDgDuR/zlISCVudfS+tR4Rm/CYj0cl+MQS9Z1VM3Z6
7pqyypd5+CyNAD6vTDY/q+ZK0ShfNnU9TIIoGmOB/pc0kLctwIu3MY76Uo2DUgGn
n/ItR9mvYu/QelCwX02VG3aRYJPLRfBa+DjQfZUwZapRz3rsjKtfa8ogpPZTLrSO
o4ht/jxlKKDyNOQKYeL2yy054JR4DkKziilEzw5GZHeH2y66XWudRuWfMwbTdrGc
esP5fSNfZ9uluYl6GCCw6S83RJzQ8aZXRcAy7CJgw2Qb4XE7IOA2jf18x5AYaDUp
4a6HCjbxYkEmKCkzkh9+w5koYruyizMBKMBBh5QsMzH4xp20s/vffHwbZ1tls9Za
eTDC/E+wW9Om3qynRynm0EmcHpa0j+RcmkHOhFcXj6SRLnhzktk4Rrr3vlhardS3
Pc8h3GnE5mFXqS8t3r6/hvMk+6svhSu3RbICiLNU72F/tVLU628ux/WoCKfXZloE
7HxWoVhkTF7eOw==
=DTBQ
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Thomas Gleixner:
"A set of x86 cleanups:
- Rework the handling of x86_regset for 32 and 64 bit.
The original implementation tried to minimize the allocation size
with quite some hard to understand and fragile tricks. Make it
robust and straight forward by separating the register enumerations
for 32 and 64 bit completely.
- Add a few missing static annotations
- Remove the stale unused setup_once() assembly function
- Address a few minor static analysis and kernel-doc warnings"
* tag 'x86-cleanups-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm/32: Remove setup_once()
x86/kaslr: Fix process_mem_region()'s return value
x86: Fix misc small issues
x86/boot: Repair kernel-doc for boot_kstrtoul()
x86: Improve formatting of user_regset arrays
x86: Separate out x86_regset for 32 and 64 bit
x86/i8259: Make default_legacy_pic static
x86/tsc: Make art_related_clocksource static
- Handle the case where x2APIC is enabled and locked by the BIOS on a
kernel with CONFIG_X86_X2APIC=n gracefully. Instead of a panic which
does not make it to the graphical console during very early boot,
simply disable the local APIC completely and boot with the PIC and very
limited functionality, which allows to diagnose the issue.
- Convert x86 APIC device tree bindings to YAML
- Extend x86 APIC device tree bindings to configure interrupt delivery
mode and handle this in during init. This allows to boot with device
tree on platforms which lack a legacy PIC.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUuYUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaTED/9D33bnJesbDVZs31HxLJc/jZED0/Do
dli0wRHWmQx9jpUmTXlKRhhIcUOjPy3Cdz44yoOH14wdJ96qUCBUj8sS9vFO4F7M
CS/eoO77GKG6oXpMvsNC5TcSaZnXAb4UYz5wCV21ZXL6P0izhOivKSqTR222jT6e
afEzQhwWhHZmrkX44F1YvMuc+HP6+swfO635vNtZhKtlA7NeKdHRijGZhrXEhNO/
Pue2xbYVMSLNaRTRtN0Mjm6UvShBLQhbmD/vXrVOCztfzhSfwq0LRC9xXcXmdWCY
XjflM+osQxIUs2WbpL1lohq5VUzTlWVNsZe4YkH5b0xMEO9HkD7apF03p03SIO4n
X37joMbrfPz9ZsmSdaN836YZd74IfQ5wnFFQTVL0BC0M4lZNeAnNcxVr3Mfio4yX
GvYahmyvxHlbWag4SYqVsy15QiNV/xZZZD6uIvBvMCfxoFKw8tBF+9/2Iy+3R+zj
n7q17Y9bLSXwh1Z/9xgwdTs+7SNCpIlZ/5nz8NpBhHaZF2BziICCv2TEKZUXmli3
HHkWM7ikj67zgFMiWLLOZpiYz/vgJEFE9nhlmXEH1RNMIfqom/JG8FN8GE1C9kYV
dmSjOE7x/CdZfJ83BRlTx5j2HfAs7RW4A7IMWPIxNdqEFmhxWnQIHasAfMrHcoIU
pAQ8u/qoduJA4A==
=dpZx
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic update from Thomas Gleixner:
"A set of changes for the x86 APIC code:
- Handle the case where x2APIC is enabled and locked by the BIOS on a
kernel with CONFIG_X86_X2APIC=n gracefully.
Instead of a panic which does not make it to the graphical console
during very early boot, simply disable the local APIC completely
and boot with the PIC and very limited functionality, which allows
to diagnose the issue
- Convert x86 APIC device tree bindings to YAML
- Extend x86 APIC device tree bindings to configure interrupt
delivery mode and handle this in during init. This allows to boot
with device tree on platforms which lack a legacy PIC"
* tag 'x86-apic-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/of: Add support for boot time interrupt delivery mode configuration
x86/of: Replace printk(KERN_LVL) with pr_lvl()
dt-bindings: x86: apic: Introduce new optional bool property for lapic
dt-bindings: x86: apic: Convert Intel's APIC bindings to YAML schema
x86/of: Remove unused early_init_dt_add_memory_arch()
x86/apic: Handle no CONFIG_X86_X2APIC on systems with x2APIC enabled by BIOS
- Core:
The bulk is the rework of the MSI subsystem to support per device MSI
interrupt domains. This solves conceptual problems of the current
PCI/MSI design which are in the way of providing support for PCI/MSI[-X]
and the upcoming PCI/IMS mechanism on the same device.
IMS (Interrupt Message Store] is a new specification which allows device
manufactures to provide implementation defined storage for MSI messages
contrary to the uniform and specification defined storage mechanisms for
PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitations
of the MSI-X table, but also gives the device manufacturer the freedom to
store the message in arbitrary places, even in host memory which is shared
with the device.
There have been several attempts to glue this into the current MSI code,
but after lengthy discussions it turned out that there is a fundamental
design problem in the current PCI/MSI-X implementation. This needs some
historical background.
When PCI/MSI[-X] support was added around 2003, interrupt management was
completely different from what we have today in the actively developed
architectures. Interrupt management was completely architecture specific
and while there were attempts to create common infrastructure the
commonalities were rudimentary and just providing shared data structures and
interfaces so that drivers could be written in an architecture agnostic
way.
The initial PCI/MSI[-X] support obviously plugged into this model which
resulted in some basic shared infrastructure in the PCI core code for
setting up MSI descriptors, which are a pure software construct for holding
data relevant for a particular MSI interrupt, but the actual association to
Linux interrupts was completely architecture specific. This model is still
supported today to keep museum architectures and notorious stranglers
alive.
In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel,
which was creating yet another architecture specific mechanism and resulted
in an unholy mess on top of the existing horrors of x86 interrupt handling.
The x86 interrupt management code was already an incomprehensible maze of
indirections between the CPU vector management, interrupt remapping and the
actual IO/APIC and PCI/MSI[-X] implementation.
At roughly the same time ARM struggled with the ever growing SoC specific
extensions which were glued on top of the architected GIC interrupt
controller.
This resulted in a fundamental redesign of interrupt management and
provided the today prevailing concept of hierarchical interrupt
domains. This allowed to disentangle the interactions between x86 vector
domain and interrupt remapping and also allowed ARM to handle the zoo of
SoC specific interrupt components in a sane way.
The concept of hierarchical interrupt domains aims to encapsulate the
functionality of particular IP blocks which are involved in interrupt
delivery so that they become extensible and pluggable. The X86
encapsulation looks like this:
|--- device 1
[Vector]---[Remapping]---[PCI/MSI]--|...
|--- device N
where the remapping domain is an optional component and in case that it is
not available the PCI/MSI[-X] domains have the vector domain as their
parent. This reduced the required interaction between the domains pretty
much to the initialization phase where it is obviously required to
establish the proper parent relation ship in the components of the
hierarchy.
While in most cases the model is strictly representing the chain of IP
blocks and abstracting them so they can be plugged together to form a
hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware
it's clear that the actual PCI/MSI[-X] interrupt controller is not a global
entity, but strict a per PCI device entity.
Here we took a short cut on the hierarchical model and went for the easy
solution of providing "global" PCI/MSI domains which was possible because
the PCI/MSI[-X] handling is uniform across the devices. This also allowed
to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in
turn made it simple to keep the existing architecture specific management
alive.
A similar problem was created in the ARM world with support for IP block
specific message storage. Instead of going all the way to stack a IP block
specific domain on top of the generic MSI domain this ended in a construct
which provides a "global" platform MSI domain which allows overriding the
irq_write_msi_msg() callback per allocation.
In course of the lengthy discussions we identified other abuse of the MSI
infrastructure in wireless drivers, NTB etc. where support for
implementation specific message storage was just mindlessly glued into the
existing infrastructure. Some of this just works by chance on particular
platforms but will fail in hard to diagnose ways when the driver is used
on platforms where the underlying MSI interrupt management code does not
expect the creative abuse.
Another shortcoming of today's PCI/MSI-X support is the inability to
allocate or free individual vectors after the initial enablement of
MSI-X. This results in an works by chance implementation of VFIO (PCI
pass-through) where interrupts on the host side are not set up upfront to
avoid resource exhaustion. They are expanded at run-time when the guest
actually tries to use them. The way how this is implemented is that the
host disables MSI-X and then re-enables it with a larger number of
vectors again. That works by chance because most device drivers set up
all interrupts before the device actually will utilize them. But that's
not universally true because some drivers allocate a large enough number
of vectors but do not utilize them until it's actually required,
e.g. for acceleration support. But at that point other interrupts of the
device might be in active use and the MSI-X disable/enable dance can
just result in losing interrupts and therefore hard to diagnose subtle
problems.
Last but not least the "global" PCI/MSI-X domain approach prevents to
utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS
is not longer providing a uniform storage and configuration model.
The solution to this is to implement the missing step and switch from
global PCI/MSI domains to per device PCI/MSI domains. The resulting
hierarchy then looks like this:
|--- [PCI/MSI] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
which in turn allows to provide support for multiple domains per device:
|--- [PCI/MSI] device 1
|--- [PCI/IMS] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
|--- [PCI/IMS] device N
This work converts the MSI and PCI/MSI core and the x86 interrupt
domains to the new model, provides new interfaces for post-enable
allocation/free of MSI-X interrupts and the base framework for PCI/IMS.
PCI/IMS has been verified with the work in progress IDXD driver.
There is work in progress to convert ARM over which will replace the
platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
"solutions" are in the works as well.
- Drivers:
- Updates for the LoongArch interrupt chip drivers
- Support for MTK CIRQv2
- The usual small fixes and updates all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUsygTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYXiD/40tXKzCzf0qFIqUlZLia1N3RRrwrNC
DVTixuLtR9MrjwE+jWLQILa85SHInV8syXHSd35SzhsGDxkURFGi+HBgVWmysODf
br9VSh3Gi+kt7iXtIwAg8WNWviGNmS3kPksxCko54F0YnJhMY5r5bhQVUBQkwFG2
wES1C9Uzd4pdV2bl24Z+WKL85cSmZ+pHunyKw1n401lBABXnTF9c4f13zC14jd+y
wDxNrmOxeL3mEH4Pg6VyrDuTOURSf3TjJjeEq3EYqvUo0FyLt9I/cKX0AELcZQX7
fkRjrQQAvXNj39RJfeSkojDfllEPUHp7XSluhdBu5aIovSamdYGCDnuEoZ+l4MJ+
CojIErp3Dwj/uSaf5c7C3OaDAqH2CpOFWIcrUebShJE60hVKLEpUwd6W8juplaoT
gxyXRb1Y+BeJvO8VhMN4i7f3232+sj8wuj+HTRTTbqMhkElnin94tAx8rgwR1sgR
BiOGMJi4K2Y8s9Rqqp0Dvs01CW4guIYvSR4YY+WDbbi1xgiev89OYs6zZTJCJe4Y
NUwwpqYSyP1brmtdDdBOZLqegjQm+TwUb6oOaasFem4vT1swgawgLcDnPOx45bk5
/FWt3EmnZxMz99x9jdDn1+BCqAZsKyEbEY1avvhPVMTwoVIuSX2ceTBMLseGq+jM
03JfvdxnueM3gw==
=9erA
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt core and driver subsystem:
The bulk is the rework of the MSI subsystem to support per device MSI
interrupt domains. This solves conceptual problems of the current
PCI/MSI design which are in the way of providing support for
PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device.
IMS (Interrupt Message Store] is a new specification which allows
device manufactures to provide implementation defined storage for MSI
messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified
message store which is uniform accross all devices). The PCI/MSI[-X]
uniformity allowed us to get away with "global" PCI/MSI domains.
IMS not only allows to overcome the size limitations of the MSI-X
table, but also gives the device manufacturer the freedom to store the
message in arbitrary places, even in host memory which is shared with
the device.
There have been several attempts to glue this into the current MSI
code, but after lengthy discussions it turned out that there is a
fundamental design problem in the current PCI/MSI-X implementation.
This needs some historical background.
When PCI/MSI[-X] support was added around 2003, interrupt management
was completely different from what we have today in the actively
developed architectures. Interrupt management was completely
architecture specific and while there were attempts to create common
infrastructure the commonalities were rudimentary and just providing
shared data structures and interfaces so that drivers could be written
in an architecture agnostic way.
The initial PCI/MSI[-X] support obviously plugged into this model
which resulted in some basic shared infrastructure in the PCI core
code for setting up MSI descriptors, which are a pure software
construct for holding data relevant for a particular MSI interrupt,
but the actual association to Linux interrupts was completely
architecture specific. This model is still supported today to keep
museum architectures and notorious stragglers alive.
In 2013 Intel tried to add support for hot-pluggable IO/APICs to the
kernel, which was creating yet another architecture specific mechanism
and resulted in an unholy mess on top of the existing horrors of x86
interrupt handling. The x86 interrupt management code was already an
incomprehensible maze of indirections between the CPU vector
management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X]
implementation.
At roughly the same time ARM struggled with the ever growing SoC
specific extensions which were glued on top of the architected GIC
interrupt controller.
This resulted in a fundamental redesign of interrupt management and
provided the today prevailing concept of hierarchical interrupt
domains. This allowed to disentangle the interactions between x86
vector domain and interrupt remapping and also allowed ARM to handle
the zoo of SoC specific interrupt components in a sane way.
The concept of hierarchical interrupt domains aims to encapsulate the
functionality of particular IP blocks which are involved in interrupt
delivery so that they become extensible and pluggable. The X86
encapsulation looks like this:
|--- device 1
[Vector]---[Remapping]---[PCI/MSI]--|...
|--- device N
where the remapping domain is an optional component and in case that
it is not available the PCI/MSI[-X] domains have the vector domain as
their parent. This reduced the required interaction between the
domains pretty much to the initialization phase where it is obviously
required to establish the proper parent relation ship in the
components of the hierarchy.
While in most cases the model is strictly representing the chain of IP
blocks and abstracting them so they can be plugged together to form a
hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the
hardware it's clear that the actual PCI/MSI[-X] interrupt controller
is not a global entity, but strict a per PCI device entity.
Here we took a short cut on the hierarchical model and went for the
easy solution of providing "global" PCI/MSI domains which was possible
because the PCI/MSI[-X] handling is uniform across the devices. This
also allowed to keep the existing PCI/MSI[-X] infrastructure mostly
unchanged which in turn made it simple to keep the existing
architecture specific management alive.
A similar problem was created in the ARM world with support for IP
block specific message storage. Instead of going all the way to stack
a IP block specific domain on top of the generic MSI domain this ended
in a construct which provides a "global" platform MSI domain which
allows overriding the irq_write_msi_msg() callback per allocation.
In course of the lengthy discussions we identified other abuse of the
MSI infrastructure in wireless drivers, NTB etc. where support for
implementation specific message storage was just mindlessly glued into
the existing infrastructure. Some of this just works by chance on
particular platforms but will fail in hard to diagnose ways when the
driver is used on platforms where the underlying MSI interrupt
management code does not expect the creative abuse.
Another shortcoming of today's PCI/MSI-X support is the inability to
allocate or free individual vectors after the initial enablement of
MSI-X. This results in an works by chance implementation of VFIO (PCI
pass-through) where interrupts on the host side are not set up upfront
to avoid resource exhaustion. They are expanded at run-time when the
guest actually tries to use them. The way how this is implemented is
that the host disables MSI-X and then re-enables it with a larger
number of vectors again. That works by chance because most device
drivers set up all interrupts before the device actually will utilize
them. But that's not universally true because some drivers allocate a
large enough number of vectors but do not utilize them until it's
actually required, e.g. for acceleration support. But at that point
other interrupts of the device might be in active use and the MSI-X
disable/enable dance can just result in losing interrupts and
therefore hard to diagnose subtle problems.
Last but not least the "global" PCI/MSI-X domain approach prevents to
utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact
that IMS is not longer providing a uniform storage and configuration
model.
The solution to this is to implement the missing step and switch from
global PCI/MSI domains to per device PCI/MSI domains. The resulting
hierarchy then looks like this:
|--- [PCI/MSI] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
which in turn allows to provide support for multiple domains per
device:
|--- [PCI/MSI] device 1
|--- [PCI/IMS] device 1
[Vector]---[Remapping]---|...
|--- [PCI/MSI] device N
|--- [PCI/IMS] device N
This work converts the MSI and PCI/MSI core and the x86 interrupt
domains to the new model, provides new interfaces for post-enable
allocation/free of MSI-X interrupts and the base framework for
PCI/IMS. PCI/IMS has been verified with the work in progress IDXD
driver.
There is work in progress to convert ARM over which will replace the
platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
"solutions" are in the works as well.
Drivers:
- Updates for the LoongArch interrupt chip drivers
- Support for MTK CIRQv2
- The usual small fixes and updates all over the place"
* tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits)
irqchip/ti-sci-inta: Fix kernel doc
irqchip/gic-v2m: Mark a few functions __init
irqchip/gic-v2m: Include arm-gic-common.h
irqchip/irq-mvebu-icu: Fix works by chance pointer assignment
iommu/amd: Enable PCI/IMS
iommu/vt-d: Enable PCI/IMS
x86/apic/msi: Enable PCI/IMS
PCI/MSI: Provide pci_ims_alloc/free_irq()
PCI/MSI: Provide IMS (Interrupt Message Store) support
genirq/msi: Provide constants for PCI/IMS support
x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN
PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
PCI/MSI: Provide prepare_desc() MSI domain op
PCI/MSI: Split MSI-X descriptor setup
genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN
genirq/msi: Provide msi_domain_alloc_irq_at()
genirq/msi: Provide msi_domain_ops:: Prepare_desc()
genirq/msi: Provide msi_desc:: Msi_data
genirq/msi: Provide struct msi_map
x86/apic/msi: Remove arch_create_remap_msi_irq_domain()
...
- Remove a superfluous noinline which prevents GCC-7.3 to optimize a stub
function away.
- Allow uprobes on REP NOP and do not treat them like word-sized branch
instructions.
- Make the VDSO symbol export of __vdso_sgx_enter_enclave() depend on
CONFIG_X86_SGX to prevent build fails with newer LLVM versions which
rightfully detect that there is no function behind the symbol.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOW+sQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWH5EACPYcRw9PNBLMC6L0MF5G0qCFmLcjqn
Fe8LxLywsKdyT6f1aAcOetIqkwDN/fuUyJHcioKqyqSkNlNeRV2hoZ9OlsBGJ7zC
6HH41ZCrY39liKzMM2JmfxU6XxT74zEt3Fly4G127d78HBi9DYwk8fT6GY8/BOk6
wkeWuczqRY1NNek1SBIciBn/FMZU8UShqjKzQsS1Bpj2Dm2ZvHdVh+P2okp2wl9Z
gMbFN0Jq+8jRWOb4BF0Hx2Fg+WjXZPhT8msDXh8Vnr0u7bchWCljbLvvFST2hfpo
+u/uKeOgOHm0XfUBOQa2WpEpev4M3ve1WFSkmP/0Qe3tcaRabMRDXGezZJSAdf1K
dZV0tQu+4rygzZwEf4ppskxejG7LSvyzrLdebPvzUYFT14C5E22jRxp1+Mpswq28
ZPiw6yc3XXUqboNV3JVNs3PDPBVucSCHfQfUNEfjUayaMhb4w5jQyy93WIffOzVU
0KnXe9XX0MA3e5zVJMXExW4907Iks/K+qNgXtx/8fJnqaECIJInxZfbPmj74ZpfT
6b0sJVt04eFX4uYKoLPpFoP9LFUvzU5eR7e7yuoiSGFh3D3p9bimyR5xhBxNqs8Y
j7XL2i0jY95w6v1kK3Kmgr2L+JCAN2v/JFJ+eIOYQAIb/VkhTfNq/MHL33bDJ1X3
2IrBEgo5tk7VNw==
=oJ/K
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Three small x86 fixes which did not make it into 6.1:
- Remove a superfluous noinline which prevents GCC-7.3 to optimize a
stub function away
- Allow uprobes on REP NOP and do not treat them like word-sized
branch instructions
- Make the VDSO symbol export of __vdso_sgx_enter_enclave() depend on
CONFIG_X86_SGX to prevent build failures with newer LLVM versions
which rightfully detect that there is no function behind the
symbol"
* tag 'x86-urgent-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Conditionally export __vdso_sgx_enter_enclave()
uprobes/x86: Allow to probe a NOP instruction with 0x66 prefix
x86/alternative: Remove noinline from __ibt_endbr_seal[_end]() stubs
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmORzR4THHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXkqCCACFwHz04iepLE7R8ZZ6BVUhD6uzfzDo
s1j7ozOUGUe3vI6q0DElHWVQZgzIzLypVsfWkZToe6jeOU6R48b0tZSFyJCUNwGM
ogmS7N8fBdHfY9SBFoUPoziBifXpf3kq4hhX/w+1Lge9CN5Ywc4KjuJb91EAInbs
lm47O4KQY8w8A7BbPBHYBueUVWLvgwPRPOS032zqxN1787m2tCxpqkfnImK39kh6
IsBBIZfYsok0H5wldhZXnsARpEOeFF6BoFBXpFPlmnbv2VcK2AfZgTYdA3ESyEgd
NyOFDfh6BO07gTR1xCH6gvOpkHwx6xKAkjE36RymdhXS6fhRCRsfahVB
=m78g
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Drop unregister syscore from hyperv_cleanup to avoid hang (Gaurav
Kohli)
- Clean up panic path for Hyper-V framebuffer (Guilherme G. Piccoli)
- Allow IRQ remapping to work without x2apic (Nuno Das Neves)
- Fix comments (Olaf Hering)
- Expand hv_vp_assist_page definition (Saurabh Sengar)
- Improvement to page reporting (Shradha Gupta)
- Make sure TSC clocksource works when Linux runs as the root partition
(Stanislav Kinsburskiy)
* tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Remove unregister syscore call from Hyper-V cleanup
iommu/hyper-v: Allow hyperv irq remapping without x2apic
clocksource: hyper-v: Add TSC page support for root partition
clocksource: hyper-v: Use TSC PFN getter to map vvar page
clocksource: hyper-v: Introduce TSC PFN getter
clocksource: hyper-v: Introduce a pointer to TSC page
x86/hyperv: Expand definition of struct hv_vp_assist_page
PCI: hv: update comment in x86 specific hv_arch_irq_unmask
hv: fix comment typo in vmbus_channel/low_latency
drivers: hv, hyperv_fb: Untangle and refactor Hyper-V panic notifiers
video: hyperv_fb: Avoid taking busy spinlock on panic path
hv_balloon: Add support for configurable order free page reporting
mm/page_reporting: Add checks for page_reporting_order param
These messages:
clipped [mem size 0x00000000 64bit] to [mem size 0xfffffffffffa0000 64bit] for e820 entry [mem 0x0009f000-0x000fffff]
aren't as useful as they could be because (a) the resource is often
IORESOURCE_UNSET, so we print the size instead of the start/end and (b) we
print the available resource even if it is empty after removing the E820
entry.
Print the available space by hand to avoid the IORESOURCE_UNSET problem and
only if it's non-empty. No functional change intended.
Link: https://lore.kernel.org/r/20221208190341.1560157-4-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
After someone reported a bug report with a failed modification due to the
expected value not matching what was found, it came to my attention that
the ftrace_expected is no longer set when that happens. This makes for
debugging the issue a bit more difficult.
Set ftrace_expected to the expected code before calling ftrace_bug, so
that it shows what was expected and why it failed.
Link: https://lore.kernel.org/all/CA+wXwBQ-VhK+hpBtYtyZP-NiX4g8fqRRWithFOHQW-0coQ3vLg@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221209105247.01d4e51d@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 768ae4406a ("x86/ftrace: Use text_poke()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Enable IMS in the domain init and allocation mapping code, but do not
enable it on the vector domain as discussed in various threads on LKML.
The interrupt remap domains can expand this setting like they do with
PCI multi MSI.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232327.022658817@linutronix.de
and related code which is not longer required now that the interrupt remap
code has been converted to MSI parent domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.267353814@linutronix.de
Remove the global PCI/MSI irqdomain implementation and provide the required
MSI parent ops so the PCI/MSI code can detect the new parent and setup per
device domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.209212272@linutronix.de
Remove the global PCI/MSI irqdomain implementation and provide the required
MSI parent ops so the PCI/MSI code can detect the new parent and setup per
device domains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.151226317@linutronix.de
Enable MSI parent domain support in the x86 vector domain and fixup the
checks in the iommu implementations to check whether device::msi::domain is
the default MSI parent domain. That keeps the existing logic to protect
e.g. devices behind VMD working.
The interrupt remap PCI/MSI code still works because the underlying vector
domain still provides the same functionality.
None of the other x86 PCI/MSI, e.g. XEN and HyperV, implementations are
affected either. They still work the same way both at the low level and the
PCI/MSI implementations they provide.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124232326.034672592@linutronix.de
The retries in load_ucode_intel_ap() were in place to support systems
with mixed steppings. Mixed steppings are no longer supported and there is
only one microcode image at a time. Any retries will simply reattempt to
apply the same image over and over without making progress.
[ bp: Zap the circumstantial reasoning from the commit message. ]
Fixes: 06b8534cb7 ("x86/microcode: Rework microcode loading")
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20221129210832.107850-3-ashok.raj@intel.com
It's truly a MSI only flag and for the upcoming per device MSI domains this
must be in the MSI flags so it can be set during domain setup without
exposing this quirk outside of x86.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221124230313.454246167@linutronix.de
Intel ICC -hotpatch inserts 2-byte "0x66 0x90" NOP at the start of each
function to reserve extra space for hot-patching, and currently it is not
possible to probe these functions because branch_setup_xol_ops() wrongly
rejects NOP with REP prefix as it treats them like word-sized branch
instructions.
Fixes: 250bbd12c2 ("uprobes/x86: Refuse to attach uprobe to "word-sized" branch insns")
Reported-by: Seiji Nishikawa <snishika@redhat.com>
Suggested-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20221204173933.GA31544@redhat.com
Instead of just saying "Disabled" when MTRRs are disabled for any
reason, tell what is disabled and why.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221205080433.16643-3-jgross@suse.com
collect_cpu_info() is used to collect the current microcode revision and
processor flags on every CPU.
It had a weird mechanism to try to mimick a "once" functionality in the
sense that, that information should be issued only when it is differing
from the previous CPU.
However (1):
the new calling sequence started doing that in parallel:
microcode_init()
|-> schedule_on_each_cpu(setup_online_cpu)
|-> collect_cpu_info()
resulting in multiple redundant prints:
microcode: sig=0x50654, pf=0x80, revision=0x2006e05
microcode: sig=0x50654, pf=0x80, revision=0x2006e05
microcode: sig=0x50654, pf=0x80, revision=0x2006e05
However (2):
dumping this here is not that important because the kernel does not
support mixed silicon steppings microcode. Finally!
Besides, there is already a pr_info() in microcode_reload_late() that
shows both the old and new revisions.
What is more, the CPU signature (sig=0x50654) and Processor Flags
(pf=0x80) above aren't that useful to the end user, they are available
via /proc/cpuinfo and they don't change anyway.
Remove the redundant pr_info().
[ bp: Heavily massage. ]
Fixes: b6f86689d5 ("x86/microcode: Rip out the subsys interface gunk")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20221103175901.164783-2-ashok.raj@intel.com
The "force" argument to write_spec_ctrl_current() is currently ambiguous
as it does not guarantee the MSR write. This is due to the optimization
that writes to the MSR happen only when the new value differs from the
cached value.
This is fine in most cases, but breaks for S3 resume when the cached MSR
value gets out of sync with the hardware MSR value due to S3 resetting
it.
When x86_spec_ctrl_current is same as x86_spec_ctrl_base, the MSR write
is skipped. Which results in SPEC_CTRL mitigations not getting restored.
Move the MSR write from write_spec_ctrl_current() to a new function that
unconditionally writes to the MSR. Update the callers accordingly and
rename functions.
[ bp: Rework a bit. ]
Fixes: caa0ff24d5 ("x86/bugs: Keep a per-CPU IA32_SPEC_CTRL value")
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/806d39b0bfec2fe8f50dc5446dff20f5bb24a959.1669821572.git.pawan.kumar.gupta@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmap_local_page() is the preferred way to create temporary mappings when it
is feasible, because the mappings are thread-local and CPU-local.
kmap_local_page() uses per-task maps rather than per-CPU maps. This in
effect removes the need to disable preemption on the local CPU while the
mapping is active, and thus vastly reduces overall system latency. It is
also valid to take pagefaults within the mapped region.
The use of kmap_atomic() in the SGX code was not an explicit design choice
to disable page faults or preemption, and there is no compelling design
reason to using kmap_atomic() vs. kmap_local_page().
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/linux-sgx/Y0biN3%2FJsZMa0yUr@kernel.org/
Link: https://lore.kernel.org/r/20221115161627.4169428-1-kristen@linux.intel.com
Presently, init/boot time interrupt delivery mode is enumerated only for
ACPI enabled systems by parsing MADT table or for older systems by parsing
MP table. But for OF based x86 systems, it is assumed & hardcoded to be
legacy PIC mode. This causes a boot time crash for platforms which do not
provide a 8259 compliant legacy PIC.
Add support for configuration of init time interrupt delivery mode for x86
OF based systems by introducing a new optional boolean property
'intel,virtual-wire-mode' for the local APIC interrupt-controller
node. This property emulates IMCRP Bit 7 of MP feature info byte 2 of MP
floating pointer structure.
Defaults to legacy PIC mode if absent. Configures it to virtual wire
compatibility mode if present.
Signed-off-by: Rahul Tanwar <rtanwar@maxlinear.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20221124084143.21841-5-rtanwar@maxlinear.com
Use pr_lvl() instead of the deprecated printk(KERN_LVL).
Just a upgrade of print utilities usage. no functional changes.
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rahul Tanwar <rtanwar@maxlinear.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20221124084143.21841-4-rtanwar@maxlinear.com
Recently objtool started complaining about dead code in the object files,
in particular
vmlinux.o: warning: objtool: early_init_dt_scan_memory+0x191: unreachable instruction
when CONFIG_OF=y.
Indeed, early_init_dt_scan() is not used on x86 and making it compile (with
help of CONFIG_OF) will abrupt the code flow since in the middle of it
there is a BUG() instruction.
Remove the pointless function.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221124184824.9548-1-andriy.shevchenko@linux.intel.com
A kernel that was compiled without CONFIG_X86_X2APIC was unable to boot on
platforms that have x2APIC already enabled in the BIOS before starting the
kernel.
The kernel was supposed to panic with an approprite error message in
validate_x2apic() due to the missing X2APIC support.
However, validate_x2apic() was run too late in the boot cycle, and the
kernel tried to initialize the APIC nonetheless. This resulted in an
earlier panic in setup_local_APIC() because the APIC was not registered.
In my experiments, a panic message in setup_local_APIC() was not visible
in the graphical console, which resulted in a hang with no indication
what has gone wrong.
Instead of calling panic(), disable the APIC, which results in a somewhat
working system with the PIC only (and no SMP). This way the user is able to
diagnose the problem more easily.
Disabling X2APIC mode is not an option because it's impossible on systems
with locked x2APIC.
The proper place to disable the APIC in this case is in check_x2apic(),
which is called early from setup_arch(). Doing this in
__apic_intr_mode_select() is too late.
Make check_x2apic() unconditionally available and remove the empty stub.
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reported-by: Robert Elliott (Servers) <elliott@hpe.com>
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/d573ba1c-0dc4-3016-712a-cc23a8a33d42@molgen.mpg.de
Link: https://lore.kernel.org/lkml/20220911084711.13694-3-mat.jonczyk@o2.pl
Link: https://lore.kernel.org/all/20221129215008.7247-1-mat.jonczyk@o2.pl
Due to the explicit 'noinline' GCC-7.3 is not able to optimize away the
argument setup of:
apply_ibt_endbr(__ibt_endbr_seal, __ibt_enbr_seal_end);
even when X86_KERNEL_IBT=n and the function is an empty stub, which leads
to link errors due to missing __ibt_endbr_seal* symbols:
ld: arch/x86/kernel/alternative.o: in function `alternative_instructions':
alternative.c:(.init.text+0x15d): undefined reference to `__ibt_endbr_seal_end'
ld: alternative.c:(.init.text+0x164): undefined reference to `__ibt_endbr_seal'
Remove the explicit 'noinline' to help gcc optimize them away.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221011113803.956808-1-linmiaohe@huawei.com
If x2apic is not available, hyperv-iommu skips remapping
irqs. This breaks root partition which always needs irqs
remapped.
Fix this by allowing irq remapping regardless of x2apic,
and change hyperv_enable_irq_remapping() to return
IRQ_REMAP_XAPIC_MODE in case x2apic is missing.
Tested with root and non-root hyperv partitions.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1668715899-8971-1-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
msr-index.h should contain all MSRs for easier grepping for MSR numbers
when dealing with unchecked MSR access warnings, for example.
Move the resctrl ones. Prefix IA32_PQR_ASSOC with "MSR_" while at it.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221106212923.20699-1-bp@alien8.de
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When running as a Xen PV guest there is no need for setting up the
realmode trampoline, as realmode isn't supported in this environment.
Trying to setup the trampoline has been proven to be problematic in
some cases, especially when trying to debug early boot problems with
Xen requiring to keep the EFI boot-services memory mapped (some
firmware variants seem to claim basically all memory below 1Mb for boot
services).
Introduce new x86_platform_ops operations for that purpose, which can
be set to a NOP by the Xen PV specific kernel boot code.
[ bp: s/call_init_real_mode/do_init_real_mode/ ]
Fixes: 084ee1c641 ("x86, realmode: Relocator for realmode code")
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221123114523.3467-1-jgross@suse.com
The devnode() in struct class should not be modifying the device that is
passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Justin Sanders <justin@coraid.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sean Young <sean@mess.org>
Cc: Frank Haverkamp <haver@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Xie Yongji <xieyongji@bytedance.com>
Cc: Gautam Dawar <gautam.dawar@xilinx.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Eli Cohen <elic@nvidia.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: alsa-devel@alsa-project.org
Cc: dri-devel@lists.freedesktop.org
Cc: kvm@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-block@vger.kernel.org
Cc: linux-input@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are some paravirt assembler functions which are sharing a common
pattern. Introduce a macro DEFINE_PARAVIRT_ASM() for creating them.
Note that this macro is including explicit alignment of the generated
functions, leading to __raw_callee_save___kvm_vcpu_is_preempted(),
_paravirt_nop() and paravirt_ret0() to be aligned at 4 byte boundaries
now.
The explicit _paravirt_nop() prototype in paravirt.c isn't needed, as
it is included in paravirt_types.h already.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Link: https://lkml.kernel.org/r/20221109134418.6516-1-jgross@suse.com
WG14 N2350 specifies that it is an undefined behavior to have type
definitions within offsetof", see
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2350.htm
This specification is also part of C23.
Therefore, replace the TYPE_ALIGN macro with the _Alignof builtin to
avoid undefined behavior. (_Alignof itself is C11 and the kernel is
built with -gnu11).
ISO C11 _Alignof is subtly different from the GNU C extension
__alignof__. Latter is the preferred alignment and _Alignof the
minimal alignment. For long long on x86 these are 8 and 4
respectively.
The macro TYPE_ALIGN's behavior matches _Alignof rather than
__alignof__.
[ bp: Massage commit message. ]
Signed-off-by: YingChi Long <me@inclyc.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220925153151.2467884-1-me@inclyc.cn
Convert the remaining cases of static_cpu_has(X86_FEATURE_XENPV) and
boot_cpu_has(X86_FEATURE_XENPV) to use cpu_feature_enabled(), allowing
more efficient code in case the kernel is configured without
CONFIG_XEN_PV.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20221104072701.20283-6-jgross@suse.com
alternatives_smp_module_add() restricts patching of SMP lock prefixes to
the text address range passed as an argument.
For vmlinux, patching all the instructions located between the _text and
_etext symbols is allowed. That includes the .text section but also
other sections such as .text.hot and .text.unlikely.
As per the comment inside the 'struct smp_alt_module' definition, the
original purpose of this restriction is to avoid patching the init code
because in the case when one boots with a single CPU, the LOCK prefixes
to the locking primitives are removed.
Later on, when other CPUs are onlined, those LOCK prefixes get added
back in but by that time the .init code is very likely removed so
patching that would be a bad idea.
For modules, the current code only allows patching instructions located
inside the .text segment, excluding other sections such as .text.hot or
.text.unlikely, which may need patching.
Make patching of the kernel core and modules more consistent by
allowing all text sections of modules except .init.text to be patched in
module_finalize().
For that, use mod->core_layout.base/mod->core_layout.text_size as the
address range allowed to be patched, which include all the code sections
except the init code.
[ bp: Massage and expand commit message. ]
Signed-off-by: Julian Pidancet <julian.pidancet@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221027204906.511277-1-julian.pidancet@oracle.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmN6wAgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0EYH/3/RO90NbrFItraN
Lzr+d3VdbGjTu8xd1M+PRTmwh3zxLpB+Jwqr0T0A2gzL9B/D+AUPUJdrCVbv9DqS
FLJAVqoeV20dNBAHSffOOLPsgCZ+Eu+LzlNN7Iqde0e8cyZICFMNktitui84Xm/i
1NgFVgz9OZ6+aieYvUj3FrFq0p8GTIaC/oybDZrxYKcO8ZzKVMJ11swRw10wwq0g
qOOECvV3w7wlQ8upQZkzFxItKFc7EexZI6R4elXeGSJJ9Hlc092dv/zsKB9dwV+k
WcwkJrZRoezYXzgGBFxUcQtzi+ethjrPjuJuM1rYLUSIcfIW/0lkaSLgRoBu8D+I
1GfXkXs=
=gt6P
-----END PGP SIGNATURE-----
Merge tag 'v6.1-rc6' into x86/core, to resolve conflicts
Resolve conflicts between these commits in arch/x86/kernel/asm-offsets.c:
# upstream:
debc5a1ec0 ("KVM: x86: use a separate asm-offsets.c file")
# retbleed work in x86/core:
5d8213864a ("x86/retbleed: Add SKL return thunk")
... and these commits in include/linux/bpf.h:
# upstram:
18acb7fac2 ("bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")")
# x86/core commits:
931ab63664 ("x86/ibt: Implement FineIBT")
bea75b3389 ("x86/Kconfig: Introduce function padding")
The latter two modify BPF_DISPATCHER_ATTRIBUTES(), which was removed upstream.
Conflicts:
arch/x86/kernel/asm-offsets.c
include/linux/bpf.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Support for the TSX control MSR is enumerated in MSR_IA32_ARCH_CAPABILITIES.
This is different from how other CPU features are enumerated i.e. via
CPUID. Currently, a call to tsx_ctrl_is_supported() is required for
enumerating the feature. In the absence of a feature bit for TSX control,
any code that relies on checking feature bits directly will not work.
In preparation for adding a feature bit check in MSR save/restore
during suspend/resume, set a new feature bit X86_FEATURE_TSX_CTRL when
MSR_IA32_TSX_CTRL is present. Also make tsx_ctrl_is_supported() use the
new feature bit to avoid any overhead of reading the MSR.
[ bp: Remove tsx_ctrl_is_supported(), add room for two more feature
bits in word 11 which are coming up in the next merge window. ]
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/de619764e1d98afbb7a5fa58424f1278ede37b45.1668539735.git.pawan.kumar.gupta@linux.intel.com
fpregs lock disables preemption on RT but fpu_inherit_perms() does
spin_lock_irq(), which, on RT, uses rtmutexes and they need to be
preemptible.
- Check the page offset and the length of the data supplied by userspace
for overflow when specifying a set of pages to add to an SGX enclave
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmN6GjgACgkQEsHwGGHe
VUpzog/+OIX3ZAZ0EJqg9GgvhacPjww1oPr+DRcpXCFYjk1jTJ3seJc2we+uun0j
zYHbgO6BYyP3LdlrSjt8MgosMZGz1s14r9TXc46T8IhvUu0imbUkO9vLcxwL6pJl
LJgPIYvBu6IUoVIQVlVr7PrVvUj8nUPc3w/8qmjR91bJAWTeeFvFflvn713jlWBP
hLKiUvhdjA08Sp9gjF2drGl+NkSXPPLPHQetKa4BhVYqwDK5hRGBOt51CuDHdUOQ
QYaP5JRy435ZsoFGgYq0lOxCXIYDe8rWRBCnDWdi7kjXEYhnKJLj6Fi1SxjD+cZC
wDX+LQGFiShJFonGzxbeORBU04Owbz+nLsSeHCQsl/70kAv/W/44BLj+BPl0dit1
XBTUUCr9Wi9VdDTBVJT+EQbD3F5dBn1TO00Z0qzhv3D3gVruUNmv7SDHMoRUyYcy
9LueWCzF9YV1Se6V9gUox9vwTuc09J63IS2zkMm2ahCbfmWTSsx9P5BWLFK3E3Em
lPsdZWNJQ7F6f0B3AfRjTDXvaMyzBRYfuZHEaBMq5avDWDFBCyOhc3PqjpKt5wHS
URP6M/kOtz1zg8fy/XmMRCfCDBoAm+NfvF4zG9md1GYta7aP74Z824M+FMoXNv7f
YcR4mCzpeeiG0hXyywcL+QDpmjlsYCPhe24Gnh/Bb+1g7Huyyc8=
=VQD4
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Do not hold fpregs lock when inheriting FPU permissions because the
fpregs lock disables preemption on RT but fpu_inherit_perms() does
spin_lock_irq(), which, on RT, uses rtmutexes and they need to be
preemptible.
- Check the page offset and the length of the data supplied by
userspace for overflow when specifying a set of pages to add to an
SGX enclave
* tag 'x86_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Drop fpregs lock before inheriting FPU permissions
x86/sgx: Add overflow check in sgx_validate_offset_length()
IFS test images and microcode blobs use the same header format.
Microcode blobs use header type of 1, whereas IFS test images
will use header type of 2.
In preparation for IFS reusing intel_microcode_sanity_check(),
add header type as a parameter for sanity check.
[ bp: Touchups. ]
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20221117035935.4136738-9-jithu.joseph@intel.com
IFS test image carries the same microcode header as regular Intel
microcode blobs.
Reuse microcode_sanity_check() in the IFS driver to perform sanity check
of the IFS test images too.
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20221117035935.4136738-8-jithu.joseph@intel.com
The data type of the @print_err parameter used by microcode_sanity_check()
is int. In preparation for exporting this function to be used by
the IFS driver convert it to a more appropriate bool type for readability.
No functional change intended.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20221117035935.4136738-7-jithu.joseph@intel.com
IFS uses test images provided by Intel that can be regarded as firmware.
An IFS test image carries microcode header with an extended signature
table.
Reuse find_matching_signature() for verifying if the test image header
or the extended signature table indicate whether that image is fit to
run on a system.
No functional changes.
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20221117035935.4136738-6-jithu.joseph@intel.com
The EFI fake memmap support is specific to x86, which manipulates the
EFI memory map in various different ways after receiving it from the EFI
stub. On other architectures, we have managed to push back on this, and
the EFI memory map is kept pristine.
So let's move the fake memmap code into the x86 arch tree, where it
arguably belongs.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
This has nothing to do with random.c and everything to do with stack
protectors. Yes, it uses randomness. But many things use randomness.
random.h and random.c are concerned with the generation of randomness,
not with each and every use. So move this function into the more
specific stackprotector.h file where it belongs.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
These cases were done with this Coccinelle:
@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- + E
- - E
)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- - E
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- - E
+ F
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- + E
+ F
- - E
)
And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Now that the PCI/MSI core code does early checking for multi-MSI support
X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore.
Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20221111122015.865042356@linutronix.de
The hardware XRSTOR instruction resets the PKRU register to its hardware
init value (namely 0) if the PKRU bit is not set in the xfeatures mask.
Emulating that here restores the pre-5.14 behavior for PTRACE_SET_REGSET
with NT_X86_XSTATE, and makes sigreturn (which still uses XRSTOR) and
ptrace behave identically. KVM has never used XRSTOR and never had this
behavior, so KVM opts-out of this emulation by passing a NULL pkru pointer
to copy_uabi_to_xstate().
Fixes: e84ba47e31 ("x86/fpu: Hook up PKRU into ptrace()")
Signed-off-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221115230932.7126-6-khuey%40kylehuey.com
Move KVM's PKRU handling code in fpu_copy_uabi_to_guest_fpstate() to
copy_uabi_to_xstate() so that it is shared with other APIs that write the
XSTATE such as PTRACE_SETREGSET with NT_X86_XSTATE.
This restores the pre-5.14 behavior of ptrace. The regression can be seen
by running gdb and executing `p $pkru`, `set $pkru = 42`, and `p $pkru`.
On affected kernels (5.14+) the write to the PKRU register (which gdb
performs through ptrace) is ignored.
[ dhansen: removed stable@ tag for now. The ABI was broken for long
enough that this is not urgent material. Let's let it stew
in tip for a few weeks before it's submitted to stable
because there are so many ABIs potentially affected. ]
Fixes: e84ba47e31 ("x86/fpu: Hook up PKRU into ptrace()")
Signed-off-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221115230932.7126-5-khuey%40kylehuey.com
In preparation for moving PKRU handling code out of
fpu_copy_uabi_to_guest_fpstate() and into copy_uabi_to_xstate(), add an
argument that copy_uabi_from_kernel_to_xstate() can use to pass the
canonical location of the PKRU value. For
copy_sigframe_from_user_to_xstate() the kernel will actually restore the
PKRU value from the fpstate, but pass in the thread_struct's pkru location
anyways for consistency.
Signed-off-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221115230932.7126-4-khuey%40kylehuey.com
Both KVM (through KVM_SET_XSTATE) and ptrace (through PTRACE_SETREGSET
with NT_X86_XSTATE) ultimately call copy_uabi_from_kernel_to_xstate(),
but the canonical locations for the current PKRU value for KVM guests
and processes in a ptrace stop are different (in the kvm_vcpu_arch and
the thread_state structs respectively).
In preparation for eventually handling PKRU in
copy_uabi_to_xstate, pass in a pointer to the PKRU location.
Signed-off-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221115230932.7126-3-khuey%40kylehuey.com
This will allow copy_sigframe_from_user_to_xstate() to grab the address of
thread_struct's pkru value in a later patch.
Signed-off-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221115230932.7126-2-khuey%40kylehuey.com
DE_CFG contains the LFENCE serializing bit, restore it on resume too.
This is relevant to older families due to the way how they do S3.
Unify and correct naming while at it.
Fixes: e4d0e84e49 ("x86/cpu/AMD: Make LFENCE a serializing instruction")
Reported-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Reported-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
introduced post-6.0 or which aren't considered serious enough to justify a
-stable backport.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY27xPAAKCRDdBJ7gKXxA
juFXAP4tSmfNDrT6khFhV0l4cS43bluErVNLh32RfXBqse8GYgEA5EPvZkOssLqY
86ejRXFgAArxYC4caiNURUQL+IASvQo=
=YVOx
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"22 hotfixes.
Eight are cc:stable and the remainder address issues which were
introduced post-6.0 or which aren't considered serious enough to
justify a -stable backport"
* tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
docs: kmsan: fix formatting of "Example report"
mm/damon/dbgfs: check if rm_contexts input is for a real context
maple_tree: don't set a new maximum on the node when not reusing nodes
maple_tree: fix depth tracking in maple_state
arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging
fs: fix leaked psi pressure state
nilfs2: fix use-after-free bug of ns_writer on remount
x86/traps: avoid KMSAN bugs originating from handle_bug()
kmsan: make sure PREEMPT_RT is off
Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN
x86/uaccess: instrument copy_from_user_nmi()
kmsan: core: kmsan_in_runtime() should return true in NMI context
mm: hugetlb_vmemmap: include missing linux/moduleparam.h
mm/shmem: use page_mapping() to detect page cache for uffd continue
mm/memremap.c: map FS_DAX device memory as decrypted
Partly revert "mm/thp: carry over dirty bit when thp splits on pmd"
nilfs2: fix deadlock in nilfs_count_free_blocks()
mm/mmap: fix memory leak in mmap_region()
hugetlbfs: don't delete error page from pagecache
maple_tree: reorganize testing to restore module testing
...
On all recent Centaur platforms, ARB_DISABLE is handled by PMU
automatically while entering C3 type state. No need for OS to
issue the ARB_DISABLE, so set bm_control to zero to indicate that.
Signed-off-by: Tony W Wang-oc <TonyWWang-oc@zhaoxin.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/1667792089-4904-1-git-send-email-TonyWWang-oc%40zhaoxin.com
Commit b041b525da ("x86/split_lock: Make life miserable for split lockers")
changed the way the split lock detector works when in "warn" mode;
basically, it not only shows the warn message, but also intentionally
introduces a slowdown through sleeping plus serialization mechanism
on such task. Based on discussions in [0], seems the warning alone
wasn't enough motivation for userspace developers to fix their
applications.
This slowdown is enough to totally break some proprietary (aka.
unfixable) userspace[1].
Happens that originally the proposal in [0] was to add a new mode
which would warns + slowdown the "split locking" task, keeping the
old warn mode untouched. In the end, that idea was discarded and
the regular/default "warn" mode now slows down the applications. This
is quite aggressive with regards proprietary/legacy programs that
basically are unable to properly run in kernel with this change.
While it is understandable that a malicious application could DoS
by split locking, it seems unacceptable to regress old/proprietary
userspace programs through a default configuration that previously
worked. An example of such breakage was reported in [1].
Add a sysctl to allow controlling the "misery mode" behavior, as per
Thomas suggestion on [2]. This way, users running legacy and/or
proprietary software are allowed to still execute them with a decent
performance while still observing the warning messages on kernel log.
[0] https://lore.kernel.org/lkml/20220217012721.9694-1-tony.luck@intel.com/
[1] https://github.com/doitsujin/dxvk/issues/2938
[2] https://lore.kernel.org/lkml/87pmf4bter.ffs@tglx/
[ dhansen: minor changelog tweaks, including clarifying the actual
problem ]
Fixes: b041b525da ("x86/split_lock: Make life miserable for split lockers")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Andre Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/all/20221024200254.635256-1-gpiccoli%40igalia.com
Mike Galbraith reported the following against an old fork of preempt-rt
but the same issue also applies to the current preempt-rt tree.
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: systemd
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
Preemption disabled at:
fpu_clone
CPU: 6 PID: 1 Comm: systemd Tainted: G E (unreleased)
Call Trace:
<TASK>
dump_stack_lvl
? fpu_clone
__might_resched
rt_spin_lock
fpu_clone
? copy_thread
? copy_process
? shmem_alloc_inode
? kmem_cache_alloc
? kernel_clone
? __do_sys_clone
? do_syscall_64
? __x64_sys_rt_sigprocmask
? syscall_exit_to_user_mode
? do_syscall_64
? syscall_exit_to_user_mode
? do_syscall_64
? syscall_exit_to_user_mode
? do_syscall_64
? exc_page_fault
? entry_SYSCALL_64_after_hwframe
</TASK>
Mike says:
The splat comes from fpu_inherit_perms() being called under fpregs_lock(),
and us reaching the spin_lock_irq() therein due to fpu_state_size_dynamic()
returning true despite static key __fpu_state_size_dynamic having never
been enabled.
Mike's assessment looks correct. fpregs_lock on a PREEMPT_RT kernel disables
preemption so calling spin_lock_irq() in fpu_inherit_perms() is unsafe. This
problem exists since commit
9e798e9aa1 ("x86/fpu: Prepare fpu_clone() for dynamically enabled features").
Even though the original bug report should not have enabled the paths at
all, the bug still exists.
fpregs_lock is necessary when editing the FPU registers or a task's FP
state but it is not necessary for fpu_inherit_perms(). The only write
of any FP state in fpu_inherit_perms() is for the new child which is
not running yet and cannot context switch or be borrowed by a kernel
thread yet. Hence, fpregs_lock is not protecting anything in the new
child until clone() completes and can be dropped earlier. The siglock
still needs to be acquired by fpu_inherit_perms() as the read of the
parent's permissions has to be serialised.
[ bp: Cleanup splat. ]
Fixes: 9e798e9aa1 ("x86/fpu: Prepare fpu_clone() for dynamically enabled features")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20221110124400.zgymc2lnwqjukgfh@techsingularity.net
The way mtrr_if is initialized with the correct mtrr_ops structure is
quite weird.
Simplify that by dropping the vendor specific init functions and the
mtrr_ops[] array. Replace those with direct assignments of the related
vendor specific ops array to mtrr_if.
Note that a direct assignment is okay even for 64-bit builds, where the
symbol isn't present, as the related code will be subject to "dead code
elimination" due to how cpu_feature_enabled() is implemented.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-17-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Instead of explicitly calling cache_ap_init() in
identify_secondary_cpu() use a CPU hotplug callback instead. By
registering the callback only after having started the non-boot CPUs
and initializing cache_aps_delayed_init with "true", calling
set_cache_aps_delayed_init() at boot time can be dropped.
It should be noted that this change results in cache_ap_init() being
called a little bit later when hotplugging CPUs. By using a new
hotplug slot right at the start of the low level bringup this is not
problematic, as no operations requiring a specific caching mode are
performed that early in CPU initialization.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-15-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Today, PAT is usable only with MTRR being active, with some nasty tweaks
to make PAT usable when running as a Xen PV guest which doesn't support
MTRR.
The reason for this coupling is that both PAT MSR changes and MTRR
changes require a similar sequence and so full PAT support was added
using the already available MTRR handling.
Xen PV PAT handling can work without MTRR, as it just needs to consume
the PAT MSR setting done by the hypervisor without the ability and need
to change it. This in turn has resulted in a convoluted initialization
sequence and wrong decisions regarding cache mode availability due to
misguiding PAT availability flags.
Fix all of that by allowing to use PAT without MTRR and by reworking
the current PAT initialization sequence to match better with the newly
introduced generic cache initialization.
This removes the need of the recently added pat_force_disabled flag, so
remove the remnants of the patch adding it.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-14-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Instead of having a stop_machine() handler for either a specific
MTRR register or all state at once, add a handler just for calling
cache_cpu_init() if appropriate.
Add functions for calling stop_machine() with this handler as well.
Add a generic replacement for mtrr_bp_restore() and a wrapper for
mtrr_bp_init().
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-13-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
In order to prepare decoupling MTRR and PAT replace the MTRR-specific
mtrr_aps_delayed_init flag with a more generic cache_aps_delayed_init
one.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-12-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
There is no need for keeping __mtrr_enabled as it can easily be replaced
by testing mtrr_if to be not NULL.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-11-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
In case of the generic cache interface being used (Intel CPUs or a
64-bit system), the initialization sequence of the boot CPU is more
complicated than necessary:
- check if MTRR enabled, if yes, call mtrr_bp_pat_init() which will
disable caching, set the PAT MSR, and reenable caching
- call mtrr_cleanup(), in case that changed anything, call
cache_cpu_init() doing the same caching disable/enable dance as
above, but this time with setting the (modified) MTRR state (even
if MTRR was disabled) AND setting the PAT MSR (again even with
disabled MTRR)
The sequence can be simplified a lot while removing potential
inconsistencies:
- check if MTRR enabled, if yes, call mtrr_cleanup() and then
cache_cpu_init()
This ensures to:
- no longer disable/enable caching more than once
- avoid to set MTRRs and/or the PAT MSR on the boot processor in case
of MTRR cleanups even if MTRRs meant to be disabled
With that mtrr_bp_pat_init() can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-10-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Instead of using an indirect call to mtrr_if->set_all just call the only
possible target cache_cpu_init() directly. Remove the set_all function
pointer from struct mtrr_ops.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-9-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Add a main cache_cpu_init() init routine which initializes MTRR and/or
PAT support depending on what has been detected on the system.
Leave the MTRR-specific initialization in a MTRR-specific init function
where the smp_changes_mask setting happens now with caches disabled.
This global mask update was done with caches enabled before probably
because atomic operations while running uncached might have been quite
expensive.
But since only systems with a broken BIOS should ever require to set any
bit in smp_changes_mask, hurting those devices with a penalty of a few
microseconds during boot shouldn't be a real issue.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-8-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Prepare making PAT and MTRR support independent from each other by
moving some code needed by both out of the MTRR-specific sources.
[ bp: Massage commit message. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-7-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Split the MTRR-specific actions from cache_disable() and cache_enable()
into new functions mtrr_disable() and mtrr_enable().
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-6-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Rename the currently MTRR-specific functions prepare_set() and
post_set() in preparation to move them. Make them non-static and put
their prototypes into cacheinfo.h, where they will end after moving them
to their final position anyway.
Expand the comment before the functions with an introductory line and
rename two related static variables, too.
[ bp: Massage commit message. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-5-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
In MTRR code use_intel() is only used in one source file, and the
relevant use_intel_if member of struct mtrr_ops is set only in
generic_mtrr_ops.
Replace use_intel() with a single flag in cacheinfo.c which can be
set when assigning generic_mtrr_ops to mtrr_if. This allows to drop
use_intel_if from mtrr_ops, while preparing to decouple PAT from MTRR.
As another preparation for the PAT/MTRR decoupling use a bit for MTRR
control and one for PAT control. For now set both bits together, this
can be changed later.
As the new flag will be set only if mtrr_enabled is set, the test for
mtrr_enabled can be dropped at some places.
[ bp: Massage commit message. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-4-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Presumably, this was introduced due to a conflict resolution with
commit ef68017eb5 ("x86/kvm: Handle async page faults directly through
do_page_fault()"), given that the last posted version [1] of the blamed
commit was not based on the aforementioned commit.
[1] https://lore.kernel.org/kvm/20200525144125.143875-9-vkuznets@redhat.com/
Fixes: b1d405751c ("KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Message-Id: <20221021020113.922027-1-rafaelmendsr@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86_virt_spec_ctrl only deals with the paravirtualized
MSR_IA32_VIRT_SPEC_CTRL now and does not handle MSR_IA32_SPEC_CTRL
anymore; remove the corresponding, unused argument.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Restoration of the host IA32_SPEC_CTRL value is probably too late
with respect to the return thunk training sequence.
With respect to the user/kernel boundary, AMD says, "If software chooses
to toggle STIBP (e.g., set STIBP on kernel entry, and clear it on kernel
exit), software should set STIBP to 1 before executing the return thunk
training sequence." I assume the same requirements apply to the guest/host
boundary. The return thunk training sequence is in vmenter.S, quite close
to the VM-exit. On hosts without V_SPEC_CTRL, however, the host's
IA32_SPEC_CTRL value is not restored until much later.
To avoid this, move the restoration of host SPEC_CTRL to assembly and,
for consistency, move the restoration of the guest SPEC_CTRL as well.
This is not particularly difficult, apart from some care to cover both
32- and 64-bit, and to share code between SEV-ES and normal vmentry.
Cc: stable@vger.kernel.org
Fixes: a149180fbc ("x86: Add magic AMD return-thunk")
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This already removes an ugly #include "" from asm-offsets.c, but
especially it avoids a future error when trying to define asm-offsets
for KVM's svm/svm.h header.
This would not work for kernel/asm-offsets.c, because svm/svm.h
includes kvm_cache_regs.h which is not in the include path when
compiling asm-offsets.c. The problem is not there if the .c file is
in arch/x86/kvm.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Fixes: a149180fbc ("x86: Add magic AMD return-thunk")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
"XSAVE consistency problem" has been reported under Xen, but that's the extent
of my divination skills.
Modify XSTATE_WARN_ON() to force the caller to provide relevant diagnostic
information, and modify each caller suitably.
For check_xstate_against_struct(), this removes a double WARN() where one will
do perfectly fine.
CC stable as this has been wonky debugging for 7 years and it is good to
have there too.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220810221909.12768-1-andrew.cooper3@citrix.com
There is a case in exc_invalid_op handler that is executed outside the
irqentry_enter()/irqentry_exit() region when an UD2 instruction is used to
encode a call to __warn().
In that case the `struct pt_regs` passed to the interrupt handler is never
unpoisoned by KMSAN (this is normally done in irqentry_enter()), which
leads to false positives inside handle_bug().
Use kmsan_unpoison_entry_regs() to explicitly unpoison those registers
before using them.
Link: https://lkml.kernel.org/r/20221102110611.1085175-5-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
sgx_validate_offset_length() function verifies "offset" and "length"
arguments provided by userspace, but was missing an overflow check on
their addition. Add it.
Fixes: c6d26d3707 ("x86/sgx: Add SGX_IOC_ENCLAVE_ADD_PAGES")
Signed-off-by: Borys Popławski <borysp@invisiblethingslab.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Cc: stable@vger.kernel.org # v5.11+
Link: https://lore.kernel.org/r/0d91ac79-6d84-abed-5821-4dbe59fa1a38@invisiblethingslab.com
The new Asynchronous Exit (AEX) notification mechanism (AEX-notify)
allows one enclave to receive a notification in the ERESUME after the
enclave exit due to an AEX. EDECCSSA is a new SGX user leaf function
(ENCLU[EDECCSSA]) to facilitate the AEX notification handling. The new
EDECCSSA is enumerated via CPUID(EAX=0x12,ECX=0x0):EAX[11].
Besides Allowing reporting the new AEX-notify attribute to KVM guests,
also allow reporting the new EDECCSSA user leaf function to KVM guests
so the guest can fully utilize the AEX-notify mechanism.
Similar to existing X86_FEATURE_SGX1 and X86_FEATURE_SGX2, introduce a
new scattered X86_FEATURE_SGX_EDECCSSA bit for the new EDECCSSA, and
report it in KVM's supported CPUIDs.
Note, no additional KVM enabling is required to allow the guest to use
EDECCSSA. It's impossible to trap ENCLU (without completely preventing
the guest from using SGX). Advertise EDECCSSA as supported purely so
that userspace doesn't need to special case EDECCSSA, i.e. doesn't need
to manually check host CPUID.
The inability to trap ENCLU also means that KVM can't prevent the guest
from using EDECCSSA, but that virtualization hole is benign as far as
KVM is concerned. EDECCSSA is simply a fancy way to modify internal
enclave state.
More background about how do AEX-notify and EDECCSSA work:
SGX maintains a Current State Save Area Frame (CSSA) for each enclave
thread. When AEX happens, the enclave thread context is saved to the
CSSA and the CSSA is increased by 1. For a normal ERESUME which doesn't
deliver AEX notification, it restores the saved thread context from the
previously saved SSA and decreases the CSSA. If AEX-notify is enabled
for one enclave, the ERESUME acts differently. Instead of restoring the
saved thread context and decreasing the CSSA, it acts like EENTER which
doesn't decrease the CSSA but establishes a clean slate thread context
using the CSSA for the enclave to handle the notification. After some
handling, the enclave must discard the "new-established" SSA and switch
back to the previously saved SSA (upon AEX). Otherwise, the enclave
will run out of SSA space upon further AEXs and eventually fail to run.
To solve this problem, the new EDECCSSA essentially decreases the CSSA.
It can be used by the enclave notification handler to switch back to the
previous saved SSA when needed, i.e. after it handles the notification.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/all/20221101022422.858944-1-kai.huang%40intel.com
Short Version:
Allow enclaves to use the new Asynchronous EXit (AEX)
notification mechanism. This mechanism lets enclaves run a
handler after an AEX event. These handlers can run mitigations
for things like SGX-Step[1].
AEX Notify will be made available both on upcoming processors and
on some older processors through microcode updates.
Long Version:
== SGX Attribute Background ==
The SGX architecture includes a list of SGX "attributes". These
attributes ensure consistency and transparency around specific
enclave features.
As a simple example, the "DEBUG" attribute allows an enclave to
be debugged, but also destroys virtually all of SGX security.
Using attributes, enclaves can know that they are being debugged.
Attributes also affect enclave attestation so an enclave can, for
instance, be denied access to secrets while it is being debugged.
The kernel keeps a list of known attributes and will only
initialize enclaves that use a known set of attributes. This
kernel policy eliminates the chance that a new SGX attribute
could cause undesired effects.
For example, imagine a new attribute was added called
"PROVISIONKEY2" that provided similar functionality to
"PROVISIIONKEY". A kernel policy that allowed indiscriminate use
of unknown attributes and thus PROVISIONKEY2 would undermine the
existing kernel policy which limits use of PROVISIONKEY enclaves.
== AEX Notify Background ==
"Intel Architecture Instruction Set Extensions and Future
Features - Version 45" is out[2]. There is a new chapter:
Asynchronous Enclave Exit Notify and the EDECCSSA User Leaf Function.
Enclaves exit can be either synchronous and consensual (EEXIT for
instance) or asynchronous (on an interrupt or fault). The
asynchronous ones can evidently be exploited to single step
enclaves[1], on top of which other naughty things can be built.
AEX Notify will be made available both on upcoming processors and
on some older processors through microcode updates.
== The Problem ==
These attacks are currently entirely opaque to the enclave since
the hardware does the save/restore under the covers. The
Asynchronous Enclave Exit Notify (AEX Notify) mechanism provides
enclaves an ability to detect and mitigate potential exposure to
these kinds of attacks.
== The Solution ==
Define the new attribute value for AEX Notification. Ensure the
attribute is cleared from the list reserved attributes. Instead
of adding to the open-coded lists of individual attributes,
add named lists of privileged (disallowed by default) and
unprivileged (allowed by default) attributes. Add the AEX notify
attribute as an unprivileged attribute, which will keep the kernel
from rejecting enclaves with it set.
1. https://github.com/jovanbulck/sgx-step
2. https://cdrdv2.intel.com/v1/dl/getContent/671368?explicitVersion=true
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/20220720191347.1343986-1-dave.hansen%40linux.intel.com
Intel processors support additional software hint called EPB ("Energy
Performance Bias") to guide the hardware heuristic of power management
features to favor increasing dynamic performance or conserve energy
consumption.
Since this EPB hint is processor specific, the same value of hint can
result in different behavior across generations of processors.
commit 4ecc933b7d ("x86: intel_epb: Allow model specific normal EPB
value")' introduced capability to update the default power up EPB
based on the CPU model and updated the default EPB to 7 for Alder Lake
mobile CPUs.
The same change is required for other Alder Lake-N and Raptor Lake-P
mobile CPUs as the current default of 6 results in higher uncore power
consumption. This increase in power is related to memory clock
frequency setting based on the EPB value.
Depending on the EPB the minimum memory frequency is set by the
firmware. At EPB = 7, the minimum memory frequency is 1/4th compared to
EPB = 6. This results in significant power saving for idle and
semi-idle workload on a Chrome platform.
For example Change in power and performance from EPB change from 6 to 7
on Alder Lake-N:
Workload Performance diff (%) power diff
----------------------------------------------------
VP9 FHD30 0 (FPS) -218 mw
Google meet 0 (FPS) -385 mw
This 200+ mw power saving is very significant for mobile platform for
battery life and thermal reasons.
But as the workload demands more memory bandwidth, the memory frequency
will be increased very fast. There is no power savings for such busy
workloads.
For example:
Workload Performance diff (%) from EPB 6 to 7
-------------------------------------------------------
Speedometer 2.0 -0.8
WebGL Aquarium 10K
Fish -0.5
Unity 3D 2018 0.2
WebXPRT3 -0.5
There are run to run variations for performance scores for
such busy workloads. So the difference is not significant.
Add a new define ENERGY_PERF_BIAS_NORMAL_POWERSAVE for EPB 7
and use it for Alder Lake-N and Raptor Lake-P mobile CPUs.
This modification is done originally by
Jeremy Compostella <jeremy.compostella@intel.com>.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20221027220056.1534264-1-srinivas.pandruvada%40linux.intel.com
request_microcode_fw() can always request firmware now so drop this
superfluous argument.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20221028142638.28498-4-bp@alien8.de
Get rid of all the IPI-sending functions and their wrappers and use
those which are supposed to be called on each CPU.
Thus:
- microcode_init_cpu() gets called on each CPU on init, applying any new
microcode that the driver might've found on the filesystem.
- mc_cpu_starting() simply tries to apply cached microcode as this is
the cpuhp starting callback which gets called on CPU resume too.
Even if the driver init function is a late initcall, there is no
filesystem by then (not even a hdd driver has been loaded yet) so a new
firmware load attempt cannot simply be done.
It is pointless anyway - for that there's late loading if one really
needs it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20221028142638.28498-3-bp@alien8.de
This is a left-over from the old days when CPU hotplug wasn't as robust
as it is now. Currently, microcode gets loaded early on the CPU init
path and there's no need to attempt to load it again, which that subsys
interface callback is doing.
The only other thing that the subsys interface init path was doing is
adding the
/sys/devices/system/cpu/cpu*/microcode/
hierarchy.
So add a function which gets called on each CPU after all the necessary
driver setup has happened. Use schedule_on_each_cpu() which can block
because the sysfs creating code does kmem_cache_zalloc() which can block
too and the initial version of this where it did that setup in an IPI
handler of on_each_cpu() can cause a deadlock of the sort:
lock(fs_reclaim);
<Interrupt>
lock(fs_reclaim);
as the IPI handler runs in IRQ context.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/20221028142638.28498-2-bp@alien8.de
In fill_thread_core_info() the ptrace accessible registers are collected
for a core file to be written out as notes. The note array is allocated
from a size calculated by iterating the user regset view, and counting the
regsets that have a non-zero core_note_type. However, this only allows for
there to be non-zero core_note_type at the end of the regset view. If
there are any in the middle, fill_thread_core_info() will overflow the
note allocation, as it iterates over the size of the view and the
allocation would be smaller than that.
To apparently avoid this problem, x86_32_regsets and x86_64_regsets need
to be constructed in a special way. They both draw their indices from a
shared enum x86_regset, but 32 bit and 64 bit don't all support the same
regsets and can be compiled in at the same time in the case of
IA32_EMULATION. So this enum has to be laid out in a special way such that
there are no gaps for both x86_32_regsets and x86_64_regsets. This
involves ordering them just right by creating aliases for enum’s that
are only in one view or the other, or creating multiple versions like
REGSET32_IOPERM/REGSET64_IOPERM.
So the collection of the registers tries to minimize the size of the
allocation, but it doesn’t quite work. Then the x86 ptrace side works
around it by constructing the enum just right to avoid a problem. In the
end there is no functional problem, but it is somewhat strange and
fragile.
It could also be improved like this [1], by better utilizing the smaller
array, but this still wastes space in the regset array’s if they are not
carefully crafted to avoid gaps. Instead, just fully separate out the
enums and give them separate 32 and 64 enum names. Add some bitsize-free
defines for REGSET_GENERAL and REGSET_FP since they are the only two
referred to in bitsize generic code.
While introducing a bunch of new 32/64 enums, change the pattern of the
name from REGSET_FOO32 to REGSET32_FOO to better indicate that the 32 is
in reference to the CPU mode and not the register size, as suggested by
Eric Biederman.
This should have no functional change and is only changing how constants
are generated and referred to.
[1] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20221021221803.10910-2-rick.p.edgecombe%40intel.com
In order to avoid known hashes (from knowing the boot image),
randomize the CFI hashes with a per-boot random seed.
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221027092842.765195516@infradead.org
Add the "cfi=" boot parameter to allow people to select a CFI scheme
at boot time. Mostly useful for development / debugging.
Requested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221027092842.699804264@infradead.org
Implement an alternative CFI scheme that merges both the fine-grained
nature of kCFI but also takes full advantage of the coarse grained
hardware CFI as provided by IBT.
To contrast:
kCFI is a pure software CFI scheme and relies on being able to read
text -- specifically the instruction *before* the target symbol, and
does the hash validation *before* doing the call (otherwise control
flow is compromised already).
FineIBT is a software and hardware hybrid scheme; by ensuring every
branch target starts with a hash validation it is possible to place
the hash validation after the branch. This has several advantages:
o the (hash) load is avoided; no memop; no RX requirement.
o IBT WAIT-FOR-ENDBR state is a speculation stop; by placing
the hash validation in the immediate instruction after
the branch target there is a minimal speculation window
and the whole is a viable defence against SpectreBHB.
o Kees feels obliged to mention it is slightly more vulnerable
when the attacker can write code.
Obviously this patch relies on kCFI, but additionally it also relies
on the padding from the call-depth-tracking patches. It uses this
padding to place the hash-validation while the call-sites are
re-written to modify the indirect target to be 16 bytes in front of
the original target, thus hitting this new preamble.
Notably, there is no hardware that needs call-depth-tracking (Skylake)
and supports IBT (Tigerlake and onwards).
Suggested-by: Joao Moreira (Intel) <joao@overdrivepizza.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20221027092842.634714496@infradead.org
commit 8795359e35 ("x86/sgx: Silence softlockup detection when
releasing large enclaves") introduced a cond_resched() during enclave
release where the EREMOVE instruction is applied to every 4k enclave
page. Giving other tasks an opportunity to run while tearing down a
large enclave placates the soft lockup detector but Iqbal found
that the fix causes a 25% performance degradation of a workload
run using Gramine.
Gramine maintains a 1:1 mapping between processes and SGX enclaves.
That means if a workload in an enclave creates a subprocess then
Gramine creates a duplicate enclave for that subprocess to run in.
The consequence is that the release of the enclave used to run
the subprocess can impact the performance of the workload that is
run in the original enclave, especially in large enclaves when
SGX2 is not in use.
The workload run by Iqbal behaves as follows:
Create enclave (enclave "A")
/* Initialize workload in enclave "A" */
Create enclave (enclave "B")
/* Run subprocess in enclave "B" and send result to enclave "A" */
Release enclave (enclave "B")
/* Run workload in enclave "A" */
Release enclave (enclave "A")
The performance impact of releasing enclave "B" in the above scenario
is amplified when there is a lot of SGX memory and the enclave size
matches the SGX memory. When there is 128GB SGX memory and an enclave
size of 128GB, from the time enclave "B" starts the 128GB SGX memory
is oversubscribed with a combined demand for 256GB from the two
enclaves.
Before commit 8795359e35 ("x86/sgx: Silence softlockup detection when
releasing large enclaves") enclave release was done in a tight loop
without giving other tasks a chance to run. Even though the system
experienced soft lockups the workload (run in enclave "A") obtained
good performance numbers because when the workload started running
there was no interference.
Commit 8795359e35 ("x86/sgx: Silence softlockup detection when
releasing large enclaves") gave other tasks opportunity to run while an
enclave is released. The impact of this in this scenario is that while
enclave "B" is released and needing to access each page that belongs
to it in order to run the SGX EREMOVE instruction on it, enclave "A"
is attempting to run the workload needing to access the enclave
pages that belong to it. This causes a lot of swapping due to the
demand for the oversubscribed SGX memory. Longer latencies are
experienced by the workload in enclave "A" while enclave "B" is
released.
Improve the performance of enclave release while still avoiding the
soft lockup detector with two enhancements:
- Only call cond_resched() after XA_CHECK_SCHED iterations.
- Use the xarray advanced API to keep the xarray locked for
XA_CHECK_SCHED iterations instead of locking and unlocking
at every iteration.
This batching solution is copied from sgx_encl_may_map() that
also iterates through all enclave pages using this technique.
With this enhancement the workload experiences a 5%
performance degradation when compared to a kernel without
commit 8795359e35 ("x86/sgx: Silence softlockup detection when
releasing large enclaves"), an improvement to the reported 25%
degradation, while still placating the soft lockup detector.
Scenarios with poor performance are still possible even with these
enhancements. For example, short workloads creating sub processes
while running in large enclaves. Further performance improvements
are pursued in user space through avoiding to create duplicate enclaves
for certain sub processes, and using SGX2 that will do lazy allocation
of pages as needed so enclaves created for sub processes start quickly
and release quickly.
Fixes: 8795359e35 ("x86/sgx: Silence softlockup detection when releasing large enclaves")
Reported-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Md Iqbal Hossain <md.iqbal.hossain@intel.com>
Link: https://lore.kernel.org/all/00efa80dd9e35dc85753e1c5edb0344ac07bb1f0.1667236485.git.reinette.chatre%40intel.com
A call is made to arch_get_random_longs() and rdtsc(), rather than just
using get_random_long(), because this was written during a time when
very early boot would give abysmal entropy. These days, a call to
get_random_long() at early boot will incorporate RDRAND, RDTSC, and
more, without having to do anything bespoke.
In fact, the situation is now such that on the majority of x86 systems,
the pool actually is initialized at this point, even though it doesn't
need to be for get_random_long() to still return something better than
what this function currently does.
So simplify this to just call get_random_long() instead.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221029002613.143153-1-Jason@zx2c4.com
mce_severity_intel() has a special case to promote UC and AR errors
in kernel context to PANIC severity.
The "AR" case is already handled with separate entries in the severity
table for all instruction fetch errors, and those data fetch errors that
are not in a recoverable area of the kernel (i.e. have an extable fixup
entry).
Add an entry to the severity table for UC errors in kernel context that
reports severity = PANIC. Delete the special case code from
mce_severity_intel().
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220922195136.54575-2-tony.luck@intel.com
AMD's MCA Thresholding feature counts errors of all severity levels, not
just correctable errors. If a deferred error causes the threshold limit
to be reached (it was the error that caused the overflow), then both a
deferred error interrupt and a thresholding interrupt will be triggered.
The order of the interrupts is not guaranteed. If the threshold
interrupt handler is executed first, then it will clear MCA_STATUS for
the error. It will not check or clear MCA_DESTAT which also holds a copy
of the deferred error. When the deferred error interrupt handler runs it
will not find an error in MCA_STATUS, but it will find the error in
MCA_DESTAT. This will cause two errors to be logged.
Check for deferred errors when handling a threshold interrupt. If a bank
contains a deferred error, then clear the bank's MCA_DESTAT register.
Define a new helper function to do the deferred error check and clearing
of MCA_DESTAT.
[ bp: Simplify, convert comment to passive voice. ]
Fixes: 37d43acfd7 ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220621155943.33623-1-yazen.ghannam@amd.com
The first argument of WARN() is a condition, so this will use "addr"
as the format string and possibly crash.
Fixes: 3b6c1747da ("x86/retpoline: Add SKL retthunk retpolines")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/Y1gBoUZrRK5N%2FlCB@kili/
The field arch_has_empty_bitmaps is not required anymore. The field
min_cbm_bits is enough to validate the CBM (capacity bit mask) if the
architecture can support the zero CBM or not.
Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Link: https://lore.kernel.org/r/166430979654.372014.615622285687642644.stgit@bmoger-ubuntu
There's a conflict between the call-depth tracking commits in x86/core:
ee3e2469b3 ("x86/ftrace: Make it call depth tracking aware")
36b64f1012 ("x86/ftrace: Rebalance RSB")
eac828eaef ("x86/ftrace: Remove ftrace_epilogue()")
And these fixes in x86/urgent:
883bbbffa5 ("ftrace,kcfi: Separate ftrace_stub() and ftrace_stub_graph()")
b5f1fc3184 ("x86/ftrace: Remove ftrace_epilogue()")
It's non-trivial overlapping modifications - resolve them.
Conflicts:
arch/x86/kernel/ftrace_64.S
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an extended state component is not present in fpstate, but in init
state, the function copies from init_fpstate via copy_feature().
But, dynamic states are not present in init_fpstate because of all-zeros
init states. Then retrieving them from init_fpstate will explode like this:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:memcpy_erms+0x6/0x10
? __copy_xstate_to_uabi_buf+0x381/0x870
fpu_copy_guest_fpstate_to_uabi+0x28/0x80
kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
? __this_cpu_preempt_check+0x13/0x20
? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
? __fget_light+0xd4/0x130
__x64_sys_ioctl+0xe3/0x910
? debug_smp_processor_id+0x17/0x20
? fpregs_assert_state_consistent+0x27/0x50
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Adjust the 'mask' to zero out the userspace buffer for the features that
are not available both from fpstate and from init_fpstate.
The dynamic features depend on the compacted XSAVE format. Ensure it is
enabled before reading XCOMP_BV in init_fpstate.
Fixes: 2308ee57d9 ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Reported-by: Yuan Yao <yuan.yao@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/lkml/BYAPR11MB3717EDEF2351C958F2C86EED95259@BYAPR11MB3717.namprd11.prod.outlook.com/
Link: https://lkml.kernel.org/r/20221021185844.13472-1-chang.seok.bae@intel.com
When a console stack dump is initiated with CONFIG_GCOV_PROFILE_ALL
enabled, show_trace_log_lvl() gets out of sync with the ORC unwinder,
causing the stack trace to show all text addresses as unreliable:
# echo l > /proc/sysrq-trigger
[ 477.521031] sysrq: Show backtrace of all active CPUs
[ 477.523813] NMI backtrace for cpu 0
[ 477.524492] CPU: 0 PID: 1021 Comm: bash Not tainted 6.0.0 #65
[ 477.525295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014
[ 477.526439] Call Trace:
[ 477.526854] <TASK>
[ 477.527216] ? dump_stack_lvl+0xc7/0x114
[ 477.527801] ? dump_stack+0x13/0x1f
[ 477.528331] ? nmi_cpu_backtrace.cold+0xb5/0x10d
[ 477.528998] ? lapic_can_unplug_cpu+0xa0/0xa0
[ 477.529641] ? nmi_trigger_cpumask_backtrace+0x16a/0x1f0
[ 477.530393] ? arch_trigger_cpumask_backtrace+0x1d/0x30
[ 477.531136] ? sysrq_handle_showallcpus+0x1b/0x30
[ 477.531818] ? __handle_sysrq.cold+0x4e/0x1ae
[ 477.532451] ? write_sysrq_trigger+0x63/0x80
[ 477.533080] ? proc_reg_write+0x92/0x110
[ 477.533663] ? vfs_write+0x174/0x530
[ 477.534265] ? handle_mm_fault+0x16f/0x500
[ 477.534940] ? ksys_write+0x7b/0x170
[ 477.535543] ? __x64_sys_write+0x1d/0x30
[ 477.536191] ? do_syscall_64+0x6b/0x100
[ 477.536809] ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 477.537609] </TASK>
This happens when the compiled code for show_stack() has a single word
on the stack, and doesn't use a tail call to show_stack_log_lvl().
(CONFIG_GCOV_PROFILE_ALL=y is the only known case of this.) Then the
__unwind_start() skip logic hits an off-by-one bug and fails to unwind
all the way to the intended starting frame.
Fix it by reverting the following commit:
f1d9a2abff ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
The original justification for that commit no longer exists. That
original issue was later fixed in a different way, with the following
commit:
f2ac57a4c4 ("x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels")
Fixes: f1d9a2abff ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
[jpoimboe: rewrite commit log]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Different function signatures means they needs to be different
functions; otherwise CFI gets upset.
As triggered by the ftrace boot tests:
[] CFI failure at ftrace_return_to_handler+0xac/0x16c (target: ftrace_stub+0x0/0x14; expected type: 0x0a5d5347)
Fixes: 3c516f89e1 ("x86: Add support for CONFIG_CFI_CLANG")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/Y06dg4e1xF6JTdQq@hirez.programming.kicks-ass.net
Remove the weird jumps to RET and simply use RET.
This then promotes ftrace_stub() to a real function; which becomes
important for kcfi.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.719080593@infradead.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
The Cyrix CPU specific MTRR function cyrix_set_all() will never be
called as the mtrr_ops->set_all() callback will only be called in the
use_intel() case, which would require the use_intel_if member of struct
mtrr_ops to be set, which isn't the case for Cyrix.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221004081023.32402-3-jgross@suse.com
There are significant differences between signal handling on 32-bit vs.
64-bit, like different structure layouts and legacy syscalls. Instead
of duplicating that code for native and compat, merge both versions
into one file.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/20220606203802.158958-8-brgerst@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Add ABI prefixes to the frame setup functions that didn't already have
them. To avoid compiler warnings and prepare for moving these functions
to separate files, make them non-static.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/20220606203802.158958-7-brgerst@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Adapt the native get_sigframe() function so that the compat signal code
can use it.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/20220606203802.158958-6-brgerst@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Push down the call to sigmask_to_save() into the frame setup functions.
Thus, remove the use of compat_sigset_t outside of the compat code.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/20220606203802.158958-3-brgerst@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Passing the signal number as a separate parameter is unnecessary, since
it is always ksig->sig.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/20220606203802.158958-2-brgerst@gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
AMD systems support zero CBM (capacity bit mask) for cache allocation.
That is reflected in rdt_init_res_defs_amd() by:
r->cache.arch_has_empty_bitmaps = true;
However given the unified code in cbm_validate(), checking for:
val == 0 && !arch_has_empty_bitmaps
is not enough because of another check in cbm_validate():
if ((zero_bit - first_bit) < r->cache.min_cbm_bits)
The default value of r->cache.min_cbm_bits = 1.
Leading to:
$ cd /sys/fs/resctrl
$ mkdir foo
$ cd foo
$ echo L3:0=0 > schemata
-bash: echo: write error: Invalid argument
$ cat /sys/fs/resctrl/info/last_cmd_status
Need at least 1 bits in the mask
Initialize the min_cbm_bits to 0 for AMD. Also, remove the default
setting of min_cbm_bits and initialize it separately.
After the fix:
$ cd /sys/fs/resctrl
$ mkdir foo
$ cd foo
$ echo L3:0=0 > schemata
$ cat /sys/fs/resctrl/info/last_cmd_status
ok
Fixes: 316e7f901f ("x86/resctrl: Add struct rdt_cache::arch_has_{sparse, empty}_bitmaps")
Co-developed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Morse <james.morse@arm.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/lkml/20220517001234.3137157-1-eranian@google.com
Currently, the patch application logic checks whether the revision
needs to be applied on each logical CPU (SMT thread). Therefore, on SMT
designs where the microcode engine is shared between the two threads,
the application happens only on one of them as that is enough to update
the shared microcode engine.
However, there are microcode patches which do per-thread modification,
see Link tag below.
Therefore, drop the revision check and try applying on each thread. This
is what the BIOS does too so this method is very much tested.
Btw, change only the early paths. On the late loading paths, there's no
point in doing per-thread modification because if is it some case like
in the bugzilla below - removing a CPUID flag - the kernel cannot go and
un-use features it has detected are there early. For that, one should
use early loading anyway.
[ bp: Fixes does not contain the oldest commit which did check for
equality but that is good enough. ]
Fixes: 8801b3fcb5 ("x86/microcode/AMD: Rework container parsing")
Reported-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Ștefan Talpalaru <stefantalpalaru@yahoo.com>
Cc: <stable@vger.kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216211
Today, core ID is assumed to be unique within each package.
But an AlderLake-N platform adds a Module level between core and package,
Linux excludes the unknown modules bits from the core ID, resulting in
duplicate core ID's.
To keep core ID unique within a package, Linux must include all APIC-ID
bits for known or unknown levels above the core and below the package
in the core ID.
It is important to understand that core ID's have always come directly
from the APIC-ID encoding, which comes from the BIOS. Thus there is no
guarantee that they start at 0, or that they are contiguous.
As such, naively using them for array indexes can be problematic.
[ dhansen: un-known -> unknown ]
Fixes: 7745f03eb3 ("x86/topology: Add CPUID.1F multi-die/package support")
Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20221014090147.1836-5-rui.zhang@intel.com
CPUID.1F/B does not enumerate Package level explicitly, instead, all the
APIC-ID bits above the enumerated levels are assumed to be package ID
bits.
Current code gets package ID by shifting out all the APIC-ID bits that
Linux supports, rather than shifting out all the APIC-ID bits that
CPUID.1F enumerates. This introduces problems when CPUID.1F enumerates a
level that Linux does not support.
For example, on a single package AlderLake-N, there are 2 Ecore Modules
with 4 atom cores in each module. Linux does not support the Module
level and interprets the Module ID bits as package ID and erroneously
reports a multi module system as a multi-package system.
Fix this by using APIC-ID bits above all the CPUID.1F enumerated levels
as package ID.
[ dhansen: spelling fix ]
Fixes: 7745f03eb3 ("x86/topology: Add CPUID.1F multi-die/package support")
Suggested-by: Len Brown <len.brown@intel.com>
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20221014090147.1836-4-rui.zhang@intel.com
The fully secure mitigation for RSB underflow on Intel SKL CPUs is IBRS,
which inflicts up to 30% penalty for pathological syscall heavy work loads.
Software based call depth tracking and RSB refill is not perfect, but
reduces the attack surface massively. The penalty for the pathological case
is about 8% which is still annoying but definitely more palatable than IBRS.
Add a retbleed=stuff command line option to enable the call depth tracking
and software refill of the RSB.
This gives admins a choice. IBeeRS are safe and cause headaches, call depth
tracking is considered to be s(t)ufficiently safe.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111149.029587352@infradead.org
Since ftrace has trampolines, don't use thunks for the __fentry__ site
but instead require that every function called from there includes
accounting. This very much includes all the direct-call functions.
Additionally, ftrace uses ROP tricks in two places:
- return_to_handler(), and
- ftrace_regs_caller() when pt_regs->orig_ax is set by a direct-call.
return_to_handler() already uses a retpoline to replace an
indirect-jump to defeat IBT, since this is a jump-type retpoline, make
sure there is no accounting done and ALTERNATIVE the RET into a ret.
ftrace_regs_caller() does much the same and gets the same treatment.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.927545073@infradead.org
ftrace_regs_caller() uses a PUSH;RET pattern to tail-call into a
direct-call function, this unbalances the RSB, fix that.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.823216933@infradead.org
Remove the weird jumps to RET and simply use RET.
This then promotes ftrace_stub() to a real function; which becomes
important for kcfi.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.719080593@infradead.org
Ensure that calls in BPF jitted programs are emitting call depth accounting
when enabled to keep the call/return balanced. The return thunk jump is
already injected due to the earlier retbleed mitigations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.615413406@infradead.org
Callthunks addresses on the stack would confuse the ORC unwinder. Handle
them correctly and tell ORC to proceed further down the stack.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.511637628@infradead.org
When indirect calls are switched to direct calls then it has to be ensured
that the call target is not the function, but the call thunk when call
depth tracking is enabled. But static calls are available before call
thunks have been set up.
Ensure a second run through the static call patching code after call thunks
have been created. When call thunks are not enabled this has no side
effects.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.306100465@infradead.org
Add a debuigfs mechanism to validate the accounting, e.g. vs. call/ret
balance and to gather statistics about the stuffing to call ratio.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111148.204285506@infradead.org
Ensure that retpolines do the proper call accounting so that the return
accounting works correctly.
Specifically; retpolines are used to replace both 'jmp *%reg' and
'call *%reg', however these two cases do not have the same accounting
requirements. Therefore split things up and provide two different
retpoline arrays for SKL.
The 'jmp *%reg' case needs no accounting, the
__x86_indirect_jump_thunk_array[] covers this. The retpoline is
changed to not use the return thunk; it's a simple call;ret construct.
[ strictly speaking it should do:
andq $(~0x1f), PER_CPU_VAR(__x86_call_depth)
but we can argue this can be covered by the fuzz we already have
in the accounting depth (12) vs the RSB depth (16) ]
The 'call *%reg' case does need accounting, the
__x86_indirect_call_thunk_array[] covers this. Again, this retpoline
avoids the use of the return-thunk, in this case to avoid double
accounting.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.996634749@infradead.org
To address the Intel SKL RSB underflow issue in software it's required to
do call depth tracking.
Provide a return thunk for call depth tracking on Intel SKL CPUs.
The tracking does not use a counter. It uses uses arithmetic shift
right on call entry and logical shift left on return.
The depth tracking variable is initialized to 0x8000.... when the call
depth is zero. The arithmetic shift right sign extends the MSB and
saturates after the 12th call. The shift count is 5 so the tracking covers
12 nested calls. On return the variable is shifted left logically so it
becomes zero again.
CALL RET
0: 0x8000000000000000 0x0000000000000000
1: 0xfc00000000000000 0xf000000000000000
...
11: 0xfffffffffffffff8 0xfffffffffffffc00
12: 0xffffffffffffffff 0xffffffffffffffe0
After a return buffer fill the depth is credited 12 calls before the next
stuffing has to take place.
There is a inaccuracy for situations like this:
10 calls
5 returns
3 calls
4 returns
3 calls
....
The shift count might cause this to be off by one in either direction, but
there is still a cushion vs. the RSB depth. The algorithm does not claim to
be perfect, but it should obfuscate the problem enough to make exploitation
extremly difficult.
The theory behind this is:
RSB is a stack with depth 16 which is filled on every call. On the return
path speculation "pops" entries to speculate down the call chain. Once the
speculative RSB is empty it switches to other predictors, e.g. the Branch
History Buffer, which can be mistrained by user space and misguide the
speculation path to a gadget.
Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB which are never getting a corresponding
return executed. This stalls the prediction path until it gets resteered,
The assumption is that stuffing at the 12th return is sufficient to break
the speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers. tried to attack this approach but failed.
There is obviously no scientific proof that this will withstand future
research progress, but all we can do right now is to speculate about it.
The SAR/SHL usage was suggested by Andi Kleen.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.890071690@infradead.org
In preparation for call depth tracking on Intel SKL CPUs, make it possible
to patch in a SKL specific return thunk.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.680469665@infradead.org
As for the builtins create call thunks and patch the call sites to call the
thunk on Intel SKL CPUs for retbleed mitigation.
Note, that module init functions are ignored for sake of simplicity because
loading modules is not something which is done in high frequent loops and
the attacker has not really a handle on when this happens in order to
launch a matching attack. The depth tracking will still work for calls into
the builtins and because the call is not accounted it will underflow faster
and overstuff, but that's mitigated by the saturating counter and the side
effect is only temporary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.575673066@infradead.org
Mitigating the Intel SKL RSB underflow issue in software requires to
track the call depth. That is every CALL and every RET need to be
intercepted and additional code injected.
The existing retbleed mitigations already include means of redirecting
RET to __x86_return_thunk; this can be re-purposed and RET can be
redirected to another function doing RET accounting.
CALL accounting will use the function padding introduced in prior
patches. For each CALL instruction, the destination symbol's padding
is rewritten to do the accounting and the CALL instruction is adjusted
to call into the padding.
This ensures only affected CPUs pay the overhead of this accounting.
Unaffected CPUs will leave the padding unused and have their 'JMP
__x86_return_thunk' replaced with an actual 'RET' instruction.
Objtool has been modified to supply a .call_sites section that lists
all the 'CALL' instructions. Additionally the paravirt instruction
sites are iterated since they will have been patched from an indirect
call to direct calls (or direct instructions in which case it'll be
ignored).
Module handling and the actual thunk code for SKL will be added in
subsequent steps.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.470877038@infradead.org
The upcoming call thunk patching must hold text_mutex and needs access to
text_poke_copy(), which takes text_mutex.
Provide a _locked postfixed variant to expose the inner workings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111147.159977224@infradead.org
In preparation for call depth tracking provide a section which collects all
direct calls.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111146.016511961@infradead.org
It turns out that 'stack_canary_offset' is a variable name; shadowing
that with a #define is ripe of fail when the asm-offsets.h header gets
included. Rename the thing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Further extend struct pcpu_hot with the hard and soft irq stack
pointers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.599170752@infradead.org
Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
Also add cpu_number to the pcpu_hot structure, it is often referenced
and this cacheline is there.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.387678283@infradead.org
Add preempt_count to pcpu_hot, since it is once of the most used
per-cpu variables.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.284170644@infradead.org
The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.
Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org
The section ordering in the text section is more than suboptimal:
ALIGN_ENTRY_TEXT_BEGIN
ENTRY_TEXT
ALIGN_ENTRY_TEXT_END
SOFTIRQENTRY_TEXT
STATIC_CALL_TEXT
INDIRECT_THUNK_TEXT
ENTRY_TEXT is in a seperate PMD so it can be mapped into the cpu entry area
when KPTI is enabled. That means the sections after it are also in a
seperate PMD. That's wasteful especially as the indirect thunk text is a
hotpath on retpoline enabled systems and the static call text is fairly hot
on 32bit.
Move the entry text section last so that the other sections share a PMD
with the text before it. This is obviously just best effort and not
guaranteed when the previous text is just at a PMD boundary.
The text section placement needs an overhaul in general. There is e.g. no
point to have debugfs, sysfs, cpuhotplug and other rarely used functions
next to hot path text.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.614728935@infradead.org
Instead of resetting permissions all over the place when freeing module
memory tell the vmalloc code to do so. Avoids the exercise for the next
upcoming user.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.406703869@infradead.org
Commit 5416c26635 ("x86: make sure load_percpu_segment has no
stackprotector") disabled the stackprotector for cpu/common.c because of
load_percpu_segment(). Back then the boot stack canary was initialized very
early in start_kernel(). Switching the per CPU area by loading the GDT
caused the stackprotector to fail with paravirt enabled kernels as the
GSBASE was not updated yet. In hindsight a wrong change because it would
have been sufficient to ensure that the canary is the same in both per CPU
areas.
Commit d55535232c ("random: move rand_initialize() earlier") moved the
stack canary initialization to a later point in the init sequence. As a
consequence the per CPU stack canary is 0 when switching the per CPU areas,
so there is no requirement anymore to exclude this file.
Add a comment to load_percpu_segment().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.303010511@infradead.org
The only place where switch_to_new_gdt() is required is early boot to
switch from the early GDT to the direct GDT. Any other invocation is
completely redundant because it does not change anything.
Secondary CPUs come out of the ASM code with GDT and GSBASE correctly set
up. The same is true for XEN_PV.
Remove all the voodoo invocations which are left overs from the ancient
past, rename the function to switch_gdt_and_percpu_base() and mark it init.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.198076128@infradead.org
On 32bit FS and on 64bit GS segments are already set up correctly, but
load_percpu_segment() still sets [FG]S after switching from the early GDT
to the direct GDT.
For 32bit the segment load has no side effects, but on 64bit it causes
GSBASE to become 0, which means that any per CPU access before GSBASE is
set to the new value is going to fault. That's the reason why the whole
file containing this code has stackprotector removed.
But that's a pointless exercise for both 32 and 64 bit as the relevant
segment selector is already correct. Loading the new GDT does not change
that.
Remove the segment loads and add comments.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.097052006@infradead.org
The symbol is not used outside of the file, so mark it static.
Fixes the following warning:
arch/x86/kernel/tsc.c:53:20: warning:
symbol 'art_related_clocksource' was not declared. Should it be static?
Signed-off-by: Chen Lifu <chenlifu@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220823021821.3052159-1-chenlifu@huawei.com
== Background ==
The XSTATE init code initializes all enabled and supported components.
Then, the init states are saved in the init_fpstate buffer that is
statically allocated in about one page.
The AMX TILE_DATA state is large (8KB) but its init state is zero. And the
feature comes only with the compacted format with these established
dependencies: AMX->XFD->XSAVES. So this state is excludable from
init_fpstate.
== Problem ==
But the buffer is formatted to include that large state. Then, this can be
the cause of a noisy splat like the below.
This came from XRSTORS for the task with init_fpstate in its XSAVE buffer.
It is reproducible on AMX systems when the running kernel is built with
CONFIG_DEBUG_PAGEALLOC=y and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y:
Bad FPU state detected at restore_fpregs_from_fpstate+0x57/0xd0, reinitializing FPU registers.
...
RIP: 0010:restore_fpregs_from_fpstate+0x57/0xd0
? restore_fpregs_from_fpstate+0x45/0xd0
switch_fpu_return+0x4e/0xe0
exit_to_user_mode_prepare+0x17b/0x1b0
syscall_exit_to_user_mode+0x29/0x40
do_syscall_64+0x67/0x80
? do_syscall_64+0x67/0x80
? exc_page_fault+0x86/0x180
entry_SYSCALL_64_after_hwframe+0x63/0xcd
== Solution ==
Adjust init_fpstate to exclude dynamic states. XRSTORS from init_fpstate
still initializes those states when their bits are set in the
requested-feature bitmap.
Fixes: 2308ee57d9 ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Reported-by: Lin X Wang <lin.x.wang@intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lin X Wang <lin.x.wang@intel.com>
Link: https://lore.kernel.org/r/20220824191223.1248-4-chang.seok.bae@intel.com
The init_fpstate buffer is statically allocated. Thus, the sanity test was
established to check whether the pre-allocated buffer is enough for the
calculated size or not.
The currently measured size is not strictly relevant. Fix to validate the
calculated init_fpstate size with the pre-allocated area.
Also, replace the sanity check function with open code for clarity. The
abstraction itself and the function naming do not tend to represent simply
what it does.
Fixes: 2ae996e0c1 ("x86/fpu: Calculate the default sizes independently")
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220824191223.1248-3-chang.seok.bae@intel.com
The init_fpstate setup code is spread out and out of order. The init image
is recorded before its scoped features and the buffer size are determined.
Determine the scope of init_fpstate components and its size before
recording the init state. Also move the relevant code together.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: neelnatu@google.com
Link: https://lore.kernel.org/r/20220824191223.1248-2-chang.seok.bae@intel.com
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
From Phil Auld:
drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES
From me:
cpumask: cleanup nr_cpu_ids vs nr_cpumask_bits mess
This series cleans that mess and adds new config FORCE_NR_CPUS that
allows to optimize cpumask subsystem if the number of CPUs is known
at compile-time.
From me:
lib: optimize find_bit() functions
Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT() macros.
From me:
lib/find: add find_nth_bit()
Adds find_nth_bit(), which is ~70 times faster than bitcounting with
for_each() loop:
for_each_set_bit(bit, mask, size)
if (n-- == 0)
return bit;
Also adds bitmap_weight_and() to let people replace this pattern:
tmp = bitmap_alloc(nbits);
bitmap_and(tmp, map1, map2, nbits);
weight = bitmap_weight(tmp, nbits);
bitmap_free(tmp);
with a single bitmap_weight_and() call.
From me:
cpumask: repair cpumask_check()
After switching cpumask to use nr_cpu_ids, cpumask_check() started
generating many false-positive warnings. This series fixes it.
From Valentin Schneider:
bitmap,cpumask: Add for_each_cpu_andnot() and for_each_cpu_andnot()
Extends the API with one more function and applies it in sched/core.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmNBwmUACgkQsUSA/Tof
vshPRwv+KlqnZlKtuSPgbo/Kgswworpi/7TqfnN9GWlb8AJ2uhjBKI3GFwv4TDow
7KV6wdKdXYLr4pktcIhWy3qLrT+bDDExfarHRo3QI1A1W42EJ+ZiUaGnQGcnVMzD
5q/K1YMJYq0oaesHEw5PVUh8mm6h9qRD8VbX1u+riW/VCWBj3bho9Dp4mffQ48Q6
hVy/SnMGgClQwNYp+sxkqYx38xUqUGYoU5MzeziUmoS6pZQh+4lF33MULnI3EKmc
/ehXilPPtOV/Tm0RovDWFfm3rjNapV9FXHu8Ob2z/c+1A29EgXnE3pwrBDkAx001
TQrL9qbCANRDGPLzWQHw0dwFIaXvTdrSttCsfYYfU5hI4JbnJEe0Pqkaaohy7jqm
r0dW/TlyOG5T+k8Kwdx9w9A+jKs8TbKKZ8HOaN8BpkXswVnpbzpQbj3TITZI4aeV
6YR4URBQ5UkrVLEXFXbrOzwjL2zqDdyNoBdTJmGLJ+5b/n0HHzmyMVkegNIwLLM3
GR7sMQae
=Q/+F
-----END PGP SIGNATURE-----
Merge tag 'bitmap-6.1-rc1' of https://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES (Phil Auld)
- cleanup nr_cpu_ids vs nr_cpumask_bits mess (me)
This series cleans that mess and adds new config FORCE_NR_CPUS that
allows to optimize cpumask subsystem if the number of CPUs is known
at compile-time.
- optimize find_bit() functions (me)
Reworks find_bit() functions based on new FIND_{FIRST,NEXT}_BIT()
macros.
- add find_nth_bit() (me)
Adds find_nth_bit(), which is ~70 times faster than bitcounting with
for_each() loop:
for_each_set_bit(bit, mask, size)
if (n-- == 0)
return bit;
Also adds bitmap_weight_and() to let people replace this pattern:
tmp = bitmap_alloc(nbits);
bitmap_and(tmp, map1, map2, nbits);
weight = bitmap_weight(tmp, nbits);
bitmap_free(tmp);
with a single bitmap_weight_and() call.
- repair cpumask_check() (me)
After switching cpumask to use nr_cpu_ids, cpumask_check() started
generating many false-positive warnings. This series fixes it.
- Add for_each_cpu_andnot() and for_each_cpu_andnot() (Valentin
Schneider)
Extends the API with one more function and applies it in sched/core.
* tag 'bitmap-6.1-rc1' of https://github.com/norov/linux: (28 commits)
sched/core: Merge cpumask_andnot()+for_each_cpu() into for_each_cpu_andnot()
lib/test_cpumask: Add for_each_cpu_and(not) tests
cpumask: Introduce for_each_cpu_andnot()
lib/find_bit: Introduce find_next_andnot_bit()
cpumask: fix checking valid cpu range
lib/bitmap: add tests for for_each() loops
lib/find: optimize for_each() macros
lib/bitmap: introduce for_each_set_bit_wrap() macro
lib/find_bit: add find_next{,_and}_bit_wrap
cpumask: switch for_each_cpu{,_not} to use for_each_bit()
net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}
cpumask: add cpumask_nth_{,and,andnot}
lib/bitmap: remove bitmap_ord_to_pos
lib/bitmap: add tests for find_nth_bit()
lib: add find_nth{,_and,_andnot}_bit()
lib/bitmap: add bitmap_weight_and()
lib/bitmap: don't call __bitmap_weight() in kernel code
tools: sync find_bit() implementation
lib/find_bit: optimize find_next_bit() functions
lib/find_bit: create find_first_zero_bit_le()
...
Major changes:
- Changed location of tracing repo from personal git repo to:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
- Added Masami Hiramatsu as co-maintainer
- Updated MAINTAINERS file to separate out FTRACE as it is
more than just TRACING.
Minor changes:
- Added Mark Rutland as FTRACE reviewer
- Updated user_events to make it on its way to remove the BROKEN tag.
The changes should now be acceptable but will run it through
a cycle and hopefully we can remove the BROKEN tag next release.
- Added filtering to eprobes
- Added a delta time to the benchmark trace event
- Have the histogram and filter callbacks called via a switch
statement instead of indirect functions. This speeds it up to
avoid retpolines.
- Add a way to wake up ring buffer waiters waiting for the
ring buffer to fill up to its watermark.
- New ioctl() on the trace_pipe_raw file to wake up ring buffer
waiters.
- Wake up waiters when the ring buffer is disabled.
A reader may block when the ring buffer is disabled,
but if it was blocked when the ring buffer is disabled
it should then wake up.
Fixes:
- Allow splice to read partially read ring buffer pages
Fixes splice never moving forward.
- Fix inverted compare that made the "shortest" ring buffer
wait queue actually the longest.
- Fix a race in the ring buffer between resetting a page when
a writer goes to another page, and the reader.
- Fix ftrace accounting bug when function hooks are added at
boot up before the weak functions are set to "disabled".
- Fix bug that freed a user allocated snapshot buffer when
enabling a tracer.
- Fix possible recursive locks in osnoise tracer
- Fix recursive locking direct functions
- And other minor clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYz70cxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qpLKAP4+yOje7ZY/b3R4tTx0EIWiKdhqPx6t
Nvam2+WR2PN3QQEAqiK2A+oIbh3Zjp1MyhQWuulssWKtSTXhIQkbs7ioYAc=
=MsQw
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
"Major changes:
- Changed location of tracing repo from personal git repo to:
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
- Added Masami Hiramatsu as co-maintainer
- Updated MAINTAINERS file to separate out FTRACE as it is more than
just TRACING.
Minor changes:
- Added Mark Rutland as FTRACE reviewer
- Updated user_events to make it on its way to remove the BROKEN tag.
The changes should now be acceptable but will run it through a
cycle and hopefully we can remove the BROKEN tag next release.
- Added filtering to eprobes
- Added a delta time to the benchmark trace event
- Have the histogram and filter callbacks called via a switch
statement instead of indirect functions. This speeds it up to avoid
retpolines.
- Add a way to wake up ring buffer waiters waiting for the ring
buffer to fill up to its watermark.
- New ioctl() on the trace_pipe_raw file to wake up ring buffer
waiters.
- Wake up waiters when the ring buffer is disabled. A reader may
block when the ring buffer is disabled, but if it was blocked when
the ring buffer is disabled it should then wake up.
Fixes:
- Allow splice to read partially read ring buffer pages. This fixes
splice never moving forward.
- Fix inverted compare that made the "shortest" ring buffer wait
queue actually the longest.
- Fix a race in the ring buffer between resetting a page when a
writer goes to another page, and the reader.
- Fix ftrace accounting bug when function hooks are added at boot up
before the weak functions are set to "disabled".
- Fix bug that freed a user allocated snapshot buffer when enabling a
tracer.
- Fix possible recursive locks in osnoise tracer
- Fix recursive locking direct functions
- Other minor clean ups and fixes"
* tag 'trace-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (44 commits)
ftrace: Create separate entry in MAINTAINERS for function hooks
tracing: Update MAINTAINERS to reflect new tracing git repo
tracing: Do not free snapshot if tracer is on cmdline
ftrace: Still disable enabled records marked as disabled
tracing/user_events: Move pages/locks into groups to prepare for namespaces
tracing: Add Masami Hiramatsu as co-maintainer
tracing: Remove unused variable 'dups'
MAINTAINERS: add myself as a tracing reviewer
ring-buffer: Fix race between reset page and reading page
tracing/user_events: Update ABI documentation to align to bits vs bytes
tracing/user_events: Use bits vs bytes for enabled status page data
tracing/user_events: Use refcount instead of atomic for ref tracking
tracing/user_events: Ensure user provided strings are safely formatted
tracing/user_events: Use WRITE instead of READ for io vector import
tracing/user_events: Use NULL for strstr checks
tracing: Fix spelling mistake "preapre" -> "prepare"
tracing: Wake up waiters when tracing is disabled
tracing: Add ioctl() to force ring buffer waiters to wake up
tracing: Wake up ring buffer waiters on closing of the file
ring-buffer: Add ring_buffer_wake_waiters()
...
- Remove potentially incomplete targets when Kbuid is interrupted by
SIGINT etc. in case GNU Make may miss to do that when stderr is piped
to another program.
- Rewrite the single target build so it works more correctly.
- Fix rpm-pkg builds with V=1.
- List top-level subdirectories in ./Kbuild.
- Ignore auto-generated __kstrtab_* and __kstrtabns_* symbols in kallsyms.
- Avoid two different modules in lib/zstd/ having shared code, which
potentially causes building the common code as build-in and modular
back-and-forth.
- Unify two modpost invocations to optimize the build process.
- Remove head-y syntax in favor of linker scripts for placing particular
sections in the head of vmlinux.
- Bump the minimal GNU Make version to 3.82.
- Clean up misc Makefiles and scripts.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmM+4vcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGY2IQAInr0JUNnkkxwUSXtOcQuA3IK8RJ
FbU9HXJRoV9H+7+l3SMlN7mIbrs5eE5fTY3iwQ3CVe139d1+1q7nvTMRv8owywJx
GBgzswncuu1lk7iQQ//CxiqMwSCG8GJdYn1uDVy4I5jg3o+DtFZJtyq2Wb7pqsMm
ZhZ4PozRN+idYQJSF6Vx/zEVLHI7quMBwfe4CME8/0Kg2+hnYzbXV/aUf0ED2emq
zdCMDQgIOK5AhY+8qgMXKYnBUJMTqBp6LoR4p3ApfUkwRFY0sGa0/LK3U/B22OE7
uWyR4fCUExGyerlcHEVev+9eBfmsLLPyqlchNwpSDOPf5OSdnKmgqJEBR/Cvx0eh
URerPk7EHxyH3G8yi+cU2GtofNTGc5RHPRgJE2ADsQEi5TAUKGmbXMlsFRL/51Vn
lTANZObBNa1d4enljF6TfTL5nuccOa+DKvXnH9fQ49t0QdtSikv6J/lGwilwm1Sr
BctmCsySPuURZfkpI9OQnLuouloMXl9f7Q/+S39haS/tSgvPpyITyO71nxDnXn/s
BbFObZJUk9QkqOACjBP1hNErTLt83uBxQ9z+rDCw/SbLIe4nw0wyneuygfHI5rI8
3RZB2DbGauuJHX2Zs6YGS14SLSY33IsLqKR1/Vy3LrPvOHuEvNiOR8LITq5E0YCK
OffK2Y5cIlXR0QWf
=DHiN
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove potentially incomplete targets when Kbuid is interrupted by
SIGINT etc in case GNU Make may miss to do that when stderr is piped
to another program.
- Rewrite the single target build so it works more correctly.
- Fix rpm-pkg builds with V=1.
- List top-level subdirectories in ./Kbuild.
- Ignore auto-generated __kstrtab_* and __kstrtabns_* symbols in
kallsyms.
- Avoid two different modules in lib/zstd/ having shared code, which
potentially causes building the common code as build-in and modular
back-and-forth.
- Unify two modpost invocations to optimize the build process.
- Remove head-y syntax in favor of linker scripts for placing
particular sections in the head of vmlinux.
- Bump the minimal GNU Make version to 3.82.
- Clean up misc Makefiles and scripts.
* tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
docs: bump minimal GNU Make version to 3.82
ia64: simplify esi object addition in Makefile
Revert "kbuild: Check if linker supports the -X option"
kbuild: rebuild .vmlinux.export.o when its prerequisite is updated
kbuild: move modules.builtin(.modinfo) rules to Makefile.vmlinux_o
zstd: Fixing mixed module-builtin objects
kallsyms: ignore __kstrtab_* and __kstrtabns_* symbols
kallsyms: take the input file instead of reading stdin
kallsyms: drop duplicated ignore patterns from kallsyms.c
kbuild: reuse mksysmap output for kallsyms
mksysmap: update comment about __crc_*
kbuild: remove head-y syntax
kbuild: use obj-y instead extra-y for objects placed at the head
kbuild: hide error checker logs for V=1 builds
kbuild: re-run modpost when it is updated
kbuild: unify two modpost invocations
kbuild: move vmlinux.o rule to the top Makefile
kbuild: move .vmlinux.objs rule to Makefile.modpost
kbuild: list sub-directories in ./Kbuild
Makefile.compiler: replace cc-ifversion with compiler-specific macros
...
- PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2)
feature support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information,
if available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on
AMD CPUs by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
- HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs and
thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot()
and fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/2pMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIMA/+J+MCEVTt9kwZeBtHoPX7iZ5gnq1+McoQ
f6ALX19AO/ZSuA7EBA3cS3Ny5eyGy3ofYUnRW+POezu9CpflLW/5N27R2qkZFrWC
A09B86WH676ZrmXt+oI05rpZ2y/NGw4gJxLLa4/bWF3g9xLfo21i+YGKwdOnNFpl
DEdCVHtjlMcOAU3+on6fOYuhXDcYd7PKGcCfLE7muOMOAtwyj0bUDBt7m+hneZgy
qbZHzDU2DZ5L/LXiMyuZj5rC7V4xUbfZZfXglG38YCW1WTieS3IjefaU2tREhu7I
rRkCK48ULDNNJR3dZK8IzXJRxusq1ICPG68I+nm/K37oZyTZWtfYZWehW/d/TnPa
tUiTwimabz7UUqaGq9ZptxwINcAigax0nl6dZ3EseeGhkDE6j71/3kqrkKPz4jth
+fCwHLOrI3c4Gq5qWgPvqcUlUneKf3DlOMtzPKYg7sMhla2zQmFpYCPzKfm77U/Z
BclGOH3FiwaK6MIjPJRUXTePXqnUseqCR8PCH/UPQUeBEVHFcMvqCaa15nALed8x
dFi76VywR9mahouuLNq6sUNePlvDd2B124PygNwegLlBfY9QmKONg9qRKOnQpuJ6
UprRJjLOOucZ/N/jn6+ShHkqmXsnY2MhfUoHUoMQ0QAI+n++e+2AuePo251kKWr8
QlqKxd9PMQU=
=LcGg
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature
support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information, if
available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs
by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs
and thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key
operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and
fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements"
* tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
perf: Fix pmu_filter_match()
perf: Fix lockdep_assert_event_ctx()
perf/x86/amd/lbr: Adjust LBR regardless of filtering
perf/x86/utils: Fix uninitialized var in get_branch_type()
perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file
perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR
perf/x86/amd: Support PERF_SAMPLE_ADDR
perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT}
perf/x86/amd: Support PERF_SAMPLE_DATA_SRC
perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions
perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO}
perf/x86/uncore: Add new Raptor Lake S support
perf/x86/cstate: Add new Raptor Lake S support
perf/x86/msr: Add new Raptor Lake S support
perf/x86: Add new Raptor Lake S support
bpf: Check flags for branch stack in bpf_read_branch_records helper
perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails
perf: Use sample_flags for raw_data
perf: Use sample_flags for addr
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxj0gAKCRBZ7Krx/gZQ
66/1AQC/KfIAINNOPxozsZaxOaOKo0ouVJ7sJV4ZGsPKpU69gwD/UodJZCtyZ52h
wwkmfzTDjAgGt1QCKj96zk2XFqg4swE=
=u0pv
-----END PGP SIGNATURE-----
Merge tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull file_inode() updates from Al Vrio:
"whack-a-mole: cropped up open-coded file_inode() uses..."
* tag 'pull-file_inode' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
orangefs: use ->f_mapping
_nfs42_proc_copy(): use ->f_mapping instead of file_inode()->i_mapping
dma_buf: no need to bother with file_inode()->i_mapping
nfs_finish_open(): don't open-code file_inode()
bprm_fill_uid(): don't open-code file_inode()
sgx: use ->f_mapping...
exfat_iterate(): don't open-code file_inode(file)
ibmvmc: don't open-code file_inode()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM8QskACgkQEsHwGGHe
VUq2ZQ/+KwgmCojK54P05UOClpvf96CLDJA7r4m6ydKiM7GDWFg9wCZdews4JRk1
/5hqfkFZsEAUlloRjRk3Qvd6PWRzDX8X/jjtHn3JyzRHT6ra31tyiZmD2LEb4eb6
D0jIHfZQRYjZP39p3rYSuSMFrdWWE8gCETJLZEflR96ACwHXlm1fH/wSRI2RUG4c
sH7nT/hGqtKiDsmOcb314yjmjraYEW1mKnLKRLfjUwksBET4mOiLTjH175MQ5Yv7
cXZs0LsYvdfCqWSH5uefv32TX/yLsIi8ygaALpXawkoyXTmLr5MwJJykrm60AogV
74gvxc3s3ItO0aKVM0J4ABTUWmU+wg+sjPcJD1MolafnJpsgGdfEKlWfTY4hjMV5
onjtgr7byEdgZU25JtuI0BzPoggahnHvK6LiIvGy9vw8LRdKziKPXsyxuRF4rvXw
0n9ofVRmBCuzUsRS8vbL65K2PcIS4oUmUUSEDmALtGQ9vG8j50k6vM3Fu6HayyJx
7qgjVRpREemqRO21wS7SmR6z1RkT5J+zWv4TdacyyrA9QRqyM6ny/yZGCsfOZA77
+LxBFzITwIXlTgfTDVYnLIi1ZPP2MCK74Gq0Buqsjxz8IOpV6yjB+PSajbJzZv35
gIdbWKc5oHgmcDkrpBCoZ6KQ5ZNvDy6glSdnegkDFjRfVm5eCu0=
=RqjF
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- The usual round of smaller fixes and cleanups all over the tree
* tag 'x86_cleanups_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Include the header of init_ia32_feat_ctl()'s prototype
x86/uaccess: Improve __try_cmpxchg64_user_asm() for x86_32
x86: Fix various duplicate-word comment typos
x86/boot: Remove superfluous type casting from arch/x86/boot/bitops.h
code from the architectural one with the endgoal of plugging ARM's MPAM
implementation into it too so that the user interface remains the same
- Properly restore the MSR_MISC_FEATURE_CONTROL value instead of blindly
overwriting it to 0
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM8QhAACgkQEsHwGGHe
VUqazA/8DIfBYMXe/M6qk+tZnLyBJPL3/3hzqOPc3fu2pmwzCHhb+1ksk7s0uLEO
xdV4CK3SDc8WQnsiF9l4Hta1PhvD2Uhf6duCVv1DT0dmBQ6m9tks8SwhbgSCNrIh
cQ8ABuTUsE0/PNW6Zx7x1JC0e2J6Yjhn55WGMGJD7kGl0eo1ClYSv8vnReBE/6cX
YhgjVnWAeUNgwKayokbN7PFXwuP0WjDGmrn+7e8AF4emHWvdDYYw9F1MHIOvZoVO
lLJi6f7ddjxCQSWPg3mG0KSvc4EXixhtEzq8Mk/16drkKlPdn89sHkqEyR7vP/jQ
lEahxtzoWEfZXwVDPGCIIbfjab/lvvr4lTumKzxUgHEha+ORtWZGaukr4kPg6BRf
IBrE12jCBKmYzzgE0e9EWGr0KCn6qXrnq37yzccQXVM0WxsBOUZWQXhInl6mSdz9
uus1rKR/swJBT58ybzvw2LGFYUow0bb0qY6XvQxmriiyA60EVmf9/Nt/KgatXa63
s9Q4mVii4W1tgxSmCjNVZnDFhXvvowclNU4TuJ6d+6kvEnrvoW5+vDRk2O7iJKqf
K2zSe56lf0TnBe9WaUlxRFaTZg+UXZt7a+e7/hQ90wT/7fkIMk1uxVpqnQW4vDPi
YskbKRPc5DlLBSJ+yxW9Ntff4QVIdUhhj0bcKBAo8nmd5Kj1hy4=
=1iEb
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource control updates from Borislav Petkov:
- More work by James Morse to disentangle the resctrl filesystem
generic code from the architectural one with the endgoal of plugging
ARM's MPAM implementation into it too so that the user interface
remains the same
- Properly restore the MSR_MISC_FEATURE_CONTROL value instead of
blindly overwriting it to 0
* tag 'x86_cache_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/resctrl: Make resctrl_arch_rmid_read() return values in bytes
x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's boot_cpu_data
x86/resctrl: Rename and change the units of resctrl_cqm_threshold
x86/resctrl: Move get_corrected_mbm_count() into resctrl_arch_rmid_read()
x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()
x86/resctrl: Pass the required parameters into resctrl_arch_rmid_read()
x86/resctrl: Abstract __rmid_read()
x86/resctrl: Allow per-rmid arch private storage to be reset
x86/resctrl: Add per-rmid arch private storage for overflow and chunks
x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunks
x86/resctrl: Allow update_mba_bw() to update controls directly
x86/resctrl: Remove architecture copy of mbps_val
x86/resctrl: Switch over to the resctrl mbps_val list
x86/resctrl: Create mba_sc configuration in the rdt_domain
x86/resctrl: Abstract and use supports_mba_mbps()
x86/resctrl: Remove set_mba_sc()s control array re-initialisation
x86/resctrl: Add domain offline callback for resctrl work
x86/resctrl: Group struct rdt_hw_domain cleanup
x86/resctrl: Add domain online callback for resctrl work
x86/resctrl: Merge mon_capable and mon_enabled
...
- By popular demand, print the previous microcode revision an update
was done over
- Remove more code related to the now gone MICROCODE_OLD_INTERFACE
- Document the problems stemming from microcode late loading
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM8QL8ACgkQEsHwGGHe
VUoDxw/9FA3rOAZD7N0PI/vspMUxEDQVYV60tfuuynao72HZv+tfJbRTXe42p3ZO
B+kRPFud4lAOE1ykDHJ2A2OZzvthGfYlUnMyvk1IvK/gOwkkiSH4c6sVSrOYWtl7
uoIN/3J83BMZoWNOKqrg1OOzotzkTyeucPXdWF+sRkfVzBIgbDqtplbFFCP4abPK
WxatY2hkTfBCiN92OSOLaMGg0POpmycy+6roR2Qr5rWrC7nfREVNbKdOyEykZsfV
U2gPm0A953sZ3Ye6waFib+qjJdyR7zBQRCJVEGOB6g8BlNwqGv/TY7NIUWSVFT9Y
qcAnD3hI0g0UTYdToBUvYEpfD8zC9Wg3tZEpZSBRKh3AR2+Xt44VKQFO4L9uIt6g
hWFMBLsFiYnBmKW3arNLQcdamE34GRhwUfXm0OjHTvTWb3aFO1I9+NBCaHp19KVy
HD13wGSyj5V9SAVD0ztRFut4ZESejDyYBw9joB2IsjkY2IJmAAsRFgV0KXqUvQLX
TX13hnhm894UfQ+4KCXnA0UeEDoXhwAbYFxR89yGeOxoGe1oaPXr9C1/r88YLq0n
ekjIVZ3G97PIxmayj3cv9YrRIrrJi4PWF1Raey6go3Ma+rNBRnya5UF6Noch1lHh
HeF7t84BZ5Ub6GweWYaMHQZCA+wMCZMYYuCMNzN7b54yRtQuvCc=
=lWDD
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x75 microcode loader updates from Borislav Petkov:
- Get rid of a single ksize() usage
- By popular demand, print the previous microcode revision an update
was done over
- Remove more code related to the now gone MICROCODE_OLD_INTERFACE
- Document the problems stemming from microcode late loading
* tag 'x86_microcode_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/AMD: Track patch allocation size explicitly
x86/microcode: Print previous version of microcode after reload
x86/microcode: Remove ->request_microcode_user()
x86/microcode: Document the whole late loading problem
- Correct APM entry's Konfig help text
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM8PoMACgkQEsHwGGHe
VUpptBAAm31ZnZT1u4OMCWIfS8dUJYKQA0pg4Xb4Hj0epFFjqGvV15+JFLQrvNv8
kWbCZS7wryNsch0dnNyxico6mBD6BGqUSTsiflcpk5rWzy0JOpIfeCs5dsNYhIkV
Pqk0qXK25kfzYW/7I2HE/aNeE7jxAOTcKzLtW6J74UYj3jLWSmtlPP/o5qxMUeN+
vYlLg45jhjcTH5cBD1U2MmbJP3M8WRcRgHpI7ZGrvOXUgGbTqdu0m+dsCZIkIzps
ePDgX3mgAU6wjckjH333hbas256SAWWEtBYID/71ZIgK8EjabIu6FcwWAxyfexjp
OU/k2iBNOXHbzhO4lEzqar/lcNHyfI4edxw4gsmIQddmmNI2b62s0Nis7Or8DDU6
v5ZPrW6tcvVT7cZqP3naidRTNW2Cwo/+/Z09VQnhPQaZ7U8U5uCS4wgHiJuV3e0s
nH02QU+t/9zrn46UurVXCSSHwWNaGNA4Gb3a4ZdwRUKdxMrf5frmU+dfiFhdMMoo
I3lxQvMFtFb2rLNRA6zWDhsJFVU3F5cMu7zlGKHTNlD/TcO/qcKqiTkrAyM5trLG
NqbPdwBVzdwLR6MgYOUDI+nB5Ad1Uq2Cx+SfRLT+C8L6dZFB+zvNiBH/8Q+25ohC
c4ldgXSiZulfLApG3JY5chVvFWGa2H+EpoFexyQi+F9AAIRD14Q=
=cNp0
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Borislav Petkov:
- Drop misleading "RIP" from the opcodes dumping message
- Correct APM entry's Konfig help text
* tag 'x86_misc_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/dumpstack: Don't mention RIP in "Code: "
x86/Kconfig: Specify idle=poll instead of no-hlt
as both vendors suggest
- Clean up pciserial a bit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM8AxoACgkQEsHwGGHe
VUrdqQ//c07XggioZRVWASkE5vXggSuM1BzCkf55kzx3Hp20F8ui26cgOW8nmNr9
8pKjcTSGZvZPyZFHPBmZDFqver+viSAA3y+/DopNMd6DIeQWOCtJMFdoGPuPqCMq
qYb3WkGu3i4AcjHxeiYKIlf76w+DfSomk+NspUYKLxGuPH6Hg5MFLwv32c0orh/g
sMM5YCFP4xKtMTrpZ8GXO8+81dU57KwCPnmLy6RrPPCjIHkEmS9KBaib2T9udFvf
DMQVZGEU4DQkl+SXoj5fOn03eQ56kDaH3OyV7JbHUwIXpASAEJZENwBD/HmP02aB
uN+tm9brjyJLPD92FldoJPkcsYBcwS1vx+TBH7/KLsaGZziZ1+Oc8hk6fbAd6yp9
W/ex491SCmdBA0wQelw/fZJ24/QXOT1PGIKl96srH0VPvhQtCESMDbKMdTY4p6Ow
s6GYF+HW6SV5d/jUQQzfmd1D7Ur6RxwvUnRvfDg8/AZ+INQRciMG0aFJa3E1SUJf
TkfXZ12XsL2LABK4MWOJqJsB+DSnEZbjOaPxWIKBhMUHxoc8nIny5ZFZbww5iVii
pUdKEMgicm1zuaiIoQBY9K6oOlR2ixZJDw++NzfId2jwsiR4MeJH78sB61mItMfs
WnwIkNxNfimgvcGtAogTMalJlnN56ukrjjwGrdhJGL3WovQ8qHc=
=Ggh3
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core fixes from Borislav Petkov:
- Make sure an INT3 is slapped after every unconditional retpoline JMP
as both vendors suggest
- Clean up pciserial a bit
* tag 'x86_core_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86,retpoline: Be sure to emit INT3 after JMP *%\reg
x86/earlyprintk: Clean up pciserial
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM7/C0ACgkQEsHwGGHe
VUp3jA//caeFnxfKU8F6B+D+sqXub+fRC88U/5j5TVgueiFdpgbHxbeb5isV2DW8
/1as5LN6PCy0V9ax873PblaRdL+ir2KVjzTOQhv1fsBWtRJgRAte6ltHfncST3jM
Qm9ELCSYlwX6wgPDFWrLd5/jCHLq/7ffvWATwt4ZS9/+ebwJJ3JRU82aydvSAe+M
tYDLA0RYjrTKB7bfhlxft8n3/ttOwmaQ9sPw6J9KWL9VZ8d7uA0cZgHMFgclr9xb
+cTuOQwJ+a+njS3tXYWH0lU4lLw6tRcWrl3/gHpiuiwU85xdoClyCcsmvZYMLy1s
cv4CmAQqs0dSlcII/F+6pTh+E8gcpzm2+uiP8k56dCK3c9OhtdKcPRcxm169JID+
hKZ9rcscx1e/Q3rzhRtC4w2LguDcARe2b0OSMsGH40uZ/UOHdnYVk8oD81O9LWrB
T4DgjJCcSf4y0KGDXtjJ39POfe0zFHgx8nr5JOL9Las6gZtXZHRmCzBacIU6JeJo
5VEl6Yx1ca7aiuNTTcsmmW3zbsR9J6VppFZ6ma9jTTM8HbINM6g4UDepkeubNUdv
tAJVbb6MEDGtT1ybH4oalrYkPfhYDwRkahqPtMY0ROtlzxO65q2DI3rH2gtyboas
oP4hfEDtzKiJj4FiIRMKbryKz7MBWImHMRovRGKjB/Y48mTxX3M=
=e8H8
-----END PGP SIGNATURE-----
Merge tag 'x86_apic_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC update from Borislav Petkov:
- Add support for locking the APIC in X2APIC mode to prevent SGX
enclave leaks
* tag 'x86_apic_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Don't disable x2APIC if locked
granularity of the memory error instead of hard-coding it
- Offline memory pages on Intel machines after 2 errors reported per page
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM7+ZAACgkQEsHwGGHe
VUp3Rg/9HkwMcl8HOEShQoN1HvhDh68DeR0pgZpFjooBgilqFuPopWn7rwOsebza
b6Z1XYg3pmWwkG35ztHedFP8WkRb8aZDJV7czpGxiaxtRUCuXpGZJk45CFzU+3l8
FD9XKryDf6rPSKmg6w/63FUPVXJXpM8+pAlEso9BVrb6xmUb3n6ul5A/kt3Z4OBZ
beNDI54s7pfKXmCYze7VbK2nqyEM9fmDuPAhaicB/qhVEqxSWMNKDYDhHs1wt/Um
dOlLC2paqcc9EaRhV/L/9Pvi1zasy7Q0+jSIXhgbag7EQmZRYQn3Pe0xqUJNhK2H
ulm65Dy+AZ3y15+qBkBOgoNF/By5eMwJ+C9bucC6FWkPG0XMjDVGiphf+DPZmAHk
msYSvyp++WidNYn/bXhbQv0epjheLoHj22lTvlHwLquI+eISuDV24DIcBQQpFFed
S/8sMw3RqoOgAd3LHr1NGcd/7eF+b/Lc1mtUx7qhg/W0V2cYVTGIcNJMAJ8USHMy
IxFiB+8x7ps0lRJGlaApylLQOGAhS9WxpH2UUIKIoURrmCi9Pf8OaHzfIA5jYK+t
xEQTzZE5ZUB0V5mvNhAWTm9HvlaW1WNbokt0vElAEJ9WAjRfMVYKya7shlAK2MYI
nrM3eqlUwOqTGLGCA7IhCS9Re7M5ZKZk5nsLygnYsadntqCmN0c=
=5tV9
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Fix the APEI MCE callback handler to consult the hardware about the
granularity of the memory error instead of hard-coding it
- Offline memory pages on Intel machines after 2 errors reported per
page
* tag 'ras_core_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Retrieve poison range from hardware
RAS/CEC: Reduce offline page threshold for Intel systems
using the respective functionality from the RTC library
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM77hwACgkQEsHwGGHe
VUr0MBAAto17sVeaQUcJ9bw6qNu/IWwCt35X+dPXxXDANOxzZUdN+EEyEVcxIP45
f9nL3IiMPXPBWCyLqVbxAHi5z1FUc8NC3EJdMDQuCvZ4/OXyt1jyoRZ1V7NOUHYm
EJBS52GJJOZk0DCmv7baSrGZ0d5qytMPmMlg7Phs8uaHzzL4/PRKDQvWWkYw9eoA
fqEVVrHXi9D18PDvFCwAA83jbDrtLk52KzLrqW57jx6+V1CdOO+GFytNNOgYvwef
FFRRo/BynSEONQNSX3RiDQ7NkBhgdpOaE9RglNyeHk0VctAiawkH0yEjyz2zOsyQ
PAe8KBpnzUW49PMUBYKbOTLaOa73OYhdTEbeuV2cntE1OrH35sCH7scKIjJXyAer
FFhz8nDkT36wXYtzHMgioE7Q8vUO04D3IGRAA3CADmX5BeIhgLlU7KDXoDVdJfAP
aJUVojCFM0CI3ASmZhQbwEH+otycrcIs6bhkq/X+qQmcnTD9Q+7GqU+WdSPfHTbg
29Q+gn4IY/WWqdIgCH5i+XV0PnROcWNtb8YsgkN5GRyGV9AIRhNut20CREdBB6vG
HVsWjimVJXbA6ysG4TNxcBENALaXYNyqICkgo/r0G1lOp90CseuOKhJI4JbZx+IF
VuPyKKvk3r86sP1OkmFn46i52kyxl4OTPwFFYFpHOkfEczm4/tU=
=9t5p
-----END PGP SIGNATURE-----
Merge tag 'x86_timers_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RTC cleanups from Borislav Petkov:
- Cleanup x86/rtc.c and delete duplicated functionality in favor of
using the respective functionality from the RTC library
* tag 'x86_timers_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/rtc: Rename mach_set_rtc_mmss() to mach_set_cmos_time()
x86/rtc: Rewrite & simplify mach_get_cmos_time() by deleting duplicated functionality
is running as a guest on the ACRN hypervisor
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM7614ACgkQEsHwGGHe
VUqwcBAAhCrQdAn7nV8MWfktmZ97/KyISdvX9zb6ecc55kgJgbFTpun9BuC7Tcso
s7qwgIo2vihCpdsK2mBdydpG/VAWIsGpxSmf2GXpwC6S2CnlylsB15yoWgnV3dEV
owvRHp9H+9NkIMPzWxTAN8RdmyHw05qyhvYfXUhSAlSYzDT2LcrUTIMMYVZ6rpXM
rGKlqEmUeo4tinDmJhHB09W+D1LK2Aex10O/ESq/VT/5BAZ/Ie2QN4+6ShLqg23T
sd+Q8ho+4nbKJmlrMaAsUqx1FfxNASbDhxKmdHSln4NWZBMDMoMrMBJVGcpgqFbk
/qGAV+SRBNAz5MTusgKwp/6Cka3ms5Q5Ild0NGCSZK3M6QBKpzeFi8UPRpYDnS9J
Gfy8CHOsfhc3g5AmPeiOnaJw9rKiinCUALf7nbLFyLcT4Kpr5QqC3qpKmmtHJjT/
ksTrEs3t4bCXQB6aayIPKWjmRAEEPvI/seGE8mkKMQY26ENdwv4ZkXHNrXmf0Z/L
YWplbvz4oBPqwPBGrzYmLdEqzhN9ywTfN5CF0pZ0HKhQyzJGHhxXEMURMM0loUQY
M886q3Ur8z46+PPJdlzCNOBS0SKSUn2HU2Dl1YNBCqrTeVLWOVN1cTOVHTlh8PCN
cB5Myz+eQoXD1uzfHUEfDCwGDWSmFw2aidx09KNL+HjJtXl2K1M=
=Ds89
-----END PGP SIGNATURE-----
Merge tag 'x86_platform_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform update from Borislav Petkov:
"A single x86/platform improvement when the kernel is running as an
ACRN guest:
- Get TSC and CPU frequency from CPUID leaf 0x40000010 when the
kernel is running as a guest on the ACRN hypervisor"
* tag 'x86_platform_for_v6.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acrn: Set up timekeeping
This replaces the prior support for Clang's standard Control Flow
Integrity (CFI) instrumentation, which has required a lot of special
conditions (e.g. LTO) and work-arounds. The current implementation
("Kernel CFI") is specific to C, directly designed for the Linux kernel,
and takes advantage of architectural features like x86's IBT. This
series retains arm64 support and adds x86 support. Additional "generic"
architectural support is expected soon:
https://github.com/samitolvanen/llvm-project/commits/kcfi_generic
- treewide: Remove old CFI support details
- arm64: Replace Clang CFI support with Clang KCFI support
- x86: Introduce Clang KCFI support
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmM4aAUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkgWD/4mUgb7xewNIG/+fuipGd620Iao
K0T8q4BNxLNRltOxNc3Q0WMDCggX0qJGCeds7EdFQJQOGxWcbifM8MAS4idAGM0G
fc3Gxl1imC/oF6goCAbQgndA6jYFIWXGsv8LsRjAXRidWLFr3GFAqVqYJyokSySr
8zMQsEDuF4I1gQnOhEWdtPZbV3MQ4ZjfFzpv+33agbq6Gb72vKvDh3G6g2VXlxjt
1qnMtS+eEpbBU65cJkOi4MSLgymWbnIAeTMb0dbsV4kJ08YoTl8uz1B+weeH6GgT
WP73ZJ4nqh1kkkT9EqS9oKozNB9fObhvCokEuAjuQ7i1eCEZsbShvRc0iL7OKTGG
UfuTJa5qQ4h7Z0JS35FCSJETa+fcG0lTyEd133nLXLMZP9K2antf+A6O//fd0J1V
Jg4VN7DQmZ+UNGOzRkL6dTtQUy4PkxhniIloaClfSYXxhNirA+v//sHTnTK3z2Bl
6qceYqmFmns2Laual7+lvnZgt6egMBcmAL/MOdbU74+KIR9Xw76wxQjifktHX+WF
FEUQkUJDB5XcUyKlbvHoqobRMxvEZ8RIlC5DIkgFiPRE3TI0MqfzNSFnQ/6+lFNg
Y0AS9HYJmcj8sVzAJ7ji24WPFCXzsbFn6baJa9usDNbWyQZokYeiv7ZPNPHPDVrv
YEBP6aYko0lVSUS9qw==
=Li4D
-----END PGP SIGNATURE-----
Merge tag 'kcfi-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kcfi updates from Kees Cook:
"This replaces the prior support for Clang's standard Control Flow
Integrity (CFI) instrumentation, which has required a lot of special
conditions (e.g. LTO) and work-arounds.
The new implementation ("Kernel CFI") is specific to C, directly
designed for the Linux kernel, and takes advantage of architectural
features like x86's IBT. This series retains arm64 support and adds
x86 support.
GCC support is expected in the future[1], and additional "generic"
architectural support is expected soon[2].
Summary:
- treewide: Remove old CFI support details
- arm64: Replace Clang CFI support with Clang KCFI support
- x86: Introduce Clang KCFI support"
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107048 [1]
Link: https://github.com/samitolvanen/llvm-project/commits/kcfi_generic [2]
* tag 'kcfi-v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (22 commits)
x86: Add support for CONFIG_CFI_CLANG
x86/purgatory: Disable CFI
x86: Add types to indirectly called assembly functions
x86/tools/relocs: Ignore __kcfi_typeid_ relocations
kallsyms: Drop CONFIG_CFI_CLANG workarounds
objtool: Disable CFI warnings
objtool: Preserve special st_shndx indexes in elf_update_symbol
treewide: Drop __cficanonical
treewide: Drop WARN_ON_FUNCTION_MISMATCH
treewide: Drop function_nocfi
init: Drop __nocfi from __init
arm64: Drop unneeded __nocfi attributes
arm64: Add CFI error handling
arm64: Add types to indirect called assembly functions
psci: Fix the function type for psci_initcall_t
lkdtm: Emit an indirect call for CFI tests
cfi: Add type helper macros
cfi: Switch to -fsanitize=kcfi
cfi: Drop __CFI_ADDRESSABLE
cfi: Remove CONFIG_CFI_CLANG_SHADOW
...
Upon function exit, KMSAN marks local variables as uninitialized. Further
function calls may result in the compiler creating the stack frame where
these local variables resided. This results in frame pointers being
marked as uninitialized data, which is normally correct, because they are
not stack-allocated.
However stack unwinding functions are supposed to read and dereference the
frame pointers, in which case KMSAN might be reporting uses of
uninitialized values.
To work around that, we mark update_stack_state(), unwind_next_frame() and
show_trace_log_lvl() with __no_kmsan_checks, preventing all KMSAN reports
inside those functions and making them return initialized values.
Link: https://lkml.kernel.org/r/20220915150417.722975-40-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When instrumenting functions, KMSAN obtains the per-task state (mostly
pointers to metadata for function arguments and return values) once per
function at its beginning, using the `current` pointer.
Every time the instrumented function calls another function, this state
(`struct kmsan_context_state`) is updated with shadow/origin data of the
passed and returned values.
When `current` changes in the low-level arch code, instrumented code can
not notice that, and will still refer to the old state, possibly
corrupting it or using stale data. This may result in false positive
reports.
To deal with that, we need to apply __no_kmsan_checks to the functions
performing context switching - this will result in skipping all KMSAN
shadow checks and marking newly created values as initialized, preventing
all false positive reports in those functions. False negatives are still
possible, but we expect them to be rare and impersistent.
Link: https://lkml.kernel.org/r/20220915150417.722975-34-glider@google.com
Suggested-by: Marco Elver <elver@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Instrumenting some files with KMSAN will result in kernel being unable to
link, boot or crashing at runtime for various reasons (e.g. infinite
recursion caused by instrumentation hooks calling instrumented code
again).
Completely omit KMSAN instrumentation in the following places:
- arch/x86/boot and arch/x86/realmode/rm, as KMSAN doesn't work for i386;
- arch/x86/entry/vdso, which isn't linked with KMSAN runtime;
- three files in arch/x86/kernel - boot problems;
- arch/x86/mm/cpu_entry_area.c - recursion.
Link: https://lkml.kernel.org/r/20220915150417.722975-33-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cause segfaults when lscpu accesses their representation in sysfs
- Fix for a race in the alternatives batch patching machinery when
kprobes are set
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmM5ZDkACgkQEsHwGGHe
VUrkaA//dXhnPu2AM9x/v7JMZw0BO2peKMNCmO7b6z4+xIXlxNGNYeO766ZqpjSd
eFJj5Hv9ESOZw4UG5cvPA1Vj14nSa6/03Lo9JBFthl2KLOZEgVrD+GNQEJMqxPi/
9s1+764NXYi8iILHj7N4epQmz+oIbCUlnHLWZRkmG5ys40cPPI/d5li/rKBK8yIQ
W89f+WgbqCmpn9Ha8PFYy5uuLxQJnN/McDVZyW2d4MSxJ/FukRl4x1agrfnJq1fb
xz9Y/ZpVRPQCc4fJbQcTTffyFyg42AAqC0O0jJ5ZsOJDjZoQS7WvkcKYO33FiwKv
/wo61B+7SxbNMcZYhQGP8BxaBeSPlXmMKaifW+xZDS6RN4zfCq/M1+ziVB45GdUq
S5hN699vhImciXM5t18wPw6mrpoBBkQYBv+xKkC9ykUw2vxEZ32DeFzwxrybdcGC
hWKZJAVTQpvzr1FlrUAbBtQnhUTxSAB6EAdTtIuHQ+ts+OcraR8JNe59GCsEdCVI
as+mfqMKB8lwoSyDwomkeMcx5yL9XYy+STLPsPTHLrYFjqwTBOZgWRGrVZzt0EBo
0z12tqxpaFc7RI48Vi0qifkeX2Fi63HSBI/Ba+i11a2jM6NT2d2EcO26rDpO6R2S
6K0N7cD3o0wO+QK2hwxBgGnX8e2aRUE8tjYmW40aclfxl4nh/08=
=MiiB
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add the respective UP last level cache mask accessors in order not to
cause segfaults when lscpu accesses their representation in sysfs
- Fix for a race in the alternatives batch patching machinery when
kprobes are set
* tag 'x86_urgent_for_v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cacheinfo: Add a cpu_llc_shared_mask() UP variant
x86/alternative: Fix race in try_get_desc()
The objects placed at the head of vmlinux need special treatments:
- arch/$(SRCARCH)/Makefile adds them to head-y in order to place
them before other archives in the linker command line.
- arch/$(SRCARCH)/kernel/Makefile adds them to extra-y instead of
obj-y to avoid them going into built-in.a.
This commit gets rid of the latter.
Create vmlinux.a to collect all the objects that are unconditionally
linked to vmlinux. The objects listed in head-y are moved to the head
of vmlinux.a by using 'ar m'.
With this, arch/$(SRCARCH)/kernel/Makefile can consistently use obj-y
for builtin objects.
There is no *.o that is directly linked to vmlinux. Drop unneeded code
in scripts/clang-tools/gen_compile_commands.py.
$(AR) mPi needs 'T' to workaround the llvm-ar bug. The fix was suggested
by Nathan Chancellor [1].
[1]: https://lore.kernel.org/llvm/YyjjT5gQ2hGMH0ni@dev-arch.thelio-3990X/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
I encountered some occasional crashes of poke_int3_handler() when
kprobes are set, while accessing desc->vec.
The text poke mechanism claims to have an RCU-like behavior, but it
does not appear that there is any quiescent state to ensure that
nobody holds reference to desc. As a result, the following race
appears to be possible, which can lead to memory corruption.
CPU0 CPU1
---- ----
text_poke_bp_batch()
-> smp_store_release(&bp_desc, &desc)
[ notice that desc is on
the stack ]
poke_int3_handler()
[ int3 might be kprobe's
so sync events are do not
help ]
-> try_get_desc(descp=&bp_desc)
desc = __READ_ONCE(bp_desc)
if (!desc) [false, success]
WRITE_ONCE(bp_desc, NULL);
atomic_dec_and_test(&desc.refs)
[ success, desc space on the stack
is being reused and might have
non-zero value. ]
arch_atomic_inc_not_zero(&desc->refs)
[ might succeed since desc points to
stack memory that was freed and might
be reused. ]
Fix this issue with small backportable patch. Instead of trying to
make RCU-like behavior for bp_desc, just eliminate the unnecessary
level of indirection of bp_desc, and hold the whole descriptor as a
global. Anyhow, there is only a single descriptor at any given
moment.
Fixes: 1f676247f3 ("x86/alternatives: Implement a better poke_int3_handler() completion scheme")
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
Link: https://lkml.kernel.org/r/20220920224743.3089-1-namit@vmware.com
An unused macro reported by [-Wunused-macros].
This macro is used to access the sp in pt_regs because at that time
x86_32 can only get sp by kernel_stack_pointer(regs).
'3c88c692c287 ("x86/stackframe/32: Provide consistent pt_regs")'
This commit have unified the pt_regs and from them we can get sp from
pt_regs with regs->sp easily. Nowhere is using this macro anymore.
Refrencing pt_regs directly is more clear. Remove this macro for
code cleaning.
Link: https://lkml.kernel.org/r/20220924072629.104759-1-chenzhongjin@huawei.com
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Remove the RB tree and start using the maple tree for vm_area_struct
tracking.
Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
lock is not held.
Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree. Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.
The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.
This also moves the mmap_lock() in exit_mmap() since the oom reaper call
does walk the VMAs. Otherwise lockdep will be unhappy if oom happens.
When splitting a vma fails due to allocations of the maple tree nodes,
the error path in __split_vma() calls new->vm_ops->close(new). The page
accounting for hugetlb is actually in the close() operation, so it
accounts for the removal of 1/2 of the VMA which was not adjusted. This
results in a negative exit value. To avoid the negative charge, set
vm_start = vm_end and vm_pgoff = 0.
There is also a potential accounting issue in special mappings from
insert_vm_struct() failing to allocate, so reverse the charge there in
the failure scenario.
Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cpu idle hardware workaround.
* A new Intel model number. Folks like these upstream as soon as
possible so that each developer doing feature development doesn't
need to carry their own #define.
* SGX fixes for a userspace crash and a rare kernel warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmMx6tAACgkQaDWVMHDJ
krClIQ//fSv5oE6XpRCGx9FuiTz6m1s6zebSyY1m1wyQ8j7InoBbgJnKc1GfBNvT
+RCudOkHI5mqLsB7S5FcitFESH/TxrUQ3LlIXaMTySvf3OqaBe6oOFpBBoDD6Nal
gzCoPfZ6dOLl7D6YjiYkSL3rWP3wMhsIm2I8dVwDvxD7iw9oRuTzON+DEFR/+b2L
RTPTSGbGEHLlEXVc5S3+KYAGDTVVxo5XifLauFVWCa3bWCi6Wq78aJQnyVmvoCu9
iHs3hb7TOzSL4hS3nFHBL8wd1QXNfg2e7/gxl+AVhiTAyoQL5atpa6NnL5MHehGE
+HVJtrskFs9GjakGJmCHlh5tJy7NeiHcggdrL+EtqUif4qOehhKytIPw99Vmq8Po
B7nxMMueZQJZfsnkLttYxMTBbPv4oYAzn3uCzdODDjbUQrPkJv//pcW7cWhwGtda
GIspz1jBF+CFMygke7/xNfhEiwxIcu8nZ7HywUhWbcoGv+N3IpAgeMHlYkAIqgXA
Qhluo5o09LaTFmIS6j1Ba+tEXzTPdQdQBpBQDC3u4A5U8KOSsXA9b1OA1pPowF1k
ur4PbJe5eq2LvXofmISorCAH9qw2lpJk3n+rWojU6Rml+SI4flrGWuiRPeqhJP2B
RuiVSjx9tS9ohKIo/tZOo7varj7Ct+W2ZO/M40hp3cB94sFGp5s=
=ULl1
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Dave Hansen:
- A performance fix for recent large AMD systems that avoids an ancient
cpu idle hardware workaround
- A new Intel model number. Folks like these upstream as soon as
possible so that each developer doing feature development doesn't
need to carry their own #define
- SGX fixes for a userspace crash and a rare kernel warning
* tag 'x86_urgent_for_v6.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems
x86/sgx: Handle VA page allocation failure for EAUG on PF.
x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxd
x86/cpu: Add CPU model numbers for Meteor Lake
Include the header containing the prototype of init_ia32_feat_ctl(),
solving the following warning:
$ make W=1 arch/x86/kernel/cpu/feat_ctl.o
arch/x86/kernel/cpu/feat_ctl.c:112:6: warning: no previous prototype for ‘init_ia32_feat_ctl’ [-Wmissing-prototypes]
112 | void init_ia32_feat_ctl(struct cpuinfo_x86 *c)
This warning appeared after commit
5d5103595e ("x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup")
had moved the function init_ia32_feat_ctl()'s prototype from
arch/x86/kernel/cpu/cpu.h to arch/x86/include/asm/cpu.h.
Note that, before the commit mentioned above, the header include "cpu.h"
(arch/x86/kernel/cpu/cpu.h) was added by commit
0e79ad863d ("x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl()")
solely to fix init_ia32_feat_ctl()'s missing prototype. So, the header
include "cpu.h" is no longer necessary.
[ bp: Massage commit message. ]
Fixes: 5d5103595e ("x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup")
Signed-off-by: Luciano Leão <lucianorsleao@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Nícolas F. R. A. Prado <n@nfraprado.net>
Link: https://lore.kernel.org/r/20220922200053.1357470-1-lucianorsleao@gmail.com
resctrl_arch_rmid_read() returns a value in chunks, as read from the
hardware. This needs scaling to bytes by mon_scale, as provided by
the architecture code.
Now that resctrl_arch_rmid_read() performs the overflow and corrections
itself, it may as well return a value in bytes directly. This allows
the accesses to the architecture specific 'hw' structure to be removed.
Move the mon_scale conversion into resctrl_arch_rmid_read().
mbm_bw_count() is updated to calculate bandwidth from bytes.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-22-james.morse@arm.com
resctrl_rmid_realloc_threshold can be set by user-space. The maximum
value is specified by the architecture.
Currently max_threshold_occ_write() reads the maximum value from
boot_cpu_data.x86_cache_size, which is not portable to another
architecture.
Add resctrl_rmid_realloc_limit to describe the maximum size in bytes
that user-space can set the threshold to.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-21-james.morse@arm.com
resctrl_cqm_threshold is stored in a hardware specific chunk size,
but exposed to user-space as bytes.
This means the filesystem parts of resctrl need to know how the hardware
counts, to convert the user provided byte value to chunks. The interface
between the architecture's resctrl code and the filesystem ought to
treat everything as bytes.
Change the unit of resctrl_cqm_threshold to bytes. resctrl_arch_rmid_read()
still returns its value in chunks, so this needs converting to bytes.
As all the users have been touched, rename the variable to
resctrl_rmid_realloc_threshold, which describes what the value is for.
Neither r->num_rmid nor hw_res->mon_scale are guaranteed to be a power
of 2, so the existing code introduces a rounding error from resctrl's
theoretical fraction of the cache usage. This behaviour is kept as it
ensures the user visible value matches the value read from hardware
when the rmid will be reallocated.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-20-james.morse@arm.com
resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. When reading a bandwidth
counter, get_corrected_mbm_count() must be used to correct the
value read.
get_corrected_mbm_count() is architecture specific, this work should be
done in resctrl_arch_rmid_read().
Move the function calls. This allows the resctrl filesystems's chunks
value to be removed in favour of the architecture private version.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-19-james.morse@arm.com
resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. When reading a bandwidth
counter, mbm_overflow_count() must be used to correct for any possible
overflow.
mbm_overflow_count() is architecture specific, its behaviour should
be part of resctrl_arch_rmid_read().
Move the mbm_overflow_count() calls into resctrl_arch_rmid_read().
This allows the resctrl filesystems's prev_msr to be removed in
favour of the architecture private version.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-18-james.morse@arm.com
resctrl_arch_rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a hardware register. Currently the function
returns the MBM values in chunks directly from hardware.
To convert this to bytes, some correction and overflow calculations
are needed. These depend on the resource and domain structures.
Overflow detection requires the old chunks value. None of this
is available to resctrl_arch_rmid_read(). MPAM requires the
resource and domain structures to find the MMIO device that holds
the registers.
Pass the resource and domain to resctrl_arch_rmid_read(). This makes
rmid_dirty() too big. Instead merge it with its only caller, and the
name is kept as a local variable.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-17-james.morse@arm.com
__rmid_read() selects the specified eventid and returns the counter
value from the MSR. The error handling is architecture specific, and
handled by the callers, rdtgroup_mondata_show() and __mon_event_count().
Error handling should be handled by architecture specific code, as
a different architecture may have different requirements. MPAM's
counters can report that they are 'not ready', requiring a second
read after a short delay. This should be hidden from resctrl.
Make __rmid_read() the architecture specific function for reading
a counter. Rename it resctrl_arch_rmid_read() and move the error
handling into it.
A read from a counter that hardware supports but resctrl does not
now returns -EINVAL instead of -EIO from the default case in
__mon_event_count(). It isn't possible for user-space to see this
change as resctrl doesn't expose counters it doesn't support.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-16-james.morse@arm.com
In preparation for reducing the use of ksize(), record the actual
allocation size for later memcpy(). This avoids copying extra
(uninitialized!) bytes into the patch buffer when the requested
allocation size isn't exactly the size of a kmalloc bucket.
Additionally, fix potential future issues where runtime bounds checking
will notice that the buffer was allocated to a smaller value than
returned by ksize().
Fixes: 757885e94a ("x86, microcode, amd: Early microcode patch loading support for AMD")
Suggested-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/lkml/CA+DvKQ+bp7Y7gmaVhacjv9uF6Ar-o4tet872h4Q8RPYPJjcJQA@mail.gmail.com/
To abstract the rmid counters into a helper that returns the number
of bytes counted, architecture specific per-rmid state is needed.
It needs to be possible to reset this hidden state, as the values
may outlive the life of an rmid, or the mount time of the filesystem.
mon_event_read() is called with first = true when an rmid is first
allocated in mkdir_mondata_subdir(). Add resctrl_arch_reset_rmid()
and call it from __mon_event_count()'s rr->first check.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-15-james.morse@arm.com
A renamed __rmid_read() is intended as the function that an
architecture agnostic resctrl filesystem driver can use to
read a value in bytes from a counter. Currently the function returns
the MBM values in chunks directly from hardware. For bandwidth
counters the resctrl filesystem uses this to calculate the number of
bytes ever seen.
MPAM's scaling of counters can be changed at runtime, reducing the
resolution but increasing the range. When this is changed the prev_msr
values need to be converted by the architecture code.
Add an array for per-rmid private storage. The prev_msr and chunks
values will move here to allow resctrl_arch_rmid_read() to always
return the number of bytes read by this counter without assistance
from the filesystem. The values are moved in later patches when
the overflow and correction calls are moved into __rmid_read().
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-14-james.morse@arm.com
mbm_bw_count() is only called by the mbm_handle_overflow() worker once a
second. It reads the hardware register, calculates the bandwidth and
updates m->prev_bw_msr which is used to hold the previous hardware register
value.
Operating directly on hardware register values makes it difficult to make
this code architecture independent, so that it can be moved to /fs/,
making the mba_sc feature something resctrl supports with no additional
support from the architecture.
Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware
register using __mon_event_count().
Change mbm_bw_count() to use the current chunks value most recently saved
by __mon_event_count(). This removes an extra call to __rmid_read().
Instead of using m->prev_msr to calculate the number of chunks seen,
use the rr->val that was updated by __mon_event_count(). This removes an
extra call to mbm_overflow_count() and get_corrected_mbm_count().
Calculating bandwidth like this means mbm_bw_count() no longer operates
on hardware register values directly.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-13-james.morse@arm.com
update_mba_bw() calculates a new control value for the MBA resource
based on the user provided mbps_val and the current measured
bandwidth. Some control values need remapping by delay_bw_map().
It does this by calling wrmsrl() directly. This needs splitting
up to be done by an architecture specific helper, so that the
remainder can eventually be moved to /fs/.
Add resctrl_arch_update_one() to apply one configuration value
to the provided resource and domain. This avoids the staging
and cross-calling that is only needed with changes made by
user-space. delay_bw_map() moves to be part of the arch code,
to maintain the 'percentage control' view of MBA resources
in resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-12-james.morse@arm.com
The resctrl arch code provides a second configuration array mbps_val[]
for the MBA software controller.
Since resctrl switched over to allocating and freeing its own array
when needed, nothing uses the arch code version.
Remove it.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-11-james.morse@arm.com
Updates to resctrl's software controller follow the same path as
other configuration updates, but they don't modify the hardware state.
rdtgroup_schemata_write() uses parse_line() and the resource's
parse_ctrlval() function to stage the configuration.
resctrl_arch_update_domains() then updates the mbps_val[] array
instead, and resctrl_arch_update_domains() skips the rdt_ctrl_update()
call that would update hardware.
This complicates the interface between resctrl's filesystem parts
and architecture specific code. It should be possible for mba_sc
to be completely implemented by the filesystem parts of resctrl. This
would allow it to work on a second architecture with no additional code.
resctrl_arch_update_domains() using the mbps_val[] array prevents this.
Change parse_bw() to write the configuration value directly to the
mbps_val[] array in the domain structure. Change rdtgroup_schemata_write()
to skip the call to resctrl_arch_update_domains(), meaning all the
mba_sc specific code in resctrl_arch_update_domains() can be removed.
On the read-side, show_doms() and update_mba_bw() are changed to read
the mbps_val[] array from the domain structure. With this,
resctrl_arch_get_config() no longer needs to consider mba_sc resources.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-10-james.morse@arm.com
To support resctrl's MBA software controller, the architecture must provide
a second configuration array to hold the mbps_val[] from user-space.
This complicates the interface between the architecture specific code and
the filesystem portions of resctrl that will move to /fs/, to allow
multiple architectures to support resctrl.
Make the filesystem parts of resctrl create an array for the mba_sc
values. The software controller can be changed to use this, allowing
the architecture code to only consider the values configured in hardware.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-9-james.morse@arm.com
To determine whether the mba_MBps option to resctrl should be supported,
resctrl tests the boot CPUs' x86_vendor.
This isn't portable, and needs abstracting behind a helper so this check
can be part of the filesystem code that moves to /fs/.
Re-use the tests set_mba_sc() does to determine if the mba_sc is supported
on this system. An 'alloc_capable' test is added so that support for the
controls isn't implied by the 'delay_linear' property, which is always
true for MPAM. Because mbm_update() only update mba_sc if the mbm_local
counters are enabled, supports_mba_mbps() checks is_mbm_local_enabled().
(instead of using is_mbm_enabled(), which checks both).
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-8-james.morse@arm.com
set_mba_sc() enables the 'software controller' to regulate the bandwidth
based on the byte counters. This can be managed entirely in the parts
of resctrl that move to /fs/, without any extra support from the
architecture specific code. set_mba_sc() is called by rdt_enable_ctx()
during mount and unmount. It currently resets the arch code's ctrl_val[]
and mbps_val[] arrays.
The ctrl_val[] was already reset when the domain was created, and by
reset_all_ctrls() when the filesystem was last unmounted. Doing the work
in set_mba_sc() is not necessary as the values are already at their
defaults due to the creation of the domain, or were previously reset
during umount(), or are about to reset during umount().
Add a reset of the mbps_val[] in reset_all_ctrls(), allowing the code in
set_mba_sc() that reaches in to the architecture specific structures to
be removed.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-7-james.morse@arm.com
Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
free the memory.
Move the monitor subdir removal and the cancelling of the mbm/limbo
works into a new resctrl_offline_domain() call. These bits are not
specific to the architecture. Grouping them in one function allows
that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-6-james.morse@arm.com
domain_add_cpu() and domain_remove_cpu() need to kfree() the child
arrays that were allocated by domain_setup_ctrlval().
As this memory is moved around, and new arrays are created, adjusting
the error handling cleanup code becomes noisier.
To simplify this, move all the kfree() calls into a domain_free() helper.
This depends on struct rdt_hw_domain being kzalloc()d, allowing it to
unconditionally kfree() all the child arrays.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-5-james.morse@arm.com
Because domains are exposed to user-space via resctrl, the filesystem
must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support
resctrl, but the work is tied up with the architecture code to
allocate the memory.
Move domain_setup_mon_state(), the monitor subdir creation call and the
mbm/limbo workers into a new resctrl_online_domain() call. These bits
are not specific to the architecture. Grouping them in one function
allows that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-4-james.morse@arm.com
mon_enabled and mon_capable are always set as a pair by
rdt_get_mon_l3_config().
There is no point having two values.
Merge them together.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-3-james.morse@arm.com
rdt_resources_all[] used to have extra entries for L2CODE/L2DATA.
These were hidden from resctrl by the alloc_enabled value.
Now that the L2/L2CODE/L2DATA resources have been merged together,
alloc_enabled doesn't mean anything, it always has the same value as
alloc_capable which indicates allocation is supported by this resource.
Remove alloc_enabled and its helpers.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <quic_jiles@quicinc.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xin Hao <xhao@linux.alibaba.com>
Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Tested-by: Cristian Marussi <cristian.marussi@arm.com>
Link: https://lore.kernel.org/r/20220902154829.30399-2-james.morse@arm.com
Commit
238c91115c ("x86/dumpstack: Fix misleading instruction pointer error message")
changed the "Code:" line in bug reports when RIP is an invalid pointer.
In particular, the report currently says (for example):
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
That
Unable to access opcode bytes at RIP 0xffffffffffffffd6.
is quite confusing as RIP value is 0, not -42. That -42 comes from
"regs->ip - PROLOGUE_SIZE", because Code is dumped with some prologue
(and epilogue).
So do not mention "RIP" on this line in this context.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/b772c39f-c5ae-8f17-fe6e-6a2bc4d1f83b@kernel.org
In preparation to support compile-time nr_cpu_ids, add a setter for
the variable.
This is a no-op for all arches.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Both AMD and Intel recommend using INT3 after an indirect JMP. Make sure
to emit one when rewriting the retpoline JMP irrespective of compiler
SLS options or even CONFIG_SLS.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Link: https://lkml.kernel.org/r/Yxm+QkFPOhrVSH6q@hirez.programming.kicks-ass.net
VM_FAULT_NOPAGE is expected behaviour for -EBUSY failure path, when
augmenting a page, as this means that the reclaimer thread has been
triggered, and the intention is just to round-trip in ring-3, and
retry with a new page fault.
Fixes: 5a90d2c3f5 ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220906000221.34286-3-jarkko@kernel.org
Unsanitized pages trigger WARN_ON() unconditionally, which can panic the
whole computer, if /proc/sys/kernel/panic_on_warn is set.
In sgx_init(), if misc_register() fails or misc_register() succeeds but
neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be
prematurely stopped. This may leave unsanitized pages, which will result a
false warning.
Refine __sgx_sanitize_pages() to return:
1. Zero when the sanitization process is complete or ksgxd has been
requested to stop.
2. The number of unsanitized pages otherwise.
Fixes: 51ab30eb2a ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u
Link: https://lkml.kernel.org/r/20220906000221.34286-2-jarkko@kernel.org
Remove the CONFIG_PREEMPT_RT symbol from the ifdef around
do_softirq_own_stack() and move it to Kconfig instead.
Enable softirq stacks based on SOFTIRQ_ON_OWN_STACK which depends on
HAVE_SOFTIRQ_ON_OWN_STACK and its default value is set to !PREEMPT_RT.
This ensures that softirq stacks are not used on PREEMPT_RT and avoids
a 'select' statement on an option which has a 'depends' statement.
Link: https://lore.kernel.org/YvN5E%2FPrHfUhggr7@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Print both old and new versions of microcode after a reload is complete
because knowing the previous microcode version is sometimes important
from a debugging perspective.
[ bp: Massage commit message. ]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20220829181030.722891-1-ashok.raj@intel.com
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The APIC supports two modes, legacy APIC (or xAPIC), and Extended APIC
(or x2APIC). X2APIC mode is mostly compatible with legacy APIC, but
it disables the memory-mapped APIC interface in favor of one that uses
MSRs. The APIC mode is controlled by the EXT bit in the APIC MSR.
The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1]. This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.
Introduce support for a new feature that will allow the BIOS to lock
the APIC in x2APIC mode. If the APIC is locked in x2APIC mode and the
kernel tries to disable the APIC or revert to legacy APIC mode a GP
fault will occur.
Introduce support for a new MSR (IA32_XAPIC_DISABLE_STATUS) and handle
the new locked mode when the LEGACY_XAPIC_DISABLED bit is set by
preventing the kernel from trying to disable the x2APIC.
On platforms with the IA32_XAPIC_DISABLE_STATUS MSR, if SGX or TDX are
enabled the LEGACY_XAPIC_DISABLED will be set by the BIOS. If
legacy APIC is required, then it SGX and TDX need to be disabled in the
BIOS.
[1]: https://aepicleak.com/aepicleak.pdf
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/20220816231943.1152579-1-daniel.sneddon@linux.intel.com
The current pseudo_lock.c code overwrites the value of the
MSR_MISC_FEATURE_CONTROL to 0 even if the original value is not 0.
Therefore, modify it to save and restore the original values.
Fixes: 018961ae55 ("x86/intel_rdt: Pseudo-lock region creation/removal core")
Fixes: 443810fe61 ("x86/intel_rdt: Create debugfs files for pseudo-locking testing")
Fixes: 8a2fc0e1bc ("x86/intel_rdt: More precise L2 hit/miss measurements")
Signed-off-by: Kohei Tarumizu <tarumizu.kohei@fujitsu.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/eb660f3c2010b79a792c573c02d01e8e841206ad.1661358182.git.reinette.chatre@intel.com
While working on a GRUB patch to support PCI-serial, a number of
cleanups were suggested that apply to the code I took inspiration from.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/YwdeyCEtW+wa+QhH@worktop.programming.kicks-ass.net
When memory poison consumption machine checks fire, MCE notifier
handlers like nfit_handle_mce() record the impacted physical address
range which is reported by the hardware in the MCi_MISC MSR. The error
information includes data about blast radius, i.e. how many cachelines
did the hardware determine are impacted. A recent change
7917f9cdb5 ("acpi/nfit: rely on mce->misc to determine poison granularity")
updated nfit_handle_mce() to stop hard coding the blast radius value of
1 cacheline, and instead rely on the blast radius reported in 'struct
mce' which can be up to 4K (64 cachelines).
It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than reading the blast
radius from the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com
Link: https://lore.kernel.org/r/20220826233851.1319100-1-jane.chu@oracle.com
- Fix PAT on Xen, which caused i915 driver failures
- Fix compat INT 80 entry crash on Xen PV guests
- Fix 'MMIO Stale Data' mitigation status reporting on older Intel CPUs
- Fix RSB stuffing regressions
- Fix ORC unwinding on ftrace trampolines
- Add Intel Raptor Lake CPU model number
- Fix (work around) a SEV-SNP bootloader bug providing bogus values in
boot_params->cc_blob_address, by ignoring the value on !SEV-SNP bootups.
- Fix SEV-SNP early boot failure
- Fix the objtool list of noreturn functions and annotate snp_abort(),
which bug confused objtool on gcc-12.
- Fix the documentation for retbleed
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmMLgUERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hU1hAAj9bKO+D07gROpkRhXpLvbqm+mUqk6It8
qEuyLkC/xvD9N//Pya0ZPPE+l23EHQxM/L0tXCYwsc8Zlx3fr678FQktHZuxULaP
qlGmwub8PvsyMesXBGtgHJ4RT8L07FelBR+E/TpqCSbv/geM7mcc0ojyOKo+OmbD
vbsUTDNqsMYndzePUBrv/vJRqVLGlbd1a22DKE/YiYBbZu5hughNUB0bHYhYdsml
mutcRsVTKRbZOi4wiuY/pnveMf/z9wAwKOWd9EBaDWILygVSHkBJJQyhHigBh3F8
H1hFn1q4ybULdrAUT4YND9Tjb6U8laU0fxg8Z43bay7bFXXElqoj3W2qka9NjAim
QvWSFvTYQRTIWt6sCshVsUBWj6fBZdHxcJ8Fh+ucJJp+JWl9/aD0A41vK2Jx0LIt
p8YzBRKEqd1Q/7QD855BArN7HNtQGgShNt4oPVZ7nPnjuQt0+lngu8NR+6X3RKpX
r8rordyPvzgPL6W+1uV+8hnz1w+YU2xplAbK+zTijwgJVgyf8khSlZQNpFKULrou
zjtzo/2nB+4C4bvfetNnaOGhi1/AdCHZHyZE35rotpd73SLHvdOrH0Ll9oCVfVrC
UWbC1E67cHQw97Ni/4CrCsJRBULK01uyszVCxlEkSYInf0UsKlnmy+TqxizwsCVy
reYQ0ePyWg0=
=ZpCJ
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix PAT on Xen, which caused i915 driver failures
- Fix compat INT 80 entry crash on Xen PV guests
- Fix 'MMIO Stale Data' mitigation status reporting on older Intel CPUs
- Fix RSB stuffing regressions
- Fix ORC unwinding on ftrace trampolines
- Add Intel Raptor Lake CPU model number
- Fix (work around) a SEV-SNP bootloader bug providing bogus values in
boot_params->cc_blob_address, by ignoring the value on !SEV-SNP
bootups.
- Fix SEV-SNP early boot failure
- Fix the objtool list of noreturn functions and annotate snp_abort(),
which bug confused objtool on gcc-12.
- Fix the documentation for retbleed
* tag 'x86-urgent-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/ABI: Mention retbleed vulnerability info file for sysfs
x86/sev: Mark snp_abort() noreturn
x86/sev: Don't use cc_platform_has() for early SEV-SNP calls
x86/boot: Don't propagate uninitialized boot_params->cc_blob_address
x86/cpu: Add new Raptor Lake CPU model number
x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
x86/nospec: Fix i386 RSB stuffing
x86/nospec: Unwreck the RSB stuffing
x86/bugs: Add "unknown" reporting for MMIO Stale Data
x86/entry: Fix entry_INT80_compat for Xen PV guests
x86/PAT: Have pat_enabled() properly reflect state when running on Xen
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance
monitoring features for AMD processors.
Bit 1 of EAX indicates support for Last Branch Record Extension Version 2
(LbrExtV2) features. If found to be set during PMU initialization, the EBX
bits of the same leaf can be used to determine the number of available LBR
entries.
For better utilization of feature words, LbrExtV2 is added as a scattered
feature bit.
[peterz: Rename to AMD_LBR_V2]
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/172d2b0df39306ed77221c45ee1aa62e8ae0548d.1660211399.git.sandipan.das@amd.com
181b6f40e9 ("x86/microcode: Rip out the OLD_INTERFACE")
removed the old microcode loading interface but forgot to remove the
related ->request_microcode_user() functionality which it uses.
Rip it out now too.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220825075445.28171-1-bp@alien8.de
Mark both the function prototype and definition as noreturn in order to
prevent the compiler from doing transformations which confuse objtool
like so:
vmlinux.o: warning: objtool: sme_enable+0x71: unreachable instruction
This triggers with gcc-12.
Add it and sev_es_terminate() to the objtool noreturn tracking array
too. Sort it while at it.
Suggested-by: Michael Matz <matz@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220824152420.20547-1-bp@alien8.de
When running identity-mapped and depending on the kernel configuration,
it is possible that the compiler uses jump tables when generating code
for cc_platform_has().
This causes a boot failure because the jump table uses un-mapped kernel
virtual addresses, not identity-mapped addresses. This has been seen
with CONFIG_RETPOLINE=n.
Similar to sme_encrypt_kernel(), use an open-coded direct check for the
status of SNP rather than trying to eliminate the jump table. This
preserves any code optimization in cc_platform_has() that can be useful
post boot. It also limits the changes to SEV-specific files so that
future compiler features won't necessarily require possible build changes
just because they are not compatible with running identity-mapped.
[ bp: Massage commit message. ]
Fixes: 5e5ccff60a ("x86/sev: Add helper for validating pages in early enc attribute changes")
Reported-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 5.19.x
Link: https://lore.kernel.org/all/YqfabnTRxFSM+LoX@google.com/
installed at such instructions, possibly resulting in
incorrect execution (the wrong branch taken).
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmMCmzARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jt1RAAwd8PJDg9gkWMq+96B1RM5nAumTnbkngc
4ylQr7+8bhFj4O+uKuBYC0S4D2ox+LeEPAD7x4+pFSLahB78HTytAtixKtKe1l0B
h+AZBm35PfkUlHw52B/aCUW6DWKZOa6KrLQa8L1MqIGK8oiMiL+Q6+xXFQ8a6Gtt
ZYR2yi1esM2tHzw6CdMasRYswGO1KnZlCzGB5lHyDdU/y7xTfOgCTd00K07augXA
mLQOBqTZ9dWDaT9mdc+Uh9llHOOhhfzMDYlWxSpAUsdk6vCWiUz6Y+ybsYRlebEh
yEjRzUkvchraGj0L04m35ulqvoYTSCjc4aP0JvNB7mZoIyicajngfoKEc8DJ0BMf
dgZj3CoKuTrKbbp//hfK5rt5J0m0ueiixOFI4EM0kJVUAOG1zzWAbxEQTrWnJlZh
/8l/4n4zUY3Ss9ecG9PzNs++YH8I+gDbQd2kGr6YMffqUW9HOb5j2yk3mXs8gH0n
O8HBGP7I94w45Q8L1CDSD/Sdn+IrhtaGzpYTV7rCEiM13S4FwrnzpVoi3SJiVODM
kAxM4HScWXygTQTqvz4I5T9ibOVGp8iRQRHfLRbNiX+xB8k8DSQWAE1vxsw2v5o+
L8/N6mt71u/j5DyTzybTp1kepum+UjVH2RdH3v9ma3BQODkM+kbWq3cTBD6qOUh9
HH8KnagQcgc=
=eXIr
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kprobes fix from Ingo Molnar:
"Fix a kprobes bug in JNG/JNLE emulation when a kprobe is installed at
such instructions, possibly resulting in incorrect execution (the
wrong branch taken)"
* tag 'perf-urgent-2022-08-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Fix JNG/JNLE emulation
When meeting ftrace trampolines in ORC unwinding, unwinder uses address
of ftrace_{regs_}call address to find the ORC entry, which gets next frame at
sp+176.
If there is an IRQ hitting at sub $0xa8,%rsp, the next frame should be
sp+8 instead of 176. It makes unwinder skip correct frame and throw
warnings such as "wrong direction" or "can't access registers", etc,
depending on the content of the incorrect frame address.
By adding the base address ftrace_{regs_}caller with the offset
*ip - ops->trampoline*, we can get the correct address to find the ORC entry.
Also change "caller" to "tramp_addr" to make variable name conform to
its content.
[ mingo: Clarified the changelog a bit. ]
Fixes: 6be7fa3c74 ("ftrace, orc, x86: Handle ftrace dynamically allocated trampolines")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220819084334.244016-1-chenzhongjin@huawei.com
Older Intel CPUs that are not in the affected processor list for MMIO
Stale Data vulnerabilities currently report "Not affected" in sysfs,
which may not be correct. Vulnerability status for these older CPUs is
unknown.
Add known-not-affected CPUs to the whitelist. Report "unknown"
mitigation status for CPUs that are not in blacklist, whitelist and also
don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware
immunity to MMIO Stale Data vulnerabilities.
Mitigation is not deployed when the status is unknown.
[ bp: Massage, fixup. ]
Fixes: 8d50cdf8b8 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data")
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com
[ mingo: Consolidated 4 very similar patches into one, it's silly to spread this out. ]
Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220715044809.20572-1-wangborong@cdjrlc.com
Modify the comments for sgx_encl_lookup_backing() and for
sgx_encl_alloc_backing() to indicate that they take a reference
which must be dropped with a call to sgx_encl_put_backing().
Make sgx_encl_lookup_backing() static for now, and change the
name of sgx_encl_get_backing() to __sgx_encl_get_backing() to
make it more clear that sgx_encl_get_backing() is an internal
function.
Signed-off-by: Kristen Carlson Accardi <kristen.c.accardi@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/all/YtUs3MKLzFg+rqEV@zn.tnic/
When kprobes emulates JNG/JNLE instructions on x86 it uses the wrong
condition. For JNG (opcode: 0F 8E), according to Intel SDM, the jump is
performed if (ZF == 1 or SF != OF). However the kernel emulation
currently uses 'and' instead of 'or'.
As a result, setting a kprobe on JNG/JNLE might cause the kernel to
behave incorrectly whenever the kprobe is hit.
Fix by changing the 'and' to 'or'.
Fixes: 6256e668b7 ("x86/kprobes: Use int3 instead of debug trap for single-step")
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220813225943.143767-1-namit@vmware.com
Once upon a time, before this commit in 2013:
3195ef59cb ("x86: Do full rtc synchronization with ntp")
... the mach_set_rtc_mmss() function set only the minutes and seconds
registers of the CMOS RTC - hence the '_mmss' postfix.
This is no longer true, so rename the function to mach_set_cmos_time().
[ mingo: Expanded changelog a bit. ]
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220813131034.768527-2-mat.jonczyk@o2.pl
There are functions in drivers/rtc/rtc-mc146818-lib.c that handle
reading from / writing to the CMOS RTC clock. mach_get_cmos_time() in
arch/x86/kernel/rtc.c did not use them and was mostly a duplicate of
mc146818_get_time(). Modify mach_get_cmos_time() to use
mc146818_get_time() and remove the duplicated functionality.
mach_get_cmos_time() used a different algorithm than
mc146818_get_time(), but these functions are equivalent. The major
differences are:
- mc146818_get_time() is better refined and handles various edge
conditions,
- when the UIP ("Update in progress") bit of the RTC is set,
mach_get_cmos_time() was busy waiting with cpu_relax() while
mc146818_get_time() is using mdelay(1) in every loop iteration.
(However, there is my commit merged for Linux 5.20 / 6.0 to decrease
this period to 100us:
commit d2a632a8a1 ("rtc: mc146818-lib: reduce RTC_UIP polling period")
),
- mach_get_cmos_time() assumed that the RTC year is >= 2000, which
may not be true on some old boxes with a dead battery,
- mach_get_cmos_time() was holding the rtc_lock for a long time
and could hang if the RTC is broken or not present.
The RTC writing counterpart, mach_set_rtc_mmss() is already using
mc146818_get_time() from drivers/rtc. This was done in
commit 3195ef59cb ("x86: Do full rtc synchronization with ntp")
It appears that mach_get_cmos_time() was simply forgotten.
mach_get_cmos_time() is really used only in read_persistent_clock64(),
which is called only in a few places in kernel/time/timekeeping.c .
[ mingo: These changes are not supposed to change behavior, but they are
not identity transformations either, as mc146818_get_time() is a
better but different implementation of the same logic - so
regressions are possible in principle. ]
Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Link: https://lore.kernel.org/r/20220813131034.768527-1-mat.jonczyk@o2.pl
(not turned on by default), which also need STIBP enabled (if
available) to be '100% safe' on even the shortest speculation
windows.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmL3fqcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gnuw/6AighFp+Gp4qXP1DIVU+acVnZsxbdt7GA
WGs/JJfKYsKpWvDGFxnwtF2V1Imq8XVRPVPyFKvLQiBs2h8vNcVkgIvJsdeTFsqQ
uUwUaYgDXuhLYaFpnMGouoeA3iw2zf/CY5ZJX79Nl/CwNwT7FxiLbu+JF/I2Yc0V
yddiQ8xgT0VJhaBcUTsD2qFl8wjpxer7gNBFR4ujiYWXHag3qKyZuaySmqCz4xhd
4nyhJCp34548MsTVXDys2gnYpgLWweB9zOPvH4+GgtiFF3UJxRMhkB9NzfZq1l5W
tCjgGupb3vVoXOVb/xnXyZlPbdFNqSAja7iOXYdmNUSURd7LC0PYHpVxN0rkbFcd
V6noyU3JCCp86ceGTC0u3Iu6LLER6RBGB0gatVlzomWLjTEiC806eo23CVE22cnk
poy7FO3RWa+q1AqWsEzc3wr14ZgSKCBZwwpn6ispT/kjx9fhAFyKtH2/Sznx26GH
yKOF7pPCIXjCpcMnNoUu8cVyzfk0g3kOWQtKjaL9WfeyMtBaHhctngR0s1eCxZNJ
rBlTs+YO7fO42unZEExgvYekBzI70aThIkvxahKEWW48owWph+i/sn5gzdVF+ynR
R4PGeylfd8ZXr21cG2rG9250JLwqzhsxnAGvNjYg1p/hdyrzLTGWHIc9r9BU9000
mmOP9uY6Cjc=
=Ac6x
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix the 'IBPB mitigated RETBleed' mode of operation on AMD CPUs (not
turned on by default), which also need STIBP enabled (if available) to
be '100% safe' on even the shortest speculation windows"
* tag 'x86-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Enable STIBP for IBPB mitigated RETBleed
Intel eIBRS machines do not sufficiently mitigate against RET
mispredictions when doing a VM Exit therefore an additional RSB,
one-entry stuffing is needed.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLqsGsACgkQEsHwGGHe
VUpXGg//ZEkxhf3Ri7X9PknAWNG6eIEqigKqWcdnOw+Oq/GMVb6q7JQsqowK7KBZ
AKcY5c/KkljTJNohditnfSOePyCG5nDTPgfkjzIawnaVdyJWMRCz/L4X2cv6ykDl
2l2EvQm4Ro8XAogYhE7GzDg/osaVfx93OkLCQj278VrEMWgM/dN2RZLpn+qiIkNt
DyFlQ7cr5UASh/svtKLko268oT4JwhQSbDHVFLMJ52VaLXX36yx4rValZHUKFdox
ZDyj+kiszFHYGsI94KAD0dYx76p6mHnwRc4y/HkVcO8vTacQ2b9yFYBGTiQatITf
0Nk1RIm9m3rzoJ82r/U0xSIDwbIhZlOVNm2QtCPkXqJZZFhopYsZUnq2TXhSWk4x
GQg/2dDY6gb/5MSdyLJmvrTUtzResVyb/hYL6SevOsIRnkwe35P6vDDyp15F3TYK
YvidZSfEyjtdLISBknqYRQD964dgNZu9ewrj+WuJNJr+A2fUvBzUebXjxHREsugN
jWp5GyuagEKTtneVCvjwnii+ptCm6yfzgZYLbHmmV+zhinyE9H1xiwVDvo5T7DDS
ZJCBgoioqMhp5qR59pkWz/S5SNGui2rzEHbAh4grANy8R/X5ASRv7UHT9uAo6ve1
xpw6qnE37CLzuLhj8IOdrnzWwLiq7qZ/lYN7m+mCMVlwRWobbOo=
=a8em
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 eIBRS fixes from Borislav Petkov:
"More from the CPU vulnerability nightmares front:
Intel eIBRS machines do not sufficiently mitigate against RET
mispredictions when doing a VM Exit therefore an additional RSB,
one-entry stuffing is needed"
* tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add LFENCE to RSB fill sequence
x86/speculation: Add RSB VM Exit protections
AMD's "Technical Guidance for Mitigating Branch Type Confusion,
Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On
Privileged Mode Entry / SMT Safety" says:
Similar to the Jmp2Ret mitigation, if the code on the sibling thread
cannot be trusted, software should set STIBP to 1 or disable SMT to
ensure SMT safety when using this mitigation.
So, like already being done for retbleed=unret, and now also for
retbleed=ibpb, force STIBP on machines that have it, and report its SMT
vulnerability status accordingly.
[ bp: Remove the "we" and remove "[AMD]" applicability parameter which
doesn't work here. ]
Fixes: 3ebc170068 ("x86/bugs: Add retbleed=ibpb")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com
fatfs, autofs, squashfs, procfs, etc.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYu9BeQAKCRDdBJ7gKXxA
jp1DAP4mjCSvAwYzXklrIt+Knv3CEY5oVVdS+pWOAOGiJpldTAD9E5/0NV+VmlD9
kwS/13j38guulSlXRzDLmitbg81zAAI=
=Zfum
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"Updates to various subsystems which I help look after. lib, ocfs2,
fatfs, autofs, squashfs, procfs, etc. A relatively small amount of
material this time"
* tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
scripts/gdb: ensure the absolute path is generated on initial source
MAINTAINERS: kunit: add David Gow as a maintainer of KUnit
mailmap: add linux.dev alias for Brendan Higgins
mailmap: update Kirill's email
profile: setup_profiling_timer() is moslty not implemented
ocfs2: fix a typo in a comment
ocfs2: use the bitmap API to simplify code
ocfs2: remove some useless functions
lib/mpi: fix typo 'the the' in comment
proc: add some (hopefully) insightful comments
bdi: remove enum wb_congested_state
kernel/hung_task: fix address space of proc_dohung_task_timeout_secs
lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t()
squashfs: support reading fragments in readahead call
squashfs: implement readahead
squashfs: always build "file direct" version of page actor
Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
fs/ocfs2: Fix spelling typo in comment
ia64: old_rr4 added under CONFIG_HUGETLB_PAGE
proc: fix test for "vsyscall=xonly" boot option
...
- an old(er) binutils build fix,
- a new-GCC build fix,
- and a kexec boot environment fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuv4URHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1it3A//fGfrzHGtjHraiBy0H1Erlz0dUa4q/r6v
xPQVFYteGwL/Ynv2rOJreiEXNhv9pRv0cXXNS5iWh8IcP8IUNw6rfYmgr1aDpXdq
WkbJvwouX6JSo3g/CMekKd+Mf7NgA4O1OO65E80c4WJnxgd0AYvr6IxJRLR7X0C7
HwU6p6PmP/RHWT5T170z6sgun+6QdDEYSwFYOhxawL+BJaKEBYnQ0LLQgJazhe7z
uVxONQA9OdWBwMzvZygbOuTzc990jCHRPYgvYQhSZ8CUPuVzaa7IB9KUXh6lu93d
a7nqM3GlWTowBULY6Xq7gWJaJ7jsVWXjqo8SWVlb6YwoLR9dgGSW5bCGV0rOA6o3
yPjQhIQ9H4NOx126wPcCRBh3osGFjqlWUXVw7W51aNgd7hCvlbpWWmREeI/Pm1Ew
WBjQqpf4l0S+0On5FEFaF7swAG3b6KSNSKw7WBmpmTNt5eWOot0EtnjGW75ATpxM
+j2fj/1MIZ/Zp+wYaNK/+abM4sXHhYvU9gpPdJslRr+r2AVjy9gCZ/0zuUIVytwC
gOdV9KhqzlXPJCTm+py7fBt2qM2P5rKT2HBQYiJwIquB2njI0kjUBOJWXsGQ/F/y
hGd6WY8uDuwzzg5JtyfwE6fPGovxL5GCc4w9CYz0DbP0txPYuhMOdkHtAYLyraAj
wtdalMt3cT8=
=EM/G
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- build fix for old(er) binutils
- build fix for new GCC
- kexec boot environment fix
* tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
x86/numa: Use cpumask_available instead of hardcoded NULL check
x86/bus_lock: Don't assume the init value of DEBUGCTLMSR.BUS_LOCK_DETECT to be zero
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuu5MRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gFbA//ZppMR0/26/d+KqhdbVND6wtuTzGb5krZ
m3QynlRQ+x7CZJNJeiNSTo/Dup/KwBUpJFT5sKLtpfOQILxlEt0hdYMiD+/oxgxd
K0Vb0QrZhwFCju+OpcDAVlWNcQ5P8MMdoGUkOr5ekZ9FFalabW+bVUuM2Yf0Cok8
e20MGoZa2jcd+AZkp9jPUtCTURpW3Ew1WcVuJIgLH3EUMNrQNiPdia6xBzFyOPAw
L0G14RDkd/POGF90dUGY1Ta4WeQCNYp2Rgu5DLo6l3eJJ/oeqoIUBUoNRT9AOJHH
0SVNHkrrNlRJe9HD/Jdc6RVBMM+FFNU4rw1uxOPU2OtG0MyMsj39Nzw+xmvB9QsG
mwnMoeeDOJmFRnAyhETe4meR5mA8cPQDoNNlHL51I9JTJTUutIrfd+gQIgVgYrM2
oVfLW7Y0Eew8qYbAd2kfGnFNHDSH90RHG4beTz4zW3y4shembKhiPU7bgJ8lkke7
u4NgDOE+qTmtC1DznuV4Av8/27W6OMt/j1IWeR78IN7YBko99Ekog3zsWrAJgA/E
Y08JVrUpUU47tMl4uC9Y0AUvm1Tb2ZyDqcdlEEzF9txtdNa6cAJtJkPaO6nUrr4+
qLCbhBBADP+oQNESi6vRHRmxmk5Z/m2ybfnAuYNNraWY01Imp4kNvLFvB01ARGaF
Qin7dCjqz+E=
=S41z
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes to kprobes and the faddr2line script, plus a cleanup"
* tag 'perf-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix ';;' typo
scripts/faddr2line: Add CONFIG_DEBUG_INFO check
scripts/faddr2line: Fix vmlinux detection on arm64
x86/kprobes: Update kcb status flag after singlestepping
kprobes: Forbid probing on trampoline and BPF code areas
Including:
- Most intrusive patch is small and changes the default
allocation policy for DMA addresses. Before the change the
allocator tried its best to find an address in the first 4GB.
But that lead to performance problems when that space gets
exhaused, and since most devices are capable of 64-bit DMA
these days, we changed it to search in the full DMA-mask
range from the beginning. This change has the potential to
uncover bugs elsewhere, in the kernel or the hardware. There
is a Kconfig option and a command line option to restore the
old behavior, but none of them is enabled by default.
- Add Robin Murphy as reviewer of IOMMU code and maintainer for
the dma-iommu and iova code
- Chaning IOVA magazine size from 1032 to 1024 bytes to save
memory
- Some core code cleanups and dead-code removal
- Support for ACPI IORT RMR node
- Support for multiple PCI domains in the AMD-Vi driver
- ARM SMMU changes from Will Deacon:
- Add even more Qualcomm device-tree compatible strings
- Support dumping of IMP DEF Qualcomm registers on TLB sync
timeout
- Fix reference count leak on device tree node in Qualcomm
driver
- Intel VT-d driver updates from Lu Baolu:
- Make intel-iommu.h private
- Optimize the use of two locks
- Extend the driver to support large-scale platforms
- Cleanup some dead code
- MediaTek IOMMU refactoring and support for TTBR up to 35bit
- Basic support for Exynos SysMMU v7
- VirtIO IOMMU driver gets a map/unmap_pages() implementation
- Other smaller cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmLs3DIACgkQK/BELZcB
GuMizhAAguAnLLOkOLlR9/MhrTZfNXCUX+bfrEIevjFXMw4iPNfCCr4ydQ7EdVK6
ZA/3Z89huYl0d0x/FELolnQi+HOeqYrfTDe4rB7TgNgwZnWa+fdHcyYkgBGyfPaV
ilgjNcx8o//9o4NasyB6kU395jVmFxb735gMTTb+tcO9fr+/qIB6hxrHuCklxrNr
C7wK6kkoDPi5n0QuXCSjXEx2Hk245pAWKPLwqxsUYzHGlLfl7ULOxw65BUBGvn/H
uCsTfJFu7u+ErwQYf0qPuOwRBnRdsx9g5EAnfab8p074SoKWvbNnftIxgIRp8ZEM
YgCbhYa1GOFI4r+XzqRzEbc0/vPSttims4Jqz0KxYs7pr5EoVifrWLJFjJdCdc2h
Tio1gTvOq8HbH63kwYNKJhg4iSC6zVd37ihEhvfFO6LcgFl4iCfd2o9zK7oY40J4
XoOxofVnJ2e3tzdhZ/n5quCXiudHixm6WuVa7QYKscF7Ud0tY1wWKuibdlMQTeNM
68MvtlteKcfs1BrWzZyrFMrFeAfIY8LI82y6jdJuoNMU5LE9+5yelXBdJhnVygZ+
Jglv1TIt6W/z1H5JgXtNVZ1wWgBm7rurOqNyfN8XCd8eP1z321CLfX8ujkhKrIWP
ApG15cwvpnh1JX630+UFiEikTGU0fb2orMdPwYmwuu8DAsoLVHE=
=hI2K
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.20-or-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- The most intrusive patch is small and changes the default allocation
policy for DMA addresses.
Before the change the allocator tried its best to find an address in
the first 4GB. But that lead to performance problems when that space
gets exhaused, and since most devices are capable of 64-bit DMA these
days, we changed it to search in the full DMA-mask range from the
beginning.
This change has the potential to uncover bugs elsewhere, in the
kernel or the hardware. There is a Kconfig option and a command line
option to restore the old behavior, but none of them is enabled by
default.
- Add Robin Murphy as reviewer of IOMMU code and maintainer for the
dma-iommu and iova code
- Chaning IOVA magazine size from 1032 to 1024 bytes to save memory
- Some core code cleanups and dead-code removal
- Support for ACPI IORT RMR node
- Support for multiple PCI domains in the AMD-Vi driver
- ARM SMMU changes from Will Deacon:
- Add even more Qualcomm device-tree compatible strings
- Support dumping of IMP DEF Qualcomm registers on TLB sync
timeout
- Fix reference count leak on device tree node in Qualcomm driver
- Intel VT-d driver updates from Lu Baolu:
- Make intel-iommu.h private
- Optimize the use of two locks
- Extend the driver to support large-scale platforms
- Cleanup some dead code
- MediaTek IOMMU refactoring and support for TTBR up to 35bit
- Basic support for Exynos SysMMU v7
- VirtIO IOMMU driver gets a map/unmap_pages() implementation
- Other smaller cleanups and fixes
* tag 'iommu-updates-v5.20-or-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (116 commits)
iommu/amd: Fix compile warning in init code
iommu/amd: Add support for AVIC when SNP is enabled
iommu/amd: Simplify and Consolidate Virtual APIC (AVIC) Enablement
ACPI/IORT: Fix build error implicit-function-declaration
drivers: iommu: fix clang -wformat warning
iommu/arm-smmu: qcom_iommu: Add of_node_put() when breaking out of loop
iommu/arm-smmu-qcom: Add SM6375 SMMU compatible
dt-bindings: arm-smmu: Add compatible for Qualcomm SM6375
MAINTAINERS: Add Robin Murphy as IOMMU SUBSYTEM reviewer
iommu/amd: Do not support IOMMUv2 APIs when SNP is enabled
iommu/amd: Do not support IOMMU_DOMAIN_IDENTITY after SNP is enabled
iommu/amd: Set translation valid bit only when IO page tables are in use
iommu/amd: Introduce function to check and enable SNP
iommu/amd: Globally detect SNP support
iommu/amd: Process all IVHDs before enabling IOMMU features
iommu/amd: Introduce global variable for storing common EFR and EFR2
iommu/amd: Introduce Support for Extended Feature 2 Register
iommu/amd: Change macro for IOMMU control register bit shift to decimal value
iommu/exynos: Enable default VM instance on SysMMU v7
iommu/exynos: Add SysMMU v7 register set
...
dynamic. For instance, enclaves can now change enclave page
permissions on the fly.
- Removal of an unused structure member
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmLq2M8ACgkQaDWVMHDJ
krCbAw/+J4nHXxZNMQQX1c8CYJ7XHIr+YtsqNFYwH58rJJstHO/YwQf+mesVOeeu
08BYn+T5cdAbShKcxdkowPB17S6w/WzACtUfVhaoRQC7Md40cBiyc45UiC2e1u9g
W3Osk5+fTVcSYA9WiizPntIQkjVs9e7hcNKjTyVPnSw8W8mFCLg+ZiPb7YvKERTO
o8Wi2+zzX1BGDNOyBEqvnstz9uXDbCbFUTYX6zToBUk+Y1ZPXHwuHgNTtrAqGYaL
qyi0O2zoWnfOUmplzjJ/1aPmzPJDPgDNImC+gjTpYXGmg05Ryds+VZAc64IIjqYn
K+/5674PZFdsp5/YfctubdsQm0l0xen94sccAacd7KfsVurcHs3E2bdQPDw0htxv
svCX0Sai/qv52tPNzw+n9EJRcQsiwd9Pn0rWwx2i8hQcgMFiwCus6DBKhU7uh2Jp
oTwlspqJy2NHu9bici78tmsOio9CORjrh1WOfWX+yHEux4dtQAl889Gw5qzId6V1
Bh1MgoAu/pQ78feo96f3h5yOultOtpbTGyXEC8t4MTSpIVgZ2NzfUxe4RhOCBnhA
kdftVNfZLGOzwBbgFy0gYTe/ukt1DkP4BNHQilf2I+bUP/kZFlN8wfxBipWzr0bs
Skrz4+brBIaTdGoFgzhgt3g5YH16DSasmy/HCkIeV7eaAHFRLoE=
=Y7YA
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v6.0-2022-08-03.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave Hansen:
"A set of x86/sgx changes focused on implementing the "SGX2" features,
plus a minor cleanup:
- SGX2 ISA support which makes enclave memory management much more
dynamic. For instance, enclaves can now change enclave page
permissions on the fly.
- Removal of an unused structure member"
* tag 'x86_sgx_for_v6.0-2022-08-03.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
x86/sgx: Drop 'page_index' from sgx_backing
selftests/sgx: Page removal stress test
selftests/sgx: Test reclaiming of untouched page
selftests/sgx: Test invalid access to removed enclave page
selftests/sgx: Test faulty enclave behavior
selftests/sgx: Test complete changing of page type flow
selftests/sgx: Introduce TCS initialization enclave operation
selftests/sgx: Introduce dynamic entry point
selftests/sgx: Test two different SGX2 EAUG flows
selftests/sgx: Add test for TCS page permission changes
selftests/sgx: Add test for EPCM permission changes
Documentation/x86: Introduce enclave runtime management section
x86/sgx: Free up EPC pages directly to support large page ranges
x86/sgx: Support complete page removal
x86/sgx: Support modifying SGX page type
x86/sgx: Tighten accessible memory range after enclave initialization
x86/sgx: Support adding of pages to an initialized enclave
x86/sgx: Support restricting of enclave page permissions
x86/sgx: Support VA page allocation without reclaiming
x86/sgx: Export sgx_encl_page_alloc()
...
- Runtime verification infrastructure
This is the biggest change for this pull request. It introduces the
runtime verification that is necessary for running Linux on safety
critical systems. It allows for deterministic automata models to be
inserted into the kernel that will attach to tracepoints, where the
information on these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will then
activate a given reactor, that could just inform the user or even panic
the kernel (for which safety critical systems will detect and can recover
from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to be
confused with "work in progress"), and Wakeup While Not Running (WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace several
vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is left
off.
- The rest is various cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYu0yzRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qj4HAP4tQtV55rjj4DQ5XIXmtI3/64PmyRSJ
+y4DEXi1UvEUCQD/QAuQfWoT/7gh35ltkfeS4t3ockzy14rrkP5drZigiQA=
=kEtM
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Runtime verification infrastructure
This is the biggest change here. It introduces the runtime
verification that is necessary for running Linux on safety critical
systems.
It allows for deterministic automata models to be inserted into the
kernel that will attach to tracepoints, where the information on
these tracepoints will move the model from state to state.
If a state is encountered that does not belong to the model, it will
then activate a given reactor, that could just inform the user or
even panic the kernel (for which safety critical systems will detect
and can recover from).
- Two monitor models are also added: Wakeup In Preemptive (WIP - not to
be confused with "work in progress"), and Wakeup While Not Running
(WWNR).
- Added __vstring() helper to the TRACE_EVENT() macro to replace
several vsnprintf() usages that were all doing it wrong.
- eprobes now can have their event autogenerated when the event name is
left off.
- The rest is various cleanups and fixes.
* tag 'trace-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
rv: Unlock on error path in rv_unregister_reactor()
tracing: Use alignof__(struct {type b;}) instead of offsetof()
tracing/eprobe: Show syntax error logs in error_log file
scripts/tracing: Fix typo 'the the' in comment
tracepoints: It is CONFIG_TRACEPOINTS not CONFIG_TRACEPOINT
tracing: Use free_trace_buffer() in allocate_trace_buffers()
tracing: Use a struct alignof to determine trace event field alignment
rv/reactor: Add the panic reactor
rv/reactor: Add the printk reactor
rv/monitor: Add the wwnr monitor
rv/monitor: Add the wip monitor
rv/monitor: Add the wip monitor skeleton created by dot2k
Documentation/rv: Add deterministic automata instrumentation documentation
Documentation/rv: Add deterministic automata monitor synthesis documentation
tools/rv: Add dot2k
Documentation/rv: Add deterministic automaton documentation
tools/rv: Add dot2c
Documentation/rv: Add a basic documentation
rv/include: Add instrumentation helper functions
rv/include: Add deterministic automata monitor definition via C macros
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmLr+2wUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxfZg//eChkC2EUdT6K3zuQDbJJhsGcuOQF
lnZuUyDn4xw7BkEoZf8V6YdAnp7VvgKhLOq1/q3Geu/LBbCaczoEogOCaR/WcVOs
C+MsN0RWZQtgfuZKncQoqp25NeLPK9PFToeiIX/xViAYZF7NVjDY7XQiZHQ6JkEA
/7cUqv/4nS3KCMsKjfmiOxGnqohMWtICiw9qjFvJ40PEDnNB1b53rkiVTxBFePpI
ePfsRfi/C7klE3xNfoiEgrPp+Jfw+oShsCwXUsId7bEL2oLBc7ClqP05ZYZD3bTK
QQYyZ12Cq8TysciYpUGBjBnywUHS5DIO5YaV3wxyVAR2Z+6GY2/QVjOa2kKvoK0o
Hba6TJf8bL58AhSI8Q62pBM0sS7dqJSff+9c2BGpZvII5spP/rQQLlJO56TJjwkw
Dlf0d3thhZOc9vSKjKw+0v0FdAyc4L11EOwUsw95jZeT5WWgqJYGFnWPZwqBI1KM
DI1E5wVO5tA2H3NEn+BTTHbLWL+UppqyXPXBHiW52b2q5Bt8fJWMsFvnEEjclxmG
pYCI7VgF8jqbYKxjobxPFY2x6PH9hfaGMxwzZSdOX6e/Eh+1esgyyaC5APpCO+Pp
e4OkJaOzCmggrD0jYeLWu+yDm5KRrYo5cdfKHrKgAof0Am41lAa1OhJ2iH4ckNqP
1qmHereDOe0zNVw=
=9TAR
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Consolidate duplicated 'next function' scanning and extend to allow
'isolated functions' on s390, similar to existing hypervisors
(Niklas Schnelle)
Resource management:
- Implement pci_iobar_pfn() for sparc, which allows us to remove the
sparc-specific pci_mmap_page_range() and pci_mmap_resource_range().
This removes the ability to map the entire PCI I/O space using
/proc/bus/pci, but we believe that's already been broken since
v2.6.28 (Arnd Bergmann)
- Move common PCI definitions to asm-generic/pci.h and rework others
to be be more specific and more encapsulated in arches that need
them (Stafford Horne)
Power management:
- Convert drivers to new *_PM_OPS macros to avoid need for '#ifdef
CONFIG_PM_SLEEP' or '__maybe_unused' (Bjorn Helgaas)
Virtualization:
- Add ACS quirk for Broadcom BCM5750x multifunction NICs that isolate
the functions but don't advertise an ACS capability (Pavan Chebbi)
Error handling:
- Clear PCI Status register during enumeration in case firmware left
errors logged (Kai-Heng Feng)
- When we have native control of AER, enable error reporting for all
devices that support AER. Previously only a few drivers enabled
this (Stefan Roese)
- Keep AER error reporting enabled for switches. Previously we
enabled this during enumeration but immediately disabled it (Stefan
Roese)
- Iterate over error counters instead of error strings to avoid
printing junk in AER sysfs counters (Mohamed Khalfella)
ASPM:
- Remove pcie_aspm_pm_state_change() so ASPM config changes, e.g.,
via sysfs, are not lost across power state changes (Kai-Heng Feng)
Endpoint framework:
- Don't stop an EPC when unbinding an EPF from it (Shunsuke Mie)
Endpoint embedded DMA controller driver:
- Simplify and clean up support for the DesignWare embedded DMA
(eDMA) controller (Frank Li, Serge Semin)
Broadcom STB PCIe controller driver:
- Avoid config space accesses when link is down because we can't
recover from the CPU aborts these cause (Jim Quinlan)
- Look for power regulators described under Root Ports in DT and
enable them before scanning the secondary bus (Jim Quinlan)
- Disable/enable regulators in suspend/resume (Jim Quinlan)
Freescale i.MX6 PCIe controller driver:
- Simplify and clean up clock and PHY management (Richard Zhu)
- Disable/enable regulators in suspend/resume (Richard Zhu)
- Set PCIE_DBI_RO_WR_EN before writing DBI registers (Richard Zhu)
- Allow speeds faster than Gen2 (Richard Zhu)
- Make link being down a non-fatal error so controller probe doesn't
fail if there are no Endpoints connected (Richard Zhu)
Loongson PCIe controller driver:
- Add ACPI and MCFG support for Loongson LS7A (Huacai Chen)
- Avoid config reads to non-existent LS2K/LS7A devices because a
hardware defect causes machine hangs (Huacai Chen)
- Work around LS7A integrated devices that report incorrect Interrupt
Pin values (Jianmin Lv)
Marvell Aardvark PCIe controller driver:
- Add support for AER and Slot capability on emulated bridge (Pali
Rohár)
MediaTek PCIe controller driver:
- Add Airoha EN7532 to DT binding (John Crispin)
- Allow building of driver for ARCH_AIROHA (Felix Fietkau)
MediaTek PCIe Gen3 controller driver:
- Print decoded LTSSM state when the link doesn't come up (Jianjun
Wang)
NVIDIA Tegra194 PCIe controller driver:
- Convert DT binding to json-schema (Vidya Sagar)
- Add DT bindings and driver support for Tegra234 Root Port and
Endpoint mode (Vidya Sagar)
- Fix some Root Port interrupt handling issues (Vidya Sagar)
- Set default Max Payload Size to 256 bytes (Vidya Sagar)
- Fix Data Link Feature capability programming (Vidya Sagar)
- Extend Endpoint mode support to devices beyond Controller-5 (Vidya
Sagar)
Qualcomm PCIe controller driver:
- Rework clock, reset, PHY power-on ordering to avoid hangs and
improve consistency (Robert Marko, Christian Marangi)
- Move pipe_clk handling to PHY drivers (Dmitry Baryshkov)
- Add IPQ60xx support (Selvam Sathappan Periakaruppan)
- Allow ASPM L1 and substates for 2.7.0 (Krishna chaitanya chundru)
- Add support for more than 32 MSI interrupts (Dmitry Baryshkov)
Renesas R-Car PCIe controller driver:
- Convert DT binding to json-schema (Herve Codina)
- Add Renesas RZ/N1D (R9A06G032) to rcar-gen2 DT binding and driver
(Herve Codina)
Samsung Exynos PCIe controller driver:
- Fix phy-exynos-pcie driver so it follows the 'phy_init() before
phy_power_on()' PHY programming model (Marek Szyprowski)
Synopsys DesignWare PCIe controller driver:
- Simplify and clean up the DWC core extensively (Serge Semin)
- Fix an issue with programming the ATU for regions that cross a 4GB
boundary (Serge Semin)
- Enable the CDM check if 'snps,enable-cdm-check' exists; previously
we skipped it if 'num-lanes' was absent (Serge Semin)
- Allocate a 32-bit DMA-able page to be MSI target instead of using a
driver data structure that may not be addressable with 32-bit
address (Will McVicker)
- Add DWC core support for more than 32 MSI interrupts (Dmitry
Baryshkov)
Xilinx Versal CPM PCIe controller driver:
- Add DT binding and driver support for Versal CPM5 Gen5 Root Port
(Bharat Kumar Gogada)"
* tag 'pci-v5.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (150 commits)
PCI: imx6: Support more than Gen2 speed link mode
PCI: imx6: Set PCIE_DBI_RO_WR_EN before writing DBI registers
PCI: imx6: Reformat suspend callback to keep symmetric with resume
PCI: imx6: Move the imx6_pcie_ltssm_disable() earlier
PCI: imx6: Disable clocks in reverse order of enable
PCI: imx6: Do not hide PHY driver callbacks and refine the error handling
PCI: imx6: Reduce resume time by only starting link if it was up before suspend
PCI: imx6: Mark the link down as non-fatal error
PCI: imx6: Move regulator enable out of imx6_pcie_deassert_core_reset()
PCI: imx6: Turn off regulator when system is in suspend mode
PCI: imx6: Call host init function directly in resume
PCI: imx6: Disable i.MX6QDL clock when disabling ref clocks
PCI: imx6: Propagate .host_init() errors to caller
PCI: imx6: Collect clock enables in imx6_pcie_clk_enable()
PCI: imx6: Factor out ref clock disable to match enable
PCI: imx6: Move imx6_pcie_clk_disable() earlier
PCI: imx6: Move imx6_pcie_enable_ref_clk() earlier
PCI: imx6: Move PHY management functions together
PCI: imx6: Move imx6_pcie_grp_offset(), imx6_pcie_configure_type() earlier
PCI: imx6: Convert to NOIRQ_SYSTEM_SLEEP_PM_OPS()
...
* Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
* Rework of the sysreg access from userspace, with a complete
rewrite of the vgic-v3 view to allign with the rest of the
infrastructure
* Disagregation of the vcpu flags in separate sets to better track
their use model.
* A fix for the GICv2-on-v3 selftest
* A small set of cosmetic fixes
RISC-V:
* Track ISA extensions used by Guest using bitmap
* Added system instruction emulation framework
* Added CSR emulation framework
* Added gfp_custom flag in struct kvm_mmu_memory_cache
* Added G-stage ioremap() and iounmap() functions
* Added support for Svpbmt inside Guest
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
x86:
* Permit guests to ignore single-bit ECC errors
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Use try_cmpxchg64 instead of cmpxchg64
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* Allow NX huge page mitigation to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
* Miscellaneous cleanups:
** MCE MSR emulation
** Use separate namespaces for guest PTEs and shadow PTEs bitmasks
** PIO emulation
** Reorganize rmap API, mostly around rmap destruction
** Do not workaround very old KVM bugs for L0 that runs with nesting enabled
** new selftests API for CPUID
Generic:
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmLnyo4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMtQQf/XjVWiRcWLPR9dqzRM/vvRXpiG+UL
jU93R7m6ma99aqTtrxV/AE+kHgamBlma3Cwo+AcWk9uCVNbIhFjv2YKg6HptKU0e
oJT3zRYp+XIjEo7Kfw+TwroZbTlG6gN83l1oBLFMqiFmHsMLnXSI2mm8MXyi3dNB
vR2uIcTAl58KIprqNNsYJ2dNn74ogOMiXYx9XzoA9/5Xb6c0h4rreHJa5t+0s9RO
Gz7Io3PxumgsbJngjyL1Ve5oxhlIAcZA8DU0PQmjxo3eS+k6BcmavGFd45gNL5zg
iLpCh4k86spmzh8CWkAAwWPQE4dZknK6jTctJc0OFVad3Z7+X7n0E8TFrA==
=PM8o
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Quite a large pull request due to a selftest API overhaul and some
patches that had come in too late for 5.19.
ARM:
- Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
- Rework of the sysreg access from userspace, with a complete rewrite
of the vgic-v3 view to allign with the rest of the infrastructure
- Disagregation of the vcpu flags in separate sets to better track
their use model.
- A fix for the GICv2-on-v3 selftest
- A small set of cosmetic fixes
RISC-V:
- Track ISA extensions used by Guest using bitmap
- Added system instruction emulation framework
- Added CSR emulation framework
- Added gfp_custom flag in struct kvm_mmu_memory_cache
- Added G-stage ioremap() and iounmap() functions
- Added support for Svpbmt inside Guest
s390:
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to use TAP interface
- enable interpretive execution of zPCI instructions (for PCI
passthrough)
- First part of deferred teardown
- CPU Topology
- PV attestation
- Minor fixes
x86:
- Permit guests to ignore single-bit ECC errors
- Intel IPI virtualization
- Allow getting/setting pending triple fault with
KVM_GET/SET_VCPU_EVENTS
- PEBS virtualization
- Simplify PMU emulation by just using PERF_TYPE_RAW events
- More accurate event reinjection on SVM (avoid retrying
instructions)
- Allow getting/setting the state of the speaker port data bit
- Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls
are inconsistent
- "Notify" VM exit (detect microarchitectural hangs) for Intel
- Use try_cmpxchg64 instead of cmpxchg64
- Ignore benign host accesses to PMU MSRs when PMU is disabled
- Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
- Allow NX huge page mitigation to be disabled on a per-vm basis
- Port eager page splitting to shadow MMU as well
- Enable CMCI capability by default and handle injected UCNA errors
- Expose pid of vcpu threads in debugfs
- x2AVIC support for AMD
- cleanup PIO emulation
- Fixes for LLDT/LTR emulation
- Don't require refcounted "struct page" to create huge SPTEs
- Miscellaneous cleanups:
- MCE MSR emulation
- Use separate namespaces for guest PTEs and shadow PTEs bitmasks
- PIO emulation
- Reorganize rmap API, mostly around rmap destruction
- Do not workaround very old KVM bugs for L0 that runs with nesting enabled
- new selftests API for CPUID
Generic:
- Fix races in gfn->pfn cache refresh; do not pin pages tracked by
the cache
- new selftests API using struct kvm_vcpu instead of a (vm, id)
tuple"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits)
selftests: kvm: set rax before vmcall
selftests: KVM: Add exponent check for boolean stats
selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test
selftests: KVM: Check stat name before other fields
KVM: x86/mmu: remove unused variable
RISC-V: KVM: Add support for Svpbmt inside Guest/VM
RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap()
RISC-V: KVM: Add G-stage ioremap() and iounmap() functions
KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache
RISC-V: KVM: Add extensible CSR emulation framework
RISC-V: KVM: Add extensible system instruction emulation framework
RISC-V: KVM: Factor-out instruction emulation into separate sources
RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run
RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function
RISC-V: KVM: Fix variable spelling mistake
RISC-V: KVM: Improve ISA extension by using a bitmap
KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()
KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog
KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT
KVM: x86: Do not block APIC write for non ICR registers
...
ACRN Hypervisor reports timing information via CPUID leaf 0x40000010.
Get the TSC and CPU frequency via CPUID leaf 0x40000010 and set the
kernel values accordingly.
Signed-off-by: Fei Li <fei1.li@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Conghui <conghui.chen@intel.com>
Link: https://lore.kernel.org/r/20220804055903.365211-1-fei1.li@intel.com
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.
== Background ==
Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.
To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced. eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.
== Problem ==
Here's a simplification of how guests are run on Linux' KVM:
void run_kvm_guest(void)
{
// Prepare to run guest
VMRESUME();
// Clean up after guest runs
}
The execution flow for that would look something like this to the
processor:
1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()
Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:
* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.
* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".
IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.
However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.
Balanced CALL/RET instruction pairs such as in step #5 are not affected.
== Solution ==
The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.
However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.
Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.
The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.
In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.
There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.
[ bp: Massage, incorporate review comments from Andy Cooper. ]
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmLnDOwACgkQSfxwEqXe
A65Fiw//Z0YaPejSslQIGitQ1b0XzdWBhyJArYDieaaiQRXMqlaSKlIUqHz38xb7
+FykUY51/SJLjHV2riPxq1OK3/MPmk6VlTd0HHihcHVmg77oZcFcv2tPnDpZoqND
TsBOujLbXKwxP8tNFedRY/4+K7w+ue9BTfDjuH7aCtz7uWd+4cNJmPg3x9FCfkMA
+hbcRluwE9W3Pg4OCKwv+qxL0JF3qQtNKEOp1wpnjGAZZW/I9gFNgFBEkykvcAsj
TkIRDc3agPFj6QgDeRIgLdnf9KCsLubKAg5oJneeCvQztJJUCSkn8nQXxpx+4sLo
GsRgvCdfL/GyJqfSAzQJVYDHKtKMkJiCiWCC/oOALR8dzHJfSlULDAjbY1m/DAr9
at+vi4678Or7TNx2ZSaUlCXXKZ+UT7yWMlQWax9JuxGk1hGYP5/eT1AH5SGjqUwF
w1q8oyzxt1vUcnOzEddFXPFirnqqhAk4dQFtu83+xKM4ZssMVyeB4NZdEhAdW0ng
MX+RjrVj4l5gWWuoS0Cx3LUxDCgV6WT0dN+Vl9axAZkoJJbcXLEmXwQ6NbzTLPWg
1/MT7qFTxNcTCeAArMdZvvFbeh7pOBXO42pafrK/7vDRnTMUIw9tqXNLQUfvdFQp
F5flPgiVRHDU2vSzKIFtnPTyXU0RBBGvNb4n0ss2ehH2DSsCxYE=
=Zy3d
-----END PGP SIGNATURE-----
Merge tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"Though there's been a decent amount of RNG-related development during
this last cycle, not all of it is coming through this tree, as this
cycle saw a shift toward tackling early boot time seeding issues,
which took place in other trees as well.
Here's a summary of the various patches:
- The CONFIG_ARCH_RANDOM .config option and the "nordrand" boot
option have been removed, as they overlapped with the more widely
supported and more sensible options, CONFIG_RANDOM_TRUST_CPU and
"random.trust_cpu". This change allowed simplifying a bit of arch
code.
- x86's RDRAND boot time test has been made a bit more robust, with
RDRAND disabled if it's clearly producing bogus results. This would
be a tip.git commit, technically, but I took it through random.git
to avoid a large merge conflict.
- The RNG has long since mixed in a timestamp very early in boot, on
the premise that a computer that does the same things, but does so
starting at different points in wall time, could be made to still
produce a different RNG state. Unfortunately, the clock isn't set
early in boot on all systems, so now we mix in that timestamp when
the time is actually set.
- User Mode Linux now uses the host OS's getrandom() syscall to
generate a bootloader RNG seed and later on treats getrandom() as
the platform's RDRAND-like faculty.
- The arch_get_random_{seed_,}_long() family of functions is now
arch_get_random_{seed_,}_longs(), which enables certain platforms,
such as s390, to exploit considerable performance advantages from
requesting multiple CPU random numbers at once, while at the same
time compiling down to the same code as before on platforms like
x86.
- A small cleanup changing a cmpxchg() into a try_cmpxchg(), from
Uros.
- A comment spelling fix"
More info about other random number changes that come in through various
architecture trees in the full commentary in the pull request:
https://lore.kernel.org/all/20220731232428.2219258-1-Jason@zx2c4.com/
* tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: correct spelling of "overwrites"
random: handle archrandom with multiple longs
um: seed rng using host OS rng
random: use try_cmpxchg in _credit_init_bits
timekeeping: contribute wall clock to rng on time change
x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu"
random: remove CONFIG_ARCH_RANDOM
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCYulqTBQcem9oYXJAbGlu
dXguaWJtLmNvbQAKCRDLwZzRsCrn5SBBAP9nbAW1SPa/hDqbrclHdDrS59VkSVwv
6ZO2yAmxJAptHwD+JzyJpJiZsqVN/Tu85V1PqeAt9c8az8f3CfDBp2+w7AA=
=Ad+c
-----END PGP SIGNATURE-----
Merge tag 'integrity-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"Aside from the one EVM cleanup patch, all the other changes are kexec
related.
On different architectures different keyrings are used to verify the
kexec'ed kernel image signature. Here are a number of preparatory
cleanup patches and the patches themselves for making the keyrings -
builtin_trusted_keyring, .machine, .secondary_trusted_keyring, and
.platform - consistent across the different architectures"
* tag 'integrity-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
kexec, KEYS, s390: Make use of built-in and secondary keyring for signature verification
arm64: kexec_file: use more system keyrings to verify kernel image signature
kexec, KEYS: make the code in bzImage64_verify_sig generic
kexec: clean up arch_kexec_kernel_verify_sig
kexec: drop weak attribute from functions
kexec_file: drop weak attribute from functions
evm: Use IS_ENABLED to initialize .enabled
It's possible that this kernel has been kexec'd from a kernel that
enabled bus lock detection, or (hypothetically) BIOS/firmware has set
DEBUGCTLMSR_BUS_LOCK_DETECT.
Disable bus lock detection explicitly if not wanted.
Fixes: ebb1064e7c ("x86/traps: Handle #DB for bus lock")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20220802033206.21333-1-chenyi.qiang@intel.com
- lockdep: Fix a handful of the more complex lockdep_init_map_*() primitives
that can lose the lock_type & cause false reports. No such mishap was
observed in the wild.
- jump_label improvements: simplify the cross-arch support of
initial NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn3jERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hzeg/7BTC90XeMANhTiL23iiH7dOYZwqdFeB12
VBqdaPaGC8i+mJzVAdGyPFwCFDww6Ak6P33PcHkemuIO5+DhWis8hfw5krHEOO1k
AyVSMOZuWJ8/g6ZenjgNFozQ8C+3NqURrpdqN55d7jhMazPWbsNLLqUgvSSqo6DY
Ah2O+EKrDfGNCxT6/YaTAmUryctotxafSyFDQxv3RKPfCoIIVv9b3WApYqTOqFIu
VYTPr+aAcMsU20hPMWQI4kbQaoCxFqr3bZiZtAiS/IEunqi+PlLuWjrnCUpLwVTC
+jOCkNJHt682FPKTWelUnCnkOg9KhHRujRst5mi1+2tWAOEvKltxfe05UpsZYC3b
jhzddREMwBt3iYsRn65LxxsN4AMK/C/41zjejHjZpf+Q5kwDsc6Ag3L5VifRFURS
KRwAy9ejoVYwnL7CaVHM2zZtOk4YNxPeXmiwoMJmOufpdmD1LoYbNUbpSDf+goIZ
yPJpxFI5UN8gi8IRo3DMe4K2nqcFBC3wFn8tNSAu+44gqDwGJAJL6MsLpkLSZkk8
3QN9O11UCRTJDkURjoEWPgRRuIu9HZ4GKNhiblDy6gNM/jDE/m5OG4OYfiMhojgc
KlMhsPzypSpeApL55lvZ+AzxH8mtwuUGwm8lnIdZ2kIse1iMwapxdWXWq9wQr8eW
jLWHgyZ6rcg=
=4B89
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"This was a fairly quiet cycle for the locking subsystem:
- lockdep: Fix a handful of the more complex lockdep_init_map_*()
primitives that can lose the lock_type & cause false reports. No
such mishap was observed in the wild.
- jump_label improvements: simplify the cross-arch support of initial
NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous"
* tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix lockdep_init_map_*() confusion
jump_label: make initial NOP patching the special case
jump_label: mips: move module NOP patching into arch code
jump_label: s390: avoid pointless initial NOP patching
loader
- Add the ability to pass the IMA measurement of kernel and bootloader
to the kexec-ed kernel
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLn770ACgkQEsHwGGHe
VUoOyA//R7ljAspkzqE+kY02GOXCvVo+Ix/WFbpeUMouSb71vxjyqJED6lMrWKvM
HPzXwuQ5C1bXIbvWW424l66q9O48Iu3FvnURGc05ngBvgnyLxw+IdfWREr3rhVtR
ZKdaMHCzj1RsxCRYXie4NIyW86D1Bd4V4W7KFG/u26LSo9VL2oY1JXd0vxXrh0e6
F4pwJsS+5TrgaFPwfSLm66HWlM2oxmqBVD/Fi8Pmzq7/ewb3KSgIWralOjew5X13
f4ob9GVLojM9yVPLSww0p2CRitlxypO5pv3rsrcwo77UhikflFk4Ruc4IeMd4792
ZszDCyWWCzFHZDizo2tni4IbcKtOx1lL389sYj/ZVsAYarGzeRRNYpN5TE6cSFXK
6hqurMMTDrmeczScBK3uQ4BFkMzWYGCYWy6JNrTmD43Onb5fe2usWIbpz+oFB0Kd
26Oa85lAKUhOUTnU1yM5aeRYBYiouyD80BRKgve5pcN00BXwO0OOny5sijFt3hvC
266k2g/+zY6wNawnEesNfLFkUvR09416xEbe5W3l64vlCGsjt9doB4vPKLkHBXq4
YilUVFFT3/djTvfLy50L2ta9oNdYXK7ECfGj0t2UCcnj0IrO4E0Cm0BlPN8r/a6L
gwE9I4txaYZmT8VRBG2kiyUljUSqZUj1UFHevMuCS09dzLonJN4=
=s9Om
-----END PGP SIGNATURE-----
Merge tag 'x86_kdump_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kdump updates from Borislav Petkov:
- Add the ability to pass early an RNG seed to the kernel from the boot
loader
- Add the ability to pass the IMA measurement of kernel and bootloader
to the kexec-ed kernel
* tag 'x86_kdump_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/setup: Use rng seeds from setup_data
x86/kexec: Carry forward IMA measurement log on kexec
- Other Kbuild improvements and fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnvZYACgkQEsHwGGHe
VUqCFA//T2b0IkFHk63x4Ln+kDwEhB+NFjeuwhEd9XTktEgT0tAU6pZcUruSAHMJ
MeQFNYGbqbqTNztRDvc9WSbJVQWxxTLJYbXi2HbJpk7cOBpGwQQW1dBeLH0W0eTt
YyAZN/bhpjQK7+/rTyFZlUxqjr8VgBWCrmK7OFV65kZxZW3K7Qjg54NotAHLwU7/
oywR+REyiU4soKPUJPtIA42kNXkxzrQDjMu42/ywdQHBS5PMCUp16rnvP8uw1ERN
9aO+yGZ907Xl6xQXTxnbo2ehlhbu25xK8IZMbgEXE66iSeEtWb27H3xGMktYBETn
LsG4OkiqQ0z7lYHCcWPduAlP1ZrHAOEjD6sJhDOtbGffHRyJG4AHHngQkxfeWxPR
evRsPJssEUui65fW30cB+zY0glnGXpsfb+Ma9i8/02B2DNq5N4ERv8q4rfrLJX9Y
l9hMGw8FqI57ALoDiXJWCTGHZbEvEHiLOUQh9lLEKQmf5hTqb4Trulv+uhlc/JVS
vrfUzfhVdpWU6XY0ba0jXdbkYBJV8xERG9dsBSYqvjGaXJES6TXILiZ7CeMHoXhN
srBPCM3QZUGp/gzhua2s7SctQfyMmx6CrCg39sMligBM2wl1o3vJML5y1h/RwN7g
LTVZTObvMPYDU6dn1APOZUh77DlW7JWg//DcsB0BrlfZCySRkfE=
=SMFp
-----END PGP SIGNATURE-----
Merge tag 'x86_build_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Borislav Petkov:
- Fix stack protector builds when cross compiling with Clang
- Other Kbuild improvements and fixes
* tag 'x86_build_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/purgatory: Omit use of bin2c
x86/purgatory: Hard-code obj-y in Makefile
x86/build: Remove unused OBJECT_FILES_NON_STANDARD_test_nx.o
x86/Kconfig: Fix CONFIG_CC_HAS_SANE_STACKPROTECTOR when cross compiling with clang
- Free the pmem platform device on the registration error path
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnuOIACgkQEsHwGGHe
VUrY0Q/+JPlcOVTbneHaUazeG7i0EYugNLslPa42FzcM8B0NMnK241XjP8fubZML
cHh76LWWu7MjUdkpiXjiQoBwp9fJHF1amkUNJ2JBJ3Vcz8phvfnCeaiotMeKIrDZ
XsG7+UuQcUAL5C5C9UOJYWHAA32IlcuvaKA0xV76TF3qMo6vmI62AYdc8ddcQj7k
qvtdzwVwoJ1RAJ1jD54c5nHZHu1yuPAx9unYsGnE7v6X8geU9aZ6EsNsLO/vAmt7
0qmIuUYmR12QL3PwfAlOqIev/yIOjcac/sA3zEDOwOjZt4fS77d8GLG3mnj9x4GY
9v9n0d2ZJd8iDb6g27mzZTqzqTTcm60TckiW+asdlc+tWi+bLH6btpdmZ5RZD9M1
GgsbEXsXpnG33rMucLiGKWPseFhKty7jWH51ux0KWygHajFSRqZHyPL8LNSmohZJ
X45wvRKB5VuTbITHZ+Ix7mTna34IFYu1TiLT+Bd0oGFS4WUfee5vb7JDpIOXfV3W
0Fi7xM7JCJQnvj1K5ernx0W37t35AfWahPKLPMtZu0EFodXxyHDfGG0yWeV9THD4
zkrZTYudZbtUNLO1vep/ikPnFCJ/EhntVNhk4J2MvSZ2xdUMbyFpd8h3kVR+UL0r
aprdPGwhWEDmjycO41sp+ifC/t1e5QBxTilp5WzUzoVEH05rIUk=
=IY+U
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- Add a bunch of PCI IDs for new AMD CPUs and use them in k10temp
- Free the pmem platform device on the registration error path
* tag 'x86_misc_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hwmon: (k10temp): Add support for new family 17h and 19h models
x86/amd_nb: Add AMD PCI IDs for SMN communication
x86/pmem: Fix platform-device leak in error path
- Respect idle=nomwait when supplied on the kernel cmdline
- Two small cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLntx0ACgkQEsHwGGHe
VUqlRxAAkULobsk6Dx3wrQcYlpA8Mt/ctttTQXWiIQwhK1j7uP0zlGWBqImr5Wsk
T04g1s29azulnPs3PydCF2QlLqSyF4v2PyyUwnpKfTP6CPM+MLtz98Gm6Xcbkt+s
f28ISYgNP+15tskWdNqB5XIVGkuyBdNne9TiFwtnVrJYF47FSwqEWRyqMH+bIOGT
wSZUCfjcw7PtKwfIAmYq4beS2+wbY9bsfVyIz+H0ks2EVFQdjYWb/kH9PgUYEQFe
VEOBsPvTHDOJt0QXEXSJjmoSRUS77Wduw56Y3L2T4jWdXXQFWJ79rqNYDBvXGAdh
Y8BKM5IYFZpzrmfw2RB6jbDY/JWO5PPFvHTXogQf9+wttSerZEffVQdOeTwjT8VD
wc9/ZnNkT7915033VI90V+hdFkwarq8FXuFH8TkzcxP9DQNYG8CRTZBceq0UWBl0
5RpIDwNX9JxGrR+frJi0D24qxz//wLe56UqW9hLp73NP8QtEYEW1nb1q30Q2eM3N
iQblgmh63qQ/dy6JV1GFb3aePiWMUNQwcTrj1pd8YDfNlp4IsFsSswnsdAZWtr1A
l9qewHkBZbbzyTQkBjExUsaIdiaMywFwnUmcQNL+fHqznZIvMhJC/oCJeS0Pe/RH
alTUrYsk6Y87HFpxoXpd85a9+20m8yrA64uY8cSQguGZ9i5Lm8g=
=jkpj
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Remove the vendor check when selecting MWAIT as the default idle
state
- Respect idle=nomwait when supplied on the kernel cmdline
- Two small cleanups
* tag 'x86_cpu_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Use MSR_IA32_MISC_ENABLE constants
x86: Fix comment for X86_FEATURE_ZEN
x86: Remove vendor checks from prefer_mwait_c1_over_halt
x86: Handle idle=nomwait cmdline properly for x86_idle
be able to enter deeper low-power state
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnsksACgkQEsHwGGHe
VUpOOw//WAfkouWFd7kmACSiWtkgEQfXgImhhM7tw5Zzks+aEMtL2RrKqFYzkFg5
hJK+lMI8QDkBFU/bgI/nAZfFiAS7iBMPY4T2Uw4+jZCPLr3TmUheJ2Pe1CxlIzQC
MfjXQm/j5uTZcB2jEORjPT5dVE3p6k1KpSbvf5ZKCc9YTwdylv3VeYcfv5WEkihR
61bWU+T7Yse4A3Bx32ewabLmk7lwOcdS1vbfsqdvkpI1vE1gI8CThgTuNAt8JWij
27GIxiF2BQkyw3d/IPt3wGIPOgVowISXWdtMgpCr17Mw1m+44vXG9cjSuAKfqAUY
wNXrBzirdqzJgN85WVJEFIoJasFJicrz/oNLYbcHQa8+AruRu6in22cSkPYPvVGc
iNgSlQOZdoY9Vl6izEV4OawCccYnKjskEW7nEVIqfENrwRPYWB/IAnGxkla7q3Ch
q+T8dyOAWToumuPK13c5VoX0nd02bfwSJACYRxN+M22zq8s7+Jv1fNtQeAGLnmD1
jG3HR0wJWBOVVyira7AbFI7Mx667HayslIesftEGU33FfY0gZTcwZ7jsZ9GTSyOi
AgHN3PvHyJYQ648T8JzbyuNJe3dyDKf81OLaPHP6+nV9Dy3aCrERTML0jo8xWv2N
rDA61BV/q+hdQS3vzmLRVPzLLZksGRNCS2ZzIbkR4dGxLQAAB2M=
=w/wH
-----END PGP SIGNATURE-----
Merge tag 'x86_fpu_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu update from Borislav Petkov:
- Add machinery to initialize AMX register state in order for
AMX-capable CPUs to be able to enter deeper low-power state
* tag 'x86_fpu_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intel_idle: Add a new flag to initialize the AMX state
x86/fpu: Add a helper to prepare AMX state for low-power CPU idle
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnjdMACgkQEsHwGGHe
VUoNfw//W/eJCIdTZ4bYku0KVRvA2tP8xXqsevBaLGhi0yh4knoMI+b7pMUnUEYX
SlV2dAF0m85ICB7dN52TB6Bn0eyt1nGj9AHmgyiZ345R2IH+bvC5qig88JOR91gd
5o2HE+CICjXVvItOwwt+FMm8GrykZ2FrciAo92CTTt5TIcZyrkUXWJKwn9c1YNKd
bZFPOmAnrLUcMlweqeoZBTCVxu+yFm/CIYEs3eXISVitCEJ1JRVqxygJicycBwmw
kN1U7glF66ptJ5l1bas5ScsgKeDUbyFFiwKXrBMJI+T/FWU6YxYQW868+5E0/8g3
uhoKpDh4hECH36DdCO/DdEcpt2sBrPskx/3f1gY+LzX/uxWNB8+1996AQlOWyJSQ
W12hZED4HpyamJr6Z5BiVjSmCKhFG8kLk09D0dB35MBIsneBpFVbm4PHmnGm2X1e
0Cm92qMeIRj4unjGEK8rybJV1uy0b6mNzUgqdyXMzRagqespwi0/4rwNTn5uU9uW
gk5gsd7oV0HmbWKw83fHxE9MWj/L4t+9fW8UnVAYJMjehXhJohIUMK+B/dLQk61I
F0mX7XQDmrKgPOyBURGM36vkWqlgUPKISl2BlC/b7qgDOUnEDZmIdnv7Fnrplwt1
Ktwzsk7eTigi9iC4lpZ8mVs+m1ZXUlQnFlibXi2HB8fZe/4pWn4=
=e088
-----END PGP SIGNATURE-----
Merge tag 'x86_vmware_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vmware cleanup from Borislav Petkov:
- A single statement simplification by using the BIT() macro
* tag 'x86_vmware_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vmware: Use BIT() macro for shifting
when injecting errors on AMD platforms. In some cases, the platform
could prohibit those.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLni2gACgkQEsHwGGHe
VUq1uQ//SeO7mVATL+gtwbh3NGBUsLhYJeZkNOGaIxbiKSxEUiCuHwdUmIZukLIL
dTOAY60Wa9O7wuO9g1p2oeAK8SQO3ZyoIbKX5KZxy+eiCw0lgVyRv12l9qatj/bt
KL+ImDGkoUYp1GMrZP7Lp1B9vVc4lm73qkHSRseNrnjv8EKJbty62Ed6bhgjU+CN
jw+mbTHYGIO8M7XSPvzQhDmIBUSy1N6XVIUcBD2IqWoQCEgecW6woPUHvkoWlI/B
OwQ8KJjM5oRre/AqNN8t7COP5erYY1Qi3xX1+1QnFYlxx8/Z5w4V09X00MDN7NpG
1sJZPIctJ5lcEv6kSG+mI4D2TpmiMWDlWL1ifyZjY/p4Fu7bXEvtCpGTFGlsTWzN
kdiLEjjhA9D+ag2Ah52FBBgL3FpfJxrjDPoL8fYsVkxpzETiwXugqHr7MUh5HeHE
rQldU3aUdXvH94ilQn5Mx9bVwvVMY/egwCXMKQnz/Xzt+V4NnXPYs4didcPNsnDB
QlPpeiCkDmFsqdVQB+GDFq/bh9TeIHh6I+3zY+Esvi2y1m1IjzGbwwqjZgqhpmf3
9dVH7+bucn1muekA7uQL6R34AaPR6cST5QEEM2Lzp/77XnuQ35uvXLH80gHUT4BZ
a3UUiVXRELT5+xjx57efnnJj56NVuGsdTreC2QSA11fIPW91L84=
=Qz6G
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS update from Borislav Petkov:
"A single RAS change:
- Probe whether hardware error injection (direct MSR writes) is
possible when injecting errors on AMD platforms. In some cases, the
platform could prohibit those"
* tag 'ras_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Check whether writes to MCA_STATUS are getting ignored
KVM/s390, KVM/x86 and common infrastructure changes for 5.20
x86:
* Permit guests to ignore single-bit ECC errors
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Cleanups for MCE MSR emulation
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
Generic:
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
x86:
* Use try_cmpxchg64 instead of cmpxchg64
* Bugfixes
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
x86 cleanups:
* Use separate namespaces for guest PTEs and shadow PTEs bitmasks
* PIO emulation
* Reorganize rmap API, mostly around rmap destruction
* Do not workaround very old KVM bugs for L0 that runs with nesting enabled
* new selftests API for CPUID
When a ftrace_bug happens (where ftrace fails to modify a location) it is
helpful to have what was at that location as well as what was expected to
be there.
But with the conversion to text_poke() the variable that assigns the
expected for debugging was dropped. Unfortunately, I noticed this when I
needed it. Add it back.
Link: https://lkml.kernel.org/r/20220726101851.069d2e70@gandalf.local.home
Cc: "x86@kernel.org" <x86@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 768ae4406a ("x86/ftrace: Use text_poke()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The setup_profiling_timer() is mostly un-implemented by many
architectures. In many places it isn't guarded by CONFIG_PROFILE which is
needed for it to be used. Make it a weak symbol in kernel/profile.c and
remove the 'return -EINVAL' implementations from the kenrel.
There are a couple of architectures which do return 0 from the
setup_profiling_timer() function but they don't seem to do anything else
with it. To keep the /proc compatibility for now, leave these for a
future update or removal.
On ARM, this fixes the following sparse warning:
arch/arm/kernel/smp.c:793:5: warning: symbol 'setup_profiling_timer' was not declared. Should it be static?
Link: https://lkml.kernel.org/r/20220721195509.418205-1-ben-linux@fluff.org
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 007faec014.
Now that hyperv does its own protocol negotiation:
49d6a3c062 ("x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM")
revert this exposure of the sev_es_ghcb_hv_call() helper.
Cc: Wei Liu <wei.liu@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by:Tianyu Lan <tiala@microsoft.com>
Link: https://lore.kernel.org/r/20220614014553.1915929-1-ltykernel@gmail.com
x86/kernel/cpu/cyrix.c now needs to include <linux/isa-dma.h> since the
'isa_dma_bridge_buggy' variable was moved to it.
Fixes this build error:
../arch/x86/kernel/cpu/cyrix.c: In function ‘init_cyrix’:
../arch/x86/kernel/cpu/cyrix.c:277:17: error: ‘isa_dma_bridge_buggy’ undeclared (first use in this function)
277 | isa_dma_bridge_buggy = 2;
Fixes: abb4970ac3 ("PCI: Move isa_dma_bridge_buggy out of asm/dma.h")
Link: https://lore.kernel.org/r/20220725202224.29269-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Stafford Horne <shorne@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
The archrandom interface was originally designed for x86, which supplies
RDRAND/RDSEED for receiving random words into registers, resulting in
one function to generate an int and another to generate a long. However,
other architectures don't follow this.
On arm64, the SMCCC TRNG interface can return between one and three
longs. On s390, the CPACF TRNG interface can return arbitrary amounts,
with four longs having the same cost as one. On UML, the os_getrandom()
interface can return arbitrary amounts.
So change the api signature to take a "max_longs" parameter designating
the maximum number of longs requested, and then return the number of
longs generated.
Since callers need to check this return value and loop anyway, each arch
implementation does not bother implementing its own loop to try again to
fill the maximum number of longs. Additionally, all existing callers
pass in a constant max_longs parameter. Taken together, these two things
mean that the codegen doesn't really change much for one-word-at-a-time
platforms, while performance is greatly improved on platforms such as
s390.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
IBRS mitigation for spectre_v2 forces write to MSR_IA32_SPEC_CTRL at
every kernel entry/exit. On Enhanced IBRS parts setting
MSR_IA32_SPEC_CTRL[IBRS] only once at boot is sufficient. MSR writes at
every kernel entry/exit incur unnecessary performance loss.
When Enhanced IBRS feature is present, print a warning about this
unnecessary performance loss.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/2a5eaf54583c2bfe0edc4fea64006656256cca17.1657814857.git.pawan.kumar.gupta@linux.intel.com
Debugging missing return thunks is easier if we can see where they're
happening.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/Ys66hwtFcGbYmoiZ@hirez.programming.kicks-ass.net/
Add support for SMN communication on family 17h model A0h and family 19h
models 60h-70h.
[ bp: Merge into a single patch. ]
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h
Acked-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20220719195256.1516-1-mario.limonciello@amd.com
Instead of the magic numbers 1<<11 and 1<<12 use the constants
from msr-index.h. This makes it obvious where those bits
of MSR_IA32_MISC_ENABLE are consumed (and in fact that Linux
consumes them at all) to simple minds that grep for
MSR_IA32_MISC_ENABLE_.*_UNAVAIL.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220719174714.2410374-1-pbonzini@redhat.com
When a CPU enters an idle state, a non-initialized AMX register state may
be the cause of preventing a deeper low-power state. Other extended
register states whether initialized or not do not impact the CPU idle
state.
The new helper can ensure the AMX state is initialized before the CPU is
idle, and it will be used by the intel idle driver.
Check the AMX_TILE feature bit before using XGETBV1 as a chain of
dependencies was established via cpuid_deps[]: AMX->XFD->XGETBV1.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220608164748.11864-2-chang.seok.bae@intel.com
On AMD IBRS does not prevent Retbleed; as such use IBPB before a
firmware call to flush the branch history state.
And because in order to do an EFI call, the kernel maps a whole lot of
the kernel page table into the EFI page table, do an IBPB just in case
in order to prevent the scenario of poisoning the BTB and causing an EFI
call using the unprotected RET there.
[ bp: Massage. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220715194550.793957-1-cascardo@canonical.com
The decision of whether or not to trust RDRAND is controlled by the
"random.trust_cpu" boot time parameter or the CONFIG_RANDOM_TRUST_CPU
compile time default. The "nordrand" flag was added during the early
days of RDRAND, when there were worries that merely using its values
could compromise the RNG. However, these days, RDRAND values are not
used directly but always go through the RNG's hash function, making
"nordrand" no longer useful.
Rather, the correct switch is "random.trust_cpu", which not only handles
the relevant trust issue directly, but also is general to multiple CPU
types, not just x86.
However, x86 RDRAND does have a history of being occasionally
problematic. Prior, when the kernel would notice something strange, it'd
warn in dmesg and suggest enabling "nordrand". We can improve on that by
making the test a little bit better and then taking the step of
automatically disabling RDRAND if we detect it's problematic.
Also disable RDSEED if the RDRAND test fails.
Cc: x86@kernel.org
Cc: Theodore Ts'o <tytso@mit.edu>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Suggested-by: Borislav Petkov <bp@suse.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
When RDRAND was introduced, there was much discussion on whether it
should be trusted and how the kernel should handle that. Initially, two
mechanisms cropped up, CONFIG_ARCH_RANDOM, a compile time switch, and
"nordrand", a boot-time switch.
Later the thinking evolved. With a properly designed RNG, using RDRAND
values alone won't harm anything, even if the outputs are malicious.
Rather, the issue is whether those values are being *trusted* to be good
or not. And so a new set of options were introduced as the real
ones that people use -- CONFIG_RANDOM_TRUST_CPU and "random.trust_cpu".
With these options, RDRAND is used, but it's not always credited. So in
the worst case, it does nothing, and in the best case, maybe it helps.
Along the way, CONFIG_ARCH_RANDOM's meaning got sort of pulled into the
center and became something certain platforms force-select.
The old options don't really help with much, and it's a bit odd to have
special handling for these instructions when the kernel can deal fine
with the existence or untrusted existence or broken existence or
non-existence of that CPU capability.
Simplify the situation by removing CONFIG_ARCH_RANDOM and using the
ordinary asm-generic fallback pattern instead, keeping the two options
that are actually used. For now it leaves "nordrand" for now, as the
removal of that will take a different route.
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Patch series "cpumask: Fix invalid uniprocessor assumptions", v4.
On uniprocessor builds, it is currently assumed that any cpumask will
contain the single CPU: cpu0. This assumption is used to provide
optimised implementations.
The current assumption also appears to be wrong, by ignoring the fact that
users can provide empty cpumasks. This can result in bugs as explained in
[1] - for_each_cpu() will run one iteration of the loop even when passed
an empty cpumask.
This series introduces some basic tests, and updates the optimisations for
uniprocessor builds.
The x86 patch was written after the kernel test robot [2] ran into a
failed build. I have tried to list the files potentially affected by the
changes to cpumask.h, in an attempt to find any other cases that fail on
!SMP. I've gone through some of the files manually, and ran a few cross
builds, but nothing else popped up. I (build) checked about half of the
potientally affected files, but I do not have the resources to do them
all. I hope we can fix other issues if/when they pop up later.
[1] https://lore.kernel.org/all/20220530082552.46113-1-sander@svanheule.net/
[2] https://lore.kernel.org/all/202206060858.wA0FOzRy-lkp@intel.com/
This patch (of 5):
The maps to keep track of shared caches between CPUs on SMP systems are
declared in asm/smp.h, among them specifically cpu_llc_shared_map. These
maps are externally defined in cpu/smpboot.c. The latter is only compiled
on CONFIG_SMP=y, which means the declared extern symbols from asm/smp.h do
not have a corresponding definition on uniprocessor builds.
The inline cpu_llc_shared_mask() function from asm/smp.h refers to the map
declaration mentioned above. This function is referenced in cacheinfo.c
inside for_each_cpu() loop macros, to provide cpumask for the loop. On
uniprocessor builds, the symbol for the cpu_llc_shared_map does not exist.
However, the current implementation of for_each_cpu() also (wrongly)
ignores the provided mask.
By sheer luck, the compiler thus optimises out this unused reference to
cpu_llc_shared_map, and the linker therefore does not require the
cpu_llc_shared_mask to actually exist on uniprocessor builds. Only on SMP
bulids does smpboot.o exist to provide the required symbols.
To no longer rely on compiler optimisations for successful uniprocessor
builds, move the definitions of cpu_llc_shared_map and cpu_l2c_shared_map
from smpboot.c to cacheinfo.c.
Link: https://lkml.kernel.org/r/cover.1656777646.git.sander@svanheule.net
Link: https://lkml.kernel.org/r/e8167ddb570f56744a3dc12c2149a660a324d969.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
can accomodate a XenPV guest due to how the latter is setting up the PAT
machinery
Now that the retbleed nightmare is public, here's the first round of
fallout fixes:
- Fix a build failure on 32-bit due to missing include
- Remove an untraining point in espfix64 return path
- other small cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLTudcACgkQEsHwGGHe
VUrY4BAAtWm7wC6T8rzovbsyticj6kcehRMBEXxtlEP5LOeltR0dbNaIGskrS2Li
Q9YxxtQhbZPXqzqB+xeHVhDPThzsd3+wRvvetmR4fW/c3XCYr+fLLFjHj0NEvX0P
lQzuY8GKWGU/QTrjKSKclGvqyB692Fvdu4YImlnrGSbR6ywwVttditd3YNJR0w1q
7S4zq90uwWuX6cTLqXIcbKbhssOjcR1Agj9+bE8i+rzyB2VtNoihJCJh0pTJAn3P
RaXnxI/7J6Y+5imPf5/ywu8gxhvGBTy5MU/1v2pw939EurU9tmhVkNVWdO2g/qYY
V+Y1nj9xV0ucL4hlUBqAFdM+5jFC89Ey1X2tSgUgSl+44L/d8IIjVCp6inVoyCzE
Olbc6q7A/V8PNXfo4g6gDwVc3Ii53Fwgtu8xVHkwPGfjly6+yZ9O/RUBXcBAOnpU
jfS9LSc/Ro7kxqFy32beUgB7wwhMpkYuHe6ECxrvXj1IK13y3OkdxFzm03ty7S/E
BwjrkltDia9BQ6i4Ywy+qSBYkSH6+sxxt4pboB+ft6/p2JIw4YJIp9PRqzPAG0jx
JTjcZ9YAr7zPNlWp5e30BjGawgbuKPq0wkF1r6QD+3VzNf9+SSDmtkYGWeAyTTP2
SkzLy5QCNBTeeq0FZVYZFm/vcF7wccQhQdwNcezxygKAmihH0xw=
=ND3p
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.19_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Improve the check whether the kernel supports WP mappings so that it
can accomodate a XenPV guest due to how the latter is setting up the
PAT machinery
- Now that the retbleed nightmare is public, here's the first round of
fallout fixes:
* Fix a build failure on 32-bit due to missing include
* Remove an untraining point in espfix64 return path
* other small cleanups
* tag 'x86_urgent_for_v5.19_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Remove apostrophe typo
um: Add missing apply_returns()
x86/entry: Remove UNTRAIN_RET from native_irq_return_ldt
x86/bugs: Mark retbleed_strings static
x86/pat: Fix x86_has_pat_wp()
x86/asm/32: Fix ANNOTATE_UNRET_SAFE use on 32-bit
Fix more fallout from recent changes of the ACPI CPPC handling on AMD
platforms (Mario Limonciello).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmLRsmYSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxw7AP/i0f3x8wTGby77H9aFHxqYD1az5/nUco
Ymp8+l1Bm9TNbfBa1EqmwGwleho5X2GXUzFX2K3cKqNNp3fG/R0fsp7/+/h8YDmd
DzwidRaIQPMsqdUzjIncB83UNIhFeh3/gix0n+iZM+cVw2dBdluXHiq88HtDjOFz
ZkCF0ka5QcJyLJ4DgIbF5/CV/q3hd0ZT3xKLC0HfedZ7YwPRTL1yefUK3sosLZTW
qDWYL/86+XwoFBntBQ6gfGYS/xQ7nQ50QKa+SD8cqLCSrQFYswLkRnQpDg1XJlMT
XPc5WRrlbMC5VV+daY5/uMflRVdDLSuBKt5uTyFrvgguvw9S0Bgct8l/tABa8KRL
AGDMoqW6V0Lc3z1Jyu4xjjTY6lrhyTB+any8K/roGrAxSYqZveMiAgyyMLbZfxNa
dU1IAIw41g/ucBB/pE0T1jfrVrAIsLVoKXSl4ixpu50yC1DjJrv85zQfOyEVI9jM
/okPONeRLTbXnB7+xRCA8AB5nXgPOqPqpLYPW7IDHghskacqZiAo6fm7tWEC5ijX
epiI/JvwU7Pbt9UuS/7Kui9yC/33ErZEdYRC4QOhBQS4vLVpzQMRFRjdF8d50Jzk
Qm73OzFKDZNm/8ladAWSdyvmvl/Ch8PjAqIBqmZNQYAoBYmWjN1LhuN+d17/ro2I
jE05pwY8TXQL
=Y0dz
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix more fallout from recent changes of the ACPI CPPC handling on AMD
platforms (Mario Limonciello)"
* tag 'acpi-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: CPPC: Fix enabling CPPC on AMD systems with shared memory
Remove a superfluous ' in the mitigation string.
Fixes: e8ec1b6e08 ("x86/bugs: Enable STIBP for JMP2RET")
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
commit 278311e417 ("kexec, KEYS: Make use of platform keyring for
signature verify") adds platform keyring support on x86 kexec but not
arm64.
The code in bzImage64_verify_sig uses the keys on the
.builtin_trusted_keys, .machine, if configured and enabled,
.secondary_trusted_keys, also if configured, and .platform keyrings
to verify the signed kernel image as PE file.
Cc: kexec@lists.infradead.org
Cc: keyrings@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Reviewed-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Coiby Xu <coxu@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
tboot_force_iommu() is only called by the Intel IOMMU driver. Move the
helper into that driver. No functional change intended.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20220514014322.2927339-7-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
When commit 72f2ecb7ec ("ACPI: bus: Set CPPC _OSC bits for all
and when CPPC_LIB is supported") was introduced, we found collateral
damage that a number of AMD systems that supported CPPC but
didn't advertise support in _OSC stopped having a functional
amd-pstate driver. The _OSC was only enforced on Intel systems at that
time.
This was fixed for the MSR based designs by commit 8b356e536e
("ACPI: CPPC: Don't require _OSC if X86_FEATURE_CPPC is supported")
but some shared memory based designs also support CPPC but haven't
advertised support in the _OSC. Add support for those designs as well by
hardcoding the list of systems.
Fixes: 72f2ecb7ec ("ACPI: bus: Set CPPC _OSC bits for all and when CPPC_LIB is supported")
Fixes: 8b356e536e ("ACPI: CPPC: Don't require _OSC if X86_FEATURE_CPPC is supported")
Link: https://lore.kernel.org/all/3559249.JlDtxWtqDm@natalenko.name/
Cc: 5.18+ <stable@vger.kernel.org> # 5.18+
Reported-and-tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The build on x86_32 currently fails after commit
9bb2ec608a (objtool: Update Retpoline validation)
with:
arch/x86/kernel/../../x86/xen/xen-head.S:35: Error: no such instruction: `annotate_unret_safe'
ANNOTATE_UNRET_SAFE is defined in nospec-branch.h. And head_32.S is
missing this include. Fix this.
Fixes: 9bb2ec608a ("objtool: Update Retpoline validation")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/63e23f80-033f-f64e-7522-2816debbc367@kernel.org
solved and the nightmare is complete, here's the next one: speculating
after RET instructions and leaking privileged information using the now
pretty much classical covert channels.
It is called RETBleed and the mitigation effort and controlling
functionality has been modelled similar to what already existing
mitigations provide.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLNdDYACgkQEsHwGGHe
VUrNAw/+OTFF7md0+17Ju6vvagc/nXfUxk/r0lWU9/KzbRXvPTZdPKTW4NN5c0IS
VnogyUGFFpzU3dKU2os9ejTD4kHNx0oLuBfQt4w7t4qR+g3+nAH0ywNjH/N1VTJt
iDpww7CxqloV+i9RCsWV+zQPMPfc2VMUhe6xqNB2CgEDrruzFrDASZR6zzarsKxY
x4rwHn0ZkV7zNJfcNpV2323qktqHgBtAFf7GlZK8hBsgsiSk+xDk9CODkfxfWIV7
o4BNvNmaUKDJL51hpuzvIzYwDSiRO5AXdjxHG/0CHc3r3dtA6Xt1elHbERAyUMuM
P+6XievP5ZV/xXXjoZ5Vla67o3bbGKmTo2WluvVGeg8ahzQEwyPGqeXn77hk+of+
BtasZyLgfdwSeWExxp0n5Nhh972TMpy5K4gqOFXcxvPSuTl6tTw77F1u0UQLaVVH
QzHNu+RO/2iQ/P30cOM11IbZ9sfcBOj+5mjfoDoR4qCtoCQfyfHK+HlwXjZ+uk98
xU/FnQbOKPRVxiyCVhrbKFxjW7iL7AIb0nRgxHzGGoIJ6A71Tbwa/5gGakE7WEBz
e7ce8NW2JFucGBFYyiBab6I6fB7lbvmqbNPerYEVoU5YxZkMu+xxyToqBnsyPfHZ
lxgEGREUaY8aZmGDfrD9EYyhhtQU/MwdpN+FY3xXQdUJkvkNaLg=
=0Ca0
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_retbleed' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull lockdep fix for x86 retbleed from Borislav Petkov:
- Fix lockdep complaint for __static_call_fixup()
* tag 'x86_bugs_retbleed' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/static_call: Serialize __static_call_fixup() properly
__static_call_fixup() invokes __static_call_transform() without holding
text_mutex, which causes lockdep to complain in text_poke_bp().
Adding the proper locking cures that, but as this is either used during
early boot or during module finalizing, it's not required to use
text_poke_bp(). Add an argument to __static_call_transform() which tells
it to use text_poke_early() for it.
Fixes: ee88d363d1 ("x86,static_call: Use alternative RET encoding")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
solved and the nightmare is complete, here's the next one: speculating
after RET instructions and leaking privileged information using the now
pretty much classical covert channels.
It is called RETBleed and the mitigation effort and controlling
functionality has been modelled similar to what already existing
mitigations provide.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLKqAgACgkQEsHwGGHe
VUoM5w/8CSvwPZ3otkhmu8MrJPtWc7eLDPjYN4qQP+19e+bt094MoozxeeWG2wmp
hkDJAYHT2Oik/qDuEdhFgNYwS7XGgbV3Py3B8syO4//5SD5dkOSG+QqFXvXMdFri
YsVqqNkjJOWk/YL9Ql5RS/xQewsrr0OqEyWWocuI6XAvfWV4kKvlRSd+6oPqtZEO
qYlAHTXElyIrA/gjmxChk1HTt5HZtK3uJLf4twNlUfzw7LYFf3+sw3bdNuiXlyMr
WcLXMwGpS0idURwP3mJa7JRuiVBzb4+kt8mWwWqA02FkKV45FRRRFhFUsy667r00
cdZBaWdy+b7dvXeliO3FN/x1bZwIEUxmaNy1iAClph4Ifh0ySPUkxAr8EIER7YBy
bstDJEaIqgYg8NIaD4oF1UrG0ZbL0ImuxVaFdhG1hopQsh4IwLSTLgmZYDhfn/0i
oSqU0Le+A7QW9s2A2j6qi7BoAbRW+gmBuCgg8f8ECYRkFX1ZF6mkUtnQxYrU7RTq
rJWGW9nhwM9nRxwgntZiTjUUJ2HtyXEgYyCNjLFCbEBfeG5QTg7XSGFhqDbgoymH
85vsmSXYxgTgQ/kTW7Fs26tOqnP2h1OtLJZDL8rg49KijLAnISClEgohYW01CWQf
ZKMHtz3DM0WBiLvSAmfGifScgSrLB5AjtvFHT0hF+5/okEkinVk=
=09fW
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_retbleed' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 retbleed fixes from Borislav Petkov:
"Just when you thought that all the speculation bugs were addressed and
solved and the nightmare is complete, here's the next one: speculating
after RET instructions and leaking privileged information using the
now pretty much classical covert channels.
It is called RETBleed and the mitigation effort and controlling
functionality has been modelled similar to what already existing
mitigations provide"
* tag 'x86_bugs_retbleed' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
x86/speculation: Disable RRSBA behavior
x86/kexec: Disable RET on kexec
x86/bugs: Do not enable IBPB-on-entry when IBPB is not supported
x86/entry: Move PUSH_AND_CLEAR_REGS() back into error_entry
x86/bugs: Add Cannon lake to RETBleed affected CPU list
x86/retbleed: Add fine grained Kconfig knobs
x86/cpu/amd: Enumerate BTC_NO
x86/common: Stamp out the stepping madness
KVM: VMX: Prevent RSB underflow before vmenter
x86/speculation: Fill RSB on vmexit for IBRS
KVM: VMX: Fix IBRS handling after vmexit
KVM: VMX: Prevent guest RSB poisoning attacks with eIBRS
KVM: VMX: Convert launched argument to flags
KVM: VMX: Flatten __vmx_vcpu_run()
objtool: Re-add UNWIND_HINT_{SAVE_RESTORE}
x86/speculation: Remove x86_spec_ctrl_mask
x86/speculation: Use cached host SPEC_CTRL value for guest entry/exit
x86/speculation: Fix SPEC_CTRL write on SMT state change
x86/speculation: Fix firmware entry SPEC_CTRL handling
x86/speculation: Fix RSB filling with CONFIG_RETPOLINE=n
...
Currently, the only way x86 can get an early boot RNG seed is via EFI,
which is generally always used now for physical machines, but is very
rarely used in VMs, especially VMs that are optimized for starting
"instantaneously", such as Firecracker's MicroVM. For tiny fast booting
VMs, EFI is not something you generally need or want.
Rather, the image loader or firmware should be able to pass a single
random seed, exactly as device tree platforms do with the "rng-seed"
property. Additionally, this is something that bootloaders can append,
with their own seed file management, which is something every other
major OS ecosystem has that Linux does not (yet).
Add SETUP_RNG_SEED, similar to the other eight setup_data entries that
are parsed at boot. It also takes care to zero out the seed immediately
after using, in order to retain forward secrecy. This all takes about 7
trivial lines of code.
Then, on kexec_file_load(), a new fresh seed is generated and passed to
the next kernel, just as is done on device tree architectures when
using kexec. And, importantly, I've tested that QEMU is able to properly
pass SETUP_RNG_SEED as well, making this work for every step of the way.
This code too is pretty straight forward.
Together these measures ensure that VMs and nested kexec()'d kernels
always receive a proper boot time RNG seed at the earliest possible
stage from their parents:
- Host [already has strongly initialized RNG]
- QEMU [passes fresh seed in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- ...
I've verified in several scenarios that this works quite well from a
host kernel to QEMU and down inwards, mixing and matching loaders, with
every layer providing a seed to the next.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20220630113300.1892799-1-Jason@zx2c4.com
failures where the hypervisor verifies page tables and uninitialized
data in that range leads to bogus failures in those checks
- Add any potential setup_data entries supplied at boot to the identity
pagetable mappings to prevent kexec kernel boot failures. Usually, this
is not a problem for the normal kernel as those mappings are part of
the initially mapped 2M pages but if kexec gets to allocate the second
kernel somewhere else, those setup_data entries need to be mapped there
too.
- Fix objtool not to discard text references from the __tracepoints
section so that ENDBR validation still works
- Correct the setup_data types limit as it is user-visible, before 5.19
releases
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLKpf8ACgkQEsHwGGHe
VUrc5w/8DIVLQ8w+Balf2TGfp5Sl3mPkg+eoARH29qtXhvVBs5KJB9sbT1IGnxao
nE4yNeiIKhH5SEd17l11E7eWuUtNgZENLsUb3aiAdsItNS+MzOWQuEOPbnAwgJmk
oKdxiI1SHiVoPy5KVXOcyAS90PSJIkhhxwgR5MInGdmpSUzEFsx5SY82ZfOjOkZU
L7zCsJzeDfhJdWiR4N0MXWRaFbIvRxI1uXyqgv+Lo6JK5l8dyUUSEdWyLUqZ7E4M
GFo6LwR3lskQM2bE9vBWS0h1X00d5oDMzfono8kZzRGA/11plZHRI007PCez8yZh
4sUnnxsfCy2YF8/8hs4IhrHZdcWW9XoN4gTUsjD0wekGTHhOEqu5qpAnVSrXbvvM
ZfPF8vM+DLPTWQqAT0a4aj1vd1RflDIQPSXKDzJDjeF49zouAj1ae/3KSOYJDzN9
V6NGiKBnzj1rbtm0+8jOsTQusmh/oDage7uLlmel3hTfNOc2Ay0LXrJWcvqhj66V
4CtCd12sLeavin+mGptni6lXbsue61EolRtH44RvZJsXLVY8iclM4onl728xOrxj
CBtJo6bd3oQYy0SQsysXGDVR7BSXtwAYfArYR8BrMTtgHxuyULt/BDoew4r7XADB
Xxz7ADJZ3DI3Gqza5H6r89Tj6Oi3yXiBWUVUNXFCMYc6ZrqvZc0=
=tOvF
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.19_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Prepare for and clear .brk early in order to address XenPV guests
failures where the hypervisor verifies page tables and uninitialized
data in that range leads to bogus failures in those checks
- Add any potential setup_data entries supplied at boot to the identity
pagetable mappings to prevent kexec kernel boot failures. Usually,
this is not a problem for the normal kernel as those mappings are
part of the initially mapped 2M pages but if kexec gets to allocate
the second kernel somewhere else, those setup_data entries need to be
mapped there too.
- Fix objtool not to discard text references from the __tracepoints
section so that ENDBR validation still works
- Correct the setup_data types limit as it is user-visible, before 5.19
releases
* tag 'x86_urgent_for_v5.19_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix the setup data types max limit
x86/ibt, objtool: Don't discard text references from tracepoint section
x86/compressed/64: Add identity mappings for setup_data entries
x86: Fix .brk attribute in linker script
x86: Clear .brk area at early boot
x86/xen: Use clear_bss() for Xen PV guests
Some Intel processors may use alternate predictors for RETs on
RSB-underflow. This condition may be vulnerable to Branch History
Injection (BHI) and intramode-BTI.
Kernel earlier added spectre_v2 mitigation modes (eIBRS+Retpolines,
eIBRS+LFENCE, Retpolines) which protect indirect CALLs and JMPs against
such attacks. However, on RSB-underflow, RET target prediction may
fallback to alternate predictors. As a result, RET's predicted target
may get influenced by branch history.
A new MSR_IA32_SPEC_CTRL bit (RRSBA_DIS_S) controls this fallback
behavior when in kernel mode. When set, RETs will not take predictions
from alternate predictors, hence mitigating RETs as well. Support for
this is enumerated by CPUID.7.2.EDX[RRSBA_CTRL] (bit2).
For spectre v2 mitigation, when a user selects a mitigation that
protects indirect CALLs and JMPs against BHI and intramode-BTI, set
RRSBA_DIS_S also to protect RETs for RSB-underflow case.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
All the invocations unroll to __x86_return_thunk and this file
must be PIC independent.
This fixes kexec on 64-bit AMD boxes.
[ bp: Fix 32-bit build. ]
Reported-by: Edward Tran <edward.tran@oracle.com>
Reported-by: Awais Tanveer <awais.tanveer@oracle.com>
Suggested-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Storing the 'page_index' value in the sgx_backing struct is
dead code and no longer needed.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220708162124.8442-1-kristen@linux.intel.com
There are some VM configurations which have Skylake model but do not
support IBPB. In those cases, when using retbleed=ibpb, userspace is going
to be killed and kernel is going to panic.
If the CPU does not support IBPB, warn and proceed with the auto option. Also,
do not fallback to IBPB on AMD/Hygon systems if it is not supported.
Fixes: 3ebc170068 ("x86/bugs: Add retbleed=ibpb")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
The page reclaimer ensures availability of EPC pages across all
enclaves. In support of this it runs independently from the
individual enclaves in order to take locks from the different
enclaves as it writes pages to swap.
When needing to load a page from swap an EPC page needs to be
available for its contents to be loaded into. Loading an existing
enclave page from swap does not reclaim EPC pages directly if
none are available, instead the reclaimer is woken when the
available EPC pages are found to be below a watermark.
When iterating over a large number of pages in an oversubscribed
environment there is a race between the reclaimer woken up and
EPC pages reclaimed fast enough for the page operations to proceed.
Ensure there are EPC pages available before attempting to load
a page that may potentially be pulled from swap into an available
EPC page.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.chatre@intel.com
The SGX2 page removal flow was introduced in previous patch and is
as follows:
1) Change the type of the pages to be removed to SGX_PAGE_TYPE_TRIM
using the ioctl() SGX_IOC_ENCLAVE_MODIFY_TYPES introduced in
previous patch.
2) Approve the page removal by running ENCLU[EACCEPT] from within
the enclave.
3) Initiate actual page removal using the ioctl()
SGX_IOC_ENCLAVE_REMOVE_PAGES introduced here.
Support the final step of the SGX2 page removal flow with ioctl()
SGX_IOC_ENCLAVE_REMOVE_PAGES. With this ioctl() the user specifies
a page range that should be removed. All pages in the provided
range should have the SGX_PAGE_TYPE_TRIM page type and the request
will fail with EPERM (Operation not permitted) if a page that does
not have the correct type is encountered. Page removal can fail
on any page within the provided range. Support partial success by
returning the number of pages that were successfully removed.
Since actual page removal will succeed even if ENCLU[EACCEPT] was not
run from within the enclave the ENCLU[EMODPR] instruction with RWX
permissions is used as a no-op mechanism to ensure ENCLU[EACCEPT] was
successfully run from within the enclave before the enclave page is
removed.
If the user omits running SGX_IOC_ENCLAVE_REMOVE_PAGES the pages will
still be removed when the enclave is unloaded.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/b75ee93e96774e38bb44a24b8e9bbfb67b08b51b.1652137848.git.reinette.chatre@intel.com
Every enclave contains one or more Thread Control Structures (TCS). The
TCS contains meta-data used by the hardware to save and restore thread
specific information when entering/exiting the enclave. With SGX1 an
enclave needs to be created with enough TCSs to support the largest
number of threads expecting to use the enclave and enough enclave pages
to meet all its anticipated memory demands. In SGX1 all pages remain in
the enclave until the enclave is unloaded.
SGX2 introduces a new function, ENCLS[EMODT], that is used to change
the type of an enclave page from a regular (SGX_PAGE_TYPE_REG) enclave
page to a TCS (SGX_PAGE_TYPE_TCS) page or change the type from a
regular (SGX_PAGE_TYPE_REG) or TCS (SGX_PAGE_TYPE_TCS)
page to a trimmed (SGX_PAGE_TYPE_TRIM) page (setting it up for later
removal).
With the existing support of dynamically adding regular enclave pages
to an initialized enclave and changing the page type to TCS it is
possible to dynamically increase the number of threads supported by an
enclave.
Changing the enclave page type to SGX_PAGE_TYPE_TRIM is the first step
of dynamically removing pages from an initialized enclave. The complete
page removal flow is:
1) Change the type of the pages to be removed to SGX_PAGE_TYPE_TRIM
using the SGX_IOC_ENCLAVE_MODIFY_TYPES ioctl() introduced here.
2) Approve the page removal by running ENCLU[EACCEPT] from within
the enclave.
3) Initiate actual page removal using the ioctl() introduced in the
following patch.
Add ioctl() SGX_IOC_ENCLAVE_MODIFY_TYPES to support changing SGX
enclave page types within an initialized enclave. With
SGX_IOC_ENCLAVE_MODIFY_TYPES the user specifies a page range and the
enclave page type to be applied to all pages in the provided range.
The ioctl() itself can return an error code based on failures
encountered by the kernel. It is also possible for SGX specific
failures to be encountered. Add a result output parameter to
communicate the SGX return code. It is possible for the enclave page
type change request to fail on any page within the provided range.
Support partial success by returning the number of pages that were
successfully changed.
After the page type is changed the page continues to be accessible
from the kernel perspective with page table entries and internal
state. The page may be moved to swap. Any access until ENCLU[EACCEPT]
will encounter a page fault with SGX flag set in error code.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Link: https://lkml.kernel.org/r/babe39318c5bf16fc65fbfb38896cdee72161575.1652137848.git.reinette.chatre@intel.com
Before an enclave is initialized the enclave's memory range is unknown.
The enclave's memory range is learned at the time it is created via the
SGX_IOC_ENCLAVE_CREATE ioctl() where the provided memory range is
obtained from an earlier mmap() of /dev/sgx_enclave. After an enclave
is initialized its memory can be mapped into user space (mmap()) from
where it can be entered at its defined entry points.
With the enclave's memory range known after it is initialized there is
no reason why it should be possible to map memory outside this range.
Lock down access to the initialized enclave's memory range by denying
any attempt to map memory outside its memory range.
Locking down the memory range also makes adding pages to an initialized
enclave more efficient. Pages are added to an initialized enclave by
accessing memory that belongs to the enclave's memory range but not yet
backed by an enclave page. If it is possible for user space to map
memory that does not form part of the enclave then an access to this
memory would eventually fail. Failures range from a prompt general
protection fault if the access was an ENCLU[EACCEPT] from within the
enclave, or a page fault via the vDSO if it was another access from
within the enclave, or a SIGBUS (also resulting from a page fault) if
the access was from outside the enclave.
Disallowing invalid memory to be mapped in the first place avoids
preventable failures.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/6391460d75ae79cea2e81eef0f6ffc03c6e9cfe7.1652137848.git.reinette.chatre@intel.com
With SGX1 an enclave needs to be created with its maximum memory demands
allocated. Pages cannot be added to an enclave after it is initialized.
SGX2 introduces a new function, ENCLS[EAUG], that can be used to add
pages to an initialized enclave. With SGX2 the enclave still needs to
set aside address space for its maximum memory demands during enclave
creation, but all pages need not be added before enclave initialization.
Pages can be added during enclave runtime.
Add support for dynamically adding pages to an initialized enclave,
architecturally limited to RW permission at creation but allowed to
obtain RWX permissions after trusted enclave runs EMODPE. Add pages
via the page fault handler at the time an enclave address without a
backing enclave page is accessed, potentially directly reclaiming
pages if no free pages are available.
The enclave is still required to run ENCLU[EACCEPT] on the page before
it can be used. A useful flow is for the enclave to run ENCLU[EACCEPT]
on an uninitialized address. This will trigger the page fault handler
that will add the enclave page and return execution to the enclave to
repeat the ENCLU[EACCEPT] instruction, this time successful.
If the enclave accesses an uninitialized address in another way, for
example by expanding the enclave stack to a page that has not yet been
added, then the page fault handler would add the page on the first
write but upon returning to the enclave the instruction that triggered
the page fault would be repeated and since ENCLU[EACCEPT] was not run
yet it would trigger a second page fault, this time with the SGX flag
set in the page fault error code. This can only be recovered by entering
the enclave again and directly running the ENCLU[EACCEPT] instruction on
the now initialized address.
Accessing an uninitialized address from outside the enclave also
triggers this flow but the page will remain inaccessible (access will
result in #PF) until accepted from within the enclave via
ENCLU[EACCEPT].
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Link: https://lkml.kernel.org/r/a254a58eabea053803277449b24b6e4963a3883b.1652137848.git.reinette.chatre@intel.com
In the initial (SGX1) version of SGX, pages in an enclave need to be
created with permissions that support all usages of the pages, from the
time the enclave is initialized until it is unloaded. For example,
pages used by a JIT compiler or when code needs to otherwise be
relocated need to always have RWX permissions.
SGX2 includes a new function ENCLS[EMODPR] that is run from the kernel
and can be used to restrict the EPCM permissions of regular enclave
pages within an initialized enclave.
Introduce ioctl() SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS to support
restricting EPCM permissions. With this ioctl() the user specifies
a page range and the EPCM permissions to be applied to all pages in
the provided range. ENCLS[EMODPR] is run to restrict the EPCM
permissions followed by the ENCLS[ETRACK] flow that will ensure
no cached linear-to-physical address mappings to the changed
pages remain.
It is possible for the permission change request to fail on any
page within the provided range, either with an error encountered
by the kernel or by the SGX hardware while running
ENCLS[EMODPR]. To support partial success the ioctl() returns an
error code based on failures encountered by the kernel as well
as two result output parameters: one for the number of pages
that were successfully changed and one for the SGX return code.
The page table entry permissions are not impacted by the EPCM
permission changes. VMAs and PTEs will continue to allow the
maximum vetted permissions determined at the time the pages
are added to the enclave. The SGX error code in a page fault
will indicate if it was an EPCM permission check that prevented
an access attempt.
No checking is done to ensure that the permissions are actually
being restricted. This is because the enclave may have relaxed
the EPCM permissions from within the enclave without the kernel
knowing. An attempt to relax permissions using this call will
be ignored by the hardware.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Link: https://lkml.kernel.org/r/082cee986f3c1a2f4fdbf49501d7a8c5a98446f8.1652137848.git.reinette.chatre@intel.com
struct sgx_encl should be protected with the mutex
sgx_encl->lock. One exception is sgx_encl->page_cnt that
is incremented (in sgx_encl_grow()) when an enclave page
is added to the enclave. The reason the mutex is not held
is to allow the reclaimer to be called directly if there are
no EPC pages (in support of a new VA page) available at the time.
Incrementing sgx_encl->page_cnt without sgc_encl->lock held
is currently (before SGX2) safe from concurrent updates because
all paths in which sgx_encl_grow() is called occur before
enclave initialization and are protected with an atomic
operation on SGX_ENCL_IOCTL.
SGX2 includes support for dynamically adding pages after
enclave initialization where the protection of SGX_ENCL_IOCTL
is not available.
Make direct reclaim of EPC pages optional when new VA pages
are added to the enclave. Essentially the existing "reclaim"
flag used when regular EPC pages are added to an enclave
becomes available to the caller when used to allocate VA pages
instead of always being "true".
When adding pages without invoking the reclaimer it is possible
to do so with sgx_encl->lock held, gaining its protection against
concurrent updates to sgx_encl->page_cnt after enclave
initialization.
No functional change.
Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/42c5934c229982ee67982bb97c6ab34bde758620.1652137848.git.reinette.chatre@intel.com
Move sgx_encl_page_alloc() to encl.c and export it so that it can be
used in the implementation for support of adding pages to initialized
enclaves, which requires to allocate new enclave pages.
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/57ae71b4ea17998467670232e12d6617b95c6811.1652137848.git.reinette.chatre@intel.com
In order to use sgx_encl_{grow,shrink}() in the page augmentation code
located in encl.c, export these functions.
Suggested-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d51730acf54b6565710b2261b3099517b38c2ec4.1652137848.git.reinette.chatre@intel.com
SGX2 functions are not allowed on all page types. For example,
ENCLS[EMODPR] is only allowed on regular SGX enclave pages and
ENCLS[EMODPT] is only allowed on TCS and regular pages. If these
functions are attempted on another type of page the hardware would
trigger a fault.
Keep a record of the SGX page type so that there is more
certainty whether an SGX2 instruction can succeed and faults
can be treated as real failures.
The page type is a property of struct sgx_encl_page
and thus does not cover the VA page type. VA pages are maintained
in separate structures and their type can be determined in
a different way. The SGX2 instructions needing the page type do not
operate on VA pages and this is thus not a scenario needing to
be covered at this time.
struct sgx_encl_page hosting this information is maintained for each
enclave page so the space consumed by the struct is important.
The existing sgx_encl_page->vm_max_prot_bits is already unsigned long
while only using three bits. Transition to a bitfield for the two
members to support the additional information without increasing
the space consumed by the struct.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/a0a6939eefe7ba26514f6c49723521cde372de64.1652137848.git.reinette.chatre@intel.com
User provided offset and length is validated when parsing the parameters
of the SGX_IOC_ENCLAVE_ADD_PAGES ioctl(). Extract this validation
(with consistent use of IS_ALIGNED) into a utility that can be used
by the SGX2 ioctl()s that will also provide these values.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/767147bc100047abed47fe27c592901adfbb93a2.1652137848.git.reinette.chatre@intel.com
The ETRACK function followed by an IPI to all CPUs within an enclave
is a common pattern with more frequent use in support of SGX2.
Make the (empty) IPI callback function available internally in
preparation for usage by SGX2.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/1179ed4a9c3c1c2abf49d51bfcf2c30b493181cc.1652137848.git.reinette.chatre@intel.com
The SGX reclaimer removes page table entries pointing to pages that are
moved to swap.
SGX2 enables changes to pages belonging to an initialized enclave, thus
enclave pages may have their permission or type changed while the page
is being accessed by an enclave. Supporting SGX2 requires page table
entries to be removed so that any cached mappings to changed pages
are removed. For example, with the ability to change enclave page types
a regular enclave page may be changed to a Thread Control Structure
(TCS) page that may not be accessed by an enclave.
Factor out the code removing page table entries to a separate function
sgx_zap_enclave_ptes(), fixing accuracy of comments in the process,
and make it available to the upcoming SGX2 code.
Place sgx_zap_enclave_ptes() with the rest of the enclave code in
encl.c interacting with the page table since this code is no longer
unique to the reclaimer.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/b010cdf01d7ce55dd0f00e883b7ccbd9db57160a.1652137848.git.reinette.chatre@intel.com
sgx_encl_ewb_cpumask() is no longer unique to the reclaimer where it
is used during the EWB ENCLS leaf function when EPC pages are written
out to main memory and sgx_encl_ewb_cpumask() is used to learn which
CPUs might have executed the enclave to ensure that TLBs are cleared.
Upcoming SGX2 enabling will use sgx_encl_ewb_cpumask() during the
EMODPR and EMODT ENCLS leaf functions that make changes to enclave
pages. The function is needed for the same reason it is used now: to
learn which CPUs might have executed the enclave to ensure that TLBs
no longer point to the changed pages.
Rename sgx_encl_ewb_cpumask() to sgx_encl_cpumask() to reflect the
broader usage.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d4d08c449450a13d8dd3bb6c2b1af03895586d4f.1652137848.git.reinette.chatre@intel.com
Using sgx_encl_ewb_cpumask() to learn which CPUs might have executed
an enclave is useful to ensure that TLBs are cleared when changes are
made to enclave pages.
sgx_encl_ewb_cpumask() is used within the reclaimer when an enclave
page is evicted. The upcoming SGX2 support enables changes to be
made to enclave pages and will require TLBs to not refer to the
changed pages and thus will be needing sgx_encl_ewb_cpumask().
Relocate sgx_encl_ewb_cpumask() to be with the rest of the enclave
code in encl.c now that it is no longer unique to the reclaimer.
Take care to ensure that any future usage maintains the
current context requirement that ETRACK has been called first.
Expand the existing comments to highlight this while moving them
to a more prominent location before the function.
No functional change.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/05b60747fd45130cf9fc6edb1c373a69a18a22c5.1652137848.git.reinette.chatre@intel.com
sgx_encl_load_page() is used to find and load an enclave page into
enclave (EPC) memory, potentially loading it from the backing storage.
Both usages of sgx_encl_load_page() are during an access to the
enclave page from a VMA and thus the permissions of the VMA are
considered before the enclave page is loaded.
SGX2 functions operating on enclave pages belonging to an initialized
enclave requiring the page to be in EPC. It is thus required to
support loading enclave pages into the EPC independent from a VMA.
Split the current sgx_encl_load_page() to support the two usages:
A new call, sgx_encl_load_page_in_vma(), behaves exactly like the
current sgx_encl_load_page() that takes VMA permissions into account,
while sgx_encl_load_page() just loads an enclave page into EPC.
VMA, PTE, and EPCM permissions continue to dictate whether
the pages can be accessed from within an enclave.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d4393513c1f18987c14a490bcf133bfb71a5dc43.1652137848.git.reinette.chatre@intel.com
Add a wrapper for the EAUG ENCLS leaf function used to
add a page to an initialized enclave.
EAUG:
1) Stores all properties of the new enclave page in the SGX
hardware's Enclave Page Cache Map (EPCM).
2) Sets the PENDING bit in the EPCM entry of the enclave page.
This bit is cleared by the enclave by invoking ENCLU leaf
function EACCEPT or EACCEPTCOPY.
Access from within the enclave to the new enclave page is not
possible until the PENDING bit is cleared.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/97a46754fe4764e908651df63694fb760f783d6e.1652137848.git.reinette.chatre@intel.com
Add a wrapper for the EMODT ENCLS leaf function used to
change the type of an enclave page as maintained in the
SGX hardware's Enclave Page Cache Map (EPCM).
EMODT:
1) Updates the EPCM page type of the enclave page.
2) Sets the MODIFIED bit in the EPCM entry of the enclave page.
This bit is reset by the enclave by invoking ENCLU leaf
function EACCEPT or EACCEPTCOPY.
Access from within the enclave to the enclave page is not possible
while the MODIFIED bit is set.
After changing the enclave page type by issuing EMODT the kernel
needs to collaborate with the hardware to ensure that no logical
processor continues to hold a reference to the changed page. This
is required to ensure no required security checks are circumvented
and is required for the enclave's EACCEPT/EACCEPTCOPY to succeed.
Ensuring that no references to the changed page remain is
accomplished with the ETRACK flow.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/dba63a8c0db1d510b940beee1ba2a8207efeb1f1.1652137848.git.reinette.chatre@intel.com
Add a wrapper for the EMODPR ENCLS leaf function used to
restrict enclave page permissions as maintained in the
SGX hardware's Enclave Page Cache Map (EPCM).
EMODPR:
1) Updates the EPCM permissions of an enclave page by treating
the new permissions as a mask. Supplying a value that attempts
to relax EPCM permissions has no effect on EPCM permissions
(PR bit, see below, is changed).
2) Sets the PR bit in the EPCM entry of the enclave page to
indicate that permission restriction is in progress. The bit
is reset by the enclave by invoking ENCLU leaf function
EACCEPT or EACCEPTCOPY.
The enclave may access the page throughout the entire process
if conforming to the EPCM permissions for the enclave page.
After performing the permission restriction by issuing EMODPR
the kernel needs to collaborate with the hardware to ensure that
all logical processors sees the new restricted permissions. This
is required for the enclave's EACCEPT/EACCEPTCOPY to succeed and
is accomplished with the ETRACK flow.
Expand enum sgx_return_code with the possible EMODPR return
values.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/d15e7a769e13e4ca671fa2d0a0d3e3aec5aedbd4.1652137848.git.reinette.chatre@intel.com
The SGX ENCLS instruction uses EAX to specify an SGX function and
may require additional registers, depending on the SGX function.
ENCLS invokes the specified privileged SGX function for managing
and debugging enclaves. Macros are used to wrap the ENCLS
functionality and several wrappers are used to wrap the macros to
make the different SGX functions accessible in the code.
The wrappers of the supported SGX functions are cryptic. Add short
descriptions of each as a comment.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/5e78a1126711cbd692d5b8132e0683873398f69e.1652137848.git.reinette.chatre@intel.com
Cannon lake is also affected by RETBleed, add it to the list.
Fixes: 6ad0ad2bf8 ("x86/bugs: Report Intel retbleed vulnerability")
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
commit 72f2ecb7ec ("ACPI: bus: Set CPPC _OSC bits for all and
when CPPC_LIB is supported") added support for claiming to
support CPPC in _OSC on non-Intel platforms.
This unfortunately caused a regression on a vartiety of AMD
platforms in the field because a number of AMD platforms don't set
the `_OSC` bit 5 or 6 to indicate CPPC or CPPC v2 support.
As these AMD platforms already claim CPPC support via a dedicated
MSR from `X86_FEATURE_CPPC`, use this enable this feature rather
than requiring the `_OSC` on platforms with a dedicated MSR.
If there is additional breakage on the shared memory designs also
missing this _OSC, additional follow up changes may be needed.
Fixes: 72f2ecb7ec ("Set CPPC _OSC bits for all and when CPPC_LIB is supported")
Reported-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On kexec file load, the Integrity Measurement Architecture (IMA)
subsystem may verify the IMA signature of the kernel and initramfs, and
measure it. The command line parameters passed to the kernel in the
kexec call may also be measured by IMA.
A remote attestation service can verify a TPM quote based on the TPM
event log, the IMA measurement list and the TPM PCR data. This can
be achieved only if the IMA measurement log is carried over from the
current kernel to the next kernel across the kexec call.
PowerPC and ARM64 both achieve this using device tree with a
"linux,ima-kexec-buffer" node. x86 platforms generally don't make use of
device tree, so use the setup_data mechanism to pass the IMA buffer to
the new kernel.
Signed-off-by: Jonathan McDowell <noodles@fb.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> # IMA function definitions
Link: https://lore.kernel.org/r/YmKyvlF3my1yWTvK@noodles-fedora-PC23Y6EG
Commit in Fixes added the "NOLOAD" attribute to the .brk section as a
"failsafe" measure.
Unfortunately, this leads to the linker no longer covering the .brk
section in a program header, resulting in the kernel loader not knowing
that the memory for the .brk section must be reserved.
This has led to crashes when loading the kernel as PV dom0 under Xen,
but other scenarios could be hit by the same problem (e.g. in case an
uncompressed kernel is used and the initrd is placed directly behind
it).
So drop the "NOLOAD" attribute. This has been verified to correctly
cover the .brk section by a program header of the resulting ELF file.
Fixes: e32683c6f7 ("x86/mm: Fix RESERVE_BRK() for older binutils")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20220630071441.28576-4-jgross@suse.com
The .brk section has the same properties as .bss: it is an alloc-only
section and should be cleared before being used.
Not doing so is especially a problem for Xen PV guests, as the
hypervisor will validate page tables (check for writable page tables
and hypervisor private bits) before accepting them to be used.
Make sure .brk is initially zero by letting clear_bss() clear the brk
area, too.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220630071441.28576-3-jgross@suse.com
Instead of clearing the bss area in assembly code, use the clear_bss()
function.
This requires to pass the start_info address as parameter to
xen_start_kernel() in order to avoid the xen_start_info being zeroed
again.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20220630071441.28576-2-jgross@suse.com
Do fine-grained Kconfig for all the various retbleed parts.
NOTE: if your compiler doesn't support return thunks this will
silently 'upgrade' your mitigation to IBPB, you might not like this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
The platform can sometimes - depending on its settings - cause writes
to MCA_STATUS MSRs to get ignored, regardless of HWCR[McStatusWrEn]'s
value.
For further info see
PPR for AMD Family 19h, Model 01h, Revision B1 Processors, doc ID 55898
at https://bugzilla.kernel.org/show_bug.cgi?id=206537.
Therefore, probe for ignored writes to MCA_STATUS to determine if hardware
error injection is at all possible.
[ bp: Heavily massage commit message and patch. ]
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220214233640.70510-2-Smita.KoralahalliChannabasappa@amd.com
BTC_NO indicates that hardware is not susceptible to Branch Type Confusion.
Zen3 CPUs don't suffer BTC.
Hypervisors are expected to synthesise BTC_NO when it is appropriate
given the migration pool, to prevent kernels using heuristics.
[ bp: Massage. ]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
The whole MMIO/RETBLEED enumeration went overboard on steppings. Get
rid of all that and simply use ANY.
If a future stepping of these models would not be affected, it had
better set the relevant ARCH_CAP_$FOO_NO bit in
IA32_ARCH_CAPABILITIES.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
On VMX, there are some balanced returns between the time the guest's
SPEC_CTRL value is written, and the vmenter.
Balanced returns (matched by a preceding call) are usually ok, but it's
at least theoretically possible an NMI with a deep call stack could
empty the RSB before one of the returns.
For maximum paranoia, don't allow *any* returns (balanced or otherwise)
between the SPEC_CTRL write and the vmenter.
[ bp: Fix 32-bit build. ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Prevent RSB underflow/poisoning attacks with RSB. While at it, add a
bunch of comments to attempt to document the current state of tribal
knowledge about RSB attacks and what exactly is being mitigated.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
On eIBRS systems, the returns in the vmexit return path from
__vmx_vcpu_run() to vmx_vcpu_run() are exposed to RSB poisoning attacks.
Fix that by moving the post-vmexit spec_ctrl handling to immediately
after the vmexit.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
This mask has been made redundant by kvm_spec_ctrl_test_value(). And it
doesn't even work when MSR interception is disabled, as the guest can
just write to SPEC_CTRL directly.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
There's no need to recalculate the host value for every entry/exit.
Just use the cached value in spec_ctrl_current().
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
If the SMT state changes, SSBD might get accidentally disabled. Fix
that.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Zen2 uarchs have an undocumented, unnamed, MSR that contains a chicken
bit for some speculation behaviour. It needs setting.
Note: very belatedly AMD released naming; it's now officially called
MSR_AMD64_DE_CFG2 and MSR_AMD64_DE_CFG2_SUPPRESS_NOBR_PRED_BIT
but shall remain the SPECTRAL CHICKEN.
Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Since entry asm is tricky, add a validation pass that ensures the
retbleed mitigation has been done before the first actual RET
instruction.
Entry points are those that either have UNWIND_HINT_ENTRY, which acts
as UNWIND_HINT_EMPTY but marks the instruction as an entry point, or
those that have UWIND_HINT_IRET_REGS at +0.
This is basically a variant of validate_branch() that is
intra-function and it will simply follow all branches from marked
entry points and ensures that all paths lead to ANNOTATE_UNRET_END.
If a path hits RET or an indirection the path is a fail and will be
reported.
There are 3 ANNOTATE_UNRET_END instances:
- UNTRAIN_RET itself
- exception from-kernel; this path doesn't need UNTRAIN_RET
- all early exceptions; these also don't need UNTRAIN_RET
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
When booting with retbleed=auto, if the kernel wasn't built with
CONFIG_CC_HAS_RETURN_THUNK, the mitigation falls back to IBPB. Make
sure a warning is printed in that case. The IBPB fallback check is done
twice, but it really only needs to be done once.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
jmp2ret mitigates the easy-to-attack case at relatively low overhead.
It mitigates the long speculation windows after a mispredicted RET, but
it does not mitigate the short speculation window from arbitrary
instruction boundaries.
On Zen2, there is a chicken bit which needs setting, which mitigates
"arbitrary instruction boundaries" down to just "basic block boundaries".
But there is no fix for the short speculation window on basic block
boundaries, other than to flush the entire BTB to evict all attacker
predictions.
On the spectrum of "fast & blurry" -> "safe", there is (on top of STIBP
or no-SMT):
1) Nothing System wide open
2) jmp2ret May stop a script kiddy
3) jmp2ret+chickenbit Raises the bar rather further
4) IBPB Only thing which can count as "safe".
Tentative numbers put IBPB-on-entry at a 2.5x hit on Zen2, and a 10x hit
on Zen1 according to lmbench.
[ bp: Fixup feature bit comments, document option, 32-bit build fix. ]
Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Having IBRS enabled while the SMT sibling is idle unnecessarily slows
down the running sibling. OTOH, disabling IBRS around idle takes two
MSR writes, which will increase the idle latency.
Therefore, only disable IBRS around deeper idle states. Shallow idle
states are bounded by the tick in duration, since NOHZ is not allowed
for them by virtue of their short target residency.
Only do this for mwait-driven idle, since that keeps interrupts disabled
across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
Note: C6 is a random threshold, most importantly C1 probably shouldn't
disable IBRS, benchmarking needed.
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
retbleed will depend on spectre_v2, while spectre_v2_user depends on
retbleed. Break this cycle.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
When changing SPEC_CTRL for user control, the WRMSR can be delayed
until return-to-user when KERNEL_IBRS has been enabled.
This avoids an MSR write during context switch.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Due to TIF_SSBD and TIF_SPEC_IB the actual IA32_SPEC_CTRL value can
differ from x86_spec_ctrl_base. As such, keep a per-CPU value
reflecting the current task's MSR content.
[jpoimboe: rename]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
For untrained return thunks to be fully effective, STIBP must be enabled
or SMT disabled.
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Add the "retbleed=<value>" boot parameter to select a mitigation for
RETBleed. Possible values are "off", "auto" and "unret"
(JMP2RET mitigation). The default value is "auto".
Currently, "retbleed=auto" will select the unret mitigation on
AMD and Hygon and no mitigation on Intel (JMP2RET is not effective on
Intel).
[peterz: rebase; add hygon]
[jpoimboe: cleanups]
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Note: needs to be in a section distinct from Retpolines such that the
Retpoline RET substitution cannot possibly use immediate jumps.
ORC unwinding for zen_untrain_ret() and __x86_return_thunk() is a
little tricky but works due to the fact that zen_untrain_ret() doesn't
have any stack ops and as such will emit a single ORC entry at the
start (+0x3f).
Meanwhile, unwinding an IP, including the __x86_return_thunk() one
(+0x40) will search for the largest ORC entry smaller or equal to the
IP, these will find the one ORC entry (+0x3f) and all works.
[ Alexandre: SVM part. ]
[ bp: Build fix, massages. ]
Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
In addition to teaching static_call about the new way to spell 'RET',
there is an added complication in that static_call() is allowed to
rewrite text before it is known which particular spelling is required.
In order to deal with this; have a static_call specific fixup in the
apply_return() 'alternative' patching routine that will rewrite the
static_call trampoline to match the definite sequence.
This in turn creates the problem of uniquely identifying static call
trampolines. Currently trampolines are 8 bytes, the first 5 being the
jmp.d32/ret sequence and the final 3 a byte sequence that spells out
'SCT'.
This sequence is used in __static_call_validate() to ensure it is
patching a trampoline and not a random other jmp.d32. That is,
false-positives shouldn't be plenty, but aren't a big concern.
OTOH the new __static_call_fixup() must not have false-positives, and
'SCT' decodes to the somewhat weird but semi plausible sequence:
push %rbx
rex.XB push %r12
Additionally, there are SLS concerns with immediate jumps. Combined it
seems like a good moment to change the signature to a single 3 byte
trap instruction that is unique to this usage and will not ever get
generated by accident.
As such, change the signature to: '0x0f, 0xb9, 0xcc', which decodes
to:
ud1 %esp, %ecx
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Instead of defaulting to patching NOP opcodes at init time, and leaving
it to the architectures to override this if this is not needed, switch
to a model where doing nothing is the default. This is the common case
by far, as only MIPS requires NOP patching at init time. On all other
architectures, the correct encodings are emitted by the compiler and so
no initial patching is needed.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220615154142.1574619-4-ardb@kernel.org
MIPS is the only remaining architecture that needs to patch jump label
NOP encodings to initialize them at load time. So let's move the module
patching part of that from generic code into arch/mips, and drop it from
the others.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220615154142.1574619-3-ardb@kernel.org
VMWARE_CMD_VCPU_RESERVED is bit 31 and that would mean undefined
behavior when shifting an int but the kernel is built with
-fno-strict-overflow which will wrap around using two's complement.
Use the BIT() macro to improve readability and avoid any potential
overflow confusion because it uses an unsigned long.
[ bp: Clarify commit message. ]
Signed-off-by: Shreenidhi Shedi <sshedi@vmware.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Link: https://lore.kernel.org/r/20220601101820.535031-1-sshedi@vmware.com
Make sure to free the platform device in the unlikely event that
registration fails.
Fixes: 7a67832c7e ("libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220620140723.9810-1-johan@kernel.org
kfree can handle NULL pointer as its argument.
According to coccinelle isnullfree check, remove NULL check
before kfree operation.
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Message-Id: <20220614133458.147314-1-dzm91@hust.edu.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Make RESERVE_BRK() work again with older binutils. The recent
'simplification' broke that.
- Make early #VE handling increment RIP when successful.
- Make the #VE code consistent vs. the RIP adjustments and add comments.
- Handle load_unaligned_zeropad() across page boundaries correctly in #VE
when the second page is shared.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKvIG4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaqCD/9NAUyHTjKDqdWuMD/ITU8ymDr+Ix8z
vUlysdXbxJg6MvT12ZbhJUFTKAsXskGAAnXz/EtZ8zTQQVzTjis/HooJh4XLeuO4
NLh9KV9FvH7w69e6Jg31MGkOUJU3BV+WYUx1f34zbQ8FHftxUwu+M47UYExPYKDR
VIbNeQIpqoBfjTSPVGXlWl/panuZG6RV+PRcvxV3yeRRA8zyCB/WTmNkoDjbw4fl
YCWwJF7/m4iT3LtoaFXWVGFzSRZoGHbhSdgEOZGIZ7sjvydoaQo402JuhW3WLI2m
oXLVZ+2wOPGBKp3WQ1t3mpfScBvCiN3SW4pSPDQ+E8fT/RQiRMb29c9S6ANdm3nT
27fYMJOq+xxex5gOYzdgLz7O99M08uOn2bxJwB+IBIr5jEFH9b4EffeEWsfdZBsi
1AzkXCi+Ib0ZYAndxUP068m+4iW0LtuApm0fg6LhtdDmBGquj+88OZOUK7Z/kW/N
IkjgCeqFgmdNb/+Z3XrdYobaAl6J4toIqA4A+O8yL6gJfn9PnaMGsYtA8c5yQchD
kFoTu5pCALY2KjZkKFRMuEbMH2oj3sjjb7f6mYAHxec6jikIx2c5HswA4sLmzHAN
GG2MDUH12bWoLfeA4IRwTRz/vh8IeZNq5ZzdCnS6KHUNk5OJRGLtRphKy8z+pOYx
+i9ThZFBV8pBzg==
=sRtG
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Make RESERVE_BRK() work again with older binutils. The recent
'simplification' broke that.
- Make early #VE handling increment RIP when successful.
- Make the #VE code consistent vs. the RIP adjustments and add
comments.
- Handle load_unaligned_zeropad() across page boundaries correctly in
#VE when the second page is shared.
* tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
x86/tdx: Clarify RIP adjustments in #VE handler
x86/tdx: Fix early #VE handling
x86/mm: Fix RESERVE_BRK() for older binutils
- Remove obsolete CONFIG_X86_SMAP reference from objtool
- Fix overlapping text section failures in faddr2line for real
- Remove OBJECT_FILES_NON_STANDARD usage from x86 ftrace and replace it
with finegrained annotations so objtool can validate that code
correctly.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKvHbwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQlsEADC7BotE1Mi8HITScIlHT19bkZ7Bm6M
g46RCtUw0u+lo7BclIetNYGWQ0z+D+1DbmZTBv5D22N/Wd6MrwOuuwRVtJ2dEosF
l/EK1qKLxCGMdQ1x0KOmrjzHB4WmcUH0pQwDHa5N5s5IyrJRRFHtoLk3EYp6QIhi
mIZTJgGUEe/aS7BtFM5xARprm7JxdbhXgXVbnC1ctzpOi2y6O7F4Ek3x4j5zTGV/
49iuO2ljqmKdoFlPa1fhuw+O9sTUf+RmKLPEU+2nm7Wg+wlIwKIB3yDDBoZ9Q7lm
YdorW0/506x4qOwa8KAjxXEb2qgZkL/fQcMS0jQQKNNDD8GHU+3YdjCNbrmZjA+T
Ib0rs9YsxgLstnmqzU3QfarhgtdzIMhzF0cofoFDK0a8tHLqOhZF4j7wLmYo52ec
6Hrt8IvJPW2b3TC6KfRAF+uWsDDoaum0OOFVjVu1b9G6fFf0i8h5JGWdkJh8cau9
0qvAYg0sDjRf7zAZ6kbbGDVCljECyypIdCBF3ihA6yMhJX16VihEWTJyqIaoO+a0
0XzPvkhX+6mY0kqn+5HvvhLEY4POA8pFKeTPx4eABvKnw7m01I9cJfHa7HEBQHe6
ZJA7qWMBRrPfK2Ogrx8zohj2sOdHMHlIQU+6Ai+7+BHs13Nje4umGMu2YuZrnfIF
gQ1ayCN8OsUwVA==
=12KJ
-----END PGP SIGNATURE-----
Merge tag 'objtool-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull build tooling updates from Thomas Gleixner:
- Remove obsolete CONFIG_X86_SMAP reference from objtool
- Fix overlapping text section failures in faddr2line for real
- Remove OBJECT_FILES_NON_STANDARD usage from x86 ftrace and replace it
with finegrained annotations so objtool can validate that code
correctly.
* tag 'objtool-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usage
faddr2line: Fix overlapping text section failures, the sequel
objtool: Fix obsolete reference to CONFIG_X86_SMAP
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmKs17UUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vzlVQ//Qygje8u1x0qQqGFykMf3+cev2ECV
/sfOHtRoG9qZ4gKQOGShr9gyeV+9VFl0+CEJE4z9/YiAnoOr5JMAMixYRjHQU9MB
k3TFkDqD0KBgmXdAubVP5HQGZgA+mryvtEhr5rm45PooJJSjuh1ds87YYO+Z/t7s
c2AzvpHFLjECB6LRHqHieyp4CeWn5tw8im7uMUmfKkkXF5ckqw3e+7gwJzRrAukg
GgbLb2JZLzXSl1HOx/2GvPW8RzyXRbbJmpvu9LsNKeoqP006F3chDVIncqG3Q9QQ
LvrjudC/eY/2Fee3tpP6gjl9A5ALvXT/k4gTw4Lwm8OlaxrYZ3gwcBHZepp3F08s
qVCncjFooeHAMiJDkGvtf6N8k8VnOy4zvg2qDpKOl2NO3jH95nGi0LGMf0/GXvfh
P4bwqjGcWKSo9C2amagZ49rzaJBIQRms8ItM6WPvCYirjyYi3PeUMl/dylstbnuq
DQuprZtgOEGlPGUg/CO4fCpCIkKzpvOV6157z39mZS5HOT5ugvF4k+hwSxZd0hsM
rJI7Te48Z46Y/qoFesgOglJwwEZs4RqHAvMTGC+V5Ftj9gHe7wo2oOxlhCyFW1Qd
jec3UhXEzTVxnjsu7peHyfwwmGWFUkG16P17hWpncpG6azW1dybQOCsRy21/F+q2
i1nd61lXpuhKQKQ=
=5y1d
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci fix from Bjorn Helgaas:
"Revert clipping of PCI host bridge windows to avoid E820 regions,
which broke several machines by forcing unnecessary BAR reassignments
(Hans de Goede)"
* tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"
This reverts commit 4c5e242d3e.
Prior to 4c5e242d3e ("x86/PCI: Clip only host bridge windows for E820
regions"), E820 regions did not affect PCI host bridge windows. We only
looked at E820 regions and avoided them when allocating new MMIO space.
If firmware PCI bridge window and BAR assignments used E820 regions, we
left them alone.
After 4c5e242d3e, we removed E820 regions from the PCI host bridge
windows before looking at BARs, so firmware assignments in E820 regions
looked like errors, and we moved things around to fit in the space left
(if any) after removing the E820 regions. This unnecessary BAR
reassignment broke several machines.
Guilherme reported that Steam Deck fails to boot after 4c5e242d3e. We
clipped the window that contained most 32-bit BARs:
BIOS-e820: [mem 0x00000000a0000000-0x00000000a00fffff] reserved
acpi PNP0A08:00: clipped [mem 0x80000000-0xf7ffffff window] to [mem 0xa0100000-0xf7ffffff window] for e820 entry [mem 0xa0000000-0xa00fffff]
which forced us to reassign all those BARs, for example, this NVMe BAR:
pci 0000:00:01.2: PCI bridge to [bus 01]
pci 0000:00:01.2: bridge window [mem 0x80600000-0x806fffff]
pci 0000:01:00.0: BAR 0: [mem 0x80600000-0x80603fff 64bit]
pci 0000:00:01.2: can't claim window [mem 0x80600000-0x806fffff]: no compatible bridge window
pci 0000:01:00.0: can't claim BAR 0 [mem 0x80600000-0x80603fff 64bit]: no compatible bridge window
pci 0000:00:01.2: bridge window: assigned [mem 0xa0100000-0xa01fffff]
pci 0000:01:00.0: BAR 0: assigned [mem 0xa0100000-0xa0103fff 64bit]
All the reassignments were successful, so the devices should have been
functional at the new addresses, but some were not.
Andy reported a similar failure on an Intel MID platform. Benjamin
reported a similar failure on a VMWare Fusion VM.
Note: this is not a clean revert; this revert keeps the later change to
make the clipping dependent on a new pci_use_e820 bool, moving the checking
of this bool to arch_remove_reservations().
[bhelgaas: commit log, add more reporters and testers]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216109
Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Reported-by: Jongman Heo <jongman.heo@gmail.com>
Fixes: 4c5e242d3e ("x86/PCI: Clip only host bridge windows for E820 regions")
Link: https://lore.kernel.org/r/20220612144325.85366-1-hdegoede@redhat.com
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Stale Data.
They are a class of MMIO-related weaknesses which can expose stale data
by propagating it into core fill buffers. Data which can then be leaked
using the usual speculative execution methods.
Mitigations include this set along with microcode updates and are
similar to MDS and TAA vulnerabilities: VERW now clears those buffers
too.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKXMkMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWGPD/idalLIhhV5F2+hZIKm0WSnsBxAOh9K
7y8xBxpQQ5FUfW3vm7Pg3ro6VJp7w2CzKoD4lGXzGHriusn3qst3vkza9Ay8xu8g
RDwKe6hI+p+Il9BV9op3f8FiRLP9bcPMMReW/mRyYsOnJe59hVNwRAL8OG40PY4k
hZgg4Psfvfx8bwiye5efjMSe4fXV7BUCkr601+8kVJoiaoszkux9mqP+cnnB5P3H
zW1d1jx7d6eV1Y063h7WgiNqQRYv0bROZP5BJkufIoOHUXDpd65IRF3bDnCIvSEz
KkMYJNXb3qh7EQeHS53NL+gz2EBQt+Tq1VH256qn6i3mcHs85HvC68gVrAkfVHJE
QLJE3MoXWOqw+mhwzCRrEXN9O1lT/PqDWw8I4M/5KtGG/KnJs+bygmfKBbKjIVg4
2yQWfMmOgQsw3GWCRjgEli7aYbDJQjany0K/qZTq54I41gu+TV8YMccaWcXgDKrm
cXFGUfOg4gBm4IRjJ/RJn+mUv6u+/3sLVqsaFTs9aiib1dpBSSUuMGBh548Ft7g2
5VbFVSDaLjB2BdlcG7enlsmtzw0ltNssmqg7jTK/L7XNVnvxwUoXw+zP7RmCLEYt
UV4FHXraMKNt2ZketlomC8ui2hg73ylUp4pPdMXCp7PIXp9sVamRTbpz12h689VJ
/s55bWxHkR6S
=LBxT
-----END PGP SIGNATURE-----
Merge tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MMIO stale data fixes from Thomas Gleixner:
"Yet another hw vulnerability with a software mitigation: Processor
MMIO Stale Data.
They are a class of MMIO-related weaknesses which can expose stale
data by propagating it into core fill buffers. Data which can then be
leaked using the usual speculative execution methods.
Mitigations include this set along with microcode updates and are
similar to MDS and TAA vulnerabilities: VERW now clears those buffers
too"
* tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/mmio: Print SMT warning
KVM: x86/speculation: Disable Fill buffer clear within guests
x86/speculation/mmio: Reuse SRBDS mitigation for SBDS
x86/speculation/srbds: Update SRBDS mitigation selection
x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data
x86/speculation/mmio: Enable CPU Fill buffer clearing on idle
x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations
x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data
x86/speculation: Add a common function for MD_CLEAR mitigation update
x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug
Documentation: Add documentation for Processor MMIO Stale Data
With binutils 2.26, RESERVE_BRK() causes a build failure:
/tmp/ccnGOKZ5.s: Assembler messages:
/tmp/ccnGOKZ5.s:98: Error: missing ')'
/tmp/ccnGOKZ5.s:98: Error: missing ')'
/tmp/ccnGOKZ5.s:98: Error: missing ')'
/tmp/ccnGOKZ5.s:98: Error: junk at end of line, first unrecognized
character is `U'
The problem is this line:
RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE)
Specifically, the INIT_PGT_BUF_SIZE macro which (via PAGE_SIZE's use
_AC()) has a "1UL", which makes older versions of the assembler unhappy.
Unfortunately the _AC() macro doesn't work for inline asm.
Inline asm was only needed here to convince the toolchain to add the
STT_NOBITS flag. However, if a C variable is placed in a section whose
name is prefixed with ".bss", GCC and Clang automatically set
STT_NOBITS. In fact, ".bss..page_aligned" already relies on this trick.
So fix the build failure (and simplify the macro) by allocating the
variable in C.
Also, add NOLOAD to the ".brk" output section clause in the linker
script. This is a failsafe in case the ".bss" prefix magic trick ever
stops working somehow. If there's a section type mismatch, the GNU
linker will force the ".brk" output section to be STT_NOBITS. The LLVM
linker will fail with a "section type mismatch" error.
Note this also changes the name of the variable from .brk.##name to
__brk_##name. The variable names aren't actually used anywhere, so it's
harmless.
Fixes: a1e2c031ec ("x86/mm: Simplify RESERVE_BRK()")
Reported-by: Joe Damato <jdamato@fastly.com>
Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/22d07a44c80d8e8e1e82b9a806ddc8c6bbb2606e.1654759036.git.jpoimboe@kernel.org
Remove vendor checks from prefer_mwait_c1_over_halt function. Restore
the decision tree to support MWAIT C1 as the default idle state based on
CPUID checks as done by Thomas Gleixner in
commit 09fd4b4ef5 ("x86: use cpuid to check MWAIT support for C1")
The decision tree is removed in
commit 69fb3676df ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
Prefer MWAIT when the following conditions are satisfied:
1. CPUID_Fn00000001_ECX [Monitor] should be set
2. CPUID_Fn00000005 should be supported
3. If CPUID_Fn00000005_ECX [EMX] is set then there should be
at least one C1 substate available, indicated by
CPUID_Fn00000005_EDX [MWaitC1SubStates] bits.
Otherwise use HLT for default_idle function.
HPC customers who want to optimize for lower latency are known to
disable Global C-States in the BIOS. In fact, some vendors allow
choosing a BIOS 'performance' profile which explicitly disables
C-States. In this scenario, the cpuidle driver will not be loaded and
the kernel will continue with the default idle state chosen at boot
time. On AMD systems currently the default idle state is HLT which has
a higher exit latency compared to MWAIT.
The reason for the choice of HLT over MWAIT on AMD systems is:
1. Families prior to 10h didn't support MWAIT
2. Families 10h-15h supported MWAIT, but not MWAIT C1. Hence it was
preferable to use HLT as the default state on these systems.
However, AMD Family 17h onwards supports MWAIT as well as MWAIT C1. And
it is preferable to use MWAIT as the default idle state on these
systems, as it has lower exit latencies.
The below table represents the exit latency for HLT and MWAIT on AMD
Zen 3 system. Exit latency is measured by issuing a wakeup (IPI) to
other CPU and measuring how many clock cycles it took to wakeup. Each
iteration measures 10K wakeups by pinning source and destination.
HLT:
25.0000th percentile : 1900 ns
50.0000th percentile : 2000 ns
75.0000th percentile : 2300 ns
90.0000th percentile : 2500 ns
95.0000th percentile : 2600 ns
99.0000th percentile : 2800 ns
99.5000th percentile : 3000 ns
99.9000th percentile : 3400 ns
99.9500th percentile : 3600 ns
99.9900th percentile : 5900 ns
Min latency : 1700 ns
Max latency : 5900 ns
Total Samples 9999
MWAIT:
25.0000th percentile : 1400 ns
50.0000th percentile : 1500 ns
75.0000th percentile : 1700 ns
90.0000th percentile : 1800 ns
95.0000th percentile : 1900 ns
99.0000th percentile : 2300 ns
99.5000th percentile : 2500 ns
99.9000th percentile : 3200 ns
99.9500th percentile : 3500 ns
99.9900th percentile : 4600 ns
Min latency : 1200 ns
Max latency : 4600 ns
Total Samples 9997
Improvement (99th percentile): 21.74%
Below is another result for context_switch2 micro-benchmark, which
brings out the impact of improved wakeup latency through increased
context-switches per second.
with HLT:
-------------------------------
50.0000th percentile : 190184
75.0000th percentile : 191032
90.0000th percentile : 192314
95.0000th percentile : 192520
99.0000th percentile : 192844
MIN : 190148
MAX : 192852
with MWAIT:
-------------------------------
50.0000th percentile : 277444
75.0000th percentile : 278268
90.0000th percentile : 278888
95.0000th percentile : 279164
99.0000th percentile : 280504
MIN : 273278
MAX : 281410
Improvement(99th percentile): ~ 45.46%
Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://ozlabs.org/~anton/junkcode/context_switch2.c
Link: https://lkml.kernel.org/r/0cc675d8fd1f55e41b510e10abf2e21b6e9803d5.1654538381.git-series.wyes.karny@amd.com
When kernel is booted with idle=nomwait do not use MWAIT as the
default idle state.
If the user boots the kernel with idle=nomwait, it is a clear
direction to not use mwait as the default idle state.
However, the current code does not take this into consideration
while selecting the default idle state on x86.
Fix it by checking for the idle=nomwait boot option in
prefer_mwait_c1_over_halt().
Also update the documentation around idle=nomwait appropriately.
[ dhansen: tweak commit message ]
Signed-off-by: Wyes Karny <wyes.karny@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lkml.kernel.org/r/fdc2dc2d0a1bc21c2f53d989ea2d2ee3ccbc0dbe.1654538381.git-series.wyes.karny@amd.com
A new 64-bit control field "tertiary processor-based VM-execution
controls", is defined [1]. It's controlled by bit 17 of the primary
processor-based VM-execution controls.
Different from its brother VM-execution fields, this tertiary VM-
execution controls field is 64 bit. So it occupies 2 vmx_feature_leafs,
TERTIARY_CTLS_LOW and TERTIARY_CTLS_HIGH.
Its companion VMX capability reporting MSR,MSR_IA32_VMX_PROCBASED_CTLS3
(0x492), is also semantically different from its brothers, whose 64 bits
consist of all allow-1, rather than 32-bit allow-0 and 32-bit allow-1 [1][2].
Therefore, its init_vmx_capabilities() is a little different from others.
[1] ISE 6.2 "VMCS Changes"
https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
[2] SDM Vol3. Appendix A.3
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Zeng Guang <guang.zeng@intel.com>
Message-Id: <20220419153240.11549-1-guang.zeng@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The file-wide OBJECT_FILES_NON_STANDARD annotation is used with
CONFIG_FRAME_POINTER to tell objtool to skip the entire file when frame
pointers are enabled. However that annotation is now deprecated because
it doesn't work with IBT, where objtool runs on vmlinux.o instead of
individual translation units.
Instead, use more fine-grained function-specific annotations:
- The 'save_mcount_regs' macro does funny things with the frame pointer.
Use STACK_FRAME_NON_STANDARD_FP to tell objtool to ignore the
functions using it.
- The return_to_handler() "function" isn't actually a callable function.
Instead of being called, it's returned to. The real return address
isn't on the stack, so unwinding is already doomed no matter which
unwinder is used. So just remove the STT_FUNC annotation, telling
objtool to ignore it. That also removes the implicit
ANNOTATE_NOENDBR, which now needs to be made explicit.
Fixes the following warning:
vmlinux.o: warning: objtool: __fentry__+0x16: return with modified stack frame
Fixes: ed53a0d971 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/b7a7a42fe306aca37826043dac89e113a1acdbac.1654268610.git.jpoimboe@kernel.org
- fixes for material merged during this merge window
- cc:stable fixes for more longstanding issues
- minor mailmap and MAINTAINERS updates
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYpz1+QAKCRDdBJ7gKXxA
jrudAP9EvjTg4KhmXDoUpgJYc2oPg27nIhu1LWT8VFdsVQ6mPwEA//HPvPhjah8u
C1M183VxKL9trZf22DBn2BbD3kBDIAo=
=9LgC
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm hotfixes from Andrew Morton:
"Fixups for various recently-added and longer-term issues and a few
minor tweaks:
- fixes for material merged during this merge window
- cc:stable fixes for more longstanding issues
- minor mailmap and MAINTAINERS updates"
* tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery
x86/kexec: fix memory leak of elf header buffer
mm/memremap: fix missing call to untrack_pfn() in pagemap_range()
mm: page_isolation: use compound_nr() correctly in isolate_single_pageblock()
mm: hugetlb_vmemmap: fix CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON
MAINTAINERS: add maintainer information for z3fold
mailmap: update Josh Poimboeuf's email
SGX enclave is accounted to the wrong memory control group.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcd1MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaalD/0TdNTH+LiM0BpEZ4VHIAFhE9mgfaU/
1HIZcXEvAzPqS+iLMYAPo2dS7hNKv1GCCD8HcuOdEwC/CyTdrcpvhCNeQXCagF38
BHtzVCMFd/Y6U7ERNVsaHiuHFSkF+3QHef4Gzljzblgj1FK7s55z9tlQmE3pElOg
UGfRoD32ODUtQPmOCjlOhFjsUUtFpdpXFCbjPPFdOqJ80LbdKR2s/0IBpHMk1xoz
ESmS10tVC3a5np1/4Ge8vRCZnewOpulL/Is84Q8MbCvxI8NQh9pD7Imom/wRjSAS
19N+sWh7ywuUtAOVqJ23dDc6SOL3yjM4HbmsEYRGPsgzuJ5crezLcrKgCFeGmz/4
4zbU3R9hzzXQy8ZqNjbj71FKswfUDcMLb26GA/62d2N6zR7O0TSzfIrpIYp+GwJ3
5KaM0LiKoW/LXGfwEdEBWpCkK1OKgMXmZ5IQlr5bRz3Qihqzkk65Dgfo66XRt+jb
DhMHW+cMfLwSX72QER6LyP2jPfUSCZgy5Pn8LfXUH30fc084gyrAPq2eqtVnf0lf
Hq5/r1nMosPE0CtxHM1vNRj5M052nQxXhDhdsTcoO6PVBrvEjJbkanj3XbNRk66T
FDWGWmdtDC2su6p0ezwbARMxYnsSS40GVsp/DoOu+SHxlAm9VkomY+QDJ2FuoJnb
K0XfW5vV9MEsvw==
=cuLD
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX fix from Thomas Gleixner:
"A single fix for x86/SGX to prevent that memory which is allocated for
an SGX enclave is accounted to the wrong memory control group"
* tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Set active memcg prior to shmem allocation
- Disable late microcode loading by default. Unless the HW people get
their act together and provide a required minimum version in the
microcode header for making a halfways informed decision its just
lottery and broken.
- Warn and taint the kernel when microcode is loaded late
- Remove the old unused microcode loader interface
- Remove a redundant perf callback from the microcode loader
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdjETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUUTD/92cLMB0g7XP8yFN+ZiHl7uoaDtE0UR
WfapnMNL3tKKVnuwEMg83AjtyQ+O1JNZ6iS5K8jmiBnBByg6EQccz8pfe3jUQ6Ar
gOLWV1F4tRbJD2NqqiWOo/l5qs5hJaz/QeE+oyCP0fvw6DOZixepG5RzveFSSwAa
G2Q03GsGEu84SXlVAjagMSU6tYlBnlZBfKRB8NiNxkW8CLcJY0NfDCunbN6icEbH
AQHXeviM3GWMKJA9R9DeSvYq9PbN5o2UVmcFQWsDAzZ8Ne2qCqskjoGNjuQ9s+72
G35fm5d7dtIcrYg4PSJN0JDJP4HbcfSjhUrbdH4iAClTkGnNQERfuDV9O92/lYJE
hd9c8yCegD0NWQ4dMNNrM5PSbWbQK7ajqRYVvqqouJZpH+IDtajA1jxEe+1msB8P
xmXQDcdSMOyVs+Bw3Djf3tt8Qqhu4jxnf3y711oLklPwwh9lq9SvaWiX9ZFoYgdn
1HVtQUAOdgDmncs5BQ8dpuwtoYXH5p31n0wh57emyFXl7wA93eWouuFczQ6mSol9
LLdd9c+q9mBFIo0ult+fVhEOTDJF+27s3YXOpge6BAqei/SQIU4c5oq51CukV7ap
TPzWAayq0lsAwXn1k6r86Zkewh1C2SQTyk1J3zMehZlSVpwSWbEQASYlKywmh6YB
N+6/0XtHDAVK5Q==
=DQjr
-----END PGP SIGNATURE-----
Merge tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode updates from Thomas Gleixner:
- Disable late microcode loading by default. Unless the HW people get
their act together and provide a required minimum version in the
microcode header for making a halfways informed decision its just
lottery and broken.
- Warn and taint the kernel when microcode is loaded late
- Remove the old unused microcode loader interface
- Remove a redundant perf callback from the microcode loader
* tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Remove unnecessary perf callback
x86/microcode: Taint and warn on late loading
x86/microcode: Default-disable late loading
x86/microcode: Rip out the OLD_INTERFACE
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdNYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYod9oD/9Y3JOKOKfXKz6VTZeIkaq8bwizjCxX
tzTdeDy6pCYHv1z63rhwTpB4lUl+dt5CUiaf9YjEU3bdgNBtba/C0j6rx2Zf7Qk0
hw2CjsPahEdFGRRgzbMF6fOUHxsOV/fZ5S4w9XcV7u5QAMkoj/w3rl7Eh2Vn9KbL
B8I7Cl+Vyec2feknWau3s9vt4GRt+EYR08YmjWL1bxzjFss8JTs0mpzVnuR/QurU
O20UIGS/167EvacmC15Ehht3EJEOykte3iWVFCMgEwFsZTOpByCQGGnDQCVik0o9
6gYESc6fRNfTbC+rRGMs2LWXwsYfJMAqYLkSQIfOIETqysxu2HoWoCFWhpmaKLYr
DEL3mVTy1LB7TqY6+C0P2UCeU9CuNr3fejQf4SsDIsNmTlUuv+FHrDCfi9cotX/G
gmRa/29BASMgoVzF/QnzrEUGvEqU5S7wJgBxAD0cTw4IwvXz80KgbHNEl9utOCjB
ceXoPh3zOaEnBZn7B5HXRq52r2KOA+T7dL/6blPfuYokZrKftq8z9fLDXEqKSLFY
2lSxtowzAZiUxZDI5Z6qoBmEeEXrbxK3r7ro42KXetvudDWAoCspf2Qz8kEgZOCV
ykDMPEnhesL8eE1LJzaJBSggz3LmQAslxIZ+CcZlFMAI+vOmbFEbMYVYxDzyJECN
LEK9uNoEY1mMeQ==
=gNpf
-----END PGP SIGNATURE-----
Merge tag 'x86-boot-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot update from Thomas Gleixner:
"Use strlcpy() instead of strscpy() in arch_setup()"
* tag 'x86-boot-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/setup: Use strscpy() to replace deprecated strlcpy()
of Peter Zijlstra was encountering with ptrace in his freezer rewrite
I identified some cleanups to ptrace_stop that make sense on their own
and move make resolving the other problems much simpler.
The biggest issue is the habbit of the ptrace code to change task->__state
from the tracer to suppress TASK_WAKEKILL from waking up the tracee. No
other code in the kernel does that and it is straight forward to update
signal_wake_up and friends to make that unnecessary.
Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
on the fact that all stopped states except the special stop states can
tolerate spurious wake up and recover their state.
The state of stopped and traced tasked is changed to be stored in
task->jobctl as well as in task->__state. This makes it possible for
the freezer to recover tasks in these special states, as well as
serving as a general cleanup. With a little more work in that
direction I believe TASK_STOPPED can learn to tolerate spurious wake
ups and become an ordinary stop state.
The TASK_TRACED state has to remain a special state as the registers for
a process are only reliably available when the process is stopped in
the scheduler. Fundamentally ptrace needs acess to the saved
register values of a task.
There are bunch of semi-random ptrace related cleanups that were found
while looking at these issues.
One cleanup that deserves to be called out is from commit 57b6de08b5
("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
makes a change that is technically user space visible, in the handling
of what happens to a tracee when a tracer dies unexpectedly.
According to our testing and our understanding of userspace nothing
cares that spurious SIGTRAPs can be generated in that case.
The entire discussion can be found at:
https://lkml.kernel.org/r/87a6bv6dl6.fsf_-_@email.froward.int.ebiederm.org
Eric W. Biederman (11):
signal: Rename send_signal send_signal_locked
signal: Replace __group_send_sig_info with send_signal_locked
ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
ptrace: Remove arch_ptrace_attach
signal: Use lockdep_assert_held instead of assert_spin_locked
ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
ptrace: Document that wait_task_inactive can't fail
ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
ptrace: Don't change __state
ptrace: Always take siglock in ptrace_resume
Peter Zijlstra (1):
sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
arch/ia64/include/asm/ptrace.h | 4 --
arch/ia64/kernel/ptrace.c | 57 ----------------
arch/um/include/asm/thread_info.h | 2 +
arch/um/kernel/exec.c | 2 +-
arch/um/kernel/process.c | 2 +-
arch/um/kernel/ptrace.c | 8 +--
arch/um/kernel/signal.c | 4 +-
arch/x86/kernel/step.c | 3 +-
arch/xtensa/kernel/ptrace.c | 4 +-
arch/xtensa/kernel/signal.c | 4 +-
drivers/tty/tty_jobctrl.c | 4 +-
include/linux/ptrace.h | 7 --
include/linux/sched.h | 10 ++-
include/linux/sched/jobctl.h | 8 +++
include/linux/sched/signal.h | 20 ++++--
include/linux/signal.h | 3 +-
kernel/ptrace.c | 87 ++++++++---------------
kernel/sched/core.c | 5 +-
kernel/signal.c | 140 +++++++++++++++++---------------------
kernel/time/posix-cpu-timers.c | 6 +-
20 files changed, 140 insertions(+), 240 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaXaYACgkQC/v6Eiaj
j0CgoA/+JncSQ6PY2D5Jh1apvHzmnRsFXzr3DRvtv/CVx4oIebOXRQFyVDeD5tRn
TmMgB29HpBlHRDLojlmlZRGAld1HR/aPEW9j8W1D3Sy/ZFO5L8lQitv9aDHO9Ntw
4lZvlhS1M0KhATudVVBqSPixiG6CnV5SsGmixqdOyg7xcXSY6G1l2nB7Zk9I3Tat
ZlmhuZ6R5Z5qsm4MEq0vUSrnsHiGxYrpk6uQOaVz8Wkv8ZFmbutt6XgxF0tsyZNn
mHSmWSiZzIgBjTlaibEmxi8urYJTPj3vGBeJQVYHblFwLFi6+Oy7bDxQbWjQvaZh
DsgWPScfBF4Jm0+8hhCiSYpvPp8XnZuklb4LNCeok/VFr+KfSmpJTIhn00kagQ1u
vxQDqLws8YLW4qsfGydfx9uUIFCbQE/V2VDYk5J3Re3gkUNDOOR1A56hPniKv6VB
2aqGO2Fl0RdBbUa3JF+XI5Pwq5y1WrqR93EUvj+5+u5W9rZL/8WLBHBMEz6gbmfD
DhwFE0y8TG2WRlWJVEDRId+5zo3di/YvasH0vJZ5HbrxhS2RE/yIGAd+kKGx/lZO
qWDJC7IHvFJ7Mw5KugacyF0SHeNdloyBM7KZW6HeXmgKn9IMJBpmwib92uUkRZJx
D8j/bHHqD/zsgQ39nO+c4M0MmhO/DsPLG/dnGKrRCu7v1tmEnkY=
=ZUuO
-----END PGP SIGNATURE-----
Merge tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace_stop cleanups from Eric Biederman:
"While looking at the ptrace problems with PREEMPT_RT and the problems
Peter Zijlstra was encountering with ptrace in his freezer rewrite I
identified some cleanups to ptrace_stop that make sense on their own
and move make resolving the other problems much simpler.
The biggest issue is the habit of the ptrace code to change
task->__state from the tracer to suppress TASK_WAKEKILL from waking up
the tracee. No other code in the kernel does that and it is straight
forward to update signal_wake_up and friends to make that unnecessary.
Peter's task freezer sets frozen tasks to a new state TASK_FROZEN and
then it stores them by calling "wake_up_state(t, TASK_FROZEN)" relying
on the fact that all stopped states except the special stop states can
tolerate spurious wake up and recover their state.
The state of stopped and traced tasked is changed to be stored in
task->jobctl as well as in task->__state. This makes it possible for
the freezer to recover tasks in these special states, as well as
serving as a general cleanup. With a little more work in that
direction I believe TASK_STOPPED can learn to tolerate spurious wake
ups and become an ordinary stop state.
The TASK_TRACED state has to remain a special state as the registers
for a process are only reliably available when the process is stopped
in the scheduler. Fundamentally ptrace needs acess to the saved
register values of a task.
There are bunch of semi-random ptrace related cleanups that were found
while looking at these issues.
One cleanup that deserves to be called out is from commit 57b6de08b5
("ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs"). This
makes a change that is technically user space visible, in the handling
of what happens to a tracee when a tracer dies unexpectedly. According
to our testing and our understanding of userspace nothing cares that
spurious SIGTRAPs can be generated in that case"
* tag 'ptrace_stop-cleanup-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state
ptrace: Always take siglock in ptrace_resume
ptrace: Don't change __state
ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs
ptrace: Document that wait_task_inactive can't fail
ptrace: Reimplement PTRACE_KILL by always sending SIGKILL
signal: Use lockdep_assert_held instead of assert_spin_locked
ptrace: Remove arch_ptrace_attach
ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
signal: Replace __group_send_sig_info with send_signal_locked
signal: Rename send_signal send_signal_locked
ordinary user mode tasks.
In commit 40966e316f ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them. This struct
kthread going away during execve in turned made a use after free of
struct kthread possible.
The commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
init and umh") is enough to fix the use after free and is simple enough
to be backportable.
The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.
In making init and the user mode helpers tasks purely user mode tasks
I ran into two complications. The function task_tick_numa was
detecting tasks without an mm by testing for the presence of
PF_KTHREAD. The initramfs code in populate_initrd_image was using
flush_delayed_fput to ensuere the closing of all it's file descriptors
was complete, and flush_delayed_fput does not work in a userspace thread.
I have looked and looked and more complications and in my code review
I have not found any, and neither has anyone else with the code sitting
in linux-next.
Link: https://lkml.kernel.org/r/87mtfu4up3.fsf@email.froward.int.ebiederm.org
Eric W. Biederman (8):
kthread: Don't allocate kthread_struct for init and umh
fork: Pass struct kernel_clone_args into copy_thread
fork: Explicity test for idle tasks in copy_thread
fork: Generalize PF_IO_WORKER handling
init: Deal with the init process being a user mode process
fork: Explicitly set PF_KTHREAD
fork: Stop allowing kthreads to call execve
sched: Update task_tick_numa to ignore tasks without an mm
arch/alpha/kernel/process.c | 13 ++++++------
arch/arc/kernel/process.c | 13 ++++++------
arch/arm/kernel/process.c | 12 ++++++-----
arch/arm64/kernel/process.c | 12 ++++++-----
arch/csky/kernel/process.c | 15 ++++++-------
arch/h8300/kernel/process.c | 10 ++++-----
arch/hexagon/kernel/process.c | 12 ++++++-----
arch/ia64/kernel/process.c | 15 +++++++------
arch/m68k/kernel/process.c | 12 ++++++-----
arch/microblaze/kernel/process.c | 12 ++++++-----
arch/mips/kernel/process.c | 13 ++++++------
arch/nios2/kernel/process.c | 12 ++++++-----
arch/openrisc/kernel/process.c | 12 ++++++-----
arch/parisc/kernel/process.c | 18 +++++++++-------
arch/powerpc/kernel/process.c | 15 +++++++------
arch/riscv/kernel/process.c | 12 ++++++-----
arch/s390/kernel/process.c | 12 ++++++-----
arch/sh/kernel/process_32.c | 12 ++++++-----
arch/sparc/kernel/process_32.c | 12 ++++++-----
arch/sparc/kernel/process_64.c | 12 ++++++-----
arch/um/kernel/process.c | 15 +++++++------
arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/include/asm/switch_to.h | 8 +++----
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kernel/process.c | 18 +++++++++-------
arch/xtensa/kernel/process.c | 17 ++++++++-------
fs/exec.c | 8 ++++---
include/linux/sched/task.h | 8 +++++--
init/initramfs.c | 2 ++
init/main.c | 2 +-
kernel/fork.c | 46 +++++++++++++++++++++++++++++++++-------
kernel/sched/fair.c | 2 +-
kernel/umh.c | 6 +++---
33 files changed, 234 insertions(+), 160 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaR/MACgkQC/v6Eiaj
j0Aayg/7Bx66872d9c6igkJ+MPCTuh+v9QKCGwiYEmiU4Q5sVAFB0HPJO27qC14u
630X0RFNZTkPzNNEJNIW4kw6Dj8s8YRKf+FgQAVt4SzdRwT7eIPDjk1nGraopPJ3
O04pjvuTmUyidyViRyFcf2ptx/pnkrwP8jUSc+bGTgfASAKAgAokqKE5ecjewbBc
Y/EAkQ6QW7KxPjeSmpAHwI+t3BpBev9WEC4PbhRhsBCQFO2+PJiklvqdhVNBnIjv
qUezll/1xv9UYgniB15Q4Nb722SmnWSU3r8as1eFPugzTHizKhufrrpyP+KMK1A0
tdtEJNs5t2DZF7ZbGTFSPqJWmyTYLrghZdO+lOmnaSjHxK4Nda1d4NzbefJ0u+FE
tutewowvHtBX6AFIbx+H3O+DOJM2IgNMf+ReQDU/TyNyVf3wBrTbsr9cLxypIJIp
zze8npoLMlB7B4yxVo5ES5e63EXfi3iHl0L3/1EhoGwriRz1kWgVLUX/VZOUpscL
RkJHsW6bT8sqxPWAA5kyWjEN+wNR2PxbXi8OE4arT0uJrEBMUgDCzydzOv5tJB00
mSQdytxH9LVdsmxBKAOBp5X6WOLGA4yb1cZ6E/mEhlqXMpBDF1DaMfwbWqxSYi4q
sp5zU3SBAW0qceiZSsWZXInfbjrcQXNV/DkDRDO9OmzEZP4m1j0=
=x6fy
-----END PGP SIGNATURE-----
Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kthread updates from Eric Biederman:
"This updates init and user mode helper tasks to be ordinary user mode
tasks.
Commit 40966e316f ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them. This struct
kthread going away during execve in turned made a use after free of
struct kthread possible.
Here, commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
init and umh") is enough to fix the use after free and is simple
enough to be backportable.
The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.
In making init and the user mode helpers tasks purely user mode tasks
I ran into two complications. The function task_tick_numa was
detecting tasks without an mm by testing for the presence of
PF_KTHREAD. The initramfs code in populate_initrd_image was using
flush_delayed_fput to ensuere the closing of all it's file descriptors
was complete, and flush_delayed_fput does not work in a userspace
thread.
I have looked and looked and more complications and in my code review
I have not found any, and neither has anyone else with the code
sitting in linux-next"
* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sched: Update task_tick_numa to ignore tasks without an mm
fork: Stop allowing kthreads to call execve
fork: Explicitly set PF_KTHREAD
init: Deal with the init process being a user mode process
fork: Generalize PF_IO_WORKER handling
fork: Explicity test for idle tasks in copy_thread
fork: Pass struct kernel_clone_args into copy_thread
kthread: Don't allocate kthread_struct for init and umh
When the system runs out of enclave memory, SGX can reclaim EPC pages
by swapping to normal RAM. These backing pages are allocated via a
per-enclave shared memory area. Since SGX allows unlimited over
commit on EPC memory, the reclaimer thread can allocate a large
number of backing RAM pages in response to EPC memory pressure.
When the shared memory backing RAM allocation occurs during
the reclaimer thread context, the shared memory is charged to
the root memory control group, and the shmem usage of the enclave
is not properly accounted for, making cgroups ineffective at
limiting the amount of RAM an enclave can consume.
For example, when using a cgroup to launch a set of test
enclaves, the kernel does not properly account for 50% - 75% of
shmem page allocations on average. In the worst case, when
nearly all allocations occur during the reclaimer thread, the
kernel accounts less than a percent of the amount of shmem used
by the enclave's cgroup to the correct cgroup.
SGX stores a list of mm_structs that are associated with
an enclave. Pick one of them during reclaim and charge that
mm's memcg with the shmem allocation. The one that gets picked
is arbitrary, but this list almost always only has one mm. The
cases where there is more than one mm with different memcg's
are not worth considering.
Create a new function - sgx_encl_alloc_backing(). This function
is used whenever a new backing storage page needs to be
allocated. Previously the same function was used for page
allocation as well as retrieving a previously allocated page.
Prior to backing page allocation, if there is a mm_struct associated
with the enclave that is requesting the allocation, it is set
as the active memory control group.
[ dhansen: - fix merge conflict with ELDU fixes
- check against actual ksgxd_tsk, not ->mm ]
Cc: stable@vger.kernel.org
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://lkml.kernel.org/r/20220520174248.4918-1-kristen@linux.intel.com
This is reported by kmemleak detector:
unreferenced object 0xffffc900002a9000 (size 4096):
comm "kexec", pid 14950, jiffies 4295110793 (age 373.951s)
hex dump (first 32 bytes):
7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 .ELF............
04 00 3e 00 01 00 00 00 00 00 00 00 00 00 00 00 ..>.............
backtrace:
[<0000000016a8ef9f>] __vmalloc_node_range+0x101/0x170
[<000000002b66b6c0>] __vmalloc_node+0xb4/0x160
[<00000000ad40107d>] crash_prepare_elf64_headers+0x8e/0xcd0
[<0000000019afff23>] crash_load_segments+0x260/0x470
[<0000000019ebe95c>] bzImage64_load+0x814/0xad0
[<0000000093e16b05>] arch_kexec_kernel_image_load+0x1be/0x2a0
[<000000009ef2fc88>] kimage_file_alloc_init+0x2ec/0x5a0
[<0000000038f5a97a>] __do_sys_kexec_file_load+0x28d/0x530
[<0000000087c19992>] do_syscall_64+0x3b/0x90
[<0000000066e063a4>] entry_SYSCALL_64_after_hwframe+0x44/0xae
In crash_prepare_elf64_headers(), a buffer is allocated via vmalloc() to
store elf headers. While it's not freed back to system correctly when
kdump kernel is reloaded or unloaded. Then memory leak is caused. Fix it
by introducing x86 specific function arch_kimage_file_post_load_cleanup(),
and freeing the buffer there.
And also remove the incorrect elf header buffer freeing code. Before
calling arch specific kexec_file loading function, the image instance has
been initialized. So 'image->elf_headers' must be NULL. It doesn't make
sense to free the elf header buffer in the place.
Three different people have reported three bugs about the memory leak on
x86_64 inside Redhat.
Link: https://lkml.kernel.org/r/20220223113225.63106-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Similar to MDS and TAA, print a warning if SMT is enabled for the MMIO
Stale Data vulnerability.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
c93dc84cbe ("perf/x86: Add a microcode revision check for SNB-PEBS")
checks whether the microcode revision has fixed PEBS issues.
This can happen either:
1. At PEBS init time, where the early microcode has been loaded already
2. During late loading, in the microcode_check() callback.
So remove the unnecessary call in the microcode loader init routine.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220525161232.14924-5-bp@alien8.de
Warn before it is attempted and taint the kernel. Late loading microcode
can lead to malfunction of the kernel when the microcode update changes
behaviour. There is no way for the kernel to determine whether its safe or
not.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220525161232.14924-4-bp@alien8.de
It is dangerous and it should not be used anyway - there's a nice early
loading already.
Requested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220525161232.14924-3-bp@alien8.de
- Add Tegra234 cpufreq support (Sumit Gupta).
- Clean up and enhance the Mediatek cpufreq driver (Wan Jiabing,
Rex-BC Chen, and Jia-Wei Chang).
- Fix up the CPPC cpufreq driver after recent changes (Zheng Bin,
Pierre Gondois).
- Minor update to dt-binding for Qcom's opp-v2-kryo-cpu (Yassine
Oudjana).
- Use list iterator only inside the list_for_each_entry loop (Xiaomeng
Tong, and Jakob Koschel).
- New APIs related to finding OPP based on interconnect bandwidth
(Krzysztof Kozlowski).
- Fix the missing of_node_put() in _bandwidth_supported() (Dan
Carpenter).
- Cleanups (Krzysztof Kozlowski, and Viresh Kumar).
- Add Out of Band mode description to the intel-speed-select utility
documentation (Srinivas Pandruvada).
- Add power sequences support to the system reboot and power off
code and make related platform-specific changes for multiple
platforms (Dmitry Osipenko, Geert Uytterhoeven).
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmKU8lESHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxVz0P91LNCbkDSt60jzNkXdEjsvUnI/YjJ+QJ
/+ta7iCwf90obb6s9soBkTyU8Ia7hJ/IWDJW/5xhdG0ySYF17hGNIGKK9xKGsJFK
tzzWtjFsvT3PeUZQERekqWp8OYskHYmQMj8o4jqqFF7DZD/AswTgkVLALUd7YhVL
UvLmcKsUA7eXy3ZrhtrGSzVSEbKOGXBLFyjy3IuWjfz6Uk/nGQRNKGf7byRWLM44
y7zb75/5+p4MPyyJP8M/uiXzEYDKuubRtfx9PdmLgBUSMbtho6eB1x47dZWooaxe
YKmcFjF80AmnwxHb+Te2rZHPeIYr+5hLBaEq7xaLQf/nAS3y5z1PIfI2wVQ5mXPz
D599jHHda/6oSAKCVTq2fKfnlR6fetm5j66xOQINpD+G5b5tNSpllXJDamFZxFgP
DiQAOFzdnRYnK7yTiLWVl1q76SVRxqsGz7/5Ak+NRj2OQK2wRkLzHuZfiV/8r0pk
ksi6Ew9TerXkstoTQsSToPQxB2VvosSajNU3Oy27pmM0oal1XxP0LIPz9sMor5/g
tfk5f6Yz/+FFIfXj3cZffZNdhsJgejmcqPdrSdCOV3sBrblnIMQNpHiYg4jGztoj
IjYKYPVpSaWiSZLQOaK2moTEvm9CfQz1TQCF+/Kz88LX6/7ZaDJFxHG2FDEob0sg
6KVbrZWweLI=
=PAh+
-----END PGP SIGNATURE-----
Merge tag 'pm-5.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These update the ARM cpufreq drivers and fix up the CPPC cpufreq
driver after recent changes, update the OPP code and PM documentation
and add power sequences support to the system reboot and power off
code.
Specifics:
- Add Tegra234 cpufreq support (Sumit Gupta)
- Clean up and enhance the Mediatek cpufreq driver (Wan Jiabing,
Rex-BC Chen, and Jia-Wei Chang)
- Fix up the CPPC cpufreq driver after recent changes (Zheng Bin,
Pierre Gondois)
- Minor update to dt-binding for Qcom's opp-v2-kryo-cpu (Yassine
Oudjana)
- Use list iterator only inside the list_for_each_entry loop
(Xiaomeng Tong, and Jakob Koschel)
- New APIs related to finding OPP based on interconnect bandwidth
(Krzysztof Kozlowski)
- Fix the missing of_node_put() in _bandwidth_supported() (Dan
Carpenter)
- Cleanups (Krzysztof Kozlowski, and Viresh Kumar)
- Add Out of Band mode description to the intel-speed-select utility
documentation (Srinivas Pandruvada)
- Add power sequences support to the system reboot and power off code
and make related platform-specific changes for multiple platforms
(Dmitry Osipenko, Geert Uytterhoeven)"
* tag 'pm-5.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (60 commits)
cpufreq: CPPC: Fix unused-function warning
cpufreq: CPPC: Fix build error without CONFIG_ACPI_CPPC_CPUFREQ_FIE
Documentation: admin-guide: PM: Add Out of Band mode
kernel/reboot: Change registration order of legacy power-off handler
m68k: virt: Switch to new sys-off handler API
kernel/reboot: Add devm_register_restart_handler()
kernel/reboot: Add devm_register_power_off_handler()
soc/tegra: pmc: Use sys-off handler API to power off Nexus 7 properly
reboot: Remove pm_power_off_prepare()
regulator: pfuze100: Use devm_register_sys_off_handler()
ACPI: power: Switch to sys-off handler API
memory: emif: Use kernel_can_power_off()
mips: Use do_kernel_power_off()
ia64: Use do_kernel_power_off()
x86: Use do_kernel_power_off()
sh: Use do_kernel_power_off()
m68k: Switch to new sys-off handler API
powerpc: Use do_kernel_power_off()
xen/x86: Use do_kernel_power_off()
parisc: Use do_kernel_power_off()
...
- The majority of the changes are for fixes and clean ups.
Noticeable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to having it
embedded in initram). This is useful for embedded boards without initram
disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use more than
59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset>
instead of using the name of the function before it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYpOgXRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjkKAQDbpemxvpFyJlZqT8KgEIXubu+ag2/q
p0XDHaPS0zF9OQEAjTxg6GMEbnFYl6fzxZtOoEbiaQ7ppfdhRI8t6sSMVA8=
=+nDD
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The majority of the changes are for fixes and clean ups.
Notable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to
having it embedded in initram). This is useful for embedded boards
without initram disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use
more than 59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset> instead of using the
name of the function before it"
* tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (52 commits)
ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function
tracing: Fix comments for event_trigger_separate_filter()
x86/traceponit: Fix comment about irq vector tracepoints
x86,tracing: Remove unused headers
ftrace: Clean up hash direct_functions on register failures
tracing: Fix comments of create_filter()
tracing: Disable kcov on trace_preemptirq.c
tracing: Initialize integer variable to prevent garbage return value
ftrace: Fix typo in comment
ftrace: Remove return value of ftrace_arch_modify_*()
tracing: Cleanup code by removing init "char *name"
tracing: Change "char *" string form to "char []"
tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQ
tracing/timerlat: Print stacktrace in the IRQ handler if needed
tracing/timerlat: Notify IRQ new max latency only if stop tracing is set
kprobes: Fix build errors with CONFIG_KRETPROBES=n
tracing: Fix return value of trace_pid_write()
tracing: Fix potential double free in create_var_ref()
tracing: Use strim() to remove whitespace instead of doing it manually
ftrace: Deal with error return code of the ftrace_process_locs() function
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmKSSbcTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXgJyCACeyMOcFws5lyqqdk0R0zGr2KFfKsJn
YQR9nvldT2p/1y0ykvU208UIq0HHmXOb9pD8gOUzGYGp4XlEaC1f4V37mmzgLcRu
vL/HcFqBl2cQEfaQxiXZrmsIIVszwbc57EGqpl93cS2er4hp/NXmredKCId7Mpt8
FjxjgVGzdhEUKbJZYjkDM5pYAnJ9QVwuK3MaarKMK86Oj1P5YtKgIb4ZSt/NHvsC
Mukx3nivSH29XfK3fRsFDJUQr9WNYh1cmTtyhB0tWVXQCYFc4angZRtCJwyXzkp2
P5GBIQoMZcXX2XWkUBTtA1w5g/aZZsBExb3YGhQjsQP+jb6MtDnvOEo9
=Z2E+
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20220528' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Harden hv_sock driver (Andrea Parri)
- Harden Hyper-V PCI driver (Andrea Parri)
- Fix multi-MSI for Hyper-V PCI driver (Jeffrey Hugo)
- Fix Hyper-V PCI to reduce boot time (Dexuan Cui)
- Remove code for long EOL'ed Hyper-V versions (Michael Kelley, Saurabh
Sengar)
- Fix balloon driver error handling (Shradha Gupta)
- Fix a typo in vmbus driver (Julia Lawall)
- Ignore vmbus IMC device (Michael Kelley)
- Add a new error message to Hyper-V DRM driver (Saurabh Sengar)
* tag 'hyperv-next-signed-20220528' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (28 commits)
hv_balloon: Fix balloon_probe() and balloon_remove() error handling
scsi: storvsc: Removing Pre Win8 related logic
Drivers: hv: vmbus: fix typo in comment
PCI: hv: Fix synchronization between channel callback and hv_pci_bus_exit()
PCI: hv: Add validation for untrusted Hyper-V values
PCI: hv: Fix interrupt mapping for multi-MSI
PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()
drm/hyperv: Remove support for Hyper-V 2008 and 2008R2/Win7
video: hyperv_fb: Remove support for Hyper-V 2008 and 2008R2/Win7
scsi: storvsc: Remove support for Hyper-V 2008 and 2008R2/Win7
Drivers: hv: vmbus: Remove support for Hyper-V 2008 and Hyper-V 2008R2/Win7
x86/hyperv: Disable hardlockup detector by default in Hyper-V guests
drm/hyperv: Add error message for fb size greater than allocated
PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time
PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI
Drivers: hv: vmbus: Refactor the ring-buffer iterator functions
Drivers: hv: vmbus: Accept hv_sock offers in isolated guests
hv_sock: Add validation for untrusted Hyper-V values
hv_sock: Copy packets sent by Hyper-V out of the ring buffer
hv_sock: Check hv_pkt_iter_first_raw()'s return value
...
- Add support for clearing memory error via pwrite(2) on DAX
- Fix 'security overwrite' support in the presence of media errors
- Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs
Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj
kxOtJEU2bA9naninXGORsDUndiNkwQo=
=gVIn
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and DAX updates from Dan Williams:
"New support for clearing memory errors when a file is in DAX mode,
alongside with some other fixes and cleanups.
Previously it was only possible to clear these errors using a truncate
or hole-punch operation to trigger the filesystem to reallocate the
block, now, any page aligned write can opportunistically clear errors
as well.
This change spans x86/mm, nvdimm, and fs/dax, and has received the
appropriate sign-offs. Thanks to Jane for her work on this.
Summary:
- Add support for clearing memory error via pwrite(2) on DAX
- Fix 'security overwrite' support in the presence of media errors
- Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"
* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
pmem: implement pmem_recovery_write()
pmem: refactor pmem_clear_poison()
dax: add .recovery_write dax_operation
dax: introduce DAX_RECOVERY_WRITE dax access mode
mce: fix set_mce_nospec to always unmap the whole page
x86/mce: relocate set{clear}_mce_nospec() functions
acpi/nfit: rely on mce->misc to determine poison granularity
testing: nvdimm: asm/mce.h is not needed in nfit.c
testing: nvdimm: iomap: make __nfit_test_ioremap a macro
nvdimm: Allow overwrite in the presence of disabled dimms
tools/testing/nvdimm: remove unneeded flush_workqueue
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmKP/tQUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vzH0xAAojQowrSWzZ5FKTqI+L/L9ZXoAb+e
9IvQljKc9taJldmXp+EB9wkS/5B+VtQcC2qUQuWEQXUoECF8qHlcB4l+XQyd1tWO
O0vZxETH22xjLLrjG2F3l5rrfkJZAf2nEugwbDk97YEgiimeOiRcv3bx6AUCtj6I
rPJ13Fop3Jke7sQMcXYJe3gQLT1o1AKiQGghiCFNi/gzx2lXI6mmHBgLxFoiqcby
WpfXbvbJti95HRaahUR3HaDFfHj4HVkQNLlTtIykJ3Tl2/rOhWEJjI8JOIQpAA+M
WBrWw9rfgbScTiGV+dZ3h7hKiPnHKl9YETIX7L0oA2sj0jZcIs0d6mSBZx0kYuI9
eAlx+qSK9xpbQQr/fdYaUdF1q4QdtU0BYOvOWOzWsqYCECMRJ1PUHFSMbmR/+PNB
P5lHnAbggRSoxdAtwFYv1HTr+VpGH9S+5oxHCz3ohpMjYy6mkCZwHpZn3doaU3ci
KG6yIoVKftm3fZdtFvL03qHl/I8+X24ZhT/T/278PRGjkhSyr56hZo8hg0gqqTct
ngip8qNABmSbqpr73/W6Vl42zAbYtNk1BykYahbKupgW8FbT7hqaZTB05V87pVu+
Ko1aJM6VoOP9rMlKHI9ba8eYCzDrZbLZUFn7ljNPDpzutf0tAwtgwzvZBXN3za6+
Z9+D5dxmvrZEIbA=
=hEti
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
"Resource management:
- Restrict E820 clipping to PCI host bridge windows (Bjorn Helgaas)
- Log E820 clipping better (Bjorn Helgaas)
- Add kernel cmdline options to enable/disable E820 clipping (Hans de
Goede)
- Disable E820 reserved region clipping for IdeaPads, Yoga, Yoga
Slip, Acer Spin 5, Clevo Barebone systems where clipping leaves no
usable address space for touchpads, Thunderbolt devices, etc (Hans
de Goede)
- Disable E820 clipping by default starting in 2023 (Hans de Goede)
PCI device hotplug:
- Include files to remove implicit dependencies (Christophe Leroy)
- Only put Root Ports in D3 if they can signal and wake from D3 so
AMD Yellow Carp doesn't miss hotplug events (Mario Limonciello)
Power management:
- Define pci_restore_standard_config() only for CONFIG_PM_SLEEP since
it's unused otherwise (Krzysztof Kozlowski)
- Power up devices completely, including anything platform firmware
needs to do, during runtime resume (Rafael J. Wysocki)
- Move pci_resume_bus() to PM callbacks so we observe the required
bridge power-up delays (Rafael J. Wysocki)
- Drop unneeded runtime_d3cold device flag (Rafael J. Wysocki)
- Split pci_raw_set_power_state() between pci_power_up() and a new
pci_set_low_power_state() (Rafael J. Wysocki)
- Set current_state to D3cold if config read returns ~0, indicating
the device is not accessible (Rafael J. Wysocki)
- Do not call pci_update_current_state() from pci_power_up() so BARs
and ASPM config are restored correctly (Rafael J. Wysocki)
- Write 0 to PMCSR in pci_power_up() in all cases (Rafael J. Wysocki)
- Split pci_power_up() to pci_set_full_power_state() to avoid some
redundant operations (Rafael J. Wysocki)
- Skip restoring BARs if device is not in D0 (Rafael J. Wysocki)
- Rearrange and clarify pci_set_power_state() (Rafael J. Wysocki)
- Remove redundant BAR restores from pci_pm_thaw_noirq() (Rafael J.
Wysocki)
Virtualization:
- Acquire device lock before config space access lock to avoid AB/BA
deadlock with sriov_numvfs_store() (Yicong Yang)
Error handling:
- Clear MULTI_ERR_COR/UNCOR_RCV bits, which a race could previously
leave permanently set (Kuppuswamy Sathyanarayanan)
Peer-to-peer DMA:
- Whitelist Intel Skylake-E Root Ports regardless of which devfn they
are (Shlomo Pongratz)
ASPM:
- Override L1 acceptable latency advertised by Intel DG2 so ASPM L1
can be enabled (Mika Westerberg)
Cadence PCIe controller driver:
- Set up device-specific register to allow PTM Responder to be
enabled by the normal architected bit (Christian Gmeiner)
- Override advertised FLR support since the controller doesn't
implement FLR correctly (Parshuram Thombare)
Cadence PCIe endpoint driver:
- Correct bitmap size for the ob_region_map of outbound window usage
(Dan Carpenter)
Freescale i.MX6 PCIe controller driver:
- Fix PERST# assertion/deassertion so we observe the required delays
before accessing device (Francesco Dolcini)
Freescale Layerscape PCIe controller driver:
- Add "big-endian" DT property (Hou Zhiqiang)
- Update SCFG DT property (Hou Zhiqiang)
- Add "aer", "pme", "intr" DT properties (Li Yang)
- Add DT compatible strings for ls1028a (Xiaowei Bao)
Intel VMD host bridge driver:
- Assign VMD IRQ domain before enumeration to avoid IOMMU interrupt
remapping errors when MSI-X remapping is disabled (Nirmal Patel)
- Revert VMD workaround that kept MSI-X remapping enabled when IOMMU
remapping was enabled (Nirmal Patel)
Marvell MVEBU PCIe controller driver:
- Add of_pci_get_slot_power_limit() to parse the
'slot-power-limit-milliwatt' DT property (Pali Rohár)
- Add mvebu support for sending Set_Slot_Power_Limit message (Pali
Rohár)
MediaTek PCIe controller driver:
- Fix refcount leak in mtk_pcie_subsys_powerup() (Miaoqian Lin)
MediaTek PCIe Gen3 controller driver:
- Reset PHY and MAC at probe time (AngeloGioacchino Del Regno)
Microchip PolarFlare PCIe controller driver:
- Add chained_irq_enter()/chained_irq_exit() calls to mc_handle_msi()
and mc_handle_intx() to avoid lost interrupts (Conor Dooley)
- Fix interrupt handling race (Daire McNamara)
NVIDIA Tegra194 PCIe controller driver:
- Drop tegra194 MSI register save/restore, which is unnecessary since
the DWC core does it (Jisheng Zhang)
Qualcomm PCIe controller driver:
- Add SM8150 SoC DT binding and support (Bhupesh Sharma)
- Fix pipe clock imbalance (Johan Hovold)
- Fix runtime PM imbalance on probe errors (Johan Hovold)
- Fix PHY init imbalance on probe errors (Johan Hovold)
- Convert DT binding to YAML (Dmitry Baryshkov)
- Update DT binding to show that resets aren't required for
MSM8996/APQ8096 platforms (Dmitry Baryshkov)
- Add explicit register names per chipset in DT binding (Dmitry
Baryshkov)
- Add sc7280-specific clock and reset definitions to DT binding
(Dmitry Baryshkov)
Rockchip PCIe controller driver:
- Fix bitmap size when searching for free outbound region (Dan
Carpenter)
Rockchip DesignWare PCIe controller driver:
- Remove "snps,dw-pcie" from rockchip-dwc DT "compatible" property
because it's not fully compatible with rockchip (Peter Geis)
- Reset rockchip-dwc controller at probe (Peter Geis)
- Add rockchip-dwc INTx support (Peter Geis)
Synopsys DesignWare PCIe controller driver:
- Return error instead of success if DMA mapping of MSI area fails
(Jiantao Zhang)
Miscellaneous:
- Change pci_set_dma_mask() documentation references to
dma_set_mask() (Alex Williamson)"
* tag 'pci-v5.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (64 commits)
dt-bindings: PCI: qcom: Add schema for sc7280 chipset
dt-bindings: PCI: qcom: Specify reg-names explicitly
dt-bindings: PCI: qcom: Do not require resets on msm8996 platforms
dt-bindings: PCI: qcom: Convert to YAML
PCI: qcom: Fix unbalanced PHY init on probe errors
PCI: qcom: Fix runtime PM imbalance on probe errors
PCI: qcom: Fix pipe clock imbalance
PCI: qcom: Add SM8150 SoC support
dt-bindings: pci: qcom: Document PCIe bindings for SM8150 SoC
x86/PCI: Disable E820 reserved region clipping starting in 2023
x86/PCI: Disable E820 reserved region clipping via quirks
x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions
PCI: microchip: Fix potential race in interrupt handling
PCI/AER: Clear MULTI_ERR_COR/UNCOR_RCV bits
PCI: cadence: Clear FLR in device capabilities register
PCI: cadence: Allow PTM Responder to be enabled
PCI: vmd: Revert 2565e5b69c ("PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is enabled by IOMMU.")
PCI: vmd: Assign VMD IRQ domain before enumeration
PCI: Avoid pci_dev_lock() AB/BA deadlock with sriov_numvfs_store()
PCI: rockchip-dwc: Add legacy interrupt support
...
subsystems. Most notably some maintenance work in ocfs2 and initramfs.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo/6xQAKCRDdBJ7gKXxA
jkD9AQCPczLBbRWpe1edL+5VHvel9ePoHQmvbHQnufdTh9rB5QEAu0Uilxz4q9cx
xSZypNhj2n9f8FCYca/ZrZneBsTnAA8=
=AJEO
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"The non-MM patch queue for this merge window.
Not a lot of material this cycle. Many singleton patches against
various subsystems. Most notably some maintenance work in ocfs2
and initramfs"
* tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (65 commits)
kcov: update pos before writing pc in trace function
ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock
fs/ntfs: remove redundant variable idx
fat: remove time truncations in vfat_create/vfat_mkdir
fat: report creation time in statx
fat: ignore ctime updates, and keep ctime identical to mtime in memory
fat: split fat_truncate_time() into separate functions
MAINTAINERS: add Muchun as a memcg reviewer
proc/sysctl: make protected_* world readable
ia64: mca: drop redundant spinlock initialization
tty: fix deadlock caused by calling printk() under tty_port->lock
relay: remove redundant assignment to pointer buf
fs/ntfs3: validate BOOT sectors_per_clusters
lib/string_helpers: fix not adding strarray to device's resource list
kernel/crash_core.c: remove redundant check of ck_cmdline
ELF, uapi: fixup ELF_ST_TYPE definition
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
ipc: update semtimedop() to use hrtimer
ipc/sem: remove redundant assignments
...
Commit:
4b9a8dca0e ("x86/idt: Remove the tracing IDT completely")
removed the 'tracing IDT' from arch/x86/kernel/tracepoint.c,
but left related comment. So that the comment become anachronistic.
Just remove the comment.
Link: https://lkml.kernel.org/r/20220526110831.175743-1-sunliming@kylinos.cn
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Commit 4b9a8dca0e ("x86/idt: Remove the tracing IDT completely")
removed the tracing IDT from the file arch/x86/kernel/tracepoint.c,
but left the related headers unused, remove it.
Link: https://lkml.kernel.org/r/20220525012827.93464-1-sunliming@kylinos.cn
Signed-off-by: sunliming <sunliming@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
All instances of the function ftrace_arch_modify_prepare() and
ftrace_arch_modify_post_process() return zero. There's no point in
checking their return value. Just have them be void functions.
Link: https://lkml.kernel.org/r/20220518023639.4065-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
* ultravisor communication device driver
* fix TEID on terminating storage key ops
RISC-V:
* Added Sv57x4 support for G-stage page table
* Added range based local HFENCE functions
* Added remote HFENCE functions based on VCPU requests
* Added ISA extension registers in ONE_REG interface
* Updated KVM RISC-V maintainers entry to cover selftests support
ARM:
* Add support for the ARMv8.6 WFxT extension
* Guard pages for the EL2 stacks
* Trap and emulate AArch32 ID registers to hide unsupported features
* Ability to select and save/restore the set of hypercalls exposed
to the guest
* Support for PSCI-initiated suspend in collaboration with userspace
* GICv3 register-based LPI invalidation support
* Move host PMU event merging into the vcpu data structure
* GICv3 ITS save/restore fixes
* The usual set of small-scale cleanups and fixes
x86:
* New ioctls to get/set TSC frequency for a whole VM
* Allow userspace to opt out of hypercall patching
* Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
AMD SEV improvements:
* Add KVM_EXIT_SHUTDOWN metadata for SEV-ES
* V_TSC_AUX support
Nested virtualization improvements for AMD:
* Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
nested vGIF)
* Allow AVIC to co-exist with a nested guest running
* Fixes for LBR virtualizations when a nested guest is running,
and nested LBR virtualization support
* PAUSE filtering for nested hypervisors
Guest support:
* Decoupling of vcpu_is_preempted from PV spinlocks
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKN9M4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNLeAf+KizAlQwxEehHHeNyTkZuKyMawrD6
zsqAENR6i1TxiXe7fDfPFbO2NR0ZulQopHbD9mwnHJ+nNw0J4UT7g3ii1IAVcXPu
rQNRGMVWiu54jt+lep8/gDg0JvPGKVVKLhxUaU1kdWT9PhIOC6lwpP3vmeWkUfRi
PFL/TMT0M8Nfryi0zHB0tXeqg41BiXfqO8wMySfBAHUbpv8D53D2eXQL6YlMM0pL
2quB1HxHnpueE5vj3WEPQ3PCdy1M2MTfCDBJAbZGG78Ljx45FxSGoQcmiBpPnhJr
C6UGP4ZDWpml5YULUoA70k5ylCbP+vI61U4vUtzEiOjHugpPV5wFKtx5nw==
=ozWx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"S390:
- ultravisor communication device driver
- fix TEID on terminating storage key ops
RISC-V:
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
ARM:
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed to
the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
x86:
- New ioctls to get/set TSC frequency for a whole VM
- Allow userspace to opt out of hypercall patching
- Only do MSR filtering for MSRs accessed by rdmsr/wrmsr
AMD SEV improvements:
- Add KVM_EXIT_SHUTDOWN metadata for SEV-ES
- V_TSC_AUX support
Nested virtualization improvements for AMD:
- Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
nested vGIF)
- Allow AVIC to co-exist with a nested guest running
- Fixes for LBR virtualizations when a nested guest is running, and
nested LBR virtualization support
- PAUSE filtering for nested hypervisors
Guest support:
- Decoupling of vcpu_is_preempted from PV spinlocks"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits)
KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest
KVM: selftests: x86: Sync the new name of the test case to .gitignore
Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND
x86, kvm: use correct GFP flags for preemption disabled
KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer
x86/kvm: Alloc dummy async #PF token outside of raw spinlock
KVM: x86: avoid calling x86 emulator without a decoded instruction
KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak
x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave)
s390/uv_uapi: depend on CONFIG_S390
KVM: selftests: x86: Fix test failure on arch lbr capable platforms
KVM: LAPIC: Trace LAPIC timer expiration on every vmentry
KVM: s390: selftest: Test suppression indication on key prot exception
KVM: s390: Don't indicate suppression on dirtying, failing memop
selftests: drivers/s390x: Add uvdevice tests
drivers/s390/char: Add Ultravisor io device
MAINTAINERS: Update KVM RISC-V entry to cover selftests support
RISC-V: KVM: Introduce ISA extension register
RISC-V: KVM: Cleanup stale TLB entries when host CPU changes
RISC-V: KVM: Add remote HFENCE functions based on VCPU requests
...
- don't over-decrypt memory (Robin Murphy)
- takes min align mask into account for the swiotlb max mapping size
(Tianyu Lan)
- use GFP_ATOMIC in dma-debug (Mikulas Patocka)
- fix DMA_ATTR_NO_KERNEL_MAPPING on xen/arm (me)
- don't fail on highmem CMA pages in dma_direct_alloc_pages (me)
- cleanup swiotlb initialization and share more code with swiotlb-xen
(me, Stefano Stabellini)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKObTQLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYObmA//dIcDB/q4iFGD+WJh4MhM+asx0ZsdF2OJz42WEhgT
Z9duOrgcneEQundCamqJP9rNTs980LHDA8uWQC5rZEc9vxuRVOdS7bSgYRUwWh6B
r0ZjOsvQCn+ChoZML8uyk4rfmEINq+EvJuec3G5fgecZOhPuJS2i2uzzv5cHwqgP
ChC0fwyZlkfdECXgvZXbEoCJLfTgGNlziN6Ai8dirSoqgEQUoCsY89/M7OiEBvV2
R4XUWD7OvQERfB4t6xLuUHyzf9PAuWB+OiblRVNeAmK3lMjxVrc3k4kIowgklnzD
8hfmphAa9Zou3zdfi6Gd4fiQRHRVOwKVp1rtqUmJ+lPSiwyMzu64z9ld2+2qac0h
V4sSr/yJkhxnBT4/0MkTChvhnRobisackpUzNRpiM4ck7cNVb7eAvkISsbH+pWI9
aEexPhbyskjlV+GOyM4QL4ygG0dpXY0HSyoh6uaSVsaXMycnWIsJCPidXxV1HGV0
q2/RLHuHwYxia8cYCF01/DQvwOKSjwbU0zModxtRezGD5GYh2C0a+SrA1aX+qiTu
yGJCs2UHtSQstAt78tTVp499YeDeL/oGSQkPAu8zyRkSczzF+CncGTuXyoJbAWyK
otcgERWljgZ4scxjfu1uacfoVhKQ7nOu7hiJokL0U80FESAennLC3ZlocvB9h/ff
HNA=
=n2rk
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.19-2022-05-25' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- don't over-decrypt memory (Robin Murphy)
- takes min align mask into account for the swiotlb max mapping size
(Tianyu Lan)
- use GFP_ATOMIC in dma-debug (Mikulas Patocka)
- fix DMA_ATTR_NO_KERNEL_MAPPING on xen/arm (me)
- don't fail on highmem CMA pages in dma_direct_alloc_pages (me)
- cleanup swiotlb initialization and share more code with swiotlb-xen
(me, Stefano Stabellini)
* tag 'dma-mapping-5.19-2022-05-25' of git://git.infradead.org/users/hch/dma-mapping: (23 commits)
dma-direct: don't over-decrypt memory
swiotlb: max mapping size takes min align mask into account
swiotlb: use the right nslabs-derived sizes in swiotlb_init_late
swiotlb: use the right nslabs value in swiotlb_init_remap
swiotlb: don't panic when the swiotlb buffer can't be allocated
dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC
dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pages
swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm
x86: remove cruft from <asm/dma-mapping.h>
swiotlb: remove swiotlb_init_with_tbl and swiotlb_init_late_with_tbl
swiotlb: merge swiotlb-xen initialization into swiotlb
swiotlb: provide swiotlb_init variants that remap the buffer
swiotlb: pass a gfp_mask argument to swiotlb_init_late
swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction
swiotlb: make the swiotlb_init interface more useful
x86: centralize setting SWIOTLB_FORCE when guest memory encryption is enabled
x86: remove the IOMMU table infrastructure
MIPS/octeon: use swiotlb_init instead of open coding it
arm/xen: don't check for xen_initial_domain() in xen_create_contiguous_region
swiotlb: rename swiotlb_late_init_with_default_size
...
Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding
to very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space
to remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections
that have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting.
This makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core
from Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmKNMPQACgkQMUZtbf5S
IrsRARAAuDyYs6jFYB3p+xazZdOnbF4iAgVv71+DQGvmsCl6CB9OrsNZMlvE85OL
Q3gjcRbgjrkN4lhgI8DmiGYbsUJnAvVjFdNjccz1Z/vTLYvuIM0ol54MUp5S+9WY
StncOJkOGJxxR/Gi5gzVmejPDsysU3Jik+hm/fpIcz8pybXxAsFKU5waY5qfl+/T
TZepfV0VCfqRDjqcF1qA5+jJZNU8pdodQlZ1+mh8bwu6Jk1ZkWkj6Ov8MWdwQldr
LnPeK/9hIGzkdJYHZfajxA3t8D0K5CHzSuih2bJ9ry8ZXgVBkXEThew778/R5izW
uB0YZs9COFlrIP7XHjtRTy/2xHOdYIPlj2nWhVdfuQDX8Crvt4VRN6EZ1rjko1ZJ
WanfG6WHF8NH5pXBRQbh3kIMKBnYn6OIzuCfCQSqd+niHcxFIM4vRiggeXI5C5TW
vJgEWfK6X+NfDiFVa3xyCrEmp5ieA/pNecpwd8rVkql+MtFAAw4vfsotLKOJEAru
J/XL6UE+YuLqIJV9ACZ9x1AFXXAo661jOxBunOo4VXhXVzWS9lYYz5r5ryIkgT/8
/Fr0zjANJWgfIuNdIBtYfQ4qG+LozGq038VA06RhFUAZ5tF9DzhqJs2Q2AFuWWBC
ewCePJVqo1j2Ceq2mGonXRt47OEnlePoOxTk9W+cKZb7ZWE+zEo=
=Wjii
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Support TCPv6 segmentation offload with super-segments larger than
64k bytes using the IPv6 Jumbogram extension header (AKA BIG TCP).
- Generalize skb freeing deferral to per-cpu lists, instead of
per-socket lists.
- Add a netdev statistic for packets dropped due to L2 address
mismatch (rx_otherhost_dropped).
- Continue work annotating skb drop reasons.
- Accept alternative netdev names (ALT_IFNAME) in more netlink
requests.
- Add VLAN support for AF_PACKET SOCK_RAW GSO.
- Allow receiving skb mark from the socket as a cmsg.
- Enable memcg accounting for veth queues, sysctl tables and IPv6.
BPF
---
- Add libbpf support for User Statically-Defined Tracing (USDTs).
- Speed up symbol resolution for kprobes multi-link attachments.
- Support storing typed pointers to referenced and unreferenced
objects in BPF maps.
- Add support for BPF link iterator.
- Introduce access to remote CPU map elements in BPF per-cpu map.
- Allow middle-of-the-road settings for the
kernel.unprivileged_bpf_disabled sysctl.
- Implement basic types of dynamic pointers e.g. to allow for
dynamically sized ringbuf reservations without extra memory copies.
Protocols
---------
- Retire port only listening_hash table, add a second bind table
hashed by port and address. Avoid linear list walk when binding to
very popular ports (e.g. 443).
- Add bridge FDB bulk flush filtering support allowing user space to
remove all FDB entries matching a condition.
- Introduce accept_unsolicited_na sysctl for IPv6 to implement
router-side changes for RFC9131.
- Support for MPTCP path manager in user space.
- Add MPTCP support for fallback to regular TCP for connections that
have never connected additional subflows or transmitted
out-of-sequence data (partial support for RFC8684 fallback).
- Avoid races in MPTCP-level window tracking, stabilize and improve
throughput.
- Support lockless operation of GRE tunnels with seq numbers enabled.
- WiFi support for host based BSS color collision detection.
- Add support for SO_TXTIME/SCM_TXTIME on CAN sockets.
- Support transmission w/o flow control in CAN ISOTP (ISO 15765-2).
- Support zero-copy Tx with TLS 1.2 crypto offload (sendfile).
- Allow matching on the number of VLAN tags via tc-flower.
- Add tracepoint for tcp_set_ca_state().
Driver API
----------
- Improve error reporting from classifier and action offload.
- Add support for listing line cards in switches (devlink).
- Add helpers for reporting page pool statistics with ethtool -S.
- Add support for reading clock cycles when using PTP virtual clocks,
instead of having the driver convert to time before reporting. This
makes it possible to report time from different vclocks.
- Support configuring low-latency Tx descriptor push via ethtool.
- Separate Clause 22 and Clause 45 MDIO accesses more explicitly.
New hardware / drivers
----------------------
- Ethernet:
- Marvell's Octeon NIC PCI Endpoint support (octeon_ep)
- Sunplus SP7021 SoC (sp7021_emac)
- Add support for Renesas RZ/V2M (in ravb)
- Add support for MediaTek mt7986 switches (in mtk_eth_soc)
- Ethernet PHYs:
- ADIN1100 industrial PHYs (w/ 10BASE-T1L and SQI reporting)
- TI DP83TD510 PHY
- Microchip LAN8742/LAN88xx PHYs
- WiFi:
- Driver for pureLiFi X, XL, XC devices (plfxlc)
- Driver for Silicon Labs devices (wfx)
- Support for WCN6750 (in ath11k)
- Support Realtek 8852ce devices (in rtw89)
- Mobile:
- MediaTek T700 modems (Intel 5G 5000 M.2 cards)
- CAN:
- ctucanfd: add support for CTU CAN FD open-source IP core from
Czech Technical University in Prague
Drivers
-------
- Delete a number of old drivers still using virt_to_bus().
- Ethernet NICs:
- intel: support TSO on tunnels MPLS
- broadcom: support multi-buffer XDP
- nfp: support VF rate limiting
- sfc: use hardware tx timestamps for more than PTP
- mlx5: multi-port eswitch support
- hyper-v: add support for XDP_REDIRECT
- atlantic: XDP support (including multi-buffer)
- macb: improve real-time perf by deferring Tx processing to NAPI
- High-speed Ethernet switches:
- mlxsw: implement basic line card information querying
- prestera: add support for traffic policing on ingress and egress
- Embedded Ethernet switches:
- lan966x: add support for packet DMA (FDMA)
- lan966x: add support for PTP programmable pins
- ti: cpsw_new: enable bc/mc storm prevention
- Qualcomm 802.11ax WiFi (ath11k):
- Wake-on-WLAN support for QCA6390 and WCN6855
- device recovery (firmware restart) support
- support setting Specific Absorption Rate (SAR) for WCN6855
- read country code from SMBIOS for WCN6855/QCA6390
- enable keep-alive during WoWLAN suspend
- implement remain-on-channel support
- MediaTek WiFi (mt76):
- support Wireless Ethernet Dispatch offloading packet movement
between the Ethernet switch and WiFi interfaces
- non-standard VHT MCS10-11 support
- mt7921 AP mode support
- mt7921 IPv6 NS offload support
- Ethernet PHYs:
- micrel: ksz9031/ksz9131: cabletest support
- lan87xx: SQI support for T1 PHYs
- lan937x: add interrupt support for link detection"
* tag 'net-next-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1809 commits)
ptp: ocp: Add firmware header checks
ptp: ocp: fix PPS source selector debugfs reporting
ptp: ocp: add .init function for sma_op vector
ptp: ocp: vectorize the sma accessor functions
ptp: ocp: constify selectors
ptp: ocp: parameterize input/output sma selectors
ptp: ocp: revise firmware display
ptp: ocp: add Celestica timecard PCI ids
ptp: ocp: Remove #ifdefs around PCI IDs
ptp: ocp: 32-bit fixups for pci start address
Revert "net/smc: fix listen processing for SMC-Rv2"
ath6kl: Use cc-disable-warning to disable -Wdangling-pointer
selftests/bpf: Dynptr tests
bpf: Add dynptr data slices
bpf: Add bpf_dynptr_read and bpf_dynptr_write
bpf: Dynptr support for ring buffers
bpf: Add bpf_dynptr_from_mem for local dynptrs
bpf: Add verifier support for dynptrs
bpf: Suppress 'passing zero to PTR_ERR' warning
bpf: Introduce bpf_arch_text_invalidate for bpf_prog_pack
...
strlcpy() is marked deprecated and should not be used, because
it doesn't limit the source length.
The preferred interface for when strlcpy()'s return value is not
checked (truncation) is strscpy().
[ mingo: Tweaked the changelog ]
Signed-off-by: XueBing Chen <chenxuebing@jari.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/730f0fef.a33.180fa69880f.Coremail.chenxuebing@jari.cn
Commit ddd7ed842627 ("x86/kvm: Alloc dummy async #PF token outside of
raw spinlock") leads to the following Smatch static checker warning:
arch/x86/kernel/kvm.c:212 kvm_async_pf_task_wake()
warn: sleeping in atomic context
arch/x86/kernel/kvm.c
202 raw_spin_lock(&b->lock);
203 n = _find_apf_task(b, token);
204 if (!n) {
205 /*
206 * Async #PF not yet handled, add a dummy entry for the token.
207 * Allocating the token must be down outside of the raw lock
208 * as the allocator is preemptible on PREEMPT_RT kernels.
209 */
210 if (!dummy) {
211 raw_spin_unlock(&b->lock);
--> 212 dummy = kzalloc(sizeof(*dummy), GFP_KERNEL);
^^^^^^^^^^
Smatch thinks the caller has preempt disabled. The `smdb.py preempt
kvm_async_pf_task_wake` output call tree is:
sysvec_kvm_asyncpf_interrupt() <- disables preempt
-> __sysvec_kvm_asyncpf_interrupt()
-> kvm_async_pf_task_wake()
The caller is this:
arch/x86/kernel/kvm.c
290 DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
291 {
292 struct pt_regs *old_regs = set_irq_regs(regs);
293 u32 token;
294
295 ack_APIC_irq();
296
297 inc_irq_stat(irq_hv_callback_count);
298
299 if (__this_cpu_read(apf_reason.enabled)) {
300 token = __this_cpu_read(apf_reason.token);
301 kvm_async_pf_task_wake(token);
302 __this_cpu_write(apf_reason.token, 0);
303 wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
304 }
305
306 set_irq_regs(old_regs);
307 }
The DEFINE_IDTENTRY_SYSVEC() is a wrapper that calls this function
from the call_on_irqstack_cond(). It's inside the call_on_irqstack_cond()
where preempt is disabled (unless it's already disabled). The
irq_enter/exit_rcu() functions disable/enable preempt.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the raw spinlock in kvm_async_pf_task_wake() before allocating the
the dummy async #PF token, the allocator is preemptible on PREEMPT_RT
kernels and must not be called from truly atomic contexts.
Opportunistically document why it's ok to loop on allocation failure,
i.e. why the function won't get stuck in an infinite loop.
Reported-by: Yajun Deng <yajun.deng@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave',
i.e. to KVM's historical uABI size. When saving FPU state for usersapce,
KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if
the host doesn't support XSAVE. Setting the XSAVE header allows the VM
to be migrated to a host that does support XSAVE without the new host
having to handle FPU state that may or may not be compatible with XSAVE.
Setting the uABI size to the host's default size results in out-of-bounds
writes (setting the FP+SSE bits) and data corruption (that is thankfully
caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs.
WARN if the default size is larger than KVM's historical uABI size; all
features that can push the FPU size beyond the historical size must be
opt-in.
==================================================================
BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130
Read of size 8 at addr ffff888011e33a00 by task qemu-build/681
CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1
Hardware name: /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x45
print_report.cold+0x45/0x575
kasan_report+0x9b/0xd0
fpu_copy_uabi_to_guest_fpstate+0x86/0x130
kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm]
kvm_vcpu_ioctl+0x47f/0x7b0 [kvm]
__x64_sys_ioctl+0x5de/0xc90
do_syscall_64+0x31/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
Allocated by task 0:
(stack is not available)
The buggy address belongs to the object at ffff888011e33800
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 0 bytes to the right of
512-byte region [ffff888011e33800, ffff888011e33a00)
The buggy address belongs to the physical page:
page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30
head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0
flags: 0x4000000000010200(slab|head|zone=1)
raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Fixes: be50b2065d ("kvm: x86: Add support for getting/setting expanded xstate buffer")
Fixes: c60427dd50 ("x86/fpu: Add uabi_size to guest_fpu")
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Zdenek Kaspar <zkaspar82@gmail.com>
Message-Id: <20220504001219.983513-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmKHGu8ACgkQrUjsVaLH
LAe1sQ/40ltbl/v0cW+zkuUOem+apmJMhtoCfh2Pv00yUYftUNw01Uu+NN04T70x
PYwbu0O8j4dgIFNRPU7VQBVI+fJydkgEr3kpk8UOCCGKiE0NAcFoQv70ngPObc4W
L425i2RviZuQUXLTFsoLOb246p8V8lkfbEQKqWksFEROYWFbdNKmaLpfVqq3Bia2
+G8L2OyAHGjUXgIdOnflZHxowJg4ueGob3iH+4AhZNUpIQYtlKSfi/eo0vmzf5Uz
bD35o6y4G7NnZJyZoKb3QAEt0WQ55YDsNN62XrULQ7GEuWnpez+Jhw3jtrAr59Q7
m8n93NMKKJ9CbnsspFJ+4nHCd2Gb4i99Py70IW6Ro22DL8KRrLDv2ZQi3dJCGrAT
MtER+12coglkgjhDmLn6MMEjWkgbXXxQCEs4OQ8VMORtHAsOQEszu5TCEnihXr2q
+uUZ5O0G6eDowctOVMTdqVMtj1u1AT7fZ68evvk4omNnoFWjkQzd4sVPNDJtK+nC
7mA9IUyC2LSvr/oNNpcuIZsKU6OzQUQ5ISTMpbP/HJInFcvYbJTl0I8UcvjzlImo
81CZTUQOY9kQE+VUTHcGqPr0TjN/YlfF//koiCfeTycN0jbRZZ9rpcRQ38R8sDsS
yy7JQqwpi/x8me9ldt5r19ky5zMlCKpnQfGX6ws+umhqVEHBKw==
=Xznv
-----END PGP SIGNATURE-----
Merge tag 'kvm-riscv-5.19-1' of https://github.com/kvm-riscv/linux into HEAD
KVM/riscv changes for 5.19
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed
to the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmKGAGsPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDB/gQAMhyZ+wCG0OMEZhwFF6iDfxVEX2Kw8L41NtD
a/e6LDWuIOGihItpRkYROc5myG74D7XckF2Bz3G7HJoU4vhwHOV/XulE26GFizoC
O1GVRekeSUY81wgS1yfo0jojLupBkTjiq3SjTHoDP7GmCM0qDPBtA0QlMRzd2bMs
Kx0+UUXZUHFSTXc7Lp4vqNH+tMp7se+yRx7hxm6PCM5zG+XYJjLxnsZ0qpchObgU
7f6YFojsLUs1SexgiUqJ1RChVQ+FkgICh5HyzORvGtHNNzK6D2sIbsW6nqMGAMql
Kr3A5O/VOkCztSYnLxaa76/HqD21mvUrXvr3grhabNc7rOmuzWV0dDgr6c6wHKHb
uNCtH4d7Ra06gUrEOrfsgLOLn0Zqik89y6aIlMsnTudMg9gMNgFHy1jz4LM7vMkY
FS5AVj059heg2uJcfgTvzzcqneyuBLBmF3dS4coowO6oaj8SycpaEmP5e89zkPMI
1kk8d0e6RmXuCh/2AJ8GxxnKvBPgqp2mMKXOCJ8j4AmHEDX/CKpEBBqIWLKkplUU
8DGiOWJUtRZJg398dUeIpiVLoXJthMODjAnkKkuhiFcQbXomlwgg7YSnNAz6TRED
Z7KR2leC247kapHnnagf02q2wED8pBeyrxbQPNdrHtSJ9Usm4nTkY443HgVTJW3s
aTwPZAQ7
=mh7W
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.19
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed
to the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
from 4 to 6. - Paolo]
- Update the Energy Model support code to allow the Energy Model to be
artificial, which means that the power values may not be on a uniform
scale with other devices providing power information, and update the
cpufreq_cooling and devfreq_cooling thermal drivers to support
artificial Energy Models (Lukasz Luba).
- Make DTPM check the Energy Model type (Lukasz Luba).
- Fix policy counter decrementation in cpufreq if Energy Model is in
use (Pierre Gondois).
- Add CPU-based scaling support to passive devfreq governor (Saravana
Kannan, Chanwoo Choi).
- Update the rk3399_dmc devfreq driver (Brian Norris).
- Export dev_pm_ops instead of suspend() and resume() in the IIO
chemical scd30 driver (Jonathan Cameron).
- Add namespace variants of EXPORT[_GPL]_SIMPLE_DEV_PM_OPS and
PM-runtime counterparts (Jonathan Cameron).
- Move symbol exports in the IIO chemical scd30 driver into the
IIO_SCD30 namespace (Jonathan Cameron).
- Avoid device PM-runtime usage count underflows (Rafael Wysocki).
- Allow dynamic debug to control printing of PM messages (David
Cohen).
- Fix some kernel-doc comments in hibernation code (Yang Li, Haowen
Bai).
- Preserve ACPI-table override during hibernation (Amadeusz Sławiński).
- Improve support for suspend-to-RAM for PSCI OSI mode (Ulf Hansson).
- Make Intel RAPL power capping driver support the RaptorLake and
AlderLake N processors (Zhang Rui, Sumeet Pawnikar).
- Remove redundant store to value after multiply in the RAPL power
capping driver (Colin Ian King).
- Add AlderLake processor support to the intel_idle driver (Zhang Rui).
- Fix regression leading to no genpd governor in the PSCI cpuidle
driver and fix the riscv-sbi cpuidle driver to allow a genpd
governor to be used (Ulf Hansson).
- Fix cpufreq governor clean up code to avoid using kfree() directly
to free kobject-based items (Kevin Hao).
- Prepare cpufreq for powerpc's asm/prom.h cleanup (Christophe Leroy).
- Make intel_pstate notify frequency invariance code when no_turbo is
turned on and off (Chen Yu).
- Add Sapphire Rapids OOB mode support to intel_pstate (Srinivas
Pandruvada).
- Make cpufreq avoid unnecessary frequency updates due to mismatch
between hardware and the frequency table (Viresh Kumar).
- Make remove_cpu_dev_symlink() clear the real_cpus mask to simplify
code (Viresh Kumar).
- Rearrange cpufreq_offline() and cpufreq_remove_dev() to make the
calling convention for some driver callbacks consistent (Rafael
Wysocki).
- Avoid accessing half-initialized cpufreq policies from the show()
and store() sysfs functions (Schspa Shi).
- Rearrange cpufreq_offline() to make the calling convention for some
driver callbacks consistent (Schspa Shi).
- Update CPPC handling in cpufreq (Pierre Gondois).
- Extend dev_pm_domain_detach() doc (Krzysztof Kozlowski).
- Move genpd's time-accounting to ktime_get_mono_fast_ns() (Ulf
Hansson).
- Improve the way genpd deals with its governors (Ulf Hansson).
- Update the turbostat utility to version 2022.04.16 (Len Brown,
Dan Merillat, Sumeet Pawnikar, Zephaniah E. Loss-Cutler-Hull, Chen
Yu).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmKL3hsSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxW4oP/RzMh6dclWXs3J/gUCKTqRepq6cb80tq
Q2r9xRRHwy6ZH/PVddGDHmhQ7d3NAv13s4srA9kznZognF3hzuxnGau226ilDqHh
qxVSBRjWY9ijxRBvkcCaa6HZm4Chb91pUX0CLpdYSl9BTgIdk66HZYaMsKhHU/di
j7KKHPdKyyQkssWnMjGEyuaF+UebiEgISCF3+X0eb6c1m7GHXpgLJVxNy0pKkUdK
j+n6+ms12OlVLtg1eIl0J5824w/rkK3ZdqfEXJSq++mNMqSj/KCI3yWpzsLKp9AB
xxhox/tPgJVyON8Vtbb2IkWkiQUKeSrAGIUYXWmnwIZYLPSGD7BPzr82Cxr7S/ez
imMB+1Qd3SsOQ9EdI9rGYgNsEF2vOs1xjMehSdUdmTz148IzBOBt4YyQeb/mfXqH
nh9eVuFCzqH1lAayYt6iP1+V5gQn9as/+rR91k4k4A6OKXomuQUGORLeHfuKMfNH
eBZ72tdXqiq6z+ag3lY3pBAMSm11epCOa3VR6QNaC7hrlY3AZP+o3tIUL6W813b+
V3l1gWApGHZE1hiDM95dll/dIt9IZpTRd3dlqF/YnFW7fPDrz71EGvhrZpO7vdO0
/G6eJcCDjqJVcbCE8Y77I6/AXjpVQ7PRPeNx6aW7jPcQhpVIgcsF2BGjk9anjXDs
3yHJs9R/HMmA
=Hewm
-----END PGP SIGNATURE-----
Merge tag 'pm-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These add support for 'artificial' Energy Models in which power
numbers for different entities may be in different scales, add support
for some new hardware, fix bugs and clean up code in multiple places.
Specifics:
- Update the Energy Model support code to allow the Energy Model to
be artificial, which means that the power values may not be on a
uniform scale with other devices providing power information, and
update the cpufreq_cooling and devfreq_cooling thermal drivers to
support artificial Energy Models (Lukasz Luba).
- Make DTPM check the Energy Model type (Lukasz Luba).
- Fix policy counter decrementation in cpufreq if Energy Model is in
use (Pierre Gondois).
- Add CPU-based scaling support to passive devfreq governor (Saravana
Kannan, Chanwoo Choi).
- Update the rk3399_dmc devfreq driver (Brian Norris).
- Export dev_pm_ops instead of suspend() and resume() in the IIO
chemical scd30 driver (Jonathan Cameron).
- Add namespace variants of EXPORT[_GPL]_SIMPLE_DEV_PM_OPS and
PM-runtime counterparts (Jonathan Cameron).
- Move symbol exports in the IIO chemical scd30 driver into the
IIO_SCD30 namespace (Jonathan Cameron).
- Avoid device PM-runtime usage count underflows (Rafael Wysocki).
- Allow dynamic debug to control printing of PM messages (David
Cohen).
- Fix some kernel-doc comments in hibernation code (Yang Li, Haowen
Bai).
- Preserve ACPI-table override during hibernation (Amadeusz
Sławiński).
- Improve support for suspend-to-RAM for PSCI OSI mode (Ulf Hansson).
- Make Intel RAPL power capping driver support the RaptorLake and
AlderLake N processors (Zhang Rui, Sumeet Pawnikar).
- Remove redundant store to value after multiply in the RAPL power
capping driver (Colin Ian King).
- Add AlderLake processor support to the intel_idle driver (Zhang
Rui).
- Fix regression leading to no genpd governor in the PSCI cpuidle
driver and fix the riscv-sbi cpuidle driver to allow a genpd
governor to be used (Ulf Hansson).
- Fix cpufreq governor clean up code to avoid using kfree() directly
to free kobject-based items (Kevin Hao).
- Prepare cpufreq for powerpc's asm/prom.h cleanup (Christophe
Leroy).
- Make intel_pstate notify frequency invariance code when no_turbo is
turned on and off (Chen Yu).
- Add Sapphire Rapids OOB mode support to intel_pstate (Srinivas
Pandruvada).
- Make cpufreq avoid unnecessary frequency updates due to mismatch
between hardware and the frequency table (Viresh Kumar).
- Make remove_cpu_dev_symlink() clear the real_cpus mask to simplify
code (Viresh Kumar).
- Rearrange cpufreq_offline() and cpufreq_remove_dev() to make the
calling convention for some driver callbacks consistent (Rafael
Wysocki).
- Avoid accessing half-initialized cpufreq policies from the show()
and store() sysfs functions (Schspa Shi).
- Rearrange cpufreq_offline() to make the calling convention for some
driver callbacks consistent (Schspa Shi).
- Update CPPC handling in cpufreq (Pierre Gondois).
- Extend dev_pm_domain_detach() doc (Krzysztof Kozlowski).
- Move genpd's time-accounting to ktime_get_mono_fast_ns() (Ulf
Hansson).
- Improve the way genpd deals with its governors (Ulf Hansson).
- Update the turbostat utility to version 2022.04.16 (Len Brown, Dan
Merillat, Sumeet Pawnikar, Zephaniah E. Loss-Cutler-Hull, Chen Yu)"
* tag 'pm-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (94 commits)
PM: domains: Trust domain-idle-states from DT to be correct by genpd
PM: domains: Measure power-on/off latencies in genpd based on a governor
PM: domains: Allocate governor data dynamically based on a genpd governor
PM: domains: Clean up some code in pm_genpd_init() and genpd_remove()
PM: domains: Fix initialization of genpd's next_wakeup
PM: domains: Fixup QoS latency measurements for IRQ safe devices in genpd
PM: domains: Measure suspend/resume latencies in genpd based on governor
PM: domains: Move the next_wakeup variable into the struct gpd_timing_data
PM: domains: Allocate gpd_timing_data dynamically based on governor
PM: domains: Skip another warning in irq_safe_dev_in_sleep_domain()
PM: domains: Rename irq_safe_dev_in_no_sleep_domain() in genpd
PM: domains: Don't check PM_QOS_FLAG_NO_POWER_OFF in genpd
PM: domains: Drop redundant code for genpd always-on governor
PM: domains: Add GENPD_FLAG_RPM_ALWAYS_ON for the always-on governor
powercap: intel_rapl: remove redundant store to value after multiply
cpufreq: CPPC: Enable dvfs_possible_from_any_cpu
cpufreq: CPPC: Enable fast_switch
ACPI: CPPC: Assume no transition latency if no PCCT
ACPI: bus: Set CPPC _OSC bits for all and when CPPC_LIB is supported
ACPI: CPPC: Check _OSC for flexible address space
...
- Update ACPICA code in the kernel to upstream revision 20220331
including the following changes:
* Add support for the Windows 11 _OSI string (Mario Limonciello)
* Add the CFMWS subtable to the CEDT table (Lawrence Hileman).
* iASL: NHLT: Treat Terminator as specific_config (Piotr Maziarz).
* iASL: NHLT: Fix parsing undocumented bytes at the end of Endpoint
Descriptor (Piotr Maziarz).
* iASL: NHLT: Rename linux specific strucures to device_info (Piotr
Maziarz).
* Add new ACPI 6.4 semantics to Load() and LoadTable() (Bob Moore).
* Clean up double word in comment (Tom Rix).
* Update copyright notices to the year 2022 (Bob Moore).
* Remove some tabs and // comments - automated cleanup (Bob Moore).
* Replace zero-length array with flexible-array member (Gustavo A. R.
Silva).
* Interpreter: Add units to time variable names (Paul Menzel).
* Add support for ARM Performance Monitoring Unit Table (Besar
Wicaksono).
* Inform users about ACPI spec violation related to sleep length (Paul
Menzel).
* iASL/MADT: Add OEM-defined subtable (Bob Moore).
* Interpreter: Fix some typo mistakes (Selvarasu Ganesan).
* Updates for revision E.d of IORT (Shameer Kolothum).
* Use ACPI_FORMAT_UINT64 for 64-bit output (Bob Moore).
- Improve debug messages in the ACPI device PM code (Rafael Wysocki).
- Block ASUS B1400CEAE from suspend to idle by default (Mario
Limonciello).
- Improve handling of PCI devices that are in D3cold during system
initialization (Rafael Wysocki).
- Fix BERT error region memory mapping (Lorenzo Pieralisi).
- Add support for NVIDIA 16550-compatible port subtype to the SPCR
parsing code (Jeff Brasen).
- Use static for BGRT_SHOW kobj_attribute defines (Tom Rix).
- Fix missing prototype warning for acpi_agdi_init() (Ilkka Koskinen).
- Fix missing ERST record ID in the APEI code (Liu Xinpeng).
- Make APEI error injection to refuse to inject into the zero
page (Tony Luck).
- Correct description of INT3407 / INT3532 DPTF attributes in sysfs
(Sumeet Pawnikar).
- Add support for high frequency impedance notification to the DPTF
driver (Sumeet Pawnikar).
- Make mp_config_acpi_gsi() a void function (Li kunyu).
- Unify Package () representation for properties in the ACPI device
properties documentation (Andy Shevchenko).
- Include UUID in _DSM evaluation warning (Michael Niewöhner).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmKL2NUSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxozgP/2RDYurr4b7welCumisfd26V8ldPTGfh
hZLQaNMYnzlPyoazOMkp6fi4PTaVxVjz75DJw7gjIYXO+ChMscZNHHvpHGXk6R+Q
H3wM1E6w7jf6Tffg8SuhC38Q1Oh3JBLqPXrzKmuku6Wma6GtqAKtCcxCIb6jj9Bc
l6xU+FT5MHz2AKtHRqDPrMYYY/v7w7Krnu7EbsWnqYgKjfYyE5CJZocPm5bLcqI4
ZMYcyca8wZu68cj0nR79O1sc1UY4RWDupNTzro8m6Nl2fSWzh+o6aWdjNXqY9fHb
TM3s4nIHH3WVppZSZutX0wnuz4NRFlRNF85m0NXDM5hKoy/hsahTjrWhtKcrKXzv
2G/1NoxMBgpr55oSvPPrFUnj/Dne4mnM9ftp7cGZj9lwEWg9qXSbBSRa3XYAATps
GoxIyd+cP5lGXtur/eqV/HfDQqJ4L7TlVL2HjH1UcH0AdL4D1CF5Ybjfb60xSVC1
MYRGZTCibWz3YL4191YebmalOFmvQohldC/U/RZgOlPL8QygaSo1+Unn7xp0txng
vtbePiEpaUAGzjlvwi8NscWXCOekspc0GImfkEgsvMrDOJJPqRlUl//m2zGsZzDK
VV9SRhy28Xm5oxJnayHFRDXHEBQUVYwsL4k8X1wiYBsq36X98C0tpF+5VXNyZS7i
UKUSJQpKptu6
=XH6/
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA kernel code to upstream revision 20220331,
improve handling of PCI devices that are in D3cold during system
initialization, add support for a few features, fix bugs and clean up
code.
Specifics:
- Update ACPICA code in the kernel to upstream revision 20220331
including the following changes:
- Add support for the Windows 11 _OSI string (Mario Limonciello)
- Add the CFMWS subtable to the CEDT table (Lawrence Hileman).
- iASL: NHLT: Treat Terminator as specific_config (Piotr
Maziarz).
- iASL: NHLT: Fix parsing undocumented bytes at the end of
Endpoint Descriptor (Piotr Maziarz).
- iASL: NHLT: Rename linux specific strucures to device_info
(Piotr Maziarz).
- Add new ACPI 6.4 semantics to Load() and LoadTable() (Bob
Moore).
- Clean up double word in comment (Tom Rix).
- Update copyright notices to the year 2022 (Bob Moore).
- Remove some tabs and // comments - automated cleanup (Bob
Moore).
- Replace zero-length array with flexible-array member (Gustavo
A. R. Silva).
- Interpreter: Add units to time variable names (Paul Menzel).
- Add support for ARM Performance Monitoring Unit Table (Besar
Wicaksono).
- Inform users about ACPI spec violation related to sleep length
(Paul Menzel).
- iASL/MADT: Add OEM-defined subtable (Bob Moore).
- Interpreter: Fix some typo mistakes (Selvarasu Ganesan).
- Updates for revision E.d of IORT (Shameer Kolothum).
- Use ACPI_FORMAT_UINT64 for 64-bit output (Bob Moore).
- Improve debug messages in the ACPI device PM code (Rafael Wysocki).
- Block ASUS B1400CEAE from suspend to idle by default (Mario
Limonciello).
- Improve handling of PCI devices that are in D3cold during system
initialization (Rafael Wysocki).
- Fix BERT error region memory mapping (Lorenzo Pieralisi).
- Add support for NVIDIA 16550-compatible port subtype to the SPCR
parsing code (Jeff Brasen).
- Use static for BGRT_SHOW kobj_attribute defines (Tom Rix).
- Fix missing prototype warning for acpi_agdi_init() (Ilkka
Koskinen).
- Fix missing ERST record ID in the APEI code (Liu Xinpeng).
- Make APEI error injection to refuse to inject into the zero page
(Tony Luck).
- Correct description of INT3407 / INT3532 DPTF attributes in sysfs
(Sumeet Pawnikar).
- Add support for high frequency impedance notification to the DPTF
driver (Sumeet Pawnikar).
- Make mp_config_acpi_gsi() a void function (Li kunyu).
- Unify Package () representation for properties in the ACPI device
properties documentation (Andy Shevchenko).
- Include UUID in _DSM evaluation warning (Michael Niewöhner)"
* tag 'acpi-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (41 commits)
Revert "ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms"
ACPI: utils: include UUID in _DSM evaluation warning
ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default
x86: ACPI: Make mp_config_acpi_gsi() a void function
ACPI: DPTF: Add support for high frequency impedance notification
ACPI: AGDI: Fix missing prototype warning for acpi_agdi_init()
ACPI: bus: Avoid non-ACPI device objects in walks over children
ACPI: DPTF: Correct description of INT3407 / INT3532 attributes
ACPI: BGRT: use static for BGRT_SHOW kobj_attribute defines
ACPI, APEI, EINJ: Refuse to inject into the zero page
ACPI: PM: Always print final debug message in acpi_device_set_power()
ACPI: SPCR: Add support for NVIDIA 16550-compatible port subtype
ACPI: docs: enumeration: Unify Package () for properties (part 2)
ACPI: APEI: Fix missing ERST record id
ACPICA: Update version to 20220331
ACPICA: exsystem.c: Use ACPI_FORMAT_UINT64 for 64-bit output
ACPICA: IORT: Updates for revision E.d
ACPICA: executer/exsystem: Fix some typo mistakes
ACPICA: iASL/MADT: Add OEM-defined subtable
ACPICA: executer/exsystem: Warn about sleeps greater than 10 ms
...
Platform PMU changes:
=====================
- x86/intel:
- Add new Intel Alder Lake and Raptor Lake support
- x86/amd:
- AMD Zen4 IBS extensions support
- Add AMD PerfMonV2 support
- Add AMD Fam19h Branch Sampling support
Generic changes:
================
- signal: Deliver SIGTRAP on perf event asynchronously if blocked
Perf instrumentation can be driven via SIGTRAP, but this causes a problem
when SIGTRAP is blocked by a task & terminate the task.
Allow user-space to request these signals asynchronously (after they get
unblocked) & also give the information to the signal handler when this
happens:
" To give user space the ability to clearly distinguish synchronous from
asynchronous signals, introduce siginfo_t::si_perf_flags and
TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is
required in future).
The resolution to the problem is then to (a) no longer force the signal
(avoiding the terminations), but (b) tell user space via si_perf_flags
if the signal was synchronous or not, so that such signals can be
handled differently (e.g. let user space decide to ignore or consider
the data imprecise). "
- Unify/standardize the /sys/devices/cpu/events/* output format.
- Misc fixes & cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmKLuiURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ioSRAAgM3PneFHn5MFiuV/8ZfP3xMHNUOYOCgN
JhALRcUhDdL4N9pS0DSImfXvAlYPJ/TZK8qBRNDsRgygp5vjrbr9zH2HdZBW1gyV
qi3bpuNS+METnfNyumAoBeOYbMIvpm3NDUX+w68Xvkd1g8ykyno8Zc2H2hj3IDsW
cK3ErP0CZLsnBZsymy29/bxCYhfxsED6J06hOa8R3Tvl4XYg/27Z+tEuZ4GYeFS8
VikulYB9RhRWUbhkzwjyRSbTWyvsuXP+xD28ymUIxXaNCDOwxK8uYtVepUFIBO8X
cZgtwT2faV3y5ZAnz02M+/JZl+Jz5EPm037vNQp9aJsTuAbAGnxh/hL0cBVuDqhv
Nh9wkqS8FqwAbtpvg/IeamzqN5z/Yn2Q/Jyk/4oWipmeddXWUL7sYVoSduTGJJkz
cZz2ciNQbnOCzv0ZSjihrGMqPaT+/wI/iLW3ouLoZXpfTtVVRiiLuI1DDAZ1rd2r
D6djV8JjHIs71V/6E9ahVATxq8yMdikd7u734rA5K3XSxIBTYrdshbOhddzgeE7d
chQ7XvpQXDoFrZtxkHXP5iIeNF7fU9MWNWaEcsrZaWEB/8UpD6eL2if1Kl8mog+h
J4+zR1LWRHh8TNRfos3yCP2PSbbS6LPVsYLJzP+bb+pxgqdJ+urxfmxoCtY5trNI
zHT52xfdxSo=
=UqYA
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"Platform PMU changes:
- x86/intel:
- Add new Intel Alder Lake and Raptor Lake support
- x86/amd:
- AMD Zen4 IBS extensions support
- Add AMD PerfMonV2 support
- Add AMD Fam19h Branch Sampling support
Generic changes:
- signal: Deliver SIGTRAP on perf event asynchronously if blocked
Perf instrumentation can be driven via SIGTRAP, but this causes a
problem when SIGTRAP is blocked by a task & terminate the task.
Allow user-space to request these signals asynchronously (after
they get unblocked) & also give the information to the signal
handler when this happens:
"To give user space the ability to clearly distinguish
synchronous from asynchronous signals, introduce
siginfo_t::si_perf_flags and TRAP_PERF_FLAG_ASYNC (opted for
flags in case more binary information is required in future).
The resolution to the problem is then to (a) no longer force the
signal (avoiding the terminations), but (b) tell user space via
si_perf_flags if the signal was synchronous or not, so that such
signals can be handled differently (e.g. let user space decide
to ignore or consider the data imprecise). "
- Unify/standardize the /sys/devices/cpu/events/* output format.
- Misc fixes & cleanups"
* tag 'perf-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/x86/amd/core: Fix reloading events for SVM
perf/x86/amd: Run AMD BRS code only on supported hw
perf/x86/amd: Fix AMD BRS period adjustment
perf/x86/amd: Remove unused variable 'hwc'
perf/ibs: Fix comment
perf/amd/ibs: Advertise zen4_ibs_extensions as pmu capability attribute
perf/amd/ibs: Add support for L3 miss filtering
perf/amd/ibs: Use ->is_visible callback for dynamic attributes
perf/amd/ibs: Cascade pmu init functions' return value
perf/x86/uncore: Add new Alder Lake and Raptor Lake support
perf/x86/uncore: Clean up uncore_pci_ids[]
perf/x86/cstate: Add new Alder Lake and Raptor Lake support
perf/x86/msr: Add new Alder Lake and Raptor Lake support
perf/x86: Add new Alder Lake and Raptor Lake support
perf/amd/ibs: Use interrupt regs ip for stack unwinding
perf/x86/amd/core: Add PerfMonV2 overflow handling
perf/x86/amd/core: Add PerfMonV2 counter control
perf/x86/amd/core: Detect available counters
perf/x86/amd/core: Detect PerfMonV2 support
x86/msr: Add PerfCntrGlobal* registers
...
- Comprehensive interface overhaul:
=================================
Objtool's interface has some issues:
- Several features are done unconditionally, without any way to turn
them off. Some of them might be surprising. This makes objtool
tricky to use, and prevents porting individual features to other
arches.
- The config dependencies are too coarse-grained. Objtool enablement is
tied to CONFIG_STACK_VALIDATION, but it has several other features
independent of that.
- The objtool subcmds ("check" and "orc") are clumsy: "check" is really
a subset of "orc", so it has all the same options. The subcmd model
has never really worked for objtool, as it only has a single purpose:
"do some combination of things on an object file".
- The '--lto' and '--vmlinux' options are nonsensical and have
surprising behavior.
Overhaul the interface:
- get rid of subcmds
- make all features individually selectable
- remove and/or clarify confusing/obsolete options
- update the documentation
- fix some bugs found along the way
- Fix x32 regression
- Fix Kbuild cleanup bugs
- Add scripts/objdump-func helper script to disassemble a single function from an object file.
- Rewrite scripts/faddr2line to be section-aware, by basing it on 'readelf',
moving it away from 'nm', which doesn't handle multiple sections well,
which can result in decoding failure.
- Rewrite & fix symbol handling - which had a number of bugs wrt. object files
that don't have global symbols - which is rare but possible. Also fix a
bunch of symbol handling bugs found along the way.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmKLtcURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jVQg//QM8nCNadJAVS9exVGX1DZI9pnf3OJaA9
gOFML7Lv3MC+Lwdxt6Iv020rFVaeAnOcjPsis3dppFz62FZzzMWoemn5irg2BFiJ
dp++UtJWTfKxgU2BHydU9uXD0kcJkD4AjBCIaFsgmTjAz8QvMGa9j0smuUm3cDSL
0Bdid+LhkQqW3P2FiLWsSAzh4vqZmdwpXgERZRql8qD3NYk5hV4QDKs3gMguktat
9gos4kGt0uwKfiEvmeNEXkoAwUsTvE/vqaOy9cVxxCqcWrrC+yQeBpwSoqhHK526
dyHlwlYvBaPFqZnmquVUv21iv1MU6dUBJPhNIChke0NDTwVzSXdI75207FARyk5J
3igSFEfJcU9zMvhAAsAjzD/uQP2ATowg5qa/V2xyWwtyaRgBleRffYiDsbhgDoNc
R4/vI+vn/fQXouMhmmjPNYzu9uHQ+k89wQCJIY8Bswf7oNu6nKL3jJb/a/a7xhsH
ZNqv+M0KEENTZcjBU2UHGyImApmkTlsp2mxUiiHs7QoV1hTfz+TcTXKPM1mIuJB8
/HrVpv64CZ3S7p4JyGBUTNpci4mBjgBmwwAf16+dtaxyxxfoqReVWh3+bzsZbH+B
kRjezWHh7/yCsoyDm7/LPgyPKEbozLLzMsTsjVJeWgeTgZ+xuqku3PTVctyzAI21
DVL5oZe3iK4=
=ARdm
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- Comprehensive interface overhaul:
=================================
Objtool's interface has some issues:
- Several features are done unconditionally, without any way to
turn them off. Some of them might be surprising. This makes
objtool tricky to use, and prevents porting individual features
to other arches.
- The config dependencies are too coarse-grained. Objtool
enablement is tied to CONFIG_STACK_VALIDATION, but it has several
other features independent of that.
- The objtool subcmds ("check" and "orc") are clumsy: "check" is
really a subset of "orc", so it has all the same options.
The subcmd model has never really worked for objtool, as it only
has a single purpose: "do some combination of things on an object
file".
- The '--lto' and '--vmlinux' options are nonsensical and have
surprising behavior.
Overhaul the interface:
- get rid of subcmds
- make all features individually selectable
- remove and/or clarify confusing/obsolete options
- update the documentation
- fix some bugs found along the way
- Fix x32 regression
- Fix Kbuild cleanup bugs
- Add scripts/objdump-func helper script to disassemble a single
function from an object file.
- Rewrite scripts/faddr2line to be section-aware, by basing it on
'readelf', moving it away from 'nm', which doesn't handle multiple
sections well, which can result in decoding failure.
- Rewrite & fix symbol handling - which had a number of bugs wrt.
object files that don't have global symbols - which is rare but
possible. Also fix a bunch of symbol handling bugs found along the
way.
* tag 'objtool-core-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
objtool: Fix objtool regression on x32 systems
objtool: Fix symbol creation
scripts/faddr2line: Fix overlapping text section failures
scripts: Create objdump-func helper script
objtool: Remove libsubcmd.a when make clean
objtool: Remove inat-tables.c when make clean
objtool: Update documentation
objtool: Remove --lto and --vmlinux in favor of --link
objtool: Add HAVE_NOINSTR_VALIDATION
objtool: Rename "VMLINUX_VALIDATION" -> "NOINSTR_VALIDATION"
objtool: Make noinstr hacks optional
objtool: Make jump label hack optional
objtool: Make static call annotation optional
objtool: Make stack validation frame-pointer-specific
objtool: Add CONFIG_OBJTOOL
objtool: Extricate sls from stack validation
objtool: Rework ibt and extricate from stack validation
objtool: Make stack validation optional
objtool: Add option to print section addresses
objtool: Don't print parentheses in function addresses
...
- Initial support for the ARMv9 Scalable Matrix Extension (SME). SME
takes the approach used for vectors in SVE and extends this to provide
architectural support for matrix operations. No KVM support yet, SME
is disabled in guests.
- Support for crashkernel reservations above ZONE_DMA via the
'crashkernel=X,high' command line option.
- btrfs search_ioctl() fix for live-lock with sub-page faults.
- arm64 perf updates: support for the Hisilicon "CPA" PMU for monitoring
coherent I/O traffic, support for Arm's CMN-650 and CMN-700
interconnect PMUs, minor driver fixes, kerneldoc cleanup.
- Kselftest updates for SME, BTI, MTE.
- Automatic generation of the system register macros from a 'sysreg'
file describing the register bitfields.
- Update the type of the function argument holding the ESR_ELx register
value to unsigned long to match the architecture register size
(originally 32-bit but extended since ARMv8.0).
- stacktrace cleanups.
- ftrace cleanups.
- Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
avoid executable mappings in kexec/hibernate code, drop TLB flushing
from get_clear_flush() (and rename it to get_clear_contig()),
ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmKH19IACgkQa9axLQDI
XvEFWg//bf0p6zjeNaOJmBbyVFsXsVyYiEaLUpFPUs3oB+81s2YZ+9i1rgMrNCft
EIDQ9+/HgScKxJxnzWf68heMdcBDbk76VJtLALExbge6owFsjByQDyfb/b3v/bLd
ezAcGzc6G5/FlI1IP7ct4Z9MnQry4v5AG8lMNAHjnf6GlBS/tYNAqpmj8HpQfgRQ
ZbhfZ8Ayu3TRSLWL39NHVevpmxQm/bGcpP3Q9TtjUqg0r1FQ5sK/LCqOksueIAzT
UOgUVYWSFwTpLEqbYitVqgERQp9LiLoK5RmNYCIEydfGM7+qmgoxofSq5e2hQtH2
SZM1XilzsZctRbBbhMit1qDBqMlr/XAy/R5FO0GauETVKTaBhgtj6mZGyeC9nU/+
RGDljaArbrOzRwMtSuXF+Fp6uVo5spyRn1m8UT/k19lUTdrV9z6EX5Fzuc4Mnhed
oz4iokbl/n8pDObXKauQspPA46QpxUYhrAs10B/ELc3yyp/Qj3jOfzYHKDNFCUOq
HC9mU+YiO9g2TbYgCrrFM6Dah2E8fU6/cR0ZPMeMgWK4tKa+6JMEINYEwak9e7M+
8lZnvu3ntxiJLN+PrPkiPyG+XBh2sux1UfvNQ+nw4Oi9xaydeX7PCbQVWmzTFmHD
q7UPQ8220e2JNCha9pULS8cxDLxiSksce06DQrGXwnHc1Ir7T04=
=0DjE
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Initial support for the ARMv9 Scalable Matrix Extension (SME).
SME takes the approach used for vectors in SVE and extends this to
provide architectural support for matrix operations. No KVM support
yet, SME is disabled in guests.
- Support for crashkernel reservations above ZONE_DMA via the
'crashkernel=X,high' command line option.
- btrfs search_ioctl() fix for live-lock with sub-page faults.
- arm64 perf updates: support for the Hisilicon "CPA" PMU for
monitoring coherent I/O traffic, support for Arm's CMN-650 and
CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup.
- Kselftest updates for SME, BTI, MTE.
- Automatic generation of the system register macros from a 'sysreg'
file describing the register bitfields.
- Update the type of the function argument holding the ESR_ELx register
value to unsigned long to match the architecture register size
(originally 32-bit but extended since ARMv8.0).
- stacktrace cleanups.
- ftrace cleanups.
- Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
avoid executable mappings in kexec/hibernate code, drop TLB flushing
from get_clear_flush() (and rename it to get_clear_contig()),
ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (145 commits)
arm64/sysreg: Generate definitions for FAR_ELx
arm64/sysreg: Generate definitions for DACR32_EL2
arm64/sysreg: Generate definitions for CSSELR_EL1
arm64/sysreg: Generate definitions for CPACR_ELx
arm64/sysreg: Generate definitions for CONTEXTIDR_ELx
arm64/sysreg: Generate definitions for CLIDR_EL1
arm64/sve: Move sve_free() into SVE code section
arm64: Kconfig.platforms: Add comments
arm64: Kconfig: Fix indentation and add comments
arm64: mm: avoid writable executable mappings in kexec/hibernate code
arm64: lds: move special code sections out of kernel exec segment
arm64/hugetlb: Implement arm64 specific huge_ptep_get()
arm64/hugetlb: Use ptep_get() to get the pte value of a huge page
arm64: kdump: Do not allocate crash low memory if not needed
arm64/sve: Generate ZCR definitions
arm64/sme: Generate defintions for SVCR
arm64/sme: Generate SMPRI_EL1 definitions
arm64/sme: Automatically generate SMPRIMAP_EL2 definitions
arm64/sme: Automatically generate SMIDR_EL1 defines
arm64/sme: Automatically generate defines for SMCR
...
Highlights:
- New drivers:
- Intel "In Field Scan" (IFS) support
- Winmate FM07/FM07P buttons
- Mellanox SN2201 support
- AMD PMC driver enhancements
- Lots of various other small fixes and hardware-id additions
The following is an automated git shortlog grouped by driver:
Documentation:
- In-Field Scan
Documentation/ABI:
- Add new attributes for mlxreg-io sysfs interfaces
- sysfs-class-firmware-attributes: Misc. cleanups
- sysfs-class-firmware-attributes: Fix Sphinx errors
- sysfs-driver-intel_sdsi: Fix sphinx warnings
acerhdf:
- Cleanup str_starts_with()
amd-pmc:
- Fix build error unused-function
- Shuffle location of amd_pmc_get_smu_version()
- Avoid reading SMU version at probe time
- Move FCH init to first use
- Move SMU logging setup out of init
- Fix compilation without CONFIG_SUSPEND
amd_hsmp:
- Add HSMP protocol version 5 messages
asus-nb-wmi:
- Add keymap for MyASUS key
asus-wmi:
- Update unknown code message
- Use kobj_to_dev()
- Fix driver not binding when fan curve control probe fails
- Potential buffer overflow in asus_wmi_evaluate_method_buf()
barco-p50-gpio:
- Fix duplicate included linux/io.h
dell-laptop:
- Add quirk entry for Latitude 7520
gigabyte-wmi:
- Add support for Z490 AORUS ELITE AC and X570 AORUS ELITE WIFI
- added support for B660 GAMING X DDR4 motherboard
hp-wmi:
- Correct code style related issues
intel-hid:
- fix _DSM function index handling
intel-uncore-freq:
- Prevent driver loading in guests
intel_cht_int33fe:
- Set driver data
platform/mellanox:
- Add support for new SN2201 system
platform/surface:
- aggregator: Fix initialization order when compiling as builtin module
- gpe: Add support for Surface Pro 8
platform/x86/dell:
- add buffer allocation/free functions for SMI calls
platform/x86/intel:
- Fix 'rmmod pmt_telemetry' panic
- pmc/core: Use kobj_to_dev()
- pmc/core: change pmc_lpm_modes to static
platform/x86/intel/ifs:
- Add CPU_SUP_INTEL dependency
- add ABI documentation for IFS
- Add IFS sysfs interface
- Add scan test support
- Authenticate and copy to secured memory
- Check IFS Image sanity
- Read IFS firmware image
- Add stub driver for In-Field Scan
platform/x86/intel/sdsi:
- Fix bug in multi packet reads
- Poll on ready bit for writes
- Handle leaky bucket
platform_data/mlxreg:
- Add field for notification callback
pmc_atom:
- dont export pmc_atom_read - no modular users
- remove unused pmc_atom_write()
samsung-laptop:
- use kobj_to_dev()
- Fix an unsigned comparison which can never be negative
stop_machine:
- Add stop_core_cpuslocked() for per-core operations
think-lmi:
- certificate support clean ups
thinkpad_acpi:
- Correct dual fan probe
- Add a s2idle resume quirk for a number of laptops
- Convert btusb DMI list to quirks
tools/power/x86/intel-speed-select:
- Fix warning for perf_cap.cpu
- Display error on turbo mode disabled
- fix build failure when using -Wl,--as-needed
toshiba_acpi:
- use kobj_to_dev()
trace:
- platform/x86/intel/ifs: Add trace point to track Intel IFS operations
winmate-fm07-keys:
- Winmate FM07/FM07P buttons
wmi:
- replace usage of found with dedicated list iterator variable
x86/microcode/intel:
- Expose collect_cpu_info_early() for IFS
x86/msr-index:
- Define INTEGRITY_CAPABILITIES MSR
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmKKlA0UHGhkZWdvZWRl
QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9w0Iwf+PYoq7qtU6j6N2f8gL2s65JpKiSPP
CkgnCzTP+khvNnTWMQS8RW9VE6YrHXmN/+d3UAvRrHsOYm3nyZT5aPju9xJ6Xyfn
5ZdMVvYxz7cm3lC6ay8AQt0Cmy6im/+lzP5vA5K68IYh0fPX/dvuOU57pNvXYFfk
Yz5/Gm0t0C4CKVqkcdU/zkNawHP+2+SyQe+Ua2srz7S3DAqUci0lqLr/w9Xk2Yij
nCgEWFB1Qjd2NoyRRe44ksLQ0dXpD4ADDzED+KPp6VTGnw61Eznf9319Z5ONNa/O
VAaSCcDNKps8d3ZpfCpLb3Rs4ztBCkRnkLFczJBgPsBiuDmyTT2/yeEtNg==
=HdEG
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Hans de Goede:
"This includes some small changes to kernel/stop_machine.c and arch/x86
which are deps of the new Intel IFS support.
Highlights:
- New drivers:
- Intel "In Field Scan" (IFS) support
- Winmate FM07/FM07P buttons
- Mellanox SN2201 support
- AMD PMC driver enhancements
- Lots of various other small fixes and hardware-id additions"
* tag 'platform-drivers-x86-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (54 commits)
platform/x86/intel/ifs: Add CPU_SUP_INTEL dependency
platform/x86: intel_cht_int33fe: Set driver data
platform/x86: intel-hid: fix _DSM function index handling
platform/x86: toshiba_acpi: use kobj_to_dev()
platform/x86: samsung-laptop: use kobj_to_dev()
platform/x86: gigabyte-wmi: Add support for Z490 AORUS ELITE AC and X570 AORUS ELITE WIFI
tools/power/x86/intel-speed-select: Fix warning for perf_cap.cpu
tools/power/x86/intel-speed-select: Display error on turbo mode disabled
Documentation: In-Field Scan
platform/x86/intel/ifs: add ABI documentation for IFS
trace: platform/x86/intel/ifs: Add trace point to track Intel IFS operations
platform/x86/intel/ifs: Add IFS sysfs interface
platform/x86/intel/ifs: Add scan test support
platform/x86/intel/ifs: Authenticate and copy to secured memory
platform/x86/intel/ifs: Check IFS Image sanity
platform/x86/intel/ifs: Read IFS firmware image
platform/x86/intel/ifs: Add stub driver for In-Field Scan
stop_machine: Add stop_core_cpuslocked() for per-core operations
x86/msr-index: Define INTEGRITY_CAPABILITIES MSR
x86/microcode/intel: Expose collect_cpu_info_early() for IFS
...
pressure:
SGX uses normal RAM allocated from special shmem files as backing storage
when it runs out of SGX memory (EPC). The code was overly aggressive when
freeing shmem pages and was inadvertently freeing perfectly good data.
This resulted in failures in the SGX instructions used to swap data back
into SGX memory.
This turned out to be really hard to trigger in mainline. It was
originally encountered testing the out-of-tree "SGX2" patches, but later
reproduced on mainline.
Fix the data loss by being more careful about truncating pages out of
the backing storage and more judiciously setting pages dirty.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmKLqcgACgkQaDWVMHDJ
krA7rA//ZgNgOTzCp/jdntz2KSp9MPhwaSJg0MUnsa7wt0T/3sPXaEAu9wgSZod7
xqxH17LKUc27SyALtPrkvm68aVZ/Z0Nhq2gDndspXd/Zcl/CD/Cy+GI+ZpdNoYhz
Fuqiq1TrszzzqBksgiEal9S874+jum2uWqYBMHB45ODp+E7F479Zm42hI3dSp1VN
6n5zOi5u+unHgDRQ/rwMovu2XU61ZXrycqkbZvu4P4tRbEUH+EhAMKG2RyZOB2V9
XNqr1vBJ122CWMIxcdzEUEofPFFwVEtC9jK+rdgUW1ZYAPJDjVvcnXx7dpA9PHLb
DytBSWyeISllJKbea1pIMsdCT/IE4I3s0US2ZA3Ru7YAMgUIi+IGu++JJ2dWdDvx
GoJz6yBVw4r6cl7kLUfbtIUPsJLYkEMpTM4XODsxMwzd2/Jdbe2UfQskzEn9Auvc
1qGRspu/3VbqE5WFz5Npd94+B+8BOo7kKLcizBHqmX8U2PBkMnhRatxDMCu8frfL
DlrjosgUgMYQRkEp3Zugo33O8F2EAE0T1I9g7N4sullX0jGnFifjgiPipnWcnIB9
RnF5NHdrTMPwqhvzz+3o1yJgve56juZxESqn1khEIQEqgUtxFaEnrmYzdLlVkoGg
XbuY7TNp1hDC3s9OHeiCL2oUaSmyh0eKCokLiAuWowVzbuU69BU=
=pTAC
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave Hansen:
"A set of patches to prevent crashes in SGX enclaves under heavy memory
pressure:
SGX uses normal RAM allocated from special shmem files as backing
storage when it runs out of SGX memory (EPC). The code was overly
aggressive when freeing shmem pages and was inadvertently freeing
perfectly good data. This resulted in failures in the SGX instructions
used to swap data back into SGX memory.
This turned out to be really hard to trigger in mainline. It was
originally encountered testing the out-of-tree "SGX2" patches, but
later reproduced on mainline.
Fix the data loss by being more careful about truncating pages out of
the backing storage and more judiciously setting pages dirty"
* tag 'x86_sgx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Ensure no data in PCMD page after truncate
x86/sgx: Fix race between reclaimer and page fault handler
x86/sgx: Obtain backing storage page with enclave mutex held
x86/sgx: Mark PCMD page as dirty when modifying contents
x86/sgx: Disconnect backing page references from dirty status
- Remove function export
- Correct asm constraint
- Fix __setup handlers retval
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL6VkACgkQEsHwGGHe
VUqs6g/+Ikpd4Mrou4P5Ul8QNdN9mEzwUfW6i8VpoA3h1L6mKkZxbUsbSz9xInjw
MAhrcevujW6GwdQdus2sUcSlX+jxl6c/IlMdf8RegNPY/JBPDX4dRA7rPetvZEDm
ZiIYVTiEzJoOzPDJeO7a3v5EHPsY6CjsCFhGz7hjIcrwQjzCLkL5MqG+WDAtebe+
QVdbllD2RlZNPDyHYE5Lqh1h+Y0e4n6kS7LCWxexfHlNOZ5KBRVyIJvz/xOZFZ1/
9oX0UDD2gfH5chLs8GKsr7cZYERMtNlKBPoxGzl8iKF4iUeiksdj3P5y+mdcFaDG
YbM7aXewmbyLyiCkh1zXU6Mw3lK1VfUtVXtEYj+qXf1jWp59ctNEJkc6/VAcaKh7
oS7MNG7Y44B8XwdH7MiqDE7eVCnqEjIR+BIiwjyXNLFP1AXZMAXuBzXPF/vZ3Gyf
3N5vzO4VNEN6Oa1TReSspKwYvq2uPtHMjLX2rT6Py2ru32mj2dCc5E7GD83RKL8V
vDIz4VGOZyGfjp6gClMBsyK4mYwSwgXbnOci7DJn56mMf2qzBJITILXc31zz4gX2
E9kiBu/4Mwjnrx9QRpCNXu7iddBA3YM2NMtNlwBcCgZOFaFz/yOx9TpnugF17WHQ
VVtQi8wlcsS+F05Y11b7euusMQyk1EpWabIrw8UQd+61Dwpz58Q=
=/WGB
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
"A variety of fixes which don't fit any other tip bucket:
- Remove unnecessary function export
- Correct asm constraint
- Fix __setup handlers retval"
* tag 'x86_misc_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Cleanup the control_va_addr_alignment() __setup handler
x86: Fix return value of __setup handlers
x86/delay: Fix the wrong asm constraint in delay_loop()
x86/amd_nb: Unexport amd_cache_northbridges()
- Make life miserable for apps using split locks by slowing them down
considerably while the rest of the system remains responsive. The hope
is it will hurt more and people will really fix their misaligned locks
apps. As a result, free a TIF bit.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL5PQACgkQEsHwGGHe
VUrz1Q//QjAKyKsAwCzGSPergtnZp9drimSuNsZAz6/xL8wFnn2nfWJTxugNF5jg
n0Hal2oUGC8lg13mliB7NuDNu4RUWpkFzTzcIbPT8K9h7CUBdQPzqS7E3/p4s/eG
ZCHp8psBGNp8+/+/LFfu9yhzYsAH9ji5KWmOzTVx9UdP3ovgR8BuCI7FCVJSfRz7
cY690XgvcuKoXKckVNaCcoQXPJxykfk4Y1yt1TpITqivFbs2I0vLgzEhoRcTAhPA
nX3pR3uy6oaA6rZAapRt8lbLWOwIEWoI0Tt1v+r5p28+nFiCRfm1XdPYK6CDBlox
UuMBK4WyvSKjKHLu3wEdLCvYbs1kw2l9pXvS3hrqsKhbdeXKrxrNZ3zshwFMAYap
MY/nSTsKSWUUgMgUbWI084csapGFB+hxwY8OVr6JXbxE8YYD/yCbPGOe1cLI7MMt
/H3F6vNqSzdp1N3mAaaKVxiiT21lHIn6oJuSZcDE5sOvBwvpXsOp/w3FxhJCOX49
PXrZLZmSHkDQSbh1XnvT/a+rq3XX1TFXFz71HYZf1yDk+xTijECglNtGnGSdj2Za
iOw6M8VduV5Wy3ED9ubonruuHEJn6njpx/MH1B9+mAZsuLBpmuYFBxOn6AHOkXSb
MVJD4flHXj0ugYm4Q5Y3yi24iWLsRI9utTOU079VL6i6DmFXeZc=
=svvI
-----END PGP SIGNATURE-----
Merge tag 'x86_splitlock_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 splitlock updates from Borislav Petkov:
- Add Raptor Lake to the set of CPU models which support splitlock
- Make life miserable for apps using split locks by slowing them down
considerably while the rest of the system remains responsive. The
hope is it will hurt more and people will really fix their misaligned
locks apps. As a result, free a TIF bit.
* tag 'x86_splitlock_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/split_lock: Enable the split lock feature on Raptor Lake
x86/split-lock: Remove unused TIF_SLD bit
x86/split_lock: Make life miserable for split lockers
allocated and are present when later accessed ("nosmp" and x2APIC)
- Clarify the bit overlap between an old APIC and a modern, integrated one
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL3BMACgkQEsHwGGHe
VUrkNRAAjqxwBP28EnYvHthvbxhfsuwws+OcSm2lt5SK5WGZK+p1pnDrPvxSawF8
t7O1oyIJfSaFmEPqs52Z/dj7noKJPBhDNoevmDVTmfQZkGvpDT1xjBATfABjbsnf
SGUXK6c8rg20afGiOO9GLL7DB/zArDRdf/2fpn6f+1I5tJCAurnjp9A1ssZw3KBl
m5plwaoQSsyCkqJtpT+Q5Mu9fyfaqTPPMBJrPi0tbRlVjryXJh7GW31TQfmHn3V7
wDUvtfD2kY9kzs/EHL3ilxmnlLfCya5f1kW76z5Yek3GkxCoMD0vFYJ0VUTd8KFf
mi7e2w4L1x6fyYiNKaMEeoml1aed03qifcdXF9Gv+t6fRdzmWwo1IgzQq+gu+WQ4
p8U6GfzbXPN92xQfEsq7n7jmiKNL5S0e+VHFHE0xV1YxmEELwH5nURnk7g/idjZS
IJWhR3xNBtsFxHr/JmfGbk8qPBMNX6B2W0sVkIC9Zc0gDr9v5Gw06fYh/venSiOC
ePOO/RsMDftFBsHipc8o5IdkZXmr487hThNyt1vFZCL7V0TE3Vsw+aU9btzpBoz9
t4QuZw+iO6Z0SZy6Jt/27cp43Ky5Jp/ry+HNQmfFwDaXnh0ZeYQOZOMVgvODBmaw
N4qblX8UDd8+gtR7W9EDyXu+9UK35Nh3VbUO8MfOCRp2EaZqk/U=
=fkbm
-----END PGP SIGNATURE-----
Merge tag 'x86_apic_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Borislav Petkov:
- Always do default APIC routing setup so that cpumasks are properly
allocated and are present when later accessed ("nosmp" and x2APIC)
- Clarify the bit overlap between an old APIC and a modern, integrated
one
* tag 'x86_apic_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Do apic driver probe for "nosmp" use case
x86/apic: Clarify i82489DX bit overlap in APIC_LVT0
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL2RYACgkQEsHwGGHe
VUrfzQ/9H0Wr6TsJdagMank7nKfadunppVjGF7CSaJwxNwUJ3Vi3rQ2rIaPmMjLe
RSuUatB9B8kFL4zpeRSwBRDScff5ZWwAc5/Jl5XAxcQCj4jNdlcTREUcTqFSabUQ
mllm7wxtGRyBfKpHax7MnYTURnW23ZxfyEQ0NT7mqsvRQh3WoexUuj7Lc1XYwJge
R7hjPWTtWSDMFtTYFhp+LD8hfCOmRE8z0goO0aGO8GvQXskuD4MyjMGvtQ/jquaV
S6LPAZNGPheInHrQorxvcy1FS86F6v803Kmt/RFumGFodmjraegO8smQSf4trBwJ
w7EASc/VniPxK1mZf3QsH+r47UqWVo+jq/VM+SFyyyoY+nXLhKNfnMgs1PsPKBlN
few3Iv0Apbs/2LXAfG4y9Ah9yHBYKma1RXkoROx0Yc+qOApCV8x0BkfKWpdXPHrF
fQ8jD7uWnqv4KydqScza2iWSand8jVTgQQov4cHtIggr9ksV8ywEVRSqJGjieiNf
bd3cqj0SQD/OU9N7JZYZ7fFBWsK6AXHZg216NX0hCRDcvWq4KZageLvk2JVURY2i
qlZO2HfahnC9Ngogslj1354LCpT5rfofV3zm3+M2jCyPGbILUzWkeXH4KsNJ4R9m
G3tUoL6DTqRHHcib8TN6gmJENdmmlSMKFOZ6BIFW+SYn5DwVUPk=
=WNci
-----END PGP SIGNATURE-----
Merge tag 'x86_kdump_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 kdump fixlet from Borislav Petkov:
- A single debug message fix
* tag 'x86_kdump_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/crash: Fix minor typo/bug in debug message
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL1CkACgkQEsHwGGHe
VUpA6Q//QmHDD5GIDkGOjp2BvZfFL/Lb/NqM6k5/1koKbxvWg3zge6w1DH4s2Ai3
U+QmGdHpbVY3Xag7RPru9Kuyh7f3GNPeXIw6JQ78NOAdoEpceSPTAs9r6GzHYLfH
n27hsSWrJQT3PLNUr+/ii/fXpypHCzAPpgpr8sYkY+TEYXuInWP18BrVIMBNRVV0
e9IhtkEL3wJh0FN9LtXWcfjzpTNArloFe204rVpzznpUIgHqK39WwhyRp0ppmhhX
uK9s2XJTD9DBszYZb/NjsxFAoDoB8MS7fVPmdnAKo2P/SzznVOC5TJQiMI/zCXpX
ShhKPJHsbXf//N4HxjbAuAUwYhwBp9nIvruosudZTXiqRwDUxCXRGipsBMQY8l/L
dUAgh3fmF4uw5wEZ6PNiKJ0m0VDgSbusZliLr1o//36/ZqyLf4vSx81K7J7p5u2U
HkP+GAvtWvNXGAAasiVL+D9wOWwgwXFsI44JrWnuTCCiWWdmAHc52b/PAC3bpxNH
f/X2OiA14UzYeV2oO9gznZlM8NFCfekKc/ND/aT3rYrvLqxMJcPg2YHKmgI4U7GO
m5Dfl+69iN22QzEQiMIe/s78zfBaPT0dVX+xjFGusR5V4RnKUUZ6D2oOzrIJ6ans
nx89vEHnudBF95loYrKlJiZqacUJOxPBZ3Z51CeMWfBBTkiHOHU=
=s0mT
-----END PGP SIGNATURE-----
Merge tag 'x86_platform_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Borislav Petkov:
- A couple of changes enabling SGI UV5 support
* tag 'x86_platform_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Log gap hole end size
x86/platform/uv: Update TSC sync state for UV5
x86/platform/uv: Update NMI Handler for UV5
thus allow for guests to use this compacted XSTATE variant when the
hypervisor exports that support
- A variable shadowing cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLsPQACgkQEsHwGGHe
VUoA7hAAoAP6qWntADHcDcA8QMjX9fvOi3uFjiJyGeiYCRH2rmwAAg8Y0DdI/1UE
Wq+7tzTPdyDPulqaEe9PV7f3HRY72cGA/2jdkMxkGG5mGZfVganb0OWgFXecdo6r
CIWf9vMOPwULIT4XvcnaWF6fv+1ZbFZOks9NpxZQZTYA3WQhozgfQOWlkoFFSdC/
pIwWFCUOv/pBPWVSeizE/Y6Yfuaix3KiElwk9NMDTPCRhyBd6VmpkpcBer+n3JUA
HoppbGLYonZEw1PkMmTlQJuFHKJzqwThGGoVY3FDtlAMD4+vmGt1vXNbLlfvtqup
zYHAIG/hqql7Ai9bgXSC2ccYG9v1op+gIFzKTBhI7FkVwEc6R6JtV7uGF7GAr6SL
KPnweo9GCoRmnc6Ju0+IuT0JIMXjO3iQIC0J3uLX8gCbsXVM29qdqhkYcLC75vOc
sXjAUrdolkDIRXzwkJURTxWT/yeKaN9n8r1s7BCmZ7Pg6zZS3/K1nHQkFTWCjSfA
oEy7GmEeI2uFgQX9qpF7NRlNj+D3AxV6W5IURCTI7GsP32e20jhOdU4AyrqsTy2N
8PgUVP9baioUpjY6BKsMc3JiR0ihb0OM3wX9fThu8lu5uHE9Oar+S4OOlFtxPXth
kG7pIS0MqB4N6aKWDFxvLvlUVgAxSqSmnWL4rQSP+Ralu9CY4k0=
=eDaz
-----END PGP SIGNATURE-----
Merge tag 'x86_fpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Borislav Petkov:
- Add support for XSAVEC - the Compacted XSTATE saving variant - and
thus allow for guests to use this compacted XSTATE variant when the
hypervisor exports that support
- A variable shadowing cleanup
* tag 'x86_fpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Cleanup variable shadowing
x86/fpu/xsave: Support XSAVEC in the kernel
needed anymore
- Other misc improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLp74ACgkQEsHwGGHe
VUpqrhAAgNdNw/vNTTzeOH5ZSNxyIoTQapmrSNev0cXRW4tV2hxuYSa2wPZPJZXx
aYhnFxwL7rVy0er7jG/5KaOyzHmrh6PcmqgFdPVo8+yVrfcsPIUqg/4L5peFZh7T
ETV2pvFIiB4njkL/pR3mU5uAtTjyO89tD/LclKmc4ndv19vI8maj+k/dCDOnNnEz
m4wJMXYWh4bG47/izU5TcTYU7ttTLEiVQ/mC5kEuj7PQeUR0kXKvvLo4rX+lOI2v
dQRHgHg/qoNM7uVLd7vV/YdMWwcHchmKG5Y7+a/ogdlwR7a/X9e+lklFSeuxNvyH
8dOHIyzcb6lKTijpqhisZ3o9150ax3Q5FlSWuE3F/9Rcuc1T5eY82kTW2RTOTdV9
xsjob4y+hlpsUfuImupxJLHn685xsYAdqyiG/SPkcnJL++tNBlWiGHX9NqXF5cgw
bq4/94Aouxevl0OBxnFBeoQOJvOnf60OY3LHcYR78yEEJyi4iWsC0/TEmD+9IE+r
EpC1wz9bHCYbSwZ+yv8u2tNPd/rKxdspPL/6SxT9a+WAVrOZbQAN3VmlOIon6W9O
bW5ye6suqBbl/Q1FACVU1xxSNjLTJUTFsB1X3QKGm8E+Kr7/zD1ZtT0WQNvyLMfT
p/I4VRcdIxV3eDiYqeTfJ3sTS7IjKHSaZVBnpkZvRh869mMdqCg=
=CfX1
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Borislav Petkov:
- Remove all the code around GS switching on 32-bit now that it is not
needed anymore
- Other misc improvements
* tag 'x86_core_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bug: Use normal relative pointers in 'struct bug_entry'
x86/nmi: Make register_nmi_handler() more robust
x86/asm: Merge load_gs_index()
x86/32: Remove lazy GS macros
ELF: Remove elf_core_copy_kernel_regs()
x86/32: Simplify ELF_CORE_COPY_REGS
frequency invariance code along with removing the need for unnecessary IPIs
- Finally remove a.out support
- The usual trivial cleanups and fixes all over x86
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe
VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2
XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV
qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK
/3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj
0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK
BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE
M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB
xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7
2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W
hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX
TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0=
=Dks/
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for
unnecessary IPIs
- Finally remove a.out support
- The usual trivial cleanups and fixes all over x86
* tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86: Remove empty files
x86/speculation: Add missing srbds=off to the mitigations= help text
x86/prctl: Remove pointless task argument
x86/aperfperf: Make it correct on 32bit and UP kernels
x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
x86/aperfmperf: Replace arch_freq_get_on_cpu()
x86/aperfmperf: Replace aperfmperf_get_khz()
x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
x86/aperfmperf: Make parts of the frequency invariance code unconditional
x86/aperfmperf: Restructure arch_scale_freq_tick()
x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
x86/aperfmperf: Untangle Intel and AMD frequency invariance init
x86/aperfmperf: Separate AP/BP frequency invariance init
x86/smp: Move APERF/MPERF code where it belongs
x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
x86/process: Fix kernel-doc warning due to a changed function name
x86: Remove a.out support
x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
x86: Replace cpumask_weight() with cpumask_empty() where appropriate
x86/pkeys: Remove __arch_set_user_pkey_access() declaration
...
conventions so that former can be converted to C eventually
- Simplify PUSH_AND_CLEAR_REGS so that it can be used at the system call
entry paths instead of having opencoded, slightly different variants of it
everywhere
- Misc other fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLeQEACgkQEsHwGGHe
VUqFqQ/6AkVfWa9EMnmOcFcUYHjK7srsv7kzppc2P6ly98QOJFsCYagPRHVHXGZF
k4Dezk29j2d4AjVdGot/CpTlRezSe0dmPxTcH5QD+SpiJ8bSgMrnH/0La+No0ypi
VabOZgQaHWIUboccpE77oIRdglun/ZnePN3gRdBRtQWgmeQZVWxD6ly6L1Ptp1Lk
nBXVMpH2h5agLjulsw7j7PihrbM6RFf3qSw4GkaQAAxooxb2i7qb05sG347lm72l
3ppsHtP80MKCmJpe20O+V+O4Hvq1/XJ18Tin6p1bhqSe0PW2pS5QUN7ziF/5orvH
9p8PVWrrH6kTaK1NJilGYG4eIeyuWhSVnObgFqbe7RIITy5eCYXyaq5PLqVahWFD
qk1+Z3nsS6g6BLu10dFACnPq7O+6tVEWsoOZ2D4XJAV/zThbEwE75E4rW6x07gnm
s0BzXgtzb0s35L46jzTctc9RtdCRFjZmD+iHXSqjEfH/dyS1tsvXX6z5wBTb5qn3
FQE3sVtZs0e5yIFAfp19hzmweY/Mgu9b1p+IfkhQhInrLyJNwUVsMkpH1WFdkL5/
RZWtURuYO7lE6Iw1wwZPL691A7hx+1cE9YWuEBH2Il6byJa4UWP4azXCx1nbMFKk
E5ZDKL3iRsDPVI+k+D6NwBN19ih2LAmT2Mxcg1EOV434LLlkHsk=
=P80f
-----END PGP SIGNATURE-----
Merge tag 'x86_asm_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Borislav Petkov:
- A bunch of changes towards streamlining low level asm helpers'
calling conventions so that former can be converted to C eventually
- Simplify PUSH_AND_CLEAR_REGS so that it can be used at the system
call entry paths instead of having opencoded, slightly different
variants of it everywhere
- Misc other fixes
* tag 'x86_asm_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Fix register corruption in compat syscall
objtool: Fix STACK_FRAME_NON_STANDARD reloc type
linkage: Fix issue with missing symbol size
x86/entry: Remove skip_r11rcx
x86/entry: Use PUSH_AND_CLEAR_REGS for compat
x86/entry: Simplify entry_INT80_compat()
x86/mm: Simplify RESERVE_BRK()
x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS
x86/entry: Don't call error_entry() for XENPV
x86/entry: Move CLD to the start of the idtentry macro
x86/entry: Move PUSH_AND_CLEAR_REGS out of error_entry()
x86/entry: Switch the stack after error_entry() returns
x86/traps: Use pt_regs directly in fixup_bad_iret()
are not really needed anymore
- Misc fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLdfgACgkQEsHwGGHe
VUpB5Q//TIGVgmnSd0YYxY2cIe047lfcd34D+3oEGk0d2FidtirP/tjgBqIXRuY5
UncoveqBuI/6/7bodP/ANg9DNVXv2489eFYyZtEOLSGnfzV2AU10aw95cuQQG+BW
YIc6bGSsgfiNo8Vtj4L3xkVqxOrqaCYnh74GTSNNANht3i8KH8Qq9n3qZTuMiF6R
fH9xWak3TZB2nMzHdYrXh0sSR6eBHN3KYSiT0DsdlU9PUlavlSPFYQRiAlr6FL6J
BuYQdlUaCQbINvaviGW4SG7fhX32RfF/GUNaBajB40TO6H98KZLpBBvstWQ841xd
/o44o5wbghoGP1ne8OKwP+SaAV2bE6twd5eO1lpwcpXnQfATvjQ2imxvOiRhy5LY
pFPt/hko9gKWJ6SI0SQ4tiKJALFPLWD6561scHU6PoriFhv0SRIaPmJyEsDYynMz
bCXaPPsoovRwwwBfAxxQjljIlhQSBVt3gWZ8NWD1tYbNaqM+WK7xKBaONGh3OCw3
iK7lsbbljtM0zmANImYyeo7+Hr1NVOmMiK2WZYbxhxgzH3l8v/6EbDt3I70WU57V
9apCU3/nk/HFpX65SdW5qmuiWLVdH9NXrEqbvaUB4ApT18MdUUugewBhcGnf3Umu
wEtltzziqcIkxzDoXXpBGWpX31S7PsM2XVDqYC7dwuNttgEw2Fc=
=7AUX
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU feature updates from Borislav Petkov:
- Remove a bunch of chicken bit options to turn off CPU features which
are not really needed anymore
- Misc fixes and cleanups
* tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add missing prototype for unpriv_ebpf_notify()
x86/pm: Fix false positive kmemleak report in msr_build_context()
x86/speculation/srbds: Do not try to turn mitigation off when not supported
x86/cpu: Remove "noclflush"
x86/cpu: Remove "noexec"
x86/cpu: Remove "nosmep"
x86/cpu: Remove CONFIG_X86_SMAP and "nosmap"
x86/cpu: Remove "nosep"
x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=
This is the Intel version of a confidential computing solution called
Trust Domain Extensions (TDX). This series adds support to run the
kernel as part of a TDX guest. It provides similar guest protections to
AMD's SEV-SNP like guest memory and register state encryption, memory
integrity protection and a lot more.
Design-wise, it differs from AMD's solution considerably: it uses
a software module which runs in a special CPU mode called (Secure
Arbitration Mode) SEAM. As the name suggests, this module serves as sort
of an arbiter which the confidential guest calls for services it needs
during its lifetime.
Just like AMD's SNP set, this series reworks and streamlines certain
parts of x86 arch code so that this feature can be properly accomodated.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe
VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv
kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT
pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm
khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw
r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX
fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy
HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs
IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G
tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n
HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq
VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM=
=ac9K
-----END PGP SIGNATURE-----
Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull Intel TDX support from Borislav Petkov:
"Intel Trust Domain Extensions (TDX) support.
This is the Intel version of a confidential computing solution called
Trust Domain Extensions (TDX). This series adds support to run the
kernel as part of a TDX guest. It provides similar guest protections
to AMD's SEV-SNP like guest memory and register state encryption,
memory integrity protection and a lot more.
Design-wise, it differs from AMD's solution considerably: it uses a
software module which runs in a special CPU mode called (Secure
Arbitration Mode) SEAM. As the name suggests, this module serves as
sort of an arbiter which the confidential guest calls for services it
needs during its lifetime.
Just like AMD's SNP set, this series reworks and streamlines certain
parts of x86 arch code so that this feature can be properly
accomodated"
* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
x86/tdx: Fix RETs in TDX asm
x86/tdx: Annotate a noreturn function
x86/mm: Fix spacing within memory encryption features message
x86/kaslr: Fix build warning in KASLR code in boot stub
Documentation/x86: Document TDX kernel architecture
ACPICA: Avoid cache flush inside virtual machines
x86/tdx/ioapic: Add shared bit for IOAPIC base address
x86/mm: Make DMA memory shared for TD guest
x86/mm/cpa: Add support for TDX shared memory
x86/tdx: Make pages shared in ioremap()
x86/topology: Disable CPU online/offline control for TDX guests
x86/boot: Avoid #VE during boot for TDX platforms
x86/boot: Set CR0.NE early and keep it set during the boot
x86/acpi/x86/boot: Add multiprocessor wake-up support
x86/boot: Add a trampoline for booting APs via firmware handoff
x86/tdx: Wire up KVM hypercalls
x86/tdx: Port I/O: Add early boot support
x86/tdx: Port I/O: Add runtime hypercalls
x86/boot: Port I/O: Add decompression-time support for TDX
x86/boot: Port I/O: Allow to hook up alternative helpers
...
Add to confidential guests the necessary memory integrity protection
against malicious hypervisor-based attacks like data replay, memory
remapping and others, thus achieving a stronger isolation from the
hypervisor.
At the core of the functionality is a new structure called a reverse
map table (RMP) with which the guest has a say in which pages get
assigned to it and gets notified when a page which it owns, gets
accessed/modified under the covers so that the guest can take an
appropriate action.
In addition, add support for the whole machinery needed to launch a SNP
guest, details of which is properly explained in each patch.
And last but not least, the series refactors and improves parts of the
previous SEV support so that the new code is accomodated properly and
not just bolted on.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLU2AACgkQEsHwGGHe
VUpb/Q//f4LGiJf4nw1flzpe90uIsHNwAafng3NOjeXmhI/EcOlqPf23WHPCgg3Z
2umfa4sRZyj4aZubDd7tYAoq4qWrQ7pO7viWCNTh0InxBAILOoMPMuq2jSAbq0zV
ASUJXeQ2bqjYxX4JV4N5f3HT2l+k68M0mpGLN0H+O+LV9pFS7dz7Jnsg+gW4ZP25
PMPLf6FNzO/1tU1aoYu80YDP1ne4eReLrNzA7Y/rx+S2NAetNwPn21AALVgoD4Nu
vFdKh4MHgtVbwaQuh0csb/+4vD+tDXAhc8lbIl+Abl9ZxJaDWtAJW5D9e2CnsHk1
NOkHwnrzizzhtGK1g56YPUVRFAWhZYMOI1hR0zGPLQaVqBnN4b+iahPeRiV0XnGE
PSbIHSfJdeiCkvLMCdIAmpE5mRshhRSUfl1CXTCdetMn8xV/qz/vG6bXssf8yhTV
cfLGPHU7gfVmsbR9nk5a8KZ78PaytxOxfIDXvCy8JfQwlIWtieaCcjncrj+sdMJy
0fdOuwvi4jma0cyYuPolKiS1Hn4ldeibvxXT7CZQlIx6jZShMbpfpTTJs11XdtHm
PdDAc1TY3AqI33mpy9DhDQmx/+EhOGxY3HNLT7evRhv4CfdQeK3cPVUWgo4bGNVv
ZnFz7nvmwpyufltW9K8mhEZV267174jXGl6/idxybnlVE7ESr2Y=
=Y8kW
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull AMD SEV-SNP support from Borislav Petkov:
"The third AMD confidential computing feature called Secure Nested
Paging.
Add to confidential guests the necessary memory integrity protection
against malicious hypervisor-based attacks like data replay, memory
remapping and others, thus achieving a stronger isolation from the
hypervisor.
At the core of the functionality is a new structure called a reverse
map table (RMP) with which the guest has a say in which pages get
assigned to it and gets notified when a page which it owns, gets
accessed/modified under the covers so that the guest can take an
appropriate action.
In addition, add support for the whole machinery needed to launch a
SNP guest, details of which is properly explained in each patch.
And last but not least, the series refactors and improves parts of the
previous SEV support so that the new code is accomodated properly and
not just bolted on"
* tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/entry: Fixup objtool/ibt validation
x86/sev: Mark the code returning to user space as syscall gap
x86/sev: Annotate stack change in the #VC handler
x86/sev: Remove duplicated assignment to variable info
x86/sev: Fix address space sparse warning
x86/sev: Get the AP jump table address from secrets page
x86/sev: Add missing __init annotations to SEV init routines
virt: sevguest: Rename the sevguest dir and files to sev-guest
virt: sevguest: Change driver name to reflect generic SEV support
x86/boot: Put globals that are accessed early into the .data section
x86/boot: Add an efi.h header for the decompressor
virt: sevguest: Fix bool function returning negative value
virt: sevguest: Fix return value check in alloc_shared_pages()
x86/sev-es: Replace open-coded hlt-loop with sev_es_terminate()
virt: sevguest: Add documentation for SEV-SNP CPUID Enforcement
virt: sevguest: Add support to get extended report
virt: sevguest: Add support to derive key
virt: Add SEV-SNP guest driver
x86/sev: Register SEV-SNP guest request platform device
x86/sev: Provide support for SNP guest request NAEs
...
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-05-23
We've added 113 non-merge commits during the last 26 day(s) which contain
a total of 121 files changed, 7425 insertions(+), 1586 deletions(-).
The main changes are:
1) Speed up symbol resolution for kprobes multi-link attachments, from Jiri Olsa.
2) Add BPF dynamic pointer infrastructure e.g. to allow for dynamically sized ringbuf
reservations without extra memory copies, from Joanne Koong.
3) Big batch of libbpf improvements towards libbpf 1.0 release, from Andrii Nakryiko.
4) Add BPF link iterator to traverse links via seq_file ops, from Dmitrii Dolgov.
5) Add source IP address to BPF tunnel key infrastructure, from Kaixi Fan.
6) Refine unprivileged BPF to disable only object-creating commands, from Alan Maguire.
7) Fix JIT blinding of ld_imm64 when they point to subprogs, from Alexei Starovoitov.
8) Add BPF access to mptcp_sock structures and their meta data, from Geliang Tang.
9) Add new BPF helper for access to remote CPU's BPF map elements, from Feng Zhou.
10) Allow attaching 64-bit cookie to BPF link of fentry/fexit/fmod_ret, from Kui-Feng Lee.
11) Follow-ups to typed pointer support in BPF maps, from Kumar Kartikeya Dwivedi.
12) Add busy-poll test cases to the XSK selftest suite, from Magnus Karlsson.
13) Improvements in BPF selftest test_progs subtest output, from Mykola Lysenko.
14) Fill bpf_prog_pack allocator areas with illegal instructions, from Song Liu.
15) Add generic batch operations for BPF map-in-map cases, from Takshak Chahande.
16) Make bpf_jit_enable more user friendly when permanently on 1, from Tiezhu Yang.
17) Fix an array overflow in bpf_trampoline_get_progs(), from Yuntao Wang.
====================
Link: https://lore.kernel.org/r/20220523223805.27931-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce a memset like API for text_poke. This will be used to fill the
unused RX memory with illegal instructions.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20220520235758.1858153-3-song@kernel.org
Merge PM core changes, updates related to system sleep and power capping
updates for 5.19-rc1:
- Export dev_pm_ops instead of suspend() and resume() in the IIO
chemical scd30 driver (Jonathan Cameron).
- Add namespace variants of EXPORT[_GPL]_SIMPLE_DEV_PM_OPS and
PM-runtime counterparts (Jonathan Cameron).
- Move symbol exports in the IIO chemical scd30 driver into the
IIO_SCD30 namespace (Jonathan Cameron).
- Avoid device PM-runtime usage count underflows (Rafael Wysocki).
- Allow dynamic debug to control printing of PM messages (David
Cohen).
- Fix some kernel-doc comments in hibernation code (Yang Li, Haowen
Bai).
- Preserve ACPI-table override during hibernation (Amadeusz Sławiński).
- Improve support for suspend-to-RAM for PSCI OSI mode (Ulf Hansson).
- Make Intel RAPL power capping driver support the RaptorLake and
AlderLake N processors (Zhang Rui, Sumeet Pawnikar).
- Remove redundant store to value after multiply in the RAPL power
capping driver (Colin Ian King).
* pm-core:
PM: runtime: Avoid device usage count underflows
iio: chemical: scd30: Move symbol exports into IIO_SCD30 namespace
PM: core: Add NS varients of EXPORT[_GPL]_SIMPLE_DEV_PM_OPS and runtime pm equiv
iio: chemical: scd30: Export dev_pm_ops instead of suspend() and resume()
* pm-sleep:
cpuidle: PSCI: Improve support for suspend-to-RAM for PSCI OSI mode
PM: runtime: Allow to call __pm_runtime_set_status() from atomic context
PM: hibernate: Don't mark comment as kernel-doc
x86/ACPI: Preserve ACPI-table override during hibernation
PM: hibernate: Fix some kernel-doc comments
PM: sleep: enable dynamic debug support within pm_pr_dbg()
PM: sleep: Narrow down -DDEBUG on kernel/power/ files
* powercap:
powercap: intel_rapl: remove redundant store to value after multiply
powercap: intel_rapl: add support for ALDERLAKE_N
powercap: RAPL: Add Power Limit4 support for RaptorLake
powercap: intel_rapl: add support for RaptorLake
Merge APEI material, changes related to DPTF, ACPI-related x86 cleanup
and documentation improvement for 5.19-rc1:
- Fix missing ERST record ID in the APEI code (Liu Xinpeng).
- Make APEI error injection to refuse to inject into the zero
page (Tony Luck).
- Correct description of INT3407 / INT3532 DPTF attributes in sysfs
(Sumeet Pawnikar).
- Add support for high frequency impedance notification to the DPTF
driver (Sumeet Pawnikar).
- Make mp_config_acpi_gsi() a void function (Li kunyu).
- Unify Package () representation for properties in the ACPI device
properties documentation (Andy Shevchenko).
* acpi-apei:
ACPI, APEI, EINJ: Refuse to inject into the zero page
ACPI: APEI: Fix missing ERST record id
* acpi-dptf:
ACPI: DPTF: Add support for high frequency impedance notification
ACPI: DPTF: Correct description of INT3407 / INT3532 attributes
* acpi-x86:
x86: ACPI: Make mp_config_acpi_gsi() a void function
* acpi-docs:
ACPI: docs: enumeration: Unify Package () for properties (part 2)
The Shared Buffers Data Sampling (SBDS) variant of Processor MMIO Stale
Data vulnerabilities may expose RDRAND, RDSEED and SGX EGETKEY data.
Mitigation for this is added by a microcode update.
As some of the implications of SBDS are similar to SRBDS, SRBDS mitigation
infrastructure can be leveraged by SBDS. Set X86_BUG_SRBDS and use SRBDS
mitigation.
Mitigation is enabled by default; use srbds=off to opt-out. Mitigation
status can be checked from below file:
/sys/devices/system/cpu/vulnerabilities/srbds
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Currently, Linux disables SRBDS mitigation on CPUs not affected by
MDS and have the TSX feature disabled. On such CPUs, secrets cannot
be extracted from CPU fill buffers using MDS or TAA. Without SRBDS
mitigation, Processor MMIO Stale Data vulnerabilities can be used to
extract RDRAND, RDSEED, and EGETKEY data.
Do not disable SRBDS mitigation by default when CPU is also affected by
Processor MMIO Stale Data vulnerabilities.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Add the sysfs reporting file for Processor MMIO Stale Data
vulnerability. It exposes the vulnerability and mitigation state similar
to the existing files for the other hardware vulnerabilities.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
When the CPU is affected by Processor MMIO Stale Data vulnerabilities,
Fill Buffer Stale Data Propagator (FBSDP) can propagate stale data out
of Fill buffer to uncore buffer when CPU goes idle. Stale data can then
be exploited with other variants using MMIO operations.
Mitigate it by clearing the Fill buffer before entering idle state.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
MDS, TAA and Processor MMIO Stale Data mitigations rely on clearing CPU
buffers. Moreover, status of these mitigations affects each other.
During boot, it is important to maintain the order in which these
mitigations are selected. This is especially true for
md_clear_update_mitigation() that needs to be called after MDS, TAA and
Processor MMIO Stale Data mitigation selection is done.
Introduce md_clear_select_mitigation(), and select all these mitigations
from there. This reflects relationships between these mitigations and
ensures proper ordering.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Processor MMIO Stale Data is a class of vulnerabilities that may
expose data after an MMIO operation. For details please refer to
Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst.
These vulnerabilities are broadly categorized as:
Device Register Partial Write (DRPW):
Some endpoint MMIO registers incorrectly handle writes that are
smaller than the register size. Instead of aborting the write or only
copying the correct subset of bytes (for example, 2 bytes for a 2-byte
write), more bytes than specified by the write transaction may be
written to the register. On some processors, this may expose stale
data from the fill buffers of the core that created the write
transaction.
Shared Buffers Data Sampling (SBDS):
After propagators may have moved data around the uncore and copied
stale data into client core fill buffers, processors affected by MFBDS
can leak data from the fill buffer.
Shared Buffers Data Read (SBDR):
It is similar to Shared Buffer Data Sampling (SBDS) except that the
data is directly read into the architectural software-visible state.
An attacker can use these vulnerabilities to extract data from CPU fill
buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill
buffers using the VERW instruction before returning to a user or a
guest.
On CPUs not affected by MDS and TAA, user application cannot sample data
from CPU fill buffers using MDS or TAA. A guest with MMIO access can
still use DRPW or SBDR to extract data architecturally. Mitigate it with
VERW instruction to clear fill buffers before VMENTER for MMIO capable
guests.
Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control
the mitigation.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Processor MMIO Stale Data mitigation uses similar mitigation as MDS and
TAA. In preparation for adding its mitigation, add a common function to
update all mitigations that depend on MD_CLEAR.
[ bp: Add a newline in md_clear_update_mitigation() to separate
statements better. ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Processor MMIO Stale Data is a class of vulnerabilities that may
expose data after an MMIO operation. For more details please refer to
Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
Add the Processor MMIO Stale Data bug enumeration. A microcode update
adds new bits to the MSR IA32_ARCH_CAPABILITIES, define them.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Kernel now supports chained power-off handlers. Use do_kernel_power_off()
that invokes chained power-off handlers. It also invokes legacy
pm_power_off() for now, which will be removed once all drivers will
be converted to the new sys-off API.
Reviewed-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Variable info is being assigned the same value twice, remove the
redundant assignment. Also assign variable v in the declaration.
Cleans up clang scan warning:
warning: Value stored to 'info' during its initialization is never read [deadcode.DeadStores]
No code changed:
# arch/x86/kernel/sev.o:
text data bss dec hex filename
19878 4487 4112 28477 6f3d sev.o.before
19878 4487 4112 28477 6f3d sev.o.after
md5:
bfbaa515af818615fd01fea91e7eba1b sev.o.before.asm
bfbaa515af818615fd01fea91e7eba1b sev.o.after.asm
[ bp: Running the before/after check on sev.c because sev-shared.c
gets included into it. ]
Fixes: 597cfe4821 ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220516184215.51841-1-colin.i.king@gmail.com
register_nmi_handler() has no sanity check whether a handler has been
registered already. Such an unintended double-add leads to list corruption
and hard to diagnose problems during the next NMI handling.
Init the list head in the static NMI action struct and check it for being
empty in register_nmi_handler().
[ bp: Fixups. ]
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/lkml/20220511234332.3654455-1-seanjc@google.com
A PCMD (Paging Crypto MetaData) page contains the PCMD
structures of enclave pages that have been encrypted and
moved to the shmem backing store. When all enclave pages
sharing a PCMD page are loaded in the enclave, there is no
need for the PCMD page and it can be truncated from the
backing store.
A few issues appeared around the truncation of PCMD pages. The
known issues have been addressed but the PCMD handling code could
be made more robust by loudly complaining if any new issue appears
in this area.
Add a check that will complain with a warning if the PCMD page is not
actually empty after it has been truncated. There should never be data
in the PCMD page at this point since it is was just checked to be empty
and truncated with enclave mutex held and is updated with the
enclave mutex held.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/6495120fed43fafc1496d09dd23df922b9a32709.1652389823.git.reinette.chatre@intel.com
Haitao reported encountering a WARN triggered by the ENCLS[ELDU]
instruction faulting with a #GP.
The WARN is encountered when the reclaimer evicts a range of
pages from the enclave when the same pages are faulted back right away.
Consider two enclave pages (ENCLAVE_A and ENCLAVE_B)
sharing a PCMD page (PCMD_AB). ENCLAVE_A is in the
enclave memory and ENCLAVE_B is in the backing store. PCMD_AB contains
just one entry, that of ENCLAVE_B.
Scenario proceeds where ENCLAVE_A is being evicted from the enclave
while ENCLAVE_B is faulted in.
sgx_reclaim_pages() {
...
/*
* Reclaim ENCLAVE_A
*/
mutex_lock(&encl->lock);
/*
* Get a reference to ENCLAVE_A's
* shmem page where enclave page
* encrypted data will be stored
* as well as a reference to the
* enclave page's PCMD data page,
* PCMD_AB.
* Release mutex before writing
* any data to the shmem pages.
*/
sgx_encl_get_backing(...);
encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
mutex_unlock(&encl->lock);
/*
* Fault ENCLAVE_B
*/
sgx_vma_fault() {
mutex_lock(&encl->lock);
/*
* Get reference to
* ENCLAVE_B's shmem page
* as well as PCMD_AB.
*/
sgx_encl_get_backing(...)
/*
* Load page back into
* enclave via ELDU.
*/
/*
* Release reference to
* ENCLAVE_B' shmem page and
* PCMD_AB.
*/
sgx_encl_put_backing(...);
/*
* PCMD_AB is found empty so
* it and ENCLAVE_B's shmem page
* are truncated.
*/
/* Truncate ENCLAVE_B backing page */
sgx_encl_truncate_backing_page();
/* Truncate PCMD_AB */
sgx_encl_truncate_backing_page();
mutex_unlock(&encl->lock);
...
}
mutex_lock(&encl->lock);
encl_page->desc &=
~SGX_ENCL_PAGE_BEING_RECLAIMED;
/*
* Write encrypted contents of
* ENCLAVE_A to ENCLAVE_A shmem
* page and its PCMD data to
* PCMD_AB.
*/
sgx_encl_put_backing(...)
/*
* Reference to PCMD_AB is
* dropped and it is truncated.
* ENCLAVE_A's PCMD data is lost.
*/
mutex_unlock(&encl->lock);
}
What happens next depends on whether it is ENCLAVE_A being faulted
in or ENCLAVE_B being evicted - but both end up with ENCLS[ELDU] faulting
with a #GP.
If ENCLAVE_A is faulted then at the time sgx_encl_get_backing() is called
a new PCMD page is allocated and providing the empty PCMD data for
ENCLAVE_A would cause ENCLS[ELDU] to #GP
If ENCLAVE_B is evicted first then a new PCMD_AB would be allocated by the
reclaimer but later when ENCLAVE_A is faulted the ENCLS[ELDU] instruction
would #GP during its checks of the PCMD value and the WARN would be
encountered.
Noting that the reclaimer sets SGX_ENCL_PAGE_BEING_RECLAIMED at the time
it obtains a reference to the backing store pages of an enclave page it
is in the process of reclaiming, fix the race by only truncating the PCMD
page after ensuring that no page sharing the PCMD page is in the process
of being reclaimed.
Cc: stable@vger.kernel.org
Fixes: 08999b2489 ("x86/sgx: Free backing memory after faulting the enclave page")
Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/ed20a5db516aa813873268e125680041ae11dfcf.1652389823.git.reinette.chatre@intel.com
Haitao reported encountering a WARN triggered by the ENCLS[ELDU]
instruction faulting with a #GP.
The WARN is encountered when the reclaimer evicts a range of
pages from the enclave when the same pages are faulted back
right away.
The SGX backing storage is accessed on two paths: when there
are insufficient free pages in the EPC the reclaimer works
to move enclave pages to the backing storage and as enclaves
access pages that have been moved to the backing storage
they are retrieved from there as part of page fault handling.
An oversubscribed SGX system will often run the reclaimer and
page fault handler concurrently and needs to ensure that the
backing store is accessed safely between the reclaimer and
the page fault handler. This is not the case because the
reclaimer accesses the backing store without the enclave mutex
while the page fault handler accesses the backing store with
the enclave mutex.
Consider the scenario where a page is faulted while a page sharing
a PCMD page with the faulted page is being reclaimed. The
consequence is a race between the reclaimer and page fault
handler, the reclaimer attempting to access a PCMD at the
same time it is truncated by the page fault handler. This
could result in lost PCMD data. Data may still be
lost if the reclaimer wins the race, this is addressed in
the following patch.
The reclaimer accesses pages from the backing storage without
holding the enclave mutex and runs the risk of concurrently
accessing the backing storage with the page fault handler that
does access the backing storage with the enclave mutex held.
In the scenario below a PCMD page is truncated from the backing
store after all its pages have been loaded in to the enclave
at the same time the PCMD page is loaded from the backing store
when one of its pages are reclaimed:
sgx_reclaim_pages() { sgx_vma_fault() {
...
mutex_lock(&encl->lock);
...
__sgx_encl_eldu() {
...
if (pcmd_page_empty) {
/*
* EPC page being reclaimed /*
* shares a PCMD page with an * PCMD page truncated
* enclave page that is being * while requested from
* faulted in. * reclaimer.
*/ */
sgx_encl_get_backing() <----------> sgx_encl_truncate_backing_page()
}
mutex_unlock(&encl->lock);
} }
In this scenario there is a race between the reclaimer and the page fault
handler when the reclaimer attempts to get access to the same PCMD page
that is being truncated. This could result in the reclaimer writing to
the PCMD page that is then truncated, causing the PCMD data to be lost,
or in a new PCMD page being allocated. The lost PCMD data may still occur
after protecting the backing store access with the mutex - this is fixed
in the next patch. By ensuring the backing store is accessed with the mutex
held the enclave page state can be made accurate with the
SGX_ENCL_PAGE_BEING_RECLAIMED flag accurately reflecting that a page
is in the process of being reclaimed.
Consistently protect the reclaimer's backing store access with the
enclave's mutex to ensure that it can safely run concurrently with the
page fault handler.
Cc: stable@vger.kernel.org
Fixes: 1728ab54b4 ("x86/sgx: Add a page reclaimer")
Reported-by: Haitao Huang <haitao.huang@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/fa2e04c561a8555bfe1f4e7adc37d60efc77387b.1652389823.git.reinette.chatre@intel.com
Recent commit 08999b2489 ("x86/sgx: Free backing memory
after faulting the enclave page") expanded __sgx_encl_eldu()
to clear an enclave page's PCMD (Paging Crypto MetaData)
from the PCMD page in the backing store after the enclave
page is restored to the enclave.
Since the PCMD page in the backing store is modified the page
should be marked as dirty to ensure the modified data is retained.
Cc: stable@vger.kernel.org
Fixes: 08999b2489 ("x86/sgx: Free backing memory after faulting the enclave page")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lkml.kernel.org/r/00cd2ac480db01058d112e347b32599c1a806bc4.1652389823.git.reinette.chatre@intel.com
SGX uses shmem backing storage to store encrypted enclave pages
and their crypto metadata when enclave pages are moved out of
enclave memory. Two shmem backing storage pages are associated with
each enclave page - one backing page to contain the encrypted
enclave page data and one backing page (shared by a few
enclave pages) to contain the crypto metadata used by the
processor to verify the enclave page when it is loaded back into
the enclave.
sgx_encl_put_backing() is used to release references to the
backing storage and, optionally, mark both backing store pages
as dirty.
Managing references and dirty status together in this way results
in both backing store pages marked as dirty, even if only one of
the backing store pages are changed.
Additionally, waiting until the page reference is dropped to set
the page dirty risks a race with the page fault handler that
may load outdated data into the enclave when a page is faulted
right after it is reclaimed.
Consider what happens if the reclaimer writes a page to the backing
store and the page is immediately faulted back, before the reclaimer
is able to set the dirty bit of the page:
sgx_reclaim_pages() { sgx_vma_fault() {
...
sgx_encl_get_backing();
... ...
sgx_reclaimer_write() {
mutex_lock(&encl->lock);
/* Write data to backing store */
mutex_unlock(&encl->lock);
}
mutex_lock(&encl->lock);
__sgx_encl_eldu() {
...
/*
* Enclave backing store
* page not released
* nor marked dirty -
* contents may not be
* up to date.
*/
sgx_encl_get_backing();
...
/*
* Enclave data restored
* from backing store
* and PCMD pages that
* are not up to date.
* ENCLS[ELDU] faults
* because of MAC or PCMD
* checking failure.
*/
sgx_encl_put_backing();
}
...
/* set page dirty */
sgx_encl_put_backing();
...
mutex_unlock(&encl->lock);
} }
Remove the option to sgx_encl_put_backing() to set the backing
pages as dirty and set the needed pages as dirty right after
receiving important data while enclave mutex is held. This ensures that
the page fault handler can get up to date data from a page and prepares
the code for a following change where only one of the backing pages
need to be marked as dirty.
Cc: stable@vger.kernel.org
Fixes: 1728ab54b4 ("x86/sgx: Add a page reclaimer")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Haitao Huang <haitao.huang@intel.com>
Link: https://lore.kernel.org/linux-sgx/8922e48f-6646-c7cc-6393-7c78dcf23d23@intel.com/
Link: https://lkml.kernel.org/r/fa9f98986923f43e72ef4c6702a50b2a0b3c42e3.1652389823.git.reinette.chatre@intel.com
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."
Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.
Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().
Fixes: 284ce4011b ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The functions invoked via do_arch_prctl_common() can only operate on
the current task and none of these function uses the task argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx
IFS is a CPU feature that allows a binary blob, similar to microcode,
to be loaded and consumed to perform low level validation of CPU
circuitry. In fact, it carries the same Processor Signature
(family/model/stepping) details that are contained in Intel microcode
blobs.
In support of an IFS driver to trigger loading, validation, and running
of these tests blobs, make the functionality of cpu_signatures_match()
and collect_cpu_info_early() available outside of the microcode driver.
Add an "intel_" prefix and drop the "_early" suffix from
collect_cpu_info_early() and EXPORT_SYMBOL_GPL() it. Add
declaration to x86 <asm/cpu.h>
Make cpu_signatures_match() an inline function in x86 <asm/cpu.h>,
and also give it an "intel_" prefix.
No functional change intended.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20220506225410.1652287-2-tony.luck@intel.com
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
The current implementation of PTRACE_KILL is buggy and has been for
many years as it assumes it's target has stopped in ptrace_stop. At a
quick skim it looks like this assumption has existed since ptrace
support was added in linux v1.0.
While PTRACE_KILL has been deprecated we can not remove it as
a quick search with google code search reveals many existing
programs calling it.
When the ptracee is not stopped at ptrace_stop some fields would be
set that are ignored except in ptrace_stop. Making the userspace
visible behavior of PTRACE_KILL a noop in those case.
As the usual rules are not obeyed it is not clear what the
consequences are of calling PTRACE_KILL on a running process.
Presumably userspace does not do this as it achieves nothing.
Replace the implementation of PTRACE_KILL with a simple
send_sig_info(SIGKILL) followed by a return 0. This changes the
observable user space behavior only in that PTRACE_KILL on a process
not stopped in ptrace_stop will also kill it. As that has always
been the intent of the code this seems like a reasonable change.
Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20220505182645.497868-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Because the return value of mp_config_acpi_gsi() is not use, change it
into a void function.
Signed-off-by: Li kunyu <kunyu@nfschina.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In newer versions of Hyper-V, the x86/x64 PMU can be virtualized
into guest VMs by explicitly enabling it. Linux kernels are typically
built to automatically enable the hardlockup detector if the PMU is
found. To prevent the possibility of false positives due to the
vagaries of VM scheduling, disable the PMU-based hardlockup detector
by default in a VM on Hyper-V. The hardlockup detector can still be
enabled by overriding the default with the nmi_watchdog=1 option on
the kernel boot line or via sysctl at runtime.
This change mimics the approach taken with KVM guests in
commit 692297d8f9 ("watchdog: introduce the hardlockup_detector_disable()
function").
Linux on ARM64 does not provide a PMU-based hardlockup detector, so
there's no corresponding disable in the Hyper-V init code on ARM64.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1652111063-6535-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
drm/i915 feature pull #2 for v5.19:
Features and functionality:
- Add first set of DG2 PCI IDs for "motherboard down" designs (Matt Roper)
- Add initial RPL-P PCI IDs as ADL-P subplatform (Matt Atwood)
Refactoring and cleanups:
- Power well refactoring and cleanup (Imre)
- GVT-g refactor and mdev API cleanup (Christoph, Jason, Zhi)
- DPLL refactoring and cleanup (Ville)
- VBT panel specific data parsing cleanup (Ville)
- Use drm_mode_init() for on-stack modes (Ville)
Fixes:
- Fix PSR state pipe A/B confusion by clearing more state on disable (José)
- Fix FIFO underruns caused by not taking DRAM channel into account (Vinod)
- Fix FBC flicker on display 11+ by enabling a workaround (José)
- Fix VBT seamless DRRS min refresh rate check (Ville)
- Fix panel type assumption on bogus VBT data (Ville)
- Fix panel data parsing for VBT that misses panel data pointers block (Ville)
- Fix spurious AUX timeout/hotplug handling on LTTPR links (Imre)
Merges:
- Backmerge drm-next (Jani)
- GVT changes (Jani)
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/87bkwbkkdo.fsf@intel.com
Add fn and fn_arg members into struct kernel_clone_args and test for
them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER).
This allows any task that wants to be a user space task that only runs
in kernel mode to use this functionality.
The code on x86 is an exception and still retains a PF_KTHREAD test
because x86 unlikely everything else handles kthreads slightly
differently than user space tasks that start with a function.
The functions that created tasks that start with a function
have been updated to set ".fn" and ".fn_arg" instead of
".stack" and ".stack_size". These functions are fork_idle(),
create_io_thread(), kernel_thread(), and user_mode_thread().
Link: https://lkml.kernel.org/r/20220506141512.516114-4-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With io_uring we have started supporting tasks that are for most
purposes user space tasks that exclusively run code in kernel mode.
The kernel task that exec's init and tasks that exec user mode
helpers are also user mode tasks that just run kernel code
until they call kernel execve.
Pass kernel_clone_args into copy_thread so these oddball
tasks can be supported more cleanly and easily.
v2: Fix spelling of kenrel_clone_args on h8300
Link: https://lkml.kernel.org/r/20220506141512.516114-2-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The FPU usage related to task FPU management is either protected by
disabling interrupts (switch_to, return to user) or via fpregs_lock() which
is a wrapper around local_bh_disable(). When kernel code wants to use the
FPU then it has to check whether it is possible by calling irq_fpu_usable().
But the condition in irq_fpu_usable() is wrong. It allows FPU to be used
when:
!in_interrupt() || interrupted_user_mode() || interrupted_kernel_fpu_idle()
The latter is checking whether some other context already uses FPU in the
kernel, but if that's not the case then it allows FPU to be used
unconditionally even if the calling context interrupted a fpregs_lock()
critical region. If that happens then the FPU state of the interrupted
context becomes corrupted.
Allow in kernel FPU usage only when no other context has in kernel FPU
usage and either the calling context is not hard interrupt context or the
hard interrupt did not interrupt a local bottomhalf disabled region.
It's hard to find a proper Fixes tag as the condition was broken in one way
or the other for a very long time and the eager/lazy FPU changes caused a
lot of churn. Picked something remotely connected from the history.
This survived undetected for quite some time as FPU usage in interrupt
context is rare, but the recent changes to the random code unearthed it at
least on a kernel which had FPU debugging enabled. There is probably a
higher rate of silent corruption as not all issues can be detected by the
FPU debugging code. This will be addressed in a subsequent change.
Fixes: 5d2bd7009f ("x86, fpu: decouple non-lazy/eager fpu restore from xsave")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220501193102.588689270@linutronix.de
Clean up control_va_addr_alignment():
a. Make '=' required instead of optional (as documented).
b. Print a warning if an invalid option value is used.
c. Return 1 from the __setup handler when an invalid option value is
used. This prevents the kernel from polluting init's (limited)
environment space with the entire string.
Fixes: dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties on AMD family 15h")
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Link: https://lore.kernel.org/r/20220315001045.7680-1-rdunlap@infradead.org
__setup() handlers should return 1 to obsolete_checksetup() in
init/main.c to indicate that the boot option has been handled. A return
of 0 causes the boot option/value to be listed as an Unknown kernel
parameter and added to init's (limited) argument (no '=') or environment
(with '=') strings. So return 1 from these x86 __setup handlers.
Examples:
Unknown kernel command line parameters "apicpmtimer
BOOT_IMAGE=/boot/bzImage-517rc8 vdso=1 ring3mwait=disable", will be
passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
apicpmtimer
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc8
vdso=1
ring3mwait=disable
Fixes: 2aae950b21 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu")
Fixes: 77b52b4c5c ("x86: add "debugpat" boot option")
Fixes: e16fd002af ("x86/cpufeature: Enable RING3MWAIT for Knights Landing")
Fixes: b8ce335906 ("x86_64: convert to clock events")
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Link: https://lore.kernel.org/r/20220314012725.26661-1-rdunlap@infradead.org
Raptor Lake supports the split lock detection feature. Add it to
the split_lock_cpu_ids[] array.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220427231059.293086-1-tony.luck@intel.com
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some
new performance monitoring features for AMD processors.
Bit 0 of EAX indicates support for Performance Monitoring
Version 2 (PerfMonV2) features. If found to be set during
PMU initialization, the EBX bits of the same CPUID function
can be used to determine the number of available PMCs for
different PMU types. Additionally, Core PMCs can be managed
using new global control and status registers.
For better utilization of feature words, PerfMonV2 is added
as a scattered feature bit.
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c70e497e22f18e7f05b025bb64ca21cc12b17792.1650515382.git.sandipan.das@amd.com
Always stash the address error_entry() is going to return to, in %r12
and get rid of the void *error_entry_ret; slot in struct bad_iret_stack
which was supposed to account for it and pt_regs pushed on the stack.
After this, both fixup_bad_iret() and sync_regs() can work on a struct
pt_regs pointer directly.
[ bp: Rewrite commit message, touch ups. ]
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220503032107.680190-2-jiangshanlai@gmail.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmJu9FYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAyEH/16xtJSpLmLwrQzG
o+4ToQxSQ+/9UHyu0RTEvHg2THm9/8emtIuYyc/5FgdoWctcSa3AaDcveWmuWmkS
KYcdhfJsaEqjNHS3OPYXN84fmo9Hel7263shu5+IYmP/sN0DfQp6UWTryX1q4B3Q
4Pdutkuq63Uwd8nBZ5LXQBumaBrmkkuMgWEdT4+6FOo1mPzwdIGBxCuz1UsNNl5k
chLWxkQfe2eqgWbYJrgCQfrVdORXVtoU2fGilZUNrHRVGkkldXkkz5clJfapyZD3
odmZCEbrE4GPKgZwCmDERMfD1hzhZDtYKiHfOQ506szH5ykJjPBcOjHed7dA60eB
J3+wdek=
=39Ca
-----END PGP SIGNATURE-----
Backmerge tag 'v5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next
Linux 5.18-rc5
There was a build fix for arm I wanted in drm-next, so backmerge rather then cherry-pick.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Addresses: warning: Local variable 'mask' shadows outer variable
Remove extra variable declaration and switch the bit mask assignment to use
BIT_ULL() while at it.
Fixes: 522e92743b ("x86/fpu: Deduplicate copy_uabi_from_user/kernel_to_xstate()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/202204262032.jFYKit5j-lkp@intel.com
For the "nosmp" use case, the APIC initialization code selects
"APIC_SYMMETRIC_IO_NO_ROUTING" as the default interrupt mode and avoids
probing APIC drivers.
This works well for the default APIC modes, but for the x2APIC case the
probe function is required to allocate the cluster_hotplug mask. So in the
APIC_SYMMETRIC_IO_NO_ROUTING case when the x2APIC is initialized it
dereferences a NULL pointer and the kernel crashes.
This was observed on a TDX platform where x2APIC is enabled and "nosmp"
command line option is allowed.
To fix this issue, probe APIC drivers via default_setup_apic_routing() for
the APIC_SYMMETRIC_IO_NO_ROUTING interrupt mode too.
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/a64f864e1114bcd63593286aaf61142cfce384ea.1650076869.git.sathyanarayanan.kuppuswamy@intel.com
solely controlled by the hypervisor
- A build fix to make the function prototype (__warn()) as visible as
the definition itself
- A bunch of objtool annotation fixes which have accumulated over time
- An ORC unwinder fix to handle bad input gracefully
- Well, we thought the microcode gets loaded in time in order to restore
the microcode-emulated MSRs but we thought wrong. So there's a fix for
that to have the ordering done properly
- Add new Intel model numbers
- A spelling fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmJucwMACgkQEsHwGGHe
VUpgiw/8CuOXJhHSuYscEfAmPGoiG9+oLTYVc1NEfJEIyNuZULcr+aYlddTF79hm
V+Flq6FyA3NU220F8t5s3jOaDkWjWJ8nZGPUUxo5+yNHugIGYh/kLy6w8LC8SgLq
GqqYX4fd28tqFSgIBCrr+9GgpTE7bvzBGYLByKj9AO6ecLvWJmc+bENQCTaTRFgl
og6xenzyECWxgbWIql0UeB1xw2AJ8UfYVeLKzOHpc95ZF209+mg7JLL5yIxwwgNV
/CGoh28+twjX5SA1rr3cUx9gmFzrYubYZMglhgugBsShkdfuMLhis4woU7lF7cV9
HnxH6mkvN4R0Im7DZXgQPJ63ZFLJ8tN3RyLQDYBRd71w0Epr/K2aacYeQkWTflcx
4Ia+AiJ7rpKx0cUbUHX7pf3lzna/c8u/xPnlAIbR6rfwXO5mACupaofN5atAdx9T
9rPCPIdroM5XzBTiN4aNJHEsADL1h/oQdzrziTwryyezbTtnNC5KW53hnqyf5Bqo
gBlbfVsnwM0AfLHSPE1D0liOR2spwuB+/bWrsOCzEYENC44nDxHE/MUUjg7/l+Vr
6N5syrQ7QsIPqUaEM+bQdKHGaXSU6amF8OWpFMjzkleQw5m7/X8LzyZsBlB4yeqv
63hUEpdmFyR/6bLdEvjUXeAPcbA41WHwOMdNPaKDqn3zhwYZaa4=
=poyP
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- A fix to disable PCI/MSI[-X] masking for XEN_HVM guests as that is
solely controlled by the hypervisor
- A build fix to make the function prototype (__warn()) as visible as
the definition itself
- A bunch of objtool annotation fixes which have accumulated over time
- An ORC unwinder fix to handle bad input gracefully
- Well, we thought the microcode gets loaded in time in order to
restore the microcode-emulated MSRs but we thought wrong. So there's
a fix for that to have the ordering done properly
- Add new Intel model numbers
- A spelling fix
* tag 'x86_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests
bug: Have __warn() prototype defined unconditionally
x86/Kconfig: fix the spelling of 'becoming' in X86_KERNEL_IBT config
objtool: Use offstr() to print address of missing ENDBR
objtool: Print data address for "!ENDBR" data warnings
x86/xen: Add ANNOTATE_NOENDBR to startup_xen()
x86/uaccess: Add ENDBR to __put_user_nocheck*()
x86/retpoline: Add ANNOTATE_NOENDBR for retpolines
x86/static_call: Add ANNOTATE_NOENDBR to static call trampoline
objtool: Enable unreachable warnings for CLANG LTO
x86,objtool: Explicitly mark idtentry_body()s tail REACHABLE
x86,objtool: Mark cpu_startup_entry() __noreturn
x86,xen,objtool: Add UNWIND hint
lib/strn*,objtool: Enforce user_access_begin() rules
MAINTAINERS: Add x86 unwinding entry
x86/unwind/orc: Recheck address range after stack info was updated
x86/cpu: Load microcode during restore_processor_state()
x86/cpu: Add new Alderlake and Raptorlake CPU model numbers
Remove the read_from_oldmem() wrapper introduced earlier and convert all
the remaining callers to pass an iov_iter.
Link: https://lkml.kernel.org/r/20220408090636.560886-4-bhe@redhat.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Amit Daniel Kachhap <amit.kachhap@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Convert vmcore to use an iov_iter", v5.
For some reason several people have been sending bad patches to fix
compiler warnings in vmcore recently. Here's how it should be done.
Compile-tested only on x86. As noted in the first patch, s390 should take
this conversion a bit further, but I'm not inclined to do that work
myself.
This patch (of 3):
Instead of passing in a 'buf' and 'userbuf' argument, pass in an iov_iter.
s390 needs more work to pass the iov_iter down further, or refactor, but
I'd be more comfortable if someone who can test on s390 did that work.
It's more convenient to convert the whole of read_from_oldmem() to take an
iov_iter at the same time, so rename it to read_from_oldmem_iter() and add
a temporary read_from_oldmem() wrapper that creates an iov_iter.
Link: https://lkml.kernel.org/r/20220408090636.560886-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20220408090636.560886-2-bhe@redhat.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The ftrace_[enable,disable]_ftrace_graph_caller() are used to do
special hooks for graph tracer, which are not needed on some ARCHs
that use graph_ops:func function to install return_hooker.
So introduce the weak version in ftrace core code to cleanup
in x86.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20220420160006.17880-1-zhouchengming@bytedance.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.
show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.
Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.
Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Doug Smythies <dsmythies@telus.net>
Link: https://lore.kernel.org/r/87pml5180p.ffs@tglx
Reading the current CPU frequency from /sys/..../scaling_cur_freq involves
in the worst case two IPIs due to the ad hoc sampling.
The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them and consolidate this with the /proc/cpuinfo readout.
The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.
The resulting text size vs. the original APERF/MPERF plus the separate
frequency invariance code:
text: 2411 -> 723
init.text: 0 -> 767
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.934040006@linutronix.de
The frequency invariance infrastructure provides the APERF/MPERF samples
already. Utilize them for the cpu frequency display in /proc/cpuinfo.
The sample is considered valid for 20ms. So for idle or isolated NOHZ full
CPUs the function returns 0, which is matching the previous behaviour.
This gets rid of the mass IPIs and a delay of 20ms for stabilizing observed
by Eric when reading /proc/cpuinfo.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.875029458@linutronix.de
Now that the MSR readout is unconditional, store the results in the per CPU
data structure along with a jiffies timestamp for the CPU frequency readout
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.817702355@linutronix.de
The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.
arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.
While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.
As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de
Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.706185092@linutronix.de
Preparation for sharing code with the CPU frequency portion of the
aperf/mperf code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.648485667@linutronix.de
AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
Intel parts from being marked __init.
Split out the common code and provide a dedicated interface for the AMD
initialization and mark the Intel specific code and data __init.
The remaining text size is almost cut in half:
text: 2614 -> 1350
init.text: 0 -> 786
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.592465719@linutronix.de
This code is convoluted and because it can be invoked post init via the
ACPI/CPPC code, all of the initialization functionality is built in instead
of being part of init text and init data.
As a first step create separate calls for the boot and the application
processors.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.536733494@linutronix.de
as this can share code with the preexisting APERF/MPERF code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.478362457@linutronix.de
aperfmperf_get_khz() already excludes idle CPUs from APERF/MPERF sampling
and that's a reasonable decision. There is no point in sending up to two
IPIs to an idle CPU just because someone reads a sysfs file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.419880163@linutronix.de
Changes to the "warn" mode of split lock handling mean that TIF_SLD is
never set.
Remove the bit, and the functions that use it.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220310204854.31752-3-tony.luck@intel.com
In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas
said:
Its's simply wishful thinking that stuff gets fixed because of a
WARN_ONCE(). This has never worked. The only thing which works is to
make stuff fail hard or slow it down in a way which makes it annoying
enough to users to complain.
He was talking about WBINVD. But it made me think about how we use the
split lock detection feature in Linux.
Existing code has three options for applications:
1) Don't enable split lock detection (allow arbitrary split locks)
2) Warn once when a process uses split lock, but let the process
keep running with split lock detection disabled
3) Kill process that use split locks
Option 2 falls into the "wishful thinking" territory that Thomas warns does
nothing. But option 3 might not be viable in a situation with legacy
applications that need to run.
Hence make option 2 much stricter to "slow it down in a way which makes
it annoying".
Primary reason for this change is to provide better quality of service to
the rest of the applications running on the system. Internal testing shows
that even with many processes splitting locks, performance for the rest of
the system is much more responsive.
The new "warn" mode operates like this. When an application tries to
execute a bus lock the #AC handler.
1) Delays (interruptibly) 10 ms before moving to next step.
2) Blocks (interruptibly) until it can get the semaphore
If interrupted, just return. Assume the signal will either
kill the task, or direct execution away from the instruction
that is trying to get the bus lock.
3) Disables split lock detection for the current core
4) Schedules a work queue to re-enable split lock detect in 2 jiffies
5) Returns
The work queue that re-enables split lock detection also releases the
semaphore.
There is a corner case where a CPU may be taken offline while split lock
detection is disabled. A CPU hotplug handler handles this case.
Old behaviour was to only print the split lock warning on the first
occurrence of a split lock from a task. Preserve that by adding a flag to
the task structure that suppresses subsequent split lock messages from that
task.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com
The GHCB specification section 2.7 states that when SEV-SNP is enabled,
a guest should not rely on the hypervisor to provide the address of the
AP jump table. Instead, if a guest BIOS wants to provide an AP jump
table, it should record the address in the SNP secrets page so the guest
operating system can obtain it directly from there.
Fix this on the guest kernel side by having SNP guests use the AP jump
table address published in the secrets page rather than issuing a GHCB
request to get it.
[ mroth:
- Improve error handling when ioremap()/memremap() return NULL
- Don't mix function calls with declarations
- Add missing __init
- Tweak commit message ]
Fixes: 0afb6b660a ("x86/sev: Use SEV-SNP AP creation to start secondary CPUs")
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220422135624.114172-3-michael.roth@amd.com
Currently, get_secrets_page() is only reachable from the following call
chain:
__init snp_init_platform_device():
get_secrets_page()
so mark it as __init as well. This is also needed since it calls
early_memremap(), which is also an __init routine.
Similarly, get_jump_table_addr() is only reachable from the following
call chain:
__init setup_real_mode():
sme_sev_setup_real_mode():
sev_es_setup_ap_jump_table():
get_jump_table_addr()
so mark get_jump_table_addr() and everything up that call chain as
__init as well. This is also needed since future patches will add a
call to get_secrets_page(), which needs to be __init due to the reasons
stated above.
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220422135624.114172-2-michael.roth@amd.com
Need to bring commit d8bb92e70a ("drm/dp: Factor out a function to
probe a DPCD address") back as a dependency to further work in
drm-intel-next.
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
XSAVEC is the user space counterpart of XSAVES which cannot save supervisor
state. In virtualization scenarios the hypervisor does not expose XSAVES
but XSAVEC to the guest, though the kernel does not make use of it.
That's unfortunate because XSAVEC uses the compacted format of saving the
XSTATE. This is more efficient in terms of storage space vs. XSAVE[OPT] as
it does not create holes for XSTATE components which are not supported or
enabled by the kernel but are available in hardware. There is room for
further optimizations when XSAVEC/S and XGETBV1 are supported.
In order to support XSAVEC:
- Define the XSAVEC ASM macro as it's not yet supported by the required
minimal toolchain.
- Create a software defined X86_FEATURE_XCOMPACTED to select the compacted
XSTATE buffer format for both XSAVEC and XSAVES.
- Make XSAVEC an option in the 'XSAVE' ASM alternatives
Requested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220404104820.598704095@linutronix.de
When a machine error is graded as PANIC by the AMD grading logic, the
MCE handler calls mce_panic(). The notification chain does not come
into effect so the AMD EDAC driver does not decode the errors. In these
cases, the messages displayed to the user are more cryptic and miss
information that might be relevant, like the context in which the error
took place.
Add messages to the grading logic for machine errors so that it is clear
what error it was.
[ bp: Massage commit message. ]
Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220405183212.354606-3-carlos.bilbao@amd.com
The MCE handler needs to understand the severity of the machine errors to
act accordingly. Simplify the AMD grading logic following a logic that
closely resembles the descriptions of the public PPR documents. This will
help include more fine-grained grading of errors in the future.
[ bp: Touchups. ]
Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20220405183212.354606-2-carlos.bilbao@amd.com
Now that stack validation is an optional feature of objtool, add
CONFIG_OBJTOOL and replace most usages of CONFIG_STACK_VALIDATION with
it.
CONFIG_STACK_VALIDATION can now be considered to be frame-pointer
specific. CONFIG_UNWINDER_ORC is already inherently valid for live
patching, so no need to "validate" it.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Link: https://lkml.kernel.org/r/939bf3d85604b2a126412bf11af6e3bd3b872bcb.1650300597.git.jpoimboe@redhat.com
With SIGTRAP on perf events, we have encountered termination of
processes due to user space attempting to block delivery of SIGTRAP.
Consider this case:
<set up SIGTRAP on a perf event>
...
sigset_t s;
sigemptyset(&s);
sigaddset(&s, SIGTRAP | <and others>);
sigprocmask(SIG_BLOCK, &s, ...);
...
<perf event triggers>
When the perf event triggers, while SIGTRAP is blocked, force_sig_perf()
will force the signal, but revert back to the default handler, thus
terminating the task.
This makes sense for error conditions, but not so much for explicitly
requested monitoring. However, the expectation is still that signals
generated by perf events are synchronous, which will no longer be the
case if the signal is blocked and delivered later.
To give user space the ability to clearly distinguish synchronous from
asynchronous signals, introduce siginfo_t::si_perf_flags and
TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is
required in future).
The resolution to the problem is then to (a) no longer force the signal
(avoiding the terminations), but (b) tell user space via si_perf_flags
if the signal was synchronous or not, so that such signals can be
handled differently (e.g. let user space decide to ignore or consider
the data imprecise).
The alternative of making the kernel ignore SIGTRAP on perf events if
the signal is blocked may work for some usecases, but likely causes
issues in others that then have to revert back to interception of
sigprocmask() (which we want to avoid). [ A concrete example: when using
breakpoint perf events to track data-flow, in a region of code where
signals are blocked, data-flow can no longer be tracked accurately.
When a relevant asynchronous signal is received after unblocking the
signal, the data-flow tracking logic needs to know its state is
imprecise. ]
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20220404111204.935357-1-elver@google.com
MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
host-side polling after suspend/resume. Non-bootstrap CPUs are
restored correctly by the haltpoll driver because they are hot-unplugged
during suspend and hot-plugged during resume; however, the BSP
is not hotpluggable and remains in host-sde polling mode after
the guest resume. The makes the guest pay for the cost of vmexits
every time the guest enters idle.
Fix it by recording BSP's haltpoll state and resuming it during guest
resume.
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
During patch review, it was decided the SNP guest driver name should not
be SEV-SNP specific, but should be generic for use with anything SEV.
However, this feedback was missed and the driver name, and many of the
driver functions and structures, are SEV-SNP name specific. Rename the
driver to "sev-guest" (to match the misc device that is created) and
update some of the function and structure names, too.
While in the file, adjust the one pr_err() message to be a dev_err()
message so that the message, if issued, uses the driver name.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/307710bb5515c9088a19fd0b930268c7300479b2.1650464054.git.thomas.lendacky@amd.com
Adding initial PCI ids for RPL-P.
RPL-P behaves identically to ADL-P from i915's point of view.
Changes since V1 :
- SUBPLATFORM ADL_N and RPL_P clash as both are ADLP
based - Matthew R
Bspec: 55376
Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com>
Signed-off-by: Madhumitha Tolakanahalli Pradeep <madhumitha.tolakanahalli.pradeep@intel.com>
Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
[mattrope: Corrected comment formatting to match coding style]
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220418062157.2974665-1-tejaskumarx.surendrakumar.upadhyay@intel.com
A crash was observed in the ORC unwinder:
BUG: stack guard page was hit at 000000000dd984a2 (stack is 00000000d1caafca..00000000613712f0)
kernel stack overflow (page fault): 0000 [#1] SMP NOPTI
CPU: 93 PID: 23787 Comm: context_switch1 Not tainted 5.4.145 #1
RIP: 0010:unwind_next_frame
Call Trace:
<NMI>
perf_callchain_kernel
get_perf_callchain
perf_callchain
perf_prepare_sample
perf_event_output_forward
__perf_event_overflow
perf_ibs_handle_irq
perf_ibs_nmi_handler
nmi_handle
default_do_nmi
do_nmi
end_repeat_nmi
This was really two bugs:
1) The perf IBS code passed inconsistent regs to the unwinder.
2) The unwinder didn't handle the bad input gracefully.
Fix the latter bug. The ORC unwinder needs to be immune against bad
inputs. The problem is that stack_access_ok() doesn't recheck the
validity of the full range of registers after switching to the next
valid stack with get_stack_info(). Fix that.
[ jpoimboe: rewrote commit log ]
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/1650353656-956624-1-git-send-email-dmtrmonakhov@yandex-team.ru
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
When resuming from system sleep state, restore_processor_state()
restores the boot CPU MSRs. These MSRs could be emulated by microcode.
If microcode is not loaded yet, writing to emulated MSRs leads to
unchecked MSR access error:
...
PM: Calling lapic_suspend+0x0/0x210
unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr)
Call Trace:
<TASK>
? restore_processor_state
x86_acpi_suspend_lowlevel
acpi_suspend_enter
suspend_devices_and_enter
pm_suspend.cold
state_store
kobj_attr_store
sysfs_kf_write
kernfs_fop_write_iter
new_sync_write
vfs_write
ksys_write
__x64_sys_write
do_syscall_64
entry_SYSCALL_64_after_hwframe
RIP: 0033:0x7fda13c260a7
To ensure microcode emulated MSRs are available for restoration, load
the microcode on the boot CPU before restoring these MSRs.
[ Pawan: write commit message and productize it. ]
Fixes: e2a1256b17 ("x86/speculation: Restore speculation related MSRs during S3 resume")
Reported-by: Kyle D. Pelton <kyle.d.pelton@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Tested-by: Kyle D. Pelton <kyle.d.pelton@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841
Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.com
Reuse the generic swiotlb initialization for xen-swiotlb. For ARM/ARM64
this works trivially, while for x86 xen_swiotlb_fixup needs to be passed
as the remap argument to swiotlb_init_remap/swiotlb_init_late.
Note that the lower bound of the swiotlb size is changed to the smaller
IO_TLB_MIN_SLABS based value with this patch, but that is fine as the
2MB value used in Xen before was just an optimization and is not the
hard lower bound.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Pass a boolean flag to indicate if swiotlb needs to be enabled based on
the addressing needs, and replace the verbose argument with a set of
flags, including one to force enable bounce buffering.
Note that this patch removes the possibility to force xen-swiotlb use
with the swiotlb=force parameter on the command line on x86 (arm and
arm64 never supported that), but this interface will be restored shortly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Move enabling SWIOTLB_FORCE for guest memory encryption into common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The IOMMU table tries to separate the different IOMMUs into different
backends, but actually requires various cross calls.
Rewrite the code to do the generic swiotlb/swiotlb-xen setup directly
in pci-dma.c and then just call into the IOMMU drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX to
cover all CPUs which allow to disable it.
- Disable TSX development mode at boot so that a microcode update which
provides TSX development mode does not suddenly make the system
vulnerable to TSX Asynchronous Abort.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmJb5LYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVVbD/9cxZWkFctCiymedUZqLabkfpYSki65
MngdpCPzCNaaIdlp44lwCido5+gJsY9unXdm3OAUzLjv6SsxxpDr5njz1/C6TM1l
XmWjlkLEbG2QDPd1Ybd/lpYQORBmiukyo8v8x0yFT7ZzwvSddoDZAbeUtkQBrIin
sDTeExsewKzL2X5qXhttrHLHu1PYgurn4ThIrrG+eg2e4FNk6UUFUS3TOyMvzJDg
NWJ7N5pGy9YkR7CISq1q+qdnH55pGaUrgonDi2qBTt3EaH0fQtZP2ZtIOYr3O4nI
YCx6isrIiGUB6kSygofxmk4B+22CaUJXd2OcUxMZ/Th/a2aCK+35BtGVPXQGi6nU
d7m+ZWB7dShOiejFygS59ty+5L5kliKXYZfUASsq1CLoXH8K1xUwBMkbY5FQ2WH1
Ue4KUvjguNqsgSRAfeHdOi6B36oot0Xf9JO013Wm3V/r9hsGPtSOjWwFuVvT/euw
a9iFtruATxDssBxH/l0djCKnwwm5yuOt1OpyizcIMFnlCgRD06h/6zgAvsJK7c8d
dh6lC4D2mXP1e2wtEyZelve1tmRJ/FeReyG2V5FNU7m1mWYGm1rJZ4AEvnbrzcbC
ePwFva0lPu8GVKG6HRgHfR8PjuQ7TFmKPKytT7fboIqQpTIY+1Q75wYD4eXkSu8Q
/ltzXQz/8lz7bA==
=UQaW
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two x86 fixes related to TSX:
- Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX
to cover all CPUs which allow to disable it.
- Disable TSX development mode at boot so that a microcode update
which provides TSX development mode does not suddenly make the
system vulnerable to TSX Asynchronous Abort"
* tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Disable TSX development mode at boot
x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
Commit 3ee48b6af4 ("mm, x86: Saving vmcore with non-lazy freeing of
vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
purge the vmap areas instead of doing it lazily.
Commit 690467c81b ("mm/vmalloc: Move draining areas out of caller
context") moved the purging from the vunmap() caller to a worker thread.
Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
(possibly forever). For example, consider the following scenario:
1. Thread reads from /proc/vmcore. This eventually calls
__copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
vmap_lazy_nr to lazy_max_pages() + 1.
2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
pages (one page plus the guard page) to the purge list and
vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
drain_vmap_work is scheduled.
3. Thread returns from the kernel and is scheduled out.
4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
frees the 2 pages on the purge list. vmap_lazy_nr is now
lazy_max_pages() + 1.
5. This is still over the threshold, so it tries to purge areas again,
but doesn't find anything.
6. Repeat 5.
If the system is running with only one CPU (which is typicial for kdump)
and preemption is disabled, then this will never make forward progress:
there aren't any more pages to purge, so it hangs. If there is more
than one CPU or preemption is enabled, then the worker thread will spin
forever in the background. (Note that if there were already pages to be
purged at the time that set_iounmap_nonlazy() was called, this bug is
avoided.)
This can be reproduced with anything that reads from /proc/vmcore
multiple times. E.g., vmcore-dmesg /proc/vmcore.
It turns out that improvements to vmap() over the years have obsoleted
the need for this "optimization". I benchmarked `dd if=/proc/vmcore
of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
5.18-rc1 with set_iounmap_nonlazy() removed entirely:
|5.17 |5.18+fix|5.18+removal
4k|40.86s| 40.09s| 26.73s
1M|24.47s| 23.98s| 21.84s
The removal was the fastest (by a wide margin with 4k reads). This
patch removes set_iounmap_nonlazy().
Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
Fixes: 690467c81b ("mm/vmalloc: Move draining areas out of caller context")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
GS is always a user segment now.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220325153953.162643-4-brgerst@gmail.com
Read a record is cleared by others, but the deleted record cache entry is
still created by erst_get_record_id_next. When next enumerate the records,
get the cached deleted record, then erst_read() return -ENOENT and try to
get next record, loop back to first ID will return 0 in function
__erst_record_id_cache_add_one and then set record_id as
APEI_ERST_INVALID_RECORD_ID, finished this time read operation.
It will result in read the records just in the cache hereafter.
This patch cleared the deleted record cache, fix the issue that
"./erst-inject -p" shows record counts not equal to "./erst-inject -n".
A reproducer of the problem(retry many times):
[root@localhost erst-inject]# ./erst-inject -c 0xaaaaa00011
[root@localhost erst-inject]# ./erst-inject -p
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00012
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00013
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00014
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000006
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000007
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000008
[root@localhost erst-inject]# ./erst-inject -p
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00012
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00013
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00014
[root@localhost erst-inject]# ./erst-inject -n
total error record count: 6
Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The pr_debug() intends to display the memsz member, but the
parameter is actually the bufsz member (which is already
displayed). Correct this to display memsz value.
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/20220413164237.20845-2-eric.devolder@oracle.com
When the "nopv" command line parameter is used, it should not waste
memory for kvmclock.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1646727529-11774-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge branch for features that did not make it into 5.18:
* New ioctls to get/set TSC frequency for a whole VM
* Allow userspace to opt out of hypercall patching
Nested virtualization improvements for AMD:
* Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE,
nested vGIF)
* Allow AVIC to co-exist with a nested guest running
* Fixes for LBR virtualizations when a nested guest is running,
and nested LBR virtualization support
* PAUSE filtering for nested hypervisors
Guest support:
* Decoupling of vcpu_is_preempted from PV spinlocks
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Daniel stumbled over the bit overlap of the i82498DX external APIC and the
TSC deadline timer configuration bit in modern APICs, which is neither
documented in the code nor in the current SDM. Maciej provided links to
the original i82489DX/486 documentation. See Link.
Remove the i82489DX macro maze, use a i82489DX specific define in the apic
code and document the overlap in a comment.
Reported-by: Daniel Vacek <neelx@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Link: https://lore.kernel.org/r/87ee22f3ci.ffs@tglx
When overriding NHLT ACPI-table tests show that on some platforms
there is problem that NHLT contains garbage after hibernation/resume
cycle.
Problem stems from the fact that ACPI override performs early memory
allocation using memblock_phys_alloc_range() in
memblock_phys_alloc_range(). This memory block is later being marked as
ACPI memory block in arch_reserve_mem_area(). Later when memory areas
are considered for hibernation it is being marked as nosave in
e820__register_nosave_regions().
Fix this by marking ACPI override memory area as ACPI NVS
(Non-Volatile-Sleeping), which according to specification needs to be
saved on entering S4 and restored when leaving and is implemented as
such in kernel.
Signed-off-by: Amadeusz Sławiński <amadeuszx.slawinski@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Sync up with v5.18-rc1, in particular to get 5e3094cfd9
("drm/i915/xehpsdv: Add has_flat_ccs to device info").
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
A microcode update on some Intel processors causes all TSX transactions
to always abort by default[*]. Microcode also added functionality to
re-enable TSX for development purposes. With this microcode loaded, if
tsx=on was passed on the cmdline, and TSX development mode was already
enabled before the kernel boot, it may make the system vulnerable to TSX
Asynchronous Abort (TAA).
To be on safer side, unconditionally disable TSX development mode during
boot. If a viable use case appears, this can be revisited later.
[*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557
[ bp: Drop unstable web link, massage heavily. ]
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com
tsx_clear_cpuid() uses MSR_TSX_FORCE_ABORT to clear CPUID.RTM and
CPUID.HLE. Not all CPUs support MSR_TSX_FORCE_ABORT, alternatively use
MSR_IA32_TSX_CTRL when supported.
[ bp: Document how and why TSX gets disabled. ]
Fixes: 293649307e ("x86/tsx: Clear CPUID bits when TSX always force aborts")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/5b323e77e251a9c8bcdda498c5cc0095be1e1d3c.1646943780.git.pawan.kumar.gupta@linux.intel.com
In some cases, x86 code calls cpumask_weight() to check if any bit of a
given cpumask is set.
This can be done more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20220210224933.379149-17-yury.norov@gmail.com
ACPI firmware advertises PCI host bridge resources via PNP0A03 _CRS
methods. Some BIOSes include non-window address space in _CRS, and if we
allocate that non-window space for PCI devices, they don't work.
4dc2287c18 ("x86: avoid E820 regions when allocating address space")
works around this issue by clipping out any regions mentioned in the E820
table in the allocate_resource() path, but the implementation has a couple
issues:
- The clipping is done for *all* allocations, not just those for PCI
address space, and
- The clipping is done at each allocation instead of being done once when
setting up the host bridge windows.
Rework the implementation so we only clip PCI host bridge windows, and we
do it once when setting them up.
Example output changes:
BIOS-e820: [mem 0x00000000b0000000-0x00000000c00fffff] reserved
+ acpi PNP0A08:00: clipped [mem 0xc0000000-0xfebfffff window] to [mem 0xc0100000-0xfebfffff window] for e820 entry [mem 0xb0000000-0xc00fffff]
- pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
+ pci_bus 0000:00: root bus resource [mem 0xc0100000-0xfebfffff window]
Link: https://lore.kernel.org/r/20220304035110.988712-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When remove_e820_regions() clips a resource because an E820 region overlaps
it, log a note in dmesg to add in debugging.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Replace the halt loop in handle_vc_boot_ghcb() with an
sev_es_terminate(). The HLT gives the system no indication the guest is
unhappy. The termination request will signal there was an error during
VC handling during boot.
[ bp: Update it to pass the reason set too. ]
Signed-off-by: Peter Gonda <pgonda@google.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20220317211913.1397427-1-pgonda@google.com
The kernel interacts with each bare-metal IOAPIC with a special
MMIO page. When running under KVM, the guest's IOAPICs are
emulated by KVM.
When running as a TDX guest, the guest needs to mark each IOAPIC
mapping as "shared" with the host. This ensures that TDX private
protections are not applied to the page, which allows the TDX host
emulation to work.
ioremap()-created mappings such as virtio will be marked as
shared by default. However, the IOAPIC code does not use ioremap() and
instead uses the fixmap mechanism.
Introduce a special fixmap helper just for the IOAPIC code. Ensure
that it marks IOAPIC pages as "shared". This replaces
set_fixmap_nocache() with __set_fixmap() since __set_fixmap()
allows custom 'prot' values.
AMD SEV gets IOAPIC pages shared because FIXMAP_PAGE_NOCACHE has _ENC
bit clear. TDX has to set bit to share the page with the host.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-29-kirill.shutemov@linux.intel.com
Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.
It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.
Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.
Provide a TDX version of x86_platform.guest.* callbacks. It makes
__set_memory_enc_pgtable() work right in TDX guest.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-27-kirill.shutemov@linux.intel.com
There are a few MSRs and control register bits that the kernel
normally needs to modify during boot. But, TDX disallows
modification of these registers to help provide consistent security
guarantees. Fortunately, TDX ensures that these are all in the correct
state before the kernel loads, which means the kernel does not need to
modify them.
The conditions to avoid are:
* Any writes to the EFER MSR
* Clearing CR4.MCE
This theoretically makes the guest boot more fragile. If, for instance,
EFER was set up incorrectly and a WRMSR was performed, it will trigger
early exception panic or a triple fault, if it's before early
exceptions are set up. However, this is likely to trip up the guest
BIOS long before control reaches the kernel. In any case, these kinds
of problems are unlikely to occur in production environments, and
developers have good debug tools to fix them quickly.
Change the common boot code to work on TDX and non-TDX systems.
This should have no functional effect on non-TDX systems.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-24-kirill.shutemov@linux.intel.com
Secondary CPU startup is currently performed with something called
the "INIT/SIPI protocol". This protocol requires assistance from
VMMs to boot guests. As should be a familiar story by now, that
support can not be provded to TDX guests because TDX VMMs are
not trusted by guests.
To remedy this situation a new[1] "Multiprocessor Wakeup Structure"
has been added to to an existing ACPI table (MADT). This structure
provides the physical address of a "mailbox". A write to the mailbox
then steers the secondary CPU to the boot code.
Add ACPI MADT wake structure parsing support and wake support. Use
this support to wake CPUs whenever it is present instead of INIT/SIPI.
While this structure can theoretically be used on 32-bit kernels,
there are no 32-bit TDX guest kernels. It has not been tested and
can not practically *be* tested on 32-bit. Make it 64-bit only.
1. Details about the new structure can be found in ACPI v6.4, in the
"Multiprocessor Wakeup Structure" section.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-22-kirill.shutemov@linux.intel.com
Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".
Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.
There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.
The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-21-kirill.shutemov@linux.intel.com
TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.
But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.
The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).
To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].
In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.
To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.
Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:
* Specific instructions (WBINVD, for example)
* Specific MSR accesses
* Specific CPUID leaf accesses
* Access to specific guest physical addresses
Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.
For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.
Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.
During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.
TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag. This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.
Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.
For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
TDX brings a new exception -- Virtualization Exception (#VE). Handling
of #VE structurally very similar to handling #GP.
Extract two helpers from exc_general_protection() that can be reused for
handling #VE.
No functional changes.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-7-kirill.shutemov@linux.intel.com
Guests communicate with VMMs with hypercalls. Historically, these
are implemented using instructions that are known to cause VMEXITs
like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer
expose the guest state to the host. This prevents the old hypercall
mechanisms from working. So, to communicate with VMM, TDX
specification defines a new instruction called TDCALL.
In a TDX based VM, since the VMM is an untrusted entity, an intermediary
layer -- TDX module -- facilitates secure communication between the host
and the guest. TDX module is loaded like a firmware into a special CPU
mode called SEAM. TDX guests communicate with the TDX module using the
TDCALL instruction.
A guest uses TDCALL to communicate with both the TDX module and VMM.
The value of the RAX register when executing the TDCALL instruction is
used to determine the TDCALL type. A leaf of TDCALL used to communicate
with the VMM is called TDVMCALL.
Add generic interfaces to communicate with the TDX module and VMM
(using the TDCALL instruction).
__tdx_module_call() - Used to communicate with the TDX module (via
TDCALL instruction).
__tdx_hypercall() - Used by the guest to request services from
the VMM (via TDVMCALL leaf of TDCALL).
Also define an additional wrapper _tdx_hypercall(), which adds error
handling support for the TDCALL failure.
The __tdx_module_call() and __tdx_hypercall() helper functions are
implemented in assembly in a .S file. The TDCALL ABI requires
shuffling arguments in and out of registers, which proved to be
awkward with inline assembly.
Just like syscalls, not all TDVMCALL use cases need to use the same
number of argument registers. The implementation here picks the current
worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer
than 4 arguments, there will end up being a few superfluous (cheap)
instructions. But, this approach maximizes code reuse.
For registers used by the TDCALL instruction, please check TDX GHCI
specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL
Interface".
Based on previous patch by Sean Christopherson.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-4-kirill.shutemov@linux.intel.com
Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It
defines a new VMX root operation (SEAM VMX root) and a new VMX non-root
operation (SEAM VMX non-root) which are both isolated from the legacy
VMX operation where the host kernel runs.
A CPU-attested software module (called 'TDX module') runs in SEAM VMX
root to manage and protect VMs running in SEAM VMX non-root. SEAM VMX
root is also used to host another CPU-attested software module (called
'P-SEAMLDR') to load and update the TDX module.
Host kernel transits to either P-SEAMLDR or TDX module via the new
SEAMCALL instruction, which is essentially a VMExit from VMX root mode
to SEAM VMX root mode. SEAMCALLs are leaf functions defined by
P-SEAMLDR and TDX module around the new SEAMCALL instruction.
A guest kernel can also communicate with TDX module via TDCALL
instruction.
TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI.
RAX is used to carry both the SEAMCALL leaf function number (input) and
the completion status (output). Additional GPRs (RCX, RDX, R8-R11) may
be further used as both input and output operands in individual leaf.
TDCALL and SEAMCALL share the same ABI and require the largely same
code to pass down arguments and retrieve results.
Define an assembly macro that can be used to implement C wrapper for
both TDCALL and SEAMCALL.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-3-kirill.shutemov@linux.intel.com
In preparation of extending cc_platform_has() API to support TDX guest,
use CPUID instruction to detect support for TDX guests in the early
boot code (via tdx_early_init()). Since copy_bootdata() is the first
user of cc_platform_has() API, detect the TDX guest status before it.
Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this
bit in a valid TDX guest platform.
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-2-kirill.shutemov@linux.intel.com
Show value of gap end in the kernel log which equates to number of physical
address bits used by system.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220406195149.228164-4-steve.wahl@hpe.com
The UV5 platform synchronizes the TSCs among all chassis, and will not
proceed to OS boot without achieving synchronization. Previous UV
platforms provided a register indicating successful synchronization.
This is no longer available on UV5. On this platform TSC_ADJUST
should not be reset by the kernel.
Signed-off-by: Mike Travis <mike.travis@hpe.com>
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220406195149.228164-3-steve.wahl@hpe.com
Version 2 of the GHCB specification provides a Non Automatic Exit (NAE)
event type that can be used by the SEV-SNP guest to communicate with the
PSP without risk from a malicious hypervisor who wishes to read, alter,
drop or replay the messages sent.
SNP_LAUNCH_UPDATE can insert two special pages into the guest’s memory:
the secrets page and the CPUID page. The PSP firmware populates the
contents of the secrets page. The secrets page contains encryption keys
used by the guest to interact with the firmware. Because the secrets
page is encrypted with the guest’s memory encryption key, the hypervisor
cannot read the keys. See SEV-SNP firmware spec for further details on
the secrets page format.
Create a platform device that the SEV-SNP guest driver can bind to get
the platform resources such as encryption key and message id to use to
communicate with the PSP. The SEV-SNP guest driver provides a userspace
interface to get the attestation report, key derivation, extended
attestation report etc.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-43-brijesh.singh@amd.com
Version 2 of GHCB specification provides SNP_GUEST_REQUEST and
SNP_EXT_GUEST_REQUEST NAE that can be used by the SNP guest to
communicate with the PSP.
While at it, add a snp_issue_guest_request() helper that will be used by
driver or other subsystem to issue the request to PSP.
See SEV-SNP firmware and GHCB spec for more details.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-42-brijesh.singh@amd.com
For debugging purposes it is very useful to have a way to see the full
contents of the SNP CPUID table provided to a guest. Add an sev=debug
kernel command-line option to do so.
Also introduce some infrastructure so that additional options can be
specified via sev=option1[,option2] over time in a consistent manner.
[ bp: Massage, simplify string parsing. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-41-brijesh.singh@amd.com
SEV-SNP guests will be provided the location of special 'secrets' and
'CPUID' pages via the Confidential Computing blob. This blob is
provided to the run-time kernel either through a boot_params field that
was initialized by the boot/compressed kernel, or via a setup_data
structure as defined by the Linux Boot Protocol.
Locate the Confidential Computing blob from these sources and, if found,
use the provided CPUID page/table address to create a copy that the
run-time kernel will use when servicing CPUID instructions via a #VC
handler.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-40-brijesh.singh@amd.com
Initial/preliminary detection of SEV-SNP is done via the Confidential
Computing blob. Check for it prior to the normal SEV/SME feature
initialization, and add some sanity checks to confirm it agrees with
SEV-SNP CPUID/MSR bits.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-39-brijesh.singh@amd.com
CPUID instructions generate a #VC exception for SEV-ES/SEV-SNP guests,
for which early handlers are currently set up to handle. In the case
of SEV-SNP, guests can use a configurable location in guest memory
that has been pre-populated with a firmware-validated CPUID table to
look up the relevant CPUID values rather than requesting them from
hypervisor via a VMGEXIT. Add the various hooks in the #VC handlers to
allow CPUID instructions to be handled via the table. The code to
actually configure/enable the table will be added in a subsequent
commit.
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-33-brijesh.singh@amd.com
This code will also be used later for SEV-SNP-validated CPUID code in
some cases, so move it to a common helper.
While here, also add a check to terminate in cases where the CPUID
function/subfunction is indexed and the subfunction is non-zero, since
the GHCB MSR protocol does not support non-zero subfunctions.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-32-brijesh.singh@amd.com
Due to
103a4908ad ("x86/head/64: Disable stack protection for head$(BITS).o")
kernel/head{32,64}.c are compiled with -fno-stack-protector to allow
a call to set_bringup_idt_handler(), which would otherwise have stack
protection enabled with CONFIG_STACKPROTECTOR_STRONG.
While sufficient for that case, there may still be issues with calls to
any external functions that were compiled with stack protection enabled
that in-turn make stack-protected calls, or if the exception handlers
set up by set_bringup_idt_handler() make calls to stack-protected
functions.
Subsequent patches for SEV-SNP CPUID validation support will introduce
both such cases. Attempting to disable stack protection for everything
in scope to address that is prohibitive since much of the code, like the
SEV-ES #VC handler, is shared code that remains in use after boot and
could benefit from having stack protection enabled. Attempting to inline
calls is brittle and can quickly balloon out to library/helper code
where that's not really an option.
Instead, re-enable stack protection for head32.c/head64.c, and make the
appropriate changes to ensure the segment used for the stack canary is
initialized in advance of any stack-protected C calls.
For head64.c:
- The BSP will enter from startup_64() and call into C code
(startup_64_setup_env()) shortly after setting up the stack, which
may result in calls to stack-protected code. Set up %gs early to allow
for this safely.
- APs will enter from secondary_startup_64*(), and %gs will be set up
soon after. There is one call to C code prior to %gs being setup
(__startup_secondary_64()), but it is only to fetch 'sme_me_mask'
global, so just load 'sme_me_mask' directly instead, and remove the
now-unused __startup_secondary_64() function.
For head32.c:
- BSPs/APs will set %fs to __BOOT_DS prior to any C calls. In recent
kernels, the compiler is configured to access the stack canary at
%fs:__stack_chk_guard [1], which overlaps with the initial per-cpu
'__stack_chk_guard' variable in the initial/"master" .data..percpu
area. This is sufficient to allow access to the canary for use
during initial startup, so no changes are needed there.
[1] 3fb0fdb3bb ("x86/stackprotector/32: Make the canary into a regular percpu variable")
[ bp: Massage commit message. ]
Suggested-by: Joerg Roedel <jroedel@suse.de> #for 64-bit %gs set up
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-24-brijesh.singh@amd.com
To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP
AP Creation NAE event. This allows for guest control over the AP register
state rather than trusting the hypervisor with the SEV-ES Jump Table
address.
During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if
SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This
will allow the SEV-SNP AP Creation NAE event method to be used to boot
the APs. As a result of installing the override when SEV-SNP is active,
this method of starting the APs becomes the required method. The override
function will fail to start the AP if the hypervisor does not have
support for AP creation.
[ bp: Work in forgotten review comments. ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-23-brijesh.singh@amd.com
Add the needed functionality to change pages state from shared
to private and vice-versa using the Page State Change VMGEXIT as
documented in the GHCB spec.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-22-brijesh.singh@amd.com
probe_roms() accesses the memory range (0xc0000 - 0x10000) to probe
various ROMs. The memory range is not part of the E820 system RAM range.
The memory range is mapped as private (i.e encrypted) in the page table.
When SEV-SNP is active, all the private memory must be validated before
accessing. The ROM range was not part of E820 map, so the guest BIOS
did not validate it. An access to invalidated memory will cause a
exception yet, so validate the ROM memory regions before it is accessed.
[ bp: Massage commit message. ]
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-21-brijesh.singh@amd.com