Currently, all valid MCA_ADDR values are assumed to be usable on AMD
systems. However, this is not correct in most cases. Notifiers expecting
usable addresses may then operate on inappropriate values.
Define a helper function to do AMD-specific checks for a usable memory
address. List out all known cases.
[ bp: Tone down the capitalized words. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-3-yazen.ghannam@amd.com
Define helper functions for legacy and SMCA systems in order to reuse
individual checks in later changes.
Describe what each function is checking for, and correct the XEC bitmask
for SMCA.
No functional change intended.
[ bp: Use "else in amd_mce_is_memory_error() to make the conditional
balanced, for readability. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613141142.36801-2-yazen.ghannam@amd.com
This reverts commit 45e34c8af5, and the
two subsequent fixes to it:
3f874c9b2a ("x86/smp: Don't send INIT to non-present and non-booted CPUs")
b1472a60a5 ("x86/smp: Don't send INIT to boot CPU")
because it seems to result in hung machines at shutdown. Particularly
some Dell machines, but Thomas says
"The rest seems to be Lenovo and Sony with Alderlake/Raptorlake CPUs -
at least that's what I could figure out from the various bug reports.
I don't know which CPUs the DELL machines have, so I can't say it's a
pattern.
I agree with the revert for now"
Ashok Raj chimes in:
"There was a report (probably this same one), and it turns out it was a
bug in the BIOS SMI handler.
The client BIOS's were waiting for the lowest APICID to be the SMI
rendevous master. If this is MeteorLake, the BSP wasn't the one with
the lowest APIC and it triped here.
The BIOS change is also being pushed to others for assimilation :)
Server BIOS's had this correctly for a while now"
and it does look likely to be some bad interaction between SMI and the
non-BSP cores having put into INIT (and thus unresponsive until reset).
Link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
Link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
Link: https://forum.artixlinux.org/index.php/topic,5997.0.html
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prototypes between architectures.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrobMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ggUw/7BGHe360tsrsMAcOHcwvvGhnQ0UKuoqLs
IJl3dfsdX+JnL7cpbNcRBVDqgH2seIwdQFa73gALColcxntEBbnC/gVS4QLLSxSp
HIq5C1OELT/jPMOjc6aimJx/qPvW/CLgo2WJx78rv0ykkf1RJIzqCTVKf8VQX6Vu
t0/9jEhBNuL8DZthJ5ZV448WtlJcdnWXVGxq/UHEheV219Rjtp3NGf8s/K+WMzF3
x9Zhmb+/UPgjhaZtrQDP2mf7ZYgmVhLvJTRSQdQNrcDe/ZaNrCrEGOwHuOpQ0vXw
v42rd1AVGV/xgIhfBOABLdb5snBbQMDvYLcma04bkBd6H6WPFJZ1PvnGovTagxUO
FP4117VBA5ARwZemxwGEPJkNF9lVEPSBVDv7bx2OO0zVCViuAbKJXxGzCW/GiSGA
BRk5FogxJ7TcjWsYWWaZfYlq8RFI5UI3K/IEQIUpQKtC9OMhScdp532xEP48dc8u
pHjPjVoYCXtoD/ZD73ZJJduN6Hn30HEE/IJ6+YKJRo4EZquUrGtbgG46QbFBqxqW
xIPPTx7OPAaAfq020+c6BuUnta1iEY6I4De/+XbRdQf+AIWqfLsWNI8en8aO4t5p
rtlrYD2Si1F0KBWtDWR7JhCm8CD2klWVWrD9d4DLpz9ljHXKa9d7BYp3BxkvgcQm
x8f1D9yC9X4=
=20Na
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
"Fix a Longsoon build warning by harmonizing the
arch_[un]register_cpu() prototypes between architectures"
* tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu-hotplug: Provide prototypes for arch CPU registration
Fei has reported that KASAN triggers during apply_alternatives() on
a 5-level paging machine:
BUG: KASAN: out-of-bounds in rcu_is_watching()
Read of size 4 at addr ff110003ee6419a0 by task swapper/0/0
...
__asan_load4()
rcu_is_watching()
trace_hardirqs_on()
text_poke_early()
apply_alternatives()
...
On machines with 5-level paging, cpu_feature_enabled(X86_FEATURE_LA57)
gets patched. It includes KASAN code, where KASAN_SHADOW_START depends on
__VIRTUAL_MASK_SHIFT, which is defined with cpu_feature_enabled().
KASAN gets confused when apply_alternatives() patches the
KASAN_SHADOW_START users. A test patch that makes KASAN_SHADOW_START
static, by replacing __VIRTUAL_MASK_SHIFT with 56, works around the issue.
Fix it for real by disabling KASAN while the kernel is patching alternatives.
[ mingo: updated the changelog ]
Fixes: 6657fca06e ("x86/mm: Allow to boot without LA57 if CONFIG_X86_5LEVEL=y")
Reported-by: Fei Yang <fei.yang@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231012100424.1456-1-kirill.shutemov@linux.intel.com
Mask off xfeatures that aren't exposed to the guest only when saving guest
state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly.
Preserving the maximal set of xfeatures in user_xfeatures restores KVM's
ABI for KVM_SET_XSAVE, which prior to commit ad856280dd ("x86/kvm/fpu:
Limit guest user_xfeatures to supported bits of XCR0") allowed userspace
to load xfeatures that are supported by the host, irrespective of what
xfeatures are exposed to the guest.
There is no known use case where userspace *intentionally* loads xfeatures
that aren't exposed to the guest, but the bug fixed by commit ad856280dd
was specifically that KVM_GET_SAVE{2} would save xfeatures that weren't
exposed to the guest, e.g. would lead to userspace unintentionally loading
guest-unsupported xfeatures when live migrating a VM.
Restricting KVM_SET_XSAVE to guest-supported xfeatures is especially
problematic for QEMU-based setups, as QEMU has a bug where instead of
terminating the VM if KVM_SET_XSAVE fails, QEMU instead simply stops
loading guest state, i.e. resumes the guest after live migration with
incomplete guest state, and ultimately results in guest data corruption.
Note, letting userspace restore all host-supported xfeatures does not fix
setups where a VM is migrated from a host *without* commit ad856280dd,
to a target with a subset of host-supported xfeatures. However there is
no way to safely address that scenario, e.g. KVM could silently drop the
unsupported features, but that would be a clear violation of KVM's ABI and
so would require userspace to opt-in, at which point userspace could
simply be updated to sanitize the to-be-loaded XSAVE state.
Reported-by: Tyler Stachecki <stachecki.tyler@gmail.com>
Closes: https://lore.kernel.org/all/20230914010003.358162-1-tstachecki@bloomberg.net
Fixes: ad856280dd ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Message-Id: <20230928001956.924301-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can
constrain which xfeatures are saved into the userspace buffer without
having to modify the user_xfeatures field in KVM's guest_fpu state.
KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to
guest must not show up in the effective xstate_bv field of the buffer.
Saving only the guest-supported xfeatures allows userspace to load the
saved state on a different host with a fewer xfeatures, so long as the
target host supports the xfeatures that are exposed to the guest.
KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to
the set of guest-supported xfeatures, but doing so broke KVM's historical
ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that
are supported by the *host*.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928001956.924301-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
for the package mask. :-)
Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
name less ambiguous.
[ Shrikanth Hegde: rename on s390 as well. ]
[ Valentin Schneider: also rename it in the comments. ]
[ mingo: port to recent kernels & find all remaining occurances. ]
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
The ->idt_seq and ->recv_jiffies variables added by:
1a3ea611fc ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
... place the exit-time check of the bottom bit of ->idt_seq after the
this_cpu_dec_return() that re-enables NMI nesting. This can result in
the following sequence of events on a given CPU in kernels built with
CONFIG_NMI_CHECK_CPU=y:
o An NMI arrives, and ->idt_seq is incremented to an odd number.
In addition, nmi_state is set to NMI_EXECUTING==1.
o The NMI is processed.
o The this_cpu_dec_return(nmi_state) zeroes nmi_state and returns
NMI_EXECUTING==1, thus opting out of the "goto nmi_restart".
o Another NMI arrives and ->idt_seq is incremented to an even
number, triggering the warning. But all is just fine, at least
assuming we don't get so many closely spaced NMIs that the stack
overflows or some such.
Experience on the fleet indicates that the MTBF of this false positive
is about 70 years. Or, for those who are not quite that patient, the
MTBF appears to be about one per week per 4,000 systems.
Fix this false-positive warning by moving the "nmi_restart" label before
the initial ->idt_seq increment/check and moving the this_cpu_dec_return()
to follow the final ->idt_seq increment/check. This way, all nested NMIs
that get past the NMI_NOT_RUNNING check get a clean ->idt_seq slate.
And if they don't get past that check, they will set nmi_state to
NMI_LATCHED, which will cause the this_cpu_dec_return(nmi_state)
to restart.
Fixes: 1a3ea611fc ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/0cbff831-6e3d-431c-9830-ee65ee7787ff@paulmck-laptop
When compiling the x86 kernel, if X86_LOCAL_APIC is not enabled but
GENERIC_MSI_IRQ is selected in '.config', the following compilation
error will occur:
include/linux/gpio/driver.h:38:19: error:
field 'msiinfo' has incomplete type
kernel/irq/msi.c:752:5: error: invalid use of incomplete typedef
'msi_alloc_info_t' {aka 'struct irq_alloc_info'}
kernel/irq/msi.c:740:1: error: control reaches end of non-void function
This is because file such as 'kernel/irq/msi.c' only depends on
'GENERIC_MSI_IRQ', and uses 'struct msi_alloc_info_t'. However,
this struct depends on 'X86_LOCAL_APIC'.
When enable 'GENERIC_MSI_IRQ' or 'X86_LOCAL_APIC' will select
'IRQ_DOMAIN_HIERARCHY', so exposing this struct using
'IRQ_DOMAIN_HIERARCHY' rather than 'X86_LOCAL_APIC'.
Under the above conditions, if 'HPET_TIMER' is selected, the following
compilation error will occur:
arch/x86/kernel/hpet.c:550:13: error: ‘x86_vector_domain’ undeclared
arch/x86/kernel/hpet.c:600:9: error: implicit declaration of
function ‘init_irq_alloc_info’
This is because 'x86_vector_domain' is defined in 'kernel/apic/vector.c'
which is compiled only when 'X86_LOCAL_APIC' is enabled. Besides,
function 'msi_domain_set_affinity' is defined in 'include/linux/msi.h'
which depends on 'GENERIC_MSI_IRQ'. So use 'X86_LOCAL_APIC' and
'GENERIC_MSI_IRQ' to expose these code.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231012032659.323251-1-yaolu@kylinos.cn
Add the interface in resctrl FS to show if sparse cache allocation
bit masks are supported on the platform. Reading the file returns
either a "1" if non-contiguous 1s are supported and "0" otherwise.
The file path is /sys/fs/resctrl/info/{resource}/sparse_masks, where
{resource} can be either "L2" or "L3".
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Peter Newman <peternewman@google.com>
Link: https://lore.kernel.org/r/7300535160beba41fd8aa073749ec1ee29b4621f.1696934091.git.maciej.wieczor-retman@intel.com
The setting for non-contiguous 1s support in Intel CAT is
hardcoded to false. On these systems, writing non-contiguous
1s into the schemata file will fail before resctrl passes
the value to the hardware.
In Intel CAT CPUID.0x10.1:ECX[3] and CPUID.0x10.2:ECX[3] stopped
being reserved and now carry information about non-contiguous 1s
value support for L3 and L2 cache respectively. The CAT
capacity bitmask (CBM) supports a non-contiguous 1s value if
the bit is set.
The exception are Haswell systems where non-contiguous 1s value
support needs to stay disabled since they can't make use of CPUID
for Cache allocation.
Originally-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Peter Newman <peternewman@google.com>
Link: https://lore.kernel.org/r/1849b487256fe4de40b30f88450cba3d9abc9171.1696934091.git.maciej.wieczor-retman@intel.com
Provide common prototypes for arch_register_cpu() and
arch_unregister_cpu(). These are called by acpi_processor.c, with weak
versions, so the prototype for this is already set. It is generally not
necessary for function prototypes to be conditional on preprocessor macros.
Some architectures (e.g. Loongarch) are missing the prototype for this, and
rather than add it to Loongarch's asm/cpu.h, do the job once for everyone.
Since this covers everyone, remove the now unnecessary prototypes in
asm/cpu.h, and therefore remove the 'static' from one of ia64's
arch_register_cpu() definitions.
[ tglx: Bring back the ia64 part and remove the ACPI prototypes ]
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1qkoRr-0088Q8-Da@rmk-PC.armlinux.org.uk
Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.
Reported-by: René Rebe <rene@exactcode.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: René Rebe <rene@exactcode.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
Since commit:
4d96f91091 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")
... the SWIOTLB bounce buffer size adjustment and restricted virtio memory
setting also inadvertently apply to TDX: the code is using
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) as a gatekeeping condition,
which is also true for TDX, and this is also what we want.
To reflect this, move the corresponding code to generic mem_encrypt.c.
No functional changes intended.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231010145220.3960055-2-alexander.shishkin@linux.intel.com
The kernel test robot reported kernel-doc warnings here:
arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'of' not described in 'rdt_bit_usage_show'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'seq' not described in 'rdt_bit_usage_show'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'v' not described in 'rdt_bit_usage_show'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1144: warning: Function parameter or member 'type' not described in '__rdtgroup_cbm_overlaps'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1224: warning: Function parameter or member 'rdtgrp' not described in 'rdtgroup_mode_test_exclusive'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'of' not described in 'rdtgroup_mode_write'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'buf' not described in 'rdtgroup_mode_write'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'nbytes' not described in 'rdtgroup_mode_write'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'off' not described in 'rdtgroup_mode_write'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 'of' not described in 'rdtgroup_size_show'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 's' not described in 'rdtgroup_size_show'
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 'v' not described in 'rdtgroup_size_show'
The first two functions are missing an argument description while the
other three are file callbacks and don't require a kernel-doc comment.
Closes: https://lore.kernel.org/oe-kbuild-all/202310070434.mD8eRNAz-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Newman <peternewman@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20231011064843.246592-1-maciej.wieczor-retman@intel.com
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
Remove sentinel element from sld_sysctl and itmt_kern_table. This
removal is safe because register_sysctl_init and register_sysctl
implicitly use the array size in addition to checking for the sentinel.
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Provide debug files which dump the topology related information of
cpuinfo_x86. This is useful to validate the upcoming conversion of the
topology evaluation for correctness or bug compatibility.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.353191313@linutronix.de
Per CPU cpuinfo is used to persist the logical package and die IDs. That's
really not the right place simply because cpuinfo is subject to be
reinitialized when a CPU goes through an offline/online cycle.
This works by chance today, but that's far from correct and neither obvious
nor documented.
Add a per cpu datastructure which persists those logical IDs, which allows
to cleanup the CPUID evaluation code.
This is a temporary workaround until the larger topology management is in
place, which makes all of this logical management mechanics obsolete.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.292947071@linutronix.de
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.
Make it all consistently use u32 because that reflects the hardware
register width.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.233274223@linutronix.de
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.
Make it all consistently use u32 because that reflects the hardware
register width.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.172569282@linutronix.de
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.
Make it all consistently use u32 because that reflects the hardware
register width even if that callback going to be removed soonish.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.113097126@linutronix.de
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.
Make it all consistently use u32 because that reflects the hardware
register width and fixup a few related usage sites for consistency sake.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.054064391@linutronix.de
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.
Make it all consistently use u32 because that reflects the hardware
register width and move the default implementation to local.h as there are
no users outside the apic directory.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.981956102@linutronix.de
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.
Make it all consistently use u32 because that reflects the hardware
register width and fixup the most obvious usage sites of that.
The APIC callbacks will be addressed separately.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.922905727@linutronix.de
APIC ID checks compare with BAD_APICID all over the place, but some
initializers and some code which fiddles with global data structure use
-1[U] instead. That simply cannot work at all.
Fix it up and use BAD_APICID consistently all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.862835121@linutronix.de
The topology IDs which identify the LLC and L2 domains clearly belong to
the per CPU topology information.
Move them into cpuinfo_x86::cpuinfo_topo and get rid of the extra per CPU
data and the related exports.
This also paves the way to do proper topology evaluation during early boot
because it removes the only per CPU dependency for that.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.803864641@linutronix.de
Yet another topology related data pair. Rename logical_proc_id to
logical_pkg_id so it fits the common naming conventions.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.745139505@linutronix.de
cpuinfo_x86::x86_coreid_bits is only used by the AMD numa topology code. No
point in evaluating it on non AMD systems.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.687588373@linutronix.de
Rename it to core_id and stick it to the other ID fields.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.566519388@linutronix.de
Rename it to pkg_id which is the terminology used in the kernel.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.329006989@linutronix.de
The topology related information is randomly scattered across cpuinfo_x86.
Create a new structure cpuinfo_topo and move in a first step initial_apicid
and apicid into it.
Aside of being better readable this is in preparation for replacing the
horribly fragile CPU topology evaluation code further down the road.
Consolidate APIC ID fields to u32 as that represents the hardware type.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.269787744@linutronix.de
The SMT control mechanism got added as speculation attack vector
mitigation. The implemented logic relies on the primary thread mask to
be set up properly.
This turns out to be an issue with XEN/PV guests because their CPU hotplug
mechanics do not enumerate APICs and therefore the mask is never correctly
populated.
This went unnoticed so far because by chance XEN/PV ends up with
smp_num_siblings == 2. So cpu_smt_control stays at its default value
CPU_SMT_ENABLED and the primary thread mask is never evaluated in the
context of CPU hotplug.
This stopped "working" with the upcoming overhaul of the topology
evaluation which legitimately provides a fake topology for XEN/PV. That
sets smp_num_siblings to 1, which causes the core CPU hot-plug core to
refuse to bring up the APs.
This happens because cpu_smt_control is set to CPU_SMT_NOT_SUPPORTED which
causes cpu_bootable() to evaluate the unpopulated primary thread mask with
the conclusion that all non-boot CPUs are not valid to be plugged.
The core code has already been made more robust against this kind of fail,
but the primary thread mask really wants to be populated to avoid other
issues all over the place.
Just fake the mask by pretending that all XEN/PV vCPUs are primary threads,
which is consistent because all of XEN/PVs topology is fake or non-existent.
Fixes: 6a4d2657e0 ("x86/smp: Provide topology_is_primary_thread()")
Fixes: f54d4434c2 ("x86/apic: Provide cpu_primary_thread mask")
Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.210011520@linutronix.de
Hygon processors with a model ID > 3 have CPUID leaf 0xB correctly
populated and don't need the fixed package ID shift workaround. The fixup
is also incorrect when running in a guest.
Fixes: e0ceeae708 ("x86/CPU/hygon: Fix phys_proc_id calculation logic for multi-die processors")
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/tencent_594804A808BD93A4EBF50A994F228E3A7F07@qq.com
Link: https://lore.kernel.org/r/20230814085112.089607918@linutronix.de
Check the IO permission bitmap (if present) before emulating IOIO #VC
exceptions for user-space. These permissions are checked by hardware
already before the #VC is raised, but due to the VC-handler decoding
race it needs to be checked again in software.
Fixes: 25189d08e5 ("x86/sev-es: Add support for handling IOIO exceptions")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Dohrmann <erbse.13@gmx.de>
Cc: <stable@kernel.org>
A virt scenario can be constructed where MMIO memory can be user memory.
When that happens, a race condition opens between when the hardware
raises the #VC and when the #VC handler gets to emulate the instruction.
If the MOVS is replaced with a MOVS accessing kernel memory in that
small race window, then write to kernel memory happens as the access
checks are not done at emulation time.
Disable MMIO emulation in user mode temporarily until a sensible use
case appears and justifies properly handling the race window.
Fixes: 0118b604c2 ("x86/sev-es: Handle MMIO String Instructions")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Dohrmann <erbse.13@gmx.de>
Cc: <stable@kernel.org>
The kernel test robot reported kernel-doc warnings here:
monitor.c:34: warning: Cannot understand * @rmid_free_lru A least recently used list of free RMIDs on line 34 - I thought it was a doc line
monitor.c:41: warning: Cannot understand * @rmid_limbo_count count of currently unused but (potentially) on line 41 - I thought it was a doc line
monitor.c:50: warning: Cannot understand * @rmid_entry - The entry in the limbo and free lists. on line 50 - I thought it was a doc line
We don't have a syntax for documenting individual data items via
kernel-doc, so remove the "/**" kernel-doc markers and add a hyphen
for consistency.
Fixes: 6a445edce6 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
Fixes: 24247aeeab ("x86/intel_rdt/cqm: Improve limbo list processing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231006235132.16227-1-rdunlap@infradead.org
Commit bf5835bcdb ("intel_idle: Disable IBRS during long idle")
disables IBRS when the CPU enters long idle. However, when a CPU
becomes offline, the IBRS bit is still set when X86_FEATURE_KERNEL_IBRS
is enabled. That will impact the performance of a sibling CPU. Mitigate
this performance impact by clearing all the mitigation bits in SPEC_CTRL
MSR when offline. When the CPU is online again, it will be re-initialized
and so restoring the SPEC_CTRL value isn't needed.
Add a comment to say that native_play_dead() is a __noreturn function,
but it can't be marked as such to avoid confusion about the missing
MSR restoration code.
When DPDK is running on an isolated CPU thread processing network packets
in user space while its sibling thread is idle. The performance of the
busy DPDK thread with IBRS on and off in the sibling idle thread are:
IBRS on IBRS off
------- --------
packets/second: 7.8M 10.4M
avg tsc cycles/packet: 282.26 209.86
This is a 25% performance degradation. The test system is a Intel Xeon
4114 CPU @ 2.20GHz.
[ mingo: Extended the changelog with performance data from the 0/4 mail. ]
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20230727184600.26768-3-longman@redhat.com
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in
rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug
functions.
Also rcu_cpu_starting() and rcu_report_dead() have different naming
conventions while they mirror each other's effects.
Fix the headers and propose a naming that relates both functions and
aligns with the prefix of other rcutree CPU-hotplug functions.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
With the help of newly changed function parse_crashkernel() and generic
reserve_crashkernel_generic(), crashkernel reservation can be simplified
by steps:
1) Add a new header file <asm/crash_core.h>, and define CRASH_ALIGN,
CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX and
DEFAULT_CRASH_KERNEL_LOW_SIZE in <asm/crash_core.h>;
2) Add arch_reserve_crashkernel() to call parse_crashkernel() and
reserve_crashkernel_generic(), and do the ARCH specific work if
needed.
3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in
arch/x86/Kconfig.
When adding DEFAULT_CRASH_KERNEL_LOW_SIZE, add crash_low_size_default() to
calculate crashkernel low memory because x86_64 has special requirement.
The old reserve_crashkernel_low() and reserve_crashkernel() can be
removed.
[bhe@redhat.com: move crash_low_size_default() code into <asm/crash_core.h>]
Link: https://lkml.kernel.org/r/ZQpeAjOmuMJBFw1/@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20230914033142.676708-7-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add two parameters 'low_size' and 'high' to function parse_crashkernel(),
later crashkernel=,high|low parsing will be added. Make adjustments in
all call sites of parse_crashkernel() in arch.
Link: https://lkml.kernel.org/r/20230914033142.676708-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The usage of '&' before the array parameter is redundant because '&array'
is equivalent to 'array'. Therefore, there is no need to include '&'
before the array parameter. In fact, using '&' can cause more confusion,
especially for individuals who are not familiar with the address-of
operation for arrays. They might mistakenly believe that one is different
from the other and spend additional time realizing that they are actually
the same.
Harmonizing the style by removing the unnecessary '&' would save time for
those individuals.
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZMt24BGEX9IhPSY6@fedora
The following commit:
ddb5cdbafa ("kbuild: generate KSYMTAB entries by modpost")
deprecated <asm/export.h>, which is now a wrapper of <linux/export.h>.
Use <linux/export.h> in *.S as well as in *.c files.
After all the <asm/export.h> lines are replaced, <asm/export.h> and
<asm-generic/export.h> will be removed.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230806145958.380314-2-masahiroy@kernel.org
Since the size value is added to the base address to yield the last valid
byte address of the GDT, the current size value of startup_gdt_descr is
incorrect (too large by one), fix it.
[ mingo: This probably never mattered, because startup_gdt[] is only used
in a very controlled fashion - but make it consistent nevertheless. ]
Fixes: 866b556efa ("x86/head/64: Install startup GDT")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230807084547.217390-1-ytcoode@gmail.com
c->x86_cache_alignment is initialized from c->x86_clflush_size.
However, commit fbf6449f84 moved c->x86_clflush_size initialization
to later in boot without moving the c->x86_cache_alignment assignment:
fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")
This presumably left c->x86_cache_alignment set to zero for longer
than it should be.
The result was an oops on 32-bit kernels while accessing a pointer
at 0x20. The 0x20 came from accessing a structure member at offset
0x10 (buffer->cpumask) from a ZERO_SIZE_PTR=0x10. kmalloc() can
evidently return ZERO_SIZE_PTR when it's given 0 as its alignment
requirement.
Move the c->x86_cache_alignment initialization to be after
c->x86_clflush_size has an actual value.
Fixes: fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20231002220045.1014760-1-dave.hansen@linux.intel.com
Fetching the device tree configuration before initmem_init() is necessary
to allow the parsing of NUMA node information. However moving the entire
x86_dtb_init() call before initmem_init() is not correct as APIC/IO-APIC enumeration
has to be after initmem_init().
Thus, move the x86_flattree_get_config() call out of x86_dtb_init(),
into setup_arch(), to call it before initmem_init(), and
leave the ACPI/IOAPIC registration sequence as-is.
[ mingo: Updated the changelog for clarity. ]
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/1692949657-16446-1-git-send-email-ssengar@linux.microsoft.com
In snp_accept_memory(), the npages variables value is calculated from
phys_addr_t variables but is an unsigned int. A very large range passed
into snp_accept_memory() could lead to truncating npages to zero. This
doesn't happen at the moment but let's be prepared.
Fixes: 6c32117963 ("x86/sev: Add SNP-specific unaccepted memory support")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/6d511c25576494f682063c9fb6c705b526a3757e.1687441505.git.thomas.lendacky@amd.com
SNP retrieves the majority of CPUID information from the SNP CPUID page.
But there are times when that information needs to be supplemented by the
hypervisor, for example, obtaining the initial APIC ID of the vCPU from
leaf 1.
The current implementation uses the MSR protocol to retrieve the data from
the hypervisor, even when a GHCB exists. The problem arises when an NMI
arrives on return from the VMGEXIT. The NMI will be immediately serviced
and may generate a #VC requiring communication with the hypervisor.
Since a GHCB exists in this case, it will be used. As part of using the
GHCB, the #VC handler will write the GHCB physical address into the GHCB
MSR and the #VC will be handled.
When the NMI completes, processing resumes at the site of the VMGEXIT
which is expecting to read the GHCB MSR and find a CPUID MSR protocol
response. Since the NMI handling overwrote the GHCB MSR response, the
guest will see an invalid reply from the hypervisor and self-terminate.
Fix this problem by using the GHCB when it is available. Any NMI
received is properly handled because the GHCB contents are copied into
a backup page and restored on NMI exit, thus preserving the active GHCB
request or result.
[ bp: Touchups. ]
Fixes: ee0bfa08a3 ("x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlers")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/a5856fa1ebe3879de91a8f6298b6bbd901c61881.1690578565.git.thomas.lendacky@amd.com
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an
enclave and set secs.epc_page to NULL. The SECS page is used for EAUG
and ELDU in the SGX page fault handler. However, the NULL check for
secs.epc_page is only done for ELDU, not EAUG before being used.
Fix this by doing the same NULL check and reloading of the SECS page as
needed for both EAUG and ELDU.
The SECS page holds global enclave metadata. It can only be reclaimed
when there are no other enclave pages remaining. At that point,
virtually nothing can be done with the enclave until the SECS page is
paged back in.
An enclave can not run nor generate page faults without a resident SECS
page. But it is still possible for a #PF for a non-SECS page to race
with paging out the SECS page: when the last resident non-SECS page A
triggers a #PF in a non-resident page B, and then page A and the SECS
both are paged out before the #PF on B is handled.
Hitting this bug requires that race triggered with a #PF for EAUG.
Following is a trace when it happens.
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:sgx_encl_eaug_page+0xc7/0x210
Call Trace:
? __kmem_cache_alloc_node+0x16a/0x440
? xa_load+0x6e/0xa0
sgx_vma_fault+0x119/0x230
__do_fault+0x36/0x140
do_fault+0x12f/0x400
__handle_mm_fault+0x728/0x1110
handle_mm_fault+0x105/0x310
do_user_addr_fault+0x1ee/0x750
? __this_cpu_preempt_check+0x13/0x20
exc_page_fault+0x76/0x180
asm_exc_page_fault+0x27/0x30
Fixes: 5a90d2c3f5 ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.intel.com
Instead of setting x86_virt_bits to a possibly-correct value and then
correcting it later, do all the necessary checks before setting it.
At this point, the #VC handler references boot_cpu_data.x86_virt_bits,
and in the previous version, it would be triggered by the CPUIDs between
the point at which it is set to 48 and when it is set to the correct
value.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Adam Dunlap <acdunlap@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jacob Xu <jacobhxu@google.com>
Link: https://lore.kernel.org/r/20230912002703.3924521-3-acdunlap@google.com
Add new Root, Device 18h Function 3, and Function 4 PCI IDS
for AMD F19h Model 90h-9fh (MI300A).
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20230926051932.193239-1-suma.hegde@amd.com
Set CR4.PSE in secondary_startup_64: the Intel SDM is clear that it does
not matter whether it's 0 or 1 when 4-level-pts are enabled, but it's
distracting to find CR4 different on BSP and auxiliaries - on x86_64,
BSP alone got to add the PSE bit, in probe_page_size_mask().
Peter Zijlstra adds:
"I think the point is that PSE bit is completely without
meaning in long mode.
But yes, having the same CR4 bits set across BSP and APs is
definitely sane."
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/103ad03a-8c93-c3e2-4226-f79af4d9a074@google.com
When compiled with W=1, the following warning is generated:
arch/x86/kernel/kgdb.c:698: warning: Cannot understand *
on line 698 - I thought it was a doc line
Remove the corresponding empty comment line to fix the warning.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/aad659537c1d4ebd86912a6f0be458676c8e69af.1695401178.git.christophe.jaillet@wanadoo.fr
- Fix a kexec bug,
- Fix an UML build bug,
- Fix a handful of SRSO related bugs,
- Fix a shadow stacks handling bug & robustify related code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUNbQIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jVIg/9EChW7qFTda8joR41Uayg07VIOpGirDLu
7hjzOnt4Ni93cfFNUBkKDKXoWxGAiOD+cRDnT6+zsJAvAZR26Y3UNVLYlAy+lFKK
9kRxeDM7nOEKqCC+zneinMFcIKmZRttMLpj8O901jB2S08x4UarnNx5SaWEcqYbn
rf1XIEuEvlxqMfafNueS/TRadV52qVU8Y+2+inkDnM7dDNwt+rCs5tQ9ebJos8QO
tsAoQes1G+0mjPrpqAgsIic5e3QCHliwVr8ASQrKZ9KR+fokEJRbSBNjgHUCNeVN
0bHBhuDJBSniC7QmAQGEizrZWmHl2HxwYYnCE0gd0g24b7sDTwWBFSXWratCrPdX
e4qYq36xonWdQcbpVF8ijMXF/R810vDyis/yc/BTeo5EBWf6aM/cx1dmS9DUxRpA
QsIW2amSCPVYwYE5MAC+K/DM9nq1tk8YnKi4Mob3L28+W3CmVwSwT9S86z2QLlZu
+KgVV4yBtJY1FJqVur5w3awhFtqLfBdfIvs6uQCd9VZXVPbBfS8+rOQmmhFixEDu
FSPlAChmXYTAM2R+UxcEvw1Zckrtd2BCOa8UrY2lq57mduBK1EymdpfjlrUAChLG
x7fQBOGNgOTLwYcsurIdS5jAqiudpnJ/KDG8ZAmKsVoW96JCPp9B3tVZMp9tT30C
8HRwSPX4384=
=58St
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix a kexec bug
- Fix an UML build bug
- Fix a handful of SRSO related bugs
- Fix a shadow stacks handling bug & robustify related code
* tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/shstk: Add warning for shadow stack double unmap
x86/shstk: Remove useless clone error handling
x86/shstk: Handle vfork clone failure correctly
x86/srso: Fix SBPB enablement for spec_rstack_overflow=off
x86/srso: Don't probe microcode in a guest
x86/srso: Set CPUID feature bits independently of bug or mitigation status
x86/srso: Fix srso_show_state() side effect
x86/asm: Fix build of UML with KASAN
x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
Commit
7825451fa4 ("static_call: Add call depth tracking support")
failed to realize the problem fixed there is not specific to call depth
tracking but applies to all return-thunk uses.
Move the fix to the appropriate place and condition.
Fixes: ee88d363d1 ("x86,static_call: Use alternative RET encoding")
Reported-by: David Kaplan <David.Kaplan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
The following commit
095b8303f3 ("x86/alternative: Make custom return thunk unconditional")
made '__x86_return_thunk' a placeholder value. All code setting
X86_FEATURE_RETHUNK also changes the value of 'x86_return_thunk'. So
the optimization at the beginning of apply_returns() is dead code.
Also, before the above-mentioned commit, the optimization actually had a
bug It bypassed __static_call_fixup(), causing some raw returns to
remain unpatched in static call trampolines. Thus the 'Fixes' tag.
Fixes: d2408e043e ("x86/alternative: Optimize returns patching")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/16d19d2249d4485d8380fb215ffaae81e6b8119e.1693889988.git.jpoimboe@kernel.org
When SVM is disabled by BIOS, one cannot use KVM but the
SVM feature is still shown in the output of /proc/cpuinfo.
On Intel machines, VMX is cleared by init_ia32_feat_ctl(),
so do the same on AMD and Hygon processors.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230921114940.957141-1-pbonzini@redhat.com
The 'mid' pointer is being initialized with a value that is never read,
it is being re-assigned and used inside a for-loop. Remove the
redundant initialization.
Cleans up clang scan build warning:
arch/x86/kernel/unwind_orc.c:88:7: warning: Value stored to 'mid' during its initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20230920114141.118919-1-colin.i.king@gmail.com
There are several ways a thread's shadow stacks can get unmapped. This
can happen on exit or exec, as well as error handling in exec or clone.
The task struct already keeps track of the thread's shadow stack. Use the
size variable to keep track of if the shadow stack has already been freed.
When an attempt to double unmap the thread shadow stack is caught, warn
about it and abort the operation.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lore.kernel.org/all/20230908203655.543765-4-rick.p.edgecombe%40intel.com
When clone fails after the shadow stack is allocated, any allocated shadow
stack is cleaned up in exit_thread() in copy_process(). So the logic in
copy_thread() is unneeded, and also will not handle failures that happen
outside of copy_thread().
In addition, since there is a second attempt to unmap the same shadow
stack, there is a race where an newly mapped region could get unmapped.
So remove the logic in copy_thread() and rely on exit_thread() to handle
clone failure.
Fixes: b2926a36b9 ("x86/shstk: Handle thread shadow stack")
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lore.kernel.org/all/20230908203655.543765-3-rick.p.edgecombe%40intel.com
Shadow stacks are allocated automatically and freed on exit, depending
on the clone flags. The two cases where new shadow stacks are not
allocated are !CLONE_VM (fork()) and CLONE_VFORK (vfork()). For
!CLONE_VM, although a new stack is not allocated, it can be freed normally
because it will happen in the child's copy of the VM.
However, for CLONE_VFORK the parent and the child are actually using the
same shadow stack. So the kernel doesn't need to allocate *or* free a
shadow stack for a CLONE_VFORK child. CLONE_VFORK children already need
special tracking to avoid returning to userspace until the child exits or
execs. Shadow stack uses this same tracking to avoid freeing CLONE_VFORK
shadow stacks.
However, the tracking is not setup until the clone has succeeded
(internally). Which means, if a CLONE_VFORK fails, the existing logic will
not know it is a CLONE_VFORK and proceed to unmap the parents shadow stack.
This error handling cleanup logic runs via exit_thread() in the
bad_fork_cleanup_thread label in copy_process(). The issue was seen in
the glibc test "posix/tst-spawn3-pidfd" while running with shadow stack
using currently out-of-tree glibc patches.
Fix it by not unmapping the vfork shadow stack in the error case as well.
Since clone is implemented in core code, it is not ideal to pass the clone
flags along the error path in order to have shadow stack code have
symmetric logic in the freeing half of the thread shadow stack handling.
Instead use the existing state for thread shadow stacks to track whether
the thread is managing its own shadow stack. For CLONE_VFORK, simply set
shstk->base and shstk->size to 0, and have it mean the thread is not
managing a shadow stack and so should skip cleanup work. Implement this
by breaking up the CLONE_VFORK and !CLONE_VM cases in
shstk_alloc_thread_stack() to separate conditionals since, the logic is
now different between them. In the case of CLONE_VFORK && !CLONE_VM, the
existing behavior is to not clean up the shadow stack in the child (which
should go away quickly with either be exit or exec), so maintain that
behavior by handling the CLONE_VFORK case first in the allocation path.
This new logioc cleanly handles the case of normal, successful
CLONE_VFORK's skipping cleaning up their shadow stack's on exit as well.
So remove the existing, vfork shadow stack freeing logic. This is in
deactivate_mm() where vfork_done is used to tell if it is a vfork child
that can skip cleaning up the thread shadow stack.
Fixes: b2926a36b9 ("x86/shstk: Handle thread shadow stack")
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lore.kernel.org/all/20230908203655.543765-2-rick.p.edgecombe%40intel.com
If the user has requested no SRSO mitigation, other mitigations can use
the lighter-weight SBPB instead of IBPB.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/b20820c3cfd1003171135ec8d762a0b957348497.1693889988.git.jpoimboe@kernel.org
To support live migration, the hypervisor sets the "lowest common
denominator" of features. Probing the microcode isn't allowed because
any detected features might go away after a migration.
As Andy Cooper states:
"Linux must not probe microcode when virtualised. What it may see
instantaneously on boot (owing to MSR_PRED_CMD being fully passed
through) is not accurate for the lifetime of the VM."
Rely on the hypervisor to set the needed IBPB_BRTYPE and SBPB bits.
Fixes: 1b5277c0ea ("x86/srso: Add SRSO_NO support")
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/3938a7209606c045a3f50305d201d840e8c834c7.1693889988.git.jpoimboe@kernel.org
Booting with mitigations=off incorrectly prevents the
X86_FEATURE_{IBPB_BRTYPE,SBPB} CPUID bits from getting set.
Also, future CPUs without X86_BUG_SRSO might still have IBPB with branch
type prediction flushing, in which case SBPB should be used instead of
IBPB. The current code doesn't allow for that.
Also, cpu_has_ibpb_brtype_microcode() has some surprising side effects
and the setting of these feature bits really doesn't belong in the
mitigation code anyway. Move it to earlier.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/869a1709abfe13b673bdd10c2f4332ca253a40bc.1693889988.git.jpoimboe@kernel.org
Reading the 'spec_rstack_overflow' sysfs file can trigger an unnecessary
MSR write, and possibly even a (handled) exception if the microcode
hasn't been updated.
Avoid all that by just checking X86_FEATURE_IBPB_BRTYPE instead, which
gets set by srso_select_mitigation() if the updated microcode exists.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/27d128899cb8aee9eb2b57ddc996742b0c1d776b.1693889988.git.jpoimboe@kernel.org
Only Xen is using the paravirt lazy mode code, so it can be moved to
Xen specific sources.
This allows to make some of the functions static or to merge them into
their only call sites.
While at it do a rename from "paravirt" to "xen" for all moved
specifiers.
No functional change.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20230913113828.18421-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
The code calling ima_free_kexec_buffer() runs long after the memblock
allocator has already been torn down, potentially resulting in a use
after free in memblock_isolate_range().
With KASAN or KFENCE, this use after free will result in a BUG
from the idle task, and a subsequent kernel panic.
Switch ima_free_kexec_buffer() over to memblock_free_late() to avoid
that bug.
Fixes: fee3ff99bc ("powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c")
Suggested-by: Mike Rappoport <rppt@kernel.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230817135558.67274c83@imladris.surriel.com
- Fix an UV boot crash,
- Skip spurious ENDBR generation on _THIS_IP_,
- Fix ENDBR use in putuser() asm methods,
- Fix corner case boot crashes on 5-level paging,
- and fix a false positive WARNING on LTO kernels.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOrwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j6Jw/+PjUfh/4+KM/Z8VzcBy2UhY3NMF2ptGCN
47FPLy+8ADqOvIfgsPsBEO9VXwdvkHfH64YWRUlULjvPNOSs+37drBYMe9AI9xKE
u6NhoBHmsnOtoLkBFIQYlJys9GyAeoMlwSSHxzRwQS+3VztRjoH636jiBcg/h7DR
zhakfnJD1SSOZuEyyDPnO0uIUarrcqC2jdZwucSqZnvZFdA/pexUHQEe2RtMXLln
EIA5kuEo7UdADcSr/Lbty7MKO+6xpRTjxF0yNItPtwPWsnxrSAC7P+dQ37uB975U
Z0CJ/kw54XG5Sdoech7XCWYmJzDxSPhiziA1USKpZJMo5phAG/apTJK6NG4TG94r
U+3QhLopUoSd8N74/VtZn0FjOrMsk7YKD5o8twlTbpCd2VaiJk4X5oBIns6Tvq05
0Vmsx15XO3mEzg1wWbbdjhjHW0czRgBpikS9QyexZKVkukYcW8QT6bk1gK1ypg94
do4PHKB53QIt31dedy/ddpQj4u1sJ4+a9/+IG29kjUB33M4+uFedtw11vfe+CDN0
XLRc6HfPblogIIEO/figJgwSq/TPCOsNHTwKkejq+D1Ey6DsrnX9Gg4BWVz/3dDW
84SW7TaO2FGEDh4NkR2ijkZlbpofFnCvhCh/uohodPlqDrTVTuRKCZW9I5plmGVa
qeXUpNDFs1E=
=BMjT
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- Fix an UV boot crash
- Skip spurious ENDBR generation on _THIS_IP_
- Fix ENDBR use in putuser() asm methods
- Fix corner case boot crashes on 5-level paging
- and fix a false positive WARNING on LTO kernels"
* tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/purgatory: Remove LTO flags
x86/boot/compressed: Reserve more memory for page tables
x86/ibt: Avoid duplicate ENDBR in __put_user_nocheck*()
x86/ibt: Suppress spurious ENDBR
x86/platform/uv: Use alternate source for socket to node data
balancing bug, and a topology setup bug on (Intel) hybrid processors.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOVQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iOahAAj3YsoNbT/k6m9yp622n1OopaNEQvsK+/
F2Q5g/hJrm3+W5764rF8CvjhDbmrv6owjp3yUyZLDIfSAFZYMvwoNody3a373Yr3
VFBMJ00jNIv/TAFCJZYeybg3yViwObKKfpu4JBj//QU+4uGWCoBMolkVekU2bBti
r50fMxBPgg2Yic57DCC8Y+JZzHI/ydQ3rvVXMzkrTZCO/zY4/YmERM9d+vp4wl4B
uG9cfXQ4Yf/1gZo0XDlTUkOJUXPnkMgi+N4eHYlGuyOCoIZOfATI24hRaPBoQcdx
PDwHcKmyNxH9iaRppNQMvi797g3KrKVEmZwlZg1IfsILhKC0F4GsQ85tw8qQWE8j
brFPkWVUxAUSl4LXoqVInaxDHmJWR2UC3RA7b+BxFF/GMLTow0z4a+JMC6eKlNyR
uBisZnuEuecqwF9TLhyD3KBHh7PihUPz8PuFHk+Um5sggaUE82I+VwX6uxEi5y8r
ke2kAkpuMxPWT5lwDmFPAXWfvpZz5vvTIRUxGGj2+s4d8b0YfLtZsx5+uOIacaub
Gw+wYFfSowph72tR/SUVq0An/UTSPPBxty8eFIVeE6sW9bw3ghTtkf8300xjV7Rj
sKVxXy/podAo8wG7R6aZfTfsCpohmeEjskiatYdThYamPPx7V0R5pq4twmTXTHLJ
bFvQ1GFCOu0=
=jIeN
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix a performance regression on large SMT systems, an Intel SMT4
balancing bug, and a topology setup bug on (Intel) hybrid processors"
* tag 'sched-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sched: Restore the SD_ASYM_PACKING flag in the DIE domain
sched/fair: Fix SMT4 group_smt_balance handling
sched/fair: Optimize should_we_balance() for large SMT systems
Another major aspect of supporting running of 32bit processes is the
ability to access 32bit syscalls. Such syscalls can be invoked by
using the legacy int 0x80 handler and sysenter/syscall instructions.
If IA32 emulation is disabled ensure that each of those 3 distinct
mechanisms are also disabled. For int 0x80 a #GP exception would be
generated since the respective descriptor is not going to be loaded at
all. Invoking sysenter will also result in a #GP since IA32_SYSENTER_CS
contains an invalid segment. Finally, syscall instruction cannot really
be disabled so it's configured to execute a minimal handler.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230623111409.3047467-6-nik.borisov@suse.com
The SYSCALL instruction cannot really be disabled in compatibility mode.
The best that can be done is to configure the CSTAR msr to point to a
minimal handler. Currently this handler has a rather misleading name -
ignore_sysret() as it's not really doing anything with sysret.
Give it a more descriptive name.
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230623111409.3047467-3-nik.borisov@suse.com
Commit 8f2d6c41e5 ("x86/sched: Rewrite topology setup") dropped the
SD_ASYM_PACKING flag in the DIE domain added in commit 044f0e27de
("x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid
processors"). Restore it on hybrid processors.
The die-level domain does not depend on any build configuration and now
x86_sched_itmt_flags() is always needed. Remove the build dependency on
CONFIG_SCHED_[SMT|CLUSTER|MC].
Fixes: 8f2d6c41e5 ("x86/sched: Rewrite topology setup")
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Caleb Callaway <caleb.callaway@intel.com>
Link: https://lkml.kernel.org/r/20230815035747.11529-1-ricardo.neri-calderon@linux.intel.com
Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are
almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro
share similar code pattern too. The __tdx_hypercall() and __tdcall()
should be unified to use the same assembly code.
As a preparation to unify them, simplify the TDX_HYPERCALL to make it
more like the TDX_MODULE_CALL.
The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as
function call argument, and does below extra things comparing to the
TDX_MODULE_CALL:
1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally;
2) It sets RCX to the (fixed) bitmap of shared registers internally;
3) It calls __tdx_hypercall_failed() internally (and panics) when the
TDCALL instruction itself fails;
4) After TDCALL, it moves R10 to RAX to return the return code of the
VMCALL leaf, regardless the '\ret' asm macro argument;
Firstly, change the TDX_HYPERCALL to take the same function call
arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer
to 'struct tdx_module_args'. Then 1) and 2) can be moved to the
caller:
- TDG.VP.VMCALL leaf ID can be passed via the function call argument;
- 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus
the bitmap of shared registers can be passed via RCX in the
structure.
Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL
always save output registers to the structure. The caller then can:
- Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error;
- Return R10 in the structure as the return code of the VMCALL leaf;
With above changes, change the asm function from __tdx_hypercall() to
__tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper
of it. This avoids having to add another wrapper of __tdx_hypercall()
(_tdx_hypercall() is already taken).
The __tdcall_hypercall() will be replaced with a __tdcall() variant
using TDX_MODULE_CALL in a later commit as the final goal is to have one
assembly to handle both TDCALL and TDVMCALL.
Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep
this unchanged, annotate __tdx_hypercall(), which is a C function now,
as 'noinstr'.
Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so.
Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the
compressed code.
Opportunistically fix a checkpatch error complaining using space around
parenthesis '(' and ')' while moving the bitmap of shared registers to
<asm/shared/tdx.h>.
[ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com
The TDX guest live migration support (TDX 1.5) adds new TDCALL/SEAMCALL
leaf functions. Those new TDCALLs/SEAMCALLs take additional registers
for input (R10-R13) and output (R12-R13). TDG.SERVTD.RD is an example.
Also, the current TDX_MODULE_CALL doesn't aim to handle TDH.VP.ENTER
SEAMCALL, which monitors the TDG.VP.VMCALL in input/output registers
when it returns in case of VMCALL from TDX guest.
With those new TDCALLs/SEAMCALLs and the TDH.VP.ENTER covered, the
TDX_MODULE_CALL macro basically needs to handle the same input/output
registers as the TDX_HYPERCALL does. And as a result, they also share
similar logic in the assembly, thus should be unified to use one common
assembly.
Extend the TDX_MODULE_CALL asm to support the new TDCALLs/SEAMCALLs and
also the TDH.VP.ENTER SEAMCALL. Eventually it will be unified with the
TDX_HYPERCALL.
The new input/output registers fit with the "callee-saved" registers in
the x86 calling convention. Add a new "saved" parameter to support
those new TDCALLs/SEAMCALLs and TDH.VP.ENTER and keep the existing
TDCALLs/SEAMCALLs minimally impacted.
For TDH.VP.ENTER, after it returns the registers shared by the guest
contain guest's values. Explicitly clear them to prevent speculative
use of guest's values.
Note most TDX live migration related SEAMCALLs may also clobber AVX*
state ("AVX, AVX2 and AVX512 state: may be reset to the architectural
INIT state" -- see TDH.EXPORT.MEM for example). And TDH.VP.ENTER also
clobbers XMM0-XMM15 when the corresponding bit is set in RCX. Don't
handle them in the TDX_MODULE_CALL macro but let the caller save and
restore when needed.
This is basically based on Peter's code.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/d4785de7c392f7c5684407f6c24a73b92148ec49.1692096753.git.kai.huang%40intel.com
Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and
SEAMCALL, takes one parameter for each input register and an optional
'struct tdx_module_output' (a collection of output registers) as output.
This is different from the TDX_HYPERCALL macro which uses a single
'struct tdx_hypercall_args' to carry all input/output registers.
The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more
input/output registers. Also, the TDH.VP.ENTER (which isn't covered
by the current TDX_MODULE_CALL macro) basically can use all registers
that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't
extendible to cover those cases.
Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro
to use a single structure 'struct tdx_module_args' to carry all the
input/output registers. Currently, R10/R11 are only used as output
register but not as input by any TDCALL/SEAMCALL. Change to also use
R10/R11 as input register to make input/output registers symmetric.
Currently, the TDX_MODULE_CALL macro depends on the caller to pass a
non-NULL 'struct tdx_module_output' to get additional output registers.
Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to
take a new 'ret' macro argument to indicate whether to save the output
registers to the 'struct tdx_module_args'. Also introduce a new
__tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret().
Note the tdcall(), which is a wrapper of __tdcall(), is called by three
callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init().
The former two need the additional output but the last one doesn't. For
simplicity, make tdcall() always call __tdcall_ret() to avoid another
"_ret()" wrapper. The last caller tdx_early_init() isn't performance
critical anyway.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com
The UV code attempts to build a set of tables to allow it to do
bidirectional socket<=>node lookups.
But when nr_cpus is set to a smaller number than actually present, the
cpu_to_node() mapping information for unused CPUs is not available to
build_socket_tables(). This results in skipping some nodes or sockets
when creating the tables and leaving some -1's for later code to trip.
over, causing oopses.
The problem is that the socket<=>node lookups are created by doing a
loop over all CPUs, then looking up the CPU's APICID and socket. But
if a CPU is not present, there is no way to start this lookup.
Instead of looping over all CPUs, take CPUs out of the equation
entirely. Loop over all APICIDs which are mapped to a valid NUMA node.
Then just extract the socket-id from the APICID.
This avoid tripping over disabled CPUs.
Fixes: 8a50c58519 ("x86/platform/uv: UV support for sub-NUMA clustering")
Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230807141730.1117278-1-steve.wahl%40hpe.com
fix a ld.lld linker (in)compatibility quirk and make the x86 SMP init code a bit
more conservative to fix kexec() lockups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmT97boRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jObA//X7nug+d+IMLIs+c4579z4ZhkltMRxJVI
Btf8sdHpwgTUtKaOLmJnGiJ7f0GK5NtoaNtGUJF28aQETVOhco0Fvg/R8k1FE2Tc
CJqw6oy2FjVqD9qzZPCXh6QCvTtGjN5GF+xmoUbf7eZ9U8IVvOxBG+7yDMorQw3P
zzjIccLLg/aDvNLN/yZD2oqw6UGHZuh/Qr/0Q4PkZ7zL+yWV8EC+HOx3rlQklq0x
hh6YMwa4LGr3przUObHsfNS11EDzLDhg2WtTQMr12vlnpUB2eXnXWLklr6rpWjcz
qJiMxkrEkygB7seXnuQ0b4KHN/17zdkJ+t6vZoznUTXs1ohIDLWdiNTSl03qCs9B
V98a1x3MPT6aro9O/5ywyAJwPb0hvsg2S0ODFWab0Z3oRUbIG/k6dTEYlP7qZw8v
EFMtLy6M2EILXetj8q2ZGcA0rKz7pj/z9SosWDzqNj76w7xGwDKrSWoKJckkCwG+
j+ycBuKfrpxVYOF4ywvONSf35QTIW8BR0sM9Lg1GZuwaeincFwLf0cmS4ljGRyZ1
Vsi0SfpIgVQkeY/17onTa1C5X6c2wIE9nq253M58Xnc9B2EWpYImr+4PVZk6s4GI
GExvdPC/rIIwYa0LmvYTTlpHEd7f5qIAhfcEtMAuGSjVDLvmdDGFkaU7TgJ6Jcw2
D12wKSAAgPU=
=S38E
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Fix preemption delays in the SGX code, remove unnecessarily
UAPI-exported code, fix a ld.lld linker (in)compatibility quirk and
make the x86 SMP init code a bit more conservative to fix kexec()
lockups"
* tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Break up long non-preemptible delays in sgx_vepc_release()
x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI
x86/build: Fix linker fill bytes quirk/incompatibility for ld.lld
x86/smp: Don't send INIT to non-present and non-booted CPUs
* Clean up vCPU targets, always returning generic v8 as the preferred target
* Trap forwarding infrastructure for nested virtualization (used for traps
that are taken from an L2 guest and are needed by the L1 hypervisor)
* FEAT_TLBIRANGE support to only invalidate specific ranges of addresses
when collapsing a table PTE to a block PTE. This avoids that the guest
refills the TLBs again for addresses that aren't covered by the table PTE.
* Fix vPMU issues related to handling of PMUver.
* Don't unnecessary align non-stack allocations in the EL2 VA space
* Drop HCR_VIRT_EXCP_MASK, which was never used...
* Don't use smp_processor_id() in kvm_arch_vcpu_load(),
but the cpu parameter instead
* Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort()
* Remove prototypes without implementations
RISC-V:
* Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest
* Added ONE_REG interface for SATP mode
* Added ONE_REG interface to enable/disable multiple ISA extensions
* Improved error codes returned by ONE_REG interfaces
* Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V
* Added get-reg-list selftest for KVM RISC-V
s390:
* PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch)
Allows a PV guest to use crypto cards. Card access is governed by
the firmware and once a crypto queue is "bound" to a PV VM every
other entity (PV or not) looses access until it is not bound
anymore. Enablement is done via flags when creating the PV VM.
* Guest debug fixes (Ilya)
x86:
* Clean up KVM's handling of Intel architectural events
* Intel bugfixes
* Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use debug
registers and generate/handle #DBs
* Clean up LBR virtualization code
* Fix a bug where KVM fails to set the target pCPU during an IRTE update
* Fix fatal bugs in SEV-ES intrahost migration
* Fix a bug where the recent (architecturally correct) change to reinject
#BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)
* Retry APIC map recalculation if a vCPU is added/enabled
* Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
* Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR cannot diverge from the default when TSC scaling is disabled
up related code
* Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
* Rip out the ancient MMU_DEBUG crud and replace the useful bits with
CONFIG_KVM_PROVE_MMU
* Fix KVM's handling of !visible guest roots to avoid premature triple fault
injection
* Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface
that is needed by external users (currently only KVMGT), and fix a variety
of issues in the process
This last item had a silly one-character bug in the topic branch that
was sent to me. Because it caused pretty bad selftest failures in
some configurations, I decided to squash in the fix. So, while the
exact commit ids haven't been in linux-next, the code has (from the
kvm-x86 tree).
Generic:
* Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass
action specific data without needing to constantly update the main handlers.
* Drop unused function declarations
Selftests:
* Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
* Add support for printf() in guest code and covert all guest asserts to use
printf-based reporting
* Clean up the PMU event filter test and add new testcases
* Include x86 selftests in the KVM x86 MAINTAINERS entry
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmT1m0kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMNgggAiN7nz6UC423FznuI+yO3TLm8tkx1
CpKh5onqQogVtchH+vrngi97cfOzZb1/AtifY90OWQi31KEWhehkeofcx7G6ERhj
5a9NFADY1xGBsX4exca/VHDxhnzsbDWaWYPXw5vWFWI6erft9Mvy3tp1LwTvOzqM
v8X4aWz+g5bmo/DWJf4Wu32tEU6mnxzkrjKU14JmyqQTBawVmJ3RYvHVJ/Agpw+n
hRtPAy7FU6XTdkmq/uCT+KUHuJEIK0E/l1js47HFAqSzwdW70UDg14GGo1o4ETxu
RjZQmVNvL57yVgi6QU38/A0FWIsWQm5IlaX1Ug6x8pjZPnUKNbo9BY4T1g==
=W+4p
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Clean up vCPU targets, always returning generic v8 as the preferred
target
- Trap forwarding infrastructure for nested virtualization (used for
traps that are taken from an L2 guest and are needed by the L1
hypervisor)
- FEAT_TLBIRANGE support to only invalidate specific ranges of
addresses when collapsing a table PTE to a block PTE. This avoids
that the guest refills the TLBs again for addresses that aren't
covered by the table PTE.
- Fix vPMU issues related to handling of PMUver.
- Don't unnecessary align non-stack allocations in the EL2 VA space
- Drop HCR_VIRT_EXCP_MASK, which was never used...
- Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu
parameter instead
- Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort()
- Remove prototypes without implementations
RISC-V:
- Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest
- Added ONE_REG interface for SATP mode
- Added ONE_REG interface to enable/disable multiple ISA extensions
- Improved error codes returned by ONE_REG interfaces
- Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V
- Added get-reg-list selftest for KVM RISC-V
s390:
- PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch)
Allows a PV guest to use crypto cards. Card access is governed by
the firmware and once a crypto queue is "bound" to a PV VM every
other entity (PV or not) looses access until it is not bound
anymore. Enablement is done via flags when creating the PV VM.
- Guest debug fixes (Ilya)
x86:
- Clean up KVM's handling of Intel architectural events
- Intel bugfixes
- Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use
debug registers and generate/handle #DBs
- Clean up LBR virtualization code
- Fix a bug where KVM fails to set the target pCPU during an IRTE
update
- Fix fatal bugs in SEV-ES intrahost migration
- Fix a bug where the recent (architecturally correct) change to
reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to
skip it)
- Retry APIC map recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie
the "emergency disabling" behavior to KVM actually being loaded,
and move all of the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the
TSC ratio MSR cannot diverge from the default when TSC scaling is
disabled up related code
- Add a framework to allow "caching" feature flags so that KVM can
check if the guest can use a feature without needing to search
guest CPUID
- Rip out the ancient MMU_DEBUG crud and replace the useful bits with
CONFIG_KVM_PROVE_MMU
- Fix KVM's handling of !visible guest roots to avoid premature
triple fault injection
- Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the
API surface that is needed by external users (currently only
KVMGT), and fix a variety of issues in the process
Generic:
- Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier
events to pass action specific data without needing to constantly
update the main handlers.
- Drop unused function declarations
Selftests:
- Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
- Add support for printf() in guest code and covert all guest asserts
to use printf-based reporting
- Clean up the PMU event filter test and add new testcases
- Include x86 selftests in the KVM x86 MAINTAINERS entry"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (279 commits)
KVM: x86/mmu: Include mmu.h in spte.h
KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots
KVM: x86/mmu: Disallow guest from using !visible slots for page tables
KVM: x86/mmu: Harden TDP MMU iteration against root w/o shadow page
KVM: x86/mmu: Harden new PGD against roots without shadow pages
KVM: x86/mmu: Add helper to convert root hpa to shadow page
drm/i915/gvt: Drop final dependencies on KVM internal details
KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers
KVM: x86/mmu: Drop @slot param from exported/external page-track APIs
KVM: x86/mmu: Bug the VM if write-tracking is used but not enabled
KVM: x86/mmu: Assert that correct locks are held for page write-tracking
KVM: x86/mmu: Rename page-track APIs to reflect the new reality
KVM: x86/mmu: Drop infrastructure for multiple page-track modes
KVM: x86/mmu: Use page-track notifiers iff there are external users
KVM: x86/mmu: Move KVM-only page-track declarations to internal header
KVM: x86: Remove the unused page-track hook track_flush_slot()
drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region()
KVM: x86: Add a new page-track hook to handle memslot deletion
drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot
KVM: x86: Reject memslot MOVE operations if KVMGT is attached
...
On large enclaves we hit the softlockup warning with following call trace:
xa_erase()
sgx_vepc_release()
__fput()
task_work_run()
do_exit()
The latency issue is similar to the one fixed in:
8795359e35 ("x86/sgx: Silence softlockup detection when releasing large enclaves")
The test system has 64GB of enclave memory, and all is assigned to a single VM.
Release of 'vepc' takes a longer time and causes long latencies, which triggers
the softlockup warning.
Add cond_resched() to give other tasks a chance to run and reduce
latencies, which also avoids the softlockup detector.
[ mingo: Rewrote the changelog. ]
Fixes: 540745ddbc ("x86/sgx: Introduce virtual EPC for use by KVM guests")
Reported-by: Yu Zhang <yu.zhang@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Yu Zhang <yu.zhang@ionos.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Haitao Huang <haitao.huang@linux.intel.com>
Cc: stable@vger.kernel.org
With ":text =0xcccc", ld.lld fills unused text area with 0xcccc0000.
Example objdump -D output:
ffffffff82b04203: 00 00 add %al,(%rax)
ffffffff82b04205: cc int3
ffffffff82b04206: cc int3
ffffffff82b04207: 00 00 add %al,(%rax)
ffffffff82b04209: cc int3
ffffffff82b0420a: cc int3
Replace it with ":text =0xcccccccc", so we get the following instead:
ffffffff82b04203: cc int3
ffffffff82b04204: cc int3
ffffffff82b04205: cc int3
ffffffff82b04206: cc int3
ffffffff82b04207: cc int3
ffffffff82b04208: cc int3
gcc/ld doesn't seem to have the same issue. The generated code stays the
same for gcc/ld.
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Fixes: 7705dc8557 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
Link: https://lore.kernel.org/r/20230906175215.2236033-1-song@kernel.org
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmT0EE8THHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXg5FCACGJ6n2ikhtRHAENHIVY/mTh+HbhO07
ERzjADfqKF43u1Nt9cslgT4MioqwLjQsAu/A0YcJgVxVSOtg7dnbDmurRAjrGT/3
iKqcVvnaiwSV44TkF8evpeMttZSOg29ImmpyQjoZJJvDMfpxleEK53nuKB9EsjKL
Mz/0gSPoNc79bWF+85cVhgPnGIh9nBarxHqVsuWjMhc+UFhzjf9mOtk34qqPfJ1Q
4RsKGEjkVkeXoG6nGd6Gl/+8WoTpenOZQLchhInocY+k9FlAzW1Kr+ICLDx+Topw
8OJ6fv2rMDOejT9aOaA3/imf7LMer0xSUKb6N0sqQAQX8KzwcOYyKtQJ
=rC/v
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Support for SEV-SNP guests on Hyper-V (Tianyu Lan)
- Support for TDX guests on Hyper-V (Dexuan Cui)
- Use SBRM API in Hyper-V balloon driver (Mitchell Levy)
- Avoid dereferencing ACPI root object handle in VMBus driver (Maciej
Szmigiero)
- A few misecllaneous fixes (Jiapeng Chong, Nathan Chancellor, Saurabh
Sengar)
* tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
x86/hyperv: Remove duplicate include
x86/hyperv: Move the code in ivm.c around to avoid unnecessary ifdef's
x86/hyperv: Remove hv_isolation_type_en_snp
x86/hyperv: Use TDX GHCI to access some MSRs in a TDX VM with the paravisor
Drivers: hv: vmbus: Bring the post_msg_page back for TDX VMs with the paravisor
x86/hyperv: Introduce a global variable hyperv_paravisor_present
Drivers: hv: vmbus: Support >64 VPs for a fully enlightened TDX/SNP VM
x86/hyperv: Fix serial console interrupts for fully enlightened TDX guests
Drivers: hv: vmbus: Support fully enlightened TDX guests
x86/hyperv: Support hypercalls for fully enlightened TDX guests
x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests
x86/hyperv: Fix undefined reference to isolation_type_en_snp without CONFIG_HYPERV
x86/hyperv: Add missing 'inline' to hv_snp_boot_ap() stub
hv: hyperv.h: Replace one-element array with flexible-array member
Drivers: hv: vmbus: Don't dereference ACPI root object handle
x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES
x86/hyperv: Add smp support for SEV-SNP guest
clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp enlightened guest
x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP enlightened guest
...
Vasant reported that kexec() can hang or reset the machine when it tries to
park CPUs via INIT. This happens when the kernel is using extended APIC,
but the present mask has APIC IDs >= 0x100 enumerated.
As extended APIC can only handle 8 bit of APIC ID sending INIT to APIC ID
0x100 sends INIT to APIC ID 0x0. That's the boot CPU which is special on
x86 and INIT causes the system to hang or resets the machine.
Prevent this by sending INIT only to those CPUs which have been booted
once.
Fixes: 45e34c8af5 ("x86/smp: Put CPUs into INIT on shutdown if possible")
Reported-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/87cyzwjbff.ffs@tglx
* Fix PKRU covert channel
* Fix -Wmissing-variable-declarations warning for ia32_xyz_class
* Fix kernel-doc annotation warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTyK2sACgkQaDWVMHDJ
krCPyA//WbT65hhGZt5UHayoW3Cv8NmS4kRf6FTVMmt1yIvbOtVmWVZeOGMypprC
jYBPOdTf1SdoUxeD1cTTd/rtbYknczaBLC5VGilGb2/663yQl9ThT3ePc6OgUpPW
28ItLslfHGchHbrOgCgUK8Aiv8xRdfNUJoByMTOoce6eGUQsgYaNOq74z1YaLsZ4
afbitcd/vWO9eJfW5DNl26rIP+ksOgA/Iue8uRiEczoPltnbUxGGLCtuuR9B2Xxi
FqfttvcnUy9yFJBnELx4721VtTcmK4Dq5O/ZoJnmVQqRPY9aXpkbwZnH9co8r/uh
GueQj30l6VhEO6XnwPvjsOghXI2vOMrxlL3zGMw5y6pjfUWa4W6f33EV6mByYfml
dfPTXIz9K1QxznHtk6xVhn1X8CFw7L1L7CC/2CPQPETiuqyjWrHqvFBXcsWpKq4N
1GjnpoveNZmWcE7UY9qEHAXGh7ndI+EIux/0WfJJHOE/fzCr06GphQdxNwPGdi5O
T3JkZf6R4GBvfi6e0F6Qn2B02sDl18remmAMi9bAn/M31g+mvAoHqKLlI9EzAGGz
91PzJdHecdvalU6zsCKqi0ETzq11WG7zCfSXs+jeHBe1jsUITpcZCkpVi2pGA2aq
X7qLjJyW+Sx0hdE2IWFhCOfiwGFpXxfegCl1uax3HrZg6rrORbw=
=prt2
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Dave Hansen:
"The most important fix here adds a missing CPU model to the recent
Gather Data Sampling (GDS) mitigation list to ensure that mitigations
are available on that CPU.
There are also a pair of warning fixes, and closure of a covert
channel that pops up when protection keys are disabled.
Summary:
- Mark all Skylake CPUs as vulnerable to GDS
- Fix PKRU covert channel
- Fix -Wmissing-variable-declarations warning for ia32_xyz_class
- Fix kernel-doc annotation warning"
* tag 'x86-urgent-2023-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu/xstate: Fix PKRU covert channel
x86/irq/i8259: Fix kernel-doc annotation warning
x86/speculation: Mark all Skylake CPUs as vulnerable to GDS
x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_class
Here is the big set of char/misc and other small driver subsystem
changes for 6.6-rc1.
Stuff all over the place here, lots of driver updates and changes and
new additions. Short summary is:
- new IIO drivers and updates
- Interconnect driver updates
- fpga driver updates and additions
- fsi driver updates
- mei driver updates
- coresight driver updates
- nvmem driver updates
- counter driver updates
- lots of smaller misc and char driver updates and additions
All of these have been in linux-next for a long time with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH64g8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynr2QCfd3RKeR+WnGzyEOFhksl30UJJhiIAoNZtYT5+
t9KG0iMDXRuTsOqeEQbd
=tVnk
-----END PGP SIGNATURE-----
Merge tag 'char-misc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of char/misc and other small driver subsystem
changes for 6.6-rc1.
Stuff all over the place here, lots of driver updates and changes and
new additions. Short summary is:
- new IIO drivers and updates
- Interconnect driver updates
- fpga driver updates and additions
- fsi driver updates
- mei driver updates
- coresight driver updates
- nvmem driver updates
- counter driver updates
- lots of smaller misc and char driver updates and additions
All of these have been in linux-next for a long time with no reported
problems"
* tag 'char-misc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (267 commits)
nvmem: core: Notify when a new layout is registered
nvmem: core: Do not open-code existing functions
nvmem: core: Return NULL when no nvmem layout is found
nvmem: core: Create all cells before adding the nvmem device
nvmem: u-boot-env:: Replace zero-length array with DECLARE_FLEX_ARRAY() helper
nvmem: sec-qfprom: Add Qualcomm secure QFPROM support
dt-bindings: nvmem: sec-qfprom: Add bindings for secure qfprom
dt-bindings: nvmem: Add compatible for QCM2290
nvmem: Kconfig: Fix typo "drive" -> "driver"
nvmem: Explicitly include correct DT includes
nvmem: add new NXP QorIQ eFuse driver
dt-bindings: nvmem: Add t1023-sfp efuse support
dt-bindings: nvmem: qfprom: Add compatible for MSM8226
nvmem: uniphier: Use devm_platform_get_and_ioremap_resource()
nvmem: qfprom: do some cleanup
nvmem: stm32-romem: Use devm_platform_get_and_ioremap_resource()
nvmem: rockchip-efuse: Use devm_platform_get_and_ioremap_resource()
nvmem: meson-mx-efuse: Convert to devm_platform_ioremap_resource()
nvmem: lpc18xx_otp: Convert to devm_platform_ioremap_resource()
nvmem: brcm_nvram: Use devm_platform_get_and_ioremap_resource()
...
Here is a small set of driver core updates and additions for 6.6-rc1.
Included in here are:
- stable kernel documentation updates
- class structure const work from Ivan on various subsystems
- kernfs tweaks
- driver core tests!
- kobject sanity cleanups
- kobject structure reordering to save space
- driver core error code handling fixups
- other minor driver core cleanups
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH77Q8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ylZMACePk8SitfaJc6FfFf5I7YK7Nq0V8MAn0nUjgsR
i8NcNpu/Yv4HGrDgTdh/
=PJbk
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is a small set of driver core updates and additions for 6.6-rc1.
Included in here are:
- stable kernel documentation updates
- class structure const work from Ivan on various subsystems
- kernfs tweaks
- driver core tests!
- kobject sanity cleanups
- kobject structure reordering to save space
- driver core error code handling fixups
- other minor driver core cleanups
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
driver core: Call in reversed order in device_platform_notify_remove()
driver core: Return proper error code when dev_set_name() fails
kobject: Remove redundant checks for whether ktype is NULL
kobject: Add sanity check for kset->kobj.ktype in kset_register()
drivers: base: test: Add missing MODULE_* macros to root device tests
drivers: base: test: Add missing MODULE_* macros for platform devices tests
drivers: base: Free devm resources when unregistering a device
drivers: base: Add basic devm tests for platform devices
drivers: base: Add basic devm tests for root devices
kernfs: fix missing kernfs_iattr_rwsem locking
docs: stable-kernel-rules: mention that regressions must be prevented
docs: stable-kernel-rules: fine-tune various details
docs: stable-kernel-rules: make the examples for option 1 a proper list
docs: stable-kernel-rules: move text around to improve flow
docs: stable-kernel-rules: improve structure by changing headlines
base/node: Remove duplicated include
kernfs: attach uuid for every kernfs and report it in fsid
kernfs: add stub helper for kernfs_generic_poll()
x86/resctrl: make pseudo_lock_class a static const structure
x86/MSR: make msr_class a static const structure
...
When XCR0[9] is set, PKRU can be read and written from userspace with
XSAVE and XRSTOR, even when CR4.PKE is clear.
Clear XCR0[9] when protection keys are disabled.
Reported-by: Tavis Ormandy <taviso@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230831043228.1194256-1-jmattson@google.com
Convert IBT selftest to asm to fix objtool warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ
krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC
Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/
gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO
Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR
ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx
rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9
MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/
Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s
VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh
8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp
AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA=
=3UUm
-----END PGP SIGNATURE-----
Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part of Intel's
Control-flow Enforcement Technology (CET).
CET consists of two related security features: shadow stacks and
indirect branch tracking. This series implements just the shadow stack
part of this feature, and just for userspace.
The main use case for shadow stack is providing protection against
return oriented programming attacks. It works by maintaining a
secondary (shadow) stack using a special memory type that has
protections against modification. When executing a CALL instruction,
the processor pushes the return address to both the normal stack and
to the special permission shadow stack. Upon RET, the processor pops
the shadow stack copy and compares it to the normal stack copy.
For more information, refer to the links below for the earlier
versions of this patch set"
Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/
* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/shstk: Change order of __user in type
x86/ibt: Convert IBT selftest to asm
x86/shstk: Don't retry vm_munmap() on -EINTR
x86/kbuild: Fix Documentation/ reference
x86/shstk: Move arch detail comment out of core mm
x86/shstk: Add ARCH_SHSTK_STATUS
x86/shstk: Add ARCH_SHSTK_UNLOCK
x86: Add PTRACE interface for shadow stack
selftests/x86: Add shadow stack test
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/shstk: Wire in shadow stack interface
x86: Expose thread features in /proc/$PID/status
x86/shstk: Support WRSS for userspace
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Check that signal frame is shadow stack mem
x86/shstk: Check that SSP is aligned on sigreturn
x86/shstk: Handle signals for shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle thread shadow stack
x86/shstk: Add user-mode shadow stack support
...
Fix this warning:
arch/x86/kernel/i8259.c:235: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* ELCR registers (0x4d0, 0x4d1) control edge/level of IRQ
CC arch/x86/kernel/irqinit.o
Signed-off-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230830131211.88226-1-vincenzopalazzodev@gmail.com
The Gather Data Sampling (GDS) vulnerability is common to all Skylake
processors. However, the "client" Skylakes* are now in this list:
https://www.intel.com/content/www/us/en/support/articles/000022396/processors.html
which means they are no longer included for new vulnerabilities here:
https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
or in other GDS documentation. Thus, they were not included in the
original GDS mitigation patches.
Mark SKYLAKE and SKYLAKE_L as vulnerable to GDS to match all the
other Skylake CPUs (which include Kaby Lake). Also group the CPUs
so that the ones that share the exact same vulnerabilities are next
to each other.
Last, move SRBDS to the end of each line. This makes it clear at a
glance that SKYLAKE_X is unique. Of the five Skylakes, it is the
only "server" CPU and has a different implementation from the
clients of the "special register" hardware, making it immune to SRBDS.
This makes the diff much harder to read, but the resulting table is
worth it.
I very much appreciate the report from Michael Zhivich about this
issue. Despite what level of support a hardware vendor is providing,
the kernel very much needs an accurate and up-to-date list of
vulnerable CPUs. More reports like this are very welcome.
* Client Skylakes are CPUID 406E3/506E3 which is family 6, models
0x4E and 0x5E, aka INTEL_FAM6_SKYLAKE and INTEL_FAM6_SKYLAKE_L.
Reported-by: Michael Zhivich <mzhivich@akamai.com>
Fixes: 8974eb5882 ("x86/speculation: Add Gather Data Sampling mitigation")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS
Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis
KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas
yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7
wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058
5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t
vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT
DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8
2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl
wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp
pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J
gLdtzs8M9K9t
=yGM1
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD
KVM x86 changes for 6.6:
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
coalescing lots of silly duplicates.
* Use static_calls() instead of indirect calls for apic->foo()
* Tons of cleanups an crap removal along the way
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTvfO8ACgkQaDWVMHDJ
krAP2A//ccii/LuvtTnNEIMMR5w2rwTdHv91ancgFkC8pOeNk37Z8sSLq8tKuLFA
vgjBIysVIqunuRcNCJ+eqwIIxYfU+UGCWHppzLwO+DY3Q7o9EoTL0BgytdAqxpQQ
ntEVarqWq25QYXKFoAqbUTJ1UXa42/8HfiXAX/jvP+ACXfilkGPZre6ASxlXeOhm
XbgPuNQPmXi2WYQH9GCQEsz2Nh80hKap8upK2WbQzzJ3lXsm+xA//4klab0HCYwl
Uc302uVZozyXRMKbAlwmgasTFOLiV8KKriJ0oHoktBpWgkpdR9uv/RDeSaFR3DAl
aFmecD4k/Hqezg4yVl+4YpEn2KjxiwARCm4PMW5AV7lpWBPBHAOOai65yJlAi9U6
bP8pM0+aIx9xg7oWfsTnQ7RkIJ+GZ0w+KZ9LXFM59iu3eV1pAJE3UVyUehe/J1q9
n8OcH0UeHRlAb8HckqVm1AC7IPvfHw4OAPtUq7z3NFDwbq6i651Tu7f+i2bj31cX
77Ames+fx6WjxUjyFbJwaK44E7Qez3waztdBfn91qw+m0b+gnKE3ieDNpJTqmm5b
mKulV7KJwwS6cdqY3+Kr+pIlN+uuGAv7wGzVLcaEAXucDsVn/YAMJHY2+v97xv+n
J9N+yeaYtmSXVlDsJ6dndMrTQMmcasK1CVXKxs+VYq5Lgf+A68w=
=eoKm
-----END PGP SIGNATURE-----
Merge tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Dave Hansen:
"This includes a very thorough rework of the 'struct apic' handlers.
Quite a variety of them popped up over the years, especially in the
32-bit days when odd apics were much more in vogue.
The end result speaks for itself, which is a removal of a ton of code
and static calls to replace indirect calls.
If there's any breakage here, it's likely to be around the 32-bit
museum pieces that get light to no testing these days.
Summary:
- Rework apic callbacks, getting rid of unnecessary ones and
coalescing lots of silly duplicates.
- Use static_calls() instead of indirect calls for apic->foo()
- Tons of cleanups an crap removal along the way"
* tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
x86/apic: Turn on static calls
x86/apic: Provide static call infrastructure for APIC callbacks
x86/apic: Wrap IPI calls into helper functions
x86/apic: Mark all hotpath APIC callback wrappers __always_inline
x86/xen/apic: Mark apic __ro_after_init
x86/apic: Convert other overrides to apic_update_callback()
x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb()
x86/apic: Provide apic_update_callback()
x86/xen/apic: Use standard apic driver mechanism for Xen PV
x86/apic: Provide common init infrastructure
x86/apic: Wrap apic->native_eoi() into a helper
x86/apic: Nuke ack_APIC_irq()
x86/apic: Remove pointless arguments from [native_]eoi_write()
x86/apic/noop: Tidy up the code
x86/apic: Remove pointless NULL initializations
x86/apic: Sanitize APIC ID range validation
x86/apic: Prepare x2APIC for using apic::max_apic_id
x86/apic: Simplify X2APIC ID validation
x86/apic: Add max_apic_id member
x86/apic: Wrap APIC ID validation into an inline
...
- Prevent kprobes on compiler generated CFI checking code.
The compiler generates a instruction sequence for indirect call
checks. If this sequence is modified with a kprobe, then the check
fails. So the instructions must be protected against probing.
- A few minor cleanups for the SMP code
Thanks,
tglx
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTvO50THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoSS/D/4xIz99FKNzowrhR9LYQpvjBLmrFY2L
Hz/l+iHarNU8VE9eeTpYt0F3Ho7cUaMbC4juqxNthfjPOtkGotO6FLLOxWfkVkzz
k54N2o4cUX7Hwi/HEpKwjzQ1ggLwxEUafeczvFg5aU3PlKcpzscT5todxxRrw+vz
WA9JIuMlwVRwl+Jodzp+LsUGWaF1gT1aVBvFZGXFJ12GB58NMn+Nm9g77BmsYN2H
jUQjsKLq32S1aU8Bq+YNIr9XWagTvDicJ617cwyzy1HivKadeBF2bj5+uJioxLFZ
8YN9PyXtICkYJ/TrOsVJ5XxuNWNpyZ2x8OwCVdYboqeooylOJTHx1pUBufNQIpAr
qx2npX6NOK29QFVXiTJoP3K+4sv/G4EL6FO3Sf2J92/4HPiTlhINGCjm/UAqbJcJ
5Y3QfnIqB6ohCm7/e648ht5U+vdiNMzMAFj/jgNyqV8KBUQoXG58ygNP42V+Qq8+
Is5oDv4M1BgynUG8FaIQV3LXFHFLA6dielgdJwTy7TZry0O9rU5qieUDskD6GBVv
6ayiEqYKyXyJZw6YEzY/kDUxnpIRnqq0Lzbdn7TjH5cetnlfImc477jv8m3Um8ZP
FMVe20MN79yuKPVlvkt2tCPEYttDXufQDHtDJp2WCp1/TizmGEW51sdsacw+YXrS
M1uUljSmEeSM6Q==
=nODz
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner:
- Prevent kprobes on compiler generated CFI checking code.
The compiler generates an instruction sequence for indirect call
checks. If this sequence is modified with a kprobe, then the check
fails. So the instructions must be protected against probing.
- A few minor cleanups for the SMP code
* tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Prohibit probing on compiler generated CFI checking code
x86/smpboot: Change smp_store_boot_cpu_info() to static
x86/smp: Remove a non-existent function declaration
x86/smpboot: Remove a stray comment about CPU hotplug
- allow dynamic sizing of the swiotlb buffer, to cater for secure
virtualization workloads that require all I/O to be bounce buffered
(Petr Tesarik)
- move a declaration to a header (Arnd Bergmann)
- check for memory region overlap in dma-contiguous (Binglei Wang)
- remove the somewhat dangerous runtime swiotlb-xen enablement and
unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross)
- per-node CMA improvements (Yajun Deng)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmTuDHkLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOqvhAApMk2/ceTgVH17sXaKE822+xKvgv377O6TlggMeGG
W4zA0KD69DNz0AfaaCc5U5f7n8Ld/YY1RsvkHW4b3jgw+KRTeQr0jjitBgP5kP2M
A1+qxdyJpCTwiPt9s2+JFVPeyZ0s52V6OJODKRG3s0ore55R+U09VySKtASON+q3
GMKfWqQteKC+thg7NkrQ7JUixuo84oICws+rZn4K9ifsX2O0HYW6aMW0feRfZjJH
r0TgqZc4RdPTSaF22oapR9Ls39+7hp/pBvoLm5sBNA3cl5C3X4VWo9ERMU1jW9h+
VYQv39NycUspgskWJmpbU06/+ooYqQlwHSR/vdNusmFIvxo4tf6/UX72YO5F8Dar
ap0wYGauiEwTjSnhVxPTXk3obWyWEsgFAeRnPdTlH2CNmv38QZU2HLb8eU1pcXxX
j+WI2Ewy9z22uBVYiPOKpdW1jkSfmlmfPp/8SbAdua7I3YQ90rQN6AvU06zAi/cL
NQTgO81E4jPkygqAVgS/LeYziWAQ73yM7m9ExThtTgqFtHortwhJ4Fd8XKtvtvEb
viXAZ/WZtQBv/CIKAW98NhgIDP/SPOT8ym6V35WK+kkNFMS6LMSQUfl9GgbHGyFa
n9icMm7BmbDtT1+AKNafG9En4DtAf9M9QNidAVOyfrsIk6S0gZoZwvIStkA7on8a
cNY=
=kVVr
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-maping updates from Christoph Hellwig:
- allow dynamic sizing of the swiotlb buffer, to cater for secure
virtualization workloads that require all I/O to be bounce buffered
(Petr Tesarik)
- move a declaration to a header (Arnd Bergmann)
- check for memory region overlap in dma-contiguous (Binglei Wang)
- remove the somewhat dangerous runtime swiotlb-xen enablement and
unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross)
- per-node CMA improvements (Yajun Deng)
* tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: optimize get_max_slots()
swiotlb: move slot allocation explanation comment where it belongs
swiotlb: search the software IO TLB only if the device makes use of it
swiotlb: allocate a new memory pool when existing pools are full
swiotlb: determine potential physical address limit
swiotlb: if swiotlb is full, fall back to a transient memory pool
swiotlb: add a flag whether SWIOTLB is allowed to grow
swiotlb: separate memory pool data from other allocator data
swiotlb: add documentation and rename swiotlb_do_find_slots()
swiotlb: make io_tlb_default_mem local to swiotlb.c
swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated
dma-contiguous: check for memory region overlap
dma-contiguous: support numa CMA for specified node
dma-contiguous: support per-numa CMA for all architectures
dma-mapping: move arch_dma_set_mask() declaration to header
swiotlb: unexport is_swiotlb_active
x86: always initialize xen-swiotlb when xen-pcifront is enabling
xen/pci: add flag for PCI passthrough being possible
("refactor Kconfig to consolidate KEXEC and CRASH options").
- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h").
- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands").
- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions").
- Switch the handling of kdump from a udev scheme to in-kernel handling,
by Eric DeVolder ("crash: Kernel handling of CPU and memory hot
un/plug").
- Many singleton patches to various parts of the tree
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA
juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy
QHy7lL0Syha38kKLMXTM+bN6YQHi9AU=
=WJQa
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- An extensive rework of kexec and crash Kconfig from Eric DeVolder
("refactor Kconfig to consolidate KEXEC and CRASH options")
- kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
couple of macros to args.h")
- gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
commands")
- vsprintf inclusion rationalization from Andy Shevchenko
("lib/vsprintf: Rework header inclusions")
- Switch the handling of kdump from a udev scheme to in-kernel
handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
hot un/plug")
- Many singleton patches to various parts of the tree
* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
document while_each_thread(), change first_tid() to use for_each_thread()
drivers/char/mem.c: shrink character device's devlist[] array
x86/crash: optimize CPU changes
crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
crash: hotplug support for kexec_load()
x86/crash: add x86 crash hotplug support
crash: memory and CPU hotplug sysfs attributes
kexec: exclude elfcorehdr from the segment digest
crash: add generic infrastructure for crash hotplug support
crash: move a few code bits to setup support of crash hotplug
kstrtox: consistently use _tolower()
kill do_each_thread()
nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
scripts/bloat-o-meter: count weak symbol sizes
treewide: drop CONFIG_EMBEDDED
lockdep: fix static memory detection even more
lib/vsprintf: declare no_hash_pointers in sprintf.h
lib/vsprintf: split out sprintf() and friends
kernel/fork: stop playing lockless games for exe_file replacement
adfs: delete unused "union adfs_dirtail" definition
...
The following commit deserves special mention:
22dc02f81c Revert "sched/fair: Move unused stub functions to header"
This is in x86/cleanups, because the revert is a re-application of a
number of cleanups that got removed inadvertedly.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDkoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jCMw//UvQGM8yxsTa57r0/ZpJHS2++P5pJxOsz
45kBb3aBiDV6idArce4EHpthp3MvF3Pycibp9w0qg//NOtIHTKeagXv52abxsu1W
hmS6gXJZDXZvjO1BFaUlmv97iYtzGfKnQppj32C4tMr9SaP49h3KvOHH1Z8CR3mP
1nZaJJwYIi2qBh7msnmLGG+F0drb85O/dfHdoLX6iVJw9UP4n5nu9u8u1E0iC7J7
2GC6AwP60A0EBRTK9EHQQEYwy9uvdS/TG5f2Qk1VP87KA9TTocs8MyapMG4DQu79
hZKVEGuVQAlV3rYe9cJBNpDx1mTu3rmuMH0G71KEe3T6UcG5QRUiAPm8UfA9prPD
uWjY4zm5o0W3tUio4V1MqqiLFIaBU63WmTY9RyM0QH8Ms8r8GugWKmnrTIuHfEC3
9D+Uhyb5d8ID6qFGLTOvPm0g+v64lnH71qq83PcVJgsmZvUb2XvFA3d/A0h9JzLT
2In/yfU10UsLUFTiNRyAgcLccjaGhliDB2oke9Kp0OyOTSQRcWmiq8kByVxCPImP
auOWWcNXjcuOgjlnziEkMTDuRY12MgUB2If4zhELvdEFibIaaNW5sNCbY2msWaN1
CUD7fcj0L3HZvzujUm72l5hxL2brJMuPwVNJfuOe4T8wzy569d6VJULrd1URBM1B
vfaPs1Dz46Q=
=kiAA
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Ingo Molnar:
"The following commit deserves special mention:
22dc02f81c Revert "sched/fair: Move unused stub functions to header"
This is in x86/cleanups, because the revert is a re-application of a
number of cleanups that got removed inadvertedly"
[ This also effectively undoes the amd_check_microcode() microcode
declaration change I had done in my microcode loader merge in commit
42a7f6e3ff ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]").
I picked the declaration change by Arnd from this branch instead,
which put it in <asm/processor.h> instead of <asm/microcode.h> like I
had done in my merge resolution - Linus ]
* tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy()
x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy()
x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy()
x86/qspinlock-paravirt: Fix missing-prototype warning
x86/paravirt: Silence unused native_pv_lock_init() function warning
x86/alternative: Add a __alt_reloc_selftest() prototype
x86/purgatory: Include header for warn() declaration
x86/asm: Avoid unneeded __div64_32 function definition
Revert "sched/fair: Move unused stub functions to header"
x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP
x86/cpu: Fix amd_check_microcode() declaration
- The biggest change is introduction of a new iteration of the
SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
Deadline First") scheduler.
EEVDF too is a virtual-time scheduler, with two parameters (weight
and relative deadline), compared to CFS that had weight only.
It completely reworks the base scheduler: placement, preemption,
picking -- everything.
LWN.net, as usual, has a terrific writeup about EEVDF:
https://lwn.net/Articles/925371/
Preemption (both tick and wakeup) is driven by testing against
a fresh pick. Because the tree is now effectively an interval
tree, and the selection is no longer the 'leftmost' task,
over-scheduling is less of a problem. A lot of the CFS
heuristics are removed or replaced by more natural latency-space
parameters & constructs.
In terms of expected performance regressions: we'll and can fix
everything where a 'good' workload misbehaves with the new scheduler,
but EEVDF inevitably changes workload scheduling in a binary fashion,
hopefully for the better in the overwhelming majority of cases,
but in some cases it won't, especially in adversarial loads that
got lucky with the previous code, such as some variants of hackbench.
We are trying hard to err on the side of fixing all performance
regressions, but we expect some inevitable post-release iterations
of that process.
- Improve load-balancing on hybrid x86 systems: enable cluster
scheduling (again).
- Improve & fix bandwidth-scheduling on nohz systems.
- Improve bandwidth-throttling.
- Use lock guards to simplify and de-goto-ify control flow.
- Misc improvements, cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDOgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iS4g//b9yewVW9OPxetKoN8zIJA0TjFYuuOVHK
BlCJi5dbzXeCTrtENI65BRA7kPbTQ3AjwLRQ2BallAZ4dJceK0RhlZJvcrMNsm4e
Adcpoch/FbqPKCrtAJQY04Ln1B244n/KyVifYett9220dMgTFQGJJYxrTc2G2+Kp
F44vdUHzRczIE+KeOgBild1CwfKv5Zn5xgaXgtuoPLZtWBE0C1fSSzbK/PTINcUx
bS4NVxK0CpOqSiNjnugV8KsYb71/0U6IgShBVjfHsrlBYigOH2NbVTH5xyjF8f83
WxiGstlhxj+N6Kv4L6FOJIAr2BIggH82j3FaPACmv4c8pzEoBBbvlAJkfinLEgbn
Povg3OF2t6uZ8NoHjeu3WxOjBsphbpkFz7H5nno1ibXSIR/JyUH5MdBPSx93QITB
QoUKQpr/L8zWauWDOEzSaJjEsZbl8rkcIVq5Bk0bR3qn2xkZsIeVte+vCEu3+tBc
b4JOZjq7AuPDqPnsBLvuyiFZ7zwsAfm+pOD5UF3/zbLjPn1N/7wTNQZ29zjc04jl
SifpCZGgF1KlG8m8wNTlSfVvq0ksppCzJt+C6VFuejZ191IGpirQHn4Vp0sluMhC
WRzXhb7v37Bq5JY10GMfeKb/jAiRs68kozhzqVPsBSAPS6I6jJssONgedq+LbQdC
tFsmE9n09do=
=XtCD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- The biggest change is introduction of a new iteration of the
SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
Deadline First") scheduler
EEVDF too is a virtual-time scheduler, with two parameters (weight
and relative deadline), compared to CFS that had weight only. It
completely reworks the base scheduler: placement, preemption, picking
-- everything
LWN.net, as usual, has a terrific writeup about EEVDF:
https://lwn.net/Articles/925371/
Preemption (both tick and wakeup) is driven by testing against a
fresh pick. Because the tree is now effectively an interval tree, and
the selection is no longer the 'leftmost' task, over-scheduling is
less of a problem. A lot of the CFS heuristics are removed or
replaced by more natural latency-space parameters & constructs
In terms of expected performance regressions: we will and can fix
everything where a 'good' workload misbehaves with the new scheduler,
but EEVDF inevitably changes workload scheduling in a binary fashion,
hopefully for the better in the overwhelming majority of cases, but
in some cases it won't, especially in adversarial loads that got
lucky with the previous code, such as some variants of hackbench. We
are trying hard to err on the side of fixing all performance
regressions, but we expect some inevitable post-release iterations of
that process
- Improve load-balancing on hybrid x86 systems: enable cluster
scheduling (again)
- Improve & fix bandwidth-scheduling on nohz systems
- Improve bandwidth-throttling
- Use lock guards to simplify and de-goto-ify control flow
- Misc improvements, cleanups and fixes
* tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
sched/eevdf: Curb wakeup-preemption
sched: Simplify sched_core_cpu_{starting,deactivate}()
sched: Simplify try_steal_cookie()
sched: Simplify sched_tick_remote()
sched: Simplify sched_exec()
sched: Simplify ttwu()
sched: Simplify wake_up_if_idle()
sched: Simplify: migrate_swap_stop()
sched: Simplify sysctl_sched_uclamp_handler()
sched: Simplify get_nohz_timer_target()
sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
sched/rt: Fix sysctl_sched_rr_timeslice intial value
sched/fair: Block nohz tick_stop when cfs bandwidth in use
sched, cgroup: Restore meaning to hierarchical_quota
MAINTAINERS: Add Peter explicitly to the psi section
sched/psi: Select KERNFS as needed
sched/topology: Align group flags when removing degenerate domain
sched/fair: remove util_est boosting
sched/fair: Propagate enqueue flags into place_entity()
...
working on. This part makes the loader core code as it is practically
enabled on pretty much every baremetal machine so there's no need to
have the Kconfig items. In addition, there are cleanups which prepare
for future feature enablement.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTskfcACgkQEsHwGGHe
VUo/hBAAiqVqdc0WASHgVYoO9mD1M2T3oHFz8ceX2pGKf/raZ5UJyit7ybWEZEWG
rWXFORRlqOKoKImQm6JhyJsu29Xmi9sTb1WNwEyT8YMdhx8v57hOch3alX7sm2BF
9eOl/77hxt7Pt8keN6gY5w5cydEgBvi8bVe8sfU3hJMwieAMH0q5syRx7fDizcVd
qoTicHRjfj5Q8BL5NXtdPEMENYPyV89DVjnUM1HVPpCkoHxmOujewgjs4gY7PsGp
qAGB1+IG3aqOWHM9SDIJp5U9tNX2huhqRTcZsNEe8qHTXCv8F8zRzK0J8giM1wed
5aAGC4AfMh/gjryXeMj1nnwoAf5TQw4dK+y+BYXIykdQDV1up+HdDtjrBmJ5Kslf
n/8uGLdLLqMQVEE2hT3r1Ft1RqVf3UwWOxzc+KASjKCj0F5djt+F2JDNGoN0sMD9
JDj3Dtvo2e1+aZlcvXWSmMCVMT0By1mbFqirEXT3i1sYwHDx23s+KwY71CdT8gx8
nbEWsaADPRynNbTAI5Z5TFq0Cohs9AUNuotRHYCc0Et5NBlzoN8yAKNW39twUDEt
a/Knq1Vnybrp18pE/rDphm+p/K261OuEXfFFVTASSlvgMnVM0UAZZZka7A0DmN+g
mvZ2A9hByFk6sELm3QeNrOdex8djeichyY7+EQ13K25wMd/YsX0=
=QXDh
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:
"The first, cleanup part of the microcode loader reorg tglx has been
working on. The other part wasn't fully ready in time so it will
follow on later.
This part makes the loader core code as it is practically enabled on
pretty much every baremetal machine so there's no need to have the
Kconfig items.
In addition, there are cleanups which prepare for future feature
enablement"
* tag 'x86_microcode_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Remove remaining references to CONFIG_MICROCODE_AMD
x86/microcode/intel: Remove pointless mutex
x86/microcode/intel: Remove debug code
x86/microcode: Move core specific defines to local header
x86/microcode/intel: Rename get_datasize() since its used externally
x86/microcode: Make reload_early_microcode() static
x86/microcode: Include vendor headers into microcode.h
x86/microcode/intel: Move microcode functions out of cpu/intel.c
x86/microcode: Hide the config knob
x86/mm: Remove unused microcode.h include
x86/microcode: Remove microcode_mutex
x86/microcode/AMD: Rip out static buffers
range whose SEV encryption status needs to change, is not page aligned
so that callers which round up the number of pages to be decrypted,
would mark a trailing page as decrypted and thus cause corruption
during live migration.
- Return an error from the #VC handler on AMD SEV-* guests when the debug
registers swapping is enabled as a DR7 access should not happen then
- that register is guest/host switched.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsje4ACgkQEsHwGGHe
VUrKWxAAqiTpQSjJCB32ReioSsLv3kl7vtLO3xtE42VpF0F7pAAPzRsh+bgDjGSM
uqcEgbX1YtPlb8wK6yh5dyNLLvtzxhaAQkUfbfuEN2oqbvIEcJmhWAm/xw1yCsh2
GDphFPtvqgT4KUCkEHj8tC9eQzG+L0bwymPzqXooVDnm4rL0ulEl6ONffhHfJFVg
bmL8UjmJNodFcO6YBfosQIDDfc4ayuwm9f/rGltNFl+jwCi62kMJaVdU1112agsV
LE73DRoRpfHKLslj9o9ubRcvaHKS24y2Amflnj1tas0h8I2uXBRwIgxjQXl5vtXV
pu5/5VHM9X13x8XKpKVkEohXkBzFRigs8yfHq+JlpyWXXB/ymW8Acbqqnvll12r4
JSy+XfBNa6V5Y/NDS/1faJiX6XSi5ZyZHZG70sf52XVoBYhzoms5kxqTJnHHisnY
X50677/tQF3V9WsmKD0aj0Um2ztiq0/TNMI7FT3lzYRDNJb1ln3ZK9f04i8L5jA4
bsrSV5oCVpLkW4eQaAJwxttTB+dRb5MwwkeS7D/eTuJ1pgUmJMIbZp2YbJH7NP2F
6FShQdwHi8KYN7mxUM+WwOk7goaBm5L61w5UtRlt6aDE7LdEbMAeSSdmD3HlEZHR
ntBqcEx4SkAT+Ru0izVXjsoWmtkn8+DY44oUC2X6eZxUSAT4Cm4=
=td9F
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Handle the case where the beginning virtual address of the address
range whose SEV encryption status needs to change, is not page
aligned so that callers which round up the number of pages to be
decrypted, would mark a trailing page as decrypted and thus cause
corruption during live migration.
- Return an error from the #VC handler on AMD SEV-* guests when the
debug registers swapping is enabled as a DR7 access should not happen
then - that register is guest/host switched.
* tag 'x86_sev_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Make enc_dec_hypercall() accept a size instead of npages
x86/sev: Do not handle #VC for DR7 read/write
consumption MCEs are not delivered synchronously but still within the
same context, which can lead to erroneously increased error severity
and unneeded kernel panics
- Do not log errors caught by polling shared MCA banks as they
materialize as duplicated error records otherwise
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsNcoACgkQEsHwGGHe
VUq4MBAApcFUWxsA8co+v235W05mqAvsb8pt4RLFCYBQ6INCKFPMwP3ShEjSgnvZ
owQdPC2j7qzUiMFeDmBn1GXoENgv/Azc1A/0AKa5tKctHrS1Z/ZvDBncydY1HzTu
o4JYRN+KQWoaf0Nz8iYgBNjblPxw057Lg4fmOyCu1F6mmWypdBjC43fGoDTdTIZd
4uhxVzS09ns1GhBpDoJbj6SXSpbOtvnMguGVrXvuhw+NBfaYpJ6Fb5gH2TrT8rWE
jd5uSOSxYVPIjt5XjLfhu2eQheecJiYIxTWbNlVRUZHgmvVgtRon5WwMVSrFxJL2
vLawaKvnHjgsOIewW0d9hEe6PVgvcUEMwQNmI86vDzCi+RGM8pbRZYqCInYyDtBd
e6W1ZsfqVBWO9LKr7T9LEMM7HlGSe8aPkaeTfmCv18+hgvEkkjXY19dcLYe+ExmW
2JvsxF08wqXPAIBDy7cN4DHWdRTd3g91Qd10Ex6bUMovifP9Jt3KXWAuX7qWPjY2
YvLASs/04z5sGNk3XB+f2EOPMJRHjHneNppQLuSBIzhOFXOHDA70aObNGfXw8oGK
fGhPTEXFJWhTH7fL7FZCwGEEARXkuOWBpIX1HNYst2zFwKNTNqzaxkxAMYwdv6j5
K30hNMrCQj912t82NWOoerPt0uRLdXDKKTJV0VJNfcP8oaA3nec=
=RMCN
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Add a quirk for AMD Zen machines where Instruction Fetch unit poison
consumption MCEs are not delivered synchronously but still within the
same context, which can lead to erroneously increased error severity
and unneeded kernel panics
- Do not log errors caught by polling shared MCA banks as they
materialize as duplicated error records otherwise
* tag 'ras_core_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/MCE: Always save CS register on AMD Zen IF Poison errors
x86/mce: Prevent duplicate error records
the respective drivers
- Update HPE Superdome Flex maintainers list
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsMxUACgkQEsHwGGHe
VUo8aw/+PyoGF2/96Nb5A3DT2yXrYzT75UITFFrqi8YtRNdlHLV/uFWZjBhCGm8V
q5a/Pu/K9qjTDkWZG9UkDlg1VfCMqAsRSWlgSq7qShC07oyOqdzfZLJmAKlGYt+g
BQXk7RAtUpKzK79D3r/uq6RkFAvLXflIuGClXg42Bw4SKjuAiwbW5bwU4xsMzEdw
Rk0Egn3Nic/reqSlWIaRQAVy5/4Jcoi9HvcJEKJatT+sqJtbR2krpr/rMkL6G9YR
2nOONkRYt7T2+Db5Tnld4FurgQ77F6HeJvAPgUdZVoqUWGCtB4piryd8zuXrj216
Hu4Z7DHjbdMPy1Gl9e9GHmHu62Cx7I2RlcZ5hsLZIjOjRQb+Yo9NgUGiSQyp+Uny
4Rh3abpuZZjfsHUf6M5Mv+oKqzGPtJXP95E1IScuHh8/mTT254p3ozYLSusTS/tM
M65eTSuq6FEbSmO3+dtC05V2b+TxAEYFr6yUCvymUqpaNO3XXhxi55asc0JznEwc
rV18i03U/A8+9NeUCy2j92F4GHP2QEGscAmdS9CHlDCzsG5RjLx8jJHQIKy6+3W7
cOUqCprxfUBNuxFQT/Mb1B1/CdXmpTaiKRzfqjoCS4sqApMYiWnNa4s7ioOn8ewY
9r+P4SlhC3qvJQhw3JLWUvy69DnJzCDpScnJ1lw3/zHWDkmR8ZQ=
=e0EQ
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- Add PCI device IDs for a new AMD family 0x1a CPUs and use them in the
respective drivers
- Update HPE Superdome Flex maintainers list
* tag 'x86_misc_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uv: Update HPE Superdome Flex Maintainers
EDAC/amd64: Add support for AMD family 1Ah models 00h-1Fh and 40h-4Fh
hwmon: (k10temp) Add thermal support for AMD Family 1Ah-based models
x86/amd_nb: Add PCI IDs for AMD Family 1Ah-based models
This is mandated by the current tightening of EFI executables
requirements when used in a secure boot scenario. More specifically,
an EFI executable cannot have a single section with RWX permissions,
which conflicts with the in-place kernel decompression that is done
today. Instead, the things required by the booting kernel image are
done in the EFI stub now. Work by Ard Biesheuvel.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsMGAACgkQEsHwGGHe
VUqqGg/+LJq7dapnyMiqafO9Xdpm+h7wTylChdfev7UB3wri9IJGXXFkwZZUNF8Z
p2c/ECk75CG+z7ZSv+WirZLWlmlBT1D51YjI1lHmg3Vmh3YXZWJHkwaG1DI4euL7
C4gvd/gmpXGMOxjZLV/D5d7wwoO2REyVkDeCy0lqHCisvd4GRXqUJHe6hJ3GxVv/
VqbT7ZHMP4BlbWK4/yTq9+wKSC9T0OfAOW2b+K/SAh19TpRqFBpkMOxaZx0nj2Eh
UxQUd2hntGtEGpPf71HPQMYCkWyvJD1LpTYfwld7tNvzXSt2xg2YZAf7WzJ0Od3b
kxg2mItHhVRCcwEc9OWlphvgAYCzYECzeMniesLFeJYLMvHNtORzxNPkDLrOYjT6
0K05DEWJlEqZhIWtaKJ/SSCj7JnfmgubfK3tG3cerCelQZIvUix6ysvDBwPnqdUr
3O1d7iMlM1UtqMo0GLeIp0q1aRKMsNfRtUvqjaiQ2NjSuyJqZl8t0jfKd6DHcnuy
WidBMhREULa5qYGE9dxosdTJK1wOhqomTmBSC8EeDUPevLV+DNQUn+oBOVhFvJVk
gpmsr2mi9+hwC9iNh1K4wslLl9Jc/BlIY0wwC0ysh6eTWQfrvpfXrtHRpmTvwcVW
g2lt+shYiPDa7q3bvqQL7iqFU6bkMGCs89EFGJEpNbBO076uNZY=
=2Oej
-----END PGP SIGNATURE-----
Merge tag 'x86_boot_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Borislav Petkov:
"Avoid the baremetal decompressor code when booting on an EFI machine.
This is mandated by the current tightening of EFI executables
requirements when used in a secure boot scenario. More specifically,
an EFI executable cannot have a single section with RWX permissions,
which conflicts with the in-place kernel decompression that is done
today.
Instead, the things required by the booting kernel image are done in
the EFI stub now.
Work by Ard Biesheuvel"
* tag 'x86_boot_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/efistub: Avoid legacy decompressor when doing EFI boot
x86/efistub: Perform SNP feature test while running in the firmware
efi/libstub: Add limit argument to efi_random_alloc()
x86/decompressor: Factor out kernel decompression and relocation
x86/decompressor: Move global symbol references to C code
decompress: Use 8 byte alignment
x86/efistub: Prefer EFI memory attributes protocol over DXE services
x86/efistub: Perform 4/5 level paging switch from the stub
x86/decompressor: Merge trampoline cleanup with switching code
x86/decompressor: Pass pgtable address to trampoline directly
x86/decompressor: Only call the trampoline when changing paging levels
x86/decompressor: Call trampoline directly from C code
x86/decompressor: Avoid the need for a stack in the 32-bit trampoline
x86/decompressor: Use standard calling convention for trampoline
x86/decompressor: Call trampoline as a normal function
x86/decompressor: Assign paging related global variables earlier
x86/decompressor: Store boot_params pointer in callee save register
x86/efistub: Clear BSS in EFI handover protocol entrypoint
x86/decompressor: Avoid magic offsets for EFI handover entrypoint
x86/efistub: Simplify and clean up handover entry code
...
- Support partial SMT enablement.
So far the sysfs SMT control only allows to toggle between SMT on and
off. That's sufficient for x86 which usually has at max two threads
except for the Xeon PHI platform which has four threads per core.
Though PowerPC has up to 16 threads per core and so far it's only
possible to control the number of enabled threads per core via a
command line option. There is some way to control this at runtime, but
that lacks enforcement and the usability is awkward.
This update expands the sysfs interface and the core infrastructure to
accept numerical values so PowerPC can build SMT runtime control for
partial SMT enablement on top.
The core support has also been provided to the PowerPC maintainers who
added the PowerPC related changes on top.
- Minor cleanups and documentation updates.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsj4wTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaszEADKMd/6m7/Bq7RU2OJ+IXw8yfMEF9nS
6HPrFu71a4cDufb/G8UckQOvkwdTFWD7bZ0snJe2sBDFTOtzK/inYkgPZTxlm7si
JcJmFnHKUM7OTwNZb7Tv1bd9Csz4JhggAYUw6P8CqsCmhQ+p6ECemx3bHDlYiywm
5eW2yzI9EM4dbsHPwUOvjI0WazGvAf0esSDAS8JTnhBXbd8FAckbMV+xuRPcCUK+
dBqbqr+3Nf4/wcXTro/gZIc7sEATAHH6m7zHlLVBSyVPnBxre8NLz6KciW4SezyJ
GWFnDV03mmG2KxQ2ugwI8n6M3zDJQtfEJFwW/x4t2M5RK+ka2a6G6GtCLHYOXLWR
akIuBXtTAC57BgpqzBihGej9eiC1BJ1QMa9ZK+6WDXSZtMTFOLlbwdY2/qyfxpfw
LfepWb+UMtFy5YyW84S1O5/AqpOtKD2kPTqfDjvDxWIAigispU+qwAKxcMzMjtwz
aAlf2Z/iX0R9DkRzGD2gaFG5AUsRich8RtVO7u+WDwYSsi8ywrvryiPlZrDDBkSQ
sRzdoHeXNGVY/FgkbZmEyBj4udrypymkR6ivqn6C2OrysgznSiv5NC983uS6TfJX
cVqdUv6CNYYNiNu0x0Qf0MluYT2s5c1Fa4bjCBJL+KwORwjM3+TCN9RA1KtFrW2T
G3Ta1KqI6wRonA==
=JQRJ
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
"Updates for the CPU hotplug core:
- Support partial SMT enablement.
So far the sysfs SMT control only allows to toggle between SMT on
and off. That's sufficient for x86 which usually has at max two
threads except for the Xeon PHI platform which has four threads per
core
Though PowerPC has up to 16 threads per core and so far it's only
possible to control the number of enabled threads per core via a
command line option. There is some way to control this at runtime,
but that lacks enforcement and the usability is awkward
This update expands the sysfs interface and the core infrastructure
to accept numerical values so PowerPC can build SMT runtime control
for partial SMT enablement on top
The core support has also been provided to the PowerPC maintainers
who added the PowerPC related changes on top
- Minor cleanups and documentation updates"
* tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: core-api/cpuhotplug: Fix state names
cpu/hotplug: Remove unused function declaration cpu_set_state_online()
cpu/SMT: Fix cpu_smt_possible() comment
cpu/SMT: Allow enabling partial SMT states via sysfs
cpu/SMT: Create topology_smt_thread_allowed()
cpu/SMT: Remove topology_smt_supported()
cpu/SMT: Store the current/max number of threads
cpu/SMT: Move smt/control simple exit cases earlier
cpu/SMT: Move SMT prototypes into cpu_smt.h
cpu/hotplug: Remove dependancy against cpu_primary_thread_mask
This pull reqeust contains the following:
o Handle negative skews in "skew is too large" messages.
o Extend watchdog check exemption to 4-Socket platforms
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmTcF94THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jPeHEACXvYFMwno+1DPlt2PtFJglnkuYyOUQ
SwRRiOLWXdzsikzjWIvdrSOqslbdIjhZmSSyPDjyo4GPYFSSOyKNsqczwsX8R49u
O/2Yzkrx2OmIWcKiAkII8Iw4fFedoITtzC59wbZHoo/+upEpbZYP3u7AjJTird7y
du3WcWcGc1eEt5+7MNwbZfwpzo2t5Rb3Wqfgs6vnKTG7Abc/23uChsCBzPavX7X/
djNd1bA5YmEldKKxSoF5XSW/F1TWIA4fXMDkBwgRKHBx1Y7xU+nJMtam4ogAzN6a
4zgthgy5wQ7/VnTBv2rmQQb6ae3Blm09Yg1ac8zt95RLgmGkyX73lZGnRBTdYqyh
kb47Tfw3a+e+VPTD+W3rY1NOSOwbstdDHVckK+0bFvqNyXOoaEJL+EEOhm9rPXxv
le+T6Ct1VPAF9lHPUz7lVCVXN91vP4Gqrxjmeq5rqWNOvRW3jBcCLnEpFzWJtu50
JjQBi3HA0HW+Bxqov22W0llFAa0gVm8xyxXfNSSL7VoCinnS6/qyQvD1GoG0brk3
l5orOk38m2/acvTyvw2tnvAAuqmOr+oGlcQhJOOVl7jDz+sae6RMTwWCMaVxKUDo
YW7v5YFQePmzC9J4M5XFMdCCTb3cUCjVLMdsPgONM3kn9ALEDhhTBSw76N//bGLg
4/OEsT7Aq6LHkA==
=6/HI
-----END PGP SIGNATURE-----
Merge tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull clocksource watchdog updates from Paul McKenney:
- Handle negative skews in "skew is too large" messages
- Extend watchdog check exemption to 4-Socket platforms
* tag 'clocksource.2023.08.15a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
x86/tsc: Extend watchdog check exemption to 4-Sockets platform
clocksource: Handle negative skews in "skew is too large" messages
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
CONFIG_DEBUG_LIST (Marco Elver).
- Fix kallsyms lookup failure under Clang LTO (Yonghong Song).
- Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn).
- Flexible array member conversion not carried in other tree (Gustavo
A. R. Silva).
- Various strlcpy() and strncpy() removals not carried in other trees
(Azeem Shaikh, Justin Stitt).
- Convert nsproxy.count to refcount_t (Elena Reshetova).
- Add handful of __counted_by annotations not carried in other trees,
as well as an LKDTM test.
- Fix build failure with gcc-plugins on GCC 14+.
- Fix selftests to respect SKIP for signal-delivery tests.
- Fix CFI warning for paravirt callback prototype.
- Clarify documentation for seq_show_option_n() usage.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs6ZAWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkpjD/9AeST5Imc2t0t71Qd+wPxW3jT3
kDZPlHH8wHmuxSpRscX82m21SozvEMvybo6Cp7FSH4qr863FnBWMlo8acr7rKxUf
0f7Y9qgY/hKADiVx5p0pbnCgcy+l4pwsxIqVCGuhjvNCbWHrdGqLM4UjIfaVz5Ws
+55a/C3S1KVwB1s1+6to43jtKqQAx6yrqYWOaT3wEfCzHC87f9PUHhIGnFQVwPGP
WpjQI/BQKpH7+MDCoJOPrZqXaE/4lWALxR6+5BBheGbvLoWifpJEYHX6bDUzkgBz
liQDkgr4eAw5EXSOS7mX3EApfeMKakznJt9Mcmn0h3pPRlM3ZSVD64Xrou2Brpje
exS2JRuh6HwIiXY9nTHc6YMGcAWG1syAR/hM2fQdujM0CWtBUk9+kkuYWsqF6nIK
3tOxYLB/Ph4p+tShd+v5R3mEmp/6snYRKJoUk+9Fk67i54NnK4huyxaCO4zui+ML
3vHuGp8KgFHUjJaYmYXHs3TRZnKSFUkPGc4MbpiGtmJ9zhfSwlhhF+yfBJCsvmTf
ZajA+sPupT4OjLxU6vUD/ZNkXAEjWzktyX2v9YBA7FHh7SqPtX9ARRIxh417AjEJ
tBPHhW/iRw9ftBIAKDmI7gPLynngd/zvjhvk6O5egHYjjgRM1/WAJZ4V26XR6+hf
TWfQb7VRzdZIqwOEUA==
=9ZWP
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As has become normal, changes are scattered around the tree (either
explicitly maintainer Acked or for trivial stuff that went ignored):
- Carve out the new CONFIG_LIST_HARDENED as a more focused subset of
CONFIG_DEBUG_LIST (Marco Elver)
- Fix kallsyms lookup failure under Clang LTO (Yonghong Song)
- Clarify documentation for CONFIG_UBSAN_TRAP (Jann Horn)
- Flexible array member conversion not carried in other tree (Gustavo
A. R. Silva)
- Various strlcpy() and strncpy() removals not carried in other trees
(Azeem Shaikh, Justin Stitt)
- Convert nsproxy.count to refcount_t (Elena Reshetova)
- Add handful of __counted_by annotations not carried in other trees,
as well as an LKDTM test
- Fix build failure with gcc-plugins on GCC 14+
- Fix selftests to respect SKIP for signal-delivery tests
- Fix CFI warning for paravirt callback prototype
- Clarify documentation for seq_show_option_n() usage"
* tag 'hardening-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
LoadPin: Annotate struct dm_verity_loadpin_trusted_root_digest with __counted_by
kallsyms: Change func signature for cleanup_symbol_name()
kallsyms: Fix kallsyms_selftest failure
nsproxy: Convert nsproxy.count to refcount_t
integrity: Annotate struct ima_rule_opt_list with __counted_by
lkdtm: Add FAM_BOUNDS test for __counted_by
Compiler Attributes: counted_by: Adjust name and identifier expansion
um: refactor deprecated strncpy to memcpy
um: vector: refactor deprecated strncpy
alpha: Replace one-element array with flexible-array member
hardening: Move BUG_ON_DATA_CORRUPTION to hardening options
list: Introduce CONFIG_LIST_HARDENED
list_debug: Introduce inline wrappers for debug checks
compiler_types: Introduce the Clang __preserve_most function attribute
gcc-plugins: Rename last_stmt() for GCC 14+
selftests/harness: Actually report SKIP for signal tests
x86/paravirt: Fix tlb_remove_table function callback prototype warning
EISA: Replace all non-returning strlcpy with strscpy
perf: Replace strlcpy with strscpy
um: Remove strlcpy declaration
...
Commit e6bcfdd75d ("x86/microcode: Hide the config knob") removed the
MICROCODE_AMD config, but left some references in defconfigs and comments,
that have no effect on any kernel build around.
Clean up those remaining config references. No functional change.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230825141226.13566-1-lukas.bulwahn@gmail.com
enc_dec_hypercall() accepted a page count instead of a size, which
forced its callers to round up. As a result, non-page aligned
vaddrs caused pages to be spuriously marked as decrypted via the
encryption status hypercall, which in turn caused consistent
corruption of pages during live migration. Live migration requires
accurate encryption status information to avoid migrating pages
from the wrong perspective.
Fixes: 064ce6c550 ("mm: x86: Invoke hypercall when page encryption status is changed")
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Tested-by: Ben Hillier <bhillier@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230824223731.2055016-1-srutherford@google.com
In ms_hyperv_init_platform(), do not distinguish between a SNP VM with
the paravisor and a SNP VM without the paravisor.
Replace hv_isolation_type_en_snp() with
!ms_hyperv.paravisor_present && hv_isolation_type_snp().
The hv_isolation_type_en_snp() in drivers/hv/hv.c and
drivers/hv/hv_common.c can be changed to hv_isolation_type_snp() since
we know !ms_hyperv.paravisor_present is true there.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-10-decui@microsoft.com
When the paravisor is present, a SNP VM must use GHCB to access some
special MSRs, including HV_X64_MSR_GUEST_OS_ID and some SynIC MSRs.
Similarly, when the paravisor is present, a TDX VM must use TDX GHCI
to access the same MSRs.
Implement hv_tdx_msr_write() and hv_tdx_msr_read(), and use the helper
functions hv_ivm_msr_read() and hv_ivm_msr_write() to access the MSRs
in a unified way for SNP/TDX VMs with the paravisor.
Do not export hv_tdx_msr_write() and hv_tdx_msr_read(), because we never
really used hv_ghcb_msr_write() and hv_ghcb_msr_read() in any module.
Update arch/x86/include/asm/mshyperv.h so that the kernel can still build
if CONFIG_AMD_MEM_ENCRYPT or CONFIG_INTEL_TDX_GUEST is not set, or
neither is set.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-9-decui@microsoft.com
The new variable hyperv_paravisor_present is set only when the VM
is a SNP/TDX VM with the paravisor running: see ms_hyperv_init_platform().
We introduce hyperv_paravisor_present because we can not use
ms_hyperv.paravisor_present in arch/x86/include/asm/mshyperv.h:
struct ms_hyperv_info is defined in include/asm-generic/mshyperv.h, which
is included at the end of arch/x86/include/asm/mshyperv.h, but at the
beginning of arch/x86/include/asm/mshyperv.h, we would already need to use
struct ms_hyperv_info in hv_do_hypercall().
We use hyperv_paravisor_present only in include/asm-generic/mshyperv.h,
and use ms_hyperv.paravisor_present elsewhere. In the future, we'll
introduce a hypercall function structure for different VM types, and
at boot time, the right function pointers would be written into the
structure so that runtime testing of TDX vs. SNP vs. normal will be
avoided and hyperv_paravisor_present will no longer be needed.
Call hv_vtom_init() when it's a VBS VM or when ms_hyperv.paravisor_present
is true, i.e. the VM is a SNP VM or TDX VM with the paravisor.
Enhance hv_vtom_init() for a TDX VM with the paravisor.
In hv_common_cpu_init(), don't decrypt the hyperv_pcpu_input_arg
for a TDX VM with the paravisor, just like we don't decrypt the page
for a SNP VM with the paravisor.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-7-decui@microsoft.com
When a fully enlightened TDX guest runs on Hyper-V, the UEFI firmware sets
the HW_REDUCED flag and consequently ttyS0 interrupts can't work. Fix the
issue by overriding x86_init.acpi.reduced_hw_early_init().
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-5-decui@microsoft.com
Add Hyper-V specific code so that a fully enlightened TDX guest (i.e.
without the paravisor) can run on Hyper-V:
Don't use hv_vp_assist_page. Use GHCI instead.
Don't try to use the unsupported HV_REGISTER_CRASH_CTL.
Don't trust (use) Hyper-V's TLB-flushing hypercalls.
Don't use lazy EOI.
Share the SynIC Event/Message pages with the hypervisor.
Don't use the Hyper-V TSC page for now, because non-trivial work is
required to share the page with the hypervisor.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-4-decui@microsoft.com
No logic change to SNP/VBS guests.
hv_isolation_type_tdx() will be used to instruct a TDX guest on Hyper-V to
do some TDX-specific operations, e.g. for a fully enlightened TDX guest
(i.e. without the paravisor), hv_do_hypercall() should use
__tdx_hypercall() and such a guest on Hyper-V should handle the Hyper-V
Event/Message/Monitor pages specially.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230824080712.30327-2-decui@microsoft.com
crash_prepare_elf64_headers() writes into the elfcorehdr an ELF PT_NOTE
for all possible CPUs. As such, subsequent changes to CPUs (ie. hot
un/plug, online/offline) do not need to rewrite the elfcorehdr.
The kimage->file_mode term covers kdump images loaded via the
kexec_file_load() syscall. Since crash_prepare_elf64_headers() wrote the
initial elfcorehdr, no update to the elfcorehdr is needed for CPU changes.
The kimage->elfcorehdr_updated term covers kdump images loaded via the
kexec_load() syscall. At least one memory or CPU change must occur to
cause crash_prepare_elf64_headers() to rewrite the elfcorehdr.
Afterwards, no update to the elfcorehdr is needed for CPU changes.
This code is intentionally *NOT* hoisted into crash_handle_hotplug_event()
as it would prevent the arch-specific handler from running for CPU
changes. This would break PPC, for example, which needs to update other
information besides the elfcorehdr, on CPU changes.
Link: https://lkml.kernel.org/r/20230814214446.6659-9-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The hotplug support for kexec_load() requires changes to the userspace
kexec-tools and a little extra help from the kernel.
Given a kdump capture kernel loaded via kexec_load(), and a subsequent
hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites
it to reflect the hotplug change. That is the desired outcome, however,
at kernel panic time, the purgatory integrity check fails (because the
elfcorehdr changed), and the capture kernel does not boot and no vmcore is
generated.
Therefore, the userspace kexec-tools/kexec must indicate to the kernel
that the elfcorehdr can be modified (because the kexec excluded the
elfcorehdr from the digest, and sized the elfcorehdr memory buffer
appropriately).
To facilitate hotplug support with kexec_load():
- a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is
safe for the kernel to modify the kexec_load()'d elfcorehdr
- the /sys/kernel/crash_elfcorehdr_size node communicates the
preferred size of the elfcorehdr memory buffer
- The sysfs crash_hotplug nodes (ie.
/sys/devices/system/[cpu|memory]/crash_hotplug) dynamically
take into account kexec_file_load() vs kexec_load() and
KEXEC_UPDATE_ELFCOREHDR.
This is critical so that the udev rule processing of crash_hotplug
is all that is needed to determine if the userspace unload-then-load
of the kdump image is to be skipped, or not. The proposed udev
rule change looks like:
# The kernel updates the crash elfcorehdr for CPU and memory changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
The table below indicates the behavior of kexec_load()'d kdump image
updates (with the new udev crash_hotplug rule in place):
Kernel |Kexec
-------+-----+----
Old |Old |New
| a | a
-------+-----+----
New | a | b
-------+-----+----
where kexec 'old' and 'new' delineate kexec-tools has the needed
modifications for the crash hotplug feature, and kernel 'old' and 'new'
delineate the kernel supports this crash hotplug feature.
Behavior 'a' indicates the unload-then-reload of the entire kdump image.
For the kexec 'old' column, the unload-then-reload occurs due to the
missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec)
does not present the crash_hotplug sysfs node, which leads to the
unload-then-reload of the kdump image.
Behavior 'b' indicates the desired optimized behavior of the kernel
directly modifying the elfcorehdr and avoiding the unload-then-reload of
the kdump image.
If the udev rule is not updated with crash_hotplug node check, then no
matter any combination of kernel or kexec is new or old, the kdump image
continues to be unload-then-reload on hotplug changes.
To fully support crash hotplug feature, there needs to be a rollout of
kernel, kexec-tools and udev rule changes. However, the order of the
rollout of these pieces does not matter; kexec_load()'d kdump images still
function for hotplug as-is.
Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Suggested-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When CPU or memory is hot un/plugged, or off/onlined, the crash
elfcorehdr, which describes the CPUs and memory in the system, must also
be updated.
A new elfcorehdr is generated from the available CPUs and memory and
replaces the existing elfcorehdr. The segment containing the elfcorehdr
is identified at run-time in crash_core:crash_handle_hotplug_event().
No modifications to purgatory (see 'kexec: exclude elfcorehdr from the
segment digest') or boot_params (as the elfcorehdr= capture kernel command
line parameter pointer remains unchanged and correct) are needed, just
elfcorehdr.
For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and
CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU
and memory resources.
For kexec_load(), the userspace kexec utility needs to size the elfcorehdr
segment in the same/similar manner.
To accommodate kexec_load() syscall in the absence of kexec_file_load()
syscall support, prepare_elf_headers() and dependents are moved outside of
CONFIG_KEXEC_FILE.
[eric.devolder@oracle.com: correct unused function build error]
Link: https://lkml.kernel.org/r/20230821182644.2143-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230814214446.6659-6-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Akhil Raj <lf32.dev@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
A suitable replacement is `strscpy` [2] due to the fact that it
guarantees NUL-termination on its destination buffer argument which is
_not_ the case for `strncpy`!
In this case, it means we can drop the `...-1` from:
| strncpy(to, from, len-1);
as well as remove the comment mentioning NUL-termination as `strscpy`
implicitly grants us this behavior.
There should be no functional change as I don't believe the padding from
`strncpy` is needed here. If it turns out that the padding is necessary
we should use `strscpy_pad` as a direct replacement.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/r/20230822-strncpy-arch-x86-kernel-apic-x2apic_uv_x-v1-1-91d681d0b3f3@google.com
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
A suitable replacement is `strscpy` [2] due to the fact that it
guarantees NUL-termination on its destination buffer argument which is
_not_ the case for `strncpy`!
In this case, it is a simple swap from `strncpy` to `strscpy`. There is
one slight difference, though. If NUL-padding is a functional
requirement here we should opt for `strscpy_pad`. It seems like this
shouldn't be needed as I see no obvious signs of any padding being
required.
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/r/20230822-strncpy-arch-x86-kernel-hpet-v1-1-2c7d3be86f4a@google.com
0-Day found a 34.6% regression in stress-ng's 'af-alg' test case, and
bisected it to commit b81fac906a ("x86/fpu: Move FPU initialization into
arch_cpu_finalize_init()"), which optimizes the FPU init order, and moves
the CR4_OSXSAVE enabling into a later place:
arch_cpu_finalize_init
identify_boot_cpu
identify_cpu
generic_identify
get_cpu_cap --> setup cpu capability
...
fpu__init_cpu
fpu__init_cpu_xstate
cr4_set_bits(X86_CR4_OSXSAVE);
As the FPU is not yet initialized the CPU capability setup fails to set
X86_FEATURE_OSXSAVE. Many security module like 'camellia_aesni_avx_x86_64'
depend on this feature and therefore fail to load, causing the regression.
Cure this by setting X86_FEATURE_OSXSAVE feature right after OSXSAVE
enabling.
[ tglx: Moved it into the actual BSP FPU initialization code and added a comment ]
Fixes: b81fac906a ("x86/fpu: Move FPU initialization into arch_cpu_finalize_init()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/202307192135.203ac24e-oliver.sang@intel.com
Link: https://lore.kernel.org/lkml/20230823065747.92257-1-feng.tang@intel.com
The thread flag TIF_NEED_FPU_LOAD indicates that the FPU saved state is
valid and should be reloaded when returning to userspace. However, the
kernel will skip doing this if the FPU registers are already valid as
determined by fpregs_state_valid(). The logic embedded there considers
the state valid if two cases are both true:
1: fpu_fpregs_owner_ctx points to the current tasks FPU state
2: the last CPU the registers were live in was the current CPU.
This is usually correct logic. A CPU’s fpu_fpregs_owner_ctx is set to
the current FPU during the fpregs_restore_userregs() operation, so it
indicates that the registers have been restored on this CPU. But this
alone doesn’t preclude that the task hasn’t been rescheduled to a
different CPU, where the registers were modified, and then back to the
current CPU. To verify that this was not the case the logic relies on the
second condition. So the assumption is that if the registers have been
restored, AND they haven’t had the chance to be modified (by being
loaded on another CPU), then they MUST be valid on the current CPU.
Besides the lazy FPU optimizations, the other cases where the FPU
registers might not be valid are when the kernel modifies the FPU register
state or the FPU saved buffer. In this case the operation modifying the
FPU state needs to let the kernel know the correspondence has been
broken. The comment in “arch/x86/kernel/fpu/context.h” has:
/*
...
* If the FPU register state is valid, the kernel can skip restoring the
* FPU state from memory.
*
* Any code that clobbers the FPU registers or updates the in-memory
* FPU state for a task MUST let the rest of the kernel know that the
* FPU registers are no longer valid for this task.
*
* Either one of these invalidation functions is enough. Invalidate
* a resource you control: CPU if using the CPU for something else
* (with preemption disabled), FPU for the current task, or a task that
* is prevented from running by the current task.
*/
However, this is not completely true. When the kernel modifies the
registers or saved FPU state, it can only rely on
__fpu_invalidate_fpregs_state(), which wipes the FPU’s last_cpu
tracking. The exec path instead relies on fpregs_deactivate(), which sets
the CPU’s FPU context to NULL. This was observed to fail to restore the
reset FPU state to the registers when returning to userspace in the
following scenario:
1. A task is executing in userspace on CPU0
- CPU0’s FPU context points to tasks
- fpu->last_cpu=CPU0
2. The task exec()’s
3. While in the kernel the task is preempted
- CPU0 gets a thread executing in the kernel (such that no other
FPU context is activated)
- Scheduler sets task’s fpu->last_cpu=CPU0 when scheduling out
4. Task is migrated to CPU1
5. Continuing the exec(), the task gets to
fpu_flush_thread()->fpu_reset_fpregs()
- Sets CPU1’s fpu context to NULL
- Copies the init state to the task’s FPU buffer
- Sets TIF_NEED_FPU_LOAD on the task
6. The task reschedules back to CPU0 before completing the exec() and
returning to userspace
- During the reschedule, scheduler finds TIF_NEED_FPU_LOAD is set
- Skips saving the registers and updating task’s fpu→last_cpu,
because TIF_NEED_FPU_LOAD is the canonical source.
7. Now CPU0’s FPU context is still pointing to the task’s, and
fpu->last_cpu is still CPU0. So fpregs_state_valid() returns true even
though the reset FPU state has not been restored.
So the root cause is that exec() is doing the wrong kind of invalidate. It
should reset fpu->last_cpu via __fpu_invalidate_fpregs_state(). Further,
fpu__drop() doesn't really seem appropriate as the task (and FPU) are not
going away, they are just getting reset as part of an exec. So switch to
__fpu_invalidate_fpregs_state().
Also, delete the misleading comment that says that either kind of
invalidate will be enough, because it’s not always the case.
Fixes: 33344368cb ("x86/fpu: Clean up the fpu__clear() variants")
Reported-by: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lijun Pan <lijun.pan@intel.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Lijun Pan <lijun.pan@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230818170305.502891-1-rick.p.edgecombe@intel.com
When CONFIG_HYPERV is not set, arch/x86/hyperv/ivm.c is not built (see
arch/x86/Kbuild), so 'isolation_type_en_snp' in the ivm.c is not defined,
and this failure happens:
ld: arch/x86/kernel/cpu/mshyperv.o: in function `ms_hyperv_init_platform':
arch/x86/kernel/cpu/mshyperv.c:417: undefined reference to `isolation_type_en_snp'
Fix the failure by testing hv_get_isolation_type() and
ms_hyperv.paravisor_present for a fully enlightened SNP VM: when
CONFIG_HYPERV is not set, hv_get_isolation_type() is defined as a
static inline function that always returns HV_ISOLATION_TYPE_NONE
(see include/asm-generic/mshyperv.h), so the compiler won't generate any
code for the ms_hyperv.paravisor_present and static_branch_enable().
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Closes: https://lore.kernel.org/lkml/b4979997-23b9-0c43-574e-e4a3506500ff@amd.com/
Fixes: d6e2d65244 ("x86/hyperv: Add sev-snp enlightened guest static key")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230823032008.18186-1-decui@microsoft.com
Add Hyperv-specific handling for faults caused by VMMCALL
instructions.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-9-ltykernel@gmail.com
In the AMD SEV-SNP guest, AP needs to be started up via sev es
save area and Hyper-V requires to call HVCALL_START_VP hypercall
to pass the gpa of sev es save area with AP's vp index and VTL(Virtual
trust level) parameters. Override wakeup_secondary_cpu_64 callback
with hv_snp_boot_ap.
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20230818102919.1318039-8-ltykernel@gmail.com
Currently kcov instrument is disabled for object files under
arch/x86/kernel folder.
For object files under arch/x86/kernel, actually just disabling the kcov
instrument of files:"head32.o or head64.o and sev.o" could achieve
successful booting and provide kcov coverage for object files that do not
disable kcov instrument. The additional kcov coverage collected from
arch/x86/kernel folder helps kernel fuzzing efforts to find bugs.
Link to related improvement discussion is below:
https://groups.google.com/g/syzkaller/c/Dsl-RYGCqs8/m/x-tfpTyFBAAJ Related
ticket is as follow: https://bugzilla.kernel.org/show_bug.cgi?id=198443
Link: https://lkml.kernel.org/r/06c0bb7b5f61e5884bf31180e8c122648c752010.1690771380.git.pengfei.xu@intel.com
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Pengfei Xu <pengfei.xu@intel.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: <heng.su@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kees Cook <keescook@google.com>,
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The APIs that allow backtracing across CPUs have always had a way to
exclude the current CPU. This convenience means callers didn't need to
find a place to allocate a CPU mask just to handle the common case.
Let's extend the API to take a CPU ID to exclude instead of just a
boolean. This isn't any more complex for the API to handle and allows the
hardlockup detector to exclude a different CPU (the one it already did a
trace for) without needing to find space for a CPU mask.
Arguably, this new API also encourages safer behavior. Specifically if
the caller wants to avoid tracing the current CPU (maybe because they
already traced the current CPU) this makes it more obvious to the caller
that they need to make sure that the current CPU ID can't change.
[akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub]
Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The Instruction Fetch (IF) units on current AMD Zen-based systems do not
guarantee a synchronous #MC is delivered for poison consumption errors.
Therefore, MCG_STATUS[EIPV|RIPV] will not be set. However, the
microarchitecture does guarantee that the exception is delivered within
the same context. In other words, the exact rIP is not known, but the
context is known to not have changed.
There is no architecturally-defined method to determine this behavior.
The Code Segment (CS) register is always valid on such IF unit poison
errors regardless of the value of MCG_STATUS[EIPV|RIPV].
Add a quirk to save the CS register for poison consumption from the IF
unit banks.
This is needed to properly determine the context of the error.
Otherwise, the severity grading function will assume the context is
IN_KERNEL due to the m->cs value being 0 (the initialized value). This
leads to unnecessary kernel panics on data poison errors due to the
kernel believing the poison consumption occurred in kernel context.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230814200853.29258-1-yazen.ghannam@amd.com
Specify how is SRSO mitigated when SMT is disabled. Also, correct the
SMT check for that.
Fixes: e9fbc47b81 ("x86/srso: Disable the mitigation on unaffected configurations")
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20230814200813.p5czl47zssuej7nv@treble
The following warning is reported when frame pointers and kernel IBT are
enabled:
vmlinux.o: warning: objtool: ibt_selftest+0x11: sibling call from callable instruction with modified stack frame
The problem is that objtool interprets the indirect branch in
ibt_selftest() as a sibling call, and GCC inserts a (partial) frame
pointer prologue before it:
0000 000000000003f550 <ibt_selftest>:
0000 3f550: f3 0f 1e fa endbr64
0004 3f554: e8 00 00 00 00 call 3f559 <ibt_selftest+0x9> 3f555: R_X86_64_PLT32 __fentry__-0x4
0009 3f559: 55 push %rbp
000a 3f55a: 48 8d 05 02 00 00 00 lea 0x2(%rip),%rax # 3f563 <ibt_selftest_ip>
0011 3f561: ff e0 jmp *%rax
Note the inline asm is missing ASM_CALL_CONSTRAINT, so the 'push %rbp'
happens before the indirect branch and the 'mov %rsp, %rbp' happens
afterwards.
Simplify the generated code and make it easier to understand for both
tools and humans by moving the selftest to proper asm.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/99a7e16b97bda97bf0a04aa141d6241cd8a839a2.1680912949.git.jpoimboe@kernel.org
Christian reported spurious module load crashes after some of Song's
module memory layout patches.
Turns out that if the very last instruction on the very last page of the
module is a 'JMP __x86_return_thunk' then __static_call_fixup() will
trip a fault and die.
And while the module rework made this slightly more likely to happen,
it's always been possible.
Fixes: ee88d363d1 ("x86,static_call: Use alternative RET encoding")
Reported-by: Christian Bricart <christian@bricart.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lkml.kernel.org/r/20230816104419.GA982867@hirez.programming.kicks-ass.net
Similar to how it doesn't make sense to have UNTRAIN_RET have two
untrain calls, it also doesn't make sense for VMEXIT to have an extra
IBPB call.
This cures VMEXIT doing potentially unret+IBPB or double IBPB.
Also, the (SEV) VMEXIT case seems to have been overlooked.
Redefine the meaning of the synthetic IBPB flags to:
- ENTRY_IBPB -- issue IBPB on entry (was: entry + VMEXIT)
- IBPB_ON_VMEXIT -- issue IBPB on VMEXIT
And have 'retbleed=ibpb' set *BOTH* feature flags to ensure it retains
the previous behaviour and issues IBPB on entry+VMEXIT.
The new 'srso=ibpb_vmexit' option only sets IBPB_ON_VMEXIT.
Create UNTRAIN_RET_VM specifically for the VMEXIT case, and have that
check IBPB_ON_VMEXIT.
All this avoids having the VMEXIT case having to check both ENTRY_IBPB
and IBPB_ON_VMEXIT and simplifies the alternatives.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121149.109557833@infradead.org
Since there can only be one active return_thunk, there only needs be
one (matching) untrain_ret. It fundamentally doesn't make sense to
allow multiple untrain_ret at the same time.
Fold all the 3 different untrain methods into a single (temporary)
helper stub.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121149.042774962@infradead.org
Rename the original retbleed return thunk and untrain_ret to
retbleed_return_thunk() and retbleed_untrain_ret().
No functional changes.
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.909378169@infradead.org
Use the existing configurable return thunk. There is absolute no
justification for having created this __x86_return_thunk alternative.
To clarify, the whole thing looks like:
Zen3/4 does:
srso_alias_untrain_ret:
nop2
lfence
jmp srso_alias_return_thunk
int3
srso_alias_safe_ret: // aliasses srso_alias_untrain_ret just so
add $8, %rsp
ret
int3
srso_alias_return_thunk:
call srso_alias_safe_ret
ud2
While Zen1/2 does:
srso_untrain_ret:
movabs $foo, %rax
lfence
call srso_safe_ret (jmp srso_return_thunk ?)
int3
srso_safe_ret: // embedded in movabs instruction
add $8,%rsp
ret
int3
srso_return_thunk:
call srso_safe_ret
ud2
While retbleed does:
zen_untrain_ret:
test $0xcc, %bl
lfence
jmp zen_return_thunk
int3
zen_return_thunk: // embedded in the test instruction
ret
int3
Where Zen1/2 flush the BTB entry using the instruction decoder trick
(test,movabs) Zen3/4 use BTB aliasing. SRSO adds a return sequence
(srso_safe_ret()) which forces the function return instruction to
speculate into a trap (UD2). This RET will then mispredict and
execution will continue at the return site read from the top of the
stack.
Pick one of three options at boot (evey function can only ever return
once).
[ bp: Fixup commit message uarch details and add them in a comment in
the code too. Add a comment about the srso_select_mitigation()
dependency on retbleed_select_mitigation(). Add moar ifdeffery for
32-bit builds. Add a dummy srso_untrain_ret_alias() definition for
32-bit alternatives needing the symbol. ]
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230814121148.842775684@infradead.org