mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-18 01:46:58 +00:00
- Ensure that the WBINVD in stop_this_cpu() has been completed before the
control CPU proceedes.
stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs
in a HLT loop.
The control CPU sends an IPI to the APs and waits for their CPU online bits
to be cleared. Once they all are marked "offline" it proceeds.
But stop_this_cpu() clears the CPU online bit before issuing WBINVD,
which means there is no guarantee that the AP has reached the HLT loop.
This was reported to cause intermittent reboot/shutdown failures due to
some dubious interaction with the firmware.
This is not only a problem of WBINVD. The code to actually "stop" the
CPU which runs between clearing the online bit and reaching the HLT loop
can cause large enough delays on its own (think virtualization). That's
especially dangerous for kexec() as kexec() expects that all APs are in
a safe state and not executing code while the boot CPU jumps to the new
kernel. There are more issues vs. kexec() which are addressed separately.
Cure this by implementing an explicit synchronization point right before
the AP reaches HLT. This guarantees that the AP has completed the full
stop proceedure.
- Fix the condition for WBINVD in stop_this_cpu().
The WBINVD in stop_this_cpu() is required for ensuring that when
switching to or from memory encryption no dirty data is left in the
cache lines which might cause a write back in the wrong more later.
This checks CPUID directly because the feature bit might have been
cleared due to a command line option.
But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel
CPUs return the content of the highest supported leaf when a non-existing
leaf is read, while AMD CPUs return all zeros for unsupported leafs.
So the result of the test on Intel CPUs is lottery and on AMD its just
correct by chance.
While harmless it's incorrect and causes the conditional wbinvd() to be
issued where not required, which caused the above issue to be unearthed.
- Make kexec() robust against AP code execution
Ashok observed triple faults when doing kexec() on a system which had
been booted with "nosmt".
It turned out that the SMT siblings which had been brought up partially
are parked in mwait_play_dead() to enable power savings.
mwait_play_dead() is monitoring the thread flags of the AP's idle task,
which has been chosen as it's unlikely to be written to.
But kexec() can overwrite the previous kernel text and data including
page tables etc. When it overwrites the cache lines monitored by an AP
that AP resumes execution after the MWAIT on eventually overwritten
text, stack and page tables, which obviously might end up in a triple
fault easily.
Make this more robust in several steps:
1) Use an explicit per CPU cache line for monitoring.
2) Write a command to these cache lines to kick APs out of MWAIT before
proceeding with kexec(), shutdown or reboot.
The APs confirm the wakeup by writing status back and then enter a
HLT loop.
3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs
in INIT state.
HLT is not a guarantee that an AP won't wake up and resume
execution. HLT is woken up by NMI and SMI. SMI puts the CPU back
into HLT (+/- firmware bugs), but NMI is delivered to the CPU which
executes the NMI handler. Same issue as the MWAIT scenario described
above.
Sending an INIT/INIT sequence to the APs puts them into wait for
STARTUP state, which is safe against NMI.
There is still an issue remaining which can't be fixed: #MCE
If the AP sits in HLT and receives a broadcast #MCE it will try to
handle it with the obvious consequences.
INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to
shut down the machine.
So there is a choice between fire (HLT) and frying pan (INIT). Frying
pan has been chosen as it's at least preventing the NMI issue.
On systems which are not using INIT/INIT/STARTUP there is not much
which can be done right now, but at least the obvious and easy to
trigger MWAIT issue has been addressed.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZfpQTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeZpD/9gSJN2qtGqoOgE8bWAenEeqppmBGFE
EAhuhsvN1qG9JosUFo4KzxsGD/aWt2P6XglBDrGti8mFNol67jutmwWklntL3/ZR
m8D6D+Pl7/CaDgACDTDbrnVC3lOGyMhD301yJrnBigS/SEoHeHI9UtadbHukuLQj
TlKt5KtAnap15bE6QL846cDIptB9SjYLLPULo3i4azXEis/l6eAkffwAR6dmKlBh
2RbhLK1xPPG9nqWYjqZXnex09acKwD9xY9xHj4+GampV4UqHJRWfW0YtFs5ENi01
r3FVCdKEcvMkUw0zh0IAviBRs2vCI/R3YSfEc7P0264yn5WzMhAT+OGCovNjByiW
sB4Iqa+Yf6aoBWwux6W4d22xu7uYhmFk/jiLyRZJPW/gvGZCZATT/x/T2hRoaYA8
3S0Rs7n/gbfvynQETgniifuM0bXRW0lEJAmn840GwyVQwlpDEPBJSwW4El49kbkc
+dHxnmpMCfnBxfVLS1YDd4WOmkWBeECNcW330FShlQQ8mM3UG31+Q8Jc55Ze9SW0
w1h+IgIOHlA0DpQUUM8DJTSuxFx2piQsZxjOtzd70+BiKZpCsHqVLIp4qfnf+/GO
gyP0cCQLbafpABbV9uVy8A/qgUGi0Qii0GJfCTy0OdmU+JX3C2C/gsM3uN0g3qAj
vUhkuCXEGL5k1w==
=KgZ0
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner:
"A set of fixes for kexec(), reboot and shutdown issues:
- Ensure that the WBINVD in stop_this_cpu() has been completed before
the control CPU proceedes.
stop_this_cpu() is used for kexec(), reboot and shutdown to park
the APs in a HLT loop.
The control CPU sends an IPI to the APs and waits for their CPU
online bits to be cleared. Once they all are marked "offline" it
proceeds.
But stop_this_cpu() clears the CPU online bit before issuing
WBINVD, which means there is no guarantee that the AP has reached
the HLT loop.
This was reported to cause intermittent reboot/shutdown failures
due to some dubious interaction with the firmware.
This is not only a problem of WBINVD. The code to actually "stop"
the CPU which runs between clearing the online bit and reaching the
HLT loop can cause large enough delays on its own (think
virtualization). That's especially dangerous for kexec() as kexec()
expects that all APs are in a safe state and not executing code
while the boot CPU jumps to the new kernel. There are more issues
vs kexec() which are addressed separately.
Cure this by implementing an explicit synchronization point right
before the AP reaches HLT. This guarantees that the AP has
completed the full stop proceedure.
- Fix the condition for WBINVD in stop_this_cpu().
The WBINVD in stop_this_cpu() is required for ensuring that when
switching to or from memory encryption no dirty data is left in the
cache lines which might cause a write back in the wrong more later.
This checks CPUID directly because the feature bit might have been
cleared due to a command line option.
But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
Intel CPUs return the content of the highest supported leaf when a
non-existing leaf is read, while AMD CPUs return all zeros for
unsupported leafs.
So the result of the test on Intel CPUs is lottery and on AMD its
just correct by chance.
While harmless it's incorrect and causes the conditional wbinvd()
to be issued where not required, which caused the above issue to be
unearthed.
- Make kexec() robust against AP code execution
Ashok observed triple faults when doing kexec() on a system which
had been booted with "nosmt".
It turned out that the SMT siblings which had been brought up
partially are parked in mwait_play_dead() to enable power savings.
mwait_play_dead() is monitoring the thread flags of the AP's idle
task, which has been chosen as it's unlikely to be written to.
But kexec() can overwrite the previous kernel text and data
including page tables etc. When it overwrites the cache lines
monitored by an AP that AP resumes execution after the MWAIT on
eventually overwritten text, stack and page tables, which obviously
might end up in a triple fault easily.
Make this more robust in several steps:
1) Use an explicit per CPU cache line for monitoring.
2) Write a command to these cache lines to kick APs out of MWAIT
before proceeding with kexec(), shutdown or reboot.
The APs confirm the wakeup by writing status back and then
enter a HLT loop.
3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
APs in INIT state.
HLT is not a guarantee that an AP won't wake up and resume
execution. HLT is woken up by NMI and SMI. SMI puts the CPU
back into HLT (+/- firmware bugs), but NMI is delivered to the
CPU which executes the NMI handler. Same issue as the MWAIT
scenario described above.
Sending an INIT/INIT sequence to the APs puts them into wait
for STARTUP state, which is safe against NMI.
There is still an issue remaining which can't be fixed: #MCE
If the AP sits in HLT and receives a broadcast #MCE it will try to
handle it with the obvious consequences.
INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
#MCE to shut down the machine.
So there is a choice between fire (HLT) and frying pan (INIT).
Frying pan has been chosen as it's at least preventing the NMI
issue.
On systems which are not using INIT/INIT/STARTUP there is not much
which can be done right now, but at least the obvious and easy to
trigger MWAIT issue has been addressed"
* tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/smp: Put CPUs into INIT on shutdown if possible
x86/smp: Split sending INIT IPI out into a helper function
x86/smp: Cure kexec() vs. mwait_play_dead() breakage
x86/smp: Use dedicated cache-line for mwait_play_dead()
x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
x86/smp: Dont access non-existing CPUID leaf
x86/smp: Make stop_other_cpus() more robust
101 lines
2.4 KiB
C
101 lines
2.4 KiB
C
/* SPDX-License-Identifier: GPL-2.0 */
|
|
#ifndef _ASM_X86_CPU_H
|
|
#define _ASM_X86_CPU_H
|
|
|
|
#include <linux/device.h>
|
|
#include <linux/cpu.h>
|
|
#include <linux/topology.h>
|
|
#include <linux/nodemask.h>
|
|
#include <linux/percpu.h>
|
|
#include <asm/ibt.h>
|
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
extern void prefill_possible_map(void);
|
|
|
|
#else /* CONFIG_SMP */
|
|
|
|
static inline void prefill_possible_map(void) {}
|
|
|
|
#define cpu_physical_id(cpu) boot_cpu_physical_apicid
|
|
#define cpu_acpi_id(cpu) 0
|
|
#define safe_smp_processor_id() 0
|
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
struct x86_cpu {
|
|
struct cpu cpu;
|
|
};
|
|
|
|
#ifdef CONFIG_HOTPLUG_CPU
|
|
extern int arch_register_cpu(int num);
|
|
extern void arch_unregister_cpu(int);
|
|
extern void soft_restart_cpu(void);
|
|
#endif
|
|
|
|
extern void ap_init_aperfmperf(void);
|
|
|
|
int mwait_usable(const struct cpuinfo_x86 *);
|
|
|
|
unsigned int x86_family(unsigned int sig);
|
|
unsigned int x86_model(unsigned int sig);
|
|
unsigned int x86_stepping(unsigned int sig);
|
|
#ifdef CONFIG_CPU_SUP_INTEL
|
|
extern void __init sld_setup(struct cpuinfo_x86 *c);
|
|
extern bool handle_user_split_lock(struct pt_regs *regs, long error_code);
|
|
extern bool handle_guest_split_lock(unsigned long ip);
|
|
extern void handle_bus_lock(struct pt_regs *regs);
|
|
u8 get_this_hybrid_cpu_type(void);
|
|
#else
|
|
static inline void __init sld_setup(struct cpuinfo_x86 *c) {}
|
|
static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline bool handle_guest_split_lock(unsigned long ip)
|
|
{
|
|
return false;
|
|
}
|
|
|
|
static inline void handle_bus_lock(struct pt_regs *regs) {}
|
|
|
|
static inline u8 get_this_hybrid_cpu_type(void)
|
|
{
|
|
return 0;
|
|
}
|
|
#endif
|
|
#ifdef CONFIG_IA32_FEAT_CTL
|
|
void init_ia32_feat_ctl(struct cpuinfo_x86 *c);
|
|
#else
|
|
static inline void init_ia32_feat_ctl(struct cpuinfo_x86 *c) {}
|
|
#endif
|
|
|
|
extern __noendbr void cet_disable(void);
|
|
|
|
struct ucode_cpu_info;
|
|
|
|
int intel_cpu_collect_info(struct ucode_cpu_info *uci);
|
|
|
|
static inline bool intel_cpu_signatures_match(unsigned int s1, unsigned int p1,
|
|
unsigned int s2, unsigned int p2)
|
|
{
|
|
if (s1 != s2)
|
|
return false;
|
|
|
|
/* Processor flags are either both 0 ... */
|
|
if (!p1 && !p2)
|
|
return true;
|
|
|
|
/* ... or they intersect. */
|
|
return p1 & p2;
|
|
}
|
|
|
|
extern u64 x86_read_arch_cap_msr(void);
|
|
int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
|
|
int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
|
|
|
|
extern struct cpumask cpus_stop_mask;
|
|
|
|
#endif /* _ASM_X86_CPU_H */
|