Commit Graph

765 Commits

Author SHA1 Message Date
Thomas Gleixner
c8b7fb09d1 x86/smpboot: Remove cpu_callin_mask
Now that TSC synchronization is SMP function call based there is no reason
to wait for the AP to be set in smp_callin_mask. The control CPU waits for
the AP to set itself in the online mask anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.206394064@linutronix.de
2023-05-15 13:44:53 +02:00
Thomas Gleixner
9d349d47f0 x86/smpboot: Make TSC synchronization function call based
Spin-waiting on the control CPU until the AP reaches the TSC
synchronization is just a waste especially in the case that there is no
synchronization required.

As the synchronization has to run with interrupts disabled the control CPU
part can just be done from a SMP function call. The upcoming AP issues that
call async only in the case that synchronization is required.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.148255496@linutronix.de
2023-05-15 13:44:53 +02:00
Thomas Gleixner
d4f28f07c2 x86/smpboot: Move synchronization masks to SMP boot code
The usage is in smpboot.c and not in the CPU initialization code.

The XEN_PV usage of cpu_callout_mask is obsolete as cpu_init() not longer
waits and cacheinfo has its own CPU mask now, so cpu_callout_mask can be
made static too.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205256.091511483@linutronix.de
2023-05-15 13:44:52 +02:00
Thomas Gleixner
e94cd1503b x86/smpboot: Get rid of cpu_init_secondary()
The synchronization of the AP with the control CPU is a SMP boot problem
and has nothing to do with cpu_init().

Open code cpu_init_secondary() in start_secondary() and move
wait_for_master_cpu() into the SMP boot code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.981999763@linutronix.de
2023-05-15 13:44:51 +02:00
David Woodhouse
2b3be65d2e x86/smpboot: Split up native_cpu_up() into separate phases and document them
There are four logical parts to what native_cpu_up() does on the BSP (or
on the controlling CPU for a later hotplug):

 1) Wake the AP by sending the INIT/SIPI/SIPI sequence.

 2) Wait for the AP to make it as far as wait_for_master_cpu() which
    sets that CPU's bit in cpu_initialized_mask, then sets the bit in
    cpu_callout_mask to let the AP proceed through cpu_init().

 3) Wait for the AP to finish cpu_init() and get as far as the
    smp_callin() call, which sets that CPU's bit in cpu_callin_mask.

 4) Perform the TSC synchronization and wait for the AP to actually
    mark itself online in cpu_online_mask.

In preparation to allow these phases to operate in parallel on multiple
APs, split them out into separate functions and document the interactions
a little more clearly in both the BP and AP code paths.

No functional change intended.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.928917242@linutronix.de
2023-05-15 13:44:51 +02:00
Thomas Gleixner
c7f15dd3f0 x86/smpboot: Remove unnecessary barrier()
Peter stumbled over the barrier() after the invocation of smp_callin() in
start_secondary():

  "...this barrier() and it's comment seem weird vs smp_callin(). That
   function ends with an atomic bitop (it has to, at the very least it must
   not be weaker than store-release) but also has an explicit wmb() to order
   setup vs CPU_STARTING.

   There is no way the smp_processor_id() referred to in this comment can land
   before cpu_init() even without the barrier()."

The barrier() along with the comment was added in 2003 with commit
d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those
well documented combo patches of that time which changes world and some
more. The context back then was:

	/*
	 * Dont put anything before smp_callin(), SMP
	 * booting is too fragile that we want to limit the
	 * things done here to the most necessary things.
	 */
	cpu_init();
	smp_callin();

+	/* otherwise gcc will move up smp_processor_id before the cpu_init */
+ 	barrier();

	Dprintk("cpu %d: waiting for commence\n", smp_processor_id());

Even back in 2003 the compiler was not allowed to reorder that
smp_processor_id() invocation before the cpu_init() function call.
Especially not as smp_processor_id() resolved to:

  asm volatile("movl %%gs:%c1,%0":"=r" (ret__):"i"(pda_offset(field)):"memory");

There is no trace of this change in any mailing list archive including the
back then official x86_64 list discuss@x86-64.org, which would explain the
problem this change solved.

The debug prints are gone by now and the the only smp_processor_id()
invocation today is farther down in start_secondary() after locking
vector_lock which itself prevents reordering.

Even if the compiler would be allowed to reorder this, the code would still
be correct as GSBASE is set up early in the assembly code and is valid when
the CPU reaches start_secondary(), while the code at the time when this
barrier was added did the GSBASE setup in cpu_init().

As the barrier has zero value, remove it.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.875713771@linutronix.de
2023-05-15 13:44:50 +02:00
Thomas Gleixner
5475abbde7 x86/smpboot: Remove the CPU0 hotplug kludge
This was introduced with commit e1c467e690 ("x86, hotplug: Wake up CPU0
via NMI instead of INIT, SIPI, SIPI") to eventually support physical
hotplug of CPU0:

 "We'll change this code in the future to wake up hard offlined CPU0 if
  real platform and request are available."

11 years later this has not happened and physical hotplug is not officially
supported. Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.768845190@linutronix.de
2023-05-15 13:44:49 +02:00
Thomas Gleixner
134a12827b x86/smpboot: Avoid pointless delay calibration if TSC is synchronized
When TSC is synchronized across sockets then there is no reason to
calibrate the delay for the first CPU which comes up on a socket.

Just reuse the existing calibration value.

This removes 100ms pointlessly wasted time from CPU hotplug per socket.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.608773568@linutronix.de
2023-05-15 13:44:48 +02:00
Thomas Gleixner
ba831b7b1a cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __init
No point in keeping them around.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de
2023-05-15 13:44:47 +02:00
Thomas Gleixner
5107e3ebb8 x86/smpboot: Cleanup topology_phys_to_logical_pkg()/die()
Make topology_phys_to_logical_pkg_die() static as it's only used in
smpboot.c and fixup the kernel-doc warnings for both functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck
Link: https://lore.kernel.org/r/20230512205255.493750666@linutronix.de
2023-05-15 13:44:47 +02:00
Chen Yu
044f0e27de x86/sched: Add the SD_ASYM_PACKING flag to the die domain of hybrid processors
Intel Meteor Lake hybrid processors have cores in two separate dies. The
cores in one of the dies have higher maximum frequency. Use the SD_ASYM_
PACKING flag to give higher priority to the die with CPUs of higher maximum
frequency.

Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230406203148.19182-13-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:38 +02:00
Ricardo Neri
995998ebde x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags
There is no difference between any of the SMT siblings of a physical core.
Do not do asym_packing load balancing at this level.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Link: https://lore.kernel.org/r/20230406203148.19182-11-ricardo.neri-calderon@linux.intel.com
2023-05-08 10:58:37 +02:00
Linus Torvalds
2aff7c706c Objtool changes for v6.4:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did
    this inconsistently follow this new, common convention, and fix all the fallout
    that objtool can now detect statically.
 
  - Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity,
    split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it.
 
  - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code.
 
  - Generate ORC data for __pfx code
 
  - Add more __noreturn annotations to various kernel startup/shutdown/panic functions.
 
  - Misc improvements & fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM
 pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul
 TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd
 LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP
 c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT
 /PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a
 1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1
 0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ
 SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p
 vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo
 x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi
 fRho4CqzrtM=
 =NwEV
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - Mark arch_cpu_idle_dead() __noreturn, make all architectures &
   drivers that did this inconsistently follow this new, common
   convention, and fix all the fallout that objtool can now detect
   statically

 - Fix/improve the ORC unwinder becoming unreliable due to
   UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK
   and UNWIND_HINT_UNDEFINED to resolve it

 - Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code

 - Generate ORC data for __pfx code

 - Add more __noreturn annotations to various kernel startup/shutdown
   and panic functions

 - Misc improvements & fixes

* tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/hyperv: Mark hv_ghcb_terminate() as noreturn
  scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
  x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
  btrfs: Mark btrfs_assertfail() __noreturn
  objtool: Include weak functions in global_noreturns check
  cpu: Mark nmi_panic_self_stop() __noreturn
  cpu: Mark panic_smp_self_stop() __noreturn
  arm64/cpu: Mark cpu_park_loop() and friends __noreturn
  x86/head: Mark *_start_kernel() __noreturn
  init: Mark start_kernel() __noreturn
  init: Mark [arch_call_]rest_init() __noreturn
  objtool: Generate ORC data for __pfx code
  x86/linkage: Fix padding for typed functions
  objtool: Separate prefix code from stack validation code
  objtool: Remove superfluous dead_end_function() check
  objtool: Add symbol iteration helpers
  objtool: Add WARN_INSN()
  scripts/objdump-func: Support multiple functions
  context_tracking: Fix KCSAN noinstr violation
  objtool: Add stackleak instrumentation to uaccess safe list
  ...
2023-04-28 14:02:54 -07:00
Josh Poimboeuf
52668badd3 x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
Fixes the following warning:

  vmlinux.o: warning: objtool: resume_play_dead+0x21: unreachable instruction

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/ce1407c4bf88b1334fe40413126343792a77ca50.1681342859.git.jpoimboe@kernel.org
2023-04-14 17:31:27 +02:00
David Woodhouse
805ae9dc3b x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
When bringing up a secondary CPU from do_boot_cpu(), the warm reset flag
is set in CMOS and the starting IP for the trampoline written inside the
BDA at 0x467. Once the CPU is running, the CMOS flag is unset and the
value in the BDA cleared.

To allow for parallel bringup of CPUs, add a reference count to track the
number of CPUs currently bring brought up, and clear the state only when
the count reaches zero.

Since the RTC spinlock is required to write to the CMOS, it can be used
for mutual exclusion on the refcount too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Link: https://lore.kernel.org/r/20230316222109.1940300-5-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Brian Gerst
8f6be6d870 x86/smpboot: Remove initial_gs
Given its CPU#, each CPU can find its own per-cpu offset, and directly set
GSBASE accordingly. The global variable can be eliminated.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230316222109.1940300-9-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Brian Gerst
c253b64020 x86/smpboot: Remove early_gdt_descr on 64-bit
Build the GDT descriptor on the stack instead.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230316222109.1940300-8-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Brian Gerst
3adee777ad x86/smpboot: Remove initial_stack on 64-bit
In order to facilitate parallel startup, start to eliminate some of the
global variables passing information to CPUs in the startup path.

However, start by introducing one more: smpboot_control. For now this
merely holds the CPU# of the CPU which is coming up. Each CPU can then
find its own per-cpu data, and everything else it needs can be found
from there, allowing the other global variables to be removed.

First to be removed is initial_stack. Each CPU can load %rsp from its
current_task->thread.sp instead. That is already set up with the correct
idle thread for APs. Set up the .sp field in INIT_THREAD on x86 so that
the BSP also finds a suitable stack pointer in the static per-cpu data
when coming up on first boot.

On resume from S3, the CPU needs a temporary stack because its idle task
is already active. Instead of setting initial_stack, the sleep code can
simply set its own current->thread.sp to point to the temporary stack.
Nobody else cares about ->thread.sp for a thread which is currently on
a CPU, because the true value is actually in the %rsp register. Which
is restored with the rest of the CPU context in do_suspend_lowlevel().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Usama Arif <usama.arif@bytedance.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230316222109.1940300-7-usama.arif@bytedance.com
2023-03-21 13:35:53 +01:00
Srivatsa S. Bhat (VMware)
fcb3a81d22 x86/hotplug: Remove incorrect comment about mwait_play_dead()
The comment that says mwait_play_dead() returns only on failure is a bit
misleading because mwait_play_dead() could actually return for valid
reasons (such as mwait not being supported by the platform) that do not
indicate a failure of the CPU offline operation. So, remove the comment.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230128003751.141317-1-srivatsa@csail.mit.edu
2023-02-14 23:44:34 +01:00
Linus Torvalds
94a855111e - Add the call depth tracking mitigation for Retbleed which has
been long in the making. It is a lighterweight software-only fix for
 Skylake-based cores where enabling IBRS is a big hammer and causes a
 significant performance impact.
 
 What it basically does is, it aligns all kernel functions to 16 bytes
 boundary and adds a 16-byte padding before the function, objtool
 collects all functions' locations and when the mitigation gets applied,
 it patches a call accounting thunk which is used to track the call depth
 of the stack at any time.
 
 When that call depth reaches a magical, microarchitecture-specific value
 for the Return Stack Buffer, the code stuffs that RSB and avoids its
 underflow which could otherwise lead to the Intel variant of Retbleed.
 
 This software-only solution brings a lot of the lost performance back,
 as benchmarks suggest:
 
   https://lore.kernel.org/all/20220915111039.092790446@infradead.org/
 
 That page above also contains a lot more detailed explanation of the
 whole mechanism
 
 - Implement a new control flow integrity scheme called FineIBT which is
 based on the software kCFI implementation and uses hardware IBT support
 where present to annotate and track indirect branches using a hash to
 validate them
 
 - Other misc fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOZp5EACgkQEsHwGGHe
 VUrZFxAAvi/+8L0IYSK4mKJvixGbTFjxN/Swo2JVOfs34LqGUT6JaBc+VUMwZxdb
 VMTFIZ3ttkKEodjhxGI7oGev6V8UfhI37SmO2lYKXpQVjXXnMlv/M+Vw3teE38CN
 gopi+xtGnT1IeWQ3tc/Tv18pleJ0mh5HKWiW+9KoqgXj0wgF9x4eRYDz1TDCDA/A
 iaBzs56j8m/FSykZHnrWZ/MvjKNPdGlfJASUCPeTM2dcrXQGJ93+X2hJctzDte0y
 Nuiw6Y0htfFBE7xoJn+sqm5Okr+McoUM18/CCprbgSKYk18iMYm3ZtAi6FUQZS1A
 ua4wQCf49loGp15PO61AS5d3OBf5D3q/WihQRbCaJvTVgPp9sWYnWwtcVUuhMllh
 ZQtBU9REcVJ/22bH09Q9CjBW0VpKpXHveqQdqRDViLJ6v/iI6EFGmD24SW/VxyRd
 73k9MBGrL/dOf1SbEzdsnvcSB3LGzp0Om8o/KzJWOomrVKjBCJy16bwTEsCZEJmP
 i406m92GPXeaN1GhTko7vmF0GnkEdJs1GVCZPluCAxxbhHukyxHnrjlQjI4vC80n
 Ylc0B3Kvitw7LGJsPqu+/jfNHADC/zhx1qz/30wb5cFmFbN1aRdp3pm8JYUkn+l/
 zri2Y6+O89gvE/9/xUhMohzHsWUO7xITiBavewKeTP9GSWybWUs=
 =cRy1
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Borislav Petkov:

 - Add the call depth tracking mitigation for Retbleed which has been
   long in the making. It is a lighterweight software-only fix for
   Skylake-based cores where enabling IBRS is a big hammer and causes a
   significant performance impact.

   What it basically does is, it aligns all kernel functions to 16 bytes
   boundary and adds a 16-byte padding before the function, objtool
   collects all functions' locations and when the mitigation gets
   applied, it patches a call accounting thunk which is used to track
   the call depth of the stack at any time.

   When that call depth reaches a magical, microarchitecture-specific
   value for the Return Stack Buffer, the code stuffs that RSB and
   avoids its underflow which could otherwise lead to the Intel variant
   of Retbleed.

   This software-only solution brings a lot of the lost performance
   back, as benchmarks suggest:

       https://lore.kernel.org/all/20220915111039.092790446@infradead.org/

   That page above also contains a lot more detailed explanation of the
   whole mechanism

 - Implement a new control flow integrity scheme called FineIBT which is
   based on the software kCFI implementation and uses hardware IBT
   support where present to annotate and track indirect branches using a
   hash to validate them

 - Other misc fixes and cleanups

* tag 'x86_core_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
  x86/paravirt: Use common macro for creating simple asm paravirt functions
  x86/paravirt: Remove clobber bitmask from .parainstructions
  x86/debug: Include percpu.h in debugreg.h to get DECLARE_PER_CPU() et al
  x86/cpufeatures: Move X86_FEATURE_CALL_DEPTH from bit 18 to bit 19 of word 11, to leave space for WIP X86_FEATURE_SGX_EDECCSSA bit
  x86/Kconfig: Enable kernel IBT by default
  x86,pm: Force out-of-line memcpy()
  objtool: Fix weak hole vs prefix symbol
  objtool: Optimize elf_dirty_reloc_sym()
  x86/cfi: Add boot time hash randomization
  x86/cfi: Boot time selection of CFI scheme
  x86/ibt: Implement FineIBT
  objtool: Add --cfi to generate the .cfi_sites section
  x86: Add prefix symbols for function padding
  objtool: Add option to generate prefix symbols
  objtool: Avoid O(bloody terrible) behaviour -- an ode to libelf
  objtool: Slice up elf_create_section_symbol()
  kallsyms: Revert "Take callthunks into account"
  x86: Unconfuse CONFIG_ and X86_FEATURE_ namespaces
  x86/retpoline: Fix crash printing warning
  x86/paravirt: Fix a !PARAVIRT build warning
  ...
2022-12-14 15:03:00 -08:00
Linus Torvalds
3ef3ace4e2 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
guests which do not get MTRRs exposed but only PAT. (TDX guests do not
 support the cache disabling dance when setting up MTRRs so they fall
 under the same category.) This is a cleanup work to remove all the ugly
 workarounds for such guests and init things separately (Juergen Gross)
 
 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
 Performance Bias, leading to power savings
 
 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern Centaur
 CPUs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOYhIMACgkQEsHwGGHe
 VUpxug//ZKw3hYFroKhsULJi/e0j2nGARiSlJrJcFHl2vgh9yGvDsnYUyM/rgjgt
 cM3uCLbEG7nA6uhB3nupzaXZ8lBM1nU9kiEl/kjQ5oYf9nmJ48fLttvWGfxYN4s3
 kj5fYVhlOZpntQXIWrwxnPqghUysumMnZmBJeKYiYNNfkj62l3xU2Ni4Gnjnp02I
 9MmUhl7pj1aEyOQfM8rovy+wtYCg5WTOmXVlyVN+b9MwfYeK+stojvCZHxtJs9BD
 fezpJjjG+78xKUC7vVZXCh1p1N5Qvj014XJkVl9Hg0n7qizKFZRtqi8I769G2ptd
 exP8c2nDXKCqYzE8vK6ukWgDANQPs3d6Z7EqUKuXOCBF81PnMPSUMyNtQFGNM6Wp
 S5YSvFfCgUjp50IunOpvkDABgpM+PB8qeWUq72UFQJSOymzRJg/KXtE2X+qaMwtC
 0i6VLXfMddGcmqNKDppfGtCjq2W5VrNIIJedtAQQGyl+pl3XzZeNomhJpm/0mVfJ
 8UrlXZeXl/EUQ7qk40gC/Ash27pU9ZDx4CMNMy1jDIQqgufBjEoRIDSFqQlghmZq
 An5/BqMLhOMxUYNA7bRUnyeyxCBypetMdQt5ikBmVXebvBDmArXcuSNAdiy1uBFX
 KD8P3Y1AnsHIklxkLNyZRUy7fb4mgMFenUbgc0vmbYHbFl0C0pQ=
 =Zmgh
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Borislav Petkov:

 - Split MTRR and PAT init code to accomodate at least Xen PV and TDX
   guests which do not get MTRRs exposed but only PAT. (TDX guests do
   not support the cache disabling dance when setting up MTRRs so they
   fall under the same category)

   This is a cleanup work to remove all the ugly workarounds for such
   guests and init things separately (Juergen Gross)

 - Add two new Intel CPUs to the list of CPUs with "normal" Energy
   Performance Bias, leading to power savings

 - Do not do bus master arbitration in C3 (ARB_DISABLE) on modern
   Centaur CPUs

* tag 'x86_cpu_for_v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  x86/mtrr: Make message for disabled MTRRs more descriptive
  x86/pat: Handle TDX guest PAT initialization
  x86/cpuid: Carve out all CPUID functionality
  x86/cpu: Switch to cpu_feature_enabled() for X86_FEATURE_XENPV
  x86/cpu: Remove X86_FEATURE_XENPV usage in setup_cpu_entry_area()
  x86/cpu: Drop 32-bit Xen PV guest code in update_task_stack()
  x86/cpu: Remove unneeded 64-bit dependency in arch_enter_from_user_mode()
  x86/cpufeatures: Add X86_FEATURE_XENPV to disabled-features.h
  x86/acpi/cstate: Optimize ARB_DISABLE on Centaur CPUs
  x86/mtrr: Simplify mtrr_ops initialization
  x86/cacheinfo: Switch cache_ap_init() to hotplug callback
  x86: Decouple PAT and MTRR handling
  x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
  x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
  x86/mtrr: Get rid of __mtrr_enabled bool
  x86/mtrr: Simplify mtrr_bp_init()
  x86/mtrr: Remove set_all callback from struct mtrr_ops
  x86/mtrr: Disentangle MTRR init from PAT init
  x86/mtrr: Move cache control code to cacheinfo.c
  x86/mtrr: Split MTRR-specific handling from cache dis/enabling
  ...
2022-12-13 14:56:56 -08:00
Jason A. Donenfeld
b3883a9a1f stackprotector: move get_random_canary() into stackprotector.h
This has nothing to do with random.c and everything to do with stack
protectors. Yes, it uses randomness. But many things use randomness.
random.h and random.c are concerned with the generation of randomness,
not with each and every use. So move this function into the more
specific stackprotector.h file where it belongs.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:18:10 +01:00
Juergen Gross
30f89e524b x86/cacheinfo: Switch cache_ap_init() to hotplug callback
Instead of explicitly calling cache_ap_init() in
identify_secondary_cpu() use a CPU hotplug callback instead. By
registering the callback only after having started the non-boot CPUs
and initializing cache_aps_delayed_init with "true", calling
set_cache_aps_delayed_init() at boot time can be dropped.

It should be noted that this change results in cache_ap_init() being
called a little bit later when hotplugging CPUs. By using a new
hotplug slot right at the start of the low level bringup this is not
problematic, as no operations requiring a specific caching mode are
performed that early in CPU initialization.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-15-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Juergen Gross
0b9a6a8bed x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()
Instead of having a stop_machine() handler for either a specific
MTRR register or all state at once, add a handler just for calling
cache_cpu_init() if appropriate.

Add functions for calling stop_machine() with this handler as well.

Add a generic replacement for mtrr_bp_restore() and a wrapper for
mtrr_bp_init().

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-13-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Juergen Gross
955d0e0805 x86/mtrr: Let cache_aps_delayed_init replace mtrr_aps_delayed_init
In order to prepare decoupling MTRR and PAT replace the MTRR-specific
mtrr_aps_delayed_init flag with a more generic cache_aps_delayed_init
one.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221102074713.21493-12-jgross@suse.com
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-11-10 13:12:45 +01:00
Thomas Gleixner
c063a217bc x86/percpu: Move current_top_of_stack next to current_task
Extend the struct pcpu_hot cacheline with current_top_of_stack;
another very frequently used value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.493038635@infradead.org
2022-10-17 16:41:05 +02:00
Thomas Gleixner
e57ef2ed97 x86: Put hot per CPU variables into a struct
The layout of per-cpu variables is at the mercy of the compiler. This
can lead to random performance fluctuations from build to build.

Create a structure to hold some of the hottest per-cpu variables,
starting with current_task.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111145.179707194@infradead.org
2022-10-17 16:41:03 +02:00
Thomas Gleixner
1f19e2d50b x86/cpu: Get rid of redundant switch_to_new_gdt() invocations
The only place where switch_to_new_gdt() is required is early boot to
switch from the early GDT to the direct GDT. Any other invocation is
completely redundant because it does not change anything.

Secondary CPUs come out of the ASM code with GDT and GSBASE correctly set
up. The same is true for XEN_PV.

Remove all the voodoo invocations which are left overs from the ancient
past, rename the function to switch_gdt_and_percpu_base() and mark it init.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220915111143.198076128@infradead.org
2022-10-17 16:40:56 +02:00
Yury Norov
38bef8e57f smp: add set_nr_cpu_ids()
In preparation to support compile-time nr_cpu_ids, add a setter for
the variable.

This is a no-op for all arches.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
2022-09-19 17:51:53 -07:00
Sander Vanheule
adbcaef840 x86/cacheinfo: move shared cache map definitions
Patch series "cpumask: Fix invalid uniprocessor assumptions", v4.

On uniprocessor builds, it is currently assumed that any cpumask will
contain the single CPU: cpu0.  This assumption is used to provide
optimised implementations.

The current assumption also appears to be wrong, by ignoring the fact that
users can provide empty cpumasks.  This can result in bugs as explained in
[1] - for_each_cpu() will run one iteration of the loop even when passed
an empty cpumask.

This series introduces some basic tests, and updates the optimisations for
uniprocessor builds.

The x86 patch was written after the kernel test robot [2] ran into a
failed build.  I have tried to list the files potentially affected by the
changes to cpumask.h, in an attempt to find any other cases that fail on
!SMP.  I've gone through some of the files manually, and ran a few cross
builds, but nothing else popped up.  I (build) checked about half of the
potientally affected files, but I do not have the resources to do them
all.  I hope we can fix other issues if/when they pop up later.

[1] https://lore.kernel.org/all/20220530082552.46113-1-sander@svanheule.net/
[2] https://lore.kernel.org/all/202206060858.wA0FOzRy-lkp@intel.com/


This patch (of 5):

The maps to keep track of shared caches between CPUs on SMP systems are
declared in asm/smp.h, among them specifically cpu_llc_shared_map.  These
maps are externally defined in cpu/smpboot.c.  The latter is only compiled
on CONFIG_SMP=y, which means the declared extern symbols from asm/smp.h do
not have a corresponding definition on uniprocessor builds.

The inline cpu_llc_shared_mask() function from asm/smp.h refers to the map
declaration mentioned above.  This function is referenced in cacheinfo.c
inside for_each_cpu() loop macros, to provide cpumask for the loop.  On
uniprocessor builds, the symbol for the cpu_llc_shared_map does not exist.
However, the current implementation of for_each_cpu() also (wrongly)
ignores the provided mask.

By sheer luck, the compiler thus optimises out this unused reference to
cpu_llc_shared_map, and the linker therefore does not require the
cpu_llc_shared_mask to actually exist on uniprocessor builds.  Only on SMP
bulids does smpboot.o exist to provide the required symbols.

To no longer rely on compiler optimisations for successful uniprocessor
builds, move the definitions of cpu_llc_shared_map and cpu_l2c_shared_map
from smpboot.c to cacheinfo.c.

Link: https://lkml.kernel.org/r/cover.1656777646.git.sander@svanheule.net
Link: https://lkml.kernel.org/r/e8167ddb570f56744a3dc12c2149a660a324d969.1656777646.git.sander@svanheule.net
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Yury Norov <yury.norov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:31:40 -07:00
Linus Torvalds
a13dc4d409 - Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for unnecessary IPIs
 
 - Finally remove a.out support
 
 - The usual trivial cleanups and fixes all over x86
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe
 VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2
 XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV
 qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK
 /3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj
 0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK
 BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE
 M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB
 xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7
 2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W
 hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX
 TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0=
 =Dks/
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:

 - Serious sanitization and cleanup of the whole APERF/MPERF and
   frequency invariance code along with removing the need for
   unnecessary IPIs

 - Finally remove a.out support

 - The usual trivial cleanups and fixes all over x86

* tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86: Remove empty files
  x86/speculation: Add missing srbds=off to the mitigations= help text
  x86/prctl: Remove pointless task argument
  x86/aperfperf: Make it correct on 32bit and UP kernels
  x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
  x86/aperfmperf: Replace arch_freq_get_on_cpu()
  x86/aperfmperf: Replace aperfmperf_get_khz()
  x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
  x86/aperfmperf: Make parts of the frequency invariance code unconditional
  x86/aperfmperf: Restructure arch_scale_freq_tick()
  x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
  x86/aperfmperf: Untangle Intel and AMD frequency invariance init
  x86/aperfmperf: Separate AP/BP frequency invariance init
  x86/smp: Move APERF/MPERF code where it belongs
  x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
  x86/process: Fix kernel-doc warning due to a changed function name
  x86: Remove a.out support
  x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
  x86: Replace cpumask_weight() with cpumask_empty() where appropriate
  x86/pkeys: Remove __arch_set_user_pkey_access() declaration
  ...
2022-05-23 18:17:09 -07:00
Linus Torvalds
3a755ebcc2 Intel Trust Domain Extensions
This is the Intel version of a confidential computing solution called
 Trust Domain Extensions (TDX). This series adds support to run the
 kernel as part of a TDX guest. It provides similar guest protections to
 AMD's SEV-SNP like guest memory and register state encryption, memory
 integrity protection and a lot more.
 
 Design-wise, it differs from AMD's solution considerably: it uses
 a software module which runs in a special CPU mode called (Secure
 Arbitration Mode) SEAM. As the name suggests, this module serves as sort
 of an arbiter which the confidential guest calls for services it needs
 during its lifetime.
 
 Just like AMD's SNP set, this series reworks and streamlines certain
 parts of x86 arch code so that this feature can be properly accomodated.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe
 VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv
 kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT
 pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm
 khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw
 r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX
 fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy
 HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs
 IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G
 tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n
 HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq
 VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM=
 =ac9K
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel TDX support from Borislav Petkov:
 "Intel Trust Domain Extensions (TDX) support.

  This is the Intel version of a confidential computing solution called
  Trust Domain Extensions (TDX). This series adds support to run the
  kernel as part of a TDX guest. It provides similar guest protections
  to AMD's SEV-SNP like guest memory and register state encryption,
  memory integrity protection and a lot more.

  Design-wise, it differs from AMD's solution considerably: it uses a
  software module which runs in a special CPU mode called (Secure
  Arbitration Mode) SEAM. As the name suggests, this module serves as
  sort of an arbiter which the confidential guest calls for services it
  needs during its lifetime.

  Just like AMD's SNP set, this series reworks and streamlines certain
  parts of x86 arch code so that this feature can be properly
  accomodated"

* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/tdx: Fix RETs in TDX asm
  x86/tdx: Annotate a noreturn function
  x86/mm: Fix spacing within memory encryption features message
  x86/kaslr: Fix build warning in KASLR code in boot stub
  Documentation/x86: Document TDX kernel architecture
  ACPICA: Avoid cache flush inside virtual machines
  x86/tdx/ioapic: Add shared bit for IOAPIC base address
  x86/mm: Make DMA memory shared for TD guest
  x86/mm/cpa: Add support for TDX shared memory
  x86/tdx: Make pages shared in ioremap()
  x86/topology: Disable CPU online/offline control for TDX guests
  x86/boot: Avoid #VE during boot for TDX platforms
  x86/boot: Set CR0.NE early and keep it set during the boot
  x86/acpi/x86/boot: Add multiprocessor wake-up support
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Port I/O: Add early boot support
  x86/tdx: Port I/O: Add runtime hypercalls
  x86/boot: Port I/O: Add decompression-time support for TDX
  x86/boot: Port I/O: Allow to hook up alternative helpers
  ...
2022-05-23 17:51:12 -07:00
Thomas Gleixner
bb6e89df90 x86/aperfmperf: Make parts of the frequency invariance code unconditional
The frequency invariance support is currently limited to x86/64 and SMP,
which is the vast majority of machines.

arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF
and MPERF MSRs. The CPU frequency getters function do the same via dedicated
IPIs.

While it could be argued that on systems where frequency invariance support
is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can
be avoided, it does not make sense to keep the extra code and the resulting
runtime issues of mass IPIs around.

As a first step split out the non frequency invariance specific
initialization code and the read MSR portion of arch_scale_freq_tick(). The
rest of the code is still conditional and guarded with a static key.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.761988704@linutronix.de
2022-04-27 20:22:19 +02:00
Thomas Gleixner
0dfaf3f6ec x86/aperfmperf: Untangle Intel and AMD frequency invariance init
AMD boot CPU initialization happens late via ACPI/CPPC which prevents the
Intel parts from being marked __init.

Split out the common code and provide a dedicated interface for the AMD
initialization and mark the Intel specific code and data __init.

The remaining text size is almost cut in half:

  text:		2614	->	1350
  init.text:	   0	->	 786

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.592465719@linutronix.de
2022-04-27 20:22:19 +02:00
Thomas Gleixner
138a7f9c6b x86/aperfmperf: Separate AP/BP frequency invariance init
This code is convoluted and because it can be invoked post init via the
ACPI/CPPC code, all of the initialization functionality is built in instead
of being part of init text and init data.

As a first step create separate calls for the boot and the application
processors.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.536733494@linutronix.de
2022-04-27 15:51:08 +02:00
Thomas Gleixner
55cb0b7074 x86/smp: Move APERF/MPERF code where it belongs
as this can share code with the preexisting APERF/MPERF code.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20220415161206.478362457@linutronix.de
2022-04-27 15:51:08 +02:00
Sean Christopherson
ff2e64684f x86/boot: Add a trampoline for booting APs via firmware handoff
Historically, x86 platforms have booted secondary processors (APs)
using INIT followed by the start up IPI (SIPI) messages. In regular
VMs, this boot sequence is supported by the VMM emulation. But such a
wakeup model is fatal for secure VMs like TDX in which VMM is an
untrusted entity. To address this issue, a new wakeup model was added
in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot
the APs. More details about this wakeup model can be found in ACPI
specification v6.4, the section titled "Multiprocessor Wakeup Structure".

Since the existing trampoline code requires processors to boot in real
mode with 16-bit addressing, it will not work for this wakeup model
(because it boots the AP in 64-bit mode). To handle it, extend the
trampoline code to support 64-bit mode firmware handoff. Also, extend
IDT and GDT pointers to support 64-bit mode hand off.

There is no TDX-specific detection for this new boot method. The kernel
will rely on it as the sole boot method whenever the new ACPI structure
is present.

The ACPI table parser for the MADT multiprocessor wake up structure and
the wakeup method that uses this structure will be added by the following
patch in this series.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-21-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Tom Lendacky
0afb6b660a x86/sev: Use SEV-SNP AP creation to start secondary CPUs
To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP
AP Creation NAE event. This allows for guest control over the AP register
state rather than trusting the hypervisor with the SEV-ES Jump Table
address.

During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if
SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This
will allow the SEV-SNP AP Creation NAE event method to be used to boot
the APs. As a result of installing the override when SEV-SNP is active,
this method of starting the APs becomes the required method. The override
function will fail to start the AP if the hypervisor does not have
support for AP creation.

  [ bp: Work in forgotten review comments. ]

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220307213356.2797205-23-brijesh.singh@amd.com
2022-04-06 17:06:49 +02:00
Huang Rui
eb5616d4ad x86/ACPI: CPPC: Move init_freq_invariance_cppc() into x86 CPPC
The init_freq_invariance_cppc code actually doesn't need the SMP
functionality. So setting the CONFIG_SMP as the check condition for
init_freq_invariance_cppc may cause the confusion to misunderstand the
CPPC. And the x86 CPPC file is better space to store the CPPC related
functions, while the init_freq_invariance_cppc is out of smpboot, that
means, the CONFIG_SMP won't be mandatory condition any more. And It's more
clear than before.

Signed-off-by: Huang Rui <ray.huang@amd.com>
[ rjw: Subject adjustment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-08 19:16:43 +01:00
Huang Rui
666f6ecf35 x86: Expose init_freq_invariance() to topology header
The function init_freq_invariance will be used on x86 CPPC, so expose it in
the topology header.

Signed-off-by: Huang Rui <ray.huang@amd.com>
[ rjw: Subject adjustment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-08 19:16:43 +01:00
Huang Rui
82d8936914 x86/ACPI: CPPC: Move AMD maximum frequency ratio setting function into x86 CPPC
The AMD maximum frequency ratio setting function depends on CPPC, so the
x86 CPPC implementation file is better space for this function.

Signed-off-by: Huang Rui <ray.huang@amd.com>
[ rjw: Subject adjustment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-03-08 19:16:43 +01:00
Peter Zijlstra
cabdc3a847 sched,x86: Don't use cluster topology for x86 hybrid CPUs
For x86 hybrid CPUs like Alder Lake, the order of CPU selection should
be based strictly on CPU priority.  Don't include cluster topology for
hybrid CPUs to avoid interference with such CPU selection order.

On Alder Lake, the Atom CPU cluster has more capacity (4 Atom CPUs) vs
Big core cluster (2 hyperthread CPUs). This could potentially bias CPU
selection towards Atom over Big Core, when Big core CPU has higher
priority.

Fixes: 66558b730f ("sched: Add cluster scheduler level for x86")
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20211204091402.GM16608@worktop.programming.kicks-ass.net
2021-12-08 22:15:37 +01:00
Boris Ostrovsky
ce2612b670 x86/smp: Factor out parts of native_smp_prepare_cpus()
Commit 66558b730f ("sched: Add cluster scheduler level for x86")
introduced cpu_l2c_shared_map mask which is expected to be initialized
by smp_op.smp_prepare_cpus(). That commit only updated
native_smp_prepare_cpus() version but not xen_pv_smp_prepare_cpus().
As result Xen PV guests crash in set_cpu_sibling_map().

While the new mask can be allocated in xen_pv_smp_prepare_cpus() one can
see that both versions of smp_prepare_cpus ops share a number of common
operations that can be factored out. So do that instead.

Fixes: 66558b730f ("sched: Add cluster scheduler level for x86")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/1635896196-18961-1-git-send-email-boris.ostrovsky@oracle.com
2021-11-11 13:09:32 +01:00
Linus Torvalds
18398bb825 The usual round of random minor fixes and cleanups all over the place.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/ucwACgkQEsHwGGHe
 VUpLMQ//d4xim4zD4hQVleYkGWqA2nB050QtutIto1nvsiZdjrUSMjJGZnos2nLd
 9tY3NZtgrFfAyjUkal098L+2zed+U6UemIV6kT1F3TnWg4dYByxYABNutOsQGUgw
 o4sTwjG7ELC273yPt/WY9TMwfMCiX7t80QkjoSeWbkApdfB0aZoxB0CvdLBKwCl/
 bxdfX1uvqW7sc6fatcI634hC1HDw8GJThym4/lrMHq2Pr8n/U6pEWoBFsdlprnLk
 pqb3IGX3kNnpjTmCpZxvd4ZQV8xUlMcJkdEFjKDf7BLtWjwZxPIdGcfnxrpf2EJQ
 yVZklcabaBNz/zNkoQeyD6Ix1ZCFSxcHRhg0BJpvvhzQ91My2pGZgLuzUYz3Fk7G
 GjWZje8WZcL3ViL9oGbOYMLSw76wov95+8WMiyKqPaNuzZbS3py5C/ThgqpCdg5b
 WyQe0GhUvthzLsVz9Gu7OFrbZl6VBz8q7/bxuo+vpFhgC1EiOj2yPSZNUJBRKdcd
 cFSfybcjk3Qyf7YXmZ/NcD9TQARQO1ediRY6ZNeZr7JYPzyebY+wTfHqDvdX65S5
 i/zgeAX4XAuX4pl28nJvDe8x1P7t5T8L6Qno9Lnd1xMG7jWift9RSEOo29rUp0sw
 gA9xV/BsmApvyM8pgD/lAqxAFzGkYfSy8bB6uav8HccHprVfJE0=
 =4BVs
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Borislav Petkov:
 "The usual round of random minor fixes and cleanups all over the place"

* tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/Makefile: Remove unneeded whitespaces before tabs
  x86/of: Kill unused early_init_dt_scan_chosen_arch()
  x86: Fix misspelled Kconfig symbols
  x86/Kconfig: Remove references to obsolete Kconfig symbols
  x86/smp: Remove unnecessary assignment to local var freq_scale
2021-11-01 15:25:08 -07:00
Linus Torvalds
8cb1ae19bf x86/fpu updates:
- Cleanup of extable fixup handling to be more robust, which in turn
    allows to make the FPU exception fixups more robust as well.
 
  - Change the return code for signal frame related failures from explicit
    error codes to a boolean fail/success as that's all what the calling
    code evaluates.
 
  - A large refactoring of the FPU code to prepare for adding AMX support:
 
    - Distangle the public header maze and remove especially the misnomed
      kitchen sink internal.h which is despite it's name included all over
      the place.
 
    - Add a proper abstraction for the register buffer storage (struct
      fpstate) which allows to dynamically size the buffer at runtime by
      flipping the pointer to the buffer container from the default
      container which is embedded in task_struct::tread::fpu to a
      dynamically allocated container with a larger register buffer.
 
    - Convert the code over to the new fpstate mechanism.
 
    - Consolidate the KVM FPU handling by moving the FPU related code into
      the FPU core which removes the number of exports and avoids adding
      even more export when AMX has to be supported in KVM. This also
      removes duplicated code which was of course unnecessary different and
      incomplete in the KVM copy.
 
    - Simplify the KVM FPU buffer handling by utilizing the new fpstate
      container and just switching the buffer pointer from the user space
      buffer to the KVM guest buffer when entering vcpu_run() and flipping
      it back when leaving the function. This cuts the memory requirements
      of a vCPU for FPU buffers in half and avoids pointless memory copy
      operations.
 
      This also solves the so far unresolved problem of adding AMX support
      because the current FPU buffer handling of KVM inflicted a circular
      dependency between adding AMX support to the core and to KVM.  With
      the new scheme of switching fpstate AMX support can be added to the
      core code without affecting KVM.
 
    - Replace various variables with proper data structures so the extra
      information required for adding dynamically enabled FPU features (AMX)
      can be added in one place
 
  - Add AMX (Advanved Matrix eXtensions) support (finally):
 
     AMX is a large XSTATE component which is going to be available with
     Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
     which allows to trap the (first) use of an AMX related instruction,
     which has two benefits:
 
     1) It allows the kernel to control access to the feature
 
     2) It allows the kernel to dynamically allocate the large register
        state buffer instead of burdening every task with the the extra 8K
        or larger state storage.
 
     It would have been great to gain this kind of control already with
     AVX512.
 
     The support comes with the following infrastructure components:
 
     1) arch_prctl() to
        - read the supported features (equivalent to XGETBV(0))
        - read the permitted features for a task
        - request permission for a dynamically enabled feature
 
        Permission is granted per process, inherited on fork() and cleared
        on exec(). The permission policy of the kernel is restricted to
        sigaltstack size validation, but the syscall obviously allows
        further restrictions via seccomp etc.
 
     2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
        takes granted permissions and the potentially resulting larger
        signal frame into account. This mechanism can also be used to
        enforce factual sigaltstack validation independent of dynamic
        features to help with finding potential victims of the 2K
        sigaltstack size constant which is broken since AVX512 support was
        added.
 
     3) Exception handling for #NM traps to catch first use of a extended
        feature via a new cause MSR. If the exception was caused by the use
        of such a feature, the handler checks permission for that
        feature. If permission has not been granted, the handler sends a
        SIGILL like the #UD handler would do if the feature would have been
        disabled in XCR0. If permission has been granted, then a new fpstate
        which fits the larger buffer requirement is allocated.
 
        In the unlikely case that this allocation fails, the handler sends
        SIGSEGV to the task. That's not elegant, but unavoidable as the
        other discussed options of preallocation or full per task
        permissions come with their own set of horrors for kernel and/or
        userspace. So this is the lesser of the evils and SIGSEGV caused by
        unexpected memory allocation failures is not a fundamentally new
        concept either.
 
        When allocation succeeds, the fpstate properties are filled in to
        reflect the extended feature set and the resulting sizes, the
        fpu::fpstate pointer is updated accordingly and the trap is disarmed
        for this task permanently.
 
     4) Enumeration and size calculations
 
     5) Trap switching via MSR_XFD
 
        The XFD (eXtended Feature Disable) MSR is context switched with the
        same life time rules as the FPU register state itself. The mechanism
        is keyed off with a static key which is default disabled so !AMX
        equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
        is limited by comparing the tasks XFD value with a per CPU shadow
        variable to avoid redundant MSR writes. In case of switching from a
        AMX using task to a non AMX using task or vice versa, the extra MSR
        write is obviously inevitable.
 
        All other places which need to be aware of the variable feature sets
        and resulting variable sizes are not affected at all because they
        retrieve the information (feature set, sizes) unconditonally from
        the fpstate properties.
 
     6) Enable the new AMX states
 
   Note, this is relatively new code despite the fact that AMX support is in
   the works for more than a year now.
 
   The big refactoring of the FPU code, which allowed to do a proper
   integration has been started exactly 3 weeks ago. Refactoring of the
   existing FPU code and of the original AMX patches took a week and has
   been subject to extensive review and testing. The only fallout which has
   not been caught in review and testing right away was restricted to AMX
   enabled systems, which is completely irrelevant for anyone outside Intel
   and their early access program. There might be dragons lurking as usual,
   but so far the fine grained refactoring has held up and eventual yet
   undetected fallout is bisectable and should be easily addressable before
   the 5.16 release. Famous last words...
 
   Many thanks to Chang Bae and Dave Hansen for working hard on this and
   also to the various test teams at Intel who reserved extra capacity to
   follow the rapid development of this closely which provides the
   confidence level required to offer this rather large update for inclusion
   into 5.16-rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a
 /3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3
 YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU
 jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej
 jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT
 EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN
 RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY
 YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC
 dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8
 FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL
 75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T
 hcKvDmehQLrUvg==
 =x3WL
 -----END PGP SIGNATURE-----

Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu updates from Thomas Gleixner:

 - Cleanup of extable fixup handling to be more robust, which in turn
   allows to make the FPU exception fixups more robust as well.

 - Change the return code for signal frame related failures from
   explicit error codes to a boolean fail/success as that's all what the
   calling code evaluates.

 - A large refactoring of the FPU code to prepare for adding AMX
   support:

      - Distangle the public header maze and remove especially the
        misnomed kitchen sink internal.h which is despite it's name
        included all over the place.

      - Add a proper abstraction for the register buffer storage (struct
        fpstate) which allows to dynamically size the buffer at runtime
        by flipping the pointer to the buffer container from the default
        container which is embedded in task_struct::tread::fpu to a
        dynamically allocated container with a larger register buffer.

      - Convert the code over to the new fpstate mechanism.

      - Consolidate the KVM FPU handling by moving the FPU related code
        into the FPU core which removes the number of exports and avoids
        adding even more export when AMX has to be supported in KVM.
        This also removes duplicated code which was of course
        unnecessary different and incomplete in the KVM copy.

      - Simplify the KVM FPU buffer handling by utilizing the new
        fpstate container and just switching the buffer pointer from the
        user space buffer to the KVM guest buffer when entering
        vcpu_run() and flipping it back when leaving the function. This
        cuts the memory requirements of a vCPU for FPU buffers in half
        and avoids pointless memory copy operations.

        This also solves the so far unresolved problem of adding AMX
        support because the current FPU buffer handling of KVM inflicted
        a circular dependency between adding AMX support to the core and
        to KVM. With the new scheme of switching fpstate AMX support can
        be added to the core code without affecting KVM.

      - Replace various variables with proper data structures so the
        extra information required for adding dynamically enabled FPU
        features (AMX) can be added in one place

 - Add AMX (Advanced Matrix eXtensions) support (finally):

   AMX is a large XSTATE component which is going to be available with
   Saphire Rapids XEON CPUs. The feature comes with an extra MSR
   (MSR_XFD) which allows to trap the (first) use of an AMX related
   instruction, which has two benefits:

    1) It allows the kernel to control access to the feature

    2) It allows the kernel to dynamically allocate the large register
       state buffer instead of burdening every task with the the extra
       8K or larger state storage.

   It would have been great to gain this kind of control already with
   AVX512.

   The support comes with the following infrastructure components:

    1) arch_prctl() to
        - read the supported features (equivalent to XGETBV(0))
        - read the permitted features for a task
        - request permission for a dynamically enabled feature

       Permission is granted per process, inherited on fork() and
       cleared on exec(). The permission policy of the kernel is
       restricted to sigaltstack size validation, but the syscall
       obviously allows further restrictions via seccomp etc.

    2) A stronger sigaltstack size validation for sys_sigaltstack(2)
       which takes granted permissions and the potentially resulting
       larger signal frame into account. This mechanism can also be used
       to enforce factual sigaltstack validation independent of dynamic
       features to help with finding potential victims of the 2K
       sigaltstack size constant which is broken since AVX512 support
       was added.

    3) Exception handling for #NM traps to catch first use of a extended
       feature via a new cause MSR. If the exception was caused by the
       use of such a feature, the handler checks permission for that
       feature. If permission has not been granted, the handler sends a
       SIGILL like the #UD handler would do if the feature would have
       been disabled in XCR0. If permission has been granted, then a new
       fpstate which fits the larger buffer requirement is allocated.

       In the unlikely case that this allocation fails, the handler
       sends SIGSEGV to the task. That's not elegant, but unavoidable as
       the other discussed options of preallocation or full per task
       permissions come with their own set of horrors for kernel and/or
       userspace. So this is the lesser of the evils and SIGSEGV caused
       by unexpected memory allocation failures is not a fundamentally
       new concept either.

       When allocation succeeds, the fpstate properties are filled in to
       reflect the extended feature set and the resulting sizes, the
       fpu::fpstate pointer is updated accordingly and the trap is
       disarmed for this task permanently.

    4) Enumeration and size calculations

    5) Trap switching via MSR_XFD

       The XFD (eXtended Feature Disable) MSR is context switched with
       the same life time rules as the FPU register state itself. The
       mechanism is keyed off with a static key which is default
       disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
       CPUs the overhead is limited by comparing the tasks XFD value
       with a per CPU shadow variable to avoid redundant MSR writes. In
       case of switching from a AMX using task to a non AMX using task
       or vice versa, the extra MSR write is obviously inevitable.

       All other places which need to be aware of the variable feature
       sets and resulting variable sizes are not affected at all because
       they retrieve the information (feature set, sizes) unconditonally
       from the fpstate properties.

    6) Enable the new AMX states

   Note, this is relatively new code despite the fact that AMX support
   is in the works for more than a year now.

   The big refactoring of the FPU code, which allowed to do a proper
   integration has been started exactly 3 weeks ago. Refactoring of the
   existing FPU code and of the original AMX patches took a week and has
   been subject to extensive review and testing. The only fallout which
   has not been caught in review and testing right away was restricted
   to AMX enabled systems, which is completely irrelevant for anyone
   outside Intel and their early access program. There might be dragons
   lurking as usual, but so far the fine grained refactoring has held up
   and eventual yet undetected fallout is bisectable and should be
   easily addressable before the 5.16 release. Famous last words...

   Many thanks to Chang Bae and Dave Hansen for working hard on this and
   also to the various test teams at Intel who reserved extra capacity
   to follow the rapid development of this closely which provides the
   confidence level required to offer this rather large update for
   inclusion into 5.16-rc1

* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  Documentation/x86: Add documentation for using dynamic XSTATE features
  x86/fpu: Include vmalloc.h for vzalloc()
  selftests/x86/amx: Add context switch test
  selftests/x86/amx: Add test cases for AMX state management
  x86/fpu/amx: Enable the AMX feature in 64-bit mode
  x86/fpu: Add XFD handling for dynamic states
  x86/fpu: Calculate the default sizes independently
  x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
  x86/fpu/xstate: Add fpstate_realloc()/free()
  x86/fpu/xstate: Add XFD #NM handler
  x86/fpu: Update XFD state where required
  x86/fpu: Add sanity checks for XFD
  x86/fpu: Add XFD state to fpstate
  x86/msr-index: Add MSRs for XFD
  x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
  x86/fpu: Reset permission and fpstate on exec()
  x86/fpu: Prepare fpu_clone() for dynamically enabled features
  x86/fpu/signal: Prepare for variable sigframe length
  x86/signal: Use fpu::__state_user_size for sigalt stack validation
  ...
2021-11-01 14:03:56 -07:00
Peter Zijlstra
55409ac5c3 sched,x86: Fix L2 cache mask
Currently AMD/Hygon do not populate l2c_id, this means that for SMT
enabled systems they report an L2 per thread. This is ofcourse not
true but was harmless so far.

However, since commit: 66558b730f ("sched: Add cluster scheduler
level for x86") the scheduler topology setup requires:

  SMT <= L2 <= LLC

Which leads to noisy warnings and possibly weird behaviour on affected
chips.

Therefore change the topology generation such that if l2c_id is not
populated it follows the SMT topology, thereby satisfying the
constraint.

Fixes: 66558b730f ("sched: Add cluster scheduler level for x86")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
2021-10-22 18:21:28 +02:00
Thomas Gleixner
b56d2795b2 x86/fpu: Replace the includes of fpu/internal.h
Now that the file is empty, fixup all references with the proper includes
and delete the former kitchen sink.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de
2021-10-20 15:27:29 +02:00
Tim Chen
66558b730f sched: Add cluster scheduler level for x86
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.

To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a boost
to performance especially on medium load system on Jacobsville.  on a
Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
2021-10-15 11:25:16 +02:00
Tim Gardner
85784470ef x86/smp: Remove unnecessary assignment to local var freq_scale
Coverity warns of an unused value in arch_scale_freq_tick():

  CID 100778 (#1 of 1): Unused value (UNUSED_VALUE)
  assigned_value: Assigning value 1024ULL to freq_scale here, but that stored
  value is overwritten before it can be used.

It was introduced by commit:

  e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")

Remove the variable initializer.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/20210910184405.24422-1-tim.gardner@canonical.com
2021-09-17 21:20:34 +02:00
Balbir Singh
c52787b590 x86/smp: Add a per-cpu view of SMT state
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not.

This is helpful when the system has some of its cores with threads offlined
and can be used for cases where action is taken based on the state of SMT.

The upcoming support for paranoid L1D flush will make use of this information.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-2-sblbir@amazon.com
2021-07-28 11:42:23 +02:00
Linus Torvalds
e5a0fc4e20 CPU setup code changes:
- Clean up & simplify AP exception handling setup.
 
  - Consolidate the disjoint IDT setup code living in
    idt_setup_traps() and idt_setup_ist_traps() into
    a single idt_setup_traps() initialization function
    and call it before cpu_init().
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZdu0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gWgA//ecu2FqCFT2gpZP7ABdJqAhtYA8f7rg/a
 BMKNNfTha/L9Dot2PSqMq8oVi8NA++EbKjuWeFujAIoU/2vso+YrmHc84O35nOOF
 u2Zgps7UK+ffKo11Yu1Vpb911ctJClAwL0CemsC30QDbpGVHPecLuxOIgY6E6BwC
 qhNqNLp0K4bFRq0ya27O8RPiz9LjCzUILHHvWSAl5m5tWqovED8aXdjrDJcFXqwY
 u9nuuRpUpQWqCldZP9X7+pdo4Z2HZjvIBjqHD/wl3VMjV6q+k+su6AjV9p1D8hoz
 otY96i8MQjD/sgIa1H+tUc2ZusGzDls+EpYiGaPmqeXMitKEwOFpVDAaT8SelUms
 bR4VQ9IYB1NG7Qbco3NQHMV1sWuvUJcLG6ILYFWXgH0hP1EDHFn/TvOn0rfJysbE
 AmCpwmUo0b8Bj6nbKkVcXxoX1FdeqiM5+cPxHxGVgxVoR0Umz13EX4y4cBzSIRht
 eYwT6H1CxR9a4TIr8cMBsN14QsnV3f6lv/RNfVdmZEJVVr0boRI90L2xMLBB9RkP
 z03g7VvfMuSWnKyOFheP4ae9ul2qxAT380+g1oHQH0XIFtj9yIhzJHpoUCzhgCra
 Ui2Z71Dhq0R1UNpPsPfc1XkQI9chiahn8gc1u2zvN4SzZa6DZH22VvGNK0ghoIxq
 5WFho50hNIk=
 =BPbv
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 exception handling updates from Ingo Molnar:

 - Clean up & simplify AP exception handling setup.

 - Consolidate the disjoint IDT setup code living in idt_setup_traps()
   and idt_setup_ist_traps() into a single idt_setup_traps()
   initialization function and call it before cpu_init().

* tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/idt: Rework IDT setup for boot CPU
  x86/cpu: Init AP exception handling from cpu_init_secondary()
2021-06-28 12:46:30 -07:00
Ingo Molnar
a9e906b71f Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03 19:00:49 +02:00
Borislav Petkov
b1efd0ff4b x86/cpu: Init AP exception handling from cpu_init_secondary()
SEV-ES guests require properly setup task register with which the TSS
descriptor in the GDT can be located so that the IST-type #VC exception
handler which they need to function properly, can be executed.

This setup needs to happen before attempting to load microcode in
ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions.

Simplify the machinery by running that exception setup from a new function
cpu_init_secondary() and explicitly call cpu_init_exception_handling() for
the boot CPU before cpu_init(). The latter prepares for fixing and
simplifying the exception/IST setup on the boot CPU.

There should be no functional changes resulting from this patch.

[ tglx: Reworked it so cpu_init_exception_handling() stays seperate ]

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>                                                                                                                                                                                                                        
Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de
2021-05-18 14:49:21 +02:00
Huang Rui
3743d55b28 x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
Some AMD Ryzen generations has different calculation method on maximum
performance. 255 is not for all ASICs, some specific generations should use 166
as the maximum performance. Otherwise, it will report incorrect frequency value
like below:

  ~ → lscpu | grep MHz
  CPU MHz:                         3400.000
  CPU max MHz:                     7228.3198
  CPU min MHz:                     2200.0000

[ mingo: Tidied up whitespace use. ]
[ Alexander Monakov <amonakov@ispras.ru>: fix 225 -> 255 typo. ]

Fixes: 41ea667227 ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Fixed-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425073451.2557394-1-ray.huang@amd.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211791
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-13 12:10:24 +02:00
Valentin Schneider
f1a0a376ca sched/core: Initialize the idle task with preemption disabled
As pointed out by commit

  de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")

init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.

As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().

Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().

Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().

Secondary startups were patched via coccinelle:

  @begone@
  @@

  -preempt_disable();
  ...
  cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12 13:01:45 +02:00
Wan Jiabing
3cf4524ce4 x86/smpboot: Remove duplicate includes
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210427063835.9039-1-wanjiabing@vivo.com
2021-05-05 21:50:13 +02:00
Linus Torvalds
c6536676c7 - turn the stack canary into a normal __percpu variable on 32-bit which
gets rid of the LAZY_GS stuff and a lot of code.
 
 - Add an insn_decode() API which all users of the instruction decoder
 should preferrably use. Its goal is to keep the details of the
 instruction decoder away from its users and simplify and streamline how
 one decodes insns in the kernel. Convert its users to it.
 
 - kprobes improvements and fixes
 
 - Set the maximum DIE per package variable on Hygon
 
 - Rip out the dynamic NOP selection and simplify all the machinery around
 selecting NOPs. Use the simplified NOPs in objtool now too.
 
 - Add Xeon Sapphire Rapids to list of CPUs that support PPIN
 
 - Simplify the retpolines by folding the entire thing into an
 alternative now that objtool can handle alternatives with stack
 ops. Then, have objtool rewrite the call to the retpoline with the
 alternative which then will get patched at boot time.
 
 - Document Intel uarch per models in intel-family.h
 
 - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
 exception on Intel.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCHyJQACgkQEsHwGGHe
 VUpjiRAAwPZdwwp08ypZuMHR4EhLNru6gYhbAoALGgtYnQjLtn5onQhIeieK+R4L
 cmZpxHT9OFp5dXHk4kwygaQBsD4pPOiIpm60kye1dN3cSbOORRdkwEoQMpKMZ+5Y
 kvVsmn7lrwRbp600KdE4G6L5+N6gEgr0r6fMFWWGK3mgVAyCzPexVHgydcp131ch
 iYMo6/pPDcNkcV/hboVKgx7GISdQ7L356L1MAIW/Sxtw6uD/X4qGYW+kV2OQg9+t
 nQDaAo7a8Jqlop5W5TQUdMLKQZ1xK8SFOSX/nTS15DZIOBQOGgXR7Xjywn1chBH/
 PHLwM5s4XF6NT5VlIA8tXNZjWIZTiBdldr1kJAmdDYacrtZVs2LWSOC0ilXsd08Z
 EWtvcpHfHEqcuYJlcdALuXY8xDWqf6Q2F7BeadEBAxwnnBg+pAEoLXI/1UwWcmsj
 wpaZTCorhJpYo2pxXckVdHz2z0LldDCNOXOjjaWU8tyaOBKEK6MgAaYU7e0yyENv
 mVc9n5+WuvXuivC6EdZ94Pcr/KQsd09ezpJYcVfMDGv58YZrb6XIEELAJIBTu2/B
 Ua8QApgRgetx+1FKb8X6eGjPl0p40qjD381TADb4rgETPb1AgKaQflmrSTIik+7p
 O+Eo/4x/GdIi9jFk3K+j4mIznRbUX0cheTJgXoiI4zXML9Jv94w=
 =bm4S
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 updates from Borislav Petkov:

 - Turn the stack canary into a normal __percpu variable on 32-bit which
   gets rid of the LAZY_GS stuff and a lot of code.

 - Add an insn_decode() API which all users of the instruction decoder
   should preferrably use. Its goal is to keep the details of the
   instruction decoder away from its users and simplify and streamline
   how one decodes insns in the kernel. Convert its users to it.

 - kprobes improvements and fixes

 - Set the maximum DIE per package variable on Hygon

 - Rip out the dynamic NOP selection and simplify all the machinery
   around selecting NOPs. Use the simplified NOPs in objtool now too.

 - Add Xeon Sapphire Rapids to list of CPUs that support PPIN

 - Simplify the retpolines by folding the entire thing into an
   alternative now that objtool can handle alternatives with stack ops.
   Then, have objtool rewrite the call to the retpoline with the
   alternative which then will get patched at boot time.

 - Document Intel uarch per models in intel-family.h

 - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
   exception on Intel.

* tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  x86, sched: Treat Intel SNC topology as default, COD as exception
  x86/cpu: Comment Skylake server stepping too
  x86/cpu: Resort and comment Intel models
  objtool/x86: Rewrite retpoline thunk calls
  objtool: Skip magical retpoline .altinstr_replacement
  objtool: Cache instruction relocs
  objtool: Keep track of retpoline call sites
  objtool: Add elf_create_undef_symbol()
  objtool: Extract elf_symbol_add()
  objtool: Extract elf_strtab_concat()
  objtool: Create reloc sections implicitly
  objtool: Add elf_create_reloc() helper
  objtool: Rework the elf_rebuild_reloc_section() logic
  objtool: Fix static_call list generation
  objtool: Handle per arch retpoline naming
  objtool: Correctly handle retpoline thunk calls
  x86/retpoline: Simplify retpolines
  x86/alternatives: Optimize optimize_nops()
  x86: Add insn_decode_kernel()
  x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration
  ...
2021-04-27 17:45:09 -07:00
Linus Torvalds
ea5bc7b977 Trivial cleanups and fixes all over the place.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
 VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
 +HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
 /uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
 4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
 lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
 LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
 sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
 kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
 dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
 wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
 Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
 =bO1i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 cleanups from Borislav Petkov:
 "Trivial cleanups and fixes all over the place"

* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Remove me from IDE/ATAPI section
  x86/pat: Do not compile stubbed functions when X86_PAT is off
  x86/asm: Ensure asm/proto.h can be included stand-alone
  x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
  x86/msr: Make locally used functions static
  x86/cacheinfo: Remove unneeded dead-store initialization
  x86/process/64: Move cpu_current_top_of_stack out of TSS
  tools/turbostat: Unmark non-kernel-doc comment
  x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
  x86/fpu/math-emu: Fix function cast warning
  x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
  x86: Fix various typos in comments, take #2
  x86: Remove unusual Unicode characters from comments
  x86/kaslr: Return boolean values from a function returning bool
  x86: Fix various typos in comments
  x86/setup: Remove unused RESERVE_BRK_ARRAY()
  stacktrace: Move documentation for arch_stack_walk_reliable() to header
  x86: Remove duplicate TSC DEADLINE MSR definitions
2021-04-26 09:25:47 -07:00
Alison Schofield
2c88d45edb x86, sched: Treat Intel SNC topology as default, COD as exception
Commit 1340ccfa9a ("x86,sched: Allow topologies where NUMA nodes
share an LLC") added a vendor and model specific check to never
call topology_sane() for Intel Skylake Server systems where NUMA
nodes share an LLC.

Intel Ice Lake and Sapphire Rapids CPUs also enumerate an LLC that is
shared by multiple NUMA nodes. The LLC on these CPUs is shared for
off-package data access but private to the NUMA node for on-package
access. Rather than managing a list of allowable SNC topologies, make
this SNC topology the default, and treat Intel's Cluster-On-Die (COD)
topology as the exception.

In SNC mode, Sky Lake, Ice Lake, and Sapphire Rapids servers do not
emit this warning:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210310190233.31752-1-alison.schofield@intel.com
2021-04-15 18:34:20 +02:00
Vitaly Kuznetsov
fa26d0c778 ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit 8cdddd182b ("ACPI: processor: Fix CPU0 wakeup in
acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying
wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead()
into acpi_idle_play_dead(). The problem is that these functions are not
exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails.

The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0()
(the later from assembly) but it seems putting the whole pattern into a
new function and exporting it instead is better.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8cdddd182b ("CPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-07 19:02:43 +02:00
Vitaly Kuznetsov
8cdddd182b ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
Commit 496121c021 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):

 # echo 0 > /sys/devices/system/cpu/cpu0/online
 # echo 1 > /sys/devices/system/cpu/cpu0/online
 <10 seconds delay>
 -bash: echo: write error: Input/output error

In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:

	/*
	 * If NMI wants to wake up CPU0, start CPU0.
	 */
	if (wakeup_cpu0())
		start_cpu0();

cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
 - NMI is sent to CPU0
 - wakeup_cpu0_nmi() works as expected
 - we get back to while (1) loop in acpi_idle_play_dead()
 - safe_halt() puts CPU0 to sleep again.

The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.

Fixes: 496121c021 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-01 13:37:55 +02:00
Ingo Molnar
d9f6e12fb0 x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments.

Doing this in a single commit should reduce the churn.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-18 15:31:53 +01:00
Rafael J. Wysocki
d11a1d08a0 cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
If the maximum performance level taken for computing the
arch_max_freq_ratio value used in the x86 scale-invariance code is
higher than the one corresponding to the cpuinfo.max_freq value
coming from the acpi_cpufreq driver, the scale-invariant utilization
falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly
faster, which causes the schedutil governor to select a frequency
below cpuinfo.max_freq.  That frequency corresponds to a frequency
table entry below the maximum performance level necessary to get to
the "boost" range of CPU frequencies which prevents "boost"
frequencies from being used in some workloads.

While this issue is related to scale-invariance, it may be amplified
by commit db865272d9 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from the 5.10 development cycle which
made it extremely easy to default to schedutil even if the preferred
driver is acpi_cpufreq as long as intel_pstate is built too, because
the mere presence of the latter effectively removes the ondemand
governor from the defaults.  Distro kernels are likely to include
both intel_pstate and acpi_cpufreq on x86, so their users who cannot
use intel_pstate or choose to use acpi_cpufreq may easily be
affectecd by this issue.

If CPPC is available, it can be used to address this issue by
extending the frequency tables created by acpi_cpufreq to cover the
entire available frequency range (including "boost" frequencies) for
each CPU, but if CPPC is not there, acpi_cpufreq has no idea what
the maximum "boost" frequency is and the frequency tables created by
it cannot be extended in a meaningful way, so in that case make it
ask the arch scale-invariance code to to use the "nominal" performance
level for CPU utilization scaling in order to avoid the issue at hand.

Fixes: db865272d9 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-02-08 13:45:51 +01:00
Rafael J. Wysocki
9c7d9017a4 x86: PM: Register syscore_ops for scale invariance
On x86 scale invariace tends to be disabled during resume from
suspend-to-RAM, because the MPERF or APERF MSR values are not as
expected then due to updates taking place after the platform
firmware has been invoked to complete the suspend transition.

That, of course, is not desirable, especially if the schedutil
scaling governor is in use, because the lack of scale invariance
causes it to be less reliable.

To counter that effect, modify init_freq_invariance() to register
a syscore_ops object for scale invariance with the ->resume callback
pointing to init_counter_refs() which will run on the CPU starting
the resume transition (the other CPUs will be taken care of the
"online" operations taking place later).

Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
2021-01-19 17:04:03 +01:00
Linus Torvalds
148842c98a Yet another large set of x86 interrupt management updates:
- Simplification and distangling of the MSI related functionality
 
    - Let IO/APIC construct the RTE entries from an MSI message instead of
      having IO/APIC specific code in the interrupt remapping drivers
 
    - Make the retrieval of the parent interrupt domain (vector or remap
      unit) less hardcoded and use the relevant irqdomain callbacks for
      selection.
 
    - Allow the handling of more than 255 CPUs without a virtualized IOMMU
      when the hypervisor supports it. This has made been possible by the
      above modifications and also simplifies the existing workaround in the
      HyperV specific virtual IOMMU.
 
    - Cleanup of the historical timer_works() irq flags related
      inconsistencies.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxd8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYpOD/9C5TppNlPMUyx2SflH6bxt37pJEpln
 +hYTKsk+jSThntr5mfj+GifGvgmHOVBTGnlDUnUnrpN7TQmLFBzwTOtnBLW53AO2
 16/u0+Xci4LNCtEkaymf0Rq4MfsfriXHPJr0A/CnZ0tpHSf5QKHAiitSiGujdMlb
 gbq43+zXd+jNkH7vkOLPX/7dZVI1hNASQEevJu2tRR4xYTuXFdBxvLgYkHtYKKrK
 R1sbs6nI6yIzye2u4m4xGu29SxgUft+zdUf+UehJKM3yFmf51d9qpkX+kLaTWuaL
 VPsMItbn0kdvxwXQWO6DYnIAAnVKCklyHQJTZCoNq9Fe91OoByak1CEVspSOa1av
 JmycNSch4IYWasR4vVCB1gbb+V9SejcKu5SV3CDrEDqwkOIpfiqpriUXSCJTLlFd
 QOEDOLuuk/79Qs//J/tb/nJ4IuKv8WPudDfIlMro8wUsAr67DjD4mnXprZ+svwWx
 Ct/0/Memk+BSa0cw6pvg24BUZGN6zrufkBu2HKT9GOXRUdNkdLkiPhT8mK4T/O0l
 f90QCLjPSOJ/K/pLEWdUHEPmgC5Q9RsXOmwVGqX+RbjfP7mYTJXlmWnBb+cFNch0
 xFIH3SxVGylxxT06NX3SkvinrHj10CoAlmneefBlLtx6dF+2P84DAMZSF0OFToVI
 c2KMg5zoesI4bg==
 =8Gfs
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic updates from Thomas Gleixner:
 "Yet another large set of x86 interrupt management updates:

   - Simplification and distangling of the MSI related functionality

   - Let IO/APIC construct the RTE entries from an MSI message instead
     of having IO/APIC specific code in the interrupt remapping drivers

   - Make the retrieval of the parent interrupt domain (vector or remap
     unit) less hardcoded and use the relevant irqdomain callbacks for
     selection.

   - Allow the handling of more than 255 CPUs without a virtualized
     IOMMU when the hypervisor supports it. This has made been possible
     by the above modifications and also simplifies the existing
     workaround in the HyperV specific virtual IOMMU.

   - Cleanup of the historical timer_works() irq flags related
     inconsistencies"

* tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  x86/ioapic: Cleanup the timer_works() irqflags mess
  iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
  iommu/amd: Fix IOMMU interrupt generation in X2APIC mode
  iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled
  iommu/amd: Fix union of bitfields in intcapxt support
  x86/ioapic: Correct the PCI/ISA trigger type selection
  x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
  x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
  x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
  iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available
  x86/apic: Support 15 bits of APIC ID in MSI where available
  x86/ioapic: Handle Extended Destination ID field in RTE
  iommu/vt-d: Simplify intel_irq_remapping_select()
  x86: Kill all traces of irq_remapping_get_irq_domain()
  x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
  x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
  iommu/hyper-v: Implement select() method on remapping irqdomain
  iommu/vt-d: Implement select() method on remapping irqdomain
  iommu/amd: Implement select() method on remapping irqdomain
  x86/apic: Add select() method on vector irqdomain
  ...
2020-12-14 18:59:53 -08:00
Linus Torvalds
adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Giovanni Gherdovich
3149cd5530 x86: Print ratio freq_max/freq_base used in frequency invariance calculations
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.

 # dmesg | grep 'Estimated ratio of average max'
 [   14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
2020-12-11 10:30:23 +01:00
Giovanni Gherdovich
976df7e573 x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
Frequency invariant accounting calculations need the ratio
freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power
allocation between cores: AMD EPYC CPUs implement "Core Performance Boost".
Three candidates are considered to estimate this value:

- maximum non-boost frequency
- maximum boost frequency
- the mid point between the above two

Experimental data on an AMD EPYC Zen2 machine slightly favors the third
option, which is applied with this patch.

The analysis uses the ondemand cpufreq governor as baseline, and compares
it with schedutil in a number of configurations. Using the freq_max value
described above offers a moderate advantage in performance and efficiency:

sugov-max (freq_max=max_boost) performs the worst on tbench: less
throughput and reduced efficiency than the other invariant-schedutil
options (see "Data Overview" below). Consider that tbench is generally a
problematic case as no schedutil version currently is better than ondemand.

sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's
can surpass ondemand with less filesystem latency and slightly increased
efficiency.

1. DATA OVERVIEW
2. DETAILED PERFORMANCE TABLES
3. POWER CONSUMPTION TABLE

1. DATA OVERVIEW
================

sugov-noinv : non-invariant schedutil governor
sugov-max   : invariant schedutil, freq_max=max_boost
sugov-mid   : invariant schedutil, freq_max=midpoint
sugov-P0    : invariant schedutil, freq_max=max_P
perfgov     : performance governor

driver      : acpi_cpufreq
machine     : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket,
              128 cores / 256 threads, SATA SSD storage, 250G of memory,
	      XFS filesystem

Benchmarks are described in the next section.
Tilde (~) means the value is the same as baseline.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0  better if
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                        PERFORMANCE RATIOS
tbench        1.00       1.44       0.90       0.87       0.93       0.93      higher
dbench        1.00       0.91       0.95       0.94       0.94       1.06      lower
kernbench     1.00       0.93       ~          ~          ~          0.97      lower
gitsource     1.00       0.66       0.97       0.96       ~          0.95      lower
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00       1.16       0.84       0.84       0.88       0.85      higher
dbench        1.00       1.03       1.02       1.02       1.02       0.93      higher
kernbench     1.00       1.05       ~          ~          ~          ~         higher
gitsource     1.00       1.46       1.04       1.04       ~          1.05      higher

2. DETAILED PERFORMANCE TABLES
==============================

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        427.19  +- 0.16% (        )     778.35  +- 0.10% (  82.20%)     346.92  +- 0.14% ( -18.79%)
Hmean  2        853.82  +- 0.09% (        )    1536.23  +- 0.03% (  79.93%)     694.36  +- 0.05% ( -18.68%)
Hmean  4       1657.54  +- 0.12% (        )    2938.18  +- 0.12% (  77.26%)    1362.81  +- 0.11% ( -17.78%)
Hmean  8       3301.87  +- 0.06% (        )    5679.10  +- 0.04% (  72.00%)    2693.35  +- 0.04% ( -18.43%)
Hmean  16      6139.65  +- 0.05% (        )    9498.81  +- 0.04% (  54.71%)    4889.97  +- 0.17% ( -20.35%)
Hmean  32     11170.28  +- 0.09% (        )   17393.25  +- 0.08% (  55.71%)    9104.55  +- 0.09% ( -18.49%)
Hmean  64     19322.97  +- 0.17% (        )   31573.91  +- 0.08% (  63.40%)   18552.52  +- 0.40% (  -3.99%)
Hmean  128    30383.71  +- 0.11% (        )   37416.91  +- 0.15% (  23.15%)   25938.70  +- 0.41% ( -14.63%)
Hmean  256    31143.96  +- 0.41% (        )   30908.76  +- 0.88% (  -0.76%)   29754.32  +- 0.24% (  -4.46%)
Hmean  512    30858.49  +- 0.26% (        )   38524.60  +- 1.19% (  24.84%)   42080.39  +- 0.56% (  36.37%)
Hmean  1024   39187.37  +- 0.19% (        )   36213.86  +- 0.26% (  -7.59%)   39555.98  +- 0.12% (   0.94%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        352.59  +- 1.03% ( -17.46%)     352.08  +- 0.75% ( -17.58%)     352.31  +- 1.48% ( -17.53%)
Hmean  2        697.32  +- 0.08% ( -18.33%)     700.16  +- 0.20% ( -18.00%)     696.79  +- 0.06% ( -18.39%)
Hmean  4       1369.88  +- 0.04% ( -17.35%)    1369.72  +- 0.07% ( -17.36%)    1365.91  +- 0.05% ( -17.59%)
Hmean  8       2696.79  +- 0.04% ( -18.33%)    2711.06  +- 0.04% ( -17.89%)    2715.10  +- 0.61% ( -17.77%)
Hmean  16      4725.03  +- 0.03% ( -23.04%)    4875.65  +- 0.02% ( -20.59%)    4953.05  +- 0.28% ( -19.33%)
Hmean  32      9231.65  +- 0.10% ( -17.36%)    8704.89  +- 0.27% ( -22.07%)   10562.02  +- 0.36% (  -5.45%)
Hmean  64     15364.27  +- 0.19% ( -20.49%)   17786.64  +- 0.15% (  -7.95%)   19665.40  +- 0.22% (   1.77%)
Hmean  128    42100.58  +- 0.13% (  38.56%)   34946.28  +- 0.13% (  15.02%)   38635.79  +- 0.06% (  27.16%)
Hmean  256    30660.23  +- 1.08% (  -1.55%)   32307.67  +- 0.54% (   3.74%)   31153.27  +- 0.12% (   0.03%)
Hmean  512    24604.32  +- 0.14% ( -20.27%)   40408.50  +- 1.10% (  30.95%)   38800.29  +- 1.23% (  25.74%)
Hmean  1024   35535.47  +- 0.28% (  -9.32%)   41070.38  +- 2.56% (   4.81%)   31308.29  +- 2.52% ( -20.11%)

Benchmark          : dbench (filesystem stressor)
Varying parameter  : number of clients
Unit               : seconds (lower is better)

NOTE-1: This dbench version measures the average latency of a set of filesystem
        operations, as we found the traditional dbench metric (throughput) to be
	misleading.
NOTE-2: Due to high variability, we partition the original dataset and apply
        statistical bootrapping (a resampling method). Accuracy is reported in the
	form of 95% confidence intervals.

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         98.79  +- 0.92 (        )      83.36  +- 0.82 (  15.62%)      84.82  +- 0.92 (  14.14%)
SubAmean  2        116.00  +- 0.89 (        )     102.12  +- 0.77 (  11.96%)     109.63  +- 0.89 (   5.49%)
SubAmean  4        149.90  +- 1.03 (        )     132.12  +- 0.91 (  11.86%)     143.90  +- 1.15 (   4.00%)
SubAmean  8        182.41  +- 1.13 (        )     159.86  +- 0.93 (  12.36%)     165.82  +- 1.03 (   9.10%)
SubAmean  16       237.83  +- 1.23 (        )     219.46  +- 1.14 (   7.72%)     229.28  +- 1.19 (   3.59%)
SubAmean  32       334.34  +- 1.49 (        )     309.94  +- 1.42 (   7.30%)     321.19  +- 1.36 (   3.93%)
SubAmean  64       576.61  +- 2.16 (        )     540.75  +- 2.00 (   6.22%)     551.27  +- 1.99 (   4.39%)
SubAmean  128     1350.07  +- 4.14 (        )    1205.47  +- 3.20 (  10.71%)    1280.26  +- 3.75 (   5.17%)
SubAmean  256     3444.42  +- 7.97 (        )    3698.00 +- 27.43 (  -7.36%)    3494.14  +- 7.81 (  -1.44%)
SubAmean  2048   39457.89 +- 29.01 (        )   34105.33 +- 41.85 (  13.57%)   39688.52 +- 36.26 (  -0.58%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         85.68  +- 1.04 (  13.27%)      84.16  +- 0.84 (  14.81%)      83.99  +- 0.90 (  14.99%)
SubAmean  2        108.42  +- 0.95 (   6.54%)     109.91  +- 1.39 (   5.24%)     112.06  +- 0.91 (   3.39%)
SubAmean  4        136.90  +- 1.04 (   8.67%)     137.59  +- 0.93 (   8.21%)     136.55  +- 0.95 (   8.91%)
SubAmean  8        163.15  +- 0.96 (  10.56%)     166.07  +- 1.02 (   8.96%)     165.81  +- 0.99 (   9.10%)
SubAmean  16       224.86  +- 1.12 (   5.45%)     223.83  +- 1.06 (   5.89%)     230.66  +- 1.19 (   3.01%)
SubAmean  32       320.51  +- 1.38 (   4.13%)     322.85  +- 1.49 (   3.44%)     321.96  +- 1.46 (   3.70%)
SubAmean  64       553.25  +- 1.93 (   4.05%)     554.19  +- 2.08 (   3.89%)     562.26  +- 2.22 (   2.49%)
SubAmean  128     1264.35  +- 3.72 (   6.35%)    1256.99  +- 3.46 (   6.89%)    2018.97 +- 18.79 ( -49.55%)
SubAmean  256     3466.25  +- 8.25 (  -0.63%)    3450.58  +- 8.44 (  -0.18%)    5032.12 +- 38.74 ( -46.09%)
SubAmean  2048   39133.10 +- 45.71 (   0.82%)   39905.95 +- 34.33 (  -1.14%)   53811.86 +-193.04 ( -36.38%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        471.71 +- 26.61% (        )     409.88 +- 16.99% (  13.11%)     430.63  +- 0.18% (   8.71%)
Amean  4        211.87  +- 0.58% (        )     194.03  +- 0.74% (   8.42%)     215.33  +- 0.64% (  -1.63%)
Amean  8        109.79  +- 1.27% (        )     101.43  +- 1.53% (   7.61%)     111.05  +- 1.95% (  -1.15%)
Amean  16        59.50  +- 1.28% (        )      55.61  +- 1.35% (   6.55%)      59.65  +- 1.78% (  -0.24%)
Amean  32        34.94  +- 1.22% (        )      32.36  +- 1.95% (   7.41%)      35.44  +- 0.63% (  -1.43%)
Amean  64        22.58  +- 0.38% (        )      20.97  +- 1.28% (   7.11%)      22.41  +- 1.73% (   0.74%)
Amean  128       17.72  +- 0.44% (        )      16.68  +- 0.32% (   5.88%)      17.65  +- 0.96% (   0.37%)
Amean  256       16.44  +- 0.53% (        )      15.76  +- 0.32% (   4.18%)      16.76  +- 0.60% (  -1.93%)
Amean  512       16.54  +- 0.21% (        )      15.62  +- 0.41% (   5.53%)      16.84  +- 0.85% (  -1.83%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        421.30  +- 0.24% (  10.69%)     419.26  +- 0.15% (  11.12%)     414.38  +- 0.33% (  12.15%)
Amean  4  	217.81  +- 5.53% (  -2.80%)     211.63  +- 0.99% (   0.12%)     208.43  +- 0.47% (   1.63%)
Amean  8  	108.80  +- 0.43% (   0.90%)     108.48  +- 1.44% (   1.19%)     108.59  +- 3.08% (   1.09%)
Amean  16 	 58.84  +- 0.74% (   1.12%)      58.37  +- 0.94% (   1.91%)      57.78  +- 0.78% (   2.90%)
Amean  32 	 34.04  +- 2.00% (   2.59%)      34.28  +- 1.18% (   1.91%)      33.98  +- 2.21% (   2.75%)
Amean  64 	 22.22  +- 1.69% (   1.60%)      22.27  +- 1.60% (   1.38%)      22.25  +- 1.41% (   1.47%)
Amean  128	 17.55  +- 0.24% (   0.97%)      17.53  +- 0.94% (   1.04%)      17.49  +- 0.43% (   1.30%)
Amean  256	 16.51  +- 0.46% (  -0.40%)      16.48  +- 0.48% (  -0.19%)      16.44  +- 1.21% (   0.00%)
Amean  512	 16.50  +- 0.35% (   0.19%)      16.35  +- 0.42% (   1.14%)      16.37  +- 0.33% (   0.99%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean          1035.76  +- 0.30% (        )     688.21  +- 0.04% (  33.56%)    1003.85  +- 0.14% (   3.08%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean           995.82  +- 0.08% (   3.86%)    1011.98  +- 0.03% (   2.30%)     986.87  +- 0.19% (   4.72%)

3. POWER CONSUMPTION TABLE
==========================

Average power consumption (watts).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
tbench4     227.25     281.83     244.17     236.76     241.50     247.99
dbench4     151.97     161.87     157.08     158.10     158.06     153.73
kernbench   162.78     167.22     162.90     164.19     164.65     164.72
gitsource   133.65     139.00     133.04     134.43     134.18     134.32

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
2020-12-11 10:29:55 +01:00
Nathan Fontenot
41ea667227 x86, sched: Calculate frequency invariance for AMD systems
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.

On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.

Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.

Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.

[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]

Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
2020-12-11 10:26:00 +01:00
Paul E. McKenney
29368e0939 x86/smpboot: Move rcu_cpu_starting() earlier
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:

=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x77/0x97
 __is_insn_slot_addr+0x15d/0x170
 kernel_text_address+0xba/0xe0
 ? get_stack_info+0x22/0xa0
 __kernel_text_address+0x9/0x30
 show_trace_log_lvl+0x17d/0x380
 ? dump_stack+0x77/0x97
 dump_stack+0x77/0x97
 __lock_acquire+0xdf7/0x1bf0
 lock_acquire+0x258/0x3d0
 ? vprintk_emit+0x6d/0x2c0
 _raw_spin_lock+0x27/0x40
 ? vprintk_emit+0x6d/0x2c0
 vprintk_emit+0x6d/0x2c0
 printk+0x4d/0x69
 start_secondary+0x1c/0x100
 secondary_startup_64_no_verify+0xb8/0xbb

This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function.  Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.

Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Thomas Gleixner
8c44963b60 x86/apic: Cleanup destination mode
apic::irq_dest_mode is actually a boolean, but defined as u32 and named in
a way which does not explain what it means.

Make it a boolean and rename it to 'dest_mode_logical'

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-9-dwmw2@infradead.org
2020-10-28 20:26:25 +01:00
Thomas Gleixner
e57d04e5fa x86/apic: Get rid of apic:: Dest_logical
struct apic has two members which store information about the destination
mode: dest_logical and irq_dest_mode.

dest_logical contains a mask which was historically used to set the
destination mode in IPI messages. Over time the usage was reduced and the
logical/physical functions were seperated.

There are only a few places which still use 'dest_logical' but they can
use 'irq_dest_mode' instead.

irq_dest_mode is actually a boolean where 0 means physical destination mode
and 1 means logical destination mode. Of course the name does not reflect
the functionality. This will be cleaned up in a subsequent change.

Remove apic::dest_logical and fixup the remaining users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-8-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
Joerg Roedel
520d030852 x86/smpboot: Load TSS and getcpu GDT entry before loading IDT
The IDT on 64-bit contains vectors which use paranoid_entry() and/or IST
stacks. To make these vectors work, the TSS and the getcpu GDT entry need
to be set up before the IDT is loaded.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-68-joro@8bytes.org
2020-09-09 11:33:20 +02:00
Ashok Raj
52d6b926aa x86/hotplug: Silence APIC only after all interrupts are migrated
There is a race when taking a CPU offline. Current code looks like this:

native_cpu_disable()
{
	...
	apic_soft_disable();
	/*
	 * Any existing set bits for pending interrupt to
	 * this CPU are preserved and will be sent via IPI
	 * to another CPU by fixup_irqs().
	 */
	cpu_disable_common();
	{
		....
		/*
		 * Race window happens here. Once local APIC has been
		 * disabled any new interrupts from the device to
		 * the old CPU are lost
		 */
		fixup_irqs(); // Too late to capture anything in IRR.
		...
	}
}

The fix is to disable the APIC *after* cpu_disable_common().

Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.

Fixes: 60dcaad573 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com
2020-08-27 09:29:23 +02:00
Linus Torvalds
335ad94c21 Misc changes:
- Prepare for Intel's new SERIALIZE instruction
  - Enable split-lock debugging on more CPUs
  - Add more Intel CPU models
  - Optimize stack canary initialization a bit
  - Simplify the Spectre logic a bit
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oTsQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gueQ//Vh9sTi8+q5ZCxXnJQOi59SZsFy1quC2Q
 6bFoSQ46npMBoYyC2eDQ4exBncWLqorT8Vq/evlW3XPldUzHKOk7b4Omonwyrrj5
 dg5fqcRjpjU8ni6egmy4ElMjab53gDuv0yNazjONeBGeWuBGu4vI2bP2eY3Addfm
 2eo2d5ZIMRCdShrUNwToJWWt6q4DzL/lcrVZAlX0LwlWVLqUCdIARALRM7V1XDsC
 udxS8KnvhTaJ7l63BSJREe3AGksLQd9P4UkJS4IE4t0zINBIrME043BYBMTh2Vvk
 y3jykKegIbmhPquGXG8grJbPDUF2/3FxmGKTIhpoo++agb2fxt921y5kqMJwniNS
 H/Gk032iGzjjwWnOoWE56UeuDTOlweSIrm4EG22HyEDK7kOMJusjYAV5fB4Sv7vj
 TBy5q0PCIutjXDTL1hIWf0WDiQt6eGNQS/yt3FlapLBGVRQwMU/pKYVVIOIaFtNs
 szx1ZeiT358Ww8a2fQlb8pqv50Upmr2wqFkAsMbm+NN3N92cqK6gJlo1p7fnxIuG
 +YVASobjsqbn0S62v/9SB/KRJz07adlZ6Tl/O/ILRvWyqik7COCCHDVJ62Zzaz5z
 LqR2daVM5H+Lp6jGZuIoq/JiUkxUe2K990eWHb3PUpOC4Rh73PvtMc7WFhbAjbye
 XV3eOEDi65c=
 =sL2Q
 -----END PGP SIGNATURE-----

Merge tag 'x86-cpu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Ingo Molar:

 - prepare for Intel's new SERIALIZE instruction

 - enable split-lock debugging on more CPUs

 - add more Intel CPU models

 - optimize stack canary initialization a bit

 - simplify the Spectre logic a bit

* tag 'x86-cpu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Refactor sync_core() for readability
  x86/cpu: Relocate sync_core() to sync_core.h
  x86/cpufeatures: Add enumeration for SERIALIZE instruction
  x86/split_lock: Enable the split lock feature on Sapphire Rapids and Alder Lake CPUs
  x86/cpu: Add Lakefield, Alder Lake and Rocket Lake models to the to Intel CPU family
  x86/stackprotector: Pre-initialize canary for secondary CPUs
  x86/speculation: Merge one test in spectre_v2_user_select_mitigation()
2020-08-03 17:08:02 -07:00
Brian Gerst
c9a1ff316b x86/stackprotector: Pre-initialize canary for secondary CPUs
The idle tasks created for each secondary CPU already have a random stack
canary generated by fork().  Copy the canary to the percpu variable before
starting the secondary CPU which removes the need to call
boot_init_stack_canary().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200617225624.799335-1-brgerst@gmail.com
2020-06-18 13:09:17 +02:00
Giovanni Gherdovich
f4291df103 x86, sched: Bail out of frequency invariance if turbo_freq/base_freq gives 0
Be defensive against the case where the processor reports a base_freq
larger than turbo_freq (the ratio would be zero).

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-4-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Giovanni Gherdovich
51beea8862 x86, sched: Bail out of frequency invariance if turbo frequency is unknown
There may be CPUs that support turbo boost but don't declare any turbo
ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition
scale-invariant calculations can't be performed.

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-3-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Giovanni Gherdovich
e2b0d619b4 x86, sched: check for counters overflow in frequency invariant accounting
The product mcnt * arch_max_freq_ratio can overflows u64.

For context, a large value for arch_max_freq_ratio would be 5000,
corresponding to a turbo_freq/base_freq ratio of 5 (normally it's more like
1500-2000). A large increment frequency for the MPERF counter would be 5GHz
(the base clock of all CPUs on the market today is less than that). With
these figures, a CPU would need to go without a scheduler tick for around 8
days for the u64 overflow to happen. It is unlikely, but the check is
warranted.

Under similar conditions, the difference acnt of two consecutive APERF
readings can overflow as well.

In these circumstances is appropriate to disable frequency invariant
accounting: the feature relies on measures of the clock frequency done at
every scheduler tick, which need to be "fresh" to be at all meaningful.

A note on i386: prior to version 5.1, the GCC compiler didn't have the
builtin function __builtin_mul_overflow. In these GCC versions the macro
check_mul_overflow needs __udivdi3() to do (u64)a/b, which the kernel
doesn't provide. For this reason this change fails to build on i386 if
GCC<5.1, and we protect the entire frequency invariant code behind
CONFIG_X86_64 (special thanks to "kbuild test robot" <lkp@intel.com>).

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-2-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Mike Rapoport
65fddcfca8 mm: reorder includes after introduction of linux/pgtable.h
The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes.  Fix this up with the aid of
the below script and manual adjustments here and there.

	import sys
	import re

	if len(sys.argv) is not 3:
	    print "USAGE: %s <file> <header>" % (sys.argv[0])
	    sys.exit(1)

	hdr_to_move="#include <linux/%s>" % sys.argv[2]
	moved = False
	in_hdrs = False

	with open(sys.argv[1], "r") as f:
	    lines = f.readlines()
	    for _line in lines:
		line = _line.rstrip('
')
		if line == hdr_to_move:
		    continue
		if line.startswith("#include <linux/"):
		    in_hdrs = True
		elif not moved and in_hdrs:
		    moved = True
		    print hdr_to_move
		print line

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Mike Rapoport
ca5999fde0 mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Linus Torvalds
17e0a7cb6a Misc cleanups, with an emphasis on removing obsolete/dead code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
 kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
 X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
 dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
 KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
 L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
 iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
 R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
 zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
 XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
 Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
 KiYX7Dk5uhQ=
 =0ZQd
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, with an emphasis on removing obsolete/dead code"

* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/spinlock: Remove obsolete ticket spinlock macros and types
  x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
  x86/apb_timer: Drop unused declaration and macro
  x86/apb_timer: Drop unused TSC calibration
  x86/io_apic: Remove unused function mp_init_irq_at_boot()
  x86/mm: Stop printing BRK addresses
  x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
  x86/nmi: Remove edac.h include leftover
  mm: Remove MPX leftovers
  x86/mm/mmap: Fix -Wmissing-prototypes warnings
  x86/early_printk: Remove unused includes
  crash_dump: Remove no longer used saved_max_pfn
  x86/smpboot: Remove the last ICPU() macro
2020-06-01 13:47:10 -07:00
Linus Torvalds
d861f6e682 Misc cleanups in the SMP hotplug and cross-call code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VJfsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ihcA/+Ko18kdGRPAlShM9qkDWO5N80p1LEp7F0
 ku1OxPAz9ii7K/jlnGr9wYYPxsIL3lbFeqFE7q5q5socXufaN8MUj9sVCmN7ScmR
 zO84aTHtxrJJhKIPM6HkUTbVl5KrQaud3F/J56CCjuKPsJWy9iuCGnKtfKK38bx+
 qJEfVKVm95Bv0NSEvqvci3DKKPYjzpKzuuttHXQ8Z80zG94FEkwj0JwZzttIjLl1
 rgRMgWTH7+3tQCMnZEfXG8xBxbXS9i3hKyr/v5QTNgIICyXGquPkf5MiwjJFS2Xb
 wpPqNh8HTo5kUJstYygRjcftatU7K72h2Rz/CoUkN2roNYlvRAhdBaBMwN0cGaG8
 pPhnLHHHRYZjl4fiROgRwVV3A6LcAHSrIcKzwGrvpCSpqyVozPGsmD/e8ZG1JYpC
 vxESTZbCDywng2Ls8jqQBut+dFGElvopXl1s004bCak89IFR4p15qojMJK2MSsqu
 BxhjIoqp8/f1fsAX+1p0RBEYnEr1KFtWa+nY8aVKL6bEx+Y7Qyq0ypMGtKavP06X
 VMcPMm1gYeXoGpLaTLYBRL5t7Rmm7i+xufuDQKUJetenfh2YS4aQ9lfV+rsQH1YE
 wavQrbwThfBZ9K1XkEmOkSqONysZ2YAtK9slKzciQIZvY3V8NbKAmBudCgqTgarp
 xqeW9NFfeFc=
 =Rr2n
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Ingo Molnar:
 "Misc cleanups in the SMP hotplug and cross-call code"

* tag 'smp-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Remove __freeze_secondary_cpus()
  cpu/hotplug: Remove disable_nonboot_cpus()
  cpu/hotplug: Fix a typo in comment "broadacasted"->"broadcasted"
  smp: Use smp_call_func_t in on_each_cpu()
2020-06-01 13:38:55 -07:00
Borislav Petkov
a9a3ed1eff x86: Fix early boot crash on gcc-10, third try
... or the odyssey of trying to disable the stack protector for the
function which generates the stack canary value.

The whole story started with Sergei reporting a boot crash with a kernel
built with gcc-10:

  Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139
  Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
  Call Trace:
    dump_stack
    panic
    ? start_secondary
    __stack_chk_fail
    start_secondary
    secondary_startup_64
  -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary

This happens because gcc-10 tail-call optimizes the last function call
in start_secondary() - cpu_startup_entry() - and thus emits a stack
canary check which fails because the canary value changes after the
boot_init_stack_canary() call.

To fix that, the initial attempt was to mark the one function which
generates the stack canary with:

  __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)

however, using the optimize attribute doesn't work cumulatively
as the attribute does not add to but rather replaces previously
supplied optimization options - roughly all -fxxx options.

The key one among them being -fno-omit-frame-pointer and thus leading to
not present frame pointer - frame pointer which the kernel needs.

The next attempt to prevent compilers from tail-call optimizing
the last function call cpu_startup_entry(), shy of carving out
start_secondary() into a separate compilation unit and building it with
-fno-stack-protector, was to add an empty asm("").

This current solution was short and sweet, and reportedly, is supported
by both compilers but we didn't get very far this time: future (LTO?)
optimization passes could potentially eliminate this, which leads us
to the third attempt: having an actual memory barrier there which the
compiler cannot ignore or move around etc.

That should hold for a long time, but hey we said that about the other
two solutions too so...

Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kalle Valo <kvalo@codeaurora.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-05-15 11:48:01 +02:00
Qais Yousef
5655585589 cpu/hotplug: Remove disable_nonboot_cpus()
The single user could have called freeze_secondary_cpus() directly.

Since this function was a source of confusion, remove it as it's
just a pointless wrapper.

While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to
preserve the naming symmetry.

Done automatically via:

	git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g'

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
2020-05-07 15:18:40 +02:00
Giovanni Gherdovich
db441bd9f6 x86, sched: Move check for CPU type to caller function
Improve readability of the function intel_set_max_freq_ratio() by moving
the check for KNL CPUs there, together with checks for GLM and SKX.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-5-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Peter Zijlstra (Intel)
b56e7d45e8 x86, sched: Don't enable static key when starting secondary CPUs
The static key arch_scale_freq_key only needs to be enabled once (at
boot). This change fixes a bug by which the key was enabled every time cpu0
is started, even as a secondary CPU during cpu hotplug. Secondary CPUs are
started from the idle thread: setting a static key from there means
acquiring a lock and may result in sleeping in the idle task, causing CPU
lockup.

Another consequence of this change is that init_counter_refs() is now
called on each CPU correctly; previously the function on_each_cpu() was
used, but it was called at boot when the only online cpu is cpu0.

[ggherdovich@suse.cz: Tested and wrote changelog]
Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-4-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Giovanni Gherdovich
23ccee22e8 x86, sched: Account for CPUs with less than 4 cores in freq. invariance
If a CPU has less than 4 physical cores, MSR_TURBO_RATIO_LIMIT will
rightfully report that the 4C turbo ratio is zero. In such cases, use the
1C turbo ratio instead for frequency invariance calculations.

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Reported-by: Like Xu <like.xu@linux.intel.com>
Reported-by: Neil Rickert <nwr10cst-oslnx@yahoo.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://lkml.kernel.org/r/20200416054745.740-3-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Giovanni Gherdovich
9a6c2c3c7a x86, sched: Bail out of frequency invariance if base frequency is unknown
Some hypervisors such as VMWare ESXi 5.5 advertise support for
X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
clock frequency of the CPU (highest non-turbo frequency), producing a
division by zero when computing the ratio turbo_freq/base_freq necessary
for frequency invariant accounting.

It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
couldn't be done in principle (not that it would make a lot of sense in a
VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
that feature.

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-2-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Borislav Petkov
2fa9a3cf30 x86/smpboot: Remove the last ICPU() macro
Now all is using the shiny new macros.

No code changed:

  # arch/x86/kernel/smpboot.o:

   text    data     bss     dec     hex filename
  16432    2649      40   19121    4ab1 smpboot.o.before
  16432    2649      40   19121    4ab1 smpboot.o.after

md5:
   a58104003b72c1de533095bc5a4c30a9  smpboot.o.before.asm
   a58104003b72c1de533095bc5a4c30a9  smpboot.o.after.asm

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200324185836.GI22931@zn.tnic
2020-04-13 10:34:09 +02:00
Linus Torvalds
fdf5563a72 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "This topic tree contains more commits than usual:

   - most of it are uaccess cleanups/reorganization by Al

   - there's a bunch of prototype declaration (--Wmissing-prototypes)
     cleanups

   - misc other cleanups all around the map"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/mm/set_memory: Fix -Wmissing-prototypes warnings
  x86/efi: Add a prototype for efi_arch_mem_reserve()
  x86/mm: Mark setup_emu2phys_nid() static
  x86/jump_label: Move 'inline' keyword placement
  x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt()
  kill uaccess_try()
  x86: unsafe_put-style macro for sigmask
  x86: x32_setup_rt_frame(): consolidate uaccess areas
  x86: __setup_rt_frame(): consolidate uaccess areas
  x86: __setup_frame(): consolidate uaccess areas
  x86: setup_sigcontext(): list user_access_{begin,end}() into callers
  x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit)
  x86: ia32_setup_rt_frame(): consolidate uaccess areas
  x86: ia32_setup_frame(): consolidate uaccess areas
  x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers
  x86/alternatives: Mark text_poke_loc_init() static
  x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl()
  x86/mm: Drop pud_mknotpresent()
  x86: Replace setup_irq() by request_irq()
  x86/configs: Slightly reduce defconfigs
  ...
2020-03-31 11:04:05 -07:00
Linus Torvalds
642e53ead6 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Various NUMA scheduling updates: harmonize the load-balancer and
     NUMA placement logic to not work against each other. The intended
     result is better locality, better utilization and fewer migrations.

   - Introduce Thermal Pressure tracking and optimizations, to improve
     task placement on thermally overloaded systems.

   - Implement frequency invariant scheduler accounting on (some) x86
     CPUs. This is done by observing and sampling the 'recent' CPU
     frequency average at ~tick boundaries. The CPU provides this data
     via the APERF/MPERF MSRs. This hopefully makes our capacity
     estimates more precise and keeps tasks on the same CPU better even
     if it might seem overloaded at a lower momentary frequency. (As
     usual, turbo mode is a complication that we resolve by observing
     the maximum frequency and renormalizing to it.)

   - Add asymmetric CPU capacity wakeup scan to improve capacity
     utilization on asymmetric topologies. (big.LITTLE systems)

   - PSI fixes and optimizations.

   - RT scheduling capacity awareness fixes & improvements.

   - Optimize the CONFIG_RT_GROUP_SCHED constraints code.

   - Misc fixes, cleanups and optimizations - see the changelog for
     details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  threads: Update PID limit comment according to futex UAPI change
  sched/fair: Fix condition of avg_load calculation
  sched/rt: cpupri_find: Trigger a full search as fallback
  kthread: Do not preempt current task if it is going to call schedule()
  sched/fair: Improve spreading of utilization
  sched: Avoid scale real weight down to zero
  psi: Move PF_MEMSTALL out of task->flags
  MAINTAINERS: Add maintenance information for psi
  psi: Optimize switching tasks inside shared cgroups
  psi: Fix cpu.pressure for cpu.max and competing cgroups
  sched/core: Distribute tasks within affinity masks
  sched/fair: Fix enqueue_task_fair warning
  thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
  sched/rt: Remove unnecessary push for unfit tasks
  sched/rt: Allow pulling unfitting task
  sched/rt: Optimize cpupri_find() on non-heterogenous systems
  sched/rt: Re-instate old behavior in select_task_rq_rt()
  sched/rt: cpupri_find: Implement fallback mechanism for !fit case
  sched/fair: Fix reordering of enqueue/dequeue_task_fair()
  sched/fair: Fix runnable_avg for throttled cfs
  ...
2020-03-30 17:01:51 -07:00
Thomas Gleixner
adefe55e72 x86/kernel: Convert to new CPU match macros
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.

Get rid the of the local macro wrappers for consistency.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20200320131509.250559388@linutronix.de
2020-03-24 21:28:26 +01:00
Martin Molnar
4d1d0977a2 x86: Fix a handful of typos
Fix a couple of typos in code comments.

 [ bp: While at it: s/IRQ's/IRQs/. ]

Signed-off-by: Martin Molnar <martin.molnar.programming@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/0819a044-c360-44a4-f0b6-3f5bafe2d35c@gmail.com
2020-02-16 20:58:06 +01:00
Giovanni Gherdovich
918229cdd5 x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance
On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.

The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between freq_base and freq_max every time turbo goes off or on.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz
2020-01-28 21:37:06 +01:00
Giovanni Gherdovich
298c6f99bf x86, sched: Add support for frequency invariance on ATOM
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
turbo ratio.

We intended to perform tests validating that this patch doesn't regress in
terms of energy efficiency, given that this is the primary concern on Atom
processors. Alas, we found out that turbostat doesn't support reading RAPL
interfaces on our test machine (Airmont), and we don't have external equipment
to measure power consumption; all we have is the performance results of the
benchmarks we ran.

Test machine:

Platform    : Dell Wyse 3040 Thin Client[1]
CPU Model   : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
Fam/Mod/Ste : 6:76:4
Topology    : 1 socket, 4 cores / 4 threads
Memory      : 2G
Storage     : onboard flash, XFS filesystem

[1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client

Base frequency and available turbo levels (MHz):

    Min Operating Freq   266 |***
    Low Freq Mode        800 |********
    Base Freq           2400 |************************
    4 Cores             2800 |****************************
    3 Cores             2800 |****************************
    2 Cores             3200 |********************************
    1 Core              3200 |********************************

Tested kernels:

Baseline      : v5.4-rc1,              intel_pstate passive,  schedutil
Comparison #1 : v5.4-rc1,              intel_pstate active ,  powersave
Comparison #2 : v5.4-rc1, this patch,  intel_pstate passive,  schedutil

tbench, hackbench and kernbench performed the same under all three kernels;
dbench ran faster with intel_pstate/powersave and the git unit tests were a
lot faster with intel_pstate/powersave and invariant schedutil wrt the
baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
an Atom system to go fast. Power consumption regressions aren't expected but
we lack the equipment to make that measurement. Turbostat seems to think that
reading RAPL on this machine isn't a good idea and we're trusting that
decision.

comparison ratio of performance with baseline; 1.00 means neutral,
lower is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                0.90             ~
    kernbench             0.98          0.97
    gitsource             0.63          0.43

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-6-ggherdovich@suse.cz
2020-01-28 21:37:05 +01:00
Giovanni Gherdovich
eacf0474ae x86, sched: Add support for frequency invariance on ATOM_GOLDMONT*
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
reported by the CPU.

The encoding of turbo ratios for GOLDMONT* is identical to the one for
SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
conservative frequency selections (favoring power efficiency).

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-5-ggherdovich@suse.cz
2020-01-28 21:37:04 +01:00
Giovanni Gherdovich
8bea0dfb4a x86, sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
reported by the CPU.

Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
one or two turbo frequencies; in the former case that's 100 MHz above the base
frequency, in the latter case the two levels are 100 MHz and 200 MHz above
base frequency.

We set freq_max to the second-highest frequency reported by the CPU. This
could be the base frequency (if only one turbo level is available) or the first
turbo level (if two levels are available). The rationale is to compromise
between power efficiency or performance -- going straight to max turbo would
favor efficiency and blindly using base freq would favor performance.

For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
to get the available frequencies (taken from a comment in turbostat's sources):

    [0] -- Reserved
    [7:1] -- Base value of number of active cores of bucket 1.
    [15:8] -- Base value of freq ratio of bucket 1.
    [20:16] -- +ve delta of number of active cores of bucket 2.
    i.e. active cores of bucket 2 =
    active cores of bucket 1 + delta
    [23:21] -- Negative delta of freq ratio of bucket 2.
    i.e. freq ratio of bucket 2 =
    freq ratio of bucket 1 - delta
    [28:24]-- +ve delta of number of active cores of bucket 3.
    [31:29]-- -ve delta of freq ratio of bucket 3.
    [36:32]-- +ve delta of number of active cores of bucket 4.
    [39:37]-- -ve delta of freq ratio of bucket 4.
    [44:40]-- +ve delta of number of active cores of bucket 5.
    [47:45]-- -ve delta of freq ratio of bucket 5.
    [52:48]-- +ve delta of number of active cores of bucket 6.
    [55:53]-- -ve delta of freq ratio of bucket 6.
    [60:56]-- +ve delta of number of active cores of bucket 7.
    [63:61]-- -ve delta of freq ratio of bucket 7.

1. PERFORMANCE EVALUATION: TBENCH +5%
2. NEUTRAL BENCHMARKS (ALL OTHERS)
3. TEST SETUP

1. PERFORMANCE EVALUATION: TBENCH +5%
-------------------------------------

A performance evaluation was conducted on a Knights Mill machine (see "Test
Setup" below), were the frequency-invariance patch (on schedutil) is compared
to both non-invariant schedutil and active intel_pstate with powersave: all
three tested kernels behave the same performance-wise and with regard to power
consumption (performance per watt). The only notable difference is tbench:

comparison ratio of performance with baseline; 1.00 means neutral,
higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    tbench                1.04          1.05

performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    tbench                1.03          1.04

which essentially means that frequency-invariant schedutil is 5% better than
baseline, the same as intel_pstate+powersave.

As the results above are averaged over the varying parameter, here the detailed
table.

Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                    5.2.0 vanilla (BASELINE)                 5.2.0 intel_pstate                     5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean   1         49.06  +- 2.12% (        )         51.66  +- 1.52% (   5.30%)         52.87  +- 0.88% (   7.76%)
Hmean   2         93.82  +- 0.45% (        )        103.24  +- 0.70% (  10.05%)        105.90  +- 0.70% (  12.88%)
Hmean   4        192.46  +- 1.15% (        )        215.95  +- 0.60% (  12.21%)        215.78  +- 1.43% (  12.12%)
Hmean   8        406.74  +- 2.58% (        )        438.58  +- 0.36% (   7.83%)        437.61  +- 0.97% (   7.59%)
Hmean   16       857.70  +- 1.22% (        )        890.26  +- 0.72% (   3.80%)        889.11  +- 0.73% (   3.66%)
Hmean   32      1760.10  +- 0.92% (        )       1791.70  +- 0.44% (   1.79%)       1787.95  +- 0.44% (   1.58%)
Hmean   64      3183.50  +- 0.34% (        )       3183.19  +- 0.36% (  -0.01%)       3187.53  +- 0.36% (   0.13%)
Hmean   128     4830.96  +- 0.31% (        )       4846.53  +- 0.30% (   0.32%)       4855.86  +- 0.30% (   0.52%)
Hmean   256     5467.98  +- 0.38% (        )       5793.80  +- 0.28% (   5.96%)       5821.94  +- 0.17% (   6.47%)
Hmean   512     5398.10  +- 0.06% (        )       5745.56  +- 0.08% (   6.44%)       5503.68  +- 0.07% (   1.96%)
Hmean   1024    5290.43  +- 0.63% (        )       5221.07  +- 0.47% (  -1.31%)       5277.22  +- 0.80% (  -0.25%)
Hmean   1088    5139.71  +- 0.57% (        )       5236.02  +- 0.71% (   1.87%)       5190.57  +- 0.41% (   0.99%)

2. NEUTRAL BENCHMARKS (ALL OTHERS)
----------------------------------

* pgbench (both read/write and read-only)
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
* dbench
* kernbench
* gitsource (git unit test suite)

3. TEST SETUP
-------------

Test machine:

CPU Model   : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
Fam/Mod/Ste : 6:133:0
Topology    : 1 socket, 68 cores / 272 threads
Memory      : 96G
Storage     : rotary, XFS filesystem

Max EFFICiency, BASE frequency and available turbo levels (MHz):

    EFFIC   1000 |**********
    BASE    1100 |***********
    68C     1100 |***********
    30C     1200 |************

Tested kernels:

Baseline      : v5.2,              intel_pstate passive,  schedutil
Comparison #1 : v5.2,              intel_pstate active ,  powersave
Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-4-ggherdovich@suse.cz
2020-01-28 21:37:02 +01:00
Giovanni Gherdovich
2a0abc5969 x86, sched: Add support for frequency invariance on SKYLAKE_X
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can
be sustained by a group of at least 4 cores.

From the changelog of commit 31e07522be ("tools/power turbostat: fix
decoding for GLM, DNV, SKX turbo-ratio limits"):

 >   Newer processors do not hard-code the the number of cpus in each bin
 >   to {1, 2, 3, 4, 5, 6, 7, 8}  Rather, they can specify any number
 >   of CPUS in each of the 8 bins:
 >
 >   eg.
 >
 >   ...
 >   37 * 100.0 = 3600.0 MHz max turbo 4 active cores
 >   38 * 100.0 = 3700.0 MHz max turbo 3 active cores
 >   39 * 100.0 = 3800.0 MHz max turbo 2 active cores
 >   39 * 100.0 = 3900.0 MHz max turbo 1 active cores
 >
 >   could now look something like this:
 >
 >   ...
 >   37 * 100.0 = 3600.0 MHz max turbo 16 active cores
 >   38 * 100.0 = 3700.0 MHz max turbo 8 active cores
 >   39 * 100.0 = 3800.0 MHz max turbo 4 active cores
 >   39 * 100.0 = 3900.0 MHz max turbo 2 active cores

This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D,
but we treat these two classes in separate commits because their freq_max
values need to be different. For SKX we prefer a lower freq_max in the ratio
freq_curr/freq_max, allowing load and utilization to overshoot and the
schedutil governor to be more performance-oriented. Models from the Atom
series (such as GOLDMONT*) are handled in a forthcoming commit as they have to
favor power-efficiency over performance.

Results from a performance evaluation follow.

1. TEST SETUP
2. NEUTRAL BENCHMARKS
3. NON-NEUTRAL BENCHMARKS
4. DETAILED TABLES

1. TEST SETUP
-------------

Test machine:

CPU Model   : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake)
Fam/Mod/Ste : 6:85:6
Topology    : 2 sockets, 24 cores / 48 threads each socket
Memory      : 192G
Storage     : SSD, XFS filesystem

Max EFFICiency, BASE frequency and available turbo levels (MHz):

    EFFIC   1000 |**********
    BASE    2400 |************************
    24C     3100 |*******************************
    20C     3300 |*********************************
    16C     3600 |************************************
    12C     3600 |************************************
    8C      3600 |************************************
    4C      3700 |*************************************
    2C      3900 |***************************************

Tested kernels:

Baseline      : v5.2,              intel_pstate passive,  schedutil
Comparison #1 : v5.2,              intel_pstate active ,  powersave+HWP
Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil

2. NEUTRAL BENCHMARKS
---------------------

* pgbench read/write
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf

3. NON-NEUTRAL BENCHMARKS
-------------------------

comparison ratio with baseline; 1.00 means neutral, higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    pgbench read-only     1.10             ~
    tbench                1.82          1.14

comparison ratio with baseline; 1.00 means neutral, lower is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                   ~          0.97
    kernbench             0.88          0.78
    gitsource[*]             ~          0.46

[*] "gitsource" consists in running git's unit tests
tilde (~) means 1.00, ie result identical to baseline

Performance per watt:

performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:

		      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                0.92          0.91
    tbench                1.26          1.04
    kernbench             0.95          0.96
    gitsource             1.03          1.30

Similarly to earlier Xeons, measurable performance gains over non-invariant
schedutil are observed on dbench, tbench, kernel compilation and running the
git unit tests suite. Looking at the detailed tables show that the patch
scores the largest difference when the machine is lightly loaded. Power
efficiency suffers lightly on kernbench and a bit more on dbench, but largely
improves on gitsource (which also runs considerably faster). For reference, we
also report results using active intel_pstate with powersave and HWP; the
largest gap between non-invariant schedutil and intel_pstate+powersave is
still tbench, which runs 82% better and with 26% improved efficiency on the
latter configuration -- this divide isn't closed yet by frequency-invariant
schedutil.

4. DETAILED TABLES
------------------

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                     5.2.0 vanilla (BASELINE)            5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean   1         183.56  +- 0.21% (        )       516.12  +- 0.57% ( 181.18%)       185.59  +- 0.59% (   1.11%)
Hmean   2         365.75  +- 0.25% (        )      1015.14  +- 0.33% ( 177.55%)       402.59  +- 4.48% (  10.07%)
Hmean   4         720.99  +- 0.44% (        )      1951.75  +- 0.28% ( 170.70%)       738.39  +- 1.72% (   2.41%)
Hmean   8        1449.93  +- 0.34% (        )      3830.56  +- 0.24% ( 164.19%)      1750.36  +- 4.65% (  20.72%)
Hmean   16       2874.26  +- 0.57% (        )      7381.62  +- 0.53% ( 156.82%)      4348.35  +- 2.22% (  51.29%)
Hmean   32       6116.17  +- 5.10% (        )     13013.05  +- 0.08% ( 112.76%)      8980.35  +- 0.66% (  46.83%)
Hmean   64      14485.04  +- 3.46% (        )     17835.12  +- 0.35% (  23.13%)     16540.73  +- 0.51% (  14.19%)
Hmean   128     30779.16  +- 3.20% (        )     32796.94  +- 2.13% (   6.56%)     31512.58  +- 0.20% (   2.38%)
Hmean   256     34664.66  +- 0.81% (        )     34604.67  +- 0.46% (  -0.17%)     34943.70  +- 0.25% (   0.80%)
Hmean   384     33957.51  +- 0.11% (        )     34091.50  +- 0.14% (   0.39%)     33921.41  +- 0.09% (  -0.11%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                    5.2.0 vanilla (BASELINE)             5.2.0 intel_pstate/HWP                     5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean   2        332.94  +- 0.40% (        )        260.16  +- 0.45% (  21.86%)        233.56  +- 0.21% (  29.85%)
Amean   4        173.04  +- 0.43% (        )        138.76  +- 0.03% (  19.81%)        123.59  +- 0.11% (  28.58%)
Amean   8         89.65  +- 0.20% (        )         73.54  +- 0.09% (  17.97%)         65.69  +- 0.10% (  26.72%)
Amean   16        48.08  +- 1.41% (        )         41.64  +- 1.61% (  13.40%)         36.00  +- 1.80% (  25.11%)
Amean   32        28.78  +- 0.72% (        )         26.61  +- 1.99% (   7.55%)         23.19  +- 1.68% (  19.43%)
Amean   64        20.46  +- 1.85% (        )         19.76  +- 0.35% (   3.42%)         17.38  +- 0.92% (  15.06%)
Amean   128       18.69  +- 1.70% (        )         17.59  +- 1.04% (   5.90%)         15.73  +- 1.40% (  15.85%)
Amean   192       18.82  +- 1.01% (        )         17.76  +- 0.77% (   5.67%)         15.57  +- 1.80% (  17.28%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                 5.2.0 vanilla (BASELINE)           5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         792.49  +- 0.20% (        )      779.35  +- 0.24% (   1.66%)      427.14  +- 0.16% (   46.10%)

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-3-ggherdovich@suse.cz
2020-01-28 21:37:01 +01:00
Giovanni Gherdovich
1567c3e346 x86, sched: Add support for frequency invariance
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.

The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.

Large performance gains are obtained when the machine is lightly loaded and
no regression are observed at saturation. The benchmarks with the largest
gains are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.

1. FREQUENCY INVARIANCE: MOTIVATION
   * Without it, a task looks larger if the CPU runs slower

2. PECULIARITIES OF X86
   * freq invariance accounting requires knowing the ratio freq_curr/freq_max
   2.1 CURRENT FREQUENCY
       * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
   2.2 MAX FREQUENCY
       * It varies with time (turbo). As an approximation, we set it to a
         constant, i.e. 4-cores turbo frequency.

3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
   * The invariant schedutil's formula has no feedback loop and reacts faster
     to utilization changes

4. KNOWN LIMITATIONS
   * In some cases tasks can't reach max util despite how hard they try

5. PERFORMANCE TESTING
   5.1 MACHINES
       * Skylake, Broadwell, Haswell
   5.2 SETUP
       * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
         active cores turbo w/ invariant schedutil, and intel_pstate/powersave
   5.3 BENCHMARK RESULTS
       5.3.1 NEUTRAL BENCHMARKS
             * NAS Parallel Benchmark (HPC), hackbench
       5.3.2 NON-NEUTRAL BENCHMARKS
             * tbench (10-30% better), kernbench (10-15% better),
               shell-intensive-scripts (30-50% better)
             * no regressions
       5.3.3 SELECTION OF DETAILED RESULTS
       5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
             * dbench (5% worse on one machine), kernbench (3% worse),
               tbench (5-10% better), shell-intensive-scripts (10-40% better)

6. MICROARCH'ES ADDRESSED HERE
   * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
     etc have different MSRs semantic for querying turbo levels)

7. REFERENCES
   * MMTests performance testing framework, github.com/gormanm/mmtests

 +-------------------------------------------------------------------------+
 | 1. FREQUENCY INVARIANCE: MOTIVATION
 +-------------------------------------------------------------------------+

For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.

[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)

 +-------------------------------------------------------------------------+
 | 2. PECULIARITIES OF X86
 +-------------------------------------------------------------------------+

Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.

2.1 CURRENT FREQUENCY
====================

Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.

Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.

2.2 MAX FREQUENCY
=================

Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.

The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:

    * 1-core (1C) turbo frequency (the fastest turbo state available)
    * around base frequency (a.k.a. max P-state)
    * something in between, such as 4C turbo

To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:

    freq_max set to     | effect on DVFS
    --------------------+------------------
    1C turbo            | power efficiency (lower freq choices)
    base freq           | performance (higher util_avg, higher freq requests)
    4C turbo            | a bit of both

4C turbo proves to be a good compromise in a number of benchmarks (see below).

 +-------------------------------------------------------------------------+
 | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
 +-------------------------------------------------------------------------+

Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from

    freq_next = 1.25 * freq_curr * util            [non-invariant util signal]

to

    freq_next = 1.25 * freq_max * util             [invariant util signal]

where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.

Compare it to the update formula of intel_pstate/powersave:

    freq_next = 1.25 * freq_max * Busy%

where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.

Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.

 +-------------------------------------------------------------------------+
 | 4. KNOWN LIMITATIONS
 +-------------------------------------------------------------------------+

It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).

If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.

While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.

 [Lelli-2018]
 https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/

 +-------------------------------------------------------------------------+
 | 5. PERFORMANCE TESTING
 +-------------------------------------------------------------------------+

5.1 MACHINES
============

We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.

* 8x-SKYLAKE-UMA:
  Single socket E3-1240 v5, Skylake 4 cores/8 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC    800 |********
    BASE    3500 |***********************************
    4C      3700 |*************************************
    3C      3800 |**************************************
    2C      3900 |***************************************
    1C      3900 |***************************************

* 80x-BROADWELL-NUMA:
  Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2200 |**********************
    8C      2900 |*****************************
    7C      3000 |******************************
    6C      3100 |*******************************
    5C      3200 |********************************
    4C      3300 |*********************************
    3C      3400 |**********************************
    2C      3600 |************************************
    1C      3600 |************************************

* 48x-HASWELL-NUMA
  Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2300 |***********************
    12C     2600 |**************************
    11C     2600 |**************************
    10C     2600 |**************************
    9C      2600 |**************************
    8C      2600 |**************************
    7C      2600 |**************************
    6C      2600 |**************************
    5C      2700 |***************************
    4C      2800 |****************************
    3C      2900 |*****************************
    2C      3100 |*******************************
    1C      3100 |*******************************

5.2 SETUP
=========

* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
  driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
  try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
  on all machines), plus one more value closer to base_freq but still in the
  turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
  with active intel_pstate on this machine use that.

This gives, in terms of combinations tested on each machine:

* 8x-SKYLAKE-UMA
  * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
  * intel_pstate active + powersave + HWP
  * invariant schedutil, freq_max = 1C turbo
  * invariant schedutil, freq_max = 3C turbo
  * invariant schedutil, freq_max = 4C turbo

* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
  * [same as 8x-SKYLAKE-UMA, but no HWP capable]
  * invariant schedutil, freq_max = 8C turbo
    (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")

5.3 BENCHMARK RESULTS
=====================

5.3.1 NEUTRAL BENCHMARKS
------------------------

Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:

* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
  computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)

5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------

What follow are summary tables where each benchmark result is given a score.

* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
  means a score of 1.00.
* The results in the score ratio are the geometric means of results running
  the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
  ... number of processes; for pgbench: varying the number of clients, and so
  on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
  operations/second), the subsequent three show lower-is-better kind of tests
  (i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
  entire unit tests suite of the Git SCM and measuring how long it takes. We
  take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
  columns show invariant schedutil for different values of freq_max. 4C turbo
  is circled as it's the value we've chosen for the final implementation.

80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
pgbench-ro           ~      ~      ~     | ~    |
pgbench-rw           ~      ~      ~     | ~    |
netperf-udp          ~      ~      ~     | ~    |
netperf-tcp          ~      ~      ~     | ~    |
tbench4              1.30   1.14   1.14  | 1.16 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  12C
pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                         +------+

In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.

If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.

80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              1.23   0.95   0.95  | 0.95 |  0.95
kernbench            0.93   0.83   0.83  | 0.83 |  0.82
gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
dbench4              ~      ~      ~     | ~    |
kernbench            ~      ~      ~     | ~    |
gitsource            0.92   0.55   0.55  | 0.55 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              ~      ~      ~     | ~    |  ~
kernbench            0.94   0.90   0.89  | 0.90 |  0.90
gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                         +------+

dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.

On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).

The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.

5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------

Machine            : 48x-HASWELL-NUMA
Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)

This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.

The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.

Machine            : 80x-BROADWELL-NUMA
Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)

The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.

Machine            : 8x-SKYLAKE-UMA
Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                            5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)

                           5.2.0 3C-turbo                   5.2.0 4C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)

In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.

5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------

The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.

turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).

80x-BROADWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |      8C
pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                    +--------+
8x-SKYLAKE-UMA (power consumption, watts)
                                                    +--------+
              BASELINE I_PSTATE/HWP    1C       3C  |     4C |
pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
dbench4           27.28    27.37    27.49    27.41  |  27.38 |
netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
tbench4           41.13    45.61    43.10    43.33  |  43.56 |
kernbench         42.56    42.63    43.01    43.01  |  43.01 |
gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                    +--------+
48x-HASWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |     12C
pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                    +--------+

A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).

80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |    8C
pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
dbench4          1.24   0.94   0.95  | 0.94 |  0.92
netperf-udp      ~      1.02   1.02  | ~    |  1.02
netperf-tcp      ~      1.02   ~     | ~    |  1.02
tbench4          1.26   1.10   1.06  | 1.12 |  1.26
kernbench        0.98   0.97   0.97  | 0.97 |  0.98
gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                     +------+

8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                     +------+
         I_PSTATE/HWP     1C     3C  |   4C |
pgbench-ro       ~      ~      ~     | ~    |
pgbench-rw       0.95   0.97   0.96  | 0.96 |
dbench4          ~      ~      ~     | ~    |
netperf-udp      ~      ~      ~     | ~    |
netperf-tcp      ~      ~      ~     | ~    |
tbench4          1.17   1.09   1.08  | 1.10 |
kernbench        ~      ~      ~     | ~    |
gitsource        1.06   1.40   1.40  | 1.40 |
                                     +------+

48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |   12C
pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
pgbench-rw       ~      0.86   ~     | ~    |  0.86
dbench4          ~      1.02   1.02  | 1.02 |  ~
netperf-udp      ~      0.97   1.03  | 1.02 |  ~
netperf-tcp      0.96   ~      ~     | ~    |  ~
tbench4          1.24   ~      1.06  | 1.05 |  1.11
kernbench        0.97   0.97   0.98  | 0.97 |  0.96
gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                     +------+

These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.

+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+

The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.

Subsequent patches will address:

* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont

+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+

Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:

    db-pgbench-timed-ro-small-xfs
    db-pgbench-timed-rw-small-xfs
    io-dbench4-async-xfs
    network-netperf-unbound
    network-tbench
    scheduler-unbound
    workload-kerndevel-xfs
    workload-shellscripts-xfs
    hpc-nas-c-class-mpi-full-xfs
    hpc-nas-c-class-omp-full

All those benchmarks are generally available on the web:

pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-28 21:36:59 +01:00
Linus Torvalds
c5f12fdb8b Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Thomas Gleixner:

 - Cleanup the apic IPI implementation by removing duplicated code and
   consolidating the functions into the APIC core.

 - Implement a safe variant of the IPI broadcast mode. Contrary to
   earlier attempts this uses the core tracking of which CPUs have been
   brought online at least once so that a broadcast does not end up in
   some dead end in BIOS/SMM code when the CPU is still waiting for
   init. Once all CPUs have been brought up once, IPI broadcasting is
   enabled. Before that regular one by one IPIs are issued.

 - Drop the paravirt CR8 related functions as they have no user anymore

 - Initialize the APIC TPR to block interrupt 16-31 as they are reserved
   for CPU exceptions and should never be raised by any well behaving
   device.

 - Emit a warning when vector space exhaustion breaks the admin set
   affinity of an interrupt.

 - Make sure to use the NMI fallback when shutdown via reboot vector IPI
   fails. The original code had conditions which prevent the code path
   to be reached.

 - Annotate various APIC config variables as RO after init.

[ The ipi broadcase change came in earlier through the cpu hotplug
  branch, but I left the explanation in the commit message since it was
  shared between the two different branches    - Linus ]

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  x86/apic/vector: Warn when vector space exhaustion breaks affinity
  x86/apic: Annotate global config variables as "read-only after init"
  x86/apic/x2apic: Implement IPI shorthands support
  x86/apic/flat64: Remove the IPI shorthand decision logic
  x86/apic: Share common IPI helpers
  x86/apic: Remove the shorthand decision logic
  x86/smp: Enhance native_send_call_func_ipi()
  x86/smp: Move smp_function_call implementations into IPI code
  x86/apic: Provide and use helper for send_IPI_allbutself()
  x86/apic: Add static key to Control IPI shorthands
  x86/apic: Move no_ipi_broadcast() out of 32bit
  x86/apic: Add NMI_VECTOR wait to IPI shorthand
  x86/apic: Remove dest argument from __default_send_IPI_shortcut()
  x86/hotplug: Silence APIC and NMI when CPU is dead
  x86/cpu: Move arch_smt_update() to a neutral place
  x86/apic/uv: Make x2apic_extra_bits static
  x86/apic: Consolidate the apic local headers
  x86/apic: Move apic_flat_64 header into apic directory
  x86/apic: Move ipi header into apic directory
  x86/apic: Cleanup the include maze
  ...
2019-09-17 12:04:39 -07:00
Thomas Gleixner
60dcaad573 x86/hotplug: Silence APIC and NMI when CPU is dead
In order to support IPI/NMI broadcasting via the shorthand mechanism side
effects of shorthands need to be mitigated:

 Shorthand IPIs and NMIs hit all CPUs including unplugged CPUs

Neither of those can be handled on unplugged CPUs for obvious reasons.

It would be trivial to just fully disable the APIC via the enable bit in
MSR_APICBASE. But that's not possible because clearing that bit on systems
based on the 3 wire APIC bus would require a hardware reset to bring it
back as the APIC would lose track of bus arbitration. On systems with FSB
delivery APICBASE could be disabled, but it has to be guaranteed that no
interrupt is sent to the APIC while in that state and it's not clear from
the SDM whether it still responds to INIT/SIPI messages.

Therefore stay on the safe side and switch the APIC into soft disabled mode
so it won't deliver any regular vector to the CPU.

NMIs are still propagated to the 'dead' CPUs. To mitigate that add a check
for the CPU being offline on early nmi entry and if so bail.

Note, this cannot use the stop/restart_nmi() magic which is used in the
alternatives code. A dead CPU cannot invoke nmi_enter() or anything else
due to RCU and other reasons.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907241723290.1791@nanos.tec.linutronix.de
2019-07-25 16:11:59 +02:00
Pingfan Liu
6973210242 x86/realmode: Remove trampoline_status
There is no reader of trampoline_status, it's only written.

It turns out that after commit ce4b1b1650 ("x86/smpboot: Initialize
secondary CPU only if master CPU will wait for it"), trampoline_status is
not needed any more.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1563266424-3472-1-git-send-email-kernelfans@gmail.com
2019-07-22 11:30:18 +02:00
Zhenzhong Duan
090d54bcbc Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"
This reverts commit ca5d376e17.

Commit 8990cac6e5 ("x86/jump_label: Initialize static branching
early") adds jump_label_init() call in setup_arch() to make static
keys initialized early, so we could use the original simpler code
again.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:57 +02:00
Thomas Gleixner
7652ac9201 x86/asm: Move native_write_cr0/4() out of line
The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.

The reason is that the static key which controls the pinning is marked RO
after init. The kvm_intel module contains a CR4 write which requires to
update the static key entry list. That obviously does not work when the key
is in a RO section.

With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
uses the paravirt indirection and the actual write function is built in.

As the key is intended to be immutable after init, move
native_write_cr0/4() out of line.

While at it consolidate the update of the cr4 shadow variable and store the
value right away when the pinning is initialized on a booting CPU. No point
in reading it back 20 instructions later. This allows to confine the static
key and the pinning variable to cpu/common and allows to mark them static.

Fixes: 8dbec27a24 ("x86/asm: Pin sensitive CR0 bits")
Fixes: 873d50d58f ("x86/asm: Pin sensitive CR4 bits")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Xi Ruoyao <xry111@mengyan1223.wang>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907102140340.1758@nanos.tec.linutronix.de
2019-07-10 22:15:05 +02:00
Linus Torvalds
222a21d295 Merge branch 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 topology updates from Ingo Molnar:
 "Implement multi-die topology support on Intel CPUs and expose the die
  topology to user-space tooling, by Len Brown, Kan Liang and Zhang Rui.

  These changes should have no effect on the kernel's existing
  understanding of topologies, i.e. there should be no behavioral impact
  on cache, NUMA, scheduler, perf and other topologies and overall
  system performance"

* 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
  perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support
  hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
  thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
  perf/x86/intel/cstate: Support multi-die/package
  perf/x86/intel/rapl: Support multi-die/package
  perf/x86/intel/uncore: Support multi-die/package
  topology: Create core_cpus and die_cpus sysfs attributes
  topology: Create package_cpus sysfs attribute
  hwmon/coretemp: Support multi-die/package
  powercap/intel_rapl: Update RAPL domain name and debug messages
  thermal/x86_pkg_temp_thermal: Support multi-die/package
  powercap/intel_rapl: Support multi-die/package
  powercap/intel_rapl: Simplify rapl_find_package()
  x86/topology: Define topology_logical_die_id()
  x86/topology: Define topology_die_id()
  cpu/topology: Export die_id
  x86/topology: Create topology_max_die_per_package()
  x86/topology: Add CPUID.1F multi-die/package support
2019-07-08 18:28:44 -07:00
Kees Cook
873d50d58f x86/asm: Pin sensitive CR4 bits
Several recent exploits have used direct calls to the native_write_cr4()
function to disable SMEP and SMAP before then continuing their exploits
using userspace memory access.

Direct calls of this form can be mitigate by pinning bits of CR4 so that
they cannot be changed through a common function. This is not intended to
be a general ROP protection (which would require CFI to defend against
properly), but rather a way to avoid trivial direct function calling (or
CFI bypasses via a matching function prototype) as seen in:

https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)

The goals of this change:

 - Pin specific bits (SMEP, SMAP, and UMIP) when writing CR4.

 - Avoid setting the bits too early (they must become pinned only after
   CPU feature detection and selection has finished).

 - Pinning mask needs to be read-only during normal runtime.

 - Pinning needs to be checked after write to validate the cr4 state

Using __ro_after_init on the mask is done so it can't be first disabled
with a malicious write.

Since these bits are global state (once established by the boot CPU and
kernel boot parameters), they are safe to write to secondary CPUs before
those CPUs have finished feature detection. As such, the bits are set at
the first cr4 write, so that cr4 write bugs can be detected (instead of
silently papered over). This uses a few bytes less storage of a location we
don't have: read-only per-CPU data.

A check is performed after the register write because an attack could just
skip directly to the register write. Such a direct jump is possible because
of how this function may be built by the compiler (especially due to the
removal of frame pointers) where it doesn't add a stack frame (function
exit may only be a retq without pops) which is sufficient for trivial
exploitation like in the timer overwrites mentioned above).

The asm argument constraints gain the "+" modifier to convince the compiler
that it shouldn't make ordering assumptions about the arguments or memory,
and treat them as changed.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: https://lkml.kernel.org/r/20190618045503.39105-3-keescook@chromium.org
2019-06-22 11:55:22 +02:00
Thomas Gleixner
9ff554e9be treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 82
Based on 1 normalized pattern(s):

  this code is released under the gnu general public license version 2
  or later

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075211.232210963@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:52 +02:00
Len Brown
2e4c54dac7 topology: Create core_cpus and die_cpus sysfs attributes
Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list"

These attributes represent all of the logical CPUs that share the
same core.

These attriutes is synonymous with the existing "thread_siblings" and
"thread_siblings_list" attribute, which will be deprecated.

Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list".
These attributes represent all of the logical CPUs that share the
same die.

Suggested-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/071c23a298cd27ede6ed0b6460cae190d193364f.1557769318.git.len.brown@intel.com
2019-05-23 10:08:34 +02:00
Len Brown
212bf4fdb7 x86/topology: Define topology_logical_die_id()
Define topology_logical_die_id() ala existing topology_logical_package_id()

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/2f3526e25ae14fbeff26fb26e877d159df8946d9.1557769318.git.len.brown@intel.com
2019-05-23 10:08:32 +02:00
Len Brown
7745f03eb3 x86/topology: Add CPUID.1F multi-die/package support
Some new systems have multiple software-visible die within each package.

Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle
either CPUID.B, or the new CPUID.1F.

Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.

die_id will be non-zero only for multi-die/package systems.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.brown@intel.com
2019-05-23 10:08:30 +02:00
Linus Torvalds
948a64995a Merge branch 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 topology updates from Ingo Molnar:
 "Two main changes: preparatory changes for Intel multi-die topology
  support, plus a syslog message tweak"

* 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/topology: Make DEBUG_HOTPLUG_CPU0 pr_info() more descriptive
  x86/smpboot: Rename match_die() to match_pkg()
  topology: Simplify cputopology.txt formatting and wording
  x86/topology: Fix documentation typo
2019-05-06 16:33:06 -07:00
Len Brown
169d086996 x86/smpboot: Rename match_die() to match_pkg()
Syntax only, no functional or semantic change.

This routine matches packages, not die, so name it thus.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/7ca18c4ae7816a1f9eda37414725df676e63589d.1551160674.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 10:56:05 +02:00
Thomas Gleixner
66c7ceb47f x86/irq/32: Handle irq stack allocation failure proper
irq_ctx_init() crashes hard on page allocation failures. While that's ok
during early boot, it's just wrong in the CPU hotplug bringup code.

Check the page allocation failure and return -ENOMEM and handle it at the
call sites. On early boot the only way out is to BUG(), but on CPU hotplug
there is no reason to crash, so just abort the operation.

Rename the function to something more sensible while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Shaokun Zhang <zhangshaokun@hisilicon.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Yi Wang <wang.yi59@zte.com.cn>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/20190414160146.089060584@linutronix.de
2019-04-17 15:31:42 +02:00
Linus Torvalds
bcd49c3dd1 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Various cleanups and simplifications, none of them really stands out,
  they are all over the place"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/uaccess: Remove unused __addr_ok() macro
  x86/smpboot: Remove unused phys_id variable
  x86/mm/dump_pagetables: Remove the unused prev_pud variable
  x86/fpu: Move init_xstate_size() to __init section
  x86/cpu_entry_area: Move percpu_setup_debug_store() to __init section
  x86/mtrr: Remove unused variable
  x86/boot/compressed/64: Explain paging_prepare()'s return value
  x86/resctrl: Remove duplicate MSR_MISC_FEATURE_CONTROL definition
  x86/asm/suspend: Drop ENTRY from local data
  x86/hw_breakpoints, kprobes: Remove kprobes ifdeffery
  x86/boot: Save several bytes in decompressor
  x86/trap: Remove useless declaration
  x86/mm/tlb: Remove unused cpu variable
  x86/events: Mark expected switch-case fall-throughs
  x86/asm-prototypes: Remove duplicate include <asm/page.h>
  x86/kernel: Mark expected switch-case fall-throughs
  x86/insn-eval: Mark expected switch-case fall-through
  x86/platform/UV: Replace kmalloc() and memset() with k[cz]alloc() calls
  x86/e820: Replace kmalloc() + memcpy() with kmemdup()
2019-03-07 16:36:57 -08:00
Anshuman Khandual
98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Shaokun Zhang
f91fecc09e x86/smpboot: Remove unused phys_id variable
The 'phys_id' local variable became unused after commit

  ce4b1b1650 ("x86/smpboot: Initialize secondary CPU only if master CPU will wait for it").

Remove it.

Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/1550495101-41755-1-git-send-email-zhangshaokun@hisilicon.com
2019-02-18 17:09:24 +01:00
Hui Wang
aa02ef099c x86/topology: Use total_cpus for max logical packages calculation
nr_cpu_ids can be limited on the command line via nr_cpus=. This can break the
logical package management because it results in a smaller number of packages
while in kdump kernel.

Check below case:
There is a two sockets system, each socket has 8 cores, which has 16 logical
cpus while HT was turn on.

 0  1  2  3  4  5  6  7     |    16 17 18 19 20 21 22 23
 cores on socket 0               threads on socket 0
 8  9 10 11 12 13 14 15     |    24 25 26 27 28 29 30 31
 cores on socket 1               threads on socket 1

While starting the kdump kernel with command line option nr_cpus=16 panic
was triggered on one of the cpus 24-31 eg. 26, then online cpu will be
1-15, 26(cpu 0 was disabled in kdump), ncpus will be 16 and
__max_logical_packages will be 1, but actually two packages were booted on.

This issue can reproduced by set kdump option nr_cpus=<real physical core
numbers>, and then trigger panic on last socket's thread, for example:

taskset -c 26 echo c > /proc/sysrq-trigger

Use total_cpus which will not be limited by nr_cpus command line to calculate
the value of __max_logical_packages.

Signed-off-by: Hui Wang <john.wanghui@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <guijianfeng@huawei.com>
Cc: <wencongyang2@huawei.com>
Cc: <douliyang1@huawei.com>
Cc: <qiaonuohan@huawei.com>
Link: https://lkml.kernel.org/r/20181107023643.22174-1-john.wanghui@huawei.com
2018-12-18 13:38:37 +01:00
Mike Rapoport
57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Pu Wen
0b13bec787 x86/smpboot: Do not use BSP INIT delay and MWAIT to idle on Dhyana
The Hygon Dhyana CPU uses no delay in smp_quirk_init_udelay(), and does
HLT on idle just like AMD does.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: bp@alien8.de
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: thomas.lendacky@amd.com
Link: https://lkml.kernel.org/r/87000fa82e273f5967c908448414228faf61e077.1537533369.git.puwen@hygon.cn
2018-09-27 18:28:57 +02:00
Thomas Gleixner
f2701b77bb Merge 4.18-rc7 into master to pick up the KVM dependcy
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 16:39:29 +02:00
Nicolai Stange
447ae31667 x86: Don't include linux/irq.h from asm/hardirq.h
The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().

Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like

  asm/smp.h
    asm/apic.h
      asm/hardirq.h
        linux/irq.h
          linux/topology.h
            linux/smp.h
              asm/smp.h

or

  linux/gfp.h
    linux/mmzone.h
      asm/mmzone.h
        asm/mmzone_64.h
          asm/smp.h
            asm/apic.h
              asm/hardirq.h
                linux/irq.h
                  linux/irqdesc.h
                    linux/kobject.h
                      linux/sysfs.h
                        linux/kernfs.h
                          linux/idr.h
                            linux/gfp.h

and others.

This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.

A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.

However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.

Remove the linux/irq.h #include from x86' asm/hardirq.h.

Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.

Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.

Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 09:53:13 +02:00
Zhenzhong Duan
4fb5f58e8d x86/mm/32: Initialize the CR4 shadow before __flush_tlb_all()
On 32-bit kernels, __flush_tlb_all() may have read the CR4 shadow before the
initialization of CR4 shadow in cpu_init().

Fix it by adding an explicit cr4_init_shadow() call into start_secondary()
which is the first function called on non-boot SMP CPUs - ahead of the
__flush_tlb_all() call.

( This is somewhat of a layering violation, but start_secondary() does
  CR4 bootstrap in the PCID case anyway. )

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/b07b6ae9-4b57-4b40-b9bc-50c2c67f1d91@default
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03 09:26:10 +02:00
Thomas Gleixner
f048c399e0 x86/topology: Provide topology_smt_supported()
Provide information whether SMT is supoorted by the CPUs. Preparatory patch
for SMT control mechanism.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 14:20:57 +02:00
Thomas Gleixner
6a4d2657e0 x86/smp: Provide topology_is_primary_thread()
If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.

This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.

Preparatory patch to support disabling SMT at boot/runtime.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 14:20:57 +02:00
Linus Torvalds
5cef8c2a22 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:

 - Centaur CPU updates (David Wang)

 - AMD and other CPU topology enumeration improvements and fixes
   (Borislav Petkov, Thomas Gleixner, Suravee Suthikulpanit)

 - Continued 5-level paging work (Kirill A. Shutemov)

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Mark __pgtable_l5_enabled __initdata
  x86/mm: Mark p4d_offset() __always_inline
  x86/mm: Introduce the 'no5lvl' kernel parameter
  x86/mm: Stop pretending pgtable_l5_enabled is a variable
  x86/mm: Unify pgtable_l5_enabled usage in early boot code
  x86/boot/compressed/64: Fix trampoline page table address calculation
  x86/CPU: Move x86_cpuinfo::x86_max_cores assignment to detect_num_cpu_cores()
  x86/Centaur: Report correct CPU/cache topology
  x86/CPU: Move cpu_detect_cache_sizes() into init_intel_cacheinfo()
  x86/CPU: Make intel_num_cpu_cores() generic
  x86/CPU: Move cpu local function declarations to local header
  x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available
  x86/CPU: Modify detect_extended_topology() to return result
  x86/CPU/AMD: Calculate last level cache ID from number of sharing threads
  x86/CPU: Rename intel_cacheinfo.c to cacheinfo.c
  perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
  x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
  x86/Centaur: Initialize supported CPU features properly
2018-06-04 18:19:18 -07:00
Ingo Molnar
177bfd725b Merge branches 'x86/urgent' and 'core/urgent' into x86/boot, to pick up fixes and avoid conflicts
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-19 08:18:56 +02:00
Thomas Gleixner
1f50ddb4f4 x86/speculation: Handle HT correctly on AMD
The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when
hyperthreading is enabled the SSBD bit toggle needs to take both cores into
account. Otherwise the following situation can happen:

CPU0		CPU1

disable SSB
		disable SSB
		enable  SSB <- Enables it for the Core, i.e. for CPU0 as well

So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled
again.

On Intel the SSBD control is per core as well, but the synchronization
logic is implemented behind the per thread SPEC_CTRL MSR. It works like
this:

  CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL

i.e. if one of the threads enables a mitigation then this affects both and
the mitigation is only disabled in the core when both threads disabled it.

Add the necessary synchronization logic for AMD family 17H. Unfortunately
that requires a spinlock to serialize the access to the MSR, but the locks
are only shared between siblings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-05-17 17:09:18 +02:00
Borislav Petkov
f8b64d08dd x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're
always present as symbols and not only in the CONFIG_SMP case. Then,
other code using them doesn't need ugly ifdeffery anymore. Get rid of
some ifdeffery.

Signed-off-by: Borislav Petkov <bpetkov@suse.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1524864877-111962-2-git-send-email-suravee.suthikulpanit@amd.com
2018-05-06 12:49:14 +02:00
Yazen Ghannam
da6fa7ef67 x86/smpboot: Don't use mwait_play_dead() on AMD systems
Recent AMD systems support using MWAIT for C1 state. However, MWAIT will
not allow deeper cstates than C1 on current systems.

play_dead() expects to use the deepest state available.  The deepest state
available on AMD systems is reached through SystemIO or HALT. If MWAIT is
available, it is preferred over the other methods, so the CPU never reaches
the deepest possible state.

Don't try to use MWAIT to play_dead() on AMD systems. Instead, use CPUIDLE
to enter the deepest state advertised by firmware. If CPUIDLE is not
available then fallback to HALT.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: https://lkml.kernel.org/r/20180403140228.58540-1-Yazen.Ghannam@amd.com
2018-04-26 16:06:19 +02:00
Alison Schofield
1340ccfa9a x86,sched: Allow topologies where NUMA nodes share an LLC
Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
2018-04-17 15:39:55 +02:00
Samuel Neves
4596749339 x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
Without this fix, /proc/cpuinfo will display an incorrect amount
of CPU cores, after bringing them offline and online again, as
exemplified below:

  $ cat /proc/cpuinfo | grep cores
  cpu cores	: 4
  cpu cores	: 8
  cpu cores	: 8
  cpu cores	: 20
  cpu cores	: 4
  cpu cores	: 3
  cpu cores	: 2
  cpu cores	: 2

This patch fixes this by always zeroing the booted_cores variable
upon turning off a logical CPU.

Tested-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jgross@suse.com
Cc: luto@kernel.org
Cc: prarit@redhat.com
Cc: vkuznets@redhat.com
Link: http://lkml.kernel.org/r/20180221205036.5244-1-sneves@dei.uc.pt
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-23 08:47:47 +01:00
Prarit Bhargava
63e708f826 x86/xen: Calculate __max_logical_packages on PV domains
The kernel panics on PV domains because native_smp_cpus_done() is
only called for HVM domains.

Calculate __max_logical_packages for PV domains.

Fixes: b4c0a7326f ("x86/smpboot: Fix __max_logical_packages estimate")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Tested-and-reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: xen-devel@lists.xenproject.org
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2018-02-17 09:40:45 +01:00
Masayoshi Mizuma
295cc7eb31 x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU
When a physical CPU is hot-removed, the following warning messages
are shown while the uncore device is removed in uncore_pci_remove():

  WARNING: CPU: 120 PID: 5 at arch/x86/events/intel/uncore.c:988
  uncore_pci_remove+0xf1/0x110
  ...
  CPU: 120 PID: 5 Comm: kworker/u1024:0 Not tainted 4.15.0-rc8 #1
  Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  ...
  Call Trace:
  pci_device_remove+0x36/0xb0
  device_release_driver_internal+0x145/0x210
  pci_stop_bus_device+0x76/0xa0
  pci_stop_root_bus+0x44/0x60
  acpi_pci_root_remove+0x1f/0x80
  acpi_bus_trim+0x54/0x90
  acpi_bus_trim+0x2e/0x90
  acpi_device_hotplug+0x2bc/0x4b0
  acpi_hotplug_work_fn+0x1a/0x30
  process_one_work+0x141/0x340
  worker_thread+0x47/0x3e0
  kthread+0xf5/0x130

When uncore_pci_remove() runs, it tries to get the package ID to
clear the value of uncore_extra_pci_dev[].dev[] by using
topology_phys_to_logical_pkg(). The warning messesages are
shown because topology_phys_to_logical_pkg() returns -1.

  arch/x86/events/intel/uncore.c:
  static void uncore_pci_remove(struct pci_dev *pdev)
  {
  ...
          phys_id = uncore_pcibus_to_physid(pdev->bus);
  ...
                  pkg = topology_phys_to_logical_pkg(phys_id); // returns -1
                  for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
                          if (uncore_extra_pci_dev[pkg].dev[i] == pdev) {
                                  uncore_extra_pci_dev[pkg].dev[i] = NULL;
                                  break;
                          }
                  }
                  WARN_ON_ONCE(i >= UNCORE_EXTRA_PCI_DEV_MAX); // <=========== HERE!!

topology_phys_to_logical_pkg() tries to find
cpuinfo_x86->phys_proc_id that matches the phys_pkg argument.

  arch/x86/kernel/smpboot.c:
  int topology_phys_to_logical_pkg(unsigned int phys_pkg)
  {
          int cpu;

          for_each_possible_cpu(cpu) {
                  struct cpuinfo_x86 *c = &cpu_data(cpu);

                  if (c->initialized && c->phys_proc_id == phys_pkg)
                          return c->logical_proc_id;
          }
          return -1;
  }

However, the phys_proc_id was already set to 0 by remove_siblinginfo()
when the CPU was offlined.

So, topology_phys_to_logical_pkg() cannot find the correct
logical_proc_id and always returns -1.

As the result, uncore_pci_remove() calls WARN_ON_ONCE() and the warning
messages are shown.

What is worse is that the bogus 'pkg' index results in two bugs:

 - We dereference uncore_extra_pci_dev[] with a negative index
 - We fail to clean up a stale pointer in uncore_extra_pci_dev[][]

To fix these bugs, remove the clearing of ->phys_proc_id from remove_siblinginfo().

This should not cause any problems, because ->phys_proc_id is not
used after it is hot-removed and it is re-set while hot-adding.

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: yasu.isimatu@gmail.com
Cc: <stable@vger.kernel.org>
Fixes: 30bb981185 ("x86/topology: Avoid wasting 128k for package id array")
Link: http://lkml.kernel.org/r/ed738d54-0f01-b38b-b794-c31dc118c207@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13 12:47:28 +01:00
Linus Torvalds
3ccabd6d9d Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Remove unused IOMMU_STRESS Kconfig
  x86/extable: Mark exception handler functions visible
  x86/timer: Don't inline __const_udelay
  x86/headers: Remove duplicate #includes
2018-01-30 13:01:09 -08:00
Jan Kiszka
e348caef8b x86/platform: Control warm reset setup via legacy feature flag
Allow to turn off the setup of BIOS-managed warm reset via a new flag in
x86_legacy_features. Besides the UV1, the upcoming jailhose guest support
needs this switched off.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: jailhouse-dev@googlegroups.com
Link: https://lkml.kernel.org/r/44376558129d70a2c1527959811371ef4b82e829.1511770314.git.jan.kiszka@siemens.com
2018-01-14 21:11:53 +01:00
Linus Torvalds
52c90f2d32 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 page table isolation fixes from Thomas Gleixner:
 "Four patches addressing the PTI fallout as discussed and debugged
  yesterday:

   - Remove stale and pointless TLB flush invocations from the hotplug
     code

   - Remove stale preempt_disable/enable from __native_flush_tlb()

   - Plug the memory leak in the write_ldt() error path"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ldt: Make LDT pgtable free conditional
  x86/ldt: Plug memory leak in error path
  x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()
  x86/smpboot: Remove stale TLB flush invocations
2017-12-31 13:03:05 -08:00
Thomas Gleixner
322f8b8b34 x86/smpboot: Remove stale TLB flush invocations
smpboot_setup_warm_reset_vector() and smpboot_restore_warm_reset_vector()
invoke local_flush_tlb() for no obvious reason.

Digging in history revealed that the original code in the 2.1 era added
those because the code manipulated a swapper_pg_dir pagetable entry. The
pagetable manipulation was removed long ago in the 2.3 timeframe, but the
TLB flush invocations stayed around forever.

Remove them along with the pointless pr_debug()s which come from the same 2.1
change.

Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171230211829.586548655@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-31 12:12:51 +01:00
Linus Torvalds
caf9a82657 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI preparatory patches from Thomas Gleixner:
 "Todays Advent calendar window contains twentyfour easy to digest
  patches. The original plan was to have twenty three matching the date,
  but a late fixup made that moot.

   - Move the cpu_entry_area mapping out of the fixmap into a separate
     address space. That's necessary because the fixmap becomes too big
     with NRCPUS=8192 and this caused already subtle and hard to
     diagnose failures.

     The top most patch is fresh from today and cures a brain slip of
     that tall grumpy german greybeard, who ignored the intricacies of
     32bit wraparounds.

   - Limit the number of CPUs on 32bit to 64. That's insane big already,
     but at least it's small enough to prevent address space issues with
     the cpu_entry_area map, which have been observed and debugged with
     the fixmap code

   - A few TLB flush fixes in various places plus documentation which of
     the TLB functions should be used for what.

   - Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
     more than sysenter now and keeping the name makes backtraces
     confusing.

   - Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
     which is only invoked on fork().

   - Make vysycall more robust.

   - A few fixes and cleanups of the debug_pagetables code. Check
     PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
     C89 initialization of the address hint array which already was out
     of sync with the index enums.

   - Move the ESPFIX init to a different place to prepare for PTI.

   - Several code moves with no functional change to make PTI
     integration simpler and header files less convoluted.

   - Documentation fixes and clarifications"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
  init: Invoke init_espfix_bsp() from mm_init()
  x86/cpu_entry_area: Move it out of the fixmap
  x86/cpu_entry_area: Move it to a separate unit
  x86/mm: Create asm/invpcid.h
  x86/mm: Put MMU to hardware ASID translation in one place
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Move the CR3 construction functions to tlbflush.h
  x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
  x86/mm: Remove superfluous barriers
  x86/mm: Use __flush_tlb_one() for kernel memory
  x86/microcode: Dont abuse the TLB-flush interface
  x86/uv: Use the right TLB-flush API
  x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
  x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
  x86/mm/64: Improve the memory map documentation
  x86/ldt: Prevent LDT inheritance on exec
  x86/ldt: Rework locking
  arch, mm: Allow arch_dup_mmap() to fail
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  ...
2017-12-23 11:53:04 -08:00
Thomas Gleixner
613e396bc0 init: Invoke init_espfix_bsp() from mm_init()
init_espfix_bsp() needs to be invoked before the page table isolation
initialization. Move it into mm_init() which is the place where pti_init()
will be added.

While at it get rid of the #ifdeffery and provide proper stub functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-22 20:13:05 +01:00
Ingo Molnar
0fd2e9c53d Merge commit 'upstream-x86-entry' into WIP.x86/mm
Pull in a minimal set of v4.15 entry code changes, for a base for the MM isolation patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-17 12:58:53 +01:00
Pravin Shedge
81bf665d00 x86/headers: Remove duplicate #includes
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: ard.biesheuvel@linaro.org
Cc: boris.ostrovsky@oracle.com
Cc: geert@linux-m68k.org
Cc: jgross@suse.com
Cc: linux-efi@vger.kernel.org
Cc: luto@kernel.org
Cc: matt@codeblueprint.co.uk
Cc: thomas.lendacky@amd.com
Cc: tim.c.chen@linux.intel.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1513024951-9221-1-git-send-email-pravin.shedge4linux@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 11:32:24 +01:00
Prarit Bhargava
947134d9b0 x86/smpboot: Do not use smp_num_siblings in __max_logical_packages calculation
Documentation/x86/topology.txt defines smp_num_siblings as "The number of
threads in a core".  Since commit bbb65d2d36 ("x86: use cpuid vector 0xb
when available for detecting cpu topology") smp_num_siblings is the
maximum number of threads in a core.  If Simultaneous MultiThreading
(SMT) is disabled on a system, smp_num_siblings is 2 and not 1 as
expected.

Use topology_max_smt_threads(), which contains the active numer of threads,
in the __max_logical_packages calculation.

On a single socket, single core, single thread system __max_smt_threads has
not been updated when the __max_logical_packages calculation happens, so its
zero which makes the package estimate fail. Initialize it to one, which is
the minimum number of threads on a core.

[ tglx: Folded the __max_smt_threads fix in ]

Fixes: b4c0a7326f ("x86/smpboot: Fix __max_logical_packages estimate")
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Prarit Bhargava <prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jakub Kicinski <kubakici@wp.pl>
Cc: netdev@vger.kernel.org
Cc: "netdev@vger.kernel.org"
Cc: Clark Williams <williams@redhat.com>
Link: https://lkml.kernel.org/r/20171204164521.17870-1-prarit@redhat.com
2017-12-07 10:28:22 +01:00
Chunyu Hu
55d2d0ad2f x86/idt: Load idt early in start_secondary
On a secondary, idt is first loaded in cpu_init() with load_current_idt(),
i.e. no exceptions can be handled before that point.

The conversion of WARN() to use UD requires the IDT being loaded earlier as
any warning between start_secondary() and load_curren_idt() in cpu_init()
will result in an unhandled @UD exception and therefore fail the bringup of
the CPU.

Install the IDT handlers right in start_secondary() before calling cpu_init().

[ tglx: Massaged changelog ]

Fixes: 9a93848fe7 ("x86/debug: Implement __WARN() using UD0")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: peterz@infradead.org
Cc: bp@alien8.de
Cc: rostedt@goodmis.org
Cc: luto@kernel.org
Link: https://lkml.kernel.org/r/1511792499-4073-1-git-send-email-chuhu@redhat.com
2017-11-28 08:15:40 +01:00
Prarit Bhargava
b4c0a7326f x86/smpboot: Fix __max_logical_packages estimate
A system booted with a small number of cores enabled per package
panics because the estimate of __max_logical_packages is too low.

This occurs when the total number of active cores across all packages is
less than the maximum core count for a single package. e.g.:

  On a 4 package system with 20 cores/package where only 4 cores are
  enabled on each package, the value of __max_logical_packages is
  calculated as DIV_ROUND_UP(16 / 20) = 1 and not 4.

Calculate __max_logical_packages after the cpu enumeration has completed.
Use the boot cpu's data to extrapolate the number of packages.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: He Chen <he.chen@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Piotr Luc <piotr.luc@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171114124257.22013-4-prarit@redhat.com
2017-11-17 16:22:31 +01:00
Andi Kleen
30bb981185 x86/topology: Avoid wasting 128k for package id array
Analyzing large early boot allocations unveiled the logical package id
storage as a prominent memory waste. Since commit 1f12e32f4c
("x86/topology: Create logical package id") every 64-bit system allocates a
128k array to convert logical package ids.

This happens because the array is sized for MAX_LOCAL_APIC which is always
32k on 64bit systems, and it needs 4 bytes for each entry.

This is fairly wasteful, especially for the common case of having only one
socket, which uses exactly 4 byte out of 128K.

There is no user of the package id map which is performance critical, so
the lookup is not required to be O(1). Store the logical processor id in
cpu_data and use a loop based lookup.

To keep the mapping stable accross cpu hotplug operations, add a flag to
cpu_data which is set when the CPU is brought up the first time. When the
flag is set, then cpu_data is not reinitialized by copying boot_cpu_data on
subsequent bringups.

[ tglx: Rename the flag to 'initialized', use proper pointers instead of
  	repeated cpu_data(x) evaluation and massage changelog. ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: He Chen <he.chen@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Piotr Luc <piotr.luc@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171114124257.22013-3-prarit@redhat.com
2017-11-17 16:22:30 +01:00
Linus Torvalds
b18d62891a Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
 "This update provides a major overhaul of the APIC initialization and
  vector allocation code:

   - Unification of the APIC and interrupt mode setup which was
     scattered all over the place and was hard to follow. This also
     distangles the timer setup from the APIC initialization which
     brings a clear separation of functionality.

     Great detective work from Dou Lyiang!

   - Refactoring of the x86 vector allocation mechanism. The existing
     code was based on nested loops and rather convoluted APIC callbacks
     which had a horrible worst case behaviour and tried to serve all
     different use cases in one go. This led to quite odd hacks when
     supporting the new managed interupt facility for multiqueue devices
     and made it more or less impossible to deal with the vector space
     exhaustion which was a major roadblock for server hibernation.

     Aside of that the code dealing with cpu hotplug and the system
     vectors was disconnected from the actual vector management and
     allocation code, which made it hard to follow and maintain.

     Utilizing the new bitmap matrix allocator core mechanism, the new
     allocator and management code consolidates the handling of system
     vectors, legacy vectors, cpu hotplug mechanisms and the actual
     allocation which needs to be aware of system and legacy vectors and
     hotplug constraints into a single consistent entity.

     This has one visible change: The support for multi CPU targets of
     interrupts, which is only available on a certain subset of
     CPUs/APIC variants has been removed in favour of single interrupt
     targets. A proper analysis of the multi CPU target feature revealed
     that there is no real advantage as the vast majority of interrupts
     end up on the CPU with the lowest APIC id in the set of target CPUs
     anyway. That change was agreed on by the relevant folks and allowed
     to simplify the implementation significantly and to replace rather
     fragile constructs like the vector cleanup IPI with straight
     forward and solid code.

     Furthermore this allowed to cleanly separate the allocation details
     for legacy, normal and managed interrupts:

      * Legacy interrupts are not longer wasting 16 vectors
        unconditionally

      * Managed interrupts have now a guaranteed vector reservation, but
        the actual vector assignment happens when the interrupt is
        requested. It's guaranteed not to fail.

      * Normal interrupts no longer allocate vectors unconditionally
        when the interrupt is set up (IO/APIC init or MSI(X) enable).
        The mechanism has been switched to a best effort reservation
        mode. The actual allocation happens when the interrupt is
        requested. Contrary to managed interrupts the request can fail
        due to vector space exhaustion, but drivers must handle a fail
        of request_irq() anyway. When the interrupt is freed, the vector
        is handed back as well.

        This solves a long standing problem with large unconditional
        vector allocations for a certain class of enterprise devices
        which prevented server hibernation due to vector space
        exhaustion when the unused allocated vectors had to be migrated
        to CPU0 while unplugging all non boot CPUs.

     The code has been equipped with trace points and detailed debugfs
     information to aid analysis of the vector space"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  x86/vector/msi: Select CONFIG_GENERIC_IRQ_RESERVATION_MODE
  PCI/MSI: Set MSI_FLAG_MUST_REACTIVATE in core code
  genirq: Add config option for reservation mode
  x86/vector: Use correct per cpu variable in free_moved_vector()
  x86/apic/vector: Ignore set_affinity call for inactive interrupts
  x86/apic: Fix spelling mistake: "symmectic" -> "symmetric"
  x86/apic: Use dead_cpu instead of current CPU when cleaning up
  ACPI/init: Invoke early ACPI initialization earlier
  x86/vector: Respect affinity mask in irq descriptor
  x86/irq: Simplify hotplug vector accounting
  x86/vector: Switch IOAPIC to global reservation mode
  x86/vector/msi: Switch to global reservation mode
  x86/vector: Handle managed interrupts proper
  x86/io_apic: Reevaluate vector configuration on activate()
  iommu/amd: Reevaluate vector configuration on activate()
  iommu/vt-d: Reevaluate vector configuration on activate()
  x86/apic/msi: Force reactivation of interrupts at startup time
  x86/vector: Untangle internal state from irq_cfg
  x86/vector: Compile SMP only code conditionally
  x86/apic: Remove unused callbacks
  ...
2017-11-13 18:29:23 -08:00
Linus Torvalds
6a9f70b0a5 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
 "Three smaller changes:

   - clang fix

   - boot message beautification

   - unnecessary header inclusion removal"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Disable Clang warnings about GNU extensions
  x86/boot: Remove unnecessary #include <generated/utsrelease.h>
  x86/boot: Spell out "boot CPU" for BP
2017-11-13 16:32:30 -08:00
Linus Torvalds
d6ec9d9a4d Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Ingo Molnar:
 "Note that in this cycle most of the x86 topics interacted at a level
  that caused them to be merged into tip:x86/asm - but this should be a
  temporary phenomenon, hopefully we'll back to the usual patterns in
  the next merge window.

  The main changes in this cycle were:

  Hardware enablement:

   - Add support for the Intel UMIP (User Mode Instruction Prevention)
     CPU feature. This is a security feature that disables certain
     instructions such as SGDT, SLDT, SIDT, SMSW and STR. (Ricardo Neri)

     [ Note that this is disabled by default for now, there are some
       smaller enhancements in the pipeline that I'll follow up with in
       the next 1-2 days, which allows this to be enabled by default.]

   - Add support for the AMD SEV (Secure Encrypted Virtualization) CPU
     feature, on top of SME (Secure Memory Encryption) support that was
     added in v4.14. (Tom Lendacky, Brijesh Singh)

   - Enable new SSE/AVX/AVX512 CPU features: AVX512_VBMI2, GFNI, VAES,
     VPCLMULQDQ, AVX512_VNNI, AVX512_BITALG. (Gayatri Kammela)

  Other changes:

   - A big series of entry code simplifications and enhancements (Andy
     Lutomirski)

   - Make the ORC unwinder default on x86 and various objtool
     enhancements. (Josh Poimboeuf)

   - 5-level paging enhancements (Kirill A. Shutemov)

   - Micro-optimize the entry code a bit (Borislav Petkov)

   - Improve the handling of interdependent CPU features in the early
     FPU init code (Andi Kleen)

   - Build system enhancements (Changbin Du, Masahiro Yamada)

   - ... plus misc enhancements, fixes and cleanups"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits)
  x86/build: Make the boot image generation less verbose
  selftests/x86: Add tests for the STR and SLDT instructions
  selftests/x86: Add tests for User-Mode Instruction Prevention
  x86/traps: Fix up general protection faults caused by UMIP
  x86/umip: Enable User-Mode Instruction Prevention at runtime
  x86/umip: Force a page fault when unable to copy emulated result to user
  x86/umip: Add emulation code for UMIP instructions
  x86/cpufeature: Add User-Mode Instruction Prevention definitions
  x86/insn-eval: Add support to resolve 16-bit address encodings
  x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode
  x86/insn-eval: Add wrapper function for 32 and 64-bit addresses
  x86/insn-eval: Add support to resolve 32-bit address encodings
  x86/insn-eval: Compute linear address in several utility functions
  resource: Fix resource_size.cocci warnings
  X86/KVM: Clear encryption attribute when SEV is active
  X86/KVM: Decrypt shared per-cpu variables when SEV is active
  percpu: Introduce DEFINE_PER_CPU_DECRYPTED
  x86: Add support for changing memory encryption attribute in early boot
  x86/io: Unroll string I/O when SEV is active
  x86/boot: Add early boot support when running with SEV active
  ...
2017-11-13 14:13:48 -08:00
Linus Torvalds
8e9a2dba86 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Another attempt at enabling cross-release lockdep dependency
     tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
     with better performance and fewer false positives. (Byungchul Park)

   - Introduce lockdep_assert_irqs_enabled()/disabled() and convert
     open-coded equivalents to lockdep variants. (Frederic Weisbecker)

   - Add down_read_killable() and use it in the VFS's iterate_dir()
     method. (Kirill Tkhai)

   - Convert remaining uses of ACCESS_ONCE() to
     READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
     driven. (Mark Rutland, Paul E. McKenney)

   - Get rid of lockless_dereference(), by strengthening Alpha atomics,
     strengthening READ_ONCE() with smp_read_barrier_depends() and thus
     being able to convert users of lockless_dereference() to
     READ_ONCE(). (Will Deacon)

   - Various micro-optimizations:

        - better PV qspinlocks (Waiman Long),
        - better x86 barriers (Michael S. Tsirkin)
        - better x86 refcounts (Kees Cook)

   - ... plus other fixes and enhancements. (Borislav Petkov, Juergen
     Gross, Miguel Bernal Marin)"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
  rcu: Use lockdep to assert IRQs are disabled/enabled
  netpoll: Use lockdep to assert IRQs are disabled/enabled
  timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
  sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
  irq_work: Use lockdep to assert IRQs are disabled/enabled
  irq/timings: Use lockdep to assert IRQs are disabled/enabled
  perf/core: Use lockdep to assert IRQs are disabled/enabled
  x86: Use lockdep to assert IRQs are disabled/enabled
  smp/core: Use lockdep to assert IRQs are disabled/enabled
  timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
  timers/nohz: Use lockdep to assert IRQs are disabled/enabled
  workqueue: Use lockdep to assert IRQs are disabled/enabled
  irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
  locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
  locking/pvqspinlock: Implement hybrid PV queued/unfair locks
  locking/rwlocks: Fix comments
  x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
  block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
  workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
  ...
2017-11-13 12:38:26 -08:00