Commit Graph

765 Commits

Author SHA1 Message Date
Linus Torvalds
e5a0fc4e20 CPU setup code changes:
- Clean up & simplify AP exception handling setup.
 
  - Consolidate the disjoint IDT setup code living in
    idt_setup_traps() and idt_setup_ist_traps() into
    a single idt_setup_traps() initialization function
    and call it before cpu_init().
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZdu0RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gWgA//ecu2FqCFT2gpZP7ABdJqAhtYA8f7rg/a
 BMKNNfTha/L9Dot2PSqMq8oVi8NA++EbKjuWeFujAIoU/2vso+YrmHc84O35nOOF
 u2Zgps7UK+ffKo11Yu1Vpb911ctJClAwL0CemsC30QDbpGVHPecLuxOIgY6E6BwC
 qhNqNLp0K4bFRq0ya27O8RPiz9LjCzUILHHvWSAl5m5tWqovED8aXdjrDJcFXqwY
 u9nuuRpUpQWqCldZP9X7+pdo4Z2HZjvIBjqHD/wl3VMjV6q+k+su6AjV9p1D8hoz
 otY96i8MQjD/sgIa1H+tUc2ZusGzDls+EpYiGaPmqeXMitKEwOFpVDAaT8SelUms
 bR4VQ9IYB1NG7Qbco3NQHMV1sWuvUJcLG6ILYFWXgH0hP1EDHFn/TvOn0rfJysbE
 AmCpwmUo0b8Bj6nbKkVcXxoX1FdeqiM5+cPxHxGVgxVoR0Umz13EX4y4cBzSIRht
 eYwT6H1CxR9a4TIr8cMBsN14QsnV3f6lv/RNfVdmZEJVVr0boRI90L2xMLBB9RkP
 z03g7VvfMuSWnKyOFheP4ae9ul2qxAT380+g1oHQH0XIFtj9yIhzJHpoUCzhgCra
 Ui2Z71Dhq0R1UNpPsPfc1XkQI9chiahn8gc1u2zvN4SzZa6DZH22VvGNK0ghoIxq
 5WFho50hNIk=
 =BPbv
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 exception handling updates from Ingo Molnar:

 - Clean up & simplify AP exception handling setup.

 - Consolidate the disjoint IDT setup code living in idt_setup_traps()
   and idt_setup_ist_traps() into a single idt_setup_traps()
   initialization function and call it before cpu_init().

* tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/idt: Rework IDT setup for boot CPU
  x86/cpu: Init AP exception handling from cpu_init_secondary()
2021-06-28 12:46:30 -07:00
Ingo Molnar
a9e906b71f Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-06-03 19:00:49 +02:00
Borislav Petkov
b1efd0ff4b x86/cpu: Init AP exception handling from cpu_init_secondary()
SEV-ES guests require properly setup task register with which the TSS
descriptor in the GDT can be located so that the IST-type #VC exception
handler which they need to function properly, can be executed.

This setup needs to happen before attempting to load microcode in
ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions.

Simplify the machinery by running that exception setup from a new function
cpu_init_secondary() and explicitly call cpu_init_exception_handling() for
the boot CPU before cpu_init(). The latter prepares for fixing and
simplifying the exception/IST setup on the boot CPU.

There should be no functional changes resulting from this patch.

[ tglx: Reworked it so cpu_init_exception_handling() stays seperate ]

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>                                                                                                                                                                                                                        
Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de
2021-05-18 14:49:21 +02:00
Huang Rui
3743d55b28 x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
Some AMD Ryzen generations has different calculation method on maximum
performance. 255 is not for all ASICs, some specific generations should use 166
as the maximum performance. Otherwise, it will report incorrect frequency value
like below:

  ~ → lscpu | grep MHz
  CPU MHz:                         3400.000
  CPU max MHz:                     7228.3198
  CPU min MHz:                     2200.0000

[ mingo: Tidied up whitespace use. ]
[ Alexander Monakov <amonakov@ispras.ru>: fix 225 -> 255 typo. ]

Fixes: 41ea667227 ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Fixed-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425073451.2557394-1-ray.huang@amd.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211791
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-05-13 12:10:24 +02:00
Valentin Schneider
f1a0a376ca sched/core: Initialize the idle task with preemption disabled
As pointed out by commit

  de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")

init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.

As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().

Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().

Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().

Secondary startups were patched via coccinelle:

  @begone@
  @@

  -preempt_disable();
  ...
  cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
2021-05-12 13:01:45 +02:00
Wan Jiabing
3cf4524ce4 x86/smpboot: Remove duplicate includes
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210427063835.9039-1-wanjiabing@vivo.com
2021-05-05 21:50:13 +02:00
Linus Torvalds
c6536676c7 - turn the stack canary into a normal __percpu variable on 32-bit which
gets rid of the LAZY_GS stuff and a lot of code.
 
 - Add an insn_decode() API which all users of the instruction decoder
 should preferrably use. Its goal is to keep the details of the
 instruction decoder away from its users and simplify and streamline how
 one decodes insns in the kernel. Convert its users to it.
 
 - kprobes improvements and fixes
 
 - Set the maximum DIE per package variable on Hygon
 
 - Rip out the dynamic NOP selection and simplify all the machinery around
 selecting NOPs. Use the simplified NOPs in objtool now too.
 
 - Add Xeon Sapphire Rapids to list of CPUs that support PPIN
 
 - Simplify the retpolines by folding the entire thing into an
 alternative now that objtool can handle alternatives with stack
 ops. Then, have objtool rewrite the call to the retpoline with the
 alternative which then will get patched at boot time.
 
 - Document Intel uarch per models in intel-family.h
 
 - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
 exception on Intel.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCHyJQACgkQEsHwGGHe
 VUpjiRAAwPZdwwp08ypZuMHR4EhLNru6gYhbAoALGgtYnQjLtn5onQhIeieK+R4L
 cmZpxHT9OFp5dXHk4kwygaQBsD4pPOiIpm60kye1dN3cSbOORRdkwEoQMpKMZ+5Y
 kvVsmn7lrwRbp600KdE4G6L5+N6gEgr0r6fMFWWGK3mgVAyCzPexVHgydcp131ch
 iYMo6/pPDcNkcV/hboVKgx7GISdQ7L356L1MAIW/Sxtw6uD/X4qGYW+kV2OQg9+t
 nQDaAo7a8Jqlop5W5TQUdMLKQZ1xK8SFOSX/nTS15DZIOBQOGgXR7Xjywn1chBH/
 PHLwM5s4XF6NT5VlIA8tXNZjWIZTiBdldr1kJAmdDYacrtZVs2LWSOC0ilXsd08Z
 EWtvcpHfHEqcuYJlcdALuXY8xDWqf6Q2F7BeadEBAxwnnBg+pAEoLXI/1UwWcmsj
 wpaZTCorhJpYo2pxXckVdHz2z0LldDCNOXOjjaWU8tyaOBKEK6MgAaYU7e0yyENv
 mVc9n5+WuvXuivC6EdZ94Pcr/KQsd09ezpJYcVfMDGv58YZrb6XIEELAJIBTu2/B
 Ua8QApgRgetx+1FKb8X6eGjPl0p40qjD381TADb4rgETPb1AgKaQflmrSTIik+7p
 O+Eo/4x/GdIi9jFk3K+j4mIznRbUX0cheTJgXoiI4zXML9Jv94w=
 =bm4S
 -----END PGP SIGNATURE-----

Merge tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 updates from Borislav Petkov:

 - Turn the stack canary into a normal __percpu variable on 32-bit which
   gets rid of the LAZY_GS stuff and a lot of code.

 - Add an insn_decode() API which all users of the instruction decoder
   should preferrably use. Its goal is to keep the details of the
   instruction decoder away from its users and simplify and streamline
   how one decodes insns in the kernel. Convert its users to it.

 - kprobes improvements and fixes

 - Set the maximum DIE per package variable on Hygon

 - Rip out the dynamic NOP selection and simplify all the machinery
   around selecting NOPs. Use the simplified NOPs in objtool now too.

 - Add Xeon Sapphire Rapids to list of CPUs that support PPIN

 - Simplify the retpolines by folding the entire thing into an
   alternative now that objtool can handle alternatives with stack ops.
   Then, have objtool rewrite the call to the retpoline with the
   alternative which then will get patched at boot time.

 - Document Intel uarch per models in intel-family.h

 - Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
   exception on Intel.

* tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  x86, sched: Treat Intel SNC topology as default, COD as exception
  x86/cpu: Comment Skylake server stepping too
  x86/cpu: Resort and comment Intel models
  objtool/x86: Rewrite retpoline thunk calls
  objtool: Skip magical retpoline .altinstr_replacement
  objtool: Cache instruction relocs
  objtool: Keep track of retpoline call sites
  objtool: Add elf_create_undef_symbol()
  objtool: Extract elf_symbol_add()
  objtool: Extract elf_strtab_concat()
  objtool: Create reloc sections implicitly
  objtool: Add elf_create_reloc() helper
  objtool: Rework the elf_rebuild_reloc_section() logic
  objtool: Fix static_call list generation
  objtool: Handle per arch retpoline naming
  objtool: Correctly handle retpoline thunk calls
  x86/retpoline: Simplify retpolines
  x86/alternatives: Optimize optimize_nops()
  x86: Add insn_decode_kernel()
  x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration
  ...
2021-04-27 17:45:09 -07:00
Linus Torvalds
ea5bc7b977 Trivial cleanups and fixes all over the place.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
 VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
 +HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
 /uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
 4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
 lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
 LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
 sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
 kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
 dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
 wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
 Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
 =bO1i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 cleanups from Borislav Petkov:
 "Trivial cleanups and fixes all over the place"

* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Remove me from IDE/ATAPI section
  x86/pat: Do not compile stubbed functions when X86_PAT is off
  x86/asm: Ensure asm/proto.h can be included stand-alone
  x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
  x86/msr: Make locally used functions static
  x86/cacheinfo: Remove unneeded dead-store initialization
  x86/process/64: Move cpu_current_top_of_stack out of TSS
  tools/turbostat: Unmark non-kernel-doc comment
  x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
  x86/fpu/math-emu: Fix function cast warning
  x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
  x86: Fix various typos in comments, take #2
  x86: Remove unusual Unicode characters from comments
  x86/kaslr: Return boolean values from a function returning bool
  x86: Fix various typos in comments
  x86/setup: Remove unused RESERVE_BRK_ARRAY()
  stacktrace: Move documentation for arch_stack_walk_reliable() to header
  x86: Remove duplicate TSC DEADLINE MSR definitions
2021-04-26 09:25:47 -07:00
Alison Schofield
2c88d45edb x86, sched: Treat Intel SNC topology as default, COD as exception
Commit 1340ccfa9a ("x86,sched: Allow topologies where NUMA nodes
share an LLC") added a vendor and model specific check to never
call topology_sane() for Intel Skylake Server systems where NUMA
nodes share an LLC.

Intel Ice Lake and Sapphire Rapids CPUs also enumerate an LLC that is
shared by multiple NUMA nodes. The LLC on these CPUs is shared for
off-package data access but private to the NUMA node for on-package
access. Rather than managing a list of allowable SNC topologies, make
this SNC topology the default, and treat Intel's Cluster-On-Die (COD)
topology as the exception.

In SNC mode, Sky Lake, Ice Lake, and Sapphire Rapids servers do not
emit this warning:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210310190233.31752-1-alison.schofield@intel.com
2021-04-15 18:34:20 +02:00
Vitaly Kuznetsov
fa26d0c778 ACPI: processor: Fix build when CONFIG_ACPI_PROCESSOR=m
Commit 8cdddd182b ("ACPI: processor: Fix CPU0 wakeup in
acpi_idle_play_dead()") tried to fix CPU0 hotplug breakage by copying
wakeup_cpu0() + start_cpu0() logic from hlt_play_dead()//mwait_play_dead()
into acpi_idle_play_dead(). The problem is that these functions are not
exported to modules so when CONFIG_ACPI_PROCESSOR=m build fails.

The issue could've been fixed by exporting both wakeup_cpu0()/start_cpu0()
(the later from assembly) but it seems putting the whole pattern into a
new function and exporting it instead is better.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 8cdddd182b ("CPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()")
Cc: <stable@vger.kernel.org> # 5.10+
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-07 19:02:43 +02:00
Vitaly Kuznetsov
8cdddd182b ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
Commit 496121c021 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):

 # echo 0 > /sys/devices/system/cpu/cpu0/online
 # echo 1 > /sys/devices/system/cpu/cpu0/online
 <10 seconds delay>
 -bash: echo: write error: Input/output error

In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:

	/*
	 * If NMI wants to wake up CPU0, start CPU0.
	 */
	if (wakeup_cpu0())
		start_cpu0();

cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
 - NMI is sent to CPU0
 - wakeup_cpu0_nmi() works as expected
 - we get back to while (1) loop in acpi_idle_play_dead()
 - safe_halt() puts CPU0 to sleep again.

The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.

Fixes: 496121c021 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: 5.10+ <stable@vger.kernel.org> # 5.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-04-01 13:37:55 +02:00
Ingo Molnar
d9f6e12fb0 x86: Fix various typos in comments
Fix ~144 single-word typos in arch/x86/ code comments.

Doing this in a single commit should reduce the churn.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-18 15:31:53 +01:00
Rafael J. Wysocki
d11a1d08a0 cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
If the maximum performance level taken for computing the
arch_max_freq_ratio value used in the x86 scale-invariance code is
higher than the one corresponding to the cpuinfo.max_freq value
coming from the acpi_cpufreq driver, the scale-invariant utilization
falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly
faster, which causes the schedutil governor to select a frequency
below cpuinfo.max_freq.  That frequency corresponds to a frequency
table entry below the maximum performance level necessary to get to
the "boost" range of CPU frequencies which prevents "boost"
frequencies from being used in some workloads.

While this issue is related to scale-invariance, it may be amplified
by commit db865272d9 ("cpufreq: Avoid configuring old governors as
default with intel_pstate") from the 5.10 development cycle which
made it extremely easy to default to schedutil even if the preferred
driver is acpi_cpufreq as long as intel_pstate is built too, because
the mere presence of the latter effectively removes the ondemand
governor from the defaults.  Distro kernels are likely to include
both intel_pstate and acpi_cpufreq on x86, so their users who cannot
use intel_pstate or choose to use acpi_cpufreq may easily be
affectecd by this issue.

If CPPC is available, it can be used to address this issue by
extending the frequency tables created by acpi_cpufreq to cover the
entire available frequency range (including "boost" frequencies) for
each CPU, but if CPPC is not there, acpi_cpufreq has no idea what
the maximum "boost" frequency is and the frequency tables created by
it cannot be extended in a meaningful way, so in that case make it
ask the arch scale-invariance code to to use the "nominal" performance
level for CPU utilization scaling in order to avoid the issue at hand.

Fixes: db865272d9 ("cpufreq: Avoid configuring old governors as default with intel_pstate")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-02-08 13:45:51 +01:00
Rafael J. Wysocki
9c7d9017a4 x86: PM: Register syscore_ops for scale invariance
On x86 scale invariace tends to be disabled during resume from
suspend-to-RAM, because the MPERF or APERF MSR values are not as
expected then due to updates taking place after the platform
firmware has been invoked to complete the suspend transition.

That, of course, is not desirable, especially if the schedutil
scaling governor is in use, because the lack of scale invariance
causes it to be less reliable.

To counter that effect, modify init_freq_invariance() to register
a syscore_ops object for scale invariance with the ->resume callback
pointing to init_counter_refs() which will run on the CPU starting
the resume transition (the other CPUs will be taken care of the
"online" operations taking place later).

Fixes: e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
2021-01-19 17:04:03 +01:00
Linus Torvalds
148842c98a Yet another large set of x86 interrupt management updates:
- Simplification and distangling of the MSI related functionality
 
    - Let IO/APIC construct the RTE entries from an MSI message instead of
      having IO/APIC specific code in the interrupt remapping drivers
 
    - Make the retrieval of the parent interrupt domain (vector or remap
      unit) less hardcoded and use the relevant irqdomain callbacks for
      selection.
 
    - Allow the handling of more than 255 CPUs without a virtualized IOMMU
      when the hypervisor supports it. This has made been possible by the
      above modifications and also simplifies the existing workaround in the
      HyperV specific virtual IOMMU.
 
    - Cleanup of the historical timer_works() irq flags related
      inconsistencies.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/Xxd8THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYpOD/9C5TppNlPMUyx2SflH6bxt37pJEpln
 +hYTKsk+jSThntr5mfj+GifGvgmHOVBTGnlDUnUnrpN7TQmLFBzwTOtnBLW53AO2
 16/u0+Xci4LNCtEkaymf0Rq4MfsfriXHPJr0A/CnZ0tpHSf5QKHAiitSiGujdMlb
 gbq43+zXd+jNkH7vkOLPX/7dZVI1hNASQEevJu2tRR4xYTuXFdBxvLgYkHtYKKrK
 R1sbs6nI6yIzye2u4m4xGu29SxgUft+zdUf+UehJKM3yFmf51d9qpkX+kLaTWuaL
 VPsMItbn0kdvxwXQWO6DYnIAAnVKCklyHQJTZCoNq9Fe91OoByak1CEVspSOa1av
 JmycNSch4IYWasR4vVCB1gbb+V9SejcKu5SV3CDrEDqwkOIpfiqpriUXSCJTLlFd
 QOEDOLuuk/79Qs//J/tb/nJ4IuKv8WPudDfIlMro8wUsAr67DjD4mnXprZ+svwWx
 Ct/0/Memk+BSa0cw6pvg24BUZGN6zrufkBu2HKT9GOXRUdNkdLkiPhT8mK4T/O0l
 f90QCLjPSOJ/K/pLEWdUHEPmgC5Q9RsXOmwVGqX+RbjfP7mYTJXlmWnBb+cFNch0
 xFIH3SxVGylxxT06NX3SkvinrHj10CoAlmneefBlLtx6dF+2P84DAMZSF0OFToVI
 c2KMg5zoesI4bg==
 =8Gfs
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic updates from Thomas Gleixner:
 "Yet another large set of x86 interrupt management updates:

   - Simplification and distangling of the MSI related functionality

   - Let IO/APIC construct the RTE entries from an MSI message instead
     of having IO/APIC specific code in the interrupt remapping drivers

   - Make the retrieval of the parent interrupt domain (vector or remap
     unit) less hardcoded and use the relevant irqdomain callbacks for
     selection.

   - Allow the handling of more than 255 CPUs without a virtualized
     IOMMU when the hypervisor supports it. This has made been possible
     by the above modifications and also simplifies the existing
     workaround in the HyperV specific virtual IOMMU.

   - Cleanup of the historical timer_works() irq flags related
     inconsistencies"

* tag 'x86-apic-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  x86/ioapic: Cleanup the timer_works() irqflags mess
  iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()
  iommu/amd: Fix IOMMU interrupt generation in X2APIC mode
  iommu/amd: Don't register interrupt remapping irqdomain when IR is disabled
  iommu/amd: Fix union of bitfields in intcapxt support
  x86/ioapic: Correct the PCI/ISA trigger type selection
  x86/ioapic: Use I/O-APIC ID for finding irqdomain, not index
  x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it
  x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detected
  iommu/hyper-v: Disable IRQ pseudo-remapping if 15 bit APIC IDs are available
  x86/apic: Support 15 bits of APIC ID in MSI where available
  x86/ioapic: Handle Extended Destination ID field in RTE
  iommu/vt-d: Simplify intel_irq_remapping_select()
  x86: Kill all traces of irq_remapping_get_irq_domain()
  x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
  x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
  iommu/hyper-v: Implement select() method on remapping irqdomain
  iommu/vt-d: Implement select() method on remapping irqdomain
  iommu/amd: Implement select() method on remapping irqdomain
  x86/apic: Add select() method on vector irqdomain
  ...
2020-12-14 18:59:53 -08:00
Linus Torvalds
adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Giovanni Gherdovich
3149cd5530 x86: Print ratio freq_max/freq_base used in frequency invariance calculations
The value freq_max/freq_base is a fundamental component of frequency
invariance calculations. It may come from a variety of sources such as MSRs
or ACPI data, tracking it down when troubleshooting a system could be
non-trivial. It is worth saving it in the kernel logs.

 # dmesg | grep 'Estimated ratio of average max'
 [   14.024036] smpboot: Estimated ratio of average max frequency by base frequency (times 1024): 1289

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-4-ggherdovich@suse.cz
2020-12-11 10:30:23 +01:00
Giovanni Gherdovich
976df7e573 x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
Frequency invariant accounting calculations need the ratio
freq_curr/freq_max, but freq_max is unknown as it depends on dynamic power
allocation between cores: AMD EPYC CPUs implement "Core Performance Boost".
Three candidates are considered to estimate this value:

- maximum non-boost frequency
- maximum boost frequency
- the mid point between the above two

Experimental data on an AMD EPYC Zen2 machine slightly favors the third
option, which is applied with this patch.

The analysis uses the ondemand cpufreq governor as baseline, and compares
it with schedutil in a number of configurations. Using the freq_max value
described above offers a moderate advantage in performance and efficiency:

sugov-max (freq_max=max_boost) performs the worst on tbench: less
throughput and reduced efficiency than the other invariant-schedutil
options (see "Data Overview" below). Consider that tbench is generally a
problematic case as no schedutil version currently is better than ondemand.

sugov-P0 (freq_max=max_P) is the worst on dbench, while the other sugov's
can surpass ondemand with less filesystem latency and slightly increased
efficiency.

1. DATA OVERVIEW
2. DETAILED PERFORMANCE TABLES
3. POWER CONSUMPTION TABLE

1. DATA OVERVIEW
================

sugov-noinv : non-invariant schedutil governor
sugov-max   : invariant schedutil, freq_max=max_boost
sugov-mid   : invariant schedutil, freq_max=midpoint
sugov-P0    : invariant schedutil, freq_max=max_P
perfgov     : performance governor

driver      : acpi_cpufreq
machine     : AMD EPYC 7742 (Zen2, aka "Rome"), dual socket,
              128 cores / 256 threads, SATA SSD storage, 250G of memory,
	      XFS filesystem

Benchmarks are described in the next section.
Tilde (~) means the value is the same as baseline.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0  better if
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                        PERFORMANCE RATIOS
tbench        1.00       1.44       0.90       0.87       0.93       0.93      higher
dbench        1.00       0.91       0.95       0.94       0.94       1.06      lower
kernbench     1.00       0.93       ~          ~          ~          0.97      lower
gitsource     1.00       0.66       0.97       0.96       ~          0.95      lower
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                    PERFORMANCE-PER-WATT RATIOS
tbench        1.00       1.16       0.84       0.84       0.88       0.85      higher
dbench        1.00       1.03       1.02       1.02       1.02       0.93      higher
kernbench     1.00       1.05       ~          ~          ~          ~         higher
gitsource     1.00       1.46       1.04       1.04       ~          1.05      higher

2. DETAILED PERFORMANCE TABLES
==============================

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        427.19  +- 0.16% (        )     778.35  +- 0.10% (  82.20%)     346.92  +- 0.14% ( -18.79%)
Hmean  2        853.82  +- 0.09% (        )    1536.23  +- 0.03% (  79.93%)     694.36  +- 0.05% ( -18.68%)
Hmean  4       1657.54  +- 0.12% (        )    2938.18  +- 0.12% (  77.26%)    1362.81  +- 0.11% ( -17.78%)
Hmean  8       3301.87  +- 0.06% (        )    5679.10  +- 0.04% (  72.00%)    2693.35  +- 0.04% ( -18.43%)
Hmean  16      6139.65  +- 0.05% (        )    9498.81  +- 0.04% (  54.71%)    4889.97  +- 0.17% ( -20.35%)
Hmean  32     11170.28  +- 0.09% (        )   17393.25  +- 0.08% (  55.71%)    9104.55  +- 0.09% ( -18.49%)
Hmean  64     19322.97  +- 0.17% (        )   31573.91  +- 0.08% (  63.40%)   18552.52  +- 0.40% (  -3.99%)
Hmean  128    30383.71  +- 0.11% (        )   37416.91  +- 0.15% (  23.15%)   25938.70  +- 0.41% ( -14.63%)
Hmean  256    31143.96  +- 0.41% (        )   30908.76  +- 0.88% (  -0.76%)   29754.32  +- 0.24% (  -4.46%)
Hmean  512    30858.49  +- 0.26% (        )   38524.60  +- 1.19% (  24.84%)   42080.39  +- 0.56% (  36.37%)
Hmean  1024   39187.37  +- 0.19% (        )   36213.86  +- 0.26% (  -7.59%)   39555.98  +- 0.12% (   0.94%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        352.59  +- 1.03% ( -17.46%)     352.08  +- 0.75% ( -17.58%)     352.31  +- 1.48% ( -17.53%)
Hmean  2        697.32  +- 0.08% ( -18.33%)     700.16  +- 0.20% ( -18.00%)     696.79  +- 0.06% ( -18.39%)
Hmean  4       1369.88  +- 0.04% ( -17.35%)    1369.72  +- 0.07% ( -17.36%)    1365.91  +- 0.05% ( -17.59%)
Hmean  8       2696.79  +- 0.04% ( -18.33%)    2711.06  +- 0.04% ( -17.89%)    2715.10  +- 0.61% ( -17.77%)
Hmean  16      4725.03  +- 0.03% ( -23.04%)    4875.65  +- 0.02% ( -20.59%)    4953.05  +- 0.28% ( -19.33%)
Hmean  32      9231.65  +- 0.10% ( -17.36%)    8704.89  +- 0.27% ( -22.07%)   10562.02  +- 0.36% (  -5.45%)
Hmean  64     15364.27  +- 0.19% ( -20.49%)   17786.64  +- 0.15% (  -7.95%)   19665.40  +- 0.22% (   1.77%)
Hmean  128    42100.58  +- 0.13% (  38.56%)   34946.28  +- 0.13% (  15.02%)   38635.79  +- 0.06% (  27.16%)
Hmean  256    30660.23  +- 1.08% (  -1.55%)   32307.67  +- 0.54% (   3.74%)   31153.27  +- 0.12% (   0.03%)
Hmean  512    24604.32  +- 0.14% ( -20.27%)   40408.50  +- 1.10% (  30.95%)   38800.29  +- 1.23% (  25.74%)
Hmean  1024   35535.47  +- 0.28% (  -9.32%)   41070.38  +- 2.56% (   4.81%)   31308.29  +- 2.52% ( -20.11%)

Benchmark          : dbench (filesystem stressor)
Varying parameter  : number of clients
Unit               : seconds (lower is better)

NOTE-1: This dbench version measures the average latency of a set of filesystem
        operations, as we found the traditional dbench metric (throughput) to be
	misleading.
NOTE-2: Due to high variability, we partition the original dataset and apply
        statistical bootrapping (a resampling method). Accuracy is reported in the
	form of 95% confidence intervals.

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         98.79  +- 0.92 (        )      83.36  +- 0.82 (  15.62%)      84.82  +- 0.92 (  14.14%)
SubAmean  2        116.00  +- 0.89 (        )     102.12  +- 0.77 (  11.96%)     109.63  +- 0.89 (   5.49%)
SubAmean  4        149.90  +- 1.03 (        )     132.12  +- 0.91 (  11.86%)     143.90  +- 1.15 (   4.00%)
SubAmean  8        182.41  +- 1.13 (        )     159.86  +- 0.93 (  12.36%)     165.82  +- 1.03 (   9.10%)
SubAmean  16       237.83  +- 1.23 (        )     219.46  +- 1.14 (   7.72%)     229.28  +- 1.19 (   3.59%)
SubAmean  32       334.34  +- 1.49 (        )     309.94  +- 1.42 (   7.30%)     321.19  +- 1.36 (   3.93%)
SubAmean  64       576.61  +- 2.16 (        )     540.75  +- 2.00 (   6.22%)     551.27  +- 1.99 (   4.39%)
SubAmean  128     1350.07  +- 4.14 (        )    1205.47  +- 3.20 (  10.71%)    1280.26  +- 3.75 (   5.17%)
SubAmean  256     3444.42  +- 7.97 (        )    3698.00 +- 27.43 (  -7.36%)    3494.14  +- 7.81 (  -1.44%)
SubAmean  2048   39457.89 +- 29.01 (        )   34105.33 +- 41.85 (  13.57%)   39688.52 +- 36.26 (  -0.58%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SubAmean  1         85.68  +- 1.04 (  13.27%)      84.16  +- 0.84 (  14.81%)      83.99  +- 0.90 (  14.99%)
SubAmean  2        108.42  +- 0.95 (   6.54%)     109.91  +- 1.39 (   5.24%)     112.06  +- 0.91 (   3.39%)
SubAmean  4        136.90  +- 1.04 (   8.67%)     137.59  +- 0.93 (   8.21%)     136.55  +- 0.95 (   8.91%)
SubAmean  8        163.15  +- 0.96 (  10.56%)     166.07  +- 1.02 (   8.96%)     165.81  +- 0.99 (   9.10%)
SubAmean  16       224.86  +- 1.12 (   5.45%)     223.83  +- 1.06 (   5.89%)     230.66  +- 1.19 (   3.01%)
SubAmean  32       320.51  +- 1.38 (   4.13%)     322.85  +- 1.49 (   3.44%)     321.96  +- 1.46 (   3.70%)
SubAmean  64       553.25  +- 1.93 (   4.05%)     554.19  +- 2.08 (   3.89%)     562.26  +- 2.22 (   2.49%)
SubAmean  128     1264.35  +- 3.72 (   6.35%)    1256.99  +- 3.46 (   6.89%)    2018.97 +- 18.79 ( -49.55%)
SubAmean  256     3466.25  +- 8.25 (  -0.63%)    3450.58  +- 8.44 (  -0.18%)    5032.12 +- 38.74 ( -46.09%)
SubAmean  2048   39133.10 +- 45.71 (   0.82%)   39905.95 +- 34.33 (  -1.14%)   53811.86 +-193.04 ( -36.38%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        471.71 +- 26.61% (        )     409.88 +- 16.99% (  13.11%)     430.63  +- 0.18% (   8.71%)
Amean  4        211.87  +- 0.58% (        )     194.03  +- 0.74% (   8.42%)     215.33  +- 0.64% (  -1.63%)
Amean  8        109.79  +- 1.27% (        )     101.43  +- 1.53% (   7.61%)     111.05  +- 1.95% (  -1.15%)
Amean  16        59.50  +- 1.28% (        )      55.61  +- 1.35% (   6.55%)      59.65  +- 1.78% (  -0.24%)
Amean  32        34.94  +- 1.22% (        )      32.36  +- 1.95% (   7.41%)      35.44  +- 0.63% (  -1.43%)
Amean  64        22.58  +- 0.38% (        )      20.97  +- 1.28% (   7.11%)      22.41  +- 1.73% (   0.74%)
Amean  128       17.72  +- 0.44% (        )      16.68  +- 0.32% (   5.88%)      17.65  +- 0.96% (   0.37%)
Amean  256       16.44  +- 0.53% (        )      15.76  +- 0.32% (   4.18%)      16.76  +- 0.60% (  -1.93%)
Amean  512       16.54  +- 0.21% (        )      15.62  +- 0.41% (   5.53%)      16.84  +- 0.85% (  -1.83%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        421.30  +- 0.24% (  10.69%)     419.26  +- 0.15% (  11.12%)     414.38  +- 0.33% (  12.15%)
Amean  4  	217.81  +- 5.53% (  -2.80%)     211.63  +- 0.99% (   0.12%)     208.43  +- 0.47% (   1.63%)
Amean  8  	108.80  +- 0.43% (   0.90%)     108.48  +- 1.44% (   1.19%)     108.59  +- 3.08% (   1.09%)
Amean  16 	 58.84  +- 0.74% (   1.12%)      58.37  +- 0.94% (   1.91%)      57.78  +- 0.78% (   2.90%)
Amean  32 	 34.04  +- 2.00% (   2.59%)      34.28  +- 1.18% (   1.91%)      33.98  +- 2.21% (   2.75%)
Amean  64 	 22.22  +- 1.69% (   1.60%)      22.27  +- 1.60% (   1.38%)      22.25  +- 1.41% (   1.47%)
Amean  128	 17.55  +- 0.24% (   0.97%)      17.53  +- 0.94% (   1.04%)      17.49  +- 0.43% (   1.30%)
Amean  256	 16.51  +- 0.46% (  -0.40%)      16.48  +- 0.48% (  -0.19%)      16.44  +- 1.21% (   0.00%)
Amean  512	 16.50  +- 0.35% (   0.19%)      16.35  +- 0.42% (   1.14%)      16.37  +- 0.33% (   0.99%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                  5.9.0-ondemand (BASELINE)                   5.9.0-perfgov               5.9.0-sugov-noinv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean          1035.76  +- 0.30% (        )     688.21  +- 0.04% (  33.56%)    1003.85  +- 0.14% (   3.08%)

                            5.9.0-sugov-max                 5.9.0-sugov-mid                  5.9.0-sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean           995.82  +- 0.08% (   3.86%)    1011.98  +- 0.03% (   2.30%)     986.87  +- 0.19% (   4.72%)

3. POWER CONSUMPTION TABLE
==========================

Average power consumption (watts).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
            ondemand  perfgov  sugov-noinv  sugov-max  sugov-mid  sugov-P0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
tbench4     227.25     281.83     244.17     236.76     241.50     247.99
dbench4     151.97     161.87     157.08     158.10     158.06     153.73
kernbench   162.78     167.22     162.90     164.19     164.65     164.72
gitsource   133.65     139.00     133.04     134.43     134.18     134.32

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-3-ggherdovich@suse.cz
2020-12-11 10:29:55 +01:00
Nathan Fontenot
41ea667227 x86, sched: Calculate frequency invariance for AMD systems
This is the first pass in creating the ability to calculate the
frequency invariance on AMD systems. This approach uses the CPPC
highest performance and nominal performance values that range from
0 - 255 instead of a high and base frquency. This is because we do
not have the ability on AMD to get a highest frequency value.

On AMD systems the highest performance and nominal performance
vaues do correspond to the highest and base frequencies for the system
so using them should produce an appropriate ratio but some tweaking
is likely necessary.

Due to CPPC being initialized later in boot than when the frequency
invariant calculation is currently made, I had to create a callback
from the CPPC init code to do the calculation after we have CPPC
data.

Special thanks to "kernel test robot <lkp@intel.com>" for reporting that
compilation of drivers/acpi/cppc_acpi.c is conditional to
CONFIG_ACPI_CPPC_LIB, not just CONFIG_ACPI.

[ ggherdovich@suse.cz: made safe under CPU hotplug, edited changelog. ]

Signed-off-by: Nathan Fontenot <nathan.fontenot@amd.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20201112182614.10700-2-ggherdovich@suse.cz
2020-12-11 10:26:00 +01:00
Paul E. McKenney
29368e0939 x86/smpboot: Move rcu_cpu_starting() earlier
The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats
as follows:

=============================
WARNING: suspicious RCU usage
5.9.0+ #268 Not tainted
-----------------------------
kernel/kprobes.c:300 RCU-list traversed in non-reader section!!

other info that might help us debug this:

RCU used illegally from offline CPU!
rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/1/0.

stack backtrace:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x77/0x97
 __is_insn_slot_addr+0x15d/0x170
 kernel_text_address+0xba/0xe0
 ? get_stack_info+0x22/0xa0
 __kernel_text_address+0x9/0x30
 show_trace_log_lvl+0x17d/0x380
 ? dump_stack+0x77/0x97
 dump_stack+0x77/0x97
 __lock_acquire+0xdf7/0x1bf0
 lock_acquire+0x258/0x3d0
 ? vprintk_emit+0x6d/0x2c0
 _raw_spin_lock+0x27/0x40
 ? vprintk_emit+0x6d/0x2c0
 vprintk_emit+0x6d/0x2c0
 printk+0x4d/0x69
 start_secondary+0x1c/0x100
 secondary_startup_64_no_verify+0xb8/0xbb

This is avoided by moving the call to rcu_cpu_starting up near
the beginning of the start_secondary() function.  Note that the
raw_smp_processor_id() is required in order to avoid calling into lockdep
before RCU has declared the CPU to be watched for readers.

Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Reported-by: Qian Cai <cai@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Thomas Gleixner
8c44963b60 x86/apic: Cleanup destination mode
apic::irq_dest_mode is actually a boolean, but defined as u32 and named in
a way which does not explain what it means.

Make it a boolean and rename it to 'dest_mode_logical'

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-9-dwmw2@infradead.org
2020-10-28 20:26:25 +01:00
Thomas Gleixner
e57d04e5fa x86/apic: Get rid of apic:: Dest_logical
struct apic has two members which store information about the destination
mode: dest_logical and irq_dest_mode.

dest_logical contains a mask which was historically used to set the
destination mode in IPI messages. Over time the usage was reduced and the
logical/physical functions were seperated.

There are only a few places which still use 'dest_logical' but they can
use 'irq_dest_mode' instead.

irq_dest_mode is actually a boolean where 0 means physical destination mode
and 1 means logical destination mode. Of course the name does not reflect
the functionality. This will be cleaned up in a subsequent change.

Remove apic::dest_logical and fixup the remaining users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201024213535.443185-8-dwmw2@infradead.org
2020-10-28 20:26:24 +01:00
Joerg Roedel
520d030852 x86/smpboot: Load TSS and getcpu GDT entry before loading IDT
The IDT on 64-bit contains vectors which use paranoid_entry() and/or IST
stacks. To make these vectors work, the TSS and the getcpu GDT entry need
to be set up before the IDT is loaded.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-68-joro@8bytes.org
2020-09-09 11:33:20 +02:00
Ashok Raj
52d6b926aa x86/hotplug: Silence APIC only after all interrupts are migrated
There is a race when taking a CPU offline. Current code looks like this:

native_cpu_disable()
{
	...
	apic_soft_disable();
	/*
	 * Any existing set bits for pending interrupt to
	 * this CPU are preserved and will be sent via IPI
	 * to another CPU by fixup_irqs().
	 */
	cpu_disable_common();
	{
		....
		/*
		 * Race window happens here. Once local APIC has been
		 * disabled any new interrupts from the device to
		 * the old CPU are lost
		 */
		fixup_irqs(); // Too late to capture anything in IRR.
		...
	}
}

The fix is to disable the APIC *after* cpu_disable_common().

Testing was done with a USB NIC that provided a source of frequent
interrupts. A script migrated interrupts to a specific CPU and
then took that CPU offline.

Fixes: 60dcaad573 ("x86/hotplug: Silence APIC and NMI when CPU is dead")
Reported-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Tested-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/875zdarr4h.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/r/1598501530-45821-1-git-send-email-ashok.raj@intel.com
2020-08-27 09:29:23 +02:00
Linus Torvalds
335ad94c21 Misc changes:
- Prepare for Intel's new SERIALIZE instruction
  - Enable split-lock debugging on more CPUs
  - Add more Intel CPU models
  - Optimize stack canary initialization a bit
  - Simplify the Spectre logic a bit
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oTsQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gueQ//Vh9sTi8+q5ZCxXnJQOi59SZsFy1quC2Q
 6bFoSQ46npMBoYyC2eDQ4exBncWLqorT8Vq/evlW3XPldUzHKOk7b4Omonwyrrj5
 dg5fqcRjpjU8ni6egmy4ElMjab53gDuv0yNazjONeBGeWuBGu4vI2bP2eY3Addfm
 2eo2d5ZIMRCdShrUNwToJWWt6q4DzL/lcrVZAlX0LwlWVLqUCdIARALRM7V1XDsC
 udxS8KnvhTaJ7l63BSJREe3AGksLQd9P4UkJS4IE4t0zINBIrME043BYBMTh2Vvk
 y3jykKegIbmhPquGXG8grJbPDUF2/3FxmGKTIhpoo++agb2fxt921y5kqMJwniNS
 H/Gk032iGzjjwWnOoWE56UeuDTOlweSIrm4EG22HyEDK7kOMJusjYAV5fB4Sv7vj
 TBy5q0PCIutjXDTL1hIWf0WDiQt6eGNQS/yt3FlapLBGVRQwMU/pKYVVIOIaFtNs
 szx1ZeiT358Ww8a2fQlb8pqv50Upmr2wqFkAsMbm+NN3N92cqK6gJlo1p7fnxIuG
 +YVASobjsqbn0S62v/9SB/KRJz07adlZ6Tl/O/ILRvWyqik7COCCHDVJ62Zzaz5z
 LqR2daVM5H+Lp6jGZuIoq/JiUkxUe2K990eWHb3PUpOC4Rh73PvtMc7WFhbAjbye
 XV3eOEDi65c=
 =sL2Q
 -----END PGP SIGNATURE-----

Merge tag 'x86-cpu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpu updates from Ingo Molar:

 - prepare for Intel's new SERIALIZE instruction

 - enable split-lock debugging on more CPUs

 - add more Intel CPU models

 - optimize stack canary initialization a bit

 - simplify the Spectre logic a bit

* tag 'x86-cpu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Refactor sync_core() for readability
  x86/cpu: Relocate sync_core() to sync_core.h
  x86/cpufeatures: Add enumeration for SERIALIZE instruction
  x86/split_lock: Enable the split lock feature on Sapphire Rapids and Alder Lake CPUs
  x86/cpu: Add Lakefield, Alder Lake and Rocket Lake models to the to Intel CPU family
  x86/stackprotector: Pre-initialize canary for secondary CPUs
  x86/speculation: Merge one test in spectre_v2_user_select_mitigation()
2020-08-03 17:08:02 -07:00
Brian Gerst
c9a1ff316b x86/stackprotector: Pre-initialize canary for secondary CPUs
The idle tasks created for each secondary CPU already have a random stack
canary generated by fork().  Copy the canary to the percpu variable before
starting the secondary CPU which removes the need to call
boot_init_stack_canary().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200617225624.799335-1-brgerst@gmail.com
2020-06-18 13:09:17 +02:00
Giovanni Gherdovich
f4291df103 x86, sched: Bail out of frequency invariance if turbo_freq/base_freq gives 0
Be defensive against the case where the processor reports a base_freq
larger than turbo_freq (the ratio would be zero).

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-4-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Giovanni Gherdovich
51beea8862 x86, sched: Bail out of frequency invariance if turbo frequency is unknown
There may be CPUs that support turbo boost but don't declare any turbo
ratio, i.e. their MSR_TURBO_RATIO_LIMIT is all zeroes. In that condition
scale-invariant calculations can't be performed.

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-3-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Giovanni Gherdovich
e2b0d619b4 x86, sched: check for counters overflow in frequency invariant accounting
The product mcnt * arch_max_freq_ratio can overflows u64.

For context, a large value for arch_max_freq_ratio would be 5000,
corresponding to a turbo_freq/base_freq ratio of 5 (normally it's more like
1500-2000). A large increment frequency for the MPERF counter would be 5GHz
(the base clock of all CPUs on the market today is less than that). With
these figures, a CPU would need to go without a scheduler tick for around 8
days for the u64 overflow to happen. It is unlikely, but the check is
warranted.

Under similar conditions, the difference acnt of two consecutive APERF
readings can overflow as well.

In these circumstances is appropriate to disable frequency invariant
accounting: the feature relies on measures of the clock frequency done at
every scheduler tick, which need to be "fresh" to be at all meaningful.

A note on i386: prior to version 5.1, the GCC compiler didn't have the
builtin function __builtin_mul_overflow. In these GCC versions the macro
check_mul_overflow needs __udivdi3() to do (u64)a/b, which the kernel
doesn't provide. For this reason this change fails to build on i386 if
GCC<5.1, and we protect the entire frequency invariant code behind
CONFIG_X86_64 (special thanks to "kbuild test robot" <lkp@intel.com>).

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200531182453.15254-2-ggherdovich@suse.cz
2020-06-15 14:10:02 +02:00
Mike Rapoport
65fddcfca8 mm: reorder includes after introduction of linux/pgtable.h
The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes.  Fix this up with the aid of
the below script and manual adjustments here and there.

	import sys
	import re

	if len(sys.argv) is not 3:
	    print "USAGE: %s <file> <header>" % (sys.argv[0])
	    sys.exit(1)

	hdr_to_move="#include <linux/%s>" % sys.argv[2]
	moved = False
	in_hdrs = False

	with open(sys.argv[1], "r") as f:
	    lines = f.readlines()
	    for _line in lines:
		line = _line.rstrip('
')
		if line == hdr_to_move:
		    continue
		if line.startswith("#include <linux/"):
		    in_hdrs = True
		elif not moved and in_hdrs:
		    moved = True
		    print hdr_to_move
		print line

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Mike Rapoport
ca5999fde0 mm: introduce include/linux/pgtable.h
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.

Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:13 -07:00
Linus Torvalds
17e0a7cb6a Misc cleanups, with an emphasis on removing obsolete/dead code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VLcQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iFnhAArGBqco3C2RPQugv7UDDbKEaMvxOGrc5B
 kwnyOS/k/yeIkfhT9u11oBuLcaj/Zgw8YCjFyRfaNsorRqnytLyZzZ6PvdCCE3YU
 X3DVYgulcdAQnM4bS2e3Kt9ciJvFxB27XNm0AfuyLMUxMqCD+iIO4gJ6TuQNBYy3
 dfUMfB1R9OUDW13GCrASe+p1Dw76uaqVngdFWJhnC8Rm49E6gFXq7CLQp5Cka81I
 KZeJ8I6ug9p3gqhOIXdi+S6g5CM5jf86Wkk7dOHwHFH7CceFb3FIz7z0n1je4Wgd
 L5rYX7+PwfNeZ73GIuvEBN+agJH2K0H/KmnlWNWeZHzc+J12MeruSdSMBIkBOEpn
 iSbYAOmDpQLzBjTdZjC8bDqTZf472WrTh4VwN9NxHLucjdC+IqGoTAvnyyEOmZ5o
 R7sv7Q++316CVwRhYVXbzwZcqtiinCDE1EkP5nKTo9z3z0kMF5+ce/k7wn5sgZIk
 zJq3LXtaToiDoDRAPGxcvFPts9MdC0EI1aKTIjaK/n6i2h/SpJfrTKgANWaldYTe
 XJIqlSB43saqf5YAQ3/sY+wnpCRBmmCU+sfKja4C8bH7RuggI3mZS19uhFs0Qctq
 Yx5bIXVSBAIqjJtgzQ0WAAZ5LrCpNNyAzb35ZYefQlGyJlx1URKXVBmxa6S99biU
 KiYX7Dk5uhQ=
 =0ZQd
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups, with an emphasis on removing obsolete/dead code"

* tag 'x86-cleanups-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/spinlock: Remove obsolete ticket spinlock macros and types
  x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit
  x86/apb_timer: Drop unused declaration and macro
  x86/apb_timer: Drop unused TSC calibration
  x86/io_apic: Remove unused function mp_init_irq_at_boot()
  x86/mm: Stop printing BRK addresses
  x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()
  x86/nmi: Remove edac.h include leftover
  mm: Remove MPX leftovers
  x86/mm/mmap: Fix -Wmissing-prototypes warnings
  x86/early_printk: Remove unused includes
  crash_dump: Remove no longer used saved_max_pfn
  x86/smpboot: Remove the last ICPU() macro
2020-06-01 13:47:10 -07:00
Linus Torvalds
d861f6e682 Misc cleanups in the SMP hotplug and cross-call code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl7VJfsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ihcA/+Ko18kdGRPAlShM9qkDWO5N80p1LEp7F0
 ku1OxPAz9ii7K/jlnGr9wYYPxsIL3lbFeqFE7q5q5socXufaN8MUj9sVCmN7ScmR
 zO84aTHtxrJJhKIPM6HkUTbVl5KrQaud3F/J56CCjuKPsJWy9iuCGnKtfKK38bx+
 qJEfVKVm95Bv0NSEvqvci3DKKPYjzpKzuuttHXQ8Z80zG94FEkwj0JwZzttIjLl1
 rgRMgWTH7+3tQCMnZEfXG8xBxbXS9i3hKyr/v5QTNgIICyXGquPkf5MiwjJFS2Xb
 wpPqNh8HTo5kUJstYygRjcftatU7K72h2Rz/CoUkN2roNYlvRAhdBaBMwN0cGaG8
 pPhnLHHHRYZjl4fiROgRwVV3A6LcAHSrIcKzwGrvpCSpqyVozPGsmD/e8ZG1JYpC
 vxESTZbCDywng2Ls8jqQBut+dFGElvopXl1s004bCak89IFR4p15qojMJK2MSsqu
 BxhjIoqp8/f1fsAX+1p0RBEYnEr1KFtWa+nY8aVKL6bEx+Y7Qyq0ypMGtKavP06X
 VMcPMm1gYeXoGpLaTLYBRL5t7Rmm7i+xufuDQKUJetenfh2YS4aQ9lfV+rsQH1YE
 wavQrbwThfBZ9K1XkEmOkSqONysZ2YAtK9slKzciQIZvY3V8NbKAmBudCgqTgarp
 xqeW9NFfeFc=
 =Rr2n
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Ingo Molnar:
 "Misc cleanups in the SMP hotplug and cross-call code"

* tag 'smp-core-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Remove __freeze_secondary_cpus()
  cpu/hotplug: Remove disable_nonboot_cpus()
  cpu/hotplug: Fix a typo in comment "broadacasted"->"broadcasted"
  smp: Use smp_call_func_t in on_each_cpu()
2020-06-01 13:38:55 -07:00
Borislav Petkov
a9a3ed1eff x86: Fix early boot crash on gcc-10, third try
... or the odyssey of trying to disable the stack protector for the
function which generates the stack canary value.

The whole story started with Sergei reporting a boot crash with a kernel
built with gcc-10:

  Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b37df9 #139
  Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
  Call Trace:
    dump_stack
    panic
    ? start_secondary
    __stack_chk_fail
    start_secondary
    secondary_startup_64
  -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary

This happens because gcc-10 tail-call optimizes the last function call
in start_secondary() - cpu_startup_entry() - and thus emits a stack
canary check which fails because the canary value changes after the
boot_init_stack_canary() call.

To fix that, the initial attempt was to mark the one function which
generates the stack canary with:

  __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)

however, using the optimize attribute doesn't work cumulatively
as the attribute does not add to but rather replaces previously
supplied optimization options - roughly all -fxxx options.

The key one among them being -fno-omit-frame-pointer and thus leading to
not present frame pointer - frame pointer which the kernel needs.

The next attempt to prevent compilers from tail-call optimizing
the last function call cpu_startup_entry(), shy of carving out
start_secondary() into a separate compilation unit and building it with
-fno-stack-protector, was to add an empty asm("").

This current solution was short and sweet, and reportedly, is supported
by both compilers but we didn't get very far this time: future (LTO?)
optimization passes could potentially eliminate this, which leads us
to the third attempt: having an actual memory barrier there which the
compiler cannot ignore or move around etc.

That should hold for a long time, but hey we said that about the other
two solutions too so...

Reported-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kalle Valo <kvalo@codeaurora.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org
2020-05-15 11:48:01 +02:00
Qais Yousef
5655585589 cpu/hotplug: Remove disable_nonboot_cpus()
The single user could have called freeze_secondary_cpus() directly.

Since this function was a source of confusion, remove it as it's
just a pointless wrapper.

While at it, rename enable_nonboot_cpus() to thaw_secondary_cpus() to
preserve the naming symmetry.

Done automatically via:

	git grep -l enable_nonboot_cpus | xargs sed -i 's/enable_nonboot_cpus/thaw_secondary_cpus/g'

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Link: https://lkml.kernel.org/r/20200430114004.17477-1-qais.yousef@arm.com
2020-05-07 15:18:40 +02:00
Giovanni Gherdovich
db441bd9f6 x86, sched: Move check for CPU type to caller function
Improve readability of the function intel_set_max_freq_ratio() by moving
the check for KNL CPUs there, together with checks for GLM and SKX.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-5-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Peter Zijlstra (Intel)
b56e7d45e8 x86, sched: Don't enable static key when starting secondary CPUs
The static key arch_scale_freq_key only needs to be enabled once (at
boot). This change fixes a bug by which the key was enabled every time cpu0
is started, even as a secondary CPU during cpu hotplug. Secondary CPUs are
started from the idle thread: setting a static key from there means
acquiring a lock and may result in sleeping in the idle task, causing CPU
lockup.

Another consequence of this change is that init_counter_refs() is now
called on each CPU correctly; previously the function on_each_cpu() was
used, but it was called at boot when the only online cpu is cpu0.

[ggherdovich@suse.cz: Tested and wrote changelog]
Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-4-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Giovanni Gherdovich
23ccee22e8 x86, sched: Account for CPUs with less than 4 cores in freq. invariance
If a CPU has less than 4 physical cores, MSR_TURBO_RATIO_LIMIT will
rightfully report that the 4C turbo ratio is zero. In such cases, use the
1C turbo ratio instead for frequency invariance calculations.

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Reported-by: Like Xu <like.xu@linux.intel.com>
Reported-by: Neil Rickert <nwr10cst-oslnx@yahoo.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: https://lkml.kernel.org/r/20200416054745.740-3-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Giovanni Gherdovich
9a6c2c3c7a x86, sched: Bail out of frequency invariance if base frequency is unknown
Some hypervisors such as VMWare ESXi 5.5 advertise support for
X86_FEATURE_APERFMPERF but then fill all MSR's with zeroes. In particular,
MSR_PLATFORM_INFO set to zero tricks the code that wants to know the base
clock frequency of the CPU (highest non-turbo frequency), producing a
division by zero when computing the ratio turbo_freq/base_freq necessary
for frequency invariant accounting.

It is to be noted that even if MSR_PLATFORM_INFO contained the appropriate
data, APERF and MPERF are constantly zero on ESXi 5.5, thus freq-invariance
couldn't be done in principle (not that it would make a lot of sense in a
VM anyway). The real problem is advertising X86_FEATURE_APERFMPERF. This
appears to be fixed in more recent versions: ESXi 6.7 doesn't advertise
that feature.

Fixes: 1567c3e346 ("x86, sched: Add support for frequency invariance")
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200416054745.740-2-ggherdovich@suse.cz
2020-04-22 23:10:13 +02:00
Borislav Petkov
2fa9a3cf30 x86/smpboot: Remove the last ICPU() macro
Now all is using the shiny new macros.

No code changed:

  # arch/x86/kernel/smpboot.o:

   text    data     bss     dec     hex filename
  16432    2649      40   19121    4ab1 smpboot.o.before
  16432    2649      40   19121    4ab1 smpboot.o.after

md5:
   a58104003b72c1de533095bc5a4c30a9  smpboot.o.before.asm
   a58104003b72c1de533095bc5a4c30a9  smpboot.o.after.asm

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200324185836.GI22931@zn.tnic
2020-04-13 10:34:09 +02:00
Linus Torvalds
fdf5563a72 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "This topic tree contains more commits than usual:

   - most of it are uaccess cleanups/reorganization by Al

   - there's a bunch of prototype declaration (--Wmissing-prototypes)
     cleanups

   - misc other cleanups all around the map"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/mm/set_memory: Fix -Wmissing-prototypes warnings
  x86/efi: Add a prototype for efi_arch_mem_reserve()
  x86/mm: Mark setup_emu2phys_nid() static
  x86/jump_label: Move 'inline' keyword placement
  x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt()
  kill uaccess_try()
  x86: unsafe_put-style macro for sigmask
  x86: x32_setup_rt_frame(): consolidate uaccess areas
  x86: __setup_rt_frame(): consolidate uaccess areas
  x86: __setup_frame(): consolidate uaccess areas
  x86: setup_sigcontext(): list user_access_{begin,end}() into callers
  x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit)
  x86: ia32_setup_rt_frame(): consolidate uaccess areas
  x86: ia32_setup_frame(): consolidate uaccess areas
  x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers
  x86/alternatives: Mark text_poke_loc_init() static
  x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl()
  x86/mm: Drop pud_mknotpresent()
  x86: Replace setup_irq() by request_irq()
  x86/configs: Slightly reduce defconfigs
  ...
2020-03-31 11:04:05 -07:00
Linus Torvalds
642e53ead6 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Various NUMA scheduling updates: harmonize the load-balancer and
     NUMA placement logic to not work against each other. The intended
     result is better locality, better utilization and fewer migrations.

   - Introduce Thermal Pressure tracking and optimizations, to improve
     task placement on thermally overloaded systems.

   - Implement frequency invariant scheduler accounting on (some) x86
     CPUs. This is done by observing and sampling the 'recent' CPU
     frequency average at ~tick boundaries. The CPU provides this data
     via the APERF/MPERF MSRs. This hopefully makes our capacity
     estimates more precise and keeps tasks on the same CPU better even
     if it might seem overloaded at a lower momentary frequency. (As
     usual, turbo mode is a complication that we resolve by observing
     the maximum frequency and renormalizing to it.)

   - Add asymmetric CPU capacity wakeup scan to improve capacity
     utilization on asymmetric topologies. (big.LITTLE systems)

   - PSI fixes and optimizations.

   - RT scheduling capacity awareness fixes & improvements.

   - Optimize the CONFIG_RT_GROUP_SCHED constraints code.

   - Misc fixes, cleanups and optimizations - see the changelog for
     details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  threads: Update PID limit comment according to futex UAPI change
  sched/fair: Fix condition of avg_load calculation
  sched/rt: cpupri_find: Trigger a full search as fallback
  kthread: Do not preempt current task if it is going to call schedule()
  sched/fair: Improve spreading of utilization
  sched: Avoid scale real weight down to zero
  psi: Move PF_MEMSTALL out of task->flags
  MAINTAINERS: Add maintenance information for psi
  psi: Optimize switching tasks inside shared cgroups
  psi: Fix cpu.pressure for cpu.max and competing cgroups
  sched/core: Distribute tasks within affinity masks
  sched/fair: Fix enqueue_task_fair warning
  thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
  sched/rt: Remove unnecessary push for unfit tasks
  sched/rt: Allow pulling unfitting task
  sched/rt: Optimize cpupri_find() on non-heterogenous systems
  sched/rt: Re-instate old behavior in select_task_rq_rt()
  sched/rt: cpupri_find: Implement fallback mechanism for !fit case
  sched/fair: Fix reordering of enqueue/dequeue_task_fair()
  sched/fair: Fix runnable_avg for throttled cfs
  ...
2020-03-30 17:01:51 -07:00
Thomas Gleixner
adefe55e72 x86/kernel: Convert to new CPU match macros
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.

Get rid the of the local macro wrappers for consistency.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lkml.kernel.org/r/20200320131509.250559388@linutronix.de
2020-03-24 21:28:26 +01:00
Martin Molnar
4d1d0977a2 x86: Fix a handful of typos
Fix a couple of typos in code comments.

 [ bp: While at it: s/IRQ's/IRQs/. ]

Signed-off-by: Martin Molnar <martin.molnar.programming@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/0819a044-c360-44a4-f0b6-3f5bafe2d35c@gmail.com
2020-02-16 20:58:06 +01:00
Giovanni Gherdovich
918229cdd5 x86/intel_pstate: Handle runtime turbo disablement/enablement in frequency invariance
On some platforms such as the Dell XPS 13 laptop the firmware disables turbo
when the machine is disconnected from AC, and viceversa it enables it again
when it's reconnected. In these cases a _PPC ACPI notification is issued.

The scheduler needs to know freq_max for frequency-invariant calculations.
To account for turbo availability to come and go, record freq_max at boot as
if turbo was available and store it in a helper variable. Use a setter
function to swap between freq_base and freq_max every time turbo goes off or on.

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-7-ggherdovich@suse.cz
2020-01-28 21:37:06 +01:00
Giovanni Gherdovich
298c6f99bf x86, sched: Add support for frequency invariance on ATOM
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core
turbo ratio.

We intended to perform tests validating that this patch doesn't regress in
terms of energy efficiency, given that this is the primary concern on Atom
processors. Alas, we found out that turbostat doesn't support reading RAPL
interfaces on our test machine (Airmont), and we don't have external equipment
to measure power consumption; all we have is the performance results of the
benchmarks we ran.

Test machine:

Platform    : Dell Wyse 3040 Thin Client[1]
CPU Model   : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont)
Fam/Mod/Ste : 6:76:4
Topology    : 1 socket, 4 cores / 4 threads
Memory      : 2G
Storage     : onboard flash, XFS filesystem

[1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client

Base frequency and available turbo levels (MHz):

    Min Operating Freq   266 |***
    Low Freq Mode        800 |********
    Base Freq           2400 |************************
    4 Cores             2800 |****************************
    3 Cores             2800 |****************************
    2 Cores             3200 |********************************
    1 Core              3200 |********************************

Tested kernels:

Baseline      : v5.4-rc1,              intel_pstate passive,  schedutil
Comparison #1 : v5.4-rc1,              intel_pstate active ,  powersave
Comparison #2 : v5.4-rc1, this patch,  intel_pstate passive,  schedutil

tbench, hackbench and kernbench performed the same under all three kernels;
dbench ran faster with intel_pstate/powersave and the git unit tests were a
lot faster with intel_pstate/powersave and invariant schedutil wrt the
baseline. Not that any of this is terrbily interesting anyway, one doesn't buy
an Atom system to go fast. Power consumption regressions aren't expected but
we lack the equipment to make that measurement. Turbostat seems to think that
reading RAPL on this machine isn't a good idea and we're trusting that
decision.

comparison ratio of performance with baseline; 1.00 means neutral,
lower is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                0.90             ~
    kernbench             0.98          0.97
    gitsource             0.63          0.43

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-6-ggherdovich@suse.cz
2020-01-28 21:37:05 +01:00
Giovanni Gherdovich
eacf0474ae x86, sched: Add support for frequency invariance on ATOM_GOLDMONT*
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and
GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency
reported by the CPU.

The encoding of turbo ratios for GOLDMONT* is identical to the one for
SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to
a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more
conservative frequency selections (favoring power efficiency).

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-5-ggherdovich@suse.cz
2020-01-28 21:37:04 +01:00
Giovanni Gherdovich
8bea0dfb4a x86, sched: Add support for frequency invariance on XEON_PHI_KNL/KNM
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency
reported by the CPU.

Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either
one or two turbo frequencies; in the former case that's 100 MHz above the base
frequency, in the latter case the two levels are 100 MHz and 200 MHz above
base frequency.

We set freq_max to the second-highest frequency reported by the CPU. This
could be the base frequency (if only one turbo level is available) or the first
turbo level (if two levels are available). The rationale is to compromise
between power efficiency or performance -- going straight to max turbo would
favor efficiency and blindly using base freq would favor performance.

For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi
to get the available frequencies (taken from a comment in turbostat's sources):

    [0] -- Reserved
    [7:1] -- Base value of number of active cores of bucket 1.
    [15:8] -- Base value of freq ratio of bucket 1.
    [20:16] -- +ve delta of number of active cores of bucket 2.
    i.e. active cores of bucket 2 =
    active cores of bucket 1 + delta
    [23:21] -- Negative delta of freq ratio of bucket 2.
    i.e. freq ratio of bucket 2 =
    freq ratio of bucket 1 - delta
    [28:24]-- +ve delta of number of active cores of bucket 3.
    [31:29]-- -ve delta of freq ratio of bucket 3.
    [36:32]-- +ve delta of number of active cores of bucket 4.
    [39:37]-- -ve delta of freq ratio of bucket 4.
    [44:40]-- +ve delta of number of active cores of bucket 5.
    [47:45]-- -ve delta of freq ratio of bucket 5.
    [52:48]-- +ve delta of number of active cores of bucket 6.
    [55:53]-- -ve delta of freq ratio of bucket 6.
    [60:56]-- +ve delta of number of active cores of bucket 7.
    [63:61]-- -ve delta of freq ratio of bucket 7.

1. PERFORMANCE EVALUATION: TBENCH +5%
2. NEUTRAL BENCHMARKS (ALL OTHERS)
3. TEST SETUP

1. PERFORMANCE EVALUATION: TBENCH +5%
-------------------------------------

A performance evaluation was conducted on a Knights Mill machine (see "Test
Setup" below), were the frequency-invariance patch (on schedutil) is compared
to both non-invariant schedutil and active intel_pstate with powersave: all
three tested kernels behave the same performance-wise and with regard to power
consumption (performance per watt). The only notable difference is tbench:

comparison ratio of performance with baseline; 1.00 means neutral,
higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    tbench                1.04          1.05

performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    tbench                1.03          1.04

which essentially means that frequency-invariant schedutil is 5% better than
baseline, the same as intel_pstate+powersave.

As the results above are averaged over the varying parameter, here the detailed
table.

Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                    5.2.0 vanilla (BASELINE)                 5.2.0 intel_pstate                     5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean   1         49.06  +- 2.12% (        )         51.66  +- 1.52% (   5.30%)         52.87  +- 0.88% (   7.76%)
Hmean   2         93.82  +- 0.45% (        )        103.24  +- 0.70% (  10.05%)        105.90  +- 0.70% (  12.88%)
Hmean   4        192.46  +- 1.15% (        )        215.95  +- 0.60% (  12.21%)        215.78  +- 1.43% (  12.12%)
Hmean   8        406.74  +- 2.58% (        )        438.58  +- 0.36% (   7.83%)        437.61  +- 0.97% (   7.59%)
Hmean   16       857.70  +- 1.22% (        )        890.26  +- 0.72% (   3.80%)        889.11  +- 0.73% (   3.66%)
Hmean   32      1760.10  +- 0.92% (        )       1791.70  +- 0.44% (   1.79%)       1787.95  +- 0.44% (   1.58%)
Hmean   64      3183.50  +- 0.34% (        )       3183.19  +- 0.36% (  -0.01%)       3187.53  +- 0.36% (   0.13%)
Hmean   128     4830.96  +- 0.31% (        )       4846.53  +- 0.30% (   0.32%)       4855.86  +- 0.30% (   0.52%)
Hmean   256     5467.98  +- 0.38% (        )       5793.80  +- 0.28% (   5.96%)       5821.94  +- 0.17% (   6.47%)
Hmean   512     5398.10  +- 0.06% (        )       5745.56  +- 0.08% (   6.44%)       5503.68  +- 0.07% (   1.96%)
Hmean   1024    5290.43  +- 0.63% (        )       5221.07  +- 0.47% (  -1.31%)       5277.22  +- 0.80% (  -0.25%)
Hmean   1088    5139.71  +- 0.57% (        )       5236.02  +- 0.71% (   1.87%)       5190.57  +- 0.41% (   0.99%)

2. NEUTRAL BENCHMARKS (ALL OTHERS)
----------------------------------

* pgbench (both read/write and read-only)
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf
* dbench
* kernbench
* gitsource (git unit test suite)

3. TEST SETUP
-------------

Test machine:

CPU Model   : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill)
Fam/Mod/Ste : 6:133:0
Topology    : 1 socket, 68 cores / 272 threads
Memory      : 96G
Storage     : rotary, XFS filesystem

Max EFFICiency, BASE frequency and available turbo levels (MHz):

    EFFIC   1000 |**********
    BASE    1100 |***********
    68C     1100 |***********
    30C     1200 |************

Tested kernels:

Baseline      : v5.2,              intel_pstate passive,  schedutil
Comparison #1 : v5.2,              intel_pstate active ,  powersave
Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-4-ggherdovich@suse.cz
2020-01-28 21:37:02 +01:00
Giovanni Gherdovich
2a0abc5969 x86, sched: Add support for frequency invariance on SKYLAKE_X
The scheduler needs the ratio freq_curr/freq_max for frequency-invariant
accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can
be sustained by a group of at least 4 cores.

From the changelog of commit 31e07522be ("tools/power turbostat: fix
decoding for GLM, DNV, SKX turbo-ratio limits"):

 >   Newer processors do not hard-code the the number of cpus in each bin
 >   to {1, 2, 3, 4, 5, 6, 7, 8}  Rather, they can specify any number
 >   of CPUS in each of the 8 bins:
 >
 >   eg.
 >
 >   ...
 >   37 * 100.0 = 3600.0 MHz max turbo 4 active cores
 >   38 * 100.0 = 3700.0 MHz max turbo 3 active cores
 >   39 * 100.0 = 3800.0 MHz max turbo 2 active cores
 >   39 * 100.0 = 3900.0 MHz max turbo 1 active cores
 >
 >   could now look something like this:
 >
 >   ...
 >   37 * 100.0 = 3600.0 MHz max turbo 16 active cores
 >   38 * 100.0 = 3700.0 MHz max turbo 8 active cores
 >   39 * 100.0 = 3800.0 MHz max turbo 4 active cores
 >   39 * 100.0 = 3900.0 MHz max turbo 2 active cores

This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D,
but we treat these two classes in separate commits because their freq_max
values need to be different. For SKX we prefer a lower freq_max in the ratio
freq_curr/freq_max, allowing load and utilization to overshoot and the
schedutil governor to be more performance-oriented. Models from the Atom
series (such as GOLDMONT*) are handled in a forthcoming commit as they have to
favor power-efficiency over performance.

Results from a performance evaluation follow.

1. TEST SETUP
2. NEUTRAL BENCHMARKS
3. NON-NEUTRAL BENCHMARKS
4. DETAILED TABLES

1. TEST SETUP
-------------

Test machine:

CPU Model   : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake)
Fam/Mod/Ste : 6:85:6
Topology    : 2 sockets, 24 cores / 48 threads each socket
Memory      : 192G
Storage     : SSD, XFS filesystem

Max EFFICiency, BASE frequency and available turbo levels (MHz):

    EFFIC   1000 |**********
    BASE    2400 |************************
    24C     3100 |*******************************
    20C     3300 |*********************************
    16C     3600 |************************************
    12C     3600 |************************************
    8C      3600 |************************************
    4C      3700 |*************************************
    2C      3900 |***************************************

Tested kernels:

Baseline      : v5.2,              intel_pstate passive,  schedutil
Comparison #1 : v5.2,              intel_pstate active ,  powersave+HWP
Comparison #2 : v5.2, this patch,  intel_pstate passive,  schedutil

2. NEUTRAL BENCHMARKS
---------------------

* pgbench read/write
* NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing
* hackbench
* netperf

3. NON-NEUTRAL BENCHMARKS
-------------------------

comparison ratio with baseline; 1.00 means neutral, higher is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    pgbench read-only     1.10             ~
    tbench                1.82          1.14

comparison ratio with baseline; 1.00 means neutral, lower is better:

                      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                   ~          0.97
    kernbench             0.88          0.78
    gitsource[*]             ~          0.46

[*] "gitsource" consists in running git's unit tests
tilde (~) means 1.00, ie result identical to baseline

Performance per watt:

performance-per-watt ratios with baseline; 1.00 means neutral, higher is better:

		      I_PSTATE      FREQ-INV
    ----------------------------------------
    dbench                0.92          0.91
    tbench                1.26          1.04
    kernbench             0.95          0.96
    gitsource             1.03          1.30

Similarly to earlier Xeons, measurable performance gains over non-invariant
schedutil are observed on dbench, tbench, kernel compilation and running the
git unit tests suite. Looking at the detailed tables show that the patch
scores the largest difference when the machine is lightly loaded. Power
efficiency suffers lightly on kernbench and a bit more on dbench, but largely
improves on gitsource (which also runs considerably faster). For reference, we
also report results using active intel_pstate with powersave and HWP; the
largest gap between non-invariant schedutil and intel_pstate+powersave is
still tbench, which runs 82% better and with 26% improved efficiency on the
latter configuration -- this divide isn't closed yet by frequency-invariant
schedutil.

4. DETAILED TABLES
------------------

Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                     5.2.0 vanilla (BASELINE)            5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean   1         183.56  +- 0.21% (        )       516.12  +- 0.57% ( 181.18%)       185.59  +- 0.59% (   1.11%)
Hmean   2         365.75  +- 0.25% (        )      1015.14  +- 0.33% ( 177.55%)       402.59  +- 4.48% (  10.07%)
Hmean   4         720.99  +- 0.44% (        )      1951.75  +- 0.28% ( 170.70%)       738.39  +- 1.72% (   2.41%)
Hmean   8        1449.93  +- 0.34% (        )      3830.56  +- 0.24% ( 164.19%)      1750.36  +- 4.65% (  20.72%)
Hmean   16       2874.26  +- 0.57% (        )      7381.62  +- 0.53% ( 156.82%)      4348.35  +- 2.22% (  51.29%)
Hmean   32       6116.17  +- 5.10% (        )     13013.05  +- 0.08% ( 112.76%)      8980.35  +- 0.66% (  46.83%)
Hmean   64      14485.04  +- 3.46% (        )     17835.12  +- 0.35% (  23.13%)     16540.73  +- 0.51% (  14.19%)
Hmean   128     30779.16  +- 3.20% (        )     32796.94  +- 2.13% (   6.56%)     31512.58  +- 0.20% (   2.38%)
Hmean   256     34664.66  +- 0.81% (        )     34604.67  +- 0.46% (  -0.17%)     34943.70  +- 0.25% (   0.80%)
Hmean   384     33957.51  +- 0.11% (        )     34091.50  +- 0.14% (   0.39%)     33921.41  +- 0.09% (  -0.11%)

Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                    5.2.0 vanilla (BASELINE)             5.2.0 intel_pstate/HWP                     5.2.0 freq-inv
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean   2        332.94  +- 0.40% (        )        260.16  +- 0.45% (  21.86%)        233.56  +- 0.21% (  29.85%)
Amean   4        173.04  +- 0.43% (        )        138.76  +- 0.03% (  19.81%)        123.59  +- 0.11% (  28.58%)
Amean   8         89.65  +- 0.20% (        )         73.54  +- 0.09% (  17.97%)         65.69  +- 0.10% (  26.72%)
Amean   16        48.08  +- 1.41% (        )         41.64  +- 1.61% (  13.40%)         36.00  +- 1.80% (  25.11%)
Amean   32        28.78  +- 0.72% (        )         26.61  +- 1.99% (   7.55%)         23.19  +- 1.68% (  19.43%)
Amean   64        20.46  +- 1.85% (        )         19.76  +- 0.35% (   3.42%)         17.38  +- 0.92% (  15.06%)
Amean   128       18.69  +- 1.70% (        )         17.59  +- 1.04% (   5.90%)         15.73  +- 1.40% (  15.85%)
Amean   192       18.82  +- 1.01% (        )         17.76  +- 0.77% (   5.67%)         15.57  +- 1.80% (  17.28%)

Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                 5.2.0 vanilla (BASELINE)           5.2.0 intel_pstate/HWP                    5.2.0 freq-inv
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         792.49  +- 0.20% (        )      779.35  +- 0.24% (   1.66%)      427.14  +- 0.16% (   46.10%)

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-3-ggherdovich@suse.cz
2020-01-28 21:37:01 +01:00
Giovanni Gherdovich
1567c3e346 x86, sched: Add support for frequency invariance
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.

The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.

Large performance gains are obtained when the machine is lightly loaded and
no regression are observed at saturation. The benchmarks with the largest
gains are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.

1. FREQUENCY INVARIANCE: MOTIVATION
   * Without it, a task looks larger if the CPU runs slower

2. PECULIARITIES OF X86
   * freq invariance accounting requires knowing the ratio freq_curr/freq_max
   2.1 CURRENT FREQUENCY
       * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
   2.2 MAX FREQUENCY
       * It varies with time (turbo). As an approximation, we set it to a
         constant, i.e. 4-cores turbo frequency.

3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
   * The invariant schedutil's formula has no feedback loop and reacts faster
     to utilization changes

4. KNOWN LIMITATIONS
   * In some cases tasks can't reach max util despite how hard they try

5. PERFORMANCE TESTING
   5.1 MACHINES
       * Skylake, Broadwell, Haswell
   5.2 SETUP
       * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
         active cores turbo w/ invariant schedutil, and intel_pstate/powersave
   5.3 BENCHMARK RESULTS
       5.3.1 NEUTRAL BENCHMARKS
             * NAS Parallel Benchmark (HPC), hackbench
       5.3.2 NON-NEUTRAL BENCHMARKS
             * tbench (10-30% better), kernbench (10-15% better),
               shell-intensive-scripts (30-50% better)
             * no regressions
       5.3.3 SELECTION OF DETAILED RESULTS
       5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
             * dbench (5% worse on one machine), kernbench (3% worse),
               tbench (5-10% better), shell-intensive-scripts (10-40% better)

6. MICROARCH'ES ADDRESSED HERE
   * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
     etc have different MSRs semantic for querying turbo levels)

7. REFERENCES
   * MMTests performance testing framework, github.com/gormanm/mmtests

 +-------------------------------------------------------------------------+
 | 1. FREQUENCY INVARIANCE: MOTIVATION
 +-------------------------------------------------------------------------+

For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.

[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)

 +-------------------------------------------------------------------------+
 | 2. PECULIARITIES OF X86
 +-------------------------------------------------------------------------+

Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.

2.1 CURRENT FREQUENCY
====================

Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.

Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.

2.2 MAX FREQUENCY
=================

Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.

The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:

    * 1-core (1C) turbo frequency (the fastest turbo state available)
    * around base frequency (a.k.a. max P-state)
    * something in between, such as 4C turbo

To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:

    freq_max set to     | effect on DVFS
    --------------------+------------------
    1C turbo            | power efficiency (lower freq choices)
    base freq           | performance (higher util_avg, higher freq requests)
    4C turbo            | a bit of both

4C turbo proves to be a good compromise in a number of benchmarks (see below).

 +-------------------------------------------------------------------------+
 | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
 +-------------------------------------------------------------------------+

Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from

    freq_next = 1.25 * freq_curr * util            [non-invariant util signal]

to

    freq_next = 1.25 * freq_max * util             [invariant util signal]

where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.

Compare it to the update formula of intel_pstate/powersave:

    freq_next = 1.25 * freq_max * Busy%

where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.

Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.

 +-------------------------------------------------------------------------+
 | 4. KNOWN LIMITATIONS
 +-------------------------------------------------------------------------+

It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).

If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.

While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.

 [Lelli-2018]
 https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/

 +-------------------------------------------------------------------------+
 | 5. PERFORMANCE TESTING
 +-------------------------------------------------------------------------+

5.1 MACHINES
============

We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.

* 8x-SKYLAKE-UMA:
  Single socket E3-1240 v5, Skylake 4 cores/8 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC    800 |********
    BASE    3500 |***********************************
    4C      3700 |*************************************
    3C      3800 |**************************************
    2C      3900 |***************************************
    1C      3900 |***************************************

* 80x-BROADWELL-NUMA:
  Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2200 |**********************
    8C      2900 |*****************************
    7C      3000 |******************************
    6C      3100 |*******************************
    5C      3200 |********************************
    4C      3300 |*********************************
    3C      3400 |**********************************
    2C      3600 |************************************
    1C      3600 |************************************

* 48x-HASWELL-NUMA
  Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2300 |***********************
    12C     2600 |**************************
    11C     2600 |**************************
    10C     2600 |**************************
    9C      2600 |**************************
    8C      2600 |**************************
    7C      2600 |**************************
    6C      2600 |**************************
    5C      2700 |***************************
    4C      2800 |****************************
    3C      2900 |*****************************
    2C      3100 |*******************************
    1C      3100 |*******************************

5.2 SETUP
=========

* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
  driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
  try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
  on all machines), plus one more value closer to base_freq but still in the
  turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
  with active intel_pstate on this machine use that.

This gives, in terms of combinations tested on each machine:

* 8x-SKYLAKE-UMA
  * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
  * intel_pstate active + powersave + HWP
  * invariant schedutil, freq_max = 1C turbo
  * invariant schedutil, freq_max = 3C turbo
  * invariant schedutil, freq_max = 4C turbo

* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
  * [same as 8x-SKYLAKE-UMA, but no HWP capable]
  * invariant schedutil, freq_max = 8C turbo
    (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")

5.3 BENCHMARK RESULTS
=====================

5.3.1 NEUTRAL BENCHMARKS
------------------------

Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:

* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
  computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)

5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------

What follow are summary tables where each benchmark result is given a score.

* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
  means a score of 1.00.
* The results in the score ratio are the geometric means of results running
  the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
  ... number of processes; for pgbench: varying the number of clients, and so
  on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
  operations/second), the subsequent three show lower-is-better kind of tests
  (i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
  entire unit tests suite of the Git SCM and measuring how long it takes. We
  take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
  columns show invariant schedutil for different values of freq_max. 4C turbo
  is circled as it's the value we've chosen for the final implementation.

80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
pgbench-ro           ~      ~      ~     | ~    |
pgbench-rw           ~      ~      ~     | ~    |
netperf-udp          ~      ~      ~     | ~    |
netperf-tcp          ~      ~      ~     | ~    |
tbench4              1.30   1.14   1.14  | 1.16 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  12C
pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                         +------+

In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.

If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.

80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              1.23   0.95   0.95  | 0.95 |  0.95
kernbench            0.93   0.83   0.83  | 0.83 |  0.82
gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
dbench4              ~      ~      ~     | ~    |
kernbench            ~      ~      ~     | ~    |
gitsource            0.92   0.55   0.55  | 0.55 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              ~      ~      ~     | ~    |  ~
kernbench            0.94   0.90   0.89  | 0.90 |  0.90
gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                         +------+

dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.

On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).

The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.

5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------

Machine            : 48x-HASWELL-NUMA
Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)

This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.

The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.

Machine            : 80x-BROADWELL-NUMA
Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)

The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.

Machine            : 8x-SKYLAKE-UMA
Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                            5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)

                           5.2.0 3C-turbo                   5.2.0 4C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)

In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.

5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------

The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.

turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).

80x-BROADWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |      8C
pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                    +--------+
8x-SKYLAKE-UMA (power consumption, watts)
                                                    +--------+
              BASELINE I_PSTATE/HWP    1C       3C  |     4C |
pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
dbench4           27.28    27.37    27.49    27.41  |  27.38 |
netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
tbench4           41.13    45.61    43.10    43.33  |  43.56 |
kernbench         42.56    42.63    43.01    43.01  |  43.01 |
gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                    +--------+
48x-HASWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |     12C
pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                    +--------+

A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).

80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |    8C
pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
dbench4          1.24   0.94   0.95  | 0.94 |  0.92
netperf-udp      ~      1.02   1.02  | ~    |  1.02
netperf-tcp      ~      1.02   ~     | ~    |  1.02
tbench4          1.26   1.10   1.06  | 1.12 |  1.26
kernbench        0.98   0.97   0.97  | 0.97 |  0.98
gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                     +------+

8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                     +------+
         I_PSTATE/HWP     1C     3C  |   4C |
pgbench-ro       ~      ~      ~     | ~    |
pgbench-rw       0.95   0.97   0.96  | 0.96 |
dbench4          ~      ~      ~     | ~    |
netperf-udp      ~      ~      ~     | ~    |
netperf-tcp      ~      ~      ~     | ~    |
tbench4          1.17   1.09   1.08  | 1.10 |
kernbench        ~      ~      ~     | ~    |
gitsource        1.06   1.40   1.40  | 1.40 |
                                     +------+

48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |   12C
pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
pgbench-rw       ~      0.86   ~     | ~    |  0.86
dbench4          ~      1.02   1.02  | 1.02 |  ~
netperf-udp      ~      0.97   1.03  | 1.02 |  ~
netperf-tcp      0.96   ~      ~     | ~    |  ~
tbench4          1.24   ~      1.06  | 1.05 |  1.11
kernbench        0.97   0.97   0.98  | 0.97 |  0.96
gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                     +------+

These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.

+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+

The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.

Subsequent patches will address:

* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont

+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+

Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:

    db-pgbench-timed-ro-small-xfs
    db-pgbench-timed-rw-small-xfs
    io-dbench4-async-xfs
    network-netperf-unbound
    network-tbench
    scheduler-unbound
    workload-kerndevel-xfs
    workload-shellscripts-xfs
    hpc-nas-c-class-mpi-full-xfs
    hpc-nas-c-class-omp-full

All those benchmarks are generally available on the web:

pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz
2020-01-28 21:36:59 +01:00
Linus Torvalds
c5f12fdb8b Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Thomas Gleixner:

 - Cleanup the apic IPI implementation by removing duplicated code and
   consolidating the functions into the APIC core.

 - Implement a safe variant of the IPI broadcast mode. Contrary to
   earlier attempts this uses the core tracking of which CPUs have been
   brought online at least once so that a broadcast does not end up in
   some dead end in BIOS/SMM code when the CPU is still waiting for
   init. Once all CPUs have been brought up once, IPI broadcasting is
   enabled. Before that regular one by one IPIs are issued.

 - Drop the paravirt CR8 related functions as they have no user anymore

 - Initialize the APIC TPR to block interrupt 16-31 as they are reserved
   for CPU exceptions and should never be raised by any well behaving
   device.

 - Emit a warning when vector space exhaustion breaks the admin set
   affinity of an interrupt.

 - Make sure to use the NMI fallback when shutdown via reboot vector IPI
   fails. The original code had conditions which prevent the code path
   to be reached.

 - Annotate various APIC config variables as RO after init.

[ The ipi broadcase change came in earlier through the cpu hotplug
  branch, but I left the explanation in the commit message since it was
  shared between the two different branches    - Linus ]

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  x86/apic/vector: Warn when vector space exhaustion breaks affinity
  x86/apic: Annotate global config variables as "read-only after init"
  x86/apic/x2apic: Implement IPI shorthands support
  x86/apic/flat64: Remove the IPI shorthand decision logic
  x86/apic: Share common IPI helpers
  x86/apic: Remove the shorthand decision logic
  x86/smp: Enhance native_send_call_func_ipi()
  x86/smp: Move smp_function_call implementations into IPI code
  x86/apic: Provide and use helper for send_IPI_allbutself()
  x86/apic: Add static key to Control IPI shorthands
  x86/apic: Move no_ipi_broadcast() out of 32bit
  x86/apic: Add NMI_VECTOR wait to IPI shorthand
  x86/apic: Remove dest argument from __default_send_IPI_shortcut()
  x86/hotplug: Silence APIC and NMI when CPU is dead
  x86/cpu: Move arch_smt_update() to a neutral place
  x86/apic/uv: Make x2apic_extra_bits static
  x86/apic: Consolidate the apic local headers
  x86/apic: Move apic_flat_64 header into apic directory
  x86/apic: Move ipi header into apic directory
  x86/apic: Cleanup the include maze
  ...
2019-09-17 12:04:39 -07:00
Thomas Gleixner
60dcaad573 x86/hotplug: Silence APIC and NMI when CPU is dead
In order to support IPI/NMI broadcasting via the shorthand mechanism side
effects of shorthands need to be mitigated:

 Shorthand IPIs and NMIs hit all CPUs including unplugged CPUs

Neither of those can be handled on unplugged CPUs for obvious reasons.

It would be trivial to just fully disable the APIC via the enable bit in
MSR_APICBASE. But that's not possible because clearing that bit on systems
based on the 3 wire APIC bus would require a hardware reset to bring it
back as the APIC would lose track of bus arbitration. On systems with FSB
delivery APICBASE could be disabled, but it has to be guaranteed that no
interrupt is sent to the APIC while in that state and it's not clear from
the SDM whether it still responds to INIT/SIPI messages.

Therefore stay on the safe side and switch the APIC into soft disabled mode
so it won't deliver any regular vector to the CPU.

NMIs are still propagated to the 'dead' CPUs. To mitigate that add a check
for the CPU being offline on early nmi entry and if so bail.

Note, this cannot use the stop/restart_nmi() magic which is used in the
alternatives code. A dead CPU cannot invoke nmi_enter() or anything else
due to RCU and other reasons.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907241723290.1791@nanos.tec.linutronix.de
2019-07-25 16:11:59 +02:00
Pingfan Liu
6973210242 x86/realmode: Remove trampoline_status
There is no reader of trampoline_status, it's only written.

It turns out that after commit ce4b1b1650 ("x86/smpboot: Initialize
secondary CPU only if master CPU will wait for it"), trampoline_status is
not needed any more.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1563266424-3472-1-git-send-email-kernelfans@gmail.com
2019-07-22 11:30:18 +02:00
Zhenzhong Duan
090d54bcbc Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"
This reverts commit ca5d376e17.

Commit 8990cac6e5 ("x86/jump_label: Initialize static branching
early") adds jump_label_init() call in setup_arch() to make static
keys initialized early, so we could use the original simpler code
again.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:57 +02:00
Thomas Gleixner
7652ac9201 x86/asm: Move native_write_cr0/4() out of line
The pinning of sensitive CR0 and CR4 bits caused a boot crash when loading
the kvm_intel module on a kernel compiled with CONFIG_PARAVIRT=n.

The reason is that the static key which controls the pinning is marked RO
after init. The kvm_intel module contains a CR4 write which requires to
update the static key entry list. That obviously does not work when the key
is in a RO section.

With CONFIG_PARAVIRT enabled this does not happen because the CR4 write
uses the paravirt indirection and the actual write function is built in.

As the key is intended to be immutable after init, move
native_write_cr0/4() out of line.

While at it consolidate the update of the cr4 shadow variable and store the
value right away when the pinning is initialized on a booting CPU. No point
in reading it back 20 instructions later. This allows to confine the static
key and the pinning variable to cpu/common and allows to mark them static.

Fixes: 8dbec27a24 ("x86/asm: Pin sensitive CR0 bits")
Fixes: 873d50d58f ("x86/asm: Pin sensitive CR4 bits")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Xi Ruoyao <xry111@mengyan1223.wang>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Xi Ruoyao <xry111@mengyan1223.wang>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907102140340.1758@nanos.tec.linutronix.de
2019-07-10 22:15:05 +02:00
Linus Torvalds
222a21d295 Merge branch 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 topology updates from Ingo Molnar:
 "Implement multi-die topology support on Intel CPUs and expose the die
  topology to user-space tooling, by Len Brown, Kan Liang and Zhang Rui.

  These changes should have no effect on the kernel's existing
  understanding of topologies, i.e. there should be no behavioral impact
  on cache, NUMA, scheduler, perf and other topologies and overall
  system performance"

* 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support
  perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support
  hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages
  thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages
  perf/x86/intel/cstate: Support multi-die/package
  perf/x86/intel/rapl: Support multi-die/package
  perf/x86/intel/uncore: Support multi-die/package
  topology: Create core_cpus and die_cpus sysfs attributes
  topology: Create package_cpus sysfs attribute
  hwmon/coretemp: Support multi-die/package
  powercap/intel_rapl: Update RAPL domain name and debug messages
  thermal/x86_pkg_temp_thermal: Support multi-die/package
  powercap/intel_rapl: Support multi-die/package
  powercap/intel_rapl: Simplify rapl_find_package()
  x86/topology: Define topology_logical_die_id()
  x86/topology: Define topology_die_id()
  cpu/topology: Export die_id
  x86/topology: Create topology_max_die_per_package()
  x86/topology: Add CPUID.1F multi-die/package support
2019-07-08 18:28:44 -07:00
Kees Cook
873d50d58f x86/asm: Pin sensitive CR4 bits
Several recent exploits have used direct calls to the native_write_cr4()
function to disable SMEP and SMAP before then continuing their exploits
using userspace memory access.

Direct calls of this form can be mitigate by pinning bits of CR4 so that
they cannot be changed through a common function. This is not intended to
be a general ROP protection (which would require CFI to defend against
properly), but rather a way to avoid trivial direct function calling (or
CFI bypasses via a matching function prototype) as seen in:

https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
(https://github.com/xairy/kernel-exploits/tree/master/CVE-2017-7308)

The goals of this change:

 - Pin specific bits (SMEP, SMAP, and UMIP) when writing CR4.

 - Avoid setting the bits too early (they must become pinned only after
   CPU feature detection and selection has finished).

 - Pinning mask needs to be read-only during normal runtime.

 - Pinning needs to be checked after write to validate the cr4 state

Using __ro_after_init on the mask is done so it can't be first disabled
with a malicious write.

Since these bits are global state (once established by the boot CPU and
kernel boot parameters), they are safe to write to secondary CPUs before
those CPUs have finished feature detection. As such, the bits are set at
the first cr4 write, so that cr4 write bugs can be detected (instead of
silently papered over). This uses a few bytes less storage of a location we
don't have: read-only per-CPU data.

A check is performed after the register write because an attack could just
skip directly to the register write. Such a direct jump is possible because
of how this function may be built by the compiler (especially due to the
removal of frame pointers) where it doesn't add a stack frame (function
exit may only be a retq without pops) which is sufficient for trivial
exploitation like in the timer overwrites mentioned above).

The asm argument constraints gain the "+" modifier to convince the compiler
that it shouldn't make ordering assumptions about the arguments or memory,
and treat them as changed.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: kernel-hardening@lists.openwall.com
Link: https://lkml.kernel.org/r/20190618045503.39105-3-keescook@chromium.org
2019-06-22 11:55:22 +02:00
Thomas Gleixner
9ff554e9be treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 82
Based on 1 normalized pattern(s):

  this code is released under the gnu general public license version 2
  or later

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075211.232210963@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:52 +02:00
Len Brown
2e4c54dac7 topology: Create core_cpus and die_cpus sysfs attributes
Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list"

These attributes represent all of the logical CPUs that share the
same core.

These attriutes is synonymous with the existing "thread_siblings" and
"thread_siblings_list" attribute, which will be deprecated.

Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list".
These attributes represent all of the logical CPUs that share the
same die.

Suggested-by: Brice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/071c23a298cd27ede6ed0b6460cae190d193364f.1557769318.git.len.brown@intel.com
2019-05-23 10:08:34 +02:00
Len Brown
212bf4fdb7 x86/topology: Define topology_logical_die_id()
Define topology_logical_die_id() ala existing topology_logical_package_id()

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/2f3526e25ae14fbeff26fb26e877d159df8946d9.1557769318.git.len.brown@intel.com
2019-05-23 10:08:32 +02:00
Len Brown
7745f03eb3 x86/topology: Add CPUID.1F multi-die/package support
Some new systems have multiple software-visible die within each package.

Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle
either CPUID.B, or the new CPUID.1F.

Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.

die_id will be non-zero only for multi-die/package systems.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.brown@intel.com
2019-05-23 10:08:30 +02:00
Linus Torvalds
948a64995a Merge branch 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 topology updates from Ingo Molnar:
 "Two main changes: preparatory changes for Intel multi-die topology
  support, plus a syslog message tweak"

* 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/topology: Make DEBUG_HOTPLUG_CPU0 pr_info() more descriptive
  x86/smpboot: Rename match_die() to match_pkg()
  topology: Simplify cputopology.txt formatting and wording
  x86/topology: Fix documentation typo
2019-05-06 16:33:06 -07:00
Len Brown
169d086996 x86/smpboot: Rename match_die() to match_pkg()
Syntax only, no functional or semantic change.

This routine matches packages, not die, so name it thus.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/7ca18c4ae7816a1f9eda37414725df676e63589d.1551160674.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19 10:56:05 +02:00
Thomas Gleixner
66c7ceb47f x86/irq/32: Handle irq stack allocation failure proper
irq_ctx_init() crashes hard on page allocation failures. While that's ok
during early boot, it's just wrong in the CPU hotplug bringup code.

Check the page allocation failure and return -ENOMEM and handle it at the
call sites. On early boot the only way out is to BUG(), but on CPU hotplug
there is no reason to crash, so just abort the operation.

Rename the function to something more sensible while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Shaokun Zhang <zhangshaokun@hisilicon.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: x86-ml <x86@kernel.org>
Cc: xen-devel@lists.xenproject.org
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Yi Wang <wang.yi59@zte.com.cn>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/20190414160146.089060584@linutronix.de
2019-04-17 15:31:42 +02:00
Linus Torvalds
bcd49c3dd1 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Various cleanups and simplifications, none of them really stands out,
  they are all over the place"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/uaccess: Remove unused __addr_ok() macro
  x86/smpboot: Remove unused phys_id variable
  x86/mm/dump_pagetables: Remove the unused prev_pud variable
  x86/fpu: Move init_xstate_size() to __init section
  x86/cpu_entry_area: Move percpu_setup_debug_store() to __init section
  x86/mtrr: Remove unused variable
  x86/boot/compressed/64: Explain paging_prepare()'s return value
  x86/resctrl: Remove duplicate MSR_MISC_FEATURE_CONTROL definition
  x86/asm/suspend: Drop ENTRY from local data
  x86/hw_breakpoints, kprobes: Remove kprobes ifdeffery
  x86/boot: Save several bytes in decompressor
  x86/trap: Remove useless declaration
  x86/mm/tlb: Remove unused cpu variable
  x86/events: Mark expected switch-case fall-throughs
  x86/asm-prototypes: Remove duplicate include <asm/page.h>
  x86/kernel: Mark expected switch-case fall-throughs
  x86/insn-eval: Mark expected switch-case fall-through
  x86/platform/UV: Replace kmalloc() and memset() with k[cz]alloc() calls
  x86/e820: Replace kmalloc() + memcpy() with kmemdup()
2019-03-07 16:36:57 -08:00
Anshuman Khandual
98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Shaokun Zhang
f91fecc09e x86/smpboot: Remove unused phys_id variable
The 'phys_id' local variable became unused after commit

  ce4b1b1650 ("x86/smpboot: Initialize secondary CPU only if master CPU will wait for it").

Remove it.

Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/1550495101-41755-1-git-send-email-zhangshaokun@hisilicon.com
2019-02-18 17:09:24 +01:00
Hui Wang
aa02ef099c x86/topology: Use total_cpus for max logical packages calculation
nr_cpu_ids can be limited on the command line via nr_cpus=. This can break the
logical package management because it results in a smaller number of packages
while in kdump kernel.

Check below case:
There is a two sockets system, each socket has 8 cores, which has 16 logical
cpus while HT was turn on.

 0  1  2  3  4  5  6  7     |    16 17 18 19 20 21 22 23
 cores on socket 0               threads on socket 0
 8  9 10 11 12 13 14 15     |    24 25 26 27 28 29 30 31
 cores on socket 1               threads on socket 1

While starting the kdump kernel with command line option nr_cpus=16 panic
was triggered on one of the cpus 24-31 eg. 26, then online cpu will be
1-15, 26(cpu 0 was disabled in kdump), ncpus will be 16 and
__max_logical_packages will be 1, but actually two packages were booted on.

This issue can reproduced by set kdump option nr_cpus=<real physical core
numbers>, and then trigger panic on last socket's thread, for example:

taskset -c 26 echo c > /proc/sysrq-trigger

Use total_cpus which will not be limited by nr_cpus command line to calculate
the value of __max_logical_packages.

Signed-off-by: Hui Wang <john.wanghui@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <guijianfeng@huawei.com>
Cc: <wencongyang2@huawei.com>
Cc: <douliyang1@huawei.com>
Cc: <qiaonuohan@huawei.com>
Link: https://lkml.kernel.org/r/20181107023643.22174-1-john.wanghui@huawei.com
2018-12-18 13:38:37 +01:00
Mike Rapoport
57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Pu Wen
0b13bec787 x86/smpboot: Do not use BSP INIT delay and MWAIT to idle on Dhyana
The Hygon Dhyana CPU uses no delay in smp_quirk_init_udelay(), and does
HLT on idle just like AMD does.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: bp@alien8.de
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Cc: thomas.lendacky@amd.com
Link: https://lkml.kernel.org/r/87000fa82e273f5967c908448414228faf61e077.1537533369.git.puwen@hygon.cn
2018-09-27 18:28:57 +02:00
Thomas Gleixner
f2701b77bb Merge 4.18-rc7 into master to pick up the KVM dependcy
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 16:39:29 +02:00
Nicolai Stange
447ae31667 x86: Don't include linux/irq.h from asm/hardirq.h
The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().

Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like

  asm/smp.h
    asm/apic.h
      asm/hardirq.h
        linux/irq.h
          linux/topology.h
            linux/smp.h
              asm/smp.h

or

  linux/gfp.h
    linux/mmzone.h
      asm/mmzone.h
        asm/mmzone_64.h
          asm/smp.h
            asm/apic.h
              asm/hardirq.h
                linux/irq.h
                  linux/irqdesc.h
                    linux/kobject.h
                      linux/sysfs.h
                        linux/kernfs.h
                          linux/idr.h
                            linux/gfp.h

and others.

This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.

A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.

However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.

Remove the linux/irq.h #include from x86' asm/hardirq.h.

Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.

Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.

Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-08-05 09:53:13 +02:00
Zhenzhong Duan
4fb5f58e8d x86/mm/32: Initialize the CR4 shadow before __flush_tlb_all()
On 32-bit kernels, __flush_tlb_all() may have read the CR4 shadow before the
initialization of CR4 shadow in cpu_init().

Fix it by adding an explicit cr4_init_shadow() call into start_secondary()
which is the first function called on non-boot SMP CPUs - ahead of the
__flush_tlb_all() call.

( This is somewhat of a layering violation, but start_secondary() does
  CR4 bootstrap in the PCID case anyway. )

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/b07b6ae9-4b57-4b40-b9bc-50c2c67f1d91@default
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03 09:26:10 +02:00
Thomas Gleixner
f048c399e0 x86/topology: Provide topology_smt_supported()
Provide information whether SMT is supoorted by the CPUs. Preparatory patch
for SMT control mechanism.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 14:20:57 +02:00
Thomas Gleixner
6a4d2657e0 x86/smp: Provide topology_is_primary_thread()
If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.

This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.

Preparatory patch to support disabling SMT at boot/runtime.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 14:20:57 +02:00
Linus Torvalds
5cef8c2a22 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:

 - Centaur CPU updates (David Wang)

 - AMD and other CPU topology enumeration improvements and fixes
   (Borislav Petkov, Thomas Gleixner, Suravee Suthikulpanit)

 - Continued 5-level paging work (Kirill A. Shutemov)

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Mark __pgtable_l5_enabled __initdata
  x86/mm: Mark p4d_offset() __always_inline
  x86/mm: Introduce the 'no5lvl' kernel parameter
  x86/mm: Stop pretending pgtable_l5_enabled is a variable
  x86/mm: Unify pgtable_l5_enabled usage in early boot code
  x86/boot/compressed/64: Fix trampoline page table address calculation
  x86/CPU: Move x86_cpuinfo::x86_max_cores assignment to detect_num_cpu_cores()
  x86/Centaur: Report correct CPU/cache topology
  x86/CPU: Move cpu_detect_cache_sizes() into init_intel_cacheinfo()
  x86/CPU: Make intel_num_cpu_cores() generic
  x86/CPU: Move cpu local function declarations to local header
  x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available
  x86/CPU: Modify detect_extended_topology() to return result
  x86/CPU/AMD: Calculate last level cache ID from number of sharing threads
  x86/CPU: Rename intel_cacheinfo.c to cacheinfo.c
  perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
  x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
  x86/Centaur: Initialize supported CPU features properly
2018-06-04 18:19:18 -07:00
Ingo Molnar
177bfd725b Merge branches 'x86/urgent' and 'core/urgent' into x86/boot, to pick up fixes and avoid conflicts
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-19 08:18:56 +02:00
Thomas Gleixner
1f50ddb4f4 x86/speculation: Handle HT correctly on AMD
The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when
hyperthreading is enabled the SSBD bit toggle needs to take both cores into
account. Otherwise the following situation can happen:

CPU0		CPU1

disable SSB
		disable SSB
		enable  SSB <- Enables it for the Core, i.e. for CPU0 as well

So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled
again.

On Intel the SSBD control is per core as well, but the synchronization
logic is implemented behind the per thread SPEC_CTRL MSR. It works like
this:

  CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL

i.e. if one of the threads enables a mitigation then this affects both and
the mitigation is only disabled in the core when both threads disabled it.

Add the necessary synchronization logic for AMD family 17H. Unfortunately
that requires a spinlock to serialize the access to the MSR, but the locks
are only shared between siblings.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-05-17 17:09:18 +02:00
Borislav Petkov
f8b64d08dd x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're
always present as symbols and not only in the CONFIG_SMP case. Then,
other code using them doesn't need ugly ifdeffery anymore. Get rid of
some ifdeffery.

Signed-off-by: Borislav Petkov <bpetkov@suse.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1524864877-111962-2-git-send-email-suravee.suthikulpanit@amd.com
2018-05-06 12:49:14 +02:00
Yazen Ghannam
da6fa7ef67 x86/smpboot: Don't use mwait_play_dead() on AMD systems
Recent AMD systems support using MWAIT for C1 state. However, MWAIT will
not allow deeper cstates than C1 on current systems.

play_dead() expects to use the deepest state available.  The deepest state
available on AMD systems is reached through SystemIO or HALT. If MWAIT is
available, it is preferred over the other methods, so the CPU never reaches
the deepest possible state.

Don't try to use MWAIT to play_dead() on AMD systems. Instead, use CPUIDLE
to enter the deepest state advertised by firmware. If CPUIDLE is not
available then fallback to HALT.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Link: https://lkml.kernel.org/r/20180403140228.58540-1-Yazen.Ghannam@amd.com
2018-04-26 16:06:19 +02:00
Alison Schofield
1340ccfa9a x86,sched: Allow topologies where NUMA nodes share an LLC
Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
into two "slices", each containing half the cores, half the LLC, and one
memory controller and each slice is enumerated to Linux as a NUMA
node. This is similar to how the cores and LLC were arranged for the
Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This means
that the portion of the LLC *available* to a CPU depends on the data being
accessed:

    Remote socket: entire package LLC is shared
    Local socket->local slice: data goes into local slice LLC
    Local socket->remote slice: data goes into remote-slice LLC. Slightly
                    		higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One of
the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates the LLC
as being shared by the entire package. This is not 100% precise because the
entire cache is not usable by all accesses. But, it *is* the way the
hardware enumerates itself, and this is not likely to change.

The userspace visible impact of all the above is that the sysfs info
reports the entire LLC as being available to the entire package. As noted
above, this is not true for local socket accesses. This patch does not
correct the sysfs info. It is the same, pre and post patch.

The current code emits the following warning:

 sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

The warning is coming from the topology_sane() check in smpboot.c because
the topology is not matching the expectations of the model for obvious
reasons.

To fix this, add a vendor and model specific check to never call
topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
the "coregroup" sched_domain_topology_level and use NUMA information from
the SRAT alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package level,
which are also NUMA boundaries.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: brice.goglin@gmail.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
2018-04-17 15:39:55 +02:00
Samuel Neves
4596749339 x86/topology: Update the 'cpu cores' field in /proc/cpuinfo correctly across CPU hotplug operations
Without this fix, /proc/cpuinfo will display an incorrect amount
of CPU cores, after bringing them offline and online again, as
exemplified below:

  $ cat /proc/cpuinfo | grep cores
  cpu cores	: 4
  cpu cores	: 8
  cpu cores	: 8
  cpu cores	: 20
  cpu cores	: 4
  cpu cores	: 3
  cpu cores	: 2
  cpu cores	: 2

This patch fixes this by always zeroing the booted_cores variable
upon turning off a logical CPU.

Tested-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jgross@suse.com
Cc: luto@kernel.org
Cc: prarit@redhat.com
Cc: vkuznets@redhat.com
Link: http://lkml.kernel.org/r/20180221205036.5244-1-sneves@dei.uc.pt
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-23 08:47:47 +01:00
Prarit Bhargava
63e708f826 x86/xen: Calculate __max_logical_packages on PV domains
The kernel panics on PV domains because native_smp_cpus_done() is
only called for HVM domains.

Calculate __max_logical_packages for PV domains.

Fixes: b4c0a7326f ("x86/smpboot: Fix __max_logical_packages estimate")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Tested-and-reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: xen-devel@lists.xenproject.org
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2018-02-17 09:40:45 +01:00
Masayoshi Mizuma
295cc7eb31 x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU
When a physical CPU is hot-removed, the following warning messages
are shown while the uncore device is removed in uncore_pci_remove():

  WARNING: CPU: 120 PID: 5 at arch/x86/events/intel/uncore.c:988
  uncore_pci_remove+0xf1/0x110
  ...
  CPU: 120 PID: 5 Comm: kworker/u1024:0 Not tainted 4.15.0-rc8 #1
  Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  ...
  Call Trace:
  pci_device_remove+0x36/0xb0
  device_release_driver_internal+0x145/0x210
  pci_stop_bus_device+0x76/0xa0
  pci_stop_root_bus+0x44/0x60
  acpi_pci_root_remove+0x1f/0x80
  acpi_bus_trim+0x54/0x90
  acpi_bus_trim+0x2e/0x90
  acpi_device_hotplug+0x2bc/0x4b0
  acpi_hotplug_work_fn+0x1a/0x30
  process_one_work+0x141/0x340
  worker_thread+0x47/0x3e0
  kthread+0xf5/0x130

When uncore_pci_remove() runs, it tries to get the package ID to
clear the value of uncore_extra_pci_dev[].dev[] by using
topology_phys_to_logical_pkg(). The warning messesages are
shown because topology_phys_to_logical_pkg() returns -1.

  arch/x86/events/intel/uncore.c:
  static void uncore_pci_remove(struct pci_dev *pdev)
  {
  ...
          phys_id = uncore_pcibus_to_physid(pdev->bus);
  ...
                  pkg = topology_phys_to_logical_pkg(phys_id); // returns -1
                  for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
                          if (uncore_extra_pci_dev[pkg].dev[i] == pdev) {
                                  uncore_extra_pci_dev[pkg].dev[i] = NULL;
                                  break;
                          }
                  }
                  WARN_ON_ONCE(i >= UNCORE_EXTRA_PCI_DEV_MAX); // <=========== HERE!!

topology_phys_to_logical_pkg() tries to find
cpuinfo_x86->phys_proc_id that matches the phys_pkg argument.

  arch/x86/kernel/smpboot.c:
  int topology_phys_to_logical_pkg(unsigned int phys_pkg)
  {
          int cpu;

          for_each_possible_cpu(cpu) {
                  struct cpuinfo_x86 *c = &cpu_data(cpu);

                  if (c->initialized && c->phys_proc_id == phys_pkg)
                          return c->logical_proc_id;
          }
          return -1;
  }

However, the phys_proc_id was already set to 0 by remove_siblinginfo()
when the CPU was offlined.

So, topology_phys_to_logical_pkg() cannot find the correct
logical_proc_id and always returns -1.

As the result, uncore_pci_remove() calls WARN_ON_ONCE() and the warning
messages are shown.

What is worse is that the bogus 'pkg' index results in two bugs:

 - We dereference uncore_extra_pci_dev[] with a negative index
 - We fail to clean up a stale pointer in uncore_extra_pci_dev[][]

To fix these bugs, remove the clearing of ->phys_proc_id from remove_siblinginfo().

This should not cause any problems, because ->phys_proc_id is not
used after it is hot-removed and it is re-set while hot-adding.

Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: yasu.isimatu@gmail.com
Cc: <stable@vger.kernel.org>
Fixes: 30bb981185 ("x86/topology: Avoid wasting 128k for package id array")
Link: http://lkml.kernel.org/r/ed738d54-0f01-b38b-b794-c31dc118c207@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13 12:47:28 +01:00
Linus Torvalds
3ccabd6d9d Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Misc cleanups"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Remove unused IOMMU_STRESS Kconfig
  x86/extable: Mark exception handler functions visible
  x86/timer: Don't inline __const_udelay
  x86/headers: Remove duplicate #includes
2018-01-30 13:01:09 -08:00
Jan Kiszka
e348caef8b x86/platform: Control warm reset setup via legacy feature flag
Allow to turn off the setup of BIOS-managed warm reset via a new flag in
x86_legacy_features. Besides the UV1, the upcoming jailhose guest support
needs this switched off.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: jailhouse-dev@googlegroups.com
Link: https://lkml.kernel.org/r/44376558129d70a2c1527959811371ef4b82e829.1511770314.git.jan.kiszka@siemens.com
2018-01-14 21:11:53 +01:00
Linus Torvalds
52c90f2d32 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 page table isolation fixes from Thomas Gleixner:
 "Four patches addressing the PTI fallout as discussed and debugged
  yesterday:

   - Remove stale and pointless TLB flush invocations from the hotplug
     code

   - Remove stale preempt_disable/enable from __native_flush_tlb()

   - Plug the memory leak in the write_ldt() error path"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ldt: Make LDT pgtable free conditional
  x86/ldt: Plug memory leak in error path
  x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()
  x86/smpboot: Remove stale TLB flush invocations
2017-12-31 13:03:05 -08:00
Thomas Gleixner
322f8b8b34 x86/smpboot: Remove stale TLB flush invocations
smpboot_setup_warm_reset_vector() and smpboot_restore_warm_reset_vector()
invoke local_flush_tlb() for no obvious reason.

Digging in history revealed that the original code in the 2.1 era added
those because the code manipulated a swapper_pg_dir pagetable entry. The
pagetable manipulation was removed long ago in the 2.3 timeframe, but the
TLB flush invocations stayed around forever.

Remove them along with the pointless pr_debug()s which come from the same 2.1
change.

Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171230211829.586548655@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-31 12:12:51 +01:00
Linus Torvalds
caf9a82657 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI preparatory patches from Thomas Gleixner:
 "Todays Advent calendar window contains twentyfour easy to digest
  patches. The original plan was to have twenty three matching the date,
  but a late fixup made that moot.

   - Move the cpu_entry_area mapping out of the fixmap into a separate
     address space. That's necessary because the fixmap becomes too big
     with NRCPUS=8192 and this caused already subtle and hard to
     diagnose failures.

     The top most patch is fresh from today and cures a brain slip of
     that tall grumpy german greybeard, who ignored the intricacies of
     32bit wraparounds.

   - Limit the number of CPUs on 32bit to 64. That's insane big already,
     but at least it's small enough to prevent address space issues with
     the cpu_entry_area map, which have been observed and debugged with
     the fixmap code

   - A few TLB flush fixes in various places plus documentation which of
     the TLB functions should be used for what.

   - Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
     more than sysenter now and keeping the name makes backtraces
     confusing.

   - Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
     which is only invoked on fork().

   - Make vysycall more robust.

   - A few fixes and cleanups of the debug_pagetables code. Check
     PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
     C89 initialization of the address hint array which already was out
     of sync with the index enums.

   - Move the ESPFIX init to a different place to prepare for PTI.

   - Several code moves with no functional change to make PTI
     integration simpler and header files less convoluted.

   - Documentation fixes and clarifications"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
  init: Invoke init_espfix_bsp() from mm_init()
  x86/cpu_entry_area: Move it out of the fixmap
  x86/cpu_entry_area: Move it to a separate unit
  x86/mm: Create asm/invpcid.h
  x86/mm: Put MMU to hardware ASID translation in one place
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Move the CR3 construction functions to tlbflush.h
  x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
  x86/mm: Remove superfluous barriers
  x86/mm: Use __flush_tlb_one() for kernel memory
  x86/microcode: Dont abuse the TLB-flush interface
  x86/uv: Use the right TLB-flush API
  x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
  x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
  x86/mm/64: Improve the memory map documentation
  x86/ldt: Prevent LDT inheritance on exec
  x86/ldt: Rework locking
  arch, mm: Allow arch_dup_mmap() to fail
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  ...
2017-12-23 11:53:04 -08:00
Thomas Gleixner
613e396bc0 init: Invoke init_espfix_bsp() from mm_init()
init_espfix_bsp() needs to be invoked before the page table isolation
initialization. Move it into mm_init() which is the place where pti_init()
will be added.

While at it get rid of the #ifdeffery and provide proper stub functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-22 20:13:05 +01:00
Ingo Molnar
0fd2e9c53d Merge commit 'upstream-x86-entry' into WIP.x86/mm
Pull in a minimal set of v4.15 entry code changes, for a base for the MM isolation patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-17 12:58:53 +01:00
Pravin Shedge
81bf665d00 x86/headers: Remove duplicate #includes
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: ard.biesheuvel@linaro.org
Cc: boris.ostrovsky@oracle.com
Cc: geert@linux-m68k.org
Cc: jgross@suse.com
Cc: linux-efi@vger.kernel.org
Cc: luto@kernel.org
Cc: matt@codeblueprint.co.uk
Cc: thomas.lendacky@amd.com
Cc: tim.c.chen@linux.intel.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1513024951-9221-1-git-send-email-pravin.shedge4linux@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-12 11:32:24 +01:00
Prarit Bhargava
947134d9b0 x86/smpboot: Do not use smp_num_siblings in __max_logical_packages calculation
Documentation/x86/topology.txt defines smp_num_siblings as "The number of
threads in a core".  Since commit bbb65d2d36 ("x86: use cpuid vector 0xb
when available for detecting cpu topology") smp_num_siblings is the
maximum number of threads in a core.  If Simultaneous MultiThreading
(SMT) is disabled on a system, smp_num_siblings is 2 and not 1 as
expected.

Use topology_max_smt_threads(), which contains the active numer of threads,
in the __max_logical_packages calculation.

On a single socket, single core, single thread system __max_smt_threads has
not been updated when the __max_logical_packages calculation happens, so its
zero which makes the package estimate fail. Initialize it to one, which is
the minimum number of threads on a core.

[ tglx: Folded the __max_smt_threads fix in ]

Fixes: b4c0a7326f ("x86/smpboot: Fix __max_logical_packages estimate")
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Prarit Bhargava <prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jakub Kicinski <kubakici@wp.pl>
Cc: netdev@vger.kernel.org
Cc: "netdev@vger.kernel.org"
Cc: Clark Williams <williams@redhat.com>
Link: https://lkml.kernel.org/r/20171204164521.17870-1-prarit@redhat.com
2017-12-07 10:28:22 +01:00
Chunyu Hu
55d2d0ad2f x86/idt: Load idt early in start_secondary
On a secondary, idt is first loaded in cpu_init() with load_current_idt(),
i.e. no exceptions can be handled before that point.

The conversion of WARN() to use UD requires the IDT being loaded earlier as
any warning between start_secondary() and load_curren_idt() in cpu_init()
will result in an unhandled @UD exception and therefore fail the bringup of
the CPU.

Install the IDT handlers right in start_secondary() before calling cpu_init().

[ tglx: Massaged changelog ]

Fixes: 9a93848fe7 ("x86/debug: Implement __WARN() using UD0")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: peterz@infradead.org
Cc: bp@alien8.de
Cc: rostedt@goodmis.org
Cc: luto@kernel.org
Link: https://lkml.kernel.org/r/1511792499-4073-1-git-send-email-chuhu@redhat.com
2017-11-28 08:15:40 +01:00
Prarit Bhargava
b4c0a7326f x86/smpboot: Fix __max_logical_packages estimate
A system booted with a small number of cores enabled per package
panics because the estimate of __max_logical_packages is too low.

This occurs when the total number of active cores across all packages is
less than the maximum core count for a single package. e.g.:

  On a 4 package system with 20 cores/package where only 4 cores are
  enabled on each package, the value of __max_logical_packages is
  calculated as DIV_ROUND_UP(16 / 20) = 1 and not 4.

Calculate __max_logical_packages after the cpu enumeration has completed.
Use the boot cpu's data to extrapolate the number of packages.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: He Chen <he.chen@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Piotr Luc <piotr.luc@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171114124257.22013-4-prarit@redhat.com
2017-11-17 16:22:31 +01:00
Andi Kleen
30bb981185 x86/topology: Avoid wasting 128k for package id array
Analyzing large early boot allocations unveiled the logical package id
storage as a prominent memory waste. Since commit 1f12e32f4c
("x86/topology: Create logical package id") every 64-bit system allocates a
128k array to convert logical package ids.

This happens because the array is sized for MAX_LOCAL_APIC which is always
32k on 64bit systems, and it needs 4 bytes for each entry.

This is fairly wasteful, especially for the common case of having only one
socket, which uses exactly 4 byte out of 128K.

There is no user of the package id map which is performance critical, so
the lookup is not required to be O(1). Store the logical processor id in
cpu_data and use a loop based lookup.

To keep the mapping stable accross cpu hotplug operations, add a flag to
cpu_data which is set when the CPU is brought up the first time. When the
flag is set, then cpu_data is not reinitialized by copying boot_cpu_data on
subsequent bringups.

[ tglx: Rename the flag to 'initialized', use proper pointers instead of
  	repeated cpu_data(x) evaluation and massage changelog. ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: He Chen <he.chen@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Piotr Luc <piotr.luc@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20171114124257.22013-3-prarit@redhat.com
2017-11-17 16:22:30 +01:00
Linus Torvalds
b18d62891a Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
 "This update provides a major overhaul of the APIC initialization and
  vector allocation code:

   - Unification of the APIC and interrupt mode setup which was
     scattered all over the place and was hard to follow. This also
     distangles the timer setup from the APIC initialization which
     brings a clear separation of functionality.

     Great detective work from Dou Lyiang!

   - Refactoring of the x86 vector allocation mechanism. The existing
     code was based on nested loops and rather convoluted APIC callbacks
     which had a horrible worst case behaviour and tried to serve all
     different use cases in one go. This led to quite odd hacks when
     supporting the new managed interupt facility for multiqueue devices
     and made it more or less impossible to deal with the vector space
     exhaustion which was a major roadblock for server hibernation.

     Aside of that the code dealing with cpu hotplug and the system
     vectors was disconnected from the actual vector management and
     allocation code, which made it hard to follow and maintain.

     Utilizing the new bitmap matrix allocator core mechanism, the new
     allocator and management code consolidates the handling of system
     vectors, legacy vectors, cpu hotplug mechanisms and the actual
     allocation which needs to be aware of system and legacy vectors and
     hotplug constraints into a single consistent entity.

     This has one visible change: The support for multi CPU targets of
     interrupts, which is only available on a certain subset of
     CPUs/APIC variants has been removed in favour of single interrupt
     targets. A proper analysis of the multi CPU target feature revealed
     that there is no real advantage as the vast majority of interrupts
     end up on the CPU with the lowest APIC id in the set of target CPUs
     anyway. That change was agreed on by the relevant folks and allowed
     to simplify the implementation significantly and to replace rather
     fragile constructs like the vector cleanup IPI with straight
     forward and solid code.

     Furthermore this allowed to cleanly separate the allocation details
     for legacy, normal and managed interrupts:

      * Legacy interrupts are not longer wasting 16 vectors
        unconditionally

      * Managed interrupts have now a guaranteed vector reservation, but
        the actual vector assignment happens when the interrupt is
        requested. It's guaranteed not to fail.

      * Normal interrupts no longer allocate vectors unconditionally
        when the interrupt is set up (IO/APIC init or MSI(X) enable).
        The mechanism has been switched to a best effort reservation
        mode. The actual allocation happens when the interrupt is
        requested. Contrary to managed interrupts the request can fail
        due to vector space exhaustion, but drivers must handle a fail
        of request_irq() anyway. When the interrupt is freed, the vector
        is handed back as well.

        This solves a long standing problem with large unconditional
        vector allocations for a certain class of enterprise devices
        which prevented server hibernation due to vector space
        exhaustion when the unused allocated vectors had to be migrated
        to CPU0 while unplugging all non boot CPUs.

     The code has been equipped with trace points and detailed debugfs
     information to aid analysis of the vector space"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  x86/vector/msi: Select CONFIG_GENERIC_IRQ_RESERVATION_MODE
  PCI/MSI: Set MSI_FLAG_MUST_REACTIVATE in core code
  genirq: Add config option for reservation mode
  x86/vector: Use correct per cpu variable in free_moved_vector()
  x86/apic/vector: Ignore set_affinity call for inactive interrupts
  x86/apic: Fix spelling mistake: "symmectic" -> "symmetric"
  x86/apic: Use dead_cpu instead of current CPU when cleaning up
  ACPI/init: Invoke early ACPI initialization earlier
  x86/vector: Respect affinity mask in irq descriptor
  x86/irq: Simplify hotplug vector accounting
  x86/vector: Switch IOAPIC to global reservation mode
  x86/vector/msi: Switch to global reservation mode
  x86/vector: Handle managed interrupts proper
  x86/io_apic: Reevaluate vector configuration on activate()
  iommu/amd: Reevaluate vector configuration on activate()
  iommu/vt-d: Reevaluate vector configuration on activate()
  x86/apic/msi: Force reactivation of interrupts at startup time
  x86/vector: Untangle internal state from irq_cfg
  x86/vector: Compile SMP only code conditionally
  x86/apic: Remove unused callbacks
  ...
2017-11-13 18:29:23 -08:00
Linus Torvalds
6a9f70b0a5 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
 "Three smaller changes:

   - clang fix

   - boot message beautification

   - unnecessary header inclusion removal"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Disable Clang warnings about GNU extensions
  x86/boot: Remove unnecessary #include <generated/utsrelease.h>
  x86/boot: Spell out "boot CPU" for BP
2017-11-13 16:32:30 -08:00
Linus Torvalds
d6ec9d9a4d Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Ingo Molnar:
 "Note that in this cycle most of the x86 topics interacted at a level
  that caused them to be merged into tip:x86/asm - but this should be a
  temporary phenomenon, hopefully we'll back to the usual patterns in
  the next merge window.

  The main changes in this cycle were:

  Hardware enablement:

   - Add support for the Intel UMIP (User Mode Instruction Prevention)
     CPU feature. This is a security feature that disables certain
     instructions such as SGDT, SLDT, SIDT, SMSW and STR. (Ricardo Neri)

     [ Note that this is disabled by default for now, there are some
       smaller enhancements in the pipeline that I'll follow up with in
       the next 1-2 days, which allows this to be enabled by default.]

   - Add support for the AMD SEV (Secure Encrypted Virtualization) CPU
     feature, on top of SME (Secure Memory Encryption) support that was
     added in v4.14. (Tom Lendacky, Brijesh Singh)

   - Enable new SSE/AVX/AVX512 CPU features: AVX512_VBMI2, GFNI, VAES,
     VPCLMULQDQ, AVX512_VNNI, AVX512_BITALG. (Gayatri Kammela)

  Other changes:

   - A big series of entry code simplifications and enhancements (Andy
     Lutomirski)

   - Make the ORC unwinder default on x86 and various objtool
     enhancements. (Josh Poimboeuf)

   - 5-level paging enhancements (Kirill A. Shutemov)

   - Micro-optimize the entry code a bit (Borislav Petkov)

   - Improve the handling of interdependent CPU features in the early
     FPU init code (Andi Kleen)

   - Build system enhancements (Changbin Du, Masahiro Yamada)

   - ... plus misc enhancements, fixes and cleanups"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (118 commits)
  x86/build: Make the boot image generation less verbose
  selftests/x86: Add tests for the STR and SLDT instructions
  selftests/x86: Add tests for User-Mode Instruction Prevention
  x86/traps: Fix up general protection faults caused by UMIP
  x86/umip: Enable User-Mode Instruction Prevention at runtime
  x86/umip: Force a page fault when unable to copy emulated result to user
  x86/umip: Add emulation code for UMIP instructions
  x86/cpufeature: Add User-Mode Instruction Prevention definitions
  x86/insn-eval: Add support to resolve 16-bit address encodings
  x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode
  x86/insn-eval: Add wrapper function for 32 and 64-bit addresses
  x86/insn-eval: Add support to resolve 32-bit address encodings
  x86/insn-eval: Compute linear address in several utility functions
  resource: Fix resource_size.cocci warnings
  X86/KVM: Clear encryption attribute when SEV is active
  X86/KVM: Decrypt shared per-cpu variables when SEV is active
  percpu: Introduce DEFINE_PER_CPU_DECRYPTED
  x86: Add support for changing memory encryption attribute in early boot
  x86/io: Unroll string I/O when SEV is active
  x86/boot: Add early boot support when running with SEV active
  ...
2017-11-13 14:13:48 -08:00
Linus Torvalds
8e9a2dba86 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Another attempt at enabling cross-release lockdep dependency
     tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
     with better performance and fewer false positives. (Byungchul Park)

   - Introduce lockdep_assert_irqs_enabled()/disabled() and convert
     open-coded equivalents to lockdep variants. (Frederic Weisbecker)

   - Add down_read_killable() and use it in the VFS's iterate_dir()
     method. (Kirill Tkhai)

   - Convert remaining uses of ACCESS_ONCE() to
     READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
     driven. (Mark Rutland, Paul E. McKenney)

   - Get rid of lockless_dereference(), by strengthening Alpha atomics,
     strengthening READ_ONCE() with smp_read_barrier_depends() and thus
     being able to convert users of lockless_dereference() to
     READ_ONCE(). (Will Deacon)

   - Various micro-optimizations:

        - better PV qspinlocks (Waiman Long),
        - better x86 barriers (Michael S. Tsirkin)
        - better x86 refcounts (Kees Cook)

   - ... plus other fixes and enhancements. (Borislav Petkov, Juergen
     Gross, Miguel Bernal Marin)"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
  rcu: Use lockdep to assert IRQs are disabled/enabled
  netpoll: Use lockdep to assert IRQs are disabled/enabled
  timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
  sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
  irq_work: Use lockdep to assert IRQs are disabled/enabled
  irq/timings: Use lockdep to assert IRQs are disabled/enabled
  perf/core: Use lockdep to assert IRQs are disabled/enabled
  x86: Use lockdep to assert IRQs are disabled/enabled
  smp/core: Use lockdep to assert IRQs are disabled/enabled
  timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
  timers/nohz: Use lockdep to assert IRQs are disabled/enabled
  workqueue: Use lockdep to assert IRQs are disabled/enabled
  irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
  locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
  locking/pvqspinlock: Implement hybrid PV queued/unfair locks
  locking/rwlocks: Fix comments
  x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
  block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
  workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
  ...
2017-11-13 12:38:26 -08:00
Frederic Weisbecker
7a10e2a919 x86: Use lockdep to assert IRQs are disabled/enabled
Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

It also makes no more sense to fix the IRQ flags when a bug is detected
as the assertion is now pure config-dependent debugging. And to quote
Peter Zijlstra:

	The whole if !disabled, disable logic is uber paranoid programming,
	but I don't think we've ever seen that WARN trigger, and if it does
	(and then burns the kernel) we at least know what happend.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David S . Miller <davem@davemloft.net>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1509980490-4285-8-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-08 11:13:50 +01:00
Pavel Tatashin
76ce7cfe35 x86/smpboot: Make optimization of delay calibration work correctly
If the TSC has constant frequency then the delay calibration can be skipped
when it has been calibrated for a package already. This is checked in
calibrate_delay_is_known(), but that function is buggy in two aspects:

It returns 'false' if

  (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC)

which is obviously the reverse of the intended check and the check for the
sibling mask cannot work either because the topology links have not been
set up yet.

Correct the condition and move the call to set_cpu_sibling_map() before
invoking calibrate_delay() so the sibling check works correctly.

[ tglx: Rewrote changelong ]

Fixes: c25323c073 ("x86/tsc: Use topology functions")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: bob.picco@oracle.com
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20171028001100.26603-1-pasha.tatashin@oracle.com
2017-11-07 16:04:54 +01:00
Andy Lutomirski
cd493a6deb x86/entry/32: Fix cpu_current_top_of_stack initialization at boot
cpu_current_top_of_stack's initialization forgot about
TOP_OF_KERNEL_STACK_PADDING.  This bug didn't matter because the
idle threads never enter user mode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e5e370a7e6e4fddd1c4e4cf619765d96bb874b21.1509609304.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-02 11:04:47 +01:00
Dou Liyang
ca5d376e17 x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
Commit:

  9043442b43 ("locking/paravirt: Use new static key for controlling call of virt_spin_lock()")

sets the static virt_spin_lock_key to a value before jump_label_init()
has been called, which will result in a WARN().

Reorder the initialization sequence:

 - Move the native_pv_lock_init() into native_smp_prepare_cpus()
 - set the value in xen_init_lock_cpu()

to avoid calling into the not yet initialized static keys subsystem.

Suggested-by: Juergen Gross <jgross@suse.com>
Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: boris.ostrovsky@oracle.com
Cc: bp@suse.de
Cc: luto@kernel.org
Cc: vkuznets@redhat.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1509170804-3813-1-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-30 09:17:29 +01:00
Juergen Gross
9043442b43 locking/paravirt: Use new static key for controlling call of virt_spin_lock()
There are cases where a guest tries to switch spinlocks to bare metal
behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
has the downside of falling back to unfair test and set scheme for
qspinlocks due to virt_spin_lock() detecting the virtualized
environment.

Add a static key controlling whether virt_spin_lock() should be
called or not. When running on bare metal set the new key to false.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: boris.ostrovsky@oracle.com
Cc: chrisw@sous-sol.org
Cc: hpa@zytor.com
Cc: jeremy@goop.org
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20170906173625.18158-2-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10 11:50:12 +02:00
Jean Delvare
a1652bb8a0 x86/boot: Spell out "boot CPU" for BP
It's not obvious to everybody that BP stands for boot processor. At
least it was not for me. And BP is also a CPU register on x86, so it
is ambiguous. Spell out "boot CPU" everywhere instead.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-03 18:41:23 +02:00
Thomas Gleixner
2cffad7bad x86/irq: Simplify hotplug vector accounting
Before a CPU is taken offline the number of active interrupt vectors on the
outgoing CPU and the number of vectors which are available on the other
online CPUs are counted and compared. If the active vectors are more than
the available vectors on the other CPUs then the CPU hot-unplug operation
is aborted. This again uses loop based search and is inaccurate.

The bitmap matrix allocator has accurate accounting information and can
tell exactly whether the vector space is sufficient or not.

Emit a message when the number of globaly reserved (unallocated) vectors is
larger than the number of available vectors after offlining a CPU because
after that point request_irq() might fail.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Yu Chen <yu.c.chen@intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rui Zhang <rui.zhang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: https://lkml.kernel.org/r/20170913213156.351193962@linutronix.de
2017-09-25 20:52:02 +02:00
Thomas Gleixner
8ed4f3e666 x86/smpboot: Set online before setting up vectors
There is no reason to set the CPU online after establishing the vectors on
the upcoming CPU. The vector space is protected by the vector lock so no
changes can happen.

Marking the CPU online before setting up the vector space makes tracing
work in the early vector management cpu online code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Yu Chen <yu.c.chen@intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rui Zhang <rui.zhang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: https://lkml.kernel.org/r/20170913213155.264311994@linutronix.de
2017-09-25 20:51:57 +02:00
Thomas Gleixner
0fa115da40 x86/irq/vector: Initialize matrix allocator
Initialize the matrix allocator and add the proper accounting points to the
code.

No functional change, just preparation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Yu Chen <yu.c.chen@intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rui Zhang <rui.zhang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: https://lkml.kernel.org/r/20170913213155.108410660@linutronix.de
2017-09-25 20:51:56 +02:00
Thomas Gleixner
ef9e56d894 x86/ioapic: Remove obsolete post hotplug update
With single CPU affinities the post SMP boot vector update is pointless as
it will just leave the affinities on the same vectors and the same CPUs.

Remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Yu Chen <yu.c.chen@intel.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rui Zhang <rui.zhang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Len Brown <lenb@kernel.org>
Link: https://lkml.kernel.org/r/20170913213154.308697243@linutronix.de
2017-09-25 20:51:52 +02:00
Dou Liyang
935356cecd x86/apic: Initialize interrupt mode after timer init
A cold or warm boot through BIOS sets the APIC in default interrupt
delivery mode. A dump-capture kernel will not go through a BIOS reset and
leave the interrupt delivery mode in the state which was active on the
crashed kernel, but the dump kernel startup code assumes default delivery
mode which can result in interrupt delivery/handling to fail.

To solve this problem, it's required to set up the final interrupt delivery
mode as soon as possible. As IOAPIC setup needs the timer initialized for
verifying the timer interrupt delivery mode, the earliest point is right
after timer setup in late_time_init().

That results in the following init order:

  1) Set up the legacy timer, if applicable on the platform

  2) Set up APIC/IOAPIC which includes the verification of the legacy timer
     interrupt delivery.

  3) TSC calibration

  4) Local APIC timer setup


Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Cc: bhe@redhat.com
Link: https://lkml.kernel.org/r/1505293975-26005-12-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25 15:03:17 +02:00
Dou Liyang
34fba3e6b1 x86/init: Add intr_mode_init to x86_init_ops
X86 and XEN initialize interrupt delivery mode in different way.

To avoid conditionals, add a new x86_init_ops function which defaults to
the standard function and can be overridden by the early XEN platform code.

[ tglx: Folded the XEN part which was a separate patch to preserve
  	bisectability ]

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Cc: bhe@redhat.com
Link: https://lkml.kernel.org/r/1505293975-26005-10-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25 15:03:17 +02:00
Dou Liyang
4f45ed9f84 x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
Calling native_smp_prepare_cpus() to prepare for SMP bootup, does some
sanity checking, enables APIC mode and disables SMP feature.

Now, APIC mode setup has been unified to apic_intr_mode_init(), some sanity
checks are redundant and need to be cleanup.

Mark the apic_intr_mode extern to refine the switch and remove the
redundant sanity check.

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Cc: bhe@redhat.com
Link: https://lkml.kernel.org/r/1505293975-26005-7-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25 15:03:16 +02:00
Dou Liyang
3e730dad3b x86/apic: Unify interrupt mode setup for SMP-capable system
On a SMP-capable system, the kernel enables and sets up the APIC interrupt
delivery mode in native_smp_prepare_cpus(). The decision how to setup the
APIC is intermingled with the decision of setting up SMP or not.

Split the initialization of the APIC interrupt mode independent from other
decisions and have a separate apic_intr_mode_init() function for it.

The invocation time stays the same for now.

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Cc: bhe@redhat.com
Link: https://lkml.kernel.org/r/1505293975-26005-6-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25 15:03:15 +02:00
Dou Liyang
4b1244b45c x86/apic: Move logical APIC ID away from apic_bsp_setup()
apic_bsp_setup() sets and returns logical APIC ID for initializing
cpu0_logical_apicid in a SMP-capable system.

The id has nothing to do with the initialization of local APIC and I/O
APIC. And apic_bsp_setup() should be called for interrupt mode setup only.

Move the id setup into a separate helper function for cleanup and mark
apic_bsp_setup() void.

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Cc: bhe@redhat.com
Link: https://lkml.kernel.org/r/1505293975-26005-5-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25 15:03:15 +02:00
Dou Liyang
a2510d156e x86/apic: Split local APIC timer setup from the APIC setup
apic_bsp_setup() sets up the local APIC, I/O APIC and APIC timer.

The local APIC and I/O APIC setup belongs to interrupt delivery mode
setup. Setting up the local APIC timer for booting CPU is another job
and has nothing to do with interrupt delivery mode setup.

Split local APIC timer setup from the APIC setup, keep it in the original
position for SMP and UP kernel for now.

Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yinghai@kernel.org
Cc: bhe@redhat.com
Link: https://lkml.kernel.org/r/1505293975-26005-4-git-send-email-douly.fnst@cn.fujitsu.com
2017-09-25 15:03:14 +02:00
Andy Lutomirski
4ba55e65f4 x86/mm/32: Load a sane CR3 before cpu_init() on secondary CPUs
For unknown historical reasons (i.e. Borislav doesn't recall),
32-bit kernels invoke cpu_init() on secondary CPUs with
initial_page_table loaded into CR3.  Then they set
current->active_mm to &init_mm and call enter_lazy_tlb() before
fixing CR3.  This means that the x86 TLB code gets invoked while CR3
is inconsistent, and, with the improved PCID sanity checks I added,
we warn.

Fix it by loading swapper_pg_dir (i.e. init_mm.pgd) earlier.

Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 72c0098d92 ("x86/mm: Reinitialize TLB state on hotplug and resume")
Link: http://lkml.kernel.org/r/30cdfea504682ba3b9012e77717800a91c22097f.1505663533.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-17 18:59:09 +02:00
Andy Lutomirski
c7ad5ad297 x86/mm/64: Initialize CR4.PCIDE early
cpu_init() is weird: it's called rather late (after early
identification and after most MMU state is initialized) on the boot
CPU but is called extremely early (before identification) on secondary
CPUs.  It's called just late enough on the boot CPU that its CR4 value
isn't propagated to mmu_cr4_features.

Even if we put CR4.PCIDE into mmu_cr4_features, we'd hit two
problems.  First, we'd crash in the trampoline code.  That's
fixable, and I tried that.  It turns out that mmu_cr4_features is
totally ignored by secondary_start_64(), though, so even with the
trampoline code fixed, it wouldn't help.

This means that we don't currently have CR4.PCIDE reliably initialized
before we start playing with cpu_tlbstate.  This is very fragile and
tends to cause boot failures if I make even small changes to the TLB
handling code.

Make it more robust: initialize CR4.PCIDE earlier on the boot CPU
and propagate it to secondary CPUs in start_secondary().

( Yes, this is ugly.  I think we should have improved mmu_cr4_features
  to actually control CR4 during secondary bootup, but that would be
  fairly intrusive at this stage. )

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Tested-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: 660da7c922 ("x86/mm: Enable CR4.PCIDE on supported systems")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-13 09:54:43 +02:00
Alexey Dobriyan
9b130ad5bb treewide: make "nr_cpu_ids" unsigned
First, number of CPUs can't be negative number.

Second, different signnnedness leads to suboptimal code in the following
cases:

1)
	kmalloc(nr_cpu_ids * sizeof(X));

"int" has to be sign extended to size_t.

2)
	while (loff_t *pos < nr_cpu_ids)

MOVSXD is 1 byte longed than the same MOV.

Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".

Code savings on allyesconfig kernel: -3KB

	add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
	function                                     old     new   delta
	coretemp_cpu_online                          450     512     +62
	rcu_init_one                                1234    1272     +38
	pci_device_probe                             374     399     +25

				...

	pgdat_reclaimable_pages                      628     556     -72
	select_fallback_rq                           446     369     -77
	task_numa_find_cpu                          1923    1807    -116

Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:48 -07:00
Vitaly Kuznetsov
10e66760fa x86/smpboot: Unbreak CPU0 hotplug
A hang on CPU0 onlining after a preceding offlining is observed. Trace
shows that CPU0 is stuck in check_tsc_sync_target() waiting for source
CPU to run check_tsc_sync_source() but this never happens. Source CPU,
in its turn, is stuck on synchronize_sched() which is called from
native_cpu_up() -> do_boot_cpu() -> unregister_nmi_handler().

So it's a classic ABBA deadlock, due to the use of synchronize_sched() in
unregister_nmi_handler().

Fix the bug by moving unregister_nmi_handler() from do_boot_cpu() to
native_cpu_up() after cpu onlining is done.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170803105818.9934-1-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 17:02:47 +02:00
Linus Torvalds
7a69f9c60b Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Continued work to add support for 5-level paging provided by future
     Intel CPUs. In particular we switch the x86 GUP code to the generic
     implementation. (Kirill A. Shutemov)

   - Continued work to add PCID CPU support to native kernels as well.
     In this round most of the focus is on reworking/refreshing the TLB
     flush infrastructure for the upcoming PCID changes. (Andy
     Lutomirski)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/mm: Delete a big outdated comment about TLB flushing
  x86/mm: Don't reenter flush_tlb_func_common()
  x86/KASLR: Fix detection 32/64 bit bootloaders for 5-level paging
  x86/ftrace: Exclude functions in head64.c from function-tracing
  x86/mmap, ASLR: Do not treat unlimited-stack tasks as legacy mmap
  x86/mm: Remove reset_lazy_tlbstate()
  x86/ldt: Simplify the LDT switching logic
  x86/boot/64: Put __startup_64() into .head.text
  x86/mm: Add support for 5-level paging for KASLR
  x86/mm: Make kernel_physical_mapping_init() support 5-level paging
  x86/mm: Add sync_global_pgds() for configuration with 5-level paging
  x86/boot/64: Add support of additional page table level during early boot
  x86/boot/64: Rename init_level4_pgt and early_level4_pgt
  x86/boot/64: Rewrite startup_64() in C
  x86/boot/compressed: Enable 5-level paging during decompression stage
  x86/boot/efi: Define __KERNEL32_CS GDT on 64-bit configurations
  x86/boot/efi: Fix __KERNEL_CS definition of GDT entry on 64-bit configurations
  x86/boot/efi: Cleanup initialization of GDT entries
  x86/asm: Fix comment in return_from_SYSCALL_64()
  x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
  ...
2017-07-03 14:45:09 -07:00
Andy Lutomirski
d54368127a x86/mm: Remove reset_lazy_tlbstate()
The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/3acc7ad02a2ec060d2321a1e0f6de1cb90069517.1498022414.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-22 10:57:50 +02:00
Thomas Gleixner
719b3680d1 x86/smp: Adjust system_state check
To enable smp_processor_id() and might_sleep() debug checks earlier, it's
required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

Adjust the system_state check in announce_cpu() to handle the extra states.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170516184735.191715856@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-23 10:01:35 +02:00
Thomas Garnier
69218e4799 x86: Remap GDT tables in the fixmap section
Each processor holds a GDT in its per-cpu structure. The sgdt
instruction gives the base address of the current GDT. This address can
be used to bypass KASLR memory randomization. With another bug, an
attacker could target other per-cpu structures or deduce the base of
the main memory section (PAGE_OFFSET).

This patch relocates the GDT table for each processor inside the
fixmap section. The space is reserved based on number of supported
processors.

For consistency, the remapping is done by default on 32 and 64-bit.

Each processor switches to its remapped GDT at the end of
initialization. For hibernation, the main processor returns with the
original GDT and switches back to the remapping at completion.

This patch was tested on both architectures. Hibernation and KVM were
both tested specially for their usage of the GDT.

Thanks to Boris Ostrovsky <boris.ostrovsky@oracle.com> for testing and
recommending changes for Xen support.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Luis R . Rodriguez <mcgrof@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: kernel-hardening@lists.openwall.com
Cc: kvm@vger.kernel.org
Cc: lguest@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-pm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: zijun_hu <zijun_hu@htc.com>
Link: http://lkml.kernel.org/r/20170314170508.100882-2-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:06:35 +01:00
Ingo Molnar
68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
ef8bd77f33 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/hotplug.h>
We are going to split <linux/sched/hotplug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/hotplug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:35 +01:00
Ingo Molnar
105ab3d8ce sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h>
We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:26 +01:00
Linus Torvalds
c945d0227d Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
 "Misc platform updates: SGI UV4 support additions, intel-mid Merrifield
  enhancements and purge of old code"

* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86/platform/UV/NMI: Fix uneccessary kABI breakage
  x86/platform/UV: Clean up the NMI code to match current coding style
  x86/platform/UV: Ensure uv_system_init is called when necessary
  x86/platform/UV: Initialize PCH GPP_D_0 NMI Pin to be NMI source
  x86/platform/UV: Verify NMI action is valid, default is standard
  x86/platform/UV: Add basic CPU NMI health check
  x86/platform/UV: Add Support for UV4 Hubless NMIs
  x86/platform/UV: Add Support for UV4 Hubless systems
  x86/platform/UV: Clean up the UV APIC code
  x86/platform/intel-mid: Move watchdog registration to arch_initcall()
  x86/platform/intel-mid: Don't shadow error code of mp_map_gsi_to_irq()
  x86/platform/intel-mid: Allocate RTC interrupt for Merrifield
  x86/ioapic: Return suitable error code in mp_map_gsi_to_irq()
  x86/platform/UV: Fix 2 socket config problem
  x86/platform/UV: Fix panic with missing UVsystab support
  x86/platform/intel-mid: Enable RTC on Intel Merrifield
  x86/platform/intel: Remove PMIC GPIO block support
  x86/platform/intel-mid: Make intel_scu_device_register() static
  x86/platform/intel-mid: Enable GPIO keys on Merrifield
  x86/platform/intel-mid: Get rid of duplication of IPC handler
  ...
2017-02-20 16:26:57 -08:00
Borislav Petkov
79a8b9aa38 x86/CPU/AMD: Bring back Compute Unit ID
Commit:

  a33d331761 ("x86/CPU/AMD: Fix Bulldozer topology")

restored the initial approach we had with the Fam15h topology of
enumerating CU (Compute Unit) threads as cores. And this is still
correct - they're beefier than HT threads but still have some
shared functionality.

Our current approach has a problem with the Mad Max Steam game, for
example. Yves Dionne reported a certain "choppiness" while playing on
v4.9.5.

That problem stems most likely from the fact that the CU threads share
resources within one CU and when we schedule to a thread of a different
compute unit, this incurs latency due to migrating the working set to a
different CU through the caches.

When the thread siblings mask mirrors that aspect of the CUs and
threads, the scheduler pays attention to it and tries to schedule within
one CU first. Which takes care of the latency, of course.

Reported-by: Yves Dionne <yves.dionne@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.9
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20170205105022.8705-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-05 12:18:45 +01:00
travis@sgi.com
9ec808a022 x86/platform/UV: Ensure uv_system_init is called when necessary
Move the check to whether this is a UV system that needs initialization
from is_uv_system() to the internal uv_system_init() function.  This is
because on a UV system without a HUB the is_uv_system() returns false.
But we still need some specific UV system initialization.  See the
uv_system_init() for change to a quick check if UV is applicable. This
change should not increase overhead since is_uv_system() also called
into this same area.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Russ Anderson <rja@hpe.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dimitri Sivanich <sivanich@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20170125163518.256403963@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 10:21:00 +01:00
Thomas Gleixner
427d77a323 x86/smpboot: Prevent false positive out of bounds cpumask access warning
prefill_possible_map() reinitializes the cpu_possible_map by setting the
possible cpu bits and clearing all other bits up to NR_CPUS.

This is technically always correct because cpu_possible_map is statically
allocated and sized NR_CPUS. With CPUMASK_OFFSTACK and DEBUG_PER_CPU_MAPS
enabled the bounds check of cpu masks happens on nr_cpu_ids. nr_cpu_ids is
initialized to NR_CPUS and only limited after the set/clear bit loops have
been executed. 

But if the system was booted with "nr_cpus=N" on the command line, where N
is < NR_CPUS then nr_cpu_ids is limited in the parameter parsing function
before prefill_possible_map() is invoked. As a consequence the cpumask
bounds check triggers when clearing the bits past nr_cpu_ids.

Add a helper which allows to reset cpu_possible_map w/o the bounds check
and then set only the possible bits which are well inside bounds.

Reported-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: 0x7f454c46@gmail.com
Cc: Jan Beulich <JBeulich@novell.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1612131836050.3415@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 11:32:31 +01:00
Thomas Gleixner
9d85eb9119 x86/smpboot: Make logical package management more robust
The logical package management has several issues:

 - The APIC ids provided by ACPI are not required to be the same as the
   initial APIC id which can be retrieved by CPUID. The APIC ids provided
   by ACPI are those which are written by the BIOS into the APIC. The
   initial id is set by hardware and can not be changed. The hardware
   provided ids contain the real hardware package information.

   Especially AMD sets the effective APIC id different from the hardware id
   as they need to reserve space for the IOAPIC ids starting at id 0.

   As a consequence those machines trigger the currently active firmware
   bug printouts in dmesg, These are obviously wrong.

 - Virtual machines have their own interesting of enumerating APICs and
   packages which are not reliably covered by the current implementation.

The sizing of the mapping array has been tweaked to be generously large to
handle systems which provide a wrong core count when HT is disabled so the
whole magic which checks for space in the physical hotplug case is not
needed anymore.

Simplify the whole machinery and do the mapping when the CPU starts and the
CPUID derived physical package information is available. This solves the
observed problems on AMD machines and works for the virtualization issues
as well.

Remove the extra call from XEN cpu bringup code as it is not longer
required.

Fixes: d49597fd3b ("x86/cpu: Deal with broken firmware (VMWare/XEN)")
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: M. Vefa Bicakci <m.v.b@runbox.com>
Cc: xen-devel <xen-devel@lists.xen.org>
Cc: Charles (Chas) Williams <ciwillia@brocade.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Alok Kataria <akataria@vmware.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1612121102260.3429@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-13 10:22:39 +01:00
Linus Torvalds
212f30008a Merge branch 'x86-idle-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 idle updates from Ingo Molnar:
 "There were two bigger changes in this development cycle:

   - remove idle notifiers:

       32 files changed, 74 insertions(+), 803 deletions(-)

     These notifiers were of questionable value and the main usecase,
     the i7300 driver, was essentially unmaintained and can be removed,
     plus modern power management concepts don't need the callback - so
     use this golden opportunity and get rid of this opaque and fragile
     callback from a latency sensitive code path.

     (Len Brown, Thomas Gleixner)

   - improve the AMD Erratum 400 workaround that used high overhead MSR
     polling in the idle loop (Borisla Petkov, Thomas Gleixner)"

* 'x86-idle-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Remove empty idle.h header
  x86/amd: Simplify AMD E400 aware idle routine
  x86/amd: Check for the C1E bug post ACPI subsystem init
  x86/bugs: Separate AMD E400 erratum and C1E bug
  x86/cpufeature: Provide helper to set bugs bits
  x86/idle: Remove enter_idle(), exit_idle()
  x86: Remove x86_test_and_clear_bit_percpu()
  x86/idle: Remove is_idle flag
  x86/idle: Remove idle_notifier
  i7300_idle: Remove this driver
2016-12-12 14:55:04 -08:00
Linus Torvalds
518bacf5a5 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - do a large round of simplifications after all CPUs do 'eager' FPU
     context switching in v4.9: remove CR0 twiddling, remove leftover
     eager/lazy bts, etc (Andy Lutomirski)

   - more FPU code simplifications: remove struct fpu::counter, clarify
     nomenclature, remove unnecessary arguments/functions and better
     structure the code (Rik van Riel)"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Remove clts()
  x86/fpu: Remove stts()
  x86/fpu: Handle #NM without FPU emulation as an error
  x86/fpu, lguest: Remove CR0.TS support
  x86/fpu, kvm: Remove host CR0.TS manipulation
  x86/fpu: Remove irq_ts_save() and irq_ts_restore()
  x86/fpu: Stop saving and restoring CR0.TS in fpu__init_check_bugs()
  x86/fpu: Get rid of two redundant clts() calls
  x86/fpu: Finish excising 'eagerfpu'
  x86/fpu: Split old_fpu & new_fpu handling into separate functions
  x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state()
  x86/fpu: Split old & new FPU code paths
  x86/fpu: Remove __fpregs_(de)activate()
  x86/fpu: Rename lazy restore functions to "register state valid"
  x86/fpu, kvm: Remove KVM vcpu->fpu_counter
  x86/fpu: Remove struct fpu::counter
  x86/fpu: Remove use_eager_fpu()
  x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction
  x86/fpu: Hard-disable lazy FPU mode
  x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
2016-12-12 14:27:49 -08:00
Linus Torvalds
535b2f73f6 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU updates from Ingo Molnar:
 "The changes in this development cycle were:

   - AMD CPU topology enhancements that are cleanups on current CPUs but
     which enable future Fam17 hardware. (Yazen Ghannam)

   - unify bugs.c and bugs_64.c (Borislav Petkov)

   - remove the show_msr= boot option (Borislav Petkov)

   - simplify a boot message (Borislav Petkov)"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/AMD: Clean up cpu_llc_id assignment per topology feature
  x86/cpu: Get rid of the show_msr= boot option
  x86/cpu: Merge bugs.c and bugs_64.c
  x86/cpu: Remove the printk format specifier in "CPU0: "
2016-12-12 14:25:21 -08:00
Linus Torvalds
5645688f9d Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "The main changes in this development cycle were:

   - a large number of call stack dumping/printing improvements: higher
     robustness, better cross-context dumping, improved output, etc.
     (Josh Poimboeuf)

   - vDSO getcpu() performance improvement for future Intel CPUs with
     the RDPID instruction (Andy Lutomirski)

   - add two new Intel AVX512 features and the CPUID support
     infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
     He Chen)

   - more copy-user unification (Borislav Petkov)

   - entry code assembly macro simplifications (Alexander Kuleshov)

   - vDSO C/R support improvements (Dmitry Safonov)

   - misc fixes and cleanups (Borislav Petkov, Paul Bolle)"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  scripts/decode_stacktrace.sh: Fix address line detection on x86
  x86/boot/64: Use defines for page size
  x86/dumpstack: Make stack name tags more comprehensible
  selftests/x86: Add test_vdso to test getcpu()
  x86/vdso: Use RDPID in preference to LSL when available
  x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
  x86/cpufeatures: Enable new AVX512 cpu features
  x86/cpuid: Provide get_scattered_cpuid_leaf()
  x86/cpuid: Cleanup cpuid_regs definitions
  x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
  x86/unwind: Ensure stack grows down
  x86/vdso: Set vDSO pointer only after success
  x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
  x86/unwind: Detect bad stack return address
  x86/dumpstack: Warn on stack recursion
  x86/unwind: Warn on bad frame pointer
  x86/decoder: Use stderr if insn sanity test fails
  x86/decoder: Use stdout if insn decoder test is successful
  mm/page_alloc: Remove kernel address exposure in free_reserved_area()
  x86/dumpstack: Remove raw stack dump
  ...
2016-12-12 13:49:57 -08:00
Linus Torvalds
92c020d08d Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main scheduler changes in this cycle were:

   - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
     notion of 'better cores', which the scheduler will prefer to
     schedule single threaded workloads on. (Tim Chen, Srinivas
     Pandruvada)

   - enhance the handling of asymmetric capacity CPUs further (Morten
     Rasmussen)

   - improve/fix load handling when moving tasks between task groups
     (Vincent Guittot)

   - simplify and clean up the cputime code (Stanislaw Gruszka)

   - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
     Guittot)

   - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)

   - add uaccess atomicity debugging (when using access_ok() in the
     wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)

   - implement various fixes, cleanups and other enhancements (Daniel
     Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  sched/core: Use load_avg for selecting idlest group
  sched/core: Fix find_idlest_group() for fork
  kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
  kthread: Don't use to_live_kthread() in kthread_[un]park()
  kthread: Don't use to_live_kthread() in kthread_stop()
  Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
  kthread: Make struct kthread kmalloc'ed
  x86/uaccess, sched/preempt: Verify access_ok() context
  sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
  sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
  x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h>
  cpufreq/intel_pstate: Use CPPC to get max performance
  acpi/bus: Set _OSC for diverse core support
  acpi/bus: Enable HWP CPPC objects
  x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
  x86/sysctl: Add sysctl for ITMT scheduling feature
  x86: Enable Intel Turbo Boost Max Technology 3.0
  x86/topology: Define x86's arch_update_cpu_topology
  sched: Extend scheduler's asym packing
  sched/fair: Clean up the tunable parameter definitions
  ...
2016-12-12 12:15:10 -08:00
Thomas Gleixner
34bc3560c6 x86: Remove empty idle.h header
One include less is always a good thing(tm). Good riddance.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-6-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 21:23:22 +01:00
Borislav Petkov
07c94a3812 x86/amd: Simplify AMD E400 aware idle routine
Reorganize the E400 detection now that we have everything in place:
switch the CPUs to broadcast mode after the LAPIC has been initialized
and remove the facilities that were used previously on the idle path.

Unfortunately static_cpu_has_bug() cannpt be used in the E400 idle routine
because alternatives have been applied when the actual detection happens,
so the static switching does not take effect and the test will stay
false. Use boot_cpu_has_bug() instead which is definitely an improvement
over the RDMSR and the cpumask handling.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/20161209182912.2726-5-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-09 21:23:21 +01:00
Tim Chen
d3d37d850d x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
Some Intel cores in a package can be boosted to a higher turbo frequency
with ITMT 3.0 technology. The scheduler can use the asymmetric packing
feature to move tasks to the more capable cores.

If ITMT is enabled, add SD_ASYM_PACKING flag to the thread and core
sched domains to enable asymmetric packing.

Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/9bbb885bedbef4eb50e197305eb16b160cff0831.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 20:44:20 +01:00
Tim Chen
7d25127cef x86/topology: Define x86's arch_update_cpu_topology
The scheduler calls arch_update_cpu_topology() to check whether the
scheduler domains have to be rebuilt.

So far x86 has no requirement for this, but the upcoming ITMT support
makes this necessary.

Request the rebuild when the x86 internal update flag is set.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/bfbf5591276ec60b2af2da798adc1060df1e2a5f.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-24 20:44:19 +01:00
Ingo Molnar
c29c716662 Merge branch 'core/urgent' into x86/fpu, to merge fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:47:40 +01:00
Ingo Molnar
05b93c19d5 Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:41:06 +01:00
Michael Ellerman
92b2327829 kernel/smp: Make the SMP boot message common on all arches
Currently after bringing up secondary CPUs all arches print "Brought up
%d CPUs". On x86 they also print the number of nodes that were brought
online.

It would be nice to also print the number of nodes on other arches.
Although we could override smp_announce() on the other ~10 NUMA aware
arches, it seems simpler to just always print the number of nodes. On
non-NUMA arches there is just always 1 node.

Having done that, smp_announce() is no longer weak, and seems small
enough to just pull directly into smp_init().

Also update the printing of "%d CPUs" to be smart when an SMP kernel is
booted on a single CPU system, or when only one CPU is available, eg:

   smp: Brought up 2 nodes, 1 CPU

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: akpm@osdl.org
Cc: jgross@suse.com
Cc: ak@linux.intel.com
Cc: tim.c.chen@linux.intel.com
Cc: len.brown@intel.com
Cc: peterz@infradead.org
Cc: richard@nod.at
Cc: jolsa@redhat.com
Cc: boris.ostrovsky@oracle.com
Cc: mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1477460275-8266-2-git-send-email-mpe@ellerman.id.au
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-26 12:02:35 +02:00
Borislav Petkov
d54ff31dd8 x86/cpu: Remove the printk format specifier in "CPU0: "
We're using a literal, move it into the string.

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161024173844.23038-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 11:48:49 +02:00
Ville Syrjälä
ff8560512b x86/boot/smp: Don't try to poke disabled/non-existent APIC
Apparently trying to poke a disabled or non-existent APIC
leads to a box that doesn't even boot. Let's not do that.

No real clue if this is the right fix, but at least my
P3 machine boots again.

Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: dyoung@redhat.com
Cc: kexec@lists.infradead.org
Cc: stable@vger.kernel.org
Fixes: 2a51fe083e ("arch/x86: Handle non enumerated CPU after physical hotplug")
Link: http://lkml.kernel.org/r/1477102684-5092-1-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-22 10:47:54 +02:00
Josh Poimboeuf
b9b1a9c363 x86/boot/smp/32: Fix initial idle stack location on 32-bit kernels
On 32-bit kernels, the initial idle stack calculation doesn't take into
account the TOP_OF_KERNEL_STACK_PADDING, making the stack end address
inconsistent with other tasks on 32-bit.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6cf569410bfa84cf923902fc4d628444cace94be.1474480779.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 09:15:23 +02:00
Ingo Molnar
4d69f155d5 Linux 4.9-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYAoDuAAoJEHm+PkMAQRiGUeEH/03/cUjHeY5aJkcJ0JeHkoU5
 GR5nRGcjfFF6cGujw2cSXBf5NzZTcrvBBFSgGNJ/rqm4EeDBsmf6T8qSfEKky/SY
 3CNWSzayFU8Na3C8Z/a/xPTPicneX9zVnAi8XMAKXwWPmu21JCLR/hkKaxQ29qGr
 Nqe4kEdLEF80d5lFRfNjK3CX4bD6w6P7aTBaM6wuRe4u5AXKJlSF+j838o5+/tSQ
 Q1V7fyXlX+kwNmH4gViim8im0PLm7/7Li8e24pL3cAR2G6DHrUzcsYYoRMHpk5bv
 HdBeCgZL6TnIaJc0ui2FRqQsifaVfM5J+pK81wr/JhBP2hmuWIN7NMupfCYtCcM=
 =Mown
 -----END PGP SIGNATURE-----

Merge tag 'v4.9-rc1' into x86/fpu, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-16 13:04:34 +02:00
Rik van Riel
317b622cb2 x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state()
The __{fpu,cpu}_invalidate_fpregs_state() functions can only be used
to invalidate a resource they control.  Document that, and change
the API a little bit to reflect that.

Go back to open coding the fpu_fpregs_owner_ctx write in the CPU
hotplug code, which should be the exception, and move __kernel_fpu_begin()
to this API.

This patch has no functional changes to the current code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1476447331-21566-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-16 11:38:31 +02:00
Prarit Bhargava
2a51fe083e arch/x86: Handle non enumerated CPU after physical hotplug
When a CPU is physically added to a system then the MADT table is not
updated.

If subsequently a kdump kernel is started on that physically added CPU then
the ACPI enumeration fails to provide the information for this CPU which is
now the boot CPU of the kdump kernel.

As a consequence, generic_processor_info() is not invoked for that CPU so
the number of enumerated processors is 0 and none of the initializations,
including the logical package id management, are performed.

We have code which relies on the correctness of the logical package map and
other information which is initialized via generic_processor_info().
Executing such code will result in undefined behaviour or kernel crashes.

This problem applies only to the kdump kernel because a normal kexec will
switch to the original boot CPU, which is enumerated in MADT, before
jumping into the kexec kernel.

The boot code already has a check for num_processors equal 0 in
prefill_possible_map(). We can use that check as an indicator that the
enumeration of the boot CPU did not happen and invoke generic_processor_info()
for it. That initializes the relevant data for the boot CPU and therefore
prevents subsequent failure.

[ tglx: Refined the code and rewrote the changelog ]

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Fixes: 1f12e32f4c ("x86/topology: Create logical package id")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: dyoung@redhat.com
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1475514432-27682-1-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-07 15:22:15 +02:00
Rik van Riel
25d83b531c x86/fpu: Rename lazy restore functions to "register state valid"
Name the functions after the state they track, rather than the function
they currently enable. This should make it more obvious when we use the
fpu_register_state_valid() function for something else in the future.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1475627678-20788-8-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-07 11:14:41 +02:00
Linus Torvalds
597f03f9d1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "Yet another batch of cpu hotplug core updates and conversions:

   - Provide core infrastructure for multi instance drivers so the
     drivers do not have to keep custom lists.

   - Convert custom lists to the new infrastructure. The block-mq custom
     list conversion comes through the block tree and makes the diffstat
     tip over to more lines removed than added.

   - Handle unbalanced hotplug enable/disable calls more gracefully.

   - Remove the obsolete CPU_STARTING/DYING notifier support.

   - Convert another batch of notifier users.

   The relayfs changes which conflicted with the conversion have been
   shipped to me by Andrew.

   The remaining lot is targeted for 4.10 so that we finally can remove
   the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  cpufreq: Fix up conversion to hotplug state machine
  blk/mq: Reserve hotplug states for block multiqueue
  x86/apic/uv: Convert to hotplug state machine
  s390/mm/pfault: Convert to hotplug state machine
  mips/loongson/smp: Convert to hotplug state machine
  mips/octeon/smp: Convert to hotplug state machine
  fault-injection/cpu: Convert to hotplug state machine
  padata: Convert to hotplug state machine
  cpufreq: Convert to hotplug state machine
  ACPI/processor: Convert to hotplug state machine
  virtio scsi: Convert to hotplug state machine
  oprofile/timer: Convert to hotplug state machine
  block/softirq: Convert to hotplug state machine
  lib/irq_poll: Convert to hotplug state machine
  x86/microcode: Convert to hotplug state machine
  sh/SH-X3 SMP: Convert to hotplug state machine
  ia64/mca: Convert to hotplug state machine
  ARM/OMAP/wakeupgen: Convert to hotplug state machine
  ARM/shmobile: Convert to hotplug state machine
  arm64/FP/SIMD: Convert to hotplug state machine
  ...
2016-10-03 19:43:08 -07:00
Linus Torvalds
1a4a2bc460 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull low-level x86 updates from Ingo Molnar:
 "In this cycle this topic tree has become one of those 'super topics'
  that accumulated a lot of changes:

   - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
     x86 - preceded by an array of changes. v4.8 saw preparatory changes
     in this area already - this is the rest of the work. Includes the
     thread stack caching performance optimization. (Andy Lutomirski)

   - switch_to() cleanups and all around enhancements. (Brian Gerst)

   - A large number of dumpstack infrastructure enhancements and an
     unwinder abstraction. The secret long term plan is safe(r) live
     patching plus maybe another attempt at debuginfo based unwinding -
     but all these current bits are standalone enhancements in a frame
     pointer based debug environment as well. (Josh Poimboeuf)

   - More __ro_after_init and const annotations. (Kees Cook)

   - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

[ The virtually mapped stack changes are pretty fundamental, and not
  x86-specific per se, even if they are only used on x86 right now. ]

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/asm: Get rid of __read_cr4_safe()
  thread_info: Use unsigned long for flags
  x86/alternatives: Add stack frame dependency to alternative_call_2()
  x86/dumpstack: Fix show_stack() task pointer regression
  x86/dumpstack: Remove dump_trace() and related callbacks
  x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
  oprofile/x86: Convert x86_backtrace() to use the new unwinder
  x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
  perf/x86: Convert perf_callchain_kernel() to use the new unwinder
  x86/unwind: Add new unwind interface and implementations
  x86/dumpstack: Remove NULL task pointer convention
  fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
  sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  lib/syscall: Pin the task stack in collect_syscall()
  x86/process: Pin the target stack in get_wchan()
  x86/dumpstack: Pin the target stack when dumping it
  kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
  sched/core: Add try_get_task_stack() and put_task_stack()
  x86/entry/64: Fix a minor comment rebase error
  iommu/amd: Don't put completion-wait semaphore on stack
  ...
2016-10-03 16:13:28 -07:00
Linus Torvalds
110a9e42b6 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Ingo Molnar:
 "The main changes are:

   - Persistent CPU/node numbering across CPU hotplug/unplug events.
     This is a pretty involved series of changes that first fetches all
     the information during bootup and then uses it for the various
     hotplug/unplug methods. (Gu Zheng, Dou Liyang)

   - IO-APIC hot-add/remove fixes and enhancements. (Rui Wang)

   - ... various fixes, cleanups and enhancements"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/apic: Fix silent & fatal merge conflict in __generic_processor_info()
  acpi: Fix broken error check in map_processor()
  acpi: Validate processor id when mapping the processor
  acpi: Provide mechanism to validate processors in the ACPI tables
  x86/acpi: Set persistent cpuid <-> nodeid mapping when booting
  x86/acpi: Enable MADT APIs to return disabled apicids
  x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping
  x86/acpi: Enable acpi to register all possible cpus at boot time
  x86/numa: Online memory-less nodes at boot time
  x86/apic: Get rid of apic_version[] array
  x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()
  x86/ioapic: Ignore root bridges without a companion ACPI device
  x86/apic: Update comment about disabling processor focus
  x86/smpboot: Check APIC ID before setting up default routing
  x86/ioapic: Fix IOAPIC failing to request resource
  x86/ioapic: Fix lost IOAPIC resource after hot-removal and hotadd
  x86/ioapic: Fix setup_res() failing to get resource
  x86/ioapic: Support hot-removal of IOAPICs present during boot
  x86/ioapic: Change prototype of acpi_ioapic_add()
  x86/apic, ACPI: Fix incorrect assignment when handling apic/x2apic entries
  ...
2016-10-03 15:36:06 -07:00
Tim Chen
8f37961cf2 sched/core, x86/topology: Fix NUMA in package topology bug
Current code can call set_cpu_sibling_map() and invoke sched_set_topology()
more than once (e.g. on CPU hot plug).  When this happens after
sched_init_smp() has been called, we lose the NUMA topology extension to
sched_domain_topology in sched_init_numa().  This results in incorrect
topology when the sched domain is rebuilt.

This patch fixes the bug and issues warning if we call sched_set_topology()
after sched_init_smp().

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:53:18 +02:00
Thomas Gleixner
1e1b37273c Merge branch 'x86/urgent' into x86/apic
Bring in the upstream modifications so we can fixup the silent merge
conflict which is introduced by this merge.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-26 15:47:03 -04:00
Denys Vlasenko
cff9ab2b29 x86/apic: Get rid of apic_version[] array
The array has a size of MAX_LOCAL_APIC, which can be as large as 32k, so it
can consume up to 128k.

The array has been there forever and was never used for anything useful
other than a version mismatch check which was introduced in 2009.

There is no reason to store the version in an array. The kernel is not
prepared to handle different APIC versions anyway, so the real important
part is to detect a version mismatch and warn about it, which can be done
with a single variable as well.

[ tglx: Massaged changelog ]

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Borislav Petkov <bp@alien8.de>
CC: Brian Gerst <brgerst@gmail.com>
CC: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20160913181232.30815-1-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-20 00:31:19 +02:00
Thomas Gleixner
0cb7bf61b1 Merge branch 'linus' into smp/hotplug
Apply upstream changes to avoid conflicts with pending patches.
2016-09-01 18:33:46 +02:00
Brian Gerst
0100301bfd sched/x86: Rewrite the switch_to() code
Move the low-level context switch code to an out-of-line asm stub instead of
using complex inline asm.  This allows constructing a new stack frame for the
child process to make it seamlessly flow to ret_from_fork without an extra
test and branch in __switch_to().  It also improves code generation for
__schedule() by using the C calling convention instead of clobbering all
registers.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471106302-10159-5-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 12:31:41 +02:00
Wei Jiangang
384d9fe374 x86/smpboot: Check APIC ID before setting up default routing
This is not a bugfix, but code optimization.

If the BSP's APIC ID in local APIC is unexpected,
a kernel panic will occur and the system will halt.
That means no need to enable APIC mode, and no reason
to set up the default routing for APIC.

The combination of default_setup_apic_routing() and
apic_bsp_setup() are used to enable APIC mode.
They two should be kept together, rather than being
separated by the codes of checking APIC ID.
Just like their usage in APIC_init_uniprocessor().

Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/1471576957-12961-1-git-send-email-weijg.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-24 11:24:33 +02:00
Josh Poimboeuf
b32f96c75d x86/asm/head: Rename 'stack_start' -> 'initial_stack'
The 'stack_start' variable is similar in usage to 'initial_code' and
'initial_gs': they're all stored in head_64.S and they're all updated by
SMP and ACPI suspend before starting a CPU.

Rename it to 'initial_stack' to be consistent with the others.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nilay Vaish <nilayvaish@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87063d773a3212051b77e17b0ee427f6582a5050.1471535549.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 18:41:29 +02:00
Jiri Olsa
7b0501b1e7 x86/smp: Fix __max_logical_packages value setup
Frank reported kernel panic when he disabled several cores in BIOS
via following option:

  Core Disable Bitmap(Hex)   [0]

with number 0xFFE, which leaves 16 CPUs in system (out of 48).

The kernel panic below goes along with following messages:

 smpboot: Max logical packages: 2^M
 smpboot: APIC(0) Converting physical 0 to logical package 0^M
 smpboot: APIC(20) Converting physical 1 to logical package 1^M
 smpboot: APIC(40) Package 2 exceeds logical package map^M
 smpboot: CPU 8 APICId 40 disabled^M
 smpboot: APIC(60) Package 3 exceeds logical package map^M
 smpboot: CPU 12 APICId 60 disabled^M
 ...
 general protection fault: 0000 [#1] SMP^M
 Modules linked in:^M
 CPU: 15 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #1^M
 Hardware name: SGI UV300/UV300, BIOS SGI UV 300 series BIOS 05/25/2016^M
 task: ffff8801673e0000 ti: ffff8801673ac000 task.ti: ffff8801673ac000^M
 RIP: 0010:[<ffffffff81014d54>]  [<ffffffff81014d54>] uncore_change_context+0xd4/0x180^M
 ...
  [<ffffffff810158ac>] uncore_event_init_cpu+0x6c/0x70^M
  [<ffffffff81d8c91c>] intel_uncore_init+0x1c2/0x2dd^M
  [<ffffffff81d8c75a>] ? uncore_cpu_setup+0x17/0x17^M
  [<ffffffff81002190>] do_one_initcall+0x50/0x190^M
  [<ffffffff810ab193>] ? parse_args+0x293/0x480^M
  [<ffffffff81d87365>] kernel_init_freeable+0x1a5/0x249^M
  [<ffffffff81d86a35>] ? set_debug_rodata+0x12/0x12^M
  [<ffffffff816dc19e>] kernel_init+0xe/0x110^M
  [<ffffffff816e93bf>] ret_from_fork+0x1f/0x40^M
  [<ffffffff816dc190>] ? rest_init+0x80/0x80^M

The reason for the panic is wrong value of __max_logical_packages,
which lets logical_package_map uninitialized and the uncore code
relying on this map being properly initialized (maybe we should
add some safety checks there as well).

The __max_logical_packages is computed as:

  DIV_ROUND_UP(total_cpus, ncpus);
  - ncpus being number of cores

With above BIOS setup we get total_cpus == 16 which set
__max_logical_packages to 2 (ncpus is 12).

Once topology_update_package_map processes CPU with logical
pkg over 2 we display above messages and fail to initialize
the physical_to_logical_pkg map, which makes the uncore code
crash.

The fix is to remove logical_package_map bitmap completely
and keep and update the logical_packages number instead.

After we enumerate all the present CPUs, we check if the
enumerated logical packages count is within its computed
maximum from BIOS data.

If it's not the case, we set this maximum to the new enumerated
value and freeze any new addition of logical packages.

The freeze is because lot of init code like uncore/rapl/cqm
depends on having maximum logical package value set to allocate
their data, so we can't change it later on.

Prarit Bhargava tested the patch and confirms that it solves
the problem:

  From dmidecode:
          Core Count: 24
          Core Enabled: 24
          Thread Count: 48

Orig kernel boot log:

 [    0.464981] smpboot: Max logical packages: 19
 [    0.469861] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.477261] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.484760] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.492258] smpboot: APIC(c0) Converting physical 3 to logical package 3

1.  nr_cpus=8, should stop enumerating in package 0:

 [    0.533664] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.539596] smpboot: Max logical packages: 19

2.  max_cpus=8, should still enumerate all packages:

 [    0.526494] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.532428] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.538456] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.544486] smpboot: APIC(c0) Converting physical 3 to logical package 3
 [    0.550524] smpboot: Max logical packages: 19

3.  nr_cpus=49 ( 2 socket + 1 core on 3rd socket), should stop enumerating in
    package 2:

 [    0.521378] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.527314] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.533345] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.539368] smpboot: Max logical packages: 19

4.  maxcpus=49, should still enumerate all packages:

 [    0.525591] smpboot: APIC(0) Converting physical 0 to logical package 0
 [    0.531525] smpboot: APIC(40) Converting physical 1 to logical package 1
 [    0.537547] smpboot: APIC(80) Converting physical 2 to logical package 2
 [    0.543579] smpboot: APIC(c0) Converting physical 3 to logical package 3
 [    0.549624] smpboot: Max logical packages: 19

5.  kdump (nr_cpus=1) works as well.

Reported-by: Frank Ramsay <framsay@redhat.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160815101700.GA30090@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-18 10:14:48 +02:00
Boris Ostrovsky
aa877175e7 cpu/hotplug: Prevent alloc/free of irq descriptors during CPU up/down (again)
Now that Xen no longer allocates irqs in _cpu_up() we can restore
commit:

  a899418167 ("hotplug: Prevent alloc/free of irq descriptors during cpu up/down")

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: david.vrabel@citrix.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1470244948-17674-3-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:42:57 +02:00
Linus Torvalds
aeb35d6b74 Merge branch 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 header cleanups from Ingo Molnar:
 "This tree is a cleanup of the x86 tree reducing spurious uses of
  module.h - which should improve build performance a bit"

* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads
  x86/apic: Remove duplicated include from probe_64.c
  x86/ce4100: Remove duplicated include from ce4100.c
  x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t
  x86/platform: Delete extraneous MODULE_* tags fromm ts5500
  x86: Audit and remove any remaining unnecessary uses of module.h
  x86/kvm: Audit and remove any unnecessary uses of module.h
  x86/xen: Audit and remove any unnecessary uses of module.h
  x86/platform: Audit and remove any unnecessary uses of module.h
  x86/lib: Audit and remove any unnecessary uses of module.h
  x86/kernel: Audit and remove any unnecessary uses of module.h
  x86/mm: Audit and remove any unnecessary uses of module.h
  x86: Don't use module.h just for AUTHOR / LICENSE tags
2016-08-01 14:23:42 -04:00
Linus Torvalds
6453dbdda3 Power management material for v4.8-rc1
- Rework the cpufreq governor interface to make it more straightforward
    and modify the conservative governor to avoid using transition
    notifications (Rafael Wysocki).
 
  - Rework the handling of frequency tables by the cpufreq core to make
    it more efficient (Viresh Kumar).
 
  - Modify the schedutil governor to reduce the number of wakeups it
    causes to occur in cases when the CPU frequency doesn't need to be
    changed (Steve Muckle, Viresh Kumar).
 
  - Fix some minor issues and clean up code in the cpufreq core and
    governors (Rafael Wysocki, Viresh Kumar).
 
  - Add Intel Broxton support to the intel_pstate driver (Srinivas
    Pandruvada).
 
  - Fix problems related to the config TDP feature and to the validity
    of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
    Srinivas Pandruvada).
 
  - Make intel_pstate update the cpu_frequency tracepoint even if
    the frequency doesn't change to avoid confusing powertop (Rafael
    Wysocki).
 
  - Clean up the usage of __init/__initdata in intel_pstate, mark some
    of its internal variables as __read_mostly and drop an unused
    structure element from it (Jisheng Zhang, Carsten Emde).
 
  - Clean up the usage of some duplicate MSR symbols in intel_pstate
    and turbostat (Srinivas Pandruvada).
 
  - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
    Adiga, Viresh Kumar, Ben Dooks).
 
  - Fix a regression (introduced during the 4.5 cycle) in the
    pcc-cpufreq driver by reverting the problematic commit (Andreas
    Herrmann).
 
  - Add support for Intel Denverton to intel_idle, clean up Broxton
    support in it and make it explicitly non-modular (Jacob Pan,
    Jan Beulich, Paul Gortmaker).
 
  - Add support for Denverton and Ivy Bridge server to the Intel RAPL
    power capping driver and make it more careful about the handing
    of MSRs that may not be present (Jacob Pan, Xiaolong Wang).
 
  - Fix resume from hibernation on x86-64 by making the CPU offline
    during resume avoid using MONITOR/MWAIT in the "play dead" loop
    which may lead to an inadvertent "revival" of a "dead" CPU and
    a page fault leading to a kernel crash from it (Rafael Wysocki).
 
  - Make memory management during resume from hibernation more
    straightforward (Rafael Wysocki).
 
  - Add debug features that should help to detect problems related
    to hibernation and resume from it (Rafael Wysocki, Chen Yu).
 
  - Clean up hibernation core somewhat (Rafael Wysocki).
 
  - Prevent KASAN from instrumenting the hibernation core which leads
    to large numbers of false-positives from it (James Morse).
 
  - Prevent PM (hibernate and suspend) notifiers from being called
    during the cleanup phase if they have not been called during the
    corresponding preparation phase which is possible if one of the
    other notifiers returns an error at that time (Lianwei Wang).
 
  - Improve suspend-related debug printout in the tasks freezer and
    clean up suspend-related console handling (Roger Lu, Borislav
    Petkov).
 
  - Update the AnalyzeSuspend script in the kernel sources to
    version 4.2 (Todd Brandt).
 
  - Modify the generic power domains framework to make it handle
    system suspend/resume better (Ulf Hansson).
 
  - Make the runtime PM framework avoid resuming devices synchronously
    when user space changes the runtime PM settings for them and
    improve its error reporting (Rafael Wysocki, Linus Walleij).
 
  - Fix error paths in devfreq drivers (exynos, exynos-ppmu, exynos-bus)
    and in the core, make some devfreq code explicitly non-modular and
    change some of it into tristate (Bartlomiej Zolnierkiewicz,
    Peter Chen, Paul Gortmaker).
 
  - Add DT support to the generic PM clocks management code and make
    it export some more symbols (Jon Hunter, Paul Gortmaker).
 
  - Make the PCI PM core code slightly more robust against possible
    driver errors (Andy Shevchenko).
 
  - Make it possible to change DESTDIR and PREFIX in turbostat
    (Andy Shevchenko).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXl7/dAAoJEILEb/54YlRx+VgQAIQJOWvxKew3Yl02c/sdj9OT
 5VNnFrzGzdcAPofvvG9qGq8B0Es1vYehJpwwOB21ri8EvYv0riIiU1yrqslObojQ
 oaZOkSBpbIoKjGR4CpYA/A+feE+8EqIBdPGd+lx5a6oRdUi7tRVHBG9lyLO3FB/i
 jan1q8dMpZsmu+Y+rVVHGnCVuIlIEqr2ZnZfCwDAulO2Arp/QFAh4kH08ELATvrl
 bkPa25vq7/VMP/vCDzrfZKD5mUuKogIRu/J5wx4py1nE+FB35cKKyqBOgklLwAeY
 UI8vjDhr/myNUs54AZlktOkq47TCYvjvhX9kmOxBjuWqFbRusU012IRek1fYPRIV
 ZqbkqNX7UEVQwunAEg9AyFwyzEtOht93dQDT5RLEd4QzKuM76gmHpLeTGGMzE+nu
 FnmF9JGl4DVwqpZl9yU2+hR2Mt3bP8OF8qYmNiGUB3KO4emPslhSd+6y8liA5Bx2
 SJf0Gb//vaHCh3/uMnwAonYPqRkZvBLOMwuL1VUjNQfRMnQtDdgHMYB1aT/EglPA
 8ww6j4J8rVRLAxvYQ3UEmNA/vBNclKXblRR18+JddEZP9/oX0ATfwnCCUpr839uk
 xxyQhrm4/AI60+PHWCX4GG80YrKdOGTkF7LXCQZanVWjjuyF17rufegZ2YWLT07v
 JU1Cmumfdy2jJluT8xsR
 =uVGz
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael  Wysocki:
 "Again, the majority of changes go into the cpufreq subsystem, but
  there are no big features this time.  The cpufreq changes that stand
  out somewhat are the governor interface rework and improvements
  related to the handling of frequency tables.  Apart from those, there
  are fixes and new device/CPU IDs in drivers, cleanups and an
  improvement of the new schedutil governor.

  Next, there are some changes in the hibernation core, including a fix
  for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
  during resume from hibernation, a few core improvements related to
  memory management during resume, a couple of additional debug features
  and cleanups.

  Finally, we have some fixes and cleanups in the devfreq subsystem,
  generic power domains framework improvements related to system
  suspend/resume, support for some new chips in intel_idle and in the
  power capping RAPL driver, a new version of the AnalyzeSuspend utility
  and some assorted fixes and cleanups.

  Specifics:

   - Rework the cpufreq governor interface to make it more
     straightforward and modify the conservative governor to avoid using
     transition notifications (Rafael Wysocki).

   - Rework the handling of frequency tables by the cpufreq core to make
     it more efficient (Viresh Kumar).

   - Modify the schedutil governor to reduce the number of wakeups it
     causes to occur in cases when the CPU frequency doesn't need to be
     changed (Steve Muckle, Viresh Kumar).

   - Fix some minor issues and clean up code in the cpufreq core and
     governors (Rafael Wysocki, Viresh Kumar).

   - Add Intel Broxton support to the intel_pstate driver (Srinivas
     Pandruvada).

   - Fix problems related to the config TDP feature and to the validity
     of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
     Srinivas Pandruvada).

   - Make intel_pstate update the cpu_frequency tracepoint even if the
     frequency doesn't change to avoid confusing powertop (Rafael
     Wysocki).

   - Clean up the usage of __init/__initdata in intel_pstate, mark some
     of its internal variables as __read_mostly and drop an unused
     structure element from it (Jisheng Zhang, Carsten Emde).

   - Clean up the usage of some duplicate MSR symbols in intel_pstate
     and turbostat (Srinivas Pandruvada).

   - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
     Adiga, Viresh Kumar, Ben Dooks).

   - Fix a regression (introduced during the 4.5 cycle) in the
     pcc-cpufreq driver by reverting the problematic commit (Andreas
     Herrmann).

   - Add support for Intel Denverton to intel_idle, clean up Broxton
     support in it and make it explicitly non-modular (Jacob Pan, Jan
     Beulich, Paul Gortmaker).

   - Add support for Denverton and Ivy Bridge server to the Intel RAPL
     power capping driver and make it more careful about the handing of
     MSRs that may not be present (Jacob Pan, Xiaolong Wang).

   - Fix resume from hibernation on x86-64 by making the CPU offline
     during resume avoid using MONITOR/MWAIT in the "play dead" loop
     which may lead to an inadvertent "revival" of a "dead" CPU and a
     page fault leading to a kernel crash from it (Rafael Wysocki).

   - Make memory management during resume from hibernation more
     straightforward (Rafael Wysocki).

   - Add debug features that should help to detect problems related to
     hibernation and resume from it (Rafael Wysocki, Chen Yu).

   - Clean up hibernation core somewhat (Rafael Wysocki).

   - Prevent KASAN from instrumenting the hibernation core which leads
     to large numbers of false-positives from it (James Morse).

   - Prevent PM (hibernate and suspend) notifiers from being called
     during the cleanup phase if they have not been called during the
     corresponding preparation phase which is possible if one of the
     other notifiers returns an error at that time (Lianwei Wang).

   - Improve suspend-related debug printout in the tasks freezer and
     clean up suspend-related console handling (Roger Lu, Borislav
     Petkov).

   - Update the AnalyzeSuspend script in the kernel sources to version
     4.2 (Todd Brandt).

   - Modify the generic power domains framework to make it handle system
     suspend/resume better (Ulf Hansson).

   - Make the runtime PM framework avoid resuming devices synchronously
     when user space changes the runtime PM settings for them and
     improve its error reporting (Rafael Wysocki, Linus Walleij).

   - Fix error paths in devfreq drivers (exynos, exynos-ppmu,
     exynos-bus) and in the core, make some devfreq code explicitly
     non-modular and change some of it into tristate (Bartlomiej
     Zolnierkiewicz, Peter Chen, Paul Gortmaker).

   - Add DT support to the generic PM clocks management code and make it
     export some more symbols (Jon Hunter, Paul Gortmaker).

   - Make the PCI PM core code slightly more robust against possible
     driver errors (Andy Shevchenko).

   - Make it possible to change DESTDIR and PREFIX in turbostat (Andy
     Shevchenko)"

* tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
  Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
  PM / hibernate: Introduce test_resume mode for hibernation
  cpufreq: export cpufreq_driver_resolve_freq()
  cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
  PCI / PM: check all fields in pci_set_platform_pm()
  cpufreq: acpi-cpufreq: use cached frequency mapping when possible
  cpufreq: schedutil: map raw required frequency to driver frequency
  cpufreq: add cpufreq_driver_resolve_freq()
  cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
  intel_pstate: Update cpu_frequency tracepoint every time
  cpufreq: intel_pstate: clean remnant struct element
  PM / tools: scripts: AnalyzeSuspend v4.2
  x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
  cpufreq: powernv: Replacing pstate_id with frequency table index
  intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
  PM / hibernate: Image data protection during restoration
  PM / hibernate: Add missing braces in __register_nosave_region()
  PM / hibernate: Clean up comments in snapshot.c
  PM / hibernate: Clean up function headers in snapshot.c
  PM / hibernate: Add missing braces in hibernate_setup()
  ...
2016-07-26 17:29:07 -07:00
Linus Torvalds
0f657262d5 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Various x86 low level modifications:

   - preparatory work to support virtually mapped kernel stacks (Andy
     Lutomirski)

   - support for 64-bit __get_user() on 32-bit kernels (Benjamin
     LaHaise)

   - (involved) workaround for Knights Landing CPU erratum (Dave Hansen)

   - MPX enhancements (Dave Hansen)

   - mremap() extension to allow remapping of the special VDSO vma, for
     purposes of user level context save/restore (Dmitry Safonov)

   - hweight and entry code cleanups (Borislav Petkov)

   - bitops code generation optimizations and cleanups with modern GCC
     (H. Peter Anvin)

   - syscall entry code optimizations (Paolo Bonzini)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  x86/mm/cpa: Add missing comment in populate_pdg()
  x86/mm/cpa: Fix populate_pgd(): Stop trying to deallocate failed PUDs
  x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
  x86/smp: Remove unnecessary initialization of thread_info::cpu
  x86/smp: Remove stack_smp_processor_id()
  x86/uaccess: Move thread_info::addr_limit to thread_struct
  x86/dumpstack: Rename thread_struct::sig_on_uaccess_error to sig_on_uaccess_err
  x86/uaccess: Move thread_info::uaccess_err and thread_info::sig_on_uaccess_err to thread_struct
  x86/dumpstack: When OOPSing, rewind the stack before do_exit()
  x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm
  x86/dumpstack/64: Handle faults when printing the "Stack: " part of an OOPS
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  x86/mm/cpa: In populate_pgd(), don't set the PGD entry until it's populated
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm: Use pte_none() to test for empty PTE
  x86/mm: Disallow running with 32-bit PTEs to work around erratum
  x86/mm: Ignore A/D bits in pte/pmd/pud_none()
  x86/mm: Move swap offset/type up in PTE to work around erratum
  x86/entry: Inline enter_from_user_mode()
  ...
2016-07-25 15:34:18 -07:00
Rafael J. Wysocki
406f992e4a x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
On Intel hardware, native_play_dead() uses mwait_play_dead() by
default and only falls back to the other methods if that fails.
That also happens during resume from hibernation, when the restore
(boot) kernel runs disable_nonboot_cpus() to take all of the CPUs
except for the boot one offline.

However, that is problematic, because the address passed to
__monitor() in mwait_play_dead() is likely to be written to in the
last phase of hibernate image restoration and that causes the "dead"
CPU to start executing instructions again.  Unfortunately, the page
containing the address in that CPU's instruction pointer may not be
valid any more at that point.

First, that page may have been overwritten with image kernel memory
contents already, so the instructions the CPU attempts to execute may
simply be invalid.  Second, the page tables previously used by that
CPU may have been overwritten by image kernel memory contents, so the
address in its instruction pointer is impossible to resolve then.

A report from Varun Koyyalagunta and investigation carried out by
Chen Yu show that the latter sometimes happens in practice.

To prevent it from happening, temporarily change the smp_ops.play_dead
pointer during resume from hibernation so that it points to a special
"play dead" routine which uses hlt_play_dead() and avoids the
inadvertent "revivals" of "dead" CPUs this way.

A slightly unpleasant consequence of this change is that if the
system is hibernated with one or more CPUs offline, it will generally
draw more power after resume than it did before hibernation, because
the physical state entered by CPUs via hlt_play_dead() is higher-power
than the mwait_play_dead() one in the majority of cases.  It is
possible to work around this, but it is unclear how much of a problem
that's going to be in practice, so the workaround will be implemented
later if it turns out to be necessary.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=106371
Reported-by: Varun Koyyalagunta <cpudebug@centtech.com>
Original-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 22:42:48 +02:00
Andy Lutomirski
eb43e8f85f x86/smp: Remove unnecessary initialization of thread_info::cpu
It's statically initialized to zero -- no need to dynamically
initialize it to zero as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6cf6314dce3051371a913ee19d1b88e29c68c560.1468527351.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:26:31 +02:00
Paul Gortmaker
186f43608a x86/kernel: Audit and remove any unnecessary uses of module.h
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends.  That changed
when we forked out support for the latter into the export.h file.

This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig.  The advantage
in doing so is that module.h itself sources about 15 other headers;
adding significantly to what we feed cpp, and it can obscure what
headers we are effectively using.

Since module.h was the source for init.h (for __init) and for
export.h (for EXPORT_SYMBOL) we consider each obj-y/bool instance
for the presence of either and replace as needed.  Build testing
revealed some implicit header usage that was fixed up accordingly.

Note that some bool/obj-y instances remain since module.h is
the header for some exception table entry stuff, and for things
like __init_or_module (code that is tossed when MODULES=n).

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160714001901.31603-4-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 15:06:41 +02:00
Andi Kleen
70b8301f6b x86/topology: Add topology_max_smt_threads()
For SMT specific workarounds it is useful to know if SMT is active
on any online CPU in the system. This currently requires a loop
over all online CPUs.

Add a global variable that is updated with the maximum number
of smt threads on any CPU on online/offline, and use it for
topology_max_smt_threads()

The single call is easier to use than a loop.

Not exported to user space because user space already can use
the existing sibling interfaces to find this out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Cc: jolsa@kernel.org
Link: http://lkml.kernel.org/r/1463703002-19686-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:41:21 +02:00
Linus Torvalds
168f1a7163 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "The main changes in this cycle were:

   - MSR access API fixes and enhancements (Andy Lutomirski)

   - early exception handling improvements (Andy Lutomirski)

   - user-space FS/GS prctl usage fixes and improvements (Andy
     Lutomirski)

   - Remove the cpu_has_*() APIs and replace them with equivalents
     (Borislav Petkov)

   - task switch micro-optimization (Brian Gerst)

   - 32-bit entry code simplification (Denys Vlasenko)

   - enhance PAT handling in enumated CPUs (Toshi Kani)

  ... and lots of other cleanups/fixlets"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS
  x86/entry/32: Remove asmlinkage_protect()
  x86/entry/32: Remove GET_THREAD_INFO() from entry code
  x86/entry, sched/x86: Don't save/restore EFLAGS on task switch
  x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs
  selftests/x86/ldt_gdt: Test set_thread_area() deletion of an active segment
  x86/tls: Synchronize segment registers in set_thread_area()
  x86/asm/64: Rename thread_struct's fs and gs to fsbase and gsbase
  x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization
  x86/segments/64: When load_gs_index fails, clear the base
  x86/segments/64: When loadsegment(fs, ...) fails, clear the base
  x86/asm: Make asm/alternative.h safe from assembly
  x86/asm: Stop depending on ptrace.h in alternative.h
  x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()
  x86/asm: Make sure verify_cpu() has a good stack
  x86/extable: Add a comment about early exception handlers
  x86/msr: Set the return value to zero when native_rdmsr_safe() fails
  x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
  x86/paravirt: Add paravirt_{read,write}_msr()
  x86/msr: Carry on after a non-"safe" MSR access fails
  ...
2016-05-16 15:15:17 -07:00
Thomas Gleixner
56402d63ee x86/topology: Handle CPUID bogosity gracefully
Joseph reported that a XEN guest dies with a division by 0 in the package
topology setup code. This happens if cpu_info.x86_max_cores is zero.

Handle that case and emit a warning. This does not fix the underlying XEN bug,
but makes the code more robust.

Reported-and-tested-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1605062046270.3540@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-05-07 10:06:55 +02:00
Borislav Petkov
93984fbd4e x86/cpufeature: Replace cpu_has_apic with boot_cpu_has() usage
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: iommu@lists.linux-foundation.org
Cc: linux-pm@vger.kernel.org
Cc: oprofile-list@lists.sf.net
Link: http://lkml.kernel.org/r/1459801503-15600-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13 11:37:41 +02:00
Borislav Petkov
8196dab4fc x86/cpu: Get rid of compute_unit_id
It is cpu_core_id anyway.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1458917557-8757-3-git-send-email-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-29 10:45:04 +02:00
Thomas Gleixner
3e8db2246b x86/topology: Use total_cpus not nr_cpu_ids for logical packages
nr_cpu_ids can be limited on the command line via nr_cpus=. That can break the
logical package management because it results in a smaller number of packages,
but the cpus to online are occupying the full package space as the hyper
threads are enumerated after the physical cores typically.

total_cpus is the real possible cpu space not limited by nr_cpus command line
and gives us the proper number of packages.

Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Fixes: 1f12e32f4c ("x86/topology: Create logical package id")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiong Zhou <jencce.kernel@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andreas Herrmann <aherrmann@suse.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603181254330.3978@nanos
2016-03-19 10:26:40 +01:00
Peter Zijlstra
63d1e995be x86/topology: Fix Intel HT disable
As per the comment in the code; due to BIOS it is sometimes impossible to know
if there actually are smp siblings until the machine is fully enumerated. So
we rather overestimate the number of possible packages.

Fixes: 1f12e32f4c ("x86/topology: Create logical package id")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: aherrmann@suse.com
Cc: jencce.kernel@gmail.com
Cc: bp@alien8.de
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: http://lkml.kernel.org/r/20160318150538.611014173@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-19 10:26:40 +01:00
Peter Zijlstra
b5d5f27d93 x86/topology: Fix logical package mapping
That first branch testing pkg against __max_logical_packages is wrong,
because if the first pkg id is larger, then the find_first_zero will
find us logical package id 0. However, if the second pkg id is indeed
0, we'll again claim it without testing if it was already taken.

Also, it fails to print the mapping.

Fixes: 1f12e32f4c ("x86/topology: Create logical package id")
Reported-by: Xiong Zhou <jencce.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: aherrmann@suse.com
Cc: bp@alien8.de
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Link: http://lkml.kernel.org/r/20160317095220.GO6344@twins.programming.kicks-ass.net
Link: http://lkml.kernel.org/r/20160318150538.482393396@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-19 10:26:40 +01:00
Linus Torvalds
710d60cbf1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu hotplug updates from Thomas Gleixner:
 "This is the first part of the ongoing cpu hotplug rework:

   - Initial implementation of the state machine

   - Runs all online and prepare down callbacks on the plugged cpu and
     not on some random processor

   - Replaces busy loop waiting with completions

   - Adds tracepoints so the states can be followed"

More detailed commentary on this work from an earlier email:
 "What's wrong with the current cpu hotplug infrastructure?

   - Asymmetry

     The hotplug notifier mechanism is asymmetric versus the bringup and
     teardown.  This is mostly caused by the notifier mechanism.

   - Largely undocumented dependencies

     While some notifiers use explicitely defined notifier priorities,
     we have quite some notifiers which use numerical priorities to
     express dependencies without any documentation why.

   - Control processor driven

     Most of the bringup/teardown of a cpu is driven by a control
     processor.  While it is understandable, that preperatory steps,
     like idle thread creation, memory allocation for and initialization
     of essential facilities needs to be done before a cpu can boot,
     there is no reason why everything else must run on a control
     processor.  Before this patch series, bringup looks like this:

       Control CPU                     Booting CPU

       do preparatory steps
       kick cpu into life

                                       do low level init

       sync with booting cpu           sync with control cpu

       bring the rest up

   - All or nothing approach

     There is no way to do partial bringups.  That's something which is
     really desired because we waste e.g.  at boot substantial amount of
     time just busy waiting that the cpu comes to life.  That's stupid
     as we could very well do preparatory steps and the initial IPI for
     other cpus and then go back and do the necessary low level
     synchronization with the freshly booted cpu.

   - Minimal debuggability

     Due to the notifier based design, it's impossible to switch between
     two stages of the bringup/teardown back and forth in order to test
     the correctness.  So in many hotplug notifiers the cancel
     mechanisms are either not existant or completely untested.

   - Notifier [un]registering is tedious

     To [un]register notifiers we need to protect against hotplug at
     every callsite.  There is no mechanism that bringup/teardown
     callbacks are issued on the online cpus, so every caller needs to
     do it itself.  That also includes error rollback.

  What's the new design?

     The base of the new design is a symmetric state machine, where both
     the control processor and the booting/dying cpu execute a well
     defined set of states.  Each state is symmetric in the end, except
     for some well defined exceptions, and the bringup/teardown can be
     stopped and reversed at almost all states.

     So the bringup of a cpu will look like this in the future:

       Control CPU                     Booting CPU

       do preparatory steps
       kick cpu into life

                                       do low level init

       sync with booting cpu           sync with control cpu

                                       bring itself up

     The synchronization step does not require the control cpu to wait.
     That mechanism can be done asynchronously via a worker or some
     other mechanism.

     The teardown can be made very similar, so that the dying cpu cleans
     up and brings itself down.  Cleanups which need to be done after
     the cpu is gone, can be scheduled asynchronously as well.

  There is a long way to this, as we need to refactor the notion when a
  cpu is available.  Today we set the cpu online right after it comes
  out of the low level bringup, which is not really correct.

  The proper mechanism is to set it to available, i.e. cpu local
  threads, like softirqd, hotplug thread etc. can be scheduled on that
  cpu, and once it finished all booting steps, it's set to online, so
  general workloads can be scheduled on it.  The reverse happens on
  teardown.  First thing to do is to forbid scheduling of general
  workloads, then teardown all the per cpu resources and finally shut it
  off completely.

  This patch series implements the basic infrastructure for this at the
  core level.  This includes the following:

   - Basic state machine implementation with well defined states, so
     ordering and prioritization can be expressed.

   - Interfaces to [un]register state callbacks

     This invokes the bringup/teardown callback on all online cpus with
     the proper protection in place and [un]installs the callbacks in
     the state machine array.

     For callbacks which have no particular ordering requirement we have
     a dynamic state space, so that drivers don't have to register an
     explicit hotplug state.

     If a callback fails, the code automatically does a rollback to the
     previous state.

   - Sysfs interface to drive the state machine to a particular step.

     This is only partially functional today.  Full functionality and
     therefor testability will be achieved once we converted all
     existing hotplug notifiers over to the new scheme.

   - Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
     processor:

       Control CPU                     Booting CPU

       do preparatory steps
       kick cpu into life

                                       do low level init

       sync with booting cpu           sync with control cpu
       wait for boot
                                       bring itself up

                                       Signal completion to control cpu

     In a previous step of this work we've done a full tree mechanical
     conversion of all hotplug notifiers to the new scheme.  The balance
     is a net removal of about 4000 lines of code.

     This is not included in this series, as we decided to take a
     different approach.  Instead of mechanically converting everything
     over, we will do a proper overhaul of the usage sites one by one so
     they nicely fit into the symmetric callback scheme.

     I decided to do that after I looked at the ugliness of some of the
     converted sites and figured out that their hotplug mechanism is
     completely buggered anyway.  So there is no point to do a
     mechanical conversion first as we need to go through the usage
     sites one by one again in order to achieve a full symmetric and
     testable behaviour"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  cpu/hotplug: Document states better
  cpu/hotplug: Fix smpboot thread ordering
  cpu/hotplug: Remove redundant state check
  cpu/hotplug: Plug death reporting race
  rcu: Make CPU_DYING_IDLE an explicit call
  cpu/hotplug: Make wait for dead cpu completion based
  cpu/hotplug: Let upcoming cpu bring itself fully up
  arch/hotplug: Call into idle with a proper state
  cpu/hotplug: Move online calls to hotplugged cpu
  cpu/hotplug: Create hotplug threads
  cpu/hotplug: Split out the state walk into functions
  cpu/hotplug: Unpark smpboot threads from the state machine
  cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
  cpu/hotplug: Implement setup/removal interface
  cpu/hotplug: Make target state writeable
  cpu/hotplug: Add sysfs state interface
  cpu/hotplug: Hand in target state to _cpu_up/down
  cpu/hotplug: Convert the hotplugged cpu work to a state machine
  cpu/hotplug: Convert to a state machine for the control processor
  cpu/hotplug: Add tracepoints
  ...
2016-03-15 13:50:29 -07:00
Thomas Gleixner
fc6d73d674 arch/hotplug: Call into idle with a proper state
Let the non boot cpus call into idle with the corresponding hotplug state, so
the hotplug core can handle the further bringup. That's a first step to
convert the boot side of the hotplugged cpus to do all the synchronization
with the other side through the state machine. For now it'll only start the
hotplug thread and kick the full bringup of the cpu.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20160226182341.614102639@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-01 20:36:57 +01:00
Thomas Gleixner
1f12e32f4c x86/topology: Create logical package id
For per package oriented services we must be able to rely on the number of CPU
packages to be within bounds. Create a tracking facility, which

- calculates the number of possible packages depending on nr_cpu_ids after boot

- makes sure that the package id is within the number of possible packages. If
  the apic id is outside we map it to a logical package id if there is enough
  space available.

Provide interfaces for drivers to query the mapping and do translations from
physcial to logical ids.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Harish Chegondi <harish.chegondi@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20160222221011.541071755@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:35:18 +01:00
Borislav Petkov
362f924b64 x86/cpufeature: Remove unused and seldomly used cpu_has_xx macros
Those are stupid and code should use static_cpu_has_safe() or
boot_cpu_has() instead. Kill the least used and unused ones.

The remaining ones need more careful inspection before a conversion can
happen. On the TODO.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de
Cc: David Sterba <dsterba@suse.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19 11:49:55 +01:00
Thomas Gleixner
0fa85119cd Merge branch 'linus' into x86/cleanups
Pull in upstream changes so we can apply depending patches.
2015-12-19 11:49:13 +01:00
Len Brown
656279a1f3 x86 smpboot: Re-enable init_udelay=0 by default on modern CPUs
commit f1ccd24931 allowed the cmdline "cpu_init_udelay=" to work
with all values, including the default of 10000.

But in setting the default of 10000, it over-rode the code that sets
the delay 0 on modern processors.

Also, tidy up use of INT/UINT.

Fixes: f1ccd24931 "x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehavior"
Reported-by: Shane <shrybman@teksavvy.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: dparsons@brightdsl.net
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/9082eb809ef40dad02db714759c7aaf618c518d4.1448232494.git.len.brown@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-25 23:17:48 +01:00
Juergen Gross
4609586592 x86/paravirt: Remove unused pv_apic_ops structure
The only member of that structure is startup_ipi_hook which is always
set to paravirt_nop.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: jeremy@goop.org
Cc: chrisw@sous-sol.org
Cc: akataria@vmware.com
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xen.org
Cc: konrad.wilk@oracle.com
Cc: boris.ostrovsky@oracle.com
Link: http://lkml.kernel.org/r/1447767872-16730-1-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-11-19 11:03:58 +01:00
Len Brown
fcafddec4e x86/smpboot: Fix CPU #1 boot timeout
The following commit:

  a9bcaa02a5 ("x86/smpboot: Remove SIPI delays from cpu_up()")

Caused some Intel Core2 processors to time-out when bringing up CPU #1,
resulting in the missing of that CPU after bootup.

That patch reduced the SIPI delays from udelay() 300, 200 to udelay() 0,
0 on modern processors.

Several Intel(R) Core(TM)2 systems failed to bring up CPU #1 10/10 times
after that change.

Increasing either of the SIPI delays to udelay(1) results in
success. So here we increase both to udelay(10).  While this may
be 20x slower than the absolute minimum, it is still 20x to 30x
faster than the original code.

Tested-by: Donald Parsons <dparsons@brightdsl.net>
Tested-by: Shane <shrybman@teksavvy.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dparsons@brightdsl.net
Cc: shrybman@teksavvy.com
Link: http://lkml.kernel.org/r/6dd554ee8945984d85aafb2ad35793174d068af0.1444968087.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 09:14:41 +02:00
Len Brown
f1ccd24931 x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehavior
For legacy machines cpu_init_udelay defaults to 10,000.
For modern machines it is set to 0.

The user should be able to set cpu_init_udelay to
any value on the cmdline, including 10,000.

Before this patch, that was seen as "unchanged from default"
and thus on a modern machine, the user request was ignored
and the delay was set to 0.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dparsons@brightdsl.net
Cc: shrybman@teksavvy.com
Link: http://lkml.kernel.org/r/de363cdbbcfcca1d22569683f7eb9873e0177251.1444968087.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 09:14:41 +02:00
Linus Torvalds
0c0fee018d Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 init code fixlet from Ingo Molnar:
 "A single change: fix obsolete init code annotations"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Drop bogus __ref / __refdata annotations
2015-09-01 09:33:26 -07:00
Len Brown
656bba3068 x86/smpboot: Remove APIC.wait_for_init_deassert and atomic init_deasserted
Both the per-APIC flag ".wait_for_init_deassert",
and the global atomic_t "init_deasserted"
are dead code -- remove them.

For all APIC types, "wait_for_master()"
prevents an AP from proceeding until the BSP has set
cpu_callout_mask, making "init_deasserted" {unnecessary}:

	BSP: <de-assert INIT>
	...
	BSP: {set init_deasserted}
	AP: wait_for_master()
		set cpu_initialized_mask
		wait for cpu_callout_mask
	BSP: test cpu_initialized_mask
	BSP: set cpu_callout_mask
	AP: test cpu_callout_mask
	AP: {wait for init_deasserted}
	...
	AP: <touch APIC>

Deleting the {dead code} above is necessary to enable
some parallelism in a future patch.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/de4b3a9bab894735e285870b5296da25ee6a8a5a.1439739165.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-17 10:42:28 +02:00
Len Brown
a9bcaa02a5 x86/smpboot: Remove SIPI delays from cpu_up()
MPS 1.4 example code shows the following required delays during processor
on-lining:

	INIT
	 udelay(10,000)
	SIPI
	 udelay(200)
	SIPI
	 udelay(200) /* Linux actually implements this as udelay(300) */

Linux skips the udelay(10,000) on modern processors.
This patch removes the udelay(200) after each SIPI
on those same processors.

All three legacy delays can be restored by the cmdline
"cpu_init_udelay=10000".

As measured by analyze_suspend.py, this patch speeds
processor resume time on my desktop from 2.4ms to 1.8ms, per AP.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/a5dfdbc8fbfdd813784da204aad5677fe459ac37.1439739165.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-17 10:42:27 +02:00
Len Brown
2d99af8e8f x86/smpboot: Remove udelay(100) when polling cpu_callin_map
After the BSP sends INIT/SIPI/SIP to the AP and sees the AP
in the cpu_initialized_map, it sets the AP loose via the
cpu_callout_map, and waits for it via the cpu_callin_map.

The BSP polls the cpu_callin_map with a udelay(100)
and a schedule() in each iteration.

The udelay(100) adds no value.

For example, on my 4-CPU dekstop, the AP finishes
cpu_callin() in under 70 usec and sets the cpu_callin_mask.
The BSP, however, doesn't see that setting until over 30 usec
later, because it was still running its udelay(100)
when the AP finished.

Deleting the udelay(100) in the cpu_callin_mask polling loop,
saves from 0 to 100 usec per Application Processor.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/0aade12eabeb89a688c929fe80856eaea0544bb7.1439739165.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-17 10:42:27 +02:00
Len Brown
6e38f1e79d x86/smpboot: Remove udelay(100) when polling cpu_initialized_map
After the BSP sends the APIC INIT/SIPI/SIPI to the AP,
it waits for the AP to come up and indicate that it is alive
by setting its own bit in the cpu_initialized_mask.

Linux polls for up to 10 seconds for this to happen.
Each polling loop has a udelay(100) and a call to schedule().

The udelay(100) adds no value.

For example, on my desktop, the BSP waits for the
other 3 CPUs to come on line at boot for 305, 404, 405 usec.
For resume from S3, it waits 317, 404, 405 usec.

But when the udelay(100) is removed, the BSP waits
305, 310, 306 for boot, and 305, 307, 306 for resume.

So for both boot and resume, removing the udelay(100)
speeds online by about 100us in 2 of 3 cases.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/33ef746c67d2489cad0a9b1958cf71167232ff2b.1439739165.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-17 10:42:27 +02:00
Mathias Krause
4daa832d99 x86: Drop bogus __ref / __refdata annotations
The __ref / __refdata annotations used to be needed because of
referencing functions / variables annotated __cpuinit /
__cpuinitdata.

But those annotations vanished during the development of v3.11.

Therefore most of the __ref / __refdata annotations are not needed
anymore. As they may hide legitimate sections mismatches, we
better get rid of them.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437409973-8927-1-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-20 18:57:20 +02:00
Thomas Gleixner
ce0d3c0a6f genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for now
Boris reported that the sparse_irq protection around __cpu_up() in the
generic code causes a regression on Xen. Xen allocates interrupts and
some more in the xen_cpu_up() function, so it deadlocks on the
sparse_irq_lock.

There is no simple fix for this and we really should have the
protection for all architectures, but for now the only solution is to
move it to x86 where actual wreckage due to the lack of protection has
been observed.

Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Fixes: a899418167 'hotplug: Prevent alloc/free of irq descriptors during cpu up/down'
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xiao jin <jin.xiao@intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
2015-07-15 10:39:17 +02:00
Thomas Gleixner
5a3f75e3f0 x86/irq: Plug irq vector hotplug race
Jin debugged a nasty cpu hotplug race which results in leaking a irq
vector on the newly hotplugged cpu.

cpu N				cpu M
native_cpu_up                   device_shutdown
  do_boot_cpu			  free_msi_irqs
  start_secondary                   arch_teardown_msi_irqs
    smp_callin                        default_teardown_msi_irqs
       setup_vector_irq                  arch_teardown_msi_irq
        __setup_vector_irq		   native_teardown_msi_irq
          lock(vector_lock)		     destroy_irq 
          install vectors
          unlock(vector_lock)
					       lock(vector_lock)
--->                                  	       __clear_irq_vector
                                    	       unlock(vector_lock)
    lock(vector_lock)
    set_cpu_online
    unlock(vector_lock)

This leaves the irq vector(s) which are torn down on CPU M stale in
the vector array of CPU N, because CPU M does not see CPU N online
yet. There is a similar issue with concurrent newly setup interrupts.

The alloc/free protection of irq descriptors does not prevent the
above race, because it merily prevents interrupt descriptors from
going away or changing concurrently.

Prevent this by moving the call to setup_vector_irq() into the
vector_lock held region which protects set_cpu_online():

cpu N				cpu M
native_cpu_up                   device_shutdown
  do_boot_cpu			  free_msi_irqs
  start_secondary                   arch_teardown_msi_irqs
    smp_callin                        default_teardown_msi_irqs
       lock(vector_lock)                arch_teardown_msi_irq
       setup_vector_irq()
        __setup_vector_irq		   native_teardown_msi_irq
          install vectors		     destroy_irq 
       set_cpu_online
       unlock(vector_lock)
					       lock(vector_lock)
                                  	       __clear_irq_vector
                                    	       unlock(vector_lock)

So cpu M either sees the cpu N online before clearing the vector or
cpu N installs the vectors after cpu M has cleared it.

Reported-by: xiao jin <jin.xiao@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Link: http://lkml.kernel.org/r/20150705171102.141898931@linutronix.de
2015-07-07 11:54:04 +02:00
Zhu Guihua
20d5e4a9cd x86/espfix: Init espfix on the boot CPU side
As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
called before we enable local irqs, so the lockdep sub-system
would (correctly) trigger a warning about the potentially
blocking API.

So we allocate them on the boot CPU side when the secondary CPU is
brought up by the boot CPU, and hand them over to the secondary
CPU.

And we use alloc_pages_node() with the secondary CPU's node, to
make sure the espfix stack is NUMA-local to the CPU that is
going to use it.

Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: <bp@alien8.de>
Cc: <luto@amacapital.net>
Cc: <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-06 15:00:34 +02:00
Zhu Guihua
1db875631f x86/espfix: Add 'cpu' parameter to init_espfix_ap()
Add a CPU index parameter to init_espfix_ap(), so that the
parameter could be propagated to the function for espfix
page allocation.

Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: <bp@alien8.de>
Cc: <luto@amacapital.net>
Cc: <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-06 15:00:33 +02:00
Linus Torvalds
d70b3ef54c Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Ingo Molnar:
 "There were so many changes in the x86/asm, x86/apic and x86/mm topics
  in this cycle that the topical separation of -tip broke down somewhat -
  so the result is a more traditional architecture pull request,
  collected into the 'x86/core' topic.

  The topics were still maintained separately as far as possible, so
  bisectability and conceptual separation should still be pretty good -
  but there were a handful of merge points to avoid excessive
  dependencies (and conflicts) that would have been poorly tested in the
  end.

  The next cycle will hopefully be much more quiet (or at least will
  have fewer dependencies).

  The main changes in this cycle were:

   * x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas
     Gleixner)

     - This is the second and most intrusive part of changes to the x86
       interrupt handling - full conversion to hierarchical interrupt
       domains:

          [IOAPIC domain]   -----
                                 |
          [MSI domain]      --------[Remapping domain] ----- [ Vector domain ]
                                 |   (optional)          |
          [HPET MSI domain] -----                        |
                                                         |
          [DMAR domain]     -----------------------------
                                                         |
          [Legacy domain]   -----------------------------

       This now reflects the actual hardware and allowed us to distangle
       the domain specific code from the underlying parent domain, which
       can be optional in the case of interrupt remapping.  It's a clear
       separation of functionality and removes quite some duct tape
       constructs which plugged the remap code between ioapic/msi/hpet
       and the vector management.

     - Intel IOMMU IRQ remapping enhancements, to allow direct interrupt
       injection into guests (Feng Wu)

   * x86/asm changes:

     - Tons of cleanups and small speedups, micro-optimizations.  This
       is in preparation to move a good chunk of the low level entry
       code from assembly to C code (Denys Vlasenko, Andy Lutomirski,
       Brian Gerst)

     - Moved all system entry related code to a new home under
       arch/x86/entry/ (Ingo Molnar)

     - Removal of the fragile and ugly CFI dwarf debuginfo annotations.
       Conversion to C will reintroduce many of them - but meanwhile
       they are only getting in the way, and the upstream kernel does
       not rely on them (Ingo Molnar)

     - NOP handling refinements. (Borislav Petkov)

   * x86/mm changes:

     - Big PAT and MTRR rework: making the code more robust and
       preparing to phase out exposing direct MTRR interfaces to drivers -
       in favor of using PAT driven interfaces (Toshi Kani, Luis R
       Rodriguez, Borislav Petkov)

     - New ioremap_wt()/set_memory_wt() interfaces to support
       Write-Through cached memory mappings.  This is especially
       important for good performance on NVDIMM hardware (Toshi Kani)

   * x86/ras changes:

     - Add support for deferred errors on AMD (Aravind Gopalakrishnan)

       This is an important RAS feature which adds hardware support for
       poisoned data.  That means roughly that the hardware marks data
       which it has detected as corrupted but wasn't able to correct, as
       poisoned data and raises an APIC interrupt to signal that in the
       form of a deferred error.  It is the OS's responsibility then to
       take proper recovery action and thus prolonge system lifetime as
       far as possible.

     - Add support for Intel "Local MCE"s: upcoming CPUs will support
       CPU-local MCE interrupts, as opposed to the traditional system-
       wide broadcasted MCE interrupts (Ashok Raj)

     - Misc cleanups (Borislav Petkov)

   * x86/platform changes:

     - Intel Atom SoC updates

  ... and lots of other cleanups, fixlets and other changes - see the
  shortlog and the Git log for details"

* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits)
  x86/hpet: Use proper hpet device number for MSI allocation
  x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
  x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled
  x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled
  x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail
  genirq: Prevent crash in irq_move_irq()
  genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain
  iommu, x86: Properly handle posted interrupts for IOMMU hotplug
  iommu, x86: Provide irq_remapping_cap() interface
  iommu, x86: Setup Posted-Interrupts capability for Intel iommu
  iommu, x86: Add cap_pi_support() to detect VT-d PI capability
  iommu, x86: Avoid migrating VT-d posted interrupts
  iommu, x86: Save the mode (posted or remapped) of an IRTE
  iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
  iommu: dmar: Provide helper to copy shared irte fields
  iommu: dmar: Extend struct irte for VT-d Posted-Interrupts
  iommu: Add new member capability to struct irq_remap_ops
  x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code
  x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation
  x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry()
  ...
2015-06-22 17:59:09 -07:00
Linus Torvalds
e75c73ad64 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
 "This tree contains two main changes:

   - The big FPU code rewrite: wide reaching cleanups and reorganization
     that pulls all the FPU code together into a clean base in
     arch/x86/fpu/.

     The resulting code is leaner and faster, and much easier to
     understand.  This enables future work to further simplify the FPU
     code (such as removing lazy FPU restores).

     By its nature these changes have a substantial regression risk: FPU
     code related bugs are long lived, because races are often subtle
     and bugs mask as user-space failures that are difficult to track
     back to kernel side backs.  I'm aware of no unfixed (or even
     suspected) FPU related regression so far.

   - MPX support rework/fixes.  As this is still not a released CPU
     feature, there were some buglets in the code - should be much more
     robust now (Dave Hansen)"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (250 commits)
  x86/fpu: Fix double-increment in setup_xstate_features()
  x86/mpx: Allow 32-bit binaries on 64-bit kernels again
  x86/mpx: Do not count MPX VMAs as neighbors when unmapping
  x86/mpx: Rewrite the unmap code
  x86/mpx: Support 32-bit binaries on 64-bit kernels
  x86/mpx: Use 32-bit-only cmpxchg() for 32-bit apps
  x86/mpx: Introduce new 'directory entry' to 'addr' helper function
  x86/mpx: Add temporary variable to reduce masking
  x86: Make is_64bit_mm() widely available
  x86/mpx: Trace allocation of new bounds tables
  x86/mpx: Trace the attempts to find bounds tables
  x86/mpx: Trace entry to bounds exception paths
  x86/mpx: Trace #BR exceptions
  x86/mpx: Introduce a boot-time disable flag
  x86/mpx: Restrict the mmap() size check to bounds tables
  x86/mpx: Remove redundant MPX_BNDCFG_ADDR_MASK
  x86/mpx: Clean up the code by not passing a task pointer around when unnecessary
  x86/mpx: Use the new get_xsave_field_ptr()API
  x86/fpu/xstate: Wrap get_xsave_addr() to make it safer
  x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions
  ...
2015-06-22 17:16:11 -07:00
Bartosz Golaszewski
7d79a7bd75 x86: Replace cpu_**_mask() with topology_**_cpumask()
The former duplicate the functionalities of the latter but are
neither documented nor arch-independent.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benoit Cousson <bcousson@baylibre.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/1432645896-12588-9-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 15:22:17 +02:00
Ingo Molnar
78f7f1e54b x86/fpu: Rename fpu-internal.h to fpu/internal.h
This unifies all the FPU related header files under a unified, hiearchical
naming scheme:

 - asm/fpu/types.h:      FPU related data types, needed for 'struct task_struct',
                         widely included in almost all kernel code, and hence kept
                         as small as possible.

 - asm/fpu/api.h:        FPU related 'public' methods exported to other subsystems.

 - asm/fpu/internal.h:   FPU subsystem internal methods

 - asm/fpu/xsave.h:      XSAVE support internal methods

(Also standardize the header guard in asm/fpu/internal.h.)

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:31 +02:00