Commit Graph

20800 Commits

Author SHA1 Message Date
Ingo Molnar
fb802d6393 x86/alternatives: Rename 'text_poke_loc_init()' to 'text_poke_int3_loc_init()'
This name is actively confusing as well, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_loc_init()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.

Rename it to text_poke_int3_loc_init() to make it clear which layer
it belongs to.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-18-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
732c7c33a0 x86/alternatives: Rename 'text_poke_queue()' to 'smp_text_poke_batch_add()'
This name is actively confusing as well, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_queue()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.

Rename it to smp_text_poke_batch_add() to make it clear which layer
it belongs to.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-17-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
e8d7b8c2bb x86/alternatives: Rename 'text_poke_finish()' to 'smp_text_poke_batch_finish()'
This name is actively confusing as well, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_finish()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.

Rename it to smp_text_poke_batch_finish() to make it clear which layer
it belongs to.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-16-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
aedb60c2c6 x86/alternatives: Rename 'text_poke_flush()' to 'smp_text_poke_batch_flush()'
This name is actually actively confusing, because the simple text_poke*()
APIs use MM-switching based code patching, while text_poke_flush()
is part of the INT3 based text_poke_int3_*() machinery that is an
additional layer of functionality on top of regular text_poke*() functionality.

Rename it to smp_text_poke_batch_flush() to make it clear which layer
it belongs to.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-15-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
f5afa2e8ef x86/alternatives: Remove the confusing, inaccurate & unnecessary 'temp_mm_state_t' abstraction
So the temp_mm_state_t abstraction used by use_temporary_mm() and
unuse_temporary_mm() is super confusing:

 - The whole machinery is about temporarily switching to the
   text_poke_mm utility MM that got allocated during bootup
   for text-patching purposes alone:

	temp_mm_state_t prev;

        /*
         * Loading the temporary mm behaves as a compiler barrier, which
         * guarantees that the PTE will be set at the time memcpy() is done.
         */
        prev = use_temporary_mm(text_poke_mm);

 - Yet the value that gets saved in the temp_mm_state_t variable
   is not the temporary MM ... but the previous MM...

 - Ie. we temporarily put the non-temporary MM into a variable
   that has the temp_mm_state_t type. This makes no sense whatsoever.

 - The confusion continues in unuse_temporary_mm():

	static inline void unuse_temporary_mm(temp_mm_state_t prev_state)

   Here we unuse an MM that is ... not the temporary MM, but the
   previous MM. :-/

Fix up all this confusion by removing the unnecessary layer of
abstraction and using a bog-standard 'struct mm_struct *prev_mm'
variable to save the MM to.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-14-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
762255b743 x86/alternatives: Remove duplicate 'text_poke_early()' prototype
It's declared in <asm/text-patching.h> already.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-12-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
e84c31b9c9 x86/alternatives: Rename 'bp_desc' to 'int3_desc'
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-11-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
da364fc547 x86/alternatives: Rename 'poking_addr' to 'text_poke_mm_addr'
Put it into the text_poke_* namespace of <asm/text-patching.h>.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-10-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
a5c832e047 x86/alternatives: Rename 'poking_mm' to 'text_poke_mm'
Put it into the text_poke_* namespace of <asm/text-patching.h>.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-9-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
5236b6a0fe x86/alternatives: Rename 'poke_int3_handler()' to 'smp_text_poke_int3_handler()'
All related functions in this subsystem already have a
text_poke_int3_ prefix - add it to the trap handler
as well.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-8-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
9586ae48e7 x86/alternatives: Rename 'text_poke_bp()' to 'smp_text_poke_single()'
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-7-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
bee4fcfbc1 x86/alternatives: Rename 'text_poke_bp_batch()' to 'smp_text_poke_batch_process()'
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-6-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
28fb79092d x86/alternatives: Rename 'bp_refs' to 'text_poke_array_refs'
Make it clear that these reference counts lock access
to text_poke_array.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-5-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Ingo Molnar
84e5ba949b x86/alternatives: Rename 'struct bp_patching_desc' to 'struct text_poke_int3_vec'
Follow the INT3 text-poking nomenclature, and also adopt the
'vector' name for the entire object, instead of the rather
opaque 'descriptor' naming.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250411054105.2341982-4-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Peter Zijlstra
d60e4b2410 x86/alternatives: Document the text_poke_bp_batch() synchronization rules a bit more
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250411054105.2341982-3-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Eric Dumazet
4334336e76 x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler()
eBPF programs can be run 50,000,000 times per second on busy servers.

Whenever /proc/sys/kernel/bpf_stats_enabled is turned off,
hundreds of calls sites are patched from text_poke_bp_batch()
and we see a huge loss of performance due to false sharing
on bp_desc.refs lasting up to three seconds.

   51.30%  server_bin       [kernel.kallsyms]           [k] poke_int3_handler
            |
            |--46.45%--poke_int3_handler
            |          exc_int3
            |          asm_exc_int3
            |          |
            |          |--24.26%--cls_bpf_classify
            |          |          tcf_classify
            |          |          __dev_queue_xmit
            |          |          ip6_finish_output2
            |          |          ip6_output
            |          |          ip6_xmit
            |          |          inet6_csk_xmit
            |          |          __tcp_transmit_skb

Fix this by replacing bp_desc.refs with a per-cpu bp_refs.

Before the patch, on a host with 240 cores (480 threads):

  $ sysctl -wq kernel.bpf_stats_enabled=0

  text_poke_bp_batch(nr_entries=164) : Took 2655300 usec

  $ bpftool prog | grep run_time_ns
  ...
  105: sched_cls  name hn_egress  tag 699fc5eea64144e3  gpl run_time_ns
  3009063719 run_cnt 82757845 : average cost is 36 nsec per call

After this patch:

  $ sysctl -wq kernel.bpf_stats_enabled=0

  text_poke_bp_batch(nr_entries=164) : Took 702 usec

  $ bpftool prog | grep run_time_ns
  ...
  105: sched_cls  name hn_egress  tag 699fc5eea64144e3  gpl run_time_ns
  1928223019 run_cnt 67682728 : average cost is 28 nsec per call

Ie. text-patching performance improved 3700x: from 2.65 seconds
to 0.0007 seconds.

Since the atomic_cond_read_acquire(refs, !VAL) spin-loop was not triggered
even once in my tests, add an unlikely() annotation, because this appears
to be the common case.

[ mingo: Improved the changelog some more. ]

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250411054105.2341982-2-mingo@kernel.org
2025-04-11 11:01:33 +02:00
Fernando Fernandez Mancera
3940f5349b x86/i8253: Call clockevent_i8253_disable() with interrupts disabled
There's a lockdep false positive warning related to i8253_lock:

  WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
  ...
  systemd-sleep/3324 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
  ffffffffb2c23398 (i8253_lock){+.+.}-{2:2}, at: pcspkr_event+0x3f/0xe0 [pcspkr]

  ...
  ... which became HARDIRQ-irq-unsafe at:
  ...
    lock_acquire+0xd0/0x2f0
    _raw_spin_lock+0x30/0x40
    clockevent_i8253_disable+0x1c/0x60
    pit_timer_init+0x25/0x50
    hpet_time_init+0x46/0x50
    x86_late_time_init+0x1b/0x40
    start_kernel+0x962/0xa00
    x86_64_start_reservations+0x24/0x30
    x86_64_start_kernel+0xed/0xf0
    common_startup_64+0x13e/0x141
  ...

Lockdep complains due pit_timer_init() using the lock in an IRQ-unsafe
fashion, but it's a false positive, because there is no deadlock
possible at that point due to init ordering: at the point where
pit_timer_init() is called there is no other possible usage of
i8253_lock because the system is still in the very early boot stage
with no interrupts.

But in any case, pit_timer_init() should disable interrupts before
calling clockevent_i8253_disable() out of general principle, and to
keep lockdep working even in this scenario.

Use scoped_guard() for that, as suggested by Thomas Gleixner.

[ mingo: Cleaned up the changelog. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/Z-uwd4Bnn7FcCShX@gmail.com
2025-04-11 07:28:20 +02:00
David Woodhouse
de085ddd49 x86/kexec: Invalidate GDT/IDT from relocate_kernel() instead of earlier
Reduce the window during which exceptions are unhandled, by leaving the
GDT/IDT in place all the way into the relocate_kernel() function, until
the moment that %cr3 gets replaced.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250326142404.256980-4-dwmw2@infradead.org
2025-04-10 12:17:14 +02:00
David Woodhouse
7516e7216b x86/kexec: Add 8250 MMIO serial port output
This supports the same 32-bit MMIO-mapped 8250 as the early_printk code.

It's not clear why the early_printk code supports this form and only this
form; the actual runtime 8250_pci doesn't seem to support it. But having
hacked up QEMU to expose such a device, early_printk does work with it,
and now so does the kexec debug code.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250326142404.256980-3-dwmw2@infradead.org
2025-04-10 12:17:14 +02:00
David Woodhouse
d358b45120 x86/kexec: Add 8250 serial port output
If a serial port was configured for early_printk, use it for debug output
from the relocate_kernel exception handler too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250326142404.256980-2-dwmw2@infradead.org
2025-04-10 12:17:13 +02:00
Ingo Molnar
eef476f15c x86/msr: Rename 'wrmsrl_cstar()' to 'wrmsrq_cstar()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:37 +02:00
Ingo Molnar
7cbc2ba7c1 x86/msr: Rename 'native_wrmsrl()' to 'native_wrmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:28 +02:00
Ingo Molnar
604d15d15e x86/msr: Rename 'wrmsrl_amd_safe()' to 'wrmsrq_amd_safe()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:24 +02:00
Ingo Molnar
e2b8af0c69 x86/msr: Rename 'rdmsrl_amd_safe()' to 'rdmsrq_amd_safe()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:19 +02:00
Ingo Molnar
8e44e83f57 x86/msr: Rename 'mce_wrmsrl()' to 'mce_wrmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:14 +02:00
Ingo Molnar
ebe29309c4 x86/msr: Rename 'mce_rdmsrl()' to 'mce_rdmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:09 +02:00
Ingo Molnar
c895ecdab2 x86/msr: Rename 'wrmsrl_on_cpu()' to 'wrmsrq_on_cpu()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:05 +02:00
Ingo Molnar
d7484babd2 x86/msr: Rename 'rdmsrl_on_cpu()' to 'rdmsrq_on_cpu()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:59:00 +02:00
Ingo Molnar
27a23a544a x86/msr: Rename 'wrmsrl_safe_on_cpu()' to 'wrmsrq_safe_on_cpu()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:55 +02:00
Ingo Molnar
5e404cb7ac x86/msr: Rename 'rdmsrl_safe_on_cpu()' to 'rdmsrq_safe_on_cpu()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:49 +02:00
Ingo Molnar
6fa17efe45 x86/msr: Rename 'wrmsrl_safe()' to 'wrmsrq_safe()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:44 +02:00
Ingo Molnar
6fe22abacd x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:38 +02:00
Ingo Molnar
78255eb239 x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:33 +02:00
Ingo Molnar
c435e608cf x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:27 +02:00
Ingo Molnar
73bd1e01e9 x86/msr: Use u64 in rdmsrl_amd_safe() and wrmsrl_amd_safe()
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Xin Li <xin@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10 11:58:02 +02:00
Ingo Molnar
78a84fbfa4 Linux 6.15-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfy3/YeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ygIAItY5dzf5fVnVEPy
 UrF+EzIaWGWRw3N+41AyT5X7z77FPX7E0cA6MD4KxfWW/OYzeAoeZSyrM2xIsEh3
 26qiohvJjpHjfHdzvKmxNItvW8+xBv3km00U/CWWqJo89JsIVnJtrSBHOut2/gNp
 f6sGoOrrR4GXXz8JX3yG/pmizr23lN81ZkVdz0ayYEK4uY92hSsBspvyFWcdffgF
 o8NCtR+JVGac8xm+f3VPSLyunLMXsh8NWETumMHP6tHQif36I3BQqeU8DgXCgjEK
 pfZ8gEyRtXIKbEt+qniUetT+2Cwu/lAN2GjTu0LqIe9Ro3HzjtotwQdk5h6kC+Lc
 BogxIs8=
 =bf5G
 -----END PGP SIGNATURE-----

Merge tag 'v6.15-rc1' into x86/mm, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-09 22:00:25 +02:00
Ingo Molnar
6ce0fdaae0 Linux 6.15-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfy3/YeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/ygIAItY5dzf5fVnVEPy
 UrF+EzIaWGWRw3N+41AyT5X7z77FPX7E0cA6MD4KxfWW/OYzeAoeZSyrM2xIsEh3
 26qiohvJjpHjfHdzvKmxNItvW8+xBv3km00U/CWWqJo89JsIVnJtrSBHOut2/gNp
 f6sGoOrrR4GXXz8JX3yG/pmizr23lN81ZkVdz0ayYEK4uY92hSsBspvyFWcdffgF
 o8NCtR+JVGac8xm+f3VPSLyunLMXsh8NWETumMHP6tHQif36I3BQqeU8DgXCgjEK
 pfZ8gEyRtXIKbEt+qniUetT+2Cwu/lAN2GjTu0LqIe9Ro3HzjtotwQdk5h6kC+Lc
 BogxIs8=
 =bf5G
 -----END PGP SIGNATURE-----

Merge tag 'v6.15-rc1' into x86/asm, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-09 21:39:43 +02:00
Ahmed S. Darwish
d02c83d75f x86/cacheinfo: Properly parse CPUID(0x80000006) L2/L3 associativity
Complete the AMD CPUID(4) emulation logic, which uses CPUID(0x80000006)
for L2/L3 cache info and an assocs[] associativity mapping array, by
adding entries for 3-way caches and 6-way caches.

Properly handle the case where CPUID(0x80000006) returns an L2/L3
associativity of 9.  This is not real associativity, but a marker to
indicate that the respective L2/L3 cache information should be retrieved
from CPUID(0x8000001d) instead.  If such a marker is encountered, return
early from legacy_amd_cpuid4(), thus effectively emulating an "invalid
index" CPUID(4) response with a cache type of zero.

When checking if CPUID(0x80000006) L2/L3 cache info output is valid, and
given the associtivity marker 9 above, do not just check if the whole
ECX/EDX register is zero.  Rather, check if the associativity is zero or
9.  An associativity of zero implies no L2/L3 cache, which make it the
more correct check anyway vs. a zero check of the whole output register.

Fixes: a326e948c5 ("x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250409122233.1058601-3-darwi@linutronix.de
2025-04-09 20:47:05 +02:00
Ahmed S. Darwish
d274cde0db x86/cacheinfo: Properly parse CPUID(0x80000005) L1d/L1i associativity
For the AMD CPUID(4) emulation cache info logic, the same associativity
mapping array, assocs[], is used for both CPUID(0x80000005) and
CPUID(0x80000006).

This is incorrect since per the AMD manuals, the mappings for
CPUID(0x80000005) L1d/L1i associativity is:

   n = 0x1 -> 0xfe	n
   n = 0xff		fully associative

while assocs[] maps these values to:

   n = 0x1, 0x2, 0x4	n
   n = 0x3, 0x7, 0x9	0
   n = 0x6		8
   n = 0x8		16
   n = 0xa		32
   n = 0xb		48
   n = 0xc		64
   n = 0xd		96
   n = 0xe		128
   n = 0xf		fully associative

which is only valid for CPUID(0x80000006).

Parse CPUID(0x80000005) L1d/L1i associativity values as shown in the AMD
manuals.  Since the 0xffff literal is used to denote full associativity
at the AMD CPUID(4)-emulation logic, define AMD_CPUID4_FULLY_ASSOCIATIVE
for it instead of spreading that literal in more places.

Mark the assocs[] mapping array as only valid for CPUID(0x80000006) L2/L3
cache information.

Fixes: a326e948c5 ("x86, cacheinfo: Fixup L3 cache information for AMD multi-node processors")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: x86-cpuid@lists.linux.dev
Link: https://lore.kernel.org/r/20250409122233.1058601-2-darwi@linutronix.de
2025-04-09 20:47:05 +02:00
Dave Hansen
f0df00ebc5 x86/cpu: Avoid running off the end of an AMD erratum table
The NULL array terminator at the end of erratum_1386_microcode was
removed during the switch from x86_cpu_desc to x86_cpu_id. This
causes readers to run off the end of the array.

Replace the NULL.

Fixes: f3f3251526 ("x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
2025-04-09 07:57:16 -07:00
Josh Poimboeuf
83f6665a49 x86/bugs: Add RSB mitigation document
Create a document to summarize hard-earned knowledge about RSB-related
mitigations, with references, and replace the overly verbose yet
incomplete comments with a reference to the document.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ab73f4659ba697a974759f07befd41ae605e33dd.1744148254.git.jpoimboe@kernel.org
2025-04-09 12:42:09 +02:00
Josh Poimboeuf
27ce8299bc x86/bugs: Don't fill RSB on context switch with eIBRS
User->user Spectre v2 attacks (including RSB) across context switches
are already mitigated by IBPB in cond_mitigation(), if enabled globally
or if either the prev or the next task has opted in to protection.  RSB
filling without IBPB serves no purpose for protecting user space, as
indirect branches are still vulnerable.

User->kernel RSB attacks are mitigated by eIBRS.  In which case the RSB
filling on context switch isn't needed, so remove it.

Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Amit Shah <amit.shah@amd.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/98cdefe42180358efebf78e3b80752850c7a3e1b.1744148254.git.jpoimboe@kernel.org
2025-04-09 12:42:09 +02:00
Josh Poimboeuf
18bae0dfec x86/bugs: Don't fill RSB on VMEXIT with eIBRS+retpoline
eIBRS protects against guest->host RSB underflow/poisoning attacks.
Adding retpoline to the mix doesn't change that.  Retpoline has a
balanced CALL/RET anyway.

So the current full RSB filling on VMEXIT with eIBRS+retpoline is
overkill.  Disable it or do the VMEXIT_LITE mitigation if needed.

Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Amit Shah <amit.shah@amd.com>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Link: https://lore.kernel.org/r/84a1226e5c9e2698eae1b5ade861f1b8bf3677dc.1744148254.git.jpoimboe@kernel.org
2025-04-09 12:41:55 +02:00
Josh Poimboeuf
b1b19cfcf4 x86/bugs: Fix RSB clearing in indirect_branch_prediction_barrier()
IBPB is expected to clear the RSB.  However, if X86_BUG_IBPB_NO_RET is
set, that doesn't happen.  Make indirect_branch_prediction_barrier()
take that into account by calling write_ibpb() which clears RSB on
X86_BUG_IBPB_NO_RET:

	/* Make sure IBPB clears return stack preductions too. */
	FILL_RETURN_BUFFER %rax, RSB_CLEAR_LOOPS, X86_BUG_IBPB_NO_RET

Note that, as of the previous patch, write_ibpb() also reads
'x86_pred_cmd' in order to use SBPB when applicable:

	movl	_ASM_RIP(x86_pred_cmd), %eax

Therefore that existing behavior in indirect_branch_prediction_barrier()
is not lost.

Fixes: 50e4b3b940 ("x86/entry: Have entry_ibpb() invalidate return predictions")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/bba68888c511743d4cd65564d1fc41438907523f.1744148254.git.jpoimboe@kernel.org
2025-04-09 12:41:30 +02:00
Josh Poimboeuf
13235d6d50 x86/bugs: Rename entry_ibpb() to write_ibpb()
There's nothing entry-specific about entry_ibpb().  In preparation for
calling it from elsewhere, rename it to write_ibpb().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/1e54ace131e79b760de3fe828264e26d0896e3ac.1744148254.git.jpoimboe@kernel.org
2025-04-09 12:41:29 +02:00
Andy Shevchenko
996457176b x86/early_printk: Use 'mmio32' for consistency, fix comments
First of all, using 'mmio' prevents proper implementation of 8-bit accessors.
Second, it's simply inconsistent with uart8250 set of options. Rename it to
'mmio32'. While at it, remove rather misleading comment in the documentation.
From now on mmio32 is self-explanatory and pciserial supports not only 32-bit
MMIO accessors.

Also, while at it, fix the comment for the "pciserial" case. The comment
seems to be a copy'n'paste error when mentioning "serial" instead of
"pciserial" (with double quotes). Fix this.

With that, move it upper, so we don't calculate 'buf' twice.

Fixes: 3181424aea ("x86/early_printk: Add support for MMIO-based UARTs")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Denis Mukhin <dmukhin@ford.com>
Link: https://lore.kernel.org/r/20250407172214.792745-1-andriy.shevchenko@linux.intel.com
2025-04-09 12:27:08 +02:00
James Morse
45c2e30bbd x86/resctrl: Fix rdtgroup_mkdir()'s unlocked use of kernfs_node::name
Since

  741c10b096 ("kernfs: Use RCU to access kernfs_node::name.")

a helper rdt_kn_name() that checks that rdtgroup_mutex is held has been used
for all accesses to the kernfs node name.

rdtgroup_mkdir() uses the name to determine if a valid monitor group is being
created by checking the parent name is "mon_groups". This is done without
holding rdtgroup_mutex, and now triggers the following warning:

  | WARNING: suspicious RCU usage
  | 6.15.0-rc1 #4465 Tainted: G            E
  | -----------------------------
  | arch/x86/kernel/cpu/resctrl/internal.h:408 suspicious rcu_dereference_check() usage!
  [...]
  | Call Trace:
  |  <TASK>
  |  dump_stack_lvl
  |  lockdep_rcu_suspicious.cold
  |  is_mon_groups
  |  rdtgroup_mkdir
  |  kernfs_iop_mkdir
  |  vfs_mkdir
  |  do_mkdirat
  |  __x64_sys_mkdir
  |  do_syscall_64
  |  entry_SYSCALL_64_after_hwframe

Creating a control or monitor group calls mkdir_rdt_prepare(), which uses
rdtgroup_kn_lock_live() to take the rdtgroup_mutex.

To avoid taking and dropping the lock, move the check for the monitor group
name and position into mkdir_rdt_prepare() so that it occurs under
rdtgroup_mutex. Hoist is_mon_groups() earlier in the file.

  [ bp: Massage. ]

Fixes: 741c10b096 ("kernfs: Use RCU to access kernfs_node::name.")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250407124637.2433230-1-james.morse@arm.com
2025-04-09 11:35:08 +02:00
Myrrh Periwinkle
f2f29da9f0 x86/e820: Fix handling of subpage regions when calculating nosave ranges in e820__register_nosave_regions()
While debugging kexec/hibernation hangs and crashes, it turned out that
the current implementation of e820__register_nosave_regions() suffers from
multiple serious issues:

 - The end of last region is tracked by PFN, causing it to find holes
   that aren't there if two consecutive subpage regions are present

 - The nosave PFN ranges derived from holes are rounded out (instead of
   rounded in) which makes it inconsistent with how explicitly reserved
   regions are handled

Fix this by:

 - Treating reserved regions as if they were holes, to ensure consistent
   handling (rounding out nosave PFN ranges is more correct as the
   kernel does not use partial pages)

 - Tracking the end of the last RAM region by address instead of pages
   to detect holes more precisely

These bugs appear to have been introduced about ~18 years ago with the very
first version of e820_mark_nosave_regions(), and its flawed assumptions were
carried forward uninterrupted through various waves of rewrites and renames.

[ mingo: Added Git archeology details, for kicks and giggles. ]

Fixes: e8eff5ac29 ("[PATCH] Make swsusp avoid memory holes and reserved memory regions on x86_64")
Reported-by: Roberto Ricci <io@r-ricci.it>
Tested-by: Roberto Ricci <io@r-ricci.it>
Signed-off-by: Myrrh Periwinkle <myrrhperiwinkle@qtmlabs.xyz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Len Brown <len.brown@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250406-fix-e820-nosave-v3-1-f3787bc1ee1d@qtmlabs.xyz
Closes: https://lore.kernel.org/all/Z4WFjBVHpndct7br@desktop0a/
2025-04-07 19:20:08 +02:00
Petr Vaněk
8b37357a78 x86/acpi: Don't limit CPUs to 1 for Xen PV guests due to disabled ACPI
Xen disables ACPI for PV guests in DomU, which causes acpi_mps_check() to
return 1 when CONFIG_X86_MPPARSE is not set. As a result, the local APIC is
disabled and the guest is later limited to a single vCPU, despite being
configured with more.

This regression was introduced in version 6.9 in commit 7c0edad364
("x86/cpu/topology: Rework possible CPU management"), which added an
early check that limits CPUs to 1 if apic_is_disabled.

Update the acpi_mps_check() logic to return 0 early when running as a Xen
PV guest in DomU, preventing APIC from being disabled in this specific case
and restoring correct multi-vCPU behaviour.

Fixes: 7c0edad364 ("x86/cpu/topology: Rework possible CPU management")
Signed-off-by: Petr Vaněk <arkamar@atlas.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250407132445.6732-2-arkamar@atlas.cz
2025-04-07 16:35:21 +02:00
Boris Ostrovsky
321550859f x86/microcode/AMD: Clean the cache if update did not load microcode
If microcode did not get loaded there is no reason to keep it in the cache.
Moreover, if loading failed it will not be possible to load an earlier version
of microcode since the failed revision will always be selected from the cache
on the next reload attempt.

Since the failed revisions is not easily available at this point just clean the
whole cache. It will be rebuilt later if needed.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250327230503.1850368-3-boris.ostrovsky@oracle.com
2025-04-07 14:46:56 +02:00
Andi Kleen
e37aa1211f x86/cpuid: Add AMX and SPEC_CTRL dependencies
Add some missing dependencies to the CPUID dependency table:

 - All the AMX features depend on AMX_TILE
 - All the SPEC_CTRL features depend on SPEC_CTRL

[ mingo: Keep the AMX part of the table grouped ... ]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Sohil Mehta <sohil.mehta@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ahmed S. Darwish <darwi@linutronix.de>
Link: https://lore.kernel.org/r/20240924170128.2611854-1-ak@linux.intel.com
2025-04-06 19:54:35 +02:00
Linus Torvalds
16cd1c2657 A set of final cleanups for the timer subsystem:
1) Convert all del_timer[_sync]() instances over to the new
      timer_delete[_sync]() API and remove the legacy wrappers.
 
      Conversion was done with coccinelle plus some manual fixups as
      coccinelle chokes on scoped_guard().
 
   2) The final cleanup of the hrtimer_init() to hrtimer_setup() conversion.
 
      This has been delayed to the end of the merge window, so that all
      patches which have been merged through other trees are in mainline and
      all new users are catched.
 
 Doing this right before rc1 ensures that new code which is merged post rc1
 is not introducing new instances of the original functionality.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmfyXi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYzlD/4ykDZbUzgTreYOxEQpBJ9elPwBhxfL
 1v8OwDjRWlNrmLup8RiUfKrlbmztGl1J/u9ld0qhjcqkywCCBC1N5S+DhCjYetyP
 MPWLbi2Dc35cFA+M7i8fMgxI2K9MLz2Zj1UKxz1MdsSuNHm07N3mul/3T11Ye4Rz
 nPlzeQBTBDFCKTEGKjr8zjuoD15Wl48sObM0AjV35BPuQR1jfY4CE6VXo2h78+0c
 jYwpJpDmcd+o1bDrfFhWUME2DzABEkHhn4wNSETnM4E5RXZRMUbi4UiigzInibQr
 JOUTKwPJXTMX/Erd0XyXErrYf2qy1X9BQy6NlyDDOv+8kLEVRsC9Efplx9uoEtfi
 QvVT/UmgmhZFJBfIT3/B8OvasrfwOropaYoG4L0zbDpp1b09VY47N5lCLlNr/mZf
 jb2TwIln8Szy2EfIT2RSd0ZNupyU8V4aH/mYNpSlbUJ6mfvfIAttBSS/YH+Zeqku
 7zOJkoCusaySOCZCOQkeikL3ZBN+FHtNteXxmGnp34ed/tsfgGZj1lsbmkM2rrWo
 f2mQsYAclUA4KQeY9z/Xf7/c5wJUkME69PxOaaN23dOpBR7GA58Cvb0PQTnPlAiT
 KnH/JRweBHtcv4KEHMi2f5no4cxcmXyKTj7/TLyYNjc8LATL9Eo/nxG36PLxy4lN
 QPOWz11zEBLjQQ==
 =8Ftq
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A set of final cleanups for the timer subsystem:

   - Convert all del_timer[_sync]() instances over to the new
     timer_delete[_sync]() API and remove the legacy wrappers.

     Conversion was done with coccinelle plus some manual fixups as
     coccinelle chokes on scoped_guard().

   - The final cleanup of the hrtimer_init() to hrtimer_setup()
     conversion.

     This has been delayed to the end of the merge window, so that all
     patches which have been merged through other trees are in mainline
     and all new users are catched.

  Doing this right before rc1 ensures that new code which is merged post
  rc1 is not introducing new instances of the original functionality"

* tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracing/timers: Rename the hrtimer_init event to hrtimer_setup
  hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
  hrtimers: Rename debug_init() to debug_setup()
  hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
  hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
  hrtimers: Make callback function pointer private
  hrtimers: Merge __hrtimer_init() into __hrtimer_setup()
  hrtimers: Switch to use __htimer_setup()
  hrtimers: Delete hrtimer_init()
  treewide: Convert new and leftover hrtimer_init() users
  treewide: Switch/rename to timer_delete[_sync]()
2025-04-06 08:35:37 -07:00
Thomas Gleixner
8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Jiri Slaby (SUSE)
825dfab23b irqdomain: Rename irq_set_default_host() to irq_set_default_domain()
Naming interrupt domains host is confusing at best and the irqdomain code
uses both domain and host inconsistently.

Therefore rename irq_set_default_host() to irq_set_default_domain().

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-3-jirislaby@kernel.org
2025-04-04 16:39:10 +02:00
Andrew Cooper
1f13c60d84 x86/idle: Remove MFENCEs for X86_BUG_CLFLUSH_MONITOR in mwait_idle_with_hints() and prefer_mwait_c1_over_halt()
The following commit, 12 years ago:

  7e98b71920 ("x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers")

added barriers around the CLFLUSH in mwait_idle_with_hints(), justified with:

  ... and add memory barriers around it since the documentation is explicit
  that CLFLUSH is only ordered with respect to MFENCE.

This also triggered, 11 years ago, the same adjustment in:

  f8e617f458 ("sched/idle/x86: Optimize unnecessary mwait_idle() resched IPIs")

during development, although it failed to get the static_cpu_has_bug() treatment.

X86_BUG_CLFLUSH_MONITOR (a.k.a the AAI65 errata) is specific to Intel CPUs,
and the SDM currently states:

  Executions of the CLFLUSH instruction are ordered with respect to each
  other and with respect to writes, locked read-modify-write instructions,
  and fence instructions[1].

With footnote 1 reading:

  Earlier versions of this manual specified that executions of the CLFLUSH
  instruction were ordered only by the MFENCE instruction.  All processors
  implementing the CLFLUSH instruction also order it relative to the other
  operations enumerated above.

i.e. The SDM was incorrect at the time, and barriers should not have been
inserted.  Double checking the original AAI65 errata (not available from
intel.com any more) shows no mention of barriers either.

Note: If this were a general codepath, the MFENCEs would be needed, because
      AMD CPUs of the same vintage do sport otherwise-unordered CLFLUSHs.

Remove the unnecessary barriers. Furthermore, use a plain alternative(),
rather than static_cpu_has_bug() and/or no optimisation.  The workaround
is a single instruction.

Use an explicit %rax pointer rather than a general memory operand, because
MONITOR takes the pointer implicitly in the same way.

[ mingo: Cleaned up the commit a bit. ]

Fixes: 7e98b71920 ("x86, idle: Use static_cpu_has() for CLFLUSH workaround, add barriers")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20250402172458.1378112-1-andrew.cooper3@citrix.com
2025-04-02 22:02:26 +02:00
Linus Torvalds
6cb094583a * Avoid direct HLT instruction execution in TDX guests
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmfsa8QACgkQaDWVMHDJ
 krBCAhAAodPYiIEy+qpad1Q8HPhaKYUJ5jzkIdt1GYXCBf2dfY6Zj8w7edSApUhA
 7og9gK8ku8hwpf6oCGmp2Lm74FgATIj7q0ac07XBW3OsrfFQc73DfPJn6WMDYjRV
 ec9baSzX5GcqUyezq7woyJayZT9LRLBexF/vk7dAQ7nuecCOUhqLXWBN5eUT0e+K
 58kFjZoZZx/4Y9zh7UIxBQyCbL88IeI6rclW5tZJlRHNuD7B64x606ETwQJKK9GK
 YHPhqRKtjJRzSOn/xGYT4AQDPbF9u14Q4WGVO+bvgv8Z6BtmiYV2fG0q5GU14h0z
 +gwjja3Edo+F6zSIIZonQbrSVHspwm1IPJQQZHljhFOEt7Ezu3hLIYouUWVlNRgl
 mRzubZBmhQUfJOAtfGmHktdg6j+QinYDQr+/CjoXoeh8EknL+KtqamXJnyb8KAMN
 qH6X+N2coaCcl334zW44m6YTmTipdIhmHFj6edYwqdR3Ux6DDaX9PKopIIpiZEcb
 GH1o++4JMp9OBIaTu0Yp1WgWJ+EyUSWDJbydqCMOdthuESqKW45IQkLhPxZpIhB4
 5Wra4Ot7AdsThyPqNPaEu3ND+BXu4tAAa8r8GK+AP7DqRxXz/bbWTHqNepm9wSvP
 pnOlLyVTri/difMWWsJJPK6QRYbNnemrny3Do3PbIZVKS08vgLs=
 =XvoD
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 TDX updates from Dave Hansen:
 "Avoid direct HLT instruction execution in TDX guests.

  TDX guests aren't expected to use the HLT instruction directly. It
  causes a virtualization exception (#VE). While the #VE _can_ be
  handled, the current handling is slow and buggy and the easiest thing
  is just to avoid HLT in the first place. Plus, the kernel already has
  paravirt infrastructure that makes it relatively painless.

  Make TDX guests require paravirt and add some TDX-specific paravirt
  handlers which avoid HLT in the normal halt routines. Also add a
  warning in case another HLT sneaks in.

  There was a report that this leads to a "major performance
  improvement" on specjbb2015, probably because of the extra #VE
  overhead or missed wakeups from the buggy HLT handling"

* tag 'x86_tdx_for_6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Emit warning if IRQs are enabled during HLT #VE handling
  x86/tdx: Fix arch_safe_halt() execution for TDX VMs
  x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
2025-04-02 11:33:20 -07:00
Sohil Mehta
f2e01dcf6d x86/nmi: Improve NMI duration console printouts
Convert the last remaining printk() in nmi.c to pr_info(). Along with
it, use timespec macros to calculate the NMI handler duration.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-10-sohil.mehta@intel.com
2025-04-01 22:26:38 +02:00
Sohil Mehta
05279a2863 x86/nmi: Clean up NMI selftest
The expected_testcase_failures variable in the NMI selftest has never
been set since its introduction. Remove this unused variable along with
the related checks to simplify the code.

While at it, replace printk() with the corresponding pr_{cont,info}()
calls. Also, get rid of the superfluous testname wrapper and the
redundant file path comment.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-9-sohil.mehta@intel.com
2025-04-01 22:26:32 +02:00
Sohil Mehta
59cddd397a x86/nmi: Improve and relocate NMI handler comments
Some of the comments in the default NMI handling code are out of place
or inadequate. Move them to the appropriate locations and update them as
needed.

Move the comment related to CPU-specific NMIs closer to the actual code.
Also, add more details about how back-to-back NMIs are detected since
that isn't immediately obvious.

Opportunistically, replace an #ifdef section in the vicinity with an
IS_ENABLED() check to make the code easier to read.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-7-sohil.mehta@intel.com
2025-04-01 22:26:16 +02:00
Sohil Mehta
b4bc3144c1 x86/nmi: Fix comment in unknown_nmi_error()
The comment in unknown_nmi_error() is incorrect and misleading. There
is no longer a restriction on having a single Unknown NMI handler. Also,
nmi_handle() never used the 'b2b' parameter.

The commits that made the comment outdated are:

  0d443b70cc ("x86/platform: Remove warning message for duplicate NMI handlers")
  bf9f2ee28d ("x86/nmi: Remove the 'b2b' parameter from nmi_handle()")

Remove the old comment and update it to reflect the current logic.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-6-sohil.mehta@intel.com
2025-04-01 22:26:11 +02:00
Sohil Mehta
6325f94701 x86/nmi: Remove export of local_touch_nmi()
Commit:

  feb6cd6a0f ("thermal/intel_powerclamp: stop sched tick in forced idle")

got rid of the last exported user of local_touch_nmi() a while back.

Remove the unnecessary export.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-5-sohil.mehta@intel.com
2025-04-01 22:26:06 +02:00
Sohil Mehta
4a8fba4be8 x86/nmi: Use a macro to initialize NMI descriptors
The NMI descriptors for each NMI type are stored in an array. However,
they are currently initialized using raw numbers, which makes it
difficult to understand the code.

Introduce a macro to initialize the NMI descriptors using the NMI type
enum values to make the code more readable.

No functional change intended.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-4-sohil.mehta@intel.com
2025-04-01 22:26:01 +02:00
Sohil Mehta
78a0323506 x86/nmi: Consolidate NMI panic variables
Commit:

  c305a4e983 ("x86: Move sysctls into arch/x86")

recently moved the sysctl handling of panic_on_unrecovered_nmi and
panic_on_io_nmi to x86-specific code. These variables no longer need to
be declared in the generic header file.

Relocate the variable definitions and declarations closer to where they
are used. This makes all the NMI panic options consistent and easier to
track.

[ mingo: Fixed up the SHA1 of the commit reference. ]

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Joel Granados <joel.granados@kernel.org>
Link: https://lore.kernel.org/r/20250327234629.3953536-3-sohil.mehta@intel.com
2025-04-01 22:25:56 +02:00
Sohil Mehta
2e016da1cb x86/nmi: Simplify unknown NMI panic handling
The unknown_nmi_panic variable is used to control whether the kernel
should panic on unknown NMIs. There is a sysctl entry under:

  /proc/sys/kernel/unknown_nmi_panic

which can be used to change the behavior at runtime.

However, it seems that in some places, the option unnecessarily depends
on CONFIG_X86_LOCAL_APIC. Other code in nmi.c uses unknown_nmi_panic
without such a dependency. This results in a few messy #ifdefs
splattered across the code. The dependency was likely introduce due to a
potential build bug reported a long time ago:

  https://lore.kernel.org/lkml/40BC67F9.3000609@myrealbox.com/

This build bug no longer exists.

Also, similar NMI panic options, such as panic_on_unrecovered_nmi and
panic_on_io_nmi, do not have an explicit dependency on the local APIC
either.

Though, it's hard to imagine a production system without the local APIC
configuration, making a specific NMI sysctl option dependent on it
doesn't make sense.

Remove the explicit dependency between unknown NMI handling and the
local APIC to make the code cleaner and more consistent.

While at it, reorder the header includes to maintain alphabetical order.

[ mingo: Cleaned up the changelog a bit, truly ordered the headers ... ]

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20250327234629.3953536-2-sohil.mehta@intel.com
2025-04-01 22:25:41 +02:00
Linus Torvalds
2cd5769fb0 Driver core updates for 6.15-rc1
Here is the big set of driver core updates for 6.15-rc1.  Lots of stuff
 happened this development cycle, including:
   - kernfs scaling changes to make it even faster thanks to rcu
   - bin_attribute constify work in many subsystems
   - faux bus minor tweaks for the rust bindings
   - rust binding updates for driver core, pci, and platform busses,
     making more functionaliy available to rust drivers.  These are all
     due to people actually trying to use the bindings that were in 6.14.
   - make Rafael and Danilo full co-maintainers of the driver core
     codebase
   - other minor fixes and updates.
 
 This has been in linux-next for a while now, with the only reported
 issue being some merge conflicts with the rust tree.  Depending on which
 tree you pull first, you will have conflicts in one of them.  The merge
 resolution has been in linux-next as an example of what to do, or can be
 found here:
 	https://lore.kernel.org/r/CANiq72n3Xe8JcnEjirDhCwQgvWoE65dddWecXnfdnbrmuah-RQ@mail.gmail.com
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ+mMrg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylRgwCdH58OE3BgL0uoFY5vFImStpmPtqUAoL5HpVWI
 jtbJ+UuXGsnmO+JVNBEv
 =gy6W
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updatesk from Greg KH:
 "Here is the big set of driver core updates for 6.15-rc1. Lots of stuff
  happened this development cycle, including:

   - kernfs scaling changes to make it even faster thanks to rcu

   - bin_attribute constify work in many subsystems

   - faux bus minor tweaks for the rust bindings

   - rust binding updates for driver core, pci, and platform busses,
     making more functionaliy available to rust drivers. These are all
     due to people actually trying to use the bindings that were in
     6.14.

   - make Rafael and Danilo full co-maintainers of the driver core
     codebase

   - other minor fixes and updates"

* tag 'driver-core-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (52 commits)
  rust: platform: require Send for Driver trait implementers
  rust: pci: require Send for Driver trait implementers
  rust: platform: impl Send + Sync for platform::Device
  rust: pci: impl Send + Sync for pci::Device
  rust: platform: fix unrestricted &mut platform::Device
  rust: pci: fix unrestricted &mut pci::Device
  rust: device: implement device context marker
  rust: pci: use to_result() in enable_device_mem()
  MAINTAINERS: driver core: mark Rafael and Danilo as co-maintainers
  rust/kernel/faux: mark Registration methods inline
  driver core: faux: only create the device if probe() succeeds
  rust/faux: Add missing parent argument to Registration::new()
  rust/faux: Drop #[repr(transparent)] from faux::Registration
  rust: io: fix devres test with new io accessor functions
  rust: io: rename `io::Io` accessors
  kernfs: Move dput() outside of the RCU section.
  efi: rci2: mark bin_attribute as __ro_after_init
  rapidio: constify 'struct bin_attribute'
  firmware: qemu_fw_cfg: constify 'struct bin_attribute'
  powerpc/perf/hv-24x7: Constify 'struct bin_attribute'
  ...
2025-04-01 11:02:03 -07:00
Linus Torvalds
d6b02199cd - The 7 patch series "powerpc/crash: use generic crashkernel
reservation" from Sourabh Jain changes powerpc's kexec code to use more
   of the generic layers.
 
 - The 2 patch series "get_maintainer: report subsystem status
   separately" from Vlastimil Babka makes some long-requested improvements
   to the get_maintainer output.
 
 - The 4 patch series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the ucount
   code.
 
 - The 12 patch series "reboot: support runtime configuration of
   emergency hw_protection action" from Ahmad Fatoum improves the ability
   for a driver to perform an emergency system shutdown or reboot.
 
 - The 16 patch series "Converge on using secs_to_jiffies() part two"
   from Easwar Hariharan performs further migrations from
   msecs_to_jiffies() to secs_to_jiffies().
 
 - The 7 patch series "lib/interval_tree: add some test cases and
   cleanup" from Wei Yang permits more userspace testing of kernel library
   code, adds some more tests and performs some cleanups.
 
 - The 2 patch series "hung_task: Dump the blocking task stacktrace" from
   Masami Hiramatsu arranges for the hung_task detector to dump the stack
   of the blocking task and not just that of the blocked task.
 
 - The 4 patch series "resource: Split and use DEFINE_RES*() macros" from
   Andy Shevchenko provides some cleanups to the resource definition
   macros.
 
 - Plus the usual shower of singleton patches - please see the individual
   changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nuqwAKCRDdBJ7gKXxA
 jtNqAQDxqJpjWkzn4yN9CNSs1ivVx3fr6SqazlYCrt3u89WQvwEA1oRrGpETzUGq
 r6khQUIcQImPPcjFqEFpuiSOU0MBZA0=
 =Kii8
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - The series "powerpc/crash: use generic crashkernel reservation" from
   Sourabh Jain changes powerpc's kexec code to use more of the generic
   layers.

 - The series "get_maintainer: report subsystem status separately" from
   Vlastimil Babka makes some long-requested improvements to the
   get_maintainer output.

 - The series "ucount: Simplify refcounting with rcuref_t" from
   Sebastian Siewior cleans up and optimizing the refcounting in the
   ucount code.

 - The series "reboot: support runtime configuration of emergency
   hw_protection action" from Ahmad Fatoum improves the ability for a
   driver to perform an emergency system shutdown or reboot.

 - The series "Converge on using secs_to_jiffies() part two" from Easwar
   Hariharan performs further migrations from msecs_to_jiffies() to
   secs_to_jiffies().

 - The series "lib/interval_tree: add some test cases and cleanup" from
   Wei Yang permits more userspace testing of kernel library code, adds
   some more tests and performs some cleanups.

 - The series "hung_task: Dump the blocking task stacktrace" from Masami
   Hiramatsu arranges for the hung_task detector to dump the stack of
   the blocking task and not just that of the blocked task.

 - The series "resource: Split and use DEFINE_RES*() macros" from Andy
   Shevchenko provides some cleanups to the resource definition macros.

 - Plus the usual shower of singleton patches - please see the
   individual changelogs for details.

* tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits)
  mailmap: consolidate email addresses of Alexander Sverdlin
  fs/procfs: fix the comment above proc_pid_wchan()
  relay: use kasprintf() instead of fixed buffer formatting
  resource: replace open coded variant of DEFINE_RES()
  resource: replace open coded variants of DEFINE_RES_*_NAMED()
  resource: replace open coded variant of DEFINE_RES_NAMED_DESC()
  resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED()
  samples: add hung_task detector mutex blocking sample
  hung_task: show the blocker task if the task is hung on mutex
  kexec_core: accept unaccepted kexec segments' destination addresses
  watchdog/perf: optimize bytes copied and remove manual NUL-termination
  lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap()
  lib/interval_tree: skip the check before go to the right subtree
  lib/interval_tree: add test case for span iteration
  lib/interval_tree: add test case for interval_tree_iter_xxx() helpers
  lib/rbtree: add random seed
  lib/rbtree: split tests
  lib/rbtree: enable userland test suite for rbtree related data structure
  checkpatch: describe --min-conf-desc-length
  scripts/gdb/symbols: determine KASLR offset on s390
  ...
2025-04-01 10:06:52 -07:00
Linus Torvalds
eb0ece1602 - The 6 patch series "Enable strict percpu address space checks" from
Uros Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.
 
   This has caused a small amount of fallout - two or three issues were
   reported.  In all cases the calling code was founf to be incorrect.
 
 - The 4 patch series "Some cleanup for memcg" from Chen Ridong
   implements some relatively monir cleanups for the memcontrol code.
 
 - The 17 patch series "mm: fixes for device-exclusive entries (hmm)"
   from David Hildenbrand fixes a boatload of issues which David found then
   using device-exclusive PTE entries when THP is enabled.  More work is
   needed, but this makes thins better - our own HMM selftests now succeed.
 
 - The 2 patch series "mm: zswap: remove z3fold and zbud" from Yosry
   Ahmed remove the z3fold and zbud implementations.  They have been
   deprecated for half a year and nobody has complained.
 
 - The 5 patch series "mm: further simplify VMA merge operation" from
   Lorenzo Stoakes implements numerous simplifications in this area.  No
   runtime effects are anticipated.
 
 - The 4 patch series "mm/madvise: remove redundant mmap_lock operations
   from process_madvise()" from SeongJae Park rationalizes the locking in
   the madvise() implementation.  Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.
 
 - The 12 patch series "Tiny cleanup and improvements about SWAP code"
   from Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.
 
 - The 2 patch series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak user-visible
   output.
 
 - The 2 patch series "mm/damon/paddr: fix large folios access and
   schemes handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.
 
 - The 3 patch series "mm/damon/core: fix wrong and/or useless
   damos_walk() behaviors" from SeongJae Park fixes a few issues with the
   accuracy of kdamond's walking of DAMON regions.
 
 - The 3 patch series "expose mapping wrprotect, fix fb_defio use" from
   Lorenzo Stoakes changes the interaction between framebuffer deferred-io
   and core MM.  No functional changes are anticipated - this is
   preparatory work for the future removal of page structure fields.
 
 - The 4 patch series "mm/damon: add support for hugepage_size DAMOS
   filter" from Usama Arif adds a DAMOS filter which permits the filtering
   by huge page sizes.
 
 - The 4 patch series "mm: permit guard regions for file-backed/shmem
   mappings" from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state.  The feature now covers shmem and
   file-backed mappings.
 
 - The 4 patch series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping for
   pte-mapped large folios.
 
 - The 18 patch series "reimplement per-vma lock as a refcount" from
   Suren Baghdasaryan puts the vm_lock back into the vma.  Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy.  This patchset provides small (0-10%) improvements on one
   microbenchmark.
 
 - The 5 patch series "Docs/mm/damon: misc DAMOS filters documentation
   fixes and improves" from SeongJae Park does some maintenance work on the
   DAMON docs.
 
 - The 27 patch series "hugetlb/CMA improvements for large systems" from
   Frank van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for unmapped
   pages" from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.
 
 - The 19 patch series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.
 
 - The 12 patch series "selftests/mm: Some cleanups from trying to run
   them" from Brendan Jackman fixes a pile of unrelated issues which
   Brendan encountered while runnimg our selftests.
 
 - The 2 patch series "fs/proc/task_mmu: add guard region bit to pagemap"
   from Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.
 
 - The 7 patch series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply wasn't
   being effective.
 
 - The 5 patch series "mm: cleanups for device-exclusive entries (hmm)"
   from David Hildenbrand implements a number of unrelated cleanups in this
   code.
 
 - The 5 patch series "mm: Rework generic PTDUMP configs" from Anshuman
   Khandual implements a number of preparatoty cleanups to the
   GENERIC_PTDUMP Kconfig logic.
 
 - The 8 patch series "mm/damon: auto-tune aggregation interval" from
   SeongJae Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.
 
 - The 5 patch series "Fix lazy mmu mode" from Ryan Roberts fixes some
   issues in powerpc, sparc and x86 lazy MMU implementations.  Ryan did
   this in preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.
 
 - The 2 patch series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the code
   easier to follow.
 
 - The 3 patch series "page_counter cleanup and size reduction" from
   Shakeel Butt cleans up the page_counter code and fixes a size increase
   which we accidentally added late last year.
 
 - The 3 patch series "Add a command line option that enables control of
   how many threads should be used to allocate huge pages" from Thomas
   Prescher does that.  It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.
 
 - The 3 patch series "Fix calculations in trace_balance_dirty_pages()
   for cgwb" from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.
 
 - The 9 patch series "mm/damon: make allow filters after reject filters
   useful and intuitive" from SeongJae Park improves the handling of allow
   and reject filters.  Behaviour is made more consistent and the
   documention is updated accordingly.
 
 - The 5 patch series "Switch zswap to object read/write APIs" from Yosry
   Ahmed updates zswap to the new object read/write APIs and thus permits
   the removal of some legacy code from zpool and zsmalloc.
 
 - The 6 patch series "Some trivial cleanups for shmem" from Baolin Wang
   does as it claims.
 
 - The 20 patch series "fs/dax: Fix ZONE_DEVICE page reference counts"
   from Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.
 
 - The 4 patch series "refactor mremap and fix bug" from Lorenzo Stoakes
   is a preparatoty refactoring and cleanup of the mremap() code.
 
 - The 20 patch series "mm: MM owner tracking for large folios (!hugetlb)
   + CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.
 
 - The 8 patch series "mm/damon: add sysfs dirs for managing DAMOS
   filters based on handling layers" from SeongJae Park adds a couple of
   new sysfs directories to ease the management of DAMON/DAMOS filters.
 
 - The 13 patch series "arch, mm: reduce code duplication in mem_init()"
   from Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.
 
 - The 13 patch series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.
 
 - The 3 patch series "mm: page_ext: Introduce new iteration API" from
   Luiz Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.
 
 - The 8 patch series "Buddy allocator like (or non-uniform) folio split"
   from Zi Yan reworks the code to split a folio into smaller folios.  The
   main benefit is lessened memory consumption: fewer post-split folios are
   generated.
 
 - The 2 patch series "Minimize xa_node allocation during xarry split"
   from Zi Yan reduces the number of xarray xa_nodes which are generated
   during an xarray split.
 
 - The 2 patch series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.
 
 - The 3 patch series "Add tracepoints for lowmem reserves, watermarks
   and totalreserve_pages" from Martin Liu adds some more tracepoints to
   the page allocator code.
 
 - The 4 patch series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which SeongJae
   observed during his earlier madvise work.
 
 - The 3 patch series "mm/hwpoison: Fix regressions in memory failure
   handling" from Shuai Xue addresses two quite serious regressions which
   Shuai has observed in the memory-failure implementation.
 
 - The 5 patch series "mm: reliable huge page allocator" from Johannes
   Weiner makes huge page allocations cheaper and more reliable by reducing
   fragmentation.
 
 - The 5 patch series "Minor memcg cleanups & prep for memdescs" from
   Matthew Wilcox is preparatory work for the future implementation of
   memdescs.
 
 - The 4 patch series "track memory used by balloon drivers" from Nico
   Pache introduces a way to track memory used by our various balloon
   drivers.
 
 - The 2 patch series "mm/damon: introduce DAMOS filter type for active
   pages" from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.
 
 - The 2 patch series "Adding Proactive Memory Reclaim Statistics" from
   Hao Jia separates the proactive reclaim statistics from the direct
   reclaim statistics.
 
 - The 2 patch series "mm/vmscan: don't try to reclaim hwpoison folio"
   from Jinjiang Tu fixes our handling of hwpoisoned pages within the
   reclaim code.
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ+nZaAAKCRDdBJ7gKXxA
 jsOWAPiP4r7CJHMZRK4eyJOkvS1a1r+TsIarrFZtjwvf/GIfAQCEG+JDxVfUaUSF
 Ee93qSSLR1BkNdDw+931Pu0mXfbnBw==
 =Pn2K
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "Enable strict percpu address space checks" from Uros
   Bizjak uses x86 named address space qualifiers to provide
   compile-time checking of percpu area accesses.

   This has caused a small amount of fallout - two or three issues were
   reported. In all cases the calling code was found to be incorrect.

 - The series "Some cleanup for memcg" from Chen Ridong implements some
   relatively monir cleanups for the memcontrol code.

 - The series "mm: fixes for device-exclusive entries (hmm)" from David
   Hildenbrand fixes a boatload of issues which David found then using
   device-exclusive PTE entries when THP is enabled. More work is
   needed, but this makes thins better - our own HMM selftests now
   succeed.

 - The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
   remove the z3fold and zbud implementations. They have been deprecated
   for half a year and nobody has complained.

 - The series "mm: further simplify VMA merge operation" from Lorenzo
   Stoakes implements numerous simplifications in this area. No runtime
   effects are anticipated.

 - The series "mm/madvise: remove redundant mmap_lock operations from
   process_madvise()" from SeongJae Park rationalizes the locking in the
   madvise() implementation. Performance gains of 20-25% were observed
   in one MADV_DONTNEED microbenchmark.

 - The series "Tiny cleanup and improvements about SWAP code" from
   Baoquan He contains a number of touchups to issues which Baoquan
   noticed when working on the swap code.

 - The series "mm: kmemleak: Usability improvements" from Catalin
   Marinas implements a couple of improvements to the kmemleak
   user-visible output.

 - The series "mm/damon/paddr: fix large folios access and schemes
   handling" from Usama Arif provides a couple of fixes for DAMON's
   handling of large folios.

 - The series "mm/damon/core: fix wrong and/or useless damos_walk()
   behaviors" from SeongJae Park fixes a few issues with the accuracy of
   kdamond's walking of DAMON regions.

 - The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
   Stoakes changes the interaction between framebuffer deferred-io and
   core MM. No functional changes are anticipated - this is preparatory
   work for the future removal of page structure fields.

 - The series "mm/damon: add support for hugepage_size DAMOS filter"
   from Usama Arif adds a DAMOS filter which permits the filtering by
   huge page sizes.

 - The series "mm: permit guard regions for file-backed/shmem mappings"
   from Lorenzo Stoakes extends the guard region feature from its
   present "anon mappings only" state. The feature now covers shmem and
   file-backed mappings.

 - The series "mm: batched unmap lazyfree large folios during
   reclamation" from Barry Song cleans up and speeds up the unmapping
   for pte-mapped large folios.

 - The series "reimplement per-vma lock as a refcount" from Suren
   Baghdasaryan puts the vm_lock back into the vma. Our reasons for
   pulling it out were largely bogus and that change made the code more
   messy. This patchset provides small (0-10%) improvements on one
   microbenchmark.

 - The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
   improves" from SeongJae Park does some maintenance work on the DAMON
   docs.

 - The series "hugetlb/CMA improvements for large systems" from Frank
   van der Linden addresses a pile of issues which have been observed
   when using CMA on large machines.

 - The series "mm/damon: introduce DAMOS filter type for unmapped pages"
   from SeongJae Park enables users of DMAON/DAMOS to filter my the
   page's mapped/unmapped status.

 - The series "zsmalloc/zram: there be preemption" from Sergey
   Senozhatsky teaches zram to run its compression and decompression
   operations preemptibly.

 - The series "selftests/mm: Some cleanups from trying to run them" from
   Brendan Jackman fixes a pile of unrelated issues which Brendan
   encountered while runnimg our selftests.

 - The series "fs/proc/task_mmu: add guard region bit to pagemap" from
   Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
   determine whether a particular page is a guard page.

 - The series "mm, swap: remove swap slot cache" from Kairui Song
   removes the swap slot cache from the allocation path - it simply
   wasn't being effective.

 - The series "mm: cleanups for device-exclusive entries (hmm)" from
   David Hildenbrand implements a number of unrelated cleanups in this
   code.

 - The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
   implements a number of preparatoty cleanups to the GENERIC_PTDUMP
   Kconfig logic.

 - The series "mm/damon: auto-tune aggregation interval" from SeongJae
   Park implements a feedback-driven automatic tuning feature for
   DAMON's aggregation interval tuning.

 - The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
   powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
   preparation for implementing lazy mmu mode for arm64 to optimize
   vmalloc.

 - The series "mm/page_alloc: Some clarifications for migratetype
   fallback" from Brendan Jackman reworks some commentary to make the
   code easier to follow.

 - The series "page_counter cleanup and size reduction" from Shakeel
   Butt cleans up the page_counter code and fixes a size increase which
   we accidentally added late last year.

 - The series "Add a command line option that enables control of how
   many threads should be used to allocate huge pages" from Thomas
   Prescher does that. It allows the careful operator to significantly
   reduce boot time by tuning the parallalization of huge page
   initialization.

 - The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
   from Tang Yizhou fixes the tracing output from the dirty page
   balancing code.

 - The series "mm/damon: make allow filters after reject filters useful
   and intuitive" from SeongJae Park improves the handling of allow and
   reject filters. Behaviour is made more consistent and the documention
   is updated accordingly.

 - The series "Switch zswap to object read/write APIs" from Yosry Ahmed
   updates zswap to the new object read/write APIs and thus permits the
   removal of some legacy code from zpool and zsmalloc.

 - The series "Some trivial cleanups for shmem" from Baolin Wang does as
   it claims.

 - The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
   Alistair Popple regularizes the weird ZONE_DEVICE page refcount
   handling in DAX, permittig the removal of a number of special-case
   checks.

 - The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
   preparatoty refactoring and cleanup of the mremap() code.

 - The series "mm: MM owner tracking for large folios (!hugetlb) +
   CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
   which we determine whether a large folio is known to be mapped
   exclusively into a single MM.

 - The series "mm/damon: add sysfs dirs for managing DAMOS filters based
   on handling layers" from SeongJae Park adds a couple of new sysfs
   directories to ease the management of DAMON/DAMOS filters.

 - The series "arch, mm: reduce code duplication in mem_init()" from
   Mike Rapoport consolidates many per-arch implementations of
   mem_init() into code generic code, where that is practical.

 - The series "mm/damon/sysfs: commit parameters online via
   damon_call()" from SeongJae Park continues the cleaning up of sysfs
   access to DAMON internal data.

 - The series "mm: page_ext: Introduce new iteration API" from Luiz
   Capitulino reworks the page_ext initialization to fix a boot-time
   crash which was observed with an unusual combination of compile and
   cmdline options.

 - The series "Buddy allocator like (or non-uniform) folio split" from
   Zi Yan reworks the code to split a folio into smaller folios. The
   main benefit is lessened memory consumption: fewer post-split folios
   are generated.

 - The series "Minimize xa_node allocation during xarry split" from Zi
   Yan reduces the number of xarray xa_nodes which are generated during
   an xarray split.

 - The series "drivers/base/memory: Two cleanups" from Gavin Shan
   performs some maintenance work on the drivers/base/memory code.

 - The series "Add tracepoints for lowmem reserves, watermarks and
   totalreserve_pages" from Martin Liu adds some more tracepoints to the
   page allocator code.

 - The series "mm/madvise: cleanup requests validations and
   classifications" from SeongJae Park cleans up some warts which
   SeongJae observed during his earlier madvise work.

 - The series "mm/hwpoison: Fix regressions in memory failure handling"
   from Shuai Xue addresses two quite serious regressions which Shuai
   has observed in the memory-failure implementation.

 - The series "mm: reliable huge page allocator" from Johannes Weiner
   makes huge page allocations cheaper and more reliable by reducing
   fragmentation.

 - The series "Minor memcg cleanups & prep for memdescs" from Matthew
   Wilcox is preparatory work for the future implementation of memdescs.

 - The series "track memory used by balloon drivers" from Nico Pache
   introduces a way to track memory used by our various balloon drivers.

 - The series "mm/damon: introduce DAMOS filter type for active pages"
   from Nhat Pham permits users to filter for active/inactive pages,
   separately for file and anon pages.

 - The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
   separates the proactive reclaim statistics from the direct reclaim
   statistics.

 - The series "mm/vmscan: don't try to reclaim hwpoison folio" from
   Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
   code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
  mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
  x86/mm: restore early initialization of high_memory for 32-bits
  mm/vmscan: don't try to reclaim hwpoison folio
  mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
  cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
  mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
  selftests/mm: speed up split_huge_page_test
  selftests/mm: uffd-unit-tests support for hugepages > 2M
  docs/mm/damon/design: document active DAMOS filter type
  mm/damon: implement a new DAMOS filter type for active pages
  fs/dax: don't disassociate zero page entries
  MM documentation: add "Unaccepted" meminfo entry
  selftests/mm: add commentary about 9pfs bugs
  fork: use __vmalloc_node() for stack allocation
  docs/mm: Physical Memory: Populate the "Zones" section
  xen: balloon: update the NR_BALLOON_PAGES state
  hv_balloon: update the NR_BALLOON_PAGES state
  balloon_compaction: update the NR_BALLOON_PAGES state
  meminfo: add a per node counter for balloon drivers
  mm: remove references to folio in __memcg_kmem_uncharge_page()
  ...
2025-04-01 09:29:18 -07:00
Linus Torvalds
01d5b167dc Modules changes for 6.15-rc1
- Use RCU instead of RCU-sched
 
   The mix of rcu_read_lock(), rcu_read_lock_sched() and preempt_disable()
   in the module code and its users has been replaced with just
   rcu_read_lock().
 
 - The rest of changes are smaller fixes and updates.
 
 The changes have been on linux-next for at least 2 weeks, with the RCU
 cleanup present for 2 months. One performance problem was reported with the
 RCU change when KASAN + lockdep were enabled, but it was effectively
 addressed by the already merged ee57ab5a32 ("locking/lockdep: Disable
 KASAN instrumentation of lockdep.c").
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmfmwrsUHHBldHIucGF2
 bHVAc3VzZS5jb20ACgkQumpXJwqY6prWxgf/S7Pvdywm10vJ6fooYa+GxXNMwhyh
 XRjZ4m9gjeTNf2KLwX0XHv0XZeFHOmHfjd3iI+pS6CXZnCFTN9J3XPLYsrTxXUb6
 U6zzLf8Zsz8TzeI4dgvSBsZln7oICSACkAgdJCq23hpNKeaeRo91dgiZaIwyZJG3
 FekqSFtP7pYhfFoNkrFKysqbgl1+RWWZ79L2qRJA0bPzVFlvRUuh6cOHQw+8RMqf
 BYLwnArjTkW8AcXpxIXSiwphDHVZ81B96xoplavyoprA5FDpv1W+8y4DtxdWFn+1
 QVWCs/ZV3KrwXWpZev625w3fIOOIXILqRINOzLfvXTw+1xFS3TzSQEpVeg==
 =4OKc
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux

Pull modules updates from Petr Pavlu:

 - Use RCU instead of RCU-sched

   The mix of rcu_read_lock(), rcu_read_lock_sched() and
   preempt_disable() in the module code and its users has
   been replaced with just rcu_read_lock()

 - The rest of changes are smaller fixes and updates

* tag 'modules-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: (32 commits)
  MAINTAINERS: Update the MODULE SUPPORT section
  module: Remove unnecessary size argument when calling strscpy()
  module: Replace deprecated strncpy() with strscpy()
  params: Annotate struct module_param_attrs with __counted_by()
  bug: Use RCU instead RCU-sched to protect module_bug_list.
  static_call: Use RCU in all users of __module_text_address().
  kprobes: Use RCU in all users of __module_text_address().
  bpf: Use RCU in all users of __module_text_address().
  jump_label: Use RCU in all users of __module_text_address().
  jump_label: Use RCU in all users of __module_address().
  x86: Use RCU in all users of __module_address().
  cfi: Use RCU while invoking __module_address().
  powerpc/ftrace: Use RCU in all users of __module_text_address().
  LoongArch: ftrace: Use RCU in all users of __module_text_address().
  LoongArch/orc: Use RCU in all users of __module_address().
  arm64: module: Use RCU in all users of __module_text_address().
  ARM: module: Use RCU in all users of __module_text_address().
  module: Use RCU in all users of __module_text_address().
  module: Use RCU in all users of __module_address().
  module: Use RCU in search_module_extables().
  ...
2025-03-30 15:44:36 -07:00
Linus Torvalds
7405c0f01a Miscellaneous x86 fixes and updates:
- Fix a large number of x86 Kconfig dependency and help text accuracy
    bugs/problems, by Mateusz Jończyk and David Heideberg.
 
  - Fix a VM_PAT interaction with fork() crash. This also touches
    core kernel code.
 
  - Fix an ORC unwinder bug for interrupt entries
 
  - Fixes and cleanups.
 
  - Fix an AMD microcode loader bug that can promote verification failures
    into success.
 
  - Add early-printk support for MMIO based UARTs on an x86 board that
    had no other serial debugging facility and also experienced early
    boot crashes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfnFBERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iVDxAAmiB4soT3/WbaWJJdeVyxEL7sOmUNOm04
 5kAVHJVK8QGdje0eWa6h7xmuQD3UOxafE2coCrOxHhZi2qpAAY6CPIIy6oIBRwZK
 gLgT5xn1CHojfm4UFC3YUOyecBRPUF2C5jfkajWdZHumyPP/sOObqvGanpQRAYd5
 bfPHEvrBpeEeS7WkATCdyF2j+I5xYflD4g/MDAsMmqasQHOnjBuFX5VBeVxxkysC
 dMsFkFpxqcA95MnnyOnxXzgOtRTY0UystX07D3Bk1pqhG9zor+mp8OynsTRCU87T
 ZPPbUr2qACNmCqEEXl+F1mAkgj5H66xE2gaJdYx0/jBAIbX8Nwih7mMxhJShVU07
 Lhc0tukmVrDoDaVIr2HsxqI8iokuYLszUjDAqEQmQDrgelL6usPYghN1b2bDSJ9r
 0hCO/s79024H/U9oMrC+CF52D5UH/fE98ipigrbKRIO/hOsoxiiniF3DG2NVWZM2
 n5nPnOdbperqjCEteN1nxQfr7XZkvP95Bwmuqqc90XH+tzKJdHruUkbm4ua7NEEz
 WKgsUIYFjeN5ZrHbJaNtHlQueTyvsyGmL1nlaLi/MaJbSXPsM/WfwvHsaKTh3NrE
 BFwEAhMZVLDHEfnFT0Ev7Mm1MGpW8MbHoRBR1+E5FWWNS4X0yGLKXWRp8diw25Tm
 W3ZVsn65E6U=
 =/qKX
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes and updates from Ingo Molnar:

 - Fix a large number of x86 Kconfig dependency and help text accuracy
   bugs/problems, by Mateusz Jończyk and David Heideberg

 - Fix a VM_PAT interaction with fork() crash. This also touches core
   kernel code

 - Fix an ORC unwinder bug for interrupt entries

 - Fixes and cleanups

 - Fix an AMD microcode loader bug that can promote verification
   failures into success

 - Add early-printk support for MMIO based UARTs on an x86 board that
   had no other serial debugging facility and also experienced early
   boot crashes

* tag 'x86-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Fix __apply_microcode_amd()'s return value
  x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()
  x86/fpu: Update the outdated comment above fpstate_init_user()
  x86/early_printk: Add support for MMIO-based UARTs
  x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment
  x86/entry: Fix ORC unwinder for PUSH_REGS with save_ret=1
  x86/Kconfig: Fix lists in X86_EXTENDED_PLATFORM help text
  x86/Kconfig: Correct X86_X2APIC help text
  x86/speculation: Remove the extra #ifdef around CALL_NOSPEC
  x86/Kconfig: Document release year of glibc 2.3.3
  x86/Kconfig: Make CONFIG_PCI_CNB20LE_QUIRK depend on X86_32
  x86/Kconfig: Document CONFIG_PCI_MMCONFIG
  x86/Kconfig: Update lists in X86_EXTENDED_PLATFORM
  x86/Kconfig: Move all X86_EXTENDED_PLATFORM options together
  x86/Kconfig: Always enable ARCH_SPARSEMEM_ENABLE
  x86/Kconfig: Enable X86_X2APIC by default and improve help text
2025-03-30 15:25:15 -07:00
Linus Torvalds
b4c5c57c2d Miscellaneous locking fixes and updates:
- Fix a locking self-test FAIL on PREEMPT_RT kernels
  - Fix nr_unused_locks accounting bug
  - Simplify the split-lock debugging feature's fast-path
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfnDqERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jXVw/+IGVWjYnw5Ac+HyccKwY7+OEmKiGx+9C4
 yS3pc1wrofkpIVRsr0GuOv2h5V+YZJvuRrZPbatVidUkEJAQ++Iaeef+gnunfMX2
 0mRk5B0CG1M+y3lNFOhIEafkGDQOtdFTOmoRv508Xj3mb8cRMRCgkf8S3600l2IB
 ZEw67f8waHy1PwKC+dFlr0QjGL0+qbkrDqB1m4VSy0/egaSNMFjYVLLBAe6gn+/X
 zU/SWBqyV9jSfmwABSiKCpjt/GziuwGoADJpxAbSR/1NaCy866VSBLRjiiLI/c5q
 xdKd3GyD3IhLDCbzKrwVugeioxoPzXLmZKajVhZARv2kPW8DbzUMymP/eVkOf7VJ
 6dDNV3Yyq568YQBi3PQu2vESumHRctLOsRSAr2TnXlbFzBvjBM+Ia11e+p4mg9FU
 Hcn98Rt3FfgigrzH4IeLaYTbm6A5amj2ymoMwjR/vchnB8tlXddc3/KumIdak04Q
 hHRHA0cyNg1YvB/8yB0NKf5jETpbYaaKhCO9KlsceLDjuhrwlmi/XFz+bwBJIBgD
 zug2hevSW2RVRGidJ5Qz91qG76xBkhH038qorcegzGybMQ0p+Hw+/BN7OzPSueaq
 ZMRFde3CtrB332SS73KToG5XvyuffhW8XyCvkOt5W6z9P7n3UOFzDcKr/wrePV9d
 9kEJPdOCOwM=
 =jcdr
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc locking fixes and updates from Ingo Molnar:

 - Fix a locking self-test FAIL on PREEMPT_RT kernels

 - Fix nr_unused_locks accounting bug

 - Simplify the split-lock debugging feature's fast-path

* tag 'locking-urgent-2025-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/lockdep: Decrease nr_unused_locks if lock unused in zap_class()
  lockdep: Fix wait context check on softirq for PREEMPT_RT
  x86/split_lock: Simplify reenabling
2025-03-30 15:18:36 -07:00
Boris Ostrovsky
31ab12df72 x86/microcode/AMD: Fix __apply_microcode_amd()'s return value
When verify_sha256_digest() fails, __apply_microcode_amd() should propagate
the failure by returning false (and not -1 which is promoted to true).

Fixes: 50cef76d5c ("x86/microcode/AMD: Load only SHA256-checksummed patches")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250327230503.1850368-2-boris.ostrovsky@oracle.com
2025-03-28 12:49:02 +01:00
Linus Torvalds
7b667acd69 powerpc updates for 6.15
- Removal of support for IBM Cell Blades
 
  - SMP support for microwatt platform
 
  - Support for inline static calls on PPC32
 
  - Enable pmu selftests for power11 platform
 
  - Enable hardware trace macro (HTM) hcall support
 
  - Support for limited address mode capability
 
  - Changes to RMA size from 512 MB to 768 MB to handle fadump
 
  - Misc fixes and cleanups
 
 Thanks to: Abhishek Dubey, Amit Machhiwal, Andreas Schwab, Arnd Bergmann,
 Athira Rajeev, Avnish Chouhan, Christophe Leroy, Disha Goel, Donet Tom, Gaurav
 Batra, Gautam Menghani, Hari Bathini, Kajol Jain, Kees Cook, Mahesh Salgaonkar,
 Michael Ellerman, Paul Mackerras, Ritesh Harjani (IBM), Sathvika Vasireddy,
 Segher Boessenkool, Sourabh Jain, Vaibhav Jain, Venkat Rao Bagalkote.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqX2DNAOgU8sBX3pRpnEsdPSHZJQFAmfjZnYACgkQpnEsdPSH
 ZJTmKg/+NGW7wyFY9d8Iai9ncYY7GSzsMSDTaan7qg0QWOd5gjHbsdbava7TM/DW
 8p9XsC+17kSeftRNUjtc52bSN8Ei2gBdsXIagQG1alfB2X2e6wkNauifK+dz3Su6
 usMEZZTO5R/jFDotFXNM1nsUj+8dvjnPgOUrji/P8k7PT5295wpza0hz1fy5SrOA
 hM5cliBP36UgFe5Efvgm4OUX2gQIhbc3stt9MVfymW/k0Mit5f41UIPuVGiTWowY
 s0cUJGkhxUlGXT3VfOVKuZfn4u9KMha7UCl9afSceJzXOdnUIKIbskui1VEv6cD/
 iSIxi839uErAobFHlsLYprgYFciYLII3xe2qNZCA/ZxeIMS/Mm6xokESeWLhBnfa
 P7ke6l0z3GDtTvgI2eSeU9BdrVveF1NgbP9GYSKgT6gtw/kRRnxgHF8tzmLON5PT
 KXpQlzz8VuSBRtF2jnLFU89+FFwSA1bRUhDrp89HyYFqw1B5g4N7kFFTUJWHOuKS
 fwPGy+cveKehmCUBedeTRFqHvvqdwpD/WnPlQzCly3WxqdL8U/eTXYftMiAwuK28
 ovLuSs3vRThKRQ8DnUa5oB0UGsjMpRV5LdvYkhw+x8mZKUR59oj4fx2ae4TtPakg
 dbAYuPPkCORdaSga/nV6vQgsLprFpcGX3dq6E19+BVBAY5D+1PE=
 =GFcj
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Madhavan Srinivasan:

 - Remove support for IBM Cell Blades

 - SMP support for microwatt platform

 - Support for inline static calls on PPC32

 - Enable pmu selftests for power11 platform

 - Enable hardware trace macro (HTM) hcall support

 - Support for limited address mode capability

 - Changes to RMA size from 512 MB to 768 MB to handle fadump

 - Misc fixes and cleanups

Thanks to Abhishek Dubey, Amit Machhiwal, Andreas Schwab, Arnd Bergmann,
Athira Rajeev, Avnish Chouhan, Christophe Leroy, Disha Goel, Donet Tom,
Gaurav Batra, Gautam Menghani, Hari Bathini, Kajol Jain, Kees Cook,
Mahesh Salgaonkar, Michael Ellerman, Paul Mackerras, Ritesh Harjani
(IBM), Sathvika Vasireddy, Segher Boessenkool, Sourabh Jain, Vaibhav
Jain, and Venkat Rao Bagalkote.

* tag 'powerpc-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (61 commits)
  powerpc/kexec: fix physical address calculation in clear_utlb_entry()
  crypto: powerpc: Mark ghashp8-ppc.o as an OBJECT_FILES_NON_STANDARD
  powerpc: Fix 'intra_function_call not a direct call' warning
  powerpc/perf: Fix ref-counting on the PMU 'vpa_pmu'
  KVM: PPC: Enable CAP_SPAPR_TCE_VFIO on pSeries KVM guests
  powerpc/prom_init: Fixup missing #size-cells on PowerBook6,7
  powerpc/microwatt: Add SMP support
  powerpc: Define config option for processors with broadcast TLBIE
  powerpc/microwatt: Define an idle power-save function
  powerpc/microwatt: Device-tree updates
  powerpc/microwatt: Select COMMON_CLK in order to get the clock framework
  net: toshiba: Remove reference to PPC_IBM_CELL_BLADE
  net: spider_net: Remove powerpc Cell driver
  cpufreq: ppc_cbe: Remove powerpc Cell driver
  genirq: Remove IRQ_EDGE_EOI_HANDLER
  docs: Remove reference to removed CBE_CPUFREQ_SPU_GOVERNOR
  powerpc: Remove UDBG_RTAS_CONSOLE
  powerpc/io: Use standard barrier macros in io.c
  powerpc/io: Rename _insw_ns() etc.
  powerpc/io: Use generic raw accessors
  ...
2025-03-27 19:39:08 -07:00
Vishal Annapurve
9f98a4f4e7 x86/tdx: Fix arch_safe_halt() execution for TDX VMs
Direct HLT instruction execution causes #VEs for TDX VMs which is routed
to hypervisor via TDCALL. If HLT is executed in STI-shadow, resulting #VE
handler will enable interrupts before TDCALL is routed to hypervisor
leading to missed wakeup events, as current TDX spec doesn't expose
interruptibility state information to allow #VE handler to selectively
enable interrupts.

Commit bfe6ed0c67 ("x86/tdx: Add HLT support for TDX guests")
prevented the idle routines from executing HLT instruction in STI-shadow.
But it missed the paravirt routine which can be reached via this path
as an example:

	kvm_wait()       =>
        safe_halt()      =>
        raw_safe_halt()  =>
        arch_safe_halt() =>
        irq.safe_halt()  =>
        pv_native_safe_halt()

To reliably handle arch_safe_halt() for TDX VMs, introduce explicit
dependency on CONFIG_PARAVIRT and override paravirt halt()/safe_halt()
routines with TDX-safe versions that execute direct TDCALL and needed
interrupt flag updates. Executing direct TDCALL brings in additional
benefit of avoiding HLT related #VEs altogether.

As tested by Ryan Afranji:

  "Tested with the specjbb2015 benchmark. It has heavy lock contention which leads
   to many halt calls. TDX VMs suffered a poor score before this patchset.

   Verified the major performance improvement with this patchset applied."

Fixes: bfe6ed0c67 ("x86/tdx: Add HLT support for TDX guests")
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Ryan Afranji <afranji@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250228014416.3925664-3-vannapurve@google.com
2025-03-26 08:51:20 +01:00
Kirill A. Shutemov
22cc5ca5de x86/paravirt: Move halt paravirt calls under CONFIG_PARAVIRT
CONFIG_PARAVIRT_XXL is mainly defined/used by XEN PV guests. For
other VM guest types, features supported under CONFIG_PARAVIRT
are self sufficient. CONFIG_PARAVIRT mainly provides support for
TLB flush operations and time related operations.

For TDX guest as well, paravirt calls under CONFIG_PARVIRT meets
most of its requirement except the need of HLT and SAFE_HLT
paravirt calls, which is currently defined under
CONFIG_PARAVIRT_XXL.

Since enabling CONFIG_PARAVIRT_XXL is too bloated for TDX guest
like platforms, move HLT and SAFE_HLT paravirt calls under
CONFIG_PARAVIRT.

Moving HLT and SAFE_HLT paravirt calls are not fatal and should not
break any functionality for current users of CONFIG_PARAVIRT.

Fixes: bfe6ed0c67 ("x86/tdx: Add HLT support for TDX guests")
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Ryan Afranji <afranji@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20250228014416.3925664-2-vannapurve@google.com
2025-03-26 08:48:18 +01:00
Linus Torvalds
ee6740fd34 CRC updates for 6.15
Another set of improvements to the kernel's CRC (cyclic redundancy
 check) code:
 
 - Rework the CRC64 library functions to be directly optimized, like what
   I did last cycle for the CRC32 and CRC-T10DIF library functions.
 
 - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
   support and acceleration for crc64_be and crc64_nvme.
 
 - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
   crc_t10dif, crc64_be, and crc64_nvme.
 
 - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they
   are no longer needed there.
 
 - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect.
 
 - Add kunit test cases for crc64_nvme and crc7.
 
 - Eliminate redundant functions for calculating the Castagnoli CRC32,
   settling on just crc32c().
 
 - Remove unnecessary prompts from some of the CRC kconfig options.
 
 - Further optimize the x86 crc32c code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt
 NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY=
 =wmKG
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC updates from Eric Biggers:
 "Another set of improvements to the kernel's CRC (cyclic redundancy
  check) code:

   - Rework the CRC64 library functions to be directly optimized, like
     what I did last cycle for the CRC32 and CRC-T10DIF library
     functions

   - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
     support and acceleration for crc64_be and crc64_nvme

   - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
     crc_t10dif, crc64_be, and crc64_nvme

   - Remove crc_t10dif and crc64_rocksoft from the crypto API, since
     they are no longer needed there

   - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect

   - Add kunit test cases for crc64_nvme and crc7

   - Eliminate redundant functions for calculating the Castagnoli CRC32,
     settling on just crc32c()

   - Remove unnecessary prompts from some of the CRC kconfig options

   - Further optimize the x86 crc32c code"

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
  x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
  lib/crc: remove unnecessary prompt for CONFIG_CRC64
  lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
  lib/crc: remove unnecessary prompt for CONFIG_CRC8
  lib/crc: remove unnecessary prompt for CONFIG_CRC7
  lib/crc: remove unnecessary prompt for CONFIG_CRC4
  lib/crc7: unexport crc7_be_syndrome_table
  lib/crc_kunit.c: update comment in crc_benchmark()
  lib/crc_kunit.c: add test and benchmark for crc7_be()
  x86/crc32: optimize tail handling for crc32c short inputs
  riscv/crc64: add Zbc optimized CRC64 functions
  riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
  riscv/crc32: reimplement the CRC32 functions using new template
  riscv/crc: add "template" for Zbc optimized CRC functions
  x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
  x86/crc32: improve crc32c_arch() code generation with clang
  x86/crc64: implement crc64_be and crc64_nvme using new template
  x86/crc-t10dif: implement crc_t10dif using new template
  x86/crc32: implement crc32_le using new template
  x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
  ...
2025-03-25 18:33:04 -07:00
Linus Torvalds
7d20aa5c32 Power management updates for 6.15-rc1
- Manage sysfs attributes and boost frequencies efficiently from
    cpufreq core to reduce boilerplate code in drivers (Viresh Kumar).
 
  - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider,
    Dhananjay Ugwekar, Imran Shaik, zuoqian).
 
  - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky
    Bai).
 
  - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski).
 
  - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng).
 
  - Optimize the amd-pstate driver to avoid cases where call paths end
    up calling the same writes multiple times and needlessly caching
    variables through code reorganization, locking overhaul and tracing
    adjustments (Mario Limonciello, Dhananjay Ugwekar).
 
  - Make it possible to avoid enabling capacity-aware scheduling (CAS) in
    the intel_pstate driver and relocate a check for out-of-band (OOB)
    platform handling in it to make it detect OOB before checking HWP
    availability (Rafael Wysocki).
 
  - Fix dbs_update() to avoid inadvertent conversions of negative integer
    values to unsigned int which causes CPU frequency selection to be
    inaccurate in some cases when the "conservative" cpufreq governor is
    in use (Jie Zhan).
 
  - Update the handling of the most recent idle intervals in the menu
    cpuidle governor to prevent useful information from being discarded
    by it in some cases and improve the prediction accuracy (Rafael
    Wysocki).
 
  - Make it possible to tell the intel_idle driver to ignore its built-in
    table of idle states for the given processor, clean up the handling
    of auto-demotion disabling on Baytrail and Cherrytrail chips in it,
    and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy,
    Rafael Wysocki).
 
  - Make some cpuidle drivers use for_each_present_cpu() instead of
    for_each_possible_cpu() during initialization to avoid issues
    occurring when nosmp or maxcpus=0 are used (Jacky Bai).
 
  - Clean up the Energy Model handling code somewhat (Rafael Wysocki).
 
  - Use kfree_rcu() to simplify the handling of runtime Energy Model
    updates (Li RongQing).
 
  - Add an entry for the Energy Model framework to MAINTAINERS as
    properly maintained (Lukasz Luba).
 
  - Address RCU-related sparse warnings in the Energy Model code (Rafael
    Wysocki).
 
  - Remove ENERGY_MODEL dependency on SMP and allow it to be selected
    when DEVFREQ is set without CPUFREQ so it can be used on a wider
    range of systems (Jeson Gao).
 
  - Unify error handling during runtime suspend and runtime resume in the
    core to help drivers to implement more consistent runtime PM error
    handling (Rafael Wysocki).
 
  - Drop a redundant check from pm_runtime_force_resume() and rearrange
    documentation related to __pm_runtime_disable() (Rafael Wysocki).
 
  - Rework the handling of the "smart suspend" driver flag in the PM core
    to avoid issues hat may occur when drivers using it depend on some
    other drivers and clean up the related PM core code (Rafael Wysocki,
    Colin Ian King).
 
  - Fix the handling of devices with the power.direct_complete flag set
    if device_suspend() returns an error for at least one device to avoid
    situations in which some of them may not be resumed (Rafael Wysocki).
 
  - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a
    possible deadlock that may occur if the "compressor" hibernation
    module parameter is accessed during the registration of a new
    ieee80211 device (Lizhi Xu).
 
  - Suppress sleeping parent warning in device_pm_add() in the case when
    new children are added under a device with the power.direct_complete
    set after it has been processed by device_resume() (Xu Yang).
 
  - Remove needless return in three void functions related to system
    wakeup (Zijun Hu).
 
  - Replace deprecated kmap_atomic() with kmap_local_page() in the
    hibernation core code (David Reaver).
 
  - Remove unused helper functions related to system sleep (David Alan
    Gilbert).
 
  - Clean up s2idle_enter() so it does not lock and unlock CPU offline
    in vain and update comments in it (Ulf Hansson).
 
  - Clean up broken white space in dpm_wait_for_children() (Geert
    Uytterhoeven).
 
  - Update the cpupower utility to fix lib version-ing in it and memory
    leaks in error legs, remove hard-coded values, and implement CPU
    physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah
    Khan, Yiwei Lin, Zhongqiu Han).
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmfhhTYSHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO16/gIAKuRiG1fFgUcUSXC1iFu42vrB/1i4wpA
 02GICACqM3K6/5jd3ct/WOU28GUgDs+xcmqH7CnMaM6y9nXEWjWarmSfFekAO+0q
 TPtQ7xTy0hBCB3he1P2uLKBJBin4Wn47U9/rvs4J7mQd5zDxTINKIiVoHg2lEE+s
 HAeSoNRb2sp5IZDm9+/LfhHNYRP1mJ97cbZlymqctGB3xgDL7qMLid/1+gFPHAQS
 4/LXj3IgyU8DpA/j5nhtpaAqjN5g2QxIUfQgADRIcESK99Y/7aAMs1/G0WhJKaay
 9yx+4/xmkGvVCZQx1DphksFLISEzltY0SFWLsoppPzBTGVEW2GQQsNI=
 =LqVy
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These are dominated by cpufreq updates which in turn are dominated by
  updates related to boost support in the core and drivers and
  amd-pstate driver optimizations.

  Apart from the above, there are some cpuidle updates including a
  rework of the most recent idle intervals handling in the venerable
  menu governor that leads to significant improvements in some
  performance benchmarks, as the governor is now more likely to predict
  a shorter idle duration in some cases, and there are updates of the
  core device power management code, mostly related to system suspend
  and resume, that should help to avoid potential issues arising when
  the drivers of devices depending on one another want to use different
  optimizations.

  There is also a usual collection of assorted fixes and cleanups,
  including removal of some unused code.

  Specifics:

   - Manage sysfs attributes and boost frequencies efficiently from
     cpufreq core to reduce boilerplate code in drivers (Viresh Kumar)

   - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider,
     Dhananjay Ugwekar, Imran Shaik, zuoqian)

   - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky
     Bai)

   - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski)

   - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng)

   - Optimize the amd-pstate driver to avoid cases where call paths end
     up calling the same writes multiple times and needlessly caching
     variables through code reorganization, locking overhaul and tracing
     adjustments (Mario Limonciello, Dhananjay Ugwekar)

   - Make it possible to avoid enabling capacity-aware scheduling (CAS)
     in the intel_pstate driver and relocate a check for out-of-band
     (OOB) platform handling in it to make it detect OOB before checking
     HWP availability (Rafael Wysocki)

   - Fix dbs_update() to avoid inadvertent conversions of negative
     integer values to unsigned int which causes CPU frequency selection
     to be inaccurate in some cases when the "conservative" cpufreq
     governor is in use (Jie Zhan)

   - Update the handling of the most recent idle intervals in the menu
     cpuidle governor to prevent useful information from being discarded
     by it in some cases and improve the prediction accuracy (Rafael
     Wysocki)

   - Make it possible to tell the intel_idle driver to ignore its
     built-in table of idle states for the given processor, clean up the
     handling of auto-demotion disabling on Baytrail and Cherrytrail
     chips in it, and update its MAINTAINERS entry (David Arcari, Artem
     Bityutskiy, Rafael Wysocki)

   - Make some cpuidle drivers use for_each_present_cpu() instead of
     for_each_possible_cpu() during initialization to avoid issues
     occurring when nosmp or maxcpus=0 are used (Jacky Bai)

   - Clean up the Energy Model handling code somewhat (Rafael Wysocki)

   - Use kfree_rcu() to simplify the handling of runtime Energy Model
     updates (Li RongQing)

   - Add an entry for the Energy Model framework to MAINTAINERS as
     properly maintained (Lukasz Luba)

   - Address RCU-related sparse warnings in the Energy Model code
     (Rafael Wysocki)

   - Remove ENERGY_MODEL dependency on SMP and allow it to be selected
     when DEVFREQ is set without CPUFREQ so it can be used on a wider
     range of systems (Jeson Gao)

   - Unify error handling during runtime suspend and runtime resume in
     the core to help drivers to implement more consistent runtime PM
     error handling (Rafael Wysocki)

   - Drop a redundant check from pm_runtime_force_resume() and rearrange
     documentation related to __pm_runtime_disable() (Rafael Wysocki)

   - Rework the handling of the "smart suspend" driver flag in the PM
     core to avoid issues hat may occur when drivers using it depend on
     some other drivers and clean up the related PM core code (Rafael
     Wysocki, Colin Ian King)

   - Fix the handling of devices with the power.direct_complete flag set
     if device_suspend() returns an error for at least one device to
     avoid situations in which some of them may not be resumed (Rafael
     Wysocki)

   - Use mutex_trylock() in hibernate_compressor_param_set() to avoid a
     possible deadlock that may occur if the "compressor" hibernation
     module parameter is accessed during the registration of a new
     ieee80211 device (Lizhi Xu)

   - Suppress sleeping parent warning in device_pm_add() in the case
     when new children are added under a device with the
     power.direct_complete set after it has been processed by
     device_resume() (Xu Yang)

   - Remove needless return in three void functions related to system
     wakeup (Zijun Hu)

   - Replace deprecated kmap_atomic() with kmap_local_page() in the
     hibernation core code (David Reaver)

   - Remove unused helper functions related to system sleep (David Alan
     Gilbert)

   - Clean up s2idle_enter() so it does not lock and unlock CPU offline
     in vain and update comments in it (Ulf Hansson)

   - Clean up broken white space in dpm_wait_for_children() (Geert
     Uytterhoeven)

   - Update the cpupower utility to fix lib version-ing in it and memory
     leaks in error legs, remove hard-coded values, and implement CPU
     physical core querying (Thomas Renninger, John B. Wyatt IV, Shuah
     Khan, Yiwei Lin, Zhongqiu Han)"

* tag 'pm-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (139 commits)
  PM: sleep: Fix bit masking operation
  dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650
  dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible
  cpufreq: Init cpufreq only for present CPUs
  PM: sleep: Fix handling devices with direct_complete set on errors
  cpuidle: Init cpuidle only for present CPUs
  PM: clk: Remove unused pm_clk_remove()
  PM: sleep: core: Fix indentation in dpm_wait_for_children()
  PM: s2idle: Extend comment in s2idle_enter()
  PM: s2idle: Drop redundant locks when entering s2idle
  PM: sleep: Remove unused pm_generic_ wrappers
  cpufreq: tegra186: Share policy per cluster
  cpupower: Make lib versioning scheme more obvious and fix version link
  PM: EM: Rework the depends on for CONFIG_ENERGY_MODEL
  PM: EM: Address RCU-related sparse warnings
  cpupower: Implement CPU physical core querying
  pm: cpupower: remove hard-coded topology depth values
  pm: cpupower: Fix cmd_monitor() error legs to free cpu_topology
  ...
2025-03-25 15:00:18 -07:00
Linus Torvalds
21e0ff5b10 ACPI updates for 6.15-rc1
- Use the str_on_off() helper function instead of hard-coded strings in
    the ACPI power resources handling code (Thorsten Blum).
 
  - Add fan speed reporting for ACPI fans that have _FST, but otherwise
    do not support the entire ACPI 4 fan interface (Joshua Grisham).
 
  - Fix a stale comment regarding trip points in acpi_thermal_add() that
    diverged from the commented code after removing _CRT evaluation from
    acpi_thermal_get_trip_points() (xueqin Luo).
 
  - Make ACPI button driver also subscribe to system events (Mario
    Limonciello).
 
  - Use the str_yes_no() helper function instead of hard-coded strings in
    the ACPI backlight (video) driver (Thorsten Blum).
 
  - Add a missing header file include to the x86 arch CPPC code (Mario
    Limonciello).
 
  - Rework the sysfs attributes implementation in the ACPI platform-profile
    driver and improve the unregistration code in it (Nathan Chancellor,
    Kurt Borja).
 
  - Prevent the ACPI HED driver from being built as a module and change
    its initcall level to subsys_initcall to avoid initialization ordering
    issues related to it (Xiaofei Tan).
 
  - Update a maintainer email address in the ACPI PMIC entry in
    MAINTAINERS (Mika Westerberg).
 
  - Address a GCC 15's -Wunterminated-string-initialization warning in
    the core PNP subsystem code and remove some dead code from it (Kees
    Cook, David Alan Gilbert).
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmfhhZYSHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1yCMH/3IbFftpA2sJg504igRgLdMzDAhTc/Y3
 x8Y37v1e1Psxyp0SQni84H4E11QWaytSXngemnp39+LgN+14KW243z6v+PBGioyI
 +cdJw7kAk8v1aX+Ujel2Z3BIz9QPFqZd6d3R0AsZkZcI/28VW7kHNRXZ6p2kYmxK
 7acx0Y1cM1k0UotzpzQ4RDaTnFNKUGJFQwdKTEJU237gsFrVh8ev2hLm9Hy4FU6R
 zGtjrU2/oBEmoCVgrXG4n6bZYP4dZwCZ8ewckIvepGuTPTHP8tYMxrELw/3A1901
 +lnN6zK6nMpvCd9cl0ongT5iFG4gsuBGanvnP7Mf51YI380jmNSLeCI=
 =/b+r
 -----END PGP SIGNATURE-----

Merge tag 'acpi-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI updates from Rafael Wysocki:
 "From the functional perspective, the most significant changes here are
  the ACPI fan driver update allowing it to handle fans with
  fine-grained state checking supported, but without fine-grained
  control, and the ACPI button driver update making it subscribe to
  system event notifications (in addition to device notifications) which
  on some systems is requisite for waking up the system from sleep.

  The rest is fixes and cleanups including removal of some dead code.

  Specifics:

   - Use the str_on_off() helper function instead of hard-coded strings
     in the ACPI power resources handling code (Thorsten Blum)

   - Add fan speed reporting for ACPI fans that have _FST, but otherwise
     do not support the entire ACPI 4 fan interface (Joshua Grisham)

   - Fix a stale comment regarding trip points in acpi_thermal_add()
     that diverged from the commented code after removing _CRT
     evaluation from acpi_thermal_get_trip_points() (xueqin Luo)

   - Make ACPI button driver also subscribe to system events (Mario
     Limonciello)

   - Use the str_yes_no() helper function instead of hard-coded strings
     in the ACPI backlight (video) driver (Thorsten Blum)

   - Add a missing header file include to the x86 arch CPPC code (Mario
     Limonciello)

   - Rework the sysfs attributes implementation in the ACPI
     platform-profile driver and improve the unregistration code in it
     (Nathan Chancellor, Kurt Borja)

   - Prevent the ACPI HED driver from being built as a module and change
     its initcall level to subsys_initcall to avoid initialization
     ordering issues related to it (Xiaofei Tan)

   - Update a maintainer email address in the ACPI PMIC entry in
     MAINTAINERS (Mika Westerberg)

   - Address a GCC 15's -Wunterminated-string-initialization warning in
     the core PNP subsystem code and remove some dead code from it (Kees
     Cook, David Alan Gilbert)"

* tag 'acpi-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PNP: Expand length of fixup id string
  PNP: Remove prehistoric deadcode
  ACPI: button: Install notifier for system events as well
  ACPI: fan: Add fan speed reporting for fans with only _FST
  ACPI: HED: Always initialize before evged
  x86/ACPI: CPPC: Add missing include
  ACPI: video: Use str_yes_no() helper in acpi_video_bus_add()
  ACPI: platform_profile: Improve platform_profile_unregister()
  ACPI: platform-profile: Fix CFI violation when accessing sysfs files
  ACPI: power: Use str_on_off() helper function
  ACPI: thermal: Fix stale comment regarding trip points
  MAINTAINERS: Use my kernel.org address for ACPI PMIC work
2025-03-25 14:56:33 -07:00
Linus Torvalds
a5b3d8660b hyperv-next for 6.15
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmfhlLATHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXgchCADOz33rSm4G4w4r0qT05dTDi/lZkEdK
 64dQq322XXP/C9FfR66d30243gsAmuM5a0SvzFHLXAOu6yqM270Xehd/Rud+Um2s
 lSVnc0Ux0AWBgksqFd0t577aN7zmJEukosEYO5lBNop+zOcadrm3S6Th/AoL2h/D
 yphPkhH13bsCK+Wll/eBOQLIhC9iA0konYbBLuEQ5MqvUbrzc6Rmb5gxsHHZKOqg
 vLjkrYR/d3s2gIpKxiFp0RwvzGyffZEHxvU/YF3hTenPMlTlnXWbyspBSTVmWggP
 13IFLzqxDdW9RgUnGB4xRc424AC1LKqEr42QPQE7zGvl2jdJriA2Q1LT
 =BXqj
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Add support for running as the root partition in Hyper-V (Microsoft
   Hypervisor) by exposing /dev/mshv (Nuno and various people)

 - Add support for CPU offlining in Hyper-V (Hamza Mahfooz)

 - Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael
   Kelley, Thorsten Blum)

* tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
  x86/hyperv: fix an indentation issue in mshyperv.h
  x86/hyperv: Add comments about hv_vpset and var size hypercall input args
  Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
  hyperv: Add definitions for root partition driver to hv headers
  x86: hyperv: Add mshv_handler() irq handler and setup function
  Drivers: hv: Introduce per-cpu event ring tail
  Drivers: hv: Export some functions for use by root partition module
  acpi: numa: Export node_to_pxm()
  hyperv: Introduce hv_recommend_using_aeoi()
  arm64/hyperv: Add some missing functions to arm64
  x86/mshyperv: Add support for extended Hyper-V features
  hyperv: Log hypercall status codes as strings
  x86/hyperv: Fix check of return value from snp_set_vmsa()
  x86/hyperv: Add VTL mode callback for restarting the system
  x86/hyperv: Add VTL mode emergency restart callback
  hyperv: Remove unused union and structs
  hyperv: Add CONFIG_MSHV_ROOT to gate root partition support
  hyperv: Change hv_root_partition into a function
  hyperv: Convert hypercall statuses to linux error codes
  drivers/hv: add CPU offlining support
  ...
2025-03-25 14:47:04 -07:00
Linus Torvalds
0d86c23953 - A cleanup to the MCE notification machinery
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfixMAACgkQEsHwGGHe
 VUrZ/A/+LmqCE4TQEsS04oLezqu9hfPHiQ1Z1DK1RpjLyx23YZGBoYslGHtltknZ
 rXt15UDINSYXlbnVBzIFtuTQX8GKWySr/DbN9rV/rb4xW81giOPwW96/M/LfnPBh
 kV3KCrgsPvIfKE1pmCkK0ThkemcVcjtvq83Jpn3C6ppsqDZdOrSH+e8ZCnKSVK3q
 n3IwDQXmBJXV+wZFiAMvMTqUpJVNCTeiQj+ACrfOqgnAZsGFsEsEKqZkdO2ouaEy
 7QAdY+6AELX3LnAu4mJKDVsZ/HSUip7uVOqqM02TMw5O4z/cpzd3rH8tObSTW9hr
 DLlLXmfOJliQdJGHAECv79DiViycF6axVdZh6WvfvJHzZYUyNSrjoWUTDUW3JFI1
 ZikhBh/hQlfas12k0dYYcObgF1li45LyfFl/uSyIfoO1aIno+Od8yv1/jRrd8s50
 7ehS5OFtpb4EsqCED2arAsiDiaoHwrYWAP8aoJVwXg5AZB6ShritSC6QlQpOgDCw
 81VOeARaJoYJggxDzxGYCjLQORzoweDuuMs41qZLqn3DfinYvotHUjo6j5DL2JEm
 iFEce2NeKvi+T2dB8k1EzqyGL0VKSh1ogI53RzGnaWUt1f8JnJuM7Je+VXI5LGL3
 Ce8sSVzZbc5MFPOCncoxXw7f68aND+P0lm+yA79lVMT7ytp54KA=
 =rY2E
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS update from Borislav Petkov:

 - A cleanup to the MCE notification machinery

* tag 'ras_core_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce/inject: Remove call to mce_notify_irq()
2025-03-25 14:13:35 -07:00
Linus Torvalds
2899aa3973 - First part of the MPAM work: split the architectural part of resctrl from the
filesystem part so that ARM's MPAM varian of resource control can be added
   later while sharing the user interface with x86 (James Morse)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfi118ACgkQEsHwGGHe
 VUrhZhAAj9brJYnluZpNgOMl231QRJaK0Exz1TLMFvmEZxQSnRs6TJ4PVDqU7QQb
 lrvqbobf77BfO8u3jtFLZvcoXxG+zzaTEoDEqGc/57Gu4G/kC64S8kYWa88aUf4I
 lHS5kZvNUxVBh4L/33QaprigN61pZbhLoejOCdr3zWRJ62+/xoNXs1rV8N3Zwgdv
 6p/B56MMi1CBXXHbFSzBI1bXSb/gW9jMjTnvrHbg3sOzrvVVuigJMVYgEfcEi0lh
 npc0Iz/Gz3Bzemxcl05bm2eJ+Z9WR9CIHMp+PAewqL7eJCV0OBHUkClU9Ui92Js+
 BA7XhL4XAnnZAaXHQoBfskGzcQ91pWPpkjJwSQO7y3zl8A8lvTFJCb89tZMWiLDl
 bF9MmbyjJFMtEaIYLHlhoasilN2laRrnTW41ZhxEtSJ0IofE4OInJ2+pPB/TfT7O
 HfZtkadIDrH6p5qLXy9bRwPxHskuM+NX0bw0OxWfu49DGw3O8pRhTFkiQ/+ofuBb
 oJNwVBAH11AiXUZBR1ZunpYEkwMFlL4FyNOkq/OS6C51UUE72dYITR5HB0/wkTp2
 cc2oiX3CSQPKrA4G8BAvMb7zGTmryXRZ7nOkTzScVTm8BoyyZf9F69aTpg1Deuuf
 W8Z9WrabVBCEs7EhZ7OH9bvmpBFapoNDUwmt+gnTAw6U0QDspZw=
 =FRzf
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:

 - First part of the MPAM work: split the architectural part of resctrl
   from the filesystem part so that ARM's MPAM varian of resource
   control can be added later while sharing the user interface with x86
   (James Morse)

* tag 'x86_cache_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  x86/resctrl: Move get_{mon,ctrl}_domain_from_cpu() to live with their callers
  x86/resctrl: Move get_config_index() to a header
  x86/resctrl: Handle throttle_mode for SMBA resources
  x86/resctrl: Move RFTYPE flags to be managed by resctrl
  x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr
  x86/resctrl: Make prefetch_disable_bits belong to the arch code
  x86/resctrl: Allow an architecture to disable pseudo lock
  x86/resctrl: Add resctrl_arch_ prefix to pseudo lock functions
  x86/resctrl: Move mbm_cfg_mask to struct rdt_resource
  x86/resctrl: Move mba_mbps_default_event init to filesystem code
  x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers
  x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC
  x86/resctrl: Move the is_mbm_*_enabled() helpers to asm/resctrl.h
  x86/resctrl: Rewrite and move the for_each_*_rdt_resource() walkers
  x86/resctrl: Move monitor init work to a resctrl init call
  x86/resctrl: Move monitor exit work to a resctrl exit call
  x86/resctrl: Add an arch helper to reset one resource
  x86/resctrl: Move resctrl types to a separate header
  x86/resctrl: Move rdt_find_domain() to be visible to arch and fs code
  x86/resctrl: Expose resctrl fs's init function to the rest of the kernel
  ...
2025-03-25 13:51:28 -07:00
Linus Torvalds
906174776c - Some preparatory work to convert the mitigations machinery to mitigating
attack vectors instead of single vulnerabilities
 
 - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag
 
 - Add support for a Zen5-specific SRSO mitigation
 
 - Cleanups and minor improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfixS0ACgkQEsHwGGHe
 VUpi1xAAgvH2u8Eo8ibT5dABQpD65w3oQiykO+9aDpObG9w9beDVGlld8DJE61Rz
 6tcE0Clp2H/tMcCbn8zXIJ92TQ3wIX/85uZwLi1VEM1Tx7A6VtAbPv8WKfZE3FCX
 9v92HRKnK3ql+A2ZR+oyy+/8RedUmia7y7/bXH1H7Zf2uozoKkmq5cQnwfq5iU4A
 qNiKuvSlQwjZ8Zz6Ax1ugHUkE4R7mlKh8rccLXl4+mVr63/lkPHSY3OFTjcYf4HW
 Ir92N86Spfo0/l0vsOOsWoYKmoaiVP7ouJh7YbKR3B0BGN0pt2MT476mehkEs427
 m4J6XhRKhIrsYmzEkLvvpsg12zO4/PKk8BEYNS7YPYlRaOwjV4ivyFS2aY6e55rh
 yUHyo9s+16f/Mp+/fNFXll3mdMxYBioPWh3M191nJkdfyKMrtf0MdKPRibaJB8wH
 yMF4D1gMx+hFbs0/VOS6dtqD9DKW7VgPg0LW+RysfhnLTuFFb5iBcH6Of7l7Z/Ca
 vVK+JxrhB1EDVI1+MKnESKPF9c6j3DRa2xrQHi/XYje1TGqnQ1v4CmsEObYBuJDN
 9M9t4QLzNuA/DA5tS7cxxtQ3YUthuJjPLcO4EVHOCvnqCAxkzp0i3dVMUr+YISl+
 2yFqaZdTt8s8FjTI21LOyuloCo30ZLlzaorFa0lp2cIyYup+1vg=
 =btX/
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 speculation mitigation updates from Borislav Petkov:

 - Some preparatory work to convert the mitigations machinery to
   mitigating attack vectors instead of single vulnerabilities

 - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag

 - Add support for a Zen5-specific SRSO mitigation

 - Cleanups and minor improvements

* tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2
  x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code
  x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds
  x86/bugs: Relocate mds/taa/mmio/rfds defines
  x86/bugs: Add X86_BUG_SPECTRE_V2_USER
  x86/bugs: Remove X86_FEATURE_USE_IBPB
  KVM: nVMX: Always use IBPB to properly virtualize IBRS
  x86/bugs: Use a static branch to guard IBPB on vCPU switch
  x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set()
  x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation()
  x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers
  x86/bugs: KVM: Add support for SRSO_MSR_FIX
2025-03-25 13:30:18 -07:00
Linus Torvalds
2d09a9449e arm64 updates for 6.15:
Perf and PMUs:
 
  - Support for the "Rainier" CPU PMU from Arm
 
  - Preparatory driver changes and cleanups that pave the way for BRBE
    support
 
  - Support for partial virtualisation of the Apple-M1 PMU
 
  - Support for the second event filter in Arm CSPMU designs
 
  - Minor fixes and cleanups (CMN and DWC PMUs)
 
  - Enable EL2 requirements for FEAT_PMUv3p9
 
 Power, CPU topology:
 
  - Support for AMUv1-based average CPU frequency
 
  - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It adds
    a generic topology_is_primary_thread() function overridden by x86 and
    powerpc
 
 New(ish) features:
 
  - MOPS (memcpy/memset) support for the uaccess routines
 
 Security/confidential compute:
 
  - Fix the DMA address for devices used in Realms with Arm CCA. The
    CCA architecture uses the address bit to differentiate between shared
    and private addresses
 
  - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
    default
 
 Memory management clean-ups:
 
  - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs
 
  - Some minor page table accessor clean-ups
 
  - PIE/POE (permission indirection/overlay) helpers clean-up
 
 Kselftests:
 
  - MTE: skip hugetlb tests if MTE is not supported on such mappings and
    user correct naming for sync/async tag checking modes
 
 Miscellaneous:
 
  - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
    request)
 
  - Sysreg updates for new register fields
 
  - CPU type info for some Qualcomm Kryo cores
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmfjB2QACgkQa9axLQDI
 XvGrfg//W3Bx9+jw1G/XHHEQqGEVFmvltvxZUkvgV0Qki0rPSMnappJhZRL9n0Nm
 V6PvGd2KoKHZuL3g5ViZb3cs2R9BiD2JB6PncwBKuxumHGh3vz3kk1JMkDVfWdHv
 qAceOckFJD9rXjPZn+PDsfYiEi2i3RRWIP5VglZ14ue8j3prHQ6DJXLUQF2GYvzE
 /bgLSq44wp5N59ddy23+qH9rxrHzz3bgpbVv/F56W/LErvE873mRmyFwiuGJm+M0
 Pn8ra572rI6a4sgSwrMTeNPBU+F9o5AbqwauVhkz428RdMvgfEuW6qHUBnGWJDmt
 HotXmu+4Eb2KJks/iQkDo4OTJ38yUqvvZZJtP171ms3E4yqESSJngWP6O2A6LF+y
 xhe0sESF/Ew6jLhM6/hvOmBcE2AyB14JE3ymqLkXbWub4NXddBn2AF1WXFjF4CBw
 F8KSUhNLekrCYKv1k9M3nhvkcpoS9FkTF/TI+zEg546alI/GLPih6uDRkgMAODh1
 RDJYixHsf2NDDRQbfwvt9Xua/KKpDF6qNkHLA4OiqqVUwh1hkas24Lrnp8vmce4o
 wIpWCLqYWey8Rl3XWuWgWz2Xu58fHH4Dl2k72Z8I0pwp3abCDa9xEj79G0Svk7Si
 Q+FCYrNlpKee1RXBC+1MUD/Gl5r/28dEUFkAzPD80F7AgafXPd0=
 =Kc9c
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 "Nothing major this time around.

  Apart from the usual perf/PMU updates, some page table cleanups, the
  notable features are average CPU frequency based on the AMUv1
  counters, CONFIG_HOTPLUG_SMT and MOPS instructions (memcpy/memset) in
  the uaccess routines.

  Perf and PMUs:

   - Support for the 'Rainier' CPU PMU from Arm

   - Preparatory driver changes and cleanups that pave the way for BRBE
     support

   - Support for partial virtualisation of the Apple-M1 PMU

   - Support for the second event filter in Arm CSPMU designs

   - Minor fixes and cleanups (CMN and DWC PMUs)

   - Enable EL2 requirements for FEAT_PMUv3p9

  Power, CPU topology:

   - Support for AMUv1-based average CPU frequency

   - Run-time SMT control wired up for arm64 (CONFIG_HOTPLUG_SMT). It
     adds a generic topology_is_primary_thread() function overridden by
     x86 and powerpc

  New(ish) features:

   - MOPS (memcpy/memset) support for the uaccess routines

  Security/confidential compute:

   - Fix the DMA address for devices used in Realms with Arm CCA. The
     CCA architecture uses the address bit to differentiate between
     shared and private addresses

   - Spectre-BHB: assume CPUs Linux doesn't know about vulnerable by
     default

  Memory management clean-ups:

   - Drop the P*D_TABLE_BIT definition in preparation for 128-bit PTEs

   - Some minor page table accessor clean-ups

   - PIE/POE (permission indirection/overlay) helpers clean-up

  Kselftests:

   - MTE: skip hugetlb tests if MTE is not supported on such mappings
     and user correct naming for sync/async tag checking modes

  Miscellaneous:

   - Add a PKEY_UNRESTRICTED definition as 0 to uapi (toolchain people
     request)

   - Sysreg updates for new register fields

   - CPU type info for some Qualcomm Kryo cores"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
  arm64: mm: Don't use %pK through printk
  perf/arm_cspmu: Fix missing io.h include
  arm64: errata: Add newer ARM cores to the spectre_bhb_loop_affected() lists
  arm64: cputype: Add MIDR_CORTEX_A76AE
  arm64: errata: Add KRYO 2XX/3XX/4XX silver cores to Spectre BHB safe list
  arm64: errata: Assume that unknown CPUs _are_ vulnerable to Spectre BHB
  arm64: errata: Add QCOM_KRYO_4XX_GOLD to the spectre_bhb_k24_list
  arm64/sysreg: Enforce whole word match for open/close tokens
  arm64/sysreg: Fix unbalanced closing block
  arm64: Kconfig: Enable HOTPLUG_SMT
  arm64: topology: Support SMT control on ACPI based system
  arch_topology: Support SMT control for OF based system
  cpu/SMT: Provide a default topology_is_primary_thread()
  arm64/mm: Define PTDESC_ORDER
  perf/arm_cspmu: Add PMEVFILT2R support
  perf/arm_cspmu: Generalise event filtering
  perf/arm_cspmu: Move register definitons to header
  arm64/kernel: Always use level 2 or higher for early mappings
  arm64/mm: Drop PXD_TABLE_BIT
  arm64/mm: Check pmd_table() in pmd_trans_huge()
  ...
2025-03-25 13:16:16 -07:00
Linus Torvalds
0f40464674 Updates for interrupt chip drivers:
- Support for hard indices on RISC-V. The hart index identifies a hart
     (core) within a specific interrupt domain in RISC-V's Priviledged
     Architecture.
 
   - Rework of the RISC-V MSI driver.
 
     This moves the driver over to the generic MSI library and solves the
     affinity problem of unmaskable PCI/MSI controllers. Unmaskable PCI/MSI
     controllers are prone to lose interrupts when the MSI message is
     updated to change the affinity because the message write consists of
     three 32-bit subsequent writes, which update address and data. As these
     writes are non-atomic versus the device raising an interrupt, the
     device can observe a half written update and issue an interrupt on the
     wrong vector. This is mitiated by a carefully orchestrated step by step
     update and the observation of an eventually pending interrupt on the
     CPU which issues the update. The algorithm follows the well established
     method of the X86 MSI driver.
 
   - A new driver for the RISC-V Sophgo SG2042 MSI controller
 
   - Overhaul of the Renesas RZQ2L driver.
 
     Simplification of the probe function by using devm_*() mechanisms,
     which avoid the endless list of error prone gotos in the failure paths.
 
   - Expand the Renesas RZV2H driver to support RZ/G3E SoCs
 
   - A workaround for Rockchip 3568002 erratum in the GIC-V3 driver to
     ensure that the addressing is limited to the lower 32-bit of the
     physical address space.
 
   - Add support for the Allwinner AS23 NMI controller
 
   - Expand the IMX irqsteer driver to handle up to 960 input interrupts
 
   - The usual small updates, cleanups and device tree changes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff454THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoZqoD/4kdHzbxfLpf7vC3NnG8NWwTq5FpbSx
 6grQC9hWNMAs4n2IFjJRFLrjeX3AcdAQXL/BWuM0LfW9tQDQaVmqlSIlB/bn69KB
 7HyAR6ozbOgnHKGAqFUXSLf+4pq+6q3mOgGKIF289dy14HFu4ta0DqKgkPZeQnVs
 R/J8i7REUnn+YuxzSt5eOqyDPyt2EHJosSUABSWQZBlrM9jy1W7f6NqDFwawiVsa
 +tv4U/bz91vjzVxwTIgt7nJK+b2HVYdxoZYuKJwPaTsj26ANPp6ltjRTeOmZhb5h
 uKgw+OyzDnk6q+tjGcRqrqwl291VKxCvnRiqHFfu3CERdmI9qvpN9IRcEJqIbkcN
 cakekhAyt7OO7sEPcql5vBL97e9hpb7EcH78gYxwHf8Dy0rFZUvSC5v+L6VRFnJS
 XcKA1L+f9B6u5qxnBtLan9IW08HYNdvmPq6AuVjk+ndKioPUFqB2q6AtXpuA3Rmu
 Y3XH/wh/q5wk0pgeByxQW6swsfpMN3OYK3mpLx475wFh2NKzcdGlwGhDFhiw8DKX
 m1AESy3UZatj1a0qGaFS/M+mm9KGrDYIMrje832Wf4Yf1LGmTsDkd3/V99oazSsq
 Jm4qhDASXChJXd0imQICX9hPw0aHTlLYNs54obUXVULH4HivQKIgWhUXrjG0dBDL
 +tttjuv5FJxr3A==
 =jPHa
 -----END PGP SIGNATURE-----

Merge tag 'irq-drivers-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq driver updates from Thomas Gleixner:

 - Support for hard indices on RISC-V. The hart index identifies a hart
   (core) within a specific interrupt domain in RISC-V's Priviledged
   Architecture.

 - Rework of the RISC-V MSI driver

   This moves the driver over to the generic MSI library and solves the
   affinity problem of unmaskable PCI/MSI controllers. Unmaskable
   PCI/MSI controllers are prone to lose interrupts when the MSI message
   is updated to change the affinity because the message write consists
   of three 32-bit subsequent writes, which update address and data. As
   these writes are non-atomic versus the device raising an interrupt,
   the device can observe a half written update and issue an interrupt
   on the wrong vector. This is mitiated by a carefully orchestrated
   step by step update and the observation of an eventually pending
   interrupt on the CPU which issues the update. The algorithm follows
   the well established method of the X86 MSI driver.

 - A new driver for the RISC-V Sophgo SG2042 MSI controller

 - Overhaul of the Renesas RZQ2L driver

   Simplification of the probe function by using devm_*() mechanisms,
   which avoid the endless list of error prone gotos in the failure
   paths.

 - Expand the Renesas RZV2H driver to support RZ/G3E SoCs

 - A workaround for Rockchip 3568002 erratum in the GIC-V3 driver to
   ensure that the addressing is limited to the lower 32-bit of the
   physical address space.

 - Add support for the Allwinner AS23 NMI controller

 - Expand the IMX irqsteer driver to handle up to 960 input interrupts

 - The usual small updates, cleanups and device tree changes

* tag 'irq-drivers-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  irqchip/imx-irqsteer: Support up to 960 input interrupts
  irqchip/sunxi-nmi: Support Allwinner A523 NMI controller
  dt-bindings: irq: sun7i-nmi: Document the Allwinner A523 NMI controller
  irqchip/davinci-cp-intc: Remove public header
  irqchip/renesas-rzv2h: Add RZ/G3E support
  irqchip/renesas-rzv2h: Update macros ICU_TSSR_TSSEL_{MASK,PREP}
  irqchip/renesas-rzv2h: Update TSSR_TIEN macro
  irqchip/renesas-rzv2h: Add field_width to struct rzv2h_hw_info
  irqchip/renesas-rzv2h: Add max_tssel to struct rzv2h_hw_info
  irqchip/renesas-rzv2h: Add struct rzv2h_hw_info with t_offs variable
  irqchip/renesas-rzv2h: Use devm_pm_runtime_enable()
  irqchip/renesas-rzv2h: Use devm_reset_control_get_exclusive_deasserted()
  irqchip/renesas-rzv2h: Simplify rzv2h_icu_init()
  irqchip/renesas-rzv2h: Drop irqchip from struct rzv2h_icu_priv
  irqchip/renesas-rzv2h: Fix wrong variable usage in rzv2h_tint_set_type()
  dt-bindings: interrupt-controller: renesas,rzv2h-icu: Document RZ/G3E SoC
  riscv: sophgo: dts: Add msi controller for SG2042
  irqchip: Add the Sophgo SG2042 MSI interrupt controller
  dt-bindings: interrupt-controller: Add Sophgo SG2042 MSI
  arm64: dts: rockchip: rk356x: Move PCIe MSI to use GIC ITS instead of MBI
  ...
2025-03-25 09:54:36 -07:00
David Woodhouse
3d66af75b0 x86/kexec: Debugging support: Dump registers on exception
The actual serial output function is a no-op for now.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20250314173226.3062535-3-dwmw2@infradead.org
2025-03-25 12:49:05 +01:00
David Woodhouse
8df505af7f x86/kexec: Debugging support: Load an IDT and basic exception entry points
[ mingo: Minor readability edits ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20250314173226.3062535-2-dwmw2@infradead.org
2025-03-25 12:49:05 +01:00
Ahmed S. Darwish
0dd09e215a x86/cacheinfo: Apply maintainer-tip coding style fixes
The x86/cacheinfo code has been heavily refactored and fleshed out at
parent commits, where any necessary coding style fixes were also done
in place.

Apply Documentation/process/maintainer-tip.rst coding style fixes to the
rest of the code, and align its assignment expressions for readability.

Standardize on CPUID(n) when mentioning leaf queries.

Avoid breaking long lines when doing so helps readability.

At cacheinfo_amd_init_llc_id(), rename variable 'msb' to 'index_msb' as
this is how it's called at the rest of cacheinfo.c code.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-30-darwi@linutronix.de
2025-03-25 10:23:37 +01:00
Ahmed S. Darwish
6c963c42fc x86/cacheinfo: Introduce cpuid_amd_hygon_has_l3_cache()
Multiple code paths at cacheinfo.c and amd_nb.c check for AMD/Hygon CPUs
L3 cache presensce by directly checking leaf 0x80000006 EDX output.

Extract that logic into its own function.  While at it, rework the
AMD/Hygon LLC topology ID caclculation comments for clarity.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-29-darwi@linutronix.de
2025-03-25 10:23:30 +01:00
Ahmed S. Darwish
eeeebc4fc6 x86/cacheinfo: Relocate CPUID leaf 0x4 cache_type mapping
The cache_type_map[] array is used to map Intel leaf 0x4 cache_type
values to their corresponding types at <linux/cacheinfo.h>.

Move that array's definition after the actual CPUID leaf 0x4 structures,
instead of having it in the middle of AMD leaf 0x4 emulation code.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-28-darwi@linutronix.de
2025-03-25 10:23:27 +01:00
Ahmed S. Darwish
05d48035e5 x86/cacheinfo: Extract out cache self-snoop checks
The logic of not doing a cache flush if the CPU declares cache self
snooping support is repeated across the x86/cacheinfo code.  Extract it
into its own function.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-27-darwi@linutronix.de
2025-03-25 10:23:24 +01:00
Ahmed S. Darwish
fda5f817ae x86/cacheinfo: Extract out cache level topology ID calculation
For Intel CPUID leaf 0x4 parsing, refactor the cache level topology ID
calculation code into its own method instead of repeating the same logic
twice for L2 and L3.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-26-darwi@linutronix.de
2025-03-25 10:23:21 +01:00
Ahmed S. Darwish
66122616e2 x86/cacheinfo: Separate Intel CPUID leaf 0x4 handling
init_intel_cacheinfo() was overly complex.  It parsed leaf 0x4 data,
leaf 0x2 data, and performed post-processing, all within one function.
Parent commit moved leaf 0x2 parsing and the post-processing logic into
their own functions.

Continue the refactoring by extracting leaf 0x4 parsing into its own
function.  Initialize local L2/L3 topology ID variables to BAD_APICID by
default, thus ensuring they can be used unconditionally.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-25-darwi@linutronix.de
2025-03-25 10:23:18 +01:00
Ahmed S. Darwish
5adfd36758 x86/cacheinfo: Separate CPUID leaf 0x2 handling and post-processing logic
The logic of init_intel_cacheinfo() is quite convoluted: it mixes leaf
0x4 parsing, leaf 0x2 parsing, plus some post-processing, in a single
place.

Begin simplifying its logic by extracting the leaf 0x2 parsing code, and
the post-processing logic, into their own functions.  While at it,
rework the SMT LLC topology ID comment for clarity.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-24-darwi@linutronix.de
2025-03-25 10:23:15 +01:00
Ahmed S. Darwish
4772304ee6 x86/cpu: Use consolidated CPUID leaf 0x2 descriptor table
CPUID leaf 0x2 output is a stream of one-byte descriptors, each implying
certain details about the CPU's cache and TLB entries.

At previous commits, the mapping tables for such descriptors were merged
into one consolidated table.  The mapping was also transformed into a
hash lookup instead of a loop-based lookup for each descriptor.

Use the new consolidated table and its hash-based lookup through the
for_each_leaf_0x2_tlb_entry() accessor.

Remove the TLB-specific mapping, intel_tlb_table[], as it is now no
longer used.  Remove the <cpuid/types.h> macro, for_each_leaf_0x2_desc(),
since the converted code was its last user.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-23-darwi@linutronix.de
2025-03-25 10:23:12 +01:00
Ahmed S. Darwish
da23a62598 x86/cacheinfo: Use consolidated CPUID leaf 0x2 descriptor table
CPUID leaf 0x2 output is a stream of one-byte descriptors, each implying
certain details about the CPU's cache and TLB entries.

At previous commits, the mapping tables for such descriptors were merged
into one consolidated table.  The mapping was also transformed into a
hash lookup instead of a loop-based lookup for each descriptor.

Use the new consolidated table and its hash-based lookup through the
for_each_leaf_0x2_tlb_entry() accessor.  Remove the old cache-specific
mapping, cache_table[], as it is no longer used.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-22-darwi@linutronix.de
2025-03-25 10:23:08 +01:00
Thomas Gleixner
37aedb806b x86/cpu: Consolidate CPUID leaf 0x2 tables
CPUID leaf 0x2 describes TLBs and caches. So there are two tables with the
respective descriptor constants in intel.c and cacheinfo.c. The tables
occupy almost 600 byte and require a loop based lookup for each variant.

Combining them into one table occupies exactly 1k rodata and allows to get
rid of the loop based lookup by just using the descriptor byte provided by
CPUID leaf 0x2 as index into the table, which simplifies the code and
reduces text size.

The conversion of the intel.c and cacheinfo.c code is done separately.

[ darwi: Actually define struct leaf_0x2_table.
	 Tab-align all of cpuid_0x2_table[] mapping entries.
	 Define needed SZ_* macros at <linux/sizes.h> instead (merged commit.)
	 Use CACHE_L1_{INST,DATA} as names for L1 cache descriptor types.
	 Set descriptor 0x63 type as TLB_DATA_1G_2M_4M and explain why.
	 Use enums for cache and TLB descriptor types (parent commits.)
	 Start enum types at 1 since type 0 is reserved for unknown descriptors.
	 Ensure that cache and TLB enum type values do not intersect.
	 Add leaf 0x2 table accessor for_each_leaf_0x2_entry() + documentation. ]

Co-developed-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-21-darwi@linutronix.de
2025-03-25 10:23:04 +01:00
Ahmed S. Darwish
543904cdfe x86/cpu: Use enums for TLB descriptor types
The leaf 0x2 one-byte TLB descriptor types:

	TLB_INST_4K
	TLB_INST_4M
	TLB_INST_2M_4M
	...

are just discriminators to be used within the intel_tlb_table[] mapping.
Their specific values are irrelevant.

Use enums for such types.

Make the enum packed and static assert that its values remain within a
single byte so that the intel_tlb_table[] size do not go out of hand.

Use a __CHECKER__ guard for the static_assert(sizeof(enum) == 1) line as
sparse ignores the __packed annotation on enums.

This is similar to:

  fe3944fb24 ("fs: Move enum rw_hint into a new header file")

for the core SCSI code.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/Z9rsTirs9lLfEPD9@lx-t490
Link: https://lore.kernel.org/r/20250324133324.23458-20-darwi@linutronix.de
2025-03-25 10:23:00 +01:00
Ahmed S. Darwish
e1e6b57146 x86/cacheinfo: Use enums for cache descriptor types
The leaf 0x2 one-byte cache descriptor types:

	CACHE_L1_INST
	CACHE_L1_DATA
	CACHE_L2
	CACHE_L3

are just discriminators to be used within the cache_table[] mapping.
Their specific values are irrelevant.

Use enums for such types.

Make the enum packed and static assert that its values remain within a
single byte so that the cache_table[] array size do not go out of hand.

Use a __CHECKER__ guard for the static_assert(sizeof(enum) == 1) line as
sparse ignores the __packed annotation on enums.

This is similar to:

  fe3944fb24 ("fs: Move enum rw_hint into a new header file")

for the core SCSI code.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/Z9rsTirs9lLfEPD9@lx-t490
Link: https://lore.kernel.org/r/20250324133324.23458-19-darwi@linutronix.de
2025-03-25 10:22:56 +01:00
Ahmed S. Darwish
7596ab7a10 x86/cacheinfo: Clarify type markers for CPUID leaf 0x2 cache descriptors
CPUID leaf 0x2 output is a stream of one-byte descriptors, each implying
certain details about the CPU's cache and TLB entries.

Two separate tables exist for interpreting these descriptors: one for
TLBs at intel.c and one for caches at cacheinfo.c.  These mapping tables
will be merged in further commits, among other improvements to their
model.

In preparation for this, use more descriptive type names for the leaf
0x2 descriptors associated with cpu caches.  Namely:

	LVL_1_INST	=>	CACHE_L1_INST
	LVL_1_DATA	=>	CACHE_L1_DATA
	LVL_2		=>	CACHE_L2
	LVL_3		=>	CACHE_L3

After the TLB and cache descriptors mapping tables are merged, this will
make it clear that such descriptors correspond to cpu caches.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-18-darwi@linutronix.de
2025-03-25 10:22:52 +01:00
Ahmed S. Darwish
eb1c7c08c5 x86/cacheinfo: Rename 'struct _cpuid4_info_regs' to 'struct _cpuid4_info'
Parent commits decoupled amd_northbridge from _cpuid4_info_regs, moved
AMD L3 northbridge cache_disable_0/1 sysfs code to its own file, and
splitted AMD vs. Intel leaf 0x4 handling into:

    amd_fill_cpuid4_info()
    intel_fill_cpuid4_info()
    fill_cpuid4_info()

After doing all that, the "_cpuid4_info_regs" name becomes a mouthful.
It is also not totally accurate, as the structure holds cpuid4 derived
information like cache node ID and size -- not just regs.

Rename struct _cpuid4_info_regs to _cpuid4_info.  That new name also
better matches the AMD/Intel leaf 0x4 functions mentioned above.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-17-darwi@linutronix.de
2025-03-25 10:22:49 +01:00
Ahmed S. Darwish
2d56cc8722 x86/cacheinfo: Separate Intel and AMD CPUID leaf 0x4 code paths
The CPUID leaf 0x4 parsing code at cpuid4_cache_lookup_regs() is ugly and
convoluted.  It is tangled with multiple nested conditions to handle:

  * AMD with TOPEXT, or Hygon CPUs via leaf 0x8000001d

  * Legacy AMD fallback via leaf 0x4 emulation

  * Intel CPUs via the actual CPUID leaf 0x4

Moreover, AMD L3 northbridge initialization is also awkwardly placed
alongside the CPUID calls of the first two scenarios above.  Refactor all
of that as follows:

  * Update AMD's leaf 0x4 emulation comment to represent current state

  * Clearly label the AMD leaf 0x4 emulation function as a fallback

  * Split AMD/Hygon and Intel code paths into separate functions

  * Move AMD L3 northbridge initialization out of CPUID leaf 0x4 code,
    and into populate_cache_leaves() where it belongs.  There,
    ci_info_init() can directly store the initialized object in the
    private pointer of the <linux/cacheinfo.h> API.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-16-darwi@linutronix.de
2025-03-25 10:22:45 +01:00
Ahmed S. Darwish
071f4ad649 x86/cacheinfo: Use sysfs_emit() for sysfs attributes show()
Per Documentation/filesystems/sysfs.rst, a sysfs attribute's show()
method should only use sysfs_emit() or sysfs_emit_at() when returning
values to user space.

Use sysfs_emit() for the AMD L3 cache sysfs attributes cache_disable_0,
cache_disable_1, and subcaches.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-15-darwi@linutronix.de
2025-03-25 10:22:43 +01:00
Ahmed S. Darwish
365e960d29 x86/cacheinfo: Move AMD cache_disable_0/1 handling to separate file
Parent commit decoupled amd_northbridge out of _cpuid4_info_regs, where
it was merely "parked" there until ci_info_init() can store it in the
private pointer of the <linux/cacheinfo.h> API.

Given that decoupling, move the AMD-specific L3 cache_disable_0/1 sysfs
code from the generic (and already extremely convoluted) x86/cacheinfo
code into its own file.

Compile the file only if CONFIG_AMD_NB and CONFIG_SYSFS are both
enabled, which mirrors the existing logic.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-14-darwi@linutronix.de
2025-03-25 10:22:39 +01:00
Ahmed S. Darwish
c58ed2d4da x86/cacheinfo: Separate amd_northbridge from _cpuid4_info_regs
'struct _cpuid4_info_regs' is meant to hold the CPUID leaf 0x4
output registers (EAX, EBX, and ECX), as well as derived information
such as the cache node ID and size.

It also contains a reference to amd_northbridge, which is there only to
be "parked" until ci_info_init() can store it in the priv pointer of the
<linux/cacheinfo.h> API.  That priv pointer is then used by AMD-specific
L3 cache_disable_0/1 sysfs attributes.

Decouple amd_northbridge from _cpuid4_info_regs and pass it explicitly
through the functions at x86/cacheinfo.  Doing so clarifies when
amd_northbridge is actually needed (AMD-only code) and when it is
not (Intel-specific code).  It also prepares for moving the AMD-specific
L3 cache_disable_0/1 sysfs code into its own file in next commit.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-13-darwi@linutronix.de
2025-03-25 10:22:36 +01:00
Ahmed S. Darwish
77676e6802 x86/cacheinfo: Consolidate AMD/Hygon leaf 0x8000001d calls
While gathering CPU cache info, CPUID leaf 0x8000001d is invoked in two
separate if blocks: one for Hygon CPUs and one for AMDs with topology
extensions.  After each invocation, amd_init_l3_cache() is called.

Merge the two if blocks into a single condition, thus removing the
duplicated code.  Future commits will expand these if blocks, so
combining them now is both cleaner and more maintainable.

Note, while at it, remove a useless "better error?" comment that was
within the same function since the 2005 commit e2cac78935 ("[PATCH]
x86_64: When running cpuid4 need to run on the correct CPU").

Note, as previously done at commit aec28d852ed2 ("x86/cpuid: Standardize
on u32 in <asm/cpuid/api.h>"), standardize on using 'u32' and 'u8' types.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-12-darwi@linutronix.de
2025-03-25 10:22:32 +01:00
Ahmed S. Darwish
1374ff60ed x86/cacheinfo: Standardize _cpuid4_info_regs instance naming
The cacheinfo code frequently uses the output registers from CPUID leaf
0x4.  Such registers are cached in 'struct _cpuid4_info_regs', augmented
with related information, and are then passed across functions.

The naming of these _cpuid4_info_regs instances is confusing at best.

Some instances are called "this_leaf", which is vague as "this" lacks
context and "leaf" is overly generic given that other CPUID leaves are
also processed within cacheinfo.  Other _cpuid4_info_regs instances are
just called "base", adding further ambiguity.

Standardize on id4 for all instances.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-11-darwi@linutronix.de
2025-03-25 10:22:29 +01:00
Ahmed S. Darwish
036a73b517 x86/cacheinfo: Align ci_info_init() assignment expressions
The ci_info_init() function initializes 10 members of a 'struct cacheinfo'
instance using passed data from CPUID leaf 0x4.

Such assignment expressions are difficult to read in their current form.
Align them for clarity.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-10-darwi@linutronix.de
2025-03-25 10:22:26 +01:00
Ahmed S. Darwish
7b83e0d2b2 x86/cacheinfo: Constify _cpuid4_info_regs instances
_cpuid4_info_regs instances are passed through a large number of
functions at cacheinfo.c.  For clarity, constify the instance parameters
where _cpuid4_info_regs is only read from.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-9-darwi@linutronix.de
2025-03-25 10:22:23 +01:00
Thomas Gleixner
cf87582052 x86/cacheinfo: Use proper name for cacheinfo instances
The cacheinfo structure defined at <include/linux/cacheinfo.h> is a
generic cache info object representation.

Calling its instances at x86 cacheinfo.c "leaf" confuses it with a CPUID
leaf -- especially that multiple CPUID calls are already sprinkled across
that file.  Most of such instances also have a redundant "this_" prefix.

Rename all of the cacheinfo "this_leaf" instances to just "ci".

[ darwi: Move into separate commit and write commit log ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-8-darwi@linutronix.de
2025-03-25 10:22:19 +01:00
Thomas Gleixner
21e2a452dc x86/cacheinfo: Properly name amd_cpuid4()'s first parameter
amd_cpuid4()'s first parameter, "leaf", is not a CPUID leaf as the name
implies.  Rather, it's an index emulating CPUID(4)'s subleaf semantics;
i.e. an ID for the cache object currently enumerated.  Rename that
parameter to "index".

Apply minor coding style fixes to the rest of the function as well.

[ darwi: Move into a separate commit and write commit log.
	 Use "index" instead of "subleaf" for amd_cpuid4() first param,
	 as that's the name typically used at the whole of cacheinfo.c. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-7-darwi@linutronix.de
2025-03-25 10:22:16 +01:00
Thomas Gleixner
ee159792a4 x86/cacheinfo: Refactor CPUID leaf 0x2 cache descriptor lookup
Extract the cache descriptor lookup logic out of the leaf 0x2 parsing
code and into a dedicated function.  This disentangles such lookup from
the deeply nested leaf 0x2 parsing loop.

Remove the cache table termination entry, as it is no longer needed
after the ARRAY_SIZE()-based lookup.

[ darwi: Move refactoring logic into this separate commit + commit log.
	 Remove the cache table termination entry. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-6-darwi@linutronix.de
2025-03-25 10:22:13 +01:00
Ahmed S. Darwish
a078aaa38a x86/cacheinfo: Use CPUID leaf 0x2 parsing helpers
Parent commit introduced CPUID leaf 0x2 parsing helpers at
<asm/cpuid/leaf_0x2_api.h>.  The new API allows sharing leaf 0x2's output
validation and iteration logic across both intel.c and cacheinfo.c.

Convert cacheinfo.c to that new API.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-5-darwi@linutronix.de
2025-03-25 10:22:10 +01:00
Ahmed S. Darwish
fe78079ec0 x86/cpu: Introduce and use CPUID leaf 0x2 parsing helpers
Introduce CPUID leaf 0x2 parsing helpers at <asm/cpuid/leaf_0x2_api.h>.
This allows sharing the leaf 0x2's output validation and iteration logic
across both x86/cpu intel.c and cacheinfo.c.

Start by converting intel.c to the new API.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-4-darwi@linutronix.de
2025-03-25 10:22:06 +01:00
Ahmed S. Darwish
09a1da4beb x86/cacheinfo: Remove CPUID leaf 0x2 parsing loop
Leaf 0x2 output includes a "query count" byte where it was supposed to
specify the number of repeated CPUID leaf 0x2 subleaf 0 queries needed to
extract all of the CPU's cache and TLB descriptors.

Per current Intel manuals, all CPUs supporting this leaf "will always"
return an iteration count of 1.

Remove the leaf 0x2 query loop and just query the hardware once.

Note, as previously done at commit aec28d852ed2 ("x86/cpuid: Standardize
on u32 in <asm/cpuid/api.h>"), standardize on using 'u32' and 'u8' types.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-3-darwi@linutronix.de
2025-03-25 10:22:02 +01:00
Ahmed S. Darwish
b5969494c8 x86/cpu: Remove CPUID leaf 0x2 parsing loop
Leaf 0x2 output includes a "query count" byte where it was supposed to
specify the number of repeated CPUID leaf 0x2 subleaf 0 queries needed to
extract all of the CPU's cache and TLB descriptors.

Per current Intel manuals, all CPUs supporting this leaf "will always"
return an iteration count of 1.

Remove the leaf 0x2 query loop and just query the hardware once.

Note, as previously done in:

  aec28d852ed2 ("x86/cpuid: Standardize on u32 in <asm/cpuid/api.h>")

standardize on using 'u32' and 'u8' types.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250324133324.23458-2-darwi@linutronix.de
2025-03-25 10:21:46 +01:00
Maksim Davydov
0e1ff67d16 x86/split_lock: Simplify reenabling
When split_lock_mitigate is disabled, each CPU needs its own delayed_work
structure. They are used to reenable split lock detection after its
disabling. But delayed_work structure must be correctly initialized after
its allocation.

Current implementation uses deferred initialization that makes the
split lock handler code unclear. The code can be simplified a bit
if the initialization is moved to the appropriate initcall.

sld_setup() is called before setup_per_cpu_areas(), thus it can't be used
for this purpose, so introduce an independent initcall for
the initialization.

[ mingo: Simplified the 'work' assignment line a bit more. ]

Signed-off-by: Maksim Davydov <davydov-max@yandex-team.ru>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250325085807.171885-1-davydov-max@yandex-team.ru
2025-03-25 10:16:50 +01:00
Chao Gao
878477a595 x86/fpu: Update the outdated comment above fpstate_init_user()
fpu_init_fpstate_user() was removed in:

  commit 582b01b6ab ("x86/fpu: Remove old KVM FPU interface").

Update that comment to accurately reflect the current state regarding its
callers.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250324131931.2097905-1-chao.gao@intel.com
2025-03-25 09:57:33 +01:00
Denis Mukhin
3181424aea x86/early_printk: Add support for MMIO-based UARTs
During the bring-up of an x86 board, the kernel was crashing before
reaching the platform's console driver because of a bug in the firmware,
leaving no trace of the boot progress.

The only available method to debug the kernel boot process was via the
platform's MMIO-based UART, as the board lacked an I/O port-based UART,
PCI UART, or functional video output.

Then it turned out that earlyprintk= does not have a knob to configure
the MMIO-mapped UART.

Extend the early printk facility to support platform MMIO-based UARTs
on x86 systems, enabling debugging during the system bring-up phase.

The command line syntax to enable platform MMIO-based UART is:

  earlyprintk=mmio,membase[,{nocfg|baudrate}][,keep]

Note, the change does not integrate MMIO-based UART support to:

  arch/x86/boot/early_serial_console.c

Also, update kernel parameters documentation with the new syntax and
add the missing 'nocfg' setting to the PCI serial cards description.

Signed-off-by: Denis Mukhin <dmukhin@ford.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250324-earlyprintk-v3-1-aee7421dc469@ford.com
2025-03-25 08:35:38 +01:00
Jann Horn
2c118f50d7 x86/dumpstack: Fix inaccurate unwinding from exception stacks due to misplaced assignment
Commit:

  2e4be0d011 ("x86/show_trace_log_lvl: Ensure stack pointer is aligned, again")

was intended to ensure alignment of the stack pointer; but it also moved
the initialization of the "stack" variable down into the loop header.

This was likely intended as a no-op cleanup, since the commit
message does not mention it; however, this caused a behavioral change
because the value of "regs" is different between the two places.

Originally, get_stack_pointer() used the regs provided by the caller; after
that commit, get_stack_pointer() instead uses the regs at the top of the
stack frame the unwinder is looking at. Often, there are no such regs at
all, and "regs" is NULL, causing get_stack_pointer() to fall back to the
task's current stack pointer, which is not what we want here, but probably
happens to mostly work. Other times, the original regs will point to
another regs frame - in that case, the linear guess unwind logic in
show_trace_log_lvl() will start unwinding too far up the stack, causing the
first frame found by the proper unwinder to never be visited, resulting in
a stack trace consisting purely of guess lines.

Fix it by moving the "stack = " assignment back where it belongs.

Fixes: 2e4be0d011 ("x86/show_trace_log_lvl: Ensure stack pointer is aligned, again")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250325-2025-03-unwind-fixes-v1-2-acd774364768@google.com
2025-03-25 08:30:43 +01:00
Linus Torvalds
a49a879f0a Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han,
Mirsad Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeo6ERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gl9w//ZRxGhf3eXYVyPW35aoqaRe7W3fBJhaPn
 eLP0JuuglZO3JSTJfmbTmSXrNN4kkzI3nnuTnchK0iYZ+2Tn4E0iaz4tm5sjL5T1
 KW9ajv7QQxMOwXS69hrklY/eGxVimfs4mIyhxEhxLjPvzbGJOVS6WFgvaYH5klQQ
 7sPYXzrNGZJhGCmRqCelHSPhl7b6YVS/yWtMe+BpPBn1AqIQN+O+S/Kbu7dQokmq
 7MNy4eUe7GYgnSB7Qhq/jNSqjmGWsVwhMiuoDxx7GvShM73+kYatQZT0H+qzTxzp
 90rIPcPTNCOlKJO3xSWqXStcMH00MbplOhKdEU6SzeM+xC2aD2t4GoIG0EFxIQdS
 X0wh40Psm5GehThhpmMBJdmM4Le4TTSkHBhedpQ+sp+6BIX4A9hoeCoybIlcngaJ
 W89YKHqC5ruPSRDOzyaTMSypaEGH5VHP6AMj0CmkuyHRbxADsMazifpOVOw9AvVk
 IR0USCzyenjLMiugRrJgRpy8R2Q1jp35bP0e7Y4QvmydEGC1Wl5mW5Y8B9Zdlk0C
 iuxoqwntem7PAkEA2JejW6rzuRICxXloKv+xjtQjKzXpK+pwhhSp830zT8gaWZ1Z
 hdQRz7gSlJg3JzWinX8+XNusdHw9TxCUqwgLw8486nzNp01GU62gby5DWgECEdae
 2IyRRSd0FyA=
 =GsK4
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han, Mirsad
  Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo"

* tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/coco: Replace 'static const cc_mask' with the newly introduced cc_get_mask() function
  x86/delay: Fix inconsistent whitespace
  selftests/x86/syscall: Fix coccinelle WARNING recommending the use of ARRAY_SIZE()
  x86/platform: Fix missing declaration of 'x86_apple_machine'
  x86/irq: Fix missing declaration of 'io_apic_irqs'
  x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s description
  x86/apic: Use str_disabled_enabled() helper in print_ipi_mode()
2025-03-24 22:39:53 -07:00
Linus Torvalds
71b639af06 x86/fpu updates for v6.15:
- Improve crypto performance by making kernel-mode FPU reliably usable
    in softirqs ((Eric Biggers)
 
  - Fully optimize out WARN_ON_FPU() (Eric Biggers)
 
  - Initial steps to support Support Intel APX (Advanced Performance Extensions)
    (Chang S. Bae)
 
  - Fix KASAN for arch_dup_task_struct() (Benjamin Berg)
 
  - Refine and simplify the FPU magic number check during signal return
    (Chang S. Bae)
 
  - Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav Spassov)
 
  - selftests/x86/xstate: Introduce common code for testing extended states
    (Chang S. Bae)
 
  - Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros Bizjak)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfepZQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1grLw/+Lt71PQWu/uVt3/dU0VkpppTqRmUTujuR
 XVwFlBVxMFw9A8wi56bLD3WT28mKWr0SepynBt1Wr7/pZCnaW4SH/91ERChRqTzU
 ifawmqQqhrVhFlXziDOKp9lcDy9f7NjJVKBfEBa1VBQGC0Q8FzeasOhCwTJw8qXa
 Lj2z8zenhDq6NBkk22VlScc1CvsUBiXppvm8uk3md7HKaDqdzmCakm/a0FgaEppt
 K6P/u1iaVvGXme2CHSskCshkYoEFIF6LRNYnkVloXsVV4AeWeLaJ54xW+syPd/4H
 EX6oMLifdXCmxmsmi3LPRUjBgdSMsDaAkLsPNoX1w4uWjsihqyUl4wTEpobMIuu1
 PWOrCxQKxtaGoECJ0nsE4uR7MZEQ0sYCGKv4JSiiXeUVf/EDRuUBfvBCoGlDWXYA
 GZcoMmH+BtcYuQdzbrc/vWkfo3bpTxL3x5zsgtFkQL3xyzhugRPNgeOn9yuSZ9x6
 AD8QS0G6uRcc9a5ZeeTEcYxoIE/+vvfnh1wfMjisehkVn179ixfZdgcSGrZJUNyH
 a7LmUmwLvLYNO5DUUGs1upN8YHP7jiohNc/r2ZC1IX5LHbuK1gyXA4xo6hAZ8r/7
 XZ2FfRp0dr3PvuWIwq2v6T5JHvALADsKCAjqvNdwHLkxF87ygixT+B87wTA3H6ov
 LSY10A/eRx4=
 =U7PR
 -----END PGP SIGNATURE-----

Merge tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86/fpu updates from Ingo Molnar:

 - Improve crypto performance by making kernel-mode FPU reliably usable
   in softirqs ((Eric Biggers)

 - Fully optimize out WARN_ON_FPU() (Eric Biggers)

 - Initial steps to support Support Intel APX (Advanced Performance
   Extensions) (Chang S. Bae)

 - Fix KASAN for arch_dup_task_struct() (Benjamin Berg)

 - Refine and simplify the FPU magic number check during signal return
   (Chang S. Bae)

 - Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav
   Spassov)

 - selftests/x86/xstate: Introduce common code for testing extended
   states (Chang S. Bae)

 - Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros
   Bizjak)

* tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures
  x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros
  x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h
  x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs
  x86/fpu/xstate: Simplify print_xstate_features()
  x86/fpu: Refine and simplify the magic number check during signal return
  selftests/x86/xstate: Fix spelling mistake "hader" -> "header"
  x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct()
  vmlinux.lds.h: Remove entry to place init_task onto init_stack
  selftests/x86/avx: Add AVX tests
  selftests/x86/xstate: Clarify supported xstates
  selftests/x86/xstate: Consolidate test invocations into a single entry
  selftests/x86/xstate: Introduce signal ABI test
  selftests/x86/xstate: Refactor ptrace ABI test
  selftests/x86/xstate: Refactor context switching test
  selftests/x86/xstate: Enumerate and name xstate components
  selftests/x86/xstate: Refactor XSAVE helpers for general use
  selftests/x86: Consolidate redundant signal helper functions
  x86/fpu: Fix guest FPU state buffer allocation size
  x86/fpu: Fully optimize out WARN_ON_FPU()
2025-03-24 22:27:18 -07:00
Linus Torvalds
b58386a9bd Updates to the x86 boot code for the v6.15 cycle:
- Memblock setup and other early boot code cleanups (Mike Rapoport)
   - Export e820_table_kexec[] to sysfs (Dave Young)
   - Baby steps of adding relocate_kernel() debugging support (David Woodhouse)
   - Replace open-coded parity calculation with parity8() (Kuan-Wei Chiu)
   - Move the LA57 trampoline to separate source file (Ard Biesheuvel)
   - Misc micro-optimizations (Uros Bizjak)
   - Drop obsolete E820_TYPE_RESERVED_KERN and related code (Mike Rapoport)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeoawRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gCRBAAm5MwAxTOtqRQtwUBkbGB8HEfjCHJTLIe
 FiLLric6lHEn2uVw/9uhlN646pWxa+487QtxRAHlR2hpm0JyEiZkawhFpnWWx8s6
 WXdLVPK+CNQNKgcWC2AsIj7C71JcKBNJI2Lj8/p9Cn3AgB0s7m4e3GfuugMk43Lq
 aw8JHd1zzqyT9NsdfNkglwn12iui9Y0t7q0EuZgQhRXLvThwZZblJg+dvub30LGg
 FE2QM4dQC4K0IUhE42ea5wWylX3tmiDYpdEH/CwxPobfra4kMxnoUrrh9Dk82cma
 QR3wwOc4JZ6mXUWVumbtk+cyUvZ1wTGFgiSUGmomkoKz9dJewqNV4b6iRa5URGzG
 izZaAZyJDQk9r2dCnwLbjzQjr2SHXLvvTpmS8AlAyOEPTnc+388Fg4h4oL9N/rcM
 ZIxxKpfuSjiWT8tRGKGPePhqAIg7kllk/w3zSkyAsx9/DG/UrLhpLSzq0+4GPQ0E
 d0V6WwX41iouoAH+kmDDj3KkaezQ/ZfXcxKk2d3wSCvIEMfJkSSXFBDlanE+skrM
 x/0QCWVyN5zajYEEoWv8WoXov7Q67Ar6HdxtPRLtQcd/ZhpTFeq4wuitV+4phb3m
 twWQo43wkMI5jFf9U2b+PD//8PWfcBJhzP0BEN8rNJaq8KVa93eHsOpMqZK+5wC6
 q03Wx00ewfE=
 =cUeH
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot code updates from Ingo Molnar:

 - Memblock setup and other early boot code cleanups (Mike Rapoport)

 - Export e820_table_kexec[] to sysfs (Dave Young)

 - Baby steps of adding relocate_kernel() debugging support (David
   Woodhouse)

 - Replace open-coded parity calculation with parity8() (Kuan-Wei Chiu)

 - Move the LA57 trampoline to separate source file (Ard Biesheuvel)

 - Misc micro-optimizations (Uros Bizjak)

 - Drop obsolete E820_TYPE_RESERVED_KERN and related code (Mike
   Rapoport)

* tag 'x86-boot-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kexec: Add relocate_kernel() debugging support: Load a GDT
  x86/boot: Move the LA57 trampoline to separate source file
  x86/boot: Do not test if AC and ID eflags are changeable on x86_64
  x86/bootflag: Replace open-coded parity calculation with parity8()
  x86/bootflag: Micro-optimize sbf_write()
  x86/boot: Add missing has_cpuflag() prototype
  x86/kexec: Export e820_table_kexec[] to sysfs
  x86/boot: Change some static bootflag functions to bool
  x86/e820: Drop obsolete E820_TYPE_RESERVED_KERN and related code
  x86/boot: Split parsing of boot_params into the parse_boot_params() helper function
  x86/boot: Split kernel resources setup into the setup_kernel_resources() helper function
  x86/boot: Move setting of memblock parameters to e820__memblock_setup()
2025-03-24 22:25:21 -07:00
Linus Torvalds
e34c38057a [ Merge note: this pull request depends on you having merged
two locking commits in the locking tree,
 	      part of the locking-core-2025-03-22 pull request. ]
 
 x86 CPU features support:
   - Generate the <asm/cpufeaturemasks.h> header based on build config
     (H. Peter Anvin, Xin Li)
   - x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
   - Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
   - Enable modifying CPU bug flags with '{clear,set}puid='
     (Brendan Jackman)
   - Utilize CPU-type for CPU matching (Pawan Gupta)
   - Warn about unmet CPU feature dependencies (Sohil Mehta)
   - Prepare for new Intel Family numbers (Sohil Mehta)
 
 Percpu code:
   - Standardize & reorganize the x86 percpu layout and
     related cleanups (Brian Gerst)
   - Convert the stackprotector canary to a regular percpu
     variable (Brian Gerst)
   - Add a percpu subsection for cache hot data (Brian Gerst)
   - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
   - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)
 
 MM:
   - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction
     (Rik van Riel)
   - Rework ROX cache to avoid writable copy (Mike Rapoport)
   - PAT: restore large ROX pages after fragmentation
     (Kirill A. Shutemov, Mike Rapoport)
   - Make memremap(MEMREMAP_WB) map memory as encrypted by default
     (Kirill A. Shutemov)
   - Robustify page table initialization (Kirill A. Shutemov)
   - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
   - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
     (Matthew Wilcox)
 
 KASLR:
   - x86/kaslr: Reduce KASLR entropy on most x86 systems,
     to support PCI BAR space beyond the 10TiB region
     (CONFIG_PCI_P2PDMA=y) (Balbir Singh)
 
 CPU bugs:
   - Implement FineIBT-BHI mitigation (Peter Zijlstra)
   - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
   - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta)
   - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta)
 
 System calls:
   - Break up entry/common.c (Brian Gerst)
   - Move sysctls into arch/x86 (Joel Granados)
 
 Intel LAM support updates: (Maciej Wieczor-Retman)
   - selftests/lam: Move cpu_has_la57() to use cpuinfo flag
   - selftests/lam: Skip test if LAM is disabled
   - selftests/lam: Test get_user() LAM pointer handling
 
 AMD SMN access updates:
   - Add SMN offsets to exclusive region access (Mario Limonciello)
   - Add support for debugfs access to SMN registers (Mario Limonciello)
   - Have HSMP use SMN through AMD_NODE (Yazen Ghannam)
 
 Power management updates: (Patryk Wlazlyn)
   - Allow calling mwait_play_dead with an arbitrary hint
   - ACPI/processor_idle: Add FFH state handling
   - intel_idle: Provide the default enter_dead() handler
   - Eliminate mwait_play_dead_cpuid_hint()
 
 Bootup:
 
 Build system:
   - Raise the minimum GCC version to 8.1 (Brian Gerst)
   - Raise the minimum LLVM version to 15.0.0
     (Nathan Chancellor)
 
 Kconfig: (Arnd Bergmann)
   - Add cmpxchg8b support back to Geode CPUs
   - Drop 32-bit "bigsmp" machine support
   - Rework CONFIG_GENERIC_CPU compiler flags
   - Drop configuration options for early 64-bit CPUs
   - Remove CONFIG_HIGHMEM64G support
   - Drop CONFIG_SWIOTLB for PAE
   - Drop support for CONFIG_HIGHPTE
   - Document CONFIG_X86_INTEL_MID as 64-bit-only
   - Remove old STA2x11 support
   - Only allow CONFIG_EISA for 32-bit
 
 Headers:
   - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers
     (Thomas Huth)
 
 Assembly code & machine code patching:
   - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf)
   - x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
   - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
   - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
   - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
   - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
     (Uros Bizjak)
   - Use named operands in inline asm (Uros Bizjak)
   - Improve performance by using asm_inline() for atomic locking instructions
     (Uros Bizjak)
 
 Earlyprintk:
   - Harden early_serial (Peter Zijlstra)
 
 NMI handler:
   - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus()
     (Waiman Long)
 
 Miscellaneous fixes and cleanups:
 
   - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel,
     Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst,
     Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin,
     Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport,
     Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker,
     Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra,
     Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
     Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak,
     Vitaly Kuznetsov, Xin Li, liuye.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfenkQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g1FRAAi6OFTSn/5aeLMI0IMNBxJ6ddQiFc3imd
 7+C/vU5nul4CyDs8mKyj/+f/DDrbkG9lKz3VG631Yl237lXHjD8XWcVMeC/1z/q0
 3zInDIloE9/nBHRPkF6F7fARBLBZ0LFgaBsGrCo7mwpGybiQdqGcqcxllvTbtXaw
 OHta4q6ok+lBDNlfc0v6H4cRnzhmmlKu6Ng0j6UI3V7uFhi3vtxas32ltDQtzorq
 2+jbV6/+kbrrv+xPC+jlzOFhTEKRupNPQXmvyQteoQg6G3kqAKMDvBthGXd1rHuX
 Qa+BoDIifE/2NiVeRwNrhoqYH/pHCzUzDREW5IW8+ca+4XNKuzAC6EuC8CeCzyK1
 q8ZjZjooQW4zEeVFeJYllHONzJYfxfSH5CLsnbcuhq99yfGlrQhF1qL72/Omn1w/
 DfPJM8Zt5zyKvLqUg3Md+fkVCO2wyDNhB61QPzRgHF+yD+rvuDpoqvUWir+w7cSn
 fwEDVZGXlFx6dumtSrqRaTd1nvFt80s8yP2ll09DMvGQ8D/yruS7hndGAmmJVCSW
 NAfd8pSjq5v2+ux2UR92/Cc3VF3SjaUqHBOp/Nq9rESya18ZVa3cJpHhVYYtPIVf
 THW0h07RIkGVKs1uq+5ekLCr/8uAZg58UPIqmhTuW0ttymRHCNfohR45FQZzy+0M
 tJj1oc2TIZw=
 =Dcb3
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core x86 updates from Ingo Molnar:
 "x86 CPU features support:
   - Generate the <asm/cpufeaturemasks.h> header based on build config
     (H. Peter Anvin, Xin Li)
   - x86 CPUID parsing updates and fixes (Ahmed S. Darwish)
   - Introduce the 'setcpuid=' boot parameter (Brendan Jackman)
   - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan
     Jackman)
   - Utilize CPU-type for CPU matching (Pawan Gupta)
   - Warn about unmet CPU feature dependencies (Sohil Mehta)
   - Prepare for new Intel Family numbers (Sohil Mehta)

  Percpu code:
   - Standardize & reorganize the x86 percpu layout and related cleanups
     (Brian Gerst)
   - Convert the stackprotector canary to a regular percpu variable
     (Brian Gerst)
   - Add a percpu subsection for cache hot data (Brian Gerst)
   - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak)
   - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak)

  MM:
   - Add support for broadcast TLB invalidation using AMD's INVLPGB
     instruction (Rik van Riel)
   - Rework ROX cache to avoid writable copy (Mike Rapoport)
   - PAT: restore large ROX pages after fragmentation (Kirill A.
     Shutemov, Mike Rapoport)
   - Make memremap(MEMREMAP_WB) map memory as encrypted by default
     (Kirill A. Shutemov)
   - Robustify page table initialization (Kirill A. Shutemov)
   - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn)
   - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW
     (Matthew Wilcox)

  KASLR:
   - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI
     BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir
     Singh)

  CPU bugs:
   - Implement FineIBT-BHI mitigation (Peter Zijlstra)
   - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta)
   - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan
     Gupta)
   - RFDS: Exclude P-only parts from the RFDS affected list (Pawan
     Gupta)

  System calls:
   - Break up entry/common.c (Brian Gerst)
   - Move sysctls into arch/x86 (Joel Granados)

  Intel LAM support updates: (Maciej Wieczor-Retman)
   - selftests/lam: Move cpu_has_la57() to use cpuinfo flag
   - selftests/lam: Skip test if LAM is disabled
   - selftests/lam: Test get_user() LAM pointer handling

  AMD SMN access updates:
   - Add SMN offsets to exclusive region access (Mario Limonciello)
   - Add support for debugfs access to SMN registers (Mario Limonciello)
   - Have HSMP use SMN through AMD_NODE (Yazen Ghannam)

  Power management updates: (Patryk Wlazlyn)
   - Allow calling mwait_play_dead with an arbitrary hint
   - ACPI/processor_idle: Add FFH state handling
   - intel_idle: Provide the default enter_dead() handler
   - Eliminate mwait_play_dead_cpuid_hint()

  Build system:
   - Raise the minimum GCC version to 8.1 (Brian Gerst)
   - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor)

  Kconfig: (Arnd Bergmann)
   - Add cmpxchg8b support back to Geode CPUs
   - Drop 32-bit "bigsmp" machine support
   - Rework CONFIG_GENERIC_CPU compiler flags
   - Drop configuration options for early 64-bit CPUs
   - Remove CONFIG_HIGHMEM64G support
   - Drop CONFIG_SWIOTLB for PAE
   - Drop support for CONFIG_HIGHPTE
   - Document CONFIG_X86_INTEL_MID as 64-bit-only
   - Remove old STA2x11 support
   - Only allow CONFIG_EISA for 32-bit

  Headers:
   - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI
     headers (Thomas Huth)

  Assembly code & machine code patching:
   - x86/alternatives: Simplify alternative_call() interface (Josh
     Poimboeuf)
   - x86/alternatives: Simplify callthunk patching (Peter Zijlstra)
   - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf)
   - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf)
   - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra)
   - x86/kexec: Merge x86_32 and x86_64 code using macros from
     <asm/asm.h> (Uros Bizjak)
   - Use named operands in inline asm (Uros Bizjak)
   - Improve performance by using asm_inline() for atomic locking
     instructions (Uros Bizjak)

  Earlyprintk:
   - Harden early_serial (Peter Zijlstra)

  NMI handler:
   - Add an emergency handler in nmi_desc & use it in
     nmi_shootdown_cpus() (Waiman Long)

  Miscellaneous fixes and cleanups:
   - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem
     Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan
     Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar,
     Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej
     Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter
     Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner,
     Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly
     Kuznetsov, Xin Li, liuye"

* tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits)
  zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault
  x86/asm: Make asm export of __ref_stack_chk_guard unconditional
  x86/mm: Only do broadcast flush from reclaim if pages were unmapped
  perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones
  perf/x86/intel, x86/cpu: Simplify Intel PMU initialization
  x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers
  x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers
  x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions
  x86/asm: Use asm_inline() instead of asm() in clwb()
  x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h>
  x86/hweight: Use asm_inline() instead of asm()
  x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm()
  x86/hweight: Use named operands in inline asm()
  x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP
  x86/head/64: Avoid Clang < 17 stack protector in startup code
  x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h>
  x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro
  x86/cpu/intel: Limit the non-architectural constant_tsc model checks
  x86/mm/pat: Replace Intel x86_model checks with VFM ones
  x86/cpu/intel: Fix fast string initialization for extended Families
  ...
2025-03-24 22:06:11 -07:00
Linus Torvalds
327ecdbc0f Performance events updates for v6.15:
Core:
 
   - Move perf_event sysctls into kernel/events/ (Joel Granados)
   - Use POLLHUP for pinned events in error (Namhyung Kim)
   - Avoid the read if the count is already updated (Peter Zijlstra)
   - Allow the EPOLLRDNORM flag for poll (Tao Chen)
 
   - locking/percpu-rwsem: Add guard support (Peter Zijlstra)
     [ NOTE: this got (mis-)merged into the perf tree due to related work. ]
 
 perf_pmu_unregister() related improvements: (Peter Zijlstra)
 
   - Simplify the perf_event_alloc() error path
   - Simplify the perf_pmu_register() error path
   - Simplify perf_pmu_register()
   - Simplify perf_init_event()
   - Simplify perf_event_alloc()
   - Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count
   - Add this_cpc() helper
   - Introduce perf_free_addr_filters()
   - Robustify perf_event_free_bpf_prog()
   - Simplify the perf_mmap() control flow
   - Further simplify perf_mmap()
   - Remove retry loop from perf_mmap()
   - Lift event->mmap_mutex in perf_mmap()
   - Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
   - Fix perf_mmap() failure path
 
 Uprobes:
 
   - Harden x86 uretprobe syscall trampoline check (Jiri Olsa)
   - Remove redundant spinlock in uprobe_deny_signal() (Liao Chang)
   - Remove the spinlock within handle_singlestep() (Liao Chang)
 
 x86 Intel PMU enhancements:
 
   - Support PEBS counters snapshotting (Kan Liang)
   - Fix intel_pmu_read_event() (Kan Liang)
   - Extend per event callchain limit to branch stack (Kan Liang)
   - Fix system-wide LBR profiling (Kan Liang)
   - Allocate bts_ctx only if necessary (Li RongQing)
   - Apply static call for drain_pebs (Peter Zijlstra)
 
 x86 AMD PMU enhancements: (Ravi Bangoria)
 
   - Remove pointless sample period check
   - Fix ->config to sample period calculation for OP PMU
   - Fix perf_ibs_op.cnt_mask for CurCnt
   - Don't allow freq mode event creation through ->config interface
   - Add PMU specific minimum period
   - Add ->check_period() callback
   - Ceil sample_period to min_period
   - Add support for OP Load Latency Filtering
   - Update DTLB/PageSize decode logic
 
 Hardware breakpoints:
 
   - Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar Bhaskar)
 
 Hardlockup detector improvements: (Li Huafei)
 
   - perf_event memory leak
   - Warn if watchdog_ev is leaked
 
 Fixes and cleanups:
 
   - Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter Zijlstra,
     Ravi Bangoria, Thorsten Blum, XieLudan)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfehRIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hF3g//TCAQijI6OFNpYiD1xoyMq4m+baIYhYx0
 lnxwxhsN58JFcEJeWIEGLACqUePyH68jNKVSr9sIoeV4gnnMX+x2Ny6rh/1H3Ox+
 jQyVmPdFmKa8QG7wGNjcDteIzlEKK4zqruXWaG54LX2e6kbQZWwd0I21MyXkrHXb
 oMIfyZbCAWuPW1wefZm8FPgImT+nvwOosyx90OVagGqk5mYdNb9DFhMjQveStHdQ
 BnWU6rYdW1c2eXKpeuvxY4uWQoCELC6WntLimvcswy6fb+9LtbglpCYQOGGDrGvp
 v3RASf/8clFVSau8P/8NEaNgLgjN/e3eN/fAoSut8Z22nAeBC6qv4qjFt1piDpbs
 AaEXYCYM0/Tfzjp3ctPsFrxbKvB8q2qhxSm37Co0Ix6WyJn3JQbNx48g8GIod2os
 eGPXSZzoz9O8coeTKKbxWp4fpAjFfyfe/ovWQuVd8JI4bYj7Mi63J+RxQDd2TkJP
 H+IgxZoamJExgS1YcKJUBtw7QKQm5pHFx03Br7KsNxgmHy7JdoN9bh0h14pkeXjB
 MnAvWOS5ouuriJgQ+4bqAezS8DSHnDdmFmWgNEEqAlOD9Zy9hDXJ2GiqbHKMyRNC
 ae35o0PDUFTIX9O5NPIDUyWtJb5uH/S1lQhS7GD+ODlMDIX+ny+REXf9krSCR1H0
 GUqq2UmxBGA=
 =iPmA
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Core:
   - Move perf_event sysctls into kernel/events/ (Joel Granados)
   - Use POLLHUP for pinned events in error (Namhyung Kim)
   - Avoid the read if the count is already updated (Peter Zijlstra)
   - Allow the EPOLLRDNORM flag for poll (Tao Chen)
   - locking/percpu-rwsem: Add guard support [ NOTE: this got
     (mis-)merged into the perf tree due to related work ] (Peter
     Zijlstra)

  perf_pmu_unregister() related improvements: (Peter Zijlstra)
   - Simplify the perf_event_alloc() error path
   - Simplify the perf_pmu_register() error path
   - Simplify perf_pmu_register()
   - Simplify perf_init_event()
   - Simplify perf_event_alloc()
   - Merge struct pmu::pmu_disable_count into struct
     perf_cpu_pmu_context::pmu_disable_count
   - Add this_cpc() helper
   - Introduce perf_free_addr_filters()
   - Robustify perf_event_free_bpf_prog()
   - Simplify the perf_mmap() control flow
   - Further simplify perf_mmap()
   - Remove retry loop from perf_mmap()
   - Lift event->mmap_mutex in perf_mmap()
   - Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes
   - Fix perf_mmap() failure path

  Uprobes:
   - Harden x86 uretprobe syscall trampoline check (Jiri Olsa)
   - Remove redundant spinlock in uprobe_deny_signal() (Liao Chang)
   - Remove the spinlock within handle_singlestep() (Liao Chang)

  x86 Intel PMU enhancements:
   - Support PEBS counters snapshotting (Kan Liang)
   - Fix intel_pmu_read_event() (Kan Liang)
   - Extend per event callchain limit to branch stack (Kan Liang)
   - Fix system-wide LBR profiling (Kan Liang)
   - Allocate bts_ctx only if necessary (Li RongQing)
   - Apply static call for drain_pebs (Peter Zijlstra)

  x86 AMD PMU enhancements: (Ravi Bangoria)
   - Remove pointless sample period check
   - Fix ->config to sample period calculation for OP PMU
   - Fix perf_ibs_op.cnt_mask for CurCnt
   - Don't allow freq mode event creation through ->config interface
   - Add PMU specific minimum period
   - Add ->check_period() callback
   - Ceil sample_period to min_period
   - Add support for OP Load Latency Filtering
   - Update DTLB/PageSize decode logic

  Hardware breakpoints:
   - Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar
     Bhaskar)

  Hardlockup detector improvements: (Li Huafei)
   - perf_event memory leak
   - Warn if watchdog_ev is leaked

  Fixes and cleanups:
   - Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter
     Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan)"

* tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
  perf: Fix __percpu annotation
  perf: Clean up pmu specific data
  perf/x86: Remove swap_task_ctx()
  perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode
  perf: Supply task information to sched_task()
  perf: attach/detach PMU specific data
  locking/percpu-rwsem: Add guard support
  perf: Save PMU specific data in task_struct
  perf: Extend per event callchain limit to branch stack
  perf/ring_buffer: Allow the EPOLLRDNORM flag for poll
  perf/core: Use POLLHUP for pinned events in error
  perf/core: Use sysfs_emit() instead of scnprintf()
  perf/core: Remove optional 'size' arguments from strscpy() calls
  perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions
  uprobes/x86: Harden uretprobe syscall trampoline check
  watchdog/hardlockup/perf: Warn if watchdog_ev is leaked
  watchdog/hardlockup/perf: Fix perf_event memory leak
  perf/x86: Annotate struct bts_buffer::buf with __counted_by()
  perf/core: Clean up perf_try_init_event()
  perf/core: Fix perf_mmap() failure path
  ...
2025-03-24 21:46:36 -07:00
Linus Torvalds
32b22538be Scheduler updates for v6.15:
[ Merge note, these two commits are identical:
 
    - f3fa0e40df ("sched/clock: Don't define sched_clock_irqtime as static key")
    - b9f2b29b94 ("sched: Don't define sched_clock_irqtime as static key")
 
   The first one is a cherry-picked version of the second, and the first one
   is already upstream. ]
 
 Core & fair scheduler changes:
 
   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun Dong)
 
 Deadline scheduler: (Juri Lelli)
 
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h
 
 Uclamp:
 
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)
 
 Scheduler topology support: (Juri Lelli)
 
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked
 
 RSEQ: (Michael Jeanson)
 
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes
 
 Membarriers:
 
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)
 
 Scheduler debugging:
 
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)
 
 Fixes and cleanups:
 
   - Always save/restore x86 TSC sched_clock() on suspend/resume
    (Guilherme G. Piccoli)
 
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli,
     Sebastian Andrzej Siewior)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfejsoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ivkhAAwBF2tYRBS1oIHcC/OKK3JJoHVDp2LFbU
 9sm5S3ZlGD/Ns2fbpY+9A8UFgUFfjYiTSV7hvf2B9Vge0XSxTmMNFu/MdxLBbo9r
 w6GSeNcNDQKpjEGLkrmPFsa2fiYI4dmH0IzDbS9V2cNPk470QBKjAKXNPaSER691
 n2wLnQq+m5o4gXnPjnSz6RrrzisRnm2GOWnDV/iqR47pZFNlX2wWlo3s5r7//Hw0
 a+QfEfpgKehhy/VSDXmSAgpqnNffjc78yBV6LNoVUddwahnOWiQMS3XViOqgy5VO
 jUGBrzW+sKkdRMBppxwJ/0XWgHGC27amIgnU0ZE5u+eiUEu8H9qWl1cRCFxyeB0O
 8+WNfwmkH+FPWUdsn84kdePhSsZy6HfM6h44Xe0hx1V7tQXEXfbPzK3TnQg8Ktt1
 Ky6ctbZt4cGpqGQuIqvba21A/racrD/DgvB7mHeZksnqZoKTDwxhT/nlQGpuwPoy
 SJYd1ynFVJvfC69SwMdwnaglimvEZx1GfT0o5XtCMslY5NkWCou5u+e65WX7ccU5
 94wBCwI1/+KiFMJZp6TlPw07Q/Hsj9dDryxLc3OunU3zMVt3++1GZnBS/eb5FN1A
 5TlAEpxgH9c5Q4/XJFCvx21DrTVHSuIrR6naK91bgqHo0cEfaHGtO/ejPtJGSfxe
 YIFnnu1dhRw=
 =SuQE
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Cancel the slice protection of the idle entity (Zihan Zhou)
   - Reduce the default slice to avoid tasks getting an extra tick
     (Zihan Zhou)
   - Force propagating min_slice of cfs_rq when {en,de}queue tasks
     (Tianchen Ding)
   - Refactor can_migrate_task() to elimate looping (I Hsin Cheng)
   - Add unlikey branch hints to several system calls (Colin Ian King)
   - Optimize current_clr_polling() on certain architectures (Yujun
     Dong)

  Deadline scheduler: (Juri Lelli)
   - Remove redundant dl_clear_root_domain call
   - Move dl_rebuild_rd_accounting to cpuset.h

  Uclamp:
   - Use the uclamp_is_used() helper instead of open-coding it (Xuewen
     Yan)
   - Optimize sched_uclamp_used static key enabling (Xuewen Yan)

  Scheduler topology support: (Juri Lelli)
   - Ignore special tasks when rebuilding domains
   - Add wrappers for sched_domains_mutex
   - Generalize unique visiting of root domains
   - Rebuild root domain accounting after every update
   - Remove partition_and_rebuild_sched_domains
   - Stop exposing partition_sched_domains_locked

  RSEQ: (Michael Jeanson)
   - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y
   - Fix segfault on registration when rseq_cs is non-zero
   - selftests: Add rseq syscall errors test
   - selftests: Ensure the rseq ABI TLS is actually 1024 bytes

  Membarriers:
   - Fix redundant load of membarrier_state (Nysal Jan K.A.)

  Scheduler debugging:
   - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior)
   - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar)

  Fixes and cleanups:
   - Always save/restore x86 TSC sched_clock() on suspend/resume
     (Guilherme G. Piccoli)
   - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian
     Andrzej Siewior)"

* tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling()
  sched/debug: Remove CONFIG_SCHED_DEBUG
  sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files
  sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation
  sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
  sched/debug: Make 'const_debug' tunables unconditional __read_mostly
  sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
  rseq/selftests: Fix namespace collision with rseq UAPI header
  include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h
  sched/topology: Stop exposing partition_sched_domains_locked
  cgroup/cpuset: Remove partition_and_rebuild_sched_domains
  sched/topology: Remove redundant dl_clear_root_domain call
  sched/deadline: Rebuild root domain accounting after every update
  sched/deadline: Generalize unique visiting of root domains
  sched/topology: Wrappers for sched_domains_mutex
  sched/deadline: Ignore special tasks when rebuilding domains
  tracing: Use preempt_model_str()
  xtensa: Rely on generic printing of preemption model
  x86: Rely on generic printing of preemption model
  s390: Rely on generic printing of preemption model
  ...
2025-03-24 21:28:12 -07:00
Linus Torvalds
5a658afd46 Objtool changes for v6.15:
- The biggest change is the new option to automatically fail
    the build on objtool warnings: CONFIG_OBJTOOL_WERROR.
 
    While there are no currently known unfixed false positives
    left, such an expansion in the severity of objtool warnings
    inevitably creates a risk of build failures, so it's disabled by
    default and depends on !COMPILE_TEST, so it shouldn't be enabled
    on allyesconfig/allmodconfig builds and won't be forced on people
    who just accept build-time defaults in 'make oldconfig'.
 
    While the option is strongly recommended, only people who enable
    it explicitly should see it.
 
    (Josh Poimboeuf)
 
  - Disable branch profiling in noinstr code with a broad
    brush that includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf)
 
  - Create backup object files on objtool errors and print exact
    objtool arguments to make failure analysis easier (Josh Poimboeuf)
 
  - Improve noreturn handling (Josh Poimboeuf)
 
  - Improve rodata handling (Tiezhu Yang)
 
  - Support jump tables, switch tables and goto tables on LoongArch (Tiezhu Yang)
 
  - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfefkARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1inlRAAvd9Hom2qh9Iu+KYYF58vsg9zsxWZA6I1
 blouKI4SUA8Xjuw6Nihx+emaPaMW1boGSLTNsFzrCa3S1+4UHTTp/Y8snZWJ/Mc/
 Peg52N6u/LIcoQM+vNJYRtd9y4wabX87vl0qTxte0kB0Neps3/yQvxtUa2K1srXp
 8nwHK+PdzNsgPuIrIiNc9ymsPvbqFHmVIRRVNVKX4BlPJi2kJs9kx43kszweQR/X
 kW/bs1315m1HS5i02K0Zs/XdOZHLsk9ERu+aviBJV1txrgZIukATIqbODiI+3RZX
 0oa3KxfzEVFN2k3OukrezV2INzETkN+oOSTAZIUOqwSVe+8rdQVBSdYT6svYn/yy
 aS8Bi5Mm1nfizTU+cRrzU7FxWCmwsxm9r4fHPTV8Owjxg0uoGk/E/qlvERuR2rpA
 p2tHMo1lp2Yo+VZBZPfm5KDHFG4tSGhF9eav2bqSI7/Kf5AWxRl8kBs5iLrcxsXh
 4qk3FalnuM7A+1McAUcBJAvM897Yie0s2G83ZipyYyA6U3LSBhBMWh9FlIiAjuIh
 YnX6IFkW9tVzVZpJFEGQn+2Ewl5Y2Go5bokKk03vkWCZCgg+hEUsVh6Cnm1ocZpO
 Ll3/UF4i8XjjyuAuHDn6mzyHIgch2xRN02v7dJSb09+O9b8vIoPoSqbWSLJUOBqf
 r6UesXDG8mY=
 =iswJ
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - The biggest change is the new option to automatically fail the build
   on objtool warnings: CONFIG_OBJTOOL_WERROR.

   While there are no currently known unfixed false positives left, such
   an expansion in the severity of objtool warnings inevitably creates a
   risk of build failures, so it's disabled by default and depends on
   !COMPILE_TEST, so it shouldn't be enabled on
   allyesconfig/allmodconfig builds and won't be forced on people who
   just accept build-time defaults in 'make oldconfig'.

   While the option is strongly recommended, only people who enable it
   explicitly should see it.

   (Josh Poimboeuf)

 - Disable branch profiling in noinstr code with a broad brush that
   includes all of arch/x86/ and kernel/sched/. (Josh Poimboeuf)

 - Create backup object files on objtool errors and print exact objtool
   arguments to make failure analysis easier (Josh Poimboeuf)

 - Improve noreturn handling (Josh Poimboeuf)

 - Improve rodata handling (Tiezhu Yang)

 - Support jump tables, switch tables and goto tables on LoongArch
   (Tiezhu Yang)

 - Misc cleanups and fixes (Josh Poimboeuf, David Engraf, Ingo Molnar)

* tag 'objtool-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  tracing: Disable branch profiling in noinstr code
  objtool: Use O_CREAT with explicit mode mask
  objtool: Add CONFIG_OBJTOOL_WERROR
  objtool: Create backup on error and print args
  objtool: Change "warning:" to "error:" for --Werror
  objtool: Add --Werror option
  objtool: Add --output option
  objtool: Upgrade "Linked object detected" warning to error
  objtool: Consolidate option validation
  objtool: Remove --unret dependency on --rethunk
  objtool: Increase per-function WARN_FUNC() rate limit
  objtool: Update documentation
  objtool: Improve __noreturn annotation warning
  objtool: Fix error handling inconsistencies in check()
  x86/traps: Make exc_double_fault() consistently noreturn
  LoongArch: Enable jump table for objtool
  objtool/LoongArch: Add support for goto table
  objtool/LoongArch: Add support for switch table
  objtool: Handle PC relative relocation type
  objtool: Handle different entry size of rodata
  ...
2025-03-24 21:18:05 -07:00
Linus Torvalds
23608993bb Locking changes for v6.15:
Locking primitives:
 
     - Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and {,try_}cmpxchg{64,128}
       on x86 (Uros Bizjak)
 
     - mutexes: extend debug checks in mutex_lock() (Yunhui Cui)
 
     - Misc cleanups (Uros Bizjak)
 
   Lockdep:
 
     - Fix might_fault() lockdep check of current->mm->mmap_lock (Peter Zijlstra)
 
     - Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()
       (Sebastian Andrzej Siewior)
 
     - Disable KASAN instrumentation of lockdep.c (Waiman Long)
 
     - Add kasan_check_byte() check in lock_acquire() (Waiman Long)
 
     - Misc cleanups (Sebastian Andrzej Siewior)
 
   Rust runtime integration:
 
     - Use Pin for all LockClassKey usages (Mitchell Levy)
     - sync: Add accessor for the lock behind a given guard (Alice Ryhl)
     - sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl)
     - sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng)
 
   Split-lock detection feature (x86):
 
     - Fix warning mode with disabled mitigation mode (Maksim Davydov)
 
   Locking events:
 
     - Add locking events for rtmutex slow paths (Waiman Long)
     - Add locking events for lockdep (Waiman Long)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfeeMARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1j+DQ/7BUherLWPuvGCx+0+FwBG9T2XFKm8cy8r
 8p7UM/0gOzMn+EvBz+LFL2/b5BWu2tjB2Qyen2tvq8NcDQdS8GpFA+u9rxeXQzQY
 tu5LxsnQbvQe0UXl5aJX1D8ft2xmRSU+a/2uQC3/PAbXByTwN/dkEqDoxJQG6GuP
 0mpULlbG0D5j2YiiaQyG2+3xKj+fd1mg/aEoG5lx88ko6Bgoguj8b+tX/4f70YWl
 igNxWoJ8CZxCBbd7+o8vFFvvYpk1sj6Ni3LyTs658t5deJpfxOu9xkrmlxGm/d7q
 IryuiQC7yYwWWFF96W3yJ13lyojKZTVCYr50hzMd88HE/NGJawZZQJMtyeRGS2r9
 7wNZDl0JiPRUgl8bTFOHZUgVU5IIgTSGpgv4XHvUFF0+QtZ91IqB+/fcMIpdEBV9
 K02wOfqIb3uUsCXGmNfFVi1E7TeXWUDudqHN7rosxOpFDSm1PvGI4rnnaNjddVr3
 kerNfRSyoBaj5Ff1zr59yM8XZVBPmY8MrruwoODMxxcfasM6vllEjv9McBRSoxlb
 HC3+wXaadWlUnaitaVU6Xak9qIj0djaSgQfQ9nS48XuN4EfztepLYM9OEPAsNWXh
 5NZDdYXB1ndYsDTlCLiEl2c0831duJpy2kpVOkaCqC3hu+JjVt82ZeeBhOZeAXQK
 glwrSkq0FiU=
 =33q7
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Ingo Molnar:
 "Locking primitives:
   - Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and
     {,try_}cmpxchg{64,128} on x86 (Uros Bizjak)
   - mutexes: extend debug checks in mutex_lock() (Yunhui Cui)
   - Misc cleanups (Uros Bizjak)

  Lockdep:
   - Fix might_fault() lockdep check of current->mm->mmap_lock (Peter
     Zijlstra)
   - Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()
     (Sebastian Andrzej Siewior)
   - Disable KASAN instrumentation of lockdep.c (Waiman Long)
   - Add kasan_check_byte() check in lock_acquire() (Waiman Long)
   - Misc cleanups (Sebastian Andrzej Siewior)

  Rust runtime integration:
   - Use Pin for all LockClassKey usages (Mitchell Levy)
   - sync: Add accessor for the lock behind a given guard (Alice Ryhl)
   - sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl)
   - sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng)

  Split-lock detection feature (x86):
   - Fix warning mode with disabled mitigation mode (Maksim Davydov)

  Locking events:
   - Add locking events for rtmutex slow paths (Waiman Long)
   - Add locking events for lockdep (Waiman Long)"

* tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Remove disable_irq_lockdep()
  lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*()
  rust: lockdep: Use Pin for all LockClassKey usages
  rust: sync: condvar: Add wait_interruptible_freezable()
  rust: sync: lock: Add an example for Guard:: Lock_ref()
  rust: sync: Add accessor for the lock behind a given guard
  locking/lockdep: Add kasan_check_byte() check in lock_acquire()
  locking/lockdep: Disable KASAN instrumentation of lockdep.c
  locking/lock_events: Add locking events for lockdep
  locking/lock_events: Add locking events for rtmutex slow paths
  x86/split_lock: Fix the delayed detection logic
  lockdep/mm: Fix might_fault() lockdep check of current->mm->mmap_lock
  x86/locking: Remove semicolon from "lock" prefix
  locking/mutex: Add MUTEX_WARN_ON() into fast path
  x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
  x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
2025-03-24 20:55:03 -07:00
Rafael J. Wysocki
1774be7cfc Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.15-rc1:

 - Manage sysfs attributes and boost frequencies efficiently from
   cpufreq core to reduce boilerplate code from drivers (Viresh Kumar).

 - Minor cleanups to cpufreq drivers (Aaron Kling, Benjamin Schneider,
   Dhananjay Ugwekar, Imran Shaik, and zuoqian).

 - Migrate some cpufreq drivers to using for_each_present_cpu() (Jacky
   Bai).

 - cpufreq-qcom-hw DT binding fixes (Krzysztof Kozlowski).

 - Use str_enable_disable() helper in cpufreq_online() (Lifeng Zheng).

 - Optimize the amd-pstate driver to avoid cases where call paths end
   up calling the same writes multiple times and needlessly caching
   variables through code reorganization, locking overhaul and tracing
   adjustments (Mario Limonciello, Dhananjay Ugwekar).

 - Make it possible to avoid enabling capacity-aware scheduling (CAS) in
   the intel_pstate driver and relocate a check for out-of-band (OOB)
   platform handling in it to make it detect OOB before checking HWP
   availability (Rafael Wysocki).

 - Fix dbs_update() to avoid inadvertent conversions of negative integer
   values to unsigned int which causes CPU frequency selection to be
   inaccurate in some cases when the "conservative" cpufreq governor is
   in use (Jie Zhan).

* pm-cpufreq: (91 commits)
  dt-bindings: cpufreq: cpufreq-qcom-hw: Narrow properties on SDX75, SA8775p and SM8650
  dt-bindings: cpufreq: cpufreq-qcom-hw: Drop redundant minItems:1
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add missing constraint for interrupt-names
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCS8300 compatible
  cpufreq: Init cpufreq only for present CPUs
  cpufreq: tegra186: Share policy per cluster
  cpufreq/amd-pstate: Drop actions in amd_pstate_epp_cpu_offline()
  cpufreq/amd-pstate: Stop caching EPP
  cpufreq/amd-pstate: Rework CPPC enabling
  cpufreq/amd-pstate: Drop debug statements for policy setting
  cpufreq/amd-pstate: Update cppc_req_cached for shared mem EPP writes
  cpufreq/amd-pstate: Move all EPP tracing into *_update_perf and *_set_epp functions
  cpufreq/amd-pstate: Cache CPPC request in shared mem case too
  cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks
  cpufreq/amd-pstate-ut: Adjust variable scope
  cpufreq/amd-pstate-ut: Run on all of the correct CPUs
  cpufreq/amd-pstate-ut: Drop SUCCESS and FAIL enums
  cpufreq/amd-pstate-ut: Allow lowest nonlinear and lowest to be the same
  cpufreq/amd-pstate-ut: Use _free macro to free put policy
  cpufreq/amd-pstate: Drop `cppc_cap1_cached`
  ...
2025-03-24 14:19:50 +01:00
Rafael J. Wysocki
8b30d2a396 Merge branches 'acpi-x86', 'acpi-platform-profile', 'acpi-apei' and 'acpi-misc'
Merge an ACPI CPPC update, ACPI platform-profile driver updates, an ACPI
APEI update and a MAINTAINERS update related to ACPI for 6.15-rc1:

 - Add a missing header file include to the x86 arch CPPC code (Mario
   Limonciello).

 - Rework the sysfs attributes implementation in the ACPI platform-profile
   driver and improve the unregistration code in it (Nathan Chancellor,
   Kurt Borja).

 - Prevent the ACPI HED driver from being built as a module and change
   its initcall level to subsys_initcall to avoid initialization ordering
   issues related to it (Xiaofei Tan).

 - Update a maintainer email address in the ACPI PMIC entry in
   MAINTAINERS (Mika Westerberg).

* acpi-x86:
  x86/ACPI: CPPC: Add missing include

* acpi-platform-profile:
  ACPI: platform_profile: Improve platform_profile_unregister()
  ACPI: platform-profile: Fix CFI violation when accessing sysfs files

* acpi-apei:
  ACPI: HED: Always initialize before evged

* acpi-misc:
  MAINTAINERS: Use my kernel.org address for ACPI PMIC work
2025-03-24 13:57:44 +01:00
Josh Poimboeuf
2cbb20b008 tracing: Disable branch profiling in noinstr code
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update()
for each use of likely() or unlikely().  That breaks noinstr rules if
the affected function is annotated as noinstr.

Disable branch profiling for files with noinstr functions.  In addition
to some individual files, this also includes the entire arch/x86
subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle
directories, all of which are noinstr-heavy.

Due to the nature of how sched binaries are built by combining multiple
.c files into one, branch profiling is disabled more broadly across the
sched code than would otherwise be needed.

This fixes many warnings like the following:

  vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section
  vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section
  ...

Reported-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
2025-03-22 09:49:26 +01:00
Nuno Das Neves
e2575ffe57 x86: hyperv: Add mshv_handler() irq handler and setup function
Add mshv_handler() to process messages related to managing guest
partitions such as intercepts, doorbells, and scheduling messages.

In a (non-nested) root partition, the same interrupt vector is shared
between the vmbus and mshv_root drivers.

Introduce a stub for mshv_handler() and call it in
sysvec_hyperv_callback alongside vmbus_handler().

Even though both handlers will be called for every Hyper-V interrupt,
the messages for each driver are delivered to different offsets
within the SYNIC message page, so they won't step on each other.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-9-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-20 21:23:04 +00:00
Nuno Das Neves
21050f6197 Drivers: hv: Export some functions for use by root partition module
hv_get_hypervisor_version(), hv_call_deposit_pages(), and
hv_call_create_vp(), are all needed in-module with CONFIG_MSHV_ROOT=m.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@microsoft.linux.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Link: https://lore.kernel.org/r/1741980536-3865-7-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-7-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-20 21:23:04 +00:00
Stanislav Kinsburskii
8cac51796e x86/mshyperv: Add support for extended Hyper-V features
Extend the "ms_hyperv_info" structure to include a new field,
"ext_features", for capturing extended Hyper-V features.
Update the "ms_hyperv_init_platform" function to retrieve these features
using the cpuid instruction and include them in the informational output.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Tianyu Lan <tiala@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1741980536-3865-3-git-send-email-nunodasneves@linux.microsoft.com>
2025-03-20 21:23:03 +00:00
Sohil Mehta
fadb6f569b x86/cpu/intel: Limit the non-architectural constant_tsc model checks
X86_FEATURE_CONSTANT_TSC is a Linux-defined, synthesized feature flag.
It is used across several vendors. Intel CPUs will set the feature when
the architectural CPUID.80000007.EDX[1] bit is set. There are also some
Intel CPUs that have the X86_FEATURE_CONSTANT_TSC behavior but don't
enumerate it with the architectural bit.  Those currently have a model
range check.

Today, virtually all of the CPUs that have the CPUID bit *also* match
the "model >= 0x0e" check. This is confusing. Instead of an open-ended
check, pick some models (INTEL_IVYBRIDGE and P4_WILLAMETTE) as the end
of goofy CPUs that should enumerate the bit but don't.  These models are
relatively arbitrary but conservative pick for this.

This makes it obvious that later CPUs (like Family 18+) no longer need
to synthesize X86_FEATURE_CONSTANT_TSC.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250219184133.816753-14-sohil.mehta@intel.com
2025-03-19 11:19:56 +01:00
Sohil Mehta
15b7ddcf66 x86/cpu/intel: Fix fast string initialization for extended Families
X86_FEATURE_REP_GOOD is a linux defined feature flag to track whether
fast string operations should be used for copy_page(). It is also used
as a second alternative for clear_page() if enhanced fast string
operations (ERMS) are not available.

X86_FEATURE_ERMS is an Intel-specific hardware-defined feature flag that
tracks hardware support for Enhanced Fast strings.  It is used to track
whether Fast strings should be used for similar memory copy and memory
clearing operations.

On top of these, there is a FAST_STRING enable bit in the
IA32_MISC_ENABLE MSR. It is typically controlled by the BIOS to provide
a hint to the hardware and the OS on whether fast string operations are
preferred.

Commit:

  161ec53c70 ("x86, mem, intel: Initialize Enhanced REP MOVSB/STOSB")

introduced a mechanism to honor the BIOS preference for fast string
operations and clear the above feature flags if needed.

Unfortunately, the current initialization code for Intel to set and
clear these bits is confusing at best and likely incorrect.

X86_FEATURE_REP_GOOD is cleared in early_init_intel() if
MISC_ENABLE.FAST_STRING is 0. But it gets set later on unconditionally
for all Family 6 processors in init_intel(). This not only overrides the
BIOS preference but also contradicts the earlier check.

Fix this by combining the related checks and always relying on the BIOS
provided preference for fast string operations. This simplification
makes sure the upcoming Intel Family 18 and 19 models are covered as
well.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250219184133.816753-12-sohil.mehta@intel.com
2025-03-19 11:19:51 +01:00
Sohil Mehta
7a2ad75274 x86/smpboot: Fix INIT delay assignment for extended Intel Families
Some old crusty CPUs need an extra delay that slows down booting. See
the comment above 'init_udelay' for details. Newer CPUs don't need the
delay.

Right now, for Intel, Family 6 and only Family 6 skips the delay. That
leaves out both the Family 15 (Pentium 4s) and brand new Family 18/19
models.

The omission of Family 15 (Pentium 4s) seems like an oversight and 18/19
do not need the delay.

Skip the delay on all Intel processors Family 6 and beyond.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250219184133.816753-11-sohil.mehta@intel.com
2025-03-19 11:19:50 +01:00
Sohil Mehta
58d1c1fd03 x86/smpboot: Remove confusing quirk usage in INIT delay
Very old multiprocessor systems required a 10 msec delay between
asserting and de-asserting INIT but modern processors do not require
this delay.

Over time the usage of the "quirk" wording while setting the INIT delay
has become misleading. The code comments suggest that modern processors
need to be quirked, which clears the default init_udelay of 10 msec,
while legacy processors don't need the quirk and continue to use the
default init_udelay.

With a lot more modern processors, the wording should be inverted if at
all needed. Instead, simplify the comments and the code by getting rid
of "quirk" usage altogether and clarifying the following:

  - Old legacy processors -> Set the "legacy" 10 msec delay
  - Modern processors     -> Do not set any delay

No functional change.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250219184133.816753-10-sohil.mehta@intel.com
2025-03-19 11:19:48 +01:00
Sohil Mehta
337959860d x86/acpi/cstate: Improve Intel Family model checks
Update the Intel Family checks to consistently use Family 15 instead of
Family 0xF. Also, get rid of one of last usages of x86_model by using
the new VFM checks.

Update the incorrect comment since the check has changed since the
initial commit:

  ee1ca48fae ("ACPI: Disable ARB_DISABLE on platforms where it is not needed")

The two changes were:

 - 3e2ada5867 ("ACPI: fix Compaq Evo N800c (Pentium 4m) boot hang regression")
   removed the P4 - Family 15.

 - 03a05ed115 ("ACPI: Use the ARB_DISABLE for the CPU which model id is less than 0x0f.")
   got rid of CORE_YONAH - Family 6, model E.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-9-sohil.mehta@intel.com
2025-03-19 11:19:46 +01:00
Sohil Mehta
eb1ac33305 x86/cpu/intel: Replace Family 5 model checks with VFM ones
Introduce names for some Family 5 models and convert some of the checks
to be VFM based.

Also, to keep the file sorted by family, move Family 5 to the top of the
header file.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-8-sohil.mehta@intel.com
2025-03-19 11:19:44 +01:00
Sohil Mehta
fc866f2472 x86/cpu/intel: Replace Family 15 checks with VFM ones
Introduce names for some old pentium 4 models and replace the x86_model
checks with VFM ones.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-7-sohil.mehta@intel.com
2025-03-19 11:19:43 +01:00
Sohil Mehta
eaa472f76d x86/cpu/intel: Replace early Family 6 checks with VFM ones
Introduce names for some old pentium models and replace the x86_model
checks with VFM ones.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-6-sohil.mehta@intel.com
2025-03-19 11:19:41 +01:00
Sohil Mehta
a8cb451458 x86/mtrr: Modify a x86_model check to an Intel VFM check
Simplify one of the last few Intel x86_model checks in arch/x86 by
substituting it with a VFM one.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-5-sohil.mehta@intel.com
2025-03-19 11:19:40 +01:00
Sohil Mehta
7e6b0a2e41 x86/microcode: Update the Intel processor flag scan check
The Family model check to read the processor flag MSR is misleading and
potentially incorrect. It doesn't consider Family while comparing the
model number. The original check did have a Family number but it got
lost/moved during refactoring.

intel_collect_cpu_info() is called through multiple paths such as early
initialization, CPU hotplug as well as IFS image load. Some of these
flows would be error prone due to the ambiguous check.

Correct the processor flag scan check to use a Family number and update
it to a VFM based one to make it more readable.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-4-sohil.mehta@intel.com
2025-03-19 11:19:38 +01:00
Sohil Mehta
7e67f36172 x86/cpu/intel: Fix the MOVSL alignment preference for extended Families
The alignment preference for 32-bit MOVSL based bulk memory move has
been 8-byte for a long time. However this preference is only set for
Family 6 and 15 processors.

Use the same preference for upcoming Family numbers 18 and 19. Also, use
a simpler VFM based check instead of switching based on Family numbers.
Refresh the comment to reflect the new check.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250219184133.816753-3-sohil.mehta@intel.com
2025-03-19 11:19:31 +01:00
Sohil Mehta
680d9b2a56 x86/apic: Fix 32-bit APIC initialization for extended Intel Families
APIC detection is currently limited to a few specific Families and will
not match the upcoming Families >=18.

Extend the check to include all Families 6 or greater. Also convert it
to a VFM check to make it simpler.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250219184133.816753-2-sohil.mehta@intel.com
2025-03-19 11:19:29 +01:00
Brian Gerst
9a93e29f16 x86/syscall: Move sys_ni_syscall()
Move sys_ni_syscall() to kernel/process.c, and remove the now empty
entry/common.c

No functional changes.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250314151220.862768-6-brgerst@gmail.com
2025-03-19 11:19:17 +01:00
Mario Limonciello
9c19cc1f5f x86/amd_node: Add support for debugfs access to SMN registers
There are certain registers on AMD Zen systems that can only be accessed
through SMN.

Introduce a new interface that provides debugfs files for accessing SMN.  As
this introduces the capability for userspace to manipulate the hardware in
unpredictable ways, taint the kernel when writing.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-3-b5cc997e471b@amd.com
2025-03-19 11:18:33 +01:00
Mario Limonciello
8351845307 x86/amd_node: Add SMN offsets to exclusive region access
Offsets 0x60 and 0x64 are used internally by kernel drivers that call
the amd_smn_read() and amd_smn_write() functions. If userspace accesses
the regions at the same time as the kernel it may cause malfunctions in
drivers using the offsets.

Add these offsets to the exclusions so that the kernel is tainted if a
non locked down userspace tries to access them.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-2-b5cc997e471b@amd.com
2025-03-19 11:18:23 +01:00
Yazen Ghannam
8a3dc0f7c4 x86/amd_node, platform/x86/amd/hsmp: Have HSMP use SMN through AMD_NODE
The HSMP interface is just an SMN interface with different offsets.

Define an HSMP wrapper in the SMN code and have the HSMP platform driver
use that rather than a local solution.

Also, remove the "root" member from AMD_NB, since there are no more
users of it.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://lore.kernel.org/r/20250130-wip-x86-amd-nb-cleanup-v4-1-b5cc997e471b@amd.com
2025-03-19 11:18:05 +01:00
Thorsten Blum
ad5a3a8f41 x86/mtrr: Use str_enabled_disabled() helper in print_mtrr_state()
Remove hard-coded strings by using the str_enabled_disabled() helper
function.

Suggested-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20250117144900.171684-2-thorsten.blum%40linux.dev
2025-03-19 11:17:56 +01:00
Sohil Mehta
07e4a6eec2 x86/cpufeatures: Warn about unmet CPU feature dependencies
Currently, the cpuid_deps[] table is only exercised when a particular
feature is explicitly disabled and clear_cpu_cap() is called. However,
some of these listed dependencies might already be missing during boot.

These types of errors shouldn't generally happen in production
environments, but they could sometimes sneak through, especially when
VMs and Kconfigs are in the mix. Also, the kernel might introduce
artificial dependencies between unrelated features, such as making LAM
depend on LASS.

Unexpected failures can occur when the kernel tries to use such
features. Add a simple boot-time scan of the cpuid_deps[] table to
detect the missing dependencies. One option is to disable all of such
features during boot, but that may cause regressions in existing
systems. For now, just warn about the missing dependencies to create
awareness.

As a trade-off between spamming the kernel log and keeping track of all
the features that have been warned about, only warn about the first
missing dependency. Any subsequent unmet dependency will only be logged
after the first one has been resolved.

Features are typically represented through unsigned integers within the
kernel, though some of them have user-friendly names if they are exposed
via /proc/cpuinfo.

Show the friendlier name if available, otherwise display the
X86_FEATURE_* numerals to make it easier to identify the feature.

Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20250313201608.3304135-1-sohil.mehta@intel.com
2025-03-19 11:17:31 +01:00
Pawan Gupta
722fa0dba7 x86/rfds: Exclude P-only parts from the RFDS affected list
The affected CPU table (cpu_vuln_blacklist) marks Alderlake and Raptorlake
P-only parts affected by RFDS. This is not true because only E-cores are
affected by RFDS. With the current family/model matching it is not possible
to differentiate the unaffected parts, as the affected and unaffected
hybrid variants have the same model number.

Add a cpu-type match as well for such parts so as to exclude P-only parts
being marked as affected.

Note, family/model and cpu-type enumeration could be inaccurate in
virtualized environments. In a guest affected status is decided by RFDS_NO
and RFDS_CLEAR bits exposed by VMMs.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-5-e8514dcaaff2@linux.intel.com
2025-03-19 11:17:23 +01:00
Pawan Gupta
adf2de5e8d x86/cpu: Update x86_match_cpu() to also use cpu-type
Non-hybrid CPU variants that share the same Family/Model could be
differentiated by their cpu-type. x86_match_cpu() currently does not use
cpu-type for CPU matching.

Dave Hansen suggested to use below conditions to match CPU-type:

  1. If CPU_TYPE_ANY (the wildcard), then matched
  2. If hybrid, then matched
  3. If !hybrid, look at the boot CPU and compare the cpu-type to determine
     if it is a match.

  This special case for hybrid systems allows more compact vulnerability
  list.  Imagine that "Haswell" CPUs might or might not be hybrid and that
  only Atom cores are vulnerable to Meltdown.  That means there are three
  possibilities:

  	1. P-core only
  	2. Atom only
  	3. Atom + P-core (aka. hybrid)

  One might be tempted to code up the vulnerability list like this:

  	MATCH(     HASWELL, X86_FEATURE_HYBRID, MELTDOWN)
  	MATCH_TYPE(HASWELL, ATOM,               MELTDOWN)

  Logically, this matches #2 and #3. But that's a little silly. You would
  only ask for the "ATOM" match in cases where there *WERE* hybrid cores in
  play. You shouldn't have to _also_ ask for hybrid cores explicitly.

  In short, assume that processors that enumerate Hybrid==1 have a
  vulnerable core type.

Update x86_match_cpu() to also match cpu-type. Also treat hybrid systems as
special, and match them to any cpu-type.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-4-e8514dcaaff2@linux.intel.com
2025-03-19 11:17:11 +01:00
H. Peter Anvin (Intel)
841326332b x86/cpufeatures: Generate the <asm/cpufeaturemasks.h> header based on build config
Introduce an AWK script to auto-generate the <asm/cpufeaturemasks.h> header
with required and disabled feature masks based on <asm/cpufeatures.h>
and the current build config.

Thus for any CPU feature with a build config, e.g., X86_FRED, simply add:

  config X86_DISABLED_FEATURE_FRED
	def_bool y
	depends on !X86_FRED

to arch/x86/Kconfig.cpufeatures, instead of adding a conditional CPU
feature disable flag, e.g., DISABLE_FRED.

Lastly, the generated required and disabled feature masks will be added to
their corresponding feature masks for this particular compile-time
configuration.

  [ Xin: build integration improvements ]
  [ mingo: Improved changelog and comments ]

Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250305184725.3341760-3-xin@zytor.com
2025-03-19 11:15:11 +01:00
Kirill A. Shutemov
775d37d8f0 x86/acpi: Replace manual page table initialization with kernel_ident_mapping_init()
The init_transition_pgtable() functions maps the page with
asm_acpi_mp_play_dead() into an identity mapping.

Replace open-coded manual page table initialization with
kernel_ident_mapping_init() to avoid code duplication.

Use x86_mapping_info::offset to get the page mapped at the
correct location.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20241016111458.846228-3-kirill.shutemov@linux.intel.com
2025-03-19 11:12:29 +01:00
Rik van Riel
440a65b7d2 x86/mm: Enable AMD translation cache extensions
With AMD TCE (translation cache extensions) only the intermediate mappings
that cover the address range zapped by INVLPG / INVLPGB get invalidated,
rather than all intermediate mappings getting zapped at every TLB invalidation.

This can help reduce the TLB miss rate, by keeping more intermediate mappings
in the cache.

From the AMD manual:

Translation Cache Extension (TCE) Bit. Bit 15, read/write. Setting this bit to
1 changes how the INVLPG, INVLPGB, and INVPCID instructions operate on TLB
entries. When this bit is 0, these instructions remove the target PTE from the
TLB as well as all upper-level table entries that are cached in the TLB,
whether or not they are associated with the target PTE.  When this bit is set,
these instructions will remove the target PTE and only those upper-level
entries that lead to the target PTE in the page table hierarchy, leaving
unrelated upper-level entries intact.

  [ bp: use cpu_has()... I know, it is a mess. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-13-riel@surriel.com
2025-03-19 11:12:29 +01:00
Rik van Riel
767ae437a3 x86/mm: Add INVLPGB feature and Kconfig entry
In addition, the CPU advertises the maximum number of pages that can be
shot down with one INVLPGB instruction in CPUID. Save that information
for later use.

  [ bp: use cpu_has(), typos, massage. ]

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226030129.530345-3-riel@surriel.com
2025-03-19 11:08:52 +01:00
Ingo Molnar
89771319e0 Linux 6.14-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfXVtUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN/sH/i5423Gt/z51gDjA
 s4v5Z7GaBJ9zOGBahn2RWFe72zytTqKrEJmMnGfguirs0atD1DtQj4WAP7iFKP+e
 WyO663X6HF7i5y37ja0Yd4PZc31hwtqzKH8LjBf8f8tTy8UsEVqumdi5A4sS9KTM
 qm4kTyyVEY9D/s7oRY8ywjDlRJtO6nT0aKMp4kAqNEbrNUYbilT/a0hgXcgSmPyB
 uIjmjL2fZfutxGI5LgfbaSHCa1ElmhvTvivOMpaAmZSGCRVHCKGgT0CTNnHyn/7C
 dB145JkRO4ZOUqirCdO4PE/23id3ajq9fcixJGBzAv7c45y+B3JZ1r2kAfKalE8/
 qrOKLys=
 =8r7a
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc7' into x86/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-19 11:03:06 +01:00
Shuai Xue
1a15bb8303 x86/mce: use is_copy_from_user() to determine copy-from-user context
Patch series "mm/hwpoison: Fix regressions in memory failure handling",
v4.

## 1. What am I trying to do:

This patchset resolves two critical regressions related to memory failure
handling that have appeared in the upstream kernel since version 5.17, as
compared to 5.10 LTS.

    - copyin case: poison found in user page while kernel copying from user space
    - instr case: poison found while instruction fetching in user space

## 2. What is the expected outcome and why

- For copyin case:

Kernel can recover from poison found where kernel is doing get_user() or
copy_from_user() if those places get an error return and the kernel return
-EFAULT to the process instead of crashing.  More specifily, MCE handler
checks the fixup handler type to decide whether an in kernel #MC can be
recovered.  When EX_TYPE_UACCESS is found, the PC jumps to recovery code
specified in _ASM_EXTABLE_FAULT() and return a -EFAULT to user space.

- For instr case:

If a poison found while instruction fetching in user space, full recovery
is possible.  User process takes #PF, Linux allocates a new page and fills
by reading from storage.


## 3. What actually happens and why

- For copyin case: kernel panic since v5.17

Commit 4c132d1d84 ("x86/futex: Remove .fixup usage") introduced a new
extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the
extable fixup type for copy-from-user operations, changing it from
EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG.  It breaks previous EX_TYPE_UACCESS
handling when posion found in get_user() or copy_from_user().

- For instr case: user process is killed by a SIGBUS signal due to #CMCI
  and #MCE race

When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a UCNA
signature, and the core reporting and SRAR signature machine check when
the data is about to be consumed.

### Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1]

Prior to Icelake memory controllers reported patrol scrub events that
detected a previously unseen uncorrected error in memory by signaling a
broadcast machine check with an SRAO (Software Recoverable Action
Optional) signature in the machine check bank.  This was overkill because
it's not an urgent problem that no core is on the verge of consuming that
bad data.  It's also found that multi SRAO UCE may cause nested MCE
interrupts and finally become an IERR.

Hence, Intel downgrades the machine check bank signature of patrol scrub
from SRAO to UCNA (Uncorrected, No Action required), and signal changed to
#CMCI.  Just to add to the confusion, Linux does take an action (in
uc_decode_notifier()) to try to offline the page despite the UC*NA*
signature name.

### Background: why #CMCI and #MCE race when poison is consuming in
    Intel platform [1]

Having decided that CMCI/UCNA is the best action for patrol scrub errors,
the memory controller uses it for reads too.  But the memory controller is
executing asynchronously from the core, and can't tell the difference
between a "real" read and a speculative read.  So it will do CMCI/UCNA if
an error is found in any read.

Thus:

1) Core is clever and thinks address A is needed soon, issues a
   speculative read.

2) Core finds it is going to use address A soon after sending the read
   request

3) The CMCI from the memory controller is in a race with MCE from the
   core that will soon try to retire the load from address A.

Quite often (because speculation has got better) the CMCI from the memory
controller is delivered before the core is committed to the instruction
reading address A, so the interrupt is taken, and Linux offlines the page
(marking it as poison).


## Why user process is killed for instr case

Commit 046545a661 ("mm/hwpoison: fix error page recovered but reported
"not recovered"") tries to fix noise message "Memory error not recovered"
and skips duplicate SIGBUSs due to the race.  But it also introduced a bug
that kill_accessing_process() return -EHWPOISON for instr case, as result,
kill_me_maybe() send a SIGBUS to user process.

# 4. The fix, in my opinion, should be:

- For copyin case:

The key point is whether the error context is in a read from user memory. 
We do not care about the ex-type if we know its a MOV reading from
userspace.

is_copy_from_user() return true when both of the following two checks are
true:

    - the current instruction is copy
    - source address is user memory

If copy_user is true, we set

m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;

Then do_machine_check() will try fixup_exception() first.

- For instr case: let kill_accessing_process() return 0 to prevent a SIGBUS.

- For patch 3:

The return value of memory_failure() is quite important while discussed
instr case regression with Tony and Miaohe for patch 2, so add comment
about the return value.


This patch (of 3):

Commit 4c132d1d84 ("x86/futex: Remove .fixup usage") introduced a new
extable fixup type, EX_TYPE_EFAULT_REG, and commit 4c132d1d84
("x86/futex: Remove .fixup usage") updated the extable fixup type for
copy-from-user operations, changing it from EX_TYPE_UACCESS to
EX_TYPE_EFAULT_REG.  The error context for copy-from-user operations no
longer functions as an in-kernel recovery context.  Consequently, the
error context for copy-from-user operations no longer functions as an
in-kernel recovery context, resulting in kernel panics with the message:
"Machine check: Data load in unrecoverable area of kernel."

To address this, it is crucial to identify if an error context involves a
read operation from user memory.  The function is_copy_from_user() can be
utilized to determine:

    - the current operation is copy
    - when reading user memory

When these conditions are met, is_copy_from_user() will return true,
confirming that it is indeed a direct copy from user memory.  This check
is essential for correctly handling the context of errors in these
operations without relying on the extable fixup types that previously
allowed for in-kernel recovery.

So, use is_copy_from_user() to determine if a context is copy user directly.

Link: https://lkml.kernel.org/r/20250312112852.82415-1-xueshuai@linux.alibaba.com
Link: https://lkml.kernel.org/r/20250312112852.82415-2-xueshuai@linux.alibaba.com
Fixes: 4c132d1d84 ("x86/futex: Remove .fixup usage")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Ruidong Tian <tianruidong@linux.alibaba.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 22:07:05 -07:00
Mike Rapoport (Microsoft)
e120d1bc12 arch, mm: set high_memory in free_area_init()
high_memory defines upper bound on the directly mapped memory.  This bound
is defined by the beginning of ZONE_HIGHMEM when a system has high memory
and by the end of memory otherwise.

All this is known to generic memory management initialization code that
can set high_memory while initializing core mm structures.

Add a generic calculation of high_memory to free_area_init() and remove
per-architecture calculation except for the architectures that set and use
high_memory earlier than that.

Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>	[x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17 22:06:52 -07:00
Chao Gao
dda366083e x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures
Guest FPUs manage vCPU FPU states. They are allocated via
fpu_alloc_guest_fpstate() and are resized in fpstate_realloc() when XFD
features are enabled.

Since the introduction of guest FPUs, there have been inconsistencies in
the kernel buffer size and xfeatures:

 1. fpu_alloc_guest_fpstate() uses fpu_user_cfg since its introduction. See:

    69f6ed1d14 ("x86/fpu: Provide infrastructure for KVM FPU cleanup")
    36487e6228 ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features")

 2. __fpstate_reset() references fpu_kernel_cfg to set storage attributes.

 3. fpu->guest_perm uses fpu_kernel_cfg, affecting fpstate_realloc().

A recent commit in the tip:x86/fpu tree partially addressed the inconsistency
between (1) and (3) by using fpu_kernel_cfg for size calculation in (1),
but left fpu_guest->xfeatures and fpu_guest->perm still referencing
fpu_user_cfg:

  https://lore.kernel.org/all/20250218141045.85201-1-stanspas@amazon.de/

  1937e18cc3 ("x86/fpu: Fix guest FPU state buffer allocation size")

The inconsistencies within fpu_alloc_guest_fpstate() and across the
mentioned functions cause confusion.

Fix them by using fpu_kernel_cfg consistently in fpu_alloc_guest_fpstate(),
except for fields related to the UABI buffer. Referencing fpu_kernel_cfg
won't impact functionalities, as:

 1. fpu_guest->perm is overwritten shortly in fpu_init_guest_permissions()
    with fpstate->guest_perm, which already uses fpu_kernel_cfg.

 2. fpu_guest->xfeatures is solely used to check if XFD features are enabled.
    Including supervisor xfeatures doesn't affect the check.

Fixes: 36487e6228 ("x86/fpu: Prepare guest FPU for dynamically enabled FPU features")
Suggested-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Link: https://lore.kernel.org/r/20250317140613.1761633-1-chao.gao@intel.com
2025-03-17 23:52:31 +01:00
Borislav Petkov (AMD)
4348e91778 x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros
Tie together the %[xa] in the XSAVE/XRSTOR definitions with the
respective usage in the asm macros so that it is perfectly clear.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-03-17 13:47:12 +01:00
Josh Poimboeuf
8085fcd78c x86/traps: Make exc_double_fault() consistently noreturn
The CONFIG_X86_ESPFIX64 version of exc_double_fault() can return to its
caller, but the !CONFIG_X86_ESPFIX64 version never does.  In the latter
case the compiler and/or objtool may consider it to be implicitly
noreturn.

However, due to the currently inflexible way objtool detects noreturns,
a function's noreturn status needs to be consistent across configs.

The current workaround for this issue is to suppress unreachable
warnings for exc_double_fault()'s callers.  Unfortunately that can
result in ORC coverage gaps and potentially worse issues like inert
static calls and silently disabled CPU mitigations.

Instead, prevent exc_double_fault() from ever being implicitly marked
noreturn by forcing a return behind a never-taken conditional.

Until a more integrated noreturn detection method exists, this is likely
the least objectionable workaround.

Fixes: 55eeab2a8a ("objtool: Ignore exc_double_fault() __noreturn warnings")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/r/d1f4026f8dc35d0de6cc61f2684e0cb6484009d1.1741975349.git.jpoimboe@kernel.org
2025-03-17 11:35:59 +01:00
Sebastian Andrzej Siewior
96389cf365 x86: Rely on generic printing of preemption model
After __die_header(), __die_body() is always invoked. There we have
show_regs() -> show_regs_print_info() which prints the current
preemption model.
Remove it from the initial line.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250314160810.2373416-8-bigeasy@linutronix.de
2025-03-17 11:23:40 +01:00
Sourabh Jain
7b54a96f30 crash: remove an unused argument from reserve_crashkernel_generic()
cmdline argument is not used in reserve_crashkernel_generic() so remove
it.  Correspondingly, all the callers have been updated as well.

No functional change intended.

Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:30:47 -07:00
Frank van der Linden
665eaf3133 x86/setup: call hugetlb_bootmem_alloc early
Call hugetlb_bootmem_allloc in an earlier spot in setup, after
hugelb_cma_reserve.  This will make vmemmap preinit of the sections
covered by the allocated hugetlb pages possible.

Link: https://lkml.kernel.org/r/20250228182928.2645936-21-fvdl@google.com
Signed-off-by: Frank van der Linden <fvdl@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Roman Gushchin (Cruise) <roman.gushchin@linux.dev>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:06:29 -07:00
Uros Bizjak
6b7ce3134f x86/kgdb: use IS_ERR_PCPU() macro
Patch series "Enable strict percpu address space checks", v4.

Enable strict percpu address space checks via x86 named address space
qualifiers.  Percpu variables are declared in __seg_gs/__seg_fs named AS
and kept named AS qualified until they are dereferenced via percpu
accessor.  This approach enables various compiler checks for
cross-namespace variable assignments.

Please note that current version of sparse doesn't know anything about
__typeof_unqual__() operator.  Avoid the usage of __typeof_unqual__() when
sparse checking is active to prevent sparse errors with unknowing keyword.
The proposed patch by Dan Carpenter to implement __typeof_unqual__()
handling in sparse is located at:

https://lore.kernel.org/lkml/5b8d0dee-8fb6-45af-ba6c-7f74aff9a4b8@stanley.mountain/


This patch (of 6):

Use IS_ERR_PCPU() when checking the error pointer in the percpu address
space.  This macro adds intermediate cast to unsigned long when switching
named address spaces.

The patch will avoid future build errors due to pointer address space
mismatch with enabled strict percpu address space checks.

Link: https://lkml.kernel.org/r/20250127160709.80604-1-ubizjak@gmail.com
Link: https://lkml.kernel.org/r/20250127160709.80604-2-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Nadav Amit <nadav.amit@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16 22:05:52 -07:00
David Woodhouse
b25eb5f5e4 x86/kexec: Add relocate_kernel() debugging support: Load a GDT
There are some failure modes which lead to triple-faults in the
relocate_kernel() function, which is fairly much undebuggable
for normal mortals.

Adding a GDT in the relocate_kernel() environment is step 1 towards
being able to catch faults and do something more useful.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250312144257.2348250-2-dwmw2@infradead.org
2025-03-14 11:01:53 +01:00
Ajay Kaher
a2ab25529b x86/vmware: Parse MP tables for SEV-SNP enabled guests under VMware hypervisors
Under VMware hypervisors, SEV-SNP enabled VMs are fundamentally able to boot
without UEFI, but this regressed a year ago due to:

  0f4a1e8098 ("x86/sev: Skip ROM range scans and validation for SEV-SNP guests")

In this case, mpparse_find_mptable() has to be called to parse MP
tables which contains the necessary boot information.

[ mingo: Updated the changelog. ]

Fixes: 0f4a1e8098 ("x86/sev: Skip ROM range scans and validation for SEV-SNP guests")
Co-developed-by: Ye Li <ye.li@broadcom.com>
Signed-off-by: Ye Li <ye.li@broadcom.com>
Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ye Li <ye.li@broadcom.com>
Reviewed-by: Kevin Loughlin <kevinloughlin@google.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250313173111.10918-1-ajay.kaher@broadcom.com
2025-03-13 19:01:09 +01:00
Uros Bizjak
2883b4c216 x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h
Current minimum required version of binutils is 2.25, which
supports XSAVE{,OPT,C,S} and XRSTOR{,S} instruction mnemonics.

Replace the byte-wise specification of XSAVE{,OPT,C,S}
and XRSTOR{,S} with these proper mnemonics.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250313130251.383204-1-ubizjak@gmail.com
2025-03-13 18:36:52 +01:00
James Morse
823beb31e5 x86/resctrl: Move get_{mon,ctrl}_domain_from_cpu() to live with their callers
Each of get_{mon,ctrl}_domain_from_cpu() only has one caller.

Once the filesystem code is moved to /fs/, there is no equivalent to
core.c.

Move these functions to each live next to their caller. This allows
them to be made static and the header file entries to be removed.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-31-james.morse@arm.com
2025-03-12 12:24:58 +01:00
James Morse
f62b4e45e0 x86/resctrl: Move get_config_index() to a header
get_config_index() is used by the architecture specific code to map
a CLOSID+type pair to an index in the configuration arrays.

MPAM needs to do this too to preserve the ABI to user-space, there is no
reason to do it differently.

Move the helper to a header file to allow all architectures that either
use or emulate CDP to use the same pattern of CLOSID values. Moving
this to a header file means it must be marked inline, which matches
the existing compiler choice for this static function.

Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-30-james.morse@arm.com
2025-03-12 12:24:54 +01:00
James Morse
6c2282d42c x86/resctrl: Handle throttle_mode for SMBA resources
Now that the visibility of throttle_mode is being managed by resctrl, it
should consider resources other than MBA that may have a throttle_mode.  SMBA
is one such resource.

Extend thread_throttle_mode_init() to check SMBA for a throttle_mode.

Adding support for multiple resources means it is possible for a platform with
both MBA and SMBA, but an undefined throttle_mode on one of them to make the
file visible.

Add the 'undefined' case to rdt_thread_throttle_mode_show().

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-29-james.morse@arm.com
2025-03-12 12:24:46 +01:00
James Morse
373af4ecfd x86/resctrl: Move RFTYPE flags to be managed by resctrl
resctrl_file_fflags_init() is called from the architecture specific code to
make the 'thread_throttle_mode' file visible. The architecture specific code
has already set the membw.throttle_mode in the rdt_resource.

This forces the RFTYPE flags used by resctrl to be exposed to the architecture
specific code.

This doesn't need to be specific to the architecture, the throttle_mode can be
used by resctrl to determine if the 'thread_throttle_mode' file should be
visible. This allows the RFTYPE flags to be private to resctrl.

Add thread_throttle_mode_init(), and use it to call resctrl_file_fflags_init()
from resctrl_init(). This avoids publishing an extra function between the
architecture and filesystem code.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-28-james.morse@arm.com
2025-03-12 12:24:37 +01:00
James Morse
4cf9acfc8f x86/resctrl: Make resctrl_arch_pseudo_lock_fn() take a plr
resctrl_arch_pseudo_lock_fn() has architecture specific behaviour,
and takes a struct rdtgroup as an argument.

After the filesystem code moves to /fs/, the definition of struct
rdtgroup will not be available to the architecture code.

The only reason resctrl_arch_pseudo_lock_fn() wants the rdtgroup is
for the CLOSID. Embed that in the pseudo_lock_region as a closid,
and move the definition of struct pseudo_lock_region to resctrl.h.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-27-james.morse@arm.com
2025-03-12 12:24:33 +01:00
James Morse
4d20f38ab6 x86/resctrl: Make prefetch_disable_bits belong to the arch code
prefetch_disable_bits is set by rdtgroup_locksetup_enter() from a value
provided by the architecture, but is largely read by other architecture
helpers.

Make resctrl_arch_get_prefetch_disable_bits() set prefetch_disable_bits so
that it can be isolated to arch-code from where the other arch-code helpers
can use its cached value.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-26-james.morse@arm.com
2025-03-12 12:24:30 +01:00
James Morse
7028840552 x86/resctrl: Allow an architecture to disable pseudo lock
Pseudo-lock relies on knowledge of the micro-architecture to disable
prefetchers etc.

On arm64 these controls are typically secure only, meaning Linux can't access
them. Arm's cache-lockdown feature works in a very different way. Resctrl's
pseudo-lock isn't going to be used on arm64 platforms.

Add a Kconfig symbol that can be selected by the architecture. This enables or
disables building of the pseudo_lock.c file, and replaces the functions with
stubs. An additional IS_ENABLED() check is needed in rdtgroup_mode_write() so
that attempting to enable pseudo-lock reports an "Unknown or unsupported mode"
to user-space via the last_cmd_status file.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-25-james.morse@arm.com
2025-03-12 12:24:25 +01:00
James Morse
7d0ec14c64 x86/resctrl: Add resctrl_arch_ prefix to pseudo lock functions
resctrl's pseudo lock has some copy-to-cache and measurement functions that
are micro-architecture specific.

For example, pseudo_lock_fn() is not at all portable.

Label these 'resctrl_arch_' so they stay under /arch/x86.  To expose these
functions to the filesystem code they need an entry in a header file, and
can't be marked static.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-24-james.morse@arm.com
2025-03-12 12:24:22 +01:00
James Morse
c32a7d7777 x86/resctrl: Move mbm_cfg_mask to struct rdt_resource
The mbm_cfg_mask field lists the bits that user-space can set when configuring
an event. This value is output via the last_cmd_status file.

Once the filesystem parts of resctrl are moved to live in /fs/, the struct
rdt_hw_resource is inaccessible to the filesystem code. Because this value is
output to user-space, it has to be accessible to the filesystem code.

Move it to struct rdt_resource.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-23-james.morse@arm.com
2025-03-12 12:24:09 +01:00
James Morse
37bae17567 x86/resctrl: Move mba_mbps_default_event init to filesystem code
mba_mbps_default_event is initialised based on whether mbm_local or mbm_total
is supported. In the case of both, it is initialised to mbm_local.
mba_mbps_default_event is initialised in core.c's get_rdt_mon_resources(),
while all the readers are in rdtgroup.c.

After this code is split into architecture-specific and filesystem code,
get_rdt_mon_resources() remains part of the architecture code, which would
mean mba_mbps_default_event has to be exposed by the filesystem code.

Move the initialisation to the filesystem's resctrl_mon_resource_init().

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-22-james.morse@arm.com
2025-03-12 12:23:52 +01:00
James Morse
650680d651 x86/resctrl: Change mon_event_config_{read,write}() to be arch helpers
mon_event_config_{read,write}() are called via IPI and access model specific
registers to do their work.

To support another architecture, this needs abstracting.

Rename mon_event_config_{read,write}() to have a "resctrl_arch_" prefix, and
move their struct mon_config_info parameter into <linux/resctrl.h>.  This
allows another architecture to supply an implementation of these.

As struct mon_config_info is now exposed globally, give it a 'resctrl_'
prefix. MPAM systems need access to the domain to do this work, add the
resource and domain to struct resctrl_mon_config_info.

Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-21-james.morse@arm.com
2025-03-12 12:23:49 +01:00
James Morse
d81826f87a x86/resctrl: Add resctrl_arch_is_evt_configurable() to abstract BMEC
When BMEC is supported the resctrl event can be configured in a number of
ways. This depends on architecture support. rdt_get_mon_l3_config() modifies
the struct mon_evt and calls resctrl_file_fflags_init() to create the files
that allow the configuration.

Splitting this into separate architecture and filesystem parts would require
the struct mon_evt and resctrl_file_fflags_init() to be exposed.

Instead, add resctrl_arch_is_evt_configurable(), and use this from
resctrl_mon_resource_init() to initialise struct mon_evt and call
resctrl_file_fflags_init().

resctrl_arch_is_evt_configurable() calls rdt_cpu_has() so it doesn't obviously
benefit from being inlined. Putting it in core.c will allow rdt_cpu_has() to
eventually become static.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-20-james.morse@arm.com
2025-03-12 12:23:45 +01:00
James Morse
d012b66a16 x86/resctrl: Move the is_mbm_*_enabled() helpers to asm/resctrl.h
The architecture specific parts of resctrl provide helpers like
is_mbm_total_enabled() and is_mbm_local_enabled() to hide accesses to the
rdt_mon_features bitmap.

Exposing a group of helpers between the architecture and filesystem code is
preferable to a single unsigned-long like rdt_mon_features. Helpers can be more
readable and have a well defined behaviour, while allowing architectures to hide
more complex behaviour.

Once the filesystem parts of resctrl are moved, these existing helpers can no
longer live in internal.h. Move them to include/linux/resctrl.h Once these are
exposed to the wider kernel, they should have a 'resctrl_arch_' prefix, to fit
the rest of the arch<->fs interface.

Move and rename the helpers that touch rdt_mon_features directly. is_mbm_event()
and is_mbm_enabled() are only called from rdtgroup.c, so can be moved into that
file.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-19-james.morse@arm.com
2025-03-12 12:23:33 +01:00
James Morse
88464bff03 x86/resctrl: Rewrite and move the for_each_*_rdt_resource() walkers
The for_each_*_rdt_resource() helpers walk the architecture's array of
structures, using the resctrl visible part as an iterator. These became
over-complex when the structures were split into a filesystem and
architecture-specific struct. This approach avoided the need to touch every
call site, and was done before there was a helper to retrieve a resource by
rid.

Once the filesystem parts of resctrl are moved to /fs/, both the arch's
resource array, and the definition of those structures is no longer
accessible. To support resctrl, each architecture would have to provide
equally complex macros.

Rewrite the macro to make use of resctrl_arch_get_resource(), and move these
to include/linux/resctrl.h so existing x86 arch code continues to use them.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-18-james.morse@arm.com
2025-03-12 12:23:30 +01:00
James Morse
4b6bdbf27f x86/resctrl: Move monitor init work to a resctrl init call
rdt_get_mon_l3_config() is called from the arch's resctrl_arch_late_init(),
and initialises both architecture specific fields, such as hw_res->mon_scale
and resctrl filesystem fields by calling dom_data_init().

To separate the filesystem and architecture parts of resctrl, this function
needs splitting up.

Add resctrl_mon_resource_init() to do the filesystem specific work, and call
it from resctrl_init(). This runs later, but is still before the filesystem is
mounted and the rmid_ptrs[] array can be used.

  [ bp: Massage commit message. ]

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-17-james.morse@arm.com
2025-03-12 12:23:21 +01:00
James Morse
011842727f x86/resctrl: Move monitor exit work to a resctrl exit call
rdt_put_mon_l3_config() is called via the architecture's resctrl_arch_exit()
call, and appears to free the rmid_ptrs[] and closid_num_dirty_rmid[] arrays.
In reality this code is marked __exit, and is removed by the linker as resctrl
can't be built as a module.

To separate the filesystem and architecture parts of resctrl, this free()ing
work needs to be triggered by the filesystem, as these structures belong to
the filesystem code.

Rename rdt_put_mon_l3_config() to resctrl_mon_resource_exit() and call it from
resctrl_exit(). The kfree() is currently dependent on r->mon_capable.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-16-james.morse@arm.com
2025-03-12 12:23:13 +01:00
James Morse
9be68b144a x86/resctrl: Add an arch helper to reset one resource
On umount(), resctrl resets each resource back to its default configuration.
It only ever does this for all resources in one go.

reset_all_ctrls() is architecture specific as it works with struct
rdt_hw_resource.

Make reset_all_ctrls() an arch helper that resets one resource.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-15-james.morse@arm.com
2025-03-12 12:23:10 +01:00
James Morse
f16adbaf92 x86/resctrl: Move resctrl types to a separate header
When resctrl is fully factored into core and per-arch code, each arch will
need to use some resctrl common definitions in order to define its own
specializations and helpers.  Following conventional practice, it would be
desirable to put the dependent arch definitions in an <asm/resctrl.h> header
that is included by the common <linux/resctrl.h> header.  However, this can
make it awkward to avoid a circular dependency between <linux/resctrl.h> and
the arch header.

To avoid such dependencies, move the affected common types and constants into
a new header that does not need to depend on <linux/resctrl.h> or on the arch
headers.

The same logic applies to the monitor-configuration defines, move these too.

Some kind of enumeration for events is needed between the filesystem and
architecture code. Take the x86 definition as its convenient for x86.

The definition of enum resctrl_event_id is needed to allow the architecture
code to define resctrl_arch_mon_ctx_alloc() and resctrl_arch_mon_ctx_free().

The definition of enum resctrl_res_level is needed to allow the architecture
code to define resctrl_arch_set_cdp_enabled() and
resctrl_arch_get_cdp_enabled().

The bits for mbm_local_bytes_config et al are ABI, and must be the same on all
architectures. These are documented in Documentation/arch/x86/resctrl.rst

The maintainers entry for these headers was missed when resctrl.h was created.
Add a wildcard entry to match both resctrl.h and resctrl_types.h.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-14-james.morse@arm.com
2025-03-12 12:23:00 +01:00
James Morse
e3d5138cef x86/resctrl: Move rdt_find_domain() to be visible to arch and fs code
rdt_find_domain() finds a domain given a resource and a cache-id.  This is
used by both the architecture code and the filesystem code.

After the filesystem code moves to live in /fs/, this helper is either
duplicated by all architectures, or needs exposing by the filesystem code.

Add the declaration to the global header file. As it's now globally visible,
and has only a handful of callers, swap the 'rdt' for 'resctrl'. Move the
function to live with its caller in ctrlmondata.c as the filesystem code will
not have anything corresponding to core.c.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-13-james.morse@arm.com
2025-03-12 12:22:56 +01:00
James Morse
8079565d17 x86/resctrl: Expose resctrl fs's init function to the rest of the kernel
rdtgroup_init() needs exposing to the rest of the kernel so that arch code can
call it once it lives in core code. As this is one of the few functions
exposed, rename it to have "resctrl" in the name. The same goes for the exit
call.

Rename x86's arch code init functions for RDT to have an arch prefix to make
it clear these are part of the architecture code.

Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-12-james.morse@arm.com
2025-03-12 12:22:54 +01:00
James Morse
6f06aee356 x86/resctrl: Remove rdtgroup from update_cpu_closid_rmid()
update_cpu_closid_rmid() takes a struct rdtgroup as an argument, which it uses
to update the local CPUs default pqr values. This is a problem once the
resctrl parts move out to /fs/, as the arch code cannot poke around inside
struct rdtgroup.

Rename update_cpu_closid_rmid() as resctrl_arch_sync_cpus_defaults() to be
used as the target of an IPI, and pass the effective CLOSID and RMID in a new
struct.

Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-11-james.morse@arm.com
2025-03-12 12:22:51 +01:00
James Morse
aebd5354dd x86/resctrl: Add helper for setting CPU default properties
rdtgroup_rmdir_ctrl() and rdtgroup_rmdir_mon() set the per-CPU pqr_state for
CPUs that were part of the rmdir()'d group.

Another architecture might not have a 'pqr_state', its hardware may need the
values in a different format. MPAM's equivalent of RMID values are not unique,
and always need the CLOSID to be provided too.

There is only one caller that modifies a single value, (rdtgroup_rmdir_mon()).
MPAM always needs both CLOSID and RMID for the hardware value as these are
written to the same system register.

As rdtgroup_rmdir_mon() has the CLOSID on hand, only provide a helper to set
both values. These values are read by __resctrl_sched_in(), but may be written
by a different CPU without any locking, add READ/WRTE_ONCE() to avoid torn
values.

Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-10-james.morse@arm.com
2025-03-12 12:22:48 +01:00
James Morse
dbc58f7eec x86/resctrl: Generate default_ctrl instead of sharing it
The struct rdt_resource default_ctrl is used by both the architecture code for
resetting the hardware controls, and sometimes by the filesystem code as the
default value for the schema, unless the bandwidth software controller is in
use.

Having the default exposed by the architecture code causes unnecessary
duplication for each architecture as the default value must be specified, but
can be derived from other schema properties. Now that the maximum bandwidth is
explicitly described, resctrl can derive the default value from the schema
format and the other resource properties.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-9-james.morse@arm.com
2025-03-12 12:22:44 +01:00
James Morse
634ebb98b9 x86/resctrl: Add max_bw to struct resctrl_membw
__rdt_get_mem_config_amd() and __get_mem_config_intel() both use the
default_ctrl property as a maximum value. This is because the MBA schema works
differently between these platforms. Doing this complicates determining
whether the default_ctrl property belongs to the arch code, or can be derived
from the schema format.

Deriving the maximum or default value from the schema format would avoid the
architecture code having to tell resctrl such obvious things as the maximum
percentage is 100, and the maximum bitmap is all ones.

Maximum bandwidth is always going to vary per platform. Add max_bw as
a special case. This is currently used for the maximum MBA percentage on Intel
platforms, but can be removed from the architecture code if 'percentage'
becomes a schema format resctrl supports directly.

This value isn't needed for other schema formats.

This will allow the default_ctrl to be generated from the schema properties
when it is needed.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-8-james.morse@arm.com
2025-03-12 12:22:41 +01:00
James Morse
43312b8ea1 x86/resctrl: Remove data_width and the tabular format
The resctrl architecture code provides a data_width for the controls of each
resource. This is used to zero pad all control values in the schemata file so
they appear in columns. The same is done with the resource names to complete
the visual effect. e.g.

  | SMBA:0=2048
  |   L3:0=00ff

AMD platforms discover their maximum bandwidth for the MB resource from
firmware, but hard-code the data_width to 4. If the maximum bandwidth requires
more digits - the tabular format is silently broken.  This is also broken when
the mba_MBps mount option is used as the field width isn't updated. If new
schema are added resctrl will need to be able to determine the maximum width.
The benefit of this pretty-printing is questionable.

Instead of handling runtime discovery of the data_width for AMD platforms,
remove the feature. These fields are always zero padded so should be harmless
to remove if the whole field has been treated as a number.  In the above
example, this would now look like this:

  | SMBA:0=2048
  |   L3:0=ff

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-7-james.morse@arm.com
2025-03-12 12:22:36 +01:00
James Morse
bb9343c8f2 x86/resctrl: Use schema type to determine the schema format string
Resctrl's architecture code gets to specify a format string that is
used when printing schema entries. This is expected to be one of two
values that the filesystem code supports.

Setting this format string allows the architecture code to change
the ABI resctrl presents to user-space.

Instead, use the schema format enum to choose which format string to
use.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-6-james.morse@arm.com
2025-03-12 12:22:33 +01:00
James Morse
c24f5eab6b x86/resctrl: Use schema type to determine how to parse schema values
Resctrl's architecture code gets to specify a function pointer that is used
when parsing schema entries. This is expected to be one of two helpers from
the filesystem code.

Setting this function pointer allows the architecture code to change the ABI
resctrl presents to user-space, and forces resctrl to expose these helpers.

Instead, add a schema format enum to choose which schema parser to use. This
allows the helpers to be made static and the structs used for passing
arguments moved out of shared headers.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-5-james.morse@arm.com
2025-03-12 12:22:30 +01:00
James Morse
131dab13a8 x86/resctrl: Remove fflags from struct rdt_resource
The resctrl arch code specifies whether a resource controls a cache or memory
using the fflags field. This field is then used by resctrl to determine which
files should be exposed in the filesystem.

Allowing the architecture to pick this value means the RFTYPE_ flags have to
be in a shared header, and allows an architecture to create a combination that
resctrl does not support.

Remove the fflags field, and pick the value based on the resource id.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-4-james.morse@arm.com
2025-03-12 12:22:26 +01:00
James Morse
3c02153113 x86/resctrl: Add a helper to avoid reaching into the arch code resource list
Resctrl occasionally wants to know something about a specific resource, in
these cases it reaches into the arch code's rdt_resources_all[] array.

Once the filesystem parts of resctrl are moved to /fs/, this means it will
need visibility of the architecture specific struct rdt_hw_resource
definition, and the array of all resources.  All architectures would also need
a r_resctrl member in this struct.

Instead, abstract this via a helper to allow architectures to do different
things here. Move the level enum to the resctrl header and add a helper to
retrieve the struct rdt_resource by 'rid'.

resctrl_arch_get_resource() should not return NULL for any value in the enum,
it may instead return a dummy resource that is !alloc_enabled && !mon_enabled.

Co-developed-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-3-james.morse@arm.com
2025-03-12 12:22:22 +01:00
James Morse
a121798ae6 x86/resctrl: Fix allocation of cleanest CLOSID on platforms with no monitors
Commit

  6eac36bb9e ("x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid")

added logic that causes resctrl to search for the CLOSID with the fewest dirty
cache lines when creating a new control group, if requested by the arch code.
This depends on the values read from the llc_occupancy counters. The logic is
applicable to architectures where the CLOSID effectively forms part of the
monitoring identifier and so do not allow complete freedom to choose an unused
monitoring identifier for a given CLOSID.

This support missed that some platforms may not have these counters.  This
causes a NULL pointer dereference when creating a new control group as the
array was not allocated by dom_data_init().

As this feature isn't necessary on platforms that don't have cache occupancy
monitors, add this to the check that occurs when a new control group is
allocated.

Fixes: 6eac36bb9e ("x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid")
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com> # arm64
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Amit Singh Tomar <amitsinght@marvell.com> # arm64
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com> # arm64
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/20250311183715.16445-2-james.morse@arm.com
2025-03-12 12:21:48 +01:00
Greg Kroah-Hartman
993a47bd7b Linux 6.14-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfOKBUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1aQH/iC+Oyij4VxAjBek
 BOXIT/p6CwlIXb8ObiWWcRjDPizlcxb3RaV8J2RO+IqaQ2wltxpFANq2G7Re2FPm
 SNcEpIURAOVcxHGedcfFA91srO5F4FzNTO8LVp7MIbcgMYy3pdk+dbZmi6A691R+
 t9pb74m+MAnF1o/MUx7pUlhAT/4ymuuR0F7WCSg4h0Xwe5m0nlJY89kJBC7PCjyd
 n3mdhsz3rDSLmt/z/T7HGD89r8sYSvm9cOKtL3ELgGTrm7boQV8ii9Y9w04DI8PQ
 JmIernugcCxmhH36mVUAHgJf2+/T388xFUh/D5+skeUOUZpaJZG866rnb32WpsHc
 eWLFUeg=
 =Wypt
 -----END PGP SIGNATURE-----

Merge 6.14-rc6 into driver-core-next

We need the driver core fix in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-03-10 17:37:25 +01:00
Florent Revest
e3e89178a9 x86/microcode/AMD: Fix out-of-bounds on systems with CPU-less NUMA nodes
Currently, load_microcode_amd() iterates over all NUMA nodes, retrieves their
CPU masks and unconditionally accesses per-CPU data for the first CPU of each
mask.

According to Documentation/admin-guide/mm/numaperf.rst:

  "Some memory may share the same node as a CPU, and others are provided as
  memory only nodes."

Therefore, some node CPU masks may be empty and wouldn't have a "first CPU".

On a machine with far memory (and therefore CPU-less NUMA nodes):
- cpumask_of_node(nid) is 0
- cpumask_first(0) is CONFIG_NR_CPUS
- cpu_data(CONFIG_NR_CPUS) accesses the cpu_info per-CPU array at an
  index that is 1 out of bounds

This does not have any security implications since flashing microcode is
a privileged operation but I believe this has reliability implications by
potentially corrupting memory while flashing a microcode update.

When booting with CONFIG_UBSAN_BOUNDS=y on an AMD machine that flashes
a microcode update. I get the following splat:

  UBSAN: array-index-out-of-bounds in arch/x86/kernel/cpu/microcode/amd.c:X:Y
  index 512 is out of range for type 'unsigned long[512]'
  [...]
  Call Trace:
   dump_stack
   __ubsan_handle_out_of_bounds
   load_microcode_amd
   request_microcode_amd
   reload_store
   kernfs_fop_write_iter
   vfs_write
   ksys_write
   do_syscall_64
   entry_SYSCALL_64_after_hwframe

Change the loop to go over only NUMA nodes which have CPUs before determining
whether the first CPU on the respective node needs microcode update.

  [ bp: Massage commit message, fix typo. ]

Fixes: 7ff6edf4fe ("x86/microcode/AMD: Fix mixed steppings support")
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250310144243.861978-1-revest@chromium.org
2025-03-10 16:02:54 +01:00
Vladis Dronov
65be5c95d0 x86/sgx: Warn explicitly if X86_FEATURE_SGX_LC is not enabled
The kernel requires X86_FEATURE_SGX_LC to be able to create SGX enclaves,
not just X86_FEATURE_SGX.

There is quite a number of hardware which has X86_FEATURE_SGX but not
X86_FEATURE_SGX_LC. A kernel running on such hardware does not create
the /dev/sgx_enclave file and does so silently.

Explicitly warn if X86_FEATURE_SGX_LC is not enabled to properly notify
users that the kernel disabled the SGX driver.

The X86_FEATURE_SGX_LC, a.k.a. SGX Launch Control, is a CPU feature
that enables LE (Launch Enclave) hash MSRs to be writable (with
additional opt-in required in the 'feature control' MSR) when running
enclaves, i.e. using a custom root key rather than the Intel proprietary
key for enclave signing.

I've hit this issue myself and have spent some time researching where
my /dev/sgx_enclave file went on SGX-enabled hardware.

Related links:

  https://github.com/intel/linux-sgx/issues/837
  https://patchwork.kernel.org/project/platform-driver-x86/patch/20180827185507.17087-3-jarkko.sakkinen@linux.intel.com/

[ mingo: Made the error message a bit more verbose, and added other cases
         where the kernel fails to create the /dev/sgx_enclave device node. ]

Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250309172215.21777-2-vdronov@redhat.com
2025-03-10 12:29:18 +01:00
Sebastian Andrzej Siewior
14daa3bca2 x86: Use RCU in all users of __module_address().
__module_address() can be invoked within a RCU section, there is no
requirement to have preemption disabled.

Replace the preempt_disable() section around __module_address() with
RCU.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250108090457.512198-23-bigeasy@linutronix.de
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-03-10 11:54:45 +01:00
Borislav Petkov (AMD)
058a6bec37 x86/microcode/AMD: Add some forgotten models to the SHA check
Add some more forgotten models to the SHA check.

Fixes: 50cef76d5c ("x86/microcode/AMD: Load only SHA256-checksummed patches")
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/r/20250307220256.11816-1-bp@kernel.org
2025-03-08 20:09:37 +01:00
Ingo Molnar
14296d0e85 Merge branch 'linus' into x86/urgent, to pick up dependent patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-08 20:09:27 +01:00
Ingo Molnar
f23ecef20a Merge branch 'locking/urgent' into locking/core, to pick up locking fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-08 00:54:06 +01:00
Linus Torvalds
042751d353 Miscellaneous x86 fixes:
- Fix CPUID leaf 0x2 parsing bugs
  - Sanitize very early boot parameters to avoid crash
  - Fix size overflows in the SGX code
  - Make CALL_NOSPEC use consistent
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmfK46YRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hrTA/9EKBx/Uu+WKaVCfX/+hrdHfiMhwCFFOrT
 Aa5QEcsEcFP2QDo3M8mXj9DzNJda8L5ilBWqv+E08zYexvv5/wQ2Q7bEQthD+4Su
 hFbbTjlsF5r6D2sfzT03RuDbvTO3lv7uD/dYoE9e82GKKhvGt2mdcyQ6ZIt1Iz8R
 gIcqGBv3KMXHlx91+VeRHZmI7ZHL1lqz8SyXvnY5jCI5x44QNdlBAMSbxwkw9YzJ
 1w59Dfxz9UDHJ0vzj7OpZFL/JjABKXEIeTXHfSevd0f3V05T3hTlZUPEKvqNpHlP
 dnAHbH5HoqQpBWdt4CYmJcEhdz9fM5h0s7JBMN+0k+ORZ7J1wJ+tM35cm3uWpd0/
 zpDzsFU33AdVbt9ryY62kZk5Lr9Pvas5GRjhB8ZkcSjFPu03xKvZ0sFs02H4zsOc
 k5Nd8PHKRbBhoJhtBtSRMuhADKKEIA490TUz5NVCJYqVnopyotYEX74GlQhFcXf6
 qXE8wcIfpC+bYbzPEVWe0+MhqQrcuN85D+KaIUIR9EITkAQ57VEvfwRv0FB3j7zS
 //0sM1bsNJB9dbGgkL2/PqihwBtXGf8MzyDvvQPLK/DiyD32B4zgtQHYScgQZqH+
 Bze+GR33T2eifkEkBEIOjfIVc8t4iTuoni9SjLIcTTNo8kedCKkGRCEHgqWZO8ko
 a53vzAMQSv8=
 =ziXk
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2025-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Fix CPUID leaf 0x2 parsing bugs

 - Sanitize very early boot parameters to avoid crash

 - Fix size overflows in the SGX code

 - Make CALL_NOSPEC use consistent

* tag 'x86-urgent-2025-03-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Sanitize boot params before parsing command line
  x86/sgx: Fix size overflows in sgx_encl_create()
  x86/cpu: Properly parse CPUID leaf 0x2 TLB descriptor 0x63
  x86/cpu: Validate CPUID leaf 0x2 EDX output
  x86/cacheinfo: Validate CPUID leaf 0x2 EDX output
  x86/speculation: Add a conditional CS prefix to CALL_NOSPEC
  x86/speculation: Simplify and make CALL_NOSPEC consistent
2025-03-07 10:05:32 -10:00
Andrew Cooper
14cb5d8306 x86/amd_nb: Use rdmsr_safe() in amd_get_mmconfig_range()
Xen doesn't offer MSR_FAM10H_MMIO_CONF_BASE to all guests.  This results
in the following warning:

  unchecked MSR access error: RDMSR from 0xc0010058 at rIP: 0xffffffff8101d19f (xen_do_read_msr+0x7f/0xa0)
  Call Trace:
   xen_read_msr+0x1e/0x30
   amd_get_mmconfig_range+0x2b/0x80
   quirk_amd_mmconfig_area+0x28/0x100
   pnp_fixup_device+0x39/0x50
   __pnp_add_device+0xf/0x150
   pnp_add_device+0x3d/0x100
   pnpacpi_add_device_handler+0x1f9/0x280
   acpi_ns_get_device_callback+0x104/0x1c0
   acpi_ns_walk_namespace+0x1d0/0x260
   acpi_get_devices+0x8a/0xb0
   pnpacpi_init+0x50/0x80
   do_one_initcall+0x46/0x2e0
   kernel_init_freeable+0x1da/0x2f0
   kernel_init+0x16/0x1b0
   ret_from_fork+0x30/0x50
   ret_from_fork_asm+0x1b/0x30

based on quirks for a "PNP0c01" device.  Treating MMCFG as disabled is the
right course of action, so no change is needed there.

This was most likely exposed by fixing the Xen MSR accessors to not be
silently-safe.

Fixes: 3fac3734c4 ("xen/pv: support selecting safe/unsafe msr accesses")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250307002846.3026685-1-andrew.cooper3@citrix.com
2025-03-07 13:28:31 +01:00
Maksim Davydov
c929d08df8 x86/split_lock: Fix the delayed detection logic
If the warning mode with disabled mitigation mode is used, then on each
CPU where the split lock occurred detection will be disabled in order to
make progress and delayed work will be scheduled, which then will enable
detection back.

Now it turns out that all CPUs use one global delayed work structure.
This leads to the fact that if a split lock occurs on several CPUs
at the same time (within 2 jiffies), only one CPU will schedule delayed
work, but the rest will not.

The return value of schedule_delayed_work_on() would have shown this,
but it is not checked in the code.

A diagram that can help to understand the bug reproduction:

 - sld_update_msr() enables/disables SLD on both CPUs on the same core

 - schedule_delayed_work_on() internally checks WORK_STRUCT_PENDING_BIT.
   If a work has the 'pending' status, then schedule_delayed_work_on()
   will return an error code and, most importantly, the work will not
   be placed in the workqueue.

Let's say we have a multicore system on which split_lock_mitigate=0 and
a multithreaded application is running that calls splitlock in multiple
threads. Due to the fact that sld_update_msr() affects the entire core
(both CPUs), we will consider 2 CPUs from different cores. Let the 2
threads of this application schedule to CPU0 (core 0) and to CPU 2
(core 1), then:

|                                 ||                                   |
|             CPU 0 (core 0)      ||          CPU 2 (core 1)           |
|_________________________________||___________________________________|
|                                 ||                                   |
| 1) SPLIT LOCK occured           ||                                   |
|                                 ||                                   |
| 2) split_lock_warn()            ||                                   |
|                                 ||                                   |
| 3) sysctl_sld_mitigate == 0     ||                                   |
|    (work = &sl_reenable)        ||                                   |
|                                 ||                                   |
| 4) schedule_delayed_work_on()   ||                                   |
|    (reenable will be called     ||                                   |
|     after 2 jiffies on CPU 0)   ||                                   |
|                                 ||                                   |
| 5) disable SLD for core 0       ||                                   |
|                                 ||                                   |
|    -------------------------    ||                                   |
|                                 ||                                   |
|                                 || 6) SPLIT LOCK occured             |
|                                 ||                                   |
|                                 || 7) split_lock_warn()              |
|                                 ||                                   |
|                                 || 8) sysctl_sld_mitigate == 0       |
|                                 ||    (work = &sl_reenable,          |
|                                 ||     the same address as in 3) )   |
|                                 ||                                   |
|            2 jiffies            || 9) schedule_delayed_work_on()     |
|                                 ||    fials because the work is in   |
|                                 ||    the pending state since 4).    |
|                                 ||    The work wasn't placed to the  |
|                                 ||    workqueue. reenable won't be   |
|                                 ||    called on CPU 2                |
|                                 ||                                   |
|                                 || 10) disable SLD for core 0        |
|                                 ||                                   |
|                                 ||     From now on SLD will          |
|                                 ||     never be reenabled on core 1  |
|                                 ||                                   |
|    -------------------------    ||                                   |
|                                 ||                                   |
|    11) enable SLD for core 0 by ||                                   |
|        __split_lock_reenable    ||                                   |
|                                 ||                                   |

If the application threads can be scheduled to all processor cores,
then over time there will be only one core left, on which SLD will be
enabled and split lock will be able to be detected; and on all other
cores SLD will be disabled all the time.

Most likely, this bug has not been noticed for so long because
sysctl_sld_mitigate default value is 1, and in this case a semaphore
is used that does not allow 2 different cores to have SLD disabled at
the same time, that is, strictly only one work is placed in the
workqueue.

In order to fix the warning mode with disabled mitigation mode,
delayed work has to be per-CPU. Implement it.

Fixes: 727209376f ("x86/split_lock: Add sysctl to control the misery mode")
Signed-off-by: Maksim Davydov <davydov-max@yandex-team.ru>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250115131704.132609-1-davydov-max@yandex-team.ru
2025-03-07 13:01:02 +01:00
Ingo Molnar
82354fce16 Merge branch 'sched/urgent' into sched/core, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-06 22:26:36 +01:00
Rafael J. Wysocki
f96d92fcbb amd-pstate content for 6.15 (3/6/25)
A lot of code optimization to avoid cases where call paths will end up calling
 the same writes multiple times and needlessly caching variables.  To accomplish
 this some of the writes are now made into an atomically written "perf" variable.
 Locking has been overhauled to ensure it only applies to the necessary functions.
 Tracing has been adjusted to ensure trace events only are used right before
 writing out to the hardware.
 
 NOTE: This is a redo of amd-pstate-v6.15-2025-03-03 with a fixed Fixes tag.
 -----BEGIN PGP SIGNATURE-----
 
 iQJOBAABCgA4FiEECwtuSU6dXvs5GA2aLRkspiR3AnYFAmfJ8w8aHG1hcmlvLmxp
 bW9uY2llbGxvQGFtZC5jb20ACgkQLRkspiR3AnaugQ/9GvheK3/lWweGF3f4ixdl
 xS0XTj3DYQBxDvP2H35QRBHaUQQADJnJLjpl1HHy0x9Hem6m3oLZH3QpqKDLb1xX
 QWzI2JJZ9+KPZRKRwlBnzENcHtBraKX4M/+vvhTo6LSbLfSz47cdZhX6r+9siwPR
 ofWfusfBlEVPPLaul2Q+oOaQzvwy9Or2q2y3xICzo+Vo3QilSk3pekN8bd2CVd9y
 L1t1sAeN1StiOLBRbdbh8B2oyJfCCmSI85P8+sEKKR58G1q42Dv4NRl7fgzuJAfO
 4ATgX0RjNZcI9yLPmLr5PFmHJYvpHr29MAWKAo5a8VwMu1cBTv0FFkEboRvUgiXl
 3+hjQJGkyCZJTxIWhnvKl6Ci00N5apmwc1iMeoa1A1xvvTVkXLmn8RrYIJm2xMCH
 27SPJAQx+7e2cRcgjb3orz2Ymj+I9WoGy68/wdmnWM+ZxF1QaKguhM0lrIKFbu7K
 6CzVnndVgtYt9kFi/SQ4m1e8zr6RnvPWTxqQxTCVQZBNahe3b7DgnrrNrB/NBU86
 QotiTBIxQed6oqHAtL2mFRb9SdTGh9PE5xGcR94qlEGqVtu9KZIFqCDG1OgZKkR5
 9yLLai7sUGfMCLuEsdM5hXvf4fOHaGSQ0Cxx2qYWC3DDnS651fMNKh/taBo8RmS3
 r18loH2dKvBvOi2beCA5sMk=
 =VkdW
 -----END PGP SIGNATURE-----

Merge tag 'amd-pstate-v6.15-2025-03-06' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux

Merge amd-pstate updates for 6.15 (3/6/25) from Mario Limonciello:

"A lot of code optimization to avoid cases where call paths will end up
 calling the same writes multiple times and needlessly caching variables.
 To accomplish this some of the writes are now made into an atomically
 written "perf" variable. Locking has been overhauled to ensure it only
 applies to the necessary functions. Tracing has been adjusted to ensure
 trace events only are used right before writing out to the hardware."

NOTE: This is a redo of amd-pstate-v6.15-2025-03-03 with a fixed Fixes tag.

* tag 'amd-pstate-v6.15-2025-03-06' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (29 commits)
  cpufreq/amd-pstate: Drop actions in amd_pstate_epp_cpu_offline()
  cpufreq/amd-pstate: Stop caching EPP
  cpufreq/amd-pstate: Rework CPPC enabling
  cpufreq/amd-pstate: Drop debug statements for policy setting
  cpufreq/amd-pstate: Update cppc_req_cached for shared mem EPP writes
  cpufreq/amd-pstate: Move all EPP tracing into *_update_perf and *_set_epp functions
  cpufreq/amd-pstate: Cache CPPC request in shared mem case too
  cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks
  cpufreq/amd-pstate-ut: Adjust variable scope
  cpufreq/amd-pstate-ut: Run on all of the correct CPUs
  cpufreq/amd-pstate-ut: Drop SUCCESS and FAIL enums
  cpufreq/amd-pstate-ut: Allow lowest nonlinear and lowest to be the same
  cpufreq/amd-pstate-ut: Use _free macro to free put policy
  cpufreq/amd-pstate: Drop `cppc_cap1_cached`
  cpufreq/amd-pstate: Overhaul locking
  cpufreq/amd-pstate: Move perf values into a union
  cpufreq/amd-pstate: Drop min and max cached frequencies
  cpufreq/amd-pstate: Show a warning when a CPU fails to setup
  cpufreq/amd-pstate: Invalidate cppc_req_cached during suspend
  cpufreq/amd-pstate: Fix the clamping of perf values
  ...
2025-03-06 21:28:48 +01:00
Mario Limonciello
b4cc466b97 cpufreq/amd-pstate: Replace all AMD_CPPC_* macros with masks
Bitfield masks are easier to follow and less error prone.

Reviewed-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-03-06 13:01:25 -06:00
Eric Biggers
d021985504 x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs
Background:
===========

Currently kernel-mode FPU is not always usable in softirq context on
x86, since softirqs can nest inside a kernel-mode FPU section in task
context, and nested use of kernel-mode FPU is not supported.

Therefore, x86 SIMD-optimized code that can be called in softirq context
has to sometimes fall back to non-SIMD code.  There are two options for
the fallback, both of which are pretty terrible:

  (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
      code because it cannot use specialized instructions like AES, SHA,
      or carryless multiplication.

  (b) Execute the request asynchronously using a kworker.  In other
      words, use the "crypto SIMD helper" in crypto/simd.c.

Currently most of the x86 en/decryption code (skcipher and aead
algorithms) uses option (b), since this avoids the slow scalar fallback
and it is easier to wire up.  But option (b) is still really bad for its
own reasons:

  - Punting the request to a kworker is bad for performance too.

  - It forces the algorithm to be marked as asynchronous
    (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
    users who request a synchronous algorithm.  That's another huge
    performance problem, which is especially unfortunate for users who
    don't even do en/decryption in softirq context.

  - It makes all en/decryption operations take a detour through
    crypto/simd.c.  That involves additional checks and an additional
    indirect call, which slow down en/decryption for *everyone*.

Fortunately, the skcipher and aead APIs are only usable in task and
softirq context in the first place.  Thus, if kernel-mode FPU were to be
reliably usable in softirq context, no fallback would be needed.
Indeed, other architectures such as arm, arm64, and riscv have already
done this.

Changes implemented:
====================

Therefore, this patch updates x86 accordingly to reliably support
kernel-mode FPU in softirqs.

This is done by just disabling softirq processing in kernel-mode FPU
sections (when hardirqs are not already disabled), as that prevents the
nesting that was problematic.

This will delay some softirqs slightly, but only ones that would have
otherwise been nested inside a task context kernel-mode FPU section.
Any such softirqs would have taken the slow fallback path before if they
tried to do any en/decryption.  Now these softirqs will just run at the
end of the task context kernel-mode FPU section (since local_bh_enable()
runs pending softirqs) and will no longer take the slow fallback path.

Alternatives considered:
========================

- Make kernel-mode FPU sections fully preemptible.  This would require
  growing task_struct by another struct fpstate which is more than 2K.

- Make softirqs save/restore the kernel-mode FPU state to a per-CPU
  struct fpstate when nested use is detected.  Somewhat interesting, but
  seems unnecessary when a simpler solution exists.

Performance results:
====================

I did some benchmarks with AES-XTS encryption of 16-byte messages (which is
unrealistically small, but this makes it easier to see the overhead of
kernel-mode FPU...).  The baseline was 384 MB/s.  Removing the use of
crypto/simd.c, which this work makes possible, increases it to 487 MB/s,
a +27% improvement in throughput.

CPU was AMD Ryzen 9 9950X (Zen 5).  No debugging options were enabled.

[ mingo: Prettified the changelog and added performance results. ]

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250304204954.3901-1-ebiggers@kernel.org
2025-03-06 12:44:09 +01:00
Jiri Olsa
fa6192adc3 uprobes/x86: Harden uretprobe syscall trampoline check
Jann reported a possible issue when trampoline_check_ip returns
address near the bottom of the address space that is allowed to
call into the syscall if uretprobes are not set up:

   https://lore.kernel.org/bpf/202502081235.5A6F352985@keescook/T/#m9d416df341b8fbc11737dacbcd29f0054413cbbf

Though the mmap minimum address restrictions will typically prevent
creating mappings there, let's make sure uretprobe syscall checks
for that.

Fixes: ff474a78ce ("uprobe: Add uretprobe syscall to speed up return probe")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250212220433.3624297-1-jolsa@kernel.org
2025-03-06 12:22:45 +01:00
Jarkko Sakkinen
0d3e0dfd68 x86/sgx: Fix size overflows in sgx_encl_create()
The total size calculated for EPC can overflow u64 given the added up page
for SECS.  Further, the total size calculated for shmem can overflow even
when the EPC size stays within limits of u64, given that it adds the extra
space for 128 byte PCMD structures (one for each page).

Address this by pre-evaluating the micro-architectural requirement of
SGX: the address space size must be power of two. This is eventually
checked up by ECREATE but the pre-check has the additional benefit of
making sure that there is some space for additional data.

Fixes: 888d249117 ("x86/sgx: Add SGX_IOC_ENCLAVE_CREATE")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250305050006.43896-1-jarkko@kernel.org

Closes: https://lore.kernel.org/linux-sgx/c87e01a0-e7dd-4749-a348-0980d3444f04@stanley.mountain/
2025-03-05 09:51:41 +01:00
Linus Torvalds
bb2281fb05 - Load only sha256-signed microcode patch blobs
- Other good cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAme+3r8ACgkQEsHwGGHe
 VUq7thAAvpWb51ep3pegXtUcTF6hmB5rKfFmWSd5ntzcBudhxZyFMAIkoNGtjA/m
 lwamrDowhcGMLQ16aWzruYLeXJao6ROTJHq0hiTdxdBqyemW7Z1exPc0s4OolHCD
 1hL3pP7Z3ubBEwm+2jI1M0KWJ0ZKv54SLQEaTxIStFqvVNq4/mI+5zpsFM+iElZ+
 bc3D5ISyCwdEGvVHLKAD0W6J1cWNFygLaAM74JvkGkY3ByCR7HnUJa9lYx/b8PEk
 uWeDu5IWY+gwKDor9NgW1mI66x3jiGExVcwhi30r1V/jX8v+H0+zO0s84hF8YFra
 VmAoa2YkmOmO1zqrs/S2gGb+kX2nKlck6jRALVFFq8eXOP1BsmyvGk12XbVCeBYQ
 kl6M89AS4nSF62k7PCJdzvkuRpVx1DSCbaTUB+fnzxP9BcOCKY1eobIJs/rBwDSx
 0AcD178j2eb5uP8nACnBUshJwHYmssHXX2sXH/NbCvSqulYfTrnYs/EYQuf7M4g+
 yGkBsH/T24RswobT9VzdO4EyzGaL1SPn6pQ5yKnDBSAAZNMfjaNNYFzBpSttzviR
 8j4PBJ6u6mCGohEsMVXAmCYs2EvfSr+lxSFHMRV9Ym09SFnI2muhG5Rd29ast1Mz
 H0YSRs9pRMHLDHorMVU6+AgFq+8XIt0f13zkIyM/S3b2ZLRVgiQ=
 =2mhO
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull AMD microcode loading fixes from Borislav Petkov:

 - Load only sha256-signed microcode patch blobs

 - Other good cleanups

* tag 'x86_microcode_for_v6.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Load only SHA256-checksummed patches
  x86/microcode/AMD: Add get_patch_level()
  x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration
  x86/microcode/AMD: Merge early_apply_microcode() into its single callsite
  x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations
  x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature
2025-03-04 19:05:53 -10:00
Uros Bizjak
6d536cad0d x86/percpu: Fix __per_cpu_hot_end marker
Make __per_cpu_hot_end marker point to the end of the percpu cache
hot data, not to the end of the percpu cache hot section.

This fixes CONFIG_MPENTIUM4 case where X86_L1_CACHE_SHIFT
is set to 7 (128 bytes).

Also update assert message accordingly.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Link: https://lore.kernel.org/r/20250304173455.89361-1-ubizjak@gmail.com

Closes: https://lore.kernel.org/lkml/Z8a-NVJs-pm5W-mG@gmail.com/
2025-03-04 20:30:33 +01:00
Brian Gerst
06aa03056f x86/smp: Move this_cpu_off to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-12-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
f3856cd343 x86/stackprotector: Move __stack_chk_guard to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-11-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
a1e4cc0155 x86/percpu: Move current_task to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-10-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
385f72c83e x86/percpu: Move top_of_stack to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-9-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
c6a0918072 x86/irq: Move irq stacks to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-8-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
c8f1ac2bd7 x86/softirq: Move softirq_pending to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-7-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
839be1619f x86/retbleed: Move call depth to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-6-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
01c7bc5198 x86/smp: Move cpu number to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-5-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
46e8fff6d4 x86/preempt: Move preempt count to percpu hot section
No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-4-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Brian Gerst
972f9cdff9 x86/percpu: Move pcpu_hot to percpu hot section
Also change the alignment of the percpu hot section:

 -       PERCPU_SECTION(INTERNODE_CACHE_BYTES)
 +       PERCPU_SECTION(L1_CACHE_BYTES)

As vSMP will muck with INTERNODE_CACHE_BYTES that invalidates the
too-large-section assert we do:

  ASSERT(__per_cpu_hot_end - __per_cpu_hot_start <= 64, "percpu cache hot section too large")

[ mingo: Added INTERNODE_CACHE_BYTES fix & explanation. ]

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-3-brgerst@gmail.com
2025-03-04 20:30:33 +01:00
Ingo Molnar
71c2ff150f Merge branch 'x86/asm' into x86/core, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04 20:29:35 +01:00
Uros Bizjak
c8b584fe82 x86/irq/32: Change some static functions to bool
The return values of these functions is 0/1, but they use an int
type instead of bool:

  check_stack_overflow()
  execute_on_irq_stack()

Change the type of these function to bool and adjust their return
values and affected helper variables.

[ mingo: Rewrote the changelog ]

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303155446.112769-5-ubizjak@gmail.com
2025-03-04 20:28:58 +01:00
Uros Bizjak
d4432fb5b8 x86/irq/32: Use current_stack_pointer to avoid asm() in check_stack_overflow()
Make code more readable by using the 'current_stack_pointer' global variable.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303155446.112769-4-ubizjak@gmail.com
2025-03-04 20:28:58 +01:00
Uros Bizjak
76f7113781 x86/irq/32: Add missing clobber to inline asm
i386 ABI declares %edx as a call-clobbered register.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303155446.112769-2-ubizjak@gmail.com
2025-03-04 20:12:40 +01:00
Uros Bizjak
0ec914707c x86/irq/32: Use named operands in inline asm
Also use inout "+" constraint modifier where appropriate.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303155446.112769-1-ubizjak@gmail.com
2025-03-04 20:12:40 +01:00
Ingo Molnar
cfdaa618de Merge branch 'x86/cpu' into x86/asm, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04 11:19:21 +01:00
Ahmed S. Darwish
6309ff98f0 x86/cacheinfo: Remove unnecessary headers and reorder the rest
Remove the headers at cacheinfo.c that are no longer required.

Alphabetically reorder what remains since more headers will be included
in further commits.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-13-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Thomas Gleixner
b3a756bd72 x86/cacheinfo: Remove the P4 trace leftovers for real
Commit 851026a2bf ("x86/cacheinfo: Remove unused trace variable") removed
the switch case for LVL_TRACE but did not get rid of the surrounding gunk.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-12-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Thomas Gleixner
1f61dfdf16 x86/cpu: Remove unused TLB strings
Commit:

  e0ba94f14f ("x86/tlb_info: get last level TLB entry number of CPU")

added the TLB table for parsing CPUID(0x4), including strings
describing them. The string entry in the table was never used.

Convert them to comments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-10-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Thomas Gleixner
535d9a8270 x86/cpu: Get rid of the smp_store_cpu_info() indirection
smp_store_cpu_info() is just a wrapper around identify_secondary_cpu()
without further value.

Move the extra bits from smp_store_cpu_info() into identify_secondary_cpu()
and remove the wrapper.

[ darwi: Make it compile and fix up the xen/smp_pv.c instance ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-9-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Ahmed S. Darwish
8b7e54b542 x86/cpu: Simplify TLB entry count storage
Commit:

  e0ba94f14f ("x86/tlb_info: get last level TLB entry number of CPU")

introduced u16 "info" arrays for each TLB type.

Since 2012 and each array stores just one type of information: the
number of TLB entries for its respective TLB type.

Replace such arrays with simple variables.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-8-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Ahmed S. Darwish
cb5f4c76b2 x86/cpu: Use max() for CPUID leaf 0x2 TLB descriptors parsing
The conditional statement "if (x < y) { x = y; }" appears 22 times at
the Intel leaf 0x2 descriptors parsing logic.

Replace each of such instances with a max() expression to simplify
the code.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-7-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Ahmed S. Darwish
dec7fdc0b7 x86/cpu: Remove unnecessary headers and reorder the rest
Remove the headers at intel.c that are no longer required.

Alphabetically reorder what remains since more headers will be included
in further commits.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-6-darwi@linutronix.de
2025-03-04 11:17:33 +01:00
Ingo Molnar
1b4c36f9b1 Merge branch 'x86/urgent' into x86/cpu, to pick up dependent commits
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04 11:15:26 +01:00
Brendan Jackman
d0ba9bcf00 x86/cpu: Log CPU flag cmdline hacks more verbosely
Since using these options is very dangerous, make details as visible as
possible:

- Instead of a single message for each of the cmdline options, print a
  separate pr_warn() for each individual flag.

- Say explicitly whether the flag is a "feature" or a "bug".

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-3-8d255032cb4c@google.com
2025-03-04 11:15:12 +01:00
Brendan Jackman
681955761b x86/cpu: Warn louder about the {set,clear}cpuid boot parameters
Commit 814165e9fd ("x86/cpu: Add the 'setcpuid=' boot parameter")
recently expanded the user's ability to break their system horribly by
overriding effective CPU flags. This was reflected with updates to the
documentation to try and make people aware that this is dangerous.

To further reduce the risk of users mistaking this for a "real feature",
and try to help them figure out why their kernel is tainted if they do
use it:

- Upgrade the existing printk to pr_warn, to help ensure kernel logs
  reflect what changes are in effect.

- Print an extra warning that tries to be as dramatic as possible, while
  also highlighting the fact that it tainted the kernel.

Suggested-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-2-8d255032cb4c@google.com
2025-03-04 11:15:03 +01:00
Brendan Jackman
27c3b452c1 x86/cpu: Remove unnecessary macro indirection related to CPU feature names
These macros used to abstract over CONFIG_X86_FEATURE_NAMES, but that
was removed in:

  7583e8fbdc ("x86/cpu: Remove X86_FEATURE_NAMES")

Now they are just an unnecessary indirection, remove them.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250303-setcpuid-taint-louder-v1-1-8d255032cb4c@google.com
2025-03-04 11:14:53 +01:00
Josh Poimboeuf
4e32645cd8 x86/smp: Fix mwait_play_dead() and acpi_processor_ffh_play_dead() noreturn behavior
Fix some related issues (done in a single patch to avoid introducing
intermediate bisect warnings):

  1) The SMP version of mwait_play_dead() doesn't return, but its
     !SMP counterpart does.  Make its calling behavior consistent by
     resolving the !SMP version to a BUG().  It should never be called
     anyway, this just enforces that at runtime and enables its callers
     to be marked as __noreturn.

  2) While the SMP definition of mwait_play_dead() is annotated as
     __noreturn, the declaration isn't.  Nor is it listed in
     tools/objtool/noreturns.h.  Fix that.

  3) Similar to #1, the SMP version of acpi_processor_ffh_play_dead()
     doesn't return but its !SMP counterpart does.  Make the !SMP
     version a BUG().  It should never be called.

  4) acpi_processor_ffh_play_dead() doesn't return, but is lacking any
     __noreturn annotations.  Fix that.

This fixes the following objtool warnings:

  vmlinux.o: warning: objtool: acpi_processor_ffh_play_dead+0x67: mwait_play_dead() is missing a __noreturn annotation
  vmlinux.o: warning: objtool: acpi_idle_play_dead+0x3c: acpi_processor_ffh_play_dead() is missing a __noreturn annotation

Fixes: a7dd183f0b ("x86/smp: Allow calling mwait_play_dead with an arbitrary hint")
Fixes: 541ddf31e3 ("ACPI/processor_idle: Add FFH state handling")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/e885c6fa9e96a61471b33e48c2162d28b15b14c5.1740962711.git.jpoimboe@kernel.org
2025-03-04 11:14:25 +01:00
Ahmed S. Darwish
f6bdaab79e x86/cpu: Properly parse CPUID leaf 0x2 TLB descriptor 0x63
CPUID leaf 0x2's one-byte TLB descriptors report the number of entries
for specific TLB types, among other properties.

Typically, each emitted descriptor implies the same number of entries
for its respective TLB type(s).  An emitted 0x63 descriptor is an
exception: it implies 4 data TLB entries for 1GB pages and 32 data TLB
entries for 2MB or 4MB pages.

For the TLB descriptors parsing code, the entry count for 1GB pages is
encoded at the intel_tlb_table[] mapping, but the 2MB/4MB entry count is
totally ignored.

Update leaf 0x2's parsing logic 0x2 to account for 32 data TLB entries
for 2MB/4MB pages implied by the 0x63 descriptor.

Fixes: e0ba94f14f ("x86/tlb_info: get last level TLB entry number of CPU")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304085152.51092-4-darwi@linutronix.de
2025-03-04 09:59:14 +01:00
Ahmed S. Darwish
1881148215 x86/cpu: Validate CPUID leaf 0x2 EDX output
CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX.  For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.

Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and
ECX, but left EDX unchecked.

Validate EDX's most-significant bit as well.

Fixes: e0ba94f14f ("x86/tlb_info: get last level TLB entry number of CPU")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304085152.51092-3-darwi@linutronix.de
2025-03-04 09:59:14 +01:00
Ahmed S. Darwish
8177c6bedb x86/cacheinfo: Validate CPUID leaf 0x2 EDX output
CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX.  For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.

The historical Git commit:

  019361a20f016 ("- pre6: Intel: start to add Pentium IV specific stuff (128-byte cacheline etc)...")

introduced leaf 0x2 output parsing.  It only validated the MSBs of EAX,
EBX, and ECX, but left EDX unchecked.

Validate EDX's most-significant bit.

Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304085152.51092-2-darwi@linutronix.de
2025-03-04 09:59:14 +01:00
Ingo Molnar
1fff9f8730 Linux 6.14-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmfEtgQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUI8H/0TJm0ia8TOseHDT
 bV0alw4qlNdq7Thi7M5HCsInzyhKImdHDBd/cY3HI2QRZS927WoPHUExDvEd6vUE
 sETrrBrl/iGJ5nLvXPKkCxXC43ZD3nCI/chBPWwspH2DA/Nih8LAmMkESGVFC7fa
 TUOKqY1FsAWMYUJ64hP4s9Dwi5XECps7/Di46ypgtr7sVA15jpfF3ePi2mfR73Bm
 hSfF7E5Xa3E22IBE2NPxvO4fHiYJWbNJk8Vv2ewfHroE/zKsJ/zCMk9mOtFII0P/
 TODkBSImFqUx+cDZVc0bJAy8rwA2lNXo3LU28N0Ca4EAoqyyIIdTPQ2wYA82Z2Y2
 hE/BeN0=
 =a9Md
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc5' into x86/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-03 21:05:45 +01:00
Brian Gerst
604ea3e90b x86/smp/32: Remove safe_smp_processor_id()
The safe_smp_processor_id() function was originally implemented in:

  dc2bc768a0 ("stack overflow safe kdump: safe_smp_processor_id()")

to mitigate the CPU number corruption on a stack overflow.  At the time,
x86-32 stored the CPU number in thread_struct, which was located at the
bottom of the task stack and thus vulnerable to an overflow.

The CPU number is now located in percpu memory, so this workaround
is no longer needed.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250303170115.2176553-1-brgerst@gmail.com
2025-03-03 20:30:09 +01:00
Brian Gerst
399fd7a264 x86/asm: Merge KSTK_ESP() implementations
Commit:

  263042e463 ("Save user RSP in pt_regs->sp on SYSCALL64 fastpath")

simplified the 64-bit implementation of KSTK_ESP() which is
now identical to 32-bit.  Merge them into a common definition.

No functional change.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250303183111.2245129-1-brgerst@gmail.com
2025-03-03 20:28:33 +01:00
Breno Leitao
98fdaeb296 x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2
Change the default value of spectre v2 in user mode to respect the
CONFIG_MITIGATION_SPECTRE_V2 config option.

Currently, user mode spectre v2 is set to auto
(SPECTRE_V2_USER_CMD_AUTO) by default, even if
CONFIG_MITIGATION_SPECTRE_V2 is disabled.

Set the spectre_v2 value to auto (SPECTRE_V2_USER_CMD_AUTO) if the
Spectre v2 config (CONFIG_MITIGATION_SPECTRE_V2) is enabled, otherwise
set the value to none (SPECTRE_V2_USER_CMD_NONE).

Important to say the command line argument "spectre_v2_user" overwrites
the default value in both cases.

When CONFIG_MITIGATION_SPECTRE_V2 is not set, users have the flexibility
to opt-in for specific mitigations independently. In this scenario,
setting spectre_v2= will not enable spectre_v2_user=, and command line
options spectre_v2_user and spectre_v2 are independent when
CONFIG_MITIGATION_SPECTRE_V2=n.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Kaplan <David.Kaplan@amd.com>
Link: https://lore.kernel.org/r/20241031-x86_bugs_last_v2-v2-2-b7ff1dab840e@debian.org
2025-03-03 12:48:41 +01:00
Breno Leitao
2a08b83271 x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code
There is a helper function to check if SMT is available. Use this helper
instead of performing the check manually.

The helper function cpu_smt_possible() does exactly the same thing as
was being done manually inside spectre_v2_user_select_mitigation().
Specifically, it returns false if CONFIG_SMP is disabled, otherwise
it checks the cpu_smt_control global variable.

This change improves code consistency and reduces duplication.

No change in functionality intended.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: David Kaplan <David.Kaplan@amd.com>
Link: https://lore.kernel.org/r/20241031-x86_bugs_last_v2-v2-1-b7ff1dab840e@debian.org
2025-03-03 12:48:17 +01:00
Dr. David Alan Gilbert
3101900218 x86/paravirt: Remove unused paravirt_disable_iospace()
The last use of paravirt_disable_iospace() was removed in 2015 by
commit d1c29465b8 ("lguest: don't disable iospace.")

Remove it.

Note the comment above it about 'entry.S' is unrelated to this
but stayed when intervening code got deleted.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20250303004441.250451-1-linux@treblig.org
2025-03-03 11:19:52 +01:00
Peter Zijlstra
73e8079be9 x86/ibt: Make cfi_bhi a constant for FINEIBT_BHI=n
Robot yielded a .config that tripped:

  vmlinux.o: warning: objtool: do_jit+0x276: relocation to !ENDBR: .noinstr.text+0x6a60

This is the result of using __bhi_args[1] in unreachable code; make
sure the compiler is able to determine this is unreachable and trigger
DCE.

Closes: https://lore.kernel.org/oe-kbuild-all/202503030704.H9KFysNS-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250303094911.GL5880@noisy.programming.kicks-ass.net
2025-03-03 10:54:11 +01:00
David Kaplan
b8ce25df29 x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds
Add AUTO mitigations for mds/taa/mmio/rfds to create consistent vulnerability
handling.  These AUTO mitigations will be turned into the appropriate default
mitigations in the <vuln>_select_mitigation() functions.  Later, these will be
used with the new attack vector controls to help select appropriate
mitigations.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250108202515.385902-4-david.kaplan@amd.com
2025-02-28 12:40:21 +01:00
David Kaplan
2c93762ec4 x86/bugs: Relocate mds/taa/mmio/rfds defines
Move the mds, taa, mmio, and rfds mitigation enums earlier in the file to
prepare for restructuring of these mitigations as they are all inter-related.

No functional change.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250108202515.385902-3-david.kaplan@amd.com
2025-02-28 12:39:17 +01:00
David Kaplan
98c7a713db x86/bugs: Add X86_BUG_SPECTRE_V2_USER
All CPU vulnerabilities with command line options map to a single X86_BUG bit
except for Spectre V2 where both the spectre_v2 and spectre_v2_user command
line options are related to the same bug.

The spectre_v2 command line options mostly relate to user->kernel and
guest->host mitigations, while the spectre_v2_user command line options relate
to user->user or guest->guest protections.

Define a new X86_BUG bit for spectre_v2_user so each *_select_mitigation()
function in bugs.c is related to a unique X86_BUG bit.

No functional changes.

Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250108202515.385902-2-david.kaplan@amd.com
2025-02-28 12:34:30 +01:00
Brendan Jackman
ab68d2e365 x86/cpu: Enable modifying CPU bug flags with '{clear,set}puid='
Sometimes it can be very useful to run CPU vulnerability mitigations on
systems where they aren't known to mitigate any real-world
vulnerabilities. This can be handy for mundane reasons like debugging
HW-agnostic logic on whatever machine is to hand, but also for research
reasons: while some mitigations are focused on individual vulns and
uarches, others are fairly general, and it's strategically useful to
have an idea how they'd perform on systems where they aren't currently
needed.

As evidence for this being useful, a flag specifically for Retbleed was
added in:

  5c9a92dec3 ("x86/bugs: Add retbleed=force").

Since CPU bugs are tracked using the same basic mechanism as features,
and there are already parameters for manipulating them by hand, extend
that mechanism to support bug as well as capabilities.

With this patch and setcpuid=srso, a QEMU guest running on an Intel host
will boot with Safe-RET enabled.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-3-7dc71bce742a@google.com
2025-02-28 10:57:50 +01:00
Brendan Jackman
814165e9fd x86/cpu: Add the 'setcpuid=' boot parameter
In preparation for adding support to inject fake CPU bugs at boot-time,
add a general facility to force enablement of CPU flags.

The flag taints the kernel and the documentation attempts to be clear
that this is highly unsuitable for uses outside of kernel development
and platform experimentation.

The new arg is parsed just like clearcpuid, but instead of leading to
setup_clear_cpu_cap() it leads to setup_force_cpu_cap().

I've tested this by booting a nested QEMU guest on an Intel host, which
with setcpuid=svm will claim that it supports AMD virtualization.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-2-7dc71bce742a@google.com
2025-02-28 10:57:49 +01:00
Brendan Jackman
f034937f5a x86/cpu: Create helper function to parse the 'clearcpuid=' boot parameter
This is in preparation for a later commit that will reuse this code, to
make review convenient.

Factor out a helper function which does the full handling for this arg
including printing info to the console.

No functional change intended.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241220-force-cpu-bug-v2-1-7dc71bce742a@google.com
2025-02-28 10:57:49 +01:00
Max Grobecker
a4248ee16f x86/cpu: Don't clear X86_FEATURE_LAHF_LM flag in init_amd_k8() on AMD when running in a virtual machine
When running in a virtual machine, we might see the original hardware CPU
vendor string (i.e. "AuthenticAMD"), but a model and family ID set by the
hypervisor. In case we run on AMD hardware and the hypervisor sets a model
ID < 0x14, the LAHF cpu feature is eliminated from the the list of CPU
capabilities present to circumvent a bug with some BIOSes in conjunction with
AMD K8 processors.

Parsing the flags list from /proc/cpuinfo seems to be happening mostly in
bash scripts and prebuilt Docker containers, as it does not need to have
additionals tools present – even though more reliable ways like using "kcpuid",
which calls the CPUID instruction instead of parsing a list, should be preferred.
Scripts, that use /proc/cpuinfo to determine if the current CPU is
"compliant" with defined microarchitecture levels like x86-64-v2 will falsely
claim the CPU is incapable of modern CPU instructions when "lahf_lm" is missing
in that flags list.

This can prevent some docker containers from starting or build scripts to create
unoptimized binaries.

Admittably, this is more a small inconvenience than a severe bug in the kernel
and the shoddy scripts that rely on parsing /proc/cpuinfo
should be fixed instead.

This patch adds an additional check to see if we're running inside a
virtual machine (X86_FEATURE_HYPERVISOR is present), which, to my
understanding, can't be present on a real K8 processor as it was introduced
only with the later/other Athlon64 models.

Example output with the "lahf_lm" flag missing in the flags list
(should be shown between "hypervisor" and "abm"):

    $ cat /proc/cpuinfo
    processor       : 0
    vendor_id       : AuthenticAMD
    cpu family      : 15
    model           : 6
    model name      : Common KVM processor
    stepping        : 1
    microcode       : 0x1000065
    cpu MHz         : 2599.998
    cache size      : 512 KB
    physical id     : 0
    siblings        : 1
    core id         : 0
    cpu cores       : 1
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 13
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
                      cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp
                      lm rep_good nopl cpuid extd_apicid tsc_known_freq pni
                      pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt
                      tsc_deadline_timer aes xsave avx f16c hypervisor abm
                      3dnowprefetch vmmcall bmi1 avx2 bmi2 xsaveopt

... while kcpuid shows the feature to be present in the CPU:

    # kcpuid -d | grep lahf
         lahf_lm             - LAHF/SAHF available in 64-bit mode

[ mingo: Updated the comment a bit, incorporated Boris's review feedback. ]

Signed-off-by: Max Grobecker <max@grobecker.info>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
2025-02-28 10:42:28 +01:00
Zhang Kunbo
e008eeec78 x86/platform: Fix missing declaration of 'x86_apple_machine'
Fix this sparse warning:

  arch/x86/kernel/quirks.c:662:6: warning: symbol 'x86_apple_machine' was not declared. Should it be static?

Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241126015636.3463994-1-zhangkunbo@huawei.com
2025-02-27 22:52:37 +01:00
Zhang Kunbo
b8ffd97935 x86/irq: Fix missing declaration of 'io_apic_irqs'
Fix this sparse warning:

  arch/x86/kernel/i8259.c:57:15: warning: symbol 'io_apic_irqs' was not declared. Should it be static?

Signed-off-by: Zhang Kunbo <zhangkunbo@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20241126020511.3464664-1-zhangkunbo@huawei.com
2025-02-27 22:52:37 +01:00
Xin Li (Intel)
ad546940b5 x86/ia32: Leave NULL selector values 0~3 unchanged
The first GDT descriptor is reserved as 'NULL descriptor'.  As bits 0
and 1 of a segment selector, i.e., the RPL bits, are NOT used to index
GDT, selector values 0~3 all point to the NULL descriptor, thus values
0, 1, 2 and 3 are all valid NULL selector values.

When a NULL selector value is to be loaded into a segment register,
reload_segments() sets its RPL bits.  Later IRET zeros ES, FS, GS, and
DS segment registers if any of them is found to have any nonzero NULL
selector value.  The two operations offset each other to actually effect
a nop.

Besides, zeroing of RPL in NULL selector values is an information leak
in pre-FRED systems as userspace can spot any interrupt/exception by
loading a nonzero NULL selector, and waiting for it to become zero.
But there is nothing software can do to prevent it before FRED.

ERETU, the only legit instruction to return to userspace from kernel
under FRED, by design does NOT zero any segment register to avoid this
problem behavior.

As such, leave NULL selector values 0~3 unchanged and close the leak.

Do the same on 32-bit kernel as well.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241126184529.1607334-1-xin@zytor.com
2025-02-27 22:46:11 +01:00
Chang S. Bae
69a2fdf446 x86/fpu/xstate: Simplify print_xstate_features()
print_xstate_features() currently invokes print_xstate_feature() multiple
times on separate lines, which can be simplified in a loop.

print_xstate_feature() already checks the feature's enabled status and is
only called within print_xstate_features(). Inline print_xstate_feature()
and iterate over features in a loop to streamline the enabling message.

No functional changes.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250227184502.10288-2-chang.seok.bae@intel.com
2025-02-27 19:54:41 +01:00
Chang S. Bae
dc8aa31a7a x86/fpu: Refine and simplify the magic number check during signal return
Before restoring xstate from the user space buffer, the kernel performs
sanity checks on these magic numbers: magic1 in the software reserved
area, and magic2 at the end of XSAVE region.

The position of magic2 is calculated based on the xstate size derived
from the user space buffer. But, the in-kernel record is directly
available and reliable for this purpose.

This reliance on user space data is also inconsistent with the recent
fix in:

  d877550eaf ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer")

Simply use fpstate->user_size, and then get rid of unnecessary
size-evaluation code.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241211014500.3738-1-chang.seok.bae@intel.com
2025-02-27 19:38:06 +01:00
Kuan-Wei Chiu
9c94c14ca3 x86/bootflag: Replace open-coded parity calculation with parity8()
Refactor parity calculations to use the standard parity8() helper. This
change eliminates redundant implementations and improves code
efficiency.

[ ubizjak: Updated the patch to apply to the latest x86 tree. ]

Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250227125616.2253774-1-ubizjak@gmail.com
2025-02-27 14:00:30 +01:00
Pawan Gupta
db5157df14 x86/cpu: Remove get_this_hybrid_cpu_*()
Because calls to get_this_hybrid_cpu_type() and
get_this_hybrid_cpu_native_id() are not required now. cpu-type and
native-model-id are cached at boot in per-cpu struct cpuinfo_topology.

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-4-2ae010f50370@linux.intel.com
2025-02-27 13:34:52 +01:00
Pawan Gupta
4a412c70af x86/cpu: Prefix hexadecimal values with 0x in cpu_debug_show()
The hex values in CPU debug interface are not prefixed with 0x. This may
cause misinterpretation of values. Fix it.

[ mingo: Restore previous vertical alignment of the output. ]

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20241211-add-cpu-type-v5-1-2ae010f50370@linux.intel.com
2025-02-27 13:26:53 +01:00
Arnd Bergmann
0abf508675 x86/smp: Drop 32-bit "bigsmp" machine support
The x86-32 kernel used to support multiple platforms with more than eight
logical CPUs, from the 1999-2003 timeframe: Sequent NUMA-Q, IBM Summit,
Unisys ES7000 and HP F8. Support for all except the latter was dropped
back in 2014, leaving only the F8 based DL740 and DL760 G2 machines in
this catery, with up to eight single-core Socket-603 Xeon-MP processors
with hyperthreading.

Like the already removed machines, the HP F8 servers at the time cost
upwards of $100k in typical configurations, but were quickly obsoleted
by their 64-bit Socket-604 cousins and the AMD Opteron.

Earlier servers with up to 8 Pentium Pro or Xeon processors remain
fully supported as they had no hyperthreading. Similarly, the more
common 4-socket Xeon-MP machines with hyperthreading using Intel
or ServerWorks chipsets continue to work without this, and all the
multi-core Xeon processors also run 64-bit kernels.

While the "bigsmp" support can also be used to run on later 64-bit
machines (including VM guests), it seems best to discourage that
and get any remaining users to update their kernels to 64-bit builds
on these. As a side-effect of this, there is also no more need to
support NUMA configurations on 32-bit x86, as all true 32-bit
NUMA platforms are already gone.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250226213714.4040853-3-arnd@kernel.org
2025-02-27 11:19:05 +01:00
Ingo Molnar
30667e5547 Merge branch 'x86/mm' into x86/cpu, to avoid conflicts
We are going to apply a new series that conflicts with pending
work in x86/mm, so merge in x86/mm to avoid it, and also to
refresh the x86/cpu branch with fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-27 11:17:37 +01:00
Yosry Ahmed
8f64eee70c x86/bugs: Remove X86_FEATURE_USE_IBPB
X86_FEATURE_USE_IBPB was introduced in:

  2961298efe ("x86/cpufeatures: Clean up Spectre v2 related CPUID flags")

to have separate flags for when the CPU supports IBPB (i.e. X86_FEATURE_IBPB)
and when an IBPB is actually used to mitigate Spectre v2.

Ever since then, the uses of IBPB expanded. The name became confusing
because it does not control all IBPB executions in the kernel.
Furthermore, because its name is generic and it's buried within
indirect_branch_prediction_barrier(), it's easy to use it not knowing
that it is specific to Spectre v2.

X86_FEATURE_USE_IBPB is no longer needed because all the IBPB executions
it used to control are now controlled through other means (e.g.
switch_mm_*_ibpb static branches).

Remove the unused feature bit.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20250227012712.3193063-7-yosry.ahmed@linux.dev
2025-02-27 10:57:21 +01:00
Yosry Ahmed
80dacb0804 x86/bugs: Use a static branch to guard IBPB on vCPU switch
Instead of using X86_FEATURE_USE_IBPB to guard the IBPB execution in KVM
when a new vCPU is loaded, introduce a static branch, similar to
switch_mm_*_ibpb.

This makes it obvious in spectre_v2_user_select_mitigation() what
exactly is being toggled, instead of the unclear X86_FEATURE_USE_IBPB
(which will be shortly removed). It also provides more fine-grained
control, making it simpler to change/add paths that control the IBPB in
the vCPU switch path without affecting other IBPBs.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-5-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Yosry Ahmed
bd9a8542ce x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set()
If X86_FEATURE_USE_IBPB is not set, then both spectre_v2_user_ibpb and
spectre_v2_user_stibp are set to SPECTRE_V2_USER_NONE in
spectre_v2_user_select_mitigation(). Since ib_prctl_set() already checks
for this before performing the IBPB, the X86_FEATURE_USE_IBPB check is
redundant. Remove it.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20250227012712.3193063-4-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Yosry Ahmed
549435aab4 x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers
indirect_branch_prediction_barrier() only performs the MSR write if
X86_FEATURE_USE_IBPB is set, using alternative_msr_write(). In
preparation for removing X86_FEATURE_USE_IBPB, move the feature check
into the callers so that they can be addressed one-by-one, and use
X86_FEATURE_IBPB instead to guard the MSR write.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250227012712.3193063-2-yosry.ahmed@linux.dev
2025-02-27 10:57:20 +01:00
Uros Bizjak
adf6819278 x86/bootflag: Micro-optimize sbf_write()
Change parity bit with XOR when !parity instead of masking bit out
and conditionally setting it when !parity.

Saves a couple of bytes in the object file.

Co-developed-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250226153709.6370-1-ubizjak@gmail.com
2025-02-27 10:14:00 +01:00
Christophe Leroy
d856bc3ac7 static_call_inline: Provide trampoline address when updating sites
In preparation of support of inline static calls on powerpc, provide
trampoline address when updating sites, so that when the destination
function is too far for a direct function call, the call site is
patched with a call to the trampoline.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/5efe0cffc38d6f69b1ec13988a99f1acff551abf.1733245362.git.christophe.leroy@csgroup.eu
2025-02-26 21:09:43 +05:30
Benjamin Berg
5d3b81d4d8 x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct()
The init_task instance of struct task_struct is statically allocated and
may not contain the full FP state for userspace. As such, limit the copy
to the valid area of both init_task and 'dst' and ensure all memory is
initialized.

Note that the FP state is only needed for userspace, and as such it is
entirely reasonable for init_task to not contain parts of it.

Fixes: 5aaeb5c01c ("x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250226133136.816901-1-benjamin@sipsolutions.net
----

v2:
- Fix code if arch_task_struct_size < sizeof(init_task) by using
  memcpy_and_pad.
2025-02-26 15:32:34 +01:00
Borislav Petkov
8442df2b49 x86/bugs: KVM: Add support for SRSO_MSR_FIX
Add support for

  CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it
  indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate
  SRSO.

Enable BpSpecReduce to mitigate SRSO across guest/host boundaries.

Switch back to enabling the bit when virtualization is enabled and to
clear the bit when virtualization is disabled because using a MSR slot
would clear the bit when the guest is exited and any training the guest
has done, would potentially influence the host kernel when execution
enters the kernel and hasn't VMRUN the guest yet.

More detail on the public thread in Link below.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
2025-02-26 15:13:06 +01:00
Peter Zijlstra
dfebe7362f x86/ibt: Optimize the fineibt-bhi arity 1 case
Saves a CALL to an out-of-line thunk for the common case of 1
argument.

Suggested-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.927885784@infradead.org
2025-02-26 13:49:11 +01:00
Peter Zijlstra
0c92385dc0 x86/ibt: Implement FineIBT-BHI mitigation
While WAIT_FOR_ENDBR is specified to be a full speculation stop; it
has been shown that some implementations are 'leaky' to such an extend
that speculation can escape even the FineIBT preamble.

To deal with this, add additional hardening to the FineIBT preamble.

Notably, using a new LLVM feature:

  e223485c9b

which encodes the number of arguments in the kCFI preamble's register.

Using this register<->arity mapping, have the FineIBT preamble CALL
into a stub clobbering the relevant argument registers in the
speculative case.

Scott sayeth thusly:

Microarchitectural attacks such as Branch History Injection (BHI) and
Intra-mode Branch Target Injection (IMBTI) [1] can cause an indirect
call to mispredict to an adversary-influenced target within the same
hardware domain (e.g., within the kernel). Instructions at the
mispredicted target may execute speculatively and potentially expose
kernel data (e.g., to a user-mode adversary) through a
microarchitectural covert channel such as CPU cache state.

CET-IBT [2] is a coarse-grained control-flow integrity (CFI) ISA
extension that enforces that each indirect call (or indirect jump)
must land on an ENDBR (end branch) instruction, even speculatively*.
FineIBT is a software technique that refines CET-IBT by associating
each function type with a 32-bit hash and enforcing (at the callee)
that the hash of the caller's function pointer type matches the hash
of the callee's function type. However, recent research [3] has
demonstrated that the conditional branch that enforces FineIBT's hash
check can be coerced to mispredict, potentially allowing an adversary
to speculatively bypass the hash check:

__cfi_foo:
  ENDBR64
  SUB R10d, 0x01234567
  JZ foo    # Even if the hash check fails and ZF=0, this branch could still mispredict as taken
  UD2
foo:
  ...

The techniques demonstrated in [3] require the attacker to be able to
control the contents of at least one live register at the mispredicted
target. Therefore, this patch set introduces a sequence of CMOV
instructions at each indirect-callable target that poisons every live
register with data that the attacker cannot control whenever the
FineIBT hash check fails, thus mitigating any potential attack.

The security provided by this scheme has been discussed in detail on
an earlier thread [4].

 [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html
 [2] Intel Software Developer's Manual, Volume 1, Chapter 18
 [3] https://www.vusec.net/projects/native-bhi/
 [4] https://lore.kernel.org/lkml/20240927194925.707462984@infradead.org/
 *There are some caveats for certain processors, see [1] for more info

Suggested-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.820402212@infradead.org
2025-02-26 13:49:11 +01:00
Ingo Molnar
ac3144f91b Linux 6.14-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAme7hfkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGU+IH/1bk6zIvAwXXS5yu
 KNsQ8dEkC3Xme6HqLtPsAhRLF+5YJf6MaGm1ip5dDMyIvasa2gwvCQQQoOpeMbKj
 79VKT+m9t3szMHZaQYjlOuYHBmNSJ4cMCD2Qh6ktXHGPfTTWDFGf7fBwBOkVNeJU
 1Ask+bxeop21aJMhfYXrUta3OYyerLBUR6jCiCM82A/GLtdv6oNGXBu3ygDt9Tjx
 ZHSl+CYjKpmGUP8JnMKwCBHVguEfqgzZ//dY1H16AvOLed9k2jkMFn8O5Vi3vjnx
 TWMMXoiJimuamGzbjxtCCqzxNlFFDT4gRpDqeJxb16W/gDTFmbRr9LDjNehCZe33
 AigLZ6M=
 =Y/7F
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc4' into x86/fpu, to pick up fixes and refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-26 13:05:16 +01:00
Peter Zijlstra
97e59672a9 x86/ibt: Add paranoid FineIBT mode
Due to concerns about circumvention attacks against FineIBT on 'naked'
ENDBR, add an additional caller side hash check to FineIBT. This
should make it impossible to pivot over such a 'naked' ENDBR
instruction at the cost of an additional load.

The specific pivot reported was against the SYSCALL entry site and
FRED will have all those holes fixed up.

  https://lore.kernel.org/linux-hardening/Z60NwR4w%2F28Z7XUa@ubun/

This specific fineibt_paranoid_start[] sequence was concocted by
Scott.

Suggested-by: Scott Constable <scott.d.constable@intel.com>
Reported-by: Jennifer Miller <jmill@asu.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.598033084@infradead.org
2025-02-26 12:27:45 +01:00
Peter Zijlstra
029f718fed x86/traps: Decode LOCK Jcc.d8 as #UD
Because overlapping code sequences are all the rage.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.486463917@infradead.org
2025-02-26 12:24:17 +01:00
Peter Zijlstra
06926c6cdb x86/ibt: Optimize the FineIBT instruction sequence
Scott notes that non-taken branches are faster. Abuse overlapping code
that traps instead of explicit UD2 instructions.

And LEA does not modify flags and will have less dependencies.

Suggested-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.371942555@infradead.org
2025-02-26 12:24:09 +01:00
Peter Zijlstra
e33d805a10 x86/traps: Allow custom fixups in handle_bug()
The normal fixup in handle_bug() is simply continuing at the next
instruction. However upcoming patches make this the wrong thing, so
allow handlers (specifically handle_cfi_failure()) to over-ride
regs->ip.

The callchain is such that the fixup needs to be done before it is
determined if the exception is fatal, as such, revert any changes in
that case.

Additionally, have handle_cfi_failure() remember the regs->ip value it
starts with for reporting.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.275223080@infradead.org
2025-02-26 12:22:39 +01:00
Peter Zijlstra
2e044911be x86/traps: Decode 0xEA instructions as #UD
FineIBT will start using 0xEA as #UD. Normally '0xEA' is a 'bad',
invalid instruction for the CPU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.166774696@infradead.org
2025-02-26 12:22:10 +01:00
Nikolay Borisov
6447828875 x86/mce/inject: Remove call to mce_notify_irq()
The call to mce_notify_irq() has been there since the initial version of
the soft inject mce machinery, introduced in

  ea149b36c7 ("x86, mce: add basic error injection infrastructure").

At that time it was functional since injecting an MCE resulted in the
following call chain:

  raise_mce()
    ->machine_check_poll()
        ->mce_log() - sets notfiy_user_bit
  ->mce_notify_user() (current mce_notify_irq) consumed the bit and called the
  usermode helper.

However, with the introduction of

  011d826111 ("RAS: Add a Corrected Errors Collector")

the code got moved around and the usermode helper began to be called via the
early notifier mce_first_notifier() rendering the call in raise_local()
defunct as the mce_need_notify bit (ex notify_user) is only being set from the
early notifier.

Remove the noop call and make mce_notify_irq() static.

No functional changes.

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250225143348.268469-1-nik.borisov@suse.com
2025-02-26 12:18:37 +01:00
Ingo Molnar
5d703825fd x86/alternatives: Clean up preprocessor conditional block comments
When in the middle of a kernel source code file a kernel developer
sees a lone #else or #endif:

   ...

   #else

   ...

It's not obvious at a glance what those preprocessor blocks are
conditional on, if the starting #ifdef is outside visible range.

So apply the standard pattern we use in such cases elsewhere in
the kernel for large preprocessor blocks:

  #ifdef CONFIG_XXX
  ...
  ...
  ...
  #endif /* CONFIG_XXX */

  ...

  #ifdef CONFIG_XXX
  ...
  ...
  ...
  #else /* !CONFIG_XXX: */
  ...
  ...
  ...
  #endif /* !CONFIG_XXX */

( Note that in the  #else case we use the /* !CONFIG_XXX */ marker
  in the final #endif, not /* CONFIG_XXX */, which serves as an easy
  visual marker to differentiate #else or #elif related #endif closures
  from singular #ifdef/#endif blocks. )

Also clean up __CFI_DEFAULT definition with a bit more vertical alignment
applied, and a pointless tab converted to the standard space we use in
such definitions.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-02-26 12:15:15 +01:00
Peter Zijlstra
500a41acb0 x86/ibt: Add exact_endbr() helper
For when we want to exactly match ENDBR, and not everything that we
can scribble it with.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.059556588@infradead.org
2025-02-26 12:11:18 +01:00
Peter Zijlstra
9a54fb3134 x86/cfi: Add 'cfi=warn' boot option
Rebuilding with CONFIG_CFI_PERMISSIVE=y enabled is such a pain, esp. since
clang is so slow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124159.924496481@infradead.org
2025-02-26 12:10:48 +01:00
Arnd Bergmann
9de7695925 x86/irq: Define trace events conditionally
When both of X86_LOCAL_APIC and X86_THERMAL_VECTOR are disabled,
the irq tracing produces a W=1 build warning for the tracing
definitions:

  In file included from include/trace/trace_events.h:27,
                 from include/trace/define_trace.h:113,
                 from arch/x86/include/asm/trace/irq_vectors.h:383,
                 from arch/x86/kernel/irq.c:29:
  include/trace/stages/init.h:2:23: error: 'str__irq_vectors__trace_system_name' defined but not used [-Werror=unused-const-variable=]

Make the tracepoints conditional on the same symbosl that guard
their usage.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250225213236.3141752-1-arnd@kernel.org
2025-02-25 22:44:35 +01:00
Russell Senior
bebe35bb73 x86/CPU: Fix warm boot hang regression on AMD SC1100 SoC systems
I still have some Soekris net4826 in a Community Wireless Network I
volunteer with. These devices use an AMD SC1100 SoC. I am running
OpenWrt on them, which uses a patched kernel, that naturally has
evolved over time.  I haven't updated the ones in the field in a
number of years (circa 2017), but have one in a test bed, where I have
intermittently tried out test builds.

A few years ago, I noticed some trouble, particularly when "warm
booting", that is, doing a reboot without removing power, and noticed
the device was hanging after the kernel message:

  [    0.081615] Working around Cyrix MediaGX virtual DMA bugs.

If I removed power and then restarted, it would boot fine, continuing
through the message above, thusly:

  [    0.081615] Working around Cyrix MediaGX virtual DMA bugs.
  [    0.090076] Enable Memory-Write-back mode on Cyrix/NSC processor.
  [    0.100000] Enable Memory access reorder on Cyrix/NSC processor.
  [    0.100070] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
  [    0.110058] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
  [    0.120037] CPU: NSC Geode(TM) Integrated Processor by National Semi (family: 0x5, model: 0x9, stepping: 0x1)
  [...]

In order to continue using modern tools, like ssh, to interact with
the software on these old devices, I need modern builds of the OpenWrt
firmware on the devices. I confirmed that the warm boot hang was still
an issue in modern OpenWrt builds (currently using a patched linux
v6.6.65).

Last night, I decided it was time to get to the bottom of the warm
boot hang, and began bisecting. From preserved builds, I narrowed down
the bisection window from late February to late May 2019. During this
period, the OpenWrt builds were using 4.14.x. I was able to build
using period-correct Ubuntu 18.04.6. After a number of bisection
iterations, I identified a kernel bump from 4.14.112 to 4.14.113 as
the commit that introduced the warm boot hang.

  07aaa7e3d6

Looking at the upstream changes in the stable kernel between 4.14.112
and 4.14.113 (tig v4.14.112..v4.14.113), I spotted a likely suspect:

  https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=20afb90f730982882e65b01fb8bdfe83914339c5

So, I tried reverting just that kernel change on top of the breaking
OpenWrt commit, and my warm boot hang went away.

Presumably, the warm boot hang is due to some register not getting
cleared in the same way that a loss of power does. That is
approximately as much as I understand about the problem.

More poking/prodding and coaching from Jonas Gorski, it looks
like this test patch fixes the problem on my board: Tested against
v6.6.67 and v4.14.113.

Fixes: 18fb053f9b ("x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors")
Debugged-by: Jonas Gorski <jonas.gorski@gmail.com>
Signed-off-by: Russell Senior <russell@personaltelco.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/CAHP3WfOgs3Ms4Z+L9i0-iBOE21sdMk5erAiJurPjnrL9LSsgRA@mail.gmail.com
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
2025-02-25 22:44:01 +01:00
Dmytro Maluka
96f41f644c x86/of: Don't use DTB for SMP setup if ACPI is enabled
There are cases when it is useful to use both ACPI and DTB provided by
the bootloader, however in such cases we should make sure to prevent
conflicts between the two. Namely, don't try to use DTB for SMP setup
if ACPI is enabled.

Precisely, this prevents at least:

- incorrectly calling register_lapic_address(APIC_DEFAULT_PHYS_BASE)
  after the LAPIC was already successfully enumerated via ACPI, causing
  noisy kernel warnings and probably potential real issues as well

- failed IOAPIC setup in the case when IOAPIC is enumerated via mptable
  instead of ACPI (e.g. with acpi=noirq), due to
  mpparse_parse_smp_config() overridden by x86_dtb_parse_smp_config()

Signed-off-by: Dmytro Maluka <dmaluka@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250105172741.3476758-2-dmaluka@chromium.org
2025-02-25 22:13:02 +01:00
Thorsten Blum
8e8f030649 x86/mtrr: Remove unnecessary strlen() in mtrr_write()
The local variable length already holds the string length after calling
strncpy_from_user(). Using another local variable linlen and calling
strlen() is therefore unnecessary and can be removed. Remove linlen
and strlen() and use length instead.

No change in functionality intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250225131621.329699-2-thorsten.blum@linux.dev
2025-02-25 20:50:55 +01:00
Waiman Long
fe37c699ae x86/nmi: Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus()
Depending on the type of panics, it was found that the
__register_nmi_handler() function can be called in NMI context from
nmi_shootdown_cpus() leading to a lockdep splat:

  WARNING: inconsistent lock state
  inconsistent {INITIAL USE} -> {IN-NMI} usage.

   lock(&nmi_desc[0].lock);
   <Interrupt>
     lock(&nmi_desc[0].lock);

  Call Trace:
    _raw_spin_lock_irqsave
    __register_nmi_handler
    nmi_shootdown_cpus
    kdump_nmi_shootdown_cpus
    native_machine_crash_shutdown
    __crash_kexec

In this particular case, the following panic message was printed before:

  Kernel panic - not syncing: Fatal hardware error!

This message seemed to be given out from __ghes_panic() running in
NMI context.

The __register_nmi_handler() function which takes the nmi_desc lock
with irq disabled shouldn't be called from NMI context as this can
lead to deadlock.

The nmi_shootdown_cpus() function can only be invoked once. After the
first invocation, all other CPUs should be stuck in the newly added
crash_nmi_callback() and cannot respond to a second NMI.

Fix it by adding a new emergency NMI handler to the nmi_desc
structure and provide a new set_emergency_nmi_handler() helper to set
crash_nmi_callback() in any context. The new emergency handler will
preempt other handlers in the linked list. That will eliminate the need
to take any lock and serve the panic in NMI use case.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250206191844.131700-1-longman@redhat.com
2025-02-25 14:38:43 +01:00
Uros Bizjak
dc8bd769e7 x86/ioperm: Use atomic64_inc_return() in ksys_ioperm()
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref)
to use optimized implementation on targets that define
atomic_inc_return() and to remove now unneeded initialization of the
%eax/%edx register pair before the call to atomic64_inc_return().

On x86_32 the code improves from:

 1b0:    b9 00 00 00 00           mov    $0x0,%ecx
            1b1: R_386_32    .bss
 1b5:    89 43 0c                 mov    %eax,0xc(%ebx)
 1b8:    31 d2                    xor    %edx,%edx
 1ba:    b8 01 00 00 00           mov    $0x1,%eax
 1bf:    e8 fc ff ff ff           call   1c0 <ksys_ioperm+0xa8>
            1c0: R_386_PC32    atomic64_add_return_cx8
 1c4:    89 03                    mov    %eax,(%ebx)
 1c6:    89 53 04                 mov    %edx,0x4(%ebx)

to:

 1b0:    be 00 00 00 00           mov    $0x0,%esi
            1b1: R_386_32    .bss
 1b5:    89 43 0c                 mov    %eax,0xc(%ebx)
 1b8:    e8 fc ff ff ff           call   1b9 <ksys_ioperm+0xa1>
            1b9: R_386_PC32    atomic64_inc_return_cx8
 1bd:    89 03                    mov    %eax,(%ebx)
 1bf:    89 53 04                 mov    %edx,0x4(%ebx)

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250223161355.3607-1-ubizjak@gmail.com
2025-02-23 19:18:18 +01:00
Linus Torvalds
5cf80612d3 Miscellaneous x86 fixes:
- Fix AVX-VNNI CPU feature dependency bug triggered via
    the 'noxsave' boot option
 
  - Fix typos in the SVA documentation
 
  - Add Tony Luck as RDT co-maintainer and remove Fenghua Yu
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAme52akRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hP5g/9He4e+kBsObHbnr4vUcgHi4LR3xcFdBTM
 9rHRNeLDbPZl85n1rrbKNp+z35bLqs02ULGIbM4y40UwgbRco6Ax/6VCWxbQyFMO
 CMQRUQhlmnT7k3sSJ5/AH6V/uInjv2vzH2CnBQMm756GFOqKe0Erk/rGn7/xHbUv
 Ji0K6Hm35dgJpYpyhxAPUM6047lJ2DivjFm8h/Gd646sMmjUQRr+DcHCRb4Dda1g
 goT/dr7OhTE3jBThD8TPzEZeJzyR66YoknOEQPHx4SLzvYXKrDjnrPmf9VzI8VZX
 MAHM2Cs/lYhwC1lC03p1FT0ExCAS1NWDyhPQFza0dP4zYxdTMmXwOPkgWu+BBesn
 R7mSO9LdjDIAaxXPHhs8tdpQImfZzRr/S7T5vq63eWW/yT16TMdfhhvol54X6dgT
 67wxKXRxInwWQVnA3y+YtmwxxR7l1xmqSZ5fDK5VGz4TSRTg8rZxvv4pWydcJ9Or
 sEWi88qNGrGnevXYVjowyyOrAG4KyfmC7Idgytkjezh1m0HyF+LY/0f7rlHc0eXD
 1HAMu0cGun4WkOyoAuCmgdV0WmKDycihQO/U3o2mvtSvc+kAjGGss/b4PXzJTVmk
 8FQsItveOBffvWKxdaDz9R5oOZ0Vzzo3NnIkrvEWfBSBP2XcH3Qy0Xr4prXUA3cl
 p+/KdMjNohs=
 =CR8P
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Fix AVX-VNNI CPU feature dependency bug triggered via the 'noxsave'
   boot option

 - Fix typos in the SVA documentation

 - Add Tony Luck as RDT co-maintainer and remove Fenghua Yu

* tag 'x86-urgent-2025-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  docs: arch/x86/sva: Fix two grammar errors under Background and FAQ
  x86/cpufeatures: Make AVX-VNNI depend on AVX
  MAINTAINERS: Change maintainer for RDT
2025-02-22 10:45:02 -08:00
Dave Young
a2498e5c45 x86/kexec: Export e820_table_kexec[] to sysfs
Previously the e820_table_kexec[] was exported to sysfs since kexec-tools uses
the memmap entries to prepare the e820 table for the new kernel.

The following commit, ~8 years ago, introduced e820_table_firmware[] and changed
the behavior to export the firmware table instead:

  12df216c61 ("x86/boot/e820: Introduce the bootloader provided e820_table_firmware[] table")

Originally the kexec_file_load and kexec_load syscalls both used e820_table_kexec[].
Since the sysfs exported entries are from e820_table_firmware[] people
now need to tune both tables for kexec.

Restore the old behavior so the kexec_load and kexec_file_load syscalls work with
only one table update.  The e820_table_firmware[] is used by hibernation kernel
code and it works without the sysfs exporting. Also remove the SEV
e820_table_firmware[] updating code.

Also update the code comments here and drop the comments about setup_data
reservation since it is not needed any more after this change was made
a year ago:

  fc7f27cda8 ("x86/kexec: Do not update E820 kexec table for setup_data")

[ mingo: Tidy up the changelog and comments. ]

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/Z5jcb1GKhLvH8kDc@darkstar.users.ipa.redhat.com
2025-02-22 12:44:45 +01:00
Uros Bizjak
5bebe2e4fe x86/boot: Change some static bootflag functions to bool
The return values of some functions are of boolean type. Change the
type of these function to bool and adjust their return values.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250129154920.6773-1-ubizjak@gmail.com
2025-02-22 12:30:59 +01:00
Borislav Petkov (AMD)
50cef76d5c x86/microcode/AMD: Load only SHA256-checksummed patches
Load patches for which the driver carries a SHA256 checksum of the patch
blob.

This can be disabled by adding "microcode.amd_sha_check=off" on the
kernel cmdline. But it is highly NOT recommended.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-02-22 11:46:05 +01:00
Nuno Das Neves
db912b8954 hyperv: Change hv_root_partition into a function
Introduce hv_curr_partition_type to store the partition type
as an enum.

Right now this is limited to guest or root partition, but there will
be other kinds in future and the enum is easily extensible.

Set up hv_curr_partition_type early in Hyper-V initialization with
hv_identify_partition_type(). hv_root_partition() just queries this
value, and shouldn't be called before that.

Making this check into a function sets the stage for adding a config
option to gate the compilation of root partition code. In particular,
hv_root_partition() can be stubbed out always be false if root
partition support isn't desired.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1740167795-13296-3-git-send-email-nunodasneves@linux.microsoft.com>
2025-02-22 02:21:45 +00:00
Brian Gerst
684b12916a x86/arch_prctl/64: Clean up ARCH_MAP_VDSO_32
process_64.c is not built on native 32-bit, so CONFIG_X86_32 will never
be set.

No change in functionality intended.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250202202323.422113-3-brgerst@gmail.com
2025-02-21 22:32:25 +01:00
Brian Gerst
2df1ad0d25 x86/arch_prctl: Simplify sys_arch_prctl()
Use in_ia32_syscall() instead of a compat syscall entry.

No change in functionality intended.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20250202202323.422113-2-brgerst@gmail.com
2025-02-21 22:32:25 +01:00
Thorsten Blum
0156338a18 x86/apic: Use str_disabled_enabled() helper in print_ipi_mode()
Remove hard-coded strings by using the str_disabled_enabled() helper.

No change in functionality intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250209210333.5666-2-thorsten.blum@linux.dev
2025-02-21 16:54:22 +01:00
Rik van Riel
f2c5c21058 x86/mm: Remove pv_ops.mmu.tlb_remove_table call
Every pv_ops.mmu.tlb_remove_table call ends up calling tlb_remove_table.

Get rid of the indirection by simply calling tlb_remove_table directly,
and not going through the paravirt function pointers.

Suggested-by: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250213161423.449435-3-riel@surriel.com
2025-02-21 16:20:12 +01:00
Rik van Riel
a37259732a x86/mm: Make MMU_GATHER_RCU_TABLE_FREE unconditional
Currently x86 uses CONFIG_MMU_GATHER_TABLE_FREE when using
paravirt, and not when running on bare metal.

There is no real good reason to do things differently for
each setup. Make them all the same.

Currently get_user_pages_fast synchronizes against page table
freeing in two different ways:

 - on bare metal, by blocking IRQs, which block TLB flush IPIs
 - on paravirt, with MMU_GATHER_RCU_TABLE_FREE

This is done because some paravirt TLB flush implementations
handle the TLB flush in the hypervisor, and will do the flush
even when the target CPU has interrupts disabled.

Always handle page table freeing with MMU_GATHER_RCU_TABLE_FREE.
Using RCU synchronization between page table freeing and get_user_pages_fast()
allows bare metal to also do TLB flushing while interrupts are disabled.

Various places in the mm do still block IRQs or disable preemption
as an implicit way to block RCU frees.

That makes it safe to use INVLPGB on AMD CPUs.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Tested-by: Brendan Jackman <jackmanb@google.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20250213161423.449435-2-riel@surriel.com
2025-02-21 16:20:11 +01:00
Mike Rapoport (Microsoft)
efe659ac01 x86/e820: Drop obsolete E820_TYPE_RESERVED_KERN and related code
E820_TYPE_RESERVED_KERN is a relict from the ancient history that was used
to early reserve setup_data, see:

  28bb223795 ("x86: move reserve_setup_data to setup.c")

Nowadays setup_data is anyway reserved in memblock and there is no point in
carrying E820_TYPE_RESERVED_KERN that behaves exactly like E820_TYPE_RAM
but only complicates the code.

A bonus for removing E820_TYPE_RESERVED_KERN is a small but measurable
speedup of 20 microseconds in init_mem_mappings() on a VM with 32GB or RAM.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20250214090651.3331663-5-rppt@kernel.org
2025-02-21 16:05:00 +01:00
Mike Rapoport (Microsoft)
d45dd0a9b2 x86/boot: Split parsing of boot_params into the parse_boot_params() helper function
Makes setup_arch() a bit easier to comprehend.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250214090651.3331663-4-rppt@kernel.org
2025-02-21 16:05:00 +01:00
Mike Rapoport (Microsoft)
297fb82eba x86/boot: Split kernel resources setup into the setup_kernel_resources() helper function
Makes setup_arch() a bit easier to comprehend.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250214090651.3331663-3-rppt@kernel.org
2025-02-21 16:05:00 +01:00
Mike Rapoport (Microsoft)
2d6bff3139 x86/boot: Move setting of memblock parameters to e820__memblock_setup()
Changing memblock parameters, namely bottom_up and allocation upper
limit does not have any effect before memblock initialization in
e820__memblock_setup().

Move the calls to memblock_set_bottom_up() and memblock_set_current_limit()
to e820__memblock_setup() to group all the memblock initial setup and make
setup_arch() more readable.

No functional changes.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250214090651.3331663-2-rppt@kernel.org
2025-02-21 16:05:00 +01:00
Guilherme G. Piccoli
d90c9de9de x86/tsc: Always save/restore TSC sched_clock() on suspend/resume
TSC could be reset in deep ACPI sleep states, even with invariant TSC.

That's the reason we have sched_clock() save/restore functions, to deal
with this situation. But what happens is that such functions are guarded
with a check for the stability of sched_clock - if not considered stable,
the save/restore routines aren't executed.

On top of that, we have a clear comment in native_sched_clock() saying
that *even* with TSC unstable, we continue using TSC for sched_clock due
to its speed.

In other words, if we have a situation of TSC getting detected as unstable,
it marks the sched_clock as unstable as well, so subsequent S3 sleep cycles
could bring bogus sched_clock values due to the lack of the save/restore
mechanism, causing warnings like this:

  [22.954918] ------------[ cut here ]------------
  [22.954923] Delta way too big! 18446743750843854390 ts=18446744072977390405 before=322133536015 after=322133536015 write stamp=18446744072977390405
  [22.954923] If you just came from a suspend/resume,
  [22.954923] please switch to the trace global clock:
  [22.954923]   echo global > /sys/kernel/tracing/trace_clock
  [22.954923] or add trace_clock=global to the kernel command line
  [22.954937] WARNING: CPU: 2 PID: 5728 at kernel/trace/ring_buffer.c:2890 rb_add_timestamp+0x193/0x1c0

Notice that the above was reproduced even with "trace_clock=global".

The fix for that is to _always_ save/restore the sched_clock on suspend
cycle _if TSC is used_ as sched_clock - only if we fallback to jiffies
the sched_clock_stable() check becomes relevant to save/restore the
sched_clock.

Debugged-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250215210314.351480-1-gpiccoli@igalia.com
2025-02-21 15:27:38 +01:00
Artem Bityutskiy
64aad4749d ACPI/processor_idle: Export acpi_processor_ffh_play_dead()
The kernel test robot reported the following build error:

  >> ERROR: modpost: "acpi_processor_ffh_play_dead" [drivers/acpi/processor.ko] undefined!

Caused by this recently merged commit:

  541ddf31e3 ("ACPI/processor_idle: Add FFH state handling")

The build failure is due to an oversight in the 'CONFIG_ACPI_PROCESSOR=m' case,
the function export is missing. Add it.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202502151207.FA9UO1iX-lkp@intel.com/
Fixes: 541ddf31e3 ("ACPI/processor_idle: Add FFH state handling")
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/de5bf4f116779efde315782a15146fdc77a4a044.camel@linux.intel.com
2025-02-21 15:17:07 +01:00
Ingo Molnar
affe678f35 Linux 6.14-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmeyYIQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNy0H/jWdgjddRaEHQ1RB
 e18Oi6MJcTQikHbCHKGZGlyxR4dYxdAONuMmWwgt+266K8qUJSZcNXePwqGEWjx2
 qkJ9Tu0Agr8KkfVDtGHGXyd4tuZRpx9Fco6+jKkKiMjjtif7nrUajUGGwRsqGoib
 YYzrhbjNZDl17/J58O1E4YZs3w7Lu26PwDR58RZMsSG0pygAfU2fogKcYmi1pTYV
 w86icn0LlO8b5Y7fsrY56rLrawnI1RGlxfylUTHzo4QkoIUGvQLB8c6XPMYsVf9R
 lvkphu+/fGVnSw577WlVy8DTBso+Pj2nWw4jUTiEAy9hYY6zMxrqrX3XowAwbxj1
 m6zP+F8=
 =ieVA
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc3' into x86/mm, to pick up fixes before merging new changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-21 15:02:56 +01:00
Stanislav Spassov
1937e18cc3 x86/fpu: Fix guest FPU state buffer allocation size
Ongoing work on an optimization to batch-preallocate vCPU state buffers
for KVM revealed a mismatch between the allocation sizes used in
fpu_alloc_guest_fpstate() and fpstate_realloc(). While the former
allocates a buffer sized to fit the default set of XSAVE features
in UABI form (as per fpu_user_cfg), the latter uses its ksize argument
derived (for the requested set of features) in the same way as the sizes
found in fpu_kernel_cfg, i.e. using the compacted in-kernel
representation.

The correct size to use for guest FPU state should indeed be the
kernel one as seen in fpstate_realloc(). The original issue likely
went unnoticed through a combination of UABI size typically being
larger than or equal to kernel size, and/or both amounting to the
same number of allocated 4K pages.

Fixes: 69f6ed1d14 ("x86/fpu: Provide infrastructure for KVM FPU cleanup")
Signed-off-by: Stanislav Spassov <stanspas@amazon.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250218141045.85201-1-stanspas@amazon.de
2025-02-21 14:47:26 +01:00
Dan Carpenter
06dd759b68 x86/module: Remove unnecessary check in module_finalize()
The "calls" pointer can no longer be NULL after the following
commit:

   ab9fea5948 ("x86/alternative: Simplify callthunk patching")

Delete this unnecessary check.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/fcbb2f57-0714-4139-b441-8817365c16a1@stanley.mountain
2025-02-21 14:27:42 +01:00
Eric Biggers
5171207284 x86/cpufeatures: Make AVX-VNNI depend on AVX
The 'noxsave' boot option disables support for AVX, but support for the
AVX-VNNI feature was still declared on CPUs that support it.  Fix this.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20250220060124.89622-1-ebiggers@kernel.org
2025-02-21 14:19:16 +01:00
Thomas Gleixner
751dc837da genirq: Introduce common irq_force_complete_move() implementation
CONFIG_GENERIC_PENDING_IRQ requires an architecture specific implementation
of irq_force_complete_move() for CPU hotplug. At the moment, only x86
implements this unconditionally, but for RISC-V irq_force_complete_move()
is only needed when the RISC-V IMSIC driver is in use and not needed
otherwise.

To allow runtime configuration of this mechanism, introduce a common
irq_force_complete_move() implementation in the interrupt core code, which
only invokes the completion function, when a interrupt chip in the
hierarchy implements it.

Switch X86 over to the new mechanism. No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250217085657.789309-5-apatel@ventanamicro.com
2025-02-20 15:19:26 +01:00
Mario Limonciello
c9e6f7fb1c x86/ACPI: CPPC: Add missing include
Some minimial kernel configurations will fail with -Werror=implicit-function-declaration
due to a missing header include.

Add that header.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/20250211203314.762755-1-superm1@kernel.org
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-02-18 19:15:08 +01:00
Joel Granados
c305a4e983 x86: Move sysctls into arch/x86
Move the following sysctl tables into arch/x86/kernel/setup.c:

  panic_on_{unrecoverable_nmi,io_nmi}
  bootloader_{type,version}
  io_delay_type
  unknown_nmi_panic
  acpi_realmode_flags

Variables moved from include/linux/ to arch/x86/include/asm/ because there
is no longer need for them outside arch/x86/kernel:

  acpi_realmode_flags
  panic_on_{unrecoverable_nmi,io_nmi}

Include <asm/nmi.h> in arch/s86/kernel/setup.h in order to bring in
panic_on_{io_nmi,unrecovered_nmi}.

This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kerenel/sysctl.c.

Signed-off-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250218-jag-mv_ctltables-v1-8-cd3698ab8d29@kernel.org
2025-02-18 11:08:36 +01:00
Ingo Molnar
e8f925c320 Linux 6.14-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmeyYIQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNy0H/jWdgjddRaEHQ1RB
 e18Oi6MJcTQikHbCHKGZGlyxR4dYxdAONuMmWwgt+266K8qUJSZcNXePwqGEWjx2
 qkJ9Tu0Agr8KkfVDtGHGXyd4tuZRpx9Fco6+jKkKiMjjtif7nrUajUGGwRsqGoib
 YYzrhbjNZDl17/J58O1E4YZs3w7Lu26PwDR58RZMsSG0pygAfU2fogKcYmi1pTYV
 w86icn0LlO8b5Y7fsrY56rLrawnI1RGlxfylUTHzo4QkoIUGvQLB8c6XPMYsVf9R
 lvkphu+/fGVnSw577WlVy8DTBso+Pj2nWw4jUTiEAy9hYY6zMxrqrX3XowAwbxj1
 m6zP+F8=
 =ieVA
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc3' into x86/core, to pick up fixes

Pick up upstream x86 fixes before applying new patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-18 11:07:15 +01:00
Brian Gerst
38a4968b31 x86/percpu/64: Remove INIT_PER_CPU macros
Now that the load and link addresses of percpu variables are the same,
these macros are no longer necessary.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-12-brgerst@gmail.com
2025-02-18 10:15:50 +01:00
Brian Gerst
b5c4f95351 x86/percpu/64: Remove fixed_percpu_data
Now that the stack protector canary value is a normal percpu variable,
fixed_percpu_data is unused and can be removed.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-10-brgerst@gmail.com
2025-02-18 10:15:43 +01:00
Brian Gerst
9d7de2aa8b x86/percpu/64: Use relative percpu offsets
The percpu section is currently linked at absolute address 0, because
older compilers hard-coded the stack protector canary value at a fixed
offset from the start of the GS segment.  Now that the canary is a
normal percpu variable, the percpu section does not need to be linked
at a specific address.

x86-64 will now calculate the percpu offsets as the delta between the
initial percpu address and the dynamically allocated memory, like other
architectures.  Note that GSBASE is limited to the canonical address
width (48 or 57 bits, sign-extended).  As long as the kernel text,
modules, and the dynamically allocated percpu memory are all in the
negative address space, the delta will not overflow this limit.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-9-brgerst@gmail.com
2025-02-18 10:15:27 +01:00
Brian Gerst
80d47defdd x86/stackprotector/64: Convert to normal per-CPU variable
Older versions of GCC fixed the location of the stack protector canary
at %gs:40.  This constraint forced the percpu section to be linked at
absolute address 0 so that the canary could be the first data object in
the percpu section.  Supporting the zero-based percpu section requires
additional code to handle relocations for RIP-relative references to
percpu data, extra complexity to kallsyms, and workarounds for linker
bugs due to the use of absolute symbols.

GCC 8.1 supports redefining where the canary is located, allowing it to
become a normal percpu variable instead of at a fixed location.  This
removes the constraint that the percpu section must be zero-based.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-8-brgerst@gmail.com
2025-02-18 10:15:09 +01:00
Ard Biesheuvel
78c4374ef8 x86/module: Deal with GOT based stack cookie load on Clang < 17
Clang versions before 17 will not honour -fdirect-access-external-data
for the load of the stack cookie emitted into each function's prologue
and epilogue.

This is not an issue for the core kernel, as the linker will relax these
loads into LEA instructions that take the address of __stack_chk_guard
directly. For modules, however, we need to work around this, by dealing
with R_X86_64_REX_GOTPCRELX relocations that refer to __stack_chk_guard.

In this case, given that this is a GOT load, the reference should not
refer to __stack_chk_guard directly, but to a memory location that holds
its address. So take the address of __stack_chk_guard into a static
variable, and fix up the relocations to refer to that.

[ mingo: Fix broken R_X86_64_GOTPCRELX definition. ]

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-7-brgerst@gmail.com
2025-02-18 10:15:05 +01:00
Brian Gerst
a9a76b38aa x86/boot: Disable stack protector for early boot code
On 64-bit, this will prevent crashes when the canary access is changed
from %gs:40 to %gs:__stack_chk_guard(%rip).  RIP-relative addresses from
the identity-mapped early boot code will target the wrong address with
zero-based percpu.  KASLR could then shift that address to an unmapped
page causing a crash on boot.

This early boot code runs well before user-space is active and does not
need stack protector enabled.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250123190747.745588-4-brgerst@gmail.com
2025-02-18 10:14:51 +01:00
Beata Michalska
38e480d4fc cpufreq: Allow arch_freq_get_on_cpu to return an error
Allow arch_freq_get_on_cpu to return an error for cases when retrieving
current CPU frequency is not possible, whether that being due to lack of
required arch support or due to other circumstances when the current
frequency cannot be determined at given point of time.

Signed-off-by: Beata Michalska <beata.michalska@arm.com>
Reviewed-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20250131162439.3843071-2-beata.michalska@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-02-17 18:09:20 +00:00
Borislav Petkov (AMD)
037e81fb9d x86/microcode/AMD: Add get_patch_level()
Put the MSR_AMD64_PATCH_LEVEL reading of the current microcode revision
the hw has, into a separate function.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250211163648.30531-6-bp@kernel.org
2025-02-17 09:42:40 +01:00
Borislav Petkov (AMD)
b39c387164 x86/microcode/AMD: Get rid of the _load_microcode_amd() forward declaration
Simply move save_microcode_in_initrd() down.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250211163648.30531-5-bp@kernel.org
2025-02-17 09:42:37 +01:00
Borislav Petkov (AMD)
dc15675074 x86/microcode/AMD: Merge early_apply_microcode() into its single callsite
No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250211163648.30531-4-bp@kernel.org
2025-02-17 09:42:34 +01:00
Borislav Petkov (AMD)
3ef0740d10 x86/microcode/AMD: Remove unused save_microcode_in_initrd_amd() declarations
Commit

  a7939f0167 ("x86/microcode/amd: Cache builtin/initrd microcode early")

renamed it to save_microcode_in_initrd() and made it static. Zap the
forgotten declarations.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250211163648.30531-3-bp@kernel.org
2025-02-17 09:42:31 +01:00
Borislav Petkov (AMD)
7103f0589a x86/microcode/AMD: Remove ugly linebreak in __verify_patch_section() signature
No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250211163648.30531-2-bp@kernel.org
2025-02-17 09:42:13 +01:00
Greg Kroah-Hartman
2ce177e9b3 Merge 6.14-rc3 into driver-core-next
We need the faux_device changes in here for future work.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-17 07:24:33 +01:00
Sebastian Andrzej Siewior
741c10b096 kernfs: Use RCU to access kernfs_node::name.
Using RCU lifetime rules to access kernfs_node::name can avoid the
trouble with kernfs_rename_lock in kernfs_name() and kernfs_path_from_node()
if the fs was created with KERNFS_ROOT_INVARIANT_PARENT. This is usefull
as it allows to implement kernfs_path_from_node() only with RCU
protection and avoiding kernfs_rename_lock. The lock is only required if
the __parent node can be changed and the function requires an unchanged
hierarchy while it iterates from the node to its parent.
The change is needed to allow the lookup of the node's path
(kernfs_path_from_node()) from context which runs always with disabled
preemption and or interrutps even on PREEMPT_RT. The problem is that
kernfs_rename_lock becomes a sleeping lock on PREEMPT_RT.

I went through all ::name users and added the required access for the lookup
with a few extensions:
- rdtgroup_pseudo_lock_create() drops all locks and then uses the name
  later on. resctrl supports rename with different parents. Here I made
  a temporal copy of the name while it is used outside of the lock.

- kernfs_rename_ns() accepts NULL as new_parent. This simplifies
  sysfs_move_dir_ns() where it can set NULL in order to reuse the current
  name.

- kernfs_rename_ns() is only using kernfs_rename_lock if the parents are
  different. All users use either kernfs_rwsem (for stable path view) or
  just RCU for the lookup. The ::name uses always RCU free.

Use RCU lifetime guarantees to access kernfs_node::name.

Suggested-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reported-by: syzbot+6ea37e2e6ffccf41a7e6@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/67251dc6.050a0220.529b6.015e.GAE@google.com/
Reported-by: Hillf Danton <hdanton@sina.com>
Closes: https://lore.kernel.org/20241102001224.2789-1-hdanton@sina.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-7-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Sebastian Andrzej Siewior
633488947e kernfs: Use RCU to access kernfs_node::parent.
kernfs_rename_lock is used to obtain stable kernfs_node::{name|parent}
pointer. This is a preparation to access kernfs_node::parent under RCU
and ensure that the pointer remains stable under the RCU lifetime
guarantees.

For a complete path, as it is done in kernfs_path_from_node(), the
kernfs_rename_lock is still required in order to obtain a stable parent
relationship while computing the relevant node depth. This must not
change while the nodes are inspected in order to build the path.
If the kernfs user never moves the nodes (changes the parent) then the
kernfs_rename_lock is not required and the RCU guarantees are
sufficient. This "restriction" can be set with
KERNFS_ROOT_INVARIANT_PARENT. Otherwise the lock is required.

Rename kernfs_node::parent to kernfs_node::__parent to denote the RCU
access and use RCU accessor while accessing the node.
Make cgroup use KERNFS_ROOT_INVARIANT_PARENT since the parent here can
not change.

Acked-by: Tejun Heo <tj@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250213145023.2820193-6-bigeasy@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-02-15 17:46:32 +01:00
Peter Zijlstra
882b86fd4e x86/ibt: Handle FineIBT in handle_cfi_failure()
Sami reminded me that FineIBT failure does not hook into the regular
CFI failure case, and as such CFI_PERMISSIVE does not work.

Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lkml.kernel.org/r/20250214092619.GB21726@noisy.programming.kicks-ass.net
2025-02-14 10:32:07 +01:00
Peter Zijlstra
306859de59 x86/early_printk: Harden early_serial
Scott found that mem32_serial_in() is an ideal speculation gadget, an
indirectly callable function that takes an adddress and offset and
immediately does a load.

Use static_call() to take away the need for indirect calls and
explicitly seal the functions to ensure they're not callable on IBT
enabled parts.

Reported-by: Scott Constable <scott.d.constable@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.919773202@infradead.org
2025-02-14 10:32:07 +01:00
Peter Zijlstra
c4239a72a2 x86/ibt: Clean up poison_endbr()
Basically, get rid of the .warn argument and explicitly don't call the
function when we know there isn't an endbr. This makes the calling
code clearer.

Note: perhaps don't add functions to .cfi_sites when the function
doesn't have endbr -- OTOH why would the compiler emit the prefix if
it has already determined there are no indirect callers and has
omitted the ENDBR instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.815505775@infradead.org
2025-02-14 10:32:06 +01:00
Peter Zijlstra
c20ad96c9a x86/traps: Cleanup and robustify decode_bug()
Notably, don't attempt to decode an immediate when MOD == 3.

Additionally have it return the instruction length, such that WARN
like bugs can more reliably skip to the correct instruction.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.721120726@infradead.org
2025-02-14 10:32:06 +01:00
Peter Zijlstra
ab9fea5948 x86/alternative: Simplify callthunk patching
Now that paravirt call patching is implemented using alternatives, it
is possible to avoid having to patch the alternative sites by
including the altinstr_replacement calls in the call_sites list.

This means we're now stacking relative adjustments like so:

  callthunks_patch_builtin_calls():
    patches all function calls to target: func() -> func()-10
    since the CALL accounting lives in the CALL_PADDING.

    This explicitly includes .altinstr_replacement

  alt_replace_call():
    patches: x86_BUG() -> target()

    this patching is done in a relative manner, and will preserve
    the above adjustment, meaning that with calldepth patching it
    will do: x86_BUG()-10 -> target()-10

  apply_relocation():
    does code relocation, and adjusts all RIP-relative instructions
    to the new location, also in a relative manner.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.617187089@infradead.org
2025-02-14 10:32:06 +01:00
Peter Zijlstra
93f16a1ab7 x86/boot: Mark start_secondary() with __noendbr
The handoff between the boot stubs and start_secondary() are before IBT is
enabled and is definitely not subject to kCFI. As such, suppress all that for
this function.

Notably when the ENDBR poison would become fatal (ud1 instead of nop) this will
trigger a tripple fault because we haven't set up the IDT to handle #UD yet.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.509520369@infradead.org
2025-02-14 10:32:05 +01:00
Peter Zijlstra
582077c940 x86/cfi: Clean up linkage
With the introduction of kCFI the addition of ENDBR to
SYM_FUNC_START* no longer suffices to make the function indirectly
callable. This now requires the use of SYM_TYPED_FUNC_START.

As such, remove the implicit ENDBR from SYM_FUNC_START* and add some
explicit annotations to fix things up again.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.409116003@infradead.org
2025-02-14 10:32:05 +01:00
Peter Zijlstra
72e213a7cc x86/ibt: Clean up is_endbr()
Pretty much every caller of is_endbr() actually wants to test something at an
address and ends up doing get_kernel_nofault(). Fold the lot into a more
convenient helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20250207122546.181367417@infradead.org
2025-02-14 10:32:04 +01:00
Patrick Bellasi
318e8c339c x86/cpu/kvm: SRSO: Fix possible missing IBPB on VM-Exit
In [1] the meaning of the synthetic IBPB flags has been redefined for a
better separation of concerns:
 - ENTRY_IBPB     -- issue IBPB on entry only
 - IBPB_ON_VMEXIT -- issue IBPB on VM-Exit only
and the Retbleed mitigations have been updated to match this new
semantics.

Commit [2] was merged shortly before [1], and their interaction was not
handled properly. This resulted in IBPB not being triggered on VM-Exit
in all SRSO mitigation configs requesting an IBPB there.

Specifically, an IBPB on VM-Exit is triggered only when
X86_FEATURE_IBPB_ON_VMEXIT is set. However:

 - X86_FEATURE_IBPB_ON_VMEXIT is not set for "spec_rstack_overflow=ibpb",
   because before [1] having X86_FEATURE_ENTRY_IBPB was enough. Hence,
   an IBPB is triggered on entry but the expected IBPB on VM-exit is
   not.

 - X86_FEATURE_IBPB_ON_VMEXIT is not set also when
   "spec_rstack_overflow=ibpb-vmexit" if X86_FEATURE_ENTRY_IBPB is
   already set.

   That's because before [1] this was effectively redundant. Hence, e.g.
   a "retbleed=ibpb spec_rstack_overflow=bpb-vmexit" config mistakenly
   reports the machine still vulnerable to SRSO, despite an IBPB being
   triggered both on entry and VM-Exit, because of the Retbleed selected
   mitigation config.

 - UNTRAIN_RET_VM won't still actually do anything unless
   CONFIG_MITIGATION_IBPB_ENTRY is set.

For "spec_rstack_overflow=ibpb", enable IBPB on both entry and VM-Exit
and clear X86_FEATURE_RSB_VMEXIT which is made superfluous by
X86_FEATURE_IBPB_ON_VMEXIT. This effectively makes this mitigation
option similar to the one for 'retbleed=ibpb', thus re-order the code
for the RETBLEED_MITIGATION_IBPB option to be less confusing by having
all features enabling before the disabling of the not needed ones.

For "spec_rstack_overflow=ibpb-vmexit", guard this mitigation setting
with CONFIG_MITIGATION_IBPB_ENTRY to ensure UNTRAIN_RET_VM sequence is
effectively compiled in. Drop instead the CONFIG_MITIGATION_SRSO guard,
since none of the SRSO compile cruft is required in this configuration.
Also, check only that the required microcode is present to effectively
enabled the IBPB on VM-Exit.

Finally, update the KConfig description for CONFIG_MITIGATION_IBPB_ENTRY
to list also all SRSO config settings enabled by this guard.

Fixes: 864bcaa38e ("x86/cpu/kvm: Provide UNTRAIN_RET_VM") [1]
Fixes: d893832d0e ("x86/srso: Add IBPB on VMEXIT") [2]
Reported-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Patrick Bellasi <derkling@google.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-02-11 10:07:52 -08:00
Eric Biggers
ccb7735a1e x86/fpu: Fully optimize out WARN_ON_FPU()
Currently WARN_ON_FPU evaluates its argument even if
CONFIG_X86_DEBUG_FPU is disabled, which adds unnecessary instructions to
several functions, for example kernel_fpu_begin().  Fix this by using
BUILD_BUG_ON_INVALID(x) in the no-debug case rather than (void)(x).

Fixes: 83242c5158 ("x86/fpu: Make WARN_ON_FPU() more robust in the !CONFIG_X86_DEBUG_FPU case")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20250127224523.94300-1-ebiggers%40kernel.org
2025-02-10 14:45:59 -08:00
Eric Biggers
968e9bc4ce x86: move ZMM exclusion list into CPU feature flag
Lift zmm_exclusion_list in aesni-intel_glue.c into the x86 CPU setup
code, and add a new x86 CPU feature flag X86_FEATURE_PREFER_YMM that is
set when the CPU is on this list.

This allows other code in arch/x86/, such as the CRC library code, to
apply the same exclusion list when deciding whether to execute 256-bit
or 512-bit optimized functions.

Note that full AVX512 support including ZMM registers is still exposed
to userspace and is still supported for in-kernel use.  This flag just
indicates whether in-kernel code should prefer to use YMM registers.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250210174540.161705-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-10 09:48:43 -08:00
Patryk Wlazlyn
96040f7273 x86/smp: Eliminate mwait_play_dead_cpuid_hint()
Currently, mwait_play_dead_cpuid_hint() looks up the MWAIT hint of the
deepest idle state by inspecting CPUID leaf 0x5 with the assumption
that, if the number of sub-states for a given major C-state is nonzero,
those sub-states are always represented by consecutive numbers starting
from 0. This assumption is not based on the documented platform behavior
and in fact it is not met on recent Intel platforms.

For example, Intel's Sierra Forest report two C-states with two
substates each in cpuid leaf 0x5:

  Name*   target cstate    target subcstate (mwait hint)
  ===========================================================
  C1      0x00             0x00
  C1E     0x00             0x01

  --      0x10             ----

  C6S     0x20             0x22
  C6P     0x20             0x23

  --      0x30             ----

  /* No more (sub)states all the way down to the end. */
  ===========================================================

   * Names of the cstates are not included in the CPUID leaf 0x5,
     they are taken from the product specific documentation.

Notice that hints 0x20 and 0x21 are not defined for C-state 0x20
(C6), so the existing MWAIT hint lookup in
mwait_play_dead_cpuid_hint() based on the CPUID leaf 0x5 contents does
not work in this case.

Instead of using MWAIT hint lookup that is not guaranteed to work,
make native_play_dead() rely on the idle driver for the given platform
to put CPUs going offline into appropriate idle state and, if that
fails, fall back to hlt_play_dead().

Accordingly, drop mwait_play_dead_cpuid_hint() altogether and make
native_play_dead() call cpuidle_play_dead() instead of it
unconditionally with the assumption that it will not return if it is
successful. Still, in case cpuidle_play_dead() fails, call
hlt_play_dead() at the end.

Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20250205155211.329780-5-artem.bityutskiy%40linux.intel.com
2025-02-05 10:44:52 -08:00
Patryk Wlazlyn
541ddf31e3 ACPI/processor_idle: Add FFH state handling
Recent Intel platforms will depend on the idle driver to pass the
correct hint for playing dead via mwait_play_dead_with_hint(). Expand
the existing enter_dead interface with handling for FFH states and pass
the MWAIT hint to the mwait_play_dead code.

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20250205155211.329780-3-artem.bityutskiy%40linux.intel.com
2025-02-05 10:44:52 -08:00
Patryk Wlazlyn
a7dd183f0b x86/smp: Allow calling mwait_play_dead with an arbitrary hint
Introduce a helper function to allow offlined CPUs to enter idle states
with a specific MWAIT hint. The new helper will be used in subsequent
patches by the acpi_idle and intel_idle drivers.

No functional change intended.

Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20250205155211.329780-2-artem.bityutskiy%40linux.intel.com
2025-02-05 10:44:52 -08:00
Tony Luck
1e66d6cf88 x86/cpu: Fix #define name for Intel CPU model 0x5A
This CPU was mistakenly given the name INTEL_ATOM_AIRMONT_MID. But it
uses a Silvermont core, not Airmont.

Change #define name to INTEL_ATOM_SILVERMONT_MID2

Reported-by: Christian Ludloff <ludloff@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241007165701.19693-1-tony.luck%40intel.com
2025-02-04 10:05:53 -08:00
Mike Rapoport (Microsoft)
1d7e707af4 Revert "x86/module: prepare module loading for ROX allocations of text"
The module code does not create a writable copy of the executable memory
anymore so there is no need to handle it in module relocation and
alternatives patching.

This reverts commit 9bfc4824fd.

Signed-off-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250126074733.1384926-8-rppt@kernel.org
2025-02-03 11:46:02 +01:00
Linus Torvalds
c545cd3276 x86/mm changes for v6.14:
- The biggest changes are the TLB flushing scalability optimizations,
    to update the mm_cpumask lazily and related changes. This feature
    has both a track record and a continued risk of performance regressions,
    so it was already delayed by a cycle - but it's all 100% perfect now™.
    (Rik van Riel)
 
  - Also miscellaneous fixes and cleanups. (Gautam Somani,
    Kirill A. Shutemov, Sebastian Andrzej Siewior)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeclXoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iDixAAjmTv/3KBuXaW/EoqGkyr/dgJld/Cww5a
 4yyM6pbVOkiP+pmSTiHChhn07A4eB1TMCP0RJHXUgsCr6VLY8+68MdafCMIn9hWK
 mZYbCFF2yWy2EP4a26ifTi/3P355x5WILxJH5K4fHxcsXjRy5LgCLaq0tObEqnZ8
 OAGIBw+g3t7CYurqlKfYiVSUiUG8PbXbS9Bh/0SjRe5FRbJDre3XJy9ks2c83wHU
 anPe5qpkw3mg8hPiFQfv3EYyGe1NhAs9hBMYLKqUyyxZEixymZDsvjYnOe154OMI
 9xk3XpeFFejwvBJ1pfSS3V5svm5sqtnRpZSivUl/gsT7LM65N8RqKMrTvcpT+fm7
 cQs8JK3LP+S2ih3S4wTZRdVGnIQGzqHkp9R6e8T4r9FQ2688mk/OvqJOCZEAcPgx
 VRHiMXtgZ3e8OsMiY+82TGt9wyujCR/kk+hzgXtNC1Lr++jCz848n3UcUe+wvzzw
 Lo8LGGdAzBRviwiwwrRxCYKtlUtkIwbIKtfswv5pfapji2cTHckhvuKAcujpvaXd
 +qgnX8XNVZWoG57tN02jZ8ZgAFgZlV2A03WG5e0c1wb4/3AnGQDGpCEWX2/lMj1J
 U/FFwNA6+jzcVMYyN/LQAETv0Go7sJOVTTie7mAHEhyHvxvb2YfV9VJ60V2WBKn5
 znIuU0l2qyQ=
 =g00u
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2025-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Ingo Molnar:

 - The biggest changes are the TLB flushing scalability optimizations,
   to update the mm_cpumask lazily and related changes.

   This feature has both a track record and a continued risk of
   performance regressions, so it was already delayed by a cycle - but
   it's all 100% perfect now™ (Rik van Riel)

 - Also miscellaneous fixes and cleanups. (Gautam Somani, Kirill
   Shutemov, Sebastian Andrzej Siewior)

* tag 'x86-mm-2025-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove unnecessary include of <linux/extable.h>
  x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
  x86/mm/selftests: Fix typo in lam.c
  x86/mm/tlb: Only trim the mm_cpumask once a second
  x86/mm/tlb: Also remove local CPU from mm_cpumask if stale
  x86/mm/tlb: Add tracepoint for TLB flush IPI to stale CPU
  x86/mm/tlb: Update mm_cpumask lazily
2025-01-31 10:39:07 -08:00
Linus Torvalds
2a9f04bde0 RTC for 6.13
Subsystem:
  - use boolean values with device_init_wakeup()
 
 Drivers:
  - pcf2127: add BSM support
  - pcf85063: fix possible out of bound write
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEBqsFVZXh8s/0O5JiY6TcMGxwOjIFAmeb/UQACgkQY6TcMGxw
 OjLgTA//fUNMueHNrdwEA2RATolmOpfz5tlplE2DPfIAaknJDOpZFZo6GuVsMb9S
 B0oIdwfpNa9+cJyK2cA5Bvjqh/TeLJCrH7UPbZXBczQQG3YFmwsoFhpcjJAR2JDr
 es72pLK+uALrWI//pN3y7cbtfOXm+5rGBoKCWxJTuFdWpuxbrgs7bBSDY3EGXefd
 jR+RU3IkJSmjauSv5IYfkmg0g5H0yREwQkPk2ymZvIf0Vao9XsTKlWdUucdugfDV
 7nPIcIdgsYKyB/+U1WmBo2eu/kcAz1cjj8aAfViYww0MgGvtU4heJx3v+Gpp5O8D
 D8xGUAIp28UG6pj9BNJBOP/Y3fahTnqGp9HvyCl0DnaqZYfQPLlqCOkXDlktfGB5
 YBRnzkecRqzJAFroTrrx8E9CIvp2u0kGBOikDKZ/l1dleYiWVJVmALfXH0KFLsVR
 ByiPKayaq8kGCqjZR8Ge1QDd4y8vQ+QqXQvADrPnRmreck8nqLCZrvsReGWjMpWq
 x0gSrhZU6k8tyYiufDO2JyyxoD96bHc8w6FmQquMKylzjVjNcoEjPLToIReyb+h1
 ql2JfTeY4jkcyFj/H6vkrtehumYNxzl2nHP8QtV4yOgbfn/UTxdAfAsB9m9e7AAz
 gdHsm2pt6gFkxirm0xST/Z5CohZRR+9/m9agvbM1l2Lu5q+WFu4=
 =BxV0
 -----END PGP SIGNATURE-----

Merge tag 'rtc-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
 "Not much this cycle, there are multiple small fixes.

  Core:
   - use boolean values with device_init_wakeup()

  Drivers:
   - pcf2127: add BSM support
   - pcf85063: fix possible out of bounds write"

* tag 'rtc-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
  rtc: pcf2127: add BSM support
  rtc: Remove hpet_rtc_dropped_irq()
  dt-bindings: rtc: mxc: Document fsl,imx31-rtc
  rtc: stm32: Use syscon_regmap_lookup_by_phandle_args
  rtc: zynqmp: Fix optional clock name property
  rtc: loongson: clear TOY_MATCH0_REG in loongson_rtc_isr()
  rtc: pcf85063: fix potential OOB write in PCF85063 NVMEM read
  rtc: tps6594: Fix integer overflow on 32bit systems
  rtc: use boolean values with device_init_wakeup()
  rtc: RTC_DRV_SPEAR should not default to y when compile-testing
2025-01-30 17:50:02 -08:00
Linus Torvalds
47cbb41a2e ACPI fixes for 6.14-rc1
Add a new ACPI-related quirk for Vexia EDU ATLA 10 tablet 5V (Hans
 de Goede) and fix the MADT parsing code so that CPUs with different
 entry types (LAPIC and x2APIC) are initialized in the order in which
 they appear in the MADT as required by the ACPI specification (Zhang
 Rui).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmeb5+ESHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxyScP/1wBdhFmU2RrFNoCfZLBYsp4tsQjNyFA
 D4Wib8pMwlydyFW+Nig8G7ldEomZoEslhmLlKhz0lMX0JSfz9gnelKrrEXVxYS22
 cCeNFTwl58gBk1R6uOqfJSaE851G+Z+3LxN1OOiV5QyP59stfBxz0Nqp36e9yABR
 kRfY8LkHe8yD+G+9QaMPOowQm0Q3Gv/GMUAAUEOLR/qPoAymOYxCLbs8V5IrOec1
 OJkSzvaoD6UdSR62an+6vx5nA4CgflZnwn1g+OLVPSjlcY64zl+HlqrjvntdPyLm
 MTiICp84nXPZQDWMvGV6drZqIOVokuZay2BFg6hAqfMgU27xrZ6B7lmjq/uvhVUb
 Ejmtst6MWaVsrHuHaqv6+gBdiuQqtDiHs1WsIdv1i25DhQvLWSXrypkcLhtArO5V
 qOmr+cI2aR0WTmV3dG+IBdE9UczCGKZl92r6Rr4aAnbF6OkBLb059t40s64XjJf6
 gHApQS0vYnNdXK7/21RkGP7mz+Zr7YiBPyoGGiIKEbTJB9UwoITVfUFjfUxYSUF1
 HLMzGvh/Ra3NcHrHNAHbzYZHuNev8qloPDMJ9lVVz+AEh6qVSKDWYxmeBWY6JBnK
 7/5SR1e/5x+WP5m/1PZnTt4UOmvwvGE1O63k6wrdqr3k6C25xwfcS7Yggbd/VWLN
 q+ZPtietj8ML
 =JXCp
 -----END PGP SIGNATURE-----

Merge tag 'acpi-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
 "Add a new ACPI-related quirk for Vexia EDU ATLA 10 tablet 5V (Hans de
  Goede) and fix the MADT parsing code so that CPUs with different entry
  types (LAPIC and x2APIC) are initialized in the order in which they
  appear in the MADT as required by the ACPI specification (Zhang Rui)"

* tag 'acpi-6.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  x86/acpi: Fix LAPIC/x2APIC parsing order
  ACPI: x86: Add skip i2c clients quirk for Vexia EDU ATLA 10 tablet 5V
2025-01-30 15:22:18 -08:00
Joel Granados
1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Linus Torvalds
9c5968db9e The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
 
 - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
   page allocator so we end up with the ability to allocate and free
   zero-refcount pages.  So that callers (ie, slab) can avoid a refcount
   inc & dec.
 
 - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
   large folios other than PMD-sized ones.
 
 - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
   fixes for this small built-in kernel selftest.
 
 - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
   the mapletree code.
 
 - "mm: fix format issues and param types" from Keren Sun implements a
   few minor code cleanups.
 
 - "simplify split calculation" from Wei Yang provides a few fixes and a
   test for the mapletree code.
 
 - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
   continues the work of moving vma-related code into the (relatively) new
   mm/vma.c.
 
 - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
   Hildenbrand cleans up and rationalizes handling of gfp flags in the page
   allocator.
 
 - "readahead: Reintroduce fix for improper RA window sizing" from Jan
   Kara is a second attempt at fixing a readahead window sizing issue.  It
   should reduce the amount of unnecessary reading.
 
 - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
   addresses an issue where "huge" amounts of pte pagetables are
   accumulated
   (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
   Qi's series addresses this windup by synchronously freeing PTE memory
   within the context of madvise(MADV_DONTNEED).
 
 - "selftest/mm: Remove warnings found by adding compiler flags" from
   Muhammad Usama Anjum fixes some build warnings in the selftests code
   when optional compiler warnings are enabled.
 
 - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
   Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
 
 - "pkeys kselftests improvements" from Kevin Brodsky implements various
   fixes and cleanups in the MM selftests code, mainly pertaining to the
   pkeys tests.
 
 - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
   estimate application working set size.
 
 - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
   provides some cleanups to memcg's hugetlb charging logic.
 
 - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
   removes the global swap cgroup lock.  A speedup of 10% for a tmpfs-based
   kernel build was demonstrated.
 
 - "zram: split page type read/write handling" from Sergey Senozhatsky
   has several fixes and cleaups for zram in the area of zram_write_page().
   A watchdog softlockup warning was eliminated.
 
 - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
   cleans up the pagetable destructor implementations.  A rare
   use-after-free race is fixed.
 
 - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
   simplifies and cleans up the debugging code in the VMA merging logic.
 
 - "Account page tables at all levels" from Kevin Brodsky cleans up and
   regularizes the pagetable ctor/dtor handling.  This results in
   improvements in accounting accuracy.
 
 - "mm/damon: replace most damon_callback usages in sysfs with new core
   functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
   file interface logic.
 
 - "mm/damon: enable page level properties based monitoring" from
   SeongJae Park increases the amount of information which is presented in
   response to DAMOS actions.
 
 - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
   DAMON's long-deprecated debugfs interfaces.  Thus the migration to sysfs
   is completed.
 
 - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
   Xu cleans up and generalizes the hugetlb reservation accounting.
 
 - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
   removes a never-used feature of the alloc_pages_bulk() interface.
 
 - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
   extends DAMOS filters to support not only exclusion (rejecting), but
   also inclusion (allowing) behavior.
 
 - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
   "introduces a new memory descriptor for zswap.zpool that currently
   overlaps with struct page for now.  This is part of the effort to reduce
   the size of struct page and to enable dynamic allocation of memory
   descriptors."
 
 - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
   simplifies the swap allocator locking.  A speedup of 400% was
   demonstrated for one workload.  As was a 35% reduction for kernel build
   time with swap-on-zram.
 
 - "mm: update mips to use do_mmap(), make mmap_region() internal" from
   Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
   mmap_region() can be made MM-internal.
 
 - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
   regressions and otherwise improves MGLRU performance.
 
 - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
   updates DAMON documentation.
 
 - "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
 
 - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
   provides various cleanups in the areas of hugetlb folios, THP folios and
   migration.
 
 - "Uncached buffered IO" from Jens Axboe implements the new
   RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
   reading and writing.  To permite userspace to address issues with
   massive buildup of useless pagecache when reading/writing fast devices.
 
 - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
   Weißschuh fixes and optimizes some of the MM selftests.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
 jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
 uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
 =Ib2s
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "The various patchsets are summarized below. Plus of course many
  indivudual patches which are described in their changelogs.

   - "Allocate and free frozen pages" from Matthew Wilcox reorganizes
     the page allocator so we end up with the ability to allocate and
     free zero-refcount pages. So that callers (ie, slab) can avoid a
     refcount inc & dec

   - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
     use large folios other than PMD-sized ones

   - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
     and fixes for this small built-in kernel selftest

   - "mas_anode_descend() related cleanup" from Wei Yang tidies up part
     of the mapletree code

   - "mm: fix format issues and param types" from Keren Sun implements a
     few minor code cleanups

   - "simplify split calculation" from Wei Yang provides a few fixes and
     a test for the mapletree code

   - "mm/vma: make more mmap logic userland testable" from Lorenzo
     Stoakes continues the work of moving vma-related code into the
     (relatively) new mm/vma.c

   - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
     Hildenbrand cleans up and rationalizes handling of gfp flags in the
     page allocator

   - "readahead: Reintroduce fix for improper RA window sizing" from Jan
     Kara is a second attempt at fixing a readahead window sizing issue.
     It should reduce the amount of unnecessary reading

   - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
     addresses an issue where "huge" amounts of pte pagetables are
     accumulated:

       https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/

     Qi's series addresses this windup by synchronously freeing PTE
     memory within the context of madvise(MADV_DONTNEED)

   - "selftest/mm: Remove warnings found by adding compiler flags" from
     Muhammad Usama Anjum fixes some build warnings in the selftests
     code when optional compiler warnings are enabled

   - "mm: don't use __GFP_HARDWALL when migrating remote pages" from
     David Hildenbrand tightens the allocator's observance of
     __GFP_HARDWALL

   - "pkeys kselftests improvements" from Kevin Brodsky implements
     various fixes and cleanups in the MM selftests code, mainly
     pertaining to the pkeys tests

   - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
     estimate application working set size

   - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
     provides some cleanups to memcg's hugetlb charging logic

   - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
     removes the global swap cgroup lock. A speedup of 10% for a
     tmpfs-based kernel build was demonstrated

   - "zram: split page type read/write handling" from Sergey Senozhatsky
     has several fixes and cleaups for zram in the area of
     zram_write_page(). A watchdog softlockup warning was eliminated

   - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
     Brodsky cleans up the pagetable destructor implementations. A rare
     use-after-free race is fixed

   - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
     simplifies and cleans up the debugging code in the VMA merging
     logic

   - "Account page tables at all levels" from Kevin Brodsky cleans up
     and regularizes the pagetable ctor/dtor handling. This results in
     improvements in accounting accuracy

   - "mm/damon: replace most damon_callback usages in sysfs with new
     core functions" from SeongJae Park cleans up and generalizes
     DAMON's sysfs file interface logic

   - "mm/damon: enable page level properties based monitoring" from
     SeongJae Park increases the amount of information which is
     presented in response to DAMOS actions

   - "mm/damon: remove DAMON debugfs interface" from SeongJae Park
     removes DAMON's long-deprecated debugfs interfaces. Thus the
     migration to sysfs is completed

   - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
     Peter Xu cleans up and generalizes the hugetlb reservation
     accounting

   - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
     removes a never-used feature of the alloc_pages_bulk() interface

   - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
     extends DAMOS filters to support not only exclusion (rejecting),
     but also inclusion (allowing) behavior

   - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
     introduces a new memory descriptor for zswap.zpool that currently
     overlaps with struct page for now. This is part of the effort to
     reduce the size of struct page and to enable dynamic allocation of
     memory descriptors

   - "mm, swap: rework of swap allocator locks" from Kairui Song redoes
     and simplifies the swap allocator locking. A speedup of 400% was
     demonstrated for one workload. As was a 35% reduction for kernel
     build time with swap-on-zram

   - "mm: update mips to use do_mmap(), make mmap_region() internal"
     from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
     mmap_region() can be made MM-internal

   - "mm/mglru: performance optimizations" from Yu Zhao fixes a few
     MGLRU regressions and otherwise improves MGLRU performance

   - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
     Park updates DAMON documentation

   - "Cleanup for memfd_create()" from Isaac Manjarres does that thing

   - "mm: hugetlb+THP folio and migration cleanups" from David
     Hildenbrand provides various cleanups in the areas of hugetlb
     folios, THP folios and migration

   - "Uncached buffered IO" from Jens Axboe implements the new
     RWF_DONTCACHE flag which provides synchronous dropbehind for
     pagecache reading and writing. To permite userspace to address
     issues with massive buildup of useless pagecache when
     reading/writing fast devices

   - "selftests/mm: virtual_address_range: Reduce memory" from Thomas
     Weißschuh fixes and optimizes some of the MM selftests"

* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
  mm/compaction: fix UBSAN shift-out-of-bounds warning
  s390/mm: add missing ctor/dtor on page table upgrade
  kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
  tools: add VM_WARN_ON_VMG definition
  mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
  seqlock: add missing parameter documentation for raw_seqcount_try_begin()
  mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
  mm/page_alloc: remove the incorrect and misleading comment
  zram: remove zcomp_stream_put() from write_incompressible_page()
  mm: separate move/undo parts from migrate_pages_batch()
  mm/kfence: use str_write_read() helper in get_access_type()
  selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
  kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
  selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
  selftests/mm: vm_util: split up /proc/self/smaps parsing
  selftests/mm: virtual_address_range: unmap chunks after validation
  selftests/mm: virtual_address_range: mmap() without PROT_WRITE
  selftests/memfd/memfd_test: fix possible NULL pointer dereference
  mm: add FGP_DONTCACHE folio creation flag
  mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
  ...
2025-01-26 18:36:23 -08:00
Linus Torvalds
c159dfbdd4 Mainly individually changelogged singleton patches. The patch series in
this pull are:
 
 - "lib min_heap: Improve min_heap safety, testing, and documentation"
   from Kuan-Wei Chiu provides various tightenings to the min_heap library
   code.
 
 - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some
   cleanup and Rust preparation in the xarray library code.
 
 - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes
   pathnames in some code comments.
 
 - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the
   new secs_to_jiffies() in various places where that is appropriate.
 
 - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
   switches two filesystems to the new mount API.
 
 - "Convert ocfs2 to use folios" from Matthew Wilcox does that.
 
 - "Remove get_task_comm() and print task comm directly" from Yafang Shao
   removes now-unneeded calls to get_task_comm() in various places.
 
 - "squashfs: reduce memory usage and update docs" from Phillip Lougher
   implements some memory savings in squashfs and performs some
   maintainability work.
 
 - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
   tightens the sort code's behaviour and adds some maintenance work.
 
 - "nilfs2: protect busy buffer heads from being force-cleared" from
   Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a
   corrupted image.
 
 - "nilfs2: fix kernel-doc comments for function return values" from
   Ryusuke Konishi fixes some nilfs kerneldoc.
 
 - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
   addresses some nilfs BUG_ONs which syzbot was able to trigger.
 
 - "minmax.h: Cleanups and minor optimisations" from David Laight
   does some maintenance work on the min/max library code.
 
 - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work
   on the xarray library code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA
 jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX
 mJnaZ1/ibpEpearrChCViApQtcyEGQI=
 =ti4E
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly individually changelogged singleton patches. The patch series
  in this pull are:

   - "lib min_heap: Improve min_heap safety, testing, and documentation"
     from Kuan-Wei Chiu provides various tightenings to the min_heap
     library code

   - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms
     some cleanup and Rust preparation in the xarray library code

   - "Update reference to include/asm-<arch>" from Geert Uytterhoeven
     fixes pathnames in some code comments

   - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses
     the new secs_to_jiffies() in various places where that is
     appropriate

   - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen
     switches two filesystems to the new mount API

   - "Convert ocfs2 to use folios" from Matthew Wilcox does that

   - "Remove get_task_comm() and print task comm directly" from Yafang
     Shao removes now-unneeded calls to get_task_comm() in various
     places

   - "squashfs: reduce memory usage and update docs" from Phillip
     Lougher implements some memory savings in squashfs and performs
     some maintainability work

   - "lib: clarify comparison function requirements" from Kuan-Wei Chiu
     tightens the sort code's behaviour and adds some maintenance work

   - "nilfs2: protect busy buffer heads from being force-cleared" from
     Ryusuke Konishi fixes an issues in nlifs when the fs is presented
     with a corrupted image

   - "nilfs2: fix kernel-doc comments for function return values" from
     Ryusuke Konishi fixes some nilfs kerneldoc

   - "nilfs2: fix issues with rename operations" from Ryusuke Konishi
     addresses some nilfs BUG_ONs which syzbot was able to trigger

   - "minmax.h: Cleanups and minor optimisations" from David Laight does
     some maintenance work on the min/max library code

   - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance
     work on the xarray library code"

* tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits)
  ocfs2: use str_yes_no() and str_no_yes() helper functions
  include/linux/lz4.h: add some missing macros
  Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent
  Xarray: remove repeat check in xas_squash_marks()
  Xarray: distinguish large entries correctly in xas_split_alloc()
  Xarray: move forward index correctly in xas_pause()
  Xarray: do not return sibling entries from xas_find_marked()
  ipc/util.c: complete the kernel-doc function descriptions
  gcov: clang: use correct function param names
  latencytop: use correct kernel-doc format for func params
  minmax.h: remove some #defines that are only expanded once
  minmax.h: simplify the variants of clamp()
  minmax.h: move all the clamp() definitions after the min/max() ones
  minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
  minmax.h: reduce the #define expansion of min(), max() and clamp()
  minmax.h: update some comments
  minmax.h: add whitespace around operators and after commas
  nilfs2: do not update mtime of renamed directory that is not moved
  nilfs2: handle errors that nilfs_prepare_chunk() may return
  CREDITS: fix spelling mistake
  ...
2025-01-26 17:50:53 -08:00
Guo Weikang
c6f239796b mm/memblock: add memblock_alloc_or_panic interface
Before SLUB initialization, various subsystems used memblock_alloc to
allocate memory.  In most cases, when memory allocation fails, an
immediate panic is required.  To simplify this behavior and reduce
repetitive checks, introduce `memblock_alloc_or_panic`.  This function
ensures that memory allocation failures result in a panic automatically,
improving code readability and consistency across subsystems that require
this behavior.

[guoweikang.kernel@gmail.com: arch/s390: save_area_alloc default failure behavior changed to panic]
  Link: https://lkml.kernel.org/r/20250109033136.2845676-1-guoweikang.kernel@gmail.com
  Link: https://lore.kernel.org/lkml/Z2fknmnNtiZbCc7x@kernel.org/
Link: https://lkml.kernel.org/r/20250102072528.650926-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>	[s390]
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:38 -08:00
Qi Zheng
ee0934b035 x86: pgtable: move pagetable_dtor() to __tlb_remove_table()
Move pagetable_dtor() to __tlb_remove_table(), so that ptlock and page
table pages can be freed together (regardless of whether RCU is used). 
This prevents the use-after-free problem where the ptlock is freed
immediately but the page table pages is freed later via RCU.

Link: https://lkml.kernel.org/r/27b3cdc8786bebd4f748380bf82f796482718504.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:22 -08:00
Qi Zheng
0b6476f939 x86: pgtable: convert __tlb_remove_table() to use struct ptdesc
Convert __tlb_remove_table() to use struct ptdesc, which will help to move
pagetable_dtor() to __tlb_remove_table().

And page tables shouldn't have swap cache, so use pagetable_free() instead
of free_page_and_swap_cache() to free page table pages.

Link: https://lkml.kernel.org/r/39f60f93143ff77cf5d6b3c3e75af0ffc1480adb.1736317725.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25 20:22:22 -08:00
Linus Torvalds
382e391365 hyperv-next for v6.14
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmeTFQ4THHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXqMWB/4uHjnu50u+m00OwXAKQr6i92zh50BZ
 RQragd9s9C8tuUNwPDmS/ct2BNAhoy43KJ0ClegdZjKxT1Ys8cLv4Wr5CaGckqWq
 +WCHqTgt+cPe0vUofqahB5wiAZMsnBgzFkV/OfFwBx0wkub9y5T3qVq5KapYlaDI
 7Gftb+wg1AAsrdZ/HuLRy5ZVvkM/73rU2uoi8WXjr/T14E1krCFR/qirLd1OXo6Q
 Jb97qhnCt/N9JPwIq5/VnYWde5Mpqz6UgtA2rFLDXgNGz+h9/ND6ecWFHjZWNVdc
 AKWZTO5t+fRVBOSyahoyRoYSntPw3wlxyL7A2/54h6j4Dex7wLt6NQBj
 =empO
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv updates from Wei Liu:

 - Introduce a new set of Hyper-V headers in include/hyperv and replace
   the old hyperv-tlfs.h with the new headers (Nuno Das Neves)

 - Fixes for the Hyper-V VTL mode (Roman Kisel)

 - Fixes for cpu mask usage in Hyper-V code (Michael Kelley)

 - Document the guest VM hibernation behaviour (Michael Kelley)

 - Miscellaneous fixes and cleanups (Jacob Pan, John Starks, Naman Jain)

* tag 'hyperv-next-signed-20250123' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  Documentation: hyperv: Add overview of guest VM hibernation
  hyperv: Do not overlap the hvcall IO areas in hv_vtl_apicid_to_vp_id()
  hyperv: Do not overlap the hvcall IO areas in get_vtl()
  hyperv: Enable the hypercall output page for the VTL mode
  hv_balloon: Fallback to generic_online_page() for non-HV hot added mem
  Drivers: hv: vmbus: Log on missing offers if any
  Drivers: hv: vmbus: Wait for boot-time offers during boot and resume
  uio_hv_generic: Add a check for HV_NIC for send, receive buffers setup
  iommu/hyper-v: Don't assume cpu_possible_mask is dense
  Drivers: hv: Don't assume cpu_possible_mask is dense
  x86/hyperv: Don't assume cpu_possible_mask is dense
  hyperv: Remove the now unused hyperv-tlfs.h files
  hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
  hyperv: Add new Hyper-V headers in include/hyperv
  hyperv: Clean up unnecessary #includes
  hyperv: Move hv_connection_id to hyperv-tlfs.h
2025-01-25 09:22:55 -08:00
Linus Torvalds
5b7f7234ff x86/boot changes for v6.14:
- A large and involved preparatory series to pave the way to add exception
    handling for relocate_kernel - which will be a debugging facility that
    has aided in the field to debug an exceptionally hard to debug early boot bug.
    Plus assorted cleanups and fixes that were discovered along the way,
    by David Woodhouse:
 
       - Clean up and document register use in relocate_kernel_64.S
       - Use named labels in swap_pages in relocate_kernel_64.S
       - Only swap pages for ::preserve_context mode
       - Allocate PGD for x86_64 transition page tables separately
       - Copy control page into place in machine_kexec_prepare()
       - Invoke copy of relocate_kernel() instead of the original
       - Move relocate_kernel to kernel .data section
       - Add data section to relocate_kernel
       - Drop page_list argument from relocate_kernel()
       - Eliminate writes through kernel mapping of relocate_kernel page
       - Clean up register usage in relocate_kernel()
       - Mark relocate_kernel page as ROX instead of RWX
       - Disable global pages before writing to control page
       - Ensure preserve_context flag is set on return to kernel
       - Use correct swap page in swap_pages function
       - Fix stack and handling of re-entry point for ::preserve_context
       - Mark machine_kexec() with __nocfi
       - Cope with relocate_kernel() not being at the start of the page
       - Use typedef for relocate_kernel_fn function prototype
       - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)
 
  - A series to remove the last remaining absolute symbol references from
    .head.text, and enforce this at build time, by Ard Biesheuvel:
 
       - Avoid WARN()s and panic()s in early boot code
       - Don't hang but terminate on failure to remap SVSM CA
       - Determine VA/PA offset before entering C code
       - Avoid intentional absolute symbol references in .head.text
       - Disable UBSAN in early boot code
       - Move ENTRY_TEXT to the start of the image
       - Move .head.text into its own output section
       - Reject absolute references in .head.text
 
  - Which build-time enforcement uncovered a handful of bugs of essentially
    non-working code, and a wrokaround for a toolchain bug, fixed by
    Ard Biesheuvel as well:
 
       - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
       - Disable UBSAN on SEV code that may execute very early
       - Disable ftrace branch profiling in SEV startup code
 
  - And miscellaneous cleanups:
 
        - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
        - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeQDmURHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1inwRAAjD5QR/Yu7Yiv2nM/ncUwAItsFkv9Jk4Y
 HPGz9qNJoZxKxuZVj9bfQhWDe3g6VLnlDYgatht9BsyP5b12qZrUe+yp/TOH54Z3
 wPD+U/jun4jiSr7oJkJC+bFn+a/tL39pB8Y6m+jblacgVglleO3SH5fBWNE1UbIV
 e2iiNxi0bfuHy3wquegnKaMyF1e7YLw1p5laGSwwk21g5FjT7cLQOqC0/9u8u9xX
 Ha+iaod7JOcjiQOqIt/MV57ldWEFCrUhQozRV3tK5Ptf5aoGFpisgQoRoduWUtFz
 UbHiHhv6zE4DOIUzaAbJjYfR1Z/LCviwON97XJgeOOkJaULF7yFCfhGxKSyQoMIh
 qZtlBs4VsGl2/dOl+iW6xKwgRiNundTzSQtt5D/xuFz5LnDxe/SrlZnYp8lOPP8R
 w9V2b/fC0YxmUzEW6EDhBqvfuScKiNWoic47qvYfZPaWyg1ESpvWTIh6AKB5ThUR
 upgJQdA4HW+y5C57uHW40TSe3xEeqM3+Slk0jxLElP7/yTul5r7jrjq2EkwaAv/j
 6/0LsMSr33r9fVFeMP1qLXPUaipcqTWWTpeeTr8NBGUcvOKzw5SltEG4NihzCyhF
 3/UMQhcQ6KE3iFMPlRu4hV7ZV4gErZmLoRwh9Uk28f2Xx8T95uoV8KTg1/sRZRTo
 uQLeRxYnyrw=
 =vGWS
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Ingo Molnar:

 - A large and involved preparatory series to pave the way to add
   exception handling for relocate_kernel - which will be a debugging
   facility that has aided in the field to debug an exceptionally hard
   to debug early boot bug. Plus assorted cleanups and fixes that were
   discovered along the way, by David Woodhouse:

      - Clean up and document register use in relocate_kernel_64.S
      - Use named labels in swap_pages in relocate_kernel_64.S
      - Only swap pages for ::preserve_context mode
      - Allocate PGD for x86_64 transition page tables separately
      - Copy control page into place in machine_kexec_prepare()
      - Invoke copy of relocate_kernel() instead of the original
      - Move relocate_kernel to kernel .data section
      - Add data section to relocate_kernel
      - Drop page_list argument from relocate_kernel()
      - Eliminate writes through kernel mapping of relocate_kernel page
      - Clean up register usage in relocate_kernel()
      - Mark relocate_kernel page as ROX instead of RWX
      - Disable global pages before writing to control page
      - Ensure preserve_context flag is set on return to kernel
      - Use correct swap page in swap_pages function
      - Fix stack and handling of re-entry point for ::preserve_context
      - Mark machine_kexec() with __nocfi
      - Cope with relocate_kernel() not being at the start of the page
      - Use typedef for relocate_kernel_fn function prototype
      - Fix location of relocate_kernel with -ffunction-sections (fix by Nathan Chancellor)

 - A series to remove the last remaining absolute symbol references from
   .head.text, and enforce this at build time, by Ard Biesheuvel:

      - Avoid WARN()s and panic()s in early boot code
      - Don't hang but terminate on failure to remap SVSM CA
      - Determine VA/PA offset before entering C code
      - Avoid intentional absolute symbol references in .head.text
      - Disable UBSAN in early boot code
      - Move ENTRY_TEXT to the start of the image
      - Move .head.text into its own output section
      - Reject absolute references in .head.text

 - The above build-time enforcement uncovered a handful of bugs of
   essentially non-working code, and a wrokaround for a toolchain bug,
   fixed by Ard Biesheuvel as well:

      - Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
      - Disable UBSAN on SEV code that may execute very early
      - Disable ftrace branch profiling in SEV startup code

 - And miscellaneous cleanups:

      - kexec_core: Add and update comments regarding the KEXEC_JUMP flow (Rafael J. Wysocki)
      - x86/sysfs: Constify 'struct bin_attribute' (Thomas Weißschuh)"

* tag 'x86-boot-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  x86/sev: Disable ftrace branch profiling in SEV startup code
  x86/kexec: Use typedef for relocate_kernel_fn function prototype
  x86/kexec: Cope with relocate_kernel() not being at the start of the page
  kexec_core: Add and update comments regarding the KEXEC_JUMP flow
  x86/kexec: Mark machine_kexec() with __nocfi
  x86/kexec: Fix location of relocate_kernel with -ffunction-sections
  x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
  x86/kexec: Use correct swap page in swap_pages function
  x86/kexec: Ensure preserve_context flag is set on return to kernel
  x86/kexec: Disable global pages before writing to control page
  x86/sev: Don't hang but terminate on failure to remap SVSM CA
  x86/sev: Disable UBSAN on SEV code that may execute very early
  x86/boot/64: Fix spurious undefined reference when CONFIG_X86_5LEVEL=n, on GCC-12
  x86/sysfs: Constify 'struct bin_attribute'
  x86/kexec: Mark relocate_kernel page as ROX instead of RWX
  x86/kexec: Clean up register usage in relocate_kernel()
  x86/kexec: Eliminate writes through kernel mapping of relocate_kernel page
  x86/kexec: Drop page_list argument from relocate_kernel()
  x86/kexec: Add data section to relocate_kernel
  x86/kexec: Move relocate_kernel to kernel .data section
  ...
2025-01-24 05:54:26 -08:00
Zhang Rui
0141978ae7 x86/acpi: Fix LAPIC/x2APIC parsing order
On some systems, the same CPU (with the same APIC ID) is assigned a
different logical CPU id after commit ec9aedb2aa ("x86/acpi: Ignore
invalid x2APIC entries").

This means that Linux enumerates the CPUs in a different order, which
violates ACPI specification[1] that states:

  "OSPM should initialize processors in the order that they appear in
   the MADT"

The problematic commit parses all LAPIC entries before any x2APIC
entries, aiming to ignore x2APIC entries with APIC ID < 255 when valid
LAPIC entries exist. However, it disrupts the CPU enumeration order on
systems where x2APIC entries precede LAPIC entries in the MADT.

Fix this problem by:

 1) Parsing LAPIC entries first without registering them in the
    topology to evaluate whether valid LAPIC entries exist.

 2) Restoring the MADT in order parser which invokes either the LAPIC
    or the X2APIC parser function depending on the entry type.

The X2APIC parser still ignores entries < 0xff in case that #1 found
valid LAPIC entries independent of their position in the MADT table.

Link: https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#madt-processor-local-apic-sapic-structure-entry-order
Cc: All applicable <stable@vger.kernel.org>
Reported-by: Jim Mattson <jmattson@google.com>
Closes: https://lore.kernel.org/all/20241010213136.668672-1-jmattson@google.com/
Fixes: ec9aedb2aa ("x86/acpi: Ignore invalid x2APIC entries")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Tested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20250117081420.4046737-1-rui.zhang@intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-24 14:27:54 +01:00
Linus Torvalds
2e04247f7c ftrace updates for v6.14:
- Have fprobes built on top of function graph infrastructure
 
   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function. The
   fprobe and kretprobe logic implements a similar method as the function
   graph tracer to trace the end of the function. That is to hijack the
   return address and jump to a trampoline to do the trace when the function
   exits. To do this, a shadow stack needs to be created to store the
   original return address.  Fprobes and function graph do this slightly
   differently. Fprobes (and kretprobes) has slots per callsite that are
   reserved to save the return address. This is fine when just a few points
   are traced. But users of fprobes, such as BPF programs, are starting to add
   many more locations, and this method does not scale.
 
   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started, every
   task gets its own shadow stack to hold the return address that is going to
   be traced. The function graph tracer has been updated to allow multiple
   users to use its infrastructure. Now have fprobes be one of those users.
   This will also allow for the fprobe and kretprobe methods to trace the
   return address to become obsolete. With new technologies like CFI that
   need to know about these methods of hijacking the return address, going
   toward a solution that has only one method of doing this will make the
   kernel less complex.
 
 - Cleanup with guard() and free() helpers
 
   There were several places in the code that had a lot of "goto out" in the
   error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.
 
 - Remove disabling of interrupts in the function graph tracer
 
   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable interrupts and
   not trace NMIs. But the code has changed to allow NMIs and also
   interrupts. This change was done a long time ago, but the disabling of
   interrupts was never removed. Remove the disabling of interrupts in the
   function graph tracer is it is not needed. This greatly improves its
   performance.
 
 - Allow the :mod: command to enable tracing module functions on the kernel
   command line.
 
   The function tracer already has a way to enable functions to be traced in
   modules by writing ":mod:<module>" into set_ftrace_filter. That will
   enable either all the functions for the module if it is loaded, or if it
   is not, it will cache that command, and when the module is loaded that
   matches <module>, its functions will be enabled. This also allows init
   functions to be traced. But currently events do not have that feature.
 
   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ42E2RQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qqXSAPwOMxuhye8tb1GYG62QD9+w7e6nOmlC
 2GCPj4detnEM2QD/ciivkhespVKhHpZHRewAuSnJgHPSM45NQ3EVESzjWQ4=
 =snbx
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace updates from Steven Rostedt:

 - Have fprobes built on top of function graph infrastructure

   The fprobe logic is an optimized kprobe that uses ftrace to attach to
   functions when a probe is needed at the start or end of the function.
   The fprobe and kretprobe logic implements a similar method as the
   function graph tracer to trace the end of the function. That is to
   hijack the return address and jump to a trampoline to do the trace
   when the function exits. To do this, a shadow stack needs to be
   created to store the original return address. Fprobes and function
   graph do this slightly differently. Fprobes (and kretprobes) has
   slots per callsite that are reserved to save the return address. This
   is fine when just a few points are traced. But users of fprobes, such
   as BPF programs, are starting to add many more locations, and this
   method does not scale.

   The function graph tracer was created to trace all functions in the
   kernel. In order to do this, when function graph tracing is started,
   every task gets its own shadow stack to hold the return address that
   is going to be traced. The function graph tracer has been updated to
   allow multiple users to use its infrastructure. Now have fprobes be
   one of those users. This will also allow for the fprobe and kretprobe
   methods to trace the return address to become obsolete. With new
   technologies like CFI that need to know about these methods of
   hijacking the return address, going toward a solution that has only
   one method of doing this will make the kernel less complex.

 - Cleanup with guard() and free() helpers

   There were several places in the code that had a lot of "goto out" in
   the error paths to either unlock a lock or free some memory that was
   allocated. But this is error prone. Convert the code over to use the
   guard() and free() helpers that let the compiler unlock locks or free
   memory when the function exits.

 - Remove disabling of interrupts in the function graph tracer

   When function graph tracer was first introduced, it could race with
   interrupts and NMIs. To prevent that race, it would disable
   interrupts and not trace NMIs. But the code has changed to allow NMIs
   and also interrupts. This change was done a long time ago, but the
   disabling of interrupts was never removed. Remove the disabling of
   interrupts in the function graph tracer is it is not needed. This
   greatly improves its performance.

 - Allow the :mod: command to enable tracing module functions on the
   kernel command line.

   The function tracer already has a way to enable functions to be
   traced in modules by writing ":mod:<module>" into set_ftrace_filter.
   That will enable either all the functions for the module if it is
   loaded, or if it is not, it will cache that command, and when the
   module is loaded that matches <module>, its functions will be
   enabled. This also allows init functions to be traced. But currently
   events do not have that feature.

   Because enabling function tracing can be done very early at boot up
   (before scheduling is enabled), the commands that can be done when
   function tracing is started is limited. Having the ":mod:" command to
   trace module functions as they are loaded is very useful. Update the
   kernel command line function filtering to allow it.

* tag 'ftrace-v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (26 commits)
  ftrace: Implement :mod: cache filtering on kernel command line
  tracing: Adopt __free() and guard() for trace_fprobe.c
  bpf: Use ftrace_get_symaddr() for kprobe_multi probes
  ftrace: Add ftrace_get_symaddr to convert fentry_ip to symaddr
  Documentation: probes: Update fprobe on function-graph tracer
  selftests/ftrace: Add a test case for repeating register/unregister fprobe
  selftests: ftrace: Remove obsolate maxactive syntax check
  tracing/fprobe: Remove nr_maxactive from fprobe
  fprobe: Add fprobe_header encoding feature
  fprobe: Rewrite fprobe on function-graph tracer
  s390/tracing: Enable HAVE_FTRACE_GRAPH_FUNC
  ftrace: Add CONFIG_HAVE_FTRACE_GRAPH_FUNC
  bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
  tracing/fprobe: Enable fprobe events with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
  tracing: Add ftrace_fill_perf_regs() for perf event
  tracing: Add ftrace_partial_regs() for converting ftrace_regs to pt_regs
  fprobe: Use ftrace_regs in fprobe exit handler
  fprobe: Use ftrace_regs in fprobe entry handler
  fgraph: Pass ftrace_regs to retfunc
  fgraph: Replace fgraph_ret_regs with ftrace_regs
  ...
2025-01-21 15:15:28 -08:00
Linus Torvalds
4c551165e7 Updates for the interrupt subsystem:
- Consolidation of the machine_kexec_mask_interrupts() by providing a
     generic implementation and replacing the copy & pasta orgy in the
     relevant architectures.
 
   - Prevent unconditional operations on interrupt chips during kexec
     shutdown, which can trigger warnings in certain cases when the
     underlying interrupt has been shut down before.
 
   - Make the enforcement of interrupt handling in interrupt context
     unconditionally available, so that it actually works for non x86
     related interrupt chips. The earlier enablement for ARM GIC chips set
     the required chip flag, but did not notice that the check was hidden
     behind a config switch which is not selected by ARM[64].
 
   - Decrapify the handling of deferred interrupt affinity setting. Some
     interrupt chips require that affinity changes are made from the context
     of handling an interrupt to avoid certain race conditions. For x86 this
     was the default, but with interrupt remapping this requirement was
     lifted and a flag was introduced which tells the core code that
     affinity changes can be done in any context. Unrestricted affinity
     changes are the default for the majority of interrupt chips. RISCV has
     the requirement to add the deferred mode to one of it's interrupt
     controllers, but with the original implementation this would require to
     add the any context flag to all other RISC-V interrupt chips. That's
     backwards, so reverse the logic and require that chips, which need the
     deferred mode have to be marked accordingly. That avoids chasing the
     'sane' chips and marking them.
 
   - Add multi-node support to the Loongarch AVEC interrupt controller
     driver.
 
   - The usual tiny cleanups, fixes and improvements all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmePkVITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRbQD/9bHVph/V9Ekl7JAX3aY4gG4JbRhOc7
 dp1VAcHRhktRfoTztYRbjsbMu2nvZ58GKA8bkOS2jHSF/m3PbkIJfOhwk0YdIAoa
 +kdy5yDgqCGfkqW43DN4Cr+CnzGjWMitw67tFp3fhwehMDpDjdt2L28IjtanSS0f
 hO6FV7o65MWeJwxk4Isb2/nvkO+X23Lrp6RrWS8SXBnF9FFXxiPIg/fiOPTizhCh
 1W/bSGxLLb9WwsVzmlGAKVFlXDij0QGaIUug2fdVZ63OsELXD7tJrLSPG133yk92
 ppIa0s6BT4IBsfM00us4hG15PkLuJmP3yWWcoquG0rP8Wq58VOXiN6+rcJIyvB+5
 mWceTH6IKfZGoRQKwXC7BxeBAIb147reiJtb06meq1/8ADIvzafiNy0c8x9i/UaV
 QiyhPVENjaGCGDomZmJQqN7Yb02Wge1k8InQnodDrHxZNl/bX/B1Z8Bxd0n6hPHg
 NSJXYif2AxgaddpohsdygqRDbT6SNyQdj7YjJFY5qAGJ3yFyJ4JB6WTqkWW4o1vH
 3FVqdAnJmejAmmYSkah0Hkem2T5QASQmTWb93PLxiV6q+d0NM8stWAujjyVdIV/B
 W4Uj9mQ20cz54TjLtxqX+A1k6KcqOWRgh1l2QbUlFsgsOP3V8yz47yqYdR9qMWlO
 9kNEjI3sw+G/IQ==
 =q4rj
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull interrupt subsystem updates from Thomas Gleixner:

 - Consolidate the machine_kexec_mask_interrupts() by providing a
   generic implementation and replacing the copy & pasta orgy in the
   relevant architectures.

 - Prevent unconditional operations on interrupt chips during kexec
   shutdown, which can trigger warnings in certain cases when the
   underlying interrupt has been shut down before.

 - Make the enforcement of interrupt handling in interrupt context
   unconditionally available, so that it actually works for non x86
   related interrupt chips. The earlier enablement for ARM GIC chips set
   the required chip flag, but did not notice that the check was hidden
   behind a config switch which is not selected by ARM[64].

 - Decrapify the handling of deferred interrupt affinity setting.

   Some interrupt chips require that affinity changes are made from the
   context of handling an interrupt to avoid certain race conditions.
   For x86 this was the default, but with interrupt remapping this
   requirement was lifted and a flag was introduced which tells the core
   code that affinity changes can be done in any context. Unrestricted
   affinity changes are the default for the majority of interrupt chips.

   RISCV has the requirement to add the deferred mode to one of it's
   interrupt controllers, but with the original implementation this
   would require to add the any context flag to all other RISC-V
   interrupt chips. That's backwards, so reverse the logic and require
   that chips, which need the deferred mode have to be marked
   accordingly. That avoids chasing the 'sane' chips and marking them.

 - Add multi-node support to the Loongarch AVEC interrupt controller
   driver.

 - The usual tiny cleanups, fixes and improvements all over the place.

* tag 'irq-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()
  genirq/timings: Add kernel-doc for a function parameter
  genirq: Remove IRQ_MOVE_PCNTXT and related code
  x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
  genirq: Provide IRQCHIP_MOVE_DEFERRED
  hexagon: Remove GENERIC_PENDING_IRQ leftover
  ARC: Remove GENERIC_PENDING_IRQ
  genirq: Remove handle_enforce_irqctx() wrapper
  genirq: Make handle_enforce_irqctx() unconditionally available
  irqchip/loongarch-avec: Add multi-nodes topology support
  irqchip/ts4800: Replace seq_printf() by seq_puts()
  irqchip/ti-sci-inta : Add module build support
  irqchip/ti-sci-intr: Add module build support
  irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function
  irqchip: keystone: Use syscon_regmap_lookup_by_phandle_args
  genirq/kexec: Prevent redundant IRQ masking by checking state before shutdown
  kexec: Consolidate machine_kexec_mask_interrupts() implementation
  genirq: Reuse irq_thread_fn() for forced thread case
  genirq: Move irq_thread_fn() further up in the code
2025-01-21 13:51:07 -08:00
Linus Torvalds
62de6e1685 Scheduler enhancements for v6.14:
- Fair scheduler (SCHED_FAIR) enhancements:
 
    - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
 
    - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
 
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function
 
    - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
 
    - Cleanups:
      - Clean up in migrate_degrades_locality() to improve
        readability (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)
 
  - Deadline scheduler (SCHED_DL) enhancements:
 
    - Restore dl_server bandwidth on non-destructive root domain
      changes (Juri Lelli)
 
    - Correctly account for allocated bandwidth during
      hotplug (Juri Lelli)
 
    - Check bandwidth overflow earlier for hotplug (Juri Lelli)
 
    - Clean up goto label in pick_earliest_pushable_dl_task()
      (John Stultz)
 
    - Consolidate timer cancellation (Wander Lairson Costa)
 
  - Load-balancer enhancements:
 
    - Improve performance by prioritizing migrating eligible
      tasks in sched_balance_rq() (Hao Jia)
 
    - Do not compute NUMA Balancing stats unnecessarily during
      load-balancing (K Prateek Nayak)
 
    - Do not compute overloaded status unnecessarily during
      load-balancing (K Prateek Nayak)
 
  - Generic scheduling code enhancements:
 
    - Use READ_ONCE() in task_on_rq_queued(), to consistently use
      the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
 
  - Isolated CPUs support enhancements: (Waiman Long)
 
    - Make "isolcpus=nohz" equivalent to "nohz_full"
    - Consolidate housekeeping cpumasks that are always identical
    - Remove HK_TYPE_SCHED
    - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
 
  - RSEQ enhancements:
 
    - Validate read-only fields under DEBUG_RSEQ config
      (Mathieu Desnoyers)
 
  - PSI enhancements:
 
    - Fix race when task wakes up before psi_sched_switch()
      adjusts flags (Chengming Zhou)
 
  - IRQ time accounting performance enhancements: (Yafang Shao)
 
    - Define sched_clock_irqtime as static key
    - Don't account irq time if sched_clock_irqtime is disabled
 
  - Virtual machine scheduling enhancements:
 
    - Don't try to catch up excess steal time (Suleiman Souhlal)
 
  - Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
 
    - Convert "sysctl_sched_itmt_enabled" to boolean
    - Use guard() for itmt_update_mutex
    - Move the "sched_itmt_enabled" sysctl to debugfs
    - Remove x86_smt_flags and use cpu_smt_flags directly
    - Use x86_sched_itmt_flags for PKG domain unconditionally
 
  - Debugging code & instrumentation enhancements:
 
    - Change need_resched warnings to pr_err() (David Rientjes)
    - Print domain name in /proc/schedstat (K Prateek Nayak)
    - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
    - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
    - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
    - Update Schedstat version to 17 (Swapnil Sapkal)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePSRcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hrdBAAjYiLl5Q8SHM0xnl+kbvuUkCTgEB/gSgA
 mfrZtHRUgRZuA89NZ9NljlCkQSlsLTOjnpNuaeFzs529GMg9iemc99dbnz3BP5F3
 V5qpYvWe7yIkJ3hd0TOGLmYEPMNQaAW57YBOrxcPjWNLJ4cr9iMdccVA1OQtcmqD
 ZUh3nibv81QI8HDmT2G+figxEIqH3yBV1+SmEIxbrdkQpIJ5702Ng6+0KQK5TShN
 xwjFELWZUl2TfkoCc4nkIpkImV6cI1DvXSw1xK6gbb1xEVOrsmFW3TYFw4trKHBu
 2RBG4wtmzNjh+12GmSdIBJHogPNcay+JIJW9EG/unT7jirqzkkeP1X2eJEbh+X1L
 CMa7GsD9Vy72jCzeJDMuiy7bKfG/MiKUtDXrAZQDo2atbw7H88QOzMuTE5a5WSV+
 tRxXGI/dgFVOk+JQUfctfJbYeXjmG8GAflawvXtGDAfDZsja6M+65fH8p0AOgW1E
 HHmXUzAe2E2xQBiSok/DYHPQeCDBAjoJvU93YhGiXv8UScb2UaD4BAfzfmc8P+Zs
 Eox6444ah5U0jiXmZ3HU707n1zO+Ql4qKoyyMJzSyP+oYHE/Do7NYTElw2QovVdN
 FX/9Uae8T4ttA/5lFe7FNoXgKvSxXDKYyKLZcysjVrWJF866Ui/TWtmxA6w8Osn7
 sfucuLawLPM=
 =5ZNW
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_FAIR) enhancements:

   - Behavioral improvements:
      - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)

   - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
      - Rename h_nr_running into h_nr_queued
      - Add new cfs_rq.h_nr_runnable
      - Use the new cfs_rq.h_nr_runnable
      - Removed unsued cfs_rq.h_nr_delayed
      - Rename cfs_rq.idle_h_nr_running into h_nr_idle
      - Remove unused cfs_rq.idle_nr_running
      - Rename cfs_rq.nr_running into nr_queued
      - Do not try to migrate delayed dequeue task
      - Fix variable declaration position
      - Encapsulate set custom slice in a __setparam_fair() function

   - Fixes:
      - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
      - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal
        Chourasia)

   - Cleanups:
      - Clean up in migrate_degrades_locality() to improve readability
        (Peter Zijlstra)
      - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
      - Update comments after sched_tick() rename (Sebastian Andrzej
        Siewior)
      - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
        (Valentin Schneider)

  Deadline scheduler (SCHED_DL) enhancements:

   - Restore dl_server bandwidth on non-destructive root domain changes
     (Juri Lelli)

   - Correctly account for allocated bandwidth during hotplug (Juri
     Lelli)

   - Check bandwidth overflow earlier for hotplug (Juri Lelli)

   - Clean up goto label in pick_earliest_pushable_dl_task() (John
     Stultz)

   - Consolidate timer cancellation (Wander Lairson Costa)

  Load-balancer enhancements:

   - Improve performance by prioritizing migrating eligible tasks in
     sched_balance_rq() (Hao Jia)

   - Do not compute NUMA Balancing stats unnecessarily during
     load-balancing (K Prateek Nayak)

   - Do not compute overloaded status unnecessarily during
     load-balancing (K Prateek Nayak)

  Generic scheduling code enhancements:

   - Use READ_ONCE() in task_on_rq_queued(), to consistently use the
     WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)

  Isolated CPUs support enhancements: (Waiman Long)

   - Make "isolcpus=nohz" equivalent to "nohz_full"
   - Consolidate housekeeping cpumasks that are always identical
   - Remove HK_TYPE_SCHED
   - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE

  RSEQ enhancements:

   - Validate read-only fields under DEBUG_RSEQ config (Mathieu
     Desnoyers)

  PSI enhancements:

   - Fix race when task wakes up before psi_sched_switch() adjusts flags
     (Chengming Zhou)

  IRQ time accounting performance enhancements: (Yafang Shao)

   - Define sched_clock_irqtime as static key
   - Don't account irq time if sched_clock_irqtime is disabled

  Virtual machine scheduling enhancements:

   - Don't try to catch up excess steal time (Suleiman Souhlal)

  Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)

   - Convert "sysctl_sched_itmt_enabled" to boolean
   - Use guard() for itmt_update_mutex
   - Move the "sched_itmt_enabled" sysctl to debugfs
   - Remove x86_smt_flags and use cpu_smt_flags directly
   - Use x86_sched_itmt_flags for PKG domain unconditionally

  Debugging code & instrumentation enhancements:

   - Change need_resched warnings to pr_err() (David Rientjes)
   - Print domain name in /proc/schedstat (K Prateek Nayak)
   - Fix value reported by hot tasks pulled in /proc/schedstat (Peter
     Zijlstra)
   - Report the different kinds of imbalances in /proc/schedstat
     (Swapnil Sapkal)
   - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
   - Update Schedstat version to 17 (Swapnil Sapkal)"

* tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  rseq: Fix rseq unregistration regression
  psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
  sched, psi: Don't account irq time if sched_clock_irqtime is disabled
  sched: Don't account irq time if sched_clock_irqtime is disabled
  sched: Define sched_clock_irqtime as static key
  sched/fair: Do not compute overloaded status unnecessarily during lb
  sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
  x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
  x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
  x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
  x86/itmt: Use guard() for itmt_update_mutex
  x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
  sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
  sched/debug: Change need_resched warnings to pr_err
  sched/fair: Encapsulate set custom slice in a __setparam_fair() function
  sched: Fix race between yield_to() and try_to_wake_up()
  docs: Update Schedstat version to 17
  sched/stats: Print domain name in /proc/schedstat
  sched: Move sched domain name out of CONFIG_SCHED_DEBUG
  sched: Report the different kinds of imbalances in /proc/schedstat
  ...
2025-01-21 11:32:36 -08:00
Linus Torvalds
858df1de21 Miscellaneous x86 cleanups and typo fixes, and also the removal
of the "disablelapic" boot parameter.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmePTD8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jf5g//Wo1WKUXukRrBANr2nIlx9B7xJliRmUxv
 mJ0VKo49YPl6C34fjSHhBs3+nPbYD+CyWVKAz5PqkfkFRGBgpQi26EnyKaIhLVFW
 HWhW5vQm/FJfzBIrfFg7g/H1PK+rEYa4mv8JF9vhwp7BOfuqx4ABGKWQnrvOGg2B
 VivE5k7/kxWRPTg45Kgb1iwlS2gcfWCRi9qdCzdJgY/4XYE6k6hKeV0PgTT3Vojf
 pZKsgZRq8tzMaX75obtyyrX3TWj0nkRec0XbgyXBFvlFh/l3e0RswxzGGAjrC1XP
 R+qmscdCkczUwRGc1mGj9MoCqMRRffU6/hTNsjqu8o7Q2gzZzXWHcUc+X7UwOeKZ
 2guxOj4iagdn7+mIso6uAjY+OOdFVw7/C8ysbCmwo3MiaDsfaK2NkdBoT2xDWuIw
 NP/45RMpTIsgL0wG6upzXXApKgYxfWhNSq+oHDF4/TjWY4i779hjMghvtX1BI7yb
 LXIh2SsRcnmEPl42UGaz6xmdmkulWZPPxI5rghixU48Eazkngfp7ZTHYpm5NFoRP
 Qc3JNcKo7rGmkoo/sA7uwawjnaTz/H77SDNjfAufzjVAKidvUqW6xaK/8JM1fq0n
 du+9sQN5MrAqdKx5Lu624s/7ektwkDeUdQFGazqS9y0GBT25T9Rw+LQDuec7BG3p
 v8sok4IaPA0=
 =Hzj3
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Ingo Molnar:
 "Miscellaneous x86 cleanups and typo fixes, and also the removal of
  the 'disablelapic' boot parameter"

* tag 'x86-cleanups-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioapic: Remove a stray tab in the IO-APIC type string
  x86/cpufeatures: Remove "AMD" from the comments to the AMD-specific leaf
  Documentation/kernel-parameters: Fix a typo in kvm.enable_virt_at_load text
  x86/cpu: Fix typo in x86_match_cpu()'s doc
  x86/apic: Remove "disablelapic" cmdline option
  Documentation: Merge x86-specific boot options doc into kernel-parameters.txt
  x86/ioremap: Remove unused size parameter in remapping functions
  x86/ioremap: Simplify setup_data mapping variants
  x86/boot/compressed: Remove unused header includes from kaslr.c
2025-01-21 11:15:29 -08:00
Linus Torvalds
6c4aa896eb Performance events changes for v6.14:
- Seqlock optimizations that arose in a perf context and were
    merged into the perf tree:
 
    - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
    - mm: Convert mm_lock_seq to a proper seqcount ((Suren Baghdasaryan)
    - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren Baghdasaryan)
    - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)
 
  - Core perf enhancements:
 
    - Reduce 'struct page' footprint of perf by mapping pages
      in advance (Lorenzo Stoakes)
    - Save raw sample data conditionally based on sample type (Yabin Cui)
    - Reduce sampling overhead by checking sample_type in
      perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin Cui)
    - Export perf_exclude_event() (Namhyung Kim)
 
  - Uprobes scalability enhancements: (Andrii Nakryiko)
 
    - Simplify find_active_uprobe_rcu() VMA checks
    - Add speculative lockless VMA-to-inode-to-uprobe resolution
    - Simplify session consumer tracking
    - Decouple return_instance list traversal and freeing
    - Ensure return_instance is detached from the list before freeing
    - Reuse return_instances between multiple uretprobes within task
    - Guard against kmemdup() failing in dup_return_instance()
 
  - AMD core PMU driver enhancements:
 
    - Relax privilege filter restriction on AMD IBS (Namhyung Kim)
 
  - AMD RAPL energy counters support: (Dhananjay Ugwekar)
 
    - Introduce topology_logical_core_id() (K Prateek Nayak)
 
    - Remove the unused get_rapl_pmu_cpumask() function
    - Remove the cpu_to_rapl_pmu() function
    - Rename rapl_pmu variables
    - Make rapl_model struct global
    - Add arguments to the init and cleanup functions
    - Modify the generic variable names to *_pkg*
    - Remove the global variable rapl_msrs
    - Move the cntr_mask to rapl_pmus struct
    - Add core energy counter support for AMD CPUs
 
  - Intel core PMU driver enhancements:
 
    - Support RDPMC 'metrics clear mode' feature (Kan Liang)
    - Clarify adaptive PEBS processing (Kan Liang)
    - Factor out functions for PEBS records processing (Kan Liang)
    - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)
 
  - Intel uncore driver enhancements: (Kan Liang)
 
    - Convert buggy pmu->func_id use to pmu->registered
    - Support more units on Granite Rapids
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOJdQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i2yQ/+MXl7yfJOgdbwjBpgGGzH4burEO7ppak+
 ktzz+YjpNgjODe/xMAJGjjblouuYArCnRolc1UPvPm6M7jSY76wi42Y6c4dRtFoB
 2ReSrRqnreLOcrRS9nsTjvWRHfJHqJDVSd9TfHX6ILfzbaizCZOGYk558ZxAKRqu
 Lw7FOvLEe/Y3tg4z8dDg083jsasalKySP9wIPc0BkSqQTOfusd3KXju/Fux/9wkn
 hZcUgF4ds+0bH7xtO1/G9ILqGyeq97X1McIR9bAjln5Mxykclen4hSjRaWWHHo9O
 mzBKmd/blIATisfuuW+QLDQow3M1k3688cz7e9QOeWHHd/dJiMb9RLV90jdND/T/
 uLINC5vNemzyWEfnNiYQ31LjhG3SeuDiKWzRp36MbQcCh6EBdRXWLBgtmxq1L/3o
 ZCaCdtFu5+6epycdyOVZEpWDnjdx4GmLXMZi5WJfZ7fZ/IFjNkjk4OdzI1iRQ+i3
 Sbi75ep59ayTUhm5AB7gCJsP3R7EsZsiPHUenQdA2n9Sj6xE+IuhlS/QDQ9g5mdY
 Ijs0jHeVCGmhYoOD1xWnCZSzlnkEVU3zwfypAK+MC7pgtFMwDy5/Bu1USGxXXDy+
 aKsrJRSgHbtZ1gwoHstqkV+DeCTfElCLYkvigzI5Nmyib5Zp4vkwy2ZLWQjaNjm7
 mqRI7PugUkU=
 =c8XB
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance events updates from Ingo Molnar:
 "Seqlock optimizations that arose in a perf context and were merged
  into the perf tree:

   - seqlock: Add raw_seqcount_try_begin (Suren Baghdasaryan)
   - mm: Convert mm_lock_seq to a proper seqcount (Suren Baghdasaryan)
   - mm: Introduce mmap_lock_speculate_{try_begin|retry} (Suren
     Baghdasaryan)
   - mm/gup: Use raw_seqcount_try_begin() (Peter Zijlstra)

  Core perf enhancements:

   - Reduce 'struct page' footprint of perf by mapping pages in advance
     (Lorenzo Stoakes)
   - Save raw sample data conditionally based on sample type (Yabin Cui)
   - Reduce sampling overhead by checking sample_type in
     perf_sample_save_callchain() and perf_sample_save_brstack() (Yabin
     Cui)
   - Export perf_exclude_event() (Namhyung Kim)

  Uprobes scalability enhancements: (Andrii Nakryiko)

   - Simplify find_active_uprobe_rcu() VMA checks
   - Add speculative lockless VMA-to-inode-to-uprobe resolution
   - Simplify session consumer tracking
   - Decouple return_instance list traversal and freeing
   - Ensure return_instance is detached from the list before freeing
   - Reuse return_instances between multiple uretprobes within task
   - Guard against kmemdup() failing in dup_return_instance()

  AMD core PMU driver enhancements:

   - Relax privilege filter restriction on AMD IBS (Namhyung Kim)

  AMD RAPL energy counters support: (Dhananjay Ugwekar)

   - Introduce topology_logical_core_id() (K Prateek Nayak)
   - Remove the unused get_rapl_pmu_cpumask() function
   - Remove the cpu_to_rapl_pmu() function
   - Rename rapl_pmu variables
   - Make rapl_model struct global
   - Add arguments to the init and cleanup functions
   - Modify the generic variable names to *_pkg*
   - Remove the global variable rapl_msrs
   - Move the cntr_mask to rapl_pmus struct
   - Add core energy counter support for AMD CPUs

  Intel core PMU driver enhancements:

   - Support RDPMC 'metrics clear mode' feature (Kan Liang)
   - Clarify adaptive PEBS processing (Kan Liang)
   - Factor out functions for PEBS records processing (Kan Liang)
   - Simplify the PEBS records processing for adaptive PEBS (Kan Liang)

  Intel uncore driver enhancements: (Kan Liang)

   - Convert buggy pmu->func_id use to pmu->registered
   - Support more units on Granite Rapids"

* tag 'perf-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  perf: map pages in advance
  perf/x86/intel/uncore: Support more units on Granite Rapids
  perf/x86/intel/uncore: Clean up func_id
  perf/x86/intel: Support RDPMC metrics clear mode
  uprobes: Guard against kmemdup() failing in dup_return_instance()
  perf/x86: Relax privilege filter restriction on AMD IBS
  perf/core: Export perf_exclude_event()
  uprobes: Reuse return_instances between multiple uretprobes within task
  uprobes: Ensure return_instance is detached from the list before freeing
  uprobes: Decouple return_instance list traversal and freeing
  uprobes: Simplify session consumer tracking
  uprobes: add speculative lockless VMA-to-inode-to-uprobe resolution
  uprobes: simplify find_active_uprobe_rcu() VMA checks
  mm: introduce mmap_lock_speculate_{try_begin|retry}
  mm: convert mm_lock_seq to a proper seqcount
  mm/gup: Use raw_seqcount_try_begin()
  seqlock: add raw_seqcount_try_begin
  perf/x86/rapl: Add core energy counter support for AMD CPUs
  perf/x86/rapl: Move the cntr_mask to rapl_pmus struct
  perf/x86/rapl: Remove the global variable rapl_msrs
  ...
2025-01-21 10:52:03 -08:00
Linus Torvalds
a6640c8c2f Objtool changes for v6.14:
- Introduce the generic section-based annotation
    infrastructure a.k.a. ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)
 
  - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
 
     - ANNOTATE_NOENDBR
     - ANNOTATE_RETPOLINE_SAFE
     - instrumentation_{begin,end}()
     - VALIDATE_UNRET_BEGIN
     - ANNOTATE_IGNORE_ALTERNATIVE
     - ANNOTATE_INTRA_FUNCTION_CALL
     - {.UN}REACHABLE
 
  - Optimize the annotation-sections parsing code (Peter Zijlstra)
 
  - Centralize annotation definitions in <linux/objtool.h>
 
  - Unify & simplify the barrier_before_unreachable()/unreachable()
    definitions (Peter Zijlstra)
 
  - Convert unreachable() calls to BUG() in x86 code, as
    unreachable() has unreliable code generation (Peter Zijlstra)
 
  - Remove annotate_reachable() and annotate_unreachable(), as it's
    unreliable against compiler optimizations (Peter Zijlstra)
 
  - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)
 
  - Robustify the annotation code by warning about unknown annotation
    types (Peter Zijlstra)
 
  - Allow arch code to discover jump table size, in preparation of
    annotated jump table support (Ard Biesheuvel)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmeOHiARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gATw/7Bn4A+Isqk9bKo6QgYEnKRoyf760ALQl6
 av/toEy1qCHT/CXCiEn1Hut1JEy4YyD6lIarC1scRl5xy7amRDEcCL0i2CKz3orn
 pf6Fk8/Pi68G2K50o4LTiq8t3uPBJXPlGyDlngh2hFTYRfPRT4m+cig784hmJEXG
 Xq2YzzUNG++U/4Uwe3JH7bX/vcZTYkZfM62FWfp3I4V0OqKU4c+Pkiv4u3Rs7L7b
 c3xk5/PktKZWV5TDsz0wU4SAGxYFGV47hhYM6cxdSYD3la7RVO+qZcqxsJByjpcL
 bvOmGKQ1SAXr08rV7TB+Fh8icaNE8Rbbmxf6slB0hdXBQb8STAZ810mZJFey6pnm
 kXgfhhfBOK5Sq+UbTfzF2JgquCGAbKK75bmNGgf2HaLnVLkFIw3AyMsuFqnxhI4X
 vXRHGnHCYpYUHTxzRYTFYR8XL8twA2kgjWkSe7hYrX/RQZV3XfyKOc2jyoJFMXeX
 LecfGJCE/pziZyj60SXT9WaUTvKc8gjWOEuAnW1pJQRM0zJqB9kjLh1cDYUseuwv
 gGkH59KEu0kcfOb5t/jWoqW3PTENJjEAhOmjun6Jv8wgbOxU88TMmSCWppj54O2X
 c2ibO407535u1SKBWZuaKFBLYftS2GM4WaGsdyTyh+ta48C8An90HMfYNKTHM9Nz
 F61Q7Zbn65E=
 =9nGt
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:

 - Introduce the generic section-based annotation infrastructure a.k.a.
   ASM_ANNOTATE/ANNOTATE (Peter Zijlstra)

 - Convert various facilities to ASM_ANNOTATE/ANNOTATE: (Peter Zijlstra)
    - ANNOTATE_NOENDBR
    - ANNOTATE_RETPOLINE_SAFE
    - instrumentation_{begin,end}()
    - VALIDATE_UNRET_BEGIN
    - ANNOTATE_IGNORE_ALTERNATIVE
    - ANNOTATE_INTRA_FUNCTION_CALL
    - {.UN}REACHABLE

 - Optimize the annotation-sections parsing code (Peter Zijlstra)

 - Centralize annotation definitions in <linux/objtool.h>

 - Unify & simplify the barrier_before_unreachable()/unreachable()
   definitions (Peter Zijlstra)

 - Convert unreachable() calls to BUG() in x86 code, as unreachable()
   has unreliable code generation (Peter Zijlstra)

 - Remove annotate_reachable() and annotate_unreachable(), as it's
   unreliable against compiler optimizations (Peter Zijlstra)

 - Fix non-standard ANNOTATE_REACHABLE annotation order (Peter Zijlstra)

 - Robustify the annotation code by warning about unknown annotation
   types (Peter Zijlstra)

 - Allow arch code to discover jump table size, in preparation of
   annotated jump table support (Ard Biesheuvel)

* tag 'objtool-core-2025-01-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Convert unreachable() to BUG()
  objtool: Allow arch code to discover jump table size
  objtool: Warn about unknown annotation types
  objtool: Fix ANNOTATE_REACHABLE to be a normal annotation
  objtool: Convert {.UN}REACHABLE to ANNOTATE
  objtool: Remove annotate_{,un}reachable()
  loongarch: Use ASM_REACHABLE
  x86: Convert unreachable() to BUG()
  unreachable: Unify
  objtool: Collect more annotations in objtool.h
  objtool: Collapse annotate sequences
  objtool: Convert ANNOTATE_INTRA_FUNCTION_CALL to ANNOTATE
  objtool: Convert ANNOTATE_IGNORE_ALTERNATIVE to ANNOTATE
  objtool: Convert VALIDATE_UNRET_BEGIN to ANNOTATE
  objtool: Convert instrumentation_{begin,end}() to ANNOTATE
  objtool: Convert ANNOTATE_RETPOLINE_SAFE to ANNOTATE
  objtool: Convert ANNOTATE_NOENDBR to ANNOTATE
  objtool: Generic annotation infrastructure
2025-01-21 10:13:11 -08:00
Linus Torvalds
b9d8a295ed - The first part of a restructuring of AMD's representation of a northbridge
which is legacy now, and the creation of the new AMD node concept which
   represents the Zen architecture of having a collection of I/O devices within
   an SoC. Those nodes comprise the so-called data fabric on Zen. This has
   at least one practical advantage of not having to add a PCI ID each time
   a new data fabric PCI device releases. Eventually, the lot more uniform
   provider of data fabric functionality amd_node.c will be used by all the
   drivers which need it
 
 - Smaller cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmePuPIACgkQEsHwGGHe
 VUpU6Q//S9j9+YC9EpredFoJ5W0BfERR5XOum7YjlLxq2mVTStrf9Q1ecrwmS4Q6
 4mAydIDfhqNlouUjMBgNNFJcvm8lat+/pjY78oT8ZdjumslMbMxo81VmQ3fX+6fE
 izMrL81DG4j8zeleUyz5ecJEK/KPw1s3SkY736511PeJSalOU4hLYmU819imfAk/
 5c9os2GNhszIROE1YUYZQ3zXne1t2PNXKvctzVrJYjyKpIDgFNzTj6gXhePzXBNO
 iFdApqSgKdnnsD6VsfxYVnOKP+cSIl27Tbge6dm7DHQbSs00aVL64JPcX8/hWtp6
 ExrwBYiFk6yafwsNUu7/PmqbZNKYxDgvXFq8jSOFfioh6Km/QZYs8y1/qXN3qmSU
 78Ah5jyO+U+++FsSa2o9eRpU2l84UIQqvp84PeSLylzh7iLFyFCWsMfreNeIsF9v
 Jsost58JQOCufRK3qfMiDO88QUZRKyCfFymDAVcvPoBwp5nK9R1ohlbxgXrCPsE7
 Bd7J6jrlpcoRyYc8vhshkrnK2Sk6pP77OZOh5AZ9AybnALH0afUNLzk6sBtaObkZ
 xIJcSIBkKz3P4zWFKsXmqGYHWp1IsKsYRsNjCt5FExWOF+uKKKBjynHmlKeS0l/b
 J6bwDUPVW/gfkBqDV8bILultj9Gm8L5Z8SwvD1ww69OYN+c7oVk=
 =ZAjD
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:

 - The first part of a restructuring of AMD's representation of a
   northbridge which is legacy now, and the creation of the new AMD node
   concept which represents the Zen architecture of having a collection
   of I/O devices within an SoC. Those nodes comprise the so-called data
   fabric on Zen.

   This has at least one practical advantage of not having to add a PCI
   ID each time a new data fabric PCI device releases. Eventually, the
   lot more uniform provider of data fabric functionality amd_node.c
   will be used by all the drivers which need it

 - Smaller cleanups

* tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/amd_node: Use defines for SMN register offsets
  x86/amd_node: Remove dependency on AMD_NB
  x86/amd_node: Update __amd_smn_rw() error paths
  x86/amd_nb: Move SMN access code to a new amd_node driver
  x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()
  x86/amd_nb: Simplify function 3 search
  x86/amd_nb: Use topology info to get AMD node count
  x86/amd_nb: Simplify root device search
  x86/amd_nb: Simplify function 4 search
  x86: Start moving AMD node functionality out of AMD_NB
  x86/amd_nb: Clean up early_is_amd_nb()
  x86/amd_nb: Restrict init function to AMD-based systems
  x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
2025-01-21 09:38:52 -08:00
Linus Torvalds
48795f90cb - Remove the less generic CPU matching infra around struct x86_cpu_desc and
use the generic struct x86_cpu_id thing
 
 - Remove magic naked numbers for CPUID functions and use proper defines of the
   prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree
 
 - Smaller cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmePjeIACgkQEsHwGGHe
 VUqRBA//TinKFcWagaQB3lsnoBRwqyg6JJZIBNMF9sBMDD9HnvEZ/JduC+3+g1rx
 iztuCmRSgQsi/QvRaEFNuDMOgk6gACyXxi7Uf6eXsQkSlsZFViaqbXsy9kqslRbl
 7QP1NS1sfdSd42JPp2UZT/lg9kluuVnn5b40zZIwy2AAzwrNFfZAS4Yg7Qe4XQDF
 xBcHi8MAF+LTm5Tv0hLmx2UcfZLhi7hXy8mTAIFS0Liww+Y5qaam33xw9KxNU5lZ
 tVepzY5my43pRs4MB1CvaQCiZ84GxvAVqz3JYsg5YhVp45xh7P2WtjBeeOqLljaW
 MkWnDLOmlaD4Y0kL4QA3ReyBVux54RbDGKC0E/t5fwYlk3dQ7gYwSEvh5358R+0z
 kwxw3NdnNngoLRXAX45EonSxj36jb6KCBHAGqXSfL73OOt30RWCqknEnixcOp/BP
 chNxCiIx7qko+rAYOD62QkguEEPFdb8roeayhIKtiKL5zUwQAr+jt/pKVx2htWLi
 xxqSaVoCFu4edWpsEJnanqhS0Es0v7YiBU3jDC37rZJ+dtzf0C2ewD7Nb1g+wUTn
 NzDkmt58hQW4jBxoxHBIclLfhEETISTEGAAObTa5I5r8IDb7Dv+ZnSv7RfjoR9fL
 RWMz1bJ1Scem+Fx7fc/IRJFSElC41giSwFlhThHdAzI1m95zJN8=
 =9Hdg
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpuid updates from Borislav Petkov:

 - Remove the less generic CPU matching infra around struct x86_cpu_desc
   and use the generic struct x86_cpu_id thing

 - Remove magic naked numbers for CPUID functions and use proper defines
   of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around
   the tree

 - Smaller cleanups and improvements

* tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Make all all CPUID leaf names consistent
  x86/fpu: Remove unnecessary CPUID level check
  x86/fpu: Move CPUID leaf definitions to common code
  x86/tsc: Remove CPUID "frequency" leaf magic numbers.
  x86/tsc: Move away from TSC leaf magic numbers
  x86/cpu: Move TSC CPUID leaf definition
  x86/cpu: Refresh DCA leaf reading code
  x86/cpu: Remove unnecessary MwAIT leaf checks
  x86/cpu: Use MWAIT leaf definition
  x86/cpu: Move MWAIT leaf definition to common header
  x86/cpu: Remove 'x86_cpu_desc' infrastructure
  x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'
  x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id'
  x86/cpu: Expose only stepping min/max interface
  x86/cpu: Introduce new microcode matching helper
  x86/cpufeature: Document cpu_feature_enabled() as the default to use
  x86/paravirt: Remove the WBINVD callback
  x86/cpufeatures: Free up unused feature bits
2025-01-21 09:30:59 -08:00
Linus Torvalds
13b6931c44 - A segmented Reverse Map table (RMP) is a across-nodes distributed
table of sorts which contains per-node descriptors of each node-local
   4K page, denoting its ownership (hypervisor, guest, etc) in the realm
   of confidential computing.  Add support for such a table in order to
   improve referential locality when accessing or modifying RMP table
   entries
 
 - Add support for reading the TSC in SNP guests by removing any
   interference or influence the hypervisor might have, with the goal of
   making a confidential guest even more independent from the hypervisor
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeOYLsACgkQEsHwGGHe
 VUrywg//WBuywe3+TNPwF0Iw8becqtD7lKMftmUoqpcf20JhiHSCexb+3/r7U2Kb
 WL1/T5cxX1rA45HzkwovUljlvin8B9bdpY40dUqrKFPMnWLfs4ru0HPA6UxPBsAq
 r/8XrXuRrI22MLbrAeQ2xSt8dqw3DpbJyUcyr0qOb6OsbtAy05uElYCzMSyzT06F
 QsTmenosuJqSo1gIGTxfU4nKyd1o8EJ5b1ThK11hvZaIOffgLjEU6g39cG9AeF4X
 TOkh9CdIlQc3ot14rJeWMy15YEW+xBdXdMEv0ZPOSZiKzTHA7wwdl0VmPm1EK57f
 BQkZikuoJezJA0r5wSwVgslTaYO0GTXNewwL5jxK1mqRgoK06IgC6xAkX8N7NTYL
 K6DX+tfaKjSJGY1z9TYOzs+wGV4MBAXmbLwnuhcPumkTYXPFbRFZqx6ec2BLIU+Y
 bZfwhlr3q+bfFeBYMzyWPHJ87JinOjwu4Ah0uLVmkoRtgb0S3pIdlyRYZAcEl6fn
 Tgfu0/RNLGGsH/a3BF7AQdt+hOv1ms5hEMYXg++30uC59LR8XbuKnLdUPRi0nVeD
 e9xyxFybu5ySesnnXabtaO9bSUF+8HV4nkclKglFvuHpLMQ5GlPxTnBj1V1podYR
 l12G2htXKsSV5JJK4x+WfYBe6Nn3tbcpgZD8M8g0lso8kejqMjs=
 =hh1m
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - A segmented Reverse Map table (RMP) is a across-nodes distributed
   table of sorts which contains per-node descriptors of each node-local
   4K page, denoting its ownership (hypervisor, guest, etc) in the realm
   of confidential computing. Add support for such a table in order to
   improve referential locality when accessing or modifying RMP table
   entries

 - Add support for reading the TSC in SNP guests by removing any
   interference or influence the hypervisor might have, with the goal of
   making a confidential guest even more independent from the hypervisor

* tag 'x86_sev_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sev: Add the Secure TSC feature for SNP guests
  x86/tsc: Init the TSC for Secure TSC guests
  x86/sev: Mark the TSC in a secure TSC guest as reliable
  x86/sev: Prevent RDTSC/RDTSCP interception for Secure TSC enabled guests
  x86/sev: Prevent GUEST_TSC_FREQ MSR interception for Secure TSC enabled guests
  x86/sev: Change TSC MSR behavior for Secure TSC enabled guests
  x86/sev: Add Secure TSC support for SNP guests
  x86/sev: Relocate SNP guest messaging routines to common code
  x86/sev: Carve out and export SNP guest messaging init routines
  virt: sev-guest: Replace GFP_KERNEL_ACCOUNT with GFP_KERNEL
  virt: sev-guest: Remove is_vmpck_empty() helper
  x86/sev/docs: Document the SNP Reverse Map Table (RMP)
  x86/sev: Add full support for a segmented RMP table
  x86/sev: Treat the contiguous RMP table as a single RMP segment
  x86/sev: Map only the RMP table entries instead of the full RMP range
  x86/sev: Move the SNP probe routine out of the way
  x86/sev: Require the RMPREAD instruction after Zen4
  x86/sev: Add support for the RMPREAD instruction
  x86/sev: Prepare for using the RMPREAD instruction to access the RMP
2025-01-21 09:00:31 -08:00
Linus Torvalds
254d763310 - A bunch of minor cleanups
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeOWnYACgkQEsHwGGHe
 VUpEfQ/+MaFbHwEyewHWxGqhe9bOcoFsKUxQvK1jO99KbcPYjg1sEZ1DDeLydtea
 vwFoNNMnZDmKYA/+yAeshUIkcxyKYa3Iy6Tc8MXw1Yx1Z+AduKwDm47rvxcn47ig
 oQt1Zj0zMzON6aF7Xx7xlF5wDDBPMuRLKdKBSuggiio/IJz2G+m2CNI4KTUGAN9S
 aHE+TIE64MHftcabuoKdRtZ44FUZyb3BRvtpj3+G1BpRR6e/HFl9ghM3emvOv4BE
 e9QUvMFdVHnXQyib+JV+lOyu/2MpTVuhZqS51e8vSaNToJ0KfOc+j/Sl8uRCSciL
 qXALGIB5uEHlotKIVk9p9frGeUruc8H5YvdOUt7VEmSofuqbfFaEMSdhBfSxfq+S
 1LGBl3hXtVaspgQfzmPpEN5pMQAIMF6xlMiFeZSwBnUhb6n4QVlj38lodBbjct2j
 mzjBaYtu9NiVNi1H6dr1s/16Vl69tqAeRncGFdzvbJV0ZzY6KN/yFWy7g6PCPwHx
 gXdXFuOiwsoXQSsQn1l3nsxuXq1zp+TjUFrREPnmYv7t/GyEdieL5LfPnAwXB9x5
 vBAPDLa0c5XZRn8QyaeH1gZVX3HMi88cjXBEDI9auejkGr3e1VfdDSZkQpYMIK9b
 oIWwD2Wr2oU6uz5ye6M0cZpoZBVsbuluE5/kLJ70aj8QEJT3OsM=
 =xrfO
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode loader updates from Borislav Petkov:

 - A bunch of minor cleanups

* tag 'x86_microcode_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Remove ret local var in early_apply_microcode()
  x86/microcode/AMD: Have __apply_microcode_amd() return bool
  x86/microcode/AMD: Make __verify_patch_size() return bool
  x86/microcode/AMD: Remove bogus comment from parse_container()
  x86/microcode/AMD: Return bool from find_blobs_in_containers()
2025-01-21 08:33:10 -08:00
Linus Torvalds
3357d1d1f9 - Extend resctrl with the capability of total memory bandwidth monitoring,
thus accomodating systems which support only total but not local memory
   bandwidth monitoring. Add the respective new mount options
 
 - The usual cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeOU2QACgkQEsHwGGHe
 VUoqTw/+LOt/36tCMbqUEIhUIvPciqdhAk+gBzP9XzGEWRJDcYLmgmBvlrFcYgIL
 lZVZ/tBsYqljycMFaJ4K4vishpkJr9s8GtDCOItkMA4cYOJcXSaGSpNojctC2Qs6
 42J+qABENRZYWFmWWcCkf8jTG0QWebmVRJwV9Q/4uYRicLFW/B+Em0sOTFn9n7OX
 39ZyCXKlmu2wsTW/DrQ8wX0KkW5hevLkJk/NFN3CDuWg7LOs09IotALs/U7ayj+C
 BIU5ZEAvan8LAPLX4Nrhb5HArsP6eCmB0Kbr+Mm+X9lRBLSbMmThXy58WlRDT0v6
 LEx+IbKUnEVoYJnC2QiqxPB4FghhZs9RE6cDnCQivwkigFhIau9krPZUc2klnVtv
 AmjUx8BporU3rcvrtKycInQQCVdzi9Q8ai7WRrpCqNEItHSOtNk9BipnFGqcnFDr
 va0SCZ3N1vIiVnZkTZ0CObA2F8RRLMiBCAB9XNfD0TzY+K0p2KCNqS6UnxmrOtR+
 8Fbk+C624u7LhmjbrPr7Hj30wIu2b/CqncIDHrh8O+jK2awvTEl6T0Tryd+jXxR1
 ERy0OSUhHEcnW73ckdfdn3/6fsL/9bnBZrTufRWxPzaM+3E9YR9KGmyc7hf5fkSX
 WE/ZRMse4QwELmSajdv9NK7oO48VDOije+cgUzPgme0i4Lsg4c0=
 =i4Yw
 -----END PGP SIGNATURE-----

Merge tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 resource control updates from Borislav Petkov:

 - Extend resctrl with the capability of total memory bandwidth
   monitoring, thus accomodating systems which support only total but
   not local memory bandwidth monitoring. Add the respective new mount
   options

 - The usual cleanups

* tag 'x86_cache_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Document the new "mba_MBps_event" file
  x86/resctrl: Add write option to "mba_MBps_event" file
  x86/resctrl: Add "mba_MBps_event" file to CTRL_MON directories
  x86/resctrl: Make mba_sc use total bandwidth if local is not supported
  x86/resctrl: Compute memory bandwidth for all supported events
  x86/resctrl: Modify update_mba_bw() to use per CTRL_MON group event
  x86/resctrl: Prepare for per-CTRL_MON group mba_MBps control
  x86/resctrl: Introduce resctrl_file_fflags_init() to initialize fflags
  x86/resctrl: Use kthread_run_on_cpu()
2025-01-21 08:31:04 -08:00
Linus Torvalds
d80825ee4a - Add support for AMD hardware which is not affected by SRSO on the
user/kernel attack vector and advertise it to guest userspace
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeOTMwACgkQEsHwGGHe
 VUoMKhAAjMp7tYNmh8687oz8A7ujXDYvbaIh8d3zRnOKq2cEpsGKSOgkw50tbs/I
 LE5o5k2NJ6evIYEkqZZH0WvksealwzoTY1LWGqHj2zotbyP6ypZn+GKORH+MsNNL
 fUaoj6DLELqPbLrr48GJG2uabtwmPOgiElZ6bqKrFnGDPI2LSLkrY7fugM3aU4h7
 VXDUAz2N2kIRKXFedVTArZtYiVO+O4/fM1VxjIRv/KrQt0lTatsjUYc6jei/7Rqa
 xPCmw6WsYfPPY8FjsgR3oaGfUQPzs8nv96Vh9lnIFw5/ajkDbwtvRuPEwSYe9MBZ
 mE+oOqdPz4of12Mv++/BkQL/tKuVPG/e38aeZUQPo/hj2LOWdUdwdAuZuslfrqaA
 9xKZgslhPBKr0yRAku60hRpbqnp07cEHuM6JMpmFoDqN1ESnWlDapWKQj+jOpGyz
 /w0Gp00R03TVhF9QTV7KUyj/U1ykhWG+4q843G5acrgh0geWzy+fYL+jPHgtBbWp
 E+NFKmnCg9YNbTiB6y9xIcEU9siq6iMXyhp3iv0qlpwhF5WueCvc3BiUwavgpoM6
 IpVqrrJspLy6/K7tMKNVKDCIkbHvJ6vKxSM9o3yzqMTL7B3ISlG9o3MSTKQVjytR
 qEnIQAwwfsWfmeWGEDun+hh83b+HsZ+tyLyrFNleGoe4yJosZtc=
 =bWI/
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 CPU speculation update from Borislav Petkov:

 - Add support for AMD hardware which is not affected by SRSO on the
   user/kernel attack vector and advertise it to guest userspace

* tag 'x86_bugs_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM: x86: Advertise SRSO_USER_KERNEL_NO to userspace
  x86/bugs: Add SRSO_USER_KERNEL_NO support
2025-01-21 08:22:40 -08:00
Linus Torvalds
d3504411a4 - Remove the shared threshold bank hack on AMD and streamline and simplify it
- Cleanup and sanitize MCA code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeOR1oACgkQEsHwGGHe
 VUo06hAAlUk3F5sp+53djAKInHdbUoHZdGX4vy5dppjnDfu5JQcg/OwO0l6RbfhA
 YUTytWHT0SdkyoJG7CnxQOteOkmimeHMfVTZw9LWm9xLMtulNMxsxJnKHlpvBbfR
 P/1eot/R+wLMzZoqqf6O8MfF1Gs/S3WjO2+T3wbdgwW7YXprbTSwW51FXhLOaObR
 PR+PfDXqcu4u4+b+bC0HSo3dN0Sc4J71cdb0tt7VIeQwVUAcfEZgdM1opXSxtQJJ
 G/Ekbjg5dJo4ZRFXXrxVNWxOXJsKbuubc6mw0C+cgCcbDklcF1gmQYvL9+NSExeP
 vDyhmMhuEDbtvUBJPQFnFywqYH/a1neo00RJUqw6xVXsn+ebBHVGLik8mgbQOaHt
 fh8bATsQ1aETAk6nx3RMPk9saiqFHk8t4qIV9FwjskXzuKDh5LzM1rGuiFLl5py/
 5hazmwn7/jYTxYJyG2ZEHD1ro2jcZFevu9dPTOSaJL3ODtlH4fQUugBcoukq6re3
 OEf/v+J8LcX+fvo8ylJYyXXT9ZDTpckjTNipU8JiEjVcro0MrxEzTnTWma4tcn+w
 Hp7lZ+/AEmHwKQcNab7frKhTPdxLFRbJyYIGiRAt9mwxXz49IBTDpoVFpXUf56zV
 Djcd6wmG1gKvM+27or/tuuiDyCZlK0+s7twRYxP50cMuvFcvmNk=
 =hSqE
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RAS updates from Borislav Petkov:

 - Remove the shared threshold bank hack on AMD and streamline and
   simplify it

 - Cleanup and sanitize MCA code

* tag 'ras_core_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce/amd: Remove shared threshold bank plumbing
  x86/mce: Remove the redundant mce_hygon_feature_init()
  x86/mce: Convert family/model mixed checks to VFM-based checks
  x86/mce: Break up __mcheck_cpu_apply_quirks()
  x86/mce: Make four functions return bool
  x86/mce/threshold: Remove the redundant this_cpu_dec_return()
  x86/mce: Make several functions return bool
2025-01-21 08:16:24 -08:00
Thomas Gleixner
7d04319a05 x86/apic: Convert to IRQCHIP_MOVE_DEFERRED
Instead of marking individual interrupts as safe to be migrated in
arbitrary contexts, mark the interrupt chips, which require the interrupt
to be moved in actual interrupt context, with the new IRQCHIP_MOVE_DEFERRED
flag. This makes more sense because this is a per interrupt chip property
and not restricted to individual interrupts.

That flips the logic from the historical opt-out to a opt-in model. This is
simpler to handle for other architectures, which default to unrestricted
affinity setting. It also allows to cleanup the redundant core logic
significantly.

All interrupt chips, which belong to a top-level domain sitting directly on
top of the x86 vector domain are marked accordingly, unless the related
setup code marks the interrupts with IRQ_MOVE_PCNTXT, i.e. XEN.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/all/20241210103335.563277044@linutronix.de
2025-01-15 21:38:53 +01:00
Xin Li (Intel)
de31b3cd70 x86/fred: Fix the FRED RSP0 MSR out of sync with its per-CPU cache
The FRED RSP0 MSR is only used for delivering events when running
userspace.  Linux leverages this property to reduce expensive MSR
writes and optimize context switches.  The kernel only writes the
MSR when about to run userspace *and* when the MSR has actually
changed since the last time userspace ran.

This optimization is implemented by maintaining a per-CPU cache of
FRED RSP0 and then checking that against the value for the top of
current task stack before running userspace.

However cpu_init_fred_exceptions() writes the MSR without updating
the per-CPU cache.  This means that the kernel might return to
userspace with MSR_IA32_FRED_RSP0==0 when it needed to point to the
top of current task stack.  This would induce a double fault (#DF),
which is bad.

A context switch after cpu_init_fred_exceptions() can paper over
the issue since it updates the cached value.  That evidently
happens most of the time explaining how this bug got through.

Fix the bug through resynchronizing the FRED RSP0 MSR with its
per-CPU cache in cpu_init_fred_exceptions().

Fixes: fe85ee3919 ("x86/entry: Set FRED RSP0 on return to userspace instead of context switch")
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250110174639.1250829-1-xin%40zytor.com
2025-01-14 14:16:36 -08:00
David Woodhouse
7c61a3d8f7 x86/kexec: Use typedef for relocate_kernel_fn function prototype
Both i386 and x86_64 now copy the relocate_kernel() function into the control
page and execute it from there, using an open-coded function pointer.

Use a typedef for it instead.

  [ bp: Put relocate_kernel_ptr ptr arithmetic on a single line. ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-10-dwmw2@infradead.org
2025-01-14 13:09:08 +01:00
David Woodhouse
e536057543 x86/kexec: Cope with relocate_kernel() not being at the start of the page
A few places in the kexec control code page make the assumption that the first
instruction of relocate_kernel is at the very start of the page.

To allow for Clang CFI information to be added to relocate_kernel(), as well
as the general principle of removing unwarranted assumptions, fix them to use
the external __relocate_kernel_start symbol that the linker adds. This means
using a separate addq and subq for calculating offsets, as the assembler can
no longer calculate the delta directly for itself and relocations aren't that
versatile. But those values can at least be used relative to a local label to
avoid absolute relocations.

Turn the jump from relocate_kernel() to identity_mapped() into a real indirect
'jmp *%rsi' too, while touching it. There was no real reason for it to be
a push+ret in the first place, and adding Clang CFI info will also give
objtool enough visibility to start complaining 'return with modified stack
frame' about it.

  [ bp: Massage commit message. ]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-9-dwmw2@infradead.org
2025-01-14 13:05:14 +01:00
David Woodhouse
2114796ca0 x86/kexec: Mark machine_kexec() with __nocfi
A recent commit caused the relocate_kernel() function to be invoked through
a function pointer, but it does not have CFI information. The resulting trap
occurs after the IDT and GDT have been invalidated, leading to a triple-fault
if CONFIG_CFI_CLANG is enabled.

Using SYM_TYPED_FUNC_START() to provide the CFI information looks like it will
require a prolonged battle with objtool. And is fairly pointless anyway, as
the actual signature comes from a __kcfi_typeid_… symbol emitted from the
C code based on the function prototype it thinks that relocate_kernel has,
rendering the check somewhat tautological.

The simple fix is just to mark machine_kexec() with __nocfi.

Fixes: eeebbde571 ("x86/kexec: Invoke copy of relocate_kernel() instead of the original")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-7-dwmw2@infradead.org
2025-01-14 13:02:40 +01:00
Nathan Chancellor
eeed915041 x86/kexec: Fix location of relocate_kernel with -ffunction-sections
After commit

  cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section"),

kernels configured with an option that uses -ffunction-sections, such as
CONFIG_LTO_CLANG, crash when kexecing because the value of relocate_kernel
does not match the value of __relocate_kernel_start so incorrect code gets
copied via machine_kexec_prepare().

  $ llvm-nm good-vmlinux &| rg relocate_kernel
  ffffffff83280d41 T __relocate_kernel_end
  ffffffff83280b00 T __relocate_kernel_start
  ffffffff83280b00 T relocate_kernel

  $ llvm-nm bad-vmlinux &| rg relocate_kernel
  ffffffff83266100 D __relocate_kernel_end
  ffffffff83266100 D __relocate_kernel_start
  ffffffff8120b0d8 T relocate_kernel

When -ffunction-sections is enabled, TEXT_MAIN matches on
'.text.[0-9a-zA-Z_]*' to coalesce the function specific functions back
into .text during link time after they have been optimized. Due to the
placement of TEXT_TEXT before KEXEC_RELOCATE_KERNEL in the x86 linker
script, the .text.relocate_kernel section ends up in .text instead of
.data.

Use a second dot in the relocate_kernel section name to avoid matching
on TEXT_MAIN, which matches a similar situation that happened in
commit

  79cd2a1122 ("x86/retpoline,kprobes: Fix position of thunk sections with CONFIG_LTO_CLANG"),

which allows kexec to function properly.

While .data.relocate_kernel still ends up in the .data section via
DATA_MAIN -> DATA_DATA, ensure it is located with the
.text.relocate_kernel section as intended by performing the same
transformation.

Fixes: cb33ff9e06 ("x86/kexec: Move relocate_kernel to kernel .data section")
Fixes: 8dbec5c77b ("x86/kexec: Add data section to relocate_kernel")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-6-dwmw2@infradead.org
2025-01-14 13:00:18 +01:00
David Woodhouse
2cacf7f23a x86/kexec: Fix stack and handling of re-entry point for ::preserve_context
A ::preserve_context kimage can be invoked more than once, and the entry point
can be different every time. When the callee returns to the kernel, it leaves
the address of its entry point for next time on the stack.

That being the case, one might reasonably assume that the caller would
allocate space for it on the stack frame before actually performing the 'call'
into the callee.

Apparently not, though. Ever since the kjump code was first added in 2009, it
has set up a *new* stack at the top of the swap_page scratch page, then just
performed the 'call' without allocating any space for the re-entry address to
be returned. It then reads the re-entry point for next time from 0(%rsp) which
is actually the first qword of the page *after* the swap page, which might not
exist at all! And if the callee has written to that, then it will have
corrupted memory it doesn't own.

Correct this by pushing the entry point of the callee onto the stack before
calling it. The callee may then adjust it, or not, as it sees fit, and
subsequent invocations should work correctly either way.

Remove a stray push of zero to the *relocate_kernel* stack, which may have
been intended for this purpose, but which was actually just noise.

Also, loading the stack for the callee relied on the address of the swap page
being in %r10 without ever documenting that fact. Recent code changes made
that no longer true, so load it directly from the local kexec_pa_swap_page
variable instead.

Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-5-dwmw2@infradead.org
2025-01-14 12:57:43 +01:00
David Woodhouse
85d724df8c x86/kexec: Use correct swap page in swap_pages function
The swap_pages function expects the swap page to be in %r10, but there
was no documentation to that effect. Once upon a time the setup code
used to load its value from a kernel virtual address and save it to an
address which is accessible in the identity-mapped page tables, and
*happened* to use %r10 to do so, with no comment that it was left there
on *purpose* instead of just being a scratch register. Once that was no
longer necessary, %r10 just holds whatever the kernel happened to leave
in it.

Now that the original value passed by the kernel is accessible via
%rip-relative addressing, load directly from there instead of using %r10
for it. But document the other parameters that the swap_pages function
*does* expect in registers.

Fixes: b3adabae8a ("x86/kexec: Drop page_list argument from relocate_kernel()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-4-dwmw2@infradead.org
2025-01-14 12:54:36 +01:00
David Woodhouse
4d5f1da98f x86/kexec: Ensure preserve_context flag is set on return to kernel
The swap_pages() function will only actually *swap*, as its name implies, if
the preserve_context flag in the %r11 register is non-zero. On the way back
from a ::preserve_context kexec, ensure that the %r11 register is non-zero so
that the pages get swapped back.

Fixes: 9e5683e2d0 ("x86/kexec: Only swap pages for ::preserve_context mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250109140757.2841269-3-dwmw2@infradead.org
2025-01-14 12:52:47 +01:00
David Woodhouse
d144d8a652 x86/kexec: Disable global pages before writing to control page
The kernel switches to a new set of page tables during kexec. The global
mappings (_PAGE_GLOBAL==1) can remain in the TLB after this switch. This
is generally not a problem because the new page tables use a different
portion of the virtual address space than the normal kernel mappings.

The critical exception to that generalisation (and the only mapping
which isn't an identity mapping) is the kexec control page itself —
which was ROX in the original kernel mapping, but should be RWX in the
new page tables. If there is a global TLB entry for that in its prior
read-only state, it definitely needs to be flushed before attempting to
write through that virtual mapping.

It would be possible to just avoid writing to the virtual address of the
page and defer all writes until they can be done through the identity
mapping. But there's no good reason to keep the old TLB entries around,
as they can cause nothing but trouble.

Clear the PGE bit in %cr4 early, before storing data in the control page.

Fixes: 5a82223e07 ("x86/kexec: Mark relocate_kernel page as ROX instead of RWX")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219592
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: "Ning, Hongyu" <hongyu.ning@linux.intel.com>
Link: https://lore.kernel.org/r/20250109140757.2841269-2-dwmw2@infradead.org
2025-01-14 12:46:17 +01:00
Guo Weikang
ccd582059a mm/early_ioremap: add null pointer checks to prevent NULL-pointer dereference
The early_ioremap interface can fail and return NULL in certain cases.  To
prevent NULL-pointer dereference crashes, fixed issues in the acpi_extlog
and copy_early_mem interfaces, improving robustness when handling early
memory.

Link: https://lkml.kernel.org/r/20241212101004.1544070-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang <guoweikang.kernel@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:59 -08:00
Qi Zheng
718b13861d x86: mm: free page table pages by RCU instead of semi RCU
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP.  But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation, let
single table also be freed by RCU like batch table freeing.  Then we can
also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, it is used for THPs, so it is safe.

After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function
call of free_pte() is as follows:

free_pte
  pte_free_tlb
    __pte_free_tlb
      ___pte_free_tlb
        paravirt_tlb_remove_table
          tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
            [no-free-memory slowpath:]
              tlb_table_invalidate
              tlb_remove_table_one
                __tlb_remove_table_one [frees via RCU]
            [fastpath:]
              tlb_table_flush
                tlb_remove_table_free [frees via RCU]
          native_tlb_remove_table [CONFIG_PARAVIRT on native]
            tlb_remove_table [see above]

Link: https://lkml.kernel.org/r/0287d442a973150b0e1019cc406e6322d148277a.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13 22:40:48 -08:00
Dr. David Alan Gilbert
58589c6a6e rtc: Remove hpet_rtc_dropped_irq()
hpet_rtc_dropped_irq() has been unused since
commit f52ef24be2 ("rtc/alpha: remove legacy rtc driver")

Remove it in rtc, and x86 hpet code.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241215022356.181625-1-linux@treblig.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2025-01-13 23:07:18 +01:00
K Prateek Nayak
e1bc026465 x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
x86_sched_itmt_flags() returns SD_ASYM_PACKING if ITMT support is
enabled by the system. Without ITMT support being enabled, it returns 0
similar to current x86_die_flags() on non-Hybrid systems
(!X86_HYBRID_CPU and !X86_FEATURE_AMD_HETEROGENEOUS_CORES)

On Intel systems that enable ITMT support, either the MC domain
coincides with the PKG domain, or in case of multiple MC groups
within a PKG domain, either Sub-NUMA Cluster (SNC) is enabled or the
processor features Hybrid core layout (X86_HYBRID_CPU) which leads to
three distinct possibilities:

o If PKG and MC domains coincide, PKG domain is degenerated by
  sd_parent_degenerate() when building sched domain topology.

o If SNC is enabled, PKG domain is never added since
  "x86_has_numa_in_package" is set and the topology will instead contain
  NODE and NUMA domains.

o On X86_HYBRID_CPU which contains multiple MC groups within the PKG,
  the PKG domain requires x86_sched_itmt_flags().

Thus, on Intel systems that contains multiple MC groups within the PKG
and enables ITMT support, the PKG domain requires
x86_sched_itmt_flags(). In all other cases PKG domain is either never
added or is degenerated. Thus, returning x86_sched_itmt_flags()
unconditionally at PKG domain on Intel systems should not lead to any
functional changes.

On AMD systems with multiple LLCs (MC groups) within a PKG domain,
enabling ITMT support requires setting SD_ASYM_PACKING to the PKG domain
since the core rankings are assigned PKG-wide.

Core rankings on AMD processors is currently set by the amd-pstate
driver when Preferred Core feature is supported. A subset of systems that
support Preferred Core feature can be detected using
X86_FEATURE_AMD_HETEROGENEOUS_CORES however, this does not cover all the
systems that support Preferred Core ranking.

Detecting Preferred Core support on AMD systems requires inspecting CPPC
Highest Perf on all present CPUs and checking if it differs on at least
one CPU. Previous suggestion to use a synthetic feature to detect
Preferred Core support [1] was found to be non-trivial to implement
since BSP alone cannot detect if Preferred Core is supported and by the
time AP comes up, alternatives are patched and setting a X86_FEATURE_*
then is not possible.

Since x86 processors enabling ITMT support that consists multiple
non-NUMA MC groups within a PKG requires SD_ASYM_PACKING flag set at the
PKG domain, return x86_sched_itmt_flags unconditionally for the PKG
domain.

Since x86_die_flags() would have just returned x86_sched_itmt_flags()
after the change, remove the unnecessary wrapper and pass
x86_sched_itmt_flags() directly as the flags function.

Fixes: f3a0523918 ("cpufreq: amd-pstate: Enable amd-pstate preferred core support")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-6-kprateek.nayak@amd.com
2025-01-13 14:10:24 +01:00
K Prateek Nayak
537e247879 x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
x86_*_flags() wrappers were introduced with commit d3d37d850d
("x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU") to add
x86_sched_itmt_flags() in addition to the default domain flags for SMT
and MC domain.

commit 995998ebde ("x86/sched: Remove SD_ASYM_PACKING from the
SMT domain flags") removed the ITMT flags for SMT domain but not the
x86_smt_flags() wrappers which directly returns cpu_smt_flags().

Remove x86_smt_flags() and directly use cpu_smt_flags() to derive the
flags for SMT domain. No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-5-kprateek.nayak@amd.com
2025-01-13 14:10:24 +01:00
K Prateek Nayak
d04013a4b2 x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
"sched_itmt_enabled" was only introduced as a debug toggle for any funky
ITMT behavior. Move the sysctl controlled from
"/proc/sys/kernel/sched_itmt_enabled" to debugfs at
"/sys/kernel/debug/x86/sched_itmt_enabled" with a notable change that a
cat on the file will return "Y" or "N" instead of "1" or "0" to
indicate that feature is enabled or disabled respectively. Either "0" or
"N" (or any string that kstrtobool() interprets as false) can be written
to the file will disable the feature, and writing  either "1" or "Y" (or
any string that kstrtobool() interprets as true) will enable it back
when the platform supports ITMT ranking.

Since ITMT is x86 specific (and PowerPC uses SD_ASYM_PACKING too), the
toggle was moved to "/sys/kernel/debug/x86/" as opposed to
"/sys/kernel/debug/sched/"

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-4-kprateek.nayak@amd.com
2025-01-13 14:10:24 +01:00
K Prateek Nayak
fc1055d533 x86/itmt: Use guard() for itmt_update_mutex
Use guard() for itmt_update_mutex which avoids the extra mutex_unlock()
in the bailout and return paths.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-3-kprateek.nayak@amd.com
2025-01-13 14:10:23 +01:00
K Prateek Nayak
2f6f726bdd x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
In preparation to move "sysctl_sched_itmt_enabled" to debugfs, convert
the unsigned int to bool since debugfs readily exposes boolean fops
primitives (debugfs_read_file_bool, debugfs_write_file_bool) which can
streamline the conversion.

Since the current ctl_table initializes extra1 and extra2 to SYSCTL_ZERO
and SYSCTL_ONE respectively, the value of "sysctl_sched_itmt_enabled"
can only be 0 or 1 and this datatype conversion should not cause any
functional changes.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20241223043407.1611-2-kprateek.nayak@amd.com
2025-01-13 14:10:23 +01:00
Yafang Shao
52cd5c4b59 arch: remove get_task_comm() and print task comm directly
Since task->comm is guaranteed to be NUL-terminated, we can print it
directly without the need to copy it into a separate buffer.  This
simplifies the code and avoids unnecessary operations.

Link: https://lkml.kernel.org/r/20241219023452.69907-3-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "André Almeida" <andrealmeid@igalia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Danilo Krummrich <dakr@redhat.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kalle Valo <kvalo@kernel.org>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-12 20:21:15 -08:00
Nuno Das Neves
ef5a3c92a8 hyperv: Switch from hyperv-tlfs.h to hyperv/hvhdk.h
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.

This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).

To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.

Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.

Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-01-10 00:54:21 +00:00
Nikunj A Dadhania
73bbf3b0fb x86/tsc: Init the TSC for Secure TSC guests
Use the GUEST_TSC_FREQ MSR to discover the TSC frequency instead of
relying on kvm-clock based frequency calibration.  Override both CPU and
TSC frequency calibration callbacks with securetsc_get_tsc_khz(). Since
the difference between CPU base and TSC frequency does not apply in this
case, the same callback is being used.

  [ bp: Carve out from
    https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com ]

Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250106124633.1418972-11-nikunj@amd.com
2025-01-08 21:26:19 +01:00
Yazen Ghannam
79821b907f x86/amd_node: Use defines for SMN register offsets
There are more than one SMN index/data pair available for software use.
The register offsets are different, but the protocol is the same.

Use defines for the SMN offset values and allow the index/data offsets
to be passed to the read/write helper function.

This eases code reuse with other SMN users in the kernel.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241206161210.163701-14-yazen.ghannam@amd.com
2025-01-08 11:02:28 +01:00