The move of the module sanity check to earlier skipped the audit logging
call in the case of failure and to a place where the previously used
context is unavailable.
Add an audit logging call for the module loading failure case and get
the module name when possible.
Link: https://issues.redhat.com/browse/RHEL-52839
Fixes: 02da2cbab4 ("module: move check_modinfo() early to early_mod_check()")
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Hook alloc_workqueue and alloc_workqueue_attrs() so that they're
accounted to the callsite. Since we're doing allocations on behalf of
another subsystem, this helps when using memory allocation profiling to
check for leaks.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use NULL instead of 0 to signal no LLC domain, matching numa_span() and
the function comment.
No functional change.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cpumask_next_wrap() is more verbose and efficient comparing to
cpumask_next() followed by cpumask_first().
Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250614155031.340988-3-yury.norov@gmail.com
cpumask_any_but() is more verbose than cpumask_first() followed by
cpumask_next(). Use it in clocksource_verify_choose_cpus().
Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250614155031.340988-2-yury.norov@gmail.com
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.
tj: Updated subject for clarity.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.
tj: Updated subject for clarity. Fixed stray #else block which wasn't
removed causing build failure.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.
tj: Updated subject for clarity. Replace #if defined() with #ifdef.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.
tj: Updated subject for clarity.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch removes duplicated code.
Eduard points out [1]:
Same cleanup cycles are done in push_stack() and push_async_cb(),
both functions are only reachable from do_check_common() via
do_check() -> do_check_insn().
Hence, I think that cur state should not be freed in push_*()
functions and pop_stack() loop there is not needed.
This would also fix the 'symptom' for [2], but the issue also has a
simpler fix which was sent separately. This fix also makes sure the
push_*() callers always return an error for which
error_recoverable_with_nospec(err) is false. This is required because
otherwise we try to recover and access the stale `state`.
Moving free_verifier_state() and pop_stack(..., pop_log=false) to happen
after the bpf_vlog_reset() call in do_check_common() is fine because the
pop_stack() call that is moved does not call bpf_vlog_reset() with the
pop_log=false parameter.
[1] https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/
[2] https://lore.kernel.org/all/68497853.050a0220.33aa0e.036a.GAE@google.com/
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/all/b6931bd0dd72327c55287862f821ca6c4c3eb69a.camel@gmail.com/
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Link: https://lore.kernel.org/r/20250613090157.568349-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF_JSET is a conditional jump and currently verifier.c:can_jump()
does not know about that. This can lead to incorrect live registers
and SCC computation.
E.g. in the following example:
1: r0 = 1;
2: r2 = 2;
3: if r1 & 0x7 goto +1;
4: exit;
5: r0 = r2;
6: exit;
W/o this fix insn_successors(3) will return only (4), a jump to (5)
would be missed and r2 won't be marked as alive at (3).
Fixes: 14c8552db6 ("bpf: simple DFA-based live registers analysis")
Reported-by: syzbot+a36aac327960ff474804@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250613175331.3238739-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().
If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.
Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.
This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.
Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed413 ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit adds __GFP_ACCOUNT flag to verifier induced memory
allocations. The intent is to account for all allocations reachable
from BPF_PROG_LOAD command, which is needed to track verifier memory
consumption in veristat. This includes allocations done in verifier.c,
and some allocations in btf.c, functions in log.c do not allocate.
There is also a utility function bpf_memcg_flags() which selectively
adds GFP_ACCOUNT flag depending on the `cgroup.memory=nobpf` option.
As far as I understand [1], the idea is to remove bpf_prog instances
and maps from memcg accounting as these objects do not strictly belong
to cgroup, hence it should not apply here.
(btf_parse_fields() is reachable from both program load and map
creation, but allocated record is not persistent as is freed as soon
as map_check_btf() exits).
[1] https://lore.kernel.org/all/20230210154734.4416-1-laoar.shao@gmail.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250613072147.3938139-2-eddyz87@gmail.com
There are two possible scenarios for syscall filtering:
- having a trusted/allowed range of PCs, and intercepting everything else
- or the opposite: a single untrusted/intercepted range and allowing
everything else (this is relevant for any kind of sandboxing scenario,
or monitoring behavior of a single library)
The current API only allows the former use case due to allowed
range wrap-around check. Add PR_SYS_DISPATCH_INCLUSIVE_ON that
enables the second use case.
Add PR_SYS_DISPATCH_EXCLUSIVE_ON alias for PR_SYS_DISPATCH_ON
to make it clear how it's different from the new
PR_SYS_DISPATCH_INCLUSIVE_ON.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/97947cc8e205ff49675826d7b0327ef2e2c66eea.1747839857.git.dvyukov@google.com
Initialize `ops` member's pointers properly by using kzalloc() instead of
kmalloc() when allocating the simulation work context. Otherwise the
pointers contain random content leading to invalid dereferencing.
Signed-off-by: Gyeyoung Baek <gye976@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612124827.63259-1-gye976@gmail.com
There have been a few bugs and/or misunderstandings about the reference
counting, and startup/shutdown behaviors in the IRQ core and related CPU
hotplug code. These 4 test cases try to capture a few interesting cases.
* irq_disable_depth_test: basic request/disable/enable sequence
* irq_free_disabled_test: request/disable/free/re-request sequence -
this catches errors on previous revisions of my work
* irq_cpuhotplug_test: exercises managed-affinity IRQ + CPU hotplug.
This captures a problematic test case which was fixed recently.
This test requires CONFIG_SMP and a hotpluggable CPU#1.
* irq_shutdown_depth_test: exercises similar behavior from
irq_cpuhotplug_test, but directly using irq_*() APIs instead of going
through CPU hotplug. This still requires CONFIG_SMP, because
managed-affinity is stubbed out (and not all APIs are even present)
without it.
Note the use of 'imply SMP': ARCH=um doesn't support SMP, and kunit is
often exercised there. Thus, 'imply' will force SMP on where possible
(such as ARCH=x86_64), but leave it off where it's not.
Behavior on various SMP and ARCH configurations:
$ tools/testing/kunit/kunit.py run 'irq_test_cases*' --arch x86_64 --qemu_args '-smp 2'
[...]
[11:12:24] Testing complete. Ran 4 tests: passed: 4
$ tools/testing/kunit/kunit.py run 'irq_test_cases*' --arch x86_64
[...]
[11:13:27] [SKIPPED] irq_cpuhotplug_test
[11:13:27] ================= [PASSED] irq_test_cases ==================
[11:13:27] ============================================================
[11:13:27] Testing complete. Ran 4 tests: passed: 3, skipped: 1
# default: ARCH=um
$ tools/testing/kunit/kunit.py run 'irq_test_cases*'
[11:14:26] [SKIPPED] irq_shutdown_depth_test
[11:14:26] [SKIPPED] irq_cpuhotplug_test
[11:14:26] ================= [PASSED] irq_test_cases ==================
[11:14:26] ============================================================
[11:14:26] Testing complete. Ran 4 tests: passed: 2, skipped: 2
Without commit 788019eb55 ("genirq: Retain disable depth for managed
interrupts across CPU hotplug"), this fails as follows:
[11:18:55] =============== irq_test_cases (4 subtests) ================
[11:18:55] [PASSED] irq_disable_depth_test
[11:18:55] [PASSED] irq_free_disabled_test
[11:18:55] # irq_shutdown_depth_test: EXPECTATION FAILED at kernel/irq/irq_test.c:147
[11:18:55] Expected desc->depth == 1, but
[11:18:55] desc->depth == 0 (0x0)
[11:18:55] ------------[ cut here ]------------
[11:18:55] Unbalanced enable for IRQ 26
[11:18:55] WARNING: CPU: 1 PID: 36 at kernel/irq/manage.c:792 __enable_irq+0x36/0x60
...
[11:18:55] [FAILED] irq_shutdown_depth_test
[11:18:55] #1
[11:18:55] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:202
[11:18:55] Expected irqd_is_activated(data) to be false, but is true
[11:18:55] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:203
[11:18:55] Expected irqd_is_started(data) to be false, but is true
[11:18:55] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:204
[11:18:55] Expected desc->depth == 1, but
[11:18:55] desc->depth == 0 (0x0)
[11:18:55] ------------[ cut here ]------------
[11:18:55] Unbalanced enable for IRQ 27
[11:18:55] WARNING: CPU: 0 PID: 38 at kernel/irq/manage.c:792 __enable_irq+0x36/0x60
...
[11:18:55] [FAILED] irq_cpuhotplug_test
[11:18:55] # module: irq_test
[11:18:55] # irq_test_cases: pass:2 fail:2 skip:0 total:4
[11:18:55] # Totals: pass:2 fail:2 skip:0 total:4
[11:18:55] ================= [FAILED] irq_test_cases ==================
[11:18:55] ============================================================
[11:18:55] Testing complete. Ran 4 tests: passed: 2, failed: 2
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250522210837.4135244-1-briannorris@chromium.org
Commit 788019eb55 ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") tried to make managed shutdown/startup properly
reference counted, but it missed the fact that the unplug and hotplug code
has an intentional imbalance by skipping IRQS_SUSPENDED interrupts on
the "restore" path.
This means that if a managed-affinity interrupt was both suspended and
managed-shutdown (such as may happen during system suspend / S3), resume
skips calling irq_startup_managed(), and would again have an unbalanced
depth this time, with a positive value (i.e., remaining unexpectedly
masked).
This IRQS_SUSPENDED check was introduced in commit a60dd06af6
("genirq/cpuhotplug: Skip suspended interrupts when restoring affinity")
for essentially the same reason as commit 788019eb55, to prevent that
irq_startup() would unconditionally re-enable an interrupt too early.
Because irq_startup_managed() now respsects the disable-depth count, the
IRQS_SUSPENDED check is not longer needed, and instead, it causes harm.
Thus, drop the IRQS_SUSPENDED check, and restore balance.
This effectively reverts commit a60dd06af6 ("genirq/cpuhotplug: Skip
suspended interrupts when restoring affinity"), because it is replaced
by commit 788019eb55 ("genirq: Retain disable depth for managed
interrupts across CPU hotplug").
Fixes: 788019eb55 ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-3-briannorris@chromium.org
Closes: https://lore.kernel.org/lkml/24ec4adc-7c80-49e9-93ee-19908a97ab84@gmail.com/
Commit 788019eb55 ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") intended to only decrement the disable depth once per
managed shutdown, but instead it decrements for each CPU hotplug in the
affinity mask, until its depth reaches a point where it finally gets
re-started.
For example, consider:
1. Interrupt is affine to CPU {M,N}
2. disable_irq() -> depth is 1
3. CPU M goes offline -> interrupt migrates to CPU N / depth is still 1
4. CPU N goes offline -> irq_shutdown() / depth is 2
5. CPU N goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 1
6. CPU M goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 0
*** BUG: driver expects the interrupt is still disabled ***
-> irq_startup() -> irqd_clr_managed_shutdown()
7. enable_irq() -> depth underflow / unbalanced enable_irq() warning
This should clear the managed-shutdown flag at step 6, so that further
hotplugs don't cause further imbalance.
Note: It might be cleaner to also remove the irqd_clr_managed_shutdown()
invocation from __irq_startup_managed(). But this is currently not possible
because of irq_update_affinity_desc() as it sets IRQD_MANAGED_SHUTDOWN and
expects irq_startup() to clear it.
Fixes: 788019eb55 ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-2-briannorris@chromium.org
padata_do_parallel() and padata_index_to_cpu() duplicate cpumask_nth().
Fix both and use the generic helper.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
There is a race condition/UAF in padata_reorder that goes back
to the initial commit. A reference count is taken at the start
of the process in padata_do_parallel, and released at the end in
padata_serial_worker.
This reference count is (and only is) required for padata_replace
to function correctly. If padata_replace is never called then
there is no issue.
In the function padata_reorder which serves as the core of padata,
as soon as padata is added to queue->serial.list, and the associated
spin lock released, that padata may be processed and the reference
count on pd would go away.
Fix this by getting the next padata before the squeue->serial lock
is released.
In order to make this possible, simplify padata_reorder by only
calling it once the next padata arrives.
Fixes: 16295bec63 ("padata: Generic parallelization/serialization interface")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Simplify the scheduler by making CONFIG_SMP=y code in the stop-CPU
scheduling class unconditional.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-36-mingo@kernel.org
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional in the RT policies scheduler.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-29-mingo@kernel.org
Simplify the scheduler by making the CONFIG_SMP=y version of
idle_thread_set_boot_cpu() unconditional.
Note that idle_thread_set_boot_cpu() is already conditional
on CONFIG_GENERIC_SMP_IDLE_THREAD, which most architectures
select unconditionally on both UP and SMP kernels.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-28-mingo@kernel.org
Simplify the scheduler by making CONFIG_SMP=y data structure
of rq->hrtick_csd unconditional.
Adjust hrtick_start() accordingly, which was split due to the
::hrtick_csd asymmetry and use the SMP version there too.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-23-mingo@kernel.org
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional.
Introduce transitory wrappers for functionality not yet converted to SMP.
Note that this patch is pretty large, because there's no clear separation
between various aspects of the SMP scheduler, it's basically a huge block
of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to
boot and work on UP systems.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional.
Unconditionally build kernel/sched/topology.c and the main sched-domains
locking primitives.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-20-mingo@kernel.org
With input changed == NULL, a local variable is used for "changed".
Initialize tmp properly, so that it can be used in the following:
*changed |= err > 0;
Otherwise, UBSAN will complain:
UBSAN: invalid-load in kernel/bpf/verifier.c:18924:4
load of value <some random value> is not a valid value for type '_Bool'
Fixes: dfb2d4c64b ("bpf: set 'changed' status if propagate_liveness() did any updates")
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250612221100.2153401-1-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Without this, `state->speculative` is used after the cleanup cycles in
push_stack() or push_async_cb() freed `env->cur_state` (i.e., `state`).
Avoid this by relying on the short-circuit logic to only access `state`
if the error is recoverable (and make sure it never is after push_*()
failed).
push_*() callers must always return an error for which
error_recoverable_with_nospec(err) is false if push_*() returns NULL,
otherwise we try to recover and access the stale `state`. This is only
violated by sanitize_ptr_alu(), thus also fix this case to return
-ENOMEM.
state->speculative does not make sense if the error path of push_*()
ran. In that case, `state->speculative &&
error_recoverable_with_nospec(err)` as a whole should already never
evaluate to true (because all cases where push_stack() fails must return
-ENOMEM/-EFAULT). As mentioned, this is only violated by the
push_stack() call in sanitize_speculative_path() which returns -EACCES
without [1] (through REASON_STACK in sanitize_err() after
sanitize_ptr_alu()). To fix this, return -ENOMEM for REASON_STACK (which
is also the behavior we will have after [1]).
Checked that it fixes the syzbot reproducer as expected.
[1] https://lore.kernel.org/all/20250603213232.339242-1-luis.gerhorst@fau.de/
Fixes: d6f1c85f22 ("bpf: Fall back to nospec for Spectre v1")
Reported-by: syzbot+b5eb72a560b8149a1885@syzkaller.appspotmail.com
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/all/38862a832b91382cddb083dddd92643bed0723b8.camel@gmail.com/
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611210728.266563-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The previous patch switched read and precision tracking for
iterator-based loops from state-graph-based loop tracking to
control-flow-graph-based loop tracking.
This patch removes the now-unused `update_loop_entry()` and
`get_loop_entry()` functions, which were part of the state-graph-based
logic.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current loop_entry-based exact states comparison logic does not handle
the following case:
.-> A --. Assume the states are visited in the order A, B, C.
| | | Assume that state B reaches a state equivalent to state A.
| v v At this point, state C is not processed yet, so state A
'-- B C has not received any read or precision marks from C.
As a result, these marks won't be propagated to B.
If B has incomplete marks, it is unsafe to use it in states_equal()
checks.
This commit replaces the existing logic with the following:
- Strongly connected components (SCCs) are computed over the program's
control flow graph (intraprocedurally).
- When a verifier state enters an SCC, that state is recorded as the
SCC entry point.
- When a verifier state is found equivalent to another (e.g., B to A
in the example), it is recorded as a states graph backedge.
Backedges are accumulated per SCC.
- When an SCC entry state reaches `branches == 0`, read and precision
marks are propagated through the backedges (e.g., from A to B, from
C to A, and then again from A to B).
To support nested subprogram calls, the entry state and backedge list
are associated not with the SCC itself but with an object called
`bpf_scc_callchain`. A callchain is a tuple `(callsite*, scc_id)`,
where `callsite` is the index of a call instruction for each frame
except the last.
See the comments added in `is_state_visited()` and
`compute_scc_callchain()` for more details.
Fixes: 2a0992829e ("bpf: correct loop detection for iterators convergence")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next patch would add some relatively heavy-weight operation to
clean_live_states(), this operation can be skipped if REG_LIVE_DONE
is set. Move the check from clean_verifier_state() to
clean_verifier_state() as a small refactoring commit.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add an out parameter to `propagate_liveness()` to record whether any
new liveness bits were set during its execution.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add an out parameter to `propagate_precision()` to record whether any
new precision bits were set during its execution.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow `mark_chain_precision()` to run from an arbitrary starting state
by replacing direct references to `env->cur_state` with a parameter.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A function to return IP for a given frame in a call stack of a state.
Will be used by a next patch.
The `state->insn_idx = env->insn_idx;` assignment in the do_check()
allows to use frame_insn_idx with env->cur_state.
At the moment bpf_verifier_state->insn_idx is set when new cached
state is added in is_state_visited() and accessed only in the contexts
when the state is already in the cache. Hence this assignment does not
change verifier behaviour.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This reverts commit 96a30e469c.
Next patches in the series modify propagate_precision() to allow
arbitrary starting state. Precision propagation requires access to
jump history, and arbitrary states represent history not belonging to
`env->cur_state`.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250611200836.4135542-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make the logic easier to follow:
- Remove the final return statement, which is never reached, and move the
actual walk-terminating return statement out of the do-while loop.
- Remove the else-clause to reduce indentation. If a non-lonely group is
encountered during the walk, the loop is immediately terminated with a
return statement anyway; no need for an else.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250606124818.455560-1-ptesarik@suse.com
When porting a cma related usage from x86_64 server to arm64 server,
the "cma=4G@4G" setup failed on arm64. The reason is arm64 and some
other architectures have specific physical address limit for reserved
cma area, like 4GB due to the device's need for 32 bit dma. Actually
lots of platforms of those architectures don't have this device dma
limit, but still have to obey it, and are not able to reserve a huge
cma pool.
This situation could be improved by honoring the user input cma
physical address than the arch limit. As when users specify it, they
already knows what the default is which probably can't suit them.
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250612021417.44929-1-feng.tang@linux.alibaba.com
Once the global hash is requested there is no way back to switch back to
the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash
and both pass the initial check and block later on the
mm::futex_hash_lock. In this case the first thread performs the switch
to the global hash. The second thread will also attempt to switch to the
global hash and while doing so, accessing the nonexisting slot 1 of the
struct futex_private_hash.
The same applies if the hash is made immutable: There is no reference
counting and the hash must not be replaced.
Verify under mm_struct::futex_phash that neither the global hash nor an
immutable hash in use.
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
Fixes: bd54df5ea7 ("futex: Allow to resize the private local hash")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
Both ARM and IBM CI reports RCU stall, which can be reproduced by the
below perf command.
perf record -a -e cpu-clock -- sleep 2
The issue is introduced by the generic throttle patch set, which
unconditionally invoke the event_stop() when throttle is triggered.
The cpu-clock and task-clock are two special SW events, which rely on
the hrtimer. The throttle is invoked in the hrtimer handler. The
event_stop()->hrtimer_cancel() waits for the handler to finish, which is
a deadlock. Instead of invoking the stop(), the HRTIMER_NORESTART should
be used to stop the timer.
There may be two ways to fix it:
- Introduce a PMU flag to track the case. Avoid the event_stop in
perf_event_throttle() if the flag is detected.
It has been implemented in the
https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@linux.intel.com/
The new flag was thought to be an overkill for the issue.
- Add a check in the event_stop. Return immediately if the throttle is
invoked in the hrtimer handler. Rely on the existing HRTIMER_NORESTART
method to stop the timer.
The latter is implemented here.
Move event->hw.interrupts = MAX_INTERRUPTS before the stop(). It makes
the order the same as perf_event_unthrottle(). Except the patch, no one
checks the hw.interrupts in the stop(). There is no impact from the
order change.
When stops in the throttle, the event should not be updated,
stop(event, 0). But the cpu_clock_event_stop() doesn't handle the flag.
In logic, it's wrong. But it didn't bring any problems with the old
code, because the stop() was not invoked when handling the throttle.
Checking the flag before updating the event.
Fixes: 9734e25fbf ("perf: Fix the throttle logic for a group")
Closes: https://lore.kernel.org/lkml/20250527161656.GJ2566836@e132581.arm.com/
Closes: https://lore.kernel.org/lkml/djxlh5fx326gcenwrr52ry3pk4wxmugu4jccdjysza7tlc5fef@ktp4rffawgcw/
Closes: https://lore.kernel.org/lkml/8e8f51d8-af64-4d9e-934b-c0ee9f131293@linux.ibm.com/
Closes: https://lore.kernel.org/lkml/4ce106d0-950c-aadc-0b6a-f0215cd39913@maine.edu/
Reported-by: Leo Yan <leo.yan@arm.com>
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20250606192546.915765-1-kan.liang@linux.intel.com
Due to the weird Makefile setup of sched the various files do not
compile as stand alone units. The new generation of editors are trying
to do just this -- mostly to offer fancy things like completions but
also better syntax highlighting and code navigation.
Specifically, I've been playing around with neovim and clangd.
Setting up clangd on the kernel source is a giant pain in the arse
(this really should be improved), but once you do manage, you run into
dumb stuff like the above.
Fix up the scheduler files to at least pretend to work.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250523164348.GN39944@noisy.programming.kicks-ass.net
The variable "head" is allocated and initialized as a list before
allocating the first "item" for the list. If the allocation of "item"
fails, it frees "head" and then jumps to the label "free_now" which will
process head and free it.
This will cause a UAF of "head", and it doesn't need to free it before
jumping to the "free_now" label as that code will free it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250610093348.33c5643a@gandalf.local.home
Fixes: a9d0aab5eb ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202506070424.lCiNreTI-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This implements the core of the series and causes the verifier to fall
back to mitigating Spectre v1 using speculation barriers. The approach
was presented at LPC'24 [1] and RAID'24 [2].
If we find any forbidden behavior on a speculative path, we insert a
nospec (e.g., lfence speculation barrier on x86) before the instruction
and stop verifying the path. While verifying a speculative path, we can
furthermore stop verification of that path whenever we encounter a
nospec instruction.
A minimal example program would look as follows:
A = true
B = true
if A goto e
f()
if B goto e
unsafe()
e: exit
There are the following speculative and non-speculative paths
(`cur->speculative` and `speculative` referring to the value of the
push_stack() parameters):
- A = true
- B = true
- if A goto e
- A && !cur->speculative && !speculative
- exit
- !A && !cur->speculative && speculative
- f()
- if B goto e
- B && cur->speculative && !speculative
- exit
- !B && cur->speculative && speculative
- unsafe()
If f() contains any unsafe behavior under Spectre v1 and the unsafe
behavior matches `state->speculative &&
error_recoverable_with_nospec(err)`, do_check() will now add a nospec
before f() instead of rejecting the program:
A = true
B = true
if A goto e
nospec
f()
if B goto e
unsafe()
e: exit
Alternatively, the algorithm also takes advantage of nospec instructions
inserted for other reasons (e.g., Spectre v4). Taking the program above
as an example, speculative path exploration can stop before f() if a
nospec was inserted there because of Spectre v4 sanitization.
In this example, all instructions after the nospec are dead code (and
with the nospec they are also dead code speculatively).
For this, it relies on the fact that speculation barriers generally
prevent all later instructions from executing if the speculation was not
correct:
* On Intel x86_64, lfence acts as full speculation barrier, not only as
a load fence [3]:
An LFENCE instruction or a serializing instruction will ensure that
no later instructions execute, even speculatively, until all prior
instructions complete locally. [...] Inserting an LFENCE instruction
after a bounds check prevents later operations from executing before
the bound check completes.
This was experimentally confirmed in [4].
* On AMD x86_64, lfence is dispatch-serializing [5] (requires MSR
C001_1029[1] to be set if the MSR is supported, this happens in
init_amd()). AMD further specifies "A dispatch serializing instruction
forces the processor to retire the serializing instruction and all
previous instructions before the next instruction is executed" [8]. As
dispatch is not specific to memory loads or branches, lfence therefore
also affects all instructions there. Also, if retiring a branch means
it's PC change becomes architectural (should be), this means any
"wrong" speculation is aborted as required for this series.
* ARM's SB speculation barrier instruction also affects "any instruction
that appears later in the program order than the barrier" [6].
* PowerPC's barrier also affects all subsequent instructions [7]:
[...] executing an ori R31,R31,0 instruction ensures that all
instructions preceding the ori R31,R31,0 instruction have completed
before the ori R31,R31,0 instruction completes, and that no
subsequent instructions are initiated, even out-of-order, until
after the ori R31,R31,0 instruction completes. The ori R31,R31,0
instruction may complete before storage accesses associated with
instructions preceding the ori R31,R31,0 instruction have been
performed
Regarding the example, this implies that `if B goto e` will not execute
before `if A goto e` completes. Once `if A goto e` completes, the CPU
should find that the speculation was wrong and continue with `exit`.
If there is any other path that leads to `if B goto e` (and therefore
`unsafe()`) without going through `if A goto e`, then a nospec will
still be needed there. However, this patch assumes this other path will
be explored separately and therefore be discovered by the verifier even
if the exploration discussed here stops at the nospec.
This patch furthermore has the unfortunate consequence that Spectre v1
mitigations now only support architectures which implement BPF_NOSPEC.
Before this commit, Spectre v1 mitigations prevented exploits by
rejecting the programs on all architectures. Because some JITs do not
implement BPF_NOSPEC, this patch therefore may regress unpriv BPF's
security to a limited extent:
* The regression is limited to systems vulnerable to Spectre v1, have
unprivileged BPF enabled, and do NOT emit insns for BPF_NOSPEC. The
latter is not the case for x86 64- and 32-bit, arm64, and powerpc
64-bit and they are therefore not affected by the regression.
According to commit a6f6a95f25 ("LoongArch, bpf: Fix jit to skip
speculation barrier opcode"), LoongArch is not vulnerable to Spectre
v1 and therefore also not affected by the regression.
* To the best of my knowledge this regression may therefore only affect
MIPS. This is deemed acceptable because unpriv BPF is still disabled
there by default. As stated in a previous commit, BPF_NOSPEC could be
implemented for MIPS based on GCC's speculation_barrier
implementation.
* It is unclear which other architectures (besides x86 64- and 32-bit,
ARM64, PowerPC 64-bit, LoongArch, and MIPS) supported by the kernel
are vulnerable to Spectre v1. Also, it is not clear if barriers are
available on these architectures. Implementing BPF_NOSPEC on these
architectures therefore is non-trivial. Searching GCC and the kernel
for speculation barrier implementations for these architectures
yielded no result.
* If any of those regressed systems is also vulnerable to Spectre v4,
the system was already vulnerable to Spectre v4 attacks based on
unpriv BPF before this patch and the impact is therefore further
limited.
As an alternative to regressing security, one could still reject
programs if the architecture does not emit BPF_NOSPEC (e.g., by removing
the empty BPF_NOSPEC-case from all JITs except for LoongArch where it
appears justified). However, this will cause rejections on these archs
that are likely unfounded in the vast majority of cases.
In the tests, some are now successful where we previously had a
false-positive (i.e., rejection). Change them to reflect where the
nospec should be inserted (using __xlated_unpriv) and modify the error
message if the nospec is able to mitigate a problem that previously
shadowed another problem (in that case __xlated_unpriv does not work,
therefore just add a comment).
Define SPEC_V1 to avoid duplicating this ifdef whenever we check for
nospec insns using __xlated_unpriv, define it here once. This also
improves readability. PowerPC can probably also be added here. However,
omit it for now because the BPF CI currently does not include a test.
Limit it to EPERM, EACCES, and EINVAL (and not everything except for
EFAULT and ENOMEM) as it already has the desired effect for most
real-world programs. Briefly went through all the occurrences of EPERM,
EINVAL, and EACCESS in verifier.c to validate that catching them like
this makes sense.
Thanks to Dustin for their help in checking the vendor documentation.
[1] https://lpc.events/event/18/contributions/1954/ ("Mitigating
Spectre-PHT using Speculation Barriers in Linux eBPF")
[2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and
Precise Spectre Defenses for Untrusted Linux Kernel Extensions")
[3] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/runtime-speculative-side-channel-mitigations.html
("Managed Runtime Speculative Execution Side Channel Mitigations")
[4] https://dl.acm.org/doi/pdf/10.1145/3359789.3359837 ("Speculator: a
tool to analyze speculative execution attacks and mitigations" -
Section 4.6 "Stopping Speculative Execution")
[5] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf
("White Paper - SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD
PROCESSORS - REVISION 5.09.23")
[6] https://developer.arm.com/documentation/ddi0597/2020-12/Base-Instructions/SB--Speculation-Barrier-
("SB - Speculation Barrier - Arm Armv8-A A32/T32 Instruction Set
Architecture (2020-12)")
[7] https://wiki.raptorcs.com/w/images/5/5f/OPF_PowerISA_v3.1C.pdf
("Power ISA™ - Version 3.1C - May 26, 2024 - Section 9.2.1 of Book
III")
[8] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf
("AMD64 Architecture Programmer’s Manual Volumes 1–5 - Revision 4.08
- April 2024 - 7.6.4 Serializing Instructions")
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Henriette Herzog <henriette.herzog@rub.de>
Cc: Dustin Nguyen <nguyen@cs.fau.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603212428.338473-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is made to clarify that this flag will cause a nospec to be added
after this insn and can therefore be relied upon to reduce speculative
path analysis.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Henriette Herzog <henriette.herzog@rub.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603212024.338154-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This changes the semantics of BPF_NOSPEC (previously a v4-only barrier)
to always emit a speculation barrier that works against both Spectre v1
AND v4. If mitigation is not needed on an architecture, the backend
should set bpf_jit_bypass_spec_v4/v1().
As of now, this commit only has the user-visible implication that unpriv
BPF's performance on PowerPC is reduced. This is the case because we
have to emit additional v1 barrier instructions for BPF_NOSPEC now.
This commit is required for a future commit to allow us to rely on
BPF_NOSPEC for Spectre v1 mitigation. As of this commit, the feature
that nospec acts as a v1 barrier is unused.
Commit f5e81d1117 ("bpf: Introduce BPF nospec instruction for
mitigating Spectre v4") noted that mitigation instructions for v1 and v4
might be different on some archs. While this would potentially offer
improved performance on PowerPC, it was dismissed after the following
considerations:
* Only having one barrier simplifies the verifier and allows us to
easily rely on v4-induced barriers for reducing the complexity of
v1-induced speculative path verification.
* For the architectures that implemented BPF_NOSPEC, only PowerPC has
distinct instructions for v1 and v4. Even there, some insns may be
shared between the barriers for v1 and v4 (e.g., 'ori 31,31,0' and
'sync'). If this is still found to impact performance in an
unacceptable way, BPF_NOSPEC can be split into BPF_NOSPEC_V1 and
BPF_NOSPEC_V4 later. As an optimization, we can already skip v1/v4
insns from being emitted for PowerPC with this setup if
bypass_spec_v1/v4 is set.
Vulnerability-status for BPF_NOSPEC-based Spectre mitigations (v4 as of
this commit, v1 in the future) is therefore:
* x86 (32-bit and 64-bit), ARM64, and PowerPC (64-bit): Mitigated - This
patch implements BPF_NOSPEC for these architectures. The previous
v4-only version was supported since commit f5e81d1117 ("bpf:
Introduce BPF nospec instruction for mitigating Spectre v4") and
commit b7540d6250 ("powerpc/bpf: Emit stf barrier instruction
sequences for BPF_NOSPEC").
* LoongArch: Not Vulnerable - Commit a6f6a95f25 ("LoongArch, bpf: Fix
jit to skip speculation barrier opcode") is the only other past commit
related to BPF_NOSPEC and indicates that the insn is not required
there.
* MIPS: Vulnerable (if unprivileged BPF is enabled) -
Commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation
barrier opcode") indicates that it is not vulnerable, but this
contradicts the kernel and Debian documentation. Therefore, I assume
that there exist vulnerable MIPS CPUs (but maybe not from Loongson?).
In the future, BPF_NOSPEC could be implemented for MIPS based on the
GCC speculation_barrier [1]. For now, we rely on unprivileged BPF
being disabled by default.
* Other: Unknown - To the best of my knowledge there is no definitive
information available that indicates that any other arch is
vulnerable. They are therefore left untouched (BPF_NOSPEC is not
implemented, but bypass_spec_v1/v4 is also not set).
I did the following testing to ensure the insn encoding is correct:
* ARM64:
* 'dsb nsh; isb' was successfully tested with the BPF CI in [2]
* 'sb' locally using QEMU v7.2.15 -cpu max (emitted sb insn is
executed for example with './test_progs -t verifier_array_access')
* PowerPC: The following configs were tested locally with ppc64le QEMU
v8.2 '-machine pseries -cpu POWER9':
* STF_BARRIER_EIEIO + CONFIG_PPC_BOOK32_64
* STF_BARRIER_SYNC_ORI (forced on) + CONFIG_PPC_BOOK32_64
* STF_BARRIER_FALLBACK (forced on) + CONFIG_PPC_BOOK32_64
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_EIEIO
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_SYNC_ORI (forced on)
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_FALLBACK (forced on)
* CONFIG_PPC_E500 (forced on) + STF_BARRIER_NONE (forced on)
Most of those cobinations should not occur in practice, but I was not
able to get an PPC e6500 rootfs (for testing PPC_E500 without forcing
it on). In any case, this should ensure that there are no unexpected
conflicts between the insns when combined like this. Individual v1/v4
barriers were already emitted elsewhere.
Hari's ack is for the PowerPC changes only.
[1] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=29b74545531f6afbee9fc38c267524326dbfbedf
("MIPS: Add speculation_barrier support")
[2] https://github.com/kernel-patches/bpf/pull/8576
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Cc: Henriette Herzog <henriette.herzog@rub.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603211703.337860-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
JITs can set bpf_jit_bypass_spec_v1/v4() if they want the verifier to
skip analysis/patching for the respective vulnerability. For v4, this
will reduce the number of barriers the verifier inserts. For v1, it
allows more programs to be accepted.
The primary motivation for this is to not regress unpriv BPF's
performance on ARM64 in a future commit where BPF_NOSPEC is also used
against Spectre v1.
This has the user-visible change that v1-induced rejections on
non-vulnerable PowerPC CPUs are avoided.
For now, this does not change the semantics of BPF_NOSPEC. It is still a
v4-only barrier and must not be implemented if bypass_spec_v4 is always
true for the arch. Changing it to a v1 AND v4-barrier is done in a
future commit.
As an alternative to bypass_spec_v1/v4, one could introduce NOSPEC_V1
AND NOSPEC_V4 instructions and allow backends to skip their lowering as
suggested by commit f5e81d1117 ("bpf: Introduce BPF nospec instruction
for mitigating Spectre v4"). Adding bpf_jit_bypass_spec_v1/v4() was
found to be preferable for the following reason:
* bypass_spec_v1/v4 benefits non-vulnerable CPUs: Always performing the
same analysis (not taking into account whether the current CPU is
vulnerable), needlessly restricts users of CPUs that are not
vulnerable. The only use case for this would be portability-testing,
but this can later be added easily when needed by allowing users to
force bypass_spec_v1/v4 to false.
* Portability is still acceptable: Directly disabling the analysis
instead of skipping the lowering of BPF_NOSPEC(_V1/V4) might allow
programs on non-vulnerable CPUs to be accepted while the program will
be rejected on vulnerable CPUs. With the fallback to speculation
barriers for Spectre v1 implemented in a future commit, this will only
affect programs that do variable stack-accesses or are very complex.
For PowerPC, the SEC_FTR checking in bpf_jit_bypass_spec_v4() is based
on the check that was previously located in the BPF_NOSPEC case.
For LoongArch, it would likely be safe to set both
bpf_jit_bypass_spec_v1() and _v4() according to
commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation
barrier opcode"). This is omitted here as I am unable to do any testing
for LoongArch.
Hari's ack concerns the PowerPC part only.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Hari Bathini <hbathini@linux.ibm.com>
Cc: Henriette Herzog <henriette.herzog@rub.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603211318.337474-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This prevents us from trying to recover from these on speculative paths
in the future.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Henriette Herzog <henriette.herzog@rub.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603205800.334980-4-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Mark these cases as non-recoverable to later prevent them from being
caught when they occur during speculative path verification.
Eduard writes [1]:
The only pace I'm aware of that might act upon specific error code
from verifier syscall is libbpf. Looking through libbpf code, it seems
that this change does not interfere with libbpf.
[1] https://lore.kernel.org/all/785b4531ce3b44a84059a4feb4ba458c68fce719.camel@gmail.com/
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Henriette Herzog <henriette.herzog@rub.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603205800.334980-3-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This is required to catch the errors later and fall back to a nospec if
on a speculative path.
Eliminate the regs variable as it is only used once and insn_idx is not
modified in-between the definition and usage.
Do not pass insn but compute it in the function itself. As Eduard points
out [1], insn is assumed to correspond to env->insn_idx in many places
(e.g, __check_reg_arg()).
Move code into do_check_insn(), replace
* "continue" with "return 0" after modifying insn_idx
* "goto process_bpf_exit" with "return PROCESS_BPF_EXIT"
* "goto process_bpf_exit_full" with "return process_bpf_exit_full()"
* "do_print_state = " with "*do_print_state = "
[1] https://lore.kernel.org/all/293dbe3950a782b8eb3b87b71d7a967e120191fd.camel@gmail.com/
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Henriette Herzog <henriette.herzog@rub.de>
Cc: Maximilian Ott <ott@cs.fau.de>
Cc: Milan Stephan <milan.stephan@fau.de>
Link: https://lore.kernel.org/r/20250603205800.334980-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When reg->type is CONST_PTR_TO_MAP, it can not be null. However the
verifier explores the branches under rX == 0 in check_cond_jmp_op()
even if reg->type is CONST_PTR_TO_MAP, because it was not checked for
in reg_not_null().
Fix this by adding CONST_PTR_TO_MAP to the set of types that are
considered non nullable in reg_not_null().
An old "unpriv: cmp map pointer with zero" selftest fails with this
change, because now early out correctly triggers in
check_cond_jmp_op(), making the verification to pass.
In practice verifier may allow pointer to null comparison in unpriv,
since in many cases the relevant branch and comparison op are removed
as dead code. So change the expected test result to __success_unpriv.
Signed-off-by: Ihor Solodrai <isolodrai@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250609183024.359974-2-isolodrai@meta.com
Current cgroup prog ordering is appending at attachment time. This is not
ideal. In some cases, users want specific ordering at a particular cgroup
level. To address this, the existing mprog API seems an ideal solution with
supporting BPF_F_BEFORE and BPF_F_AFTER flags.
But there are a few obstacles to directly use kernel mprog interface.
Currently cgroup bpf progs already support prog attach/detach/replace
and link-based attach/detach/replace. For example, in struct
bpf_prog_array_item, the cgroup_storage field needs to be together
with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog
as the member, which makes it difficult to use kernel mprog interface.
In another case, the current cgroup prog detach tries to use the
same flag as in attach. This is different from mprog kernel interface
which uses flags passed from user space.
So to avoid modifying existing behavior, I made the following changes to
support mprog API for cgroup progs:
- The support is for prog list at cgroup level. Cross-level prog list
(a.k.a. effective prog list) is not supported.
- Previously, BPF_F_PREORDER is supported only for prog attach, now
BPF_F_PREORDER is also supported by link-based attach.
- For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID/BPF_F_LINK is supported
similar to kernel mprog but with different implementation.
- For detach and replace, use the existing implementation.
- For attach, detach and replace, the revision for a particular prog
list, associated with a particular attach type, will be updated
by increasing count by 1.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250606163141.2428937-1-yonghong.song@linux.dev
One of key items in mprog API is revision for prog list. The revision
number will be increased if the prog list changed, e.g., attach, detach
or replace.
Add 'revisions' field to struct cgroup_bpf, representing revisions for
all cgroup related attachment types. The initial revision value is
set to 1, the same as kernel mprog implementations.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250606163136.2428732-1-yonghong.song@linux.dev
The dedicated helper is more verbose and effective comparing to
cpumask_first() followed by cpumask_next().
Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_locked_rq() is used both from ext.c and ext_idle.c, move it to ext.h
as a static inline function.
No functional changes.
v2: Rename locked_rq to scx_locked_rq_state, expose it and make
scx_locked_rq() inline, as suggested by Tejun.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_rq_bypassing() is used both from ext.c and ext_idle.c, move it to
ext.h as a static inline function.
No functional changes.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Functions that are only used within ext_idle.c can be marked as static
to limit their scope.
No functional changes.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There's no need to make scx_bpf_cpu_node() dependent on CONFIG_NUMA,
since cpu_to_node() can be used also in systems with CONFIG_NUMA
disabled.
This also allows to always validate the @cpu argument regardless of the
CONFIG_NUMA settings.
Fixes: 01059219b0 ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The kthreads for nbcon consoles are created by nbcon_alloc() at
the beginning of the console registration. But it currently works
only for the 2nd or later nbcon console because the code checks
@printk_kthreads_running.
The kthread for the 1st registered nbcon console is created at the very
end of register_console() by printk_kthreads_check_locked(). As a result,
the entire log is replayed synchronously when the "enabled" message
gets printed. It might block the boot for a long time with a slow serial
console.
Prevent the synchronous flush by creating the kthread even for the 1st
nbcon console when it is safe (kthreads ready and no boot consoles).
Also inform printk() to use the kthread by setting
@printk_kthreads_running. Note that the kthreads already must be
running when it is safe and this is not the 1st nbcon console.
Symmetrically, clear @printk_kthreads_running when the last nbcon
console was unregistered by nbcon_free(). This requires updating
@have_nbcon_console before nbcon_free() gets called.
Note that there is _no_ problem when the 1st nbcon console replaces boot
consoles. In this case, the kthread will be started at the end
of registration after the boot consoles are removed. But the console
does not reply the entire log buffer in this case. Note that
the flag CON_PRINTBUFFER is always cleared when the boot consoles are
removed and vice versa.
Closes: https://lore.kernel.org/r/20250514173514.2117832-1-mcobb@thegoodpenguin.co.uk
Tested-by: Michael Cobb <mcobb@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250604142045.253301-1-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
The renaming to the timer_*() namespace was delayed due massive conflicts
against Linux-next. Now that everything is upstream finish the conversion.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmhFOY0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRw8D/9ii/hq8jKguupde3UNVsdqICggO7bY
8PIY8FjZB2z3ALGOML9Pf1yystwnz1wbda9UhgGkKGj2iWvG0wWiN56J6FpksuIn
08poxMXUsLu7Wu6DaQkQrDwJ2Wu4EMefsxf6YtY/dGLLe553Bh5FHBLr75PO3d1j
AZNjGXysowzBBr//oSQuP8/MTVXd9KWvPSFPMn9oJZlkFVUbB0a6imjy10tDFC5s
uLUXwyPhsJvU6lj+B41H1hTNIoTBZexJgRgl1PhuNrN/5FLcUAPUVKbyLo+cCrqt
iB8WRw7fJu2CaKnSfRIWmi4kSeUP2d4H8oC/W4xymQtzvKNW6l0RIETg40FYqDAs
wucMBc5FmLzOBnUyoWDpn34NxOND5sHWd42yPHpxowmLWIZ2wAbSR/AHGA9vkmXa
Ksh8elyTR3swO9PRalrSsg3vM8KhH2RBXDotVFKBGmkay4WkW0TzTyjVDVZd1+bH
XxGO4PZWOXYcoQ840ocb1UMHdfEZivuaWrY4j5HWzsK/3No5f9ECJ9Dd5p/u6Ju7
FDmhrhovqKgLGnqo3MBmOeI1zSBsQuqPpRxUG1/gHVl4CYFwhcOU8pk0064ZSN9Q
RasjoJEghSlwKf4FEHJN9Z3+izoMntZGB3aUG+MXfxbBNJkHmO8/Tb4AwVQ6HyhT
+xF2fwKHwHyIbw==
=jJaU
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanup from Thomas Gleixner:
"The delayed from_timer() API cleanup:
The renaming to the timer_*() namespace was delayed due massive
conflicts against Linux-next. Now that everything is upstream finish
the conversion"
* tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename from_timer() to timer_container_of()
- Fix regression of waiting a long time on updating trace event filters
When the faultable trace points were added, it needed task trace RCU
synchronization. This was added to the tracepoint_synchronize_unregister()
function. The filter logic always called this function whenever it
updated the trace event filters before freeing the old filters.
This increased the time of "trace-cmd record" from taking 13 seconds
to running over 2 minutes to complete.
Move the freeing of the filters to call_rcu*() logic, which brings the
time back down to 13 seconds.
- Fix ring_buffer_subbuf_order_set() error path lock protection
The error path of the ring_buffer_subbuf_order_set() released the
mutex too early and allowed subsequent accesses to setting the
subbuffer size to corrupt the data and cause a bug.
By moving the mutex locking to the end of the error path, it prevents
the reentrant access to the critical data and also allows the function
to convert the taking of the mutex over to the guard() logic.
- Remove unused power management clock events
The clock events were added in 2010 for power management. In 2011
arm used them. In 2013 the code they were used in was removed.
These events have been wasting memory since then.
- Fix sparse warnings
There was a few places that sparse warned about trace_events_filter.c
where file->filter was referenced directly, but it is annotated with
an __rcu tag. Use the helper functions and fix them up to use
rcu_dereference() properly.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaEST0xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qgdSAPoD7L17oeiP5KQkM0wPuPBz0tmJF7XE
2VmHp1lBu5rYwgEAyHTD7SqWvInMMp9sGt5tzkByXpOsYC65/RprkbFpXwA=
=s4wK
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull more tracing fixes from Steven Rostedt:
- Fix regression of waiting a long time on updating trace event filters
When the faultable trace points were added, it needed task trace RCU
synchronization.
This was added to the tracepoint_synchronize_unregister() function.
The filter logic always called this function whenever it updated the
trace event filters before freeing the old filters. This increased
the time of "trace-cmd record" from taking 13 seconds to running over
2 minutes to complete.
Move the freeing of the filters to call_rcu*() logic, which brings
the time back down to 13 seconds.
- Fix ring_buffer_subbuf_order_set() error path lock protection
The error path of the ring_buffer_subbuf_order_set() released the
mutex too early and allowed subsequent accesses to setting the
subbuffer size to corrupt the data and cause a bug.
By moving the mutex locking to the end of the error path, it prevents
the reentrant access to the critical data and also allows the
function to convert the taking of the mutex over to the guard()
logic.
- Remove unused power management clock events
The clock events were added in 2010 for power management. In 2011 arm
used them. In 2013 the code they were used in was removed. These
events have been wasting memory since then.
- Fix sparse warnings
There was a few places that sparse warned about trace_events_filter.c
where file->filter was referenced directly, but it is annotated with
an __rcu tag. Use the helper functions and fix them up to use
rcu_dereference() properly.
* tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Add rcu annotation around file->filter accesses
tracing: PM: Remove unused clock events
ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()
tracing: Fix regression of filter waiting a long time on RCU synchronization
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
- Add support for the EXPORT_SYMBOL_GPL_FOR_MODULES() macro, which exports a
symbol only to specified modules
- Improve ABI handling in gendwarfksyms
- Forcibly link lib-y objects to vmlinux even if CONFIG_MODULES=n
- Add checkers for redundant or missing <linux/export.h> inclusion
- Deprecate the extra-y syntax
- Fix a genksyms bug when including enum constants from *.symref files
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmhEZc4VHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGVAgQAKLRdBGga1kBJJFIkUOHWC5+g/je
U/dO5rGnuOLviWDexC6QT8AQV2N+dQXhB11x+KacSu1bwowsEvwuegtA6VqwbETs
tyWmB0PftEzVyPfc+Rjfy0LDfKkiKkm4RhXiMwcem/rlw45gvJXrVU7jJin9fI3A
So8glpOAX+mEizUHkjZkS51nkYCZFDsn7hVo0X43vqjeFrrFGLEQ5xas4Ci+dkY3
9g8Q5bFL8CC5PHjSO8wFftCcAWwTukAht6CSSb522MKGnCVZ9RxTmRwEPXrBmXtS
5eWa8yg6y0tFVmot8iwZGBYleAWDNsj0a2j2oVjUN+EF91sk3WQApJVNBok/nQFb
4MgO3N3UXZdy4tYkBX8tMgOcGkfjZAFoNxSUm5oVouh9NyT0dpqYHhJHBNVbVJoF
igQWeVOYcioDjeU1iXnP2cw64q44ROfxmOpDxOSRz9PTM6CCya1R0m/zzBLV6Lwk
rzlXk1LLf+jIfgmS5RLlkCgrXS1U0vNGXxQH9Ui9dZSEtzdU7qt5WQ/Rz44bEBhS
OeIlJfMMx6QYJztJc/BaUjkKsutTkII52QctRbRCj/nKswHd8SnHV+xk1c2WPxrg
yKq10rPpdg1BcvmODY6cmcndt7ogDRfkogm2gvGQIBZEglRimpmpg51sZQRD0ueE
0rt12TmktsLbglB4
=Dy49
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add support for the EXPORT_SYMBOL_GPL_FOR_MODULES() macro, which
exports a symbol only to specified modules
- Improve ABI handling in gendwarfksyms
- Forcibly link lib-y objects to vmlinux even if CONFIG_MODULES=n
- Add checkers for redundant or missing <linux/export.h> inclusion
- Deprecate the extra-y syntax
- Fix a genksyms bug when including enum constants from *.symref files
* tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (28 commits)
genksyms: Fix enum consts from a reference affecting new values
arch: use always-$(KBUILD_BUILTIN) for vmlinux.lds
kbuild: set y instead of 1 to KBUILD_{BUILTIN,MODULES}
efi/libstub: use 'targets' instead of extra-y in Makefile
module: make __mod_device_table__* symbols static
scripts/misc-check: check unnecessary #include <linux/export.h> when W=1
scripts/misc-check: check missing #include <linux/export.h> when W=1
scripts/misc-check: add double-quotes to satisfy shellcheck
kbuild: move W=1 check for scripts/misc-check to top-level Makefile
scripts/tags.sh: allow to use alternative ctags implementation
kconfig: introduce menu type enum
docs: symbol-namespaces: fix reST warning with literal block
kbuild: link lib-y objects to vmlinux forcibly even when CONFIG_MODULES=n
tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
docs/core-api/symbol-namespaces: drop table of contents and section numbering
modpost: check forbidden MODULE_IMPORT_NS("module:") at compile time
kbuild: move kbuild syntax processing to scripts/Makefile.build
Makefile: remove dependency on archscripts for header installation
Documentation/kbuild: Add new gendwarfksyms kABI rules
Documentation/kbuild: Drop section numbers
...
Running sparse on trace_events_filter.c triggered several warnings about
file->filter being accessed directly even though it's annotated with __rcu.
Add rcu_dereference() around it and shuffle the logic slightly so that
it's always referenced via accessor functions.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250607102821.6c7effbf@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
or aren't considered necessary for -stable kernels. 11 are for MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaENzlAAKCRDdBJ7gKXxA
joNYAP9n38QNDUoRR6ChFikzzY77q4alD2NL0aqXBZdcSRXoUgEAlQ8Ea+t6xnzp
GnH+cnsA6FDp4F6lIoZBdENJyBYrkQE=
=ud9O
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes.
6 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels. 11 are for MM"
* tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
MAINTAINERS: add mm swap section
kmsan: test: add module description
MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATION
mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
mm/hugetlb: unshare page tables during VMA split, not before
MAINTAINERS: add Alistair as reviewer of mm memory policy
iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvec
mm/mempolicy: fix incorrect freeing of wi_kobj
alloc_tag: handle module codetag load errors as module load failures
mm/madvise: handle madvise_lock() failure during race unwinding
mm: fix vmstat after removing NR_BOUNCE
KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY
* Support for the FWFT SBI extension, which is part of SBI 3.0 and a
dependency for many new SBI and ISA extensions.
* Support for getrandom() in the VDSO.
* Support for mseal.
* Optimized routines for raid6 syndrome and recovery calculations.
* kexec_file() supports loading Image-formatted kernel binaries.
* Improvements to the instruction patching framework to allow for atomic
instruction patching, along with rules as to how systems need to
behave in order to function correctly.
* Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
some SiFive vendor extensions.
* Various fixes and cleanups, including: misaligned access handling, perf
symbol mangling, module loading, PUD THPs, and improved uaccess
routines.
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmhDLP8ZHHBhbG1lcmRh
YmJlbHRAZ29vZ2xlLmNvbQAKCRAuExnzX7sYiZhFD/4+Zikkld812VjFb9dTF+Wj
n/x9h86zDwAEFgf2BMIpUQhHru6vtdkO2l/Ky6mQblTPMWLafF4eK85yCsf84sQ0
+RX4sOMLZ0+qvqxKX+aOFe9JXOWB0QIQuPvgBfDDOV4UTm60sglIxwqOpKcsBEHs
2nplXXjiv0ckaMFLos8xlwu1uy4A/jMfT3Y9FDcABxYCqBoKOZ1frcL9ezJZbHbv
BoOKLDH8ZypFxIG/eQ511lIXXtrnLas0l4jHWjrfsWu6pmXTgJasKtbGuH3LoLnM
G/4qvHufR6lpVUOIL5L0V6PpsmYwDi/ciFIFlc8NH2oOZil3qiVaGSEbJIkWGFu9
8lWTXQWnbinZbfg2oYbWp8GlwI70vKomtDyYNyB9q9Cq9jyiTChMklRNODr4764j
ZiEnzc/l4KyvaxUg8RLKCT595lKECiUDnMytbIbunJu05HBqRCoGpBtMVzlQsyUd
ybkRt3BA7eOR8/xFA7ZZQeJofmiu2yxkBs5ggMo8UnSragw27hmv/OA0mWMXEuaD
aaWc4ZKpKqf7qLchLHOvEl5ORUhsisyIJgZwOqdme5rQoWorVtr51faA4AKwFAN4
vcKgc5qJjK8vnpW+rl3LNJF9LtH+h4TgmUI853vUlukPoH2oqRkeKVGSkxG0iAze
eQy2VjP1fJz6ciRtJZn9aw==
=cZGy
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Support for the FWFT SBI extension, which is part of SBI 3.0 and a
dependency for many new SBI and ISA extensions
- Support for getrandom() in the VDSO
- Support for mseal
- Optimized routines for raid6 syndrome and recovery calculations
- kexec_file() supports loading Image-formatted kernel binaries
- Improvements to the instruction patching framework to allow for
atomic instruction patching, along with rules as to how systems need
to behave in order to function correctly
- Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha,
some SiFive vendor extensions
- Various fixes and cleanups, including: misaligned access handling,
perf symbol mangling, module loading, PUD THPs, and improved uaccess
routines
* tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (69 commits)
riscv: uaccess: Only restore the CSR_STATUS SUM bit
RISC-V: vDSO: Wire up getrandom() vDSO implementation
riscv: enable mseal sysmap for RV64
raid6: Add RISC-V SIMD syndrome and recovery calculations
riscv: mm: Add support for Svinval extension
RISC-V: Documentation: Add enough title underlines to CMODX
riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE
MAINTAINERS: Update Atish's email address
riscv: uaccess: do not do misaligned accesses in get/put_user()
riscv: process: use unsigned int instead of unsigned long for put_user()
riscv: make unsafe user copy routines use existing assembly routines
riscv: hwprobe: export Zabha extension
riscv: Make regs_irqs_disabled() more clear
perf symbols: Ignore mapping symbols on riscv
RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND
riscv: module: Optimize PLT/GOT entry counting
riscv: Add support for PUD THP
riscv: xchg: Prefetch the destination word for sc.w
riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop
riscv: Add support for Zicbop
...
When faultable trace events were added, a trace event may no longer use
normal RCU to synchronize but instead used synchronize_rcu_tasks_trace().
This synchronization takes a much longer time to synchronize.
The filter logic would free the filters by calling
tracepoint_synchronize_unregister() after it unhooked the filter strings
and before freeing them. With this function now calling
synchronize_rcu_tasks_trace() this increased the time to free a filter
tremendously. On a PREEMPT_RT system, it was even more noticeable.
# time trace-cmd record -p function sleep 1
[..]
real 2m29.052s
user 0m0.244s
sys 0m20.136s
As trace-cmd would clear out all the filters before recording, it could
take up to 2 minutes to do a recording of "sleep 1".
To find out where the issues was:
~# trace-cmd sqlhist -e -n sched_stack select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta, start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid
Which will produce the following commands (and -e will also execute them):
echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events
echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
The above creates a synthetic event that creates a stack trace when a task
schedules out and records it with the time it scheduled back in. Basically
the time a task is off the CPU. It also records the state of the task when
it left the CPU (running, blocked, sleeping, etc). It also saves the comm
of the task as "comm" (needed for the next command).
~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" && state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger
The above creates a histogram with buckets per state, per stack, and the
value of the total time it was off the CPU for that stack trace. It filters
on tasks with "comm == trace-cmd" and only the sleeping and blocked states
(1 - sleeping, 2 - blocked).
~# trace-cmd record -p function sleep 1
~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18
{ state: 2, stack.stacktrace __schedule+0x1545/0x3700
schedule+0xe2/0x390
schedule_timeout+0x175/0x200
wait_for_completion_state+0x294/0x440
__wait_rcu_gp+0x247/0x4f0
synchronize_rcu_tasks_generic+0x151/0x230
apply_subsystem_event_filter+0xa2b/0x1300
subsystem_filter_write+0x67/0xc0
vfs_write+0x1e2/0xeb0
ksys_write+0xff/0x1d0
do_syscall_64+0x7b/0x420
entry_SYSCALL_64_after_hwframe+0x76/0x7e
} hitcount: 237 delta: 99756288 <<--------------- Delta is 99 seconds!
Totals:
Hits: 525
Entries: 21
Dropped: 0
This shows that this particular trace waited for 99 seconds on
synchronize_rcu_tasks() in apply_subsystem_event_filter().
In fact, there's a lot of places in the filter code that spends a lot of
time waiting for synchronize_rcu_tasks_trace() in order to free the
filters.
Add helper functions that will use call_rcu*() variants to asynchronously
free the filters. This brings the timings back to normal:
# time trace-cmd record -p function sleep 1
[..]
real 0m14.681s
user 0m0.335s
sys 0m28.616s
And the histogram also shows this:
~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21
{ state: 2, stack.stacktrace __schedule+0x1545/0x3700
schedule+0xe2/0x390
schedule_timeout+0x175/0x200
wait_for_completion_state+0x294/0x440
__wait_rcu_gp+0x247/0x4f0
synchronize_rcu_normal+0x3db/0x5c0
tracing_reset_online_cpus+0x8f/0x1e0
tracing_open+0x335/0x440
do_dentry_open+0x4c6/0x17a0
vfs_open+0x82/0x360
path_openat+0x1a36/0x2990
do_filp_open+0x1c5/0x420
do_sys_openat2+0xed/0x180
__x64_sys_openat+0x108/0x1d0
do_syscall_64+0x7b/0x420
} hitcount: 2 delta: 77044
Totals:
Hits: 55
Entries: 28
Dropped: 0
Where the total waiting time of synchronize_rcu_tasks_trace() is 77
milliseconds.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Andreas Ziegler <ziegler.andreas@siemens.com>
Cc: Felix MOESSBAUER <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/20250606201936.1e3d09a9@batman.local.home
Reported-by: "Flot, Julien" <julien.flot@siemens.com>
Tested-by: Julien Flot <julien.flot@siemens.com>
Fixes: a363d27cdb ("tracing: Allow system call tracepoints to handle page faults")
Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Expose a simple counter to userspace for monitoring tools.
(akpm: 2536c5c7d6 added the documentation but the code changes were lost)
Link: https://lkml.kernel.org/r/20250504180831.4190860-3-max.kellermann@ionos.com
Fixes: 2536c5c7d6 ("kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Cc: Core Minyard <cminyard@mvista.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Max Kellermann <max.kellermann@ionos.com>
Cc: Song Liu <song@kernel.org>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Failures inside codetag_load_module() are currently ignored. As a result
an error there would not cause a module load failure and freeing of the
associated resources. Correct this behavior by propagating the error code
to the caller and handling possible errors. With this change, error to
allocate percpu counters, which happens at this stage, will not be ignored
and will cause a module load failure and freeing of resources. With this
change we also do not need to disable memory allocation profiling when
this error happens, instead we fail to load the module.
Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com
Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Casey Chen <cachen@purestorage.com>
Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/
Cc: Daniel Gomez <da.gomez@samsung.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
riscv patches for 6.16-rc1, part 2
* Performance improvements
- Add support for vdso getrandom
- Implement raid6 calculations using vectors
- Introduce svinval tlb invalidation
* Cleanup
- A bunch of deduplication of the macros we use for manipulating instructions
* Misc
- Introduce a kunit test for kprobes
- Add support for mseal as riscv fits the requirements (thanks to Lorenzo for making sure of that :))
[Palmer: There was a rebase between part 1 and part 2, so I've had to do
some more git surgery here... at least two rounds of surgery...]
* alex-pr-2: (866 commits)
RISC-V: vDSO: Wire up getrandom() vDSO implementation
riscv: enable mseal sysmap for RV64
raid6: Add RISC-V SIMD syndrome and recovery calculations
riscv: mm: Add support for Svinval extension
riscv: Add kprobes KUnit test
riscv: kprobes: Remove duplication of RV_EXTRACT_ITYPE_IMM
riscv: kprobes: Remove duplication of RV_EXTRACT_UTYPE_IMM
riscv: kprobes: Remove duplication of RV_EXTRACT_RD_REG
riscv: kprobes: Remove duplication of RVC_EXTRACT_BTYPE_IMM
riscv: kprobes: Remove duplication of RVC_EXTRACT_C2_RS1_REG
riscv: kproves: Remove duplication of RVC_EXTRACT_JTYPE_IMM
riscv: kprobes: Remove duplication of RV_EXTRACT_BTYPE_IMM
riscv: kprobes: Remove duplication of RV_EXTRACT_RS1_REG
riscv: kprobes: Remove duplication of RV_EXTRACT_JTYPE_IMM
riscv: kprobes: Move branch_funct3 to insn.h
riscv: kprobes: Move branch_rs2_idx to insn.h
Linux 6.15-rc6
Input: xpad - fix xpad_device sorting
Input: xpad - add support for several more controllers
Input: xpad - fix Share button on Xbox One controllers
...
This CONFIG option, if supported by the architecture, helps reduce the
size of vmlinux.
For example, the size of vmlinux with ARCH=arm tinyconfig decreases as
follows:
text data bss dec hex filename
631684 104500 18176 754360 b82b8 vmlinux.before
455316 93404 15472 564192 89be0 vmlinux.after
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
As is, it appears as if pointer arithmetic is allowed for everything
except PTR_TO_{STACK,MAP_VALUE} if one only looks at
sanitize_check_bounds(). However, this is misleading as the function
only works together with retrieve_ptr_limit() and the two must be kept
in sync. This patch documents the interdependency and adds a check to
ensure they stay in sync.
adjust_ptr_min_max_vals(): Because the preceding switch returns -EACCES
for every opcode except for ADD/SUB, the sanitize_needed() following the
sanitize_check_bounds() call is always true if reached. This means,
unless sanitize_check_bounds() detected that the pointer goes OOB
because of the ADD/SUB and returns -EACCES, sanitize_ptr_alu() always
executes after sanitize_check_bounds().
The following shows that this also implies that retrieve_ptr_limit()
runs in all relevant cases.
Note that there are two calls to sanitize_ptr_alu(), these are simply
needed to easily calculate the correct alu_limit as explained in
commit 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic
mask"). The truncation-simulation is already performed on the first
call.
In the second sanitize_ptr_alu(commit_window = true), we always run
retrieve_ptr_limit(), unless:
* can_skip_alu_sanititation() is true, notably `BPF_SRC(insn->code) ==
BPF_K`. BPF_K is fine because it means that there is no scalar
register (which could be subject to speculative scalar confusion due
to Spectre v4) that goes into the ALU operation. The pointer register
can not be subject to v4-based value confusion due to the nospec
added. Thus, in this case it would have been fine to also skip
sanitize_check_bounds().
* If we are on a speculative path (`vstate->speculative`) and in the
second "commit" phase, sanitize_ptr_alu() always just returns 0. This
makes sense because there are no ALU sanitization limits to be learned
from speculative paths. Furthermore, because the sanitization will
ensure that pointer arithmetic stays in (architectural) bounds, the
sanitize_check_bounds() on the speculative path could also be skipped.
The second case needs more attention: Assume we have some ALU operation
that is used with scalars architecturally, but with a
non-PTR_TO_{STACK,MAP_VALUE} pointer (e.g., PTR_TO_PACKET)
speculatively. It might appear as if this would allow an unsanitized
pointer ALU operations, but this can not happen because one of the
following two always holds:
* The type mismatch stems from Spectre v4, then it is prevented by a
nospec after the possibly-bypassed store involving the pointer. There
is no speculative path simulated for this case thus it never happens.
* The type mismatch stems from a Spectre v1 gadget like the following:
r1 = slow(0)
r4 = fast(0)
r3 = SCALAR // Spectre v4 scalar confusion
if (r1) {
r2 = PTR_TO_PACKET
} else {
r2 = 42
}
if (r4) {
r2 += r3
*r2
}
If `r2 = PTR_TO_PACKET` is indeed dead code, it will be sanitized to
`goto -1` (as is the case for the r4-if block). If it is not (e.g., if
`r1 = r4 = 1` is possible), it will also be explored on an
architectural path and retrieve_ptr_limit() will reject it.
To summarize, the exception for `vstate->speculative` is safe.
Back to retrieve_ptr_limit(): It only allows the ALU operation if the
involved pointer register (can be either source or destination for ADD)
is PTR_TO_STACK or PTR_TO_MAP_VALUE. Otherwise, it returns -EOPNOTSUPP.
Therefore, sanitize_check_bounds() returning 0 for
non-PTR_TO_{STACK,MAP_VALUE} is fine because retrieve_ptr_limit() also
runs for all relevant cases and prevents unsafe operations.
To summarize, we allow unsanitized pointer arithmetic with 64-bit
ADD/SUB for the following instructions if the requirements from
retrieve_ptr_limit() AND sanitize_check_bounds() hold:
* ptr -=/+= imm32 (i.e. `BPF_SRC(insn->code) == BPF_K`)
* PTR_TO_{STACK,MAP_VALUE} -= scalar
* PTR_TO_{STACK,MAP_VALUE} += scalar
* scalar += PTR_TO_{STACK,MAP_VALUE}
To document the interdependency between sanitize_check_bounds() and
retrieve_ptr_limit(), add a verifier_bug_if() to make sure they stay in
sync.
Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/CAP01T76HZ+s5h+_REqRFkRjjoKwnZZn9YswpSVinGicah1pGJw@mail.gmail.com/
Link: https://lore.kernel.org/bpf/CAP01T75oU0zfZCiymEcH3r-GQ5A6GOc6GmYzJEnMa3=53XuUQQ@mail.gmail.com/
Link: https://lore.kernel.org/r/20250603204557.332447-1-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
After commit 68ca5d4eeb ("bpf: support BPF cookie in raw tracepoint
(raw_tp, tp_btf) programs"), we can show the cookie in bpf_link_info
like kprobe etc.
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250603154309.3063644-1-chen.dylane@linux.dev
The following ftrace patch for riscv uses a data store to update ftrace
function. Therefore, a romote fence is required to order it against
function_trace_op updates. The mechanism is similar to the fence between
function_trace_op and update_ftrace_func in the generic ftrace, so we
leverage the same ftrace_sync_ipi function.
[ alex: Fix build warning when !CONFIG_DYNAMIC_FTRACE ]
Signed-off-by: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/r/20250407180838.42877-4-andybnac@gmail.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
There may be concurrency between perf_cgroup_switch and
perf_cgroup_event_disable. Consider the following scenario: after a new
perf cgroup event is created on CPU0, the new event may not trigger
a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1
disables this perf event, it executes __perf_remove_from_context->
list _del_event->perf_cgroup_event_disable on CPU1, which causes a race
with perf_cgroup_switch running on CPU0.
The following describes the details of this concurrency scenario:
CPU0 CPU1
perf_cgroup_switch:
...
# cpuctx->cgrp is not NULL here
if (READ_ONCE(cpuctx->cgrp) == NULL)
return;
perf_remove_from_context:
...
raw_spin_lock_irq(&ctx->lock);
...
# ctx->is_active == 0 because reprogramm is not
# tigger, so CPU1 can do __perf_remove_from_context
# for CPU0
__perf_remove_from_context:
perf_cgroup_event_disable:
...
if (--ctx->nr_cgroups)
...
# this warning will happened because CPU1 changed
# ctx.nr_cgroups to 0.
WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
[peterz: use guard instead of goto unlock]
Fixes: db4a835601 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
Commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting
bug at task exit") moves the event->state update to before
list_del_event(). This makes the event->state test in list_del_event()
always false; never calling perf_cgroup_event_disable().
As a result, cpuctx->cgrp won't be cleared properly; causing havoc.
Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: David Wang <00107082@163.com>
Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/
While chasing down a missing perf_cgroup_event_disable() elsewhere,
Leo Yan found that both perf_put_aux_event() and
perf_remove_sibling_event() were also missing one.
Specifically, the rule is that events that switch to OFF,ERROR need to
call perf_cgroup_event_disable().
Unify the disable paths to ensure this.
Fixes: ab43762ef0 ("perf: Allow normal events to output AUX data")
Fixes: 9f0c4fa111 ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability")
Reported-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
Baisheng Gao reported an ARM64 crash, which Mark decoded as being a
synchronous external abort -- most likely due to trying to access
MMIO in bad ways.
The crash further shows perf trying to do a user stack sample while in
exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address
space it is trying to access.
It turns out that we stop perf after we tear down the userspace mm; a
receipie for disaster, since perf likes to access userspace for
various reasons.
Flip this order by moving up where we stop perf in do_exit().
Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER
to abort when the current task does not have an mm (exit_mm() makes
sure to set current->mm = NULL; before commencing with the actual
teardown). Such that CPU wide events don't trip on this same problem.
Fixes: c5ebcedb56 ("perf: Add ability to attach user stack dump to sample")
Reported-by: Baisheng Gao <baisheng.gao@unisoc.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net
Two fixes in the built-in idle selection helpers.
- Fix prev_cpu handling to guarantee that idle selection never returns a CPU
that's not allowed.
- Skip cross-node search when !NUMA which could lead to infinite looping due
to a bug in NUMA iterator.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaECWBw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfV9AP4mk63TUsLdqwa6iqE0dZArOAhj8QXETsv1Q0+y
FGQKEQEAxPNJx3ifhzgXoQi1SOZkqeq44erbcy0x7owtb+QBVgs=
=G1uN
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"Two fixes in the built-in idle selection helpers:
- Fix prev_cpu handling to guarantee that idle selection never
returns a CPU that's not allowed
- Skip cross-node search when !NUMA which could lead to infinite
looping due to a bug in NUMA iterator"
* tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Skip cross-node search with !CONFIG_NUMA
sched_ext: idle: Properly handle invalid prev_cpu during idle selection
- Fix UAF in module unload in ftrace when there's a bug in the module
If a module is buggy and triggers ftrace_disable which is set when
an anomaly is detected, when it gets unloaded it doesn't free
the hooks into kallsyms, and when a kallsyms lookup is performed
it may access the mod->modname field and crash via UAF.
Fix this by still freeing the mod_maps that are attached to kallsyms
on module unload regardless if ftrace_disable is set or not.
- Do not bother allocating mod_maps for kallsyms if ftrace_disable is set
- Remove unused trace events
When a trace event or tracepoint is created but not used, it still
creates the code and data structures needed for that trace event.
This just wastes memory.
A patch is being worked on to warn when a trace event is created but
not used: https://lore.kernel.org/linux-trace-kernel/20250529130138.544ffec4@gandalf.local.home/
Remove the trace events that are created but not used. This does not
remove trace events that are created but are not used due configs
not being set. That will be handled later. This only removes events
that have no user under any config.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaD9LohQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qvrRAP4xRH01dQ3HkNF3mtKXuHEh8NbTlCEE
8wYyiI8ttjVdGAEAzq5sx2BQN2Of4RLOwYtxJSigZgmJjYYGmobeHISPjwc=
=d2Cp
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix UAF in module unload in ftrace when there's a bug in the module
If a module is buggy and triggers ftrace_disable which is set when an
anomaly is detected, when it gets unloaded it doesn't free the hooks
into kallsyms, and when a kallsyms lookup is performed it may access
the mod->modname field and crash via UAF.
Fix this by still freeing the mod_maps that are attached to kallsyms
on module unload regardless if ftrace_disable is set or not.
- Do not bother allocating mod_maps for kallsyms if ftrace_disable is
set
- Remove unused trace events
When a trace event or tracepoint is created but not used, it still
creates the code and data structures needed for that trace event.
This just wastes memory.
Remove the trace events that are created but not used. This does not
remove trace events that are created but are not used due configs not
being set. That will be handled later. This only removes events that
have no user under any config.
* tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fsdax: Remove unused trace events for dax insert mapping
genirq/matrix: Remove unused irq_matrix_alloc_reserved tracepoint
xdp: Remove unused mem_return_failed event
ftrace: Don't allocate ftrace module map if ftrace is disabled
ftrace: Fix UAF when lookup kallsym after ftrace disabled
rstat per-subsystem split change skipped per-cpu allocation on UP configs;
however even on UP, depending on config options, the size of the percpu
struct may not be zero leading to crashes. Fix it by conditionalizing the
per-cpu area allocation and usage on the size of the per-cpu struct.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaD9Bww4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGa9iAP9z/bO3aW3AjpowrIBydwD7YqlCUiPV6XmfbPAH
MHyfhAEAl7WiHYZiGztupeuOAxUTgywiCDaSVhMtDzT6InJmtw0=
=UO6g
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"The rstat per-subsystem split change skipped per-cpu allocation on UP
configs; however even on UP, depending on config options, the size of
the percpu struct may not be zero leading to crashes.
Fix it by conditionalizing the per-cpu area allocation and usage on
the size of the per-cpu struct"
* tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: adjust criteria for rstat subsystem cpu lock access
In the idle CPU selection logic, attempting cross-node searches adds
unnecessary complexity when CONFIG_NUMA is disabled.
Since there's no meaningful concept of nodes in this case, simplify the
logic by restricting the idle CPU search to the current node only.
Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Make .static_call_sites in modules read-only after init
The .static_call_sites sections in modules have been made read-only after
init to avoid any (non-)accidental modifications, similarly to how they
are read-only after init in vmlinux.
- The rest are minor cleanups.
The changes have been on linux-next for 2 months, with the exception of the
last comment-only cleanup.
As discussed previously, we rotate module maintainership among its
co-maintainers every 6 months. Daniel Gomez is next in line and he will
send the next pull request for the modules.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmg9tSoUHHBldHIucGF2
bHVAc3VzZS5jb20ACgkQumpXJwqY6poY/gf9GgexhRSNp10orTltOFD9+c+g6ekn
q4UiLK7ooz4mFqPN4ljrSxvF1IYkjgrkkPDDEPZEZl2bQ7yWFb1xO9csihZeSI0l
Bf5m+Pxo+U2ylNr6wEsOS7P5GntraTkLJ4NHje/bYmIQMAjiowniTF3693FYs51d
bCD4Xn0zzZi+eekoAi/34GL5Zzkk5jhr2nBBvUZ3mXrRlR/iNpwUd4IWxb1Ujbfh
9YjAAGbeBowHzsyfPKoccJ/ybosrUKGqpVrp4KPe0JVoCvw8nwkxAfHGhsjjVk10
G0l4eCZ8KW87rPoJw5HvmTTNIxM10xKuZ4fa67/9ou+kHWRzMSa6ub4Jiw==
=JU3a
-----END PGP SIGNATURE-----
Merge tag 'modules-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull module updates from Petr Pavlu:
- Make .static_call_sites in modules read-only after init
The .static_call_sites sections in modules have been made read-only
after init to avoid any (non-)accidental modifications, similarly to
how they are read-only after init in vmlinux
- The rest are minor cleanups
* tag 'modules-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
module: Remove outdated comment about text_size
module: Make .static_call_sites read-only after init
module: Add a separate function to mark sections as read-only after init
module: Constify parameters of module_enforce_rwx_sections()
Sergey Senozhatsky adds infrastructure for passing algorithm-specific
parameters into zram. A single parameter `winbits' is implemented at
this time.
- The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt
makes memcg charging nmi-safe, which is required by BFP, which can
operate in NMI context.
- The 5 patch series "Some random fixes and cleanup to shmem" from
Kemeng Shi implements small fixes and cleanups in the shmem code.
- The 2 patch series "Skip mm selftests instead when kernel features are
not present" from Zi Yan fixes some issues in the MM selftest code.
- The 2 patch series "mm/damon: build-enable essential DAMON components
by default" from SeongJae Park reworks DAMON Kconfig to make it easier
to enable CONFIG_DAMON.
- The 2 patch series "sched/numa: add statistics of numa balance task
migration" from Libo Chen adds more info into sysfs and procfs files to
improve visibility into the NUMA balancer's task migration activity.
- The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from
Mark Brown provides various updates to some of the MM selftests to make
them play better with the overall containing framework.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA
js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O
LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw=
=oruw
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "zram: support algorithm-specific parameters" from Sergey Senozhatsky
adds infrastructure for passing algorithm-specific parameters into
zram. A single parameter `winbits' is implemented at this time.
- "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
charging nmi-safe, which is required by BFP, which can operate in NMI
context.
- "Some random fixes and cleanup to shmem" from Kemeng Shi implements
small fixes and cleanups in the shmem code.
- "Skip mm selftests instead when kernel features are not present" from
Zi Yan fixes some issues in the MM selftest code.
- "mm/damon: build-enable essential DAMON components by default" from
SeongJae Park reworks DAMON Kconfig to make it easier to enable
CONFIG_DAMON.
- "sched/numa: add statistics of numa balance task migration" from Libo
Chen adds more info into sysfs and procfs files to improve visibility
into the NUMA balancer's task migration activity.
- "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
provides various updates to some of the MM selftests to make them
play better with the overall containing framework.
* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
mm/khugepaged: clean up refcount check using folio_expected_ref_count()
selftests/mm: fix test result reporting in gup_longterm
selftests/mm: report unique test names for each cow test
selftests/mm: add helper for logging test start and results
selftests/mm: use standard ksft_finished() in cow and gup_longterm
selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
sched/numa: add statistics of numa balance task
sched/numa: fix task swap by skipping kernel threads
tools/testing: check correct variable in open_procmap()
tools/testing/vma: add missing function stub
mm/gup: update comment explaining why gup_fast() disables IRQs
selftests/mm: two fixes for the pfnmap test
mm/khugepaged: fix race with folio split/free using temporary reference
mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
mmu_notifiers: remove leftover stub macros
selftests/mm: deduplicate test names in madv_populate
kcov: rust: add flags for KCOV with Rust
mm: rust: make CONFIG_MMU ifdefs more narrow
mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
mm/damon/Kconfig: enable CONFIG_DAMON by default
...
* Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/.
kernel/ patches acked by Peter Zijlstra.
* Add MGLRU support to the access tracking perf test.
ARM fixes:
* Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change.
* Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o private
IRQs allocated.
* Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum.
* Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping.
* Avoid dereferencing a NULL pointer in the VGIC debug code, which can
happen if the device doesn't have any mapping yet.
s390:
* Fix interaction between some filesystems and Secure Execution
* Some cleanups and refactorings, preparing for an upcoming big series
x86:
* Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
* Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
* Refine and harden handling of spurious faults.
* Add support for ALLOWED_SEV_FEATURES.
* Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
* Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
* Don't account temporary allocations in sev_send_update_data().
* Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
* Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
* Advertise support to userspace for WRMSRNS and PREFETCHI.
* Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
* Add a module param to control and enumerate support for device posted
interrupts.
* Fix a potential overflow with nested virt on Intel systems running 32-bit kernels.
* Flush shadow VMCSes on emergency reboot.
* Add support for SNP to the various SEV selftests.
* Add a selftest to verify fastops instructions via forced emulation.
* Refine and optimize KVM's software processing of the posted interrupt bitmap, and share
the harvesting code between KVM and the kernel's Posted MSI handler
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg9TjwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOUxQf7B7nnWqIKd7jSkGzSD6YsSX9TXktr
2tJIOfWM3zNYg5GRCidg+m4Y5+DqQWd3Hi5hH2P9wUw7RNuOjOFsDe+y0VBr8ysE
ve39t/yp+mYalNmHVFl8s3dBDgrIeGKiz+Wgw3zCQIBZ18rJE1dREhv37RlYZ3a2
wSvuObe8sVpCTyKIowDs1xUi7qJUBoopMSuqfleSHawRrcgCpV99U8/KNFF5plLH
7fXOBAHHniVCVc+mqQN2wxtVJDhST+U3TaU4GwlKy9Yevr+iibdOXffveeIgNEU4
D6q1F2zKp6UdV3+p8hxyaTTbiCVDqsp9WOgY/0I/f+CddYn0WVZgOlR+ow==
=mYFL
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm updates from Paolo Bonzini:
Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/. kernel/
patches acked by Peter Zijlstra
- Add MGLRU support to the access tracking perf test
ARM fixes:
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o
private IRQs allocated
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping
- Avoid dereferencing a NULL pointer in the VGIC debug code, which
can happen if the device doesn't have any mapping yet
s390:
- Fix interaction between some filesystems and Secure Execution
- Some cleanups and refactorings, preparing for an upcoming big
series
x86:
- Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
to fix a race between AP destroy and VMRUN
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
the VM
- Refine and harden handling of spurious faults
- Add support for ALLOWED_SEV_FEATURES
- Add #VMGEXIT to the set of handlers special cased for
CONFIG_RETPOLINE=y
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
features that utilize those bits
- Don't account temporary allocations in sev_send_update_data()
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
Threshold
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
IBPB, between SVM and VMX
- Advertise support to userspace for WRMSRNS and PREFETCHI
- Rescan I/O APIC routes after handling EOI that needed to be
intercepted due to the old/previous routing, but not the
new/current routing
- Add a module param to control and enumerate support for device
posted interrupts
- Fix a potential overflow with nested virt on Intel systems running
32-bit kernels
- Flush shadow VMCSes on emergency reboot
- Add support for SNP to the various SEV selftests
- Add a selftest to verify fastops instructions via forced emulation
- Refine and optimize KVM's software processing of the posted
interrupt bitmap, and share the harvesting code between KVM and the
kernel's Posted MSI handler"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
rtmutex_api: provide correct extern functions
KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
KVM: s390: Simplify and move pv code
KVM: s390: Refactor and split some gmap helpers
KVM: s390: Remove unneeded srcu lock
s390: Remove unneeded includes
s390/uv: Improve splitting of large folios that cannot be split while dirty
s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
rust: add helper for mutex_trylock
RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
...
If ftrace is disabled, it is meaningless to allocate a module map.
Add a check in allocate_ftrace_mod_map() to not allocate if ftrace is
disabled.
Link: https://lore.kernel.org/20250529111955.2349189-3-yebin@huaweicloud.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The following issue happens with a buggy module:
BUG: unable to handle page fault for address: ffffffffc05d0218
PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0
Oops: Oops: 0000 [#1] SMP KASAN PTI
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
RIP: 0010:sized_strscpy+0x81/0x2f0
RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000
RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d
RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68
R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038
R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff
FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ftrace_mod_get_kallsym+0x1ac/0x590
update_iter_mod+0x239/0x5b0
s_next+0x5b/0xa0
seq_read_iter+0x8c9/0x1070
seq_read+0x249/0x3b0
proc_reg_read+0x1b0/0x280
vfs_read+0x17f/0x920
ksys_read+0xf3/0x1c0
do_syscall_64+0x5f/0x2e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The above issue may happen as follows:
(1) Add kprobe tracepoint;
(2) insmod test.ko;
(3) Module triggers ftrace disabled;
(4) rmmod test.ko;
(5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed;
ftrace_mod_get_kallsym()
...
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
...
The problem is when a module triggers an issue with ftrace and
sets ftrace_disable. The ftrace_disable is set when an anomaly is
discovered and to prevent any more damage, ftrace stops all text
modification. The issue that happened was that the ftrace_disable stops
more than just the text modification.
When a module is loaded, its init functions can also be traced. Because
kallsyms deletes the init functions after a module has loaded, ftrace
saves them when the module is loaded and function tracing is enabled. This
allows the output of the function trace to show the init function names
instead of just their raw memory addresses.
When a module is removed, ftrace_release_mod() is called, and if
ftrace_disable is set, it just returns without doing anything more. The
problem here is that it leaves the mod_list still around and if kallsyms
is called, it will call into this code and access the module memory that
has already been freed as it will return:
strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN);
Where the "mod" no longer exists and triggers a UAF bug.
Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/
Cc: stable@vger.kernel.org
Fixes: aba4b5c22c ("ftrace: Save module init functions kallsyms symbols for tracing")
Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Commit fb49f07ba1 ("locking/mutex: implement mutex_lock_killable_nest_lock")
changed the set of functions that mutex.c defines when CONFIG_DEBUG_LOCK_ALLOC
is set.
- it removed the "extern" declaration of mutex_lock_killable_nested from
include/linux/mutex.h, and replaced it with a macro since it could be
treated as a special case of _mutex_lock_killable. It also removed a
definition of the function in kernel/locking/mutex.c.
- likewise, it replaced mutex_trylock() with the more generic
mutex_trylock_nest_lock() and replaced mutex_trylock() with a macro.
However, it left the old definitions in place in kernel/locking/rtmutex_api.c,
which causes failures when building with CONFIG_RT_MUTEXES=y. Bring over
the changes.
Fixes: fb49f07ba1 ("locking/mutex: implement mutex_lock_killable_nest_lock")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On systems with NUMA balancing enabled, it has been found that tracking
task activities resulting from NUMA balancing is beneficial. NUMA
balancing employs two mechanisms for task migration: one is to migrate
a task to an idle CPU within its preferred node, and the other is to
swap tasks located on different nodes when they are on each other's
preferred nodes.
The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it
lacks statistics regarding task migration and swapping. Therefore,
relevant counts for task migration and swapping should be added.
The following two new fields:
numa_task_migrated
numa_task_swapped
will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat.
Introducing both per-task and per-memory cgroup (memcg) NUMA balancing
statistics facilitates a rapid evaluation of the performance and
resource utilization of the target workload. For instance, users can
first identify the container with high NUMA balancing activity and then
further pinpoint a specific task within that group, and subsequently
adjust the memory policy for that task. In short, although it is
possible to iterate through /proc/$pid/sched to locate the problematic
task, the introduction of aggregated NUMA balancing activity for tasks
within each memcg can assist users in identifying the task more
efficiently through a divide-and-conquer approach.
As Libo Chen pointed out, the memcg event relies on the text names in
vmstat_text, and /proc/vmstat generates corresponding items based on
vmstat_text. Thus, the relevant task migration and swapping events
introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in /proc/vmstat.
In theory, task migration and swap events are part of the scheduler's
activities. The reason for exposing them through the
memory.stat/vmstat interface is that we already have NUMA balancing
statistics in memory.stat/vmstat, and these events are closely related
to each other. Following Shakeel's suggestion, we describe the
end-to-end flow/story of all these events occurring on a timeline for
future reference:
The goal of NUMA balancing is to co-locate a task and its memory pages
on the same NUMA node. There are two strategies: migrate the pages to
the task's node, or migrate the task to the node where its pages
reside.
Suppose a task p1 is running on Node 0, but its pages are located on
Node 1. NUMA page fault statistics for p1 reveal its "page footprint"
across nodes. If NUMA balancing detects that most of p1's pages are on
Node 1:
1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.
2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.
Case 2.1: Idle CPU Available
If Node 1 has an idle CPU, p1 is directly scheduled there. This
event is logged as numa_task_migrated.
Case 2.2: No Idle CPU (Task Swap)
If all CPUs on Node1 are busy, direct migration could cause CPU
contention or load imbalance. Instead: The Numa balance selects a
candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own
page footprint). p1 and p2 are swapped. This cross-node swap is
recorded as numa_task_swapped.
Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ayush Jain <Ayush.jain3@amd.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "sched/numa: add statistics of numa balance task migration",
v6.
Introduce task migration and swap statistics in the following places:
/sys/fs/cgroup/{GROUP}/memory.stat
/proc/{PID}/sched
/proc/vmstat
These statistics facilitate a rapid evaluation of the performance and
resource utilization of the target workload.
This patch (of 2):
Task swapping is triggered when there are no idle CPUs in task A's
preferred node. In this case, the NUMA load balancer chooses a task B
on A's preferred node and swaps B with A. This helps improve NUMA
locality without introducing load imbalance between nodes. In the
current implementation, B's NUMA node preference is not mandatory.
That is to say, a kernel thread might be incorrectly chosen as B.
However, kernel thread and user space thread that does not have mm are
not supposed to be covered by NUMA balancing because NUMA balancing
only considers user pages via VMAs.
According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread. curr->mm is also checked because
it is possible that user_mode_thread() might create a user thread
without an mm. As per Prateek's analysis, after adding the PF_KTHREAD
check, there is no need to further check the PF_IDLE flag:
: - play_idle_precise() already ensures PF_KTHREAD is set before adding
: PF_IDLE
:
: - cpu_startup_entry() is only called from the startup thread which
: should be marked with PF_KTHREAD (based on my understanding looking at
: commit cff9b2332a ("kernel/sched: Modify initial boot task idle
: setup"))
In summary, the check in task_numa_compare() now aligns with
task_tick_numa().
Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Tested-by: Ayush Jain <Ayush.jain3@amd.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All users of page->index have been converted to not refer to it any more.
Update a few pieces of documentation that were missed and prevent new
users from appearing (or at least make them easy to grep for).
Link: https://lkml.kernel.org/r/20250514181508.3019795-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
semaphore" from Lance Yang enhances the hung task detector. The
detector presently dumps the blocking tasks's stack when it is blocked
on a mutex. Lance's series extends this to semaphores.
- The 2 patch series "nilfs2: improve sanity checks in dirty state
propagation" from Wentao Liang addresses a couple of minor flaws in
nilfs2.
- The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from
Illia Ostapyshyn fixes a couple of issues in the gdb scripts.
- The 9 patch series "Support kdump with LUKS encryption by reusing LUKS
volume keys" from Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in the
series [0/N] cover letter.
- The 2 patch series "sysfs: add counters for lockups and stalls" from
Max Kellermann adds /sys/kernel/hardlockup_count and
/sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count.
- The 3 patch series "fork: Page operation cleanups in the fork code"
from Pasha Tatashin implements a number of code cleanups in fork.c.
- The 3 patch series "scripts/gdb/symbols: determine KASLR offset on
s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in
the gdb scripts.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA
jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W
Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8=
=g45L
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "hung_task: extend blocking task stacktrace dump to semaphore" from
Lance Yang enhances the hung task detector.
The detector presently dumps the blocking tasks's stack when it is
blocked on a mutex. Lance's series extends this to semaphores
- "nilfs2: improve sanity checks in dirty state propagation" from
Wentao Liang addresses a couple of minor flaws in nilfs2
- "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
fixes a couple of issues in the gdb scripts
- "Support kdump with LUKS encryption by reusing LUKS volume keys" from
Coiby Xu addresses a usability problem with kdump.
When the dump device is LUKS-encrypted, the kdump kernel may not have
the keys to the encrypted filesystem. A full writeup of this is in
the series [0/N] cover letter
- "sysfs: add counters for lockups and stalls" from Max Kellermann adds
/sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
/sys/kernel/rcu_stall_count
- "fork: Page operation cleanups in the fork code" from Pasha Tatashin
implements a number of code cleanups in fork.c
- "scripts/gdb/symbols: determine KASLR offset on s390 during early
boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
scripts
* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
llist: make llist_add_batch() a static inline
delayacct: remove redundant code and adjust indentation
squashfs: add optional full compressed block caching
crash_dump, nvme: select CONFIGFS_FS as built-in
scripts/gdb/symbols: determine KASLR offset on s390 during early boot
scripts/gdb/symbols: factor out pagination_off()
scripts/gdb/symbols: factor out get_vmlinux()
kernel/panic.c: format kernel-doc comments
mailmap: update and consolidate Casey Connolly's name and email
nilfs2: remove wbc->for_reclaim handling
fork: define a local GFP_VMAP_STACK
fork: check charging success before zeroing stack
fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
fork: clean-up ifdef logic around stack allocation
kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
x86/crash: make the page that stores the dm crypt keys inaccessible
x86/crash: pass dm crypt keys to kdump kernel
Revert "x86/mm: Remove unused __set_memory_prot()"
crash_dump: retrieve dm crypt keys in kdump kernel
...
simplifies the act of creating a pte which addresses the first page in a
folio and reduces the amount of plumbing which architecture must
implement to provide this.
- The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
is a shower of largely unrelated folio infrastructure changes which
clean things up and better prepare us for future work.
- The 3 patch series "memory,x86,acpi: hotplug memory alignment
advisement" from Gregory Price adds early-init code to prevent x86 from
leaving physical memory unused when physical address regions are not
aligned to memory block size.
- The 2 patch series "mm/compaction: allow more aggressive proactive
compaction" from Michal Clapinski provides some tuning of the (sadly,
hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
of proactive compaction. In a simple test case, the reduction of a guest
VM's memory consumption was dramatic.
- The 8 patch series "Minor cleanups and improvements to swap freeing
code" from Kemeng Shi provides some code cleaups and a small efficiency
improvement to this part of our swap handling code.
- The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
from Dmitry Levin adds the ability for a ptracer to modify syscalls
arguments. At this time we can alter only "system call information that
are used by strace system call tampering, namely, syscall number,
syscall arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
guard regions" from Andrei Vagin extends the info returned by the
PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more
efficiently get at the info about guard regions.
- The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
from Gavin Shan implements that fix. No runtime effect is expected
because validate_page_before_insert() happens to fix up this error.
- The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
rewrite" from David Hildenbrand basically brings uprobe text poking into
the current decade. Remove a bunch of hand-rolled implementation in
favor of using more current facilities.
- The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
from Anshuman Khandual provides enhancements and generalizations to the
pte dumping code. This might be needed when 128-bit Page Table
Descriptors are enabled for ARM.
- The 12 patch series "Always call constructor for kernel page tables"
from Kevin Brodsky "ensures that the ctor/dtor is always called for
kernel pgtables, as it already is for user pgtables". This permits the
addition of more functionality such as "insert hooks to protect page
tables". This change does result in various architectures performing
unnecesary work, but this is fixed up where it is anticipated to occur.
- The 9 patch series "Rust support for mm_struct, vm_area_struct, and
mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
structures.
- The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
from Lorenzo Stoakes takes advantage of some VMA merging opportunities
which we've been missing for 15 years.
- The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
flushing. Instead of flushing each address range in the provided iovec,
we batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- The 6 patch series "Track node vacancy to reduce worst case allocation
counts" from Sidhartha Kumar makes the maple tree smarter about its node
preallocation. stress-ng mmap performance increased by single-digit
percentages and the amount of unnecessarily preallocated memory was
dramaticelly reduced.
- The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
Baoquan He removes a few unnecessary things which Baoquan noted when
reading the code.
- The 3 patch series ""Enhance sysfs handling for memory hotplug in
weighted interleave" from Rakie Kim "enhances the weighted interleave
policy in the memory management subsystem by improving sysfs handling,
fixing memory leaks, and introducing dynamic sysfs updates for memory
hotplug support". Fixes things on error paths which we are unlikely to
hit.
- The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
including tiered memory" from SeongJae Park introduces new DAMOS quota
goal metrics which eliminate the manual tuning which is required when
utilizing DAMON for memory tiering.
- The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
Baoquan He provides cleanups and small efficiency improvements which
Baoquan found via code inspection.
- The 2 patch series "vmscan: enforce mems_effective during demotion"
from Gregory Price "changes reclaim to respect cpuset.mems_effective
during demotion when possible". because "presently, reclaim explicitly
ignores cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated." "This is useful for isolating workloads on a
multi-tenant system from certain classes of memory more consistently."
- The 2 patch series ""Clean up split_huge_pmd_locked() and remove
unnecessary folio pointers" from Gavin Guo provides minor cleanups and
efficiency gains in in the huge page splitting and migrating code.
- The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
creates a slab cache for `struct mem_cgroup', yielding improved memory
utilization.
- The 4 patch series "add max arg to swappiness in memory.reclaim and
lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
argument for memory.reclaim MGLRU's lru_gen. This directs proactive
reclaim to reclaim from only anon folios rather than file-backed folios.
- The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
Rapoport is the first step on the path to permitting the kernel to
maintain existing VMs while replacing the host kernel via file-based
kexec. At this time only memblock's reserve_mem is preserved.
- The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
Woodhouse provides and uses a smarter way of looping over a pfn range.
By skipping ranges of invalid pfns.
- The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
VMA scanning when a task is pinned a single NUMA mode. Dramatic
performance benefits were seen in some real world cases.
- The 2 patch series "JFS: Implement migrate_folio for
jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
during memory compaction when using JFS.
- The 4 patch series "move all VMA allocation, freeing and duplication
logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
into the more appropriate mm/vma.c.
- The 6 patch series "mm, swap: clean up swap cache mapping helper" from
Kairui Song provides code consolidation and cleanups related to the
folio_index() function.
- The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
Moola does that.
- The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
Waiman Long addresses some bogus failures which are being reported by
the test_memcontrol selftest.
- The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
hook" from Lorenzo Stoakes commences the deprecation of
file_operations.mmap() in favor of the new
file_operations.mmap_prepare(). The latter is more restrictive and
prevents drivers from messing with things in ways which, amongst other
problems, may defeat VMA merging.
- The 4 patch series "memcg: decouple memcg and objcg stocks"" from
Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
one. This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- The 6 patch series "mm/damon: minor fixups and improvements for code,
tests, and documents" from SeongJae Park is "yet another batch of
miscellaneous DAMON changes. Fix and improve minor problems in code,
tests and documents."
- The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
Butt converts memcg stats to be irq safe. Another step along the way to
making memcg charging and stats updates NMI-safe, a BPF requirement.
- The 4 patch series "Let unmap_hugepage_range() and several related
functions take folio instead of page" from Fan Ni provides folio
conversions in the hugetlb code.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
=tYYm
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
creating a pte which addresses the first page in a folio and reduces
the amount of plumbing which architecture must implement to provide
this.
- "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
largely unrelated folio infrastructure changes which clean things up
and better prepare us for future work.
- "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
Price adds early-init code to prevent x86 from leaving physical
memory unused when physical address regions are not aligned to memory
block size.
- "mm/compaction: allow more aggressive proactive compaction" from
Michal Clapinski provides some tuning of the (sadly, hard-coded (more
sadly, not auto-tuned)) thresholds for our invokation of proactive
compaction. In a simple test case, the reduction of a guest VM's
memory consumption was dramatic.
- "Minor cleanups and improvements to swap freeing code" from Kemeng
Shi provides some code cleaups and a small efficiency improvement to
this part of our swap handling code.
- "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
adds the ability for a ptracer to modify syscalls arguments. At this
time we can alter only "system call information that are used by
strace system call tampering, namely, syscall number, syscall
arguments, and syscall return value.
This series should have been incorporated into mm.git's "non-MM"
branch, but I goofed.
- "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
against /proc/pid/pagemap. This permits CRIU to more efficiently get
at the info about guard regions.
- "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
implements that fix. No runtime effect is expected because
validate_page_before_insert() happens to fix up this error.
- "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
Hildenbrand basically brings uprobe text poking into the current
decade. Remove a bunch of hand-rolled implementation in favor of
using more current facilities.
- "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
Khandual provides enhancements and generalizations to the pte dumping
code. This might be needed when 128-bit Page Table Descriptors are
enabled for ARM.
- "Always call constructor for kernel page tables" from Kevin Brodsky
ensures that the ctor/dtor is always called for kernel pgtables, as
it already is for user pgtables.
This permits the addition of more functionality such as "insert hooks
to protect page tables". This change does result in various
architectures performing unnecesary work, but this is fixed up where
it is anticipated to occur.
- "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
Ryhl adds plumbing to permit Rust access to core MM structures.
- "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
Stoakes takes advantage of some VMA merging opportunities which we've
been missing for 15 years.
- "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
SeongJae Park optimizes process_madvise()'s TLB flushing.
Instead of flushing each address range in the provided iovec, we
batch the flushing across all the iovec entries. The syscall's cost
was approximately halved with a microbenchmark which was designed to
load this particular operation.
- "Track node vacancy to reduce worst case allocation counts" from
Sidhartha Kumar makes the maple tree smarter about its node
preallocation.
stress-ng mmap performance increased by single-digit percentages and
the amount of unnecessarily preallocated memory was dramaticelly
reduced.
- "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
a few unnecessary things which Baoquan noted when reading the code.
- ""Enhance sysfs handling for memory hotplug in weighted interleave"
from Rakie Kim "enhances the weighted interleave policy in the memory
management subsystem by improving sysfs handling, fixing memory
leaks, and introducing dynamic sysfs updates for memory hotplug
support". Fixes things on error paths which we are unlikely to hit.
- "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
from SeongJae Park introduces new DAMOS quota goal metrics which
eliminate the manual tuning which is required when utilizing DAMON
for memory tiering.
- "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
provides cleanups and small efficiency improvements which Baoquan
found via code inspection.
- "vmscan: enforce mems_effective during demotion" from Gregory Price
changes reclaim to respect cpuset.mems_effective during demotion when
possible. because presently, reclaim explicitly ignores
cpuset.mems_effective when demoting, which may cause the cpuset
settings to violated.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently.
- "Clean up split_huge_pmd_locked() and remove unnecessary folio
pointers" from Gavin Guo provides minor cleanups and efficiency gains
in in the huge page splitting and migrating code.
- "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
for `struct mem_cgroup', yielding improved memory utilization.
- "add max arg to swappiness in memory.reclaim and lru_gen" from
Zhongkun He adds a new "max" argument to the "swappiness=" argument
for memory.reclaim MGLRU's lru_gen.
This directs proactive reclaim to reclaim from only anon folios
rather than file-backed folios.
- "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
first step on the path to permitting the kernel to maintain existing
VMs while replacing the host kernel via file-based kexec. At this
time only memblock's reserve_mem is preserved.
- "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
and uses a smarter way of looping over a pfn range. By skipping
ranges of invalid pfns.
- "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
when a task is pinned a single NUMA mode.
Dramatic performance benefits were seen in some real world cases.
- "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
Garg addresses a warning which occurs during memory compaction when
using JFS.
- "move all VMA allocation, freeing and duplication logic to mm" from
Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
appropriate mm/vma.c.
- "mm, swap: clean up swap cache mapping helper" from Kairui Song
provides code consolidation and cleanups related to the folio_index()
function.
- "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.
- "memcg: Fix test_memcg_min/low test failures" from Waiman Long
addresses some bogus failures which are being reported by the
test_memcontrol selftest.
- "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
Stoakes commences the deprecation of file_operations.mmap() in favor
of the new file_operations.mmap_prepare().
The latter is more restrictive and prevents drivers from messing with
things in ways which, amongst other problems, may defeat VMA merging.
- "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
the per-cpu memcg charge cache from the objcg's one.
This is a step along the way to making memcg and objcg charging
NMI-safe, which is a BPF requirement.
- "mm/damon: minor fixups and improvements for code, tests, and
documents" from SeongJae Park is yet another batch of miscellaneous
DAMON changes. Fix and improve minor problems in code, tests and
documents.
- "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
stats to be irq safe. Another step along the way to making memcg
charging and stats updates NMI-safe, a BPF requirement.
- "Let unmap_hugepage_range() and several related functions take folio
instead of page" from Fan Ni provides folio conversions in the
hugetlb code.
* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
mm: pcp: increase pcp->free_count threshold to trigger free_high
mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
mm/hugetlb: pass folio instead of page to unmap_ref_private()
memcg: objcg stock trylock without irq disabling
memcg: no stock lock for cpu hot-unplug
memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
memcg: make count_memcg_events re-entrant safe against irqs
memcg: make mod_memcg_state re-entrant safe against irqs
memcg: move preempt disable to callers of memcg_rstat_updated
memcg: memcg_rstat_updated re-entrant safe against irqs
mm: khugepaged: decouple SHMEM and file folios' collapse
selftests/eventfd: correct test name and improve messages
alloc_tag: check mem_profiling_support in alloc_tag_init
Docs/damon: update titles and brief introductions to explain DAMOS
selftests/damon/_damon_sysfs: read tried regions directories in order
mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
...
x86 already uses gcc-8 as the minimum version, this changes all other
architectures to the same version. gcc-8 is used is Debian 10 and Red
Hat Enterprise Linux 8, both of which are still supported, and binutils
2.30 is the oldest corresponding version on those. Ubuntu Pro 18.04 and
SUSE Linux Enterprise Server 15 both use gcc-7 as the system compiler
but additionally include toolchains that remain supported.
With the new minimum toolchain versions, a number of workarounds for older
versions can be dropped, in particular on x86_64 and arm64. Importantly,
the updated compiler version allows removing two of the five remaining
gcc plugins, as support for sancov and structeak features is already
included in modern compiler versions.
I tried collecting the known changes that are possible based on the
new toolchain version, but expect that more cleanups will be possible.
Since this touches multiple architectures, I merged the patches through
the asm-generic tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmg6vNMACgkQmmx57+YA
GNkOmg/+LtR9B2P27GPBeG8HnLTZ8hKELiyYeSk6ZFgQv5hevE37HV35Yru7e7gu
wcF6CgYr8ff4CVcHM7y0790oGew1thkqq5CklFIH0EwCDJx/mWfZR1SS2jfZIEWM
HSDOlQQd1S8oWia14tSnQos3nW3CB9/ABVTHH+Wvl3xn48WMRvgK2LJgGLuxJrt8
5DD9auHiLjchWB5tB4DU98IgWWgFUGMTsI6IayZ4dkF4CdWqd89h0Y3pjJYeBgHS
mPxzR2q8WjEmG9hp7QuZQgn/pAYleJAwHvvkoLrkQ2ieqx3FjWiwFbQp4CG1Sc8L
eBR1lnkqS2z/e7xJLfe86fOoKWWu4I0tZKhRan/0+UOGm5nXrGpqSxKS8ZDsRuAp
3fvyhIp1cYSa7Xkok8BFhLEFR0tguXJXnXBc3tWE5VXIfFNd0Ohh1GUYhXDAqWKh
i0jN9dSNhokM3AqBi6qZl5kmBnRA3UsIaOg3QRrqN8IlBPp+u7i5xsrJIUWvD95o
TO06admmLcCJT8n6ZfNVfRjBgzu8+t54UVaDx9YYwxoNGOSFwqOb8CSPTWPxLmDr
RKDUOvO8DBlP7uFz9neP+LxluA3DjurRZvb0z0AmCZ8/RXEmTMCyfP5a6esxquXt
0Bqo6hM9q+TeXTHNS1CNvqLSWWikw+AzS/ZPPvriYFn5lxtbq6c=
=pdDC
-----END PGP SIGNATURE-----
Merge tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull compiler version requirement update from Arnd Bergmann:
"Require gcc-8 and binutils-2.30
x86 already uses gcc-8 as the minimum version, this changes all other
architectures to the same version. gcc-8 is used is Debian 10 and Red
Hat Enterprise Linux 8, both of which are still supported, and
binutils 2.30 is the oldest corresponding version on those.
Ubuntu Pro 18.04 and SUSE Linux Enterprise Server 15 both use gcc-7 as
the system compiler but additionally include toolchains that remain
supported.
With the new minimum toolchain versions, a number of workarounds for
older versions can be dropped, in particular on x86_64 and arm64.
Importantly, the updated compiler version allows removing two of the
five remaining gcc plugins, as support for sancov and structeak
features is already included in modern compiler versions.
I tried collecting the known changes that are possible based on the
new toolchain version, but expect that more cleanups will be possible.
Since this touches multiple architectures, I merged the patches
through the asm-generic tree."
* tag 'gcc-minimum-version-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
Makefile.kcov: apply needed compiler option unconditionally in CFLAGS_KCOV
Documentation: update binutils-2.30 version reference
gcc-plugins: remove SANCOV gcc plugin
Kbuild: remove structleak gcc plugin
arm64: drop binutils version checks
raid6: skip avx512 checks
kbuild: require gcc-8 and binutils-2.30
- Allow the persistent ring buffer to be memory mapped
In the last merge window there was issues with the implementation of
mapping the persistent ring buffer because it was assumed that the
persistent memory was just physical memory without being part of the
kernel virtual address space. But this was incorrect and the persistent
ring buffer can be mapped the same way as the allocated ring buffer is
mapped.
The meta data for the persistent ring buffer is different than the normal
ring buffer and the organization of mapping it to user space is a little
different. Make the updates needed to the meta data to allow the
persistent ring buffer to be mapped to user space.
- Fix cpus_read_lock() with buffer->mutex and cpu_buffer->mapping_lock
Mapping the ring buffer to user space uses the cpu_buffer->mapping_lock.
The buffer->mutex can be taken when the mapping_lock is held, giving the
locking order of: cpu_buffer->mapping_lock -->> buffer->mutex. But there
also exists the ordering:
buffer->mutex -->> cpus_read_lock()
mm->mmap_lock -->> cpu_buffer->mapping_lock
cpus_read_lock() -->> mm->mmap_lock
causing a circular chain of:
cpu_buffer->mapping_lock -> buffer->mutex -->> cpus_read_lock() -->>
mm->mmap_lock -->> cpu_buffer->mapping_lock
By moving the cpus_read_lock() outside the buffer->mutex where:
cpus_read_lock() -->> buffer->mutex, breaks the deadlock chain.
- Do not trigger WARN_ON() for commit overrun
When the ring buffer is user space mapped and there's a "commit overrun"
(where an interrupt preempted an event, and then added so many events it
filled the buffer having to drop events when it hit the preempted event)
a WARN_ON() was triggered if this was read via a memory mapped buffer.
This is due to "missed events" being non zero when the reader page ended
up with the commit page. The idea was, if the writer is on the reader page,
there's only one page that has been written to and there should be no
missed events. But if a commit overrun is done where the writer is off the
commit page and looped around to the commit page causing missed events, it
is possible that the reader page is the commit page with missed events.
Instead of triggering a WARN_ON() when the reader page is the commit page
with missed events, trigger it when the reader page is the tail_page with
missed events. That's because the writer is always on the tail_page if
an event was interrupted (which holds the commit event) and continues off
the commit page.
- Reset the persistent buffer if it is fully consumed
On boot up, if the user fully consumes the last boot buffer of the
persistent buffer, if it reboots without enabling it, there will still be
events in the buffer which can cause confusion. Instead, reset the buffer
when it is fully consumed, so that the data is not read again.
- Clean up some goto out jumps
There's a few cases that the code jumps to the "out:" label that simply
returns a value. There used to be more work done at those labels but now
that they simply return a value use a return instead of jumping to a
label.
- Use guard() to simplify some of the code
Add guard() around some locking instead of jumping to a label to do the
unlocking.
- Use free() to simplify some of the code
Use free(kfree) on variables that will get freed on error and use
return_ptr() to return the variable when its not freed. There's one
instance where free(kfree) simplifies the code on a temp variable that was
allocated just for the function use.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaDjJMxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qkDzAP468AZOnjIxezfzYEmtcDl8ZUgf2U3I
XtXjn7aKH/gZiwD/dCCZX2IY2gddqAb6s9Bo4/AWgtYbjacLPL+pWYbTJwQ=
=DOfF
-----END PGP SIGNATURE-----
Merge tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:
- Allow the persistent ring buffer to be memory mapped
In the last merge window there was issues with the implementation of
mapping the persistent ring buffer because it was assumed that the
persistent memory was just physical memory without being part of the
kernel virtual address space. But this was incorrect and the
persistent ring buffer can be mapped the same way as the allocated
ring buffer is mapped.
The metadata for the persistent ring buffer is different than the
normal ring buffer and the organization of mapping it to user space
is a little different. Make the updates needed to the meta data to
allow the persistent ring buffer to be mapped to user space.
- Fix cpus_read_lock() with buffer->mutex and cpu_buffer->mapping_lock
Mapping the ring buffer to user space uses the
cpu_buffer->mapping_lock. The buffer->mutex can be taken when the
mapping_lock is held, giving the locking order of:
cpu_buffer->mapping_lock -->> buffer->mutex. But there also exists
the ordering:
buffer->mutex -->> cpus_read_lock()
mm->mmap_lock -->> cpu_buffer->mapping_lock
cpus_read_lock() -->> mm->mmap_lock
causing a circular chain of:
cpu_buffer->mapping_lock -> buffer->mutex -->> cpus_read_lock() -->>
mm->mmap_lock -->> cpu_buffer->mapping_lock
By moving the cpus_read_lock() outside the buffer->mutex where:
cpus_read_lock() -->> buffer->mutex, breaks the deadlock chain.
- Do not trigger WARN_ON() for commit overrun
When the ring buffer is user space mapped and there's a "commit
overrun" (where an interrupt preempted an event, and then added so
many events it filled the buffer having to drop events when it hit
the preempted event) a WARN_ON() was triggered if this was read via a
memory mapped buffer.
This is due to "missed events" being non zero when the reader page
ended up with the commit page. The idea was, if the writer is on the
reader page, there's only one page that has been written to and there
should be no missed events.
But if a commit overrun is done where the writer is off the commit
page and looped around to the commit page causing missed events, it
is possible that the reader page is the commit page with missed
events.
Instead of triggering a WARN_ON() when the reader page is the commit
page with missed events, trigger it when the reader page is the
tail_page with missed events. That's because the writer is always on
the tail_page if an event was interrupted (which holds the commit
event) and continues off the commit page.
- Reset the persistent buffer if it is fully consumed
On boot up, if the user fully consumes the last boot buffer of the
persistent buffer, if it reboots without enabling it, there will
still be events in the buffer which can cause confusion. Instead,
reset the buffer when it is fully consumed, so that the data is not
read again.
- Clean up some goto out jumps
There's a few cases that the code jumps to the "out:" label that
simply returns a value. There used to be more work done at those
labels but now that they simply return a value use a return instead
of jumping to a label.
- Use guard() to simplify some of the code
Add guard() around some locking instead of jumping to a label to do
the unlocking.
- Use free() to simplify some of the code
Use free(kfree) on variables that will get freed on error and use
return_ptr() to return the variable when its not freed. There's one
instance where free(kfree) simplifies the code on a temp variable
that was allocated just for the function use.
* tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Simplify functions with __free(kfree) to free allocations
ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex)
ring-buffer: Simplify ring_buffer_read_page() with guard()
ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard()
ring-buffer: Remove jump to out label in ring_buffer_swap_cpu()
ring-buffer: Removed unnecessary if() goto out where out is the next line
tracing: Reset last-boot buffers when reading out all cpu buffers
ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped
ring-buffer: Do not trigger WARN_ON() due to a commit_overrun
ring-buffer: Move cpus_read_lock() outside of buffer->mutex
The default idle selection policy doesn't properly handle the case where
@prev_cpu is not part of the task's allowed CPUs.
In this situation, it may return an idle CPU that is not usable by the
task, breaking the assumption that the returned CPU must always be
within the allowed cpumask, causing inefficiencies or even stalls in
certain cases.
This issue can arise in the following cases:
- The task's affinity may have changed by the time the function is
invoked, especially now that the idle selection logic can be used
from multiple contexts (i.e., BPF test_run call).
- The BPF scheduler may provide a @prev_cpu that is not part of the
allowed mask, either unintentionally or as a placement hint. In fact
@prev_cpu may not necessarily refer to the CPU the task last ran on,
but it can also be considered as a target CPU that the scheduler
wishes to use for the task.
Therefore, enforce the right behavior by always checking whether
@prev_cpu is in the allowed mask, when using scx_bpf_select_cpu_and(),
and it's also usable by the task (@p->cpus_ptr). If it is not, try to
find a valid CPU nearby @prev_cpu, following the usual locality-aware
fallback path (SMT, LLC, node, allowed CPUs).
This ensures the returned CPU is always allowed, improving robustness to
affinity changes and invalid scheduler hints, while preserving locality
as much as possible.
Fixes: a730e3f7a4 ("sched_ext: idle: Consolidate default idle CPU selection kfuncs")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously it was found that on uniprocessor machines the size of
raw_spinlock_t could be zero so a pre-processor conditional was used to
avoid the allocation of ss->rstat_ss_cpu_lock. The conditional did not take
into account cases where lock debugging features were enabled. Cover these
cases along with the original non-smp case by explicitly using the size of
size of the lock type as criteria for allocation/access where applicable.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Fixes: 748922dcfa "cgroup: use subsystem-specific rstat locks to avoid contention"
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202505281034.7ae1668d-lkp@intel.com
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Calling conventions of ->d_automount() made saner (flagday change)
vfs_submount() is gone - its sole remaining user (trace_automount) had
been switched to saner primitives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCaDoRWQAKCRBZ7Krx/gZQ
6wxMAQCzuMc2GiGBMXzeK4SGA7d5rsK71unf+zczOd8NvbTImQEAs1Cu3u3bF3pq
EmHQWFTKBpBf+RHsLSoDHwUA+9THowM=
=GXLi
-----END PGP SIGNATURE-----
Merge tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull automount updates from Al Viro:
"Automount wart removal
A bunch of odd boilerplate gone from instances - the reason for
those was the need to protect the yet-to-be-attched mount from
mark_mounts_for_expiry() deciding to take it out.
But that's easy to detect and take care of in mark_mounts_for_expiry()
itself; no need to have every instance simulate mount being busy by
grabbing an extra reference to it, with finish_automount() undoing
that once it attaches that mount.
Should've done it that way from the very beginning... This is a
flagday change, thankfully there are very few instances.
vfs_submount() is gone - its sole remaining user (trace_automount)
had been switched to saner primitives"
* tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill vfs_submount()
saner calling conventions for ->d_automount()
- Have module addresses get updated in the persistent ring buffer
The addresses of the modules from the previous boot are saved in the
persistent ring buffer. If the same modules are loaded and an address is
in the old buffer points to an address that was both saved in the
persistent ring buffer and is loaded in memory, shift the address to point
to the address that is loaded in memory in the trace event.
- Print function names for irqs off and preempt off callsites
When ignoring the print fmt of a trace event and just printing the fields
directly, have the fields for preempt off and irqs off events still show
the function name (via kallsyms) instead of just showing the raw address.
- Clean ups of the histogram code
The histogram functions saved over 800 bytes on the stack to process
events as they come in. Instead, create per-cpu buffers that can hold this
information and have a separate location for each context level (thread,
softirq, IRQ and NMI).
Also add some more comments to the code.
- Add "common_comm" field for histograms
Add "common_comm" that uses the current->comm as a field in an event
histogram and acts like any of the other fields of the event.
- Show "subops" in the enabled_functions file
When the function graph infrastructure is used, a subsystem has a "subops"
that it attaches its callback function to. Instead of the
enabled_functions just showing a function calling the function that calls
the subops functions, also show the subops functions that will get called
for that function too.
- Add "copy_trace_marker" option to instances
There are cases where an instance is created for tooling to write into,
but the old tooling has the top level instance hardcoded into the
application. New tools want to consume the data from an instance and not
the top level buffer. By adding a copy_trace_marker option, whenever the
top instance trace_marker is written into, a copy of it is also written
into the instance with this option set. This allows new tools to read what
old tools are writing into the top buffer.
If this option is cleared by the top instance, then what is written into
the trace_marker is not written into the top instance. This is a way to
redirect the trace_marker writes into another instance.
- Have tracepoints created by DECLARE_TRACE() use trace_<name>_tp()
If a tracepoint is created by DECLARE_TRACE() instead of TRACE_EVENT(),
then it will not be exposed via tracefs. Currently there's no way to
differentiate in the kernel the tracepoint functions between those that
are exposed via tracefs or not. A calling convention has been made
manually to append a "_tp" prefix for events created by DECLARE_TRACE().
Instead of doing this manually, force it so that all DECLARE_TRACE()
events have this notation.
- Use __string() for task->comm in some sched events
Instead of hardcoding the comm to be TASK_COMM_LEN in some of the
scheduler events use __string() which makes it dynamic. Note, if these
events are parsed by user space it they may break, and the event may have
to be converted back to the hardcoded size.
- Have function graph "depth" be unsigned to the user
Internally to the kernel, the "depth" field of the function graph event is
signed due to -1 being used for end of boundary. What actually gets
recorded in the event itself is zero or positive. Reflect this to user
space by showing "depth" as unsigned int and be consistent across all
events.
- Allow an arbitrary long CPU string to osnoise_cpus_write()
The filtering of which CPUs to write to can exceed 256 bytes. If a machine
has 256 CPUs, and the filter is to filter every other CPU, the write would
take a string larger than 256 bytes. Instead of using a fixed size buffer
on the stack that is 256 bytes, allocate it to handle what is passed in.
- Stop having ftrace check the per-cpu data "disabled" flag
The "disabled" flag in the data structure passed to most ftrace functions
is checked to know if tracing has been disabled or not. This flag was
added back in 2008 before the ring buffer had its own way to disable
tracing. The "disable" flag is now not always set when needed, and the
ring buffer flag should be used in all locations where the disabled is
needed. Since the "disable" flag is redundant and incorrect, stop using it.
Fix up some locations that use the "disable" flag to use the ring buffer
info.
- Use a new tracer_tracing_disable/enable() instead of data->disable flag
There's a few cases that set the data->disable flag to stop tracing, but
this flag is not consistently used. It is also an on/off switch where if a
function set it and calls another function that sets it, the called
function may incorrectly enable it.
Use a new trace_tracing_disable() and tracer_tracing_enable() that uses a
counter and can be nested. These use the ring buffer flags which are
always checked making the disabling more consistent.
- Save the trace clock in the persistent ring buffer
Save what clock was used for tracing in the persistent ring buffer and set
it back to that clock after a reboot.
- Remove unused reference to a per CPU data pointer in mmiotrace functions
- Remove unused buffer_page field from trace_array_cpu structure
- Remove more strncpy() instances
- Other minor clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaDhiqRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qkheAQDpyRHoXF1AIoEqyahDax8f3vpZQeCH
B/mn+YJmU1wuVgEA7AFALov5SHKv4IzoARz68GXtR0jGhP5D8uebUhUqDAQ=
=WmFG
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt:
- Have module addresses get updated in the persistent ring buffer
The addresses of the modules from the previous boot are saved in the
persistent ring buffer. If the same modules are loaded and an address
is in the old buffer points to an address that was both saved in the
persistent ring buffer and is loaded in memory, shift the address to
point to the address that is loaded in memory in the trace event.
- Print function names for irqs off and preempt off callsites
When ignoring the print fmt of a trace event and just printing the
fields directly, have the fields for preempt off and irqs off events
still show the function name (via kallsyms) instead of just showing
the raw address.
- Clean ups of the histogram code
The histogram functions saved over 800 bytes on the stack to process
events as they come in. Instead, create per-cpu buffers that can hold
this information and have a separate location for each context level
(thread, softirq, IRQ and NMI).
Also add some more comments to the code.
- Add "common_comm" field for histograms
Add "common_comm" that uses the current->comm as a field in an event
histogram and acts like any of the other fields of the event.
- Show "subops" in the enabled_functions file
When the function graph infrastructure is used, a subsystem has a
"subops" that it attaches its callback function to. Instead of the
enabled_functions just showing a function calling the function that
calls the subops functions, also show the subops functions that will
get called for that function too.
- Add "copy_trace_marker" option to instances
There are cases where an instance is created for tooling to write
into, but the old tooling has the top level instance hardcoded into
the application. New tools want to consume the data from an instance
and not the top level buffer. By adding a copy_trace_marker option,
whenever the top instance trace_marker is written into, a copy of it
is also written into the instance with this option set. This allows
new tools to read what old tools are writing into the top buffer.
If this option is cleared by the top instance, then what is written
into the trace_marker is not written into the top instance. This is a
way to redirect the trace_marker writes into another instance.
- Have tracepoints created by DECLARE_TRACE() use trace_<name>_tp()
If a tracepoint is created by DECLARE_TRACE() instead of
TRACE_EVENT(), then it will not be exposed via tracefs. Currently
there's no way to differentiate in the kernel the tracepoint
functions between those that are exposed via tracefs or not. A
calling convention has been made manually to append a "_tp" prefix
for events created by DECLARE_TRACE(). Instead of doing this
manually, force it so that all DECLARE_TRACE() events have this
notation.
- Use __string() for task->comm in some sched events
Instead of hardcoding the comm to be TASK_COMM_LEN in some of the
scheduler events use __string() which makes it dynamic. Note, if
these events are parsed by user space it they may break, and the
event may have to be converted back to the hardcoded size.
- Have function graph "depth" be unsigned to the user
Internally to the kernel, the "depth" field of the function graph
event is signed due to -1 being used for end of boundary. What
actually gets recorded in the event itself is zero or positive.
Reflect this to user space by showing "depth" as unsigned int and be
consistent across all events.
- Allow an arbitrary long CPU string to osnoise_cpus_write()
The filtering of which CPUs to write to can exceed 256 bytes. If a
machine has 256 CPUs, and the filter is to filter every other CPU,
the write would take a string larger than 256 bytes. Instead of using
a fixed size buffer on the stack that is 256 bytes, allocate it to
handle what is passed in.
- Stop having ftrace check the per-cpu data "disabled" flag
The "disabled" flag in the data structure passed to most ftrace
functions is checked to know if tracing has been disabled or not.
This flag was added back in 2008 before the ring buffer had its own
way to disable tracing. The "disable" flag is now not always set when
needed, and the ring buffer flag should be used in all locations
where the disabled is needed. Since the "disable" flag is redundant
and incorrect, stop using it. Fix up some locations that use the
"disable" flag to use the ring buffer info.
- Use a new tracer_tracing_disable/enable() instead of data->disable
flag
There's a few cases that set the data->disable flag to stop tracing,
but this flag is not consistently used. It is also an on/off switch
where if a function set it and calls another function that sets it,
the called function may incorrectly enable it.
Use a new trace_tracing_disable() and tracer_tracing_enable() that
uses a counter and can be nested. These use the ring buffer flags
which are always checked making the disabling more consistent.
- Save the trace clock in the persistent ring buffer
Save what clock was used for tracing in the persistent ring buffer
and set it back to that clock after a reboot.
- Remove unused reference to a per CPU data pointer in mmiotrace
functions
- Remove unused buffer_page field from trace_array_cpu structure
- Remove more strncpy() instances
- Other minor clean ups and fixes
* tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (36 commits)
tracing: Fix compilation warning on arm32
tracing: Record trace_clock and recover when reboot
tracing/sched: Use __string() instead of fixed lengths for task->comm
tracepoint: Have tracepoints created with DECLARE_TRACE() have _tp suffix
tracing: Cleanup upper_empty() in pid_list
tracing: Allow the top level trace_marker to write into another instances
tracing: Add a helper function to handle the dereference arg in verifier
tracing: Remove unnecessary "goto out" that simply returns ret is trigger code
tracing: Fix error handling in event_trigger_parse()
tracing: Rename event_trigger_alloc() to trigger_data_alloc()
tracing: Replace deprecated strncpy() with strscpy() for stack_trace_filter_buf
tracing: Remove unused buffer_page field from trace_array_cpu structure
tracing: Use atomic_inc_return() for updating "disabled" counter in irqsoff tracer
tracing: Convert the per CPU "disabled" counter to local from atomic
tracing: branch: Use trace_tracing_is_on_cpu() instead of "disabled" field
ring-buffer: Add ring_buffer_record_is_on_cpu()
tracing: Do not use per CPU array_buffer.data->disabled for cpumask
ftrace: Do not disabled function graph based on "disabled" field
tracing: kdb: Use tracer_tracing_on/off() instead of setting per CPU disabled
tracing: Use tracer_tracing_disable() instead of "disabled" field for ftrace_dump_one()
...
* Add large stage-2 mapping (THP) support for non-protected guests when
pKVM is enabled, clawing back some performance.
* Enable nested virtualisation support on systems that support it,
though it is disabled by default.
* Add UBSAN support to the standalone EL2 object used in nVHE/hVHE and
protected modes.
* Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. While this has no functional
impact, it ensures correctness of emulation (the data is automatically
extracted from the published JSON files), and helps dealing with the
evolution of the architecture.
* Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
* New selftest checking the pKVM ownership transition rules
* Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
* Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
* Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
* Add a new selftest for the SVE host state being corrupted by a
guest.
* Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
* Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
* Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
* and the usual random cleanups.
LoongArch:
* Don't flush tlb if the host supports hardware page table walks.
* Add KVM selftests support.
RISC-V:
* Add vector registers to get-reg-list selftest
* VCPU reset related improvements
* Remove scounteren initialization from VCPU reset
* Support VCPU reset from userspace using set_mpstate() ioctl
x86:
* Initial support for TDX in KVM. This finally makes it possible to use the
TDX module to run confidential guests on Intel processors. This is quite a
large series, including support for private page tables (managed by the
TDX module and mirrored in KVM for efficiency), forwarding some TDVMCALLs
to userspace, and handling several special VM exits from the TDX module.
This has been in the works for literally years and it's not really possible
to describe everything here, so I'll defer to the various merge commits
up to and including commit 7bcf7246c4 ("Merge branch 'kvm-tdx-finish-initial'
into HEAD").
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmg02hwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNnkwf/db4xeWKSMseCIvBVR+ObDn3LXhwT
hAgmTkDkP1zq9RfbfJSbUA1DXRwfP+f1sWySLMWECkFEQW9fGIJF9fOQRDSXKmhX
158U3+FEt+3jxLRCGFd4zyXAqyY3C8JSkPUyJZxCpUbXtB5tdDNac4rZAXKDULwe
sUi0OW/kFDM2yt369pBGQAGdN+75/oOrYISGOSvMXHxjccNqvveX8MUhpBjYIuuj
73iBWmsfv3vCtam56Racz3C3v44ie498PmWFtnB0R+CVfWfrnUAaRiGWx+egLiBW
dBPDiZywMn++prmphEUFgaStDTQy23JBLJ8+RvHkp+o5GaTISKJB3nedZQ==
=adZU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropriate because (at 6k lines and 100+ commits) it is
much bigger than the rest, which will come later this week and
consists mostly of bugfixes and selftests. s390 changes will also come
in the second batch.
ARM:
- Add large stage-2 mapping (THP) support for non-protected guests
when pKVM is enabled, clawing back some performance.
- Enable nested virtualisation support on systems that support it,
though it is disabled by default.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
and protected modes.
- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. While this has no functional
impact, it ensures correctness of emulation (the data is
automatically extracted from the published JSON files), and helps
dealing with the evolution of the architecture.
- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a
guest.
- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
- and the usual random cleanups.
LoongArch:
- Don't flush tlb if the host supports hardware page table walks.
- Add KVM selftests support.
RISC-V:
- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
x86:
- Initial support for TDX in KVM.
This finally makes it possible to use the TDX module to run
confidential guests on Intel processors. This is quite a large
series, including support for private page tables (managed by the
TDX module and mirrored in KVM for efficiency), forwarding some
TDVMCALLs to userspace, and handling several special VM exits from
the TDX module.
This has been in the works for literally years and it's not really
possible to describe everything here, so I'll defer to the various
merge commits up to and including commit 7bcf7246c4 ('Merge
branch 'kvm-tdx-finish-initial' into HEAD')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
x86/tdx: mark tdh_vp_enter() as __flatten
Documentation: virt/kvm: remove unreferenced footnote
RISC-V: KVM: lock the correct mp_state during reset
KVM: arm64: Fix documentation for vgic_its_iter_next()
KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
KVM: arm64: Stage-2 huge mappings for np-guests
KVM: arm64: Add a range to pkvm_mappings
KVM: arm64: Convert pkvm_mappings to interval tree
KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
KVM: arm64: Add a range to __pkvm_host_unshare_guest()
KVM: arm64: Add a range to __pkvm_host_share_guest()
KVM: arm64: Introduce for_each_hyp_page
KVM: arm64: Handle huge mappings for np-guest CMOs
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
RISC-V: KVM: Remove scounteren initialization
KVM: RISC-V: remove unnecessary SBI reset state
...
The function rb_allocate_pages() allocates cpu_buffer and on error needs
to free it. It has a single return. Use __free(kfree) and return directly
on errors and have the return use return_ptr(cpu_buffer).
The function alloc_buffer() allocates buffer and on error needs to free
it. It has a single return. Use __free(kfree) and return directly on
errors and have the return use return_ptr(buffer).
The function __rb_map_vma() allocates a temporary array "pages". Have it
use __free() and not worry about freeing it when returning.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527143144.6edc4625@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Convert the taking of the buffer->mutex and the cpu_buffer->mapping_lock
over to guard(mutex) and simplify the ring_buffer_map() and
ring_buffer_unmap() functions.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250527122009.267efb72@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function ring_buffer_read_page() had two gotos. One was simply
returning "ret" and the other was unlocking the reader_lock.
There's no reason to use goto to simply return the "ret" variable. Instead
just return the value.
The jump to the unlocking of the reader_lock can be replaced by
guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock).
With these two changes the "ret" variable is no longer used and can be
removed. The return value on non-error is what was read and is stored in
the "read" variable.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527145216.0187cf36@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use guard(raw_spinlock_irqsave)() in reset_disabled_cpu_buffer() to
simplify the locking.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527144623.77a9cc47@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function ring_buffer_swap_cpu() has a bunch of jumps to the label out
that simply returns "ret". There's no reason to jump to a label that
simply returns a value. Just return directly from there.
This goes back to almost the beginning when commit 8aabee573d
("ring-buffer: remove unneeded get_online_cpus") was introduced. That
commit removed a put_online_cpus() from that label, but never updated all
the jumps to it that now no longer needed to do anything but return a
value.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527145753.6b45d840@gandalf.local.home
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In the function ring_buffer_discard_commit() there's an if statement that
jumps to the next line:
if (rb_try_to_discard(cpu_buffer, event))
goto out;
out:
This was caused by the change that modified the way timestamps were taken
in interrupt context, and removed the code between the if statement and
the goto, but failed to update the conditional logic.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250527155116.227f35be@gandalf.local.home
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reset the last-boot ring buffers when read() reads out all cpu
buffers through trace_pipe/trace_pipe_raw. This prevents ftrace to
unwind ring buffer read pointer next boot.
Note that this resets only when all per-cpu buffers are empty, and
read via read(2) syscall. For example, if you read only one of the
per-cpu trace_pipe, it does not reset it. Also, reading buffer by
splice(2) syscall does not reset because some data in the reader
(the last) page.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174792929202.496143.8184644221859580999.stgit@mhiramat.tok.corp.google.com
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the persistent ring buffer is created from the memory returned by
reserve_mem there is nothing prohibiting it to be memory mapped to user
space. The memory is the same as the pages allocated by alloc_page().
The way the memory is managed by the ring buffer code is slightly
different though and needs to be addressed.
The persistent memory uses the page->id for its own purpose where as the
user mmap buffer currently uses that for the subbuf array mapped to user
space. If the buffer is a persistent buffer, use the page index into that
buffer as the identifier instead of the page->id.
That is, the page->id for a persistent buffer, represents the order of the
buffer is in the link list. ->id == 0 means it is the reader page.
When a reader page is swapped, the new reader page's ->id gets zero, and
the old reader page gets the ->id of the page that it swapped with.
The user space mapping has the ->id is the index of where it was mapped in
user space and does not change while it is mapped.
Since the persistent buffer is fixed in its location, the index of where
a page is in the memory range can be used as the "id" to put in the meta
page array, and it can be mapped in the same order to user space as it is
in the persistent memory.
A new rb_page_id() helper function is used to get and set the id depending
on if the page is a normal memory allocated buffer or a physical memory
mapped buffer.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/20250401203332.246646011@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When reading a memory mapped buffer the reader page is just swapped out
with the last page written in the write buffer. If the reader page is the
same as the commit buffer (the buffer that is currently being written to)
it was assumed that it should never have missed events. If it does, it
triggers a WARN_ON_ONCE().
But there just happens to be one scenario where this can legitimately
happen. That is on a commit_overrun. A commit overrun is when an interrupt
preempts an event being written to the buffer and then the interrupt adds
so many new events that it fills and wraps the buffer back to the commit.
Any new events would then be dropped and be reported as "missed_events".
In this case, the next page to read is the commit buffer and after the
swap of the reader page, the reader page will be the commit buffer, but
this time there will be missed events and this triggers the following
warning:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1127 at kernel/trace/ring_buffer.c:7357 ring_buffer_map_get_reader+0x49a/0x780
Modules linked in: kvm_intel kvm irqbypass
CPU: 2 UID: 0 PID: 1127 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00004-g478bc2824b45-dirty #564 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ring_buffer_map_get_reader+0x49a/0x780
Code: 00 00 00 48 89 fe 48 c1 ee 03 80 3c 2e 00 0f 85 ec 01 00 00 4d 3b a6 a8 00 00 00 0f 85 8a fd ff ff 48 85 c0 0f 84 55 fe ff ff <0f> 0b e9 4e fe ff ff be 08 00 00 00 4c 89 54 24 58 48 89 54 24 50
RSP: 0018:ffff888121787dc0 EFLAGS: 00010002
RAX: 00000000000006a2 RBX: ffff888100062800 RCX: ffffffff8190cb49
RDX: ffff888126934c00 RSI: 1ffff11020200a15 RDI: ffff8881010050a8
RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffed1024d26982
R10: ffff888126934c17 R11: ffff8881010050a8 R12: ffff888126934c00
R13: ffff8881010050b8 R14: ffff888101005000 R15: ffff888126930008
FS: 00007f95c8cd7540(0000) GS:ffff8882b576e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f95c8de4dc0 CR3: 0000000128452002 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __pfx_ring_buffer_map_get_reader+0x10/0x10
tracing_buffers_ioctl+0x283/0x370
__x64_sys_ioctl+0x134/0x190
do_syscall_64+0x79/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f95c8de48db
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00
RSP: 002b:00007ffe037ba110 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffe037bb2b0 RCX: 00007f95c8de48db
RDX: 0000000000000000 RSI: 0000000000005220 RDI: 0000000000000006
RBP: 00007ffe037ba180 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffe037bb6f8 R14: 00007f95c9065000 R15: 00005575c7492c90
</TASK>
irq event stamp: 5080
hardirqs last enabled at (5079): [<ffffffff83e0adb0>] _raw_spin_unlock_irqrestore+0x50/0x70
hardirqs last disabled at (5080): [<ffffffff83e0aa83>] _raw_spin_lock_irqsave+0x63/0x70
softirqs last enabled at (4182): [<ffffffff81516122>] handle_softirqs+0x552/0x710
softirqs last disabled at (4159): [<ffffffff815163f7>] __irq_exit_rcu+0x107/0x210
---[ end trace 0000000000000000 ]---
The above was triggered by running on a kernel with both lockdep and KASAN
as well as kmemleak enabled and executing the following command:
# perf record -o perf-test.dat -a -- trace-cmd record --nosplice -e all -p function hackbench 50
With perf interjecting a lot of interrupts and trace-cmd enabling all
events as well as function tracing, with lockdep, KASAN and kmemleak
enabled, it could cause an interrupt preempting an event being written to
add enough events to wrap the buffer. trace-cmd was modified to have
--nosplice use mmap instead of reading the buffer.
The way to differentiate this case from the normal case of there only
being one page written to where the swap of the reader page received that
one page (which is the commit page), check if the tail page is on the
reader page. The difference between the commit page and the tail page is
that the tail page is where new writes go to, and the commit page holds
the first write that hasn't been committed yet. In the case of an
interrupt preempting the write of an event and filling the buffer, it
would move the tail page but not the commit page.
Have the warning only trigger if the tail page is also on the reader page,
and also print out the number of events dropped by a commit overrun as
that can not yet be safely added to the page so that the reader can see
there were events dropped.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Link: https://lore.kernel.org/20250528121555.2066527e@gandalf.local.home
Fixes: fe832be05a ("ring-buffer: Have mmapped ring buffer keep track of missed events")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v
bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk
xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3
G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W
HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF
PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk
qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl
LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM
mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS
4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k
3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB
2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q=
=j3t8
-----END PGP SIGNATURE-----
Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Fix and improve BTF deduplication of identical BTF types (Alan
Maguire and Andrii Nakryiko)
- Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
Alexis Lothoré)
- Support load-acquire and store-release instructions in BPF JIT on
riscv64 (Andrea Parri)
- Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
Protopopov)
- Streamline allowed helpers across program types (Feng Yang)
- Support atomic update for hashtab of BPF maps (Hou Tao)
- Implement json output for BPF helpers (Ihor Solodrai)
- Several s390 JIT fixes (Ilya Leoshkevich)
- Various sockmap fixes (Jiayuan Chen)
- Support mmap of vmlinux BTF data (Lorenz Bauer)
- Support BPF rbtree traversal and list peeking (Martin KaFai Lau)
- Tests for sockmap/sockhash redirection (Michal Luczaj)
- Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)
- Add support for dma-buf iterators in BPF (T.J. Mercier)
- The verifier support for __bpf_trap() (Yonghong Song)
* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
bpf, arm64: Remove unused-but-set function and variable.
selftests/bpf: Add tests with stack ptr register in conditional jmp
bpf: Do not include stack ptr register in precision backtracking bookkeeping
selftests/bpf: enable many-args tests for arm64
bpf, arm64: Support up to 12 function arguments
bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
bpf: Avoid __bpf_prog_ret0_warn when jit fails
bpftool: Add support for custom BTF path in prog load/loadall
selftests/bpf: Add unit tests with __bpf_trap() kfunc
bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
bpf: Remove special_kfunc_set from verifier
selftests/bpf: Add test for open coded dmabuf_iter
selftests/bpf: Add test for dmabuf_iter
bpf: Add open coded dmabuf iterator
bpf: Add dmabuf iterator
dma-buf: Rename debugfs symbols
bpf: Fix error return value in bpf_copy_from_user_dynptr
libbpf: Use mmap to parse vmlinux BTF from sysfs
selftests: bpf: Add a test for mmapable vmlinux BTF
btf: Allow mmap of vmlinux btf
...
Core
----
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing
again the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter
---------
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools
still use this interface.
- Implement support for wildcard netdevice in netdev basechain
and flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF
---
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols
---------
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the single
flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API
----------
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling
-----------------
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers
----------------------
- OpenVPN virtual driver: offload OpenVPN data channels processing
to the kernel-space, increasing the data transfer throughput WRT
the user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers
-------
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the stearing table handling to reduce significantly
the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmg3D64SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkcIsQAK2eEc+BxQer975wzvtMg6gF9eoex4a+
rZ7jxfDzDtNvTauoQsrpehDZp0FnySaVGCU36lHGB2OvDnhCpPc5hXzKDWQpOuqQ
SHrGG3/6FTbdTG/HfHUcbNyrUzIf53SADSObiQ3qg4gyEQ3sCpcOKtVtMcU8rvsY
/HqMnsJWFaROUMjMtCcnUSgjmeY9kBvha3sTXUqgeRugEOCvZD7z4rpqFIcQqHw7
e2Fi8dwIXEYNxqPp6MRq2qdyUTewCRruE8ZIMAFuhtfYeMElUZMPlqlMENX3AzTQ
cr0EgwcFOUxRA7oZRxhoBNBsVXavtSpQr4ZDoWplxP4aQ37n5tc1E9Q72axpB/Og
FbJRl6GvWYnCd8071BczgmfHlKaTAigPvt2Z4r6JjM5I/Bij/IZ3k+On1OTuOAj/
EqfFkdZ0a5cfKrwUMP+oSGtSAywkMVUtnIKJlZeRbjSj2432sCfe2jVAlS8ELM43
3LUgXYrAKtA87g171LlsRu5EEpI5QmqPb+i5LpPlEXe2TJEgPisyfecJ3NafF/2+
j575lm+TFNm9NTNhGGjDPEvw0djI5wSGGMe9J4gC74eWi6s5t6C4cuUf84TKWdwR
x+9H0IB7rfFncAwXHJuUUtzd+fPHaYzs5dDGbSgMQOXr1cr1wlubCK8mQ1r/Wt/a
3GjFIOQKW2Q5
=t/Tz
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is
unsigned long. This hence triggers a compilation warning as min()
asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t
solves this issue and works on other target architectures as well.
Compilation warning details:
kernel/trace/trace.c: In function 'tracing_splice_read_pipe':
./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
^
./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck'
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~
...
kernel/trace/trace.c:6771:8: note: in expansion of macro 'min'
min((size_t)trace_seq_used(&iter->seq),
^~~
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com
Fixes: f5178c41bb ("tracing: Fix oob write in trace_seq_to_buffer()")
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmg17ccUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPkCxAA4jAZfBVGFbaB9sPKlAOnkeIwI2iX
HNXAwmG1GHjxmCqQSY84bPYmllkNkENHbxFmtLOdWtlWL+1JTPKjylasPvzDIUsW
rqKyGUwJS5cZyIG0iyUGQEoTNlT4MbwvBBWZrLStCL+VRCyVOzyDo/kvs+AcxOYP
vKuZZG9ke2roAnrgKrbZFy0kvzTFzcUJPMOwZIE/WneNogHY23+LzcTy06DVtVLt
DRwzOSgPX99hJjtDTN5G8o7wY1FjlNlad4z3tuR+kEQsIImSgB1mZTnhCx5xNwfr
LhujN2Po8EPKBA7U0AEtwmO2yl2OL69QlEuveEMl7SxFxSnMrnTlIpGr7EOmxcrS
PFUBkOEAYfWTPtjxtbs1mYgJRcDlsLV2M0xJg58aESImMxcZPZ9oheHcAZX/A/eR
V5bNR6I9CFkbSkrH2JG810AMB3NmNEw6ztH/vhqW1x8xYP8M/AxQmqYw00xkjQby
3Qaek1+fIy1chdAJW19BKux38YRxYY48UosA73/G94Dm3N4C99zHaqOMmguaZJJ7
bAxPNe+cBcMdf/XAw5TngihOXEs2n2qpRkN3K+RzGsqBcRBA/pTBM128mANxN7Ra
MHj16OUc9m91TjRxPWgoL+g8DtQf9pM9T9DNL72cki2DeN1JBN8XuaKu+n04keKt
heaVbCBYncs/pJw=
=T6/o
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
- Always record AUDIT_ANOM events when auditing is enabled.
Prior to this patch we only recorded AUDIT_ANOM events if auditing
was enabled and the admin/distro had explicitly configured audit
beyond the defaults. Considering that AUDIT_ANOM events are anomolous
events considered to be "security relevant", it seems wise to record
these events as long as auditing is enabled, even if the system is
running with a default audit configuration.
- Mark the audit_log_vformat() function with the __printf() attribute
to quiet GCC.
* tag 'audit-pr-20250527' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: record AUDIT_ANOM_* events regardless of presence of rules
audit: mark audit_log_vformat() with __printf() attribute
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCaDYRjRQcem9oYXJAbGlu
dXguaWJtLmNvbQAKCRDLwZzRsCrn5fpRAQDsdIIwCgyQLFQhZq3wW5dhTUBQW8o9
GjaNHpROKV57cwD9GqT78xi9qsxgaYW0lUUh5+zvlGI5cAtnl8/Fkby7hgY=
=ohlH
-----END PGP SIGNATURE-----
Merge tag 'integrity-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"Carrying the IMA measurement list across kexec is not a new feature,
but is updated to address a couple of issues:
- Carrying the IMA measurement list across kexec required knowing
apriori all the file measurements between the "kexec load" and
"kexec execute" in order to measure them before the "kexec load".
Any delay between the "kexec load" and "kexec exec" exacerbated the
problem.
- Any file measurements post "kexec load" were not carried across
kexec, resulting in the measurement list being out of sync with the
TPM PCR.
With these changes, the buffer for the IMA measurement list is still
allocated at "kexec load", but copying the IMA measurement list is
deferred to after quiescing the TPM.
Two new kexec critical data records are defined"
* tag 'integrity-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: do not copy measurement list to kdump kernel
ima: measure kexec load and exec events as critical data
ima: make the kexec extra memory configurable
ima: verify if the segment size has changed
ima: kexec: move IMA log copy from kexec load to execute
ima: kexec: define functions to copy IMA log at soft boot
ima: kexec: skip IMA segment validation after kexec soft reboot
kexec: define functions to map and unmap segments
ima: define and call ima_alloc_kexec_file_buf()
ima: rename variable the seq_file "file" to "ima_kexec_file"
- More in-kernel idle CPU selection improvements. Expand topology awareness
coverage add scx_bpf_select_cpu_and() to allow more flexibility. The idle
CPU selection kfuncs can now be called from unlocked contexts too.
- A bunch of reorganization changes to lay the foundation for multiple
hierarchical scheduler support. This isn't ready yet and the included
changes don't make meaningful behavior differences. One notable change is
replacing some static_key tests with dynamic tests as the test results may
differ depending on the scheduler instance. This isn't expected to cause
meaningful performance difference.
- Other minor and doc updates.
- There were multiple patches in for-6.15-fixes which conflicted with
changes in for-6.16. for-6.15-fixes were pulled three times into for-6.16
to resolve the conflicts.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaDYZMw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGfbcAQDRloVb/d5RfC6VYlue9EV1jHuoJefTYHvR3jmO
ju70EQEAjLBXw58XAePQ9La/570JELgsC5FzJp3tLTilGx2JyQA=
=7cDG
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo:
- More in-kernel idle CPU selection improvements. Expand topology
awareness coverage add scx_bpf_select_cpu_and() to allow more
flexibility. The idle CPU selection kfuncs can now be called from
unlocked contexts too.
- A bunch of reorganization changes to lay the foundation for multiple
hierarchical scheduler support. This isn't ready yet and the included
changes don't make meaningful behavior differences. One notable
change is replacing some static_key tests with dynamic tests as the
test results may differ depending on the scheduler instance. This
isn't expected to cause meaningful performance difference.
- Other minor and doc updates.
- There were multiple patches in for-6.15-fixes which conflicted with
changes in for-6.16. for-6.15-fixes were pulled three times into
for-6.16 to resolve the conflicts.
* tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (49 commits)
sched_ext: Call ops.update_idle() after updating builtin idle bits
sched_ext, docs: convert mentions of "CFS" to "fair-class scheduler"
selftests/sched_ext: Update test enq_select_cpu_fails
sched_ext: idle: Consolidate default idle CPU selection kfuncs
selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run
sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context
sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()
sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c
sched_ext, docs: add label
sched_ext: Explain the temporary situation around scx_root dereferences
sched_ext: Add @sch to SCX_CALL_OP*()
sched_ext: Cleanup [__]scx_exit/error*()
sched_ext: Add @sch to SCX_CALL_OP*()
sched_ext: Clean up scx_root usages
Documentation: scheduler: Changed lowercase acronyms to uppercase
sched_ext: Avoid NULL scx_root deref in __scx_exit()
sched_ext: Add RCU protection to scx_root in DSQ iterator
sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()
sched_ext: Move disable machinery into scx_sched
sched_ext: Move event_stats_cpu into scx_sched
...
- cgroup rstat shared the tracking tree across all controlers with the
rationale being that a cgroup which is using one resource is likely to be
using other resources at the same time (ie. if something is allocating
memory, it's probably consuming CPU cycles). However, this turned out to
not scale very well especially with memcg using rstat for internal
operations which made memcg stat read and flush patterns substantially
different from other controllers. JP Kobryn split the rstat tree per
controller.
- cgroup BPF support was hooking into cgroup init/exit paths directly.
Convert them to use a notifier chain instead so that other usages can be
added easily. The two of the patches which implement this are mislabeled
as belonging to sched_ext instead of cgroup. Sorry.
- Relatively minor cpuset updates.
- Documentation updates.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaDYUmA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGRhbAP90v8QwUkWEKGQSam8JY3by7PvrW6pV5ot+BGuM
4xu3BAEAjsJ9FdiwYLwKYqG7y59xhhBFOo6GpcP52kPp3znl+QQ=
=6MIT
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup rstat shared the tracking tree across all controllers with the
rationale being that a cgroup which is using one resource is likely
to be using other resources at the same time (ie. if something is
allocating memory, it's probably consuming CPU cycles).
However, this turned out to not scale very well especially with memcg
using rstat for internal operations which made memcg stat read and
flush patterns substantially different from other controllers. JP
Kobryn split the rstat tree per controller.
- cgroup BPF support was hooking into cgroup init/exit paths directly.
Convert them to use a notifier chain instead so that other usages can
be added easily. The two of the patches which implement this are
mislabeled as belonging to sched_ext instead of cgroup. Sorry.
- Relatively minor cpuset updates
- Documentation updates
* tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier
sched_ext: Introduce cgroup_lifetime_notifier
cgroup: Minor reorganization of cgroup_create()
cgroup, docs: cpu controller's interaction with various scheduling policies
cgroup, docs: convert space indentation to tab indentation
cgroup: avoid per-cpu allocation of size zero rstat cpu locks
cgroup, docs: be specific about bandwidth control of rt processes
cgroup: document the rstat per-cpu initialization
cgroup: helper for checking rstat participation of css
cgroup: use subsystem-specific rstat locks to avoid contention
cgroup: use separate rstat trees for each subsystem
cgroup: compare css to cgroup::self in helper for distingushing css
cgroup: warn on rstat usage by early init subsystems
cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask()
cgroup/rstat: Improve cgroup_rstat_push_children() documentation
cgroup: fix goto ordering in cgroup_init()
cgroup: fix pointer check in css_rstat_init()
cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs
cgroup/cpuset: Fix obsolete comment in cpuset_css_offline()
cgroup/cpuset: Always use cpu_active_mask
...
* Move kern_table members out of kernel/sysctl.c
Moved a subset (tracing, panic, signal, stack_tracer and sparc) out of the
kern_table array. The goal is for kern_table to only have sysctl elements. All
this increases modularity by placing the ctl_tables closer to where they are
used while reducing the chances of merge conflicts in kernel/sysctl.c.
* Fixed sysctl unit test panic by relocating it to selftests
* Testing
These have been in linux-next from rc2, so they have had more than a month
worth of testing.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmgwLsAACgkQupfNUreW
QU9ghwv/VKZW+IXEvSjc8OiwntWkL7e5ddHY6O2Vf44MzhBefLTXmfx2HfkEA0Xw
RaOQ28Hf/zQL83RqHHnXqI7JdGWQJUm8bCPwk4H3DCaF8qOfPVvblVYmfNL2auSY
oyRRpRzZuY5EtKcrNjiHFHL2WIC8KvPVwS748oHY1eZY7kn1fcs8DDnNO4iuWop+
uJeDxu87wkRCFXF3DIM+MAHRvxSa8GHtZvb9EjAl/EHMbAyVSz3uTb7FdQDdnE09
s7P30EC03RHtgi3sd2Ku04dJsHLz7VErvpToxSH2KFlcdpJuWuCSCTT8XaD8kII8
kYYCxNpmPOf4LzEy/J2vVZB0PSHrHvuQCH7iGy+8wOPk9GHTOMkKMMXVmeGnAsef
AiosPYroxXp/nBFcuNs6/1LKpsdpFr2F6u6oMgbzLaW1Xe/oc+6oynuOgeVj9LuM
FrSxSwaVvpdwHYHujYPQAAWIgKRzITiEXnCgtSyohFquKb+7E8ZspwjOqYH2xWMQ
WwABNRqY
=45X2
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Move kern_table members out of kernel/sysctl.c
Moved a subset (tracing, panic, signal, stack_tracer and sparc) out
of the kern_table array. The goal is for kern_table to only have
sysctl elements. All this increases modularity by placing the
ctl_tables closer to where they are used while reducing the chances
of merge conflicts in kernel/sysctl.c.
- Fixed sysctl unit test panic by relocating it to selftests
* tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: Close test ctl_headers with a for loop
sysctl: call sysctl tests with a for loop
sysctl: Add 0012 to test the u8 range check
sysctl: move u8 register test to lib/test_sysctl.c
sparc: mv sparc sysctls into their own file under arch/sparc/kernel
stack_tracer: move sysctl registration to kernel/trace/trace_stack.c
tracing: Move trace sysctls into trace.c
signal: Move signal ctl tables into signal.c
panic: Move panic ctl tables into panic.c
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCaDQJqgAKCRCAXGG7T9hj
viNAAP0SmAKx3R04Q90hx4d9TU1UBrT0iu2tQI7PzNmm6dR6QQD/enuEALQUk5tP
LwDzVLgOBvqkzewQ3b6LYA2R+snmjwg=
=M+nH
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- A fix for running as a Xen dom0 on the iMX8QXP Arm platform
- An update of the xen.config adding XEN_UNPOPULATED_ALLOC for better
support of PVH dom0
- A fix of the Xen balloon driver when running without
CONFIG_XEN_UNPOPULATED_ALLOC
- A fix of the dm_op Xen hypercall on Arm needed to pass user space
buffers to the hypervisor in certain configurations
* tag 'for-linus-6.16-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/arm: call uaccess_ttbr0_enable for dm_op hypercall
xen/x86: fix initial memory balloon target
xen: enable XEN_UNPOPULATED_ALLOC as part of xen.config
xen: swiotlb: Wire up map_resource callback
- new two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists; this new approach optimizes
some calls to DMA-IOMMU layer and cache maintenance by batching them,
reduces memory usage as it is no need to store mapped DMA addresses to
unmap them, and reduces some function call overhead; it is a combination
effort of many people, lead and developed by Christoph Hellwig and Leon
Romanovsky
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaDRXIQAKCRCJp1EFxbsS
RG8tAP9kgjIwMoJqfr6DC8yYraIIUuNDyhb/fZ9vPppW6Cb7aAD/cg8udjrsUu3h
iAZBIHkYuWmkx8JG7t5/lqBc4AOC1AA=
=F3TU
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping updates from Marek Szyprowski:
"New two step DMA mapping API, which is is a first step to a long path
to provide alternatives to scatterlist and to remove hacks, abuses and
design mistakes related to scatterlists.
This new approach optimizes some calls to DMA-IOMMU layer and cache
maintenance by batching them, reduces memory usage as it is no need to
store mapped DMA addresses to unmap them, and reduces some function
call overhead. It is a combination effort of many people, lead and
developed by Christoph Hellwig and Leon Romanovsky"
* tag 'dma-mapping-6.16-2025-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
docs: core-api: document the IOVA-based API
dma-mapping: add a dma_need_unmap helper
dma-mapping: Implement link/unlink ranges API
iommu/dma: Factor out a iommu_dma_map_swiotlb helper
dma-mapping: Provide an interface to allow allocate IOVA
iommu: add kernel-doc for iommu_unmap_fast
iommu: generalize the batched sync after map interface
dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
PCI/P2PDMA: Refactor the p2pdma mapping helpers
Remove redundant code and adjust indentation of xxx_delay_max/min.
Link: https://lkml.kernel.org/r/20250521093157668iQrhhcMjA-th5LQf4-A3c@zte.com.cn
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Configfs can be configured as a loadable module, which causes a link-time
failure for dm-crypt crash dump support:
crash_dump_dm_crypt.c:(.text+0x3a4): undefined reference to `config_item_init_type_name'
aarch64-linux-ld: kernel/crash_dump_dm_crypt.o: in function `configfs_dmcrypt_keys_init':
crash_dump_dm_crypt.c:(.init.text+0x90): undefined reference to `config_group_init'
aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xb4): undefined reference to `configfs_register_subsystem'
aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xd8): undefined reference to `configfs_unregister_subsystem'
This could be avoided with a dependency on CONFIGFS_FS=y, but the
dependency has an additional problem of causing Kconfig dependency loops
since most other uses select the symbol.
Using a simple 'select CONFIGFS_FS' here in turn fails with
CONFIG_DM_CRYPT=m, because that still only causes configfs to be a
loadable module.
The only version I found that fixes this reliably uses an additional
Kconfig symbol to ensure the 'select' actually turns on configfs as
builtin, with two additional changes to avoid dependency loops with nvme
and sysfs.
There is no compile-time dependency between configfs and sysfs, so
selecting configfs from a driver with sysfs disabled does not cause link
failures, only the default /sys/kernel/config mount point will not be
created.
Link: https://lkml.kernel.org/r/20250521160359.2132363-1-arnd@kernel.org
Fixes: 6b23858fd63b ("crash_dump: make dm crypt keys persist for the kdump kernel")
Fixes: 1fb4704084 ("nvme-loop: add configfs dependency")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Breno Leitao <leitao@debian.org>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
Tian).
- Fix typos in energy model documentation and example driver code (Moon
Hee Lee, Atul Kumar Pant).
- Rearrange the energy model management code and add a new function for
adjusting a CPU energy model after adjusting the capacity of the
given CPU to it (Rafael Wysocki).
- Refactor cpufreq_online(), add and use cpufreq policy locking guards,
use __free() in policy reference counting, and clean up core cpufreq
code on top of that (Rafael Wysocki).
- Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
Kumar).
- Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay
Ugwekar).
- Add offline, online and suspend callbacks to the amd-pstate driver,
rename and use the existing amd_pstate_epp callbacks in it (Dhananjay
Ugwekar).
- Add support for the "Requested CPU Min frequency" BIOS option to the
amd-pstate driver (Dhananjay Ugwekar).
- Reset amd-pstate driver mode after running selftests (Swapnil
Sapkal).
- Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
Chancellor).
- Add helper for governor checks to the schedutil cpufreq governor and
move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki).
- Populate the cpu_capacity sysfs entries from the intel_pstate driver
after registering asym capacity support (Ricardo Neri).
- Add support for enabling Energy-aware scheduling (EAS) to the
intel_pstate driver when operating in the passive mode on a hybrid
platform (Rafael Wysocki).
- Drop redundant cpus_read_lock() from store_local_boost() in the
cpufreq core (Seyediman Seyedarab).
- Replace sscanf() with kstrtouint() in the cpufreq code and use a
symbol instead of a raw number in it (Bowen Yu).
- Add support for autonomous CPU performance state selection to the
CPPC cpufreq driver (Lifeng Zheng).
- OPP: Add dev_pm_opp_set_level() (Praveen Talari).
- Introduce scope-based cleanup headers and mutex locking guards in OPP
core (Viresh Kumar).
- Switch OPP to use kmemdup_array() (Zhang Enpei).
- Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the
menu cpuidle governor (Zhongqiu Han).
- Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla).
- Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
Bityutskiy).
- Fix typos in two comments in the teo cpuidle governor (Atul Kumar
Pant).
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla).
- Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki).
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás).
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum).
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki).
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel).
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang).
- Add missing wakeup source attribute relax_count to sysfs and
remove the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu).
- Add configurable pm_test delay for hibernation (Zihuan Zhang).
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter).
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki).
- Add a systemd service to run cpupower and change cpupower binding's
Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli).
-----BEGIN PGP SIGNATURE-----
iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmg0xS0SHHJqd0Byand5
c29ja2kubmV0AAoJEO5fvZ0v1OO1AwwH/Rvgza5YBPb9JZqWJT/ZiBw7HcEWHhP1
fNfcVU1gXPZiF0yoPfjfJua6BcLj6lyQ3d/+zWqqAcWfmRSD6HPe8yYz8qALUAqj
RWhDa04aGj6B9bQuOjejatznYlQlkwCRT7zec+75D+dAHVMqR/Vt2LFAetCadgHe
MQibAQmVFXu3RFkBjReTAdGzVoTXkwoZDrzdfA2aFAfMJNtJpOW4atUZvnucuctv
VK3ZratrctCIw7yXEoB1nWSmlY7R5JlslplBfndjmmOnky3YxNr7C6paqwtbTWoF
MiX48qkmLOGeO6gS8s/lVCDQ4oZ+UNFQvXRsM5NGjycBikhHX/dp/w4=
=dIqJ
-----END PGP SIGNATURE-----
Merge tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"Once again, the changes are dominated by cpufreq updates, but this
time the majority of them are cpufreq core changes, mostly related to
the introduction of policy locking guards and __free() usage, and
fixes related to boost handling.
Still, there is also a significant update of the intel_pstate driver
making it register an energy model when running on a hybrid platform
which is used for enabling energy-aware scheduling (EAS) if the driver
operates in the passive mode (and schedutil is used as the cpufreq
governor for all CPUs which is the passive mode default).
There are some amd-pstate driver updates too, for a good measure,
including the "Requested CPU Min frequency" BIOS option support and
new online/offline callbacks.
In the cpuidle space, the most significant change is the addition of a
C1 demotion on/off sysfs knob to intel_idle which should help some
users to configure their systems more precisely. There is also the
conversion of the PSCI cpuidle driver to a faux device one and there
are two small updates of cpuidle governors.
Device power management is also modified quite a bit, especially the
handling of devices with asynchronous suspend and resume enabled
during system transitions. They are now going to be handled more
asynchronously during suspend transitions and somewhat less
aggressively during resume transitions.
Apart from the above, the operating performance points (OPP) library
is now going to use mutex locking guards and scope-based cleanup
helpers and there is the usual bunch of assorted fixes and code
cleanups.
Specifics:
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
Tian)
- Fix typos in energy model documentation and example driver code
(Moon Hee Lee, Atul Kumar Pant)
- Rearrange the energy model management code and add a new function
for adjusting a CPU energy model after adjusting the capacity of
the given CPU to it (Rafael Wysocki)
- Refactor cpufreq_online(), add and use cpufreq policy locking
guards, use __free() in policy reference counting, and clean up
core cpufreq code on top of that (Rafael Wysocki)
- Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
Kumar)
- Fix des_perf clamping with max_perf in amd_pstate_update()
(Dhananjay Ugwekar)
- Add offline, online and suspend callbacks to the amd-pstate driver,
rename and use the existing amd_pstate_epp callbacks in it
(Dhananjay Ugwekar)
- Add support for the "Requested CPU Min frequency" BIOS option to
the amd-pstate driver (Dhananjay Ugwekar)
- Reset amd-pstate driver mode after running selftests (Swapnil
Sapkal)
- Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
Chancellor)
- Add helper for governor checks to the schedutil cpufreq governor
and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki)
- Populate the cpu_capacity sysfs entries from the intel_pstate
driver after registering asym capacity support (Ricardo Neri)
- Add support for enabling Energy-aware scheduling (EAS) to the
intel_pstate driver when operating in the passive mode on a hybrid
platform (Rafael Wysocki)
- Drop redundant cpus_read_lock() from store_local_boost() in the
cpufreq core (Seyediman Seyedarab)
- Replace sscanf() with kstrtouint() in the cpufreq code and use a
symbol instead of a raw number in it (Bowen Yu)
- Add support for autonomous CPU performance state selection to the
CPPC cpufreq driver (Lifeng Zheng)
- OPP: Add dev_pm_opp_set_level() (Praveen Talari)
- Introduce scope-based cleanup headers and mutex locking guards in
OPP core (Viresh Kumar)
- Switch OPP to use kmemdup_array() (Zhang Enpei)
- Optimize bucket assignment when next_timer_ns equals KTIME_MAX in
the menu cpuidle governor (Zhongqiu Han)
- Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla)
- Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
Bityutskiy)
- Fix typos in two comments in the teo cpuidle governor (Atul Kumar
Pant)
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla)
- Move debug runtime PM attributes to runtime_attrs[] (Rafael
Wysocki)
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás)
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum)
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki)
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel)
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang)
- Add missing wakeup source attribute relax_count to sysfs and remove
the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu)
- Add configurable pm_test delay for hibernation (Zihuan Zhang)
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter)
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki)
- Add a systemd service to run cpupower and change cpupower binding's
Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)"
* tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
cpufreq: CPPC: Add support for autonomous selection
cpufreq: Update sscanf() to kstrtouint()
cpufreq: Replace magic number
OPP: switch to use kmemdup_array()
PM: freezer: Rewrite restarting tasks log to remove stray *done.*
PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
cpufreq: drop redundant cpus_read_lock() from store_local_boost()
cpupower: do not install files to /etc/default/
cpupower: do not call systemctl at install time
cpupower: do not write DESTDIR to cpupower.service
PM: sleep: Introduce pm_sleep_transition_in_progress()
cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
cpufreq: intel_pstate: Document hybrid processor support
cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
cpufreq: intel_pstate: EAS support for hybrid platforms
PM: EM: Introduce em_adjust_cpu_capacity()
PM: EM: Move CPU capacity check to em_adjust_new_capacity()
PM: EM: Documentation: Fix typos in example driver code
cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
PM: sleep: Introduce pm_suspend_in_progress()
...
Add two tests:
- one test has 'rX <op> r10' where rX is not r10, and
- another test has 'rX <op> rY' where rX and rY are not r10
but there is an early insn 'rX = r10'.
Without previous verifier change, both tests will fail.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250524041340.4046304-1-yonghong.song@linux.dev
syzkaller reported an issue:
WARNING: CPU: 3 PID: 217 at kernel/bpf/core.c:2357 __bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Modules linked in:
CPU: 3 UID: 0 PID: 217 Comm: kworker/u32:6 Not tainted 6.15.0-rc4-syzkaller-00040-g8bac8898fe39
RIP: 0010:__bpf_prog_ret0_warn+0xa/0x20 kernel/bpf/core.c:2357
Call Trace:
<TASK>
bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
__bpf_prog_run include/linux/filter.h:718 [inline]
bpf_prog_run include/linux/filter.h:725 [inline]
cls_bpf_classify+0x74a/0x1110 net/sched/cls_bpf.c:105
...
When creating bpf program, 'fp->jit_requested' depends on bpf_jit_enable.
This issue is triggered because of CONFIG_BPF_JIT_ALWAYS_ON is not set
and bpf_jit_enable is set to 1, causing the arch to attempt JIT the prog,
but jit failed due to FAULT_INJECTION. As a result, incorrectly
treats the program as valid, when the program runs it calls
`__bpf_prog_ret0_warn` and triggers the WARN_ON_ONCE(1).
Reported-by: syzbot+0903f6d7f285e41cdf10@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/6816e34e.a70a0220.254cdc.002c.GAE@google.com
Fixes: fa9dd599b4 ("bpf: get rid of pure_initcall dependency to enable jits")
Signed-off-by: KaFai Wan <mannkafai@gmail.com>
Link: https://lore.kernel.org/r/20250526133358.2594176-1-mannkafai@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Marc Suñé (Isovalent, part of Cisco) reported an issue where an
uninitialized variable caused generating bpf prog binary code not
working as expected. The reproducer is in [1] where the flags
“-Wall -Werror” are enabled, but there is no warning as the compiler
takes advantage of uninitialized variable to do aggressive optimization.
The optimized code looks like below:
; {
0: bf 16 00 00 00 00 00 00 r6 = r1
; bpf_printk("Start");
1: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
0000000000000008: R_BPF_64_64 .rodata
3: b4 02 00 00 06 00 00 00 w2 = 0x6
4: 85 00 00 00 06 00 00 00 call 0x6
; DEFINE_FUNC_CTX_POINTER(data)
5: 61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
; bpf_printk("pre ipv6_hdrlen_offset");
6: 18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
0000000000000030: R_BPF_64_64 .rodata
8: b4 02 00 00 17 00 00 00 w2 = 0x17
9: 85 00 00 00 06 00 00 00 call 0x6
<END>
The verifier will report the following failure:
9: (85) call bpf_trace_printk#6
last insn is not an exit or jmp
The above verifier log does not give a clear hint about how to fix
the problem and user may take quite some time to figure out that
the issue is due to compiler taking advantage of uninitialized variable.
In llvm internals, uninitialized variable usage may generate
'unreachable' IR insn and these 'unreachable' IR insns may indicate
uninitialized variable impact on code optimization. So far, llvm
BPF backend ignores 'unreachable' IR hence the above code is generated.
With clang21 patch [2], those 'unreachable' IR insn are converted
to func __bpf_trap(). In order to maintain proper control flow
graph for bpf progs, [2] also adds an 'exit' insn after bpf_trap()
if __bpf_trap() is the last insn in the function. The new code looks like:
; {
0: bf 16 00 00 00 00 00 00 r6 = r1
; bpf_printk("Start");
1: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
0000000000000008: R_BPF_64_64 .rodata
3: b4 02 00 00 06 00 00 00 w2 = 0x6
4: 85 00 00 00 06 00 00 00 call 0x6
; DEFINE_FUNC_CTX_POINTER(data)
5: 61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
; bpf_printk("pre ipv6_hdrlen_offset");
6: 18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
0000000000000030: R_BPF_64_64 .rodata
8: b4 02 00 00 17 00 00 00 w2 = 0x17
9: 85 00 00 00 06 00 00 00 call 0x6
10: 85 10 00 00 ff ff ff ff call -0x1
0000000000000050: R_BPF_64_32 __bpf_trap
11: 95 00 00 00 00 00 00 00 exit
<END>
In kernel, a new kfunc __bpf_trap() is added. During insn
verification, any hit with __bpf_trap() will result in
verification failure. The kernel is able to provide better
log message for debugging.
With llvm patch [2] and without this patch (no __bpf_trap()
kfunc for existing kernel), e.g., for old kernels, the verifier
outputs
10: <invalid kfunc call>
kfunc '__bpf_trap' is referenced but wasn't resolved
Basically, kernel does not support __bpf_trap() kfunc.
This still didn't give clear signals about possible reason.
With llvm patch [2] and with this patch, the verifier outputs
10: (85) call __bpf_trap#74479
unexpected __bpf_trap() due to uninitialized variable?
It gives much better hints for verification failure.
[1] https://github.com/msune/clang_bpf/blob/main/Makefile#L3
[2] https://github.com/llvm/llvm-project/pull/131731
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250523205326.1291640-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Currently, the verifier has both special_kfunc_set and special_kfunc_list.
When adding a new kfunc usage to the verifier, it is often confusing
about whether special_kfunc_set or special_kfunc_list or both should
add that kfunc. For example, some kfuncs, e.g., bpf_dynptr_from_skb,
bpf_dynptr_clone, bpf_wq_set_callback_impl, does not need to be
in special_kfunc_set.
To avoid potential future confusion, special_kfunc_set is deleted
and btf_id_set_contains(&special_kfunc_set, ...) is removed.
The code is refactored with a new func check_special_kfunc(),
which contains all codes covered by original branch
meta.btf == btf_vmlinux && btf_id_set_contains(&special_kfunc_set, meta.func_id)
There is no functionality change.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250523205321.1291431-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This open coded iterator allows for more flexibility when creating BPF
programs. It can support output in formats other than text. With an open
coded iterator, a single BPF program can traverse multiple kernel data
structures (now including dmabufs), allowing for more efficient analysis
of kernel data compared to multiple reads from procfs, sysfs, or
multiple traditional BPF iterator invocations.
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250522230429.941193-4-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The dmabuf iterator traverses the list of all DMA buffers.
DMA buffers are refcounted through their associated struct file. A
reference is taken on each buffer as the list is iterated to ensure each
buffer persists for the duration of the bpf program execution without
holding the list mutex.
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20250522230429.941193-3-tjmercier@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
KVM's SEV intra-host migration code needs to lock all vCPUs
of the source and the target VM, before it proceeds with the migration.
The number of vCPUs that belong to each VM is not bounded by anything
except a self-imposed KVM limit of CONFIG_KVM_MAX_NR_VCPUS vCPUs which is
significantly larger than the depth of lockdep's lock stack.
Luckily, the locks in both of the cases mentioned above, are held under
the 'kvm->lock' of each VM, which means that we can use the little
known lockdep feature called a "nest_lock" to support this use case in
a cleaner way, compared to the way it's currently done.
Implement and expose 'mutex_lock_killable_nest_lock' for this
purpose.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Despite the fact that several lockdep-related checks are skipped when
calling trylock* versions of the locking primitives, for example
mutex_trylock, each time the mutex is acquired, a held_lock is still
placed onto the lockdep stack by __lock_acquire() which is called
regardless of whether the trylock* or regular locking API was used.
This means that if the caller successfully acquires more than
MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock,
lockdep will still complain that the maximum depth of the held lock stack
has been reached and disable itself.
For example, the following error currently occurs in the ARM version
of KVM, once the code tries to lock all vCPUs of a VM configured with more
than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern
systems, where having more than 48 CPUs is common, and it's also common to
run VMs that have vCPU counts approaching that number:
[ 328.171264] BUG: MAX_LOCK_DEPTH too low!
[ 328.175227] turning off the locking correctness validator.
[ 328.180726] Please attach the output of /proc/lock_stat to the bug report
[ 328.187531] depth: 48 max: 48!
[ 328.190678] 48 locks held by qemu-kvm/11664:
[ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0
[ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
[ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0
Luckily, in all instances that require locking all vCPUs, the
'kvm->lock' is taken a priori, and that fact makes it possible to use
the little known feature of lockdep, called a 'nest_lock', to avoid this
warning and subsequent lockdep self-disablement.
The action of 'nested lock' being provided to lockdep's lock_acquire(),
causes the lockdep to detect that the top of the held lock stack contains
a lock of the same class and then increment its reference counter instead
of pushing a new held_lock item onto that stack.
See __lock_acquire for more information.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Rework the initialization of the posix-timer kmem_cache and move the
cache pointer into the timer_data structure to prevent false sharing.
- Switch the alarmtimer code to lock guards.
- Improve the CPU selection criteria in the per CPU validation of the
clocksource watchdog to avoid arbitrary selections (or omissions) on
systems with a small number of CPUs.
- The usual cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgwkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYofL9D/9aiPT3UNkEVJzzQMIjeghi2foKcyqW
ut+0XH0+8bsFrzNKPs/PL9chIcblamm63FjwKmVxKaTFakP9omGUEMKAXIcy7L10
UWoQ7kSLvN/+3RB4JwOavtkNtdxkcjhDso+pd1VP0t7BQ5EsFRg4zkGHx1+PO/8C
H1URzpfmYLZWBPvIHfvgwFy5PAwwppehDynbxrR8uatg8kLvXUUGQRu/yrOYrqx8
7a/4jFkh75QdsezYOrS6yMjCS0qEeg6l37AW1WLQplZqHxJ4Mmwx9aL890KTQkXO
MZhtcZ1Iqa/7KdDNw1yzaW9T9t5RzND5IwEbBrLVBoeQft+P/Y3Grax5pHh+Gt8u
Sj4+4OiyhxQbhOcKGKjTr5pnHc//+jlm1QyLd3Ri6GL+mZB0JTnfmsFDRkhcWORN
05NcPxfganbJdENStYXZYuEIMXKnp1si8JaEbyI0AfgN8hpWITbVLnMZ65ngdOZ9
ym2HY2+3V5uCPPxegV2HdYX8VfaIRekiJAJ+Ttt3shG3+QflUNVSL4C6tv+lNNqa
PPV543ojVmlYWximdhc9KhT12yevhUWiuoFK50lndgbL9vDs471Ua/KXqrf/um0h
j8n49t8Ioq1Ht7g+olN0P3L+w0qSPM2oMWFnHh6bKTQniBrnkRDOtxQC6FC1pwEI
JVnpRMAAuSmjVQ==
=0+bf
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
"Updates for the time/timer core code:
- Rework the initialization of the posix-timer kmem_cache and move
the cache pointer into the timer_data structure to prevent false
sharing
- Switch the alarmtimer code to lock guards
- Improve the CPU selection criteria in the per CPU validation of the
clocksource watchdog to avoid arbitrary selections (or omissions)
on systems with a small number of CPUs
- The usual cleanups and improvements"
* tag 'timers-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/nohz: Remove unused tick_nohz_full_add_cpus_to()
clocksource: Fix the CPUs' choice in the watchdog per CPU verification
alarmtimer: Switch spin_{lock,unlock}_irqsave() to guards
alarmtimer: Remove dead return value in clock2alarm()
time/jiffies: Change register_refined_jiffies() to void __init
timers: Remove unused __round_jiffies(_up)
posix-timers: Initialize cache early and move pointer into __timer_data
- Convert init_timer*(), try_to_del_timer_sync() and
destroy_timer_on_stack() over to the canonical timer_*() namespace
convention.
There are is another large converstion pending, which has not been included
because it would have caused a gazillion of merge conflicts in next. The
conversion scripts will be run towards the end of the merge window and a
pull request sent once all conflict dependencies have been merged.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgTkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodwVD/97rF1Juqm1JZNIZPN/vMqwCxRoUkc6
tsK0+UC7UXusuJadxJ+Bsv25iPF+qejnThMU+SQ5yTVj/PNfxOe0WPdCEGGiL8Ye
2JCk6GqSOB/360SlLmtR1B1xHDwsuuUcQTz0w57CH66HRV5vpoWSMSwj/ypy+8nU
PlgjItaxdCKa9NJ+SUJZPWIxRkt/PsA1kwlV1OcxkgB++IiIHQEbPxECq9mlzWXF
b4Sq/Sdf2OmEePN+DYoey4fneRwJnkjkeX/o+CqosCPHRIiWUlSu5W/lU5IYojM3
s3XpMNNg/z8PMXR4JA2VaPYWLUZyBOs+3dM7Y6Am+z55EoxMxfzg6pGx2tfM4ftl
vF8wG3Z1c9MmpLk+P9LatNvfHeVLNve8KgOLa5phMDQ/El/a8KqLu6HmRDPONvKp
d6iXdPq1CP8P6jOtlFfzLmKPShgEcp+Zz9W3CaQR/0ZJEsEqrpKOLzdT86hJhBV0
mBCdzixmGtKAh0BdPdmg2FCLScqER3HKIJhZSdV8I+jSETIHCuMiIfbMXR7iwm/H
R1/ayvxrbc1mPseo28scqvo7m6cn5BFBxIUf4Sokp52ZCapz1v2aWzo4vHI0cTgT
ZOjlTrf+fgYLn1dqdD45TJiQPnmRrw4dU+WWSFRFJY2qjfyucj80vdqdkE5zkp5b
UPomlVimG4ccPg==
=FHGU
-----END PGP SIGNATURE-----
Merge tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"Another set of timer API cleanups:
- Convert init_timer*(), try_to_del_timer_sync() and
destroy_timer_on_stack() over to the canonical timer_*()
namespace convention.
There is another large conversion pending, which has not been included
because it would have caused a gazillion of merge conflicts in next.
The conversion scripts will be run towards the end of the merge window
and a pull request sent once all conflict dependencies have been
merged"
* tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()
treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try()
timers: Rename init_timers() as timers_init()
timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA
timers: Rename __init_timer_on_stack() as __timer_init_on_stack()
timers: Rename __init_timer() as __timer_init()
timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack()
timers: Rename init_timer_key() as timer_init_key()
- Switch the MSI decriptor locking to lock guards
- Replace a broken and naive implementation of PCI/MSI-X control word
updates in the PCI/TPH driver with a properly serialized variant in the
PCI/MSI core code.
- Remove the MSI descriptor abuse in the SCCI/UFS/QCOM driver by
replacing the direct access to the MSI descriptors with the proper API
function calls. People will never understand that APIs exist for a
reason...
- Provide core infrastructre for the upcoming PCI endpoint library
extensions. Currently limited to ARM GICv3+, but in theory extensible
to other architectures.
- Provide a MSI domain::teardown() callback, which allows drivers to undo
the effects of the prepare() callback.
- Move the MSI domain::prepare() callback invocation to domain creation
time to avoid redundant (and in case of ARM/GIC-V3-ITS confusing)
invocations on every allocation.
In combination with the new teardown callback this removes some ugly
hacks in the GIC-V3-ITS driver, which pretended to work around the
short comings of the core code so far. With this update the code is
correct by design and implementation.
- Make the irqchip MSI library globally available, provide a MSI parent
domain creation helper and convert a bunch of (PCI/)MSI drivers over to
the modern MSI parent mechanism. This is the first step to get rid of
at least one incarnation of the three PCI/MSI management schemes.
- The usual small cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzgFsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR0KD/402K12tlI/D70H2aTG25dbTx+dkVk+
pKpJz0985uUlLJiPCR54dZL0ofcfRU+CdjEIf1I+6TPshtg6IWLJCfqu7OWVPYzz
2lJDO0yeUGwJqc0CIa1vttvJWvcUcxfWBX/ZSkOIM5avaXqSwRwsFNfd7TQ+T+eG
79VS1yyW197mUva53ekSF2voa8EEPWfEslAjoX1dRg5d4viAxaLtKm/KpBqo1oPh
Eb+E67xEWiIonvWNdr1AOisxnbi19PyDo1xnftgBToaeXXYBodNrNIAfAkx40YUZ
IZQLHvhZ91x15hXYIS4Cz1RXqPECbu/tHxs4AFUgGvqdgJUF89wzI3C21ymrKA6E
tDlWfpIcuE3vV/bsqj1gHGL5G5m1tyBRgIdIAOOmMoTHvwp5rrQtuZzpuqzGmEzj
iVIHnn5m08kRpOZQc7+PlxQMh3eunEyj9WWG49EJgoAnJPb5lou4shTwBUheHcKm
NXxKsfo4x5C+WehGTxv80UlnMcK3Yh/TuWf2OPR6QuT2iHP2VL5jyHjIs0ICn0cp
1tvSJtdc1rgvk/4Vn4lu5eyVaTx5ZAH8ZXNQfwwBTWTp3ZyAW+7GkaCq3LPaNJoZ
4LWpgZ5gs6wT+1XNT3boKdns81VolmeTI8P1ciQKpUtaTt6Cy9P/i2az/J+BCS4U
Fn5Qqk08PHGrUQ==
=OBMj
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
"Updates for the MSI subsystem (core code and PCI):
- Switch the MSI descriptor locking to lock guards
- Replace a broken and naive implementation of PCI/MSI-X control word
updates in the PCI/TPH driver with a properly serialized variant in
the PCI/MSI core code.
- Remove the MSI descriptor abuse in the SCCI/UFS/QCOM driver by
replacing the direct access to the MSI descriptors with the proper
API function calls. People will never understand that APIs exist
for a reason...
- Provide core infrastructre for the upcoming PCI endpoint library
extensions. Currently limited to ARM GICv3+, but in theory
extensible to other architectures.
- Provide a MSI domain::teardown() callback, which allows drivers to
undo the effects of the prepare() callback.
- Move the MSI domain::prepare() callback invocation to domain
creation time to avoid redundant (and in case of ARM/GIC-V3-ITS
confusing) invocations on every allocation.
In combination with the new teardown callback this removes some
ugly hacks in the GIC-V3-ITS driver, which pretended to work around
the short comings of the core code so far. With this update the
code is correct by design and implementation.
- Make the irqchip MSI library globally available, provide a MSI
parent domain creation helper and convert a bunch of (PCI/)MSI
drivers over to the modern MSI parent mechanism. This is the first
step to get rid of at least one incarnation of the three PCI/MSI
management schemes.
- The usual small cleanups and improvements"
* tag 'irq-msi-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
PCI/MSI: Use bool for MSI enable state tracking
PCI: tegra: Convert to MSI parent infrastructure
PCI: xgene: Convert to MSI parent infrastructure
PCI: apple: Convert to MSI parent infrastructure
irqchip/msi-lib: Honour the MSI_FLAG_NO_AFFINITY flag
irqchip/mvebu: Convert to msi_create_parent_irq_domain() helper
irqchip/gic: Convert to msi_create_parent_irq_domain() helper
genirq/msi: Add helper for creating MSI-parent irq domains
irqchip: Make irq-msi-lib.h globally available
irqchip/gic-v3-its: Use allocation size from the prepare call
genirq/msi: Engage the .msi_teardown() callback on domain removal
genirq/msi: Move prepare() call to per-device allocation
irqchip/gic-v3-its: Implement .msi_teardown() callback
genirq/msi: Add .msi_teardown() callback as the reverse of .msi_prepare()
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
dt-bindings: PCI: pci-ep: Add support for iommu-map and msi-map
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
genirq/msi: Rename msi_[un]lock_descs()
...
- Consolidate on one set of functions for the interrupt domain code to
get rid of pointlessly duplicated code with only marginal different
semantics.
- Update the documentation accordingly and consolidate the coding style
of the irqdomain header.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzd+MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodTRD/0RmG5tngCbEJmTw6lPDQzRZH4OO3ja
yRYlyBipemoRmvJRGjV4uHqN2QPrdOuoqMuyBO1aWcMdkpww5bAHcbgSFrlGM1lW
kqtaxVMbufPiLQSGYe7OQf478CE1ykoBd5Va8whFKrtA73qEUdEMfWT0stspg780
7BlmQOemL91p7Ytf03FbDdo8tZ5Xu9uXGAulwY9FZsFtsCNyvhl7nOv5Sk8ZQtGO
xHRCeunjZLWR+IaK59hdakvQybXwSnjT6jODp96nlyKABEKSPShGSPFDWd3g9px7
4911QwgnvTbcrsk6YmQEmPIOgXZzypjbnjpJr8tFpTbkVIy+6chi5cBJzXoqsUaM
ylTwFcUQNvcP8yF447qb+nyPFKM5xsC07W0UpZMuJUDmhhPRtDm5pK0jpsif96GP
l4aMsWe65PUmXHQqLdE89RJXAa8XQ2qspKVtNKq9DmEVgTviQ09Z9SSQIx4U0yIx
w+YPde8kH2+O+YtMUn/MmfHhUP4MKya7j5zd8Bnv8wLBi7XGPPA5EKKh9I0dz9m+
X94lweNXyH+Q8U9mt2cQf8VG8Yzgk0eeC0sliJIlybwRgEgRcQbVWw0VvZUA1ySa
VBlaj3SinO90FEQ0CctT51ss2mUJ/XsGCnxpiGZXfqIZzFbyD1YfZQnXJH0H67DI
CqdHw22I27Mu/A==
=9nLp
-----END PGP SIGNATURE-----
Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
"A set of cleanups for the generic interrupt subsystem:
- Consolidate on one set of functions for the interrupt domain code
to get rid of pointlessly duplicated code with only marginal
different semantics.
- Update the documentation accordingly and consolidate the coding
style of the irqdomain header"
* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
irqdomain: Consolidate coding style
irqdomain: Fix kernel-doc and add it to Documentation
Documentation: irqdomain: Update it
Documentation: irq-domain.rst: Simple improvements
Documentation: irq/concepts: Minor improvements
Documentation: irq/concepts: Add commas and reflow
irqdomain: Improve kernel-docs of functions
irqdomain: Make struct irq_domain_info variables const
irqdomain: Use irq_domain_instantiate()'s return value as initializers
irqdomain: Drop irq_linear_revmap()
pinctrl: keembay: Switch to irq_find_mapping()
irqchip/armada-370-xp: Switch to irq_find_mapping()
gpu: ipu-v3: Switch to irq_find_mapping()
gpio: idt3243x: Switch to irq_find_mapping()
sh: Switch to irq_find_mapping()
powerpc: Switch to irq_find_mapping()
irqdomain: Drop irq_domain_add_*() functions
powerpc: Switch irq_domain_add_nomap() to use fwnode
thermal: Switch to irq_domain_create_linear()
soc: Switch to irq_domain_create_*()
...
- Convert the generic interrupt chip to lock guards to remove copy &
pasta boilerplate code and gotos.
- A new driver fot the interrupt controller in the EcoNet EN751221 MIPS SoC.
- Extend the SG2042-MSI driver to support the new SG2044 SoC
- Updates and cleanups for the (ancient) VT8500 driver
- Improve the scalability of the ARM GICV4.1 ITS driver by utilizing node
local copies a VM's interrupt translation table when possible. This
results in a 12% reduction of VM IPI latency in certain workloads.
- The usual cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzfSwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoc4lD/0U24B8okpp2PxVVZOtNzWgl7kcAQSJ
2U834ep1DhqJPNW0JjT+5Lb55NfAEN/uCuirjLZDsKYNNel4LXhAY951BCJMytYX
ebH/J7wGjEphRogxn9QTGGC/mguThwFnOiqOLq4aU0Sq/oRH6Uj+P6hMod7ym9bn
P+bZv9WWhLQQ3x/RimcauReCEDW6pW2soQV+zhN+xTxTW+R1zRcksz1x4+b/B7Vk
ZH6KFBpZJyC34T0aXOJFhrEo01z2iZWifgmX1zz2ZgZjeUklFxtW9vGqBRS0mU2P
9bW/qXDsSdOStyfuXbG7Q3s2z9s5Voj9okgBiA5DUD3DuplVHG/3x8do8ZHrvMoV
k59ORecx29g0nBaVMjT13gH1XfaqI3W52qff6yksqqByh+5urhGXeYzvQ07M9ldm
eUA8NxNad+6Gir6AcMN+COA+W8oOP17gvoSuFlUhdM/MZvPP0Gb8GkNk3o2Kfil/
JjvcHJHCAZv6x1L7jhFhAmTUvR9ibmMJDmXJM2tIHvS1HrHNfKAIyxy00GAVg7TN
f5Iv0+vqB7C6PHzMYIIQpZ3hrJL2GR6jdToPdAWIfr5BzugglDIRUlhEIsxhSXQn
WMmoif5bKS8wxQRyP2F3FPv+eKYT2XVlVri3LHBkqKbkJW/sqJWHHFGIdaDrwVhX
vZlmkT07PD3jbQ==
=OS2H
-----END PGP SIGNATURE-----
Merge tag 'irq-drivers-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq controller updates from Thomas Gleixner:
"Update for interrupt chip drivers:
- Convert the generic interrupt chip to lock guards to remove copy &
pasta boilerplate code and gotos.
- A new driver fot the interrupt controller in the EcoNet EN751221
MIPS SoC.
- Extend the SG2042-MSI driver to support the new SG2044 SoC
- Updates and cleanups for the (ancient) VT8500 driver
- Improve the scalability of the ARM GICV4.1 ITS driver by utilizing
node local copies a VM's interrupt translation table when possible.
This results in a 12% reduction of VM IPI latency in certain
workloads.
- The usual cleanups and improvements all over the place"
* tag 'irq-drivers-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
irqchip/irq-pruss-intc: Simplify chained interrupt handler setup
irqchip/gic-v4.1: Use local 4_1 ITS to generate VSGI
irqchip/econet-en751221: Switch to of_fwnode_handle()
irqchip/irq-vt8500: Switch to irq_domain_create_*()
irqchip/econet-en751221: Switch to irq_domain_create_linear()
irqchip/irq-vt8500: Use fewer global variables and add error handling
irqchip/irq-vt8500: Use a dedicated chained handler function
irqchip/irq-vt8500: Don't require 8 interrupts from a chained controller
irqchip/irq-vt8500: Drop redundant copy of the device node pointer
irqchip/irq-vt8500: Split up ack/mask functions
irqchip/sg2042-msi: Fix wrong type cast in sg2044_msi_irq_ack()
irqchip/sg2042-msi: Add the Sophgo SG2044 MSI interrupt controller
irqchip/sg2042-msi: Introduce configurable chipinfo for SG2042
irqchip/sg2042-msi: Rename functions and data structures to be SG2042 agnostic
dt-bindings: interrupt-controller: Add Sophgo SG2044 MSI controller
genirq/generic-chip: Fix incorrect lock guard conversions
genirq/generic-chip: Remove unused lock wrappers
irqchip: Convert generic irqchip locking to guards
gpio: mvebu: Convert generic irqchip locking to guard()
ARM: orion/gpio:: Convert generic irqchip locking to guard()
...
- Address a long standing subtle problem in the CPU hotplug code for
affinity-managed interrupts.
Affinity-managed interrupts are shut down by the core code when the
last CPU in the affinity set goes offline and started up again when the
first CPU in the affinity set becomes online again. This unfortunately
does not take into account whether an interrupt has been disabled
before the last CPU goes offline and starts up the interrupt
unconditionally when the first CPU becomes online again. That's
obviously not what drivers expect.
Address this by preserving the disabled state for affinity-managed
interrupts accross these CPU hotplug operations. All non-managed
interrupts are not affected by this because startup/shutdown is coupled
to request/free_irq() which obviously has to reset state.
- Support three-cell scheme interrupts to allow GPIO drivers to specify
interrupts from an already existing scheme
- Switch the interrupt subsystem core to lock guards. This gets rid of
quite some copy & pasta boilerplate code all over the place.
- The usual small cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzesoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYunD/44Z1kB7HGO97GC+Xyfit06ZDBBO794
CLoo6cpZ2cU7f7YuBnKVjXcW6S9WUcWwnLg0lhxoIBsKDzTTwRPX8UM5Ro4G0YvZ
ovCQGHKcps8c3ord6dZnyCV/9Ktzr30g5PCzFQkSLKM38DTKgOTH8pQnKYox0XlL
VGa5ExlWxrOF40GEFXWsJLIyYo2B3LzENZaWihT+6mtW6+ry1ZamW9g/1sron8ad
cd6UEgQvmNKKscaIqOW4hgDGr4F99oPzyyRGBd+uyqzpeOEH1wGbN5EFu9Mfuy+I
QZDusm3muIovAhRRhSR7XNQOv13D/RDkjws9sGWDWAVlnFnQTpov7f6cm7ofCkvF
H898oup43DA97+FJqWo4HlG/37T3gRaQfP0BED0u3vZQ0SWmgTO3eAS/J1OPxvO/
lWcoPIpPIvCWJYC7XBV1vQi47Kb+gTUNtVID5p8e/6hsE2H2TtOQ6T2rWbVZrkf4
0QzNCn7V0HaBLvv2ztJ/A3HlwMqmz7DGO+Q0nGG92SmsccuMt0MRP8zv2cbrGhp5
bzYUD1ZuJV9GQ4XgQ3RG+8dVur6Nu84enjIOewzyyBnOIg3bRLNuJ8RdmgIotiCF
Q+Si8iXrON0wAPqO36Z0n3Xg05okPZkTGyemgKl60n8lmK8dRgwkH1PxngpfK0Vw
NKg1wG8Kfc1hRg==
=xHtm
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
"Updates for the generic interrupt subsystem core code:
- Address a long standing subtle problem in the CPU hotplug code for
affinity-managed interrupts.
Affinity-managed interrupts are shut down by the core code when the
last CPU in the affinity set goes offline and started up again when
the first CPU in the affinity set becomes online again.
This unfortunately does not take into account whether an interrupt
has been disabled before the last CPU goes offline and starts up
the interrupt unconditionally when the first CPU becomes online
again.
That's obviously not what drivers expect.
Address this by preserving the disabled state for affinity-managed
interrupts accross these CPU hotplug operations. All non-managed
interrupts are not affected by this because startup/shutdown is
coupled to request/free_irq() which obviously has to reset state.
- Support three-cell scheme interrupts to allow GPIO drivers to
specify interrupts from an already existing scheme
- Switch the interrupt subsystem core to lock guards. This gets rid
of quite some copy & pasta boilerplate code all over the place.
- The usual small cleanups and improvements all over the place"
* tag 'irq-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
genirq/irqdesc: Remove double locking in hwirq_show()
genirq: Retain disable depth for managed interrupts across CPU hotplug
genirq: Bump the size of the local variable for sprintf()
genirq/manage: Use the correct lock guard in irq_set_irq_wake()
genirq: Consistently use '%u' format specifier for unsigned int variables
genirq: Ensure flags in lock guard is consistently initialized
genirq: Fix inverted condition in handle_nested_irq()
genirq/cpuhotplug: Fix up lock guards conversion brainf..t
genirq: Use scoped_guard() to shut clang up
genirq: Remove unused remove_percpu_irq()
genirq: Remove irq_[get|put]_desc*()
genirq/manage: Rework irq_set_irqchip_state()
genirq/manage: Rework irq_get_irqchip_state()
genirq/manage: Rework teardown_percpu_nmi()
genirq/manage: Rework prepare_percpu_nmi()
genirq/manage: Rework disable_percpu_irq()
genirq/manage: Rework irq_percpu_is_enabled()
genirq/manage: Rework enable_percpu_irq()
genirq/manage: Rework irq_set_parent()
genirq/manage: Rework can_request_irq()
...
- Move LoongArch and RISC-V ret_from_fork() implementations to C code so
that syscall_exit_user_mode() can be inlined.
- Split the RISC-V ret_from_fork() implementation into return to user and
return to kernel, which gives a measurable performance improvement.
- Inline syscall_exit_user_mode() which benefits all architectures by
avoiding a function call and letting the compiler do better
optimizations.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzdscTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoey4D/9VAZsXsPpYkeR+mBtfy5rJFQtDSbT5
wBYOJcrQOiekfyHXTn+YyY3EtIKyqzK98Bm48f1C3DgfLU1S3J5hK/YH3HmRHGc+
50WSy0q2t2OgdFObxAq56paSYIBW10KKVqyXPO/mQ0oLgECf1nai8NgV64aU1ET7
jPQHGNZuZLKm8jKl5OcFFXWSFyGO9SPBfae5FEGH/0e7LPv62DP0ph1bQ1PLmHCb
8QKWJV56zxYWDUP4Kjojy62RcG+hBeraNMqnxtzKmauBhUyX21MJdKI3OQwbfu2U
r3qQG2Y/BKOWs6jSb7yvOO+NKWAGIPD7iMMxtJs0vJzjRMDE9pkkfyPFvzQfcqGn
gLo6Dp5VxSLfGYoNFvrrQcojrcpvInRUidlZZBykogHb07RCfeXBMkvCxuAuPaDh
MoH+NeTFCi2oTkc2VHlpBC1+RCAcQ8cz1CqxXDDOXazSRqVrnLnflqLnP0Ldxzcn
jyGv+1/iP/Fz1w3HtEdIeHrHPY7SgqR4RkOkT11KVGYc2h1PpbHUws2PAxjst9gB
C3iNnR+izFzg/wjQZ7opHvJvXTJRgEAgyWly3GJorT927G8VA2SiAdzOAsRdCnBG
g7gEZEQ48MtOr7v5YaviAerAikkJWgLOU+X5pZsrha+DSme8mn5iwhsposJpFsJy
VHEmKrt5vpxrpg==
=sbxa
-----END PGP SIGNATURE-----
Merge tag 'core-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry code updates from Thomas Gleixner:
"Updates for the generic and architecture entry code:
- Move LoongArch and RISC-V ret_from_fork() implementations to C code
so that syscall_exit_user_mode() can be inlined
- Split the RISC-V ret_from_fork() implementation into return to user
and return to kernel, which gives a measurable performance
improvement
- Inline syscall_exit_user_mode() which benefits all architectures by
avoiding a function call and letting the compiler do better
optimizations"
* tag 'core-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
LoongArch: entry: Fix include order
entry: Inline syscall_exit_to_user_mode()
LoongArch: entry: Migrate ret_from_fork() to C
riscv: entry: Split ret_from_fork() into user and kernel
riscv: entry: Convert ret_from_fork() to C
Core & generic-arch updates:
- Add support for dynamic constraints and propagate it to
the Intel driver (Kan Liang)
- Fix & enhance driver-specific throttling support (Kan Liang)
- Record sample last_period before updating on the
x86 and PowerPC platforms (Mark Barnett)
- Make perf_pmu_unregister() usable (Peter Zijlstra)
- Unify perf_event_free_task() / perf_event_exit_task_context()
(Peter Zijlstra)
- Simplify perf_event_release_kernel() and perf_event_free_task()
(Peter Zijlstra)
- Allocate non-contiguous AUX pages by default (Yabin Cui)
Uprobes updates:
- Add support to emulate NOP instructions (Jiri Olsa)
- selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)
x86 Intel PMU enhancements:
- Support Intel Auto Counter Reload [ACR] (Kan Liang)
- Add PMU support for Clearwater Forest (Dapeng Mi)
- Arch-PEBS preparatory changes: (Dapeng Mi)
- Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
- Decouple BTS initialization from PEBS initialization
- Introduce pairs of PEBS static calls
x86 AMD PMU enhancements:
- Use hrtimer for handling overflows in the AMD uncore driver
(Sandipan Das)
- Prevent UMC counters from saturating (Sandipan Das)
Fixes and cleanups:
- Fix put_ctx() ordering (Frederic Weisbecker)
- Fix irq work dereferencing garbage (Frederic Weisbecker)
- Misc fixes and cleanups (Changbin Du, Frederic Weisbecker,
Ian Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang,
Sandipan Das, Thorsten Blum)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy4zoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j6QRAAvQ4GBPrdJLb8oXkLjCmWSp9PfM1h2IW0
reUrcV0BPRAwz4T60QEU2KyiEjvKxNghR6bNw4i3slAZ8EFwP9eWE/0ZYOo5+W/N
wv8vsopv/oZd2L2G5TgxDJf+tLPkqnTvp651LmGAbquPFONN1lsya9UHVPnt2qtv
fvFhjW6D828VoevRcUCsdoEUNlFDkUYQ2c3M1y5H2AI6ILDVxLsp5uYtuVUP+2lQ
7UI/elqRIIblTGT7G9LvTGiXZMm8T58fe1OOLekT6NdweJ3XEt1kMdFo/SCRYfzU
eDVVVLSextZfzBXNPtAEAlM3aSgd8+4m5sACiD1EeOUNjo5J9Sj1OOCa+bZGF/Rl
XNv5Kcp6Kh1T4N5lio8DE/NabmHDqDMbUGfud+VTS8uLLku4kuOWNMxJTD1nQ2Zz
BMfJhP89G9Vk07F9fOGuG1N6mKhIKNOgXh0S92tB7XDHcdJegueu2xh4ZszBL1QK
JVXa4DbnDj+y0LvnV+A5Z6VILr5RiCAipDb9ascByPja6BbN10Nf9Aj4nWwRTwbO
ut5OK/fDKmSjEHn1+a42d4iRxdIXIWhXCyxEhH+hJXEFx9htbQ3oAbXAEedeJTlT
g9QYGAjL96QEd0CqviorV8KyU59nVkEPoLVCumXBZ0WWhNwU6GdAmsW1hLfxQdLN
sp+XHhfxf8M=
=tPRs
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"Core & generic-arch updates:
- Add support for dynamic constraints and propagate it to the Intel
driver (Kan Liang)
- Fix & enhance driver-specific throttling support (Kan Liang)
- Record sample last_period before updating on the x86 and PowerPC
platforms (Mark Barnett)
- Make perf_pmu_unregister() usable (Peter Zijlstra)
- Unify perf_event_free_task() / perf_event_exit_task_context()
(Peter Zijlstra)
- Simplify perf_event_release_kernel() and perf_event_free_task()
(Peter Zijlstra)
- Allocate non-contiguous AUX pages by default (Yabin Cui)
Uprobes updates:
- Add support to emulate NOP instructions (Jiri Olsa)
- selftests/bpf: Add 5-byte NOP uprobe trigger benchmark (Jiri Olsa)
x86 Intel PMU enhancements:
- Support Intel Auto Counter Reload [ACR] (Kan Liang)
- Add PMU support for Clearwater Forest (Dapeng Mi)
- Arch-PEBS preparatory changes: (Dapeng Mi)
- Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
- Decouple BTS initialization from PEBS initialization
- Introduce pairs of PEBS static calls
x86 AMD PMU enhancements:
- Use hrtimer for handling overflows in the AMD uncore driver
(Sandipan Das)
- Prevent UMC counters from saturating (Sandipan Das)
Fixes and cleanups:
- Fix put_ctx() ordering (Frederic Weisbecker)
- Fix irq work dereferencing garbage (Frederic Weisbecker)
- Misc fixes and cleanups (Changbin Du, Frederic Weisbecker, Ian
Rogers, Ingo Molnar, Kan Liang, Peter Zijlstra, Qing Wang, Sandipan
Das, Thorsten Blum)"
* tag 'perf-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
perf/headers: Clean up <linux/perf_event.h> a bit
perf/uapi: Clean up <uapi/linux/perf_event.h> a bit
perf/uapi: Fix PERF_RECORD_SAMPLE comments in <uapi/linux/perf_event.h>
mips/perf: Remove driver-specific throttle support
xtensa/perf: Remove driver-specific throttle support
sparc/perf: Remove driver-specific throttle support
loongarch/perf: Remove driver-specific throttle support
csky/perf: Remove driver-specific throttle support
arc/perf: Remove driver-specific throttle support
alpha/perf: Remove driver-specific throttle support
perf/apple_m1: Remove driver-specific throttle support
perf/arm: Remove driver-specific throttle support
s390/perf: Remove driver-specific throttle support
powerpc/perf: Remove driver-specific throttle support
perf/x86/zhaoxin: Remove driver-specific throttle support
perf/x86/amd: Remove driver-specific throttle support
perf/x86/intel: Remove driver-specific throttle support
perf: Only dump the throttle log for the leader
perf: Fix the throttle logic for a group
perf/core: Add the is_event_in_freq_mode() helper to simplify the code
...
Futexes:
- Add support for task local hash maps (Sebastian Andrzej Siewior,
Peter Zijlstra)
- Implement the FUTEX2_NUMA ABI, which feature extends the futex
interface to be NUMA-aware. On NUMA-aware futexes a second u32
word containing the NUMA node is added to after the u32 futex value
word. (Peter Zijlstra)
- Implement the FUTEX2_MPOL ABI, which feature extends the futex
interface to be mempolicy-aware as well, to further refine futex
node mappings and lookups. (Peter Zijlstra)
Locking primitives:
- Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
Ingo Molnar, Nam Cao, Peter Zijlstra)
Lockdep:
- Prevent abuse of lockdep subclasses (Waiman Long)
- Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)
Plus misc cleanups and fixes.
Note that the tree includes the following dependent out-of-subsystem
changes as well:
- rcuref: Provide rcuref_is_dead()
- mm: Add vmalloc_huge_node()
- mm: Add the mmap_read_lock guard to <linux/mmap_lock.h>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy3E8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1isNw/9FS6+ZReiV3NLHvhwIw8+6U2vV733wLY+
mFzDk2CRwv2d6xg+QUrhLNI93i2fZnwNvK1f6LcRZMa1pNmwCcEghKgm0G+fRgbv
skiGrlkUCoEqsDUxRW++/aTBcMo0vqG3NOObnUOrddG2W9tfrR8jq/EwlzB99dO7
q8qaBNl9W1vLT3gh9/RPP5uKt0NKIf8ObvsyhWCGaywg81h2lC4AHf0Xlj3ZD95T
TO5jhUhl/muhYtaqxeYPK0gDtCrgFz8NwZdjKx1nyP7Gbko6+L50AvOVXog0SIAU
nncftvutGJg2ki7dbSYPDoHQrHO0JsF1vUfVZRjaKFebWpFo2yYdNMbITOeXVhSC
QSpbH2qvyn21nT/YSj9dottHWBoNYBEgrcSf6DO4g0d8A0Jh7egXjQdA852RpeQ0
LWGYx4rfiKhnjiXlKKQHrURZkcxxa40o+ls3RfFl2/kWA+7aUybvw6nAeDEkV0oL
s2U0vZxsY37EPWDm40rTe9r4YpPqcB65i9YIesPzhtbcHJVmN0gts0o5l+x53GhR
CeftFiiUi2nm6JaT+1wGvBDT3hQ8+NZ8GkPSeA6pEJWE3i4KquZlcBZLOSLZ3k/B
df58zQi99Yun33is5f1kqDNspqvJOg/1nxUK68PgNSdCMKeuZkJYrcmh/rKNnXSC
f7M1XHoWFb0=
=La/x
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"Futexes:
- Add support for task local hash maps (Sebastian Andrzej Siewior,
Peter Zijlstra)
- Implement the FUTEX2_NUMA ABI, which feature extends the futex
interface to be NUMA-aware. On NUMA-aware futexes a second u32 word
containing the NUMA node is added to after the u32 futex value word
(Peter Zijlstra)
- Implement the FUTEX2_MPOL ABI, which feature extends the futex
interface to be mempolicy-aware as well, to further refine futex
node mappings and lookups (Peter Zijlstra)
Locking primitives:
- Misc cleanups (Andy Shevchenko, Borislav Petkov, Colin Ian King,
Ingo Molnar, Nam Cao, Peter Zijlstra)
Lockdep:
- Prevent abuse of lockdep subclasses (Waiman Long)
- Add number of dynamic keys to /proc/lockdep_stats (Waiman Long)
Plus misc cleanups and fixes"
* tag 'locking-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
selftests/futex: Fix spelling mistake "unitiliazed" -> "uninitialized"
futex: Correct the kernedoc return value for futex_wait_setup().
tools headers: Synchronize prctl.h ABI header
futex: Use RCU_INIT_POINTER() in futex_mm_init().
selftests/futex: Use TAP output in futex_numa_mpol
selftests/futex: Use TAP output in futex_priv_hash
futex: Fix kernel-doc comments
futex: Relax the rcu_assign_pointer() assignment of mm->futex_phash in futex_mm_init()
futex: Fix outdated comment in struct restart_block
locking/lockdep: Add number of dynamic keys to /proc/lockdep_stats
locking/lockdep: Prevent abuse of lockdep subclass
locking/lockdep: Move hlock_equal() to the respective #ifdeffery
futex,selftests: Add another FUTEX2_NUMA selftest
selftests/futex: Add futex_numa_mpol
selftests/futex: Add futex_priv_hash
selftests/futex: Build without headers nonsense
tools/perf: Allow to select the number of hash buckets
tools headers: Synchronize prctl.h ABI header
futex: Implement FUTEX2_MPOL
futex: Implement FUTEX2_NUMA
...
Summary of changes:
- Removed swake_up_one_online() workaround
- Reverted an incorrect rcuog wake-up fix from offline softirq
- Rust RCU Guard methods marked as inline
- Updated MAINTAINERS with Joel’s and Zqiang's new email address
- Replaced magic constant in rcu_seq_done_exact() with named constant
- Added warning mechanism to validate rcu_seq_done_exact()
- Switched SRCU polling API to use rcu_seq_done_exact()
- Commented on redundant delta check in rcu_seq_done_exact()
- Made ->gpwrap tests in rcutorture more frequent
- Fixed reuse of ARM64 images in rcutorture
- rcutorture improved to check Kconfig and reader conflict handling
- Extracted logic from rcu_torture_one_read() for clarity
- Updated LWN RCU API documentation links
- Enabled --do-rt in torture.sh for CONFIG_PREEMPT_RT
- Added tests for SRCU up/down reader primitives
- Added comments and delays checks in rcutorture
- Deprecated srcu_read_lock_lite() and srcu_read_unlock_lite() via checkpatch
- Added --do-normal and --do-no-normal to torture.sh
- Added RCU Rust binding tests to torture.sh
- Reduced CPU overcommit and removed MAXSMP/CPUMASK_OFFSTACK in TREE01
- Replaced kmalloc() with kcalloc() in rcuscale
- Refined listRCU example code for stale data elimination
- Fixed hardirq count bug for x86 in cpu_stall_cputime
- Added safety checks in rcu/nocb for offloaded rdp access
- Other miscellaneous changes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEcoCIrlGe4gjE06JJqA4nf2o45hAFAmgoF5oACgkQqA4nf2o4
5hDvVw//TNsJ/g0HTMu02uXMmtFIrgvpTnH7OEGJ+2p/KErrmWYsBJQw41ueLAQL
Drtq3q9888UFF5LLA43HC88DFmT9uV8V8TmmURH+pZWdmJY1Ekn8UBSBhDPGGpC5
sGIO2jJKjHN8G7fyJKoPtL9jxKSulHF/XQTIL2pP23jopAIwosoCHVAwGvnGVvBC
smXfMSu+bd3IifNFroodsqjVXgnNQwWUNboOkz0KfkiiosgZsWWW8DaM3NGjdp+C
tUHLs1zfC6sgJUjdpokTE3TcNudlMgVlB2Quj5jhh1YvsvedgIJXl4wpR6JVutyN
F9awKt1AZkyZ+cTp+JpohaWaN9aKfNNG7jZ+rxQ0VcuRh35wmBJtiWNjEtJ38R82
kTC1RI7MEus+6OZRt92jv5TNSa9t3wHbi5fBjNRiQ8PYq5cibZy7Lyrj2JOK7Zqs
pgmdUnhQH2Uhf52b+clG5hWO55gEtACY8pin6kNewClcRtz04Jew7gkiYDGka4F4
EXbuDHSWi25eSb3FzT2BqR72OZcJ0kv747OTp+2yTv2TaBA5p+OD8hvL/WbWC2Ok
DK1YQ4RgEerTSZ4PbgPtWkNnlf6xjdWBaYNwmo+G/DgfjPoTOy1Jp73Z4b1AqSB5
IPEQy1d/799QgGTYkbrvRtvWHg8yfOMz3ByZoHg31rafr0AsrXM=
=6mun
-----END PGP SIGNATURE-----
Merge tag 'next.2025.05.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Joel Fernandes:
- Removed swake_up_one_online() workaround
- Reverted an incorrect rcuog wake-up fix from offline softirq
- Rust RCU Guard methods marked as inline
- Updated MAINTAINERS with Joel’s and Zqiang's new email address
- Replaced magic constant in rcu_seq_done_exact() with named constant
- Added warning mechanism to validate rcu_seq_done_exact()
- Switched SRCU polling API to use rcu_seq_done_exact()
- Commented on redundant delta check in rcu_seq_done_exact()
- Made ->gpwrap tests in rcutorture more frequent
- Fixed reuse of ARM64 images in rcutorture
- rcutorture improved to check Kconfig and reader conflict handling
- Extracted logic from rcu_torture_one_read() for clarity
- Updated LWN RCU API documentation links
- Enabled --do-rt in torture.sh for CONFIG_PREEMPT_RT
- Added tests for SRCU up/down reader primitives
- Added comments and delays checks in rcutorture
- Deprecated srcu_read_lock_lite() and srcu_read_unlock_lite() via checkpatch
- Added --do-normal and --do-no-normal to torture.sh
- Added RCU Rust binding tests to torture.sh
- Reduced CPU overcommit and removed MAXSMP/CPUMASK_OFFSTACK in TREE01
- Replaced kmalloc() with kcalloc() in rcuscale
- Refined listRCU example code for stale data elimination
- Fixed hardirq count bug for x86 in cpu_stall_cputime
- Added safety checks in rcu/nocb for offloaded rdp access
- Other miscellaneous changes
* tag 'next.2025.05.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (27 commits)
rcutorture: Fix issue with re-using old images on ARM64
rcutorture: Remove MAXSMP and CPUMASK_OFFSTACK from TREE01
rcutorture: Reduce TREE01 CPU overcommit
torture: Check for "Call trace:" as well as "Call Trace:"
rcutorture: Perform more frequent testing of ->gpwrap
torture: Add testing of RCU's Rust bindings to torture.sh
torture: Add --do-{,no-}normal to torture.sh
checkpatch: Deprecate srcu_read_lock_lite() and srcu_read_unlock_lite()
rcutorture: Comment invocations of tick_dep_set_task()
rcu/nocb: Add Safe checks for access offloaded rdp
rcuscale: using kcalloc() to relpace kmalloc()
doc/RCU/listRCU: refine example code for eliminating stale data
doc: Update LWN RCU API links in whatisRCU.rst
Revert "rcu/nocb: Fix rcuog wake-up from offline softirq"
rust: sync: rcu: Mark Guard methods as inline
rcu/cpu_stall_cputime: fix the hardirq count for x86 architecture
rcu: Remove swake_up_one_online() bandaid
MAINTAINERS: Update Zqiang's email address
rcutorture: Make torture.sh --do-rt use CONFIG_PREEMPT_RT
srcu: Use rcu_seq_done_exact() for polling API
...
Merge updates related to system sleep handling and runtime PM for 6.16-rc1:
- Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
Kalla).
- Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki).
- Add new devm_ functions for enabling runtime PM and runtime PM
reference counting (Bence Csókás).
- Remove size arguments from strscpy() calls in the hibernation core
code (Thorsten Blum).
- Adjust the handling of devices with asynchronous suspend enabled
during system suspend and resume to start resuming them immediately
after resuming their parents and to start suspending such a device
immediately after suspending its first child (Rafael Wysocki).
- Adjust messages printed during tasks freezing to avoid using
pr_cont() (Andrew Sayers, Paul Menzel).
- Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
Zhang).
- Add missing wakeup source attribute relax_count to sysfs and
remove the space character at the end ofi the string produced by
pm_show_wakelocks() (Zijun Hu).
- Add configurable pm_test delay for hibernation (Zihuan Zhang).
- Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
cypd4226 device on Tegra boards from suspending prematurely (Jon
Hunter).
- Unbreak printing PM debug messages during hibernation and clean up
some related code (Rafael Wysocki).
* pm-runtime:
PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
PM: sysfs: Move debug runtime PM attributes to runtime_attrs[]
PM: runtime: Add new devm functions
* pm-sleep:
PM: freezer: Rewrite restarting tasks log to remove stray *done.*
PM: sleep: Introduce pm_sleep_transition_in_progress()
PM: sleep: Introduce pm_suspend_in_progress()
PM: sleep: Print PM debug messages during hibernation
ucsi_ccg: Disable async suspend in ucsi_ccg_probe()
PM: hibernate: add configurable delay for pm_test
PM: wakeup: Delete space in the end of string shown by pm_show_wakelocks()
PM: wakeup: Add missing wakeup source attribute relax_count
PM: sleep: Remove unnecessary !!
PM: sleep: Use two lines for "Restarting..." / "done" messages
PM: sleep: Make suspend of devices more asynchronous
PM: sleep: Suspend async parents after suspending children
PM: sleep: Resume children after resuming the parent
PM: hibernate: Remove size arguments when calling strscpy()
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
Y6NDJw5+kQ==
=1PNK
-----END PGP SIGNATURE-----
Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
Merge cpufreq updates for 6.16-rc1:
- Refactor cpufreq_online(), add and use cpufreq policy locking guards,
use __free() in policy reference counting, and clean up core cpufreq
code on top of that (Rafael Wysocki).
- Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
Kumar).
- Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay
Ugwekar).
- Add offline, online and suspend callbacks to the amd-pstate driver,
rename and use the existing amd_pstate_epp callbacks in it (Dhananjay
Ugwekar).
- Add support for the "Requested CPU Min frequency" BIOS option to the
amd-pstate driver (Dhananjay Ugwekar).
- Reset amd-pstate driver mode after running selftests (Swapnil
Sapkal).
- Add helper for governor checks to the schedutil cpufreq governor and
move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki).
- Populate the cpu_capacity sysfs entries from the intel_pstate driver
after registering asym capacity support (Ricardo Neri).
- Add support for enabling Energy-aware scheduling (EAS) to the
intel_pstate driver when operating in the passive mode on a hybrid
platform (Rafael Wysocki).
- Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
Chancellor).
- Drop redundant cpus_read_lock() from store_local_boost() in the
cpufreq core (Seyediman Seyedarab).
- Replace sscanf() with kstrtouint() in the cpufreq code and use a
symbol instead of a raw number in it (Bowen Yu).
- Add support for autonomous CPU performance state selection to the
CPPC cpufreq driver (Lifeng Zheng).
* pm-cpufreq: (31 commits)
cpufreq: CPPC: Add support for autonomous selection
cpufreq: Update sscanf() to kstrtouint()
cpufreq: Replace magic number
cpufreq: drop redundant cpus_read_lock() from store_local_boost()
cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
cpufreq: intel_pstate: Document hybrid processor support
cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
cpufreq: intel_pstate: EAS support for hybrid platforms
cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
cpufreq: intel_pstate: Populate the cpu_capacity sysfs entries
arch_topology: Relocate cpu_scale to topology.[h|c]
cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
cpufreq/sched: schedutil: Add helper for governor checks
amd-pstate-ut: Reset amd-pstate driver mode after running selftests
cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option
cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver
cpufreq: Force sync policy boost with global boost on sysfs update
cpufreq: Preserve policy's boost state after resume
cpufreq: Introduce policy_set_boost()
cpufreq: Don't unnecessarily call set_boost()
...
Merge energy model management code updates for 6.16-rc1:
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
Tian).
- Fix typos in energy model documentation and example driver code (Moon
Hee Lee, Atul Kumar Pant).
- Rearrange the energy model management code and add a new function for
adjusting a CPU energy model after adjusting the capacity of the
given CPU to it (Rafael Wysocki).
* pm-em:
PM: EM: Introduce em_adjust_cpu_capacity()
PM: EM: Move CPU capacity check to em_adjust_new_capacity()
PM: EM: Documentation: Fix typos in example driver code
PM: EM: Documentation: fix typo in energy-model.rst
PM: EM: Fix potential division-by-zero error in em_compute_costs()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
ov4zAP4yfqKBAz6eMt9CzDgHCdVQJ9Nuur1EiRdot3maPzHTcQEA2hVkJrvVo1Y/
jCVAf7gmGX1Uu6nCUF6Vjy35g8i20gA=
=nzYS
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
oi3BAQD/IBxTbAZIe7vEAsuLlBoKbWrzPGvxzd4UeMGo6OY18wEAvvyJM+arQy51
jS0ZErDOJnPNe7jps+Gh+WDx6d3NMAY=
=lqAG
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs freezing updates from Christian Brauner:
"This contains various filesystem freezing related work for this cycle:
- Allow the power subsystem to support filesystem freeze for suspend
and hibernate.
Now all the pieces are in place to actually allow the power
subsystem to freeze/thaw filesystems during suspend/resume.
Filesystems are only frozen and thawed if the power subsystem does
actually own the freeze.
If the filesystem is already frozen by the time we've frozen all
userspace processes we don't care to freeze it again. That's
userspace's job once the process resumes. We only actually freeze
filesystems if we absolutely have to and we ignore other failures
to freeze.
We could bubble up errors and fail suspend/resume if the error
isn't EBUSY (aka it's already frozen) but I don't think that this
is worth it. Filesystem freezing during suspend/resume is
best-effort. If the user has 500 ext4 filesystems mounted and 4
fail to freeze for whatever reason then we simply skip them.
What we have now is already a big improvement and let's see how we
fare with it before making our lives even harder (and uglier) than
we have to.
- Allow efivars to support freeze and thaw
Allow efivarfs to partake to resync variable state during system
hibernation and suspend. Add freeze/thaw support.
This is a pretty straightforward implementation. We simply add
regular freeze/thaw support for both userspace and the kernel.
efivars is the first pseudofilesystem that adds support for
filesystem freezing and thawing.
The simplicity comes from the fact that we simply always resync
variable state after efivarfs has been frozen. It doesn't matter
whether that's because of suspend, userspace initiated freeze or
hibernation. Efivars is simple enough that it doesn't matter that
we walk all dentries. There are no directories and there aren't
insane amounts of entries and both freeze/thaw are already
heavy-handed operations. If userspace initiated a freeze/thaw cycle
they would need CAP_SYS_ADMIN in the initial user namespace (as
that's where efivarfs is mounted) so it can't be triggered by
random userspace. IOW, we really really don't care"
* tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
f2fs: fix freezing filesystem during resize
kernfs: add warning about implementing freeze/thaw
efivarfs: support freeze/thaw
power: freeze filesystems during suspend/resume
libfs: export find_next_child()
super: add filesystem freezing helpers for suspend and hibernate
gfs2: pass through holder from the VFS for freeze/thaw
super: use common iterator (Part 2)
super: use a common iterator (Part 1)
super: skip dying superblocks early
super: simplify user_get_super()
super: remove pointless s_root checks
fs: allow all writers to be frozen
locking/percpu-rwsem: add freezable alternative to down_read
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBPTwAKCRCRxhvAZXjc
om0+AQDMxKLweJXplqQQ7jxuvW2dEa60YpE2EalEKWGg9YA3KgEA3nI4kyKMKn7Y
PRFXgIcKvhs62oJLKsq8SGQUqExqvAE=
=atEw
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Use folios for symlinks in the page cache
FUSE already uses folios for its symlinks. Mirror that conversion
in the generic code and the NFS code. That lets us get rid of a few
folio->page->folio conversions in this path, and some of the few
remaining users of read_cache_page() / read_mapping_page()
- Try and make a few filesystem operations killable on the VFS
inode->i_mutex level
- Add sysctl vfs_cache_pressure_denom for bulk file operations
Some workloads need to preserve more dentries than we currently
allow through out sysctl interface
A HDFS servers with 12 HDDs per server, on a HDFS datanode startup
involves scanning all files and caching their metadata (including
dentries and inodes) in memory. Each HDD contains approximately 2
million files, resulting in a total of ~20 million cached dentries
after initialization
To minimize dentry reclamation, they set vfs_cache_pressure to 1.
Despite this configuration, memory pressure conditions can still
trigger reclamation of up to 50% of cached dentries, reducing the
cache from 20 million to approximately 10 million entries. During
the subsequent cache rebuild period, any HDFS datanode restart
operation incurs substantial latency penalties until full cache
recovery completes
To maintain service stability, more dentries need to be preserved
during memory reclamation. The current minimum reclaim ratio (1/100
of total dentries) remains too aggressive for such workload. This
patch introduces vfs_cache_pressure_denom for more granular cache
pressure control
The configuration [vfs_cache_pressure=1,
vfs_cache_pressure_denom=10000] effectively maintains the full 20
million dentry cache under memory pressure, preventing datanode
restart performance degradation
- Avoid some jumps in inode_permission() using likely()/unlikely()
- Avid a memory access which is most likely a cache miss when
descending into devcgroup_inode_permission()
- Add fastpath predicts for stat() and fdput()
- Anonymous inodes currently don't come with a proper mode causing
issues in the kernel when we want to add useful VFS debug assert.
Fix that by giving them a proper mode and masking it off when we
report it to userspace which relies on them not having any mode
- Anonymous inodes currently allow to change inode attributes because
the VFS falls back to simple_setattr() if i_op->setattr isn't
implemented. This means the ownership and mode for every single
user of anon_inode_inode can be changed. Block that as it's either
useless or actively harmful. If specific ownership is needed the
respective subsystem should allocate anonymous inodes from their
own private superblock
- Raise SB_I_NODEV and SB_I_NOEXEC on the anonymous inode superblock
- Add proper tests for anonymous inode behavior
- Make it easy to detect proper anonymous inodes and to ensure that
we can detect them in codepaths such as readahead()
Cleanups:
- Port pidfs to the new anon_inode_{g,s}etattr() helpers
- Try to remove the uselib() system call
- Add unlikely branch hint return path for poll
- Add unlikely branch hint on return path for core_sys_select
- Don't allow signals to interrupt getdents copying for fuse
- Provide a size hint to dir_context for during readdir()
- Use writeback_iter directly in mpage_writepages
- Update compression and mtime descriptions in initramfs
documentation
- Update main netfs API document
- Remove useless plus one in super_cache_scan()
- Remove unnecessary NULL-check guards during setns()
- Add separate separate {get,put}_cgroup_ns no-op cases
Fixes:
- Fix typo in root= kernel parameter description
- Use KERN_INFO for infof()|info_plog()|infofc()
- Correct comments of fs_validate_description()
- Mark an unlikely if condition with unlikely() in
vfs_parse_monolithic_sep()
- Delete macro fsparam_u32hex()
- Remove unused and problematic validate_constant_table()
- Fix potential unsigned integer underflow in fs_name()
- Make file-nr output the total allocated file handles"
* tag 'vfs-6.16-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (43 commits)
fs: Pass a folio to page_put_link()
nfs: Use a folio in nfs_get_link()
fs: Convert __page_get_link() to use a folio
fs/read_write: make default_llseek() killable
fs/open: make do_truncate() killable
fs/open: make chmod_common() and chown_common() killable
include/linux/fs.h: add inode_lock_killable()
readdir: supply dir_context.count as readdir buffer size hint
vfs: Add sysctl vfs_cache_pressure_denom for bulk file operations
fuse: don't allow signals to interrupt getdents copying
Documentation: fix typo in root= kernel parameter description
include/cgroup: separate {get,put}_cgroup_ns no-op case
kernel/nsproxy: remove unnecessary guards
fs: use writeback_iter directly in mpage_writepages
fs: remove useless plus one in super_cache_scan()
fs: add S_ANON_INODE
fs: remove uselib() system call
device_cgroup: avoid access to ->i_rdev in the common case in devcgroup_inode_permission()
fs/fs_parse: Remove unused and problematic validate_constant_table()
fs: touch up predicts in inode_permission()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc
ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH
eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA=
=IW7H
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs directory lookup updates from Christian Brauner:
"This contains cleanups for the lookup_one*() family of helpers.
We expose a set of functions with names containing "lookup_one_len"
and others without the "_len". This difference has nothing to do with
"len". It's rater a historical accident that can be confusing.
The functions without "_len" take a "mnt_idmap" pointer. This is found
in the "vfsmount" and that is an important question when choosing
which to use: do you have a vfsmount, or are you "inside" the
filesystem. A related question is "is permission checking relevant
here?".
nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
functions. They pass nop_mnt_idmap and refuse to work on filesystems
which have any other idmap.
This work changes nfsd and cachefile to use the lookup_one family of
functions and to explictily pass &nop_mnt_idmap which is consistent
with all other vfs interfaces used where &nop_mnt_idmap is explicitly
passed.
The remaining uses of the "_one" functions do not require permission
checks so these are renamed to be "_noperm" and the permission
checking is removed.
This series also changes these lookup function to take a qstr instead
of separate name and len. In many cases this simplifies the call"
* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
VFS: change lookup_one_common and lookup_noperm_common to take a qstr
Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
VFS: rename lookup_one_len family to lookup_noperm and remove permission check
cachefiles: Use lookup_one() rather than lookup_one_len()
nfsd: Use lookup_one() rather than lookup_one_len()
VFS: improve interface for lookup_one functions
or aren't considered necessary for -stable kernels. 19 are for MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDLNqwAKCRDdBJ7gKXxA
juanAQD4aZn7ACTpbIgDIlLVJouq6OOHEYye9hhxz19UN2mAUgEAn8jPqvBDav3S
HxjMFSdgLUQVO03FCs9tpNJchi69nw0=
=R3UI
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-05-25-00-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"22 hotfixes.
13 are cc:stable and the remainder address post-6.14 issues or aren't
considered necessary for -stable kernels. 19 are for MM"
* tag 'mm-hotfixes-stable-2025-05-25-00-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
mailmap: add Jarkko's employer email address
mm: fix copy_vma() error handling for hugetlb mappings
memcg: always call cond_resched() after fn()
mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb folios
mm: vmalloc: only zero-init on vrealloc shrink
mm: vmalloc: actually use the in-place vrealloc region
alloc_tag: allocate percpu counters for module tags dynamically
module: release codetag section when module load fails
mm/cma: make detection of highmem_start more robust
MAINTAINERS: add mm memory policy section
MAINTAINERS: add mm ksm section
kasan: avoid sleepable page allocation from atomic context
highmem: add folio_test_partial_kmap()
MAINTAINERS: add hung-task detector section
taskstats: fix struct taskstats breaks backward compatibility since version 15
mm/truncate: fix out-of-bounds when doing a right-aligned split
MAINTAINERS: add mm reclaim section
MAINTAINERS: update page allocator section
mm: fix VM_UFFD_MINOR == VM_SHADOW_STACK on USERFAULTFD=y && ARM64_GCS=y
mm: mmap: map MAP_STACK to VM_NOHUGEPAGE only if THP is enabled
...
Sean noted that scripts/Makefile.lib:name-fix-token rule will mangle
the module name with s/-/_/g.
Since this happens late in the build, only the kernel needs to bother
with this, the modpost tool still sees the original name.
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Tested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Instead of only accepting "module:${name}", extend it with a comma
separated list of module names and add tail glob support.
That is, something like: "module:foo-*,bar" is now possible.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Designate the "module:${modname}" symbol namespace to mean: 'only
export to the named module'.
Notably, explicit imports of anything in the "module:" space is
forbidden.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When module load fails after memory for codetag section is ready, codetag
section memory will not be properly released. This causes memory leak,
and if next module load happens to get the same module address, codetag
may pick the uninitialized section when manipulating tags during module
unload, and leads to "unable to handle page fault" BUG.
Link: https://lkml.kernel.org/r/20250519163823.7540-1-00107082@163.com
Fixes: 0db6f8d782 ("alloc_tag: load module tags into separate contiguous memory")
Closes: https://lore.kernel.org/all/20250516131246.6244-1-00107082@163.com/
Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: Petr Pavlu <petr.pavlu@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On error, copy_from_user returns number of bytes not copied to
destination, but current implementation of copy_user_data_sleepable does
not handle that correctly and returns it as error value, which may
confuse user, expecting meaningful negative error value.
Fixes: a498ee7576 ("bpf: Implement dynptr copy kfuncs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250523181705.261585-1-mykyta.yatsenko5@gmail.com
User space needs access to kernel BTF for many modern features of BPF.
Right now each process needs to read the BTF blob either in pieces or
as a whole. Allow mmaping the sysfs file so that processes can directly
access the memory allocated for it in the kernel.
remap_pfn_range is used instead of vm_insert_page due to aarch64
compatibility issues.
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/bpf/20250520-vmlinux-mmap-v5-1-e8c941acc414@isovalent.com
Filesystems like XFS can implement atomic write I/O using either
REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful
if we have a flag in trace events to distinguish between the two. This
patch adds char 'U' (Untorn writes) to rwbs field of the trace events
if REQ_ATOMIC flag is set in the bio.
<W/ REQ_ATOMIC>
=================
xfs_io-4238 [009] ..... 4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [009] d.h1. 4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0]
<W/O REQ_ATOMIC>
===============
xfs_io-4237 [010] ..... 4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io]
<idle>-0 [010] d.H1. 4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0]
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
PVH dom0 is useless without XEN_UNPOPULATED_ALLOC, as otherwise it will
very likely balloon out all dom0 memory to map foreign and grant pages.
Enable it by default as part of xen.config. This also requires enabling
MEMORY_HOTREMOVE and ZONE_DEVICE.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Message-ID: <20250514092037.28970-1-roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
The "try_" prefix is confusing, since it made people believe that
try_alloc_pages() is analogous to spin_trylock() and NULL return means
EAGAIN. This is not the case. If it returns NULL there is no reason to
call it again. It will most likely return NULL again. Hence rename it to
alloc_pages_nolock() to make it symmetrical to free_pages_nolock() and
document that NULL means ENOMEM.
Link: https://lkml.kernel.org/r/20250517003446.60260-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
BPF schedulers that use both builtin CPU idle mechanism and
ops.update_idle() may want to use the latter to create interlocking between
ops.enqueue() and CPU idle transitions so that either ops.enqueue() sees the
idle bit or ops.update_idle() sees the task queued somewhere. This can
prevent race conditions where CPUs go idle while tasks are waiting in DSQs.
For such interlocking to work, ops.update_idle() must be called after
builtin CPU masks are updated. Relocate the invocation. Currently, there are
no ordering requirements on transitions from idle and this relocation isn't
expected to make meaningful differences in that direction.
This also makes the ops.update_idle() behavior semantically consistent:
any action performed in this callback should be able to override the
builtin idle state, not the other way around.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Other subsystems may make use of the cgroup hierarchy with the cgroup_bpf
support being one such example. For such a feature, it's useful to be able
to hook into cgroup creation and destruction paths to perform
feature-specific initializations and cleanups.
Add cgroup_lifetime_notifier which generates CGROUP_LIFETIME_ONLINE and
CGROUP_LIFETIME_OFFLINE events whenever cgroups are created and destroyed,
respectively.
The next patch will convert cgroup_bpf to use the new notifier and other
uses are planned.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_bpf init and exit handling will be moved to a notifier chain. In
prepartion, reorganize cgroup_create() a bit so that the new cgroup is fully
initialized before any outside changes are made.
- cgrp->ancestors[] initialization and the hierarchical nr_descendants and
nr_frozen_descendants updates were in the same loop. Separate them out and
do the former earlier and do the latter later.
- Relocate cgroup_bpf_inherit() call so that it's after all cgroup
initializations are complete.
No visible behavior changes expected.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.15-rc8).
Conflicts:
80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
4bcc063939 ("ice, irdma: fix an off by one in error handling code")
c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au
No extra adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This reverts commit 4a8f635a60.
Althought get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), the find_vpid() was not.
The documentation for find_vpid() clearly states:
"Must be called with the tasklist_lock or rcu_read_lock() held."
Add proper rcu_read_lock/unlock() to protect the find_vpid().
Fixes: 4a8f635a60 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic")
Reported-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Subsystem rstat locks are dynamically allocated per-cpu. It was discovered
that a panic can occur during this allocation when the lock size is zero.
This is the case on non-smp systems, since arch_spinlock_t is defined as an
empty struct. Prevent this allocation when !CONFIG_SMP by adding a
pre-processor conditional around the affected block.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Reported-by: Klara Modin <klarasmodin@gmail.com>
Fixes: 748922dcfa ("cgroup: use subsystem-specific rstat locks to avoid contention")
Signed-off-by: Tejun Heo <tj@kernel.org>
The current allocation of VMAP stack memory is using (THREADINFO_GFP &
~__GFP_ACCOUNT) which is a complicated way of saying (GFP_KERNEL |
__GFP_ZERO):
<linux/thread_info.h>:
define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO)
<linux/gfp_types.h>:
define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
This is an unfortunate side-effect of independent changes blurring the
picture:
commit 19809c2da2 changed (THREADINFO_GFP |
__GFP_HIGHMEM) to just THREADINFO_GFP since highmem became implicit.
commit 9b6f7e163c then added stack caching
and rewrote the allocation to (THREADINFO_GFP & ~__GFP_ACCOUNT) as cached
stacks need to be accounted separately. However that code, when it
eventually accounts the memory does this:
ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0)
so the memory is charged as a GFP_KERNEL allocation.
Define a unique GFP_VMAP_STACK to use
GFP_KERNEL | __GFP_ZERO and move the comment there.
Link: https://lkml.kernel.org/r/20250509-gfp-stack-v1-1-82f6f7efc210@linaro.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are two data types: "struct vm_struct" and "struct vm_stack" that
have the same local variable names: vm_stack, or vm, or s, which makes the
code confusing to read.
Change the code so the naming is consistent:
struct vm_struct is always called vm_area
struct vm_stack is always called vm_stack
One change altering vfree(vm_stack) to vfree(vm_area->addr) may look like
a semantic change but it is not: vm_area->addr points to the vm_stack.
This was done to improve readability.
[linus.walleij@linaro.org: rebased and added new users of the variable names, address review comments]
Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-2-e6c69dd356f2@linaro.org
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fork: Page operation cleanups in the fork code", v3.
This patchset consists of outtakes from a 1 year+ old patchset from Pasha,
which all stand on their own. See:
https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
These are good cleanups for readability so I split these off, rebased on
v6.15-rc1, addressed review comments and send them separately.
All mentions of dynamic stack are removed from the patchset as we have no
idea whether that will go anywhere.
This patch (of 3):
There is unneeded OR in the ifdef functions that are used to allocate and
free kernel stacks based on direct map or vmap.
Therefore, clean up by changing the order so OR is no longer needed.
[linus.walleij@linaro.org: rebased]
Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-1-e6c69dd356f2@linaro.org
Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-0-e6c69dd356f2@linaro.org
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "sysfs: add counters for lockups and stalls", v2.
Commits 9db89b4111 ("exit: Expose "oops_count" to sysfs") and
8b05aa2633 ("panic: Expose "warn_count" to sysfs") added counters for
oopses and warnings to sysfs, and these two patches do the same for
hard/soft lockups and RCU stalls.
All of these counters are useful for monitoring tools to detect whether
the machine is healthy. If the kernel has experienced a lockup or a
stall, it's probably due to a kernel bug, and I'd like to detect that
quickly and easily. There is currently no way to detect that, other than
parsing dmesg. Or observing indirect effects: such as certain tasks not
responding, but then I need to observe all tasks, and it may take a while
until these effects become visible/measurable. I'd rather be able to
detect the primary cause more quickly, possibly before everything falls
apart.
This patch (of 2):
There is /proc/sys/kernel/hung_task_detect_count, /sys/kernel/warn_count
and /sys/kernel/oops_count but there is no userspace-accessible counter
for hard/soft lockups. Having this is useful for monitoring tools.
Link: https://lkml.kernel.org/r/20250504180831.4190860-1-max.kellermann@ionos.com
Link: https://lkml.kernel.org/r/20250504180831.4190860-2-max.kellermann@ionos.com
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Cc:
Cc: Core Minyard <cminyard@mvista.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Crash kernel will retrieve the dm crypt keys based on the dmcryptkeys
command line parameter. When user space writes the key description to
/sys/kernel/config/crash_dm_crypt_key/restore, the crash kernel will save
the encryption keys to the user keyring. Then user space e.g.
cryptsetup's --volume-key-keyring API can use it to unlock the encrypted
device.
Link: https://lkml.kernel.org/r/20250502011246.99238-6-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When there are CPU and memory hot un/plugs, the dm crypt keys may need to
be reloaded again depending on the solution for crash hotplug support.
Currently, there are two solutions. One is to utilizes udev to instruct
user space to reload the kdump kernel image and initrd, elfcorehdr and etc
again. The other is to only update the elfcorehdr segment introduced in
commit 2472627561 ("crash: add generic infrastructure for crash hotplug
support").
For the 1st solution, the dm crypt keys need to be reloaded again. The
user space can write true to /sys/kernel/config/crash_dm_crypt_key/reuse
so the stored keys can be re-used.
For the 2nd solution, the dm crypt keys don't need to be reloaded.
Currently, only x86 supports the 2nd solution. If the 2nd solution gets
extended to all arches, this patch can be dropped.
Link: https://lkml.kernel.org/r/20250502011246.99238-5-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the kdump kernel image and initrd are loaded, the dm crypts keys will
be read from keyring and then stored in kdump reserved memory.
Assume a key won't exceed 256 bytes thus MAX_KEY_SIZE=256 according to
"cryptsetup benchmark".
Link: https://lkml.kernel.org/r/20250502011246.99238-4-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A configfs /sys/kernel/config/crash_dm_crypt_keys is provided for user
space to make the dm crypt keys persist for the kdump kernel. Take the
case of dumping to a LUKS-encrypted target as an example, here is the life
cycle of the kdump copies of LUKS volume keys,
1. After the 1st kernel loads the initramfs during boot, systemd uses
an user-input passphrase to de-crypt the LUKS volume keys or simply
TPM-sealed volume keys and then save the volume keys to specified
keyring (using the --link-vk-to-keyring API) and the keys will expire
within specified time.
2. A user space tool (kdump initramfs loader like kdump-utils) create
key items inside /sys/kernel/config/crash_dm_crypt_keys to inform
the 1st kernel which keys are needed.
3. When the kdump initramfs is loaded by the kexec_file_load
syscall, the 1st kernel will iterate created key items, save the
keys to kdump reserved memory.
4. When the 1st kernel crashes and the kdump initramfs is booted, the
kdump initramfs asks the kdump kernel to create a user key using the
key stored in kdump reserved memory by writing yes to
/sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted
device is unlocked with libcryptsetup's --volume-key-keyring API.
5. The system gets rebooted to the 1st kernel after dumping vmcore to
the LUKS encrypted device is finished
Eventually the keys have to stay in the kdump reserved memory for the
kdump kernel to unlock encrypted volumes. During this process, some
measures like letting the keys expire within specified time are desirable
to reduce security risk.
This patch assumes,
1) there are 128 LUKS devices at maximum to be unlocked thus
MAX_KEY_NUM=128.
2) a key description won't exceed 128 bytes thus KEY_DESC_MAX_LEN=128.
And here is a demo on how to interact with
/sys/kernel/config/crash_dm_crypt_keys,
# Add key #1
mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720
# Add key #1's description
echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description
# how many keys do we have now?
cat /sys/kernel/config/crash_dm_crypt_keys/count
1
# Add key# 2 in the same way
# how many keys do we have now?
cat /sys/kernel/config/crash_dm_crypt_keys/count
2
# the tree structure of /crash_dm_crypt_keys configfs
tree /sys/kernel/config/crash_dm_crypt_keys/
/sys/kernel/config/crash_dm_crypt_keys/
├── 7d26b7b4-e342-4d2d-b660-7426b0996720
│ └── description
├── count
├── fce2cd38-4d59-4317-8ce2-1fd24d52c46a
│ └── description
Link: https://lkml.kernel.org/r/20250502011246.99238-3-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Support kdump with LUKS encryption by reusing LUKS volume
keys", v9.
LUKS is the standard for Linux disk encryption, widely adopted by users,
and in some cases, such as Confidential VMs, it is a requirement. With
kdump enabled, when the first kernel crashes, the system can boot into the
kdump/crash kernel to dump the memory image (i.e., /proc/vmcore) to a
specified target. However, there are two challenges when dumping vmcore
to a LUKS-encrypted device:
- Kdump kernel may not be able to decrypt the LUKS partition. For some
machines, a system administrator may not have a chance to enter the
password to decrypt the device in kdump initramfs after the 1st kernel
crashes; For cloud confidential VMs, depending on the policy the
kdump kernel may not be able to unseal the keys with TPM and the
console virtual keyboard is untrusted.
- LUKS2 by default use the memory-hard Argon2 key derivation function
which is quite memory-consuming compared to the limited memory reserved
for kdump. Take Fedora example, by default, only 256M is reserved for
systems having memory between 4G-64G. With LUKS enabled, ~1300M needs
to be reserved for kdump. Note if the memory reserved for kdump can't
be used by 1st kernel i.e. an user sees ~1300M memory missing in the
1st kernel.
Besides users (at least for Fedora) usually expect kdump to work out of
the box i.e. no manual password input or custom crashkernel value is
needed. And it doesn't make sense to derivate the keys again in kdump
kernel which seems to be redundant work.
This patchset addresses the above issues by making the LUKS volume keys
persistent for kdump kernel with the help of cryptsetup's new APIs
(--link-vk-to-keyring/--volume-key-keyring). Here is the life cycle of
the kdump copies of LUKS volume keys,
1. After the 1st kernel loads the initramfs during boot, systemd
use an user-input passphrase to de-crypt the LUKS volume keys
or TPM-sealed key and then save the volume keys to specified keyring
(using the --link-vk-to-keyring API) and the key will expire within
specified time.
2. A user space tool (kdump initramfs loader like kdump-utils) create
key items inside /sys/kernel/config/crash_dm_crypt_keys to inform
the 1st kernel which keys are needed.
3. When the kdump initramfs is loaded by the kexec_file_load
syscall, the 1st kernel will iterate created key items, save the
keys to kdump reserved memory.
4. When the 1st kernel crashes and the kdump initramfs is booted, the
kdump initramfs asks the kdump kernel to create a user key using the
key stored in kdump reserved memory by writing yes to
/sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted
device is unlocked with libcryptsetup's --volume-key-keyring API.
5. The system gets rebooted to the 1st kernel after dumping vmcore to
the LUKS encrypted device is finished
After libcryptsetup saving the LUKS volume keys to specified keyring,
whoever takes this should be responsible for the safety of these copies of
keys. The keys will be saved in the memory area exclusively reserved for
kdump where even the 1st kernel has no direct access. And further more,
two additional protections are added,
- save the copy randomly in kdump reserved memory as suggested by Jan
- clear the _PAGE_PRESENT flag of the page that stores the copy as
suggested by Pingfan
This patchset only supports x86. There will be patches to support other
architectures once this patch set gets merged.
This patch (of 9):
Currently, kexec_buf is placed in order which means for the same machine,
the info in the kexec_buf is always located at the same position each time
the machine is booted. This may cause a risk for sensitive information
like LUKS volume key. Now struct kexec_buf has a new field random which
indicates it's supposed to be placed in a random position.
Note this feature is enabled only when CONFIG_CRASH_DUMP is enabled. So
it only takes effect for kdump and won't impact kexec reboot.
Link: https://lkml.kernel.org/r/20250502011246.99238-1-coxu@redhat.com
Link: https://lkml.kernel.org/r/20250502011246.99238-2-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Suggested-by: Jan Pazdziora <jpazdziora@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no reason to restrict scx_bpf_select_cpu_dfl() invocations to
ops.select_cpu() while allowing scx_bpf_select_cpu_and() to be used from
multiple contexts, as both provide equivalent functionality, with the
latter simply accepting an additional "allowed" cpumask.
Therefore, unify the two APIs, enabling both kfuncs to be used from
ops.select_cpu(), ops.enqueue(), and unlocked contexts (e.g., via BPF
test_run).
This allows schedulers to implement a consistent idle CPU selection
policy and helps reduce code duplication.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
&desc->lock is acquired on 2 consecutive lines in hwirq_show(). This leads
obviously to a deadlock. Drop the raw_spin_lock_irq() and keep guard().
Fixes: 5d964a9f7c ("genirq/irqdesc: Switch to lock guards")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250521142541.3832130-1-claudiu.beznea.uj@bp.renesas.com
The PERF_RECORD_THROTTLE records are dumped for all throttled events.
It's not necessary for group events, which are throttled altogether.
Optimize it by only dump the throttle log for the leader.
The sample right after the THROTTLE record must be generated by the
actual target event. It is good enough for the perf tool to locate the
actual target event.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
The current throttle logic doesn't work well with a group, e.g., the
following sampling-read case.
$ perf record -e "{cycles,cycles}:S" ...
$ perf report -D | grep THROTTLE | tail -2
THROTTLE events: 426 ( 9.0%)
UNTHROTTLE events: 425 ( 9.0%)
$ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
... sample_read:
.... group nr 2
..... id 0000000000000327, value 000000000cbb993a, lost 0
..... id 0000000000000328, value 00000002211c26df, lost 0
The second cycles event has a much larger value than the first cycles
event in the same group.
The current throttle logic in the generic code only logs the THROTTLE
event. It relies on the specific driver implementation to disable
events. For all ARCHs, the implementation is similar. Only the event is
disabled, rather than the group.
The logic to disable the group should be generic for all ARCHs. Add the
logic in the generic code. The following patch will remove the buggy
driver-specific implementation.
The throttle only happens when an event is overflowed. Stop the entire
group when any event in the group triggers the throttle.
The MAX_INTERRUPTS is set to all throttle events.
The unthrottled could happen in 3 places.
- event/group sched. All events in the group are scheduled one by one.
All of them will be unthrottled eventually. Nothing needs to be
changed.
- The perf_adjust_freq_unthr_events for each tick. Needs to restart the
group altogether.
- The __perf_event_period(). The whole group needs to be restarted
altogether as well.
With the fix,
$ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
... sample_read:
.... group nr 2
..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0
Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
The kerneldoc for futex_wait_setup() states it can return "0" or "<1".
This isn't true because the error case is "<0" not less than 1.
Document that <0 is returned on error. Drop the possible return values
and state possible reasons.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20250517151455.1065363-6-bigeasy@linutronix.de
The commit dfa0a574cb ("sched/uclamg: Handle delayed dequeue")
has add the sched_delayed check to prevent double uclamp_dec/inc.
However, it put the uclamp_rq_inc() after enqueue_task().
This may lead to the following issues:
When a task with uclamp goes through enqueue_task() and could trigger
cpufreq update, its uclamp won't even be considered in the cpufreq
update. It is only after enqueue will the uclamp be added to rq
buckets, and cpufreq will only pick it up at the next update.
This could cause a delay in frequency updating. It may affect
the performance(uclamp_min > 0) or power(uclamp_max < 1024).
So, just like util_est, put the uclamp_rq_inc() before enqueue_task().
And as for the sched_delayed_task, same as util_est, using the
sched_delayed flag to prevent inc the sched_delayed_task's uclamp,
using the ENQUEUE_DELAYED flag to allow inc the sched_delayed_task's uclamp
which is being woken up.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-3-xuewen.yan@unisoc.com
To prevent double enqueue/dequeue of the util-est for sched_delayed tasks,
commit 729288bc68 ("kernel/sched: Fix util_est accounting for DELAY_DEQUEUE")
added the corresponding check. This check excludes double en/dequeue during
task migration and priority changes.
In fact, these conditions can be simplified.
For util_est_dequeue, we know that sched_delayed flag is set in dequeue_entity.
When the task is sleeping, we need to call util_est_dequeue to subtract
util-est from the cfs_rq. At this point, sched_delayed has not yet been set.
If we find that sched_delayed is already set, it indicates that this task
has already called dequeue_task_fair once. In this case, there is no need to
call util_est_dequeue again. Therefore, simply checking the sched_delayed flag
should be sufficient to prevent unnecessary util_est updates during the dequeue.
For util_est_enqueue, our goal is to add the util_est to the cfs_rq
when task enqueue. However, we don't want to add the util_est of a
sched_delayed task to the cfs_rq because the task is sleeping.
Therefore, we can exclude the util_est_enqueue for sched_delayed tasks
by checking the sched_delayed flag. However, when waking up a delayed task,
the sched_delayed flag is cleared after util_est_enqueue. As a result,
if we only check the sched_delayed flag, we would miss the util_est_enqueue.
Since waking up a sched_delayed task calls enqueue_task with the ENQUEUE_DELAYED flag,
we can determine whether to call util_est_enqueue by checking if the
enqueue_flag contains ENQUEUE_DELAYED.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-2-xuewen.yan@unisoc.com
Delayed dequeued feature keeps a sleeping task enqueued until its
lag has elapsed. As a result, it stays also visible in rq->nr_running.
So when in wake_affine_idle(), we should use the real running-tasks
in rq to check whether we should place the wake-up task to
current cpu.
On the other hand, add a helper function to return the nr-delayed.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
double-free bug in af_alg.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmgqxdkACgkQxycdCkmx
i6eWlA//Q++8TiBRxEFobeBJ4VjUuTZbcIjlhmCpcSCiPxhQ/Uiz3lHH32B/xUV8
8JfmjzkjHM9yfxk49vUOhRtnO6vxgV6l+Acl7zp3yP9TtXhVClyLBgvJK6dPwSr2
KHJXUZQxPAGJ267nXBSMg/9j0mAmbpwdXvsL/rAwGpZtRyNlQTupoldRLELsIV7L
5eOliXLZODAIDAGls9N6H8bLM9m0TVlaWxwfqQNJonFSbacBZdOmC5fTHspofCG6
lICYuzhPfFrwOIs6c7Dj9GKClgNDNZk3fdcO+rvWVzZYHTgmI7kDRM1mA62Uekwf
3o/VtmsPEehid1SdsoJgVdKFFFk9FZXsppGJtVmHOZ6oxqN7iYeuimJLwg0zR1Vk
gRVrUtzpszSHO7BgaY/Z7V/j8p2sZDzeVGhIxzIgDiRmItfRmG3YLtGV33FiUwDq
/ZbUDekL5Wed7b2LFLv8s0M7aLzVx0kLmJ5Zxh9bDBXhCFOEM/VXrQd+WE4Ga+8O
pA9OPEGw2z3mNfKvHN5sQumtHxAeno1zZDq2Ai0HrnE4Hto9rqW6IWqkUkS6CGE2
/v+i4mSsM09adE5pt+JPUViul8sVKVuldXtx7fM14V8w26LykGrUF+8U1sJTZuso
Ph339P0LlwLjmbVUNVSQahmZ7B6eIAiuTbip/Be9R97ZNJcO/LI=
=K1Gx
-----END PGP SIGNATURE-----
Merge tag 'v6.15-p7' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes a regression in padata as well as an ancient double-free
bug in af_alg"
* tag 'v6.15-p7' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: algif_hash - fix double free in hash_accept
padata: do not leak refcount in reorder_work
Allow scx_bpf_select_cpu_and() to be used from an unlocked context, in
addition to ops.enqueue() or ops.select_cpu().
This enables schedulers, including user-space ones, to implement a
consistent idle CPU selection policy and helps reduce code duplication.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Validate locking correctness when accessing p->nr_cpus_allowed and
p->cpus_ptr inside scx_bpf_select_cpu_and(): if the rq lock is held,
access is safe; otherwise, require that p->pi_lock is held.
This allows to catch potential unsafe calls to scx_bpf_select_cpu_and().
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other
source files (e.g., ext_idle.c).
No functional change.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are a few places where a conditional check is performed to validate a
given css on its rstat participation. This new helper tries to make the
code more readable where this check is performed.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It is possible to eliminate contention between subsystems when
updating/flushing stats by using subsystem-specific locks. Let the existing
rstat locks be dedicated to the cgroup base stats and rename them to
reflect that. Add similar locks to the cgroup_subsys struct for use with
individual subsystems.
Lock initialization is done in the new function ss_rstat_init(ss) which
replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
global base stat locks will be initialized. Otherwise, the subsystem locks
will be initialized.
Change the existing lock helper functions to accept a reference to a css.
Then within these functions, conditionally select the appropriate locks
based on the subsystem affiliation of the given css. Add helper functions
for this selection routine to avoid repeated code.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Different subsystems may call cgroup_rstat_updated() within the same
cgroup, resulting in a tree of pending updates from multiple subsystems.
When one of these subsystems is flushed via cgroup_rstat_flushed(), all
other subsystems with pending updates on the tree will also be flushed.
Change the paradigm of having a single rstat tree for all subsystems to
having separate trees for each subsystem. This separation allows for
subsystems to perform flushes without the side effects of other subsystems.
As an example, flushing the cpu stats will no longer cause the memory stats
to be flushed and vice versa.
In order to achieve subsystem-specific trees, change the tree node type
from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
the cgroup and instead place them on the css. Finally, change update/flush
functions to make use of the different node type (css). These changes allow
a specific subsystem to be associated with an update or flush. Separate
rstat trees will now exist for each unique subsystem.
Since updating/flushing will now be done at the subsystem level, there is
no longer a need to keep track of updated css nodes at the cgroup level.
The list management of these nodes done within the cgroup (rstat_css_list
and related) has been removed accordingly.
Conditional guards for checking validity of a given css were placed within
css_rstat_updated/flush() to prevent undefined behavior occuring from kfunc
usage in bpf programs. Guards were also placed within css_rstat_init/exit()
in order to help consolidate calls to them. At call sites for all four
functions, the existing guards were removed.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Adjust the implementation of css_is_cgroup() so that it compares the given
css to cgroup::self. Rename the function to css_is_self() in order to
reflect that. Change the existing css->ss NULL check to a warning in the
true branch. Finally, adjust call sites to use the new function name.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
An early init subsystem that attempts to make use of rstat can lead to
failures during early boot. The reason for this is the timing in which the
css's of the root cgroup have css_online() invoked on them. At the point of
this call, there is a stated assumption that a cgroup has "successfully
completed all allocations" [0]. An example of a subsystem that relies on
the previously mentioned assumption [0] is the memory subsystem. Within its
implementation of css_online(), work is queued to asynchronously begin
flushing via rstat. In the early init path for a given subsystem, having
rstat enabled leads to this sequence:
cgroup_init_early()
for_each_subsys(ss, ssid)
if (ss->early_init)
cgroup_init_subsys(ss, true)
cgroup_init_subsys(ss, early_init)
css = ss->css_alloc(...)
init_and_link_css(css, ss, ...)
...
online_css(css)
online_css(css)
ss = css->ss
ss->css_online(css)
Continuing to use the memory subsystem as an example, the issue with this
sequence is that css_rstat_init() has not been called yet. This means there
is now a race between the pending async work to flush rstat and the call to
css_rstat_init(). So a flush can occur within the given cgroup while the
rstat fields are not initialized.
Since we are in the early init phase, the rstat fields cannot be
initialized because they require per-cpu allocations. So it's not possible
to have css_rstat_init() called early enough (before online_css()). This
patch treats the combination of early init and rstat the same as as other
invalid conditions.
[0] Documentation/admin-guide/cgroup-v1/cgroups.rst (section: css_online)
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Throughout the verifier's logic, there are multiple checks for
inconsistent states that should never happen and would indicate a
verifier bug. These bugs are typically logged in the verifier logs and
sometimes preceded by a WARN_ONCE.
This patch reworks these checks to consistently emit a verifier log AND
a warning when CONFIG_DEBUG_KERNEL is enabled. The consistent use of
WARN_ONCE should help fuzzers (ex. syzkaller) expose any situation
where they are actually able to reach one of those buggy verifier
states.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/aCs1nYvNNMq8dAWP@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A recent patch that addressed a UAF introduced a reference count leak:
the parallel_data refcount is incremented unconditionally, regardless
of the return value of queue_work(). If the work item is already queued,
the incremented refcount is never decremented.
Fix this by checking the return value of queue_work() and decrementing
the refcount when necessary.
Resolves:
Unreferenced object 0xffff9d9f421e3d80 (size 192):
comm "cryptomgr_probe", pid 157, jiffies 4294694003
hex dump (first 32 bytes):
80 8b cf 41 9f 9d ff ff b8 97 e0 89 ff ff ff ff ...A............
d0 97 e0 89 ff ff ff ff 19 00 00 00 1f 88 23 00 ..............#.
backtrace (crc 838fb36):
__kmalloc_cache_noprof+0x284/0x320
padata_alloc_pd+0x20/0x1e0
padata_alloc_shell+0x3b/0xa0
0xffffffffc040a54d
cryptomgr_probe+0x43/0xc0
kthread+0xf6/0x1f0
ret_from_fork+0x2f/0x50
ret_from_fork_asm+0x1a/0x30
Fixes: dd7d37ccf6 ("padata: avoid UAF for reorder_work")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dominik Grzegorzek <dominik.grzegorzek@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The text_size bit referred to by the comment has been removed as of commit
ac3b432839 ("module: replace module_layout with module_memory")
and is thus no longer relevant. Remove it and comment about the contents of
the masks array instead.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20250429113242.998312-23-vschneid@redhat.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Section .static_call_sites holds data structures that need to be sorted and
processed only at module load time. This initial processing happens in
static_call_add_module(), which is invoked as a callback to the
MODULE_STATE_COMING notification from prepare_coming_module().
The section is never modified afterwards. Make it therefore read-only after
module initialization to avoid any (non-)accidental modifications.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250306131430.7016-4-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Move the logic to mark special sections as read-only after module
initialization into a separate function, along other related code in
strict_rwx.c. Use a table with names of such sections to make it easier to
add more.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250306131430.7016-3-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Minor cleanup, this is a non-functional change.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250306131430.7016-2-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaCi8agAKCRDdBJ7gKXxA
jm01AQCFDY9R8TeT7ppzJcgkLGtV/UdBJG9aiinORBKOmA1sRQD/URnDRJZqjVGO
Wtp1RkMJG4+u7OeuTvj9LIhuigJ7uQM=
=M3Y/
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2025-05-17-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Nine singleton hotfixes, all MM. Four are cc:stable"
* tag 'mm-hotfixes-stable-2025-05-17-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: userfaultfd: correct dirty flags set for both present and swap pte
zsmalloc: don't underflow size calculation in zs_obj_write()
mm/page_alloc: fix race condition in unaccepted memory handling
mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory
MAINTAINERS: add mm GUP section
mm/codetag: move tag retrieval back upfront in __free_pages()
mm/memory: fix mapcount / refcount sanity check for mTHP reuse
kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
mm: hugetlb: fix incorrect fallback for subpool
Add a helper to check if an event is in freq mode to improve readability.
No functional changes.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250516182853.2610284-2-kan.liang@linux.intel.com
`pr_cont()` unfortunately does not work here, as other parts of the
Linux kernel log between the two log lines:
[18445.295056] r8152-cfgselector 4-1.1.3: USB disconnect, device number 5
[18445.295112] OOM killer enabled.
[18445.295115] Restarting tasks ...
[18445.295185] usb 3-1: USB disconnect, device number 2
[18445.295193] usb 3-1.1: USB disconnect, device number 3
[18445.296262] usb 3-1.5: USB disconnect, device number 4
[18445.297017] done.
[18445.297029] random: crng reseeded on system resumption
`pr_cont()` also uses the default log level, normally warning, if the
corresponding log line is interrupted.
Therefore, replace the `pr_cont()`, and explicitly log it as a separate
line with log level info:
Restarting tasks: Starting
[…]
Restarting tasks: Done
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://patch.msgid.link/20250511174648.950430-1-pmenzel@molgen.mpg.de
[ rjw: Rebase on top of an earlier analogous change ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Creating an irq domain that serves as an MSI parent requires
a substantial amount of esoteric boiler-plate code, some of
which is often provided twice (such as the bus token).
To make things a bit simpler for the unsuspecting MSI tinkerer,
provide a helper that does it for them, and serves as documentation
of what needs to be provided.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513172819.2216709-3-maz@kernel.org
irqdomain.c's kernel-doc exists, but is not plugged into Documentation/
yet.
Before plugging it in, fix it first: irq_domain_get_irq_data() and
irq_domain_set_info() were documented twice. Identically, by both
definitions for CONFIG_IRQ_DOMAIN_HIERARCHY and !CONFIG_IRQ_DOMAIN_HIERARCHY.
Therefore, switch the second kernel-doc into an ordinary comment -- change
"/**" to simple "/*". This avoids sphinx's: WARNING: Duplicate C
declaration
Next, in commit b7b377332b ("irqdomain: Fix the kernel-doc and plug it
into Documentation"), irqdomain.h's (header) kernel-doc was added into
core-api/genericirq.rst. But given the amount of irqdomain functions and
structures, move all these to core-api/irq/irq-domain.rst now.
Finally, add these newly fixed irqdomain.c's (source) docs there as
well.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-58-jirislaby@kernel.org
Most irq_domain_add_*() functions are unused now, so drop them. The
remaining ones are moved to the deprecated section and will be removed
during the merge window after the patches in various trees have been
merged.
Note: The Chinese docs are touched but unfinished. I cannot parse those.
[ tglx: Remove the leftover in irq-domain.rst and handle merge logistics ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-41-jirislaby@kernel.org
There is no reason to export the function as an extra symbol. It is
simple enough and is just a wrapper to already exported functions.
Therefore, switch the exported function to an inline.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-13-jirislaby@kernel.org
All uses of of_node_to_fwnode() in non-irqdomain code were changed to
"officially" defined of_fwnode_handle(). Therefore, the former can be
dropped along with the last uses in the irqdomain code.
Due to merge logistics the inline cannot be dropped immediately. Move it to
a deprecated section, which will be removed during the merge window.
[ tglx: Handle merge logistics ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-12-jirislaby@kernel.org
Currently, the ->gpwrap is not tested (at all per my testing) due to the
requirement of a large delta between a CPU's rdp->gp_seq and its node's
rnp->gpseq.
This results in no testing of ->gpwrap being set. This patch by default
adds 5 minutes of testing with ->gpwrap forced by lowering the delta
between rdp->gp_seq and rnp->gp_seq to just 8 GPs. All of this is
configurable, including the active time for the setting and a full
testing cycle.
By default, the first 25 minutes of a test will have the _default_
behavior there is right now (ULONG_MAX / 4) delta. Then for 5 minutes,
we switch to a smaller delta causing 1-2 wraps in 5 minutes. I believe
this is reasonable since we at least add a little bit of testing for
usecases where ->gpwrap is set.
[ Apply fix for Dan Carpenter's bug report on init path cleanup. ]
[ Apply kernel doc warning fix from Akira Yokosawa. ]
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
The rcu_torture_reader() and rcu_torture_fwd_prog_cr() functions
run CPU-bound for extended periods of time (tens or even
hundreds of milliseconds), so they invoke tick_dep_set_task() and
tick_dep_clear_task() to ensure that the scheduling-clock tick helps
move grace periods forward.
So why doesn't rcu_torture_fwd_prog_nr() also invoke tick_dep_set_task()
and tick_dep_clear_task()? Because the point of this function is to test
RCU's ability to (eventually) force grace periods forward even when the
tick has been disabled during long CPU-bound kernel execution.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
For built with CONFIG_PROVE_RCU=y and CONFIG_PREEMPT_RT=y kernels,
Disable BH does not change the SOFTIRQ corresponding bits in
preempt_count(), but change current->softirq_disable_cnt, this
resulted in the following splat:
WARNING: suspicious RCU usage
kernel/rcu/tree_plugin.h:36 Unsafe read of RCU_NOCB offloaded state!
stack backtrace:
CPU: 0 UID: 0 PID: 22 Comm: rcuc/0
Call Trace:
[ 0.407907] <TASK>
[ 0.407910] dump_stack_lvl+0xbb/0xd0
[ 0.407917] dump_stack+0x14/0x20
[ 0.407920] lockdep_rcu_suspicious+0x133/0x210
[ 0.407932] rcu_rdp_is_offloaded+0x1c3/0x270
[ 0.407939] rcu_core+0x471/0x900
[ 0.407942] ? lockdep_hardirqs_on+0xd5/0x160
[ 0.407954] rcu_cpu_kthread+0x25f/0x870
[ 0.407959] ? __pfx_rcu_cpu_kthread+0x10/0x10
[ 0.407966] smpboot_thread_fn+0x34c/0xa50
[ 0.407970] ? trace_preempt_on+0x54/0x120
[ 0.407977] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 0.407982] kthread+0x40e/0x840
[ 0.407990] ? __pfx_kthread+0x10/0x10
[ 0.407994] ? rt_spin_unlock+0x4e/0xb0
[ 0.407997] ? rt_spin_unlock+0x4e/0xb0
[ 0.408000] ? __pfx_kthread+0x10/0x10
[ 0.408006] ? __pfx_kthread+0x10/0x10
[ 0.408011] ret_from_fork+0x40/0x70
[ 0.408013] ? __pfx_kthread+0x10/0x10
[ 0.408018] ret_from_fork_asm+0x1a/0x30
[ 0.408042] </TASK>
Currently, triggering an rdp offloaded state change need the
corresponding rdp's CPU goes offline, and at this time the rcuc
kthreads has already in parking state. this means the corresponding
rcuc kthreads can safely read offloaded state of rdp while it's
corresponding cpu is online.
This commit therefore add softirq_count() check for
Preempt-RT kernels.
Suggested-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
It's safer to using kcalloc() because it can prevent overflow
problem.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
This reverts commit f7345ccc62.
swake_up_one_online() has been removed because hrtimers can now assign
a proper online target to hrtimers queued from offline CPUs. Therefore
remove the related hackery.
Link: https://lore.kernel.org/all/20241231170712.149394-4-frederic@kernel.org/
Reviewed-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
It's now ok to perform a wake-up from an offline CPU because the
resulting armed scheduler bandwidth hrtimers are now correctly targeted
by hrtimer infrastructure.
Remove the obsolete hackerry.
Link: https://lore.kernel.org/all/20241231170712.149394-3-frederic@kernel.org/
Reviewed-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Fix those:
./kernel/futex/futex.h:208: warning: Function parameter or struct member 'drop_hb_ref' not described in 'futex_q'
./kernel/futex/waitwake.c:343: warning: expecting prototype for futex_wait_queue(). Prototype was for futex_do_wait() instead
./kernel/futex/waitwake.c:594: warning: Function parameter or struct member 'task' not described in 'futex_wait_setup'
Fixes: 93f1b6d79a ("futex: Move futex_queue() into futex_wait_setup()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250512185641.0450a99b@canb.auug.org.au # report
Link: https://lore.kernel.org/r/20250515171641.24073-1-bp@kernel.org # submission
perf always allocates contiguous AUX pages based on aux_watermark.
However, this contiguous allocation doesn't benefit all PMUs. For
instance, ARM SPE and TRBE operate with virtual pages, and Coresight
ETR allocates a separate buffer. For these PMUs, allocating contiguous
AUX pages unnecessarily exacerbates memory fragmentation. This
fragmentation can prevent their use on long-running devices.
This patch modifies the perf driver to be memory-friendly by default,
by allocating non-contiguous AUX pages. For PMUs requiring contiguous
pages (Intel BTS and some Intel PT), the existing
PERF_PMU_CAP_AUX_NO_SG capability can be used. For PMUs that don't
require but can benefit from contiguous pages (some Intel PT), a new
capability, PERF_PMU_CAP_AUX_PREFER_LARGE, is added to maintain their
existing behavior.
Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250508232642.148767-1-yabinc@google.com
Affinity-managed interrupts can be shut down and restarted during CPU
hotunplug/plug. Thereby the interrupt may be left in an unexpected state.
Specifically:
1. Interrupt is affine to CPU N
2. disable_irq() -> depth is 1
3. CPU N goes offline
4. irq_shutdown() -> depth is set to 1 (again)
5. CPU N goes online
6. irq_startup() -> depth is set to 0 (BUG! driver expects that the interrupt
still disabled)
7. enable_irq() -> depth underflow / unbalanced enable_irq() warning
This is only a problem for managed interrupts and CPU hotplug, all other
cases like request()/free()/request() truly needs to reset a possibly stale
disable depth value.
Provide a startup function, which takes the disable depth into account, and
invoked it for the managed interrupts in the CPU hotplug path.
This requires to change irq_shutdown() to do a depth increment instead of
setting it to 1, which allows to retain the disable depth, but is harmless
for the other code paths using irq_startup(), which will still reset the
disable depth unconditionally to keep the original correct behaviour.
A kunit tests will be added separately to cover some of these aspects.
[ tglx: Massaged changelog ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250514201353.3481400-2-briannorris@chromium.org
GCC is not happy about a sprintf() call on a buffer that might be too small
for the given formatting string.
kernel/irq/debugfs.c:233:26: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
Fix this by bumping the size of the local variable for sprintf().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250515085516.2913290-1-andriy.shevchenko@linux.intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202505151057.xbyXAbEn-lkp@intel.com/
- Add proper pahole version dependency to CONFIG_GENDWARFKSYMS to avoid
module loading errors
- Fix UAPI header tests for the OpenRISC architecture
- Add dependency on the libdw package in Debian and RPM packages
- Disable -Wdefault-const-init-unsafe warnings on Clang
- Make "make clean ARCH=um" also clean the arch/x86/ directory
- Revert the use of -fmacro-prefix-map=, which causes issues with
debugger usability
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmgh4XoVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGj34P/14KpYOyUZxQvz3uvGCwYUsFpeYT
CKa3s9TxAO9Dxz3dWTGsKNQXM24DXoL94bPvkVhQ1pUP2kugi0KEAmQ21k83hfMe
m/P0BPaSImTn6Cv+N7GyyuX0q+KO31UBhXkf14MCpyq0NQQXJ+7T2OgOWhZenZ6m
PzuSZO0/rNhKQNykl2xPcD3TLBP7BEWRbPxADWgQ/353dpNbxXCYC4lWKaWsUpit
FvLTiUEYRBiP68oZYCCT/26K6+FZMRBicvjowbDMDAXIi3sNeBJo6hWX6GtfHacW
q+f95edikvu0NcJxyNwkjsf7d7a5yuurQsVW0JT8dG1FZlrfuphBTEjomsWRhKtO
+AGTMAG3VbkQ+Z/WUN9FItS/+lGfKpMToZbiKsETJHni0sOJSRB1+MLQ6U0NCSAs
PpjSxm6hHZruZN7nAdhZfB3aWA2EaQYuq4SX+rYDXuSZqvYLuzBC2ZH1P2AmDbp5
wmj76QJ9JEgXsjg9ewr0/aYrx26we+P3q1pUuxpURvg7M9vHUnVR+QstUS+MhCnR
pWUjkNHMuWWz48UmqVx7YxIyURlqkqqG0avZJHLGEwRiTWkG8d+j4R5VwB84+73P
XBAYpfeaiemWdViOq/bYxJHXrKUZX1rhzRDBNq4806JBR26ZWk7GgKW8xE5Rx3qH
iqkH4uo1EHT8ULHW
=Wn5A
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Add proper pahole version dependency to CONFIG_GENDWARFKSYMS to avoid
module loading errors
- Fix UAPI header tests for the OpenRISC architecture
- Add dependency on the libdw package in Debian and RPM packages
- Disable -Wdefault-const-init-unsafe warnings on Clang
- Make "make clean ARCH=um" also clean the arch/x86/ directory
- Revert the use of -fmacro-prefix-map=, which causes issues with
debugger usability
* tag 'kbuild-fixes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: fix typos "module.builtin" to "modules.builtin"
Revert "kbuild, rust: use -fremap-path-prefix to make paths relative"
Revert "kbuild: make all file references relative to source root"
kbuild: fix dependency on sorttable
init: remove unused CONFIG_CC_CAN_LINK_STATIC
um: let 'make clean' properly clean underlying SUBARCH as well
kbuild: Disable -Wdefault-const-init-unsafe
kbuild: rpm-pkg: Add (elfutils-devel or libdw-devel) to BuildRequires
kbuild: deb-pkg: Add libdw-dev:native to Build-Depends-Arch
usr/include: openrisc: don't HDRTEST bpf_perf_event.h
kbuild: Require pahole <v1.28 or >v1.29 with GENDWARFKSYMS on X86
There is currently some confusion in the s390x JIT regarding whether
orig_call can be NULL and what that means. Originally the NULL value
was used to distinguish the struct_ops case, but this was superseded by
BPF_TRAMP_F_INDIRECT (see commit 0c970ed2f8 ("s390/bpf: Fix indirect
trampoline generation").
The remaining reason to have this check is that NULL can actually be
passed to the arch_bpf_trampoline_size() call - but not to the
respective arch_prepare_bpf_trampoline()! call - by
bpf_struct_ops_prepare_trampoline().
Remove this asymmetry by passing stub_func to both functions, so that
JITs may rely on orig_call never being NULL.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250512221911.61314-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Fix sample code that uses trace_array_printk()
The sample code for in kernel use of trace_array (that creates an instance
for use within the kernel) and shows how to use trace_array_printk() that
writes into the created instance, used trace_printk_init_buffers(). But
that function is used to initialize normal trace_printk() and produces the
NOTICE banner which is not needed for use of trace_array_printk(). The
function to initialize that is trace_array_init_printk() that takes the
created trace array instance as a parameter.
Update the sample code to reflect the proper usage.
- Fix preemption count output for stacktrace event
The tracing buffer shows the preempt count level when an event executes.
Because writing the event itself disables preemption, this needs to be
accounted for when recording. The stacktrace event did not account for
this so the output of the stacktrace event showed preemption was disabled
while the event that triggered the stacktrace shows preemption is enabled
and this leads to confusion. Account for preemption being disabled for the
stacktrace event.
The same happened for stack traces triggered by function tracer.
- Fix persistent ring buffer when trace_pipe is used
The ring buffer swaps the reader page with the next page to read from the
write buffer when trace_pipe is used. If there's only a page of data in
the ring buffer, this swap will cause the "commit" pointer (last data
written) to be on the reader page. If more data is written to the buffer,
it is added to the reader page until it falls off back into the write
buffer.
If the system reboots and the commit pointer is still on the reader page,
even if new data was written, the persistent buffer validator will miss
finding the commit pointer because it only checks the write buffer and
does not check the reader page. This causes the validator to fail the
validation and clear the buffer, where the new data is lost.
There was a check for this, but it checked the "head pointer", which was
incorrect, because the "head pointer" always stays on the write buffer and
is the next page to swap out for the reader page. Fix the logic to catch
this case and allow the user to still read the data after reboot.
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaCTZHBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qu04AQDjOS46Y8d58MuwjLrQAotOUnANZADz
7d+5snlcMjhqkAEAo+zc2z9LgqBAnv1VG3GEPgac0JmyPeOnqSJRWRpRXAM=
=UveQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix sample code that uses trace_array_printk()
The sample code for in kernel use of trace_array (that creates an
instance for use within the kernel) and shows how to use
trace_array_printk() that writes into the created instance, used
trace_printk_init_buffers(). But that function is used to initialize
normal trace_printk() and produces the NOTICE banner which is not
needed for use of trace_array_printk(). The function to initialize
that is trace_array_init_printk() that takes the created trace array
instance as a parameter.
Update the sample code to reflect the proper usage.
- Fix preemption count output for stacktrace event
The tracing buffer shows the preempt count level when an event
executes. Because writing the event itself disables preemption, this
needs to be accounted for when recording. The stacktrace event did
not account for this so the output of the stacktrace event showed
preemption was disabled while the event that triggered the stacktrace
shows preemption is enabled and this leads to confusion. Account for
preemption being disabled for the stacktrace event.
The same happened for stack traces triggered by function tracer.
- Fix persistent ring buffer when trace_pipe is used
The ring buffer swaps the reader page with the next page to read from
the write buffer when trace_pipe is used. If there's only a page of
data in the ring buffer, this swap will cause the "commit" pointer
(last data written) to be on the reader page. If more data is written
to the buffer, it is added to the reader page until it falls off back
into the write buffer.
If the system reboots and the commit pointer is still on the reader
page, even if new data was written, the persistent buffer validator
will miss finding the commit pointer because it only checks the write
buffer and does not check the reader page. This causes the validator
to fail the validation and clear the buffer, where the new data is
lost.
There was a check for this, but it checked the "head pointer", which
was incorrect, because the "head pointer" always stays on the write
buffer and is the next page to swap out for the reader page. Fix the
logic to catch this case and allow the user to still read the data
after reboot.
* tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Fix persistent buffer when commit page is the reader page
ftrace: Fix preemption accounting for stacktrace filter command
ftrace: Fix preemption accounting for stacktrace trigger command
tracing: samples: Initialize trace_array_printk() with the correct function
The ring buffer is made up of sub buffers (sometimes called pages as they
are by default PAGE_SIZE). It has the following "pages":
"tail page" - this is the page that the next write will write to
"head page" - this is the page that the reader will swap the reader page with.
"reader page" - This belongs to the reader, where it will swap the head
page from the ring buffer so that the reader does not
race with the writer.
The writer may end up on the "reader page" if the ring buffer hasn't
written more than one page, where the "tail page" and the "head page" are
the same.
The persistent ring buffer has meta data that points to where these pages
exist so on reboot it can re-create the pointers to the cpu_buffer
descriptor. But when the commit page is on the reader page, the logic is
incorrect.
The check to see if the commit page is on the reader page checked if the
head page was the reader page, which would never happen, as the head page
is always in the ring buffer. The correct check would be to test if the
commit page is on the reader page. If that's the case, then it can exit
out early as the commit page is only on the reader page when there's only
one page of data in the buffer. There's no reason to iterate the ring
buffer pages to find the "commit page" as it is already found.
To trigger this bug:
# echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable
# touch /tmp/x
# chown sshd /tmp/x
# reboot
On boot up, the dmesg will have:
Ring buffer meta [0] is from previous boot!
Ring buffer meta [1] is from previous boot!
Ring buffer meta [2] is from previous boot!
Ring buffer meta [3] is from previous boot!
Ring buffer meta [4] commit page not found
Ring buffer meta [5] is from previous boot!
Ring buffer meta [6] is from previous boot!
Ring buffer meta [7] is from previous boot!
Where the buffer on CPU 4 had a "commit page not found" error and that
buffer is cleared and reset causing the output to be empty and the data lost.
When it works correctly, it has:
# cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe
<...>-1137 [004] ..... 998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home
Fixes: 5f3b6e839f ("ring-buffer: Validate boot range memory events")
Reported-by: Tasos Sahanidis <tasos@tasossah.com>
Tested-by: Tasos Sahanidis <tasos@tasossah.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When using the stacktrace trigger command to trace syscalls, the
preemption count was consistently reported as 1 when the system call
event itself had 0 (".").
For example:
root@ubuntu22-vm:/sys/kernel/tracing/events/syscalls/sys_enter_read
$ echo stacktrace > trigger
$ echo 1 > enable
sshd-416 [002] ..... 232.864910: sys_read(fd: a, buf: 556b1f3221d0, count: 8000)
sshd-416 [002] ...1. 232.864913: <stack trace>
=> ftrace_syscall_enter
=> syscall_trace_enter
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
The root cause is that the trace framework disables preemption in __DO_TRACE before
invoking the trigger callback.
Use the tracing_gen_ctx_dec() that will accommodate for the increase of
the preemption count in __DO_TRACE when calling the callback. The result
is the accurate reporting of:
sshd-410 [004] ..... 210.117660: sys_read(fd: 4, buf: 559b725ba130, count: 40000)
sshd-410 [004] ..... 210.117662: <stack trace>
=> ftrace_syscall_enter
=> syscall_trace_enter
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
Cc: stable@vger.kernel.org
Fixes: ce33c845b0 ("tracing: Dump stacktrace trigger to the corresponding instance")
Link: https://lore.kernel.org/20250512094246.1167956-1-dolinux.peng@gmail.com
Signed-off-by: pengdonglin <dolinux.peng@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Naked scx_root dereferences are being used as temporary markers to indicate
that they need to be updated to point to the right scheduler instance.
Explain the situation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Record trace_clock information in the trace_scratch area and recover
the trace_clock when boot, so that reader can docode the timestamp
correctly.
Note that since most trace_clocks records the timestamp in nano-
seconds, this is not a bug. But some trace_clock, like counter and
tsc will record the counter value. Only for those trace_clock user
needs this information.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/174720625803.1925039.1815089037443798944.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Instead of find_first_bit() use the dedicated bitmap_empty(),
and make upper_empty() a nice one-liner.
While there, fix opencoded BITS_PER_TYPE().
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250429195119.620204-1-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In preparation of hierarchical scheduling support, add @sch to scx_exit()
and friends:
- scx_exit/error() updated to take explicit @sch instead of assuming
scx_root.
- scx_kf_exit/error() added. These are to be used from kfuncs, don't take
@sch and internally determine the scx_sched instance to abort. Currently,
it's always scx_root but once multiple scheduler support is in place, it
will be the scx_sched instance that invoked the kfunc. This simplifies
many callsites and defers scx_sched lookup until error is triggered.
- @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU
validity conditions in ops_cpu_valid() are factored into __cpu_valid() to
implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error().
- All users are converted. Most conversions are straightforward.
check_rq_for_timeouts() and scx_softlockup() are updated to use explicit
rcu_dereference*(scx_root) for safety as they may execute asynchronous to
the exit path. scx_tick() is also updated to use rcu_dereference(). While
not strictly necessary due to the preceding scx_enabled() test and IRQ
disabled context, this removes the subtlety at no noticeable cost.
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
__scx_exit() is the base exit implementation and there are three wrappers on
top of it - scx_exit(), __scx_error() and scx_error(). This is more
confusing than helpful especially given that there are only a couple users
of scx_exit() and __scx_error(). To simplify the situation:
- Make __scx_exit() take va_list and rename it to scx_vexit(). This is to
ease implementing more complex extensions on top.
- Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now
takes both @kind and @exit_code.
- Convert existing scx_exit() and __scx_error() users to use the new
scx_exit().
- scx_error() remains unchanged.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take
explicit @sch instead of assuming scx_root. As scx_root is still the only
scheduler instance, this patch doesn't make any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
- Always cache scx_root into local variable sch before using.
- Don't use scx_root if cached sch is available.
- Wrap !sch test with unlikely().
- Pass @scx into scx_cgroup_init/exit().
No behavior changes intended.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
With the goal of deprecating / removing VOLUNTARY preempt, live-patch
needs to stop relying on cond_resched() to make forward progress.
Instead, rely on schedule() with TASK_FREEZABLE set. Just like
live-patching, the freezer needs to be able to stop tasks in a safe /
known state.
[bigeasy: use likely() in __klp_sched_try_switch() and update comments]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20250509113659.wkP_HJ5z@linutronix.de
Kindly inform the MSI driver that the domain is torn down, providing the
allocation context previously populated on domain creation.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513163144.2215824-5-maz@kernel.org
The current device MSI infrastructure is subtly broken, as it will issue an
.msi_prepare() callback into the MSI controller driver every time it needs
to allocate an MSI. That's pretty wrong, as the contract (or unwarranted
assumption, depending who you ask) between the MSI controller and the core
code is that .msi_prepare() is called exactly once per device.
This leads to some subtle breakage in some MSI controller drivers, as it
gives the impression that there are multiple endpoints sharing a bus
identifier (RID in PCI parlance, DID for GICv3+). It implies that whatever
allocation the ITS driver (for example) has done on behalf of these devices
cannot be undone, as there is no way to track the shared state. This is
particularly bad for wire-MSI devices, for which .msi_prepare() is called
for each input line.
To address this issue, move the call to .msi_prepare() to take place at the
point of irq domain allocation, which is the only place that makes
sense. The msi_alloc_info_t structure is made part of the
msi_domain_template, so that its life-cycle is that of the domain as well.
Finally, the msi_info::alloc_data field is made to point at this allocation
tracking structure, ensuring that it is carried around the block.
This is all pretty straightforward, except for the non-device-MSI
leftovers, which still have to call .msi_prepare() at the old spot. One
day...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513163144.2215824-4-maz@kernel.org
The ITS driver currently nukes the structure representing an endpoint
device translating via an ITS on freeing the last LPI allocated for it.
That's an unfortunate state of affair, as it is pretty common for a driver
to allocate a single MSI, do something clever, teardown this MSI, and
reallocate a whole bunch of them. The NVME driver does exactly that,
amongst others.
What happens in that case is that the core code is accidentaly issuing
another .msi_prepare() call, even if it shouldn't. This luckily cancels
the above behaviour and hides the problem.
In order to fix the core code, start by implementing the new
.msi_teardown() callback. Nothing calls it yet, so a side effect is that
the its_dev structure will not be freed and that the DID will stay
mapped. Not a big deal, and this will be solved in following patches.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513163144.2215824-3-maz@kernel.org
While the MSI ops do have a .msi_prepare() callback that is responsible for
setting up the relevant (usually per-device) allocation, there is no
callback reversing this setup.
For this purpose, add .msi_teardown() callback.
In order to avoid breaking the ITS driver that suffers from related issues,
do not call the callback just yet.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513163144.2215824-2-maz@kernel.org
Commit 8589e325ba ("genirq/manage: Rework irq_set_irq_wake()") updated
the irq_set_irq_wake() to use the new guards for locking the interrupt
descriptor.
However, in doing so it inadvertently changed irq_set_irq_wake() such that
the 'chip_bus_lock' is no longer acquired. This has caused system suspend
tests to fail on some Tegra platforms.
Fix this by correcting the guard used in irq_set_irq_wake() to ensure the
'chip_bus_lock' is held.
Fixes: 8589e325ba ("genirq/manage: Rework irq_set_irq_wake()")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250514095041.1109783-1-jonathanh@nvidia.com
Instead of hardcoding the list of kfuncs that need prog->aux passed to
them with a combination of fixup_kfunc_call adjustment + __ign suffix,
combine both in __prog suffix, which ignores the argument passed in, and
fixes it up to the prog->aux. This allows kfuncs to have the prog->aux
passed into them without having to touch the verifier.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250513142812.1021591-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The "suspend in progress" check in device_wakeup_enable() does not
cover hibernation, but arguably it should do that, so introduce
pm_sleep_transition_in_progress() covering transitions during both
system suspend and hibernation to use in there and use it also in
pm_debug_messages_should_print().
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/7820474.EvYhyI6sBW@rjwysocki.net
[ rjw: Move the new function definition under CONFIG_PM_SLEEP ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- fprobe: Fix RCU warning message in list traversal
fprobe_module_callback() using hlist_for_each_entry_rcu() traverse
the fprobe list but it locks fprobe_mutex() instead of rcu lock
because it is enough. So add lockdep_is_held() to avoid warning.
- tracing: eprobe: Add missing trace_probe_log_clear for eprobe
__trace_eprobe_create() uses trace_probe_log but forgot to clear
it at exit. Add trace_probe_log_clear() calls.
- tracing: probes: Fix possible race in trace_probe_log APIs
trace_probe_log APIs are used in probe event (dynamic_events,
kprobe_events and uprobe_events) creation. Only dynamic_events uses
the dyn_event_ops_mutex mutex to serialize it. This makes kprobe and
uprobe events to lock the same mutex to serialize its creation to
avoid race in trace_probe_log APIs.
-----BEGIN PGP SIGNATURE-----
iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmgjXhcbHG1hc2FtaS5o
aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8b5FgIAJ6yEMCKDnao6CGVto9E
lRmgTlgJHk/PYoxMt929C+fJyUbAgafQDZIEGwKDmenaEemEJChvYOTXLmAf9qTS
2YEOUEVZ1p4OSWDriJ59eS5RR4UvMGNNIOLhZH+NV43SzyrVpkBTm+XOWCrulfrB
UaLy6WAVkGZhEabnXUQIo+lwbvJTNg4/sH0uiUyvkKFl9C6s/BWFIFaa7eE2ibfD
RIJummZ9xL0EECe9aQcmbVEZTF5y141yTDth+hEjeW3tIlkQrw4GVOufDScZIGhg
y92OoROiZqOllr0vS5+pSB4Pa+hgtBWFOcrupYb9SeG/vf30a3P5bufI3MCvzhiL
OkI=
=MhYZ
-----END PGP SIGNATURE-----
Merge tag 'probes-fixes-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- fprobe: Fix RCU warning message in list traversal
fprobe_module_callback() using hlist_for_each_entry_rcu() traverse
the fprobe list but it locks fprobe_mutex() instead of rcu lock
because it is enough. So add lockdep_is_held() to avoid warning.
- tracing: eprobe: Add missing trace_probe_log_clear for eprobe
__trace_eprobe_create() uses trace_probe_log but forgot to clear it
at exit. Add trace_probe_log_clear() calls.
- tracing: probes: Fix possible race in trace_probe_log APIs
trace_probe_log APIs are used in probe event (dynamic_events,
kprobe_events and uprobe_events) creation. Only dynamic_events uses
the dyn_event_ops_mutex mutex to serialize it. This makes kprobe and
uprobe events to lock the same mutex to serialize its creation to
avoid race in trace_probe_log APIs.
* tag 'probes-fixes-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: probes: Fix a possible race in trace_probe_log APIs
tracing: add missing trace_probe_log_clear for eprobes
tracing: fprobe: Fix RCU warning message in list traversal
Right now, if the clocksource watchdog detects a clocksource skew, it might
perform a per CPU check, for example in the TSC case on x86. In other
words: supposing TSC is detected as unstable by the clocksource watchdog
running at CPU1, as part of marking TSC unstable the kernel will also run a
check of TSC readings on some CPUs to be sure it is synced between them
all.
But that check happens only on some CPUs, not all of them; this choice is
based on the parameter "verify_n_cpus" and in some random cpumask
calculation. So, the watchdog runs such per CPU checks on up to
"verify_n_cpus" random CPUs among all online CPUs, with the risk of
repeating CPUs (that aren't double checked) in the cpumask random
calculation.
But if "verify_n_cpus" > num_online_cpus(), it should skip the random
calculation and just go ahead and check the clocksource sync between
all online CPUs, without the risk of skipping some CPUs due to
duplicity in the random cpumask calculation.
Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per
CPU verification skipping some CPU even with verify_n_cpus=8, due to the
duplicity on random cpumask generation. Skipping the randomization when the
number of online CPUs is smaller than verify_n_cpus, solves that.
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com
Since the shared trace_probe_log variable can be accessed and
modified via probe event create operation of kprobe_events,
uprobe_events, and dynamic_events, it should be protected.
In the dynamic_events, all operations are serialized by
`dyn_event_ops_mutex`. But kprobe_events and uprobe_events
interfaces are not serialized.
To solve this issue, introduces dyn_event_create(), which runs
create() operation under the mutex, for kprobe_events and
uprobe_events. This also uses lockdep to check the mutex is
held when using trace_probe_log* APIs.
Link: https://lore.kernel.org/all/174684868120.551552.3068655787654268804.stgit@devnote2/
Reported-by: Paul Cacheux <paulcacheux@gmail.com>
Closes: https://lore.kernel.org/all/20250510074456.805a16872b591e2971a4d221@kernel.org/
Fixes: ab105a4fb8 ("tracing: Use tracing error_log with probe events")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a function for updating the Energy Model for a CPU after its
capacity has changed, which subsequently will be used by the
intel_pstate driver.
An EM_PERF_DOMAIN_ARTIFICIAL check is added to em_recalc_and_update()
to prevent it from calling em_compute_costs() for an "artificial" perf
domain with a NULL cb parameter which would cause it to crash.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3637203.iIbC2pHGDl@rjwysocki.net
Move the check of the CPU capacity currently stored in the energy model
against the arch_scale_cpu_capacity() value to em_adjust_new_capacity()
so it will be done regardless of where the latter is called from.
This will be useful when a new em_adjust_new_capacity() caller is added
subsequently.
While at it, move the pd local variable declaration in
em_check_capacity_update() into the loop in which it is used.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/7810787.EvYhyI6sBW@rjwysocki.net
Introduce pm_suspend_in_progress() to be used for checking if a system-
wide suspend or resume transition is in progress, instead of comparing
pm_suspend_target_state directly to PM_SUSPEND_ON, and use it where
applicable.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Raag Jadav <raag.jadav@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/2020901.PYKUYFuaPT@rjwysocki.net
Commit cdb8c100d8 ("include/linux/suspend.h: Only show pm_pr_dbg
messages at suspend/resume") caused PM debug messages to only be
printed during system-wide suspend and resume in progress, but it
forgot about hibernation.
Address this by adding a check for hibernation in progress to
pm_debug_messages_should_print().
Fixes: cdb8c100d8 ("include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://patch.msgid.link/4998903.GXAFRqVoOG@rjwysocki.net
There are three cases in the genirq code when the irq, as an unsigned
integer variable, is converted to text representation by sprintf().
In two cases it uses '%d' specifier which is for signed values. While
it's not a problem right now, potentially it might be in the future
in case too big (> INT_MAX) number will appear there.
Consistently use '%u' format specifier for @irq which is declared as
unsigned int in all these cases.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250509154643.1499171-1-andriy.shevchenko@linux.intel.com
After the conversion to locking guards within the interrupt core code,
several builds with clang show the "Interrupts were enabled early"
WARN() in start_kernel() on boot.
In class_irqdesc_lock_constructor(), _t.flags is initialized via
__irq_get_desc_lock() within the _t initializer list. However, the C11
standard 6.7.9.23 states that the evaluation of the initialization list
expressions are indeterminately sequenced relative to one another,
meaning _t.flags could be initialized by __irq_get_desc_lock() then be
initialized to zero due to flags being absent from the initializer list.
To ensure _t.flags is consistently initialized, move the call to
__irq_get_desc_lock() and the assignment of its result to _t.lock out of
the designated initializer.
Fixes: 0f70a49f3f ("genirq: Provide conditional lock guards")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/20250513-irq-guards-fix-flags-init-v1-1-1dca3f5992d6@kernel.org
Right now these are performed in kernel/fork.c which is odd and a
violation of separation of concerns, as well as preventing us from
integrating this and related logic into userland VMA testing going
forward.
There is a fly in the ointment - nommu - mmap.c is not compiled if
CONFIG_MMU not set, and neither is vma.c.
To square the circle, let's add a new file - vma_init.c. This will be
compiled for both CONFIG_MMU and nommu builds, and will also form part of
the VMA userland testing.
This allows us to de-duplicate code, while maintaining separation of
concerns and the ability for us to userland test this logic.
Update the VMA userland tests accordingly, additionally adding a
detach_free_vma() helper function to correctly detach VMAs before freeing
them in test code, as this change was triggering the assert for this.
[akpm@linux-foundation.org: remove stray newline, per Liam]
Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is a key step in our being able to abstract and isolate VMA
allocation and destruction logic.
This function is the last one where vm_area_free() and vm_area_dup() are
directly referenced outside of mmap, so having this in mm allows us to
isolate these.
We do the same for the nommu version which is substantially simpler.
We place the declaration for dup_mmap() in mm/internal.h and have
kernel/fork.c import this in order to prevent improper use of this
functionality elsewhere in the kernel.
While we're here, we remove the useless #ifdef CONFIG_MMU check around
mmap_read_lock_maybe_expand() in mmap.c, mmap.c is compiled only if
CONFIG_MMU is set.
Link: https://lkml.kernel.org/r/e49aad3d00212f5539d9fa5769bfda4ce451db3e.1745853549.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA
node via cpuset.mems", v5.
This patch (of 2):
When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead. With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.
We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.
Link: https://lkml.kernel.org/r/20250424024523.2298272-1-libo.chen@oracle.com
Link: https://lkml.kernel.org/r/20250424024523.2298272-2-libo.chen@oracle.com
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We have all generic code in place now to support Kexec with KHO. This
patch adds a config option that depends on architecture support to enable
KHO support.
Link: https://lkml.kernel.org/r/20250509074635.3187114-9-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kexec has 2 modes: A user space driven mode and a kernel driven mode. For
the kernel driven mode, kernel code determines the physical addresses of
all target buffers that the payload gets copied into.
With KHO, we can only safely copy payloads into the "scratch area". Teach
the kexec file loader about it, so it only allocates for that area. In
addition, enlighten it with support to ask the KHO subsystem for its
respective payloads to copy into target memory. Also teach the KHO
subsystem how to fill the images for file loads.
Link: https://lkml.kernel.org/r/20250509074635.3187114-8-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce APIs allowing KHO users to preserve memory across kexec and get
access to that memory after boot of the kexeced kernel
kho_preserve_folio() - record a folio to be preserved over kexec
kho_restore_folio() - recreates the folio from the preserved memory
kho_preserve_phys() - record physically contiguous range to be
preserved over kexec.
The memory preservations are tracked by two levels of xarrays to manage
chunks of per-order 512 byte bitmaps. For instance if PAGE_SIZE = 4096,
the entire 1G order of a 1TB x86 system would fit inside a single 512 byte
bitmap. For order 0 allocations each bitmap will cover 16M of address
space. Thus, for 16G of memory at most 512K of bitmap memory will be
needed for order 0.
At serialization time all bitmaps are recorded in a linked list of pages
for the next kernel to process and the physical address of the list is
recorded in KHO FDT.
The next kernel then processes that list, reserves the memory ranges and
later, when a user requests a folio or a physical range, KHO restores
corresponding memory map entries.
Link: https://lkml.kernel.org/r/20250509074635.3187114-7-changyuanl@google.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we have a KHO kexec, we get an FDT blob and scratch region to
populate the state of the system. Provide helper functions that allow
architecture code to easily handle memory reservations based on them and
give device drivers visibility into the KHO FDT and memory reservations so
they can recover their own state.
Include a fix from Arnd Bergmann <arnd@arndb.de>
https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/.
Link: https://lkml.kernel.org/r/20250509074635.3187114-6-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Yadav <ptyadav@amazon.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add the infrastructure to generate Kexec HandOver metadata. Kexec
HandOver is a mechanism that allows Linux to preserve state - arbitrary
properties as well as memory locations - across kexec.
It does so using 2 concepts:
1) KHO FDT - Every KHO kexec carries a KHO specific flattened device tree
blob that describes preserved memory regions. Device drivers can
register to KHO to serialize and preserve their states before kexec.
2) Scratch Regions - CMA regions that we allocate in the first kernel.
CMA gives us the guarantee that no handover pages land in those
regions, because handover pages must be at a static physical memory
location. We use these regions as the place to load future kexec
images so that they won't collide with any handover data.
Link: https://lkml.kernel.org/r/20250509074635.3187114-5-changyuanl@google.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ashish Kalra <ashish.kalra@amd.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Marc Rutland <mark.rutland@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Saravana Kannan <saravanak@google.com>
Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is possible for a reclaimer to cause demotions of an lruvec belonging
to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply
this limitation based on the lruvec's memcg and prevent demotion.
Notably, this may still allow demotion of shared libraries or any memory
first instantiated in another cgroup. This means cpusets still cannot
cannot guarantee complete isolation when demotion is enabled, and the docs
have been updated to reflect this.
This is useful for isolating workloads on a multi-tenant system from
certain classes of memory more consistently - with the noted exceptions.
Note on locking:
The cgroup_get_e_css reference protects the css->effective_mems, and calls
of this interface would be subject to the same race conditions associated
with a non-atomic access to cs->effective_mems.
So while this interface cannot make strong guarantees of correctness, it
can therefore avoid taking a global or rcu_read_lock for performance.
Link: https://lkml.kernel.org/r/20250424202806.52632-3-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "vmscan: enforce mems_effective during demotion", v5.
Change reclaim to respect cpuset.mems_effective during demotion when
possible. Presently, reclaim explicitly ignores cpuset.mems_effective
when demoting, which may cause the cpuset settings to violated.
Implement cpuset_node_allowed() to check the cpuset.mems_effective
associated wih the mem_cgroup of the lruvec being scanned. This only
applies to cgroup/cpuset v2, as cpuset exists in a different hierarchy
than mem_cgroup in v1.
This requires renaming the existing cpuset_node_allowed() to be
cpuset_current_now_allowed() - which is more descriptive anyway - to
implement the new cpuset_node_allowed() which takes a target cgroup.
This patch (of 2):
Rename cpuset_node_allowed to reflect that the function checks the current
task's cpuset.mems. This allows us to make a new cpuset_node_allowed
function that checks a target cgroup's cpuset.mems.
Link: https://lkml.kernel.org/r/20250424202806.52632-1-gourry@gourry.net
Link: https://lkml.kernel.org/r/20250424202806.52632-2-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch introduces a new set of kfuncs for working with dynptrs in
BPF programs, enabling reading variable-length user or kernel data
into dynptr directly. To enable memory-safety, verifier allows only
constant-sized reads via existing bpf_probe_read_{user|kernel} etc.
kfuncs, dynptr-based kfuncs allow dynamically-sized reads without memory
safety shortcomings.
The following kfuncs are introduced:
* `bpf_probe_read_kernel_dynptr()`: probes kernel-space data into a dynptr
* `bpf_probe_read_user_dynptr()`: probes user-space data into a dynptr
* `bpf_probe_read_kernel_str_dynptr()`: probes kernel-space string into
a dynptr
* `bpf_probe_read_user_str_dynptr()`: probes user-space string into a
dynptr
* `bpf_copy_from_user_dynptr()`: sleepable, copies user-space data into
a dynptr for the current task
* `bpf_copy_from_user_str_dynptr()`: sleepable, copies user-space string
into a dynptr for the current task
* `bpf_copy_from_user_task_dynptr()`: sleepable, copies user-space data
of the task into a dynptr
* `bpf_copy_from_user_task_str_dynptr()`: sleepable, copies user-space
string of the task into a dynptr
The implementation is built on two generic functions:
* __bpf_dynptr_copy
* __bpf_dynptr_copy_str
These functions take function pointers as arguments, enabling the
copying of data from various sources, including both kernel and user
space.
Use __always_inline for generic functions and callbacks to make sure the
compiler doesn't generate indirect calls into callbacks, which is more
expensive, especially on some kernel configurations. Inlining allows
compiler to put direct calls into all the specific callback implementations
(copy_user_data_sleepable, copy_user_data_nofault, and so on).
Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Make bpf_dynptr_slice_rdwr, bpf_dynptr_check_off_len and
__bpf_dynptr_write available outside of the helpers.c by
adding their prototypes into linux/include/bpf.h.
bpf_dynptr_check_off_len() implementation is moved to header and made
inline explicitly, as small function should typically be inlined.
These functions are going to be used from bpf_trace.c in the next
patch of this series.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-2-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A little bit invasive for rc6 but they're important fixes, pass tests fine
and won't break anything outside sched_ext.
- scx_bpf_cpuperf_set() calls internal functions that require the rq to be
locked. It assumed that the BPF caller has rq locked but that's not always
true. Fix it by tracking whether rq is currently held by the CPU and
grabbing it if necessary.
- bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an uninitialized
state after an error. However, next() and destroy() can be called on an
iterator which failed initialization and thus they always need to be
initialized even after an init error. Fix by always initializing the
iterator.
- Remove duplicate BTF_ID_FLAGS() entries.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaCKYDw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGV0nAP9UX1sGJV7a8by8L0hq3luZRSQo7Kjw0pTaMeD+
/oIlOgD/VX3epHmSKvwIOO2W4bv5oSR8B+Yx+4uAhhLfTgdlxgk=
=clg7
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"A little bit invasive for rc6 but they're important fixes, pass tests
fine and won't break anything outside sched_ext:
- scx_bpf_cpuperf_set() calls internal functions that require the rq
to be locked. It assumed that the BPF caller has rq locked but
that's not always true. Fix it by tracking whether rq is currently
held by the CPU and grabbing it if necessary
- bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an
uninitialized state after an error. However, next() and destroy()
can be called on an iterator which failed initialization and thus
they always need to be initialized even after an init error. Fix by
always initializing the iterator
- Remove duplicate BTF_ID_FLAGS() entries"
* tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
sched_ext: Fix rq lock state in hotplug ops
sched_ext: Remove duplicate BTF_ID_FLAGS definitions
sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
sched_ext: Track currently locked rq
One low-risk patch to fix a cpuset bug where it over-eagerly tries to modify
CPU affinity of kernel threads.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaCKVJA4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGZJ7AQDJIOHkmRvDnSnwdsaQ7hvsU1afNEWjGsvKcLtp
VXQUFwD/UIgSc5miCkgi5ucphlr6Vxxnq0PW7hf7KRhdzhqwagg=
=j/Y4
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"One low-risk patch to fix a cpuset bug where it over-eagerly tries to
modify CPU affinity of kernel threads"
* tag 'cgroup-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Extend kthread_is_per_cpu() check to all PF_NO_SETAFFINITY tasks
With CONFIG_GENDWARFKSYMS, __gendwarfksyms_ptr variables are
added to the kernel in EXPORT_SYMBOL() to ensure DWARF type
information is available for exported symbols in the TUs where
they're actually exported. These symbols are dropped when linking
vmlinux, but dangling references to them remain in DWARF.
With CONFIG_DEBUG_INFO_BTF enabled on X86, pahole versions after
commit 47dcb534e253 ("btf_encoder: Stop indexing symbols for
VARs") and before commit 9810758003ce ("btf_encoder: Verify 0
address DWARF variables are in ELF section") place these symbols
in the .data..percpu section, which results in an "Invalid
offset" error in btf_datasec_check_meta() during boot, as all
the variables are at zero offset and have non-zero size. If
CONFIG_DEBUG_INFO_BTF_MODULES is enabled, this also results in a
failure to load modules with:
failed to validate module [$module] BTF: -22
As the issue occurs in pahole v1.28 and the fix was merged
after v1.29 was released, require pahole <v1.28 or >v1.29 when
GENDWARFKSYMS is enabled with DEBUG_INFO_BTF on X86.
Reported-by: Paolo Pisati <paolo.pisati@canonical.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
This user of SHA-256 does not support any other algorithm, so the
crypto_shash abstraction provides no value. Just use the SHA-256 library
API instead, which is much simpler and easier to use.
Tested with '/sbin/kexec --kexec-file-syscall'.
Link: https://lkml.kernel.org/r/20250428185721.844686-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When updating `watchdog_thresh`, there is a race condition between writing
the new `watchdog_thresh` value and stopping the old watchdog timer. If
the old timer triggers during this window, it may falsely detect a
softlockup due to the old interval and the new `watchdog_thresh` value
being used. The problem can be described as follow:
# We asuume previous watchdog_thresh is 60, so the watchdog timer is
# coming every 24s.
echo 10 > /proc/sys/kernel/watchdog_thresh (User space)
|
+------>+ update watchdog_thresh (We are in kernel now)
|
| # using old interval and new `watchdog_thresh`
+------>+ watchdog hrtimer (irq context: detect softlockup)
|
|
+-------+
|
|
+ softlockup_stop_all
To fix this problem, introduce a shadow variable for `watchdog_thresh`.
The update to the actual `watchdog_thresh` is delayed until after the old
timer is stopped, preventing false positives.
The following testcase may help to understand this problem.
---------------------------------------------
echo RT_RUNTIME_SHARE > /sys/kernel/debug/sched/features
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
echo 0 > /sys/kernel/debug/sched/fair_server/cpu3/runtime
echo 60 > /proc/sys/kernel/watchdog_thresh
taskset -c 3 chrt -r 99 /bin/bash -c "while true;do true; done" &
echo 10 > /proc/sys/kernel/watchdog_thresh &
---------------------------------------------
The test case above first removes the throttling restrictions for
real-time tasks. It then sets watchdog_thresh to 60 and executes a
real-time task ,a simple while(1) loop, on cpu3. Consequently, the final
command gets blocked because the presence of this real-time thread
prevents kworker:3 from being selected by the scheduler. This eventually
triggers a softlockup detection on cpu3 due to watchdog_timer_fn operating
with inconsistent variable - using both the old interval and the updated
watchdog_thresh simultaneously.
[nysal@linux.ibm.com: fix the SOFTLOCKUP_DETECTOR=n case]
Link: https://lkml.kernel.org/r/20250502111120.282690-1-nysal@linux.ibm.com
Link: https://lkml.kernel.org/r/20250421035021.3507649-1-luogengkun@huaweicloud.com
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: "Nysal Jan K.A." <nysal@linux.ibm.com>
Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is a spelling mistake in a pr_warn message. Fix it.
Link: https://lkml.kernel.org/r/20250418120331.535086-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The last use of relay_late_setup_files() was removed in 2018 by commit
2b47733045 ("drm/i915/guc: Merge log relay file and channel creation")
Remove it and the helper it used.
relay_late_setup_files() was used for eventually registering 'buffer only'
channels. With it gone, delete the docs that explain how to do that.
Which suggests it should be possible to lose the 'has_base_filename'
flags.
(Are there any other uses??)
Link: https://lkml.kernel.org/r/20250418234932.490863-1-linux@treblig.org
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reduces single-threaded overhead as it avoids one lock+irq trip on
exit.
It also improves scalability of spawning and killing threads within one
process (just shy of 5% when doing it on 24 cores on my test jig).
Both routines are moved below kcov and kmsan exit, which should be
harmless.
Link: https://lkml.kernel.org/r/20250319195436.1864415-1-mjguzik@gmail.com
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On Intel TDX guest, unaccepted memory is unusable free memory which is not
managed by buddy, until it's accepted by guest. Before that, it cannot be
accessed by the first kernel as well as the kexec'ed kernel. The kexec'ed
kernel will skip these pages and fill in zero data for the reader of
vmcore.
The dump tool like makedumpfile creates a page descriptor (size 24 bytes)
for each non-free page, including zero data page, but it will not create
descriptor for free pages. If it is not able to distinguish these
unaccepted pages with zero data pages, a certain amount of space will be
wasted in proportion (~1/170). In fact, as a special kind of free page
the unaccepted pages should be excluded, like the real free pages.
Export the page type PAGE_UNACCEPTED_MAPCOUNT_VALUE to vmcoreinfo, so that
dump tool can identify whether a page is unaccepted.
[zhiquan1.li@intel.com: fix docs: "Title underline too short" warning]
Link: https://lore.kernel.org/all/20240809114854.3745464-5-kirill.shutemov@linux.intel.com/
Link: https://lkml.kernel.org/r/20250405060610.860465-1-zhiquan1.li@intel.com
Link: https://lore.kernel.org/all/20240809114854.3745464-5-kirill.shutemov@linux.intel.com/
Link: https://lkml.kernel.org/r/20250403030801.758687-1-zhiquan1.li@intel.com
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is useful to be able to access current->mm at task exit to, say, record
a bunch of VMA information right before the task exits (e.g., for stack
symbolization reasons when dealing with short-lived processes that exit in
the middle of profiling session). Currently, trace_sched_process_exit()
is triggered after exit_mm() which resets current->mm to NULL making this
tracepoint unsuitable for inspecting and recording task's
mm_struct-related data when tracing process lifetimes.
There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other exit_*()
resource teardowns. taskstats performs a similar kind of accounting that
some applications do with BPF, and so co-locating them seems like a good
fit. So that's where trace_sched_process_exit() is moved with this patch.
Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some of
our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.
Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.
[andrii@kernel.org: document sched_process_exit and sched_process_template relation]
Link: https://lkml.kernel.org/r/20250403174120.4087794-1-andrii@kernel.org
Link: https://lkml.kernel.org/r/20250402180925.90914-1-andrii@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
uprobe_write_opcode() does some pretty low-level things that really, it
shouldn't be doing: for example, manually breaking COW by allocating
anonymous folios and replacing mapped pages.
Further, it does seem to do some shaky things: for example, writing to
possible COW-shared anonymous pages or zapping anonymous pages that might
be pinned. We're also not taking care of uffd, uffd-wp, softdirty ...
although rather corner cases here. Let's just get it right like ordinary
ptrace writes would.
Let's rewrite the code, leaving COW-breaking to core-MM, triggered by
FOLL_FORCE|FOLL_WRITE (note that the code was already using FOLL_FORCE).
We'll use GUP to lookup/faultin the page and break COW if required. Then,
we'll walk the page tables using a folio_walk to perform our page
modification atomically by temporarily unmap the PTE + flushing the TLB.
Likely, we could avoid the temporary unmap in case we can just atomically
write the instruction, but that will be a separate project.
Unfortunately, we still have to implement the zapping logic manually,
because we only want to zap in specific circumstances (e.g., page content
identical).
Note that we can now handle large folios (compound pages) and the shared
zeropage just fine, so drop these checks.
Link: https://lkml.kernel.org/r/20250321113713.204682-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: tongtiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We already have the VMA, no need to look it up using
get_user_page_vma_remote(). We can now switch to get_user_pages_remote().
Link: https://lkml.kernel.org/r/20250321113713.204682-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: tongtiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "kernel/events/uprobes: uprobe_write_opcode() rewrite", v3.
Currently, uprobe_write_opcode() implements COW-breaking manually, which
is really far from ideal. Further, there is interest in supporting
uprobes on hugetlb pages [1], and leaving at least the COW-breaking to the
core will make this much easier.
Also, I think the current code doesn't really handle some things properly
(see patch #3) when replacing/zapping pages.
Let's rewrite it, to leave COW-breaking to the fault handler, and handle
registration/unregistration by temporarily unmapping the anonymous page,
modifying it, and mapping it again. We still have to implement zapping of
anonymous pages ourselves, unfortunately.
We could look into not performing the temporary unmapping if we can
perform the write atomically, which would likely also make adding hugetlb
support a lot easier. But, limited (e.g., only PMD/PUD) hugetlb support
could be added on top of this with some tweaking.
Note that we now won't have to allocate another anonymous folio when
unregistering (which will be beneficial for hugetlb as well), we can
simply modify the already-mapped one from the registration (if any). When
registering a uprobe, we'll first trigger a ptrace-like write fault to
break COW, to then modify the already-mapped page.
Briefly sanity tested with perf probes and with the bpf uprobes selftest.
This patch (of 3):
Pass VMA instead of MM to remove_breakpoint() and remove the "MM" argument
from install_breakpoint(), because it can easily be derived from the VMA.
Link: https://lkml.kernel.org/r/20250321113713.204682-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250321113713.204682-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jiri Olsa <olsajiri@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Namhyung kim <namhyung@kernel.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: tongtiangen <tongtiangen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system
calls the tracee is blocked in.
This API allows ptracers to obtain and modify system call details in a
straightforward and architecture-agnostic way, providing a consistent way
of manipulating the system call number and arguments across architectures.
As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also does
not aim to address numerous architecture-specific system call ABI
peculiarities, like differences in the number of system call arguments for
such system calls as pread64 and preadv.
The current implementation supports changing only those bits of system
call information that are used by strace system call tampering, namely,
syscall number, syscall arguments, and syscall return value.
Support of changing additional details returned by
PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer,
could be added later if needed, by using struct ptrace_syscall_info.flags
to specify the additional details that should be set. Currently, "flags"
and "reserved" fields of struct ptrace_syscall_info must be initialized
with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are
currently ignored.
PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.
Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to
provide an API of changing the first system call argument on riscv
architecture.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
Modify information about the system call that caused the stop.
The "data" argument is a pointer to struct ptrace_syscall_info
that specifies the system call information to be set.
The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Link: https://lkml.kernel.org/r/20250303112044.GF24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Eugene Syromiatnikov <esyr@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Move the code that calculates the type of the system call stop out of
ptrace_get_syscall_info() into a separate function
ptrace_get_syscall_info_op() which is going to be used later to implement
PTRACE_SET_SYSCALL_INFO API.
Link: https://lkml.kernel.org/r/20250303112038.GE24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexey Gladkov (Intel) <legion@kernel.org>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Not intuitive, but vm_area_dup() located in kernel/fork.c is not only used
for duplicating VMAs during fork(), but also for duplicating VMAs when
splitting VMAs or when mremap()'ing them.
VM_PFNMAP mappings can at least get ordinarily mremap()'ed (no change in
size) and apparently also shrunk during mremap(), which implies
duplicating the VMA in __split_vma() first.
In case of ordinary mremap() (no change in size), we first duplicate the
VMA in copy_vma_and_data()->copy_vma() to then call untrack_pfn_clear() on
the old VMA: we effectively move the VM_PAT reservation. So the
untrack_pfn_clear() call on the new VMA duplicating is wrong in that
context.
Splitting of VMAs seems problematic, because we don't duplicate/adjust the
reservation when splitting the VMA. Instead, in memtype_erase() -- called
during zapping/munmap -- we shrink a reservation in case only the end
address matches: Assume we split a VMA into A and B, both would share a
reservation until B is unmapped.
So when unmapping B, the reservation would be updated to cover only A.
When unmapping A, we would properly remove the now-shrunk reservation.
That scenario describes the mremap() shrinking (old_size > new_size),
where we split + unmap B, and the untrack_pfn_clear() on the new VMA when
is wrong.
What if we manage to split a VM_PFNMAP VMA into A and B and unmap A first?
It would be broken because we would never free the reservation. Likely,
there are ways to trigger such a VMA split outside of mremap().
Affecting other VMA duplication was not intended, vm_area_dup() being used
outside of kernel/fork.c was an oversight. So let's fix that for; how to
handle VMA splits better should be investigated separately.
With a simple reproducer that uses mprotect() to split such a VMA I can
trigger
x86/PAT: pat_mremap:26448 freeing invalid memtype [mem 0x00000000-0x00000fff]
Link: https://lkml.kernel.org/r/20250422144942.2871395-1-david@redhat.com
Fixes: dc84bc2aba ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks
- Work around absolute relocations into vDSO code that
GCC erroneously emits in certain arm64 build environments
- Fix a false positive lockdep warning in the i8253 clocksource
driver
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmggVjYRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gcRg//Wiyl1Sth9E3M4++TCTAoZwgty/lEBo/+
u2T3BTI3cu3q+KLr8NEV+l+EOCcycv7AR7dVBKab/LEaliHwvQ0gGxu/Tc2FsQX+
jbk/1COYkwafr3XIRR0QZxU7BppSTzXNfoNqn/MjM+rgpG8CdgtLPuHgDCpeuNBn
NHHL46L4a3L7INh03WlVVh34cHnLC+Hq5DNf++Mr8VvlJG8Q5WaPgrnIMfSL5STJ
Z6A5l3w4TX1E5C5d/eEJjwUUjbGQDbvWGQRxqLYhXXyS3h379K6BN/5t5pUQgGIU
ZOV2MYS8DxGZpS3CXtKLTJxyC2VUzP9VeJTyFjlj7IZQ8UBL/JkQ+wZhuXhWsMVd
puje6gmgSP6CQ14/s5WeAU/BFfj5kakYJuAFSa8u+ucHsAKqEEdvk600WC1cXFfn
AyKuXq6xZ8M27EoqfemD447b66kh8VDbNmcp9AwiNKBzq5pVVGIfVrXRJakBKKR0
yV7fJUbgogYol6ra9Yx7FjtscKezan52C+ja9UqgMDb262Ez+zo7mMVINXvS6F+e
8byUdLqVlnkoj+xgRylVsbsfY8g145pkc03y0rxBS3EgiGtR+PUQq8Fwduhb9Bkr
obIwiqMvo5bW+ULZcGm0Fu1pInXfumofCLAiXq7Vxz7RdpKO3SfsDdIkc9O8aFMJ
ubB7DMaGcyg=
=Ndox
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc timers fixes from Ingo Molnar:
- Fix time keeping bugs in CLOCK_MONOTONIC_COARSE clocks
- Work around absolute relocations into vDSO code that GCC erroneously
emits in certain arm64 build environments
- Fix a false positive lockdep warning in the i8253 clocksource driver
* tag 'timers-urgent-2025-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/i8253: Use raw_spinlock_irqsave() in clockevent_i8253_disable()
arm64: vdso: Work around invalid absolute relocations from GCC
timekeeping: Prevent coarse clocks going backwards
Make sure trace_probe_log_clear is called in the tracing
eprobe code path, matching the trace_probe_log_init call.
Link: https://lore.kernel.org/all/20250504-fix-trace-probe-log-race-v3-1-9e99fec7eddc@gmail.com/
Signed-off-by: Paul Cacheux <paulcacheux@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When CONFIG_PROVE_RCU_LIST is enabled, fprobe triggers the following
warning:
WARNING: suspicious RCU usage
kernel/trace/fprobe.c:457 RCU-list traversed in non-reader section!!
other info that might help us debug this:
#1: ffffffff863c4e08 (fprobe_mutex){+.+.}-{4:4}, at: fprobe_module_callback+0x7b/0x8c0
Call Trace:
fprobe_module_callback
notifier_call_chain
blocking_notifier_call_chain
This warning occurs because fprobe_remove_node_in_module() traverses an
RCU list using RCU primitives without holding an RCU read lock. However,
the function is only called from fprobe_module_callback(), which holds
the fprobe_mutex lock that provides sufficient protection for safely
traversing the list.
Fix the warning by specifying the locking design to the
CONFIG_PROVE_RCU_LIST mechanism. Add the lockdep_is_held() argument to
hlist_for_each_entry_rcu() to inform the RCU checker that fprobe_mutex
provides the required protection.
Link: https://lore.kernel.org/all/20250410-fprobe-v1-1-068ef5f41436@debian.org/
Fixes: a3dc2983ca ("tracing: fprobe: Cleanup fprobe hash when module unloading")
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Antonio Quartulli <antonio@mandelbit.com>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Adding support to retrieve ref_ctr_offset for uprobe perf link,
which got somehow omitted from the initial uprobe link info changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20250509153539.779599-2-jolsa@kernel.org
There are applications that have it hard coded to write into the top level
trace_marker instance (/sys/kernel/tracing/trace_marker). This can be
annoying if a profiler is using that instance for other work, or if it
needs all writes to go into a new instance.
A new option is created called "copy_trace_marker". By default, the top
level has this set, as that is the default buffer that writing into the
top level trace_marker file will go to. But now if an instance is created
and sets this option, all writes into the top level trace_marker will also
be written into that instance buffer just as if an application were to
write into the instance's trace_marker file.
If the top level instance disables this option, then writes to its own
trace_marker and trace_marker_raw files will not go into its buffer.
If no instance has this option set, then the write will return an error
and errno will contain ENODEV.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250508095639.39f84eda@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add a helper function called handle_dereference_arg() to replace the logic
that is identical in two locations of test_event_printk().
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250507191703.5dd8a61d@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's several functions that have "goto out;" where the label out is just:
out:
return ret;
Simplify the code by just doing the return in the location and removing
all the out labels and jumps.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145456.121186494@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
According to trigger_data_alloc() doc, trigger_data_free() should be
used to free an event_trigger_data object. This fixes a mismatch introduced
when kzalloc was replaced with trigger_data_alloc without updating
the corresponding deallocation calls.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.944453325@goodmis.org
Link: https://lore.kernel.org/20250318112737.4174-1-linmq006@gmail.com
Fixes: e1f187d09e ("tracing: Have existing event_command.parse() implementations use helpers")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
[ SDR: Changed event_trigger_alloc/free() to trigger_data_alloc/free() ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function event_trigger_alloc() creates an event_trigger_data
descriptor and states that it needs to be freed via event_trigger_free().
This is incorrect, it needs to be freed by trigger_data_free() as
event_trigger_free() adds ref counting.
Rename event_trigger_alloc() to trigger_data_alloc() and state that it
needs to be freed via trigger_data_free(). This naming convention
was introducing bugs.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250507145455.776436410@goodmis.org
Fixes: 86599dbe2c ("tracing: Add helper functions to simplify event_command.parse() callback handling")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace_array_cpu had a "buffer_page" field that was originally going to
be used as a backup page for the ring buffer. But the ring buffer has its
own way of reusing pages and this field was never used.
Remove it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.738849456@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The irqsoff tracer uses the per CPU "disabled" field to prevent corruption
of the accounting when it starts to trace interrupts disabled, but there's
a slight race that could happen if for some reason it was called twice.
Use atomic_inc_return() instead.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.567884756@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The per CPU "disabled" counter is used for the latency tracers and stack
tracers to make sure that their accounting isn't messed up by an NMI or
interrupt coming in and affecting the same CPU data. But the counter is an
atomic_t type. As it only needs to synchronize against the current CPU,
switch it over to local_t type.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.394925376@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The branch tracer currently checks the per CPU "disabled" field to know if
tracing is enabled or not for the CPU. As the "disabled" value is not used
anymore to turn of tracing generically, use tracing_tracer_is_on_cpu()
instead.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.224658526@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add the function ring_buffer_record_is_on_cpu() that returns true if the
ring buffer for a give CPU is writable and false otherwise.
Also add tracer_tracing_is_on_cpu() to return if the ring buffer for a
given CPU is writeable for a given trace_array.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212236.059853898@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.
Do not bother setting the per CPU disabled flag of the array_buffer data
to use to determine what CPUs can write to the buffer and only rely on the
ring buffer code itself to disabled it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.885452497@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.
Do not bother disabling the function graph tracer if the per CPU disabled
field is set. Just record as normal. If tracing is disabled in the ring
buffer it will not be recorded.
Also, when tracing is enabled again, it will not drop the return call of
the function.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.715752008@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.
The kdb_ftdump() function iterates over all the current tracing CPUs and
increments the "disabled" counter before doing the dump, and decrements it
afterward.
As the disabled flag can be ignored, doing this today is not reliable.
Instead, simply call tracer_tracing_off() and then tracer_tracing_on() to
disable and then enabled the entire ring buffer in one go!
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Thompson <danielt@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/20250505212235.549033722@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.
The ftrace_dump_one() function iterates over all the current tracing CPUs and
increments the "disabled" counter before doing the dump, and decrements it
afterward.
As the disabled flag can be ignored, doing this today is not reliable.
Instead use the new tracer_tracing_disable() that calls into the ring
buffer code to do the disabling.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.381188238@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allow a tracer to disable writing to its buffer for a temporary amount of
time and re-enable it.
The tracer_tracing_disable() will disable writing to the trace array
buffer, and requires a tracer_tracing_enable() to re-enable it.
The difference between tracer_tracing_disable() and tracer_tracing_off()
is that the disable version can nest, and requires as many enable() calls
as disable() calls to re-enable the buffer. Where as the off() function
can be called multiple times and only requires a singe tracer_tracing_on()
to re-enable the buffer.
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daniel Thompson <danielt@kernel.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/20250505212235.210330010@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Marek reported that the rework of handle_nested_irq() introduced a inverted
condition, which prevents handling of interrupts. Fix it up.
Fixes: 2ef2e13094 ("genirq/chip: Rework handle_nested_irq()")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel/org/all/46ed4040-ca11-4157-8bd7-13c04c113734@samsung.com
task_storage_{get,delete} has been moved to bpf_base_func_proto.
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
Commit ec5fbdfb99 ("cgroup/cpuset: Enable update_tasks_cpumask()
on top_cpuset") enabled us to pull CPUs dedicated to child partitions
from tasks in top_cpuset by ignoring per cpu kthreads. However, there
can be other kthreads that are not per cpu but have PF_NO_SETAFFINITY
flag set to indicate that we shouldn't mess with their CPU affinity.
For other kthreads, their affinity will be changed to skip CPUs dedicated
to child partitions whether it is an isolating or a scheduling one.
As all the per cpu kthreads have PF_NO_SETAFFINITY set, the
PF_NO_SETAFFINITY tasks are essentially a superset of per cpu kthreads.
Fix this issue by dropping the kthread_is_per_cpu() check and checking
the PF_NO_SETAFFINITY flag instead.
Fixes: ec5fbdfb99 ("cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Empty cpumasks can't intersect with any others. Therefore, testing for
non-emptyness is useless.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In preparation for supporting BPF load-acquire and store-release
instructions for architectures where bpf_jit_needs_zext() returns true
(e.g. riscv64), make insn_def_regno() handle load-acquires properly.
Acked-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/09cb2aec979aaed9d16db41f0f5b364de39377c0.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Turn the default 5 second test delay for hibernation into a
configurable module parameter, so users can determine how
long to wait in this pseudo-hibernate state before resuming
the system.
The configurable delay parameter has been added for suspend, so
add an analogous one for hibernation.
Example (wait 30 seconds);
# echo 30 > /sys/module/hibernate/parameters/pm_test_delay
# echo core > /sys/power/pm_test
Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20250507063520.419635-1-zhangzihuan@kylinos.cn
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
pm_show_wakelocks() is called to generate a string when showing
attributes /sys/power/wake_(lock|unlock), but the string ends
with an unwanted space that was added back by mistake by commit
c9d967b2ce ("PM: wakeup: simplify the output logic of
pm_show_wakelocks()").
Remove the unwanted space.
Fixes: c9d967b2ce ("PM: wakeup: simplify the output logic of pm_show_wakelocks()")
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Link: https://patch.msgid.link/20250505-fix_power-v1-1-0f7f2c2f338c@quicinc.com
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In free_nsproxy() and the error path of create_new_namesapces() the
put_*_ns() calls are guarded by unnecessary NULL checks.
put_pid_ns(), put_ipc_ns(), put_uts_ns(), and put_time_ns() will never
receive a NULL argument unless their namespace type is disabled, and in
this case all four become no-ops at compile time anyway. put_mnt_ns()
will never receive a null argument at any time.
This unguarded usage is in line with other call sites of put_*_ns().
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Link: https://lore.kernel.org/20250508184930.183040-2-jsavitz@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Now all the pieces are in place to actually allow the power subsystem
to freeze/thaw filesystems during suspend/resume. Filesystems are only
frozen and thawed if the power subsystem does actually own the freeze.
We could bubble up errors and fail suspend/resume if the error isn't
EBUSY (aka it's already frozen) but I don't think that this is worth it.
Filesystem freezing during suspend/resume is best-effort. If the user
has 500 ext4 filesystems mounted and 4 fail to freeze for whatever
reason then we simply skip them.
What we have now is already a big improvement and let's see how we fare
with it before making our lives even harder (and uglier) than we have
to.
We add a new sysctl know /sys/power/freeze_filesystems that will allow
userspace to freeze filesystems during suspend/hibernate. For now it
defaults to off. The thaw logic doesn't require checking whether
freezing is enabled because the power subsystem exclusively owns frozen
filesystems for the duration of suspend/hibernate and is able to skip
filesystems it doesn't need to freeze.
Also it is technically possible that filesystem
filesystem_freeze_enabled is true and power freezes the filesystems but
before freezing all processes another process disables
filesystem_freeze_enabled. If power were to place the filesystems_thaw()
call under filesystems_freeze_enabled it would fail to thaw the
fileystems it frozw. The exclusive holder mechanism makes it possible to
iterate through the list without any concern making sure that no
filesystems are left frozen.
Link: https://lore.kernel.org/r/20250402-work-freeze-v2-3-6719a97b52ac@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
While an event tears down all links to it as an aux, the iteration
happens on the event's group leader instead of the group itself.
If the event is a group leader, it has no effect because the event is
also its own group leader. But otherwise there would be a risk to detach
all the siblings events from the wrong group leader.
It just happens to work because each sibling's aux link is tested
against the right event before proceeding. Also the ctx lock is the same
for the events and their group leader so the iteration is safe.
Yet the iteration is confusing. Clarify the actual intent.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-5-frederic@kernel.org
The CPU hotplug handlers are called twice: at prepare and online stage.
Their role is to:
1) Enable/disable a CPU context. This is irrelevant and even buggy at
the prepare stage because the CPU is still offline. On early
secondary CPU up, creating an event attached to that CPU might
silently fail because the CPU context is observed as online but the
context installation's IPI failure is ignored.
2) Update the scope cpumasks and re-migrate the events accordingly in
the CPU down case. This is irrelevant at the prepare stage.
3) Remove the events attached to the context of the offlining CPU. It
even uses an (unnecessary) IPI for it. This is also irrelevant at the
prepare stage.
Also none of the *_PREPARE and *_STARTING architecture perf related CPU
hotplug callbacks rely on CPUHP_PERF_PREPARE.
CPUHP_AP_PERF_ONLINE is enough and the right place to perform the work.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-4-frederic@kernel.org
The following commit:
da916e96e2 ("perf: Make perf_pmu_unregister() useable")
has introduced two significant event's parent lifecycle changes:
1) An event that has exited now has EVENT_TOMBSTONE as a parent.
This can result in a situation where the delayed wakeup irq_work can
accidentally dereference EVENT_TOMBSTONE on:
CPU 0 CPU 1
----- -----
__schedule()
local_irq_disable()
rq_lock()
<NMI>
perf_event_overflow()
irq_work_queue(&child->pending_irq)
</NMI>
perf_event_task_sched_out()
raw_spin_lock(&ctx->lock)
ctx_sched_out()
ctx->is_active = 0
event_sched_out(child)
raw_spin_unlock(&ctx->lock)
perf_event_release_kernel(parent)
perf_remove_from_context(child)
raw_spin_lock_irq(&ctx->lock)
// Sees !ctx->is_active
// Removes from context inline
__perf_remove_from_context(child)
perf_child_detach(child)
event->parent = EVENT_TOMBSTONE
raw_spin_rq_unlock_irq(rq);
<IRQ>
perf_pending_irq()
perf_event_wakeup(child)
ring_buffer_wakeup(child)
rcu_dereference(child->parent->rb) <--- CRASH
This also concerns the call to kill_fasync() on parent->fasync.
2) The final parent reference count decrement can now happen before the
the final child reference count decrement. ie: the parent can now
be freed before its child. On PREEMPT_RT, this can result in a
situation where the delayed wakeup irq_work can accidentally
dereference a freed parent:
CPU 0 CPU 1 CPU 2
----- ----- ------
perf_pmu_unregister()
pmu_detach_events()
pmu_get_event()
atomic_long_inc_not_zero(&child->refcount)
<NMI>
perf_event_overflow()
irq_work_queue(&child->pending_irq);
</NMI>
<IRQ>
irq_work_run()
wake_irq_workd()
</IRQ>
preempt_schedule_irq()
=========> SWITCH to workd
irq_work_run_list()
perf_pending_irq()
perf_event_wakeup(child)
ring_buffer_wakeup(child)
event = child->parent
perf_event_release_kernel(parent)
// Not last ref, PMU holds it
put_event(child)
// Last ref
put_event(parent)
free_event()
call_rcu(...)
rcu_core()
free_event_rcu()
rcu_dereference(event->rb) <--- CRASH
This also concerns the call to kill_fasync() on parent->fasync.
The "easy" solution to 1) is to check that event->parent is not
EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
and fasync uses).
The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
run under rcu_read_lock().
However because of 2), sanity would prescribe to make event::parent
an __rcu pointer and annotate each and every users to prove they are
reliable.
Propose an alternate solution and restore the stable pointer to the
parent until all its children have called _free_event() themselves to
avoid any further accident. Also revert the EVENT_TOMBSTONE design
that is mostly here to determine which caller of perf_event_exit_event()
must perform the refcount decrement on a child event matching the
increment in inherit_event().
Arrange instead for checking the attach state of an event prior to its
removal and decrement the refcount of the child accordingly.
Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
When inherit_event() fails after the child allocation but before the
parent refcount has been incremented, calling put_event() wrongly
decrements the reference to the parent, risking to free it too early.
Also pmu_get_event() can't be holding a reference to the child
concurrently at this point since it is under pmus_srcu critical section.
Fix it with restoring the deleted free_event() function and call it on
the failing child in order to free it directly under the verified
assumption that its refcount is only 1. The refcount to the parent is
then voluntarily omitted.
Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250424161128.29176-2-frederic@kernel.org
The ignore_pid boolean on the per CPU data descriptor is updated at
sched_switch when a new task is scheduled in. If the new task is to be
ignored, it is set to true, otherwise it is set to false. The current task
should always have the correct value as it is updated when the task is
scheduled in.
Instead of breaking up the read of this value, which requires preemption
to be disabled, just use this_cpu_read() which gives a snapshot of the
value. Since the value will always be correct for a given task (because
it's updated at sched switch) it doesn't need preemption disabled.
This will also allow trace events to be called with preemption enabled.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212235.038958766@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The per CPU "disabled" value was the original way to disable tracing when
the tracing subsystem was first created. Today, the ring buffer
infrastructure has its own way to disable tracing. In fact, things have
changed so much since 2008 that many things ignore the disable flag.
There's no reason for the function tracer to check it, if tracing is
disabled, the ring buffer will not record the event anyway.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212234.868972758@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The mmiotracer referenced the per CPU array_buffer->data descriptor but
never actually used it. Remove the references to it.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250505212234.696945463@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Allocate kernel memory for processing CPU string
(/sys/kernel/tracing/osnoise/cpus) also in osnoise_cpus_write to allow
the writing of a CPU string of an arbitrary length.
This replaces the 256-byte buffer, which is insufficient with the rising
number of CPUs. For example, if I wanted to measure on every even CPU
on a system with 256 CPUs, the string would be 456 characters long.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250425091839.343289-1-tglozar@redhat.com
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The structure ftrace_func_mapper only contains a single field and that is
a ftrace_hash. It is used to abstract it out from a normal hash to control
users of how it gets modified.
The freeing of a ftrace_func_mapper structure is:
free_ftrace_hash(&mapper->hash);
Without context, this looks like a bug. It should be commented that it is
not a bug and it is freed this way.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250416165420.5c717420@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Depth is stored as int because the code uses negative values to break
out of iterations. But what is recorded is always zero or positive. So
expose it as unsigned int instead of int.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250412221847.17310-3-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function graph infrastructure uses subops of the function tracer.
These are not shown in enabled_functions. Add a "subops:" section to the
enabled_functions line to show what functions are attached via subops. If
the subops is from the function_graph infrastructure, then show the entry
and return callbacks that are attached.
Here's an example of the output:
schedule_on_each_cpu (1) tramp: 0xffffffffc03ef000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:trace_graph_entry+0x0/0x20 ret:trace_graph_return+0x0/0x150}
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250410153830.5d97f108@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The lock guard conversion converted raw_spin_lock_irq() to
scoped_guard(raw_spinlock), which is obviously bogus and makes lockdep
mightily unhappy.
Note to self: Copy and pasta without using brain is a patently bad idea.
Fixes: 88a4df117a ("genirq/cpuhotplug: Convert to lock guards")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Doing cpufreq-specific EAS checks that require accessing policy
internals directly from sched_is_eas_possible() is a bit unfortunate,
so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
into that new function and make sched_is_eas_possible() call it.
While at it, address a possible race between the EAS governor check
and governor change by doing the former under the policy rwsem.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/2317800.iZASKD2KPV@rjwysocki.net
Add a helper for checking if schedutil is the current governor for
a given cpufreq policy and use it in sched_is_eas_possible() to avoid
accessing cpufreq policy internals directly from there.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3365956.44csPzL39Z@rjwysocki.net
In 'lookup_or_create_module_kobject()', an internal kobject is created
using 'module_ktype'. So call to 'kobject_put()' on error handling
path causes an attempt to use an uninitialized completion pointer in
'module_kobject_release()'. In this scenario, we just want to release
kobject without an extra synchronization required for a regular module
unloading process, so adding an extra check whether 'complete()' is
actually required makes 'kobject_put()' safe.
Reported-by: syzbot+7fb8a372e1f6add936dd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7fb8a372e1f6add936dd
Fixes: 942e443127 ("module: Fix mod->mkobj.kobj potentially freed too early")
Cc: stable@vger.kernel.org
Suggested-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Link: https://lore.kernel.org/r/20250507065044.86529-1-dmantipov@yandex.ru
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
To receive 428dc9fc08 ("sched_ext: bpf_iter_scx_dsq_new() should always
initialize iterator") which conflicts with cdf5a6faa8 ("sched_ext: Move
dsq_hash into scx_sched"). The conflict is a simple context conflict which
can be resolved by taking changes from both changes in the right order.
BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 650ba21b13 ("sched_ext: Implement DSQ iterator")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
This code pattern trips clang up:
if (fail)
goto undo;
guard(lock)(lock);
do_stuff();
return 0;
undo:
...
as it somehow extends the scope of the guard beyond the return statement.
Replace it with a scoped guard to help it to get its act together.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Closes: https://lore.kernel.org/oe-kbuild-all/202505071809.ajpPxfoZ-lkp@intel.com/
Split hib_submit_io into a sync and async version. The sync version is
a small wrapper around bdev_rw_virt which implements all the logic to
add a kernel direct mapping range to a bio and synchronously submits it,
while the async version is slightly simplified using the
bio_add_virt_nofail for adding the single range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
remove_percpu_irq() has been unused since it was added in 2011 by
commit 31d9d9b6d8 ("genirq: Add support for per-cpu dev_id interrupts")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250420164656.112641-1-linux@treblig.org
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.670808288@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.612184618@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.552884529@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.494561120@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.435932527@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.376836282@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.315844964@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.258216558@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Make the return value boolean to reflect it's meaning.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.187250840@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/87ldrhq0hc.ffs@tglx
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.071157729@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065422.013088277@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/87ikmlq0fk.ffs@tglx
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.897188799@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.830357569@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.650454052@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.590753128@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.532308759@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.473563978@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.415072350@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.355673840@linutronix.de
Use the new guards to get and lock the interrupt descriptor and tidy up the
code.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.295400891@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.236248749@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Note: The mask_irq() operation in the second condition was redundant as the
interrupt is already masked right at the beginning of the function.
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.175652864@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.105015800@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065421.045492336@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.986002418@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.926362488@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.865212916@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.804683349@linutronix.de
Use the new helpers to decide whether the interrupt should be handled and
switch the descriptor locking to guard().
Fixup the kernel doc comment while at it.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.744042890@linutronix.de
The interrupt flow handlers have similar patterns to decide whether to
handle an interrupt or not.
Provide common helper functions to allow removal of duplicated code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.682547546@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.620200108@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.560083665@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.497714413@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.373998838@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.312487167@linutronix.de
Convert all lock/unlock pairs to guards and tidy up the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.251299112@linutronix.de
Replace all lock/unlock pairs with lock guards and simplify the code flow.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/871ptaqhoo.ffs@tglx
The interrupt core code has an ever repeating pattern:
unsigned long flags;
struct irq_desc *desc = irq_get_desc_[bus]lock(irq, &flags, mode);
if (!desc)
return -EINVAL;
....
irq_put_desc_[bus]unlock(desc, flags);
That requires gotos in failure paths and just creates visual clutter.
Provide lock guards, which allow to simplify the code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250429065420.061659985@linutronix.de
In the kernel fq qdisc implementation, it only needs to look at
the fields of the first node in a list but does not always
need to remove it from the list. It is more convenient to have
a peek kfunc for the list. It works similar to the bpf_rbtree_first().
This patch adds bpf_list_{front,back} kfunc. The verifier is changed
such that the kfunc returning "struct bpf_list_node *" will be
marked as non-owning. The exception is the KF_ACQUIRE kfunc. The
net effect is only the new bpf_list_{front,back} kfuncs will
have its return pointer marked as non-owning.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The next patch will add bpf_list_{front,back} kfuncs to peek the head
and tail of a list. Both of them will return a 'struct bpf_list_node *'.
Follow the earlier change for rbtree, this patch checks the
return btf type is a 'struct bpf_list_node' pointer instead
of checking each kfuncs individually to decide if
mark_reg_graph_node should be called. This will make
the bpf_list_{front,back} kfunc addition easier in
the later patch.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The bpf_rbtree_{remove,left,right} requires the root's lock to be held.
They also check the node_internal->owner is still owned by that root
before proceeding, so it is safe to allow refcounted bpf_rb_node
pointer to be used in these kfuncs.
In a bpf fq implementation which is much closer to the kernel fq,
https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/,
a networking flow (allocated by bpf_obj_new) can be added to two different
rbtrees. There are cases that the flow is searched from one rbtree,
held the refcount of the flow, and then removed from another rbtree:
struct fq_flow {
struct bpf_rb_node fq_node;
struct bpf_rb_node rate_node;
struct bpf_refcount refcount;
unsigned long sk_long;
};
int bpf_fq_enqueue(...)
{
/* ... */
bpf_spin_lock(&root->lock);
while (can_loop) {
/* ... */
if (!p)
break;
gc_f = bpf_rb_entry(p, struct fq_flow, fq_node);
if (gc_f->sk_long == sk_long) {
f = bpf_refcount_acquire(gc_f);
break;
}
/* ... */
}
bpf_spin_unlock(&root->lock);
if (f) {
bpf_spin_lock(&q->lock);
bpf_rbtree_remove(&q->delayed, &f->rate_node);
bpf_spin_unlock(&q->lock);
}
}
bpf_rbtree_{left,right} do not need this change but are relaxed together
with bpf_rbtree_remove instead of adding extra verifier logic
to exclude these kfuncs.
To avoid bi-sect failure, this patch also changes the selftests together.
The "rbtree_api_remove_unadded_node" is not expecting verifier's error.
The test now expects bpf_rbtree_remove(&groot, &m->node) to return NULL.
The test uses __retval(0) to ensure this NULL return value.
Some of the "only take non-owning..." failure messages are changed also.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In a bpf fq implementation that is much closer to the kernel fq,
it will need to traverse the rbtree:
https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/
The much simplified logic that uses the bpf_rbtree_{root,left,right}
to traverse the rbtree is like:
struct fq_flow {
struct bpf_rb_node fq_node;
struct bpf_rb_node rate_node;
struct bpf_refcount refcount;
unsigned long sk_long;
};
struct fq_flow_root {
struct bpf_spin_lock lock;
struct bpf_rb_root root __contains(fq_flow, fq_node);
};
struct fq_flow *fq_classify(...)
{
struct bpf_rb_node *tofree[FQ_GC_MAX];
struct fq_flow_root *root;
struct fq_flow *gc_f, *f;
struct bpf_rb_node *p;
int i, fcnt = 0;
/* ... */
f = NULL;
bpf_spin_lock(&root->lock);
p = bpf_rbtree_root(&root->root);
while (can_loop) {
if (!p)
break;
gc_f = bpf_rb_entry(p, struct fq_flow, fq_node);
if (gc_f->sk_long == sk_long) {
f = bpf_refcount_acquire(gc_f);
break;
}
/* To be removed from the rbtree */
if (fcnt < FQ_GC_MAX && fq_gc_candidate(gc_f, jiffies_now))
tofree[fcnt++] = p;
if (gc_f->sk_long > sk_long)
p = bpf_rbtree_left(&root->root, p);
else
p = bpf_rbtree_right(&root->root, p);
}
/* remove from the rbtree */
for (i = 0; i < fcnt; i++) {
p = tofree[i];
tofree[i] = bpf_rbtree_remove(&root->root, p);
}
bpf_spin_unlock(&root->lock);
/* bpf_obj_drop the fq_flow(s) that have just been removed
* from the rbtree.
*/
for (i = 0; i < fcnt; i++) {
p = tofree[i];
if (p) {
gc_f = bpf_rb_entry(p, struct fq_flow, fq_node);
bpf_obj_drop(gc_f);
}
}
return f;
}
The above simplified code needs to traverse the rbtree for two purposes,
1) find the flow with the desired sk_long value
2) while searching for the sk_long, collect flows that are
the fq_gc_candidate. They will be removed from the rbtree.
This patch adds the bpf_rbtree_{root,left,right} kfunc to enable
the rbtree traversal. The returned bpf_rb_node pointer will be a
non-owning reference which is the same as the returned pointer
of the exisiting bpf_rbtree_first kfunc.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current rbtree kfunc, bpf_rbtree_{first, remove}, returns the
bpf_rb_node pointer. The check_kfunc_call currently checks the
kfunc btf_id instead of its return pointer type to decide
if it needs to do mark_reg_graph_node(reg0) and ref_set_non_owning(reg0).
The later patch will add bpf_rbtree_{root,left,right} that will also
return a bpf_rb_node pointer. Instead of adding more kfunc btf_id
checks to the "if" case, this patch changes the test to check the
kfunc's return type. is_rbtree_node_type() function is added to
test if a pointer type is a bpf_rb_node. The callers have already
skipped the modifiers of the pointer type.
A note on the ref_set_non_owning(), although bpf_rbtree_remove()
also returns a bpf_rb_node pointer, the bpf_rbtree_remove()
has the KF_ACQUIRE flag. Thus, its reg0 will not become non-owning.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In a later patch, two new kfuncs will take the bpf_rb_node pointer arg.
struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root,
struct bpf_rb_node *node);
struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root,
struct bpf_rb_node *node);
In the check_kfunc_call, there is a "case KF_ARG_PTR_TO_RB_NODE"
to check if the reg->type should be an allocated pointer or should be
a non_owning_ref.
The later patch will need to ensure that the bpf_rb_node pointer passing
to the new bpf_rbtree_{left,right} must be a non_owning_ref. This
should be the same requirement as the existing bpf_rbtree_remove.
This patch swaps the current "if else" statement. Instead of checking
the bpf_rbtree_remove, it checks the bpf_rbtree_add. Then the new
bpf_rbtree_{left,right} will fall into the "else" case to make
the later patch simpler. bpf_rbtree_add should be the only
one that needs an allocated pointer.
This should be a no-op change considering there are only two kfunc(s)
taking bpf_rb_node pointer arg, rbtree_add and rbtree_remove.
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The last remaining user of vfs_submount() (tracefs) is easy to convert
to fs_context_for_submount(); do that and bury that thing, along with
SB_SUBMOUNT
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There have been recent reports about running out of lockdep keys:
MAX_LOCKDEP_KEYS too low!
One possible reason is that too many dynamic keys have been registered.
A possible culprit is the lockdep_register_key() call in qdisc_alloc()
of net/sched/sch_generic.c.
Currently, there is no way to find out how many dynamic keys have been
registered. Add such a stat to the /proc/lockdep_stats to get better
clarity.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: llvm@lists.linux.dev
Link: https://lore.kernel.org/r/20250506042049.50060-4-boqun.feng@gmail.com
To catch the code trying to use a subclass value >= MAX_LOCKDEP_SUBCLASSES (8),
add a DEBUG_LOCKS_WARN_ON() statement to notify the users that such a
large value is not allowed.
[ boqun: Reword the commit log with a more objective tone ]
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: llvm@lists.linux.dev
Link: https://lore.kernel.org/r/20250506042049.50060-3-boqun.feng@gmail.com
When hlock_equal() is unused, it prevents kernel builds with clang,
`make W=1` and CONFIG_WERROR=y, CONFIG_LOCKDEP=y and
CONFIG_LOCKDEP_SMALL=n:
lockdep.c:2005:20: error: unused function 'hlock_equal' [-Werror,-Wunused-function]
Fix this by moving the function to the respective existing ifdeffery
for its the only user.
See also:
6863f5643d ("kbuild: allow Clang to find unused static inline functions for W=1 build")
Fixes: 68e3056785 ("lockdep: Adjust check_redundant() for recursive read change")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <nick.desaulniers+lkml@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: llvm@lists.linux.dev
Link: https://lore.kernel.org/r/20250506042049.50060-2-boqun.feng@gmail.com
If one wants to trace the name of the task that wakes up a process and
pass that to the synthetic events, there's nothing currently that lets the
synthetic events do that. Add a "common_comm" to the histogram logic that
allows histograms save the current->comm as a variable that can be passed
through and added to a synthetic event:
# cd /sys/kernel/tracing
# echo 's:wake_lat char[] waker; char[] wakee; u64 delta;' >> dynamic_events
# echo 'hist:keys=pid:comm=common_comm:ts=common_timestamp.usecs if !(common_flags & 0x18)' > events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:wake_comm=$comm:delta=common_timestamp.usecs-$ts:onmatch(sched.sched_waking).trace(wake_lat,$wake_comm,next_comm,$delta)' > events/sched/sched_switch/trigger
The above will create a synthetic trace event that will save both the name
of the waker and the wakee but only if the wakeup did not happen in a hard
or soft interrupt context.
The "common_comm" is used to save the task->comm at the time of the
initial event and is passed via the "comm" variable to the second event,
and that is saved as the "waker" field in the "wake_lat" synthetic event.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250407154912.3c6c6246@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The histogram trigger has three somewhat large arrays on the kernel stack:
unsigned long entries[HIST_STACKTRACE_DEPTH];
u64 var_ref_vals[TRACING_MAP_VARS_MAX];
char compound_key[HIST_KEY_SIZE_MAX];
Checking the function event_hist_trigger() stack frame size, it currently
uses 816 bytes for its stack frame due to these variables!
Instead, allocate a per CPU structure that holds these arrays for each
context level (normal, softirq, irq and NMI). That is, each CPU will have
4 of these structures. This will be allocated when the first histogram
trigger is enabled and freed when the last is disabled. When the
histogram callback triggers, it will request this structure. The request
will disable preemption, get the per CPU structure at the index of the
per CPU variable, and increment that variable.
The callback will use the arrays in this structure to perform its work and
then release the structure. That in turn will simply decrement the per CPU
index and enable preemption.
Moving the variables from the kernel stack to the per CPU structure brings
the stack frame of event_hist_trigger() down to just 112 bytes.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Link: https://lore.kernel.org/20250407123851.74ea8d58@gandalf.local.home
Fixes: 067fe038e7 ("tracing: Add variable reference handling to hist triggers")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The add_to_key() function tests if the key is a string or some data. If
it's a string it does some further calculations of the string size (still
truncating it to the max size it can be), and calls strncpy().
If the key isn't as string it calls memcpy(). The interesting point is
that both use the exact same parameters:
strncpy(compound_key + key_field->offset, (char *)key, size);
} else
memcpy(compound_key + key_field->offset, key, size);
As strncpy() is being used simply as a memcpy() for a string, and since
strncpy() is deprecated, just call memcpy() for both memory and string
keys.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/20250403210637.1c477d4a@gandalf.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the "fields" option is set in a trace instance, it ignores the "print fmt"
portion of the trace event and just prints the raw fields defined by the
TP_STRUCT__entry() of the TRACE_EVENT() macro.
The preempt_disable/enable and irq_disable/enable events record only the
caller offset from _stext to save space in the ring buffer. Even though
the "fields" option only prints the fields, it also tries to print what
they represent too, which includes function names.
Add a check in the output of the event field printing to see if the field
name is "caller_offs" or "parent_offs" and then print the function at the
offset from _stext of that field.
Instead of just showing:
irq_disable: caller_offs=0xba634d (12215117) parent_offs=0x39d10e2 (60625122)
Show:
irq_disable: caller_offs=trace_hardirqs_off.part.0+0xad/0x130 0xba634d (12215117) parent_offs=_raw_spin_lock_irqsave+0x62/0x70 0x39d10e2 (60625122)
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250506105131.4b6089a9@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add adjustments to the values of the "fields" output if the buffer is a
persistent ring buffer to adjust the addresses to both the kernel core and
kernel modules if they match a module in the persistent memory and that
module is also loaded.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250325185619.54b85587@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The trace_adjust_address() will take a given address and examine the
persistent ring buffer to see if the address matches a module that is
listed there. If it does not, it will just adjust the value to the core
kernel delta. But if the address was for something that was not part of
the core kernel text or data it should not be adjusted.
Check the result of the adjustment and only return the adjustment if it
lands in the current kernel text or data. If not, return the original
address.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250506102300.0ba2f9e0@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the "fields" option is enabled, the "print fmt" of the trace event is
ignored and only the fields are printed. But some fields contain function
pointers. Instead of just showing the hex value in this case, show the
function name when possible:
Instead of having:
# echo 1 > options/fields
# cat trace
[..]
kmem_cache_free: call_site=0xffffffffa9afcf31 (-1448095951) ptr=0xffff888124452910 (-131386736039664) name=kmemleak_object
Have it output:
kmem_cache_free: call_site=rcu_do_batch+0x3d1/0x14a0 (-1768960207) ptr=0xffff888132ea5ed0 (854220496) name=kmemleak_object
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250325213919.624181915@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Now that module addresses are saved in the persistent ring buffer, their
addresses can be used to adjust the address in the persistent ring buffer
to the address of the module that is currently loaded.
Instead of blindly using the text_delta that only works for core kernel
code, call the trace_adjust_address() that will see if the address matches
an address saved in the persistent ring buffer, and then uses that against
the matching module if it is loaded.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250506111648.5df7f3ec@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Now that we have pidfs_{get,register}_pid() that needs to be paired with
pidfs_put_pid() it's possible that someone pairs them with put_pid().
Thus freeing struct pid while it's still used by pidfs. Notice when that
happens. I'll also add a scheme to detect invalid uses of
pidfs_get_pid() and pidfs_put_pid() later.
Link: https://lore.kernel.org/20250506-uferbereich-guttun-7c8b1a0a431f@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add helper that allows a driver to skip calling dma_unmap_*
if the DMA layer can guarantee that they are no-nops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API. Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.
Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
The block layer bounce buffering support is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently the calling conventions for ->d_automount() instances have
an odd wart - returned new mount to be attached is expected to have
refcount 2.
That kludge is intended to make sure that mark_mounts_for_expiry() called
before we get around to attaching that new mount to the tree won't decide
to take it out. finish_automount() drops the extra reference after it's
done with attaching mount to the tree - or drops the reference twice in
case of error. ->d_automount() instances have rather counterintuitive
boilerplate in them.
There's a much simpler approach: have mark_mounts_for_expiry() skip the
mounts that are yet to be mounted. And to hell with grabbing/dropping
those extra references. Makes for simpler correctness analysis, at that...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is a conditional that covers all the code for the entire function.
Invert it and decrease indentation level. This also helps for further
changes to be clearer and tidier.
[ tglx: Removed line breaks ]
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250416114122.2191820-2-andriy.shevchenko@linux.intel.com
- Fix read out of bounds bug in tracing_splice_read_pipe()
The size of the sub page being read can now be greater than a page. But
the buffer used in tracing_splice_read_pipe() only allocates a page size.
The data copied to the buffer is the amount in sub buffer which can
overflow the buffer. Use min((size_t)trace_seq_used(&iter->seq), PAGE_SIZE)
to limit the amount copied to the buffer to a max of PAGE_SIZE.
- Fix the test for NULL from "!filter_hash" to "!*filter_hash"
The add_next_hash() function checked for NULL at the wrong pointer level.
- Do not use the array in trace_adjust_address() if there are no elements
The trace_adjust_address() finds the offset of a module that was stored in
the persistent buffer when reading the previous boot buffer to see if the
address belongs to a module that was loaded in the previous boot. An array
is created that matches currently loaded modules with previously loaded
modules. The trace_adjust_address() uses that array to find the new offset
of the address that's in the previous buffer. But if no module was
loaded, it ends up reading the last element in an array that was never
allocated. Check if nr_entries is zero and exit out early if it is.
- Remove nested lock of trace_event_sem in print_event_fields()
The print_event_fields() function iterates over the ftrace_events list and
requires the trace_event_sem semaphore held for read. But this function is
always called with that semaphore held for read. Remove the taking of the
semaphore and replace it with lockdep_assert_held_read(&trace_event_sem);
-----BEGIN PGP SIGNATURE-----
iIoEABYKADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaBeXEBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qvXFAP9JNgi0+ainOppsEP6u9KH+sttxKl76
14EslzuPqbzgOwD/Sm00a8n7m858iv6UN3AAW9AsX2QK5yG0Wbvterm8FgI=
=s9qk
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix read out of bounds bug in tracing_splice_read_pipe()
The size of the sub page being read can now be greater than a page.
But the buffer used in tracing_splice_read_pipe() only allocates a
page size. The data copied to the buffer is the amount in sub buffer
which can overflow the buffer.
Use min((size_t)trace_seq_used(&iter->seq), PAGE_SIZE) to limit the
amount copied to the buffer to a max of PAGE_SIZE.
- Fix the test for NULL from "!filter_hash" to "!*filter_hash"
The add_next_hash() function checked for NULL at the wrong pointer
level.
- Do not use the array in trace_adjust_address() if there are no
elements
The trace_adjust_address() finds the offset of a module that was
stored in the persistent buffer when reading the previous boot buffer
to see if the address belongs to a module that was loaded in the
previous boot. An array is created that matches currently loaded
modules with previously loaded modules. The trace_adjust_address()
uses that array to find the new offset of the address that's in the
previous buffer. But if no module was loaded, it ends up reading the
last element in an array that was never allocated.
Check if nr_entries is zero and exit out early if it is.
- Remove nested lock of trace_event_sem in print_event_fields()
The print_event_fields() function iterates over the ftrace_events
list and requires the trace_event_sem semaphore held for read. But
this function is always called with that semaphore held for read.
Remove the taking of the semaphore and replace it with
lockdep_assert_held_read(&trace_event_sem)
* tag 'trace-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Do not take trace_event_sem in print_event_fields()
tracing: Fix trace_adjust_address() when there is no modules in scratch area
ftrace: Fix NULL memory allocation check
tracing: Fix oob write in trace_seq_to_buffer()
Extend the futex2 interface to be aware of mempolicy.
When FUTEX2_MPOL is specified and there is a MPOL_PREFERRED or
home_node specified covering the futex address, use that hash-map.
Notably, in this case the futex will go to the global node hashtable,
even if it is a PRIVATE futex.
When FUTEX2_NUMA|FUTEX2_MPOL is specified and the user specified node
value is FUTEX_NO_NODE, the MPOL lookup (as described above) will be
tried first before reverting to setting node to the local node.
[bigeasy: add CONFIG_FUTEX_MPOL, add MPOL to FUTEX2_VALID_MASK, write
the node only to user if FUTEX_NO_NODE was supplied]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-18-bigeasy@linutronix.de
Extend the futex2 interface to be numa aware.
When FUTEX2_NUMA is specified for a futex, the user value is extended
to two words (of the same size). The first is the user value we all
know, the second one will be the node to place this futex on.
struct futex_numa_32 {
u32 val;
u32 node;
};
When node is set to ~0, WAIT will set it to the current node_id such
that WAKE knows where to find it. If userspace corrupts the node value
between WAIT and WAKE, the futex will not be found and no wakeup will
happen.
When FUTEX2_NUMA is not set, the node is simply an extension of the
hash, such that traditional futexes are still interleaved over the
nodes.
This is done to avoid having to have a separate !numa hash-table.
[bigeasy: ensure to have at least hashsize of 4 in futex_init(), add
pr_info() for size and allocation information. Cast the naddr math to
void*]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-17-bigeasy@linutronix.de
My initial testing showed that:
perf bench futex hash
reported less operations/sec with private hash. After using the same
amount of buckets in the private hash as used by the global hash then
the operations/sec were about the same.
This changed once the private hash became resizable. This feature added
an RCU section and reference counting via atomic inc+dec operation into
the hot path.
The reference counting can be avoided if the private hash is made
immutable.
Extend PR_FUTEX_HASH_SET_SLOTS by a fourth argument which denotes if the
private should be made immutable. Once set (to true) the a further
resize is not allowed (same if set to global hash).
Add PR_FUTEX_HASH_GET_IMMUTABLE which returns true if the hash can not
be changed.
Update "perf bench" suite.
For comparison, results of "perf bench futex hash -s":
- Xeon CPU E5-2650, 2 NUMA nodes, total 32 CPUs:
- Before the introducing task local hash
shared Averaged 1.487.148 operations/sec (+- 0,53%), total secs = 10
private Averaged 2.192.405 operations/sec (+- 0,07%), total secs = 10
- With the series
shared Averaged 1.326.342 operations/sec (+- 0,41%), total secs = 10
-b128 Averaged 141.394 operations/sec (+- 1,15%), total secs = 10
-Ib128 Averaged 851.490 operations/sec (+- 0,67%), total secs = 10
-b8192 Averaged 131.321 operations/sec (+- 2,13%), total secs = 10
-Ib8192 Averaged 1.923.077 operations/sec (+- 0,61%), total secs = 10
128 is the default allocation of hash buckets.
8192 was the previous amount of allocated hash buckets.
- Xeon(R) CPU E7-8890 v3, 4 NUMA nodes, total 144 CPUs:
- Before the introducing task local hash
shared Averaged 1.810.936 operations/sec (+- 0,26%), total secs = 20
private Averaged 2.505.801 operations/sec (+- 0,05%), total secs = 20
- With the series
shared Averaged 1.589.002 operations/sec (+- 0,25%), total secs = 20
-b1024 Averaged 42.410 operations/sec (+- 0,20%), total secs = 20
-Ib1024 Averaged 740.638 operations/sec (+- 1,51%), total secs = 20
-b65536 Averaged 48.811 operations/sec (+- 1,35%), total secs = 20
-Ib65536 Averaged 1.963.165 operations/sec (+- 0,18%), total secs = 20
1024 is the default allocation of hash buckets.
65536 was the previous amount of allocated hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20250416162921.513656-16-bigeasy@linutronix.de
The mm_struct::futex_hash_lock guards the futex_hash_bucket assignment/
replacement. The futex_hash_allocate()/ PR_FUTEX_HASH_SET_SLOTS
operation can now be invoked at runtime and resize an already existing
internal private futex_hash_bucket to another size.
The reallocation is based on an idea by Thomas Gleixner: The initial
allocation of struct futex_private_hash sets the reference count
to one. Every user acquires a reference on the local hash before using
it and drops it after it enqueued itself on the hash bucket. There is no
reference held while the task is scheduled out while waiting for the
wake up.
The resize process allocates a new struct futex_private_hash and drops
the initial reference. Synchronized with mm_struct::futex_hash_lock it
is checked if the reference counter for the currently used
mm_struct::futex_phash is marked as DEAD. If so, then all users enqueued
on the current private hash are requeued on the new private hash and the
new private hash is set to mm_struct::futex_phash. Otherwise the newly
allocated private hash is saved as mm_struct::futex_phash_new and the
rehashing and reassigning is delayed to the futex_hash() caller once the
reference counter is marked DEAD.
The replacement is not performed at rcuref_put() time because certain
callers, such as futex_wait_queue(), drop their reference after changing
the task state. This change will be destroyed once the futex_hash_lock
is acquired.
The user can change the number slots with PR_FUTEX_HASH_SET_SLOTS
multiple times. An increase and decrease is allowed and request blocks
until the assignment is done.
The private hash allocated at thread creation is changed from 16 to
16 <= 4 * number_of_threads <= global_hash_size
where number_of_threads can not exceed the number of online CPUs. Should
the user PR_FUTEX_HASH_SET_SLOTS then the auto scaling is disabled.
[peterz: reorganize the code to avoid state tracking and simplify new
object handling, block the user until changes are in effect, allow
increase and decrease of the hash].
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-15-bigeasy@linutronix.de
The futex hash is system wide and shared by all tasks. Each slot
is hashed based on futex address and the VMA of the thread. Due to
randomized VMAs (and memory allocations) the same logical lock (pointer)
can end up in a different hash bucket on each invocation of the
application. This in turn means that different applications may share a
hash bucket on the first invocation but not on the second and it is not
always clear which applications will be involved. This can result in
high latency's to acquire the futex_hash_bucket::lock especially if the
lock owner is limited to a CPU and can not be effectively PI boosted.
Introduce basic infrastructure for process local hash which is shared by
all threads of process. This hash will only be used for a
PROCESS_PRIVATE FUTEX operation.
The hashmap can be allocated via:
prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num);
A `num' of 0 means that the global hash is used instead of a private
hash.
Other values for `num' specify the number of slots for the hash and the
number must be power of two, starting with two.
The prctl() returns zero on success. This function can only be used
before a thread is created.
The current status for the private hash can be queried via:
num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS);
which return the current number of slots. The value 0 means that the
global hash is used. Values greater than 0 indicate the number of slots
that are used. A negative number indicates an error.
For optimisation, for the private hash jhash2() uses only two arguments
the address and the offset. This omits the VMA which is always the same.
[peterz: Use 0 for global hash. A bit shuffling and renaming. ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
Factor out the futex_hash_bucket initialisation into a helpr function.
The helper function will be used in a follow up patch implementing
process private hash buckets.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-12-bigeasy@linutronix.de
futex_lock_pi() and __fixup_pi_state_owner() acquire the
futex_q::lock_ptr without holding a reference assuming the previously
obtained hash bucket and the assigned lock_ptr are still valid. This
isn't the case once the private hash can be resized and becomes invalid
after the reference drop.
Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in
futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure
that it does not go away if the hash bucket has been replaced and the
old pointer has been observed. After locking the pointer needs to be
compared to check if it changed. If so then the hash bucket has been
replaced and the user has been moved to the new one and lock_ptr has
been updated. The lock operation needs to be redone in this case.
The locked hash bucket is not returned.
A special case is an early return in futex_lock_pi() (due to signal or
timeout) and a successful futex_wait_requeue_pi(). In both cases a valid
futex_q::lock_ptr is expected (and its matching hash bucket) but since
the waiter has been removed from the hash this can no longer be
guaranteed. Therefore before the waiter is removed and a reference is
acquired which is later dropped by the waiter to avoid a resize.
Add futex_q_lockptr_lock() and use it.
Acquire an additional reference in requeue_pi_wake_futex() and
futex_unlock_pi() while the futex_q is removed, denote this extra
reference in futex_q::drop_hb_ref and let the waiter drop the reference
in this case.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-11-bigeasy@linutronix.de
To support runtime resizing of the process private hash, it's required
to not use the obtained hash bucket once the reference count has been
dropped. The reference will be dropped after the unlock of the hash
bucket.
The amount of waiters is decremented after the unlock operation. There
is no requirement that this needs to happen after the unlock. The
increment happens before acquiring the lock to signal early that there
will be a waiter. The waiter can avoid blocking on the lock if it is
known that there will be no waiter.
There is no difference in terms of ordering if the decrement happens
before or after the unlock.
Decrease the waiter count before the unlock operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-10-bigeasy@linutronix.de
futex_wait_multiple_setup() changes task_struct::__state to
!TASK_RUNNING and then enqueues on multiple futexes. Every
futex_q_lock() acquires a reference on the global hash which is
dropped later.
If a rehash is in progress then the loop will block on
mm_struct::futex_hash_bucket for the rehash to complete and this will
lose the previously set task_struct::__state.
Acquire a reference on the local hash to avoiding blocking on
mm_struct::futex_hash_bucket.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-9-bigeasy@linutronix.de
Create explicit scopes for hb variables; almost pure re-indent.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-6-bigeasy@linutronix.de
Getting the hash bucket and queuing it are two distinct actions. In
light of wanting to add a put hash bucket function later, untangle
them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-5-bigeasy@linutronix.de
futex_wait_setup() has a weird calling convention in order to return
hb to use as an argument to futex_queue().
Mostly such that requeue can have an extra test in between.
Reorder code a little to get rid of this and keep the hb usage inside
futex_wait_setup().
[bigeasy: fixes]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
The function trace_adjust_address() is used to map addresses of modules
stored in the persistent memory and are also loaded in the current boot to
return the current address for the module.
If there's only one module entry, it will simply use that, otherwise it
performs a bsearch of the entry array to find the modules to offset with.
The issue is if there are no modules in the array. The code does not
account for that and ends up referencing the first element in the array
which does not exist and causes a crash.
If nr_entries is zero, exit out early as if this was a core kernel
address.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250501151909.65910359@gandalf.local.home
Fixes: 35a380ddbc ("tracing: Show last module text symbols in the stacktrace")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In the current implementation if the program is dev-bound to a specific
device, it will not be possible to perform XDP_REDIRECT into a DEVMAP
or CPUMAP even if the program is running in the driver NAPI context and
it is not attached to any map entry. This seems in contrast with the
explanation available in bpf_prog_map_compatible routine.
Fix the issue introducing __bpf_prog_map_compatible utility routine in
order to avoid bpf_prog_is_dev_bound() check running bpf_check_tail_call()
at program load time (bpf_prog_select_runtime()).
Continue forbidding to attach a dev-bound program to XDP maps
(BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_CPUMAP).
Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
The check for a failed memory location is incorrectly checking
the wrong level of pointer indirection by checking !filter_hash
rather than !*filter_hash. Fix this.
Cc: asami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250422221335.89896-1-colin.i.king@gmail.com
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
syzbot reported this bug:
==================================================================
BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260
CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:408 [inline]
print_report+0xc3/0x670 mm/kasan/report.c:521
kasan_report+0xe0/0x110 mm/kasan/report.c:634
check_region_inline mm/kasan/generic.c:183 [inline]
kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
__asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106
trace_seq_to_buffer kernel/trace/trace.c:1830 [inline]
tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822
....
==================================================================
It has been reported that trace_seq_to_buffer() tries to copy more data
than PAGE_SIZE to buf. Therefore, to prevent this, we should use the
smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument.
Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com
Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com
Fixes: 3c56819b14 ("tracing: splice support for tracing_pipe")
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A sched_ext scheduler may trigger __scx_exit() from a BPF timer
callback, where scx_root may not be safely dereferenced.
This can lead to a NULL pointer dereference as shown below (triggered by
scx_tickless):
BUG: kernel NULL pointer dereference, address: 0000000000000330
...
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
RIP: 0010:__scx_exit+0x2b/0x190
...
Call Trace:
<IRQ>
scx_bpf_get_idle_smtmask+0x59/0x80
bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6
...
bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
bpf_timer_cb+0x7a/0x140
__hrtimer_run_queues+0x1f9/0x3a0
hrtimer_run_softirq+0x8c/0xd0
handle_softirqs+0xd3/0x3d0
__irq_exit_rcu+0x9a/0xc0
irq_exit_rcu+0xe/0x20
Fix this by checking for a valid scx_root and adding proper RCU
protection.
Fixes: 48e1267773 ("sched_ext: Introduce scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Using a DSQ iterators from a timer callback can trigger the following
lockdep splat when accessing scx_root:
=============================
WARNING: suspicious RCU usage
6.14.0-virtme #1 Not tainted
-----------------------------
kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
no locks held by swapper/0/0.
stack backtrace:
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
Sched_ext: tickless (enabled+all)
Call Trace:
<IRQ>
dump_stack_lvl+0x6f/0xb0
lockdep_rcu_suspicious.cold+0x4e/0xa3
bpf_iter_scx_dsq_new+0xb1/0xd0
bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156
bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6
bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
? hrtimer_run_softirq+0x4f/0xd0
bpf_timer_cb+0x7a/0x140
__hrtimer_run_queues+0x1f9/0x3a0
hrtimer_run_softirq+0x8c/0xd0
handle_softirqs+0xd3/0x3d0
__irq_exit_rcu+0x9a/0xc0
irq_exit_rcu+0xe/0x20
sysvec_apic_timer_interrupt+0x73/0x80
Add a proper dereference check to explicitly validate RCU-safe access to
scx_root from rcu_read_lock() contexts and also from contexts that hold
rcu_read_lock_bh(), such as timer callbacks.
Fixes: cdf5a6faa8 ("sched_ext: Move dsq_hash into scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
irq_domain_debug_show_one() calls msi_domain_debug_show() with a non-NULL
domain pointer and a NULL irq_data pointer. irq_debug_show_data() calls it
with a NULL domain pointer.
The domain pointer is not used, but the irq_data pointer is required to be
non-NULL and lacks a NULL pointer check.
Add the missing NULL pointer check to ensure there is a non-NULL irq_data
pointer in msi_domain_debug_show() before dereferencing it.
[ tglx: Massaged change log ]
Fixes: 01499ae673 ("genirq/msi: Expose MSI message data in debugfs")
Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250430124836.49964-2-ajones@ventanamicro.com
Commit a3e8fe814a ("x86/build: Raise the minimum GCC version to 8.1")
raised the minimum compiler version as enforced by Kbuild to gcc-8.1
and clang-15 for x86.
This is actually the same gcc version that has been discussed as the
minimum for all architectures several times in the past, with little
objection. A previous concern was the kernel for SLE15-SP7 needing to
be built with gcc-7. As this ended up still using linux-6.4 and there
is no plan for an SP8, this is no longer a problem.
Change it for all architectures and adjust the documentation accordingly.
A few version checks can be removed in the process. The binutils
version 2.30 is the lowest version used in combination with gcc-8 on
common distros, so use that as the corresponding minimum.
Link: https://lore.kernel.org/lkml/20240925150059.3955569-32-ardb+git@google.com/
Link: https://lore.kernel.org/lkml/871q7yxrgv.wl-tiwai@suse.de/
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
A single series is present to properly handle the module_kobject creation.
It fixes a problem with missing /sys/module/<module>/drivers for built-in
modules.
The fix has been on linux-next for two weeks with no reported issues.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEIduBR9MnFA82q/jtumpXJwqY6poFAmgSH2YUHHBldHIucGF2
bHVAc3VzZS5jb20ACgkQumpXJwqY6poL/wf/TZEux9aieu8VOhPbV1Mo1npVAeT7
MJ5R4M6QKxPNvvBiXK5lWSVy5IPtcuwkbyEfKxV/CS668FwJeFpGFb91rRY108He
EUHjj5NtZ1WhEHFRBgJPLydGZGGtJzxy3yg26x6wO58VJrIx/H3HU3jgsnj1m32a
fA1cbo4Yo9gnk0YzI2KDu6A+bXi8zJVpyYDU9Ir4mdy+CVd5+vN9WypzrjHXbMya
2xhd9768sVmShY9K5+DlOXF4stVsP6CbgWGxhwIbdfLvY977QaBhrr+emrPE3uYt
5g+rg3v7ciuW14D5rLPWqZ5aXinjNt4vc7maNA9sJLW5wLOiWGXjhseUFg==
=6rT4
-----END PGP SIGNATURE-----
Merge tag 'modules-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull modules fixes from Petr Pavlu:
"A single series to properly handle the module_kobject creation.
This fixes a problem with missing /sys/module/<module>/drivers for
built-in modules"
* tag 'modules-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
drivers: base: handle module_kobject creation
kernel: globalize lookup_or_create_module_kobject()
kernel: refactor lookup_or_create_module_kobject()
kernel: param: rename locate_module_kobject
It was reported that in 6.12, smpboot_create_threads() was
taking much longer then in 6.6.
I narrowed down the call path to:
smpboot_create_threads()
-> kthread_create_on_cpu()
-> kthread_bind()
-> __kthread_bind_mask()
->wait_task_inactive()
Where in wait_task_inactive() we were regularly hitting the
queued case, which sets a 1 tick timeout, which when called
multiple times in a row, accumulates quickly into a long
delay.
I noticed disabling the DELAY_DEQUEUE sched feature recovered
the performance, and it seems the newly create tasks are usually
sched_delayed and left on the runqueue.
So in wait_task_inactive() when we see the task
p->se.sched_delayed, manually dequeue the sched_delayed task
with DEQUEUE_DELAYED, so we don't have to constantly wait a
tick.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: peter-yc.chang@mediatek.com
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250429150736.3778580-1-jstultz@google.com
Using guard/scoped_guard() to simplify code. Using guard() to remove
'goto unlock' label is neater especially.
[ tglx: Brought back the scoped_guard()'s which were dropped in v2 and
simplified alarmtimer_rtc_add_device() ]
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250430032734.2079290-4-suhui@nfschina.com
'clockid' can only be ALARM_REALTIME and ALARM_BOOTTIME. It's impossible to
return -1 and callers never check the return value.
Only alarm_clock_get_timespec(), alarm_clock_get_ktime(),
alarm_timer_create() and alarm_timer_nsleep() call clock2alarm(). These
callers use clockid_to_kclock() to get 'struct k_clock', which ensures
that clock2alarm() never returns -1.
Remove the impossible -1 return value, and add a warning to notify about any
future misuse of this function.
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250430032734.2079290-3-suhui@nfschina.com
register_refined_jiffies() is only used in setup code and always returns 0.
Mark it as __init to save some bytes and change it to void.
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250430032734.2079290-2-suhui@nfschina.com
The IMA log is currently copied to the new kernel during kexec 'load' using
ima_dump_measurement_list(). However, the IMA measurement list copied at
kexec 'load' may result in loss of IMA measurements records that only
occurred after the kexec 'load'. Move the IMA measurement list log copy
from kexec 'load' to 'execute'
Make the kexec_segment_size variable a local static variable within the
file, so it can be accessed during both kexec 'load' and 'execute'.
Define kexec_post_load() as a wrapper for calling ima_kexec_post_load() and
machine_kexec_post_load(). Replace the existing direct call to
machine_kexec_post_load() with kexec_post_load().
When there is insufficient memory to copy all the measurement logs, copy as
much of the measurement list as possible.
Co-developed-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Signed-off-by: Steven Chen <chenste@linux.microsoft.com>
Tested-by: Stefan Berger <stefanb@linux.ibm.com> # ppc64/kvm
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Currently, the function kexec_calculate_store_digests() calculates and
stores the digest of the segment during the kexec_file_load syscall,
where the IMA segment is also allocated.
Later, the IMA segment will be updated with the measurement log at the
kexec execute stage when a kexec reboot is initiated. Therefore, the
digests should be updated for the IMA segment in the normal case. The
problem is that the content of memory segments carried over to the new
kernel during the kexec systemcall can be changed at kexec 'execute'
stage, but the size and the location of the memory segments cannot be
changed at kexec 'execute' stage.
To address this, skip the calculation and storage of the digest for the
IMA segment in kexec_calculate_store_digests() so that it is not added
to the purgatory_sha_regions.
With this change, the IMA segment is not included in the digest
calculation, storage, and verification.
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Co-developed-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Steven Chen <chenste@linux.microsoft.com>
Reviewed-by: Stefan Berger <stefanb@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Tested-by: Stefan Berger <stefanb@linux.ibm.com> # ppc64/kvm
[zohar@linux.ibm.com: Fixed Signed-off-by tag to match author's email ]
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Implement kimage_map_segment() to enable IMA to map the measurement log
list to the kimage structure during the kexec 'load' stage. This function
gathers the source pages within the specified address range, and maps them
to a contiguous virtual address range.
This is a preparation for later usage.
Implement kimage_unmap_segment() for unmapping segments using vunmap().
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Co-developed-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Steven Chen <chenste@linux.microsoft.com>
Acked-by: Baoquan He <bhe@redhat.com>
Tested-by: Stefan Berger <stefanb@linux.ibm.com> # ppc64/kvm
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
With the global states and disable machinery moved into scx_sched,
scx_disable_workfn() can only be scheduled and run for the specific
scheduler instance. This makes it impossible for scx_disable_workfn() to see
SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Because disable can be triggered from any place and the scheduler cannot be
trusted, disable path uses an irq_work to bounce and a kthread_work which is
executed on an RT helper kthread to perform disable. These must be per
scheduler instance to guarantee forward progress. Move them into scx_sched.
- If an scx_sched is accessible, its helper kthread is always valid making
the `helper` check in schedule_scx_disable_work() unnecessary. As the
function becomes trivial after the removal of the test, inline it.
- scx_create_rt_helper() has only one user - creation of the disable helper
kthread. Inline it into scx_alloc_and_add_sched().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
The event counters are going to become per scheduler instance. Move
event_stats_cpu into scx_sched.
- [__]scx_add_event() are updated to take @sch. While at it, add missing
parentheses around @cnt expansions.
- scx_read_events() is updated to take @sch.
- scx_bpf_events() accesses scx_root under RCU read lock.
v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with
scx_bpf_events().
- Trivial goto label rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
In prepration of moving event_stats_cpu into scx_sched, factor out
scx_read_events() out of scx_bpf_events() and update the in-kernel users -
scx_attr_events_show() and scx_dump_state() - to use scx_read_events()
instead of scx_bpf_events(). No observable behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats
definitions above scx_sched definition. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Global DSQs are going to become per scheduler instance. Move global_dsqs
into scx_sched. find_global_dsq() already takes a task_struct pointer as an
argument and should later be able to determine the scx_sched to use from
that. For now, assume scx_root.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
User DSQs are going to become per scheduler instance. Move dsq_hash into
scx_sched. This shifts the code that assumes scx_root to be the only
scx_sched instance up the call stack but doesn't remove them yet.
v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
More will be moved into scx_sched. Factor out the allocation and kobject
addition path into scx_alloc_and_add_sched().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in
the way of making dsq_hash per scx_sched. Inline it into
scx_bpf_create_dsq(). While at it, add unlikely() around
SCX_DSQ_FLAG_BUILTIN test.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
To prepare for supporting multiple schedulers, make scx_sched allocated
dynamically. scx_sched->kobj is now an embedded field and the kobj's
lifetime determines the lifetime of the containing scx_sched.
- Enable path is updated so that kobj init and addition are performed later.
- scx_sched freeing is initiated in scx_kobj_release() and also goes through
an rcu_work so that scx_root can be accessed from an unsynchronized path -
scx_disable().
- sched_ext_ops->priv is added and used to point to scx_sched instance
created for the ops instance. This is used by bpf_scx_unreg() to determine
the scx_sched instance to disable and put.
No behavior changes intended.
v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL
scx_root after scheduler init failure. sched_ext_ops->priv added so that
scx_bpf_unreg() can always find the scx_sched instance to unregister
even if it failed early during init.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a
statically allocated struct scx_sched and initialized while loading the BPF
scheduler and cleared while unloading, and thus can be tested anytime.
However, scx_root will be switched to dynamic allocation and thus won't
always be deferenceable.
Most usages of SCX_HAS_OP() are already protected by scx_enabled() either
directly or indirectly (e.g. through a task which is on SCX). However, there
are a couple places that could try to dereference NULL scx_root. Update them
so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called.
- In handle_hotplug(), test whether scx_root is NULL before doing anything
else. This is safe because scx_root updates will be protected by
cpus_read_lock().
- In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(),
which should guarnatee that scx_root won't turn NULL. This is also in line
with other cgroup operations. As the code path is synchronized against
scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any
behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
To support multiple scheduler instances, collect some of the global
variables that should be specific to a scheduler instance into the new
struct scx_sched. scx_root is the root scheduler instance and points to a
static instance of struct scx_sched. Except for an extra dereference through
the scx_root pointer, this patch makes no functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume
that the rq involved in the operation is locked, which is not the case
during hotplug, triggering the following warning:
WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340
Fix by not tracking the target rq as locked in the context of
ops.cpu_online() and ops.cpu_offline().
Fixes: 18853ba782 ("sched_ext: Track currently locked rq")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Similar to commit 221a164035 ("entry: Move syscall_enter_from_user_mode()
to header file"), move syscall_exit_to_user_mode() to the header file as
well.
Testing was done with the byte-unixbench syscall benchmark (which calls
getpid) and QEMU. On riscv I measured a 7.09246% improvement, on x86 a
2.98843% improvement, on loongarch a 6.07954% improvement, and on s390 a
11.1328% improvement.
The Intel bot also reported "kernel test robot noticed a 1.9% improvement
of stress-ng.seek.ops_per_sec".
Signed-off-by: Charlie Jenkins <charlie@rivosinc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/all/20250320-riscv_optimize_entry-v6-4-63e187e26041@rivosinc.com
Link: https://lore.kernel.org/linux-riscv/202502051555.85ae6844-lkp@intel.com/
When the device is of a non-CPU type, table[i].performance won't be
initialized in the previous em_init_performance(), resulting in division
by zero when calculating costs in em_compute_costs().
Since the 'cost' algorithm is only used for EAS energy efficiency
calculations and is currently not utilized by other device drivers, we
should add the _is_cpu_device(dev) check to prevent this division-by-zero
issue.
Fixes: 1b600da510 ("PM: EM: Optimize em_cpu_energy() and remove division")
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies. Lei tracked down that this was being caused by the
adjustment:
tk->tkr_mono.xtime_nsec -= offset;
which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.
However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.
By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately
multiplicator * interval_cycles
into xtime_nsec. The accumulated value is always larger then the
mult_adj * offset
value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.
However, do_adjtimex() calls into timekeeping_advance() as well, to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller
then interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct for the
new multiplicator.
Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the
_COARSE clockid getters, which don't compensate for the offset, can
observe the inconsistency.
This has been tried to be fixed by forwarding the timekeeper in the case
that adjtimex() adjusts the multiplier, which resets the offset to zero:
757b000f7b ("timekeeping: Fix possible inconsistencies in _COARSE clockids")
That works correctly, but unfortunately causes a regression on the
adjtimex() side. There are two issues:
1) The forwarding of the base time moves the update out of the original
period and establishes a new one.
2) The clearing of the accumulated NTP error is changing the behaviour as
well.
User-space expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.
Commit 757b000f7b was reverted so that the established expectations of
user space implementations (ntpd, chronyd) are restored, but that obviously
brought the inconsistencies back.
One of the initial approaches to fix this was to establish a separate
storage for the coarse time getter nanoseconds part by calculating it from
the offset. That was dropped on the floor because not having yet another
state to maintain was simpler. But given the result of the above exercise,
this solution turns out to be the right one. Bring it back in a slightly
modified form.
Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in
it, switch the time getter functions and the VDSO update to use that value.
coarse_nsec is set on operations which forward or initialize the timekeeper
and after time was accumulated during a tick. If there is no accumulation
the timestamp is unchanged.
This leaves the adjtimex() behaviour unmodified and prevents coarse time
from going backwards.
[ jstultz: Simplified the coarse_nsec calculation and kept behavior so
coarse clockids aren't adjusted on each inter-tick adjtimex
call, slightly reworked the comments and commit message ]
Fixes: da15cfdae0 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/20250419054706.2319105-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
[ Arguably if pick_eevdf()/pick_next_entity() was less trusting
of complex math being correct it could have de-escalated a
crash into a warning, but that's for a different patch. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgMnbERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iWcg//Tj96bTRqh9c31n0r4+k6xPZE8kyf7FVd
vT/LraGe9I731+omWQQrybwgxe01Ja/bk2FNShQhVN3safk43OXReTWyNcd213Un
q4vYvwX8gcVOl8qZibnAUHO2N0B+uTPlFBniG+VyhWqb+zPq0/dRCSy7nc0TNc54
hHhERcLYOn+WWrUzKSbL5vm2JRowmlthTiw8/li7N5aappM8Hr4XbIZuhvd2aaB4
ocsXJpOJyDUXP51Zi7jWEbWPr8O3VS/Zdz/F9/MGomPZ6rPBmRyNnadn0w1gjrGB
ccTvJgBMMRH7Ltp05TslvVsnRnUIRIRe6bx/kW5pkSANxpSP+Ztw90ssAwq1v11G
38+XIVnRJCjgP9O8/YByyW3dgWrp2o6rrZJIt++50BfQzASmT66//1Z1iV4nQIC0
szoSa/tOm/WOFNK357pFDhAZyhZZUYlq5etvReG7q4OEHZb0y9/axw40wkzY/rpy
Jc9XnaMoey/SvyvNHMKNJxFJNHuodosfY3rXRaeuhP22FW3qPqbtCxTM+8nRWMTs
HojbqlrMFH9rAV9K1STgdH5YjIWsWHwJ9siHfw6SZtocMLOvzWDtfKyCJ+cvtvUz
z82EBuPsltgq6+LLoGXyY+wlEUo/qvY3Ywv0Y9cAqolbBw5Cw/aEaDSH+K1cmubb
AgtloYCkmXo=
=pJ11
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix sporadic crashes in dequeue_entities() due to ... bad math.
[ Arguably if pick_eevdf()/pick_next_entity() was less trusting of
complex math being correct it could have de-escalated a crash into
a warning, but that's for a different patch ]"
* tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
- Use POLLERR for events in error state, instead of
the ambiguous POLLHUP error value
- Fix non-sampling (counting) events on certain x86 platforms
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgMmoERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1i6lg//Z2fxDHOXxSxNaNtin6wNb52vSRfmtFFD
+6lxCbJP+qT66rWR8ZpRNKKQ+vZAKYXm8wGNakhb4wpFe+PJwsQhl5sWOHnoMO5a
TBQFkvGrHxDxDa8xoQy6IFgee4ckpwxiVaMe0jhwG9/2rbOhXgDZ5dFZvxV4sbAT
uT0Qfsm4gC+2oRVOx430zYSNlLRieux7mrXcTRpszLWy7n7kG2fzd+f7OFgKHrGd
Bnx+X2DE2R3k8lNhJGZBc92zhJAjgoBw3R4ajFqsH6v7Fw0DFIhJ3zEn0EBbPvVo
6hdkdYtpCog7Ek841lhzXlIz4Ofu05q+iUquEtbU3q51QeHF3a00i4SHfLT5L1NS
xhOLR1nCSi9PMSfBHsdDfQbHr4WqK5NsyFvgQNnH7h31MybhkROzlP2JWN+tA/nJ
DxBs14DiscA7zIYtl8gx8nVPgo7PBxupqJjorPgW6Fq11diKBe9thcPfjR763QKR
jt6xyw40KAC8HZKntzrqugeWUGpf/LPwbH4QNX5M9TfgTum8duHaLFR2wGWUb3gr
jPPxaSIBEPTENb2w9Z+N/5xGRwKlQo/QmROoygcr0Qox7qelp4GfFOxbQYGyppZX
6k0BCRlgpNIy6EIgORgA8fpL6k5hZS7Jkjrs2nJd07pklYOuRQDYTBd0gh0eAwU5
8wLrnBKCDCA=
=LLOs
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc perf events fixes from Ingo Molnar:
- Use POLLERR for events in error state, instead of the ambiguous
POLLHUP error value
- Fix non-sampling (counting) events on certain x86 platforms
* tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix non-sampling (counting) events on certain x86 platforms
perf/core: Change to POLLERR for pinned events with error
There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.
The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:
1. In the if (entity_is_task(se)) else block at the beginning of
dequeue_entities(), slice is set to
cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
the saved slice, which is still U64_MAX.
This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:
6. In update_entity_lag(), se->slice is used to calculate limit, which
ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
is negative, vlag > limit, so se->vlag is set to the same huge
negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
incorrectly returns false because the vruntime is so far from the
other vruntimes on the queue, causing the
(vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
pick_eevdf() and crashes.
Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).
Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.
Fixes: aef6987d89 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
Fold it into pidfd_prepare() and rename PIDFD_CLONE to PIDFD_STALE to
indicate that the passed pid might not have task linkage and no explicit
check for that should be performed.
Link: https://lore.kernel.org/20250425-work-pidfs-net-v2-3-450a19461e75@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The cgroup_rstat_push_children() function converts a set of
updated_children lists from different cgroups into a single ordered
list of cgroups to be flushed via the rstat_flush_next pointer.
The algorithm used isn't that well illustrated and it takes time to
grasp what it is doing. Improve the embedded documentation and variable
names to better illustrate the transformation process and make the code
easier to understand.
Also cgroup_rstat_lock must be held for the whole duration
from where the rstat_flush_next list is being constructed in
cgroup_rstat_push_children() to when it is consumed later in
css_rstat_flush(). Otherwise, list corruption can happen leading to
system crash as reported in [1]. In this particular case, the branch
being used has commit 093c8812de ("cgroup: rstat: Cleanup flushing
functions and locking") which breaks this rule, but is missing the fix
commit 7d6c63c319 ("cgroup: rstat: call cgroup_rstat_updated_list
with cgroup_rstat_lock") that fixes it.
This patch has no functional change.
[1] https://lore.kernel.org/lkml/BY5PR04MB68495E9E8A46CA9614D62669BCBB2@BY5PR04MB6849.namprd04.prod.outlook.com/
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since initcall_debug is a bool variable, it is not necessary to convert
it to bool with the help of a double logical negation (!!).
Remove the redundant operation.
Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
Link: https://patch.msgid.link/20250424060339.73119-1-zhangzihuan@kylinos.cn
[ rjw: Changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some kfuncs specific to the idle CPU selection policy are registered in
both the scx_kfunc_ids_any and scx_kfunc_ids_idle blocks, even though
they should only be defined in the latter.
Remove the duplicates from scx_kfunc_ids_any.
Fixes: 337d1b354a ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- avoid unused variable warnings (Arnd Bergmann, Marek Szyprowski)
- add runtume warnings and debug messages for devices with limited DMA
capabilities (Balbir Singh, Chen-Yu Tsai)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaAtFeAAKCRCJp1EFxbsS
RGJwAQDIwyLQdk4XbWUZYokxIl/5jIiuaTqQBoPGPILnwoTkuAD8DlxKQvsnzkdT
QK7TSFpKwrboSaveGWEG5oB60wIsKgU=
=TSJY
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.15-2025-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-maping fixes from Marek Szyprowski:
- avoid unused variable warnings (Arnd Bergmann, Marek Szyprowski)
- add runtume warnings and debug messages for devices with limited DMA
capabilities (Balbir Singh, Chen-Yu Tsai)
* tag 'dma-mapping-6.15-2025-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-coherent: Warn if OF reserved memory is beyond current coherent DMA mask
dma-mapping: Fix warning reported for missing prototype
dma-mapping: avoid potential unused data compilation warning
dma/mapping.c: dev_dbg support for dma_addressing_limited
dma/contiguous: avoid warning about unused size_bytes
Add namespace to BPF internal symbols used by light skeleton
to prevent abuse and document with the code their allowed usage.
Fixes: b1d18a7574 ("bpf: Extend sys_bpf commands for bpf_syscall programs.")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20250425014542.62385-1-alexei.starovoitov@gmail.com
The _safe variant used here gets the next element before running the callback,
avoiding the endless loop condition.
Signed-off-by: Brandon Kammerdiener <brandon.kammerdiener@intel.com>
Link: https://lore.kernel.org/r/20250424153246.141677-2-brandon.kammerdiener@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
According to the throttling mechanism, the pmu interrupts number can not
exceed the max_samples_per_tick in one tick. But this mechanism is
ineffective when max_samples_per_tick=1, because the throttling check is
skipped during the first interrupt and only performed when the second
interrupt arrives.
Perhaps this bug may cause little influence in one tick, but if in a
larger time scale, the problem can not be underestimated.
When max_samples_per_tick = 1:
Allowed-interrupts-per-second max-samples-per-second default-HZ ARCH
200 100 100 X86
500 250 250 ARM64
...
Obviously, the pmu interrupt number far exceed the user's expect.
Fixes: e050e3f0a7 ("perf: Fix broken interrupt rate throttling")
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250405141635.243786-3-wangqing7171@gmail.com
Go to the appropriate section labels when css_rstat_init() or
psi_cgroup_alloc() fails.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Fixes: a97915559f ("cgroup: change rstat function signatures from cgroup-based to css-based")
Signed-off-by: Tejun Heo <tj@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.15-rc4).
This pull includes wireless and a fix to vxlan which isn't
in Linus's tree just yet. The latter creates with a silent conflict
/ build breakage, so merging it now to avoid causing problems.
drivers/net/vxlan/vxlan_vnifilter.c
094adad913 ("vxlan: Use a single lock to protect the FDB table")
087a9eb9e5 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry")
https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com
No "normal" conflicts, or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI
code, adding an open-coded erratum check for Cavium ThunderX
* Bugfixes from a planned posted interrupt rework
* Do not use kvm_rip_read() unconditionally to cater for guests
with inaccessible register state.
Remove two trivial but long unused functions.
__round_jiffies() has been unused since 2008's
commit 9c133c469d ("Add round_jiffies_up and related routines")
__round_jiffies_up() has been unused since 2019's
commit 7ae3f6e130 ("powerpc/watchdog: Use hrtimers for per-CPU
heartbeat")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250418200803.427911-1-linux@treblig.org
The ops.running() and ops.stopping() callbacks can be invoked from a CPU
other than the one the task is assigned to, particularly when a task
property is changed, as both scx_next_task_scx() and dequeue_task_scx() may
run on CPUs different from the task's target CPU.
This behavior can lead to confusion or incorrect assumptions if not
properly clarified, potentially resulting in bugs (see [1]).
Therefore, update the documentation to clarify this aspect and advise
users to use scx_bpf_task_cpu() to determine the actual CPU the task
will run on or was running on.
[1] https://github.com/sched-ext/scx/pull/1728
Cc: Jake Hillion <jake@hillion.co.uk>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Adding support to access arguments with const void pointer arguments
in tracing programs.
Currently we allow tracing programs to access void pointers. If we try to
access argument which is pointer to const void like 2nd argument in kfree,
verifier will fail to load the program with;
0: R1=ctx() R10=fp0
; asm volatile ("r2 = *(u64 *)(r1 + 8); ");
0: (79) r2 = *(u64 *)(r1 +8)
func 'kfree' arg1 type UNKNOWN is not a struct
Changing the is_int_ptr to void and generic integer check and renaming
it to is_void_or_int_ptr.
Signed-off-by: KaFai Wan <mannkafai@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20250423121329.3163461-2-mannkafai@gmail.com
Many conditional checks in switch-case are redundant
with bpf_base_func_proto and should be removed.
Regarding the permission checks bpf_base_func_proto:
The permission checks in bpf_prog_load (as outlined below)
ensure that the trace has both CAP_BPF and CAP_PERFMON capabilities,
thus enabling the use of corresponding prototypes
in bpf_base_func_proto without adverse effects.
bpf_prog_load
......
bpf_cap = bpf_token_capable(token, CAP_BPF);
......
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
!bpf_cap)
goto put_token;
......
if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
goto put_token;
......
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20250423073151.297103-1-yangfeng59949@163.com
The calculation of the index used to access the mask field in 'struct
bpf_raw_tp_null_args' is done with 'int' type, which could overflow when
the tracepoint being attached has more than 8 arguments.
While none of the tracepoints mentioned in raw_tp_null_args[] currently
have more than 8 arguments, there do exist tracepoints that had more
than 8 arguments (e.g. iocost_iocg_forgive_debt), so use the correct
type for calculation and avoid Smatch static checker warning.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/bpf/20250418074946.35569-1-shung-hsi.yu@suse.com
Closes: https://lore.kernel.org/r/843a3b94-d53d-42db-93d4-be10a4090146@stanley.mountain/
Fixed a race condition in incrementing wq->stats[PWQ_STAT_COMPLETED] by
moving the operation under pool->lock.
Reported-by: syzbot+01affb1491750534256d@syzkaller.appspotmail.com
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
A small number of fixes.
virtgpu is exempt from reset shutdown fow now -
a more complete fix is in the works
spec compliance fixes in:
virtio-pci cap commands
vhost_scsi_send_bad_target
virtio console resize
missing locking fix in vhost-scsi
virtio ring - a KCSAN false positive fix
VHOST_*_OWNER documentation fix
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCgAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmgIi3cPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpGU8H/1Fq8pH+irGvyH4E21O03qx0wiM+lcYVhNH5
2a3rjOwuJBiLvscZTJG/w07hIpx0O4WrbygdT0BTll4Uen2C+OpGn/Y1LfhW6wsr
3yyeBpTr5hKiY8sOD08rMTHTCM4mD8UdYr13RcNq+eUxNZ6bA+kiGaXpIk0AiRPR
5pdbx16cTZM7k+/9aXp68hRO7yHnyilGzAJG1hHmfx1L5Mt++RVKsf2KI+3YHWcI
0ZZj/NP3iZfNm57+QpKX6zYikH4IFIer1r9wotMaR74brpuq8w7HKZUqe3VfG11Y
TBgq6NfDZVq8G8bCGPv+C+DfDnpYMFVYqytCLn4/AyOhLNCRDs8=
=m8wk
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"A small number of fixes:
- virtgpu is exempt from reset shutdown fow now - a more complete fix
is in the works
- spec compliance fixes in:
- virtio-pci cap commands
- vhost_scsi_send_bad_target
- virtio console resize
- missing locking fix in vhost-scsi
- virtio ring - a KCSAN false positive fix
- VHOST_*_OWNER documentation fix"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-scsi: Fix vhost_scsi_send_status()
vhost-scsi: Fix vhost_scsi_send_bad_target()
vhost-scsi: protect vq->log_used with vq->mutex
vhost_task: fix vhost_task_create() documentation
virtio_console: fix order of fields cols and rows
virtio_console: fix missing byte order handling for cols and rows
virtgpu: don't reset on shutdown
virtio_ring: Fix data race by tagging event_triggered as racy for KCSAN
vhost: fix VHOST_*_OWNER documentation
virtio_pci: Use self group type for cap commands
Commit:
f4b07fd62d ("perf/core: Use POLLHUP for pinned events in error")
started to emit POLLHUP for pinned events in an error state.
But the POLLHUP is also used to signal events that the attached task is
terminated. To distinguish pinned per-task events in the error state
it would need to check if the task is live.
Change it to POLLERR to make it clear.
Suggested-by: Gabriel Marin <gmx@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250422223318.180343-1-namhyung@kernel.org
a11d6784d7 ("sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()")
added a call to scx_ops_error() which was renamed to scx_error() in
for-6.16. Fix it up.
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:
[ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[ 51.713836] Call trace:
[ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c
[ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0
[ 51.713845] bpf_scx_reg+0x18/0x30
[ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0
[ 51.713849] __sys_bpf+0x1934/0x22a0
Fix by properly acquiring the rq lock when possible or raising an error
if we try to operate on a CPU that is not the one currently locked.
Fixes: d86adb4fc0 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Some kfuncs provided by sched_ext may need to operate on a struct rq,
but they can be invoked from various contexts, specifically, different
scx callbacks.
While some of these callbacks are invoked with a particular rq already
locked, others are not. This makes it impossible for a kfunc to reliably
determine whether it's safe to access a given rq, triggering potential
bugs or unsafe behaviors, see for example [1].
To address this, track the currently locked rq whenever a sched_ext
callback is invoked via SCX_CALL_OP*().
This allows kfuncs that need to operate on an arbitrary rq to retrieve
the currently locked one and apply the appropriate action as needed.
[1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a reserved memory region described in the device tree is attached
to a device, it is expected that the device's limitations are correctly
included in that description.
However, if the device driver failed to implement DMA address masking
or addressing beyond the default 32 bits (on arm64), then bad things
could happen because the DMA address was truncated, such as playing
back audio with no actual audio coming out, or DMA overwriting random
blocks of kernel memory.
Check against the coherent DMA mask when the memory regions are attached
to the device. Give a warning when the memory region can not be covered
by the mask.
A warning instead of a hard error was chosen, because it is possible
that existing drivers could be working fine even if they forgot to
extend the coherent DMA mask.
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250421083930.374173-1-wenst@chromium.org
lkp reported a warning about missing prototype for a recent patch.
The kernel-doc style comments are out of sync, move them to the right
function.
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Christoph Hellwig <hch@lst.de>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202504190615.g9fANxHw-lkp@intel.com/
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
[mszyprow: reformatted subject]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250422114034.3535515-1-balbirs@nvidia.com
Other messages are occasionally printed between these two, for example:
[203104.106534] Restarting tasks ...
[203104.106559] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
[203104.112354] done.
This seems to be a timing issue, seen in two of the eleven
hibernation exits in my current `dmesg` output.
When printed on its own, the "done" message has the default log level.
This makes the output of `dmesg --level=warn` quite misleading.
Add enough context for the "done" messages to make sense on their own,
and use the same log level for all messages.
Change the messages to "<event>..." / "Done <event>.", unlike a449dfbfc0
which uses "<event>..." / "<event> completed.". Front-loading the unique
part of the message makes it easier to scan the log, and reduces ambiguity
for users who aren't confident in their English comprehension.
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Andrew Sayers <kernel.org@pileofstuff.org>
Link: https://patch.msgid.link/20250411152632.2806038-1-kernel.org@pileofstuff.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Use kvzalloc() so that large exit_dump buffer allocations don't fail
easily.
- Remove cpu.weight / cpu.idle unimplemented warnings which are more
annoying than helpful. This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary.
Mark it for deprecation.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaAaeKw4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGcDjAQDM14FReObIKOOuqzlNaYXQqcSW2///ZVf/FR8j
HpWtyAD/Tsqg6CzBpTxKkpMRLsE2iKI1t770vkUnDbjcnR0Rxgc=
=rIV5
-----END PGP SIGNATURE-----
Merge tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Use kvzalloc() so that large exit_dump buffer allocations don't fail
easily
- Remove cpu.weight / cpu.idle unimplemented warnings which are more
annoying than helpful.
This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary. Mark it for
deprecation
* tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation
sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings
sched_ext: Use kvzalloc for large exit_dump allocation
- Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations.
- Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to make
life easier for android.
-----BEGIN PGP SIGNATURE-----
iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaAac+Q4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGUiAAQCbw6eOFAE+sjI6GgAeMVORbLqyufGDNPvBwgzJ
xPxgcwD/ZLlsJWRG6BzQ/KHeFZnGWSJEiqSSFHGCCr0l4QkIdgA=
=ay1E
-----END PGP SIGNATURE-----
Merge tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations
- Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to
make life easier for android
* tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset-v1: Add missing support for cpuset_v2_mode
cgroup: Fix compilation issue due to cgroup_mutex not being exported
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCaAFFvgAKCRAraS+Z+3y5
E/1OAP9SGmTMgHuHLlF8en+MaYdtwgcHy6uurXgbSQAAV/RwwQEAh2oXZE1D9I7a
EtxsaJYqbbhD09RPwWa2Rd8iJrJYXQk=
=qcoU
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Martin KaFai Lau says:
====================
pull-request: bpf-next 2025-04-17
We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 18 files changed, 1748 insertions(+), 19 deletions(-).
The main changes are:
1) bpf qdisc support, from Amery Hung.
A qdisc can be implemented in bpf struct_ops programs and
can be used the same as other existing qdiscs in the
"tc qdisc" command.
2) Add xsk tail adjustment tests, from Tushar Vyavahare.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: Test attaching bpf qdisc to mq and non root
selftests/bpf: Add a bpf fq qdisc to selftest
selftests/bpf: Add a basic fifo qdisc test
libbpf: Support creating and destroying qdisc
bpf: net_sched: Disable attaching bpf qdisc to non root
bpf: net_sched: Support updating bstats
bpf: net_sched: Add a qdisc watchdog timer
bpf: net_sched: Add basic bpf qdisc kfuncs
bpf: net_sched: Support implementation of Qdisc_ops in bpf
bpf: Prepare to reuse get_ctx_arg_idx
selftests/xsk: Add tail adjustment tests and support check
selftests/xsk: Add packet stream replacement function
====================
Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In css_rstat_init() allocations are done for the cgroup's pointers
rstat_cpu and rstat_base_cpu. Make sure the allocation checks are
consistent with what they are allocating.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaAQM5QAKCRCRxhvAZXjc
olcwAP0RETZn15Jkt5+mKjcx99fuVE7je3lp56UH4Y4XjZmthgEA1n65RDr4Tq6E
548A2/9Hnt4NWdvoi9VhrG4+5dNRowM=
=cFFa
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Revert the hfs{plus} deprecation warning that's also included in this
pull request. The commit introducing the deprecation warning resides
rather early in this branch. So simply dropping it would've rebased
all other commits which I decided to avoid. Hence the revert in the
same branch
[ Background - the deprecation warning discussion resulted in people
stepping up, and so hfs{plus} will have a maintainer taking care of
it after all.. - Linus ]
- Switch CONFIG_SYSFS_SYCALL default to n and decouple from
CONFIG_EXPERT
- Fix an audit bug caused by changes to our kernel path lookup helpers
this cycle. Audit needs the parent path even if the dentry it tried
to look up is negative
- Ensure that the kernel path lookup helpers leave the passed in path
argument clean when they return an error. This is consistent with all
our other helpers
- Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
information is available to kernel consumers as well
- Don't set a timer and call schedule() if the timer will expire
immediately in epoll
- Make netfs lookup tables with __nonstring
* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "hfs{plus}: add deprecation warning"
fs: move the bdex_statx call to vfs_getattr_nosec
netfs: Mark __nonstring lookup tables
eventpoll: Set epoll timeout if it's in the future
fs: ensure that *path_locked*() helpers leave passed path pristine
fs: add kern_path_locked_negative()
hfs{plus}: add deprecation warning
Kconfig: switch CONFIG_SYSFS_SYCALL default to n
- Initialize hash variables in ftrace subops logic
The fix that simplified the ftrace subops logic opened a path where some
variables could be used without being initialized, and done subtly where
the compiler did not catch it. Initialize those variables to the
EMPTY_HASH, which is the default hash.
- Reinitialize the hash pointers after they are freed
Some of the hash pointers in the subop logic were freed but may still be
referenced later. To prevent use-after-free bugs, initialize them back to
the EMPTY_HASH.
- Free the ftrace hashes when they are replaced
The fix that simplified the subops logic updated some hash pointers, but
left the original hash that they were pointing to where they are no longer
used. This caused a memory leak. Free the hashes that are pointed to by
the pointers when they are replaced.
- Fix size initialization of ftrace direct function hash
The ftrace direct function hash used by BPF initialized the hash size
incorrectly. It checked the size of items to a hard coded 32, which made
the hash bit size of 5. The hash size is supposed to be limited by the bit
size of the hash, as the bitmask is allowed to be greater than 5. Rework
the size check to first pass the number of elements to fls() and then
compare that to FTRACE_HASH_MAX_BITS before allocating the hash.
- Fix format output of ftrace_graph_ent_entry event
The field depth of the ftrace_graph_ent_entry event is of size 4 but the
output showed it as unsigned long and use "%lu". Change it to unsigned int
and use "%u" in the print format that is displayed to user space.
- Fix the trace event filter on strings
Events can be filtered on numbers or string values. The return value
checked from strncpy_from_kernel_nofault() and strncpy_from_user_nofault()
was used to determine if reading the strings would fault or not. It would
return fault if the value was non zero, which is basically meant that it
was always considering the read as a fault.
- Add selftest to test trace event string filtering
In order to catch the breakage of the string filtering, add a self test to
make sure that it continues to work.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCaAPqNRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qv5nAP4mqIgne7tzMhHIH/nQGM/7Dj98n+Vt
BXm6VifVdVJvtAD+KCDipZ2MspGEeZX3SDSnvBuj0S+OX9T9CTWPv+rFUwE=
=AWY4
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Initialize hash variables in ftrace subops logic
The fix that simplified the ftrace subops logic opened a path where
some variables could be used without being initialized, and done
subtly where the compiler did not catch it. Initialize those
variables to the EMPTY_HASH, which is the default hash.
- Reinitialize the hash pointers after they are freed
Some of the hash pointers in the subop logic were freed but may still
be referenced later. To prevent use-after-free bugs, initialize them
back to the EMPTY_HASH.
- Free the ftrace hashes when they are replaced
The fix that simplified the subops logic updated some hash pointers,
but left the original hash that they were pointing to where they are
no longer used. This caused a memory leak. Free the hashes that are
pointed to by the pointers when they are replaced.
- Fix size initialization of ftrace direct function hash
The ftrace direct function hash used by BPF initialized the hash size
incorrectly. It checked the size of items to a hard coded 32, which
made the hash bit size of 5. The hash size is supposed to be limited
by the bit size of the hash, as the bitmask is allowed to be greater
than 5. Rework the size check to first pass the number of elements to
fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating
the hash.
- Fix format output of ftrace_graph_ent_entry event
The field depth of the ftrace_graph_ent_entry event is of size 4 but
the output showed it as unsigned long and use "%lu". Change it to
unsigned int and use "%u" in the print format that is displayed to
user space.
- Fix the trace event filter on strings
Events can be filtered on numbers or string values. The return value
checked from strncpy_from_kernel_nofault() and
strncpy_from_user_nofault() was used to determine if reading the
strings would fault or not. It would return fault if the value was
non zero, which is basically meant that it was always considering the
read as a fault.
- Add selftest to test trace event string filtering
In order to catch the breakage of the string filtering, add a self
test to make sure that it continues to work.
* tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: selftests: Add testing a user string to filters
tracing: Fix filter string testing
ftrace: Fix type of ftrace_graph_ent_entry.depth
ftrace: fix incorrect hash size in register_ftrace_direct()
ftrace: Free ftrace hashes after they are replaced in the subops code
ftrace: Reinitialize hash to EMPTY_HASH after freeing
ftrace: Initialize variables for ftrace_startup/shutdown_subops()
Add helper for refilling task with default slice and event
statistics accordingly.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs
when the tasks were enqueued, which seems not accurate. What it actually
means is the refilling with defalt slice, and this can occur either when
enqueue or pick_task. Let's change the variable to
SCX_EV_REFILL_SLICE_DFL.
Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit cb380909ae ("vhost: return task creation error instead of NULL")
changed the return value of vhost_task_create(), but did not update the
documentation.
Reflect the change in the documentation: on an error, vhost_task_create()
returns an ERR_PTR() and no longer NULL.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20250327124435.142831-1-sgarzare@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The filter string testing uses strncpy_from_kernel/user_nofault() to
retrieve the string to test the filter against. The if() statement was
incorrect as it considered 0 as a fault, when it is only negative that it
faulted.
Running the following commands:
# cd /sys/kernel/tracing
# echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
# echo 1 > events/syscalls/sys_enter_openat/enable
# ls /proc/$$/maps
# cat trace
Would produce nothing, but with the fix it will produce something like:
ls-1192 [007] ..... 8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0)
Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home
Fixes: 77360f9bbc ("tracing: Add test for user space strings when filtering on string pointers")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
ftrace_graph_ent.depth is int, but ftrace_graph_ent_entry.depth is
unsigned long. This confuses trace-cmd on 64-bit big-endian systems and
makes it print a huge amount of spaces. Fix this by using unsigned int,
which has a matching size, instead.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/20250412221847.17310-2-iii@linux.ibm.com
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The maximum of the ftrace hash bits is made fls(32) in
register_ftrace_direct(), which seems illogical. So, we fix it by making
the max hash bits FTRACE_HASH_MAX_BITS instead.
Link: https://lore.kernel.org/20250413014444.36724-1-dongml2@chinatelecom.cn
Fixes: d05cb47066 ("ftrace: Fix modification of direct_function hash while in use")
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The subops processing creates new hashes when adding and removing subops.
There were some places that the old hashes that were replaced were not
freed and this caused some memory leaks.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417135939.245b128d@gandalf.local.home
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There's several locations that free a ftrace hash pointer but may be
referenced again. Reset them to EMPTY_HASH so that a u-a-f bug doesn't
happen.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417110933.20ab718b@gandalf.local.home
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The reworking to fix and simplify the ftrace_startup_subops() and the
ftrace_shutdown_subops() made it possible for the filter_hash and
notrace_hash variables to be used uninitialized in a way that the compiler
did not catch it.
Initialize both filter_hash and notrace_hash to the EMPTY_HASH as that is
what they should be if they never are used.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250417104017.3aea66c2@gandalf.local.home
Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Fixes: 0ae6b8ce20 ("ftrace: Fix accounting of subop hashes")
Closes: https://lore.kernel.org/all/1db64a42-626d-4b3a-be08-c65e47333ce2@linux.ibm.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Rename get_ctx_arg_idx to bpf_ctx_arg_idx, and allow others to call it.
No functional change.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409214606.2000194-2-ameryhung@gmail.com
Android has mounted the v1 cpuset controller using filesystem type
"cpuset" (not "cgroup") since 2015 [1], and depends on the resulting
behavior where the controller name is not added as a prefix for cgroupfs
files. [2]
Later, a problem was discovered where cpu hotplug onlining did not
affect the cpuset/cpus files, which Android carried an out-of-tree patch
to address for a while. An attempt was made to upstream this patch, but
the recommendation was to use the "cpuset_v2_mode" mount option
instead. [3]
An effort was made to do so, but this fails with "cgroup: Unknown
parameter 'cpuset_v2_mode'" because commit e1cba4b85d ("cgroup: Add
mount flag to enable cpuset to use v2 behavior in v1 cgroup") did not
update the special cased cpuset_mount(), and only the cgroup (v1)
filesystem type was updated.
Add parameter parsing to the cpuset filesystem type so that
cpuset_v2_mode works like the cgroup filesystem type:
$ mkdir /dev/cpuset
$ mount -t cpuset -ocpuset_v2_mode none /dev/cpuset
$ mount|grep cpuset
none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,cpuset_v2_mode,release_agent=/sbin/cpuset_release_agent)
[1] b769c8d24f
[2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/setup/cgroup_map_write.cpp;drc=2dac5d89a0f024a2d0cc46a80ba4ee13472f1681;l=192
[3] https://lore.kernel.org/lkml/f795f8be-a184-408a-0b5a-553d26061385@redhat.com/T/
Fixes: e1cba4b85d ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When adding folio_memcg function call in the zram module for
Android16-6.12, the following error occurs during compilation:
ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined!
This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex)
within folio_memcg. The export setting for cgroup_mutex is controlled by
the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while
CONFIG_PROVE_RCU is not, this compilation error will occur.
To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to
ensure cgroup_mutex is properly exported when needed.
Signed-off-by: gao xu <gaoxu2@honor.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Notice that ignore_dl_rate_limit() need not piggy back on the
limits_changed handling to achieve its goal (which is to enforce a
frequency update before its due time).
Namely, if sugov_should_update_freq() is updated to check
sg_policy->need_freq_update and return 'true' if it is set when
sg_policy->limits_changed is not set, ignore_dl_rate_limit() may
set the former directly instead of setting the latter, so it can
avoid hitting the memory barrier in sugov_should_update_freq().
Update the code accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/10666429.nUPlyArG6x@rjwysocki.net
The handling of the limits_changed flag in struct sugov_policy needs to
be explicitly synchronized to ensure that cpufreq policy limits updates
will not be missed in some cases.
Without that synchronization it is theoretically possible that
the limits_changed update in sugov_should_update_freq() will be
reordered with respect to the reads of the policy limits in
cpufreq_driver_resolve_freq() and in that case, if the limits_changed
update in sugov_limits() clobbers the one in sugov_should_update_freq(),
the new policy limits may not take effect for a long time.
Likewise, the limits_changed update in sugov_limits() may theoretically
get reordered with respect to the updates of the policy limits in
cpufreq_set_policy() and if sugov_should_update_freq() runs between
them, the policy limits change may be missed.
To ensure that the above situations will not take place, add memory
barriers preventing the reordering in question from taking place and
add READ_ONCE() and WRITE_ONCE() annotations around all of the
limits_changed flag updates to prevent the compiler from messing up
with that code.
Fixes: 600f5badb7 ("cpufreq: schedutil: Don't skip freq update when limits change")
Cc: 5.3+ <stable@vger.kernel.org> # 5.3+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net
Commit 8e461a1cb4 ("cpufreq: schedutil: Fix superfluous updates caused
by need_freq_update") modified sugov_should_update_freq() to set the
need_freq_update flag only for drivers with CPUFREQ_NEED_UPDATE_LIMITS
set, but that flag generally needs to be set when the policy limits
change because the driver callback may need to be invoked for the new
limits to take effect.
However, if the return value of cpufreq_driver_resolve_freq() after
applying the new limits is still equal to the previously selected
frequency, the driver callback needs to be invoked only in the case
when CPUFREQ_NEED_UPDATE_LIMITS is set (which means that the driver
specifically wants its callback to be invoked every time the policy
limits change).
Update the code accordingly to avoid missing policy limits changes for
drivers without CPUFREQ_NEED_UPDATE_LIMITS.
Fixes: 8e461a1cb4 ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update")
Closes: https://lore.kernel.org/lkml/Z_Tlc6Qs-tYpxWYb@linaro.org/
Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/3010358.e9J7NaK4W3@rjwysocki.net
Due to an oversight in merging:
da916e96e2 ("perf: Make perf_pmu_unregister() useable")
on top of:
a3c3c66670 ("perf/core: Fix child_total_time_enabled accounting bug at task exit")
the timekeeping fix from this latter patch got undone.
Redo it.
Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20250417080815.GI38216@noisy.programming.kicks-ass.net
Due to an oversight in merging:
da916e96e2 ("perf: Make perf_pmu_unregister() useable")
on top of:
56799bc035 ("perf: Fix hang while freeing sigtrap event")
.. it is now possible to hit put_event(EVENT_TOMBSTONE), which makes
the computer sad.
This also means that for the event->parent == EVENT_TOMBSTONE, the
put_event() matching inherit_event() has gone missing.
Previously this was done in perf_event_release_kernel() after calling
perf_remove_from_context(), but with it delegated to put_event(), this
case is now entirely missed, leading to leaks.
Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Tested-by: James Clark <james.clark@linaro.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Closes: https://lore.kernel.org/oe-lkp/202504131701.941039cd-lkp@intel.com
Link: https://lkml.kernel.org/r/20250415131446.GN5600@noisy.programming.kicks-ass.net
So there are three situations:
* If perf_event_free_task() has removed all the children from the parent list
before perf_event_release_kernel() got a chance to even iterate them, then
it's all good as there is no get_ctx() pending.
* If perf_event_release_kernel() iterates a child event, but it gets freed
meanwhile by perf_event_free_task() while the mutexes are temporarily
unlocked, it's all good because while locking again the ctx mutex,
perf_event_release_kernel() observes TASK_TOMBSTONE.
* But if perf_event_release_kernel() frees the child event before
perf_event_free_task() got a chance, we may face this scenario:
perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
mutex_lock(&event->child_mutex)
get_ctx(child->ctx)
mutex_unlock(&event->child_mutex)
mutex_lock(ctx->mutex)
mutex_lock(&event->child_mutex)
perf_remove_from_context(child)
mutex_unlock(&event->child_mutex)
mutex_unlock(ctx->mutex)
// This lock acquires ctx->refcount == 2
// visibility
mutex_lock(ctx->mutex)
ctx->task = TASK_TOMBSTONE
mutex_unlock(ctx->mutex)
wait_var_event()
// enters prepare_to_wait() since
// ctx->refcount == 2
// is guaranteed to be seen
set_current_state(TASK_INTERRUPTIBLE)
smp_mb()
if (ctx->refcount != 1)
schedule()
put_ctx()
// NOT fully ordered! Only RELEASE semantics
refcount_dec_and_test()
atomic_fetch_sub_release()
// So TASK_TOMBSTONE is not guaranteed to be seen
if (ctx->task == TASK_TOMBSTONE)
wake_up_var()
Basically it's a broken store buffer:
perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
smp_mb()
READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)
So we need a smp_mb__after_atomic() before looking at ctx->task.
Fixes: 59f3aa4a3e ("perf: Simplify perf_event_free_task() wait")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/Z_ZvmEhjkAhplCBE@localhost.localdomain
In the zeal to adjust all event->state checks to include the new
REVOKED state, one adjustment was made in error. Notably it resulted
in read() on the perf filedesc to stop working for any state lower
than ERROR, specifically EXIT.
This leads to problems with (among others) perf-stat, which wants to
read the counts after a program has finished execution.
Fixes: da916e96e2 ("perf: Make perf_pmu_unregister() useable")
Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
Reported-by: James Clark <james.clark@linaro.org>
Tested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/77036114-8723-4af9-a068-1d535f4e2e81@linaro.org
Link: https://lore.kernel.org/r/20250417080725.GH38216@noisy.programming.kicks-ass.net
Mike reports that commit 6d71a9c616 ("sched/fair: Fix EEVDF entity
placement bug causing scheduling lag") relies on commit 4423af84b2
("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not
trip a WARN in place_entity().
What happens is that the lag of the very last entity is 0 per
definition -- the average of one element matches the value of that
element. Therefore place_entity() will match the condition skipping
the lag adjustment:
if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) {
Without the 'se->vlag' condition -- it will attempt to adjust the zero
lag even though we're inserting into an empty tree.
Notably, we should have failed the 'cfs_rq->nr_queued' condition, but
don't because they didn't get updated.
Additionally, move update_load_add() after placement() as is
consistent with other place_entity() users -- this change is
non-functional, place_entity() does not use cfs_rq->load.
Fixes: 6d71a9c616 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gmx.de
Add a file to read local group's "asym_prefer_cpu" from debugfs. This
information was useful when debugging issues where "asym_prefer_cpu" was
incorrectly set to a CPU with a lower asym priority.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-5-kprateek.nayak@amd.com
A subset of AMD Processors supporting Preferred Core Rankings also
feature the ability to dynamically switch these rankings at runtime to
bias load balancing towards or away from the LLC domain with larger
cache.
To support dynamically updating "sg->asym_prefer_cpu" without needing to
rebuild the sched domain, introduce sched_update_asym_prefer_cpu() which
recomutes the "asym_prefer_cpu" when the core-ranking of a CPU changes.
sched_update_asym_prefer_cpu() swaps the "sg->asym_prefer_cpu" with the
CPU whose ranking has changed if the new ranking is greater than that of
the "asym_prefer_cpu". If CPU whose ranking has changed is the current
"asym_prefer_cpu", it scans the CPUs of the sched groups to find the new
"asym_prefer_cpu" and sets it accordingly.
get_group() for non-overlapping sched domains returns the sched group
for the first CPU in the sched_group_span() which ensures all CPUs in
the group see the updated value of "asym_prefer_cpu".
Overlapping groups are allocated differently and will require moving the
"asym_prefer_cpu" to "sg->sgc" but since the current implementations do
not set "SD_ASYM_PACKING" at NUMA domains, skip additional
indirection and place a SCHED_WARN_ON() to alert any future users.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-3-kprateek.nayak@amd.com
Subsequent commits add the support to dynamically update the sched_group
struct's "asym_prefer_cpu" member from a remote CPU. Use READ_ONCE()
when reading the "sg->asym_prefer_cpu" to ensure load balancer always
reads the latest value.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250409053446.23367-2-kprateek.nayak@amd.com
lookup_or_create_module_kobject() is marked as static and __init,
to make it global drop static keyword.
Since this function can be called from non-init code, use __modinit
instead of __init, __modinit marker will make it __init if
CONFIG_MODULES is not defined.
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
Link: https://lore.kernel.org/r/20250227184930.34163-4-shyamsaini@linux.microsoft.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
In the unlikely event of the allocation failing, it is better to let
the machine boot with a not fully populated sysfs than to kill it with
this BUG_ON(). All callers are already prepared for
lookup_or_create_module_kobject() returning NULL.
This is also preparation for calling this function from non __init
code, where using BUG_ON for allocation failure handling is not
acceptable.
Since we are here, also start using IS_ENABLED instead of #ifdef
construct.
Suggested-by: Thomas Weißschuh <linux@weissschuh.net>
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
Link: https://lore.kernel.org/r/20250227184930.34163-3-shyamsaini@linux.microsoft.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
The locate_module_kobject() function looks up an existing
module_kobject for a given module name. If it cannot find the
corresponding module_kobject, it creates one for the given name.
This commit renames locate_module_kobject() to
lookup_or_create_module_kobject() to better describe its operations.
This doesn't change anything functionality wise.
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com>
Link: https://lore.kernel.org/r/20250227184930.34163-2-shyamsaini@linux.microsoft.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Follow the advice of the Documentation/filesystems/sysfs.rst that show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the value
to be returned to user space.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250416101651.2128688-1-andriy.shevchenko@linux.intel.com
The audit code relies on the fact that kern_path_locked() returned a
path even for a negative dentry. If it doesn't find a valid dentry it
immediately calls:
audit_find_parent(d_backing_inode(parent_path.dentry));
which assumes that parent_path.dentry is still valid. But it isn't since
kern_path_locked() has been changed to path_put() also for a negative
dentry.
Fix this by adding a helper that implements the required audit semantics
and allows us to fix the immediate bleeding. We can find a unified
solution for this afterwards.
Link: https://lore.kernel.org/20250414-rennt-wimmeln-f186c3a780f1@brauner
Fixes: 1c3cb50b58 ("VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry")
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
After the commit 7903f907a2 ("pid: perform free_pid() calls outside
of tasklist_lock") __unhash_process() -> detach_pid() no longer calls
free_pid(), proc_flush_pid() can just use p->thread_pid without the
now pointless get_pid() + put_pid().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/20250411121857.GA10550@redhat.com
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
destroy_workqueue() does not ensure that non-pending work submitted with
queue_delayed_work() gets cancelled. The caller has to ensure that
manually.
Add this information about delayed_work in destroy_workqueue()'s
docstring.
Add a TODO for destroy_workqueue() to wait for all delayed_work.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
In the debug and resolution of an issue involving forced use of bounce
buffers, 7170130e4c ("x86/mm/init: Handle the special case of device
private pages in add_pages(), to not increase max_pfn and trigger
dma_addressing_limited() bounce buffers"). It would have been easier
to debug the issue if dma_addressing_limited() had debug information
about the device not being able to address all of memory and thus forcing
all accesses through a bounce buffer. Please see[2]
Implement dev_dbg to debug the potential use of bounce buffers
when we hit the condition. When swiotlb is used,
dma_addressing_limited() is used to determine the size of maximum dma
buffer size in dma_direct_max_mapping_size(). The debug prints could be
triggered in that check as well (when enabled).
Link: https://lore.kernel.org/lkml/20250401000752.249348-1-balbirs@nvidia.com/ [1]
Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [2]
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250414113752.3298276-1-balbirs@nvidia.com
If the test added in commit b5ffbd1396 ("sysctl: move the extra1/2
boundary check of u8 to sysctl_check_table_array") is run as a module, a
lingering reference to the module is left behind, and a 'sysctl -a'
leads to a panic.
To reproduce
CONFIG_KUNIT=y
CONFIG_SYSCTL_KUNIT_TEST=m
Then run these commands:
modprobe sysctl-test
rmmod sysctl-test
sysctl -a
The panic varies but generally looks something like this:
BUG: unable to handle page fault for address: ffffa4571c0c7db4
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 100000067 P4D 100000067 PUD 100351067 PMD 114f5e067 PTE 0
Oops: Oops: 0000 [#1] SMP NOPTI
... ... ...
RIP: 0010:proc_sys_readdir+0x166/0x2c0
... ... ...
Call Trace:
<TASK>
iterate_dir+0x6e/0x140
__se_sys_getdents+0x6e/0x100
do_syscall_64+0x70/0x150
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Move the test to lib/test_sysctl.c where the registration reference is
handled on module exit
Fixes: b5ffbd1396 ("sysctl: move the extra1/2 boundary check of u8 to sysctl_check_table_array")
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
- Hide get_vm_area() from MMUless builds
The function get_vm_area() is not defined when CONFIG_MMU is not defined.
Hide that function within #ifdef CONFIG_MMU.
- Fix output of synthetic events when they have dynamic strings
The print fmt of the synthetic event's format file use to have "%.*s" for
dynamic size strings even though the user space exported arguments had
only __get_str() macro that provided just a nul terminated string. This
was fixed so that user space could parse this properly. But the reason
that it had "%.*s" was because internally it provided the maximum size of
the string as one of the arguments. The fix that replaced "%.*s" with "%s"
caused the trace output (when the kernel reads the event) to write
"(efault)" as it would now read the length of the string as "%s".
As the string provided is always nul terminated, there's no reason for the
internal code to use "%.*s" anyway. Just remove the length argument to
match the "%s" that is now in the format.
- Fix the ftrace subops hash logic of the manager ops hash
The function_graph uses the ftrace subops code. The subops code is a way
to have a single ftrace_ops registered with ftrace to determine what
functions will call the ftrace_ops callback. More than one user of
function graph can register a ftrace_ops with it. The function graph
infrastructure will then add this ftrace_ops as a subops with the main
ftrace_ops it registers with ftrace. This is because the functions will
always call the function graph callback which in turn calls the subops
ftrace_ops callbacks.
The main ftrace_ops must add a callback to all the functions that the
subops want a callback from. When a subops is registered, it will update
the main ftrace_ops hash to include the functions it wants. This is the
logic that was broken.
The ftrace_ops hash has a "filter_hash" and a "notrace_hash" were all the
functions in the filter_hash but not in the notrace_hash are attached by
ftrace. The original logic would have the main ftrace_ops filter_hash be a
union of all the subops filter_hashes and the main notrace_hash would be a
intersect of all the subops filter hashes. But this was incorrect because
the notrace hash depends on the filter_hash it is associated to and not
the union of all filter_hashes.
Instead, when a subops is added, just include all the functions of the
subops hash that are in its filter_hash but not in its notrace_hash. The
main subops hash should not use its notrace hash, unless all of its subops
hashes have an empty filter_hash (which means to attach to all functions),
and then, and only then, the main ftrace_ops notrace hash can be the
intersect of all the subops hashes.
This not only fixes the bug, but also simplifies the code.
- Add a selftest to better test the subops filtering
Add a selftest that would catch the bug fixed by the above change.
- Fix extra newline printed in function tracing with retval
The function parameter code changed the output logic slightly and called
print_graph_retval() and also printed a newline. The print_graph_retval()
also prints a newline which caused blank lines to be printed in the
function graph tracer when retval was added. This caused one of the
selftests to fail if retvals were enabled. Instead remove the new line
output from print_graph_retval() and have the callers always print the
new line so that it doesn't have to do special logic if it calls
print_graph_retval() or not.
- Fix out-of-bound memory access in the runtime verifier
When rv_is_container_monitor() is called on the last entry on the link
list it references the next entry, which is the list head and causes an
out-of-bound memory access.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZ/rXQxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qoj7AQC0C2awpJSUIRj91qjPtMYuNUE3AVpB
EEZEkt19LfE//gEA1fOx3Cors/LrY9dthn/3LMKL23vo9c4i0ffhs2X+1gE=
=XJL5
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Hide get_vm_area() from MMUless builds
The function get_vm_area() is not defined when CONFIG_MMU is not
defined. Hide that function within #ifdef CONFIG_MMU.
- Fix output of synthetic events when they have dynamic strings
The print fmt of the synthetic event's format file use to have "%.*s"
for dynamic size strings even though the user space exported
arguments had only __get_str() macro that provided just a nul
terminated string. This was fixed so that user space could parse this
properly.
But the reason that it had "%.*s" was because internally it provided
the maximum size of the string as one of the arguments. The fix that
replaced "%.*s" with "%s" caused the trace output (when the kernel
reads the event) to write "(efault)" as it would now read the length
of the string as "%s".
As the string provided is always nul terminated, there's no reason
for the internal code to use "%.*s" anyway. Just remove the length
argument to match the "%s" that is now in the format.
- Fix the ftrace subops hash logic of the manager ops hash
The function_graph uses the ftrace subops code. The subops code is a
way to have a single ftrace_ops registered with ftrace to determine
what functions will call the ftrace_ops callback. More than one user
of function graph can register a ftrace_ops with it. The function
graph infrastructure will then add this ftrace_ops as a subops with
the main ftrace_ops it registers with ftrace. This is because the
functions will always call the function graph callback which in turn
calls the subops ftrace_ops callbacks.
The main ftrace_ops must add a callback to all the functions that the
subops want a callback from. When a subops is registered, it will
update the main ftrace_ops hash to include the functions it wants.
This is the logic that was broken.
The ftrace_ops hash has a "filter_hash" and a "notrace_hash" where
all the functions in the filter_hash but not in the notrace_hash are
attached by ftrace. The original logic would have the main ftrace_ops
filter_hash be a union of all the subops filter_hashes and the main
notrace_hash would be a intersect of all the subops filter hashes.
But this was incorrect because the notrace hash depends on the
filter_hash it is associated to and not the union of all
filter_hashes.
Instead, when a subops is added, just include all the functions of
the subops hash that are in its filter_hash but not in its
notrace_hash. The main subops hash should not use its notrace hash,
unless all of its subops hashes have an empty filter_hash (which
means to attach to all functions), and then, and only then, the main
ftrace_ops notrace hash can be the intersect of all the subops
hashes.
This not only fixes the bug, but also simplifies the code.
- Add a selftest to better test the subops filtering
Add a selftest that would catch the bug fixed by the above change.
- Fix extra newline printed in function tracing with retval
The function parameter code changed the output logic slightly and
called print_graph_retval() and also printed a newline. The
print_graph_retval() also prints a newline which caused blank lines
to be printed in the function graph tracer when retval was added.
This caused one of the selftests to fail if retvals were enabled.
Instead remove the new line output from print_graph_retval() and have
the callers always print the new line so that it doesn't have to do
special logic if it calls print_graph_retval() or not.
- Fix out-of-bound memory access in the runtime verifier
When rv_is_container_monitor() is called on the last entry on the
link list it references the next entry, which is the list head and
causes an out-of-bound memory access.
* tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix out-of-bound memory access in rv_is_container_monitor()
ftrace: Do not have print_graph_retval() add a newline
tracing/selftest: Add test to better test subops filtering of function graph
ftrace: Fix accounting of subop hashes
ftrace: Properly merge notrace hashes
tracing: Do not add length to print format in synthetic events
tracing: Hide get_vm_area() from MMUless builds
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmf6sD8ACgkQ6rmadz2v
bTq86w//bbg2S1ZhSXXQvgRSbxfecvJ0r6XGDOaMsKxPXcqpbaMoSCYx2D8puO+b
xm0vc+5qXlzuTHq9I8flDKrWdA+/sHxLQhXjcBA796vaY6IgJEnapf3kENyzZ3Vp
agpNPlZe9FLaANDRivTFPVgzVjr07/3eL7VKItASksb/3yjBSa+vrIJVfGF1krQT
slxTMzVMzB+p0MdKVjmeGn5EodWXp8TdVzQBPb8vnCn7U1h1HULSh4j1+nZ/Z1yr
zC4/pVPmdDJe1H8ghBGm4f0nY+EwXPtZiVbXnYS2FhgjvthRKFYIyxN9F6kg7AD7
NG0T6xw/QYNfPTR40PSiV/WHhH5qa2zRVtlepVU7tqqmsyRXi+0Eq/MfJyiuNzgN
WWmJec0O/Ax4r2Xs/QgX3mFlRnLNi5gmc7fuOARmayAlqElZ9QdB2x6ebW5Fk4Qx
9oyQACpcu6/oUKgeMSo52MDa82wUPPxpC6qdsefmQYaAcOKM5MD4SNd+eEnfX03E
RAaItTW9az57a2BL9C/ejJO/SwY4Er+O8B3PO7GaKiURMSZa5nVlY+2QB2fJy6TA
7IvSYjFD5E4risMbZgPFCqWkQ0yHbY7zEn/tbcNC5AFZoKv70jELPQTLPXq7UPLe
BuKoL9VJyeXF7E1MQqQH33q3tfcwlIL++piCNHvTQoPadEba2dM=
=Mezb
-----END PGP SIGNATURE-----
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Followup fixes for resilient spinlock (Kumar Kartikeya Dwivedi):
- Make res_spin_lock test less verbose, since it was spamming BPF
CI on failure, and make the check for AA deadlock stronger
- Fix rebasing mistake and use architecture provided
res_smp_cond_load_acquire
- Convert BPF maps (queue_stack and ringbuf) to resilient spinlock
to address long standing syzbot reports
- Make sure that classic BPF load instruction from SKF_[NET|LL]_OFF
offsets works when skb is fragmeneted (Willem de Bruijn)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Convert ringbuf map to rqspinlock
bpf: Convert queue_stack map to rqspinlock
bpf: Use architecture provided res_smp_cond_load_acquire
selftests/bpf: Make res_spin_lock AA test condition stronger
selftests/net: test sk_filter support for SKF_NET_OFF on frags
bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags
selftests/bpf: Make res_spin_lock test less verbose
When rv_is_container_monitor() is called on the last monitor in
rv_monitors_list, KASAN yells:
BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110
Read of size 8 at addr ffffffff97c7c798 by task setup/221
The buggy address belongs to the variable:
rv_monitors_list+0x18/0x40
This is due to list_next_entry() is called on the last entry in the list.
It wraps around to the first list_head, and the first list_head is not
embedded in struct rv_monitor_def.
Fix it by checking if the monitor is last in the list.
Cc: stable@vger.kernel.org
Cc: Gabriele Monaco <gmonaco@redhat.com>
Fixes: cb85c660fc ("rv: Add option for nested monitors and include sched")
Link: https://lore.kernel.org/e85b5eeb7228bfc23b8d7d4ab5411472c54ae91b.1744355018.git.namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The retval and retaddr options for function_graph tracer will add a
comment at the end of a function for both leaf and non leaf functions that
looks like:
__wake_up_common(); /* ret=0x1 */
} /* pick_next_task_fair ret=0x0 */
The function print_graph_retval() adds a newline after the "*/". But if
that's not called, the caller function needs to make sure there's a
newline added.
This is confusing and when the function parameters code was added, it
added a newline even when calling print_graph_retval() as the fact that
the print_graph_retval() function prints a newline isn't obvious.
This caused an extra newline to be printed and that made it fail the
selftests when the retval option was set, as the selftests were not
expecting blank lines being injected into the trace.
Instead of having print_graph_retval() print a newline, just have the
caller always print the newline regardless if it calls print_graph_retval()
or not. This not only fixes this bug, but it also simplifies the code.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250411133015.015ca393@gandalf.local.home
Reported-by: Mark Brown <broonie@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/ccc40f2b-4b9e-4abd-8daf-d22fce2a86f0@sirena.org.uk/
Fixes: ff5c9c576e ("ftrace: Add support for function argument to graph tracer")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In a prior patch series we tried to cleanly differentiate between:
(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.
as this was causing issues for non-threaded workloads.
But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.
Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-2-60b2d3bb545f@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Move the pidfd notification out of __change_pid() and into
__unhash_process(). The only valid call to __change_pid() with a NULL
argument and PIDTYPE_PID is from __unhash_process(). This is a lot more
obvious than calling it from __change_pid().
Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-1-60b2d3bb545f@kernel.org
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The function graph infrastructure uses ftrace to hook to functions. It has
a single ftrace_ops to manage all the users of function graph. Each
individual user (tracing, bpf, fprobes, etc) has its own ftrace_ops to
track the functions it will have its callback called from. These
ftrace_ops are "subops" to the main ftrace_ops of the function graph
infrastructure.
Each ftrace_ops has a filter_hash and a notrace_hash that is defined as:
Only trace functions that are in the filter_hash but not in the
notrace_hash.
If the filter_hash is empty, it means to trace all functions.
If the notrace_hash is empty, it means do not disable any function.
The function graph main ftrace_ops needs to be a superset containing all
the functions to be traced by all the subops it has. The algorithm to
perform this merge was incorrect.
When the first subops was added to the main ops, it simply made the main
ops a copy of the subops (same filter_hash and notrace_hash).
When a second ops was added, it joined the new subops filter_hash with the
main ops filter_hash as a union of the two sets. The intersect between the
new subops notrace_hash and the main ops notrace_hash was created as the
new notrace_hash of the main ops.
The issue here is that it would then start tracing functions than no
subops were tracing. For example if you had two subops that had:
subops 1:
filter_hash = '*sched*' # trace all functions with "sched" in it
notrace_hash = '*time*' # except do not trace functions with "time"
subops 2:
filter_hash = '*lock*' # trace all functions with "lock" in it
notrace_hash = '*clock*' # except do not trace functions with "clock"
The intersect of '*time*' functions with '*clock*' functions could be the
empty set. That means the main ops will be tracing all functions with
'*time*' and all "*clock*" in it!
Instead, modify the algorithm to be a bit simpler and correct.
First, when adding a new subops, even if it's the first one, do not add
the notrace_hash if the filter_hash is not empty. Instead, just add the
functions that are in the filter_hash of the subops but not in the
notrace_hash of the subops into the main ops filter_hash. There's no
reason to add anything to the main ops notrace_hash.
The notrace_hash of the main ops should only be non empty iff all subops
filter_hashes are empty (meaning to trace all functions) and all subops
notrace_hashes include the same functions.
That is, the main ops notrace_hash is empty if any subops filter_hash is
non empty.
The main ops notrace_hash only has content in it if all subops
filter_hashes are empty, and the content are only functions that intersect
all the subops notrace_hashes. If any subops notrace_hash is empty, then
so is the main ops notrace_hash.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Andy Chiu <andybnac@gmail.com>
Link: https://lore.kernel.org/20250409152720.216356767@goodmis.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The global notrace hash should be jointly decided by the intersection of
each subops's notrace hash, but not the filter hash.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250408160258.48563-1-andybnac@gmail.com
Fixes: 5fccc7552c ("ftrace: Add subops logic to allow one ops to manage many")
Signed-off-by: Andy Chiu <andybnac@gmail.com>
[ fixed removing of freeing of filter_hash ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When no audit rules are in place, AUDIT_ANOM_{LINK,CREAT} events
reported in audit_log_path_denied() are unconditionally dropped due to
an explicit check for the existence of any audit rules. Given this is a
report of a security violation, allow it to be recorded regardless of
the existence of any audit rules.
To test,
mkdir -p /root/tmp
chmod 1777 /root/tmp
touch /root/tmp/test.txt
useradd test
chown test /root/tmp/test.txt
{echo C0644 12 test.txt; printf 'hello\ntest1\n'; printf \\000;} | \
scp -t /root/tmp
Check with
ausearch -m ANOM_CREAT -ts recent
Link: https://issues.redhat.com/browse/RHEL-9065
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
audit_log_vformat() is using printf() type of format, and GCC compiler
(Debian 14.2.0-17) is not happy about this:
kernel/audit.c:1978:9: error: function ‘audit_log_vformat’
might be a candidate for ‘gnu_printf’ format attribute
kernel/audit.c:1987:17: error: function ‘audit_log_vformat’
might be a candidate for ‘gnu_printf’ format attribute
Fix the compilation errors (`make W=1` when CONFIG_WERROR=y, which is
default) by adding __printf() attribute.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[PM: commit description line wrap fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>