Commit Graph

4774 Commits

Author SHA1 Message Date
Tejun Heo
43e33f53e2 sched/core: Reorganize cgroup bandwidth control interface file reads
- Update tg_get_cfs_*() to return u64 values. These are now used as the low
  level accessors to the fair's bandwidth configuration parameters.
  Translation to usecs takes place in these functions.

- Add tg_bandwidth() which reads all three bandwidth parameters using
  tg_get_cfs_*().

- Reimplement cgroup interface read functions using tg_bandwidth(). Drop cfs
  from the function names.

This is to prepare for adding bandwidth control support to sched_ext.
tg_bandwidth() will be used as the muxing point similar to tg_weight(). No
functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-4-tj@kernel.org
2025-06-18 13:59:57 +02:00
Tejun Heo
de4c80c696 sched/core: Relocate tg_get_cfs_*() and cpu_cfs_*_read_*()
Collect the getters, relocate the trivial interface file wrappers, and put
all of them in period, quota, burst order to prepare for future changes.
Pure reordering. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-3-tj@kernel.org
2025-06-18 13:59:57 +02:00
Tejun Heo
d403a3689a sched/fair: Move max_cfs_quota_period decl and default_cfs_period() def from fair.c to sched.h
max_cfs_quota_period is defined in core.c but has a declaration in fair.c.
Move the declaration to kernel/sched/sched.h. Also, move
default_cfs_period() from fair.c to sched.h. No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250614012346.2358261-2-tj@kernel.org
2025-06-18 13:59:56 +02:00
Tejun Heo
33796b9187 sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.

sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().

v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
    to build failures on !CONFIG_GROUP_SCHED configs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
2025-06-17 08:19:55 -10:00
Tejun Heo
c50784e99f sched_ext: Make scx_group_set_weight() always update tg->scx.weight
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled
and ops.cgroup_init() may be called with a stale weight value.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 8195136669 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
2025-06-17 08:19:43 -10:00
Cheng-Yang Chou
f479fee382 sched_ext: Return NULL in llc_span
Use NULL instead of 0 to signal no LLC domain, matching numa_span() and
the function comment.

No functional change.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-16 07:32:37 -10:00
Cheng-Yang Chou
545b343015 sched_ext: Always use SMP versions in kernel/sched/ext_idle.h
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.

tj: Updated subject for clarity.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13 14:47:59 -10:00
Cheng-Yang Chou
8834ace4a8 sched_ext: Always use SMP versions in kernel/sched/ext_idle.c
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.

tj: Updated subject for clarity. Fixed stray #else block which wasn't
    removed causing build failure.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13 14:47:52 -10:00
Cheng-Yang Chou
6a1cda143c sched_ext: Always use SMP versions in kernel/sched/ext.h
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.

tj: Updated subject for clarity. Replace #if defined() with #ifdef.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13 14:30:58 -10:00
Cheng-Yang Chou
165af41516 sched_ext: Always use SMP versions in kernel/sched/ext.c
Simplify the scheduler by making formerly SMP-only primitives and data
structures unconditional.

tj: Updated subject for clarity.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13 14:29:39 -10:00
Tejun Heo
9ec5e0be0e Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.17 2025-06-13 14:21:47 -10:00
Ingo Molnar
dabe1be4e8 sched/smp: Use the SMP version of double_rq_clock_clear_update()
Simplify the scheduler by making CONFIG_SMP=y code in
double_rq_clock_clear_update() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-44-mingo@kernel.org
2025-06-13 08:47:23 +02:00
Ingo Molnar
fc75ac3c91 sched/smp: Use the SMP version of add_nr_running()
Simplify the scheduler by making CONFIG_SMP=y code in
add_nr_running() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-43-mingo@kernel.org
2025-06-13 08:47:23 +02:00
Ingo Molnar
241c307b05 sched/smp: Use the SMP version of ENQUEUE_MIGRATED
Simplify the scheduler by making the CONFIG_SMP-only ENQUEUE_MIGRATED
flag unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-42-mingo@kernel.org
2025-06-13 08:47:23 +02:00
Ingo Molnar
0203244600 sched/smp: Use the SMP version of WF_ and SD_ flag sanity checks
Simplify the scheduler by making CONFIG_SMP=y asserts related
to WF_ and SD_ flags unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-41-mingo@kernel.org
2025-06-13 08:47:23 +02:00
Ingo Molnar
ea100b31ee sched/smp: Use the SMP version of task_on_cpu()
Simplify the scheduler by making CONFIG_SMP=y code in task_on_cpu()
unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-40-mingo@kernel.org
2025-06-13 08:47:22 +02:00
Ingo Molnar
703b8e8545 sched/smp: Use the SMP version of rq_pin_lock()
Simplify the scheduler by making a CONFIG_SMP-only warning
in rq_pin_lock() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-39-mingo@kernel.org
2025-06-13 08:47:22 +02:00
Ingo Molnar
9fd5da7989 sched/smp: Use the SMP version of is_migration_disabled()
Simplify the scheduler by making the CONFIG_SMP-only code in
is_migration_disabled() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-38-mingo@kernel.org
2025-06-13 08:47:22 +02:00
Ingo Molnar
1724088119 sched/smp: Use the SMP version of cpu_of()
Simplify the scheduler by making CONFIG_SMP=y code in cpu_of()
unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-37-mingo@kernel.org
2025-06-13 08:47:22 +02:00
Ingo Molnar
caf5bde9c5 sched/smp: Use the SMP version of the stop-CPU scheduling class
Simplify the scheduler by making CONFIG_SMP=y code in the stop-CPU
scheduling class unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-36-mingo@kernel.org
2025-06-13 08:47:21 +02:00
Ingo Molnar
482c4dae75 sched/smp: Use the SMP version of the idle scheduling class
Simplify the scheduler by making CONFIG_SMP=y code in the
idle scheduling classunconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-35-mingo@kernel.org
2025-06-13 08:47:21 +02:00
Ingo Molnar
6c8d251621 sched/smp: Use the SMP version of sched_update_asym_prefer_cpu()
Simplify the scheduler by making CONFIG_SMP=y code in
sched_update_asym_prefer_cpu() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-34-mingo@kernel.org
2025-06-13 08:47:21 +02:00
Ingo Molnar
8a9246ddc1 sched/smp: Use the SMP version of the scheduler syscalls
Simplify the scheduler by making CONFIG_SMP=y code in
idle_cpu(), __sched_setscheduler() and sched_setaffinity()
unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-33-mingo@kernel.org
2025-06-13 08:47:21 +02:00
Ingo Molnar
9d9af2372f sched/smp: Use the SMP version of schedstats
Simplify the scheduler by making CONFIG_SMP=y schedstats
debugging output unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-32-mingo@kernel.org
2025-06-13 08:47:21 +02:00
Ingo Molnar
02fb885ebd sched/smp: Use the SMP version of scheduler debugging data
Simplify the scheduler by making CONFIG_SMP=y debug output
unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-31-mingo@kernel.org
2025-06-13 08:47:20 +02:00
Ingo Molnar
6324dce8f6 sched/smp: Use the SMP version of the deadline scheduling class
Simplify the scheduler by making CONFIG_SMP=y code
in prio_changed_dl() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-30-mingo@kernel.org
2025-06-13 08:47:20 +02:00
Ingo Molnar
15125a229a sched/smp: Use the SMP version of the RT scheduling class
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional in the RT policies scheduler.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-29-mingo@kernel.org
2025-06-13 08:47:20 +02:00
Ingo Molnar
74063c1755 sched/smp: Use the SMP version of idle_thread_set_boot_cpu()
Simplify the scheduler by making the CONFIG_SMP=y version of
idle_thread_set_boot_cpu() unconditional.

Note that idle_thread_set_boot_cpu() is already conditional
on CONFIG_GENERIC_SMP_IDLE_THREAD, which most architectures
select unconditionally on both UP and SMP kernels.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-28-mingo@kernel.org
2025-06-13 08:47:20 +02:00
Ingo Molnar
1f25730e5a sched/smp: Use the SMP version of sched_exec()
Simplify the scheduler making CONFIG_SMP=y sched_exec()
code unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-27-mingo@kernel.org
2025-06-13 08:47:20 +02:00
Ingo Molnar
588467616c sched/smp: Use the SMP version of wake_up_new_task()
Simplify the scheduler by making CONFIG_SMP=y code in wake_up_new_task()
unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-26-mingo@kernel.org
2025-06-13 08:47:19 +02:00
Ingo Molnar
8039addbe5 sched/smp: Use the SMP version of __task_needs_rq_lock()
Simplify the scheduler by making CONFIG_SMP=y code in
__task_needs_rq_lock() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-25-mingo@kernel.org
2025-06-13 08:47:19 +02:00
Ingo Molnar
d0a0a055a5 sched/smp: Use the SMP version of try_to_wake_up()
Simplify the scheduler by making CONFIG_SMP=y logic within
try_to_wake_up() unconditional.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-24-mingo@kernel.org
2025-06-13 08:47:19 +02:00
Ingo Molnar
a1416303d1 sched/smp: Always define rq->hrtick_csd
Simplify the scheduler by making CONFIG_SMP=y data structure
of rq->hrtick_csd unconditional.

Adjust hrtick_start() accordingly, which was split due to the
::hrtick_csd asymmetry and use the SMP version there too.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-23-mingo@kernel.org
2025-06-13 08:47:19 +02:00
Ingo Molnar
cac5cefbad sched/smp: Make SMP unconditional
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional.

Introduce transitory wrappers for functionality not yet converted to SMP.

Note that this patch is pretty large, because there's no clear separation
between various aspects of the SMP scheduler, it's basically a huge block
of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to
boot and work on UP systems.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org
2025-06-13 08:47:18 +02:00
Ingo Molnar
5202c25dd1 sched/smp: Always define sched_domains_mutex_lock()/unlock(), def_root_domain and sched_domains_mutex
Simplify the scheduler by making CONFIG_SMP=y primitives and data
structures unconditional.

Unconditionally build kernel/sched/topology.c and the main sched-domains
locking primitives.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-20-mingo@kernel.org
2025-06-13 08:47:18 +02:00
Ingo Molnar
f1c6b957f7 sched: Clean up and standardize #if/#else/#endif markers in sched/topology.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-19-mingo@kernel.org
2025-06-13 08:47:18 +02:00
Ingo Molnar
23d27e2cfb sched: Clean up and standardize #if/#else/#endif markers in sched/syscalls.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-18-mingo@kernel.org
2025-06-13 08:47:18 +02:00
Ingo Molnar
91433cd6e4 sched: Clean up and standardize #if/#else/#endif markers in sched/stats.[ch]
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-17-mingo@kernel.org
2025-06-13 08:47:17 +02:00
Ingo Molnar
fdccd0c792 sched: Clean up and standardize #if/#else/#endif markers in sched/sched.h
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-16-mingo@kernel.org
2025-06-13 08:47:17 +02:00
Ingo Molnar
3eca109a78 sched: Clean up and standardize #if/#else/#endif markers in sched/rt.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-15-mingo@kernel.org
2025-06-13 08:47:17 +02:00
Ingo Molnar
fd3db705f7 sched: Clean up and standardize #if/#else/#endif markers in sched/psi.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-14-mingo@kernel.org
2025-06-13 08:47:17 +02:00
Ingo Molnar
311bb3f7b7 sched: Clean up and standardize #if/#else/#endif markers in sched/pelt.[ch]
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-13-mingo@kernel.org
2025-06-13 08:47:17 +02:00
Ingo Molnar
c215dff7f8 sched: Clean up and standardize #if/#else/#endif markers in sched/loadavg.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-12-mingo@kernel.org
2025-06-13 08:47:16 +02:00
Ingo Molnar
833840a94f sched: Clean up and standardize #if/#else/#endif markers in sched/idle.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-11-mingo@kernel.org
2025-06-13 08:47:16 +02:00
Ingo Molnar
416d5f78e4 sched: Clean up and standardize #if/#else/#endif markers in sched/fair.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-10-mingo@kernel.org
2025-06-13 08:47:16 +02:00
Ingo Molnar
29dd6f8cd2 sched: Clean up and standardize #if/#else/#endif markers in sched/debug.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-9-mingo@kernel.org
2025-06-13 08:47:16 +02:00
Ingo Molnar
c503c3dc2d sched: Clean up and standardize #if/#else/#endif markers in sched/deadline.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

 - Fix whitespace noise and other inconsistencies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-8-mingo@kernel.org
2025-06-13 08:47:16 +02:00
Ingo Molnar
4aec8669ff sched: Clean up and standardize #if/#else/#endif markers in sched/cputime.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-7-mingo@kernel.org
2025-06-13 08:47:15 +02:00
Ingo Molnar
79af17344c sched: Clean up and standardize #if/#else/#endif markers in sched/cpupri.h
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-6-mingo@kernel.org
2025-06-13 08:47:15 +02:00
Ingo Molnar
8bb9b0c5ae sched: Clean up and standardize #if/#else/#endif markers in sched/cpufreq_schedutil.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-5-mingo@kernel.org
2025-06-13 08:47:15 +02:00
Ingo Molnar
b7ebb75856 sched: Clean up and standardize #if/#else/#endif markers in sched/core.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

	#if CONFIG_FOO
	...
	#else /* !CONFIG_FOO: */
	...
	#endif /* !CONFIG_FOO */

 - Apply this simplification:

	-#if defined(CONFIG_FOO)
	+#ifdef CONFIG_FOO

 - Fix whitespace noise.

 - Use vertical alignment to better visualize nested #ifdef blocks,
   where appropriate:

	#ifdef CONFIG_FOO
	# ifdef CONFIG_BAR
	...
	# endif
	#endif

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-4-mingo@kernel.org
2025-06-13 08:47:15 +02:00
Ingo Molnar
bbb1b274e8 sched: Clean up and standardize #if/#else/#endif markers in sched/clock.c
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-3-mingo@kernel.org
2025-06-13 08:47:14 +02:00
Ingo Molnar
69ab14ee52 sched: Clean up and standardize #if/#else/#endif markers in sched/autogroup.[ch]
- Use the standard #ifdef marker format for larger blocks,
   where appropriate:

        #if CONFIG_FOO
        ...
        #else /* !CONFIG_FOO: */
        ...
        #endif /* !CONFIG_FOO */

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250528080924.2273858-2-mingo@kernel.org
2025-06-13 08:47:14 +02:00
wang wei
b01f2d9597 sched/eevdf: Correct the comment in place_entity
Correct "l" to "vl_i".

Signed-off-by: wang wei <a929244872@163.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250605152931.22804-1-a929244872@163.com
2025-06-11 11:20:53 +02:00
Peter Zijlstra
3d7e10188a sched: Make clangd usable
Due to the weird Makefile setup of sched the various files do not
compile as stand alone units. The new generation of editors are trying
to do just this -- mostly to offer fancy things like completions but
also better syntax highlighting and code navigation.

Specifically, I've been playing around with neovim and clangd.

Setting up clangd on the kernel source is a giant pain in the arse
(this really should be improved), but once you do manage, you run into
dumb stuff like the above.

Fix up the scheduler files to at least pretend to work.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20250523164348.GN39944@noisy.programming.kicks-ass.net
2025-06-11 11:20:53 +02:00
Andrea Righi
086ed90a64 sched_ext: Make scx_locked_rq() inline
scx_locked_rq() is used both from ext.c and ext_idle.c, move it to ext.h
as a static inline function.

No functional changes.

v2: Rename locked_rq to scx_locked_rq_state, expose it and make
    scx_locked_rq() inline, as suggested by Tejun.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:35 -10:00
Andrea Righi
e212743bd7 sched_ext: Make scx_rq_bypassing() inline
scx_rq_bypassing() is used both from ext.c and ext_idle.c, move it to
ext.h as a static inline function.

No functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:24 -10:00
Andrea Righi
353656eb84 sched_ext: idle: Make local functions static in ext_idle.c
Functions that are only used within ext_idle.c can be marked as static
to limit their scope.

No functional changes.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:14 -10:00
Andrea Righi
c68ea8243c sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()
There's no need to make scx_bpf_cpu_node() dependent on CONFIG_NUMA,
since cpu_to_node() can be used also in systems with CONFIG_NUMA
disabled.

This also allows to always validate the @cpu argument regardless of the
CONFIG_NUMA settings.

Fixes: 01059219b0 ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09 06:25:02 -10:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
16b70698aa sched_ext: Fixes for v6.16-rc1
Two fixes in the built-in idle selection helpers.
 
 - Fix prev_cpu handling to guarantee that idle selection never returns a CPU
   that's not allowed.
 
 - Skip cross-node search when !NUMA which could lead to infinite looping due
   to a bug in NUMA iterator.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaECWBw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGfV9AP4mk63TUsLdqwa6iqE0dZArOAhj8QXETsv1Q0+y
 FGQKEQEAxPNJx3ifhzgXoQi1SOZkqeq44erbcy0x7owtb+QBVgs=
 =G1uN
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
 "Two fixes in the built-in idle selection helpers:

   - Fix prev_cpu handling to guarantee that idle selection never
     returns a CPU that's not allowed

   - Skip cross-node search when !NUMA which could lead to infinite
     looping due to a bug in NUMA iterator"

* tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: idle: Skip cross-node search with !CONFIG_NUMA
  sched_ext: idle: Properly handle invalid prev_cpu during idle selection
2025-06-04 12:07:16 -07:00
Andrea Righi
9960be72a5 sched_ext: idle: Skip cross-node search with !CONFIG_NUMA
In the idle CPU selection logic, attempting cross-node searches adds
unnecessary complexity when CONFIG_NUMA is disabled.

Since there's no meaningful concept of nodes in this case, simplify the
logic by restricting the idle CPU search to the current node only.

Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-03 08:22:27 -10:00
Linus Torvalds
fd1f847350 - The 2 patch series "zram: support algorithm-specific parameters" from
Sergey Senozhatsky adds infrastructure for passing algorithm-specific
   parameters into zram.  A single parameter `winbits' is implemented at
   this time.
 
 - The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt
   makes memcg charging nmi-safe, which is required by BFP, which can
   operate in NMI context.
 
 - The 5 patch series "Some random fixes and cleanup to shmem" from
   Kemeng Shi implements small fixes and cleanups in the shmem code.
 
 - The 2 patch series "Skip mm selftests instead when kernel features are
   not present" from Zi Yan fixes some issues in the MM selftest code.
 
 - The 2 patch series "mm/damon: build-enable essential DAMON components
   by default" from SeongJae Park reworks DAMON Kconfig to make it easier
   to enable CONFIG_DAMON.
 
 - The 2 patch series "sched/numa: add statistics of numa balance task
   migration" from Libo Chen adds more info into sysfs and procfs files to
   improve visibility into the NUMA balancer's task migration activity.
 
 - The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from
   Mark Brown provides various updates to some of the MM selftests to make
   them play better with the overall containing framework.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA
 js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O
 LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw=
 =oruw
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - "zram: support algorithm-specific parameters" from Sergey Senozhatsky
   adds infrastructure for passing algorithm-specific parameters into
   zram. A single parameter `winbits' is implemented at this time.

 - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
   charging nmi-safe, which is required by BFP, which can operate in NMI
   context.

 - "Some random fixes and cleanup to shmem" from Kemeng Shi implements
   small fixes and cleanups in the shmem code.

 - "Skip mm selftests instead when kernel features are not present" from
   Zi Yan fixes some issues in the MM selftest code.

 - "mm/damon: build-enable essential DAMON components by default" from
   SeongJae Park reworks DAMON Kconfig to make it easier to enable
   CONFIG_DAMON.

 - "sched/numa: add statistics of numa balance task migration" from Libo
   Chen adds more info into sysfs and procfs files to improve visibility
   into the NUMA balancer's task migration activity.

 - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
   provides various updates to some of the MM selftests to make them
   play better with the overall containing framework.

* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
  mm/khugepaged: clean up refcount check using folio_expected_ref_count()
  selftests/mm: fix test result reporting in gup_longterm
  selftests/mm: report unique test names for each cow test
  selftests/mm: add helper for logging test start and results
  selftests/mm: use standard ksft_finished() in cow and gup_longterm
  selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
  sched/numa: add statistics of numa balance task
  sched/numa: fix task swap by skipping kernel threads
  tools/testing: check correct variable in open_procmap()
  tools/testing/vma: add missing function stub
  mm/gup: update comment explaining why gup_fast() disables IRQs
  selftests/mm: two fixes for the pfnmap test
  mm/khugepaged: fix race with folio split/free using temporary reference
  mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
  mmu_notifiers: remove leftover stub macros
  selftests/mm: deduplicate test names in madv_populate
  kcov: rust: add flags for KCOV with Rust
  mm: rust: make CONFIG_MMU ifdefs more narrow
  mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
  mm/damon/Kconfig: enable CONFIG_DAMON by default
  ...
2025-06-02 16:00:26 -07:00
Chen Yu
ad6b26b6a0 sched/numa: add statistics of numa balance task
On systems with NUMA balancing enabled, it has been found that tracking
task activities resulting from NUMA balancing is beneficial.  NUMA
balancing employs two mechanisms for task migration: one is to migrate
a task to an idle CPU within its preferred node, and the other is to
swap tasks located on different nodes when they are on each other's
preferred nodes.

The kernel already provides NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.  However, it
lacks statistics regarding task migration and swapping.  Therefore,
relevant counts for task migration and swapping should be added.

The following two new fields:

numa_task_migrated
numa_task_swapped

will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched
and /proc/vmstat.

Introducing both per-task and per-memory cgroup (memcg) NUMA balancing
statistics facilitates a rapid evaluation of the performance and
resource utilization of the target workload.  For instance, users can
first identify the container with high NUMA balancing activity and then
further pinpoint a specific task within that group, and subsequently
adjust the memory policy for that task.  In short, although it is
possible to iterate through /proc/$pid/sched to locate the problematic
task, the introduction of aggregated NUMA balancing activity for tasks
within each memcg can assist users in identifying the task more
efficiently through a divide-and-conquer approach.

As Libo Chen pointed out, the memcg event relies on the text names in
vmstat_text, and /proc/vmstat generates corresponding items based on
vmstat_text.  Thus, the relevant task migration and swapping events
introduced in vmstat_text also need to be populated by
count_vm_numa_event(), otherwise these values are zero in /proc/vmstat.

In theory, task migration and swap events are part of the scheduler's
activities.  The reason for exposing them through the
memory.stat/vmstat interface is that we already have NUMA balancing
statistics in memory.stat/vmstat, and these events are closely related
to each other.  Following Shakeel's suggestion, we describe the
end-to-end flow/story of all these events occurring on a timeline for
future reference:

The goal of NUMA balancing is to co-locate a task and its memory pages
on the same NUMA node.  There are two strategies: migrate the pages to
the task's node, or migrate the task to the node where its pages
reside.

Suppose a task p1 is running on Node 0, but its pages are located on
Node 1.  NUMA page fault statistics for p1 reveal its "page footprint"
across nodes.  If NUMA balancing detects that most of p1's pages are on
Node 1:

1.Page Migration Attempt:
The Numa balance first tries to migrate p1's pages to Node 0.
The numa_page_migrate counter increments.

2.Task Migration Strategies:
After the page migration finishes, Numa balance checks every
1 second to see if p1 can be migrated to Node 1.

Case 2.1: Idle CPU Available

  If Node 1 has an idle CPU, p1 is directly scheduled there.  This
  event is logged as numa_task_migrated.

Case 2.2: No Idle CPU (Task Swap)

  If all CPUs on Node1 are busy, direct migration could cause CPU
  contention or load imbalance.  Instead: The Numa balance selects a
  candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own
  page footprint).  p1 and p2 are swapped.  This cross-node swap is
  recorded as numa_task_swapped.

Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: Ayush Jain <Ayush.jain3@amd.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:15 -07:00
Libo Chen
9709eb0f84 sched/numa: fix task swap by skipping kernel threads
Patch series "sched/numa: add statistics of numa balance task migration",
v6.

Introduce task migration and swap statistics in the following places:
/sys/fs/cgroup/{GROUP}/memory.stat
/proc/{PID}/sched
/proc/vmstat

These statistics facilitate a rapid evaluation of the performance and
resource utilization of the target workload.


This patch (of 2):

Task swapping is triggered when there are no idle CPUs in task A's
preferred node.  In this case, the NUMA load balancer chooses a task B
on A's preferred node and swaps B with A.  This helps improve NUMA
locality without introducing load imbalance between nodes.  In the
current implementation, B's NUMA node preference is not mandatory. 
That is to say, a kernel thread might be incorrectly chosen as B. 
However, kernel thread and user space thread that does not have mm are
not supposed to be covered by NUMA balancing because NUMA balancing
only considers user pages via VMAs.

According to Peter's suggestion for fixing this issue, we use
PF_KTHREAD to skip the kernel thread.  curr->mm is also checked because
it is possible that user_mode_thread() might create a user thread
without an mm.  As per Prateek's analysis, after adding the PF_KTHREAD
check, there is no need to further check the PF_IDLE flag:

: - play_idle_precise() already ensures PF_KTHREAD is set before adding
:   PF_IDLE
: 
: - cpu_startup_entry() is only called from the startup thread which
:   should be marked with PF_KTHREAD (based on my understanding looking at
:   commit cff9b2332a ("kernel/sched: Modify initial boot task idle
:   setup"))

In summary, the check in task_numa_compare() now aligns with
task_tick_numa().

Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Tested-by: Ayush Jain <Ayush.jain3@amd.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Aubrey Li <aubrey.li@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:15 -07:00
Linus Torvalds
00c010e130 - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox
simplifies the act of creating a pte which addresses the first page in a
   folio and reduces the amount of plumbing which architecture must
   implement to provide this.
 
 - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox
   is a shower of largely unrelated folio infrastructure changes which
   clean things up and better prepare us for future work.
 
 - The 3 patch series "memory,x86,acpi: hotplug memory alignment
   advisement" from Gregory Price adds early-init code to prevent x86 from
   leaving physical memory unused when physical address regions are not
   aligned to memory block size.
 
 - The 2 patch series "mm/compaction: allow more aggressive proactive
   compaction" from Michal Clapinski provides some tuning of the (sadly,
   hard-coded (more sadly, not auto-tuned)) thresholds for our invokation
   of proactive compaction.  In a simple test case, the reduction of a guest
   VM's memory consumption was dramatic.
 
 - The 8 patch series "Minor cleanups and improvements to swap freeing
   code" from Kemeng Shi provides some code cleaups and a small efficiency
   improvement to this part of our swap handling code.
 
 - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API"
   from Dmitry Levin adds the ability for a ptracer to modify syscalls
   arguments.  At this time we can alter only "system call information that
   are used by strace system call tampering, namely, syscall number,
   syscall arguments, and syscall return value.
 
   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.
 
 - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report
   guard regions" from Andrei Vagin extends the info returned by the
   PAGEMAP_SCAN ioctl against /proc/pid/pagemap.  This permits CRIU to more
   efficiently get at the info about guard regions.
 
 - The 2 patch series "Fix parameter passed to page_mapcount_is_type()"
   from Gavin Shan implements that fix.  No runtime effect is expected
   because validate_page_before_insert() happens to fix up this error.
 
 - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode()
   rewrite" from David Hildenbrand basically brings uprobe text poking into
   the current decade.  Remove a bunch of hand-rolled implementation in
   favor of using more current facilities.
 
 - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64"
   from Anshuman Khandual provides enhancements and generalizations to the
   pte dumping code.  This might be needed when 128-bit Page Table
   Descriptors are enabled for ARM.
 
 - The 12 patch series "Always call constructor for kernel page tables"
   from Kevin Brodsky "ensures that the ctor/dtor is always called for
   kernel pgtables, as it already is for user pgtables".  This permits the
   addition of more functionality such as "insert hooks to protect page
   tables".  This change does result in various architectures performing
   unnecesary work, but this is fixed up where it is anticipated to occur.
 
 - The 9 patch series "Rust support for mm_struct, vm_area_struct, and
   mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM
   structures.
 
 - The 3 patch series "fix incorrectly disallowed anonymous VMA merges"
   from Lorenzo Stoakes takes advantage of some VMA merging opportunities
   which we've been missing for 15 years.
 
 - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED
   and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB
   flushing.  Instead of flushing each address range in the provided iovec,
   we batch the flushing across all the iovec entries.  The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.
 
 - The 6 patch series "Track node vacancy to reduce worst case allocation
   counts" from Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.  stress-ng mmap performance increased by single-digit
   percentages and the amount of unnecessarily preallocated memory was
   dramaticelly reduced.
 
 - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from
   Baoquan He removes a few unnecessary things which Baoquan noted when
   reading the code.
 
 - The 3 patch series ""Enhance sysfs handling for memory hotplug in
   weighted interleave" from Rakie Kim "enhances the weighted interleave
   policy in the memory management subsystem by improving sysfs handling,
   fixing memory leaks, and introducing dynamic sysfs updates for memory
   hotplug support".  Fixes things on error paths which we are unlikely to
   hit.
 
 - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups
   including tiered memory" from SeongJae Park introduces new DAMOS quota
   goal metrics which eliminate the manual tuning which is required when
   utilizing DAMON for memory tiering.
 
 - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from
   Baoquan He provides cleanups and small efficiency improvements which
   Baoquan found via code inspection.
 
 - The 2 patch series "vmscan: enforce mems_effective during demotion"
   from Gregory Price "changes reclaim to respect cpuset.mems_effective
   during demotion when possible".  because "presently, reclaim explicitly
   ignores cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated." "This is useful for isolating workloads on a
   multi-tenant system from certain classes of memory more consistently."
 
 - The 2 patch series ""Clean up split_huge_pmd_locked() and remove
   unnecessary folio pointers" from Gavin Guo provides minor cleanups and
   efficiency gains in in the huge page splitting and migrating code.
 
 - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang
   creates a slab cache for `struct mem_cgroup', yielding improved memory
   utilization.
 
 - The 4 patch series "add max arg to swappiness in memory.reclaim and
   lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness="
   argument for memory.reclaim MGLRU's lru_gen.  This directs proactive
   reclaim to reclaim from only anon folios rather than file-backed folios.
 
 - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike
   Rapoport is the first step on the path to permitting the kernel to
   maintain existing VMs while replacing the host kernel via file-based
   kexec.  At this time only memblock's reserve_mem is preserved.
 
 - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David
   Woodhouse provides and uses a smarter way of looping over a pfn range.
   By skipping ranges of invalid pfns.
 
 - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to
   one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless
   VMA scanning when a task is pinned a single NUMA mode.  Dramatic
   performance benefits were seen in some real world cases.
 
 - The 2 patch series "JFS: Implement migrate_folio for
   jfs_metapage_aops" from Shivank Garg addresses a warning which occurs
   during memory compaction when using JFS.
 
 - The 4 patch series "move all VMA allocation, freeing and duplication
   logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c
   into the more appropriate mm/vma.c.
 
 - The 6 patch series "mm, swap: clean up swap cache mapping helper" from
   Kairui Song provides code consolidation and cleanups related to the
   folio_index() function.
 
 - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal
   Moola does that.
 
 - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from
   Waiman Long addresses some bogus failures which are being reported by
   the test_memcontrol selftest.
 
 - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare
   hook" from Lorenzo Stoakes commences the deprecation of
   file_operations.mmap() in favor of the new
   file_operations.mmap_prepare().  The latter is more restrictive and
   prevents drivers from messing with things in ways which, amongst other
   problems, may defeat VMA merging.
 
 - The 4 patch series "memcg: decouple memcg and objcg stocks"" from
   Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's
   one.  This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.
 
 - The 6 patch series "mm/damon: minor fixups and improvements for code,
   tests, and documents" from SeongJae Park is "yet another batch of
   miscellaneous DAMON changes.  Fix and improve minor problems in code,
   tests and documents."
 
 - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel
   Butt converts memcg stats to be irq safe.  Another step along the way to
   making memcg charging and stats updates NMI-safe, a BPF requirement.
 
 - The 4 patch series "Let unmap_hugepage_range() and several related
   functions take folio instead of page" from Fan Ni provides folio
   conversions in the hugetlb code.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA
 ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x
 Qqb/UGMEpilyre1PayQqOnct3aSL9Ao=
 =tYYm
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Andrea Righi
ee9a4e9279 sched_ext: idle: Properly handle invalid prev_cpu during idle selection
The default idle selection policy doesn't properly handle the case where
@prev_cpu is not part of the task's allowed CPUs.

In this situation, it may return an idle CPU that is not usable by the
task, breaking the assumption that the returned CPU must always be
within the allowed cpumask, causing inefficiencies or even stalls in
certain cases.

This issue can arise in the following cases:

 - The task's affinity may have changed by the time the function is
   invoked, especially now that the idle selection logic can be used
   from multiple contexts (i.e., BPF test_run call).

 - The BPF scheduler may provide a @prev_cpu that is not part of the
   allowed mask, either unintentionally or as a placement hint. In fact
   @prev_cpu may not necessarily refer to the CPU the task last ran on,
   but it can also be considered as a target CPU that the scheduler
   wishes to use for the task.

Therefore, enforce the right behavior by always checking whether
@prev_cpu is in the allowed mask, when using scx_bpf_select_cpu_and(),
and it's also usable by the task (@p->cpus_ptr). If it is not, try to
find a valid CPU nearby @prev_cpu, following the usual locality-aware
fallback path (SMT, LLC, node, allowed CPUs).

This ensures the returned CPU is always allowed, improving robustness to
affinity changes and invalid scheduler hints, while preserving locality
as much as possible.

Fixes: a730e3f7a4 ("sched_ext: idle: Consolidate default idle CPU selection kfuncs")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-30 14:42:30 -10:00
Linus Torvalds
90b83efa67 bpf-next-6.16
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmg3NqgACgkQ6rmadz2v
 bTpNUQ/8DPeYtn3nskpsP2OwFy6O3hhfCe6gjOAmUVSk000xbG+AcI/h1DnGZWgk
 xlVcEs93ekzUzHd7k1+RJ2c5yDLXieLJAtb66rbFU1enkxs2cWlcWSKE6K/gaoh3
 G1BCARVlKwtrJhrVrsXtYP/eGZxKRSUZFK7xhtCk7lp7sRI3xkTLE+FJBcDkTJ6W
 HwF14i3zO+BkqNGdFwwlASCCqRItSNBBiM3KjW1DbETOTfAKlvCTrcgdUiODqxhF
 PNnULW+xmICABDFlKfDMlUAGNlSHKjiI3+g31LdblA5eyEhIqiCRgBGFYoCnsluk
 qUauRSie61KqC7fxN3qVpC3bXJfD1td7uIvoqSkDLtTv8a5+HAoiohzi1qBzCayl
 LAGkBYewAfDtdDDjNY38JLH2RCdyY6zG9DhqghPHdPlM7zj7L5zZgj34igEwesMM
 mfj9TuFFF99yfX5UUeSxKpDGR1eO4Ew0p7tg8CRs8Fqh6AIQSmboREZrsncVRCTS
 4SDHSI4KcO4LO2pEKzy+X4dewganN7aESnQG34iG0liyvDDwJOgUnDWLRwPLas7k
 3b/zIfBLxOJpA5R+0hhAMtjMA4NgyKJf4yFZwEieuasQjvzwTApi24YhZ/b3HSEB
 2Dp8kHEEbwezv0OFFz/fJ88dNQnrDmtJ+QByN/liA8kj4Yuh2+Q=
 =j3t8
 -----END PGP SIGNATURE-----

Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Pull bpf updates from Alexei Starovoitov:

 - Fix and improve BTF deduplication of identical BTF types (Alan
   Maguire and Andrii Nakryiko)

 - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
   Alexis Lothoré)

 - Support load-acquire and store-release instructions in BPF JIT on
   riscv64 (Andrea Parri)

 - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
   Protopopov)

 - Streamline allowed helpers across program types (Feng Yang)

 - Support atomic update for hashtab of BPF maps (Hou Tao)

 - Implement json output for BPF helpers (Ihor Solodrai)

 - Several s390 JIT fixes (Ilya Leoshkevich)

 - Various sockmap fixes (Jiayuan Chen)

 - Support mmap of vmlinux BTF data (Lorenz Bauer)

 - Support BPF rbtree traversal and list peeking (Martin KaFai Lau)

 - Tests for sockmap/sockhash redirection (Michal Luczaj)

 - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)

 - Add support for dma-buf iterators in BPF (T.J. Mercier)

 - The verifier support for __bpf_trap() (Yonghong Song)

* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
  bpf, arm64: Remove unused-but-set function and variable.
  selftests/bpf: Add tests with stack ptr register in conditional jmp
  bpf: Do not include stack ptr register in precision backtracking bookkeeping
  selftests/bpf: enable many-args tests for arm64
  bpf, arm64: Support up to 12 function arguments
  bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
  bpf: Avoid __bpf_prog_ret0_warn when jit fails
  bpftool: Add support for custom BTF path in prog load/loadall
  selftests/bpf: Add unit tests with __bpf_trap() kfunc
  bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
  bpf: Remove special_kfunc_set from verifier
  selftests/bpf: Add test for open coded dmabuf_iter
  selftests/bpf: Add test for dmabuf_iter
  bpf: Add open coded dmabuf iterator
  bpf: Add dmabuf iterator
  dma-buf: Rename debugfs symbols
  bpf: Fix error return value in bpf_copy_from_user_dynptr
  libbpf: Use mmap to parse vmlinux BTF from sysfs
  selftests: bpf: Add a test for mmapable vmlinux BTF
  btf: Allow mmap of vmlinux btf
  ...
2025-05-28 15:52:42 -07:00
Linus Torvalds
feacb1774b sched_ext: Changes for v6.16
- More in-kernel idle CPU selection improvements. Expand topology awareness
   coverage add scx_bpf_select_cpu_and() to allow more flexibility. The idle
   CPU selection kfuncs can now be called from unlocked contexts too.
 
 - A bunch of reorganization changes to lay the foundation for multiple
   hierarchical scheduler support. This isn't ready yet and the included
   changes don't make meaningful behavior differences. One notable change is
   replacing some static_key tests with dynamic tests as the test results may
   differ depending on the scheduler instance. This isn't expected to cause
   meaningful performance difference.
 
 - Other minor and doc updates.
 
 - There were multiple patches in for-6.15-fixes which conflicted with
   changes in for-6.16. for-6.15-fixes were pulled three times into for-6.16
   to resolve the conflicts.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaDYZMw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGfbcAQDRloVb/d5RfC6VYlue9EV1jHuoJefTYHvR3jmO
 ju70EQEAjLBXw58XAePQ9La/570JELgsC5FzJp3tLTilGx2JyQA=
 =7cDG
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext updates from Tejun Heo:

 - More in-kernel idle CPU selection improvements. Expand topology
   awareness coverage add scx_bpf_select_cpu_and() to allow more
   flexibility. The idle CPU selection kfuncs can now be called from
   unlocked contexts too.

 - A bunch of reorganization changes to lay the foundation for multiple
   hierarchical scheduler support. This isn't ready yet and the included
   changes don't make meaningful behavior differences. One notable
   change is replacing some static_key tests with dynamic tests as the
   test results may differ depending on the scheduler instance. This
   isn't expected to cause meaningful performance difference.

 - Other minor and doc updates.

 - There were multiple patches in for-6.15-fixes which conflicted with
   changes in for-6.16. for-6.15-fixes were pulled three times into
   for-6.16 to resolve the conflicts.

* tag 'sched_ext-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (49 commits)
  sched_ext: Call ops.update_idle() after updating builtin idle bits
  sched_ext, docs: convert mentions of "CFS" to "fair-class scheduler"
  selftests/sched_ext: Update test enq_select_cpu_fails
  sched_ext: idle: Consolidate default idle CPU selection kfuncs
  selftests/sched_ext: Add test for scx_bpf_select_cpu_and() via test_run
  sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context
  sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()
  sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c
  sched_ext, docs: add label
  sched_ext: Explain the temporary situation around scx_root dereferences
  sched_ext: Add @sch to SCX_CALL_OP*()
  sched_ext: Cleanup [__]scx_exit/error*()
  sched_ext: Add @sch to SCX_CALL_OP*()
  sched_ext: Clean up scx_root usages
  Documentation: scheduler: Changed lowercase acronyms to uppercase
  sched_ext: Avoid NULL scx_root deref in __scx_exit()
  sched_ext: Add RCU protection to scx_root in DSQ iterator
  sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()
  sched_ext: Move disable machinery into scx_sched
  sched_ext: Move event_stats_cpu into scx_sched
  ...
2025-05-27 21:12:50 -07:00
Linus Torvalds
c89756bcf4 Power management updates for 6.16-rc1
- Fix potential division-by-zero error in em_compute_costs() (Yaxiong
    Tian).
 
  - Fix typos in energy model documentation and example driver code (Moon
    Hee Lee, Atul Kumar Pant).
 
  - Rearrange the energy model management code and add a new function for
    adjusting a CPU energy model after adjusting the capacity of the
    given CPU to it (Rafael Wysocki).
 
  - Refactor cpufreq_online(), add and use cpufreq policy locking guards,
    use __free() in policy reference counting, and clean up core cpufreq
    code on top of that (Rafael Wysocki).
 
  - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
    Kumar).
 
  - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay
    Ugwekar).
 
  - Add offline, online and suspend callbacks to the amd-pstate driver,
    rename and use the existing amd_pstate_epp callbacks in it (Dhananjay
    Ugwekar).
 
  - Add support for the "Requested CPU Min frequency" BIOS option to the
    amd-pstate driver (Dhananjay Ugwekar).
 
  - Reset amd-pstate driver mode after running selftests (Swapnil
    Sapkal).
 
  - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
    Chancellor).
 
  - Add helper for governor checks to the schedutil cpufreq governor and
    move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki).
 
  - Populate the cpu_capacity sysfs entries from the intel_pstate driver
    after registering asym capacity support (Ricardo Neri).
 
  - Add support for enabling Energy-aware scheduling (EAS) to the
    intel_pstate driver when operating in the passive mode on a hybrid
    platform (Rafael Wysocki).
 
  - Drop redundant cpus_read_lock() from store_local_boost() in the
    cpufreq core (Seyediman Seyedarab).
 
  - Replace sscanf() with kstrtouint() in the cpufreq code and use a
    symbol instead of a raw number in it (Bowen Yu).
 
  - Add support for autonomous CPU performance state selection to the
    CPPC cpufreq driver (Lifeng Zheng).
 
  - OPP: Add dev_pm_opp_set_level() (Praveen Talari).
 
  - Introduce scope-based cleanup headers and mutex locking guards in OPP
    core (Viresh Kumar).
 
  - Switch OPP to use kmemdup_array() (Zhang Enpei).
 
  - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the
    menu cpuidle governor (Zhongqiu Han).
 
  - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla).
 
  - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
    Bityutskiy).
 
  - Fix typos in two comments in the teo cpuidle governor (Atul Kumar
    Pant).
 
  - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
    Kalla).
 
  - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki).
 
  - Add new devm_ functions for enabling runtime PM and runtime PM
    reference counting (Bence Csókás).
 
  - Remove size arguments from strscpy() calls in the hibernation core
    code (Thorsten Blum).
 
  - Adjust the handling of devices with asynchronous suspend enabled
    during system suspend and resume to start resuming them immediately
    after resuming their parents and to start suspending such a device
    immediately after suspending its first child (Rafael Wysocki).
 
  - Adjust messages printed during tasks freezing to avoid using
    pr_cont() (Andrew Sayers, Paul Menzel).
 
  - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
    Zhang).
 
  - Add missing wakeup source attribute relax_count to sysfs and
    remove the space character at the end ofi the string produced by
    pm_show_wakelocks() (Zijun Hu).
 
  - Add configurable pm_test delay for hibernation (Zihuan Zhang).
 
  - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
    cypd4226 device on Tegra boards from suspending prematurely (Jon
    Hunter).
 
  - Unbreak printing PM debug messages during hibernation and clean up
    some related code (Rafael Wysocki).
 
  - Add a systemd service to run cpupower and change cpupower binding's
    Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli).
 -----BEGIN PGP SIGNATURE-----
 
 iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmg0xS0SHHJqd0Byand5
 c29ja2kubmV0AAoJEO5fvZ0v1OO1AwwH/Rvgza5YBPb9JZqWJT/ZiBw7HcEWHhP1
 fNfcVU1gXPZiF0yoPfjfJua6BcLj6lyQ3d/+zWqqAcWfmRSD6HPe8yYz8qALUAqj
 RWhDa04aGj6B9bQuOjejatznYlQlkwCRT7zec+75D+dAHVMqR/Vt2LFAetCadgHe
 MQibAQmVFXu3RFkBjReTAdGzVoTXkwoZDrzdfA2aFAfMJNtJpOW4atUZvnucuctv
 VK3ZratrctCIw7yXEoB1nWSmlY7R5JlslplBfndjmmOnky3YxNr7C6paqwtbTWoF
 MiX48qkmLOGeO6gS8s/lVCDQ4oZ+UNFQvXRsM5NGjycBikhHX/dp/w4=
 =dIqJ
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Once again, the changes are dominated by cpufreq updates, but this
  time the majority of them are cpufreq core changes, mostly related to
  the introduction of policy locking guards and __free() usage, and
  fixes related to boost handling.

  Still, there is also a significant update of the intel_pstate driver
  making it register an energy model when running on a hybrid platform
  which is used for enabling energy-aware scheduling (EAS) if the driver
  operates in the passive mode (and schedutil is used as the cpufreq
  governor for all CPUs which is the passive mode default).

  There are some amd-pstate driver updates too, for a good measure,
  including the "Requested CPU Min frequency" BIOS option support and
  new online/offline callbacks.

  In the cpuidle space, the most significant change is the addition of a
  C1 demotion on/off sysfs knob to intel_idle which should help some
  users to configure their systems more precisely. There is also the
  conversion of the PSCI cpuidle driver to a faux device one and there
  are two small updates of cpuidle governors.

  Device power management is also modified quite a bit, especially the
  handling of devices with asynchronous suspend and resume enabled
  during system transitions. They are now going to be handled more
  asynchronously during suspend transitions and somewhat less
  aggressively during resume transitions.

  Apart from the above, the operating performance points (OPP) library
  is now going to use mutex locking guards and scope-based cleanup
  helpers and there is the usual bunch of assorted fixes and code
  cleanups.

  Specifics:

   - Fix potential division-by-zero error in em_compute_costs() (Yaxiong
     Tian)

   - Fix typos in energy model documentation and example driver code
     (Moon Hee Lee, Atul Kumar Pant)

   - Rearrange the energy model management code and add a new function
     for adjusting a CPU energy model after adjusting the capacity of
     the given CPU to it (Rafael Wysocki)

   - Refactor cpufreq_online(), add and use cpufreq policy locking
     guards, use __free() in policy reference counting, and clean up
     core cpufreq code on top of that (Rafael Wysocki)

   - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
     Kumar)

   - Fix des_perf clamping with max_perf in amd_pstate_update()
     (Dhananjay Ugwekar)

   - Add offline, online and suspend callbacks to the amd-pstate driver,
     rename and use the existing amd_pstate_epp callbacks in it
     (Dhananjay Ugwekar)

   - Add support for the "Requested CPU Min frequency" BIOS option to
     the amd-pstate driver (Dhananjay Ugwekar)

   - Reset amd-pstate driver mode after running selftests (Swapnil
     Sapkal)

   - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
     Chancellor)

   - Add helper for governor checks to the schedutil cpufreq governor
     and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki)

   - Populate the cpu_capacity sysfs entries from the intel_pstate
     driver after registering asym capacity support (Ricardo Neri)

   - Add support for enabling Energy-aware scheduling (EAS) to the
     intel_pstate driver when operating in the passive mode on a hybrid
     platform (Rafael Wysocki)

   - Drop redundant cpus_read_lock() from store_local_boost() in the
     cpufreq core (Seyediman Seyedarab)

   - Replace sscanf() with kstrtouint() in the cpufreq code and use a
     symbol instead of a raw number in it (Bowen Yu)

   - Add support for autonomous CPU performance state selection to the
     CPPC cpufreq driver (Lifeng Zheng)

   - OPP: Add dev_pm_opp_set_level() (Praveen Talari)

   - Introduce scope-based cleanup headers and mutex locking guards in
     OPP core (Viresh Kumar)

   - Switch OPP to use kmemdup_array() (Zhang Enpei)

   - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in
     the menu cpuidle governor (Zhongqiu Han)

   - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla)

   - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem
     Bityutskiy)

   - Fix typos in two comments in the teo cpuidle governor (Atul Kumar
     Pant)

   - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja
     Kalla)

   - Move debug runtime PM attributes to runtime_attrs[] (Rafael
     Wysocki)

   - Add new devm_ functions for enabling runtime PM and runtime PM
     reference counting (Bence Csókás)

   - Remove size arguments from strscpy() calls in the hibernation core
     code (Thorsten Blum)

   - Adjust the handling of devices with asynchronous suspend enabled
     during system suspend and resume to start resuming them immediately
     after resuming their parents and to start suspending such a device
     immediately after suspending its first child (Rafael Wysocki)

   - Adjust messages printed during tasks freezing to avoid using
     pr_cont() (Andrew Sayers, Paul Menzel)

   - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan
     Zhang)

   - Add missing wakeup source attribute relax_count to sysfs and remove
     the space character at the end ofi the string produced by
     pm_show_wakelocks() (Zijun Hu)

   - Add configurable pm_test delay for hibernation (Zihuan Zhang)

   - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the
     cypd4226 device on Tegra boards from suspending prematurely (Jon
     Hunter)

   - Unbreak printing PM debug messages during hibernation and clean up
     some related code (Rafael Wysocki)

   - Add a systemd service to run cpupower and change cpupower binding's
     Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)"

* tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
  cpufreq: CPPC: Add support for autonomous selection
  cpufreq: Update sscanf() to kstrtouint()
  cpufreq: Replace magic number
  OPP: switch to use kmemdup_array()
  PM: freezer: Rewrite restarting tasks log to remove stray *done.*
  PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn()
  cpufreq: drop redundant cpus_read_lock() from store_local_boost()
  cpupower: do not install files to /etc/default/
  cpupower: do not call systemctl at install time
  cpupower: do not write DESTDIR to cpupower.service
  PM: sleep: Introduce pm_sleep_transition_in_progress()
  cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
  cpufreq: intel_pstate: Document hybrid processor support
  cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  cpufreq: intel_pstate: EAS support for hybrid platforms
  PM: EM: Introduce em_adjust_cpu_capacity()
  PM: EM: Move CPU capacity check to em_adjust_new_capacity()
  PM: EM: Documentation: Fix typos in example driver code
  cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
  PM: sleep: Introduce pm_suspend_in_progress()
  ...
2025-05-27 16:48:47 -07:00
Linus Torvalds
eaed94d1f6 Scheduler updates for v6.16:
Core & fair scheduler changes:
 
   - Tweak wait_task_inactive() to force dequeue sched_delayed tasks
     (John Stultz)
 
   - Adhere to place_entity() constraints (Peter Zijlstra)
 
   - Allow decaying util_est when util_avg > CPU capacity (Pierre Gondois)
 
   - Fix up wake_up_sync() vs DELAYED_DEQUEUE (Xuewen Yan)
 
 Energy management:
 
   - Introduce sched_update_asym_prefer_cpu() (K Prateek Nayak)
 
   - cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change
     (K Prateek Nayak)
 
   - Align uclamp and util_est and call before freq update (Xuewen Yan)
 
 CPU isolation:
 
   - Make use of more than one housekeeping CPU (Phil Auld)
 
 RT scheduler:
 
   - Fix race in push_rt_task() (Harshit Agarwal)
 
   - Add kernel cmdline option for rt_group_sched (Michal Koutný)
 
 Scheduler topology support:
 
   - Improve topology_span_sane speed (Steve Wahl)
 
 Scheduler debugging:
 
   - Move and extend the sched_process_exit() tracepoint (Andrii Nakryiko)
 
   - Add RT_GROUP WARN checks for non-root task_groups (Michal Koutný)
 
   - Fix trace_sched_switch(.prev_state) (Peter Zijlstra)
 
   - Untangle cond_resched() and live-patching (Peter Zijlstra)
 
 Fixes and cleanups:
 
   - Misc fixes and cleanups (K Prateek Nayak, Michal Koutný,
     Peter Zijlstra, Xuewen Yan)
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmgy50ARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jFQQ/+KXl2XDg1V/VVmMG8GmtDlR29V3M3ricy
 D7/2s0D1Y1ErHb+pRMBG31EubT9/bXjUshWIuuf51DciSLBmpELHxY5J+AevRa0L
 /pHFwSvP6H5pDakI/xZ01FlYt7PxZGs+1m1o2615Mbwq6J2bjZTan54CYzrdpLOy
 Nqb3OT4tSqU1+7SV7hVForBpZp9u3CvVBRt/wE6vcHltW/I486bM8OCOd2XrUlnb
 QoIRliGI9KHpqCpbAeKPRSKXpf9tZv/AijZ+0WUu2yY8iwSN4p3RbbbwdCipjVQj
 w5I5oqKI6cylFfl2dEFWXVO+tLBihs06w8KSQrhYmQ9DUu4RGBVM9ORINGDBPejL
 bvoQh1mAkqvIL+oodujdbMDIqLupvOEtVSvwzR7SJn8BJSB00js88ngCWLjo/CcU
 imLbWy9FSBLvOswLBzQthgAJEj+ejCkOIbcvM2lINWhX/zNsMFaaqYcO1wRunGGR
 SavTI1s+ZksCQY6vCwRkwPrOZjyg91TA/q4FK102fHL1IcthH6xubE4yi4lTIUYs
 L56HuGm8e7Shc8M2Y5rAYsVG3GoIHFLXnptOn2HnCRWaAAJYsBaLUlzoBy9MxCfw
 I2YVDCylkQxevosSi2XxXo3tbM6auISU9SelAT/dAz32V1rsjWQojRJXeGYKIbu7
 KBuN/dLItW0=
 =s/ra
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core & fair scheduler changes:

   - Tweak wait_task_inactive() to force dequeue sched_delayed tasks
     (John Stultz)

   - Adhere to place_entity() constraints (Peter Zijlstra)

   - Allow decaying util_est when util_avg > CPU capacity (Pierre
     Gondois)

   - Fix up wake_up_sync() vs DELAYED_DEQUEUE (Xuewen Yan)

  Energy management:

   - Introduce sched_update_asym_prefer_cpu() (K Prateek Nayak)

   - cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings
     change (K Prateek Nayak)

   - Align uclamp and util_est and call before freq update (Xuewen Yan)

  CPU isolation:

   - Make use of more than one housekeeping CPU (Phil Auld)

  RT scheduler:

   - Fix race in push_rt_task() (Harshit Agarwal)

   - Add kernel cmdline option for rt_group_sched (Michal Koutný)

  Scheduler topology support:

   - Improve topology_span_sane speed (Steve Wahl)

  Scheduler debugging:

   - Move and extend the sched_process_exit() tracepoint (Andrii
     Nakryiko)

   - Add RT_GROUP WARN checks for non-root task_groups (Michal Koutný)

   - Fix trace_sched_switch(.prev_state) (Peter Zijlstra)

   - Untangle cond_resched() and live-patching (Peter Zijlstra)

  Fixes and cleanups:

   - Misc fixes and cleanups (K Prateek Nayak, Michal Koutný, Peter
     Zijlstra, Xuewen Yan)"

* tag 'sched-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  sched/uclamp: Align uclamp and util_est and call before freq update
  sched/util_est: Simplify condition for util_est_{en,de}queue()
  sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE
  sched,livepatch: Untangle cond_resched() and live-patching
  sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
  sched/fair: Adhere to place_entity() constraints
  sched/debug: Print the local group's asym_prefer_cpu
  cpufreq/amd-pstate: Update asym_prefer_cpu when core rankings change
  sched/topology: Introduce sched_update_asym_prefer_cpu()
  sched/fair: Use READ_ONCE() to read sg->asym_prefer_cpu
  sched/isolation: Make use of more than one housekeeping cpu
  sched/rt: Fix race in push_rt_task
  sched: Add annotations to RT_GROUP_SCHED fields
  sched: Add RT_GROUP WARN checks for non-root task_groups
  sched: Do not construct nor expose RT_GROUP_SCHED structures if disabled
  sched: Bypass bandwitdh checks with runtime disabled RT_GROUP_SCHED
  sched: Skip non-root task_groups with disabled RT_GROUP_SCHED
  sched: Add commadline option for RT_GROUP_SCHED toggling
  sched: Always initialize rt_rq's task_group
  sched: Remove unneeed macro wrap
  ...
2025-05-26 15:19:58 -07:00
Rafael J. Wysocki
f34dc28343 Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.16-rc1:

 - Refactor cpufreq_online(), add and use cpufreq policy locking guards,
   use __free() in policy reference counting, and clean up core cpufreq
   code on top of that (Rafael Wysocki).

 - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh
   Kumar).

 - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay
   Ugwekar).

 - Add offline, online and suspend callbacks to the amd-pstate driver,
   rename and use the existing amd_pstate_epp callbacks in it (Dhananjay
   Ugwekar).

 - Add support for the "Requested CPU Min frequency" BIOS option to the
   amd-pstate driver (Dhananjay Ugwekar).

 - Reset amd-pstate driver mode after running selftests (Swapnil
   Sapkal).

 - Add helper for governor checks to the schedutil cpufreq governor and
   move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki).

 - Populate the cpu_capacity sysfs entries from the intel_pstate driver
   after registering asym capacity support (Ricardo Neri).

 - Add support for enabling Energy-aware scheduling (EAS) to the
   intel_pstate driver when operating in the passive mode on a hybrid
   platform (Rafael Wysocki).

 - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan
   Chancellor).

 - Drop redundant cpus_read_lock() from store_local_boost() in the
   cpufreq core (Seyediman Seyedarab).

 - Replace sscanf() with kstrtouint() in the cpufreq code and use a
   symbol instead of a raw number in it (Bowen Yu).

 - Add support for autonomous CPU performance state selection to the
   CPPC cpufreq driver (Lifeng Zheng).

* pm-cpufreq: (31 commits)
  cpufreq: CPPC: Add support for autonomous selection
  cpufreq: Update sscanf() to kstrtouint()
  cpufreq: Replace magic number
  cpufreq: drop redundant cpus_read_lock() from store_local_boost()
  cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver()
  cpufreq: intel_pstate: Document hybrid processor support
  cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache
  cpufreq: intel_pstate: EAS support for hybrid platforms
  cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas()
  cpufreq: intel_pstate: Populate the cpu_capacity sysfs entries
  arch_topology: Relocate cpu_scale to topology.[h|c]
  cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
  cpufreq/sched: schedutil: Add helper for governor checks
  amd-pstate-ut: Reset amd-pstate driver mode after running selftests
  cpufreq/amd-pstate: Add support for the "Requested CPU Min frequency" BIOS option
  cpufreq/amd-pstate: Add offline, online and suspend callbacks for amd_pstate_driver
  cpufreq: Force sync policy boost with global boost on sysfs update
  cpufreq: Preserve policy's boost state after resume
  cpufreq: Introduce policy_set_boost()
  cpufreq: Don't unnecessarily call set_boost()
  ...
2025-05-26 20:19:40 +02:00
Tejun Heo
273cc94965 sched_ext: Call ops.update_idle() after updating builtin idle bits
BPF schedulers that use both builtin CPU idle mechanism and
ops.update_idle() may want to use the latter to create interlocking between
ops.enqueue() and CPU idle transitions so that either ops.enqueue() sees the
idle bit or ops.update_idle() sees the task queued somewhere. This can
prevent race conditions where CPUs go idle while tasks are waiting in DSQs.

For such interlocking to work, ops.update_idle() must be called after
builtin CPU masks are updated. Relocate the invocation. Currently, there are
no ordering requirements on transitions from idle and this relocation isn't
expected to make meaningful differences in that direction.

This also makes the ops.update_idle() behavior semantically consistent:
any action performed in this callback should be able to override the
builtin idle state, not the other way around.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-05-22 09:25:15 -10:00
Andrea Righi
a730e3f7a4 sched_ext: idle: Consolidate default idle CPU selection kfuncs
There is no reason to restrict scx_bpf_select_cpu_dfl() invocations to
ops.select_cpu() while allowing scx_bpf_select_cpu_and() to be used from
multiple contexts, as both provide equivalent functionality, with the
latter simply accepting an additional "allowed" cpumask.

Therefore, unify the two APIs, enabling both kfuncs to be used from
ops.select_cpu(), ops.enqueue(), and unlocked contexts (e.g., via BPF
test_run).

This allows schedulers to implement a consistent idle CPU selection
policy and helps reduce code duplication.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-21 07:35:48 -10:00
Xuewen Yan
90ca9410da sched/uclamp: Align uclamp and util_est and call before freq update
The commit dfa0a574cb ("sched/uclamg: Handle delayed dequeue")
has add the sched_delayed check to prevent double uclamp_dec/inc.
However, it put the uclamp_rq_inc() after enqueue_task().
This may lead to the following issues:
When a task with uclamp goes through enqueue_task() and could trigger
cpufreq update, its uclamp won't even be considered in the cpufreq
update. It is only after enqueue will the uclamp be added to rq
buckets, and cpufreq will only pick it up at the next update.
This could cause a delay in frequency updating. It may affect
the performance(uclamp_min > 0) or power(uclamp_max < 1024).

So, just like util_est, put the uclamp_rq_inc() before enqueue_task().
And as for the sched_delayed_task, same as util_est, using the
sched_delayed flag to prevent inc the sched_delayed_task's uclamp,
using the ENQUEUE_DELAYED flag to allow inc the sched_delayed_task's uclamp
which is being woken up.

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-3-xuewen.yan@unisoc.com
2025-05-21 13:57:38 +02:00
Xuewen Yan
0212696a84 sched/util_est: Simplify condition for util_est_{en,de}queue()
To prevent double enqueue/dequeue of the util-est for sched_delayed tasks,
commit 729288bc68 ("kernel/sched: Fix util_est accounting for DELAY_DEQUEUE")
added the corresponding check. This check excludes double en/dequeue during
task migration and priority changes.

In fact, these conditions can be simplified.

For util_est_dequeue, we know that sched_delayed flag is set in dequeue_entity.
When the task is sleeping, we need to call util_est_dequeue to subtract
util-est from the cfs_rq. At this point, sched_delayed has not yet been set.
If we find that sched_delayed is already set, it indicates that this task
has already called dequeue_task_fair once. In this case, there is no need to
call util_est_dequeue again. Therefore, simply checking the sched_delayed flag
should be sufficient to prevent unnecessary util_est updates during the dequeue.

For util_est_enqueue, our goal is to add the util_est to the cfs_rq
when task enqueue. However, we don't want to add the util_est of a
sched_delayed task to the cfs_rq because the task is sleeping.
Therefore, we can exclude the util_est_enqueue for sched_delayed tasks
by checking the sched_delayed flag. However, when waking up a delayed task,
the sched_delayed flag is cleared after util_est_enqueue. As a result,
if we only check the sched_delayed flag, we would miss the util_est_enqueue.
Since waking up a sched_delayed task calls enqueue_task with the ENQUEUE_DELAYED flag,
we can determine whether to call util_est_enqueue by checking if the
enqueue_flag contains ENQUEUE_DELAYED.

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-2-xuewen.yan@unisoc.com
2025-05-21 13:57:38 +02:00
Xuewen Yan
aa3ee4f0b7 sched/fair: Fixup wake_up_sync() vs DELAYED_DEQUEUE
Delayed dequeued feature keeps a sleeping task enqueued until its
lag has elapsed. As a result, it stays also visible in rq->nr_running.
So when in wake_affine_idle(), we should use the real running-tasks
in rq to check whether we should place the wake-up task to
current cpu.
On the other hand, add a helper function to return the nr-delayed.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
2025-05-21 13:57:37 +02:00
Andrea Righi
4ac760bdf2 sched_ext: idle: Allow scx_bpf_select_cpu_and() from unlocked context
Allow scx_bpf_select_cpu_and() to be used from an unlocked context, in
addition to ops.enqueue() or ops.select_cpu().

This enables schedulers, including user-space ones, to implement a
consistent idle CPU selection policy and helps reduce code duplication.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20 10:24:11 -10:00
Andrea Righi
686d133723 sched_ext: idle: Validate locking correctness in scx_bpf_select_cpu_and()
Validate locking correctness when accessing p->nr_cpus_allowed and
p->cpus_ptr inside scx_bpf_select_cpu_and(): if the rq lock is held,
access is safe; otherwise, require that p->pi_lock is held.

This allows to catch potential unsafe calls to scx_bpf_select_cpu_and().

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20 10:24:05 -10:00
Andrea Righi
617a77018f sched_ext: Make scx_kf_allowed_if_unlocked() available outside ext.c
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other
source files (e.g., ext_idle.c).

No functional change.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-20 10:23:50 -10:00
Tejun Heo
cb4ff91492 sched_ext: Explain the temporary situation around scx_root dereferences
Naked scx_root dereferences are being used as temporary markers to indicate
that they need to be updated to point to the right scheduler instance.
Explain the situation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 12:37:18 -04:00
Tejun Heo
a8433f7a26 sched_ext: Add @sch to SCX_CALL_OP*()
In preparation of hierarchical scheduling support, add @sch to scx_exit()
and friends:

- scx_exit/error() updated to take explicit @sch instead of assuming
  scx_root.

- scx_kf_exit/error() added. These are to be used from kfuncs, don't take
  @sch and internally determine the scx_sched instance to abort. Currently,
  it's always scx_root but once multiple scheduler support is in place, it
  will be the scx_sched instance that invoked the kfunc. This simplifies
  many callsites and defers scx_sched lookup until error is triggered.

- @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU
  validity conditions in ops_cpu_valid() are factored into __cpu_valid() to
  implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error().

- All users are converted. Most conversions are straightforward.
  check_rq_for_timeouts() and scx_softlockup() are updated to use explicit
  rcu_dereference*(scx_root) for safety as they may execute asynchronous to
  the exit path. scx_tick() is also updated to use rcu_dereference(). While
  not strictly necessary due to the preceding scx_enabled() test and IRQ
  disabled context, this removes the subtlety at no noticeable cost.

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Tejun Heo
c4c286d747 sched_ext: Cleanup [__]scx_exit/error*()
__scx_exit() is the base exit implementation and there are three wrappers on
top of it - scx_exit(), __scx_error() and scx_error(). This is more
confusing than helpful especially given that there are only a couple users
of scx_exit() and __scx_error(). To simplify the situation:

- Make __scx_exit() take va_list and rename it to scx_vexit(). This is to
  ease implementing more complex extensions on top.

- Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now
  takes both @kind and @exit_code.

- Convert existing scx_exit() and __scx_error() users to use the new
  scx_exit().

- scx_error() remains unchanged.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Tejun Heo
ab3f497ac1 sched_ext: Add @sch to SCX_CALL_OP*()
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take
explicit @sch instead of assuming scx_root. As scx_root is still the only
scheduler instance, this patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Tejun Heo
d310fb4009 sched_ext: Clean up scx_root usages
- Always cache scx_root into local variable sch before using.

- Don't use scx_root if cached sch is available.

- Wrap !sch test with unlikely().

- Pass @scx into scx_cgroup_init/exit().

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14 11:11:48 -04:00
Peter Zijlstra
676e8cf70c sched,livepatch: Untangle cond_resched() and live-patching
With the goal of deprecating / removing VOLUNTARY preempt, live-patch
needs to stop relying on cond_resched() to make forward progress.

Instead, rely on schedule() with TASK_FREEZABLE set. Just like
live-patching, the freezer needs to be able to stop tasks in a safe /
known state.

[bigeasy: use likely() in __klp_sched_try_switch() and update comments]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Tested-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20250509113659.wkP_HJ5z@linutronix.de
2025-05-14 13:16:24 +02:00
Rafael J. Wysocki
0b224fcc89 Merge Energy Model management code changes for 6.16 2025-05-13 14:34:54 +02:00
Libo Chen
3fc567e4c0 sched/numa: add tracepoint that tracks the skipping of numa balancing due to cpuset memory pinning
Unlike sched_skip_vma_numa tracepoint which tracks skipped VMAs, this
tracks the task subjected to cpuset.mems pinning and prints out its
allowed memory node mask.

Link: https://lkml.kernel.org/r/20250424024523.2298272-3-libo.chen@oracle.com
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Srikanth Aithal <sraithal@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:46 -07:00
Libo Chen
1f6c6ac03d sched/numa: skip VMA scanning on memory pinned to one NUMA node via cpuset.mems
Patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA
node via cpuset.mems", v5.


This patch (of 2):

When the memory of the current task is pinned to one NUMA node by cgroup,
there is no point in continuing the rest of VMA scanning and hinting page
faults as they will just be overhead.  With this change, there will be no
more unnecessary PTE updates or page faults in this scenario.

We have seen up to a 6x improvement on a typical java workload running on
VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
AARCH64 system.  With the same pinning, on a 18-cores-per-socket Intel
platform, we have seen 20% improvment in a microbench that creates a
30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
pages in a fixed number of loops.

Link: https://lkml.kernel.org/r/20250424024523.2298272-1-libo.chen@oracle.com
Link: https://lkml.kernel.org/r/20250424024523.2298272-2-libo.chen@oracle.com
Signed-off-by: Libo Chen <libo.chen@oracle.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12 23:50:46 -07:00
Linus Torvalds
e9565e23cd sched_ext: Fixes for v6.15-rc6
A little bit invasive for rc6 but they're important fixes, pass tests fine
 and won't break anything outside sched_ext.
 
 - scx_bpf_cpuperf_set() calls internal functions that require the rq to be
   locked. It assumed that the BPF caller has rq locked but that's not always
   true. Fix it by tracking whether rq is currently held by the CPU and
   grabbing it if necessary.
 
 - bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an uninitialized
   state after an error. However, next() and destroy() can be called on an
   iterator which failed initialization and thus they always need to be
   initialized even after an init error. Fix by always initializing the
   iterator.
 
 - Remove duplicate BTF_ID_FLAGS() entries.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCaCKYDw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGV0nAP9UX1sGJV7a8by8L0hq3luZRSQo7Kjw0pTaMeD+
 /oIlOgD/VX3epHmSKvwIOO2W4bv5oSR8B+Yx+4uAhhLfTgdlxgk=
 =clg7
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:
 "A little bit invasive for rc6 but they're important fixes, pass tests
  fine and won't break anything outside sched_ext:

   - scx_bpf_cpuperf_set() calls internal functions that require the rq
     to be locked. It assumed that the BPF caller has rq locked but
     that's not always true. Fix it by tracking whether rq is currently
     held by the CPU and grabbing it if necessary

   - bpf_iter_scx_dsq_new() was leaving the DSQ iterator in an
     uninitialized state after an error. However, next() and destroy()
     can be called on an iterator which failed initialization and thus
     they always need to be initialized even after an init error. Fix by
     always initializing the iterator

   - Remove duplicate BTF_ID_FLAGS() entries"

* tag 'sched_ext-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
  sched_ext: Fix rq lock state in hotplug ops
  sched_ext: Remove duplicate BTF_ID_FLAGS definitions
  sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  sched_ext: Track currently locked rq
2025-05-12 18:02:05 -07:00
Feng Yang
8c112a428b sched_ext: Remove bpf_scx_get_func_proto
task_storage_{get,delete} has been moved to bpf_base_func_proto.

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
2025-05-09 11:01:45 -07:00
Rafael J. Wysocki
4854649b1f cpufreq/sched: Move cpufreq-specific EAS checks to cpufreq
Doing cpufreq-specific EAS checks that require accessing policy
internals directly from sched_is_eas_possible() is a bit unfortunate,
so introduce cpufreq_ready_for_eas() in cpufreq, move those checks
into that new function and make sched_is_eas_possible() call it.

While at it, address a possible race between the EAS governor check
and governor change by doing the former under the policy rwsem.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/2317800.iZASKD2KPV@rjwysocki.net
2025-05-07 21:17:56 +02:00
Rafael J. Wysocki
f42c8556a0 cpufreq/sched: schedutil: Add helper for governor checks
Add a helper for checking if schedutil is the current governor for
a given cpufreq policy and use it in sched_is_eas_possible() to avoid
accessing cpufreq policy internals directly from there.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://patch.msgid.link/3365956.44csPzL39Z@rjwysocki.net
2025-05-07 21:17:56 +02:00
Tejun Heo
9b30400ff6 Merge branch 'for-6.15-fixes' into for-6.16
To receive 428dc9fc08 ("sched_ext: bpf_iter_scx_dsq_new() should always
initialize iterator") which conflicts with cdf5a6faa8 ("sched_ext: Move
dsq_hash into scx_sched"). The conflict is a simple context conflict which
can be resolved by taking changes from both changes in the right order.
2025-05-07 06:25:39 -10:00
Tejun Heo
428dc9fc08 sched_ext: bpf_iter_scx_dsq_new() should always initialize iterator
BPF programs may call next() and destroy() on BPF iterators even after new()
returns an error value (e.g. bpf_for_each() macro ignores error returns from
new()). bpf_iter_scx_dsq_new() could leave the iterator in an uninitialized
state after an error return causing bpf_iter_scx_dsq_next() to dereference
garbage data. Make bpf_iter_scx_dsq_new() always clear $kit->dsq so that
next() and destroy() become noops.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 650ba21b13 ("sched_ext: Implement DSQ iterator")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-07 06:24:07 -10:00
Andrea Righi
c8fafb3485 sched_ext: Avoid NULL scx_root deref in __scx_exit()
A sched_ext scheduler may trigger __scx_exit() from a BPF timer
callback, where scx_root may not be safely dereferenced.

This can lead to a NULL pointer dereference as shown below (triggered by
scx_tickless):

 BUG: kernel NULL pointer dereference, address: 0000000000000330
...
 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
 RIP: 0010:__scx_exit+0x2b/0x190
...
 Call Trace:
  <IRQ>
  scx_bpf_get_idle_smtmask+0x59/0x80
  bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6
...
  bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
  bpf_timer_cb+0x7a/0x140
  __hrtimer_run_queues+0x1f9/0x3a0
  hrtimer_run_softirq+0x8c/0xd0
  handle_softirqs+0xd3/0x3d0
  __irq_exit_rcu+0x9a/0xc0
  irq_exit_rcu+0xe/0x20

Fix this by checking for a valid scx_root and adding proper RCU
protection.

Fixes: 48e1267773 ("sched_ext: Introduce scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30 12:47:17 -10:00
Andrea Righi
c01adf4097 sched_ext: Add RCU protection to scx_root in DSQ iterator
Using a DSQ iterators from a timer callback can trigger the following
lockdep splat when accessing scx_root:

 =============================
 WARNING: suspicious RCU usage
 6.14.0-virtme #1 Not tainted
 -----------------------------
 kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 no locks held by swapper/0/0.

 stack backtrace:
 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full)
 Sched_ext: tickless (enabled+all)
 Call Trace:
 <IRQ>
 dump_stack_lvl+0x6f/0xb0
 lockdep_rcu_suspicious.cold+0x4e/0xa3
 bpf_iter_scx_dsq_new+0xb1/0xd0
 bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156
 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6
 bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264
 ? hrtimer_run_softirq+0x4f/0xd0
 bpf_timer_cb+0x7a/0x140
 __hrtimer_run_queues+0x1f9/0x3a0
 hrtimer_run_softirq+0x8c/0xd0
 handle_softirqs+0xd3/0x3d0
 __irq_exit_rcu+0x9a/0xc0
 irq_exit_rcu+0xe/0x20
 sysvec_apic_timer_interrupt+0x73/0x80

Add a proper dereference check to explicitly validate RCU-safe access to
scx_root from rcu_read_lock() contexts and also from contexts that hold
rcu_read_lock_bh(), such as timer callbacks.

Fixes: cdf5a6faa8 ("sched_ext: Move dsq_hash into scx_sched")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-30 12:39:50 -10:00
John Stultz
b7ca5743a2 sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
It was reported that in 6.12, smpboot_create_threads() was
taking much longer then in 6.6.

I narrowed down the call path to:
 smpboot_create_threads()
 -> kthread_create_on_cpu()
    -> kthread_bind()
       -> __kthread_bind_mask()
          ->wait_task_inactive()

Where in wait_task_inactive() we were regularly hitting the
queued case, which sets a 1 tick timeout, which when called
multiple times in a row, accumulates quickly into a long
delay.

I noticed disabling the DELAY_DEQUEUE sched feature recovered
the performance, and it seems the newly create tasks are usually
sched_delayed and left on the runqueue.

So in wait_task_inactive() when we see the task
p->se.sched_delayed, manually dequeue the sched_delayed task
with DEQUEUE_DELAYED, so we don't have to constantly wait a
tick.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: peter-yc.chang@mediatek.com
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20250429150736.3778580-1-jstultz@google.com
2025-04-30 14:45:41 +02:00
Tejun Heo
9ba7f37e5b sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()
With the global states and disable machinery moved into scx_sched,
scx_disable_workfn() can only be scheduled and run for the specific
scheduler instance. This makes it impossible for scx_disable_workfn() to see
SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:11 -10:00
Tejun Heo
bff3b5aec1 sched_ext: Move disable machinery into scx_sched
Because disable can be triggered from any place and the scheduler cannot be
trusted, disable path uses an irq_work to bounce and a kthread_work which is
executed on an RT helper kthread to perform disable. These must be per
scheduler instance to guarantee forward progress. Move them into scx_sched.

- If an scx_sched is accessible, its helper kthread is always valid making
  the `helper` check in schedule_scx_disable_work() unnecessary. As the
  function becomes trivial after the removal of the test, inline it.

- scx_create_rt_helper() has only one user - creation of the disable helper
  kthread. Inline it into scx_alloc_and_add_sched().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29 08:40:10 -10:00